Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

2022-04-06号开始 好像都抓不到文章内容了 看图 #64

Open
mrchzh opened this issue Apr 8, 2022 · 7 comments
Open

2022-04-06号开始 好像都抓不到文章内容了 看图 #64

mrchzh opened this issue Apr 8, 2022 · 7 comments

Comments

@mrchzh
Copy link

mrchzh commented Apr 8, 2022

image

@mrchzh
Copy link
Author

mrchzh commented Apr 14, 2022

我已经找到源代码里面的写法了
deal_data.py
1、把标题正则改一下 350行 selector.xpath('//h1[@Class="rich_media_title"]/text()')
2、把内容正则改一下 347行 '//div[@Class="rich_media_content "]|//div[@Class="rich_media_content"]|//div[@Class="share_media"]|//div[@Class="rich_media_meta_list"]'

奈何不会py 找个时间看个视频简单学一下

@fengxuangit
Copy link

我是这么解决的,在deal_data.py中的 def deal_article(self, req_url, text): 这个函数下面使用selenium+chromedriver 注入js获取到了内容 代码

import time

from selenium import webdriver
from selenium.webdriver.support.wait import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.by import By
from bs4 import BeautifulSoup

driver = webdriver.Chrome('/usr/local/bin/chromedriver')
driver.get("http://mp.weixin.qq.com/s?__biz=MzI4MTQxMjExMw==&mid=2247484946&idx=1&sn=2ba55c5c2e82457ea9c23ad600ddc1ea&chksm=eba8d16cdcdf587afbe233be563c2e521143be1139c1975e4bfe014c7c6248c386885bac4403&scene=27#wechat_redirect")
time.sleep(1)
# element = driver.find_element(by=By.CLASS_NAME, 'rich_media_content')
element = WebDriverWait(driver, 10).until(EC.presence_of_all_elements_located((By.CLASS_NAME, "rich_media_content")))
result = driver.execute_script('''
var result = document.getElementsByClassName("rich_media_content")[0].innerText;
let xhr = new XMLHttpRequest()
let url = "http://127.0.0.1:9001/?a=" + encodeURIComponent(result); 
xhr.open("get", url, false);
xhr.send(null);
return result;
''')

content = driver.page_source
soup = BeautifulSoup(content, "lxml")
driver.close()

@wuweijie007
Copy link

标题现在要变成这样了:selector.xpath('//h1[@Class="rich_media_title "]/text()')

@wuweijie007
Copy link

我已经找到源代码里面的写法了 deal_data.py 1、把标题正则改一下 350行 selector.xpath('//h1[@Class="rich_media_title"]/text()') 2、把内容正则改一下 347行 '//div[@Class="rich_media_content "]|//div[@Class="rich_media_content"]|//div[@Class="share_media"]|//div[@Class="rich_media_meta_list"]'

奈何不会py 找个时间看个视频简单学一下

image
现在格式变成这样了,这个该怎么解析呢?

@halobug
Copy link

halobug commented Jun 26, 2023

我已经找到了deal_data.py里面的源代码 1、把标题正则改一下 350行selector.xpath('//h1[@Class="rich_media_title"]/text()') 2、把内容正则改一下 347行 '//div[@Class="rich_media_content"]|//div[@Class="rich_media_content"]|//div[@Class="share_media"]|//div[@Class=“rich_media_meta_list”]'
奈何不会py找个时间看个视频简单学一下

图像 现在格式变成这样了,这个怎么解析呢?

最新的content = '//div[@id="js_content"]'

@1049451037
Copy link

太奇怪了,为什么还需要用selenium呢?按理说所有的请求都会过mitmproxy才对,为什么文章内容可以绕过mitmproxy直接在手机上看到呢?

@1049451037
Copy link

我本地调试发现mitmproxy也抓不到文章内容了,所有response.text里都没有文章内容相关的任何信息。

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants