show	version	enable_checker
step	1.0	true

爬取百度

回忆

这次真的爬了一个网站
- oeasy.org
右键检查元素
- 获取 xpath
爬取之后获得属性 href 的值
然后切片并拼接为绝对链接地址
并且把每一个链接都爬了一遍
能出去爬个百度么？🤔

确认上网

firefox https://www.baidu.com &

一般的实验楼注册会员
- 是不能在虚拟机里上 baidu 之类的网站的
- 可以在本机里使用火狐和 python

想要爬百度
- 首先要确认能上百度

准备环境

导入了该包
- 然后发送请求

import requests
from lxml import etree
response = requests.get("http://www.baidu.com")
print(response.content)

返回结果比较简单

这是一个
- 字节序列

网页存储

将字节序列解码为字符串

import requests
from lxml import etree
response = requests.get("http://www.baidu.com")
print(response.content.decode("utf-8"))

将字符串输出重定向到b.html

python3 b.py > b.html
firefox b.html &

启动nginx

sudo cp b.html /usr/share/nginx/html/
sudo service nginx start
firefox http://localhost/b.html &

再启动火狐打开网页

这就是我们抓到的百度首页

比较

尝试和百度首页比较
- 明显没有百度热搜

百度本身依赖爬虫
- 他发现了我们是同类
- 结果就把我们给禁了 😰
那怎么办？😱

假装

然后把这个 key-value 对
- 写到 header 中
- 假装我们是浏览器

import requests
from lxml import etree
headers ={"User-Agent": "Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:109.0) Gecko/20100101 Firefox/112.0"}
response = requests.get("http://www.baidu.com", headers=headers)
print(response.content.decode("utf-8"))

注意 headers 是一个字典
- key-value 之间有冒号
- key 和 value 都有双引号包裹
- 冒号双引号都是半角的
请求中的协议是 https

结果

再次输出重定向

python3 b.py > b2.html
sudo cp b.html /usr/share/nginx/html/
firefox http://localhost/b.html &

好像可以得到热搜列表了

生成 etree

把正确的响应放到
- lxml 中进行解析
- 生成一棵 etree

但是得到的这一棵树
- 要怎样才能择到想要的链接呢?
需要 xpath!!

找 xpath

注意要在http://localhost/b.html上查找xpath
- 因为b.html是我们已经爬到手的网页

右键左上角的新闻
- 检查元素
可以看到他对应着 a 标签
- 后面的地图之类的也对应着 a 标签
如何得到 xpath ？

得到 xpath

复制之后
- 怎么办？

尝试

import requests
from lxml import etree
headers ={"User-Agent": "Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:109.0) Gecko/20100101 Firefox/112.0"}
response = requests.get("https://www.baidu.com", headers=headers)
#print(response.text)
et_html = etree.HTML(response.content)
l_et_a = et_html.xpath("/html/body/div/div[1]/div[@id='s-top-left']/a")
for anchor in l_et_a:
    print(anchor.text)

也可以使用id属性谓词确认div

如果我想把链接也输出
应该怎么办？

输出连接

先查一下文档

import requests
from lxml import etree
headers ={"User-Agent": "Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:109.0) Gecko/20100101 Firefox/112.0"}
response = requests.get("https://www.baidu.com", headers=headers)
#print(response.text)
et_html = etree.HTML(response.content)
l_et_a = et_html.xpath("/html/body/div/div[1]/div[@id='s-top-left']/a")
for anchor in l_et_a:
    print(anchor.text,end="    ->   ")
    print(anchor.attrib.get("href"))

遍历完成
- 这很简单

我们再看看百度热搜

百度热搜

xpath 是
- /html/body/div[1]/div[1]/div[5]/div/div/div[3]/ul/li/a/span[2]

进行遍历

import requests
from lxml import etree
headers ={"User-Agent": "Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:109.0) Gecko/20100101 Firefox/112.0"}
response = requests.get("https://www.baidu.com", headers=headers)
#print(response.text)
et_html = etree.HTML(response.content)
l_et_a = et_html.xpath("/html/body/div[1]/div[1]/div[5]/div/div/div[3]/ul/li/a/span[2]")
for anchor in l_et_a:
    print(anchor.text)

结果

但是我如果想要同时输出具体链接呢？

重新构造列表

span 在 a 里面
- 所以我们先把所有的 a 的列表拿到
- 然后再使用下表的方式找到 span 的 text

import requests
from lxml import etree
headers ={"User-Agent": "Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:109.0) Gecko/20100101 Firefox/112.0"}
response = requests.get("https://www.baidu.com", headers=headers)
#print(response.text)
et_html = etree.HTML(response.content)
l_et_a = et_html.xpath("/html/body/div[1]/div[1]/div[5]/div/div/div[3]/ul/li/a")
for anchor in l_et_a:
    et_span = anchor.xpath("./span[2]")[0]
    print(et_span.text,end="   ->   ")
    print(anchor.attrib.get("href"))

结果

还可以遍历最下面的链接吗？

最下面的

这个好像比较简单
文字和链接都在 a 元素中

xpath

a 元素在 p 元素中
所以把 p 元素的索引去掉
/html/body/div[1]/div[1]/div[7]/div/p/a

遍历

import requests
from lxml import etree
headers ={"User-Agent": "Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:109.0) Gecko/20100101 Firefox/112.0"}
response = requests.get("https://www.baidu.com", headers=headers)
#print(response.text)
et_html = etree.HTML(response.content)
#l_et_a = et_html.xpath("/html/body/div/div[1]/div[3][@id='s-top-left']/a")
#l_et_a = et_html.xpath("/html/body/div/div[1]/div[5]/div/div/div[3]/ul/li/a/span")
l_et_a = et_html.xpath("/html/body/div[1]/div[1]/div[7]/div/p/a")
for anchor in l_et_a:
    print(anchor.text)
    print(anchor.attrib["href"])

爬取百度图片

# -*- coding:utf8 -*-
import requests
import json
from urllib import parse
import os
import time

class BaiduImageSpider(object):
    def __init__(self):
        self.json_count = 0  # 请求到的json文件数量（一个json文件包含30个图像文件）
        self.url = 'https://image.baidu.com/search/acjson?tn=resultjson_com&logid=5179920884740494226&ipn=rj&ct' \
                   '=201326592&is=&fp=result&queryWord={' \
                   '}&cl=2&lm=-1&ie=utf-8&oe=utf-8&adpicid=&st=-1&z=&ic=0&hd=&latest=&copyright=&word={' \
                   '}&s=&se=&tab=&width=&height=&face=0&istype=2&qc=&nc=1&fr=&expermode=&nojc=&pn={' \
                   '}&rn=30&gsm=1e&1635054081427= '
        self.directory = r"."  # 存储目录  这里需要修改为自己希望保存的目录  {}不要丢
        self.header = {
            'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) '
                          'Chrome/95.0.4638.54 Safari/537.36 Edg/95.0.1020.30 '
        }

    # 创建存储文件夹
    def create_directory(self, name):
        self.directory = self.directory.format(name)
        # 如果目录不存在则创建
        if not os.path.exists(self.directory):
            os.makedirs(self.directory)
        self.directory += r'{}'

    # 获取图像链接
    def get_image_link(self, url):
        list_image_link = []
        strhtml = requests.get(url, headers=self.header)  # Get方式获取网页数据
        jsonInfo = json.loads(strhtml.text)
        for index in range(30):
            list_image_link.append(jsonInfo['data'][index]['thumbURL'])
        return list_image_link

    # 下载图片
    def save_image(self, img_link, filename):
        res = requests.get(img_link, headers=self.header)
        if res.status_code == 404:
            print(f"图片{img_link}下载出错------->")
        with open(filename, "wb") as f:
            f.write(res.content)
            print("存储路径：" + filename)

    # 入口函数
    def run(self):
        searchName = input("查询内容：")
        searchName_parse = parse.quote(searchName)  # 编码

        self.create_directory(searchName)

        pic_number = 0  # 图像数量
        for index in range(self.json_count):
            pn = (index+1)*30
            request_url = self.url.format(searchName_parse, searchName_parse, str(pn))
            list_image_link = self.get_image_link(request_url)
            for link in list_image_link:
                pic_number += 1
                self.save_image(link, self.directory.format(str(pic_number)+'.jpg'))
                time.sleep(0.2)  # 休眠0.2秒，防止封ip
        print(searchName+"----图像下载完成--------->")

if __name__ == '__main__':
    spider = BaiduImageSpider()
    spider.json_count = 10   # 定义下载10组图像，也就是三百张
    spider.run()

将百度当作数据库来用

查询媒体
- 澎湃新闻中出现 riscv 的情况

将百度当作数据库

查询搏击周评账户
- 发表过的
- 关于柔术的文章

总结

这次爬了 baidu.com
- 找到了三组链接
- 然后分别遍历
但是 headers 生成的方法有点麻烦
有更快的生成headers的方法吗？
下次再说

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

568-250373-爬取百度热搜_遍历超链接_设置user_agent_百度图片.sy.md

568-250373-爬取百度热搜_遍历超链接_设置user_agent_百度图片.sy.md

爬取百度

回忆

确认上网

准备环境

网页存储

启动nginx

比较

假装

结果

生成 etree

找 xpath

得到 xpath

尝试

输出连接

百度热搜

进行遍历

重新构造列表

最下面的

xpath

遍历

爬取百度图片

将百度当作数据库来用

将百度当作数据库

总结

Files

568-250373-爬取百度热搜_遍历超链接_设置user_agent_百度图片.sy.md

Latest commit

History

568-250373-爬取百度热搜_遍历超链接_设置user_agent_百度图片.sy.md

File metadata and controls

爬取百度

回忆

确认上网

准备环境

网页存储

启动nginx

比较

假装

结果

生成 etree

找 xpath

得到 xpath

尝试

输出连接

百度热搜

进行遍历

重新构造列表

最下面的

xpath

遍历

爬取百度图片

将百度当作数据库来用

将百度 当作数据库

总结

将百度当作数据库