网页解析方法-BeautifulSoup简明使用指南

博集华仿

2019年10月11日 15:14

浏览：3044 评论：2

摘要：获得网页的html文档后，需要先解析html文档，才能提取所需文本。BeautifulSoup是笔者认为最好用的网页解析工具。

00 安装bs库

pip install bs4

01 解析html

网页解析方法-BeautifulSoup简明使用指南的图1

import requests
import chardet
from bs4 import BeautifulSoup
headers={'User-Agent':'Mozillaxxxxxxxxxx'}
link='https://xxxxxxxxxxxxxx'
res=requests.get(link,headers=headers,timeout=10)
res.encoding=chardet.detect(res.content)['encoding']
soup=BeautifulSoup(res.text,'lxml') #使用BeautifulSoup解析res

查看一下soup；

print(soup)

网页解析方法-BeautifulSoup简明使用指南的图2

很像我们在浏览器上查看的html，有时候为了更好的排版，一般都使用；

print(soup.prettify())

网页解析方法-BeautifulSoup简明使用指南的图3

其实BeautifulSoup的作用就是将html文档转化了一下（转化成树结构），并且在这个树结构中，分为四种对象：Tag，NavigableString，Comment，BeautifulSoup。Tag对象就是原html的标记；NavigableString对象就是原html的文本；Comment对象特殊类型的NavigableString对象；BeautifulSoup对象就是文档的全部内容。其中最重要的两个对象是Tag和NavigableString。

02 元素定位（遍历）

仅仅tag进行遍历，只定位第一个元素；

soup.body.a

定位所有的子节点元素；

soup.body.contents
soup.body.children

可以加编号，定位某个；soup.body.children[2]

定位所有子孙节点（包括子节点的子节点）；

soup.body.descendants

定位父节点元素；soup.body.parent

定位父辈节点；soup.body.parents

定位兄弟节点：next_sibling，previous_sibling，next_siblings，previous_siblings；

定位元素内容：next_element，next_elements，previous_element，previous_elements