Python Beautiful Soup 刮取簡易指南

作者：Ayush Sharma 2021-12-16 15:09:45

今天我們將討論如何使用 Beautiful Soup 庫從 HTML 頁面中提取內容，之后，我們將使用它將其轉換為 Python 列表或字典。

[[440826]]

Python 中的 Beautiful Soup 庫可以很方便的從網頁中提取 HTML 內容。

今天我們將討論如何使用 Beautiful Soup 庫從 HTML 頁面中提取內容，之后，我們將使用它將其轉換為 Python 列表或字典。

什么是 Web 刮取，為什么我需要它？

答案很簡單：并非每個網站都有獲取內容的 API。你可能想從你最喜歡的烹飪網站上獲取食譜，或者從旅游博客上獲取照片。如果沒有 API，提取 HTML（或者說刮取scraping 可能是獲取內容的唯一方法。我將向你展示如何使用 Python 來獲取。

并非所以網站都喜歡被刮取，有些網站可能會明確禁止。請于網站所有者確認是否同意刮取。

Python 如何刮取網站？

使用 Python 進行刮取，我們將執行三個基本步驟：

使用 requests 庫獲取 HTML 內容
分析 HTML 結構并識別包含我們需要內容的標簽
使用 Beautiful Soup 提取標簽并將數據放入 Python 列表中

安裝庫

首先安裝我們需要的庫。requests 庫從網站獲取 HTML 內容，Beautiful Soup 解析 HTML 并將其轉換為 Python 對象。在 Python3 中安裝它們，運行：

pip3 install requests beautifulsoup4

提取 HTML

在本例中，我將選擇刮取網站的 Techhology 部分。如果你跳轉到此頁面，你會看到帶有標題、摘錄和發布日期的文章列表。我們的目標是創建一個包含這些信息的文章列表。

網站頁面的完整 URL 是：

https://notes.ayushsharma.in/technology

我們可以使用 requests 從這個頁面獲取 HTML 內容：

#!/usr/bin/python3
import requests
 
url = 'https://notes.ayushsharma.in/technology'
 
data = requests.get(url)
 
print(data.text)

變量 data 將包含頁面的 HTML 源代碼。

從 HTML 中提取內容

為了從 data 中提取數據，我們需要確定哪些標簽具有我們需要的內容。

如果你瀏覽 HTML，你會發現靠近頂部的這一段：

<div class="col">
  <a href="/2021/08/using-variables-in-jekyll-to-define-custom-content" class="post-card">
    <div class="card">
      <div class="card-body">
        <h5 class="card-title">Using variables in Jekyll to define custom content</h5>
        <small class="card-text text-muted">I recently discovered that Jekyll's config.yml can be used to define custom
          variables for reusing content. I feel like I've been living under a rock all this time. But to err over and
          over again is human.</small>
      </div>
      <div class="card-footer text-end">
        <small class="text-muted">Aug 2021</small>
      </div>
    </div>
  </a>
</div>

這是每篇文章在整個頁面中重復的部分。我們可以看到 .card-title 包含文章標題，.card-text 包含摘錄，.card-footer > small 包含發布日期。

讓我們使用 Beautiful Soup 提取這些內容。

#!/usr/bin/python3
import requests
from bs4 import BeautifulSoup
from pprint import pprint
 
url = 'https://notes.ayushsharma.in/technology'
data = requests.get(url)
 
my_data = []
 
html = BeautifulSoup(data.text, 'html.parser')
articles = html.select('a.post-card')
 
for article in articles:
 
    title = article.select('.card-title')[0].get_text()
    excerpt = article.select('.card-text')[0].get_text()
    pub_date = article.select('.card-footer small')[0].get_text()
 
    my_data.append({"title": title, "excerpt": excerpt, "pub_date": pub_date})
 
pprint(my_data)

以上代碼提取文章信息并將它們放入 my_data 變量中。我使用了 pprint 來美化輸出，但你可以在代碼中忽略它。將上面的代碼保存在一個名為 fetch.py 的文件中，然后運行它：

python3 fetch.py

如果一切順利，你應該會看到：

[{'excerpt': "I recently discovered that Jekyll's config.yml can be used to"
"define custom variables for reusing content. I feel like I've"
'been living under a rock all this time. But to err over and over'
'again is human.',
'pub_date': 'Aug 2021',
'title': 'Using variables in Jekyll to define custom content'},
{'excerpt': "In this article, I'll highlight some ideas for Jekyll"
'collections, blog category pages, responsive web-design, and'
'netlify.toml to make static website maintenance a breeze.',
'pub_date': 'Jul 2021',
'title': 'The evolution of ayushsharma.in: Jekyll, Bootstrap, Netlify,'
'static websites, and responsive design.'},
{'excerpt': "These are the top 5 lessons I've learned after 5 years of"
'Terraform-ing.',
'pub_date': 'Jul 2021',
'title': '5 key best practices for sane and usable Terraform setups'},
 
... (truncated)

以上是全部內容！在這 22 行代碼中，我們用 Python 構建了一個網絡刮取器，你可以在我的示例倉庫中找到源代碼。

總結

對于 Python 列表中的網站內容，我們現在可以用它做一些很酷的事情。我們可以將它作為 JSON 返回給另一個應用程序，或者使用自定義樣式將其轉換為 HTML。隨意復制粘貼以上代碼并在你最喜歡的網站上進行試驗。

玩的開心，繼續編碼吧。

責任編輯：龐桂玉來源： Linux中國

成人免费xxxxx在线视频软件_久久精品久久久_亚洲国产精品久久久_天天色天天色_亚洲人成一区_欧美一级欧美三级在线观看

Python Beautiful Soup 刮取簡易指南

什么是 Web 刮取，為什么我需要它？

Python 如何刮取網站？

安裝庫

提取 HTML

從 HTML 中提取內容

總結