python網路爬蟲基本工具(1)

星期二 29 Jan 2019   even  
教學

歡迎你來到python網路爬蟲教學第一篇,相信你讀過我的python網路爬蟲簡介,對於網路爬蟲已經有了基本的了解。

我們要藉由python程式碼偽裝成一般的使用者,以網址(url)向特定網路伺服器請求,成功獲得回應後,我們再設法從原始碼中擷取所要的資訊。

從這篇教學開始,我會先簡單介紹兩項基本工具的操作,requests、beautifulsoup,分別用來向網站發出請求以及剖析原始碼。你只要掌握這兩樣基本工具,你就可以爬很多很多網站了。

讓我們以最簡單的靜態網站開始熟悉工具吧!以維基百科為練習。https://en.wikipedia.org/

首先先安裝這兩個套件:

pip install requests beautifulsoup4

安裝好之後,開啟python,使用requests試圖拜訪https://en.wikipedia.org/

import requests

r = requests.get('https://en.wikipedia.org/')
r

<Response [200]>
True

你可以直接看r是否回傳<Response [200]>或是r.ok是否為True,如果是,則表示對方伺服器有接受你的請求,並將原始碼送給你。

在這段示範中,很明顯的,我們成功的得到回應,接著我們可以看看得到了什麼東西。

PS: 200 是html的狀態碼,2XX成功、3XX重新導向、4XX用戶端錯誤、5XX伺服器端錯誤。

print(type(r))
print((len(r.text))

<class 'str'>
78684

這就是對方伺服器回應給我們的網頁原始碼(純字串),我們的瀏覽器會將它變成圖文並茂的頁面,但在網頁爬蟲中,我們要從原始碼中抓出所要的資訊,也就是下一個關鍵套件,beautifulsoup要派上用場的時候了~

我通常會檢查一下長度,確認是不是抓到正確的東西

from bs4 import BeautifulSoup

soup = BeautifulSoup(r.text, 'html.parser')

除了'html.parser'外,也有其他剖析器可以選用,如:'lxml'、'html5lib'

現在我們就可以使用soup來快速檢視網頁內容,如:檢視title這個tag

print(soup.title)

<title>Wikipedia, the free encyclopedia</title>

尋找所有超連結(a tag裡面的href)

for link in soup.find_all('a'):
    print(link.get('href'))

None
#mw-head
#p-search
/wiki/Wikipedia
/wiki/Free_content
/wiki/Encyclopedia
/wiki/Wikipedia:Introduction
/wiki/Special:Statistics
/wiki/English_language
/wiki/Portal:Arts
/wiki/Portal:Biography
/wiki/Portal:Geography
/wiki/Portal:History
/wiki/Portal:Mathematics
/wiki/Portal:Science
/wiki/Portal:Society
/wiki/Portal:Technology
/wiki/Portal:Contents/Portals
/wiki/File:ZETA_reactor_left_side.jpg
/wiki/ZETA_(fusion_reactor)
/wiki/Fusion_power
/wiki/Atomic_Energy_Research_Establishment
/wiki/Neutron
/wiki/Kelvin
/wiki/Nuclear_fusion
/wiki/Laser
/wiki/Tokamak
/wiki/ZETA_(fusion_reactor)
/wiki/Operation_Pamphlet
/wiki/Tottenham_outrage
/wiki/Mario_%26_Sonic_at_the_Olympic_Games
/wiki/Wikipedia:Today%27s_featured_article/January_2019
https://lists.wikimedia.org/mailman/listinfo/daily-article-l
/wiki/Wikipedia:Featured_articles
/wiki/File:FW_Bernstein_(square_crop).jpg
/wiki/F._W._Bernstein
/wiki/Cyclonic_Ni%C3%B1o
/wiki/El_Ni%C3%B1o
/wiki/Simon_Kaloa_Ka%CA%BBai
/wiki/Cullands_Grove
/wiki/David_Gwynne-James
/wiki/King%27s_Shropshire_Light_Infantry
/wiki/Rollercoaster_(Jim_Verraros_album)
/wiki/American_Idol
/wiki/Jim_Verraros
/wiki/George_Michael
/wiki/Green_Day
/wiki/Zhuo-Hua_Pan
/wiki/Optogenetics
/wiki/Lion_Versus
/wiki/Wikipedia:Recent_additions
/wiki/Wikipedia:Your_first_article
/wiki/Template_talk:Did_you_know
/wiki/File:Paul_Kaba_Thieba_(cropped).jpg
/wiki/Afghan_National_Security_Forces
/wiki/Maidan_Shar_attack
/wiki/Maidan_Shar
/wiki/United_Nations
/wiki/United_Nations_Multidimensional_Integrated_Stabilization_Mission_in_Mali#Incidents
/wiki/Al-Qaeda_in_the_Islamic_Maghreb
/wiki/Chad%E2%80%93Israel_relations
/wiki/Burkina_Faso
/wiki/Paul_Kaba_Thieba
/wiki/2019_Burkina_Faso_government_resignation
/wiki/Tlahuelilpan_pipeline_explosion
/wiki/Tlahuelilpan
/wiki/Portal:Current_events
/wiki/Brexit
/wiki/2018%E2%80%9319_United_States_federal_government_shutdown
/wiki/2019_Venezuelan_presidential_crisis
/wiki/Deaths_in_2019
/wiki/Andrew_Fairlie_(chef)
/wiki/Harris_Wofford
/wiki/Marcel_Azzola
/wiki/Russell_Baker
/wiki/Wikipedia:In_the_news/Candidates
/wiki/January_25
/wiki/Calendar_of_saints
/wiki/Gregory_of_Nazianzus
/wiki/Eastern_Orthodox_Church
/wiki/Tatiana_Day
/wiki/File:Mikatagahara_no_tatakai.jpg
/wiki/1573
/wiki/Sengoku_period
/wiki/Takeda_Shingen
/wiki/Tokugawa_Ieyasu
/wiki/Battle_of_Mikatagahara
/wiki/Hamamatsu
/wiki/Mikawa_Province
/wiki/1704
/wiki/Province_of_Carolina
/wiki/Apalachee_massacre
/wiki/Apalachee
/wiki/Spanish_Florida
/wiki/1949
/wiki/Academy_of_Television_Arts_%26_Sciences
/wiki/1st_Primetime_Emmy_Awards
/wiki/Emmy_Award
/wiki/Television_in_the_United_States
/wiki/1995
/wiki/Black_Brant_(rocket)
/wiki/Sounding_rocket
/wiki/Norwegian_rocket_incident
/wiki/Trident_(missile)
/wiki/2011
/wiki/Timeline_of_the_Egyptian_revolution_of_2011
/wiki/Egyptian_revolution_of_2011
/wiki/Hosni_Mubarak
/wiki/Leo_IV_the_Khazar
/wiki/Helene_Bresslau_Schweitzer
/wiki/Ali_Hassan_al-Majid
/wiki/January_24
/wiki/January_25
/wiki/January_26
/wiki/Wikipedia:Selected_anniversaries/January
https://lists.wikimedia.org/mailman/listinfo/daily-article-l
/wiki/List_of_historical_anniversaries
/wiki/File:Sonam_Kapoor_snapped.jpg
/wiki/List_of_accolades_received_by_Neerja
/wiki/Neerja
/wiki/Ram_Madhvani
/wiki/Atul_Kasbekar
/wiki/Sonam_Kapoor
/wiki/Shabana_Azmi
/wiki/Yogendra_Tiku
/wiki/Shekhar_Ravjiani
/wiki/64th_National_Film_Awards
/wiki/National_Film_Awards
/wiki/National_Film_Award_for_Best_Feature_Film_in_Hindi
/wiki/National_Film_Award_%E2%80%93_Special_Jury_Award_(feature_film)
/wiki/62nd_Filmfare_Awards
/wiki/Filmfare_Critics_Award_for_Best_Film
/wiki/Filmfare_Critics_Award_for_Best_Actress
/wiki/Filmfare_Award_for_Best_Director
/wiki/Filmfare_Award_for_Best_Film
/wiki/Screen_Awards
/wiki/Screen_Award_for_Best_Actress
/wiki/Screen_Award_for_Best_Director
/wiki/List_of_accolades_received_by_Neerja
/wiki/List_of_Sites_of_Special_Scientific_Interest_in_Rutland
/wiki/Trans-Tasman_Trophy
/wiki/List_of_Chancellors_of_Germany
/wiki/Wikipedia:Today%27s_featured_list/January_2019
/wiki/Wikipedia:Featured_lists
/wiki/File:Flooded_Albizia_Saman_(rain_tree)_in_the_Mekong.jpg
/wiki/Samanea_saman
/wiki/Fabaceae
/wiki/Stamen
/wiki/Pseudanthium
/wiki/User:Basile_Morin
/wiki/Template:POTD/2019-01-24
/wiki/Template:POTD/2019-01-23
/wiki/Template:POTD/2019-01-22
/wiki/Wikipedia:Picture_of_the_day/January_2019
/wiki/Wikipedia:Featured_pictures
/wiki/Wikipedia:Community_portal
/wiki/Wikipedia:Help_desk
/wiki/Wikipedia:Local_Embassy
/wiki/Wikipedia:Reference_desk
/wiki/Wikipedia:News
/wiki/Wikipedia:Village_pump
/wiki/Wikimedia_Foundation
https://wikimediafoundation.org/our-work/wikimedia-projects/
https://commons.wikimedia.org/wiki/
//commons.wikimedia.org/
https://www.mediawiki.org/wiki/
//mediawiki.org/
https://meta.wikimedia.org/wiki/
//meta.wikimedia.org/
https://en.wikibooks.org/wiki/
//en.wikibooks.org/
https://www.wikidata.org/wiki/
//www.wikidata.org/
https://en.wikinews.org/wiki/
//en.wikinews.org/
https://en.wikiquote.org/wiki/
//en.wikiquote.org/
https://en.wikisource.org/wiki/
//en.wikisource.org/
https://species.wikimedia.org/wiki/
//species.wikimedia.org/
https://en.wikiversity.org/wiki/
//en.wikiversity.org/
https://en.wikivoyage.org/wiki/
//en.wikivoyage.org/
https://en.wiktionary.org/wiki/
//en.wiktionary.org/
/wiki/English_language
/wiki/Special:Statistics
https://de.wikipedia.org/wiki/
https://es.wikipedia.org/wiki/
https://fr.wikipedia.org/wiki/
https://it.wikipedia.org/wiki/
https://nl.wikipedia.org/wiki/
https://ja.wikipedia.org/wiki/
https://pl.wikipedia.org/wiki/
https://pt.wikipedia.org/wiki/
https://ru.wikipedia.org/wiki/
https://sv.wikipedia.org/wiki/
https://vi.wikipedia.org/wiki/
https://zh.wikipedia.org/wiki/
https://ar.wikipedia.org/wiki/
https://id.wikipedia.org/wiki/
https://ms.wikipedia.org/wiki/
https://ca.wikipedia.org/wiki/
https://cs.wikipedia.org/wiki/
https://eo.wikipedia.org/wiki/
https://eu.wikipedia.org/wiki/
https://fa.wikipedia.org/wiki/
https://ko.wikipedia.org/wiki/
https://hu.wikipedia.org/wiki/
https://no.wikipedia.org/wiki/
https://ro.wikipedia.org/wiki/
https://sr.wikipedia.org/wiki/
https://sh.wikipedia.org/wiki/
https://fi.wikipedia.org/wiki/
https://tr.wikipedia.org/wiki/
https://uk.wikipedia.org/wiki/
https://bs.wikipedia.org/wiki/
https://bg.wikipedia.org/wiki/
https://da.wikipedia.org/wiki/
https://et.wikipedia.org/wiki/
https://el.wikipedia.org/wiki/
https://simple.wikipedia.org/wiki/
https://gl.wikipedia.org/wiki/
https://he.wikipedia.org/wiki/
https://hr.wikipedia.org/wiki/
https://lv.wikipedia.org/wiki/
https://lt.wikipedia.org/wiki/
https://ml.wikipedia.org/wiki/
https://nn.wikipedia.org/wiki/
https://sk.wikipedia.org/wiki/
https://sl.wikipedia.org/wiki/
https://th.wikipedia.org/wiki/
https://meta.wikimedia.org/wiki/List_of_Wikipedias
https://en.wikipedia.org/w/index.php?title=Main_Page&oldid=870437359
/wiki/Special:MyTalk
/wiki/Special:MyContributions
/w/index.php?title=Special:CreateAccount&returnto=Main+Page
/w/index.php?title=Special:UserLogin&returnto=Main+Page
/wiki/Main_Page
/wiki/Talk:Main_Page
/wiki/Main_Page
/w/index.php?title=Main_Page&action=edit
/w/index.php?title=Main_Page&action=history
/wiki/Main_Page
/wiki/Main_Page
/wiki/Portal:Contents
/wiki/Portal:Featured_content
/wiki/Portal:Current_events
/wiki/Special:Random
https://donate.wikimedia.org/wiki/Special:FundraiserRedirector?utm_source=donate&utm_medium=sidebar&utm_campaign=C13_en.wikipedia.org&uselang=en
//shop.wikimedia.org
/wiki/Help:Contents
/wiki/Wikipedia:About
/wiki/Wikipedia:Community_portal
/wiki/Special:RecentChanges
//en.wikipedia.org/wiki/Wikipedia:Contact_us
/wiki/Special:WhatLinksHere/Main_Page
/wiki/Special:RecentChangesLinked/Main_Page
/wiki/Wikipedia:File_Upload_Wizard
/wiki/Special:SpecialPages
/w/index.php?title=Main_Page&oldid=870437359
/w/index.php?title=Main_Page&action=info
https://www.wikidata.org/wiki/Special:EntityPage/Q5296
/w/index.php?title=Special:CiteThisPage&page=Main_Page&id=870437359
/w/index.php?title=Special:Book&bookcmd=book_creator&referer=Main+Page
/w/index.php?title=Special:ElectronPdf&page=Main+Page&action=show-download-screen
/w/index.php?title=Main_Page&printable=yes
https://commons.wikimedia.org/wiki/Main_Page
https://www.mediawiki.org/wiki/MediaWiki
https://meta.wikimedia.org/wiki/Main_Page
https://species.wikimedia.org/wiki/Main_Page
https://en.wikibooks.org/wiki/Main_Page
https://www.wikidata.org/wiki/Wikidata:Main_Page
https://en.wikinews.org/wiki/Main_Page
https://en.wikiquote.org/wiki/Main_Page
https://en.wikisource.org/wiki/Main_Page
https://en.wikiversity.org/wiki/Wikiversity:Main_Page
https://en.wikivoyage.org/wiki/Main_Page
https://en.wiktionary.org/wiki/Wiktionary:Main_Page
https://ar.wikipedia.org/wiki/
https://bg.wikipedia.org/wiki/
https://bs.wikipedia.org/wiki/
https://ca.wikipedia.org/wiki/
https://cs.wikipedia.org/wiki/
https://da.wikipedia.org/wiki/
https://de.wikipedia.org/wiki/
https://et.wikipedia.org/wiki/
https://el.wikipedia.org/wiki/
https://es.wikipedia.org/wiki/
https://eo.wikipedia.org/wiki/
https://eu.wikipedia.org/wiki/
https://fa.wikipedia.org/wiki/
https://fr.wikipedia.org/wiki/
https://gl.wikipedia.org/wiki/
https://ko.wikipedia.org/wiki/
https://hr.wikipedia.org/wiki/
https://id.wikipedia.org/wiki/
https://it.wikipedia.org/wiki/
https://he.wikipedia.org/wiki/
https://ka.wikipedia.org/wiki/
https://lv.wikipedia.org/wiki/
https://lt.wikipedia.org/wiki/
https://hu.wikipedia.org/wiki/
https://ms.wikipedia.org/wiki/
https://nl.wikipedia.org/wiki/
https://ja.wikipedia.org/wiki/
https://no.wikipedia.org/wiki/
https://nn.wikipedia.org/wiki/
https://pl.wikipedia.org/wiki/
https://pt.wikipedia.org/wiki/
https://ro.wikipedia.org/wiki/
https://ru.wikipedia.org/wiki/
https://simple.wikipedia.org/wiki/
https://sk.wikipedia.org/wiki/
https://sl.wikipedia.org/wiki/
https://sr.wikipedia.org/wiki/
https://sh.wikipedia.org/wiki/
https://fi.wikipedia.org/wiki/
https://sv.wikipedia.org/wiki/
https://th.wikipedia.org/wiki/
https://tr.wikipedia.org/wiki/
https://uk.wikipedia.org/wiki/
https://vi.wikipedia.org/wiki/
https://zh.wikipedia.org/wiki/
//en.wikipedia.org/wiki/Wikipedia:Text_of_Creative_Commons_Attribution-ShareAlike_3.0_Unported_License
//creativecommons.org/licenses/by-sa/3.0/
//foundation.wikimedia.org/wiki/Terms_of_Use
//foundation.wikimedia.org/wiki/Privacy_policy
//www.wikimediafoundation.org/
https://foundation.wikimedia.org/wiki/Privacy_policy
/wiki/Wikipedia:About
/wiki/Wikipedia:General_disclaimer
//en.wikipedia.org/wiki/Wikipedia:Contact_us
https://www.mediawiki.org/wiki/Special:MyLanguage/How_to_contribute
https://foundation.wikimedia.org/wiki/Cookie_statement
//en.m.wikipedia.org/w/index.php?title=Main_Page&mobileaction=toggle_view_mobile
https://wikimediafoundation.org/
//www.mediawiki.org/
</xmp>

將原始碼裡面<a href="OOOOXXX"> ...</a>其中的OOOOXXX全部抓出來。

如果你要尋找div底下class含有otd-footer的段落,例如:

你可以使用:

soup.find_all("div", class_="otd-footer")

[<div class="otd-footer hlist noprint" style="text-align: right;">
<ul><li><b><a href="/wiki/Wikipedia:Selected_anniversaries/January" title="Wikipedia:Selected anniversaries/January">Archive</a></b></li>
<li><b><a class="extiw" href="https://lists.wikimedia.org/mailman/listinfo/daily-article-l" title="mail:daily-article-l">By email</a></b></li>
<li><b><a href="/wiki/List_of_historical_anniversaries" title="List of historical anniversaries">List of historical anniversaries</a></b></li></ul>
</div>]

** 這裡的語法是用" class_",因為class是python內建的關鍵字之一,所以beatifulsoup使用class_替代 **

應用同樣的邏輯,如果想將a tag底下 title="Special:Statistics"的內容列出來:

soup.find("a", title="Special:Statistics").contents

['5,791,601']

最後,也是最重要的,你要怎麼知道你想找的資訊藏在哪串原始碼裡面呢?

請用瀏覽器拜訪目標網站,滑鼠移動到你所要抓取的元素上按下右鍵,點選檢查元素,你就會看到那個元素在哪段html程式碼底下了

例如這裡想要看英文版維基百科總共有幾篇文章的網頁原始碼長什麼樣,就是這樣查找。

OK,以上就是python網路爬蟲基本工具的基本操作,只要會這樣,其實就能做很多有趣的事情囉~

下一篇我們直接來正式進行網路爬蟲~

See you next time!

Hope you have a good time!

本章節程式碼:

# -*- coding: utf-8 -*-
import requests
from bs4 import BeautifulSoup

r = requests.get('https://en.wikipedia.org/')
print(r)

print(type(r.text))
print(len(r.text))

soup = BeautifulSoup(r.text, 'html.parser')

print(soup.title)

for link in soup.find_all('a'):
    print(link.get('href'))
    
    
print(soup.find_all("div", class_="otd-footer"))

print(soup.find("a", title="Special:Statistics").contents)
python 網頁爬蟲教學
python網路爬蟲簡介
python網路爬蟲基本工具(1)
python網路爬蟲教學-實戰篇(1) 蘋果日報馬網
使用偽裝user-agent爬取蝦皮購物網
撈取深網中的資料-蝦皮購物API
以POST方式抓取資料-政府電子採購網
python網路爬蟲教學-Selenium基本操作
python網路爬蟲應用-facebook社團成員參與度分析

相關文章:

>