用Python去抓ptt 電影版的文章標題

Feb 10, 2020

首先搜尋Google打「ptt 電影版」搜尋

我們的目標是把ptt討論區電影版正常的標題，全部抓下來

按下F12，找出自己的User-Agent，這樣才會讓程式顯得是使用者在用瀏覽器連線，才不會連線被拒絕

點選Network，再用重新整理，或者重按F5，就會跑出檔案跟右邊的一些資訊

我們要的是index.html旁邊的拉到最下面的那段User-Agent

像廣利的User-Agent就是Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/80.0.3987.87 Safari/537.36

之後來到討論區的網頁，檢查html原始碼，去找前後的特性，然後用bs4來幫忙我們分析網頁，找出我們要的部分

bs4就是beautifulsoup4安裝的指令是

pip install beautifulsoup4

這樣就安裝好囉！

把程式寫好後去執行爬蟲

#抓取Ptt電影版網頁原始碼

#---------------以下是crawler.py爬蟲程式碼---------------------

import urllib.request as req

#建立一個request物件，附加Request Headers資訊

request=req.Request(url, headers={

"User-Agent":"Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/80.0.3987.87 Safari/537.36"

})

with req.urlopen(request) as response:

data=response.read().decode("utf-8")

#print(data)

#解析原始碼，取得每篇文章的標題

import bs4

root=bs4.BeautifulSoup(data, "html.parser")

#讓BeautifulSoup協助我們解析html 找出我們想要的部分

titles=root.find_all("div", class_="title") #尋找所有class="title"的div標籤

for title in titles:

if title.a != None: #如果標題包含a標籤(沒有被刪除)，印出來

print(title.a.string)

#---------------以上是crawler.py爬蟲程式碼---------------------

然後在指令的地方打python crawler.py

按下Enter去執行，就跑出來囉！

然後就複製結果，開啟你的Excel去貼上，做一些整理啦！

CC BY-NC-ND 2.0

Like my work? Don't forget to support and clap, let me know that you are with me on the road of creation. Keep this enthusiasm together!