建立關(guān)鍵詞庫(kù)是SEO重要工作之一。如何獲得更多關(guān)鍵詞?通常是找到一批母詞,用它們作為詞根,使用拓詞工具拓展出更多長(zhǎng)尾關(guān)鍵詞。
那么,詞根從哪里來(lái)呢?比較好的來(lái)源之一就是競(jìng)爭(zhēng)對(duì)手的關(guān)鍵詞庫(kù)。今天就來(lái)談一談如何使用Python采集競(jìng)爭(zhēng)對(duì)手站長(zhǎng)工具(chinaz.com)關(guān)鍵詞庫(kù)。
經(jīng)測(cè)試,在不登陸站長(zhǎng)工具網(wǎng)站的情況下,只能最多訪問(wèn)前10頁(yè)的關(guān)鍵詞列表。登陸狀態(tài)下則最多訪問(wèn)前57頁(yè)關(guān)鍵詞列表。想要訪問(wèn)更多,則需要開(kāi)通VIP。
目前沒(méi)有開(kāi)通VIP,所以就只能采集登錄狀態(tài)下前57頁(yè)關(guān)鍵詞,一共1140個(gè)有百度指數(shù)的關(guān)鍵詞。示例代碼如下:
import re,time,requests,random
import browsercookie
class ZhanZhang(object):
def __init__(self):
self.headers = {'User-Agent':'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/45.0.2454.101 Safari/537.36'}
self.domains = [domain.rstrip() for domain in open('domains.txt',encoding='utf-8-sig')]
def get_cookies(self):
cookies = browsercookie.chrome()
cookies = [cookie for cookie in cookies if 'chinaz.com' in str(cookie)]
cookies = requests.utils.dict_from_cookiejar(cookies)
return cookies
def curl(self,url,retries = 3,num = 1):
cookies = self.get_cookies()
try:
response = requests.get(url, cookies = cookies, headers = self.headers, timeout=5)
response.encoding = 'utf-8'
html = response.text
except:
html = None
if retries > 0:
print('請(qǐng)求失敗,重試第%s次' % num)
return self.curl(url, retries - 1, num + 1)
return html
def page(self,host):
html = self.curl('http://rank.chinaz.com/?host=%s' % host)
time.sleep(random.uniform(1,2))
data = re.search('共(\d+?)頁(yè)',html)
if data:
num = data.group(1)
if int(num) > 57:
num = 57
else:
num = 1
page = ['http://rank.chinaz.com/%s-0--%s' % (host,i) for i in range(1,int(num)+1)]
return page
def words(self,host):
words = []
pages = self.page(host)
for url in pages:
html = self.curl(url)
time.sleep(random.uniform(1,3))
kws = re.findall('<li class="ReListCent ReLists bor-b1s clearfix">([\s\S]*?)</div></li>', html)
for kw in kws:
keys = re.findall('class="ellipsis block">(.*?)</a><div class="folwc"[\s\S]*?<div class="w8-0"><a href=".*?" target="_blank">(.*?)</a></div>',kw)
words.append(keys)
return words
@property
def save_file(self):
with open('words.txt', 'w') as f:
for host in self.domains:
print('++++++開(kāi)始采集%s的關(guān)鍵詞++++++' % host)
words = self.words(host)
for line in words:
word,index = line[0]
data ='%s\t%s' % (word,index)
data += '\n'
f.write(data)
if __name__ == '__main__':
p = ZhanZhang()
p.save_file
代碼運(yùn)行需要安裝requests庫(kù)和browsercookie庫(kù),requests庫(kù)用來(lái)處理請(qǐng)求,browsercookie庫(kù)用來(lái)處理登陸后的cookie。待抓取的域名放在domains.txt文件中,一行一個(gè),不帶HTTP。腳本運(yùn)行完畢,最終數(shù)據(jù)會(huì)寫(xiě)入words.txt文件中,包括關(guān)鍵詞及整體百度指數(shù)。
這里以采集站長(zhǎng)工具為例,同樣的還可以采集其他類(lèi)似網(wǎng)站,比如愛(ài)站網(wǎng)、5118等,多個(gè)網(wǎng)站綜合起來(lái),采集的關(guān)鍵詞將更多更全一些。
PS:關(guān)注公眾號(hào)后臺(tái)回復(fù)關(guān)鍵詞【站長(zhǎng)工具】即可獲得這份代碼,不明白的地方可私信我。