
簡要概述:
話不多說先上詞云,我們來看下網(wǎng)易云上張學(xué)友從1985-01-01《BTB 3EP張學(xué)友 + 黃凱芹》到 2014-12-23《醒著做夢》一共108張專輯,1300多首歌曲里到底唱了什么!?
先從網(wǎng)易云上獲取張學(xué)友所有歌曲的歌詞,然后把所有的歌詞去重后全拼接成一個(gè)大文本,再然后~通過文本切詞將文本切開獲取所有的名詞及動(dòng)名詞,切分后的結(jié)果如下:
愛情 0.07072471042756841
世界 0.06268317887188746
感覺 0.05650215736324936
眼淚 0.044849511174034074
傷心 0.039332645851527105
思念 0.03834038757061952
世間 0.03682804854767579
故事 0.03475690253214765
眼睛 0.03445409416550336
天空 0.03350860508998348
流淚 0.03312398416346928
夢想 0.03280878370931647
心窩 0.03243319396705214
記憶 0.03158641388911513
風(fēng)雨 0.031456840476264326
星光 0.031069493653815176
笑容 0.030964867927921528
心痛 0.0307011710153588
空虛 0.029434459508895196
春風(fēng) 0.02918087252372948
感情 0.028306639850443988
癡心 0.02801903059970057
深宵 0.027155899335446567
光陰 0.02679152211589159
情人 0.026615621955302014
黑夜 0.026111296684488385
我會(huì) 0.025771411677059368
人生 0.02545445923579143
背影 0.02465825420524522
詞云生成用到了Tagul。非常棒的詞云在線生成工具,你只需要下載一個(gè)unicode編碼的字體包,如:arial unicode ms.ttf,接下來只需要將上面列表復(fù)制進(jìn)詞云表單里就可以生成酷炫的詞云了。Tagul官網(wǎng)傳送門:https://wordart.com/gallery
歌詞的獲取和文本切詞,使用的語言是python,主要用到幾個(gè)模塊,bs4,requests,jieba,mysqldb 等,這些模塊都可以通過pip install xxx這個(gè)指令快速安裝,由于這里只獲取張學(xué)友歌詞所以沒必要用到scrapy太重了,如果需要爬去全站的可以使用scrapy框架。
策略&思路:
獲取網(wǎng)易音樂的信息比較不容易,基本上都是動(dòng)態(tài)加載,反爬機(jī)制較完善。如果你是初學(xué),一定在這上面會(huì)遇到不少會(huì)坑的地方,這里簡單說下爬取思路。經(jīng)供參考

因?yàn)橹幌敕治鰧W(xué)友唱了什么,所以就直接在搜索欄搜學(xué)友,點(diǎn)選所有專輯如上圖,我們看到這樣一串鏈接http://music.163.com/#/artist/album?id=6460,看不出是什么?別急

我們點(diǎn)選到第二頁看看變化,我們看到此時(shí)的url是http://music.163.com/#/artist/album?id=6460&limit=12&offset=12
第三頁http://music.163.com/#/artist/album?id=6460&limit=12&offset=24
我們分析出其實(shí)offset從12變化到了24,有經(jīng)驗(yàn)的爬友們可以知道limit12其實(shí)是每個(gè)頁面有12張專輯,所以每翻一頁url的offset+12,那么我們不用做翻,看他他總頁數(shù)是9頁那么9*12=108,那么學(xué)友至今108張專輯,真的是牛逼了。
那么我們嘗試將url改成http://music.163.com/#/artist/album?id=6460&limit=108 不需要翻頁,讓頁面直接顯示108張專輯,ok我們接下來開始requests.get...Beautifulsoup...。
如果你們真的requests.get...Beautifulsoup...了,那么恭喜你,你成功被耍了。因?yàn)槟惴祷氐捻撁鏇]有你要的數(shù)據(jù)。
好!我們打開chrome抓包,在密密麻麻的加載項(xiàng)中發(fā)現(xiàn)一個(gè)非常熟悉的鏈接,點(diǎn)開發(fā)現(xiàn)他真實(shí)request url其實(shí)是不帶#號的,說明之前那個(gè)url是假的

ok再次requests.get...Beautifulsoup

我們要的數(shù)據(jù)也取到了,然后獲取該專輯href,然后做鏈接拼接再進(jìn)入專輯詳情頁。這里詳情頁獲取同上需要進(jìn)入抓包分析具體鏈接

爬爬爬~ 都爬了~ 經(jīng)過之地寸草不生~
在爬去歌單的時(shí)候需要獲取后面的歌曲id如下圖

然后拼接歌詞api,api是由百度提供~我們只需要將歌曲id放入去拼接出url,再次requests即可獲取詳細(xì)歌詞
http://music.163.com/api/song/lyric?os=pc&id= str(song_id) &lv=-1&kv=-1&tv=-1
注意事項(xiàng):
這里注意爬取的時(shí)候頻繁返回503,經(jīng)常會(huì)報(bào)錯(cuò)或者漏爬,短時(shí)間不能再次爬取,時(shí)間大概在半小時(shí),所以大家可以類似這樣的代碼去判斷爬取失敗的循環(huán)
if response.status_code <> 200:
requests_error_list.append(response.url)
time.sleep(300)
有了爬去失敗列表那么在最后將這些url再拿出來再次爬取,同樣先判斷狀態(tài)是否200,然后判斷url后綴,類似album?id=19008這樣的則是專輯名url,song?id=187449則為歌曲url
參考代碼:
if requests_error_list:
for url in requests_error_list:
if 'album?id=' in url:
requests_album_url()
if 'song?id=' in url:
requests_song_url()
response.status_code <> 200:
requests_error_list.append(url)
time.sleep(300)
爬蟲代碼:
conn = MySQLdb.connect(host="localhost",
user="root",
passwd="root",
db="webspider",
charset="utf8",
use_unicode=True)
cursor = conn.cursor()
cursor2 = conn.cursor()
def ua_random():
headers = {
"User-Agent": UserAgent().Chrome,
"Connection": "keep-alive",
"Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8",
"Accept-Language": "zh-CN,zh;q=0.8",
"Referer": "http: // music.163.com /"
}
return headers
url = 'http://music.163.com/artist/album?id=6460&limit=108'
response = requests.get(url, headers=ua_random())
print(response.status_code)
soup = BeautifulSoup(response.content, 'lxml')
album_href_taglist = soup.select('a.tit.s-fc0')
album_href_list = [_.get('href') for _ in album_href_taglist]
album_id_list = [_.split('=')[1].strip() for _ in album_href_list]
print(album_id_list)
for album_id in album_id_list:
url2 = ''.join(['http://music.163.com/album?id=', album_id])
response2 = requests.get(url2, headers=ua_random())
soup2 = BeautifulSoup(response2.content, 'lxml')
# print(soup2)
song_taglist = soup2.select('ul.f-hide a')
# print(song_taglist)
song_href_list = [_.get('href') for _ in song_taglist]
song_id_list = [_.split('=')[1].strip() for _ in song_href_list]
# print(song_id_list)
album_id = album_id.strip()
try:
album_name = soup2.select_one('h2.f-ff2').get_text().strip()
album_issue_date = soup2.select('p.intr')[1].get_text().split(':')[1].strip()
try:
issue_company_name = soup2.select('p.intr')[2].get_text().split(':')[1].strip()
except IndexError:
issue_company_name = None
except:
album_name = None
album_issue_date = None
print(response2.status_code, response2.url)
continue
insert_sql = """INSERT INTO album_info(album_id ,album_name,album_issue_date,issue_company_name)
VALUES (%s, %s, %s, %s)
"""
print(album_id,
album_name,
album_issue_date,
issue_company_name)
cursor.execute(insert_sql, (album_id,
album_name,
album_issue_date,
issue_company_name))
song_info_temp = soup2.textarea.get_text()
song_info = json.loads(song_info_temp)
for v in song_info:
song_id = v['id']
song_name = v['name']
singer = ','.join([_['name'] for _ in v['artists']])
song_time = None
album_id = album_id.strip()
url3 = ''.join(['http://music.163.com/api/song/lyric?os=pc&id=',
str(song_id),
'&lv=-1&kv=-1&tv=-1'])
response3 = requests.get(url3, headers=ua_random())
# time.sleep(1)
soup3 = BeautifulSoup(response3.content, 'lxml')
# print(soup3.get_text())
try:
lrc_temp = json.loads(soup3.get_text())['lrc']['lyric']
lyric = re.sub(r"\[(.*)\]", '', lrc_temp)
except KeyError:
lyric = '無歌詞'
except json.decoder.JSONDecodeError:
lyric = None
print(response3.status_code, response3.url)
insert_sql = """INSERT INTO song_info(song_id,song_name,song_time,singer,lyric,album_id)
VALUES (%s, %s, %s, %s, %s, %s)
"""
cursor2.execute(insert_sql, (song_id,
song_name,
song_time,
singer,
lyric,
album_id))
conn.commit()
conn.close()
jieba分詞代碼:
conn = MySQLdb.connect(host="localhost",
user="root",
passwd="root",
db="webspider",
charset="utf8",
use_unicode=True)
sql = '''
SELECT * from webspider.album_info as a
left join webspider.song_info as b on a.album_id = b.album_id
where b.lyric IS NOT NULL AND b.lyric <> '無歌詞'
'''
df = pandas.read_sql(sql, con=conn, )
# print(df.head(10))
t_list = df.lyric
all_union_text = ';'.join(t_list).replace('作曲 :', '').replace('作詞 :', '')
text = ','.join(set(','.join(set(all_union_text.split('\n'))).split()))
# seg_list = jieba.cut(all_union_text,cut_all=True)
# print(' '.join(seg_list))
keywords = jieba.analyse.extract_tags(text, topK=20, withWeight=True, allowPOS=('n', 'nv'))
for i in keywords:
print(i[0], i[1])