目標
抓取微博熱門數(shù)據(jù),滾動分頁;

微信截圖_20211105154620.png
工具
puppeteer cheerio
方案
通過 puppeteer 模擬瀏覽器渲染網(wǎng)頁,通過 cheerio 獲取dom節(jié)點,抓取數(shù)據(jù);
直接上代碼
import cheerio from "cheerio";
import chalk from "chalk"; // 一個美化 console 輸出的庫
import fs from "fs";
import crypto from "crypto";
import puppeteer from "puppeteer";
const log = console.log; // 縮寫 console.log
interface listType {
id: string;
time: string;
from: string;
description: string;
imgList: string[];
forward: string;
discuss: string;
fabulous: string;
}
class Reptile {
// https://weibo.com/u/5587951849
// https://weibo.com/newlogin?tabtype=list&gid=1028039999&url=https%3A%2F%2Fweibo.com%2F
private url =
"https://weibo.com/newlogin?tabtype=list&gid=1028039999&url=https%3A%2F%2Fweibo.com%2F";
async getHtml() {
// 首先通過Puppeteer啟動一個瀏覽器環(huán)境
const browser = await puppeteer.launch({
headless: true, // 值為false會發(fā)開瀏覽器
});
// Create a new page
const page = await browser.newPage();
// 設置渲染尺寸
page.setViewport({
width: 1200,
height: 900,
deviceScaleFactor: 1,
});
// Configure the navigation timeout
await page.setDefaultNavigationTimeout(0);
// 加載網(wǎng)頁
await page.goto(this.url);
// 等待網(wǎng)頁完全加載
await page.reload();
await page.waitForNavigation();
log(chalk.yellow("頁面初次加載完畢"));
let num = 0;
let getLen: number[] = [];
let data: listType[] = [];
// 滾動翻頁加載數(shù)據(jù)
const loadData = async () => {
log(chalk.blue(`第${num}次爬取,當前獲取數(shù)據(jù)${data.length}條`));
num++;
// page.click(".navbtmbox");
await page.evaluate((num: number) => {
window.scrollTo(0, num * 900);
}, num);
const content = await page.content();
// 通過cheerio獲取頁面元素
const $ = cheerio.load(content);
const list = $(".vue-recycle-scroller__item-wrapper").find(
".vue-recycle-scroller__item-view"
);
list.map((i, el) => {
const item = $(el).find(".woo-panel-main");
let arr: string[] = [];
item
.find(".woo-box-wrap")
.find(".woo-picture-img")
.map((j, img) => {
arr.push($(img).attr("src") as string);
});
const id = crypto
.createHash("md5")
.update(
`${item.find(".head-info_time_6sFQg").text()}${item
.find(".head-info_cut_1tPQI")
.text()}${item.find(".toolbar_num_JXZul").text()}`
)
.digest("hex");
let size = 0;
for (let m in data) {
if (data[m].id === id) {
size++;
}
}
if (size === 0) {
data.push({
id: id,
time: item.find(".head-info_time_6sFQg").text(),
from: item.find(".head-info_cut_1tPQI").text(),
description: item.find(".detail_wbtext_4CRf9").text(),
imgList: arr,
forward: item.find(".toolbar_num_JXZul").text(),
discuss: item.find(".toolbar_num_JXZul").text(),
fabulous: item.find(".woo-like-count").text(),
});
}
});
getLen.push(data.length);
if (
(getLen.length > 50 &&
getLen[getLen.length - 1] == getLen[getLen.length - 50]) ||
data.length > 1000
) {
fs.writeFile("./src/index.html", content, "utf8", async (error) => {
if (error) {
console.log(error);
}
log(chalk.green(`dom寫入成功`));
});
fs.writeFile(
"./src/data.json",
JSON.stringify(data),
"utf8",
async (error) => {
if (error) {
console.log(error);
}
log(
chalk.green(
`爬取數(shù)據(jù)${data.length}條,共計用時${
(num * 200) / 1000
}s,寫入成功`
)
);
page.close();
browser.close();
}
);
} else {
setTimeout(async () => {
await loadData();
}, 200);
}
};
await loadData();
}
constructor() {
this.getHtml();
}
}
new Reptile();
創(chuàng)建一個ts文件,如 index.ts 然后直接去運行就可以了;

微信截圖_20211105154445.png
抓取到的數(shù)據(jù);
學習使用,如有侵權(quán),請聯(lián)系作者隨時刪改;