node爬蟲 通過 puppeteer 滾動分頁抓取微博數(shù)據(jù)

目標
抓取微博熱門數(shù)據(jù),滾動分頁;

微信截圖_20211105154620.png

工具
puppeteer cheerio


方案
通過 puppeteer 模擬瀏覽器渲染網(wǎng)頁,通過 cheerio 獲取dom節(jié)點,抓取數(shù)據(jù);

直接上代碼

import cheerio from "cheerio";
import chalk from "chalk"; // 一個美化 console 輸出的庫
import fs from "fs";
import crypto from "crypto";
import puppeteer from "puppeteer";

const log = console.log; // 縮寫 console.log

interface listType {
  id: string;
  time: string;
  from: string;
  description: string;
  imgList: string[];
  forward: string;
  discuss: string;
  fabulous: string;
}

class Reptile {
  // https://weibo.com/u/5587951849
  // https://weibo.com/newlogin?tabtype=list&gid=1028039999&url=https%3A%2F%2Fweibo.com%2F
  private url =
    "https://weibo.com/newlogin?tabtype=list&gid=1028039999&url=https%3A%2F%2Fweibo.com%2F";

  async getHtml() {
    // 首先通過Puppeteer啟動一個瀏覽器環(huán)境
    const browser = await puppeteer.launch({
      headless: true, // 值為false會發(fā)開瀏覽器
    });

    // Create a new page
    const page = await browser.newPage();

    // 設置渲染尺寸
    page.setViewport({
      width: 1200,
      height: 900,
      deviceScaleFactor: 1,
    });

    // Configure the navigation timeout
    await page.setDefaultNavigationTimeout(0);

    // 加載網(wǎng)頁
    await page.goto(this.url);

    // 等待網(wǎng)頁完全加載
    await page.reload();
    await page.waitForNavigation();

    log(chalk.yellow("頁面初次加載完畢"));

    let num = 0;
    let getLen: number[] = [];
    let data: listType[] = [];

    // 滾動翻頁加載數(shù)據(jù)
    const loadData = async () => {
      log(chalk.blue(`第${num}次爬取,當前獲取數(shù)據(jù)${data.length}條`));
      num++;
      // page.click(".navbtmbox");
      await page.evaluate((num: number) => {
        window.scrollTo(0, num * 900);
      }, num);

      const content = await page.content();
      // 通過cheerio獲取頁面元素
      const $ = cheerio.load(content);
      const list = $(".vue-recycle-scroller__item-wrapper").find(
        ".vue-recycle-scroller__item-view"
      );
      list.map((i, el) => {
        const item = $(el).find(".woo-panel-main");
        let arr: string[] = [];
        item
          .find(".woo-box-wrap")
          .find(".woo-picture-img")
          .map((j, img) => {
            arr.push($(img).attr("src") as string);
          });
        const id = crypto
          .createHash("md5")
          .update(
            `${item.find(".head-info_time_6sFQg").text()}${item
              .find(".head-info_cut_1tPQI")
              .text()}${item.find(".toolbar_num_JXZul").text()}`
          )
          .digest("hex");
        let size = 0;
        for (let m in data) {
          if (data[m].id === id) {
            size++;
          }
        }
        if (size === 0) {
          data.push({
            id: id,
            time: item.find(".head-info_time_6sFQg").text(),
            from: item.find(".head-info_cut_1tPQI").text(),
            description: item.find(".detail_wbtext_4CRf9").text(),
            imgList: arr,
            forward: item.find(".toolbar_num_JXZul").text(),
            discuss: item.find(".toolbar_num_JXZul").text(),
            fabulous: item.find(".woo-like-count").text(),
          });
        }
      });
      getLen.push(data.length);
      if (
        (getLen.length > 50 &&
          getLen[getLen.length - 1] == getLen[getLen.length - 50]) ||
        data.length > 1000
      ) {
        fs.writeFile("./src/index.html", content, "utf8", async (error) => {
          if (error) {
            console.log(error);
          }
          log(chalk.green(`dom寫入成功`));
        });
        fs.writeFile(
          "./src/data.json",
          JSON.stringify(data),
          "utf8",
          async (error) => {
            if (error) {
              console.log(error);
            }
            log(
              chalk.green(
                `爬取數(shù)據(jù)${data.length}條,共計用時${
                  (num * 200) / 1000
                }s,寫入成功`
              )
            );
            page.close();
            browser.close();
          }
        );
      } else {
        setTimeout(async () => {
          await loadData();
        }, 200);
      }
    };
    await loadData();
  }

  constructor() {
    this.getHtml();
  }
}

new Reptile();

創(chuàng)建一個ts文件,如 index.ts 然后直接去運行就可以了;

微信截圖_20211105154445.png

抓取到的數(shù)據(jù);

學習使用,如有侵權(quán),請聯(lián)系作者隨時刪改;

最后編輯于
?著作權(quán)歸作者所有,轉(zhuǎn)載或內(nèi)容合作請聯(lián)系作者
【社區(qū)內(nèi)容提示】社區(qū)部分內(nèi)容疑似由AI輔助生成,瀏覽時請結(jié)合常識與多方信息審慎甄別。
平臺聲明:文章內(nèi)容(如有圖片或視頻亦包括在內(nèi))由作者上傳并發(fā)布,文章內(nèi)容僅代表作者本人觀點,簡書系信息發(fā)布平臺,僅提供信息存儲服務。

相關(guān)閱讀更多精彩內(nèi)容

友情鏈接更多精彩內(nèi)容