Puppeteer 入門

引言

團(tuán)隊(duì)最近經(jīng)常需要分析一些網(wǎng)站數(shù)據(jù),需要從多個數(shù)據(jù)網(wǎng)站去手動復(fù)制數(shù)據(jù)到 Excel 里面,這種重復(fù)勞動且沒有意義的體力活應(yīng)該交給機(jī)器去干,釋放出人的勞動力去干更有意思的事,所以有了學(xué)習(xí)采集方法的這篇文章。開源的采集庫有 python 的 scraper,java 的 selenium,ruby 的 watir,nodejs 的 puppeteer,golang 的 chromedp?;诳焖偕鲜秩腴T就選擇了 puppeteer,備選是 chromedp,因?yàn)槿粘J鞘褂?golang 開發(fā)項(xiàng)目。


目錄

  1. 環(huán)境搭建
  2. 網(wǎng)頁截屏 demo
  3. terminal 運(yùn)行 script 采集目標(biāo)數(shù)據(jù)
  4. web 服務(wù)化運(yùn)行 script 采集目標(biāo)數(shù)據(jù)
  5. 總結(jié)
  6. 了解更多


1、環(huán)境搭建

# mac terminal 運(yùn)行
# 安裝homebrew,配置國內(nèi)鏡像的參考 https://mirrors.ustc.edu.cn/help/brew.git.html
$ /bin/bash -c "$(curl -fsSL https://raw.githubusercontent.com/Homebrew/install/HEAD/install.sh)"

# 等待 node 安裝完成
$ brew install node

# 查看版本
$ node -v
$ npm -v

# 安裝puppeteer 環(huán)境
$ npm i puppeteer

2、網(wǎng)頁截屏 demo

// https://github.com/puppeteer/puppeteer/blob/main/examples/screenshot.js

"use strict";

const puppeteer = require("puppeteer");

(async () => {
  const browser = await puppeteer.launch();
  const page = await browser.newPage();
  await page.goto("http://example.com");
  await page.screenshot({ path: "example.png" });
  await browser.close();
})();

mac 下面直接運(yùn)行可能會提示瀏覽器版本問題,需要指定下載對應(yīng)的版本才能運(yùn)行起來。所以修改之后的代碼

// 版本號查找鏈接 http://omahaproxy.appspot.com/

const puppeteer = require("puppeteer");
const browserFetcher = puppeteer.createBrowserFetcher();

// 下載指定版本的chrome瀏覽器,下載完成之后返回chrome瀏覽器對象
browserFetcher.download("809590").then(async (res) => {
  // options 參數(shù)見
  // https://zhaoqize.github.io/puppeteer-api-zh_CN/#?product=Puppeteer&version=v8.0.0&show=api-class-puppeteer
  const options = {
    executablePath: res.executablePath, // chrome執(zhí)行路徑
    headless: true, // 瀏覽器無頭模式,后臺運(yùn)行,false 會打卡瀏覽器
    defaultViewport: {
      width: 1800,
      height: 768,
    },
    args: ["--start-maximized"],
  };

  puppeteer.launch(options).then(async (browser) => {
    const page = await browser.newPage();
    await page.goto("http://www.baidu.com");
    await page.screenshot({ path: "baidu.png" });
    await browser.close();
  });
});
# 運(yùn)行上面的腳本文件
$ node test.js
$ ls baidu.png

至此已經(jīng) puppeteer 入門了。

3、terminal 運(yùn)行 script 采集目標(biāo)數(shù)據(jù)

接下來就是開始針對團(tuán)隊(duì)需要分析的數(shù)據(jù)采集了,先熟悉 puppeteer api 文檔,主要熟悉 Page、JSHandle、以及 ElementHandle 對象,下面的代碼會經(jīng)常用到這 3 個 api。

除了上面的 3 個常用對象之外,還要熟悉 chrome 開發(fā)者工具的 api,知道怎么去查找 dom 節(jié)點(diǎn)的路徑。下圖是直接打開 chrome 開發(fā)者工具,在 Elements 面板里面,選擇要查找的 dom 節(jié)點(diǎn)上右擊彈出菜單。

<img src="https://p3-juejin.byteimg.com/tos-cn-i-k3u1fbpfcp/7960a891dae9454f9eaccd3ab36b70c7~tplv-k3u1fbpfcp-zoom-1.image" width="30%">

具體幾個功能見 Console Utilities API reference

下面就開始實(shí)戰(zhàn),要想拿到目標(biāo)網(wǎng)站的數(shù)據(jù),需要 2 個步驟,賬號登陸和打開指定網(wǎng)頁。登陸的代碼是通過腳本登陸之后把 cookie 保存到本地文件里面,然后采集的腳本就可以直接載入 cookie 文件直接使用,這樣避免了沒有身份的問題。這個登陸腳本的弊端是無非解決無頭模式下面的掃碼登陸。
直接上代碼,登陸代碼腳本如下,

// login.js
const puppeteer = require("puppeteer");
const fs = require("fs").promises;
const fs2 = require("fs");
const puppeteerNode = puppeteer;
const browserFetcher = puppeteerNode.createBrowserFetcher();

browserFetcher.download("809590").then((res) => {
  let options = {
    executablePath: res.executablePath, //chrome執(zhí)行路徑
    headless: false, //瀏覽器無頭模式
    defaultViewport: {
      width: 1800,
      height: 768,
    },
    args: ["--start-maximized"],
  };
  puppeteer.launch(options).then(async (browser) => {
    let cookies = {};
    const page = await browser.newPage();
    if (fs2.existsSync("./cookies.json")) {
      const cookiesString = await fs.readFile("./cookies.json");
      let cookies = JSON.parse(cookiesString);
      await page.setCookie(...cookies);
      await page.goto("https://dy.mock.com/login?routerstr=workbench");
    } else {
      await page.goto("https://dy.mock.com/login?routerstr=workbench");
      await page.waitForSelector("#app > div > div.bg_login > div > div > div.of_hidden > div");
      const tabs = await page.$$("#app > div > div.bg_login > div > div > div.of_hidden > div");
      await tabs[1].click().then(async (res) => {
        await page.waitForSelector("#input-msg > div > input");
        await page.waitForSelector("#app > div > div > div > div > div > div > form > div.form_item.pointer");
        const inputs = await page.$$("#input-msg > div > input");
        await inputs[0].type("賬號");
        await inputs[1].type("密碼");
        await page.click(
          "#app > div > div > div > div > div > div > form > div.form_item.pointer"
        );
        ////有頭模式下面,需要等待時間,以便掃碼登陸操作完成
        // await page.waitForTimeout(5000);
        cookies = await page.cookies();
        await fs.writeFile("./cookies.json", JSON.stringify(cookies, null, 2));
      });
    }
    browser.close();
  });
});
$ node login.js
$ ls cookies.json

下面是抓取頁面內(nèi)容的腳本代碼。

// grab.js
const puppeteer = require("puppeteer");
const fs = require("fs").promises;
const browserFetcher = puppeteer.createBrowserFetcher();

browserFetcher.download("809590").then((res) => {
  puppeteer
    .launch({
      executablePath: res.executablePath, //chrome執(zhí)行路徑
      headless: false, //瀏覽器無頭模式
    })
    .then(async (browser) => {
      const page = await browser.newPage();
      await page.setViewport({ width: 1800, height: 768 });
      // 缺少登陸驗(yàn)證,默認(rèn)已經(jīng)執(zhí)行過上面的登陸腳本
      const cookiesString = await fs.readFile("./cookies.json");
      let cookies = JSON.parse(cookiesString);
      await page.setCookie(...cookies);
      await page.goto("https://dy.fake.com/kol_list/kol_list");
      await page.waitForSelector("table");
      await page.waitForSelector("#app > div > div.main_view > div > div:nth-child(1) > div:nth-child(1) > div > div > div > div > div");
      const tabs = await page.$$("#app > div > div.main_view > div > div:nth-child(1) > div:nth-child(1) > div > div > div > div > div");

      for (let [i, tab] of tabs.entries()) {
        const tabName = await (await tab.getProperty("innerText")).jsonValue();
        tab.click().then(async (res) => {
          await page.waitForSelector("table");
          const result = await page.$$eval("table", (tables) => {
            let trs = tables[2].children[1].children;
            let t = [];
            let csv ="名稱,粉絲總數(shù),粉絲質(zhì)量,中位點(diǎn)贊數(shù),中位評論數(shù),中位分享數(shù),指數(shù)\n";
            for (const tr of trs) {
              let name = tr.children[2].innerText;
              let fans = tr.children[3].innerText.replace(",", "");
              let fansQ = tr.children[4].innerText.replace(",", "");
              let likeAvg = tr.children[5].innerText.replace(",", "");
              let commentAvg = tr.children[6].innerText.replace(",", "");
              let shareAvg = tr.children[7].innerText.replace(",", "");
              let index = tr.children[8].innerText.replace(",", "");
              let tmp = [name,fans,fansQ,likeAvg,commentAvg,shareAvg,index];
              t.push({name: name,fans: fans,q: fansQ,likeAvg: likeAvg,commentAvg: commentAvg,shareAvg: shareAvg,index: index});
              csv += tmp.join(",") + "\n";
            }
            return [t, csv];
          });
          await fs.writeFile("./" + i + "-" + tabName + "-index.csv",result[1]);
        });
        await page.waitForTimeout(3000);
      }
    });
});
# 運(yùn)行腳本
$ node grab.js
$ ls -la *.csv

可以看一下抓取到的csv文件內(nèi)容

<image src="https://p3-juejin.byteimg.com/tos-cn-i-k3u1fbpfcp/6fba17f2e7204db990544c053683d9f4~tplv-k3u1fbpfcp-zoom-1.image" width="30%">

至此 terminal 爬取數(shù)據(jù)的部分就結(jié)束了。

4、web 服務(wù)運(yùn)行腳本采集數(shù)據(jù)

web 服務(wù)運(yùn)行腳本這部分代碼很簡單,就是把上面的 terminal 的代碼包裝一下,通過 web 服務(wù) 對外訪問提供服務(wù),這樣可以通過瀏覽器直接打開網(wǎng)頁進(jìn)行操作,無需開 terminal 去運(yùn)行一些命令。

web 服務(wù)這塊,我直接選擇了 eggjs 框架搭建業(yè)務(wù)邏輯代碼,不用再去寫一些 http 相關(guān)的代碼。
代碼如下

'use strict';
const Controller = require('egg').Controller;
const path = require('path');
const puppeteer = require('puppeteer');
const fs = require('fs').promises;
const fs2 = require('fs');
const archiver = require('archiver');


class HomeController extends Controller {

  async index() {
    const { ctx } = this;
    await ctx.render('home/index.tpl');
  }

  async login() {
    const { ctx } = this;
    const v = await this.puppeteerLogin(ctx);
    await ctx.render('home/login.tpl', { v });
  }

  async grab() {
    const { ctx } = this;
    const account = ctx.cookies.get('account');
    if (account === '' || account === undefined) {
      return ctx.redirect('/');
    }
    const puppeteerNode = puppeteer;
    const browserFetcher = puppeteerNode.createBrowserFetcher();
    let v = await browserFetcher.download('809590').then(res => {
        const options = {
          executablePath: res.executablePath, // chrome執(zhí)行路徑
          headless: true, // 瀏覽器無頭模式
          defaultViewport: {
            width: 1800,
            height: 768,
          },
          args: [ '--start-maximized' ],
        };
        const v = puppeteer.launch(options)
          .then(async browser => {
            const page = await browser.newPage();
            await page.setViewport({ width: 1800, height: 768 });
            const sessionCookieDir = path.join(ctx.app.config.sessionDir, account);
            const fileName = account + '-target.zip';
            const publicZipFile = path.join(ctx.app.config.publicDir, fileName);

            if (!fs2.existsSync(sessionCookieDir)) {
              return 'pls wait seconds for login ';
            }
            const lockFile = sessionCookieDir + '/start.lock';
            if (fs2.existsSync(lockFile)) {
              console.log('pls wait seconds for done');
              return 'pls wait seconds for done';
            }
            const zipFilePath = path.join(sessionCookieDir, fileName);
            if (fs2.existsSync(publicZipFile)) {
              console.log(publicZipFile);
              return publicZipFile;
            }
            await fs.writeFile(sessionCookieDir + '/start.lock', 'lock');
            const sessionDataDir = path.join(sessionCookieDir, 'data');
            if (!fs2.existsSync(sessionDataDir)) {
              fs2.mkdirSync(sessionDataDir, '0777', true);
            }
            const sessionCookiePath = path.join(sessionCookieDir, 'cookies.json');
            const cookiesString = await fs.readFile(sessionCookiePath);
            const cookies = JSON.parse(cookiesString);
            await page.setCookie(...cookies);

            await page.goto('https://dy.fake.com/kol_list/kol_list');
            await page.waitForSelector('table');
            await page.waitForSelector('#app > div > div.main_view > div > div:nth-child(1) > div:nth-child(1) > div > div > div > div > div');
            const tabs = await page.$$('#app > div > div.main_view > div > div:nth-child(1) > div:nth-child(1) > div > div > div > div > div');

            for (const [ i, tab ] of tabs.entries()) {
              const tabName = await (await tab.getProperty('innerText')).jsonValue();
              tab.click().then(async res => {
                  await page.waitForSelector('table');
                  const result = await page.$$eval('table', tables => {
                    const trs = tables[2].children[1].children;
                    const t = [];
                    let csv = '名稱,粉絲總數(shù),粉絲質(zhì)量,中位點(diǎn)贊數(shù),中位評論數(shù),中位分享數(shù),指數(shù)' + "\n";
                    for (const tr of trs) {
                      const name = tr.children[2].innerText;
                      const fans = tr.children[3].innerText.replace(',', '');
                      const fansQ = tr.children[4].innerText.replace(',', '');
                      const likeAvg = tr.children[5].innerText.replace(',', '');
                      const commentAvg = tr.children[6].innerText.replace(',', '');
                      const shareAvg = tr.children[7].innerText.replace(',', '');
                      const index = tr.children[8].innerText.replace(',', '');
                      const tmp = [ name, fans, fansQ, likeAvg, commentAvg, shareAvg, index ];
                      t.push({name,fans,fansQ,likeAvg,commentAvg,shareAvg,index});
                      csv += tmp.join(',') + "\n";
                    }
                    return [ t, csv ];
                  });
                  await fs.writeFile(sessionDataDir + '/' + i + '-' + tabName + '-cassindex.csv', result[1]);
                });
              await page.waitForTimeout(1000);
            }
            const output = fs2.createWriteStream(zipFilePath);
            const archive = archiver('zip', {zlib: {level: 9}});
            await output.on('close', function() {
              console.log(archive.pointer() + ' total bytes');
              console.log('archiver has been finalized and the output file descriptor has closed.');
              fs.copyFile(zipFilePath, publicZipFile);
              fs.rm(zipFilePath);
            });
            await output.on('end', function() {
              console.log('Data has been drained');
            });
            await archive.on('error', function(err) {
              throw err;
            });
            await archive.on('warning', function(err) {
              if (err.code === 'ENOENT') {
                console.log('warning---->' + err);
              } else {
                throw err;
              }
            });
            await archive.pipe(output);
            await archive.directory(sessionDataDir, false);
            await archive.finalize();
            await fs.rm(lockFile);
            return publicZipFile;
          });
        return v;
      });

    if (v.indexOf('/public/') >= 0) {
      v = v.replaceAll(ctx.app.config.webRootDir, '');
      await ctx.render('home/download.tpl', { v });
    } else {
      ctx.body = v;
    }
  }

   puppeteerLogin(ctx) {
    const username = ctx.request.body.username;
    const password = ctx.request.body.password;
    const puppeteerNode = puppeteer;
    const browserFetcher = puppeteerNode.createBrowserFetcher();
    const v = browserFetcher.download('809590')
      .then(res => {
        const options = {
          executablePath: res.executablePath, // chrome執(zhí)行路徑
          headless: true, // 瀏覽器無頭模式
          defaultViewport: {
            width: 1800,
            height: 768,
          },
          args: [ '--start-maximized' ],
        };

        const v = puppeteer.launch(options)
          .then(
            async browser => {
              let cookies = {};
              const page = await browser.newPage();
              const sessionCookieDir = path.join(ctx.app.config.sessionDir, username);
              if (!fs2.existsSync(sessionCookieDir)) {
                fs2.mkdirSync(sessionCookieDir, '0777', true);
              }
              const sessionCookiePath = path.join(sessionCookieDir, 'cookies.json');

              if (fs2.existsSync(sessionCookiePath)) {
                const cookiesString = await fs.readFile(sessionCookiePath);
                const cookies = JSON.parse(cookiesString);
                await page.setCookie(...cookies);
                await page.goto('https://dy.fake.com/login?routerstr=workbench');
              } else {
                await page.goto('https://dy.fake.com/login?routerstr=workbench');
                await page.waitForSelector('#app > div > div.bg_login > div > div > div.of_hidden > div');
                const tabs = await page.$$('#app > div > div.bg_login > div > div > div.of_hidden > div');
                await tabs[1].click()
                  .then(async res => {
                    await page.waitForSelector('#input-msg > div > input');
                    await page.waitForSelector('#app > div > div > div > div > div > div > form > div.form_item.pointer');
                    const inputs = await page.$$('#input-msg > div > input');
                    await inputs[0].type(username);
                    await inputs[1].type(password);
                    await page.click('#app > div > div > div > div > div > div > form > div.form_item.pointer');
                    await page.waitForTimeout(1500);
                    cookies = await page.cookies();
                    await fs.writeFile(sessionCookiePath, JSON.stringify(cookies, null, 2));
                  });
              }
              await browser.close();
              ctx.cookies.set('account', username);
              return 'success';
            }
          );
        return v;
      });
    return v;
  }
}
module.exports = HomeController;

web 服務(wù)部署參見 eggjs 官網(wǎng)文檔,根據(jù)步驟部署完之后,啟動服務(wù)可能會遇到一些問題,在 linux 服務(wù)器下面安裝 puppeteer 涉及一些庫依賴,這些錯誤根據(jù)具體提示,直接 Google 一下應(yīng)該能解決。環(huán)境安裝完之后運(yùn)行代碼還是依然會遇到代碼問題,這是因?yàn)樵?linux 下面上面的代碼需要做出調(diào)整,把瀏覽器啟動的參數(shù)里面改成

 args: [ '--start-maximized', '--no-sandbox', '--disable-setuid-sandbox'],

關(guān)閉沙箱模式之后,再啟動服務(wù)就可以正常運(yùn)行了。

5、總結(jié)

在 puppeteer 入門過程中,遇到各種問題,節(jié)點(diǎn)查找,節(jié)點(diǎn)文本提取,循環(huán)遍歷節(jié)點(diǎn)等等,這些通過不斷的輸出調(diào)試以及查找 api 和 google 搜索,至此把遇到的問題都給解決了,雖然都解決了問題,但是代碼還是不夠完善的,只能是跑起來的一個 demo,還需要持續(xù)優(yōu)化代碼的。

對于 nodejs 使用的不多,在遇到異步回調(diào)的會忘記等待返回或者沒有執(zhí)行 promoise 的 callback,導(dǎo)致寫這塊的代碼比較慢,要查找 api 再來寫代碼。接下來要再系統(tǒng)的學(xué)一下nodejs的知識。

代碼里涉及到的抓取網(wǎng)頁鏈接地址被打碼了,無法正常訪問的。需要的同學(xué)可以通過了解更多聯(lián)系獲取。



項(xiàng)目里相關(guān)鏈接

6、了解更多


原文鏈接:Puppeteer 入門

?著作權(quán)歸作者所有,轉(zhuǎn)載或內(nèi)容合作請聯(lián)系作者
【社區(qū)內(nèi)容提示】社區(qū)部分內(nèi)容疑似由AI輔助生成,瀏覽時請結(jié)合常識與多方信息審慎甄別。
平臺聲明:文章內(nèi)容(如有圖片或視頻亦包括在內(nèi))由作者上傳并發(fā)布,文章內(nèi)容僅代表作者本人觀點(diǎn),簡書系信息發(fā)布平臺,僅提供信息存儲服務(wù)。

相關(guān)閱讀更多精彩內(nèi)容

友情鏈接更多精彩內(nèi)容