亚洲日韩av一区,欧美日韩亚洲伦

最近正好學(xué)習(xí)到Puppeteer，便以統(tǒng)計縱橫研究院文章做一個練習(xí)。

Puppeteer是Google Chrome團隊官方的無界面Chrome工具，它是一個Node庫，提供了一個高級的 API 來控制DevTools協(xié)議上的無頭版Chrome。使用Puppeteer可以模擬用戶在瀏覽器執(zhí)行的大部分操作，如截圖、抓取網(wǎng)頁渲染后的內(nèi)容、頁面交互等。

最終抓取的文章數(shù)據(jù)地址如下：

所有文章列表
各專題的文章數(shù)據(jù)
所有專題中發(fā)表過文章的作者數(shù)據(jù)

數(shù)據(jù)展示地址：http://47.104.205.189:30000/

接下來就看下puppeteer模擬用戶操作抓取數(shù)據(jù)的過程。

一、獲取縱橫研究院所有專題

運行一個puppeteer瀏覽器

const browser = await puppeteer.launch({
  headless: false
})

headless表示是否以無頭模式運行，關(guān)閉此選項可以開發(fā)一個受代碼控制的瀏覽器，便于調(diào)試。

進入http://www.itdecent.cn/u/9b797d42a0cc頁面

// 頁面加載參數(shù)
const pageOptions = {
  timeout: 0, 
  waitUntil: [
    'domcontentloaded',
    'networkidle0'
  ]
}
const page = await browser.newPage()
await page.goto('http://www.itdecent.cn/u/9b797d42a0cc', pageOptions)

timeout：頁面超時時間，簡書的頁面如果頻繁加載，會出現(xiàn)資源加載過慢的情況，這里設(shè)置為0表示不設(shè)置超時時間
waitUntil：頁面打開完成的時機，domcontentloaded表示頁面的DOMContentLoaded事件觸發(fā)，networkidle0表示至少500ms內(nèi)無網(wǎng)絡(luò)請求

點擊他創(chuàng)建的專題中的查看更多，顯示所有縱橫研究院專題

頁面右側(cè)默認只顯示10個專題，需要模擬點擊事件查看更多

專題列表

async function safeFunc (func) {
  try {
    const res = await func()
    return [null, res]
  } catch (e) {
    return [e, null]
  }
}
await safeFunc(async () => {
  await page.click('.list .check-more')
  await delay(1000)
})

page.click方法用來模擬用戶點擊事件，如果選擇器沒有選擇到元素會拋出錯誤，因此用safeFunc通用方法處理了下錯誤。

獲取所有專題

const res = await page.evaluate(async () => {
  const titleDom = Array.from(document.querySelectorAll('.title'))
    .find(one => one.innerText === '他創(chuàng)建的專題')
  if (!titleDom) return []
 // 通過選擇器和dom相關(guān)方法獲取到頁面中專題的數(shù)據(jù)
  return Array.from(titleDom.nextElementSibling.querySelectorAll('li'))
    .reduce((acc, current) => {
      const item = current.querySelector('.name')
      if (!item) return acc
      return acc.concat({
        topicName: item.innerText,
        topicHome: item.href
      })
    }, [])
})

page.evaluate可以在瀏覽器環(huán)境執(zhí)行傳入的函數(shù)，因此在傳入的函數(shù)中可以獲取到window、document對象等，能執(zhí)行瀏覽器的dom相關(guān)方法。

二、到每個專題下獲取專題中的所有文章

從專題頁獲取文章列表如下：

async function getArticles (page) {
  await autoScroll(page)
  const articles = await page.evaluate(async () => {
    return Array.from(document.querySelectorAll('.note-list > li'))
      .reduce((acc, current) => {
        const titleDom = current.querySelector('.title')
        const nicknameDom = current.querySelector('.nickname')
        if (!titleDom || !nicknameDom) return acc

        const starIcon = nicknameDom.parentElement.querySelector('.ic-list-like')
        const stars = (starIcon && Number.parseInt(starIcon.nextSibling.data)) || 0
        const commentIcon = nicknameDom.parentElement.querySelector('.ic-list-comments')
        const comments = (commentIcon && Number.parseInt(commentIcon.nextSibling.data)) || 0
        return acc.concat({
          authorName: nicknameDom.innerText, // 作者名稱
          authorHome: nicknameDom.href, // 作者主頁
          title: titleDom.innerText, // 文章標(biāo)題
          url: titleDom.href, // 文章地址
          stars, // 點贊數(shù)
          comments // 評論數(shù)
        })
      }, [])
  })
  return articles
}

該方法也是在瀏覽器上下文中用選擇器選擇到對應(yīng)的dom元素，挨個獲取文章的數(shù)據(jù)。在獲取文章之前有一個方法autoScroll是用來將頁面滾動到底部的，因為專題中文章列表為懶加載，滾動到底部才能讀取到所有文章。autoScroll方法如下：

async function autoScroll (page) {
  await page.evaluate(async () => {
    await new Promise((resolve, reject) => {
      let totalHeight = 0
      let distance = 100
      let timer = setInterval(() => {
        let scrollHeight = document.body.scrollHeight
        window.scrollBy(0, distance)
        totalHeight += distance
        if (totalHeight >= scrollHeight) {
          clearInterval(timer)
          resolve()
        }
      }, 100)
    })
  })
}

如上所示，通過定時器設(shè)置頁面的滾動高度來加載更多文章，直到滾動高度為實際頁面高度即文章加載完畢。

遍歷獲取到的專題列表，到每個專題頁面獲取文章，如下：

const topics = await getTopics(browser)
const page = await browser.newPage()
for (const topic of topics) {
  await page.goto(topic.topicHome, pageOptions)
  const articles = await getArticles(page)
  Object.assign(topic, {
    articles: articles.map(one => ({ ...topic, ...one }))
  })
}

三、到用戶頁面獲取文章的閱讀量和發(fā)布時間

如果專題頁直接顯示了文章的閱讀量和發(fā)布時間，那么根據(jù)以上兩步拿到的數(shù)據(jù)就足夠統(tǒng)計了。接下來需要對專題內(nèi)所有的文章按作者分組，再到每個作者的主頁獲取文章的詳細信息。

按作者分組：

const authors = topics.reduce((acc, topic) => {
  topic.articles.forEach(article => {
    const { authorName, authorHome } = article
    const exsitAuthor = acc.find(one => one.authorHome === authorHome)
    if (exsitAuthor) {
      Object.assign(exsitAuthor, { articles: [...exsitAuthor.articles, article] })
    } else {
      acc.push({ authorName, authorHome, articles: [article] })
    }
  })
  return acc
}, [])

從作者的主頁獲取獲取文章的閱讀量和發(fā)布時間：

async function getArticlesDetail (page) {
  await autoScroll(page)
  const articles = await page.evaluate(async () => {
    return Array.from(document.querySelectorAll('.note-list > li')).map(one => {
      if (!one) return {}
      const titleDom = one.querySelector('.title')
      const url = titleDom && titleDom.href
      const readIcon = one.querySelector('.ic-list-read')
      const readCount = (readIcon && Number.parseInt(readIcon.nextSibling.data)) || 0
      const timeDom = one.querySelector('.time')
      const publishTime = timeDom && moment(timeDom.dataset.sharedAt).format('YYYY-MM-DD HH:mm')
      return { url, readCount, publishTime }
    })
  })
  return articles
}

遍歷專題內(nèi)發(fā)布過文章的用戶，到每個用戶頁面獲取文章，如下：

for (const author of authors) {
  const { authorHome, articles } = author
  await page.goto(authorHome, pageOptions)
  const authorAllArticles = await getArticlesDetail(page)
  articles.forEach(article => {
    const articleExtraInfo = authorAllArticles.find(one => article.url === one.url)
    Object.assign(article, articleExtraInfo)
  })
}

四、排序、整理數(shù)據(jù)格式，導(dǎo)出json

const allArticles = authors.reduce((acc, current) => acc.concat(current.articles), [])
const allReadCount = allArticles.reduce((acc, current) => (acc + current.readCount), 0)

// 保存文章列表
output({
  articleCount: allArticles.length,
  readCount: allReadCount,
  articles: allArticles.sort((a, b) => (b.readCount - a.readCount))
}, './縱橫研究院文章列表.json')

// 專題文章信息補全
topics.forEach(one => {
  one.articles.forEach(article => {
    const articleExtraInfo = allArticles.find(one => article.url === one.url)
    Object.assign(article, articleExtraInfo)
  })
})

// 保存專題統(tǒng)計信息
output({
  articleCount: allArticles.length,
  readCount: allReadCount,
  topicCount: topics.length,
  topics: topics
    .sort((a, b) => (b.articles.length - a.articles.length))
    .map(one => ({
      articleCount: one.articles.length,
      readCount: one.articles.reduce((acc, current) => (acc + current.readCount), 0),
      ...one,
      articles: one.articles.sort((a, b) => (b.readCount - a.readCount))
    }))
}, './縱橫研究院專題統(tǒng)計.json')

// 保存作者統(tǒng)計信息
output({
  articleCount: allArticles.length,
  readCount: allReadCount,
  authorCount: authors.length,
  authors: authors
    .sort((a, b) => (b.articles.length - a.articles.length))
    .map(one => ({
      articleCount: one.articles.length,
      readCount: one.articles.reduce((acc, current) => (acc + current.readCount), 0),
      ...one,
      articles: one.articles.sort((a, b) => (b.readCount - a.readCount))
    }))
}, './縱橫研究院作者統(tǒng)計.json')

以上為所有步驟，最終代碼和運行結(jié)果地址點這里查看。

拓展

執(zhí)行以上步驟獲取統(tǒng)計信息，每次大概會花費6分鐘左右，因為需要挨個到20個專題、60多個用戶主頁去獲取信息，對于專題或用戶文章較多的頁面，需要滾動頁面到底部懶加載所有文章。

如果同時打開多個頁面，并行去處理這些頁面跳轉(zhuǎn)、懶加載、獲取信息等，應(yīng)該可以優(yōu)化執(zhí)行時間。用多個頁面去處理任務(wù)如下：

async function execTasks (browser, tasks, maxPageCount = 5) {
  const taskStatus = new Array(tasks.length).fill(0)
  await Promise.all(Array.from({ length: maxPageCount }).map(async (one, i) => {
    const page = await browser.newPage()
    while (true) {
      const index = findIndex(taskStatus, status => !status)
      if (index === -1) break
      taskStatus[index] = 1
      await tasks[index](page)
    }
  }))
}

const topics = await getTopics(browser)
await execTasks(browser, topics.map(topic => async (page) => {
  await page.goto(topic.topicHome, pageOptions)
  const articles = await getArticles(page)
  Object.assign(topic, {
    articles: articles.map(one => ({ ...topic, ...one }))
  })
}))

以上代碼開啟了5個網(wǎng)頁，共同處理統(tǒng)計專題的任務(wù)，不幸的是：

image.png

可能是簡書對瀏覽器并發(fā)請求網(wǎng)頁有限制，實際只有一個頁面正常打開了，經(jīng)過嘗試，就算只打開兩個網(wǎng)頁窗口并行處理任務(wù)，也會出現(xiàn)加載失敗的情況，所以最后還是妥協(xié)了只用一個page頁。

本文參考資源如下：

https://pptr.dev/
https://juejin.im/post/5bbc96785188255c72286403

色偷偷精品伊人,欧洲久久精品,欧美综合婷婷骚逼,国产AV主播,国产最新探花在线,九色在线视频一区,伊人大交九欧美,1769亚洲,黄色成人av

【原創(chuàng)】使用Puppeteer統(tǒng)計縱橫研究院文章數(shù)據(jù)

【原創(chuàng)】使用Puppeteer統(tǒng)計縱橫研究院文章數(shù)據(jù)

一、獲取縱橫研究院所有專題

二、到每個專題下獲取專題中的所有文章

三、到用戶頁面獲取文章的閱讀量和發(fā)布時間

四、排序、整理數(shù)據(jù)格式，導(dǎo)出json

相關(guān)閱讀更多精彩內(nèi)容

友情鏈接更多精彩內(nèi)容

色偷偷精品伊人,欧洲久久精品,欧美综合婷婷骚逼,国产AV主播,国产最新探花在线,九色在线视频一区,伊人大交九 欧美,1769亚洲,黄色成人av

【原創(chuàng)】使用Puppeteer統(tǒng)計縱橫研究院文章數(shù)據(jù)

一、獲取縱橫研究院所有專題

二、到每個專題下獲取專題中的所有文章

三、到用戶頁面獲取文章的閱讀量和發(fā)布時間

四、排序、整理數(shù)據(jù)格式，導(dǎo)出json

相關(guān)閱讀更多精彩內(nèi)容

友情鏈接更多精彩內(nèi)容

色偷偷精品伊人,欧洲久久精品,欧美综合婷婷骚逼,国产AV主播,国产最新探花在线,九色在线视频一区,伊人大交九欧美,1769亚洲,黄色成人av

二、到每個專題下獲取專題中的所有文章

三、到用戶頁面獲取文章的閱讀量和發(fā)布時間

四、排序、整理數(shù)據(jù)格式，導(dǎo)出json