準(zhǔn)備
使用的庫(kù):
- superagent (需要安裝類型文件
@types/superagent) npm install superagent @types/superagent -D- cheerio 向jquery一樣來解析html
npm install cheerio @types/cheerio -D
開始
使用superagent爬取html
async getRawHtml(){
const result = await superagent.get(this.url)
console.log('result test',result.text)
}
使用cheerio 對(duì)html內(nèi)容進(jìn)行解析
cheerio遵從jquery的語(yǔ)法
- eq:表示第幾個(gè)元素
getCourseInfo(html:string){
const courseInfos:Course[] = []
const $ = cheerio.load(html)
$('.content').find('.course-item').map((index,element) =>{
const desc = $(element).find('.course-desc')
const name = desc.eq(0).text()
const count = parseInt(desc.eq(1).text().split(':')[1])
courseInfos.push({
name,
count
})
})
console.log('result',courseInfos)
}
將爬取的內(nèi)容寫進(jìn)json文件
- 使用fs讀寫文件
- 對(duì)Object類型的key,value定義interface
interface Content{
[propName:number]:Course[]
}
generateJsonFile(courseResult:courseResult){
// write to json file
let fileContent:Content = {}
const filePath = path.resolve(__dirname,'../data/course.json')
if(fs.existsSync(filePath)){
fileContent = JSON.parse(fs.readFileSync(filePath,'utf-8'))
}
fileContent[courseResult.time] = courseResult.data
fs.writeFileSync(filePath,JSON.stringify(fileContent,null,2),'utf-8')
}
最后整理一下代碼,將度文件與希望文件的行為分開。
使用組合模式優(yōu)化代碼
將獨(dú)有的邏輯抽離
將分析html并生成文件的的代碼片段抽離出去,放在一個(gè)單獨(dú)的類里面,并且調(diào)用
- crowller 只負(fù)責(zé)讀取/寫
- analyzer 只負(fù)責(zé)分析
// class crowller
constructor(private analyzer:any){
this.initSpiderProcess()
}
//index
const analyzer = new Analyzer()
const crowller = new Crowller(analyzer)
將具體的analyzer的邏輯都放在analyze這個(gè)方法之中,并對(duì)analyzer這個(gè)類以及獨(dú)有的analyze這個(gè)方法定義一個(gè)接口。
// crowller.ts
async initSpiderProcess(){
const html = await this.getRawHtml()
const fileContent = this.analyzer.analyze(html,this.filePath)
this.writeFile(fileContent)
}
constructor(private url:string,private analyzer:Analyzer){
this.initSpiderProcess()
}
//analyzer.ts
public analyze(html:string,filePath:string){
const courseInfo = this.getCourseInfo(html)
const fileContent = this.generateJsonFile(courseInfo,filePath)
return JSON.stringify(fileContent,null,2)
}
//不同的分析器只要重寫這一個(gè)方法就可以了
export default class Web1Analyzer implements Analyzer{
public analyze(html:string,filePath:string){
const courseInfo = html;
return courseInfo
}
}
最后附上所有相關(guān)代碼:
// analyzer.ts
import cheerio from 'cheerio'
import fs from 'fs'
import path from 'path'
import {Analyzer} from './crowller'
interface Course{
name:string,
count:number
}
interface courseResult{
time:number,
data:Course[]
}
interface Content{
[propName:number]:Course[]
}
export default class Web1Analyzer implements Analyzer{
private getCourseInfo(html:string){
const courseInfos:Course[] = []
const $ = cheerio.load(html)
$('.content').find('.course-item').map((index,element) =>{
const desc = $(element).find('.course-desc')
const name = desc.eq(0).text()
const count = parseInt(desc.eq(1).text().split(':')[1])
courseInfos.push({
name,
count
})
})
const result = {
time:new Date().getTime(),
data:courseInfos
}
return result
}
private generateJsonFile(courseResult:courseResult,filePath:string){
// write to json file
let fileContent:Content = {}
if(fs.existsSync(filePath)){
fileContent = JSON.parse(fs.readFileSync(filePath,'utf-8'))
}
fileContent[courseResult.time] = courseResult.data
return fileContent
}
public analyze(html:string,filePath:string){
const courseInfo = this.getCourseInfo(html)
const fileContent = this.generateJsonFile(courseInfo,filePath)
return JSON.stringify(fileContent,null,2)
}
}
// crowller
import superagent from 'superagent'
import cheerio from 'cheerio'
import fs from 'fs'
import path from 'path'
import Web1Analyzer from './analyzer'
import analyzerB from './analyzerB'
interface Course{
name:string,
count:number
}
interface courseResult{
time:number,
data:Course[]
}
interface Content{
[propName:number]:Course[]
}
export interface Analyzer{
analyze:(html:string,filePath:string) => string;
}
class Crowller{
private filePath = path.resolve(__dirname,'../data/course.json')
async getRawHtml(){
const result = await superagent.get(this.url)
return result.text
}
writeFile(fileContent:string){
fs.writeFileSync(this.filePath,fileContent,'utf-8')
}
async initSpiderProcess(){
const html = await this.getRawHtml()
const fileContent = this.analyzer.analyze(html,this.filePath)
this.writeFile(fileContent)
}
constructor(private url:string,private analyzer:Analyzer){
this.initSpiderProcess()
}
}
const secret = 'x3b174jsx'
const url = `http://www.dell-lee.com/typescript/demo.html?secret=${secret}`
const analyzer = new analyzerB()
new Crowller(url,analyzer)
// superagent js ts --> js
//ts -> .d.ts 翻譯文件 @types-> js