前言

最近想學點數(shù)據(jù)分析的知識，于是想到先用爬蟲爬點數(shù)據(jù)下來，后面能夠利用數(shù)據(jù)做些分析處理。由于之前沒有做過爬蟲的相關項目，調(diào)查后了解到除了主流Python外，Java爬取數(shù)據(jù)也是挺方便的，可以利用Webmagic框架進行爬取。

項目簡介

因為要把數(shù)據(jù)存下來，雖然利用Webmagic框架的一些自帶的Pipeline如JsonFilePipeline可以很容易的將數(shù)據(jù)存到本地，但這里為了更好地學習這個框架，我選擇了自己編寫定制一個操作數(shù)據(jù)庫的Pipeline進行數(shù)據(jù)持久化存儲。對于這種小項目，操作數(shù)據(jù)庫實際用Statement編寫插入語句等就行。但由于我沒怎么用過Hibernate，平時用Mybatis比較多，想借此練練手，同時考慮到可能要進行一些拓展，所以就用了Hibernate來進行數(shù)據(jù)操作存儲。因此項目框架選型就定為Webmagic框架+Hibernate框架。主要內(nèi)容包括：實現(xiàn)Webmagic的PageProcessor，定制Pipeline，Scheduler。

Webmagic簡介

按照Webmagic官網(wǎng)介紹，該框架有四大組件：PageProcessor、Scheduler、Downloader和Pipeline。分別對應爬蟲生命周期中的下載、處理、管理和持久化等功能。整體架構圖如下：

Webmagic架構圖.png

實際上我們需要操作的只是實現(xiàn)PageProcessor，其他三個Scheduler、Downloader和Pipeline都有提供現(xiàn)成的實現(xiàn)類給我們選擇。而Spider是一個控制引擎，在我們編寫好PageProcessor、Scheduler、Downloader和Pipeline之后，就可以用Spider來啟動爬數(shù)據(jù)了。在本項目中，我將實現(xiàn)一個PageProcessor，定制下載圖片和存儲數(shù)據(jù)到數(shù)據(jù)庫的Pipeline，以及定制一個Scheduler。

利用Hibernate插入數(shù)據(jù)

1、引入依賴

在使用Hibernate之前，需要先引入依賴，下面這幾個是比較重要的依賴，包括了Webmagic框架需要引入的依賴，在pom文件中添加，將相應的包引入到項目中。

        <dependency>
            <groupId>us.codecraft</groupId>
            <artifactId>webmagic-core</artifactId>
            <version>0.7.3</version>
        </dependency>
        <dependency>
            <groupId>us.codecraft</groupId>
            <artifactId>webmagic-extension</artifactId>
            <version>0.7.3</version>
        </dependency>
        <dependency>
            <groupId>org.hibernate</groupId>
            <artifactId>hibernate-entitymanager</artifactId>
            <version>5.0.12.Final</version>
        </dependency>
        <dependency>
            <groupId>mysql</groupId>
            <artifactId>mysql-connector-java</artifactId>
            <version>5.1.43</version>
        </dependency>

2、設置配置文件

接下來需要在resources/META-INF中添加一個persistence.xml，用于數(shù)據(jù)庫的一些配置。文件內(nèi)容如下：

<!--?xml version="1.0" encoding="UTF-8"?-->
<persistence xmlns="http://java.sun.com/xml/ns/persistence" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
             xsi:schemaLocation="http://java.sun.com/xml/ns/persistence http://java.sun.com/xml/ns/persistence/persistence_2_0.xsd" version="2.0">
    <persistence-unit name="persistenceUnit" transaction-type="RESOURCE_LOCAL">
        <properties>
            <property name="hibernate.dialect" value="org.hibernate.dialect.MySQL5InnoDBDialect"/>
            <property name="hibernate.hbm2ddl.auto" value="update"/>
            <property name="hibernate.connection.driver_class" value="com.mysql.jdbc.Driver"/>
            <property name="hibernate.connection.username" value="root"/>
<!--            <property name="hibernate.connection.password" value=""/>-->
            <property name="hibernate.connection.url" value="jdbc:mysql://localhost:3306/test?useSSL=false&amp;characterEncoding=UTF-8"/>
            <property name="hibernate.show_sql" value="true"/>
        </properties>
    </persistence-unit>
</persistence>

3、構建實體類

完成配置之后，需要創(chuàng)建實體類與數(shù)據(jù)庫表相映射。這里我將爬簡書數(shù)據(jù)，所以建了一張t_simple_book表，并且先簡單地只設計了三個字段：id、title、user，分別對應主鍵、文章標題、用戶名。所以對應的實體類對象如下：

@Data
@Entity
@Table(name = "t_simple_book")
public class SimpleBook {
    @Id
    @GeneratedValue(strategy = GenerationType.IDENTITY)
    private Integer id;

    @Column(name = "title")
    private String title;

    @Column(name = "user")
    private String user;
}

并定義了一個插入數(shù)據(jù)的靜態(tài)方法：

public static <T> void insertOneData(T data) {
        EntityManagerFactory factory = Persistence.createEntityManagerFactory("persistenceUnit");
        EntityManager entityManager = factory.createEntityManager();
        entityManager.getTransaction().begin();
        entityManager.persist(data);
        entityManager.getTransaction().commit();
        entityManager.close();
        factory.close();
    }

到此為止，就可以往數(shù)據(jù)庫里插入數(shù)據(jù)了。當然，這樣做會不斷地創(chuàng)建及關閉連接對象，影響性能。先不管，后面做連接池處理就行。

實現(xiàn)PageProcessor

接下來就可以實現(xiàn)PageProcessor，做一些爬取數(shù)據(jù)的相關工作了。簡書的首頁即是列表頁，每次重新請求就能夠重新刷出新的文章列表。所以當我們不斷把首頁網(wǎng)址加入到請求中，就能一直不斷地刷出新的文章列表。而Page.addTargetRequests()方法會將請求加入到Page內(nèi)部的List中，并且會不斷地加入到Scheduler內(nèi)部的隊列中，再通過隊列poll()出來進而發(fā)起請求。因此，可以采取的思路就是，爬取每一頁的列表的URL加入到請求中，再把首頁網(wǎng)址也加入，這樣就能一直不斷地爬取新內(nèi)容了。當進入到文章詳情頁后，就可以對頁面元素進行分析，提取需要的內(nèi)容出來了。這里我只提取了文章標題、作者、以及文章中所有出現(xiàn)的圖片的URL。

定制Pipeline

我寫了兩個Pipeline，一個用于下載圖片，另外一個用于將文章的一些數(shù)據(jù)寫入到數(shù)據(jù)庫中。Pipeline實際上就是對PageProcessor的再加工處理。所以實際上你在Pipeline中完成的工作，在PageProcessor中都可以完成。但由于兩者是對應爬蟲的不同階段，分開來好一些。正如，我所寫的兩個Pipeline實際上用一個Pipeline處理就可以，但我認為它們完成的是不同的工作，分開寫結構會比較清晰。于是我就采取了分別寫兩個Pipeline的方式。

定制Scheduler

Scheduler在爬蟲階段主要的工作是能夠不斷把請求添加到隊列中，再不斷把隊列中的請求poll出來。在寫項目的時候，有時候為了測試一些結果，需要啟動程序，但啟動之后就會不斷地爬取下去，除非終止程序。那么我就想能不能做到讓程序自己達到一定條件后就自動停止。所以我查看了源碼，而要知道程序最終是怎么停止的，就需要看下Spider.run()方法。源碼如下：

    public void run() {
        this.checkRunningStat();
        this.initComponent();
        this.logger.info("Spider {} started!", this.getUUID());

        while(!Thread.currentThread().isInterrupted() && this.stat.get() == 1) {
            final Request request = this.scheduler.poll(this);
            if (request == null) {
                if (this.threadPool.getThreadAlive() == 0 && this.exitWhenComplete) {
                    break;
                }

                this.waitNewUrl();
            } else {
                this.threadPool.execute(new Runnable() {
                    public void run() {
                        try {
                            Spider.this.processRequest(request);
                            Spider.this.onSuccess(request);
                        } catch (Exception var5) {
                            Spider.this.onError(request);
                            Spider.this.logger.error("process request " + request + " error", var5);
                        } finally {
                            Spider.this.pageCount.incrementAndGet();
                            Spider.this.signalNewUrl();
                        }

                    }
                });
            }
        }

        this.stat.set(2);
        if (this.destroyWhenExit) {
            this.close();
        }

        this.logger.info("Spider {} closed! {} pages downloaded.", this.getUUID(), this.pageCount.get());
    }

可以看到，在while循環(huán)里面，通過Scheduler不斷地poll出request，當request不為null時，就會啟動線程去執(zhí)行它。而當request一直為null時，活躍線程執(zhí)行完任務后就能夠退出循環(huán)了。所以我想到了寫一個計數(shù)的Scheduler，可以統(tǒng)計下載的頁面，當達到指定值時，poll方法就一直返回null，這樣就能退出了。這樣我就能指定只下載一定量的頁面就行了。同樣的思路，也可以寫一個計時的Scheduler，給定一個時間段，當達到條件就退出。下面代碼即是我實現(xiàn)的計數(shù)Scheduler，仿照的是自帶的QueueScheduler進行實現(xiàn)。

@ThreadSafe
public class CountableScheduler extends DuplicateRemovedScheduler implements MonitorableScheduler {

    private BlockingQueue<Request> queue = new LinkedBlockingQueue();

    private int count = -1;

    public CountableScheduler() {
    }

    public void pushWhenNoDuplicate(Request request, Task task) {
        this.queue.add(request);
    }

    @Override
    public Request poll(Task task) {
        if (count == -1) {
            return (Request)this.queue.poll();
        }
        if (count > 0) {
            count--;
            return (Request)this.queue.poll();
        }else {
            this.queue.clear();
            return null;
        }
//        return (Request)this.queue.poll();
    }

    @Override
    public int getLeftRequestsCount(Task task) {
        return this.queue.size();
    }

    @Override
    public int getTotalRequestsCount(Task task) {
        return this.getDuplicateRemover().getTotalRequestsCount(task);
    }

    public CountableScheduler setCount(int count) {
        if(count <= 0) {
            return this;
        }
        this.count = count + 1;
        return this;
    }
}

后語

本次的爬蟲經(jīng)歷就介紹到這里，簡單地記錄下我自己對這個框架的一個淺顯的學習，后面我想利用Elasticsearch+Webmagic做一個對大量數(shù)據(jù)做分析處理的項目。

色偷偷精品伊人,欧洲久久精品,欧美综合婷婷骚逼,国产AV主播,国产最新探花在线,九色在线视频一区,伊人大交九欧美,1769亚洲,黄色成人av

Java爬蟲——Webmagic爬蟲框架+Hibernate持久化存儲

Java爬蟲——Webmagic爬蟲框架+Hibernate持久化存儲

前言

項目簡介

Webmagic簡介

利用Hibernate插入數(shù)據(jù)

1、引入依賴

2、設置配置文件

3、構建實體類

實現(xiàn)PageProcessor

定制Pipeline

定制Scheduler

后語

相關閱讀更多精彩內(nèi)容

友情鏈接更多精彩內(nèi)容

色偷偷精品伊人,欧洲久久精品,欧美综合婷婷骚逼,国产AV主播,国产最新探花在线,九色在线视频一区,伊人大交九 欧美,1769亚洲,黄色成人av

Java爬蟲——Webmagic爬蟲框架+Hibernate持久化存儲

前言

項目簡介

Webmagic簡介

利用Hibernate插入數(shù)據(jù)

1、引入依賴

2、設置配置文件

3、構建實體類

實現(xiàn)PageProcessor

定制Pipeline

定制Scheduler

后語

相關閱讀更多精彩內(nèi)容

友情鏈接更多精彩內(nèi)容

色偷偷精品伊人,欧洲久久精品,欧美综合婷婷骚逼,国产AV主播,国产最新探花在线,九色在线视频一区,伊人大交九欧美,1769亚洲,黄色成人av

1、引入依賴

3、構建實體類