亚洲日韩色婷婷在线,国产美女免费国产

java在進(jìn)行爬蟲過程中會(huì)因?yàn)榫W(wǎng)站作出反爬措施，導(dǎo)致抓取的內(nèi)容不全面，所以需要利用模擬瀏覽器，打開頁面獲取到頁面的全部內(nèi)容。本文以騰訊新聞https://news.qq.com/為例。
環(huán)境配置參考http://www.itdecent.cn/p/6c3d90bef17f，可以配置nodejs的環(huán)境。
一、使用jsoup解析網(wǎng)頁，當(dāng)解析騰訊新聞時(shí)只能獲取到網(wǎng)頁的源碼，其他與新聞相關(guān)的內(nèi)容一概獲取不到，從而無法抓取到有用的信息。

/**
 * 利用jsoup解析網(wǎng)頁
 * @param url
 * @return
 */
public static Document getDocumentByJsoup(String url){
    Document document = null;
    try {
        document = Jsoup.connect(url).timeout(15000).get();
        String text = document.getElementsByTag("body").text();
        System.err.println(text);
    } catch (IOException e) {
        e.printStackTrace();
    }
    return document;
}

測(cè)試獲取的結(jié)果，獲取不到新聞列表

result.png

二、利用chrome + nodejs的方式進(jìn)行測(cè)試。

/**
 * 利用chrome方式獲取頁面信息
 * @param url
 * @return
 */
public static Document getDocument(String url){
    Document document = null;
    //chrome瀏覽器地址
    String chromePath = "你的chrome瀏覽器根目錄";
    
    //nodejs地址  + 截圖的js的地址（兩個(gè)需要在同一個(gè)目錄之下）
    String nodeJSPath = "nodejs根目錄地址   渲染頁面所需要的js根目錄地址.js";
    
    String BLANK = "    ";
    
    String exec =  nodeJSPath + BLANK + chromePath + BLANK + url;
    
    try {
        //執(zhí)行腳本命令
        Process process = Runtime.getRuntime().exec(exec);
        
        System.err.println("ecec =======> " + exec);
        
        InputStream is = process.getInputStream();
        document = Jsoup.parse(is, "UTF-8", url);
      
        try {
            process.waitFor();
        } catch (InterruptedException e) {
            e.printStackTrace();
        }


        process.destroy();
        process = null;
         
    } catch (Exception e) {
        // TODO Auto-generated catch block
        e.printStackTrace();
    }
    return document;
}

運(yùn)行獲取到的結(jié)果