實(shí)現(xiàn)方案
- 利用WKWebView打開(kāi)一個(gè)待爬取的網(wǎng)頁(yè)
- 在webView渲染完成之后注入一段爬蟲(chóng)腳本
- 在腳本回調(diào)里面獲取爬取的數(shù)據(jù)
代碼
以天貓的商品爬取為例
先打印網(wǎng)頁(yè)內(nèi)容
注入腳本document.body.innerHTML
- (void)viewDidLoad {
[super viewDidLoad];
self.webView = [[WKWebView alloc] initWithFrame:CGRectMake(0, 100.f, FULL_WIDTH, 200.f)];
self.webView.navigationDelegate = self;
[self.view addSubview:self.webView];
[self.webView loadRequest:[NSURLRequest requestWithURL:[NSURL URLWithString:@"https://detail.tmall.com/item.htm?id=578502467835&ali_refid=a3_430406_1007:1121266184:N:1060515764_0_100:61033457550edeff91391950420fef46&ali_trackid=1_61033457550edeff91391950420fef46&spm=a21bo.2017.201874-sales.57"]]];
}
- (void)webView:(WKWebView *)webView didFinishNavigation:(null_unspecified WKNavigation *)navigation {
[self.webView evaluateJavaScript:@"document.body.innerHTML" completionHandler:^(id _Nullable result, NSError * _Nullable error) {
NSLog(@"抓取結(jié)果:%@", result);
}];
}

打印結(jié)果.png
格式化之后

商品圖dom.png

商品名dom.png

商品價(jià)格dom.png
寫(xiě)腳本
商品圖獲取:
document.getElementsByClassName('item')[0].getElementsByTagName('img')[0].src
價(jià)格獲取:
document.getElementsByClassName('real-price')[0].getElementsByClassName('price')[0].textContent
商品名獲取:
document.getElementsByClassName('main')[0].textContent
組合成字典的形式返回(完整腳本)
(function() {
var init = function() {
return {
imgSrc: document.getElementsByClassName('item')[0].getElementsByTagName('img')[0].src,
price: document.getElementsByClassName('real-price')[0].getElementsByClassName('price')[0].textContent,
title: document.getElementsByClassName('main')[0].textContent
};
};
return init();
})()
注入新的腳本
- (void)webView:(WKWebView *)webView didFinishNavigation:(null_unspecified WKNavigation *)navigation {
[self.webView evaluateJavaScript:@"(function(){var init = function(){return {imgSrc:document.getElementsByClassName('item')[0].getElementsByTagName('img')[0].src,price:document.getElementsByClassName('real-price')[0].getElementsByClassName('price')[0].textContent,title:document.getElementsByClassName('main')[0].textContent};}; return init();})()" completionHandler:^(id _Nullable result, NSError * _Nullable error) {
NSLog(@"抓取結(jié)果:%@", result);
}];
}

結(jié)果打印.png
注意點(diǎn)
(1) html的解析一定要以客戶(hù)端返回的為準(zhǔn), 與瀏覽器打開(kāi)看到的html是不一樣的
(2) 腳本有問(wèn)題的時(shí)候error會(huì)提示Error Domain=WKErrorDomain Code=4 "A JavaScript exception occurred" 根據(jù)提示修改腳本即可
(3) 服務(wù)端的腳本可以通過(guò)下面的方法轉(zhuǎn)成string
[NSURLConnection sendAsynchronousRequest:[NSURLRequest requestWithURL:[NSURL URLWithString:@"https://xxxxx.js"]] queue:[NSOperationQueue mainQueue] completionHandler:^(NSURLResponse *response, NSData *data, NSError *connectionError) {
NSString *script = [[NSString alloc] initWithData:data encoding:NSUTF8StringEncoding];
}
2019.02.16更新
因?yàn)榫W(wǎng)頁(yè)數(shù)據(jù)大多數(shù)是異步返回的, 在didFinishNavigation回調(diào)觸發(fā)的時(shí)候, 頁(yè)面上想被抓取的數(shù)據(jù)并沒(méi)有返回
增加一個(gè)dom變更的監(jiān)聽(tīng), 利用一個(gè)debounce防止調(diào)用過(guò)于頻繁
var timer = null;
var body = document.getElementsByTagName("body")[0];
body.addEventListener("DOMSubtreeModified", function(evt) {
clearTimeout(timer);
timer = setTimeout(function(){
spider();
}, 1000);
}, false);
這個(gè)時(shí)候只能通過(guò)js去調(diào)用oc
初始化的時(shí)候去創(chuàng)建一個(gè)webView的config
WKWebViewConfiguration *configuration = [[WKWebViewConfiguration alloc] init];
//注冊(cè)方法名
[configuration.userContentController addScriptMessageHandler:self name:@"spider"];
self.webview = [WKWebView initWithFrame:frame configuration:configuration];
實(shí)現(xiàn)WKScriptMessageHandler協(xié)議
- (void)userContentController:(WKUserContentController *)userContentController didReceiveScriptMessage:(WKScriptMessage *)message
{
if ([message.name isEqualToString:@"spider"])
{
//js的傳過(guò)來(lái)的數(shù)據(jù)
NSLog(@"%@",message.body);
}
}
js腳本
var spider = function() {
...
//window.webkit.messageHandlers.<name>.postMessage(<messageBody>)
window.webkit.messageHandlers.spider.postMessage(spiderData);
...
}