為了適應(yīng)不同項(xiàng)目對(duì)不同感興趣屬性的解析存儲(chǔ),數(shù)據(jù)存儲(chǔ)結(jié)構(gòu)采用縱向的屬性列表方式,即一個(gè)url頁(yè)面多個(gè)屬性存儲(chǔ)多條記錄方式,并且按照text,html,data,num幾大典型類(lèi)型分別對(duì)應(yīng)存儲(chǔ)。
創(chuàng)建UTF-8字符集的nutch數(shù)據(jù)庫(kù),并執(zhí)行表初始化腳本,參考DDL:
CREATE TABLE `crawl_data` (
`url` varchar(255) NOT NULL,
`code` varchar(255) NOT NULL,
`name` varchar(255) DEFAULT NULL,
`category` varchar(255) DEFAULT NULL,
`order_index` int(255) DEFAULT NULL,
`fetch_time` datetime NOT NULL,
`text_value` text, `html_value` text,
`date_value` datetime DEFAULT NULL,
`num_value` decimal(18,2) DEFAULT NULL
) ENGINE=InnoDB DEFAULT CHARSET=utf8;

為了一般在業(yè)務(wù)系統(tǒng)獲取同一個(gè)url的多個(gè)屬性或友好查詢顯示,參考如下MySQL腳本實(shí)現(xiàn)把縱向的屬性列表屬性轉(zhuǎn)換為橫向的多列顯示模式。網(wǎng)上有 相關(guān)參考腳本大多是基于數(shù)字字段sum匯總等模式的縱轉(zhuǎn)橫SQL腳本,本項(xiàng)目需求是單一的基于字符串信息轉(zhuǎn)換,經(jīng)過(guò)一段摸索最后找到基于 GROUP_CONCAT可以實(shí)現(xiàn):
SELECT url ,fetch_time,
GROUP_CONCAT(CASE WHEN code = 'domain' THEN text_value ELSE null END) AS `domain`,
GROUP_CONCAT(CASE WHEN code = 'name' THEN text_value ELSE null END) AS `name`,
GROUP_CONCAT(CASE WHEN code = 'brand' THEN text_value ELSE null END) AS `brand`,
GROUP_CONCAT(CASE WHEN code = 'category' THEN text_value ELSE null END) AS `category`,
GROUP_CONCAT(CASE WHEN code = 'purpose' THEN text_value ELSE null END) AS `purpose`,
GROUP_CONCAT(CASE WHEN code = 'price' THEN num_value ELSE null END) AS `price`,
GROUP_CONCAT(CASE WHEN code = 'refPrice' THEN num_value ELSE null END) AS `refPrice`,
GROUP_CONCAT(CASE WHEN code = 'primaryImage' THEN text_value ELSE null END) AS `primaryImage` FROM crawl_data GROUP BY url,fetch_time
