**大數(shù)據(jù) Big Data **
據(jù)說(shuō),詞源出自Alvin Toffler,上世紀(jì)70年代的作品《第三次浪潮》。
逝者 | 阿爾文·托夫勒:如何化解未來(lái)的沖擊

雖然大數(shù)據(jù)是一個(gè)泛泛的概念詞,但是關(guān)于大數(shù)據(jù),關(guān)于大數(shù)據(jù)處理分析的話題近來(lái)持續(xù)升溫,現(xiàn)在基本成了新一輪工業(yè)革命級(jí)別的話題。
大數(shù)據(jù)是什么,作為數(shù)據(jù)采集團(tuán)隊(duì) ,我們很長(zhǎng)的時(shí)間里一直也在思考,什么是大數(shù)據(jù),大數(shù)據(jù)的前景和價(jià)值在哪里。
這篇文章里,我會(huì)跟大家一起分享我的看法以及各種有趣的內(nèi)容和資源,它們關(guān)于:
- 什么是大數(shù)據(jù)
- 大數(shù)據(jù)的實(shí)踐
- 大數(shù)據(jù)的應(yīng)用場(chǎng)景

硬廣:我們團(tuán)隊(duì)的幫助你零門(mén)檻采集數(shù)據(jù):
造數(shù) - 最好用的云爬蟲(chóng)工具 進(jìn)擊的爬蟲(chóng)工具!
最近都在說(shuō)裁員,如果想知道互聯(lián)網(wǎng)裁員潮對(duì)就業(yè)薪資是不是真的產(chǎn)生了持久的負(fù)面影響,可以用我們的工具,幫你定時(shí)每天采集幾次生成列表看一看。
(一)什么是大數(shù)據(jù)
先聽(tīng)聽(tīng)行家的說(shuō)法:
大數(shù)據(jù)就是多,就是多。原來(lái)的設(shè)備存不下、算不動(dòng)。 ————啪菠蘿·畢加索
大數(shù)據(jù),不是隨機(jī)樣本,而是所有數(shù)據(jù);不是精確性,而是混雜性;不是因果關(guān)系,而是相關(guān)關(guān)系。 _______Sch?nberger
移步ted:Kenneth Cukier: Big data is better data
America's favorite pie is?Audience: Apple. Kenneth Cukier: Apple. Of course it is. How do we know it? Because of data. You look at supermarket sales. You look at supermarket sales of 30-centimeter pies that are frozen, and apple wins, no contest. The majority of the sales are apple. But then supermarkets started selling smaller, 11-centimeter pies, and suddenly, apple fell to fourth or fifth place. Why? What happened? Okay, think about it. When you buy a 30-centimeter pie, the whole family has to agree, and apple is everyone's second favorite. (Laughter) But when you buy an individual 11-centimeter pie, you can buy the one that you want. You can get your first choice.* You have more data. You can see something that you couldn't see when you only had smaller amounts of it.***
曾經(jīng)人們以為最愛(ài)吃的派都是蘋(píng)果派,不過(guò)當(dāng)你有了更細(xì)致的數(shù)據(jù),你會(huì)發(fā)現(xiàn),蘋(píng)果派受歡迎其實(shí)是一種妥協(xié)的結(jié)果:蘋(píng)果派是每個(gè)人第二喜歡的口味。
拿到小尺寸派的數(shù)據(jù)以后你更發(fā)現(xiàn),其實(shí)蘋(píng)果派只能排到第四,第五位的樣子了。
你有了更多數(shù)據(jù),你就能看到之前你看不到的信息。
大數(shù)據(jù)最核心的價(jià)值是什么? - 商業(yè) - 知乎 推薦@Han Hsiao這篇內(nèi)容的結(jié)構(gòu)十分清晰,對(duì)大數(shù)據(jù)的正面意義提出了非常清晰地探討。
大數(shù)據(jù)聽(tīng)著很牛,實(shí)際上也很牛嗎? - 人工智能 - 知乎 這里 @陳萌萌說(shuō)的也特別好,懷疑她是不是真的是一個(gè)ai。
大數(shù)據(jù)最核心的價(jià)值是什么? - 商業(yè) - 知乎,依然是這個(gè)問(wèn)題, @劉飛的文章。
大數(shù)據(jù)是大數(shù)據(jù)的采集
大數(shù)據(jù)行業(yè),本身是依托于數(shù)據(jù)源存在的服務(wù)性行業(yè)。
大數(shù)據(jù)最根本之處在于信息收集方式出現(xiàn)了重大變化與革新。大數(shù)據(jù)的出現(xiàn)與大量信息直接在網(wǎng)絡(luò)呈現(xiàn)關(guān)系非常緊密。

微博、天貓、淘寶、微信等等都直接產(chǎn)生了大量包括定位、消息記錄、消費(fèi)記錄、評(píng)價(jià)、閱讀等等殊為龐大的信息,可以說(shuō)互聯(lián)網(wǎng)企業(yè)都自然的帶有數(shù)據(jù)企業(yè)的標(biāo)簽。不過(guò)如果我們從數(shù)據(jù)的源頭看的更仔細(xì)一些,還是會(huì)發(fā)現(xiàn),其實(shí)很多數(shù)據(jù)依然是有巨大的采集與歸類的需求。

Joel Selanikio:Transcript of "The big-data revolution in healthcare"
There's a concept that people talk about nowadays called "big data." And what they're talking about is all of the information that we're generating through our interaction with and over the Internet, everything from Facebook and Twitter to music downloads, movies, streaming, all this kind of stuff, the live streaming of TED. And the folks who work with big data, for them, they talk about that their biggest problem is we have so much information. The biggest problem is: how do we organize all that information?
現(xiàn)在人人都說(shuō)大數(shù)據(jù),但其實(shí)大家說(shuō)的是 facebook,twitter,streaming 等等站點(diǎn)上每天產(chǎn)生的信息,做大數(shù)據(jù)的人呢,會(huì)覺(jué)得我們有的數(shù)據(jù)量實(shí)在太大了。(組織信息仍然是最難的問(wèn)題)
I can tell you that, working in global health, that is not our biggest problem. Because for us, even though the light is better on the Internet, the data that would help us solve the problems we're trying to solve is not actually present on the Internet. So we don't know, for example, how many people right now are being affected by disasters or by conflict situations. We don't know for, really, basically, any of the clinicsin the developing world, which ones have medicines and which ones don't. We have no idea of what the supply chain is for those clinics. We don't know -- and this is really amazing to me -- we don't know how many children were born -- or how many children there are -- in Bolivia or Botswana or Bhutan. We don't know how many kids died last week in any of those countries. We don't know the needs of the elderly, the mentally ill. For all of these different critically important problems or critically important areas that we want to solve problems in, we basically know nothing at all.
許多有效的數(shù)據(jù)還完全不在網(wǎng)絡(luò)上,要依靠原始的方法來(lái)收集。數(shù)據(jù)方面還有很多基本層面的問(wèn)題在非常多的領(lǐng)域非常明顯。
有哪些「神奇」的數(shù)據(jù)獲取方式? - Liu Cao 的回答 - 知乎 看到這里推薦一個(gè) @Liu Cao的回答。
嚴(yán)瀾(lanceyan)的博客 - 技術(shù)分享 框架交流 大數(shù)據(jù)處理 架構(gòu)搭建 機(jī)器人
強(qiáng)烈推薦:如何用形象的比喻描述大數(shù)據(jù)的技術(shù)生態(tài)?Hadoop、Hive、Spark 之間是什么關(guān)系?其中 @Xiaoyu Ma
(二)大數(shù)據(jù)的實(shí)踐
工具看這里:大數(shù)據(jù)分析一般用什么工具分析? - JavaScript - 知乎
最近看到個(gè)例子,說(shuō)pokemon go 帶給玩家運(yùn)動(dòng)量上的變化:
1、應(yīng)用中的數(shù)據(jù)分析示例:


六個(gè)月以后,大部分pokemon go 的玩家的運(yùn)動(dòng)量逐漸和 non-player基本一致了。
看來(lái)確實(shí)是一個(gè)能用相當(dāng)效果的游戲。
2、交通狀況大數(shù)據(jù)分析示例:





Susan Etlinger: What do we do with all this big data?
Now, there's a group of data scientists out of the University of Illinois-Chicago, and they're called the Health Media Collaboratory, and they've been working with the Centers for Disease Control to better understand how people talk about quitting smoking, how they talk about electronic cigarettes, and what they can do collectively to help them quit. The interesting thing is, if you want to understand how people talk about smoking, first you have to understand what they mean when they say "smoking." And on Twitter, there are four main categories: number one, smoking cigarettes; number two, smoking marijuana;number *three, smoking ribs; and number four, smoking hot women.
這里非常有趣
(三)大數(shù)據(jù)的應(yīng)用場(chǎng)景
先貼兩個(gè)新聞?dòng)^察:
京津冀大數(shù)據(jù)產(chǎn)業(yè)發(fā)展現(xiàn)狀 | 報(bào)告 | 數(shù)據(jù)觀 | 中國(guó)大數(shù)據(jù)產(chǎn)業(yè)觀察_大數(shù)據(jù)門(mén)戶
數(shù)據(jù)觀 | 中國(guó)大數(shù)據(jù)產(chǎn)業(yè)觀察_大數(shù)據(jù)門(mén)戶
如今,在政策上,國(guó)家戰(zhàn)略層面上,大數(shù)據(jù)受到的重視程度都越來(lái)越高。
應(yīng)用場(chǎng)景上,現(xiàn)在分布在:
- 供應(yīng)鏈和渠道分析&優(yōu)化
- 定價(jià)分析與優(yōu)化
- 欺詐行為分析&檢測(cè)
- 設(shè)備管理
- 社交媒體分析&客戶分析

《大數(shù)據(jù)時(shí)代》一書(shū)作者維克托認(rèn)為大數(shù)據(jù)時(shí)代有三大轉(zhuǎn)變:
“第一,我們可以分析更多的數(shù)據(jù),有時(shí)候甚至可以處理和某個(gè)特別現(xiàn)象相關(guān)的所有數(shù)據(jù),而不是依賴于隨機(jī)采樣。更高的精確性可使我們發(fā)現(xiàn)更多的細(xì)節(jié)。
第二,研究數(shù)據(jù)如此之多,以至于我們不再熱衷于追求精確度。適當(dāng)忽略微觀層面的精確度,將帶來(lái)更好的洞察力和更大的商業(yè)利益。
第三,不再熱衷于尋找因果關(guān)系,而是事物之間的相關(guān)關(guān)系。例如,不去探究機(jī)票價(jià)格變動(dòng)的原因,但是關(guān)注買(mǎi)機(jī)票的最佳時(shí)機(jī)?!贝髷?shù)據(jù)打破了企業(yè)傳統(tǒng)數(shù)據(jù)的邊界,改變了過(guò)去商業(yè)智能僅僅依靠企業(yè)內(nèi)部業(yè)務(wù)數(shù)據(jù)的局面,而大數(shù)據(jù)則使數(shù)據(jù)來(lái)源更加多樣化,不僅包括企業(yè)內(nèi)部數(shù)據(jù),也包括企業(yè)外部數(shù)據(jù),尤其是和消費(fèi)者相關(guān)的數(shù)據(jù)
據(jù)野史記載,中亞古國(guó)花剌子模有一古怪的風(fēng)俗,凡是給君王帶來(lái)好消息的信使,就會(huì)得到提升,給君王帶來(lái)壞消息的人則會(huì)被送去喂老虎。從前的人喜歡批評(píng)這位君王的天真品性,以為獎(jiǎng)勵(lì)帶來(lái)好消息的人,就能鼓勵(lì)好消息的到來(lái),處死帶來(lái)壞消息的人,就能根絕壞消息。
在今天這個(gè)信息爆炸的時(shí)代,我們不一定能讓信使一定送來(lái)好消息,但你可以讓我們的爬蟲(chóng)定時(shí)給你送來(lái)最有用最合你需求的信息。