官網(wǎng)學(xué)習(xí)Spark

1.讀取文件

1. 使用sparksession.read.textFile可以讀取 textFile("/my/directory"), textFile("/my/directory/*.txt"), and textFile("/my/directory/*.gz").
2. 使用.wholeTextFiles 可以讀取一個(gè)包含多個(gè)小文件的目錄,它會(huì)返回一個(gè)鍵值對(duì)(文件名,文件內(nèi)容)
2.1 它可以提供第二個(gè)參數(shù)用來操作最小分區(qū)數(shù)

2. RDDOperations

1. transformations:從已經(jīng)存在的創(chuàng)建一個(gè)新的dataset,(lazy)
2. actions,返回一個(gè)值在對(duì)dataset進(jìn)行計(jì)算之后,

3. 持久化rdd

1. 調(diào)用persist或者是(cache)方法

輸出RDD

1. 可以rdd.collect().foreach(println)他可以將各個(gè)節(jié)點(diǎn)的數(shù)據(jù)匯集到一起來,但是這樣會(huì)耗盡你的內(nèi)存.
2. 我們建議是獲取小部分的數(shù)據(jù)進(jìn)行獲取遍歷.rdd.take(100).foreach(println)

常見actioin操作

Action Meaning
reduce(func) row 1 col 2
collect row 2 col 2
count row 1 col 2
first row 2 col 2
take(n)
takeSample(withReplacement,num,[seed])
takeOrdered(n,[ordering])
saveAsTextFile(path)
saveAsSeTextFile(path)
saveAsSequenceFile(path)
countByKey()

foreach(func)

shuffle

1. The Shuffle is an expensive operation since it involves disk I/O, data serialization, and network I/O. 
2. To organize data for the shuffle, Spark generates sets of tasks - map tasks to organize the data, and a set of reduce tasks to aggregate it. This nomenclature comes from MapReduce and does not directly relate to Spark’s map and reduce operations.
3. Spark also automatically persists some intermediate(內(nèi)部) data in shuffle operations (e.g. reduceByKey), even without users calling persist. This is done to avoid recomputing the entire input if a node fails during the shuffle. We still recommend users call persist on the resulting RDD if they plan to reuse it

removing Data

1. 移除緩存,默認(rèn)是使用LRU(最近,最少未使用),
2. 手動(dòng)(manually)移除,RDD.unpersist()method

Shared Variables

1. spark提供兩個(gè)共享變量,可以分發(fā)的每個(gè)機(jī)器中
1.1 廣播變量(Broadcast Variables)
1.2 
?著作權(quán)歸作者所有,轉(zhuǎn)載或內(nèi)容合作請(qǐng)聯(lián)系作者
【社區(qū)內(nèi)容提示】社區(qū)部分內(nèi)容疑似由AI輔助生成,瀏覽時(shí)請(qǐng)結(jié)合常識(shí)與多方信息審慎甄別。
平臺(tái)聲明:文章內(nèi)容(如有圖片或視頻亦包括在內(nèi))由作者上傳并發(fā)布,文章內(nèi)容僅代表作者本人觀點(diǎn),簡(jiǎn)書系信息發(fā)布平臺(tái),僅提供信息存儲(chǔ)服務(wù)。

相關(guān)閱讀更多精彩內(nèi)容

友情鏈接更多精彩內(nèi)容