1.讀取文件
1. 使用sparksession.read.textFile可以讀取 textFile("/my/directory"), textFile("/my/directory/*.txt"), and textFile("/my/directory/*.gz").
2. 使用.wholeTextFiles 可以讀取一個(gè)包含多個(gè)小文件的目錄,它會(huì)返回一個(gè)鍵值對(duì)(文件名,文件內(nèi)容)
2.1 它可以提供第二個(gè)參數(shù)用來操作最小分區(qū)數(shù)
2. RDDOperations
1. transformations:從已經(jīng)存在的創(chuàng)建一個(gè)新的dataset,(lazy)
2. actions,返回一個(gè)值在對(duì)dataset進(jìn)行計(jì)算之后,
3. 持久化rdd
1. 調(diào)用persist或者是(cache)方法
輸出RDD
1. 可以rdd.collect().foreach(println)他可以將各個(gè)節(jié)點(diǎn)的數(shù)據(jù)匯集到一起來,但是這樣會(huì)耗盡你的內(nèi)存.
2. 我們建議是獲取小部分的數(shù)據(jù)進(jìn)行獲取遍歷.rdd.take(100).foreach(println)
常見actioin操作
| Action | Meaning |
|---|---|
| reduce(func) | row 1 col 2 |
| collect | row 2 col 2 |
| count | row 1 col 2 |
| first | row 2 col 2 |
| take(n) | |
| takeSample(withReplacement,num,[seed]) | |
| takeOrdered(n,[ordering]) | |
| saveAsTextFile(path) | |
| saveAsSeTextFile(path) | |
| saveAsSequenceFile(path) | |
| countByKey() |
foreach(func)
shuffle
1. The Shuffle is an expensive operation since it involves disk I/O, data serialization, and network I/O.
2. To organize data for the shuffle, Spark generates sets of tasks - map tasks to organize the data, and a set of reduce tasks to aggregate it. This nomenclature comes from MapReduce and does not directly relate to Spark’s map and reduce operations.
3. Spark also automatically persists some intermediate(內(nèi)部) data in shuffle operations (e.g. reduceByKey), even without users calling persist. This is done to avoid recomputing the entire input if a node fails during the shuffle. We still recommend users call persist on the resulting RDD if they plan to reuse it
removing Data
1. 移除緩存,默認(rèn)是使用LRU(最近,最少未使用),
2. 手動(dòng)(manually)移除,RDD.unpersist()method
Shared Variables
1. spark提供兩個(gè)共享變量,可以分發(fā)的每個(gè)機(jī)器中
1.1 廣播變量(Broadcast Variables)
1.2