Rdd的創(chuàng)建

1.通過集合生成

val data = Array(1, 2, 3, 4, 5)
val distData = sc.parallelize(data)
distData.collect

Spark will run one task for each partition of the cluster.
一個partition對應(yīng)一個task

2.通過外部共享文件

scala> val distFile = sc.textFile("data.txt")
  • If using a path on the local filesystem, the file must also be accessible at the same path on worker nodes. Either copy the file to all workers or use a network-mounted shared file system.
    本地獲取的話,每個節(jié)點(diǎn)下面都要有那個目錄和文件
  • All of Spark’s file-based input methods, including textFile, support running on directories, compressed files, and wildcards as well. For example, you can use textFile("/my/directory"), textFile("/my/directory/.txt"), and textFile("/my/directory/.gz").
    hdfs下,可以獲取一個文件夾下的所有文件,也可用通配符獲取
  • .The textFile method also takes an optional second argument for controlling the number of partitions of the file. By default, Spark creates one partition for each block of the file (blocks being 128MB by default in HDFS), but you can also ask for a higher number of partitions by passing a larger value. Note that you cannot have fewer partitions than blocks.
    可以設(shè)置參數(shù)確定分區(qū)數(shù),分區(qū)數(shù)不能小于block數(shù)量
?著作權(quán)歸作者所有,轉(zhuǎn)載或內(nèi)容合作請聯(lián)系作者
【社區(qū)內(nèi)容提示】社區(qū)部分內(nèi)容疑似由AI輔助生成,瀏覽時請結(jié)合常識與多方信息審慎甄別。
平臺聲明:文章內(nèi)容(如有圖片或視頻亦包括在內(nèi))由作者上傳并發(fā)布,文章內(nèi)容僅代表作者本人觀點(diǎn),簡書系信息發(fā)布平臺,僅提供信息存儲服務(wù)。

相關(guān)閱讀更多精彩內(nèi)容

友情鏈接更多精彩內(nèi)容