1.通過集合生成
val data = Array(1, 2, 3, 4, 5)
val distData = sc.parallelize(data)
distData.collect
Spark will run one task for each partition of the cluster.
一個partition對應(yīng)一個task
2.通過外部共享文件
scala> val distFile = sc.textFile("data.txt")
- If using a path on the local filesystem, the file must also be accessible at the same path on worker nodes. Either copy the file to all workers or use a network-mounted shared file system.
本地獲取的話,每個節(jié)點(diǎn)下面都要有那個目錄和文件
- All of Spark’s file-based input methods, including textFile, support running on directories, compressed files, and wildcards as well. For example, you can use textFile("/my/directory"), textFile("/my/directory/.txt"), and textFile("/my/directory/.gz").
hdfs下,可以獲取一個文件夾下的所有文件,也可用通配符獲取
- .The textFile method also takes an optional second argument for controlling the number of partitions of the file. By default, Spark creates one partition for each block of the file (blocks being 128MB by default in HDFS), but you can also ask for a higher number of partitions by passing a larger value. Note that you cannot have fewer partitions than blocks.
可以設(shè)置參數(shù)確定分區(qū)數(shù),分區(qū)數(shù)不能小于block數(shù)量