Application
a driver program +? executors
SparkContext = application
spark-shell ? application
gateway
application1: 1 driver + 10 executors
application2: 1 driver + 10 executors
share
application ==> n jobs ==> n stages ==> n tasks
partition? ==? task
textFile("")............ count
textFile("")............ count
textFile("")............ count
textFile("").cache
cache? lazy === transformation
unpersist eager
def persist() = persist(StorageLevel.MEMORY_ONLY)
def cache()? = persist()
class StorageLevel private(
? ? private var _useDisk: Boolean,
? ? private var _useMemory: Boolean,
? ? private var _useOffHeap: Boolean,
? ? private var _deserialized: Boolean,
? ? private var _replication: Int = 1)
MEMORY_ONLY? (false, true, false, true)
Lineage
textFile ==> xx ==> yy ==> zz?
? map filter? map? .....
描述的是一個RDD如何從父RDD計算得來的
Dependency
窄依賴
一個父RDD的partition至多被子RDD的某個partition使用一次
pipeline
寬依賴
一個父RDD的partition會被子RDD的partition使用多次
xxKey
join not co.....
shuffle ==> stage
lines.flatMap(_.split("\t")).map((_,1)).reduceByKey(_+_).collect