spark2.4開始支持image圖片數(shù)據(jù)源操作
import org.apache.spark.sql.SparkSession
object ImageDataSourceTest {
def main(args: Array[String]): Unit = {
val spark = SparkSession
.builder
.master("local[2]")
.appName("ImageDataSourceTest")
.getOrCreate()
// $example on$
val df = spark.read.format("image")
.option("dropInvalid", value = true) // 從結(jié)果中刪除無效圖片
.load("D:\\data\\image")
df.select("image.origin", "image.width", "image.height")
.show(truncate = true)
// $example off$
spark.stop()
}
}
df的schema信息
root
|-- image: struct (nullable = true)
| |-- origin: string (nullable = true)
| |-- height: integer (nullable = true)
| |-- width: integer (nullable = true)
| |-- nChannels: integer (nullable = true)
| |-- mode: integer (nullable = true)
| |-- data: binary (nullable = true)
如果是多層目錄,而且需要獲取目錄名,可以將目錄命為:cls=string,在image的同級(jí)目錄中會(huì)多出信息:“|-- cls: string (nullable = true)”
- origin: 圖片路徑
- height: 圖片高度
- width: 圖片寬度
- nChannels: 圖片通道數(shù)量,對(duì)于灰度圖像,典型值為1,對(duì)于彩色圖像(例如,RGB),典型值為3,對(duì)于具有alpha通道的彩色圖像,典型值為4
- mode: openCV兼容的類型,"CV_8UC1" -> 0, "CV_8UC3" -> 16, "CV_8UC4" -> 24,和通道一一對(duì)應(yīng)
- data: BinaryType,以openCV兼容的方式排列,大多數(shù)情況下按行排列BGR