Outline
5.1 Information Theory
5.2 Information Technology
5.3 Data quality
5.4 Data cleaning
5.5 Data fusion
5.6 Data storage
5.7 Data mining
5.8 Multimedia information processing
5.3 Data quality 數(shù)據(jù)質(zhì)量
Uncertain Data 不確定數(shù)據(jù)
- Data uncertainty occur during:
| Name | 名字 |
|---|---|
| Data collection | 數(shù)據(jù)收集 |
| Data transmission | 數(shù)據(jù)傳輸 |
| Data processing | 數(shù)據(jù)處理 |
Causes of Data Uncertainty
| Name | 名字 |
|---|---|
| Environmental factors | 環(huán)境因素 |
| Low battery power | 電池電量低 |
| Packet losses | 丟包 |
Classification of Data Uncertainty
- Source Classification 根據(jù)不確定數(shù)據(jù)的來源分類 (重點(diǎn))
| Name | 實(shí)例 | 翻譯 |
|---|---|---|
| Undesirable uncertainty | Noisy sensor data | |
| Imprecise GPS Data | ||
| Unreliable extracted/integrated data | 不可靠的提取/集成數(shù)據(jù) | |
| Desirable uncertainty | Medical data with generalized attributes | 具有通用屬性的醫(yī)療數(shù)據(jù) |
| Cloaked trajectory data | 隱藏的軌跡數(shù)據(jù) |
- Granularity Classification 根據(jù)粒度分類
| Name | 翻譯 |
|---|---|
| Tuple Uncertainty | 元組的不確定性 |
| Attribute Uncertainty | 屬性不確定性 |
- Correlations Classification 根據(jù)相互關(guān)系分類
| Name | 翻譯 |
|---|---|
| Independent Uncertainty | 獨(dú)立的不確定性 |
| Correlated Uncertainty | 相關(guān)的不確定性 |
| Uncertainty with Local Correlations | 局部相關(guān)不確定性 |
Meaning of Data Quality 數(shù)據(jù)質(zhì)量的意義(重點(diǎn))
- Generally, you have a problem if the data doesn’t mean what you think it does, or should.
通常情況下,如果數(shù)據(jù)的含義與您認(rèn)為的不同,或者不應(yīng)該相同,那么就會(huì)出現(xiàn)問題 - Data quality problems are expensive and pervasive.
數(shù)據(jù)質(zhì)量問題昂貴且普遍存在
Conventional Definition of Data Quality 數(shù)據(jù)質(zhì)量的常規(guī)標(biāo)準(zhǔn)(定義
| Name | 翻譯 | 解釋 |
|---|---|---|
| Accuarcy | 精度 | recorded correctly |
| Completeness | 完整 | All data was recorded |
| Uniqueness | 獨(dú)一 | recorded once |
| Timeliness | 及時(shí) | The data is kept up to date |
| Consistency | 一致 | The data agrees with itself |
5.4 Data Cleaning 數(shù)據(jù)清理
the process of detecting and correcting (or removing) errors and inconsistencies from data in order to improve the quality of data.
To identifying incomplete, incorrect, inaccurate, irrelevant, etc.
從數(shù)據(jù)中檢測(cè)和糾正(或消除)錯(cuò)誤和不一致以提高數(shù)據(jù)質(zhì)量的過程。
該技術(shù)目的在于識(shí)別不完整、不正確、不準(zhǔn)確、不相關(guān)等。
Data cleaning tasks 數(shù)據(jù)清洗的任務(wù) (重點(diǎn))
| Name | 翻譯 |
|---|---|
| Fill in missing values | 填充缺失的值 |
| Identify outliers and smooth out noisy data | 識(shí)別異常值并平滑噪聲數(shù)據(jù) |
| Correct inconsistent data | 糾正不一致的數(shù)據(jù) |
| Resolve redundancy caused by data integration | 解決數(shù)據(jù)集成造成的冗余 |
Methods to Handle Noisy Data
| Name | 解釋 |
|---|---|
| Binning | 裝箱法,把數(shù)據(jù)按箱處理Smooth掉邊緣數(shù)據(jù) |
| Regression | 回歸函數(shù)擬合 |
| Clustering | 聚類,檢測(cè)到不屬于大類的元素,刪掉 |
| Combined inspection | 計(jì)算機(jī)和人工檢查相結(jié)合 |
Sensor Cleaning Pipeline

Uses temporal and spatial characteristics of sensor data
利用傳感器數(shù)據(jù)的時(shí)空特性
Step 1: Point
- Operates: Single value of sensor stream.
操作:單值傳感器流。 -
Purpose: Filter individual values
目的:過濾單獨(dú)的值
① Errant (dirty / faulty) RFID tags
錯(cuò)誤的RFID標(biāo)簽
② Obvious outliers
明顯的異常值
③ Conversion of raw data into tuples
將原始數(shù)據(jù)轉(zhuǎn)換為元組
Step 1: Point
Step 2: Smoothing
- Purpose: Interpolates (inserts) lost readings
目的:插入丟失的讀數(shù)
①Temporal interpolation
時(shí)間插值
②Outlier detection
異常值檢測(cè) -
Method: Window based queries
方法:基于窗口的查詢
Step 2: Smoothing
Step 3: Merge
- Purpose: Spatial interpolation
目的:空間插值 - 例如:在一個(gè)空間顆粒中,通過計(jì)算來自不同塵埃的讀數(shù)的平均值,并忽略偏離平均值兩個(gè)偏差之外的單個(gè)讀數(shù)。

Step 4: Arbitrate 仲裁
- Purpose: Remove
目的:刪除
① conflicting readings
沖突的讀數(shù)
② de-duplication
重復(fù)數(shù)據(jù)刪除

Step 5: Virtualize 虛擬化
- Purpose: Multi-source integration
目的:多源集成

Data Fusion 數(shù)據(jù)融合
概念(重點(diǎn))
Data fusion combine data from multiple sources and gather that information in order to achieve inferences, which will be more efficient and potentially more accurate than if they were achieved by means of a single source.
數(shù)據(jù)融合將來自多個(gè)來源的數(shù)據(jù)組合起來,并收集這些信息,以實(shí)現(xiàn)推斷,這將比通過單一來源實(shí)現(xiàn)更有效和更準(zhǔn)確。填空題
Sensors only give an estimate of the measured physical property
傳感器只能對(duì)測(cè)量到的物理性質(zhì)作出估計(jì)。
Nature of errors often determine the preferred fusion algorithm
誤差的性質(zhì)往往決定了融合算法的首選。
Three Processing Architectures 三個(gè)處理架構(gòu)
| Name | 翻譯 |
|---|---|
| Data-level fusion | 數(shù)據(jù)級(jí)融合 |
| Feature-level fusion | 特征級(jí)融合 |
| Decision-level fusion | 決策級(jí)融合 |
- Data-level fusion: Direct fusion of sensor data
數(shù)據(jù)級(jí)融合: 傳感器數(shù)據(jù)的直接融合, - Feature-level fusion: Representation of sensor data via feature vectors, with subsequent fusion of the feature vectors
特征級(jí)融合: 通過特征向量表示傳感器數(shù)據(jù),然后融合特征向量 -
Decision-level fusion: Processing of each sensor to achieve high-level inferences or decisions, which are subsequently combined.
決策級(jí)融合 :對(duì)每個(gè)傳感器進(jìn)行處理,以實(shí)現(xiàn)高級(jí)推理或決策,然后將這些推理或決策組合在一起。
Data Fusion
Data-level Fusion
- 使用條件: if the sensors are measuring the same physical phenomena.
如果傳感器測(cè)量的是相同的物理現(xiàn)象

Data Storage 數(shù)據(jù)存儲(chǔ)
Database System
- Database: collection of persistent data
數(shù)據(jù)庫:持久數(shù)據(jù)的收集 - Data: Known facts that can be recorded and have an implicit meaning.
數(shù)據(jù):可以記錄并具有隱含意義的已知事實(shí)。 -
Database Management System (DBMS): software system that supports creation, population, and querying of a database
數(shù)據(jù)庫管理系統(tǒng)(DBMS):支持?jǐn)?shù)據(jù)庫的創(chuàng)建、填充和查詢的軟件系統(tǒng) - Database System: DBMS + Database
數(shù)據(jù)庫系統(tǒng):DBMS +數(shù)據(jù)庫
DBMS 功能
| Name | 解釋 |
|---|---|
| Define | 定義特定的數(shù)據(jù)庫 |
| Construct | 構(gòu)造初始數(shù)據(jù)庫 |
| Manipulate | 增刪改查數(shù)據(jù)庫 |
| Share a database | 數(shù)據(jù)庫共享 |
- Define a database.
根據(jù)數(shù)據(jù)類型、結(jié)構(gòu)和約束定義特定的數(shù)據(jù)庫 - Construct or Load the initial database.
在輔助存儲(chǔ)介質(zhì)上構(gòu)造或加載初始數(shù)據(jù)庫內(nèi)容 - Manipulate the database:
操作數(shù)據(jù)庫:
① Retrieval, Modification
檢索,修改
② Accessing the database through Web applications
通過Web應(yīng)用程序訪問數(shù)據(jù)庫 - Share a database
共享數(shù)據(jù)庫允許多個(gè)用戶和程序同時(shí)訪問數(shù)據(jù)庫
Data Storage Solution 數(shù)據(jù)存儲(chǔ)解決方案(重點(diǎn))
| Name | 解釋 |
|---|---|
| Direct Attached Storage | 直接連接存儲(chǔ)器(DAS) |
| Network Attached Storage | 網(wǎng)絡(luò)附加存儲(chǔ)(NAS) |
| Storage Area Network | 存儲(chǔ)區(qū)域網(wǎng)絡(luò)(SAN) |
- Direct Attached Storage (DAS)
Characteristics: Storage devices attached directly to servers (only point of access)
直接連接到服務(wù)器的存儲(chǔ)設(shè)備(僅訪問點(diǎn))

- Network Attached Storage (NAS)
Characteristics: more reliable than DAS, limited by LAN bandwidth.

-
Storage Area Network (SAN)
Characteristics: more expensive
SAN
5.7 Data Mining 數(shù)據(jù)挖掘
Major Data Mining Tasks 數(shù)據(jù)挖掘的主要任務(wù)
| Name | 解釋 |
|---|---|
| Classification | 分類,預(yù)測(cè)項(xiàng)目類 |
| Association Rule Discovery | 關(guān)聯(lián)發(fā)現(xiàn) |
| Clustering | 聚類,查找項(xiàng)目類 |
| Sequential Pattern Discovery | 順序模式發(fā)現(xiàn) |
| Deviation Detection | 偏差檢測(cè) |
| Forecasting | 預(yù)測(cè) |
| Description | 描述 |
| Link analysis | 尋找聯(lián)系和關(guān)聯(lián) |
Classification 分類
定義
Find a model for class attribute as a function of the
values of other attributes.
將class屬性作為其他屬性值的函數(shù)來查找模型。test set 測(cè)試集
A test set is used to determine the accuracy of the model.
測(cè)試集用于確定模型的準(zhǔn)確性。Classification method 分類方法
| Name | 解釋 |
|---|---|
| Decision Tree | 決策樹 |
| Naive Bayesian classifiers | 樸素貝葉斯分類器 |
| Using association rule | 使用關(guān)聯(lián)規(guī)則 |
| Neural networks | 神經(jīng)網(wǎng)絡(luò) |
Clustering 聚類定義
Given a set of data points, each having a set ofattributes, and a similarity measure among them.
5.8 Multimedia Information Processing 多媒體信息處理
- 定義
Multimedia is a combination of text, graphic, sound, animation, and video that is delivered interactively to the user by electronic or digitally manipulated means.
多媒體是文本、圖形、聲音、動(dòng)畫和視頻的組合,通過電子或數(shù)字操作的方式交互地傳遞給用戶
Digital Image Processing 數(shù)字圖像處理
- Digital Image
A digital image is a representation of a two-dimensional image as a finite set of digital values, called picture elements or pixels.
數(shù)字圖像是二維圖像的一種表示,它是一組有限的數(shù)字值,稱為圖像元素或像素。 - Pixel values 像素值
typically represent gray levels, colours, opacities etc.
表示灰度、顏色、不透明度。 - 填空:Remember digitization implies that a digital image is an approximation of a real scene.
Major tasks for digital Image Processing
- Improvement of pictorial information for human interpretation.
改善圖像信息的人類解釋。 - Processing of image data for storage, transmission and representation for autonomous machine perception.
用于存儲(chǔ)、傳輸和表示自主機(jī)器感知的圖像數(shù)據(jù)處理。
Processing level




