數(shù)據(jù)挖掘ch1

What is Big Data?
“Big data is high-volume, high-velocity and high-variety information assets that demand cost-effective, innovative forms of information processing for enhanced insight and decision making.” — Gartner

“Big data refers to datasets whose size is beyond the ability of typical database software tools to capture, store, manage, and analyze.” — Mckinsey & Company

Paste_Image.png

Data mining
People have been analysing and investigating data for centuries.

Statistics
Mean, Variance, Correlation, Distribution …

In modern days, data are often far beyond human comprehension.
Diversity, Volume, Dimensionality

Definition
Data Mining is the process of automatically extracting interesting and useful hidden patterns from usually massive, incomplete and noisy data.

Not a fully automatic process
Human interventions are often inevitable.
Domain Knowledge
Data Collection and Pre-processing

Synonym: Knowledge Discovery

Paste_Image.png

Data Integration & Analysis

Paste_Image.png

Process of Data Mining

Paste_Image.png

DM Techniques - Classification
“Classification is a procedure in which individual items are placed into groups based on quantitative information on one or more characteristics (referred to as variables) and based on a training set of previously labeled items.”

Given a training set: {(x1, y1), …, (xn, yn)}, produce a classifier (function) that maps any unknown object xi to its class label yi.

Algorithms
Decision Trees
K-Nearest Neighbours
Neural Networks
Support Vector Machines

Applications
Churn Prediction
Medical Diagnosis
Classification Boundaries

Paste_Image.png

Overfitting – Classification

Paste_Image.png

Confusion Matrix

Paste_Image.png

TPR=TP/(TP+FN)

TNR=TN/(TN+FP)

Accuracy=(TP+TN)/(P+N)

Receiver Operating Characteristic


Paste_Image.png

DM Techniques - Clustering
“Clustering is the assignment of a set of observations into subsets (called clusters) so that observations in the same cluster are similar in some sense.”

Distance Metrics
Euclidean Distance
Manhattan Distance
Mahalanobis Distance

Algorithms
K-Means
Sequential Leader
Affinity Propagation

Applications
Market Research
Image Segmentation
Social Network Analysis

Paste_Image.png

Hierarchical Clustering

Paste_Image.png

DM Techniques – Association Rule

Paste_Image.png
Paste_Image.png

DM Techniques – Regression

Paste_Image.png
Paste_Image.png
Paste_Image.png

Overfitting – Regression

Paste_Image.png

Data Preprocessing
Real data are often surprisingly dirty.
A Major Challenge for Data Mining

Typical Issues
Missing Attribute Values
Different Coding/Naming Schemes
Infeasible Values
Inconsistent Data
Outliers

Data Quality
Accuracy
Completeness
Consistency
Interpretability
Credibility
Timeliness

Paste_Image.png

Data Cleaning
Fill in missing values.
Correct inconsistent data.
Identify outliers and noisy data.

Data Integration
Combine data from different sources.

Data Transformation
Normalization
Aggregation
Type Conversion

Data Reduction
Feature Selection
Sampling

Privacy Protection
Data: A Double-Edged Sword
People can benefit greatly from data analysis.
The consequence of information leakage can be catastrophic.

People may be reluctant to give sensitive information due to privacy concerns.
Drug, Tax, Sexuality …

How to find out the percentage of people with a certain attribute?
The interviewer should not know the true answer of each respondent.

Randomized Response
Used in structured survey research.
Can maintain the confidentiality of respondents.
Two questions are presented:
Q1: I have the attribute A.
Q2: I do not have the attribute A.

The respondent uses a random device to:
Answer Q1 with probability p.
Answer Q2 with probability 1-p.
The interviewer has no idea about which question is answered.

Paste_Image.png

Cloud Computing

Paste_Image.png
Paste_Image.png

Why bother so many different algorithms?

No algorithm is always superior to others.

No parameter setting is optimal over all problems.

Look for the best match between problem and algorithm.
Experience
Trial and Error

Factors to consider:
Applicability
Computational Complexity
Interpretability

Always start with simple ones.

Grouping

Paste_Image.png
最后編輯于
?著作權(quán)歸作者所有,轉(zhuǎn)載或內(nèi)容合作請(qǐng)聯(lián)系作者
【社區(qū)內(nèi)容提示】社區(qū)部分內(nèi)容疑似由AI輔助生成,瀏覽時(shí)請(qǐng)結(jié)合常識(shí)與多方信息審慎甄別。
平臺(tái)聲明:文章內(nèi)容(如有圖片或視頻亦包括在內(nèi))由作者上傳并發(fā)布,文章內(nèi)容僅代表作者本人觀點(diǎn),簡(jiǎn)書(shū)系信息發(fā)布平臺(tái),僅提供信息存儲(chǔ)服務(wù)。

相關(guān)閱讀更多精彩內(nèi)容

  • 這篇只是筆記而已,用于記錄python編程中那些比較好的做法. python中經(jīng)常使用的序列化模塊是pickle,...
    Yihulee閱讀 129評(píng)論 0 0
  • 《頭上長(zhǎng)出櫻桃樹(shù)》 金子最喜歡吃櫻桃,每到春天櫻桃上市,媽媽總會(huì)給她買很多櫻桃。媽媽還告訴金子,櫻桃籽不能吞進(jìn)...
    春遲秋暮閱讀 1,432評(píng)論 8 6
  • 時(shí)間過(guò)得真快,轉(zhuǎn)眼間2016年就要結(jié)束,回想著走過(guò)的路,感觸頗多。 記得年初的時(shí)候,給自己制定了年度目標(biāo),而如今2...
    陳慕讀歷史閱讀 435評(píng)論 0 0

友情鏈接更多精彩內(nèi)容