data mining 知識(shí)大綱

說明:該知識(shí)大綱是根據(jù)電子科技大學(xué)計(jì)算機(jī)學(xué)院研究生學(xué)位課《Data Mining》的授課內(nèi)容整理而成。該課程由邵俊明老師進(jìn)行講授,且是英文授課。

Chapter 1

Definition of data mining

Data mining consists of applying data analysis and discovery algorithms that, under acceptable computational efficiency limitations, produce a particular enumeration of patterns over the data

image.png

Key factors:

  • Data storage

  • Data availability

  • Computation power

Application:

  • Target marketing: Find clusters of “model” customers who share the same characteristics: interest, income level, spending habits, etc.

  • Cross-market analysis: Find associations/co-relations between product sales, & predict based on such association.

  • Resource planning: summarize and compare the resources and spending

  • Fraud detection

Task:

  • Association rule mining

  • Cluster analysis

  • Classification

  • Outlier detection

Direction:

  • Volume (Scale of Data)

  • Velocity (Data Stream)

  • Variety (Different format of data, difference sources)

  • Veracity (Uncertainty, missing value)

Chapter 2

Nearest Neighbor

image.png

Predict class label of test instance with major vote strategy

SVM Kernel

image.png

Ensemble learning

  1. bagging: random forest

  2. boosting: adaboost

  3. stacking

image.png
image.png

Chapter 3

Why do we need Hashing?

Challenge in big data applications:

  • Curse of dimensionality

  • Storage cost

  • Query speed

Examples:

  • Information retrieval

  • Storage cost

  • Fast nearest neighbor search

Three steps for similar documents:

  • shingling

  • Min hashing

  • Locality-sensitive hashing

image.png

Min-hashing

  1. Compute signatures of columns = small summaries of columns.

  2. Examine pairs of signatures to find similar signatures.

  3. (Optional) check that columns with similar signatures are really similar.

image.png

Use several (e.g., 100) independent hash functions to create a signature.

Locality-sensitive hashing

  • General idea: Use a function f(x,y) that tells whether or not x and y is a candidate pair : a pair of elements whose similarity must be evaluated.

  • For minhash matrices: Hash columns to many buckets, and make elements of the same bucket candidate pairs.

LSH for min-hash signatures

Matrix M is the matrix of signatures.

image.png

For each band, hash its portion of each column to a hash table with k buckets.

image.png

Tradeoff

Pick the number of minhashes, the number of bands, and the number of rows per band to balance false positives/negatives.

Learn to hash

  • PCA hashing: The basic idea is rotating the data to minimize quantization loss.

  • Spectral hashing

image.png

Chapter 4

Definition of sampling

Giving a p(x), we want to draw some samples to represent p(x).

Inverse transform sampling

image.png

Drawbacks: Usually, it’s hard to get the inverse function

Rejection sampling

image.png

Adaptive reject sampling: only if p(x) is log-concave

Importance sampling

image.png
image.png

Markov chain Monte Carlo(MCMC)

image.png

Detailed balance condition: π(i)Pij = π(j)Pij

Acceptance ratio

image.png

Drawbacks: acceptance ratio is too small

Metropolis–Hastings (MH) Sampling

Based on MCMC rewriting the acceptance ratio

image.png

But acceptance ratio still isn’t 100%

Gibbs sampling (based on MCMC)

Idea: Gibbs sampling further make acceptance ratio being 100%

image.png

other features of Gibbs:

  • Do not need p(x)
image.png

Sampling on data stream

  • Bernoulli Sampling

  • Reservoir Sampling: not need to know stream size;
    image.png

Chapter 5

What is data stream?

A data stream is a massive sequence of data objects which have some unique features: One by One; Potentially Unbounded; Concept Drift

image.png

Challenges:

  • Single Pass Handling

  • Memory Limitation

  • Low Time Complexity

  • Concept Drift

What is concept drift?

The probability distribution changes.

image.png
image.png

Concept drift detection

  • Distribution-based detector

  • Error-rate based detector

image.png
image.png

Data stream classification

image.png

Typical algorithm

  • VFDT (very fast decision tree): A decision-tree learning system based on the Hoeffding tree algorithm

  • CVFDT (Concept-adapting Very Fast Decision Tree learner)

VFDT

image.png
image.png
image.png
image.png
image.png

CVFDT

image.png
image.png
image.png
image.png

Data stream clustering

  • Online phase: Summarize the data into memory-efficient data structures

  • Offline phase: Use a clustering algorithm to find the data partition (k-means, decision tree)

Framework

image.png
image.png
image.png
image.png

Chapter 6

Key node identification

image.png
image.png

PageRank

image.png
image.png

Community detection (graph clustering)

  • Minimum cut: find a graph partition such that the number of edges between the two sets is minimized.
    image.png

    But minimum cut always return an imbalanced partition.

  • Normalized cut & ratio cut
    image.png

    ,prefer a balanced partition.

  • modularity
    image.png
  • Random walk

  • Multi-level clustering

  • Dynamic community detection: a new viewpoint for community detection, the basic idea is Simulate the change of edge distances.

    • View network as dynamical system (Dynamic vs. Static)

    • Simulate the distance dynamics based on different interaction patterns (Distance dynamics vs. Node dynamics)

    • All edge distances will converge, and the community structure is intuitively identified.

Chapter 7

What is hadoop?

  • Hadoop is a software framework for distributed processing of large datasets across large clusters of computers.

  • Hadoop is open-source implementation for Google MapReduce

  • Hadoop is based on a simple programming model called MapReduce

  • Hadoop is based on a simple data model, any data will fit

image.png
image.png
image.png

Core
Filesystems and I/O:

  • Abstraction APIs
  • RPC / Persistence

Avro
Cross-language serialization:

  • RPC / persistence
  • ~ Google ProtoBuf / FB Thrift

MapReduce
Distributed execution (batch)

  • Programming model
  • Scalability / fault-tolerance

HDFS
Distributed storage (read-opt.)

  • Replication / scalability
  • ~ Google filesystem (GFS)

Zoo keeper
Coordination service

  • Locking / configuration
  • ~ Google Chubby

HBase
Column-oriented, sparse store

  • Batch & random access
  • ~ Google BigTable

Pig
Data flow language

  • Procedural SQL-inspired lang.
  • Execution environment

Hive
Distributed data warehouse

  • SQL-like query language
  • Data mgmt / query execution

image.png
image.png
image.png
image.png
image.png

Limitation of MapReduce

  • Inefficient for multi-pass algorithm

  • No efficient primitives for data sharing

    • State between steps goes to distributed file system

    • Slow due to replication & disk storage

Spark

Apache Spark is a fast and general-purpose cluster computing system. It also supports a rich set of higher-level tools including Spark SQL for SQL and structured data processing, MLlib for machine learning, GraphX for graph processing, and Spark Streaming for streaming processing.

image.png
image.png
image.png
image.png
image.png
image.png
image.png

Row-key is unique for a row

image.png
?著作權(quán)歸作者所有,轉(zhuǎn)載或內(nèi)容合作請(qǐng)聯(lián)系作者
【社區(qū)內(nèi)容提示】社區(qū)部分內(nèi)容疑似由AI輔助生成,瀏覽時(shí)請(qǐng)結(jié)合常識(shí)與多方信息審慎甄別。
平臺(tái)聲明:文章內(nèi)容(如有圖片或視頻亦包括在內(nèi))由作者上傳并發(fā)布,文章內(nèi)容僅代表作者本人觀點(diǎn),簡書系信息發(fā)布平臺(tái),僅提供信息存儲(chǔ)服務(wù)。

相關(guān)閱讀更多精彩內(nèi)容

友情鏈接更多精彩內(nèi)容