Iris數(shù)據(jù)集是常用的分類實(shí)驗(yàn)數(shù)據(jù)集,由Fisher, 1936收集整理。Iris也稱鳶尾花卉數(shù)據(jù)集,是一類多重變量分析的數(shù)據(jù)集, 它包含150個(gè)數(shù)據(jù)集,分為3類,每類50個(gè)數(shù)據(jù),每個(gè)數(shù)據(jù)包含4個(gè)屬性。
自變量 feature 特性
- petal length 花瓣長度
- petal width 花瓣寬度
- sepal length 花萼長度
- sepal width 花萼寬度
因變量 Target Variable:Species 物種
- versicolor 雜色鳶尾
- virginica 維吉尼亞鳶尾
- setosa 山鳶尾
用段小程序來看看它長什么樣
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn import datasets
from pandas.plotting import scatter_matrix
plt.style.use('ggplot')
iris = datasets.load_iris()
print('--- %s ---' % 'iris type')
print(type(iris))
print('--- %s ---' % 'iris keys')
print(iris.keys())
print('--- %s ---' % 'iris data')
print(type(iris.data))
print('--- %s ---' % 'iris target')
print(type(iris.target))
print('--- %s ---' % 'iris data shape')
print(iris.data.shape)
print('--- %s ---' % 'iris target names')
print(iris.target_names);
X = iris.data
y = iris.target
df = pd.DataFrame(X, columns= iris.feature_names)
print('--- %s ---' % 'df.head')
print(df.head())
print('--- %s ---' % 'df.info')
print(df.info())
print('--- %s ---' % 'df.describe')
print(df.describe())
print('--- %s ---' % 'iris scatter_matrix diagram')
_ = scatter_matrix(df, c=y, figsize=[8,8], s=150, marker = 'D')
輸出結(jié)果:
--- iris type ---
<class 'sklearn.utils.Bunch'>
--- iris keys ---
dict_keys(['data', 'target', 'target_names', 'DESCR', 'feature_names'])
--- iris data ---
<class 'numpy.ndarray'>
--- iris target ---
<class 'numpy.ndarray'>
--- iris data shape ---
(150, 4)
--- iris target names ---
['setosa' 'versicolor' 'virginica']
--- df.head ---
sepal length (cm) sepal width (cm) petal length (cm) petal width (cm)
0 5.1 3.5 1.4 0.2
1 4.9 3.0 1.4 0.2
2 4.7 3.2 1.3 0.2
3 4.6 3.1 1.5 0.2
4 5.0 3.6 1.4 0.2
--- df.info ---
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 150 entries, 0 to 149
Data columns (total 4 columns):
sepal length (cm) 150 non-null float64
sepal width (cm) 150 non-null float64
petal length (cm) 150 non-null float64
petal width (cm) 150 non-null float64
dtypes: float64(4)
memory usage: 4.7 KB
None
--- df.describe ---
sepal length (cm) sepal width (cm) petal length (cm) \
count 150.000000 150.000000 150.000000
mean 5.843333 3.054000 3.758667
std 0.828066 0.433594 1.764420
min 4.300000 2.000000 1.000000
25% 5.100000 2.800000 1.600000
50% 5.800000 3.000000 4.350000
75% 6.400000 3.300000 5.100000
max 7.900000 4.400000 6.900000
petal width (cm)
count 150.000000
mean 1.198667
std 0.763161
min 0.100000
25% 0.300000
50% 1.300000
75% 1.800000
max 2.500000
--- iris scatter_matrix diagram ---
Pandas 繪制的鳶尾花分散矩陣圖

除鳶尾花數(shù)據(jù)集之外, sklearn 還有一些舉例用的玩具數(shù)據(jù)集, 可以直接用如下函數(shù)加載
鳶尾花數(shù)據(jù)集, 可以用來做分類練習(xí)
load_iris([return_X_y])波士頓房價(jià)數(shù)據(jù)集, 可以用來做回歸分析
load_boston([return_X_y])糖尿病數(shù)據(jù)集, 可用來做回歸分析
load_diabetes([return_X_y])數(shù)字?jǐn)?shù)據(jù)集, 可以用來做分類練習(xí)
load_digits([n_class, return_X_y])
蘭納胡德體能訓(xùn)練的數(shù)據(jù), 可用來做多變量回歸分析
load_linnerud([return_X_y])酒類數(shù)據(jù)集, 可用做分類練習(xí)
load_wine([return_X_y])乳腺癌數(shù)據(jù)集, 可用做分類練習(xí)
load_breast_cancer([return_X_y])
如果你需要原始數(shù)據(jù)文件, 可以從 https://archive.ics.uci.edu/ml/datasets.html 下載這些數(shù)據(jù)集,
其中鳶尾花數(shù)據(jù)集就在 https://archive.ics.uci.edu/ml/machine-learning-databases/iris/