scikit-learn數(shù)據(jù)集

scikit-learn數(shù)據(jù)集

我們將介紹sklearn中的數(shù)據(jù)集類,模塊包括用于加載數(shù)據(jù)集的實用程序,包括加載和獲取流行參考數(shù)據(jù)集的方法。它還具有一些人工數(shù)據(jù)生成器。

  • sklearn數(shù)據(jù)集
sklearn數(shù)據(jù)集.png
  • sklearn.datasets

    (1)datasets.load_*()

    獲取小規(guī)模數(shù)據(jù)集,數(shù)據(jù)包含在datasets里

    (2)datasets.fetch_*()

    獲取大規(guī)模數(shù)據(jù)集,需要從網(wǎng)絡(luò)上下載,函數(shù)的第一個參數(shù)是data_home,表示數(shù)據(jù)集下載的目錄,默認(rèn)是 ~/scikit_learn_data/,要修改默認(rèn)目錄,可以修改環(huán)境變量SCIKIT_LEARN_DATA

    (3)datasets.make_*()

    本地生成數(shù)據(jù)集

    load*和 fetch* 函數(shù)返回的數(shù)據(jù)類型是 datasets.base.Bunch,本質(zhì)上是一個 dict,它的鍵值對可用通過對象的屬性方式訪問。主要包含以下屬性:

    • data:特征數(shù)據(jù)數(shù)組,是 n_samples * n_features 的二維 numpy.ndarray 數(shù)組
    • target:標(biāo)簽數(shù)組,是 n_samples 的一維 numpy.ndarray 數(shù)組
    • DESCR:數(shù)據(jù)描述
    • feature_names:特征名
    • target_names:標(biāo)簽名

    數(shù)據(jù)集目錄可以通過datasets.get_data_home()獲取,clear_data_home(data_home=None)刪除所有下載數(shù)據(jù)

    • datasets.get_data_home(data_home=None)

    返回scikit學(xué)習(xí)數(shù)據(jù)目錄的路徑。這個文件夾被一些大的數(shù)據(jù)集裝載器使用,以避免下載數(shù)據(jù)。默認(rèn)情況下,數(shù)據(jù)目錄設(shè)置為用戶主文件夾中名為“scikit_learn_data”的文件夾?;蛘撸梢酝ㄟ^“SCIKIT_LEARN_DATA”環(huán)境變量或通過給出顯式的文件夾路徑以編程方式設(shè)置它。'?'符號擴(kuò)展到用戶主文件夾。如果文件夾不存在,則會自動創(chuàng)建。

    • sklearn.datasets.clear_data_home(data_home=None)

    刪除存儲目錄中的數(shù)據(jù)

  • 獲取小數(shù)據(jù)集

    用于分類

    • sklearn.datasets.load_iris

      鳶尾花數(shù)據(jù)集采集的是鳶尾花的測量數(shù)據(jù)以及其所屬的類別。測量數(shù)據(jù)包括:萼片長度、萼片寬度、花瓣長度、花瓣寬度。類別共分為三類:Iris Setosa,Iris Versicolour,Iris Virginica。該數(shù)據(jù)集可用于多分類問題。

    • 加載數(shù)據(jù)集其參數(shù)有:
      ? return_X_y:

      若為True,則以(data, target)元組形式返回數(shù)據(jù);默認(rèn)為False,表示以字典形式返回數(shù)據(jù)全部信息(包括data和target)。

    from sklearn.datasets import  load_iris
    data = load_iris(return_X_y=True)
    
    from sklearn.datasets import  load_iris
    data = load_iris()
    #查看data所具有的屬性或方法
    print(dir(data))
    print('*'*80)
    #查看數(shù)據(jù)集的描述
    print(data.DESCR)
    print('*'*80)
    #查看數(shù)據(jù)的特征名
    print(data.feature_names)
    #print(data.data)
    print('*'*80)
    #查看數(shù)據(jù)的分類名
    print(data.target_names)
    print('*'*80)
    print(data.target)
    print('*'*80)
    #查看第2、11、101個樣本的目標(biāo)值
    print(data.target[[1,10, 100]])
    
    ['DESCR', 'data', 'feature_names', 'filename', 'target', 'target_names']
    ********************************************************************************
    .. _iris_dataset:
    
    Iris plants dataset
    --------------------
    
    **Data Set Characteristics:**
    
        :Number of Instances: 150 (50 in each of three classes)
        :Number of Attributes: 4 numeric, predictive attributes and the class
        :Attribute Information:
            - sepal length in cm
            - sepal width in cm
            - petal length in cm
            - petal width in cm
            - class:
                    - Iris-Setosa
                    - Iris-Versicolour
                    - Iris-Virginica
                    
        :Summary Statistics:
    
        ============== ==== ==== ======= ===== ====================
                        Min  Max   Mean    SD   Class Correlation
        ============== ==== ==== ======= ===== ====================
        sepal length:   4.3  7.9   5.84   0.83    0.7826
        sepal width:    2.0  4.4   3.05   0.43   -0.4194
        petal length:   1.0  6.9   3.76   1.76    0.9490  (high!)
        petal width:    0.1  2.5   1.20   0.76    0.9565  (high!)
        ============== ==== ==== ======= ===== ====================
    
        :Missing Attribute Values: None
        :Class Distribution: 33.3% for each of 3 classes.
        :Creator: R.A. Fisher
        :Donor: Michael Marshall (MARSHALL%PLU@io.arc.nasa.gov)
        :Date: July, 1988
                
       '''       部分省略      '''
    
    ********************************************************************************
    ['sepal length (cm)', 'sepal width (cm)', 'petal length (cm)', 'petal width (cm)']
    ********************************************************************************
    ['setosa' 'versicolor' 'virginica']
    ********************************************************************************
    [0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
     0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
     1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 2
     2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
     2 2]
    ********************************************************************************
    [0 0 2]
    
    • sklearn.datasets.load_digits

      手寫數(shù)字?jǐn)?shù)據(jù)集包括1797個0-9的手寫數(shù)字?jǐn)?shù)據(jù),每個數(shù)字由8*8大小的矩陣構(gòu)成,矩陣中值的范圍是0-16,代表顏色的深度。

    • 加載數(shù)據(jù)集其參數(shù)包括:
      ? return_X_y:若為True,則以(data, target)形式返回數(shù)據(jù);默認(rèn)為False,表示以字典形式返回數(shù)據(jù)全部信息(包括data和target) ;
      ? n_class:表示返回數(shù)據(jù)的類別數(shù),默認(rèn)= 10,如:n_class=5,則返回0到4的數(shù)據(jù)樣本。

    from sklearn.datasets import load_digits
    digits = load_digits(n_class=5,return_X_y=False)
    #查看第1-10個樣本的目標(biāo)值
    print(digits.target[0:10])
    
    [0 1 2 3 4 0 1 2 3 4]
    
    import matplotlib.pyplot as plt
    from sklearn.datasets import load_digits
    digits = load_digits(n_class=10,return_X_y=False)
    print(dir(digits))
    print('*'*80)
    print(digits.DESCR)
    print('*'*80)
    print(digits.data)
    print('*'*80)
    print(digits.target_names)
    print('*'*80)
    print(digits.target[[2,20,200]])
    print('*'*80)
    print(digits.images.shape)
    plt.matshow(digits.images[1])
    plt.savefig('手寫數(shù)字1')
    plt.show()
    
    ['DESCR', 'data', 'images', 'target', 'target_names']
    ********************************************************************************
    .. _digits_dataset:
    
    Optical recognition of handwritten digits dataset
    --------------------------------------------------
    
    **Data Set Characteristics:**
    
        :Number of Instances: 5620
        :Number of Attributes: 64
        :Attribute Information: 8x8 image of integer pixels in the range 0..16.
        :Missing Attribute Values: None
        :Creator: E. Alpaydin (alpaydin '@' boun.edu.tr)
        :Date: July; 1998
    '''       部分省略      '''
    ********************************************************************************
    [[ 0.  0.  5. ...  0.  0.  0.]
     [ 0.  0.  0. ... 10.  0.  0.]
     [ 0.  0.  0. ... 16.  9.  0.]
     ...
     [ 0.  0.  1. ...  6.  0.  0.]
     [ 0.  0.  2. ... 12.  0.  0.]
     [ 0.  0. 10. ... 12.  1.  0.]]
    ********************************************************************************
    [0 1 2 3 4 5 6 7 8 9]
    ********************************************************************************
    [2 0 1]
    ********************************************************************************
    (1797, 8, 8)
    
    手寫數(shù)字1.png

用于回歸

  • sklearn.datasets.load_boston

    波士頓房價數(shù)據(jù)集包含506組數(shù)據(jù),每條數(shù)據(jù)包含房屋以及房屋周圍的詳細(xì)信息。其中包括城鎮(zhèn)犯罪率、一氧化氮濃度、住宅平均房間數(shù)、到中心區(qū)域的加權(quán)距離以及自住房平均房價等。

  • 波士頓房價數(shù)據(jù)集屬性描述
    CRIM:城鎮(zhèn)人均犯罪率。
    ZN:住宅用地超過 25000 sq.ft. 的比例。
    INDUS:城鎮(zhèn)非零售商用土地的比例。
    CHAS:查理斯河空變量(如果邊界是河流,則為1;否則為0)
    NOX:一氧化氮濃度。
    RM:住宅平均房間數(shù)。
    AGE:1940 年之前建成的自用房屋比例。
    DIS:到波士頓五個中心區(qū)域的加權(quán)距離。
    RAD:輻射性公路的接近指數(shù)。
    TAX:每 10000 美元的全值財產(chǎn)稅率。
    PTRATIO:城鎮(zhèn)師生比例。
    B:1000(Bk-0.63)^ 2,其中 Bk 指代城鎮(zhèn)中黑人的比例。
    LSTAT:人口中地位低下者的比例。
    MEDV:自住房的平均房價,以千美元計。

  • 加載數(shù)據(jù)集其參數(shù)有:
    ? return_X_y:

    若為True,則以(data, target)元組形式返回數(shù)據(jù);默認(rèn)為False,表示以字典形式返回數(shù)據(jù)全部信息(包括data和target)。

from sklearn.datasets import load_boston
boston = load_boston()
print(dir(boston))
print('*'*80)
print(boston.DESCR)
print('*'*80)
print(boston.feature_names)
print(boston.data)
print('*'*80)
print(boston.filename)
print('*'*80)
print(boston.target)
['DESCR', 'data', 'feature_names', 'filename', 'target']
********************************************************************************
.. _boston_dataset:

Boston house prices dataset
---------------------------

**Data Set Characteristics:**  

    :Number of Instances: 506 

    :Number of Attributes: 13 numeric/categorical predictive. Median Value (attribute 14) is usually the target.

    :Attribute Information (in order):
        - CRIM     per capita crime rate by town
        - ZN       proportion of residential land zoned for lots over 25,000 sq.ft.
        - INDUS    proportion of non-retail business acres per town
        - CHAS     Charles River dummy variable (= 1 if tract bounds river; 0 otherwise)
        - NOX      nitric oxides concentration (parts per 10 million)
        - RM       average number of rooms per dwelling
        - AGE      proportion of owner-occupied units built prior to 1940
        - DIS      weighted distances to five Boston employment centres
        - RAD      index of accessibility to radial highways
        - TAX      full-value property-tax rate per $10,000
        - PTRATIO  pupil-teacher ratio by town
        - B        1000(Bk - 0.63)^2 where Bk is the proportion of blacks by town
        - LSTAT    % lower status of the population
        - MEDV     Median value of owner-occupied homes in $1000's

    :Missing Attribute Values: None

    :Creator: Harrison, D. and Rubinfeld, D.L.
'''       部分省略      '''
********************************************************************************
['CRIM' 'ZN' 'INDUS' 'CHAS' 'NOX' 'RM' 'AGE' 'DIS' 'RAD' 'TAX' 'PTRATIO'
 'B' 'LSTAT']
[[6.3200e-03 1.8000e+01 2.3100e+00 ... 1.5300e+01 3.9690e+02 4.9800e+00]
 [2.7310e-02 0.0000e+00 7.0700e+00 ... 1.7800e+01 3.9690e+02 9.1400e+00]
 [2.7290e-02 0.0000e+00 7.0700e+00 ... 1.7800e+01 3.9283e+02 4.0300e+00]
 ...
 [6.0760e-02 0.0000e+00 1.1930e+01 ... 2.1000e+01 3.9690e+02 5.6400e+00]
 [1.0959e-01 0.0000e+00 1.1930e+01 ... 2.1000e+01 3.9345e+02 6.4800e+00]
 [4.7410e-02 0.0000e+00 1.1930e+01 ... 2.1000e+01 3.9690e+02 7.8800e+00]]
********************************************************************************
D:\Anaconda3\lib\site-packages\sklearn\datasets\data\boston_house_prices.csv
********************************************************************************
[24.  21.6 34.7 33.4 36.2 28.7 22.9 27.1 16.5 18.9 15.  18.9 21.7 20.4
 18.2 19.9 23.1 17.5 20.2 18.2 13.6 19.6 15.2 14.5 15.6 13.9 16.6 14.8
 '''       部分省略      '''
 16.7 12.  14.6 21.4 23.  23.7 25.  21.8 20.6 21.2 19.1 20.6 15.2  7.
  8.1 13.6 20.1 21.8 24.5 23.1 19.7 18.3 21.2 17.5 16.8 22.4 20.6 23.9
 22.  11.9]
  • sklearn.datasets.load_diabetes
from sklearn.datasets import load_diabetes
diabetes = load_diabetes()
print(dir(diabetes))
print('*'*80)
print(diabetes.DESCR)
print('*'*80)
print(diabetes.data_filename)
print('*'*80)
print(diabetes.feature_names)
print(diabetes.data)
print('*'*80)
print(diabetes.target_filename)
['DESCR', 'data', 'data_filename', 'feature_names', 'target', 'target_filename']
********************************************************************************
.. _diabetes_dataset:

Diabetes dataset
----------------

Ten baseline variables, age, sex, body mass index, average blood
pressure, and six blood serum measurements were obtained for each of n =
442 diabetes patients, as well as the response of interest, a
quantitative measure of disease progression one year after baseline.

**Data Set Characteristics:**

  :Number of Instances: 442

  :Number of Attributes: First 10 columns are numeric predictive values

  :Target: Column 11 is a quantitative measure of disease progression one year after baseline

  :Attribute Information:
      - Age
      - Sex
      - Body mass index
      - Average blood pressure
      - S1
      - S2
      - S3
      - S4
      - S5
      - S6

Note: Each of these 10 feature variables have been mean centered and scaled by the standard deviation times `n_samples` (i.e. the sum of squares of each column totals 1).
'''       部分省略      '''
********************************************************************************
D:\Anaconda3\lib\site-packages\sklearn\datasets\data\diabetes_data.csv.gz
********************************************************************************
['age', 'sex', 'bmi', 'bp', 's1', 's2', 's3', 's4', 's5', 's6']
[[ 0.03807591  0.05068012  0.06169621 ... -0.00259226  0.01990842
  -0.01764613]
 [-0.00188202 -0.04464164 -0.05147406 ... -0.03949338 -0.06832974
  -0.09220405]
 [ 0.08529891  0.05068012  0.04445121 ... -0.00259226  0.00286377
  -0.02593034]
 ...
 [ 0.04170844  0.05068012 -0.01590626 ... -0.01107952 -0.04687948
   0.01549073]
 [-0.04547248 -0.04464164  0.03906215 ...  0.02655962  0.04452837
  -0.02593034]
 [-0.04547248 -0.04464164 -0.0730303  ... -0.03949338 -0.00421986
   0.00306441]]
********************************************************************************
D:\Anaconda3\lib\site-packages\sklearn\datasets\data\diabetes_target.csv.gz
  • 獲取大數(shù)據(jù)集
    • sklearn.datasets.fetch_20newsgroups

    • 加載數(shù)據(jù)集其參數(shù)有:

      subset: 'train'或者'test','all',可選,選擇要加載的數(shù)據(jù)集:訓(xùn)練集的“訓(xùn)練”,測試集的“測試”,兩者的“全部”

      data_home: 可選,默認(rèn)值:無。指定數(shù)據(jù)集的下載路徑。如果沒有,所有scikit學(xué)習(xí)數(shù)據(jù)都存儲在'?/ scikit_learn_data'子文件夾中

      categories: 選取哪一類數(shù)據(jù)集[類別列表],默認(rèn)20類

      shuffle: 是否對數(shù)據(jù)進(jìn)行隨機(jī)排序

      random_state: numpy隨機(jī)數(shù)生成器或種子整數(shù)

      download_if_missing: 可選,默認(rèn)為True,如果沒有下載過,重新下載

      remove: ('headers','footers','quotes')刪除部分文本

    from sklearn.datasets import fetch_20newsgroups
    data_test=fetch_20newsgroups(subset='test',data_home=None,categories=None,                          shuffle=True,random_state=42,remove=(),download_if_missing=True)
    
    from sklearn.datasets import fetch_20newsgroups
    data_test = fetch_20newsgroups(subset='test',shuffle=True,random_state=42)
    data_train = fetch_20newsgroups(subset='train',shuffle=True,random_state=42)
    print(dir(data_train))
    print('*'*80)
    #print(data_train.DESCR)
    print('*'*80)
    print(data_test.data[0]) #測試集中的第一篇文檔
    print('-'*80)
    print('訓(xùn)練集數(shù)據(jù)分類名稱:{} '.format(data_train.target_names))
    print(data_test.target[:10])
    print('*'*80)
    print('訓(xùn)練集數(shù)據(jù):{} 條'.format(data_train.target.shape))
    print('測試集數(shù)據(jù):{} 條'.format(data_test.target.shape))
    
    ['DESCR', 'data', 'filenames', 'target', 'target_names']
    ********************************************************************************
    ********************************************************************************
    From: v064mb9k@ubvmsd.cc.buffalo.edu (NEIL B. GANDLER)
    Subject: Need info on 88-89 Bonneville
    Organization: University at Buffalo
    Lines: 10
    News-Software: VAX/VMS VNEWS 1.41
    Nntp-Posting-Host: ubvmsd.cc.buffalo.edu
    
    
     I am a little confused on all of the models of the 88-89 bonnevilles.
    I have heard of the LE SE LSE SSE SSEI. Could someone tell me the
    differences are far as features or performance. I am also curious to
    know what the book value is for prefereably the 89 model. And how much
    less than book value can you usually get them for. In other words how
    much are they in demand this time of year. I have heard that the mid-spring
    early summer is the best time to buy.
    
                            Neil Gandler
    
    --------------------------------------------------------------------------------
    訓(xùn)練集數(shù)據(jù)分類名稱:['alt.atheism', 'comp.graphics', 'comp.os.ms-windows.misc', 'comp.sys.ibm.pc.hardware', 'comp.sys.mac.hardware', 'comp.windows.x', 'misc.forsale', 'rec.autos', 'rec.motorcycles', 'rec.sport.baseball', 'rec.sport.hockey', 'sci.crypt', 'sci.electronics', 'sci.med', 'sci.space', 'soc.religion.christian', 'talk.politics.guns', 'talk.politics.mideast', 'talk.politics.misc', 'talk.religion.misc'] 
    [ 7  5  0 17 19 13 15 15  5  1]
    ********************************************************************************
    訓(xùn)練集數(shù)據(jù):(11314,) 條
    測試集數(shù)據(jù):(7532,) 條
    
    • sklearn.datasets.fetch_20newsgroups_vectorized

      ? 加載20個新聞組數(shù)據(jù)集并將其轉(zhuǎn)換為tf-idf向量,這是一個方便的功能; 使用sklearn.feature_ extraction.text.Vectorizer的默認(rèn)設(shè)置完成tf-idf 轉(zhuǎn)換。

    from sklearn.datasets import fetch_20newsgroups_vectorized
    from sklearn.utils import shuffle
    bunch = fetch_20newsgroups_vectorized(subset='all')
    X,y = shuffle(bunch.data,bunch.target)
    print(X.shape)
    # 數(shù)據(jù)集劃分為訓(xùn)練集0.7和測試集0.3
    offset = int(X.shape[0]*0.7)
    X_train, y_train = X[0:offset], y[0:offset]
    X_test, y_test = X[offset:], y[offset:]
    print(X_train.shape)
    print(X_test.shape)
    
    (18846, 130107)
    (13192, 130107)
    (5654, 130107)
    
  • 獲取本地生成數(shù)據(jù)

    生成本地分類數(shù)據(jù):

    • sklearn.datasets.make_classification

    • 加載數(shù)據(jù)集其參數(shù)有:

      n_samples:int,optional(default = 100),樣本數(shù)量

      n_features:int,可選(默認(rèn)= 20),特征總數(shù)= n_informative + n_redundant + n_repeated

      n_informative:多信息特征的個數(shù)

      n_redundant:冗余信息,informative特征的隨機(jī)線性組合

      n_repeated :重復(fù)信息,隨機(jī)提取n_informative和n_redundant 特征

      n_classes:int,可選(default = 2),分類類別

      n_clusters_per_class :某一個類別是由幾個cluster構(gòu)成的

      random_state:int,RandomState實例,可選(默認(rèn)=無)如果int,random_state是隨機(jī)數(shù)生成器使用的種子

    from sklearn import datasets
    import matplotlib.pyplot as plt 
     
    data,target = datasets.make_classification(n_samples=100,n_features=2,
                                               n_informative=2,n_redundant=0,n_repeated=0,
                                               n_classes=2,n_clusters_per_class=1,
                                               random_state=0)
    print(data.shape)
    print(target.shape)
    #print(data)
    #print(target)
    plt.scatter(data[:,0],data[:,1],c=target)
    plt.show()
    
    (100, 2)
    (100,)
    
    111.png

    生成本地回歸數(shù)據(jù):

    • sklearn.datasets.make_regression

    • 加載數(shù)據(jù)集其參數(shù)有:

      n_samples: int,optional(default = 100),樣本數(shù)量

      n_features: int,optional(default = 100),特征數(shù)量

      coef: boolean,optional(default = False),如果為True,則返回底層線性模型的系數(shù)

      random_state: int,RandomState實例,可選(默認(rèn)=無)

    from sklearn.datasets.samples_generator import make_regression
    X, y = make_regression(n_samples=100, n_features=10, random_state=1)
    print(X.shape)
    print(y.shape)
    
  • 圖像數(shù)據(jù)

    在Anaconda中sklearn中的圖像在該目錄下

    D:\Anaconda3\Lib\site-packages\sklearn\datasets\images

    存在china.jpg和flower.jpg

from sklearn.datasets import load_sample_image
import matplotlib.pyplot as plt
img = load_sample_image('china.jpg')
plt.imshow(img)
china.png

參考資料:

網(wǎng)址:

https://blog.csdn.net/wangdong2017/article/details/81326341

視頻:

《python機(jī)器學(xué)習(xí)應(yīng)用》《黑馬程序員之機(jī)器學(xué)習(xí)》

?著作權(quán)歸作者所有,轉(zhuǎn)載或內(nèi)容合作請聯(lián)系作者
【社區(qū)內(nèi)容提示】社區(qū)部分內(nèi)容疑似由AI輔助生成,瀏覽時請結(jié)合常識與多方信息審慎甄別。
平臺聲明:文章內(nèi)容(如有圖片或視頻亦包括在內(nèi))由作者上傳并發(fā)布,文章內(nèi)容僅代表作者本人觀點,簡書系信息發(fā)布平臺,僅提供信息存儲服務(wù)。

相關(guān)閱讀更多精彩內(nèi)容

友情鏈接更多精彩內(nèi)容