41 Pandas讀取Excel繪制直方圖
直方圖(Histogram):
直方圖是數(shù)值數(shù)據(jù)分布的精確圖形表示,是一個連續(xù)變量(定量變量)的概率分布的估計,它是一種條形圖。
為了構(gòu)建直方圖,第一步是將值的范圍分段,即將整個值的范圍分成一系列間隔,然后計算每個間隔中有多少值。
1. 讀取數(shù)據(jù)
波斯頓房價數(shù)據(jù)集
import pandas as pd
import numpy as np
df = pd.read_excel("./datas/boston-house-prices/housing.xlsx")
df
.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}
<pre><code>.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
</code></pre>
| CRIM | ZN | INDUS | CHAS | NOX | RM | AGE | DIS | RAD | TAX | PTRATIO | B | LSTAT | MEDV | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 0.00632 | 18.0 | 2.31 | 0 | 0.538 | 6.575 | 65.2 | 4.0900 | 1 | 296 | 15.3 | 396.90 | 4.98 | 24.0 |
| 1 | 0.02731 | 0.0 | 7.07 | 0 | 0.469 | 6.421 | 78.9 | 4.9671 | 2 | 242 | 17.8 | 396.90 | 9.14 | 21.6 |
| 2 | 0.02729 | 0.0 | 7.07 | 0 | 0.469 | 7.185 | 61.1 | 4.9671 | 2 | 242 | 17.8 | 392.83 | 4.03 | 34.7 |
| 3 | 0.03237 | 0.0 | 2.18 | 0 | 0.458 | 6.998 | 45.8 | 6.0622 | 3 | 222 | 18.7 | 394.63 | 2.94 | 33.4 |
| 4 | 0.06905 | 0.0 | 2.18 | 0 | 0.458 | 7.147 | 54.2 | 6.0622 | 3 | 222 | 18.7 | 396.90 | 5.33 | 36.2 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 501 | 0.06263 | 0.0 | 11.93 | 0 | 0.573 | 6.593 | 69.1 | 2.4786 | 1 | 273 | 21.0 | 391.99 | 9.67 | 22.4 |
| 502 | 0.04527 | 0.0 | 11.93 | 0 | 0.573 | 6.120 | 76.7 | 2.2875 | 1 | 273 | 21.0 | 396.90 | 9.08 | 20.6 |
| 503 | 0.06076 | 0.0 | 11.93 | 0 | 0.573 | 6.976 | 91.0 | 2.1675 | 1 | 273 | 21.0 | 396.90 | 5.64 | 23.9 |
| 504 | 0.10959 | 0.0 | 11.93 | 0 | 0.573 | 6.794 | 89.3 | 2.3889 | 1 | 273 | 21.0 | 393.45 | 6.48 | 22.0 |
| 505 | 0.04741 | 0.0 | 11.93 | 0 | 0.573 | 6.030 | 80.8 | 2.5050 | 1 | 273 | 21.0 | 396.90 | 7.88 | 11.9 |
506 rows × 14 columns
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 506 entries, 0 to 505
Data columns (total 14 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 CRIM 506 non-null float64
1 ZN 506 non-null float64
2 INDUS 506 non-null float64
3 CHAS 506 non-null int64
4 NOX 506 non-null float64
5 RM 506 non-null float64
6 AGE 506 non-null float64
7 DIS 506 non-null float64
8 RAD 506 non-null int64
9 TAX 506 non-null int64
10 PTRATIO 506 non-null float64
11 B 506 non-null float64
12 LSTAT 506 non-null float64
13 MEDV 506 non-null float64
dtypes: float64(11), int64(3)
memory usage: 55.5 KB
df["MEDV"]
0 24.0
1 21.6
2 34.7
3 33.4
4 36.2
...
501 22.4
502 20.6
503 23.9
504 22.0
505 11.9
Name: MEDV, Length: 506, dtype: float64
2. 使用matplotlib畫直方圖
matplotlib直方圖文檔:https://matplotlib.org/3.2.0/api/_as_gen/matplotlib.pyplot.hist.html
import matplotlib.pyplot as plt
%matplotlib inline
plt.figure(figsize=(12, 5))
plt.hist(df["MEDV"], bins=100)
plt.show()

3. 使用pyecharts畫直方圖
pyecharts直方圖文檔:http://gallery.pyecharts.org/#/Bar/bar_histogram numpy直方圖文檔:https://docs.scipy.org/doc/numpy/reference/generated/numpy.histogram.html
from pyecharts import options as opts
from pyecharts.charts import Bar
# 需要自己計算有多少個間隔、以及每個間隔有多少個值
hist,bin_edges = np.histogram(df["MEDV"], bins=100)
# 這是每個間隔的分割點
bin_edges
array([ 5. , 5.45, 5.9 , 6.35, 6.8 , 7.25, 7.7 , 8.15, 8.6 ,
9.05, 9.5 , 9.95, 10.4 , 10.85, 11.3 , 11.75, 12.2 , 12.65,
13.1 , 13.55, 14. , 14.45, 14.9 , 15.35, 15.8 , 16.25, 16.7 ,
17.15, 17.6 , 18.05, 18.5 , 18.95, 19.4 , 19.85, 20.3 , 20.75,
21.2 , 21.65, 22.1 , 22.55, 23. , 23.45, 23.9 , 24.35, 24.8 ,
25.25, 25.7 , 26.15, 26.6 , 27.05, 27.5 , 27.95, 28.4 , 28.85,
29.3 , 29.75, 30.2 , 30.65, 31.1 , 31.55, 32. , 32.45, 32.9 ,
33.35, 33.8 , 34.25, 34.7 , 35.15, 35.6 , 36.05, 36.5 , 36.95,
37.4 , 37.85, 38.3 , 38.75, 39.2 , 39.65, 40.1 , 40.55, 41. ,
41.45, 41.9 , 42.35, 42.8 , 43.25, 43.7 , 44.15, 44.6 , 45.05,
45.5 , 45.95, 46.4 , 46.85, 47.3 , 47.75, 48.2 , 48.65, 49.1 ,
49.55, 50. ])
len(bin_edges)
101
# 這是間隔的計數(shù)
hist
array([ 2, 1, 1, 0, 5, 2, 1, 6, 3, 0, 3, 3, 5, 3, 4, 6, 3,
5, 14, 9, 9, 6, 11, 8, 6, 8, 6, 10, 9, 9, 15, 13, 20, 16,
19, 10, 14, 19, 13, 15, 21, 16, 9, 12, 14, 1, 0, 4, 5, 2, 6,
5, 5, 4, 3, 6, 2, 3, 4, 3, 4, 3, 6, 2, 1, 1, 5, 3,
1, 4, 1, 3, 1, 1, 1, 0, 0, 1, 0, 0, 1, 1, 1, 1, 1,
1, 2, 0, 1, 1, 0, 1, 1, 0, 0, 0, 2, 1, 0, 16],
dtype=int64)
len(hist)
100
對bin_edges的解釋,為什么是101個?比hist計數(shù)多1個?
舉例:如果bins是[1, 2, 3, 4],那么會分成3個區(qū)間:[1, 2)、[2, 3)、[3, 4]; 其中bins的第一個值是數(shù)組的最小值,bins的最后一個元素是數(shù)組的最大值
# 注意觀察,min是bins的第一個值,max是bins的最后一個元素
df["MEDV"].describe()
count 506.000000
mean 22.532806
std 9.197104
min 5.000000
25% 17.025000
50% 21.200000
75% 25.000000
max 50.000000
Name: MEDV, dtype: float64
# 查看bins每一個值和前一個值的差值,可以看到這是等分的數(shù)據(jù)
np.diff(bin_edges)
array([0.45, 0.45, 0.45, 0.45, 0.45, 0.45, 0.45, 0.45, 0.45, 0.45, 0.45,
0.45, 0.45, 0.45, 0.45, 0.45, 0.45, 0.45, 0.45, 0.45, 0.45, 0.45,
0.45, 0.45, 0.45, 0.45, 0.45, 0.45, 0.45, 0.45, 0.45, 0.45, 0.45,
0.45, 0.45, 0.45, 0.45, 0.45, 0.45, 0.45, 0.45, 0.45, 0.45, 0.45,
0.45, 0.45, 0.45, 0.45, 0.45, 0.45, 0.45, 0.45, 0.45, 0.45, 0.45,
0.45, 0.45, 0.45, 0.45, 0.45, 0.45, 0.45, 0.45, 0.45, 0.45, 0.45,
0.45, 0.45, 0.45, 0.45, 0.45, 0.45, 0.45, 0.45, 0.45, 0.45, 0.45,
0.45, 0.45, 0.45, 0.45, 0.45, 0.45, 0.45, 0.45, 0.45, 0.45, 0.45,
0.45, 0.45, 0.45, 0.45, 0.45, 0.45, 0.45, 0.45, 0.45, 0.45, 0.45,
0.45])
# 這些間隔的數(shù)目,剛好等于計數(shù)hist的數(shù)目
len(np.diff(bin_edges))
100
# pyecharts的直方圖使用bar實現(xiàn)
# 取bins[:-1],意思是用每個區(qū)間的左邊元素作為x軸的值
bar = (
Bar()
.add_xaxis([str(x) for x in bin_edges[:-1]])
.add_yaxis("價格分布", [float(x) for x in hist], category_gap=0)
.set_global_opts(
title_opts=opts.TitleOpts(title="波斯頓房價-價格分布-直方圖", pos_left="center"),
legend_opts=opts.LegendOpts(is_show=False)
)
)
bar.render_notebook()
小作業(yè): 獲取你們產(chǎn)品的銷量數(shù)據(jù)、價格數(shù)據(jù),提取得到一個一數(shù)組,畫一個直方圖看一下數(shù)據(jù)分布
本文使用 文章同步助手 同步