一、背景介紹
Adventure Works Cycles是基于微軟SQL Server中AdventureWorks 示例數(shù)據(jù)庫所構(gòu)建的虛擬公司,它是一家大型跨國制造公司。該公司向北美,歐洲和亞洲商業(yè)市場生產(chǎn)并銷售金屬和復(fù)合材料自行車。盡管其基本業(yè)務(wù)位于華盛頓州的博塞爾市,擁有290名員工,但幾個(gè)區(qū)域銷售團(tuán)隊(duì)遍布整個(gè)市場。
2000年,Adventure Works Cycles收購了一家位于墨西哥的小型制造工廠Importadores Neptuno。Importadores Neptuno為Adventure Works Cycles產(chǎn)品系列制造了幾個(gè)關(guān)鍵子組件。這些子組件被運(yùn)送到Bothell位置進(jìn)行最終產(chǎn)品組裝。2001年,Importadores Neptuno成為旅游自行車產(chǎn)品集團(tuán)的唯一制造商和分銷商。
在成功實(shí)現(xiàn)財(cái)政年度之后,Adventure Works Cycles希望通過下面三種方式擴(kuò)大銷售額,第一銷售目標(biāo)定位到最佳客戶、第二通過外部網(wǎng)站擴(kuò)展適用的產(chǎn)品、第三通過降低生產(chǎn)成本來降低銷售成本。
其中關(guān)于客戶類型、產(chǎn)品介紹、采購和供應(yīng)商這三個(gè)方面來做一個(gè)簡單的介紹
客戶類型
Adventure Works Cycle的客戶主要有兩種:
個(gè)體:通過網(wǎng)上零售店鋪來購買商品
商戶: 從Adventure Works Cycles銷售代表處購買轉(zhuǎn)售產(chǎn)品的零售店或批發(fā)店。
產(chǎn)品介紹
作為一家自行車生產(chǎn)公司,Adventure公司主要有以下四條產(chǎn)品線:
Adventure Works Cycles 生產(chǎn)的自行車
自行車部件,例如車輪,踏板或制動組件
從供應(yīng)商處購買的自行車服裝,用于轉(zhuǎn)售給Adventure Works Cycles的客戶。
從供應(yīng)商處購買的自行車配件,用于轉(zhuǎn)售給Adventure Works Cycles客戶。
采購和供應(yīng)商方面
在Adventure Works Cycles,采購部門購買用于制造Adventure Works Cycles自行車的原材料和零件。 Adventure Works Cycles還購買轉(zhuǎn)售產(chǎn)品,如自行車服裝和自行車附加裝置,如水瓶和水泵。
二、項(xiàng)目流程
1、在Hive數(shù)據(jù)庫中建表并導(dǎo)入數(shù)據(jù)
2、探索數(shù)據(jù)庫,羅列指標(biāo),了解數(shù)據(jù)庫中各個(gè)表格的構(gòu)成及相互間的關(guān)系
3、匯總數(shù)據(jù),建立數(shù)據(jù)倉庫
4、連接至Power BI進(jìn)行可視化展示
三、數(shù)據(jù)導(dǎo)入
1、轉(zhuǎn)換csv文件
解壓Adventure數(shù)據(jù)庫后,我們得到了29個(gè)csv文件以及一個(gè)導(dǎo)出建表語句的sql文件。csv文件都以“|”作為分隔符,而hive默認(rèn)用逗號作為分隔符,需要對其進(jìn)行調(diào)整。
用遍歷文件的方法加上pandas庫,對文件進(jìn)行轉(zhuǎn)換操作
1)用os.walk獲取文件名,建立for循環(huán),
2)用pd.read_csv來讀取文件,同時(shí)設(shè)置好分隔符和編碼格式
3)用os.path.join將文件名和文件路徑合并在一起以供調(diào)用,df.to_csv輸出轉(zhuǎn)換后的csv文件

2、讀取文件
利用python的正則表達(dá)式來循環(huán)讀取文件,同時(shí)獲取表名和字段名
create_file=open(r"create_table.sql")
table_info = {}
# 循環(huán)讀取文件,獲取表名和字段名
content=create_file.readline()
while len(content)!=0:
# print(content,end="")
# 如果是有 create table
table_name = ''
table_columns = []
while "GO" not in content:
# print(content,end="")
if "CREATE TABLE" in content.upper():
# 正則表達(dá)式,獲取表名
se0bj = re.search(r"\[(.*?)\].\[(.*?)\]",content,re.I) # re.I 對大小寫不敏感;(.*?)用于分組
if se0bj:
table_name = se0bj.group(2)
matOjb = re.search(r"\[(.*?)\] \[(.*?)\].*",content.lstrip(),re.I) # 讀取字段名和字段類型
if matOjb:
column = matOjb.group(1)
if column.upper() == "DATE":
column = "date_time"
type = matOjb.group(2)
table_columns.append([column,type])
content = create_file.readline()
table_info[table_name]=table_columns
print(table_name)
print(table_columns)
content=create_file.readline()
create_file.close()
得到結(jié)果如下
3、生成建表腳本
用字段結(jié)果生成建表shell腳本
# 解析 table_info 字典,用來創(chuàng)建表
shell_file = open(r"create_table.sh","w")
shell_file.writelines("#!/bin/sh\n\nhive -v -e\"\nuse adventure_ods_qtfy;\n\n")
for key in table_info.keys():
# 先寫入表頭
# shell_file.writelines("drop table " + key + ";\n")
shell_file.write("create table if not exists " + key + "(\n")
for columns in table_info[key]:
if columns == table_info[key][len(table_info[key])-1]:
shell_file.write(" " + columns[0] + " string\n")
else:
shell_file.write(" "+columns[0]+" string,\n")
# 寫入建表格式
shell_file.write(""")row format serde 'org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe'
with serdeproperties(
'field.delim'=',',
'serialization.encoding'='UTF-8')
stored as textfile;\n\n""")
shell_file.write("\";")
shell_file.close()
4、生成導(dǎo)數(shù)shell文件
# 導(dǎo)入數(shù)據(jù)語句
load_file = open("load_create_data.sh","w")
load_file.writelines("#!/bin/sh\n\nhive -v -e\"\nuse adventure_ods_qtfy;\n\n")
for key in table_info.keys():
load_file.write("load data local inpath \'/root/qtfy/ods_datas/%s.csv\' overwrite into table %s;\n"%(key,key))
load_file.write("\n\";")
load_file.close()
5、建庫并導(dǎo)數(shù)
登錄linux云服務(wù)器,進(jìn)入hive環(huán)境,新建ods層數(shù)據(jù)庫,用于存放基礎(chǔ)數(shù)據(jù)

上傳csv文件至云服務(wù)器,然后執(zhí)行shell文件,從上傳的csv文件中導(dǎo)出數(shù)據(jù)到hive環(huán)境下ods層數(shù)據(jù)庫中

四、數(shù)據(jù)探索
1、了解數(shù)據(jù)
29個(gè)表格名稱如圖所示

此處我們暫以銷售為主題進(jìn)行探索
表格分為維度表(以dim開頭)和事實(shí)表(以fact開頭)兩類,表格設(shè)計(jì)結(jié)構(gòu)為星座模型。
維度表:表示對分析主題所屬類型的描述。比如產(chǎn)品維度表中有產(chǎn)品id、子類別id、名稱、產(chǎn)地、價(jià)格等屬性,如果別的表有產(chǎn)品id,就可以通過join連接來獲得相應(yīng)其他字段,節(jié)約表的空間。
事實(shí)表:表示對分析主題的度量。比如淘寶用戶每日下單的記錄,買家可獲得其匯總信息,包括有userid、訂單號、支付時(shí)間、支付鏈接等等
2、目的及問題
1)對數(shù)據(jù)倉庫的數(shù)據(jù)進(jìn)行可視化界面的加工,方便數(shù)據(jù)的查看;
2)以銷售情況為主題觀察數(shù)據(jù)的基本情況;
3)搭建指標(biāo)框架結(jié)構(gòu),設(shè)計(jì)可視化圖表布局;
3、數(shù)據(jù)指標(biāo)整理
此次分析以銷售為主題,因此重點(diǎn)放在銷售的表格

事實(shí)表:
FactResellerSales和FactInternetSales兩張表,除了FactInternetSales中多了EmployeeKey字段外,其他字段兩張表均保持一致;
維度表:
事實(shí)表中存在的key有:ProductKey,OrderDateKey,PromotionKey, SalesTerritoryKey等;
ER關(guān)系圖如下:

維度:
產(chǎn)品維度:產(chǎn)品分類、產(chǎn)品子分類
時(shí)間維度:年、季、月
地區(qū)維度:銷售地區(qū)
推廣維度:推廣方式
可分析的指標(biāo):
1.總銷售額=銷售量 * 客單價(jià)
2.總成本=產(chǎn)品標(biāo)準(zhǔn)成本 * 銷售量+每筆訂單的稅費(fèi)+運(yùn)費(fèi)
3.利潤情況=銷售額-總成本
4.銷售指標(biāo)達(dá)成情況
5.銷售量最佳的產(chǎn)品Top10
6.各維度下搭配,如時(shí)間維度下的銷售情況、銷售區(qū)域變化情況
地區(qū)維度下的銷售額、推廣情況
產(chǎn)品分類下的銷售情況
五、相應(yīng)SQL語句
1、找出產(chǎn)品主類和子品類的對應(yīng)關(guān)系,并且觀察消費(fèi)者的青睞程度
select a.ProductCategoryKey, a.EnglishProductCategoryName,
b.ProductSubcategoryKey, b.EnglishProductSubcategoryName
from DimProductCategory a join DimProductSubcategory b
on a.ProductcategoryKey=b.ProductCategoryKey;

2、受歡迎程度
a、最受消費(fèi)者歡迎(購買的顧客最多)的產(chǎn)品主類排序
Select EnglishProductCategoryName, count(CustomerKey) as Popularity
From FactSurveyResponse
Group by EnglishProductCategoryName
Order by Popularity desc;

b、最受消費(fèi)者歡迎的產(chǎn)品子類排序
Select EnglishProductCategoryName, EnglishProductSubcategoryName, count(CustomerKey) as Popularity
From FactSurveyResponse
Group by EnglishProductCategoryName, EnglishProductSubcategoryName
Order by Popularity desc;

3、各大品類的銷售情況
select
case when b.ProductSubcategoryKey between 1 and 3 then 'Bikes'
when b.ProductSubcategoryKey between 4 and 17 then 'Components'
when b.ProductSubcategoryKey between 18 and 25 then 'Clothing'
when b.ProductSubcategoryKey between 26 and 37 then 'Accessories'
else 'errors' end as product_catory,
sum(a.SalesAmount) as cloth_sales
from FactInternetSales a
inner join dimproduct b
on a.ProductKey=b.productkey
group by case when b.ProductSubcategoryKey between 1 and 3 then 'Bikes'
when b.ProductSubcategoryKey between 4 and 17 then 'Components'
when b.ProductSubcategoryKey between 18 and 25 then 'Clothing'
when b.ProductSubcategoryKey between 26 and 37 then 'Accessories'
else 'errors' end ;

4、建立匯總表
數(shù)據(jù)倉庫一般可分為ods基礎(chǔ)層和dm匯總層,我們已經(jīng)在ods層導(dǎo)入存放了基礎(chǔ)數(shù)據(jù),現(xiàn)在需要在dm層匯總我們想要分析的數(shù)據(jù)
先新建一個(gè)dm層數(shù)據(jù)庫

1)聚合產(chǎn)品表
可以發(fā)現(xiàn)DimProductCategory以及DimProductSubcategory這兩張產(chǎn)品種類維度的表格同時(shí)存在 ['ProductCategoryKey', 'int']字段,對這兩張表格進(jìn)行連接,可以得到產(chǎn)品的主分類和子分類的情況,同時(shí)可以得到產(chǎn)品種類英文名
create table if not exists adventure_dw_qtfy.dimproductsubcategory_new as
select b.ProductSubcategoryKey, a.EnglishProductCategoryName, b.EnglishProductSubcategoryName
from DimProductCategory a join DimProductSubcategory b
on a.ProductcategoryKey=b.ProductCategoryKey;
2)為了方便查閱,將ods層部分表格遷移過來
use adventure_dw_qtfy;
# 銷售地區(qū)維度表
drop table if exists dimsalesterritory;
create table dimsalesterritory as select * from adventure_ods_qtfy.dimsalesterritory;
# 產(chǎn)品維度表
drop table if exists DimProduct;
create table DimProduct as select * from adventure_ods_qtfy.DimProduct;
3)建立事實(shí)表fact_time
create table fact_time as
SELECT
a.*,
b.amount AS amount_last_year,
b.order_number AS order_number_last_year,
c.amount AS amount_last_month,
c.order_number AS order_number_last_month,
round(((a.amount-c.amount)/c.amount)*100,2) AS amount_comp_last_month,
round(((a.order_number-c.order_number)/c.order_number)*100,2) AS order_number_comp_last_month,
round(((a.amount-b.amount)/b.amount)*100,2) AS amount_comp_last_year,
round(((a.order_number-b.order_number)/b.order_number)*100,2) AS order_number_comp_last_year
FROM (
SELECT
SalesTerritoryKey,
concat_ws('-',substr(orderdatekey,0,4),substr(orderdatekey,5,2),substr(orderdatekey,7,2)) AS orderdate,
YEAR(concat_ws('-',substr(orderdatekey,0,4),substr(orderdatekey,5,2),substr(orderdatekey,7,2))) AS time_YEAR,
QUARTER(concat_ws('-',substr(orderdatekey,0,4),substr(orderdatekey,5,2),substr(orderdatekey,7,2))) AS time_QUARTER,
MONTH(concat_ws('-',substr(orderdatekey,0,4),substr(orderdatekey,5,2),substr(orderdatekey,7,2))) AS time_MONTH,
WEEKofyear(concat_ws('-',substr(orderdatekey,0,4),substr(orderdatekey,5,2),substr(orderdatekey,7,2))) AS time_WEEK,
count( SalesAmount ) AS order_number,
round(count(SalesAmount)*(0.9+rand ()*0.4),2) AS order_number_forcost,
round( sum( SalesAmount ), 2 ) AS amount,
round(sum(SalesAmount)*(0.9+rand ()*0.4),2) AS amount_forcost,
round(sum(SalesAmount)/count(SalesAmount),2) AS customerunitprice,
round( avg( TotalProductCost ), 2 ) AS per_productcost,
round( avg( TaxAmt ), 2 ) AS per_tax,
round( avg( freight ), 2 ) AS avg_freight
FROM
adventure_ods_qtfy.FactInternetSales
GROUP BY
SalesTerritoryKey,
OrderDateKey
) a
LEFT JOIN (
SELECT
SalesTerritoryKey,
OrderDateKey,
date_add(concat_ws('-',substr(orderdatekey,0,4),substr(orderdatekey,5,2),substr(orderdatekey,7,2)), 365 ) AS orderdate,
count( SalesAmount ) AS order_number,
round( sum( SalesAmount ), 2 ) AS amount
FROM
adventure_ods_qtfy.FactInternetSales
GROUP BY
SalesTerritoryKey,
OrderDateKey
) b
ON a.SalesTerritoryKey = b.SalesTerritoryKey AND a.orderdate = b.orderdate
LEFT JOIN (
SELECT
SalesTerritoryKey,
OrderDateKey,
date_add(concat_ws('-',substr(orderdatekey,0,4),substr(orderdatekey,5,2),substr(orderdatekey,7,2)),30) AS orderdate,
count( SalesAmount ) AS order_number,
round( sum( SalesAmount ), 2 ) AS amount
FROM
adventure_ods_qtfy.FactInternetSales
GROUP BY
SalesTerritoryKey,
OrderDateKey
) c
ON b.SalesTerritoryKey = c.SalesTerritoryKey AND b.orderdate = c.orderdate
WHERE
a.orderdate <= current_date()
ORDER BY
a.SalesTerritoryKey, a.orderdate;
4)建立事實(shí)表Factinternet
create table Factinternet
as
select a.*,round(a.order_number/a.order_number_forcost,2) as order_number_forcost_comp,
round(a.order_number/a.order_number_forcost,2) as amount_forcost_comp
from (
SELECT
a.OrderDatekey as orderdate,
YEAR(concat_ws('-',substr(orderdatekey,0,4),substr(orderdatekey,5,2),substr(orderdatekey,7,2))) AS time_YEAR,
QUARTER(concat_ws('-',substr(orderdatekey,0,4),substr(orderdatekey,5,2),substr(orderdatekey,7,2))) AS time_QUARTER,
MONTH(concat_ws('-',substr(orderdatekey,0,4),substr(orderdatekey,5,2),substr(orderdatekey,7,2))) AS time_MONTH,
WEEKofyear(concat_ws('-',substr(orderdatekey,0,4),substr(orderdatekey,5,2),substr(orderdatekey,7,2))) AS time_WEEK,
a.SalesTerritoryKey,
b.ProductSubcategoryKey,
count(a.CustomerKey) AS order_number,
round(count(a.CustomerKey)* (0.9+rand ( ) * 0.4), 2) AS order_number_forcost,
round(sum( a.SalesAmount ), 2) AS Amount,
round(sum(SalesAmount)*(0.9+rand ()*0.4),2) AS amount_forcost,
round(sum(a.SalesAmount)/count(a.SalesAmount),2) AS customerunitprice,
round(avg(a.TotalProductCost), 2) AS per_productcost,
round(avg(a.TaxAmt), 2) AS per_tax,
round(avg(a.freight), 2) AS avg_freight
FROM
adventure_ods_qtfy.FactinternetSales a
LEFT JOIN adventure_ods_qtfy.DimProduct b ON a.ProductKey = b.ProductKey
where concat_ws('-',substr(orderdatekey,0,4),substr(orderdatekey,5,2),substr(orderdatekey,7,2)) <= current_date()
GROUP BY
a.OrderDatekey,
a.SalesTerritoryKey,
b.ProductSubcategoryKey) a;
5)建立事實(shí)表fact_geography
create table fact_geography as
select
b.GeographyKey,
sum(a.SalesAmount) as amount,
count(a.CustomerKey) as order_number
from FactInternetSales a
left join DimCustomer b on a.CustomerKey=b.CustomerKey
GROUP BY b.GeographyKey;
六、Power BI可視化展示
