Adventure項(xiàng)目總結(jié)(2020版)

一、背景介紹

Adventure Works Cycles是基于微軟SQL Server中AdventureWorks 示例數(shù)據(jù)庫所構(gòu)建的虛擬公司,它是一家大型跨國制造公司。該公司向北美,歐洲和亞洲商業(yè)市場生產(chǎn)并銷售金屬和復(fù)合材料自行車。盡管其基本業(yè)務(wù)位于華盛頓州的博塞爾市,擁有290名員工,但幾個(gè)區(qū)域銷售團(tuán)隊(duì)遍布整個(gè)市場。

2000年,Adventure Works Cycles收購了一家位于墨西哥的小型制造工廠Importadores Neptuno。Importadores Neptuno為Adventure Works Cycles產(chǎn)品系列制造了幾個(gè)關(guān)鍵子組件。這些子組件被運(yùn)送到Bothell位置進(jìn)行最終產(chǎn)品組裝。2001年,Importadores Neptuno成為旅游自行車產(chǎn)品集團(tuán)的唯一制造商和分銷商。

在成功實(shí)現(xiàn)財(cái)政年度之后,Adventure Works Cycles希望通過下面三種方式擴(kuò)大銷售額,第一銷售目標(biāo)定位到最佳客戶、第二通過外部網(wǎng)站擴(kuò)展適用的產(chǎn)品、第三通過降低生產(chǎn)成本來降低銷售成本。

其中關(guān)于客戶類型、產(chǎn)品介紹、采購和供應(yīng)商這三個(gè)方面來做一個(gè)簡單的介紹

客戶類型

Adventure Works Cycle的客戶主要有兩種:

個(gè)體:通過網(wǎng)上零售店鋪來購買商品

商戶: 從Adventure Works Cycles銷售代表處購買轉(zhuǎn)售產(chǎn)品的零售店或批發(fā)店。

產(chǎn)品介紹

作為一家自行車生產(chǎn)公司,Adventure公司主要有以下四條產(chǎn)品線:

  • Adventure Works Cycles 生產(chǎn)的自行車

  • 自行車部件,例如車輪,踏板或制動組件

  • 從供應(yīng)商處購買的自行車服裝,用于轉(zhuǎn)售給Adventure Works Cycles的客戶。

  • 從供應(yīng)商處購買的自行車配件,用于轉(zhuǎn)售給Adventure Works Cycles客戶。

采購和供應(yīng)商方面

在Adventure Works Cycles,采購部門購買用于制造Adventure Works Cycles自行車的原材料和零件。 Adventure Works Cycles還購買轉(zhuǎn)售產(chǎn)品,如自行車服裝和自行車附加裝置,如水瓶和水泵。

Adventure數(shù)據(jù)庫簡介

二、項(xiàng)目流程

1、在Hive數(shù)據(jù)庫中建表并導(dǎo)入數(shù)據(jù)

2、探索數(shù)據(jù)庫,羅列指標(biāo),了解數(shù)據(jù)庫中各個(gè)表格的構(gòu)成及相互間的關(guān)系

3、匯總數(shù)據(jù),建立數(shù)據(jù)倉庫

4、連接至Power BI進(jìn)行可視化展示

三、數(shù)據(jù)導(dǎo)入

1、轉(zhuǎn)換csv文件

解壓Adventure數(shù)據(jù)庫后,我們得到了29個(gè)csv文件以及一個(gè)導(dǎo)出建表語句的sql文件。csv文件都以“|”作為分隔符,而hive默認(rèn)用逗號作為分隔符,需要對其進(jìn)行調(diào)整。

用遍歷文件的方法加上pandas庫,對文件進(jìn)行轉(zhuǎn)換操作
1)用os.walk獲取文件名,建立for循環(huán),
2)用pd.read_csv來讀取文件,同時(shí)設(shè)置好分隔符和編碼格式
3)用os.path.join將文件名和文件路徑合并在一起以供調(diào)用,df.to_csv輸出轉(zhuǎn)換后的csv文件

完整語句如下
轉(zhuǎn)換文件截圖

2、讀取文件

利用python的正則表達(dá)式來循環(huán)讀取文件,同時(shí)獲取表名和字段名

create_file=open(r"create_table.sql")

table_info = {}
# 循環(huán)讀取文件,獲取表名和字段名
content=create_file.readline()
while len(content)!=0:
    # print(content,end="")
    # 如果是有 create table
    table_name = ''
    table_columns = []
    while "GO" not in content:
        # print(content,end="")
        if "CREATE TABLE" in content.upper():
            # 正則表達(dá)式,獲取表名
            se0bj = re.search(r"\[(.*?)\].\[(.*?)\]",content,re.I)  # re.I 對大小寫不敏感;(.*?)用于分組
            if se0bj:
                table_name = se0bj.group(2)
        matOjb = re.search(r"\[(.*?)\] \[(.*?)\].*",content.lstrip(),re.I)  #  讀取字段名和字段類型
        if matOjb:
            column = matOjb.group(1)
            if column.upper() == "DATE":
                column = "date_time"
            type = matOjb.group(2)
            table_columns.append([column,type])
        content = create_file.readline()
    table_info[table_name]=table_columns

    print(table_name)
    print(table_columns)

    content=create_file.readline()

create_file.close()

得到結(jié)果如下
image.png

3、生成建表腳本

用字段結(jié)果生成建表shell腳本

# 解析 table_info 字典,用來創(chuàng)建表
shell_file = open(r"create_table.sh","w")
shell_file.writelines("#!/bin/sh\n\nhive -v -e\"\nuse adventure_ods_qtfy;\n\n")
for key in table_info.keys():

    # 先寫入表頭
    # shell_file.writelines("drop table " + key + ";\n")
    shell_file.write("create table if not exists  " + key + "(\n")

    for columns in table_info[key]:
        if columns == table_info[key][len(table_info[key])-1]:
            shell_file.write("  " + columns[0] + "      string\n")
        else:
            shell_file.write("  "+columns[0]+"      string,\n")

    # 寫入建表格式
    shell_file.write(""")row format serde 'org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe'
    with serdeproperties(
    'field.delim'=',',
    'serialization.encoding'='UTF-8')
    stored as textfile;\n\n""")

shell_file.write("\";")
shell_file.close()

4、生成導(dǎo)數(shù)shell文件

# 導(dǎo)入數(shù)據(jù)語句
load_file = open("load_create_data.sh","w")
load_file.writelines("#!/bin/sh\n\nhive -v -e\"\nuse adventure_ods_qtfy;\n\n")

for key in table_info.keys():
    load_file.write("load data local inpath \'/root/qtfy/ods_datas/%s.csv\' overwrite into table %s;\n"%(key,key))


load_file.write("\n\";")
load_file.close()

5、建庫并導(dǎo)數(shù)

登錄linux云服務(wù)器,進(jìn)入hive環(huán)境,新建ods層數(shù)據(jù)庫,用于存放基礎(chǔ)數(shù)據(jù)

image.png

上傳csv文件至云服務(wù)器,然后執(zhí)行shell文件,從上傳的csv文件中導(dǎo)出數(shù)據(jù)到hive環(huán)境下ods層數(shù)據(jù)庫中


導(dǎo)數(shù)shell文件部分截圖

四、數(shù)據(jù)探索

1、了解數(shù)據(jù)

29個(gè)表格名稱如圖所示

image.png

此處我們暫以銷售為主題進(jìn)行探索
表格分為維度表(以dim開頭)和事實(shí)表(以fact開頭)兩類,表格設(shè)計(jì)結(jié)構(gòu)為星座模型。
維度表:表示對分析主題所屬類型的描述。比如產(chǎn)品維度表中有產(chǎn)品id、子類別id、名稱、產(chǎn)地、價(jià)格等屬性,如果別的表有產(chǎn)品id,就可以通過join連接來獲得相應(yīng)其他字段,節(jié)約表的空間。
事實(shí)表:表示對分析主題的度量。比如淘寶用戶每日下單的記錄,買家可獲得其匯總信息,包括有userid、訂單號、支付時(shí)間、支付鏈接等等

2、目的及問題

1)對數(shù)據(jù)倉庫的數(shù)據(jù)進(jìn)行可視化界面的加工,方便數(shù)據(jù)的查看;
2)以銷售情況為主題觀察數(shù)據(jù)的基本情況;
3)搭建指標(biāo)框架結(jié)構(gòu),設(shè)計(jì)可視化圖表布局;

3、數(shù)據(jù)指標(biāo)整理

此次分析以銷售為主題,因此重點(diǎn)放在銷售的表格


image.png

事實(shí)表:
FactResellerSales和FactInternetSales兩張表,除了FactInternetSales中多了EmployeeKey字段外,其他字段兩張表均保持一致;
維度表:
事實(shí)表中存在的key有:ProductKey,OrderDateKey,PromotionKey, SalesTerritoryKey等;
ER關(guān)系圖如下:


image.png

維度:
產(chǎn)品維度:產(chǎn)品分類、產(chǎn)品子分類
時(shí)間維度:年、季、月
地區(qū)維度:銷售地區(qū)
推廣維度:推廣方式
可分析的指標(biāo):
1.總銷售額=銷售量 * 客單價(jià)
2.總成本=產(chǎn)品標(biāo)準(zhǔn)成本 * 銷售量+每筆訂單的稅費(fèi)+運(yùn)費(fèi)
3.利潤情況=銷售額-總成本
4.銷售指標(biāo)達(dá)成情況
5.銷售量最佳的產(chǎn)品Top10
6.各維度下搭配,如時(shí)間維度下的銷售情況、銷售區(qū)域變化情況
地區(qū)維度下的銷售額、推廣情況
產(chǎn)品分類下的銷售情況

五、相應(yīng)SQL語句

1、找出產(chǎn)品主類和子品類的對應(yīng)關(guān)系,并且觀察消費(fèi)者的青睞程度

select a.ProductCategoryKey, a.EnglishProductCategoryName, 
  b.ProductSubcategoryKey, b.EnglishProductSubcategoryName
from DimProductCategory a join DimProductSubcategory b 
on a.ProductcategoryKey=b.ProductCategoryKey;
image.png

2、受歡迎程度

a、最受消費(fèi)者歡迎(購買的顧客最多)的產(chǎn)品主類排序

Select  EnglishProductCategoryName, count(CustomerKey) as Popularity
From FactSurveyResponse
Group by EnglishProductCategoryName
Order by Popularity desc;
image.png

b、最受消費(fèi)者歡迎的產(chǎn)品子類排序

Select  EnglishProductCategoryName, EnglishProductSubcategoryName, count(CustomerKey) as Popularity
From FactSurveyResponse
Group by EnglishProductCategoryName, EnglishProductSubcategoryName
Order by Popularity desc;
image.png

3、各大品類的銷售情況

select
   case when b.ProductSubcategoryKey between 1 and 3 then 'Bikes'
                        when b.ProductSubcategoryKey between 4 and 17 then 'Components'
                        when b.ProductSubcategoryKey between 18 and 25 then 'Clothing'
                        when b.ProductSubcategoryKey between 26 and 37 then 'Accessories'
                        else 'errors' end  as product_catory,
   sum(a.SalesAmount) as cloth_sales
from FactInternetSales a 
inner join dimproduct b 
on a.ProductKey=b.productkey
group by case when b.ProductSubcategoryKey between 1 and 3 then 'Bikes'
                        when b.ProductSubcategoryKey between 4 and 17 then 'Components'
                        when b.ProductSubcategoryKey between 18 and 25 then 'Clothing'
                        when b.ProductSubcategoryKey between 26 and 37 then 'Accessories'
                        else 'errors' end ;
image.png

4、建立匯總表

數(shù)據(jù)倉庫一般可分為ods基礎(chǔ)層和dm匯總層,我們已經(jīng)在ods層導(dǎo)入存放了基礎(chǔ)數(shù)據(jù),現(xiàn)在需要在dm層匯總我們想要分析的數(shù)據(jù)
先新建一個(gè)dm層數(shù)據(jù)庫


image.png

1)聚合產(chǎn)品表
可以發(fā)現(xiàn)DimProductCategory以及DimProductSubcategory這兩張產(chǎn)品種類維度的表格同時(shí)存在 ['ProductCategoryKey', 'int']字段,對這兩張表格進(jìn)行連接,可以得到產(chǎn)品的主分類和子分類的情況,同時(shí)可以得到產(chǎn)品種類英文名

create table if not exists adventure_dw_qtfy.dimproductsubcategory_new as 
    select b.ProductSubcategoryKey, a.EnglishProductCategoryName, b.EnglishProductSubcategoryName 
from DimProductCategory a join DimProductSubcategory b 
on a.ProductcategoryKey=b.ProductCategoryKey;

2)為了方便查閱,將ods層部分表格遷移過來

use adventure_dw_qtfy;
# 銷售地區(qū)維度表
drop table if exists dimsalesterritory;
create table dimsalesterritory as select * from adventure_ods_qtfy.dimsalesterritory;

# 產(chǎn)品維度表
drop table if exists DimProduct;
create table DimProduct as select * from adventure_ods_qtfy.DimProduct;

3)建立事實(shí)表fact_time

create table fact_time as
SELECT
  a.*,
  b.amount       AS amount_last_year,
  b.order_number AS order_number_last_year,
  c.amount       AS amount_last_month,
  c.order_number AS order_number_last_month,
  round(((a.amount-c.amount)/c.amount)*100,2)                   AS amount_comp_last_month,
  round(((a.order_number-c.order_number)/c.order_number)*100,2) AS order_number_comp_last_month,
  round(((a.amount-b.amount)/b.amount)*100,2)                   AS amount_comp_last_year,
  round(((a.order_number-b.order_number)/b.order_number)*100,2) AS order_number_comp_last_year
FROM  (
  SELECT
    SalesTerritoryKey,
    concat_ws('-',substr(orderdatekey,0,4),substr(orderdatekey,5,2),substr(orderdatekey,7,2)) AS orderdate,
    YEAR(concat_ws('-',substr(orderdatekey,0,4),substr(orderdatekey,5,2),substr(orderdatekey,7,2)))         AS time_YEAR,
    QUARTER(concat_ws('-',substr(orderdatekey,0,4),substr(orderdatekey,5,2),substr(orderdatekey,7,2)))      AS time_QUARTER,
    MONTH(concat_ws('-',substr(orderdatekey,0,4),substr(orderdatekey,5,2),substr(orderdatekey,7,2)))        AS time_MONTH,
    WEEKofyear(concat_ws('-',substr(orderdatekey,0,4),substr(orderdatekey,5,2),substr(orderdatekey,7,2)))   AS time_WEEK,
    count( SalesAmount )                          AS order_number,
    round(count(SalesAmount)*(0.9+rand ()*0.4),2) AS order_number_forcost,
    round( sum( SalesAmount ), 2 )                AS amount,
    round(sum(SalesAmount)*(0.9+rand ()*0.4),2)   AS amount_forcost,
    round(sum(SalesAmount)/count(SalesAmount),2)  AS customerunitprice,
    round( avg( TotalProductCost ), 2 )           AS per_productcost,
    round( avg( TaxAmt ), 2 )                     AS per_tax,
    round( avg( freight ), 2 )                    AS avg_freight 
  FROM
    adventure_ods_qtfy.FactInternetSales 
  GROUP BY
    SalesTerritoryKey,
    OrderDateKey 
  ) a
  LEFT JOIN (
    SELECT
      SalesTerritoryKey,
      OrderDateKey,
      date_add(concat_ws('-',substr(orderdatekey,0,4),substr(orderdatekey,5,2),substr(orderdatekey,7,2)), 365 ) AS orderdate,
      count( SalesAmount )            AS order_number,
      round( sum( SalesAmount ), 2 )  AS amount 
    FROM
      adventure_ods_qtfy.FactInternetSales 
    GROUP BY
      SalesTerritoryKey,
      OrderDateKey 
  ) b 
ON a.SalesTerritoryKey = b.SalesTerritoryKey   AND a.orderdate = b.orderdate
LEFT JOIN (
    SELECT
      SalesTerritoryKey,
      OrderDateKey,
      date_add(concat_ws('-',substr(orderdatekey,0,4),substr(orderdatekey,5,2),substr(orderdatekey,7,2)),30) AS orderdate,
      count( SalesAmount ) AS order_number,
      round( sum( SalesAmount ), 2 ) AS amount
    FROM
      adventure_ods_qtfy.FactInternetSales 
    GROUP BY
      SalesTerritoryKey,
      OrderDateKey 
  ) c 
ON b.SalesTerritoryKey = c.SalesTerritoryKey AND b.orderdate = c.orderdate 
WHERE
  a.orderdate <= current_date() 
ORDER BY
  a.SalesTerritoryKey, a.orderdate;

4)建立事實(shí)表Factinternet

create table Factinternet
as
select a.*,round(a.order_number/a.order_number_forcost,2) as order_number_forcost_comp,
       round(a.order_number/a.order_number_forcost,2) as amount_forcost_comp
from (
SELECT
    a.OrderDatekey as orderdate,
    YEAR(concat_ws('-',substr(orderdatekey,0,4),substr(orderdatekey,5,2),substr(orderdatekey,7,2)))       AS time_YEAR,
    QUARTER(concat_ws('-',substr(orderdatekey,0,4),substr(orderdatekey,5,2),substr(orderdatekey,7,2)))    AS time_QUARTER,
    MONTH(concat_ws('-',substr(orderdatekey,0,4),substr(orderdatekey,5,2),substr(orderdatekey,7,2)))      AS time_MONTH,
    WEEKofyear(concat_ws('-',substr(orderdatekey,0,4),substr(orderdatekey,5,2),substr(orderdatekey,7,2))) AS time_WEEK,
    a.SalesTerritoryKey,
    b.ProductSubcategoryKey,
    count(a.CustomerKey) AS order_number,
    round(count(a.CustomerKey)* (0.9+rand ( ) * 0.4), 2) AS order_number_forcost,
    round(sum( a.SalesAmount ), 2) AS Amount,
    round(sum(SalesAmount)*(0.9+rand ()*0.4),2) AS amount_forcost,
    round(sum(a.SalesAmount)/count(a.SalesAmount),2) AS customerunitprice,
    round(avg(a.TotalProductCost), 2) AS per_productcost,
    round(avg(a.TaxAmt), 2) AS per_tax,
    round(avg(a.freight), 2) AS avg_freight 
FROM
    adventure_ods_qtfy.FactinternetSales a
LEFT JOIN adventure_ods_qtfy.DimProduct b ON a.ProductKey = b.ProductKey 
where concat_ws('-',substr(orderdatekey,0,4),substr(orderdatekey,5,2),substr(orderdatekey,7,2)) <= current_date()
GROUP BY
    a.OrderDatekey,
    a.SalesTerritoryKey,
    b.ProductSubcategoryKey) a;

5)建立事實(shí)表fact_geography

create table fact_geography as
select 
    b.GeographyKey,
    sum(a.SalesAmount) as amount,
    count(a.CustomerKey) as order_number
from FactInternetSales a
left join DimCustomer b on a.CustomerKey=b.CustomerKey
GROUP BY b.GeographyKey;

六、Power BI可視化展示

image.png

Power BI報(bào)表展示鏈接

最后編輯于
?著作權(quán)歸作者所有,轉(zhuǎn)載或內(nèi)容合作請聯(lián)系作者
【社區(qū)內(nèi)容提示】社區(qū)部分內(nèi)容疑似由AI輔助生成,瀏覽時(shí)請結(jié)合常識與多方信息審慎甄別。
平臺聲明:文章內(nèi)容(如有圖片或視頻亦包括在內(nèi))由作者上傳并發(fā)布,文章內(nèi)容僅代表作者本人觀點(diǎn),簡書系信息發(fā)布平臺,僅提供信息存儲服務(wù)。

相關(guān)閱讀更多精彩內(nèi)容

友情鏈接更多精彩內(nèi)容