谷歌應用商品APP的分析報告
Google play store analysis
數(shù)據(jù)集來自kaggle,爬取的谷歌應用商店的APP數(shù)據(jù)
我們今天來探索一下數(shù)據(jù),并且看下哪些因素可以影響顧客評分Rating
環(huán)境 python 3.6, windows 10, jupyter notebook
首先導入相關分析包
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
#導入數(shù)據(jù)集
data =pd.read_csv('googleplaystore.csv')
探索數(shù)據(jù)
# 首先看下數(shù)據(jù)頭
data.head()

#看下總體情況
data.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10841 entries, 0 to 10840
Data columns (total 13 columns):
App 10841 non-null object
Category 10841 non-null object
Rating 9367 non-null float64
Reviews 10841 non-null object
Size 10841 non-null object
Installs 10841 non-null object
Type 10840 non-null object
Price 10841 non-null object
Content Rating 10840 non-null object
Genres 10841 non-null object
Last Updated 10841 non-null object
Current Ver 10833 non-null object
Android Ver 10838 non-null object
dtypes: float64(1), object(12)
memory usage: 1.1+ MB
這份數(shù)據(jù)有10841行,13個字段包括APP名,分類,打分,下載量,評論量,是否付費,價格,最新更新日期,版本
首先要轉(zhuǎn)化數(shù)據(jù)成我們需要的格式,Rating,Size,Price要轉(zhuǎn)換成數(shù)值型,Last updated要轉(zhuǎn)換成時間序列
#改變?yōu)閿?shù)值型
#data.Reviews.value_counts()
pd.to_numeric(data['Reviews'])
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
pandas\src\inference.pyx in pandas.lib.maybe_convert_numeric (pandas\lib.c:55708)()
ValueError: Unable to parse string "3.0M"
During handling of the above exception, another exception occurred:
ValueError Traceback (most recent call last)
<ipython-input-5-e509e4352e56> in <module>()
1 #改變?yōu)閿?shù)值型
2 #data.Reviews.value_counts()
----> 3 pd.to_numeric(data['Reviews'])
C:\Users\renhl1\Anaconda3\lib\site-packages\pandas\tools\util.py in to_numeric(arg, errors, downcast)
193 coerce_numeric = False if errors in ('ignore', 'raise') else True
194 values = lib.maybe_convert_numeric(values, set(),
--> 195 coerce_numeric=coerce_numeric)
196
197 except Exception:
pandas\src\inference.pyx in pandas.lib.maybe_convert_numeric (pandas\lib.c:56097)()
ValueError: Unable to parse string "3.0M" at position 10472
#第10472行有問題,看下什么原因
data.loc[10472,]
App Life Made WI-Fi Touchscreen Photo Frame
Category 1.9
Rating 19
Reviews 3.0M
Size 1,000+
Installs Free
Type 0
Price Everyone
Content Rating NaN
Genres February 11, 2018
Last Updated 1.0.19
Current Ver 4.0 and up
Android Ver NaN
Name: 10472, dtype: object
#可以看出這行數(shù)據(jù)錯誤,直接刪除
data.drop(10472,inplace=True)
data['Reviews']=data['Reviews'].astype(int)
#更改Size為數(shù)值型
data.Size.unique()
array(['19M', '14M', '8.7M', '25M', '2.8M', '5.6M', '29M', '33M', '3.1M',
'28M', '12M', '20M', '21M', '37M', '2.7M', '5.5M', '17M', '39M',
'31M', '4.2M', '7.0M', '23M', '6.0M', '6.1M', '4.6M', '9.2M',
'5.2M', '11M', '24M', 'Varies with device', '9.4M', '15M', '10M',
'1.2M', '26M', '8.0M', '7.9M', '56M', '57M', '35M', '54M', '201k',
'3.6M', '5.7M', '8.6M', '2.4M', '27M', '2.5M', '16M', '3.4M',
'8.9M', '3.9M', '2.9M', '38M', '32M', '5.4M', '18M', '1.1M', '2.2M',
'4.5M', '9.8M', '52M', '9.0M', '6.7M', '30M', '2.6M',
……
'892k', '154k', '860k', '364k', '387k', '626k', '161k', '879k',
'39k', '970k', '170k', '141k', '160k', '144k', '143k', '190k',
'376k', '193k', '246k', '73k', '658k', '992k', '253k', '420k',
'404k', '470k', '226k', '240k', '89k', '234k', '257k', '861k',
'467k', '157k', '44k', '676k', '67k', '552k', '885k', '1020k',
'582k', '619k'], dtype=object)
# 去掉錯誤值 Varies with device,替換為空值
data['Size'].replace('Varies with device', np.nan, inplace = True )
data['Size'].isnull().sum() #空值總數(shù)
1695
#由于size里有‘kM’字符,為了轉(zhuǎn)換成數(shù)值型,我們需要用正則表達式進行匹配
import re #導入正則表達式包
#定義一個函數(shù)來,k改為1000,M改為1000,000
def change(i):
if i is not np.nan:
A,B=re.split('[kM]+',i)
C,D=re.split('[0-9.]+',i)
if D=='M':
A=float(A)*1000000
elif D =='k':
A=float(A)*1000
return A
#轉(zhuǎn)換size列為數(shù)值型
data['Size'] =data['Size'].apply(lambda x:change(x))
#用平均值來填充空置
data['Size'].fillna(data.groupby('Category')['Size'].transform('mean'),inplace=True)
#data['Price'].value_counts()
#看下price里具體哪些數(shù)據(jù)
#變更price為float型
data['Price']=data['Price'].apply(lambda x: float(x[1:]) if x !='0' else 0 )
#首先看下有多少款APPs
len(data.App.unique())
9659
#比data行數(shù)少,說明有重復項,看下具體是哪些APP
data.App.value_counts()
ROBLOX 9
CBS Sports App - Scores, News, Stats & Watch Live 8
Candy Crush Saga 7
ESPN 7
Duolingo: Learn Languages Free 7
……
#選擇第一個APP看下內(nèi)容
data[data['App']=='ROBLOX']

#可以看到Reviews不一樣,去除重復項
#對于多個分類的,只保留一個分類(有100多個APP)
data=data.drop_duplicates(subset=['App'])
data.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 9659 entries, 0 to 10840
Data columns (total 13 columns):
App 9659 non-null object
Category 9659 non-null object
Rating 8196 non-null float64
Reviews 9659 non-null int32
Size 9659 non-null float64
Installs 9659 non-null object
Type 9658 non-null object
Price 9659 non-null float64
Content Rating 9659 non-null object
Genres 9659 non-null object
Last Updated 9659 non-null object
Current Ver 9651 non-null object
Android Ver 9657 non-null object
dtypes: float64(3), int32(1), object(9)
memory usage: 1018.7+ KB
具體分析每個字段
#分析category
cate= data['Category'].groupby(data['Category']).count()
cate=cate.sort_values(ascending=False)
plt.figure(figsize=(15,10))
sns.barplot(x=cate.index,y=cate.values)
plt.xticks(rotation=90)
plt.xlabel('Category')
plt.ylabel('App qty')
plt.title("App qty by category")
<matplotlib.text.Text at 0x1b76369a2e8>

labels=data['Category'].value_counts().index
sizes= data['Category'].value_counts().values
#做餅圖看各分類占比
plt.figure(figsize = (10,10))
plt.pie(sizes, labels=labels, autopct='%1.1f%%')
plt.title('App qty by category',color = 'blue',fontsize = 15)
<matplotlib.text.Text at 0x1b7638f8128>

結(jié)論:按分類數(shù)量,排名前3的APP是family 19.6%,game 9.9%,tool 8.5%,而且顯著高于之后分類的APP數(shù)量
#分析Genres
len(data.Genres.value_counts())
118
#genres 總共有120個類型
genr= data['Genres'].groupby(data['Genres']).count()
genr=genr.sort_values(ascending=False)
genr.index[:15] #選擇前15個類型
Index(['Tools', 'Entertainment', 'Education', 'Business', 'Medical',
'Personalization', 'Productivity', 'Lifestyle', 'Finance', 'Sports',
'Communication', 'Action', 'Health & Fitness', 'Photography',
'News & Magazines'],
dtype='object', name='Genres')
plt.figure(figsize=(15,10))
sns.barplot(x=genr.index[:15],y=genr.values[:15])
plt.xticks(rotation=90)
plt.xlabel('Genres')
plt.ylabel('App qty')
plt.title("App qty by Genres")
<matplotlib.text.Text at 0x1b764a7c278>

data.describe()

#看下Rating數(shù)據(jù)分布
fig=plt.figure(figsize=(15,6))
ax1 = fig.add_subplot(131)
ax2 = fig.add_subplot(132)
ax3 = fig.add_subplot(133)
sns.violinplot(y=data['Rating'],data=data,ax=ax1)
sns.kdeplot(data.Rating,ax=ax2,shade=True)
sns.boxplot(y=data.Rating,ax=ax3)

結(jié)論:50% app 評分在4-4.5之間,均值4.17分
#看下reivews數(shù)據(jù)
#data['Reviews'].value_counts()
fig=plt.figure(figsize=(12,8))
sns.kdeplot(data.Reviews,shade=True) #Reviews 的密度分布

絕大部分APP的評論少于10個
#具體看下評論少于200的APP的分布
a=[]
for i in range(0,200,5):
a.append(i)
fig=plt.figure(figsize=(15,8))
plt.hist(data['Reviews'],a,histtype="bar",rwidth=0.8,alpha=0.4)
plt.xticks(np.arange(0, 100, step=5))

#找出top 10 reiviews
b=data['Reviews'].value_counts()
b.sort_index(ascending=False)
78158306 1
69119316 1
66577313 1
56642847 1
44891723 1
42916526 1
27722264 1
25655305 1
24900999 1
23133508 1
22426677 1
……
16 35
15 30
14 41
13 49
12 58
11 52
10 62
9 64
8 72
7 88
6 945
4 137
3 170
2 213
1 272
0 593
Name: Reviews, dtype: int64
data[data['Reviews']>20000000]

看下評論最高的APP除了4個游戲類,竟然主要是facebook系,谷歌系的只有youtube上榜,最后兩個是獵豹移動的
接下來分析下價格的影響,包括tpye和price兩個字段
a=data.Type.value_counts()
labels=data['Type'].value_counts().index
explode = [0.2,0] #每一塊餅離中心的距離
sizes= data['Type'].value_counts().values
#colors = ['grey','blue','red','yellow','green','brown']
plt.figure(figsize = (9,9))
plt.pie(sizes, labels=labels, autopct='%1.1f%%',explode=explode)
plt.rcParams.update({'font.size': 10})
plt.title('App qty by type',color = 'blue',fontsize = 20)
<matplotlib.text.Text at 0x1b765455208>

可以看到92.2%的APP免費,付費APP占比7.8%
#分析下price
data['Price'].value_counts()
0.00 8903
0.99 145
2.99 124
1.99 73
4.99 70
3.99 57
1.49 46
5.99 26
2.49 25
9.99 19
399.99 12
6.99 11
14.99 9
4.49 9
...
Name: Price, dtype: int64
price = data['Price'].value_counts()
price.drop(0,inplace=True) #刪除免費的,分析付費APP情況
price=price.sort_values(ascending=False)
fig = plt.figure(figsize=(15,10))
sns.kdeplot(data[data['Price']!=0]['Price']) #分析付費APP的密度分布圖

可以看到絕大部分APP價格低于30美元,但是看到400美元價位有一個凸起,把這類選中看下什么情況
data[data['Price']==399.99]

在網(wǎng)上查了后發(fā)現(xiàn)這是一個惡搞軟件,沒有任何用處??戳藀lay確實有幾千評論,10W下載,不過沒明白為什么有這么說下載量,有人知道的話可以告訴我
可以之后價格分析中把這些異常值刪除
#我們再具體看下所有分類
#num = str(a.tolist()).count("1")
#num
#絕大部分APP會定價0.99,1.99,2.99等,為了更改的分析,我們把價格值唯一的刪除(也就是只有一個APP定的是這個價格),總共63個值
price =price[price>1]
#a=data['Price'].value_counts().values
fig = plt.figure(figsize=(12,10))
sns.kdeplot(price.values,shade=True)
C:\Users\renhl1\Anaconda3\lib\site-packages\statsmodels\nonparametric\kdetools.py:20: VisibleDeprecationWarning: using a non-integer number instead of an integer will result in an error in the future
y = X[:m/2+1] + np.r_[0,X[m/2+1:],0]*1j
<matplotlib.axes._subplots.AxesSubplot at 0x1b765bcbb38>

fig = plt.figure(figsize=(18,10))
sns.barplot(price.index,price.values)

付費的絕大部分在10美元以下,排名top5 依次是0.99,2.99,1.99,4.99,3.99美元
#轉(zhuǎn)變last undated 為日期型
data['Last Updated']=pd.to_datetime(data['Last Updated'])
fig = plt.figure(figsize=(10,7))
plt.plot(data['Last Updated'],'.')

#看 installs情況
data['Installs'].value_counts()
1,000,000+ 1417
100,000+ 1112
10,000+ 1031
10,000,000+ 937
1,000+ 888
100+ 710
5,000,000+ 607
500,000+ 505
50,000+ 469
5,000+ 468
10+ 385
500+ 328
50+ 204
50,000,000+ 202
100,000,000+ 188
5+ 82
1+ 67
500,000,000+ 24
1,000,000,000+ 20
0+ 14
0 1
Name: Installs, dtype: int64
install=data['Installs'].groupby(data['Installs']).count()
install =install.sort_values(ascending=False)
fig = plt.figure(figsize=(9,12))
sns.barplot(x=install.values,y=install.index)
plt.ylabel('installed times')
plt.xlabel('App qty')
plt.title("App qty by installed times")

可以看到APP數(shù)量最多的是1M次下載的,另外還有一個好玩的地方,5開頭的下載量顯著的少于10開頭的下載量
# 下載量超過10億次的APP情況
data[data['Installs']=='1,000,000,000+']


超過10億下載量的大多數(shù)是google的產(chǎn)品
#看下下載量跟reviews有沒有關系
reviews=data['Reviews'].groupby(data['Installs']).mean()
fig = plt.figure(figsize=(15,9))
sns.barplot(x=reviews.values,y=reviews.index)
plt.ylabel('installed times')
plt.xlabel('reviews')
plt.title("avg.reivew by installed times")
plt.xscale('log') #刻度改為log

可以看到下載量確實和評論數(shù)呈正相關
探索評分可能跟哪些參數(shù)有關
首先清理評分為0的數(shù)據(jù)并賦值到一個新數(shù)據(jù)集
#data['Rating'].value_counts()
newdata=data[data['Rating'].notnull()] #刪除沒有評分的數(shù)據(jù)
newdata.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 8196 entries, 0 to 10840
Data columns (total 13 columns):
App 8196 non-null object
Category 8196 non-null object
Rating 8196 non-null float64
Reviews 8196 non-null int32
Size 8196 non-null float64
Installs 8196 non-null object
Type 8196 non-null object
Price 8196 non-null float64
Content Rating 8196 non-null object
Genres 8196 non-null object
Last Updated 8196 non-null datetime64[ns]
Current Ver 8192 non-null object
Android Ver 8194 non-null object
dtypes: datetime64[ns](1), float64(3), int32(1), object(8)
memory usage: 864.4+ KB
#看下last update 和 rating 有沒有關系
fig = plt.figure(figsize=(10,7))
plt.plot(newdata['Last Updated'],newdata['Rating'],'.')
[<matplotlib.lines.Line2D at 0x1b7670ec5f8>]

#把年份單獨提取出來,作為新的一列
from datetime import datetime
newdata['updated_year']=newdata['Last Updated'].dt.year
fig = plt.figure(figsize=(15,9))
sns.boxplot(newdata['updated_year'],newdata['Rating'])
plt.xlabel('updated year')
plt.ylabel('rating')
plt.title('rating with different updated year')

可以得出結(jié)論隨著時間APP的中位數(shù)打分在越來越高,到了2018年首次超過75%的APP分數(shù)超過4分,說明隨著移動應用的完善,低質(zhì)的APP基本沒有了市場
plt.figure(figsize=(12,9))
sns.boxplot(x=newdata['Type'],y=newdata['Rating'],data=newdata)

可以看到付費APP的評分比免費APP的評分高
#看下reivews和rating是否有相關性
#pearson相關性,值在-1和+1之間,+1表示完全正相關,-1表示完全負相關,0表示沒有相關性
plt.figure(figsize=(10,10))
sns.jointplot(newdata['Reviews'],newdata['Rating'],kind='reg',size =7)

#看下size是否有相關性
plt.figure(figsize=(10,10))
sns.jointplot(newdata['Size'],newdata['Rating'],kind='reg',size =7)

結(jié)論:Rating跟Reviews 和 Size 沒有相關性
#看下category 和 rating 的關系
fig =plt.figure(figsize=(15,12))
sns.boxplot(y=newdata['Category'],x=newdata['Rating'],data=newdata)
#plt.xticks(rotation=90)
plt.ylabel('category')
plt.xlabel('rating')
plt.title('rating distribution by category')
<matplotlib.text.Text at 0x1b768aabb00>

可以看到評分最低的是dating :),評分比較高的分類有art and design, events,personalization,parenting
#看installs和Rating關系
installrate =newdata['Rating'].groupby(newdata['Installs']).count()
installrate
Installs
1+ 3
1,000+ 697
1,000,000+ 1415
1,000,000,000+ 20
10+ 69
10,000+ 987
10,000,000+ 937
100+ 303
100,000+ 1094
100,000,000+ 188
5+ 9
5,000+ 425
5,000,000+ 607
50+ 56
50,000+ 457
50,000,000+ 202
500+ 199
500,000+ 504
500,000,000+ 24
Name: Rating, dtype: int64
#把下載人數(shù)過少的評論去掉,只查看高于100下載的
slected =newdata.loc[(newdata['Installs'] != '1+')&(newdata['Installs'] != '5+')&(newdata['Installs'] != '10+')&(newdata['Installs'] != '50+')]
#看下 installs 和 rating 的關系
fig =plt.figure(figsize=(15,9))
sns.boxplot(x=slected['Installs'],y=slected['Rating'])
plt.xticks(rotation=45)
plt.xlabel('Installed qty')
plt.ylabel('rating')
plt.title('rating distribution by category')

#分數(shù)集中在4-4.5,rating跟installs 沒有很強的相關性
#看下跟Price關系,前面tpye相當于付費0元 和大于0元的比較,這里再細分付費金額的區(qū)別
#drop 0元 和 異常的i'm rick APP
selected =newdata.loc[(newdata['Price']!=0) & (newdata['Price']<200)]
#看下 installs 和 rating 的關系
fig =plt.figure(figsize=(15,9))
sns.jointplot(x=selected['Price'],y=selected['Rating'],kind='reg')
#xplt.xticks(rotation=45)
plt.xlabel('Price')
plt.ylabel('rating')
plt.title('rating distribution vs. price')

分值-0.029,price和rating 缺乏相關性
#看下category 和 genres
data['App'].groupby([data['Category'],data['Genres']]).count()
Category Genres
ART_AND_DESIGN Art & Design 57
Art & Design;Action & Adventure 1
Art & Design;Creativity 5
Art & Design;Pretend Play 1
AUTO_AND_VEHICLES Auto & Vehicles 85
BEAUTY Beauty 53
BOOKS_AND_REFERENCE Books & Reference 222
BUSINESS Business 420
COMICS Comics 55
Comics;Creativity 1
COMMUNICATION Communication 315
DATING Dating 171
EDUCATION Education 99
Education;Action & Adventure 1
Education;Brain Games 3
Education;Creativity 3
Education;Education 8
Education;Music & Video 1
Education;Pretend Play 4
ENTERTAINMENT Entertainment 92
Entertainment;Brain Games 2
Entertainment;Creativity 1
Entertainment;Music & Video 7
EVENTS Events 64
FAMILY Action;Action & Adventure 9
Adventure;Action & Adventure 4
Adventure;Brain Games 1
Adventure;Education 1
Arcade;Action & Adventure 14
Arcade;Pretend Play 1
...
GAME Simulation;Education 1
Sports 6
Strategy 17
Trivia 38
Word 23
HEALTH_AND_FITNESS Health & Fitness 288
HOUSE_AND_HOME House & Home 74
LIBRARIES_AND_DEMO Libraries & Demo 84
LIFESTYLE Lifestyle 368
Lifestyle;Pretend Play 1
MAPS_AND_NAVIGATION Maps & Navigation 131
MEDICAL Medical 395
NEWS_AND_MAGAZINES News & Magazines 254
PARENTING Parenting 46
Parenting;Brain Games 1
Parenting;Education 7
Parenting;Music & Video 6
PERSONALIZATION Personalization 376
PHOTOGRAPHY Photography 281
PRODUCTIVITY Productivity 374
SHOPPING Shopping 202
SOCIAL Social 239
SPORTS Sports 325
TOOLS Tools 826
Tools;Education 1
TRAVEL_AND_LOCAL Travel & Local 218
Travel & Local;Action & Adventure 1
VIDEO_PLAYERS Video Players & Editors 162
Video Players & Editors;Music & Video 1
WEATHER Weather 79
Name: App, dtype: int64
#不同category,付費用戶比例
a=data['App'].groupby([data['Category'],data['Type']]).count()
c=[]
d=[]
for i in a.index.values:
c.append(i[0])
d.append(i[1])
typedata=pd.DataFrame({'Category':c,'Type':d,'values':list(a.values)})
fig =plt.figure(figsize=(15,12))
sns.barplot(y=typedata[typedata['Type']=='Paid']['Category'],x=typedata[typedata['Type']=='Paid']['values'],color='yellow',alpha=0.8,label='Paid')
sns.barplot(y=typedata[typedata['Type']=='Free']['Category'],x=typedata[typedata['Type']=='Free']['values'],color='green',alpha = 0.2,label='Free')
<matplotlib.axes._subplots.AxesSubplot at 0x1b76968add8>

可以看出付費用戶占最高的是ENTERTAINMENT,'LIBRARIES_AND_DEMO,BEAUTY,SHOPPING
#看下安卓版本和rating 關系
newdata['Android Ver'].value_counts()
4.1 and up 1811
4.0.3 and up 1141
4.0 and up 1042
Varies with device 947
4.4 and up 713
2.3 and up 547
5.0 and up 447
4.2 and up 316
2.3.3 and up 232
2.2 and up 203
3.0 and up 201
4.3 and up 185
2.1 and up 112
1.6 and up 87
6.0 and up 42
7.0 and up 41
3.2 and up 31
2.0 and up 27
5.1 and up 16
1.5 and up 16
3.1 and up 8
2.0.1 and up 7
4.4W and up 5
8.0 and up 5
7.1 and up 3
4.0.3 - 7.1.1 2
1.0 and up 2
5.0 - 8.0 2
4.1 - 7.1.1 1
7.0 - 7.1.1 1
5.0 - 6.0 1
Name: Android Ver, dtype: int64
fig = plt.figure(figsize=(15,9))
sns.boxplot(x=newdata['Rating'],y=newdata['Android Ver'])
plt.xlabel('rating')
plt.ylabel('android ver')
<matplotlib.text.Text at 0x1b76a0e3240>

支持安卓版本和rating沒有特別相關性
#看戲分級和Rating關系
data['Content Rating'].value_counts()
Everyone 7903
Teen 1036
Mature 17+ 393
Everyone 10+ 322
Adults only 18+ 3
Unrated 2
Name: Content Rating, dtype: int64
fig = plt.figure(figsize=(15,9))
sns.boxplot(x=newdata['Content Rating'],y=newdata['Rating'])
plt.xlabel('content rating')
plt.ylabel('rating')
<matplotlib.text.Text at 0x1b76b76c198>

conclusion
本篇共分析了谷歌應用商店APP數(shù)據(jù)集,共9659個APPs
評分rating的均值是4.17, 50%的APP分值在4-4.5分
app分類數(shù)量排名前3的APP是family 19.6%,game 9.9%,tool 8.5%的總APP數(shù)量占比
付費用戶占比7.8%,其中ENTERTAINMENT,'LIBRARIES_AND_DEMO,BEAUTY,SHOPPING等分類的付費APP最高,付費的價格絕大部分在10美元以下,排名top5 依次是0.99,2.99,1.99,4.99,3.99美元、
大多數(shù)APP支持安卓4.0以上版本,還支持安卓2.0,3.0的APP已經(jīng)很少了
超過10億下載量的大多數(shù)是google系的產(chǎn)品,但是評論量最高的是facebook系產(chǎn)品
影響Rating分值的因子有Type,Category,updated year