題目:http://datawhale.club/t/topic/579/4
【任務(wù)一】企業(yè)收入的多樣性
【題目描述】一個(gè)企業(yè)的產(chǎn)業(yè)收入多樣性可以仿照信息熵的概念來定義收入熵指標(biāo):
其中 p(xi) 是企業(yè)該年某產(chǎn)業(yè)收入額占該年所有產(chǎn)業(yè)總收入的比重。在company.csv中存有需要計(jì)算的企業(yè)和年份,在company_data.csv中存有企業(yè)、各類收入額和收入年份的信息?,F(xiàn)請(qǐng)利用后一張表中的數(shù)據(jù),在前一張表中增加一列表示該公司該年份的收入熵指標(biāo) I 。
df1 = pd.read_csv('../data/exercise/company.csv')
df2 = pd.read_csv('../data/exercise/company_data.csv')
def px(x):
return -np.sum(x/x.sum()*np.log2(x/x.sum()))
df3 = df2.groupby(['日期','收入類型'])['收入額'].sum()
df3 = df3.reset_index()
df3 = df3.groupby('日期')['收入額'].apply(px).rename(index=lambda x: int(x[:4])).reset_index()
df1.merge(df3, on='日期',how='left').rename(columns={'收入額':'收入熵指標(biāo)'}).head()
結(jié)果:
證券代碼 日期 收入熵指標(biāo)
0 #000007 2014 5.633910
1 #000403 2015 5.753387
2 #000408 2016 5.704088
3 #000408 2017 NaN
4 #000426 2015 5.753387
【任務(wù)二】組隊(duì)學(xué)習(xí)信息表的變換
【題目描述】請(qǐng)把組隊(duì)學(xué)習(xí)的隊(duì)伍信息表變換為如下形態(tài),其中“是否隊(duì)長”一列取1表示隊(duì)長,否則為0
是否隊(duì)長 隊(duì)伍名稱 昵稱 編號(hào)
0 1 你說的都對(duì)隊(duì) 山楓葉紛飛 5
1 0 你說的都對(duì)隊(duì) 蔡 6
2 0 你說的都對(duì)隊(duì) 安慕希 7
3 0 你說的都對(duì)隊(duì) 信仰 8
4 0 你說的都對(duì)隊(duì) biubiu???? 20
... ... ... ... ...
141 0 七星聯(lián)盟 Daisy 63
142 0 七星聯(lián)盟 One Better 131
143 0 七星聯(lián)盟 rain 112
144 1 應(yīng)如是 思無邪 54
145 0 應(yīng)如是 Justzer0 58
df = pd.read_excel('../data/exercise/組隊(duì)信息匯總表(Pandas).xlsx')
import re
def replaceName(x):
obj = re.match("隊(duì)員_群昵稱\.(\d)",x)
obj2 = re.match("隊(duì)員(\d+)編號(hào)",x.replace(" ",""))
if x == '隊(duì)員_群昵稱':
return '群昵稱_隊(duì)員1'
if x == '隊(duì)長編號(hào)':
return '編號(hào)_隊(duì)長'
if x == '隊(duì)長_群昵稱':
return '群昵稱_隊(duì)長'
if obj:
return '群昵稱_隊(duì)員' + str(int(obj.group(1))+1)
else:
if obj2:
return '編號(hào)_隊(duì)員' + str(int(obj2.group(1)))
return x
df = df.rename(columns=lambda x: replaceName(x))
df['tmpIndex'] = df.index
df = pd.wide_to_long(df,
stubnames=['群昵稱','編號(hào)'],
i=['隊(duì)伍名稱'],
j='是否隊(duì)長',
sep="_",
suffix='.+').swaplevel(0,1).rename(columns={'群昵稱':'昵稱'})
df = df.drop('所在群', axis = 1).reset_index(['隊(duì)伍名稱']).rename(index=lambda x: 1 if x == '隊(duì)長' else 0).reset_index()
df = df.drop(df[pd.isnull(df.編號(hào))].index).sort_values(['tmpIndex','是否隊(duì)長'],ascending=[True,False]).drop(columns='tmpIndex').reset_index().drop(columns='index')
【任務(wù)三】
#1. 有多少縣滿足總投票數(shù)超過縣人口數(shù)的一半
df1 = pd.read_csv('../data/exercise/county_population.csv')
df2 = pd.read_csv('../data/exercise/president_county_candidate.csv')
df1 = df1.set_index('US County')
new_idx = df1.index.map(lambda x:tuple(x.split(', ')))
df1.index = new_idx
df1 = df1.reset_index(level=0).rename_axis(index={'US County': 'State'}).reset_index().set_index('US County').rename(index=lambda x:x[1:]).reset_index()
df2 = df2.rename(columns={'state':'State', 'county':'US County'})
df3 = df2.merge(df1, on=['US County','State'],how='left')
df3['percent'] = df3['total_votes']/df3['Population']
df3.query('percent > 0.5')['US County'].count()
'''
74
'''
#2.把州(`state`)作為行索引,把投票候選人作為列名,列名的順序按照候選人在全美的總票數(shù)由高到低排序,行列對(duì)應(yīng)的元素為該候選人在該州獲得的總票數(shù)
series1 = df2.groupby(['candidate'])['total_votes'].sum().sort_values(ascending=False)
df4 = df2.pivot_table(index='State',columns=['candidate'],values='total_votes',aggfunc='sum')
df4.columns = series1.index
df4
'''
candidate Joe Biden Donald Trump Jo Jorgensen Howie Hawkins Write-ins Rocky De La Fuente Gloria La Riva Kanye West Don Blankenship Brock Pierce ... Tom Hoefling Ricki Sue King Princess Jacob-Fambro Blake Huber Richard Duncan Joseph Kishore Jordan Scott Gary Swing Keith McCormic Zachary Scalf
State
Alabama NaN 7312.0 NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
Alaska NaN 34210.0 NaN NaN NaN NaN 825.0 NaN NaN NaN ... NaN NaN NaN NaN NaN NaN 318.0 NaN NaN NaN
Arizona NaN 2032.0 NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
Arkansas NaN NaN NaN NaN NaN 1713.0 2141.0 NaN NaN 1475.0 ... NaN 2812.0 NaN NaN NaN NaN 1321.0 NaN NaN NaN
California NaN 80.0 NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN 60155.0 NaN NaN NaN
5 rows × 38 columns
'''
#3.每一個(gè)州下設(shè)若干縣,定義拜登在該縣的得票率減去川普在該縣的得票率為該縣的`BT`指標(biāo),若某個(gè)州所有縣`BT`指標(biāo)的中位數(shù)大于0,則稱該州為`Biden State`,請(qǐng)找出所有的`Biden State`
df5=df2.merge(df2.groupby(['State','US County'])['total_votes'].sum().rename('county sum'), on=["State",'US County'],how='left')
df5['voteshare'] = df5['total_votes']/df5['county sum']
def test(x):
biden = float(x.query('candidate=="Joe Biden"')['voteshare'])
Trump = float(x.query('candidate=="Donald Trump"')['voteshare'])
return str(biden-Trump) + "_" + x.name[1]
BTSeries = df5.groupby(['State','US County'])['candidate','voteshare'].apply(lambda x: test(x)).rename('result').reset_index().set_index('result')
new_idx =BTSeries.index.map(lambda x: tuple(x.split('_')))
BTSeries.index = new_idx
BTSeries = BTSeries.reset_index(level=0).rename_axis(index={'result': 'BT'}).reset_index()
BTSeries['result'] = BTSeries['result'].apply(lambda x:float(x))
BTSeries.groupby('State').filter(lambda x: x.median() > 0)['State'].drop_duplicates()
'''
197 California
319 Connecticut
488 Delaware
491 District of Columbia
725 Hawaii
1878 Massachusetts
2999 New Jersey
3536 Rhode Island
4065 Vermont
'''