【JP】Task Special & Task 11 綜合練習(xí)(2020.12)

題目:http://datawhale.club/t/topic/579/4

【任務(wù)一】企業(yè)收入的多樣性

【題目描述】一個(gè)企業(yè)的產(chǎn)業(yè)收入多樣性可以仿照信息熵的概念來定義收入熵指標(biāo):

\rm I=-\sum_{i}p(x_i)\log(p(x_i))

其中 p(xi) 是企業(yè)該年某產(chǎn)業(yè)收入額占該年所有產(chǎn)業(yè)總收入的比重。在company.csv中存有需要計(jì)算的企業(yè)和年份,在company_data.csv中存有企業(yè)、各類收入額和收入年份的信息?,F(xiàn)請(qǐng)利用后一張表中的數(shù)據(jù),在前一張表中增加一列表示該公司該年份的收入熵指標(biāo) I 。

df1 = pd.read_csv('../data/exercise/company.csv')
df2 = pd.read_csv('../data/exercise/company_data.csv')
def px(x):
    return -np.sum(x/x.sum()*np.log2(x/x.sum()))
df3 = df2.groupby(['日期','收入類型'])['收入額'].sum()
df3 = df3.reset_index()
df3 = df3.groupby('日期')['收入額'].apply(px).rename(index=lambda x: int(x[:4])).reset_index()
df1.merge(df3, on='日期',how='left').rename(columns={'收入額':'收入熵指標(biāo)'}).head()

結(jié)果:

  證券代碼    日期  收入熵指標(biāo)
0  #000007  2014  5.633910
1  #000403  2015  5.753387
2  #000408  2016  5.704088
3  #000408  2017       NaN
4  #000426  2015  5.753387

【任務(wù)二】組隊(duì)學(xué)習(xí)信息表的變換

【題目描述】請(qǐng)把組隊(duì)學(xué)習(xí)的隊(duì)伍信息表變換為如下形態(tài),其中“是否隊(duì)長”一列取1表示隊(duì)長,否則為0

    是否隊(duì)長    隊(duì)伍名稱        昵稱      編號(hào)
0   1       你說的都對(duì)隊(duì)  山楓葉紛飛   5
1   0       你說的都對(duì)隊(duì)  蔡           6
2   0       你說的都對(duì)隊(duì)  安慕希     7
3   0       你說的都對(duì)隊(duì)  信仰      8
4   0       你說的都對(duì)隊(duì)  biubiu????  20
... ...     ...         ...         ...
141 0       七星聯(lián)盟        Daisy       63
142 0       七星聯(lián)盟        One Better  131
143 0       七星聯(lián)盟        rain        112
144 1       應(yīng)如是     思無邪     54
145 0       應(yīng)如是     Justzer0    58
df = pd.read_excel('../data/exercise/組隊(duì)信息匯總表(Pandas).xlsx')
import re
def replaceName(x):
    obj = re.match("隊(duì)員_群昵稱\.(\d)",x)
    obj2 = re.match("隊(duì)員(\d+)編號(hào)",x.replace(" ",""))
    if x == '隊(duì)員_群昵稱':
        return '群昵稱_隊(duì)員1'
    if x == '隊(duì)長編號(hào)':
        return '編號(hào)_隊(duì)長'
    if x == '隊(duì)長_群昵稱':
        return '群昵稱_隊(duì)長'
    if obj:
        return '群昵稱_隊(duì)員' + str(int(obj.group(1))+1)
    else:
        if obj2:
            return '編號(hào)_隊(duì)員' + str(int(obj2.group(1)))
        return x
df = df.rename(columns=lambda x: replaceName(x))
df['tmpIndex'] = df.index

df = pd.wide_to_long(df, 
                stubnames=['群昵稱','編號(hào)'],
                i=['隊(duì)伍名稱'],
                j='是否隊(duì)長',
                sep="_",
                suffix='.+').swaplevel(0,1).rename(columns={'群昵稱':'昵稱'})
df = df.drop('所在群', axis = 1).reset_index(['隊(duì)伍名稱']).rename(index=lambda x: 1 if x == '隊(duì)長' else 0).reset_index()
df = df.drop(df[pd.isnull(df.編號(hào))].index).sort_values(['tmpIndex','是否隊(duì)長'],ascending=[True,False]).drop(columns='tmpIndex').reset_index().drop(columns='index')

【任務(wù)三】

#1. 有多少縣滿足總投票數(shù)超過縣人口數(shù)的一半
df1 = pd.read_csv('../data/exercise/county_population.csv')
df2 = pd.read_csv('../data/exercise/president_county_candidate.csv')
df1 = df1.set_index('US County')
new_idx = df1.index.map(lambda x:tuple(x.split(', ')))
df1.index = new_idx
df1 = df1.reset_index(level=0).rename_axis(index={'US County': 'State'}).reset_index().set_index('US County').rename(index=lambda x:x[1:]).reset_index()
df2 = df2.rename(columns={'state':'State', 'county':'US County'})
df3 = df2.merge(df1, on=['US County','State'],how='left')
df3['percent'] = df3['total_votes']/df3['Population']
df3.query('percent > 0.5')['US County'].count()
'''
74
'''
#2.把州(`state`)作為行索引,把投票候選人作為列名,列名的順序按照候選人在全美的總票數(shù)由高到低排序,行列對(duì)應(yīng)的元素為該候選人在該州獲得的總票數(shù)
series1 = df2.groupby(['candidate'])['total_votes'].sum().sort_values(ascending=False)
df4 = df2.pivot_table(index='State',columns=['candidate'],values='total_votes',aggfunc='sum')
df4.columns = series1.index
df4
'''
candidate   Joe Biden   Donald Trump    Jo Jorgensen    Howie Hawkins   Write-ins   Rocky De La Fuente  Gloria La Riva  Kanye West  Don Blankenship Brock Pierce    ... Tom Hoefling    Ricki Sue King  Princess Jacob-Fambro   Blake Huber Richard Duncan  Joseph Kishore  Jordan Scott    Gary Swing  Keith McCormic  Zachary Scalf
State                                                                                   
Alabama NaN 7312.0  NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
Alaska  NaN 34210.0 NaN NaN NaN NaN 825.0   NaN NaN NaN ... NaN NaN NaN NaN NaN NaN 318.0   NaN NaN NaN
Arizona NaN 2032.0  NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
Arkansas    NaN NaN NaN NaN NaN 1713.0  2141.0  NaN NaN 1475.0  ... NaN 2812.0  NaN NaN NaN NaN 1321.0  NaN NaN NaN
California  NaN 80.0    NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN 60155.0 NaN NaN NaN
5 rows × 38 columns
'''
#3.每一個(gè)州下設(shè)若干縣,定義拜登在該縣的得票率減去川普在該縣的得票率為該縣的`BT`指標(biāo),若某個(gè)州所有縣`BT`指標(biāo)的中位數(shù)大于0,則稱該州為`Biden State`,請(qǐng)找出所有的`Biden State`
df5=df2.merge(df2.groupby(['State','US County'])['total_votes'].sum().rename('county sum'), on=["State",'US County'],how='left')
df5['voteshare'] = df5['total_votes']/df5['county sum']
def test(x):
    biden = float(x.query('candidate=="Joe Biden"')['voteshare'])
    Trump = float(x.query('candidate=="Donald Trump"')['voteshare'])
    return str(biden-Trump) + "_" + x.name[1]
BTSeries = df5.groupby(['State','US County'])['candidate','voteshare'].apply(lambda x: test(x)).rename('result').reset_index().set_index('result')
new_idx =BTSeries.index.map(lambda x: tuple(x.split('_')))
BTSeries.index = new_idx
BTSeries = BTSeries.reset_index(level=0).rename_axis(index={'result': 'BT'}).reset_index()
BTSeries['result'] = BTSeries['result'].apply(lambda x:float(x))
BTSeries.groupby('State').filter(lambda x: x.median() > 0)['State'].drop_duplicates()
'''
197               California
319              Connecticut
488                 Delaware
491     District of Columbia
725                   Hawaii
1878           Massachusetts
2999              New Jersey
3536            Rhode Island
4065                 Vermont
'''
最后編輯于
?著作權(quán)歸作者所有,轉(zhuǎn)載或內(nèi)容合作請(qǐng)聯(lián)系作者
【社區(qū)內(nèi)容提示】社區(qū)部分內(nèi)容疑似由AI輔助生成,瀏覽時(shí)請(qǐng)結(jié)合常識(shí)與多方信息審慎甄別。
平臺(tái)聲明:文章內(nèi)容(如有圖片或視頻亦包括在內(nèi))由作者上傳并發(fā)布,文章內(nèi)容僅代表作者本人觀點(diǎn),簡書系信息發(fā)布平臺(tái),僅提供信息存儲(chǔ)服務(wù)。

相關(guān)閱讀更多精彩內(nèi)容

  • 題目:http://datawhale.club/t/topic/579/4[http://datawhale.c...
    葉小刀_b59f閱讀 237評(píng)論 0 0
  • 1、【任務(wù)一】企業(yè)收入的多樣性 【題目描述】一個(gè)企業(yè)的產(chǎn)業(yè)收入多樣性可以仿照信息熵的概念來定義收入熵指標(biāo): I=?...
    58506fd3fbed閱讀 218評(píng)論 0 0
  • 數(shù)據(jù)集下載地址:https://gitee.com/lbwnb_lbwnb/learning_pandas_dat...
    陳易男閱讀 362評(píng)論 0 0
  • 我的圖書館 留言交流 2000個(gè)最常用的英語單詞 2016-05-15酴羰駘璺 2000個(gè)最常用的英語單詞 (英語...
    Mr_Wang92閱讀 1,030評(píng)論 0 0
  • 漸變的面目拼圖要我怎么拼? 我是疲乏了還是投降了? 不是不允許自己墜落, 我沒有滴水不進(jìn)的保護(hù)膜。 就是害怕變得面...
    悶熱當(dāng)乘涼閱讀 4,502評(píng)論 0 13

友情鏈接更多精彩內(nèi)容