99热新网址,97情色婷婷综合网,99 视频久久

2019-3-13

目前進度，已可以將處理信息存入es，具體代碼如下：

def b_job():
    for page_i in range(1,2):
        def a_job():
            return get_res(page_i)
        res = retry_job(a_job,60)
    return res

outcome=b_job()

es = Elasticsearch(hosts='192.168.31.9:9200')
es_bhv = HjEsIO(es=es)

for data in key_value_list:
    es_bhv.create_and_upload(index_name='bond',data=data)

待解決問題，如何實現(xiàn)定時執(zhí)行功能

當前思路：
1 獲取當前時間，與定時時間比較
2 如相差大于a小時，則睡眠a小時后執(zhí)行
3 如相差小于1小時大于b分鐘，則休眠b分鐘后執(zhí)行
4 如相差小于1分鐘，則睡眠1秒后執(zhí)行
卡點，時間為str格式不方便直接比較，也不能轉(zhuǎn)為int和float格式。

2019-3-11

答疑

目前進度，字段信息已取出，在處理把信息存入es階段。
問題

1 如何在把字段信息存入es時，以字段名命名，字段信息如下：

entyDefinedCode': '305888', 'issueEndDate': '2019-03-28', 'bondDefinedCode': '6498280024', 'issueStartDate': '2019-03-22', 'entyFullName': '四川閬中農(nóng)村商業(yè)銀行股份有限公司', 'debtRtng': '---', 'bondType': '大額存單', 'bondTypeCode': '100058', 'bondName': '閬中農(nóng)商銀行2019年第5期個人大額存單3Y', 'bondCode': '1906280024'

2 查了下定時重爬是否通過以下代碼實現(xiàn)？

 while True:
        print(time.strftime('%Y-%m-%d %X',time.localtime()))
        b_job()
        time.sleep(5)

2019-3-7

答疑

不好意思，這兩天有點忙，哈哈。函數(shù)里怎么取出結(jié)果呢，比如下面函數(shù)要取出res的最終值？

def b_job():
    for page_i in range(1,10):
        def a_job():
            return get_res(page_i)
        res = retry_job(a_job,60)

2019-3-5

答疑

這樣呢

def a_job(page_i):
    print(page_i)
    data = {'pageNo': page_i, 'pageSize': '15'}
    page_source = requests.post(url1, data=data,headers=headers).content.decode(encoding='utf-8', errors='ignore')
    page_json = json.loads(page_source)
    data_list=page_json['data']['resultList']
    key_value_list.extend(data_list)
    frm = pd.DataFrame(key_value_list[0], index=[0])
    return frm


def retry_job(a_job, sleep_time):
    while 1:
        try:
            return(a_job())
        except:
            time.sleep(sleep_time)

def b_job():
    for page_i in range(1,10):
        res = retry_job(a_job,60)

答疑

求指導，問題出在哪?并且這樣會跳過出錯頁嗎？

def a_job(page_i):
    print(page_i)
    data = {'pageNo': page_i, 'pageSize': '15'}
    page_source = requests.post(url1, data=data,headers=headers).content.decode(encoding='utf-8', errors='ignore')
    page_json = json.loads(page_source)
    data_list=page_json['data']['resultList']
    key_value_list.extend(data_list)
    return(key_value_list)


def retry_job(a_job, sleep_time):
        for page_i in range(1,100):
            while 1:
                try:
                    return a_job(page_i)
                except:
                    time.sleep(sleep_time)

2019-3-3

答疑

已根據(jù)提示編寫語句，提示

image.png

，無法查出return錯誤在哪，具體代碼如下：

image.png

2019-2-20

解題過程

url1 = r'http://www.chinamoney.com.cn/ags/ms/cm-u-bond-md/BondMarketInfoList2'
key_value_list=[]
for page_i in range(1,11):
    try:
        print(page_i)
        data = {'pageNo': page_i, 'pageSize': '15'}
        page_source = requests.post(url1, data=data,headers=headers).content.decode(encoding='utf-8', errors='ignore')
        page_json = json.loads(page_source)
        data_list=page_json['data']['resultList']
        key_value_list.extend(data_list)
    except:
        time.sleep(60)
        page_i=-1
frm=pd.DataFrame(key_value_list[0],index=[0])
for key_value in key_value_list[1:]:
    frm = frm.append(pd.DataFrame(key_value,index=[0]),ignore_index=True)

2019-2-19

解題文字思路

代碼還不知如何實現(xiàn)斷點重連，先記錄下文字思路

在request.post.........代碼下加上
if 遇到post請求失敗的情況①：則 sleep 60秒后重新執(zhí)行上一步驟②
else：運行正常步驟

如思路可行，那么①和②的具體實行代碼是？

2019-2-15

解題結(jié)果及過程

image.png

headers = {
'Accept': 'application/json, text/javascript, */*; q=0.01',
'Accept-Encoding': 'gzip, deflate',
'Accept-Language': 'zh-CN,zh;q=0.9',
'Cache-Control': 'no-cache',
'Content-Length': '111',
'Content-Type': 'application/x-www-form-urlencoded; charset=UTF-8',
'Cookie': '_ulta_id.CM-Prod.e9dc=d8a40698afc88ac2; _ulta_ses.CM-Prod.e9dc=65eb9b4c1557b1e7; A9qF0lbkGa=MDAwM2IyNTRjZDAwMDAwMDAwNGYwGhsQa1gxNTUwMjE5NzQ3; JSESSIONID=JbfulNwClqSdk3XtHxbjKmNbnfILjAfTrSiS_RnYsdyMmUJcyYBB!2102842532',
'Host': 'www.chinamoney.com.cn',
'Origin': 'http://www.chinamoney.com.cn',
'Pragma': 'no-cache',
'Proxy-Connection': 'keep-alive',
'Referer': 'http://www.chinamoney.com.cn/chinese/qwjsn/?searchValue=%25E6%25B7%25B1%25E9%25AB%2598%25E9%2580%259F',
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/71.0.3578.98 Safari/537.36',
'X-Requested-With':'XMLHttpRequest'
}
url1 = r'http://www.chinamoney.com.cn/ags/ms/cm-u-bond-md/BondMarketInfoList2'
key_value_list=[]
for page_i in range(1,5):
    print(page_i)
    data = {'pageNo': page_i, 'pageSize': '15'}
    page_source = requests.post(url1, data=data,headers=headers).content.decode(encoding='utf-8', errors='ignore')
    page_json = json.loads(page_source)
    data_list=page_json['data']['resultList']
    key_value_list.extend(data_list)
frm=pd.DataFrame(key_value_list[0],index=[0])
for key_value in key_value_list[1:]:
    frm = frm.append(pd.DataFrame(key_value,index=[0]),ignore_index=True)

2019-2-14

答疑

按照request.post的方式添加了data參數(shù)，得到的結(jié)果還是不對，如圖

image.png

2019-1-30

答疑

按照之前的方法取url，發(fā)現(xiàn)無論怎么改變查詢頁數(shù)或者查詢條件得到的url地址都是

image.png

若直接用此url來request的話只能獲得首頁的單一發(fā)行人的信息；

另外注意到了有一行form data信息

image.png

不知道是否可以利用這個信息建立與url的聯(lián)系？

2019-1-24

筆記

思路：找到url結(jié)合format遍歷頁面-----篩選出信息并轉(zhuǎn)為json-----轉(zhuǎn)換成frm并整理-----轉(zhuǎn)換成dict-----逐條存入es中
突破點：
1 通過begin_date end_date選擇時間區(qū)間;
2 使用frm.to_dict(orient='records')將frm轉(zhuǎn)換成dict;
3 結(jié)合for data in frm_dict與index的使用，把每一條的公告信息存入，并且將重復的自然覆蓋

begin_date='2019-01-25'
end_date='2019-01-25'
for page_i in range(1,100):
    print(page_i)
    page_source = requests.get(url1.format(page_i,begin_date,end_date), headers=headers).content.decode(encoding='utf-8', errors='ignore')
    res=re.findall('[A-Za-z0-9]\((.*?)\)',page_source)[0]
res_json = json.loads(res)
key_value_list=res_json['pageHelp']['data']
frm=pd.DataFrame(key_value_list[0],index=[0])
for key_value in key_value_list[1:]:
   frm = frm.append(pd.DataFrame(key_value,index=[0]),ignore_index=True)
frm=frm.loc[:,['security_Code','title','URL','SSEDate','bulletin_Type','bulletin_Year']]
frm_dict=frm.to_dict(orient='records')
es = Elasticsearch(hosts='192.168.1.9:9200')
es.indices.create(index='anouncement_ww', ignore=400)
for data in frm_dict:
    es.index(index='anouncement_ww', doc_type='sz', body=data)
result = es.search(index='anouncement_ww', doc_type='sz')
print(result)

答疑

存入es的每一條信息包括{code: , title：,url: 等等 }，如何實現(xiàn)在指定index中serch出所有某code的信息？

2019-1-22

答疑

1 之前是把dict存入es，frm可以直接存入es么，嘗試了下顯示下列報錯？

image.png

2 下列將frm里的數(shù)據(jù)更新到es中的大致邏輯是？

    def update_es_by_frm(self, es, frm, index_name, doc_type=cf.DEFAULT_TYPE, id=None, id_col_name="file_id"):
        """
        [外方]將frm里的數(shù)據(jù)更新到es中
        :param frm: frm格式存儲的數(shù)據(jù)
        :param host: es地址
        :param doc_type: doc_type:只需保持一致即可
        :param index_name: es的index名
        :param id: id寫入方式,默認是row_dict里的file_id
        """
        assert_isinstance([frm, doc_type, index_name], [pd.DataFrame, str, str])
        actions = []
        assert isinstance(frm, pd.DataFrame)
        row_dict_list = frm.to_dict(orient='records')
        if id == None:
            for row_dict in row_dict_list:
                action = {
                    "_index": index_name,
                    "_type": doc_type,
                    "_id": row_dict[id_col_name],
                    "_source": row_dict
                }
                actions.append(action)
        else:
            for row_dict in row_dict_list:
                action = {
                    "_index": index_name,
                    "_type": doc_type,
                    # "_id": row_dict['file_id'],
                    "_source": row_dict
                }
                actions.append(action)
        helpers.bulk(es, actions)

2019-1-21

筆記

result = es.create(index=xxx，doc_type=xxxx，id=xxx)
result = es.delete(index=***, ignore=[400, 404])
result = es.index(index=xxx, doc_type=xxx, body=xxx)
result = es.update(index=xxx, doc_type=xxx, body=xxx, id=xxx)
result = es.search(index=xxx doc_type=xxx)

答疑

針對練習題，直接用es.index的方式把json文件存入es中是否滿足題意？

for page_i in range(1,5):
    print(page_i)
    page_source = requests.get(url1.format(page_i), headers=headers).content.decode(encoding='utf-8', errors='ignore')
    res=re.findall('[A-Za-z0-9]\((.*?)\)',page_source)[0]
res_json = json.loads(res)
key_value_list=res_json['pageHelp']['data']
frm1=pd.DataFrame(key_value_list[0],index=[0])

for key_value in key_value_list[1:]:
   frm1 = frm1.append(pd.DataFrame(key_value,index=[0]),ignore_index=True)

es = Elasticsearch(hosts='192.168.1.9:9200')
# es_bhv = HjEsIO(es)
res_es_list = es.index(index=1111, doc_type='d_type',body=key_value,ignore=400)
result = es.search(index=1111, doc_type='d_type')
print(result)

2019-1-18

進度

已完成DateFrame如圖

image.png

代碼如圖：

image.png

筆記

在使用 pandas 的 DataFrame 方法時碰到的一個錯誤 ValueError: If using all scalar values, you must pass an index。
這是因為 pandas 的 DataFrame 方法需要傳入一個可迭代的對象(列表，元組，字典等)，或者給 DataFrame 指定 index 參數(shù)就可以解決這個問題，如圖

image.png

答疑

為何定義i==0時，i下面會劃紅線？

2019-1-16

卡點

1 str轉(zhuǎn)json時報錯（如下圖），目測為str中有數(shù)據(jù)不符合json格式，如何能快速找到錯誤？

image.png

2 在有for循環(huán)的情況下debug，剛開始debug能在遍歷所有區(qū)間后顯示debug結(jié)果；之后的debug在第一個取值后便結(jié)束并顯示結(jié)果，原因是？

2019-1-15

卡點

1 （已解決，通過點擊“下一頁”“查詢”獲取）我通過network--xhr取得原始數(shù)據(jù)并整理得到結(jié)果如下圖，感覺與題意不相符，是否取錯數(shù)據(jù)（因為沒有發(fā)現(xiàn)有很多字段），是否應(yīng)該通過elements獲取數(shù)據(jù)？

image.png

2 插入本地模塊時仍會報錯，如下圖：

image.png

3 翻閱了網(wǎng)上好幾篇關(guān)于es的使用教程，不是非常理解，有沒有通俗易懂易上手的教程？

2019-1-14

進度

今日練習完成了一半，尚未全部完成

答疑

如何修改project路徑，使得引用的本地模塊生效？

image.png

今日練習中，可以直接取得公告名稱，未發(fā)現(xiàn)json字符串？

2019-12

筆記

在使用xhr無效時如何查找url，通過all逐項查找。
json.loads 將已編碼的 JSON 字符串解碼為 Python 對象

答疑

1 from esfrm.es_class import HjEsIO 和 from yhj_tool.req_bhv import ReqBhvBhv需要怎么安裝？

2 自己編寫update_shanghai時，寫了以下代碼，發(fā)現(xiàn)只能取第1頁的信息，不能遍歷所有頁。

url2=r'http://query.sse.com.cn/security/stock/getStockListData2.do?&jsonCallBack=jsonpCallback99776&isPagination=true&stockCode=&csrcCode=&areaName=&stockType={0}&pageHelp.cacheSize=1&pageHelp.beginPage={1}&pageHelp.pageSize=25&pageHelp.pageNo=2&_=1547295073470'
headers={
    'Accept-Encoding': 'gzip,deflate',
    'Accept-Language': 'zh-CN,zh;q=0.9',
    'Connection': 'keep-alive',
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/67.0.3396.62 Safari/537.36',
    'Content-Type': 'application/json',
    'Accept': '*/*',
    'Host': 'query.sse.com.cn',
    'Referer': 'http://www.sse.com.cn/assortment/stock/list/share/',
    'Cookie': 'yfx_c_g_u_id_10000042=_ck19011219292018089363771730333; VISITED_COMPANY_CODE=%5B%22600000%22%5D; VISITED_STOCK_CODE=%5B%22600000%22%5D; seecookie=%5B600000%5D%3A%u6D66%u53D1%u94F6%u884C; VISITED_MENU=%5B%229055%22%2C%228528%22%5D; yfx_f_l_v_t_10000042=f_t_1547292560801__r_t_1547292560801__v_t_1547295072921__r_c_0'
}
source_dict={}
for stock_type in ['1','2']:
    for page_i in range(1,1000):
        print(page_i)
        page_source=requests.get(url2.format(stock_type, page_i),headers=headers).content.decode(encoding='utf-8')

3 update_to_es中 res_es_list = es.search(index_name, doc_type='d_type',body=。。。。。, size=10)['hits']['hits']
此時es是空的，為何可以查詢？

def update_to_es(source_dict,index_name):
    es = Elasticsearch(hosts='192.168.1.9:9200')
    es_bhv = HjEsIO(es)
    for i,j in source_dict.items():
        print(i,j)
        announce_date = i[0]
        trade_code = i[1]
        res_es_list = es.search(index_name, doc_type='d_type',body={"query": {"bool":{"must":
                                                                                             [{"match": {"announce_date": announce_date}},
                                                                                              {"match": {"trade_code": trade_code}}
                                                                                              ]}}}, size=10)['hits']['hits']

        if res_es_list:
            row = res_es_list[0]['_source']
            new_list = row['short_company_name']
            if j not in new_list:
                new_list.append(j)
                print('出現(xiàn)新字段，已添加到list!')
                print('------------------------')
            row['short_company_name'] = new_list
            es_bhv.write_row_by_id(index_name,id=res_es_list[0]['_id'],data=row)
        else:
            row = {'announce_date': announce_date, 'trade_code':trade_code, 'short_company_name':[j], 'status':100}
            es_bhv.write_row_by_id(index_name, id=None, data=row)

2019-1-10

筆記部分

.*?懶惰匹配重復任意次盡可能少得重復

答疑部分

1 如何挑選哪些是需要的headers?

2 為何下列headers在代碼中和網(wǎng)頁中的稍有差異
網(wǎng)頁中：Accept-Language: zh-CN,zh;q=0.9,en;q=0.8 代碼中： Accept-Language': 'zh-CN,zh;q=0.9'
網(wǎng)頁中：Proxy-Connection: keep-alive 代碼中：Connection': 'keep-alive',

3 res1=re.findall('"agdm":"(.?)","agjc":"(.?),',res) 中 ,',res)的第一個,號的含義是？為什么去掉后會直接影響結(jié)果？

4 獲取信息的實效問題：比如2019-1-10 需要查詢2019-1-1的數(shù)據(jù)，雖然最后輸出的結(jié)果會顯示2019-1-1，但是根據(jù)代碼得到的數(shù)據(jù)信息應(yīng)該還是2019-1-10，如何解決？

2019-1-9

如何獲取URL2地址？

找到網(wǎng)頁--F12--netework--XHR--name--headers

如何獲取連續(xù)多頁網(wǎng)頁信息？

例：url2 = r'http://www.szse.cn/api/report/ShowReport/data?SHOWTYPE=JSON&CATALOGID=1110&TABKEY=tab{0}&PAGENO={1}&random={2}'

page_source = requests.get(url2.format(stock_type, page_i, random.random()), headers= headers).content.decode(encoding='utf-8', errors='ignore')

關(guān)鍵點：本例中我們 1）需要取得A股和B股的股票代碼，選擇A股還是B股是由url2中tab后的數(shù)字決定；選擇頁碼是由PAGENO=后的數(shù)字，這兩個位置用{0} {1}表示，并利用format函數(shù)將這兩個位置由變量替代，達到遍歷效果。

答疑

上述例子中url2中random后{2} 位置替換為 random.random()的意義和作用是？

2019-1-8

卡點

不會如何獲取連續(xù)多頁網(wǎng)頁信息？

問題

如何獲得URL2？
如何獲得headers？作用是？
這3行代碼的作用是？
sub_dict = {' +': '', 'Ｂ': 'B', 'Ａ': 'A'}
for key, value in sub_dict.items():
company_name = re.sub(key, value, company_name)

完整代碼如下：

url2 = r'http://www.szse.cn/api/report/ShowReport/data?SHOWTYPE=JSON&CATALOGID=1110&TABKEY=tab{0}&PAGENO={1}&random={2}'
headers = {
        'Accept-Encoding': 'gzip,deflate',
        'Accept-Language': 'zh-CN,zh;q=0.9',
        'Connection': 'keep-alive',
        'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/68.0.3440.106 Safari/537.36',
       'Content-Type': 'application/json',
       'Accept': 'application/json, text/javascript, */*; q=0.01',
       'Host': 'www.szse.cn',
       'Referer': 'http://www.szse.cn/market/stock/list/index.html',
       'X-Request-Type': 'ajax',
       'X-Requested-With': 'XMLHttpRequest',
    }
source_dict = {}
for stock_type in ['2','1']:
        for page_i in range(1,1000):
            print(page_i)
            page_source = requests.get(url2.format(stock_type, page_i, random.random()), headers= headers).content.decode(encoding='utf-8', errors='ignore')
            res = re.findall('"data":\[(.*?)\]', page_source)[int(stock_type) - 1]
            day = re.findall('"subname":"(.*?)",', page_source)[int(stock_type) - 1]
            day = re.sub(' +','',day)
            if day == '':
                break
            if stock_type == '1':
                res1 = re.findall('"agdm":"(.*?)","agjc":"(.*?)",',res)
            else:
                res1 = re.findall('"bgdm":"(.*?)","bgjc":"(.*?)",', res)
            for i in res1:
                # today = datetime.datetime.today()
                today = datetime.datetime.today() - datetime.timedelta(days=1)#需要調(diào)整時間
                today_str = today.strftime('%Y-%m-%d')
                if today_str != day:
                    day = today_str
                company_name = i[1]
                sub_dict = {' +': '', 'Ｂ': 'B', 'Ａ': 'A'}
                for key, value in sub_dict.items():
                    company_name = re.sub(key, value, company_name)
                source_dict[(day,i[0])] = company_name

2019-1-4

如何使用groupby 配合 agg對數(shù)據(jù)聚合？

grouped = a_frm.groupby(by='QuestionId') #DataFrame按照某種規(guī)律排序，返回的是DataFrameGroupBy結(jié)構(gòu)；
last_status = grouped.agg(lambda x: x.iloc[-1]) #排序后對行名稱相同的取最后一行，返回的是DataFrame

對組合復合代碼可以采用分步拆解法理解

答疑

習題1805104答案中為何需要寫names = locals()？用names =｛｝是否更合適？

names = locals()
company_names = ['寧滬高速','深高速','四川成渝']
years = [2017,2016,2015]
for i in company_names:
for j in years:
names['%s_%s'%(i,j)] = str(i)+str(j)

2019-1-3

如何在dataframe中插入列？

frm.insert(列位置，‘插入列命名’，插入的列）

如何在dataframe中移除列？

frm.pop(列位置）

答疑（已答）

bb2[2016] = round(bb2.pop(1)/366,2)
這行代碼的作用是，新建一列‘2016 ’列，數(shù)據(jù)是以‘1’列的數(shù)據(jù)/366，并用‘2016’列替代‘1’列；代碼具體是怎么實現(xiàn)的，代碼中每一個參數(shù)、函數(shù)的功能？
pop 摘出并在原文件中刪去，round保留指定位數(shù)小數(shù)。

2019-1-2

如何找到指定字符的所在列？

for i in frm.index:
if '指定字符' in list(frm.iloc[i]):
column_num = list(frm.iloc[i]).index('指定字符')
break

frm[[0,3]]代表什么？

frm的第0和第3整列。

答疑（已答）：

re.findall('(.+?)(高速|(zhì)公路)',highway)[0][0]中（）及[0]的作用？？
限制需要匹配到的字符串；獲取的第一個匹配的字符串的第一個字符。

percentage = re.findall(r'\（(.+)\）', company)[0]中最內(nèi)層（）的作用？
同上。

12-30

如何對dataframe使用函數(shù)？

frm.apply（函數(shù)）
例如：frm6['16年收入(百萬元)'].apply(toyuan) #toyuan是自己設(shè)置的函數(shù)。

對下列代碼是否有其他表現(xiàn)方式frm6=frm5.loc[frm5['高速公路名稱'].str.contains('高速|(zhì)公路')]

有，frm6=frm5[frm5['高速公路名稱'].str.contains('高速|(zhì)公路')] ##篩選出包含某些字段的行。

如何設(shè)置一個函數(shù)，把參數(shù)轉(zhuǎn)換成乘以1000000以后的數(shù)字？

def tofloat(x): ##判斷參數(shù)是否是浮點型
try:
x = float(x)
except TypeError:
x = x
return x
def toyuan(x): ##參數(shù)乘以1000000
try:
the_float = float(x)
return the_float * 1000000
except TypeError:
return x

12-29

如何進行edit configuration 設(shè)置？

每次run時跳出edit configuration 設(shè)置非常麻煩，設(shè)置過程如下：file---setting---project interpreter---選擇一個py版本---apply---ok 設(shè)置完成

如何把代碼附近的綠色解釋字段去除？

點擊圖中小蟲符號

image.png

如何將DataFrame寫入excel？

frm.to_excel('xxx.xlsx',sheet_name='yyy')
frm.read_excel('xxx.xlsx',sheet_name='yyy')
卡點：
frm.to_excel時顯示ImportError: No module named 'openpyxl'，已通過安裝openpyxl解決！

答疑部分：

frm6=frm4.loc[frm4['高速公路名稱'],str.contains('高速|(zhì)公路')]時顯示AttributeError: type object 'str' has no attribute 'contains'
已解決把str前的，改成.

12-28（已更新答疑）

兩個dataframe有相同的列信息，如何合并？

使用merge合并，newfrm=pd.merge(frm1,frm2,on='相同column的列名稱')；

當依據(jù)2組相同的列信息合并時，newfrm=pd.merge(frm1,frm2,on=['列名稱1','列名稱2'],how='inner'或‘outer’)，inner時只合并兩組信息完全一致的部分，outer時會把不一致的部分一起合并，缺失值默認為NaN。

答疑部分

merge合并中如果how=left或right時代表的意思？？
已實驗，left表示根據(jù)列信息合并時如果不一致，保留第一組的值，第二組的不一致值為nan；right則反之。

遠程機pandas引用DataFrame時無效，重新安裝pandas也未成功。
已答疑，import某模塊時會優(yōu)先搜索當前目錄，因為當前目錄內(nèi)有以pandas命名的文件，所以import失敗。

本地機安裝完pycharm后無法打開，顯示如圖信息，網(wǎng)上未能找到真實可用的32-
bit JDK資源。

image.png

12-27

如何連接多個dataframe？

使用concat縱向合并，frm3=pd.concat([frm1,frm2],axis=0,ignore_index=True)，#注意這里是concat，很容易記成contact;ignore_index=True可以對合并后的frm重新按順序命名行標題。
使用append合并，只有縱向合并沒有橫向，frm3=frm1.append(frm2,ignore_index=True)

12-23

如何建立指定行列的隨機數(shù)組？

np.random.randn(行數(shù)，列數(shù)）

如何建立以某一日期開始的連續(xù)日期？

pd.date_range('日期',periods=數(shù)量）

如何建立dataframe？

pd.DataFrame(已有數(shù)組,index=行標題list，columns=列標題list)

如何獲取dataframe的行、列、數(shù)據(jù)信息？

df=某dataframe
行信息 df.index
列信息 df.columns
數(shù)據(jù)信息df.values

如何獲取df指定位置的數(shù)據(jù)信息？

df.loc['行標題']
df.loc[:,['列標題']
df.loc['行標題','列標題']
df.iloc[[所在行數(shù)（從0開始）],:]或者 df.iloc[行范圍,:]
df.iloc[:,[所在列數(shù)]]或者 df.iloc[:,列范圍]
df.iloc[所在行數(shù),所在列數(shù)]

如何篩選df中某指定列中指定值的數(shù)據(jù)

df[df[‘指定列’].isin([‘指定值’])]

12-8

如何恢復出list文件中的對象？

使用pickle
import pickle
打開list文件： f=open(文件所在位置包含文件名，讀取方式rb代表讀二進制）
恢復文件中對象（所有）： pickle.load(f)

如何把frame轉(zhuǎn)化成excel?

**.to_excel(文件所在地址包含文件名，sheet名）

如何把tushare接口數(shù)據(jù)提取到本地excel？

import pickle
import tushare as ts
f=open(r'D:\ww\trade_code.list','rb')
ts.set_token('12345)#這里寫入token碼
pro=ts.pro_api()
for i in pickle.load(f):
    if str(i).startswith('6'):
        ii=i+'.SH'
        df=pro.balancesheet(ts_code=ii,period='20180630',start_date='20180101',end_date='20180630')
        address=r'd:\ww\bs'+r'\\'+i+r'.xlsx'
        df.to_excel(address,sheet_name='bs')
    else:
        ii = i + '.SZ'
        df = pro.balancesheet(ts_code=ii, period='20180630', start_date='20180101', end_date='20180630')
        address = r'd:\ww\bs' + r'\\' + i + r'.xlsx'
        df.to_excel(address, sheet_name='bs')

11-29

安裝python新模塊的方法

1：files---settings---project---project interpreter---'+'----搜索-----install
2:windows鍵+r-----cmd-----pip install tushare

默寫使用tushare接口調(diào)出資產(chǎn)負債表

import tushare as ts
ts.set_token('token碼')
pro=ts.pro_api()
df=pro.balancesheet(ts_code='股票代碼‘,period='報告期'，start_date=’報告開始日',end_date='報告結(jié)束日‘）

得到結(jié)果

  ts_code  ann_date f_ann_date  end_date report_type comp_type  \
0  000651.SZ  20180831   20180831  20180630           1         1   

    total_share      cap_rese  undistr_porfit  surplus_rese    ...      \
0  6.015731e+09  1.038806e+08    6.854628e+10  3.499672e+09    ...       

  lt_payroll_payable  oth_comp_income  oth_eqt_tools  oth_eqt_tools_p_shr  \
0        112708961.0    -1.693679e+08           None                 None   

   lending_funds  acc_receivable  st_fin_payable  payables  hfs_assets  \
0           None            None            None      None        None   

   hfs_sales  
0       None  

[1 rows x 137 columns]

11-25

urlretrieve的使用方法？

urlretrieve(下載地址，保存地址，貌似與下載進度有關(guān)暫時不用）
注意點：保存地址需寫完整地址包括文件名，同時后綴需要有對應(yīng)格式。

沒辦法了，把文件命名方法抄一遍，加深印象

downloadDirectory =r'd:\temp'
def getDownloadPath(baseUrl,fileUrl,downloadDirectory):
path = fileUrl.replace(baseUrl,'')
directory=ox.path.dirname(path)
if not os.path.exists(directory):
os.makedirs(directory)
return path

11-22

import requests
params={'form_email':'ss','form_password':'11'}
r=requests.post('https://www.douban.com/accounts/login',data=params)
print(r.text)

bs4中的find和findAll區(qū)別匯總

find找到匹配的第一個，findAll找到所有匹配；
find得到的是tag，findAll得到的是list，因此find（*****）后可以直接加'.子標簽'，.get_text()，['屬性標簽']

如何使用re.sub對指定字符串去頭去尾？

re.sub(^ ,' ') re.sub( $,' ')，且可以引用變量。

對request的結(jié)果.url .text .status_code區(qū)別？

.url顯示url地址 .text顯示更多信息（包括url） .status_code顯示響應(yīng)碼

selenium下選擇器如何使用？

單個 driver.find_element_by_(' ')
多個 driver.find_elements_by_(' ')

如何使用隱式等待？

webDriver加expected_conditions 表示等到某個標志出現(xiàn)時開始摘取數(shù)據(jù)

如何找到requests post的有效地址？

輸入發(fā)送信息后在網(wǎng)頁-檢查-network中找到

想要連接字符串‘a(chǎn)’ 'b' ''怎么做？

'a'+'\'+'b' 因為\在字符串中可以與其他字母形成其他含義所以需用'\'

11-21

findAll(*****)后面直接加.get_text()為何會報錯？

.findAll得到的是list格式，所以無法直接加.get_text()

11-18

attrs global random datatime的作用？

.attrs['屬性名稱']，用來獲取屬性名稱，與直接[屬性名稱]等同

收集整個網(wǎng)站數(shù)據(jù)

求注釋

from urllib.request import urlopen
from bs4 import BeautifulSoup
import re
pages = set()
def getLinks(pageUrl):
  global pages
     html = urlopen("http://en.wikipedia.org"+pageUrl)
     bsObj = BeautifulSoup(html)
     try:
        print(bsObj.h1.get_text())
        print(bsObj.find(id="mw-content-text").findAll("p")[0])
        print(bsObj.find(id="ca-edit").find("span").find("a").attrs['href'])
      except AttributeError:
        print("頁面缺少一些屬性！不過不用擔心！")
      for link in bsObj.findAll("a", href=re.compile("^(/wiki/)")):
          if 'href' in link.attrs:
            if link.attrs['href'] not in pages:
              # 我們遇到了新頁面
                newPage = link.attrs['href']
                print("----------------\n"+newPage)
                pages.add(newPage)
                getLinks(newPage)
getLinks("")

11-16

什么是正則表達式

用來查找符合某種條件的字符串的方法。

有哪些常用的符號（以下為默寫）

. 匹配換行符以外所有字符

重復0次或多次
+重復1次或多次
\b以單詞開頭或結(jié)束
\d匹配數(shù)字類似 [0-9]
\w匹配數(shù)字漢子字母下劃線
\s匹配空格
^字符串開始
$字符串結(jié)束
\取消字符的特殊含義
？重復0次或1次
{n}重復n次
｛n,｝重復n次或更多次
｛n,m｝重復n到m次
[ ]匹配括號內(nèi)任意一個
|分支條件注意匹配分支條件時從左往右匹配如果滿足某個分支就不會管其他條件
（）分組
[^ ]匹配除外
零寬斷言（？=。。。）匹配以。。。結(jié)尾的字符的前面部分
（？<=。。。）匹配以。。。開頭的字符的后面部分
（？！。。。）匹配不以。。。結(jié)尾的字符的前面部分
（？<！。。。）匹配不以。。。開頭的字符的后面部分
*？貪婪匹配重復0或多次盡可能少的重復

怎么獲取某標簽下的某屬性？

findAll(' ... ',{'..':'...'})

為何使用get_text()？

把 html文檔中的標簽都清除，最后只包含字符串。

親戚關(guān)系？

子（直接下一級） child 后代（所有下級）descendant
兄弟（同級）next_siblings 父（直接上級）parent

11-15之前

如何從網(wǎng)上爬信息？

大致思路
獲取所在信息的網(wǎng)址--->>用html=urlopen(wangzhi)打開網(wǎng)址--->>用bsobj=BeautifulSoup(html,'lxml')把網(wǎng)站所有代碼轉(zhuǎn)化為bs對象（注意lxml是解析方式，目前比較推薦的）--->>通過“檢查”網(wǎng)址的所有代碼找到所需信息的所在位置---->>通過查找刪選找到所需信息并歸集（正則表達式、bs的find系列，其中的findall非常常用）

如何輸出word文檔內(nèi)容？

將word文件轉(zhuǎn)為二進制數(shù)據(jù)-->>對word文件進行解壓得到存儲內(nèi)容的xml格式文件-->>再將xml文件轉(zhuǎn)化為bs對象wordobj

wordFile = urlopen("http://pythonscraping.com/pages/AWordDocument.docx").read()
wordFile = BytesIO(wordFile)
document = ZipFile(wordFile)
xml_content = document.read('word/document.xml')

wordObj = BeautifulSoup(xml_content.decode('utf-8'), "lxml-xml")

1803.1已默寫

from urllib.request import urlopen
from urllib.error import HTTPError
from bs4 import BeautifulSoup

def getTitle(url):
    try:
        html = urlopen(url)#打開網(wǎng)址轉(zhuǎn)成response格式
    except HTTPError as e:#url錯誤時執(zhí)行
        print(e)
        return None
    try:
        bsObj = BeautifulSoup(html, "lxml")#轉(zhuǎn)成bs對象
        title = bsObj.body.h1#取出title
    except AttributeError as e:#發(fā)生錯誤時執(zhí)行
        return None
    return title

title = getTitle("http://www.pythonscraping.com/exercises/exercise1.html")
if title == None:
    print("Title could not be found")
else:
    print(title)

1803.2已默寫

附上找到的p標簽

image.png

import re  #插入正則函數(shù)
import string
from urllib.request import urlopen
from bs4 import BeautifulSoup
url = 'https://book.douban.com/subject/24753651/discussion/58975313/'
html=urlopen(url)#打開url 轉(zhuǎn)成response格式
bsobj=BeautifulSoup(html.read(),'xml')#轉(zhuǎn)化成bs對象
comment=bsobj.findAll('p')#找到p標簽內(nèi)容
for emails in comment:
    emails=re.findall('([A-Za-z0-9\.\_]+\@(163|qq)\.com)',emails.get_text())#利用正則表達式找出郵箱,得到很多個list
    if emails !=[]:
        for email in emails:
            em =email[0]#查找出來的email是列表，用第一項
            em=em.strip(string.punctuation)#去掉收尾的標點符號
            print(em)

1803.03已注釋

獲取‘src’屬性的截圖

image.png

import os#插入os模塊  處理文件和目錄
import re#正則函數(shù)模塊
from urllib.request import urlretrieve
from urllib.request import urlopen
from bs4 import BeautifulSoup

downloadDirectory = r'd:\temp'

def getDownloadPath(baseUrl, fileUrl, downloadDirectory):
    path = fileUrl.replace(baseUrl, '')#把“fileurl”中的‘baseurl’替換成空
    path = downloadDirectory+r'\\'+path
    directory = os.path.dirname(path)#獲取path所在文件夾位置
    if not os.path.exists(directory):#如果文件夾不存在
        os.makedirs(directory)#新建一個文件夾
    return path

html = urlopen("https://book.douban.com/subject/24753651/discussion/58975313/")
bsObj = BeautifulSoup(html, "html.parser")
downloadList = bsObj.findAll('img')#獲取img標簽

for download in downloadList:
    fileUrl = download['src']
    baseUrl = os.path.dirname(fileUrl)#獲取fileurl所在文件夾位置
    if fileUrl is not None:
        print(fileUrl)
        urlretrieve(fileUrl, getDownloadPath(baseUrl, fileUrl, downloadDirectory))##在fileurl地址下載并保存為path

1803.04已注釋可默寫

import csv
from urllib.request import urlopen
from bs4 import BeautifulSoup

url = 'https://exif.tuchong.com/view/5775300/'
html = urlopen(url)
bsObj = BeautifulSoup(html.read(), 'lxml')
tables = bsObj.findAll('table')#找到table標簽

csvFile = open('exif.csv', 'wt', newline='',encoding='utf-8')#打開exif
writer = csv.writer(csvFile)#在csvfile中寫入
try:
    for table in tables:
        headline = table.parent.find('h2').get_text()#找到H2標簽
        writer.writerow('')
        writer.writerow([headline])#在writer中寫入
        rows = table.findAll('tr')
        for row in rows:
            csvRow = []
            for cell1 in row.findAll('td',{'class':'exif-desc'}):
                csvRow.append(cell1.get_text())#在csvrow中添加信息
            for cell2 in row.findAll('td',{'class':'exif-content'}):
                csvRow.append(cell2.get_text())#在csvrow中添加信息
            writer.writerow(csvRow)#把csvrow寫入writer
finally:
    csvFile.close()

1805已注釋

import csv#插入寫入文檔模塊
from urllib.request import urlopen
from bs4 import BeautifulSoup

url = 'http://baike.baidu.com/fenlei/%E6%A4%8D%E7%89%A9'
html = urlopen(url)
bsObj = BeautifulSoup(html.read(), 'lxml')
# tables = bsObj.findAll('table')

csvFile = open('baike_zhiwu.csv', 'wt', newline='',encoding='utf-8')
writer = csv.writer(csvFile)#在csvFile中寫入
try:
    headlines = ['名稱','簡介','開放分類']
    writer.writerow(headlines)#寫入headlines
    rows = bsObj.findAll('div', {'class': 'list'})#摘取div class list標簽
    for row in rows :
        csvRow = []
        csvRow.append(row.find('a',{'class':'title nslog:7450'}).get_text())#在csv中加入信息
        csvRow.append(row.find('p').get_text())#在csv中加入信息
        # print(csvRow)
        for cell1 in row.findAll('div',{'class':'text'}):
            csvRow.append(cell1.get_text())#在csv中加入信息
        writer.writerow(csvRow)#把csvrow寫入writer中
finally:
    csvFile.close()#關(guān)閉csv文件

1803.6已注釋

from zipfile import ZipFile#解壓縮模塊
from urllib.request import urlopen
from io import BytesIO#二進制模塊
from bs4 import BeautifulSoup

wordFile = urlopen("http://pythonscraping.com/pages/AWordDocument.docx").read()
wordFile = BytesIO(wordFile)#轉(zhuǎn)化成二進制格式
document = ZipFile(wordFile)#解壓縮
xml_content = document.read('word/document.xml')#讀取document
wordObj = BeautifulSoup(xml_content.decode('utf-8'), "lxml-xml")#轉(zhuǎn)成bs對象
textStrings = wordObj.findAll("w:t")#摘取'w:t'標簽
for textElem in textStrings:
    closeTag = ""
    try:
        style = textElem.parent.previousSibling.find("w:pStyle")##從textelem的parent找到‘w:pstyle’標簽
        if style is not None and style["w:val"] == "Title":
            print("<h1>")
            closeTag = "</h1>"
    except AttributeError: #不打印標簽
        pass
    print(textElem.text)
    print(closeTag)

1803.8已注釋

from urllib.request import urlopen
from bs4 import BeautifulSoup
import re
import string


def cleaninput(input):
    input = re.sub('\\n+', '', input)#把換行符去掉（1或多個）
    input = re.sub(' +', '', input)#把空格去掉（1或多個）
    input = re.sub('\[[0-9]*\]', '', input)#把[]里的數(shù)字輸?shù)簦?或多個）
    input = bytes(input, 'utf-8')#轉(zhuǎn)成二進制格式
    input = input.decode('ascii', 'ignore')#轉(zhuǎn)成ascii格式
    cleaninput = []
    input = input.split(' ')#按空格分割
    for item in input:
        item = item.strip(string.punctuation)#兩邊去標點符號
        if len(item) > 1 or (item.lower() == 'a' or item.lower() == 'i'):#lower小寫字母
            cleaninput.append(item)
    return cleaninput


def ngrams(input, n):
    input = input.split(' ')
    output = []
    for i in range(len(input) - n + 1):
        output.append(input[i: i + n])
    return output


html = urlopen("http://en.wikipedia.org/wiki/Python_(programming_language)")
bsObj = BeautifulSoup(html.read(), 'xml')
content = bsObj.find("div", {"id": "mw-content-text"}).get_text()
ngrams = ngrams(content, 2)
content = cleaninput(content)
print(ngrams)

色偷偷精品伊人,欧洲久久精品,欧美综合婷婷骚逼,国产AV主播,国产最新探花在线,九色在线视频一区,伊人大交九 欧美,1769亚洲,黄色成人av

學習總結(jié)（截止2019-3-13）

2019-3-13

待解決問題，如何實現(xiàn)定時執(zhí)行功能

2019-3-11

答疑

1 如何在把字段信息存入es時，以字段名命名，字段信息如下：

2 查了下 定時重爬是否通過以下代碼實現(xiàn)？

2019-3-7

答疑

2019-3-5

答疑

答疑

2019-3-3

答疑

2019-2-20

解題過程

2019-2-19

解題文字思路

2019-2-15

解題結(jié)果及過程

2019-2-14

答疑

2019-1-30

答疑

2019-1-24

筆記

答疑

2019-1-22

答疑

2019-1-21

筆記

答疑

2019-1-18

進度

筆記

答疑

2019-1-16

卡點

2019-1-15

卡點

2019-1-14

進度

答疑

2019-12

筆記

答疑

2019-1-10

筆記部分

答疑部分

2019-1-9

如何獲取URL2地址？

如何獲取連續(xù)多頁網(wǎng)頁信息？

答疑

2019-1-8

卡點

問題

2019-1-4

如何使用groupby 配合 agg對數(shù)據(jù)聚合？

對組合復合代碼可以采用分步拆解法理解

答疑

2019-1-3

如何在dataframe中插入列？

如何在dataframe中移除列？

答疑（已答）

2019-1-2

如何找到指定字符的所在列？

frm[[0,3]]代表什么？

答疑（已答）：

12-30

如何對dataframe使用函數(shù)？

對下列代碼是否有其他表現(xiàn)方式frm6=frm5.loc[frm5['高速公路名稱'].str.contains('高速|(zhì)公路')]

如何設(shè)置一個函數(shù)，把參數(shù)轉(zhuǎn)換成乘以1000000以后的數(shù)字？

12-29

如何進行edit configuration 設(shè)置？

如何把代碼附近的綠色解釋字段去除？

如何將DataFrame寫入excel？

答疑部分：

12-28（已更新答疑）

兩個dataframe有相同的列信息，如何合并？

色偷偷精品伊人,欧洲久久精品,欧美综合婷婷骚逼,国产AV主播,国产最新探花在线,九色在线视频一区,伊人大交九欧美,1769亚洲,黄色成人av

待解決問題，如何實現(xiàn)定時執(zhí)行功能

1 如何在把字段信息存入es時，以字段名命名，字段信息如下：

2 查了下定時重爬是否通過以下代碼實現(xiàn)？

如何獲取URL2地址？

如何獲取連續(xù)多頁網(wǎng)頁信息？

如何在dataframe中插入列？

如何在dataframe中移除列？

如何找到指定字符的所在列？

frm[[0,3]]代表什么？

如何對dataframe使用函數(shù)？

如何設(shè)置一個函數(shù)，把參數(shù)轉(zhuǎn)換成乘以1000000以后的數(shù)字？

如何進行edit configuration 設(shè)置？

如何把代碼附近的綠色解釋字段去除？

如何將DataFrame寫入excel？

兩個dataframe有相同的列信息，如何合并？

如何建立指定行列的隨機數(shù)組？

如何建立dataframe？

如何獲取dataframe的行、列、數(shù)據(jù)信息？

如何獲取df指定位置的數(shù)據(jù)信息？

如何把tushare接口數(shù)據(jù)提取到本地excel？

urlretrieve的使用方法？

沒辦法了，把文件命名方法抄一遍，加深印象

如何使用re.sub對指定字符串去頭去尾？

selenium下選擇器如何使用？

如何找到requests post的有效地址？

想要連接字符串‘a(chǎn)’ 'b' ''怎么做？

findAll(*****)后面直接加.get_text()為何會報錯？

attrs global random datatime的作用？

為何使用get_text()？

如何從網(wǎng)上爬信息？

如何輸出word文檔內(nèi)容？