參考學(xué)習(xí)資料:https://github.com/karpathy/arxiv-sanity-preserver#arxiv-sanity-preserver
這是一個(gè)論文檢索引擎
先來(lái)一段介紹:
arxiv sanity preserver
This project is a web interface that attempts to tame the overwhelming flood of papers on Arxiv. It allows researchers to keep track of recent papers, search for papers, sort papers by similarity to any paper, see recent popular papers, to add papers to a personal library, and to get personalized recommendations of (new or old) Arxiv papers. This code is currently running live at www.arxiv-sanity.com/, where it's serving 25,000+ Arxiv papers from Machine Learning (cs.[CV|AI|CL|LG|NE]/stat.ML) over the last ~3 years. With this code base you could replicate the website to any of your favorite subsets of Arxiv by simply changing the categories infetch_papers.py.
以上介紹的大概意思就是說(shuō)這個(gè)搜索引擎很智能,想關(guān)注什么領(lǐng)域的最新進(jìn)展就把喜歡的主題詞在infetch_papers.py做一下更改即可,這是機(jī)器學(xué)習(xí)的杰作等等。
幾秒鐘就能注冊(cè)成功,跟你打字速度一樣快,進(jìn)入之后是這么個(gè)界面:

代碼布局
代碼有兩大部分:
索引代碼。使用 Arxiv API 下載任何你喜歡的類(lèi)別的最新論文,然后下載所有論文,提取所有文本,根據(jù)每篇論文的內(nèi)容創(chuàng)建 tfidf 向量。因此,此代碼與后端抓取和計(jì)算有關(guān):建立 arxiv 論文數(shù)據(jù)庫(kù)、計(jì)算內(nèi)容向量、創(chuàng)建縮略圖、為人計(jì)算 SVM 等。
用戶(hù)界面。然后是一個(gè)網(wǎng)絡(luò)服務(wù)器(基于Flask/Tornado/sqlite),允許通過(guò)數(shù)據(jù)庫(kù)搜索和過(guò)濾相似文件,等等。
Dependencies
Several: You will need numpy, feedparser (to process xml files), scikit learn (for tfidf vectorizer, training of SVM), flask (for serving the results), flask_limiter, and tornado (if you want to run the flask server in production). Also dateutil, and scipy. And sqlite3 for database (accounts, library support, etc.). Most of these are easy to get through pip, e.g.:
$ virtualenv env # optional: use virtualenv
$ source env/bin/activate # optional: use virtualenv
$ pip install -r requirements.txt
此外還可能需要 ImageMagick 和 pdftotext, 可通過(guò)Ubuntu 系統(tǒng)指令 sudo apt-get install imagemagick poppler-utils完成,好多的依賴(lài)。
流程如下,最好是按順序來(lái):
- Run
fetch_papers.pyto query arxiv API and create a filedb.pthat contains all information for each paper. This script is where you would modify the query, indicating which parts of arxiv you'd like to use. Note that if you're trying to pull too many papers arxiv will start to rate limit you. You may have to run the script multiple times, and I recommend using the arg--start-indexto restart where you left off when you were last interrupted by arxiv. - Run
download_pdfs.py, which iterates over all papers in parsed pickle and downloads the papers into folderpdf - Run
parse_pdf_to_text.pyto export all text from pdfs to files intxt - Run
thumb_pdf.pyto export thumbnails of all pdfs tothumb - Run
analyze.pyto compute tfidf vectors for all documents based on bigrams. Saves atfidf.p,tfidf_meta.pandsim_dict.ppickle files. - Run
buildsvm.pyto train SVMs for all users (if any), exports a pickleuser_sim.p - Run
make_cache.pyfor various preprocessing so that server starts faster (and make sure to runsqlite3 as.db < schema.sqlif this is the very first time ever you're starting arxiv-sanity, which initializes an empty database). - Start the mongodb daemon in the background. Mongodb can be installed by following the instructions here - https://docs.mongodb.com/tutorials/install-mongodb-on-ubuntu/.
- Start the mongodb server with -
sudo service mongod start. - Verify if the server is running in the background : The last line of /var/log/mongodb/mongod.log file must be -
[initandlisten] waiting for connections on port <port>
- Run the flask server with
serve.py. Visit localhost:5000 and enjoy sane viewing of papers!
可選項(xiàng): 你也可以運(yùn)行twitter_daemon.py在screen session, 使用Twitter API credentials (stored in twitter.txt) Twitter periodically looking for mentions of papers in the database, 并且可以把搜索結(jié)果寫(xiě)入twitter.p.
作者說(shuō)還有一個(gè)簡(jiǎn)單的shell腳本,通過(guò)逐個(gè)運(yùn)行這些命令,他會(huì)每天運(yùn)行這個(gè)腳本來(lái)獲取新論文,將它們合并到數(shù)據(jù)庫(kù)中,并重新計(jì)算所有tfidf矢量/分類(lèi)器。有關(guān)此過(guò)程的更多詳細(xì)信息,請(qǐng)參閱下文。
protip: numpy/BLAS: 腳本analyze.py與numpy執(zhí)行大量繁重的工作。作者建議小心地設(shè)置你的numpy使用BLAS(例如OpenBLAS),否則計(jì)算將需要很長(zhǎng)時(shí)間。該腳本擁有 25,000 篇論文和 5000 名用戶(hù),使用與 BLAS 鏈接的 numpy在他的計(jì)算機(jī)上運(yùn)行了幾個(gè)小時(shí)。
Running online
If you'd like to run the flask server online (e.g. AWS) run it as python serve.py --prod.
You also want to create a secret_key.txt file and fill it with random text (see top of serve.py).
Current workflow
作者說(shuō)他這個(gè)運(yùn)作現(xiàn)在還不是全自動(dòng)的,那他怎么讓代碼活到現(xiàn)在呢,他通過(guò)一個(gè)腳本,在 arxiv 出來(lái)后(~midnight PST) 執(zhí)行了以下更新:
python fetch_papers.py
python download_pdfs.py
python parse_pdf_to_text.py
python thumb_pdf.py
python analyze.py
python buildsvm.py
python make_cache.py
作者使用的 screen session,所以設(shè)置screen -S serve 參數(shù) (或-rto reattach to it) 然后在運(yùn)行:
python serve.py --prod --port 80
服務(wù)器將加載新文件并開(kāi)始托管站點(diǎn)。請(qǐng)注意,在某些系統(tǒng)上,如果沒(méi)有 sudo,您無(wú)法使用端口 80。兩個(gè)選項(xiàng)是使用iptables重置路由端口,或者可以使用 setcap來(lái)授予運(yùn)行serve.py的python解釋器的權(quán)限。在這種情況下,我建議謹(jǐn)慎對(duì)待權(quán)限,也許可以嘗試用虛擬機(jī)?(不是太明白這個(gè)設(shè)置,應(yīng)該是怕資料泄露之類(lèi)的)等等。
因?yàn)檫€沒(méi)有系統(tǒng)的學(xué)習(xí)過(guò)python,暫時(shí)還不敢隨意嘗試。
ImageMagick
這里提到的依賴(lài)工具其中一個(gè)是個(gè)類(lèi)似作弊器一樣的東西(美圖秀秀+全能掃描王?)http://www.imagemagick.org/script/index.php
也是個(gè)開(kāi)源的免費(fèi)軟件目前版本是ImageMagick 7.0.9-2. 兼容 Linux, Windows, Mac Os X, iOS, Android OS, 及其他.
可參考ImageMagick使用實(shí)例來(lái)使用ImageMagick用 command-line 完成任務(wù). 也可參見(jiàn) Fred's ImageMagick Scripts: 里面包括執(zhí)行幾何變換、模糊、銳化、邊緣、降噪和顏色操作的大量命令行腳本。也可以用參考Magick.NET,使用ImageMagick可不用安裝客戶(hù)端。
下載安裝參考:http://www.imagemagick.org/script/download.php
另一個(gè)是個(gè)讀PDF并轉(zhuǎn)為文檔的工具 pdftotext
在開(kāi)源的XpdfReader代碼上做了修飾的一個(gè)工具http://www.xpdfreader.com/