python批量下载上次论文,还在爬取贴吧图片?快用批量下载sci论文吧,根据标题名或者DOI批量下载 scihub 科研下载神器
昨晚在下载scil论文,一共295篇,手动下载的话岂不是要累si?
于是想到有没有批量下载sci论文的。
在web of science 上导出下载问下的标题、DOI等txt文件,然后筛选得到DOI和标题,保存为新文件。
通过循环得到DOI与标题,下载并保存成标题命名。
程序参考如下网址:
https://github.com/zaytoun/scihub.py
Setup
pip install -r requirements.txt
Usage
You can interact with scihub.py from the commandline:
usage: scihub.py [-h] [-d (DOI|PMID|URL)] [-f path] [-s query] [-sd query]
[-l N] [-o path] [-v]
SciHub - To remove all barriers in the way of science.
optional arguments:
-h, --help show this help message and exit
-d (DOI|PMID|URL), --download (DOI|PMID|URL)
tries to find and download the paper
-f path, --file path pass file with list of identifiers and download each
-s query, --search query
search Google Scholars
-sd query, --search_download query
search Google Scholars and download if possible
-l N, --limit N the number of search results to limit to
-o path, --output path
directory to store papers
-v, --verbose increase output verbosity
-p, --proxy set proxy
You can also import scihub. The following examples below demonstrate all the features.
fetch
from scihub import SciHub
sh = SciHub()
# fetch specific article (don't download to disk)
# this will return a dictionary in the form
# {'pdf': PDF_DATA,
# 'url': SOURCE_URL,
# 'name': UNIQUE_GENERATED NAME
# }
result = sh.fetch('http://ieeexplore.ieee.org/xpl/login.jsp?tp=&arnumber=1648853')
download
from scihub import SciHub
sh = SciHub()
# exactly the same thing as fetch except downloads the articles to disk
# if no path given, a unique name will be used as the file name
result = sh.download('http://ieeexplore.ieee.org/xpl/login.jsp?tp=&arnumber=1648853', path='paper.pdf')
search
from scihub import SciHub
sh = SciHub()
# retrieve 5 articles on Google Scholars related to 'bittorrent'
results = sh.search('bittorrent', 5)
# download the papers; will use sci-hub.io if it must
for paper in results['papers']:
sh.download(paper['url'])
但是scihub存在验证码问题,验证码问题如何解决呢?
存在验证码问题
导致爬取失败,如何解决验证码识别问题将是关键!!
以后有时间再试试咯!