从GitHub Repo刮取文件路径产生400响应,但在浏览器中查看正常工作
我试图从链接中删除所有文件路径:https://github.com/themichaelusa/Trinitum/find/master,根本不使用GitHub API。从GitHub Repo刮取文件路径产生400响应,但在浏览器中查看正常工作
上面的链接在HTML中包含一个data-url属性(table,id ='tree-finder-results',class ='tree-browser css-truncate'),用于制作这样的URL :https://github.com/themichaelusa/Trinitum/tree-list/45a2ca7145369bee6c31a54c30fca8d3f0aae6cd
,显示这本字典:
{"paths":["Examples/advanced_example.py","Examples/basic_example.py","LICENSE","README.md","Trinitum/AsyncManager.py","Trinitum/Constants.py","Trinitum/DatabaseManager.py","Trinitum/Diagnostics.py","Trinitum/Order.py","Trinitum/Pipeline.py","Trinitum/Position.py","Trinitum/RSU.py","Trinitum/Strategy.py","Trinitum/TradingInstance.py","Trinitum/Trinitum.py","Trinitum/Utilities.py","Trinitum/__init__.py","setup.cfg","setup.py"]}
,当你在Chrome等浏览器中查看它。但是,GET请求产生<[400] Response>
。
这里是我使用的代码:
username, repo = ‘themichaelusa’, ‘Trinitum’
ghURL = 'https://github.com'
url = ghURL + ('/{}/{}/find/master'.format(self.username, repo))
html = requests.get(url)
soup = BeautifulSoup(html.text, "lxml")
repoContent = soup.find('div', class_='tree-finder clearfix')
fileLinksURL = ghURL + str(repoContent.find('table').attrs['data-url'])
filePaths = requests.get(fileLinksURL)
print(filePaths)
不知道什么是错的。我的理论是,第一个链接创建一个cookie,允许第二个链接显示我们定位的回购的文件路径。我只是不确定如何通过代码实现此目的。真的会感激一些指针!
给它一个去。包含.py
文件的链接是动态生成的,因此要捕捉它们,您需要使用硒。我认为这是你的预期。
from selenium import webdriver ; from bs4 import BeautifulSoup
from urllib.parse import urljoin
url = 'https://github.com/themichaelusa/Trinitum/find/master'
driver=webdriver.Chrome()
driver.get(url)
soup = BeautifulSoup(driver.page_source, "lxml")
driver.quit()
for link in soup.select('#tree-finder-results .js-tree-finder-path'):
print(urljoin(url,link['href']))
部分结果:
https://github.com/themichaelusa/Trinitum/blob/master
https://github.com/themichaelusa/Trinitum/blob/master/Examples/advanced_example.py
https://github.com/themichaelusa/Trinitum/blob/master/Examples/basic_example.py
https://github.com/themichaelusa/Trinitum/blob/master/LICENSE
https://github.com/themichaelusa/Trinitum/blob/master/README.md
https://github.com/themichaelusa/Trinitum/blob/master/Trinitum/AsyncManager.py
@Michael Usachenko,你有没有试过这段代码? – SIM
你注意'例子/ advanced_example.py'是不是相对于'的https:// github.com/themichaelusa/Trinitum /发现/ master'的,但是'的https :// github.com/themichaelusa/Trinitum/blob/master'? –
我的建议是使用浏览器的开发工具仔细控制实际发送的请求,打印'url'和'fileLinksURL'并进行比较。 –