从GitHub Repo刮取文件路径产生400响应,但在浏览器中查看正常工作

问题描述:

我试图从链接中删除所有文件路径:https://github.com/themichaelusa/Trinitum/find/master,根本不使用GitHub API。从GitHub Repo刮取文件路径产生400响应,但在浏览器中查看正常工作

上面的链接在HTML中包含一个data-url属性(table,id ='tree-finder-results',class ='tree-browser css-truncate'),用于制作这样的URL :https://github.com/themichaelusa/Trinitum/tree-list/45a2ca7145369bee6c31a54c30fca8d3f0aae6cd

,显示这本字典:

{"paths":["Examples/advanced_example.py","Examples/basic_example.py","LICENSE","README.md","Trinitum/AsyncManager.py","Trinitum/Constants.py","Trinitum/DatabaseManager.py","Trinitum/Diagnostics.py","Trinitum/Order.py","Trinitum/Pipeline.py","Trinitum/Position.py","Trinitum/RSU.py","Trinitum/Strategy.py","Trinitum/TradingInstance.py","Trinitum/Trinitum.py","Trinitum/Utilities.py","Trinitum/__init__.py","setup.cfg","setup.py"]} 

,当你在Chrome等浏览器中查看它。但是,GET请求产生<[400] Response>

这里是我使用的代码:

username, repo = ‘themichaelusa’, ‘Trinitum’ 
ghURL = 'https://github.com' 
url = ghURL + ('/{}/{}/find/master'.format(self.username, repo)) 
html = requests.get(url) 
soup = BeautifulSoup(html.text, "lxml") 
repoContent = soup.find('div', class_='tree-finder clearfix') 
fileLinksURL = ghURL + str(repoContent.find('table').attrs['data-url']) 
filePaths = requests.get(fileLinksURL) 
print(filePaths) 

不知道什么是错的。我的理论是,第一个链接创建一个cookie,允许第二个链接显示我们定位的回购的文件路径。我只是不确定如何通过代码实现此目的。真的会感激一些指针!

+0

你注意'例子/ advanced_example.py'是不是相对于'的https:// github.com/themichaelusa/Trinitum /发现/ master'的,但是'的https :// github.com/themichaelusa/Trinitum/blob/master'? –

+0

我的建议是使用浏览器的开发工具仔细控制实际发送的请求,打印'url'和'fileLinksURL'并进行比较。 –

给它一个去。包含.py文件的链接是动态生成的,因此要捕捉它们,您需要使用硒。我认为这是你的预期。

from selenium import webdriver ; from bs4 import BeautifulSoup 
from urllib.parse import urljoin 

url = 'https://github.com/themichaelusa/Trinitum/find/master' 
driver=webdriver.Chrome() 
driver.get(url) 
soup = BeautifulSoup(driver.page_source, "lxml") 
driver.quit() 
for link in soup.select('#tree-finder-results .js-tree-finder-path'): 
    print(urljoin(url,link['href'])) 

部分结果:

https://github.com/themichaelusa/Trinitum/blob/master 
https://github.com/themichaelusa/Trinitum/blob/master/Examples/advanced_example.py 
https://github.com/themichaelusa/Trinitum/blob/master/Examples/basic_example.py 
https://github.com/themichaelusa/Trinitum/blob/master/LICENSE 
https://github.com/themichaelusa/Trinitum/blob/master/README.md 
https://github.com/themichaelusa/Trinitum/blob/master/Trinitum/AsyncManager.py 
+0

@Michael Usachenko,你有没有试过这段代码? – SIM