php curl返回400错误的请求,如果在循环中

问题描述:

我想使用cUrl库做一个屏幕刮。php curl返回400错误的请求,如果在循环中

我成功地屏幕刮,几个网站(5-10)。

但是每当我运行它在一个for循环刮散装(10-20)的网址,

它会达到一个点的最后几个网址将返回“HTTP/1.1 400错误的请求”。 您的浏览器发送了此服务器无法理解的请求。
请求标头字段的数量超过此服务器的限制。

我很确定这些网址是正确的,正确的修剪和标题长度是相同的单独。如果我将这些最后几个网址放在列表的顶部以进行刮取,它会通过,但列表的最后几个再次获得400 Bad请求错误。可能是什么问题呢?可能是什么原因?

有没有建议吗?

类似下面:

 

for($i=0;$i > sizeof($url);$i++)  
$data[$i] = $this->get($url[$i]); 



function get($url) { 

$this->headers[] = 'Accept: text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8, image/gif, image/x-bitmap, image/jpeg, image/pjpeg'; 
     $this->headers[] = 'Connection: Keep-Alive'; 
     $this->headers[] = 'Content-type: application/x-www-form-urlencoded;charset=UTF-8'; 
     $this->user_agent = 'Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.9.2.12) Gecko/20101026 Firefox/3.6.12 (.NET CLR 3.5.30729)'; 

set_time_limit(EXECUTION_TIME_LIMIT); 
     $default_exec_time = ini_get('max_execution_time'); 

     $this->redirectcount = 0; 
     $process = curl_init($url); 
     curl_setopt($process, CURLOPT_HTTPHEADER, $this->headers); 
     curl_setopt($process, CURLOPT_HEADER, 1); 
     curl_setopt($process, CURLOPT_USERAGENT, $this->user_agent); 
     if ($this->cookies == TRUE) curl_setopt($process, CURLOPT_COOKIEFILE, $this->cookie_file); 
     if ($this->cookies == TRUE) curl_setopt($process, CURLOPT_COOKIEJAR, $this->cookie_file); 

     //off compression for debugging's sake 
     //curl_setopt($process,CURLOPT_ENCODING , $this->compression); 

     curl_setopt($process, CURLOPT_TIMEOUT, 180); 
     if ($this->proxy) curl_setopt($process, CURLOPT_PROXY, $this->proxy); 
     if ($this->proxyauth){ 
      curl_setopt($process, CURLOPT_HTTPPROXYTUNNEL, 1); 
      curl_setopt($process, CURLOPT_PROXYUSERPWD, $this->proxyauth); 
     } 
     curl_setopt($process, CURLOPT_RETURNTRANSFER, 1); 
     curl_setopt($process, CURLOPT_FOLLOWLOCATION, TRUE); 
     curl_setopt($process,CURLOPT_MAXREDIRS,10); 

     //added 
     //curl_setopt($process, CURLOPT_AUTOREFERER, 1); 
     curl_setopt($process,CURLOPT_VERBOSE,TRUE); 
     if ($this->referrer) curl_setopt($process,CURLOPT_REFERER,$this->referrer); 

     if($this->cookies){ 
      foreach($this->cookies as $cookie){ 
       curl_setopt ($process, CURLOPT_COOKIE, $cookie); 
       //echo $cookie; 
      } 
     } 

     $return = $this->redirect_exec($process);//curl_exec($process) or curl_error($process); 
     curl_close($process); 
     set_time_limit($default_exec_time);//setback to default 

     return $return; 
    } 

    function redirect_exec($ch, $curlopt_header = false) { 

    //curl_setopt($ch, CURLOPT_HEADER, true); 
    //curl_setopt($ch, CURLOPT_RETURNTRANSFER, true); 
    $data = curl_exec($ch); 
    $file = fopen(DP_SCRAPE_DATA_CURL_DIR.$this->redirectcount.".html","w"); 
    fwrite($file,$data); 
    fclose($file); 

    $info = curl_getinfo($ch); 
    print_r($info);echo "
"; $http_code = $info['http_code']; if ($http_code == 301 || $http_code == 302 || $http_code == 303) { //list($header) = explode("\r\n\r\n", $data); //print_r($header); $matches = array(); //print_r($data); //Check if the response has a Location to redirect to preg_match('/(Location:|URI:)(.*?)\n/', $data, $matches); $url = trim(array_pop($matches)); //print_r($url); $url_parsed = parse_url($url); //print_r($url_parsed); if (isset($url_parsed['path']) && isset($url) && !empty($url)) { //echo "
".$url; curl_setopt($ch, CURLOPT_URL, MY_HOST.$url); //echo "
".$url; $this->redirectcount++; return $this->redirect_exec($ch); //return $this->get(MY_HOST.$url); //$this->redirect_exec($ch); } } elseif($http_code == 200){ $matches = array(); preg_match('/(/i', $data, $matches); //print_r($matches); $url = trim(array_pop($matches)); //print_r($url); $url_parsed = parse_url($url); //print_r($url_parsed); if (isset($url_parsed['path']) && isset($url) && !empty($url)) { curl_setopt($ch, CURLOPT_URL, $url); //echo "
".$url; $this->redirectcount++; sleep(SLEEP_INTERVAL); return $this->redirect_exec($ch); //return $this->get($url); //$this->redirect_exec($ch); } } //echo "data ".$data; $this->redirectcount++; return $data ; // $info['url']; }

其中$ URL是包含用于获取请求

i。从curl_getinfo实现所有查询字符串中的所有URL时,[request_size]是越来越大,它不应该是..它应该是大约相同的大小。我如何打印/回应我的http请求信息进行调试?

+0

请向我们显示您的代码。我怀疑你是否继续在参数上堆积,而不是每次迭代重置它们。 – deceze 2010-11-15 08:00:42

+1

我们不能说没有时钟的时间,但你说时钟坏了。给我们看时钟。 – stillstanding 2010-11-15 08:03:19

+0

基本上,我正在使用GET在$ url的curl_exec上执行for循环。 $ url [0] .. $ url [99]长度相同,不在参数上堆积。然而就像在病房的$ url [90]一样,我不断遇到400个错误的请求错误。 – flyclassic 2010-11-15 08:17:19

你关于乘以头的问题是在get方法的顶部:

$this->headers[] = 'Accept: text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8, image/gif, image/x-bitmap, image/jpeg, image/pjpeg'; 
$this->headers[] = 'Connection: Keep-Alive'; 
$this->headers[] = 'Content-type: application/x-www-form-urlencoded;charset=UTF-8'; 

在每次迭代中要添加相同的头文件到headers阵列的对象实例。 (说array[]追加到数组中。)您需要在每次迭代中重置数组,或者可能将标头设置移动到另一个方法。

如果headers总是只有在get方法设置,你可以为了解决这个问题改成这样:

$this->headers = array(
    'Accept: text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8, image/gif, image/x-bitmap, image/jpeg, image/pjpeg', 
    'Connection: Keep-Alive', 
    'Content-type: application/x-www-form-urlencoded;charset=UTF-8' 
); 

...但如果标题都是一样的,并没有改变在迭代之间,您可能还需要在对象构造函数中设置标头的值,并且只能从get方法中读取它的值,因为将阵列重置为相同的值始终是冗余的。

+0

我认为这是我犯的一个愚蠢的错误..谢谢! – flyclassic 2010-11-16 01:47:07

+0

@fly:我的荣幸。 – 2010-11-16 06:03:36

CURLINFO_HEADER_OUT设置为true,我可以检索发送的请求信息。

事实上,请求头获取的信息越来越多

我特别有这个头递增!

 
Accept: text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8, image/gif, image/x-bitmap, image/jpeg, image/pjpeg 
Connection: Keep-Alive 
Content-type: application/x-www-form-urlencoded;charset=UTF-8 
Accept: text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8, image/gif, image/x-bitmap, image/jpeg, image/pjpeg 
Connection: Keep-Alive 
Content-type: application/x-www-form-urlencoded;charset=UTF-8 
Accept: text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8, image/gif, image/x-bitmap, image/jpeg, image/pjpeg 
Connection: Keep-Alive 
Content-type: application/x-www-form-urlencoded;charset=UTF-8 
Accept: text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8, image/gif, image/x-bitmap, image/jpeg, image/pjpeg 
Connection: Keep-Alive 
Content-type: application/x-www-form-urlencoded;charset=UTF-8
+0

任何人都知道发生了什么?接受和内容类型的头文件在每次迭代运行时如何被添加? – flyclassic 2010-11-15 11:46:37

+0

如果您要向其中添加更多信息,则应更新您的问题,而不是创建答案。并不是每个人都会按时间顺序查看答案。 (AFAIK默认排序是通过投票。) – 2010-11-15 12:16:32