处理“打开文件过多”的最佳方法是什么？

问题描述：

我正在构建一个抓取器，它需要一个URL，从中提取链接，并将它们中的每一个访问到一定深度;在特定的网站上制作路径树。处理“打开文件过多”的最佳方法是什么？

我实现并行这个爬虫的方式是，我尽快访问每个新发现的URL，因为它的发现是这样的：

func main() { 
    link := "https://example.com" 

    wg := new(sync.WaitGroup) 
    wg.Add(1) 

    q := make(chan string) 
    go deduplicate(q, wg) 
    q <- link 
    wg.Wait() 
} 

func deduplicate(ch chan string, wg *sync.WaitGroup) { 
    for link := range ch { 
     // seen is a global variable that holds all seen URLs 
     if seen[link] { 
      wg.Done() 
      continue 
     } 
     seen[link] = true 
     go crawl(link, ch, wg) 
    } 
} 

func crawl(link string, q chan string, wg *sync.WaitGroup) { 
    // handle the link and create a variable "links" containing the links found inside the page 
    wg.Add(len(links)) 
    for _, l := range links { 
     q <- l} 
    } 
}

这对于相对较小的站点工作正常，但是当我在一个运行大的链接到处都有很多链接，我开始在一些请求中获得这两个错误中的一个：socket: too many open files和no such host（主机确实存在）。

处理这个问题的最佳方法是什么？我是否应该检查这些错误并暂停执行，直到其他请求完成为止？或者在特定时间指定可能请求的最大数量？（这对我更有意义，但不知道如何精确地编码）

您正面临与操作系统控制的每个用户打开文件的限制有关的问题。如果您使用Linux/Unix，则可以使用ulimit -n 4096命令来增加限制。该命令有一个阈值，它不能设置你想要打开的文件的数量。所以如果你想进一步推动它，那么你需要修改/etc/security/limits.conf文件并设置硬性限制和软限制。 –

另外，你正在为每个环节启动一个配置程序，如果存在的话，那么在某些时候它们中的许多人会失败goroutines的目的，而且实际上需要更长的时间才能完成任务。你应该尝试使用固定数量的goroutine来完成处理并从频道读取，而不是为每个链接启动一个新的。看看https://blog.golang.org/pipelines – Topo

或者可能是这样的模式：https：//gobyexample.com/worker-pools？（顺便说一下，你的'WaitGroup'的用法很奇怪，为每个goroutine加1，并且在每个goroutine中延迟'Done'。其他任何东西都是要求bug的） – JimB

答

错误socket: too many open files中引用的文件包括线程和套接字（http请求来加载被抓取的网页）。看到这个question。

由于无法创建文件，DNS查询也很可能失败，但是报告的错误是no such host。

这个问题可以固定在两个方面：

1) Increase the maximum number of open file handles 
2) Limit the maximum number of concurrent `crawl` calls

1）是最简单的解决方案，但可能并不理想，因为它只是直到找到一个网站，有更多的链接推迟的问题，新的限制。对于Linux使用可以设置此限制ulimit -n。

2）更多的是设计问题。我们需要限制可以并发的http请求的数量。我修改了一些代码。最重要的变化是maxGoRoutines。随着每个开始一个值的刮取通话被插入到通道中。一旦通道满了，下一个通话将被阻塞，直到通道中的值被删除。每次刮叫完成后，该值将从频道中删除。

package main 

import (
    "fmt" 
    "sync" 
    "time" 
) 

func main() { 
    link := "https://example.com" 

    wg := new(sync.WaitGroup) 
    wg.Add(1) 

    q := make(chan string) 
    go deduplicate(q, wg) 
    q <- link 
    fmt.Println("waiting") 
    wg.Wait() 
} 

//This is the maximum number of concurrent scraping calls running 
var MaxCount = 100 
var maxGoRoutines = make(chan struct{}, MaxCount) 

func deduplicate(ch chan string, wg *sync.WaitGroup) { 
    seen := make(map[string]bool) 
    for link := range ch { 
     // seen is a global variable that holds all seen URLs 
     if seen[link] { 
      wg.Done() 
      continue 
     } 
     seen[link] = true 
     wg.Add(1) 
     go crawl(link, ch, wg) 
    } 
} 

func crawl(link string, q chan string, wg *sync.WaitGroup) { 
    //This allows us to know when all the requests are done, so that we can quit 
    defer wg.Done() 

    links := doCrawl(link) 

    for _, l := range links { 
     q <- l 
    } 
} 

func doCrawl(link string) []string { 
    //This limits the maximum number of concurrent scraping requests 
    maxGoRoutines <- struct{}{} 
    defer func() { <-maxGoRoutines }() 

    // handle the link and create a variable "links" containing the links found inside the page 
    time.Sleep(time.Second) 
    return []string{link + "a", link + "b"} 
}

处理“打开文件过多”的最佳方法是什么？

相关推荐