从URL连接读取Java

问题描述:

我正在尝试从URL连接读取HTML代码。在一种情况下,我试图读取的html文件在实际的doc类型声明之前包含5个换行符。在这种情况下,输入读取器会引发EOF异常。从URL连接读取Java

URL pageUrl = 
    new URL(
     "http://www.nytimes.com/2011/03/15/sports/basketball/15nbaround.html" 
    ); 

URLConnection getConn = pageUrl.openConnection(); 
getConn.connect(); 
DataInputStream dis = new DataInputStream(getConn.getInputStream()); 
//some read method here 

有没有人遇到过这样的问题?

URL pageUrl = new URL("http://www.nytimes.com/2011/03/15/sports/basketball/15nbaround.html"); 
URLConnection getConn = pageUrl.openConnection(); 
getConn.connect(); 
DataInputStream dis = new DataInputStream(getConn.getInputStream()); 
String urlData = ""; 
while ((urlData = dis.readUTF()) != null) 
    System.out.println(urlData); 

//抛出异常

java.io.EOFException的 在java.io.DataInputStream.readUnsignedShort(DataInputStream.java:323) 在java.io.DataInputStream.readUTF(DataInputStream类。 Java的:572) 在java.io.DataInputStream.readUTF(DataInputStream.java:547)

中的BufferedReader的情况下

,它只是回应空,不会继续

pageUrl = new URL("http://www.nytimes.com/2011/03/15/sports/basketball/15nbaround.html"); 
URLConnection getConn = pageUrl.openConnection(); 
getConn.connect(); 
BufferedReader br = new BufferedReader(new InputStreamReader(getConn.getInputStream())); 
String urlData = ""; 
while(true) 
    urlData = br.readLine(); 
    System.out.println(urlData); 

输出空

+1

换行符不是EOF。也许发布你的阅读代码和抛出的异常? – 2011-03-20 22:25:43

+0

我同意Brian R.的上述评论,但没有堆栈跟踪,很难说出问题所在。另外,我不确定为什么你需要使用DataInputStream来读取HTML。这主要是为了读取Java基本类型(二进制)。如果你想逐行阅读,BufferedReader是一个更好的(不推荐)选择。 – 2011-03-20 22:33:11

+0

缓冲读取器输出为空 – Penny 2011-03-20 22:45:50

此:

public class Main { 
    public static void main(String[] args) 
     throws MalformedURLException, IOException 
    { 
     URL pageUrl = new URL("http://www.google.com"); 
     URLConnection getConn = pageUrl.openConnection(); 
     getConn.connect(); 
     BufferedReader dis = new BufferedReader( 
           new InputStreamReader(
            getConn.getInputStream())); 
     String myString; 
     while ((myString = dis.readLine()) != null) 
     { 
      System.out.println(myString); 
     } 
    } 
} 

完美。但是,您提供的URL不会返回任何内容。

+0

提供的URL会产生301响应(“永久移动”)。 – seh 2011-03-20 22:53:36

+0

好的,谢谢大家。我没有注意到301,但现在我修好了 – Penny 2011-03-21 15:32:28

您正在使用DataInputStream来读取未使用DataOutputStream进行编码的数据。检查您拨打DataInputStream#readUtf()的记录行为; it first reads two bytes以形成一个16位整数,表示后面包含UTF编码字符串的字节数。您从HTTP服务器读取的数据不以此格式编码。

相反,HTTP服务器正在按照RFC 2616节6.1和2.2发送以ASCII编码的报头。您需要将文本标题作为文本读取,然后确定邮件正文(“实体”)的编码方式。

这工作得很好:

package url; 

import java.io.BufferedReader; 
import java.io.IOException; 
import java.io.InputStreamReader; 
import java.io.Reader; 
import java.net.URL; 

/** 
* UrlReader 
* @author Michael 
* @since 3/20/11 
*/ 
public class UrlReader 
{ 

    public static void main(String[] args) 
    { 
     UrlReader urlReader = new UrlReader(); 

     for (String url : args) 
     { 
      try 
      { 
       String contents = urlReader.readContents(url); 
       System.out.printf("url: %s contents: %s\n", url, contents); 
      } 
      catch (Exception e) 
      { 
       e.printStackTrace(); 
      } 
     } 
    } 


    public String readContents(String address) throws IOException 
    { 
     StringBuilder contents = new StringBuilder(2048); 
     BufferedReader br = null; 

     try 
     { 
      URL url = new URL(address); 
      br = new BufferedReader(new InputStreamReader(url.openStream())); 
      String line = ""; 
      while (line != null) 
      { 
       line = br.readLine(); 
       contents.append(line); 
      } 
     } 
     finally 
     { 
      close(br); 
     } 

     return contents.toString(); 
    } 

    private static void close(Reader br) 
    { 
     try 
     { 
      if (br != null) 
      { 
       br.close(); 
      } 
     } 
     catch (Exception e) 
     { 
      e.printStackTrace(); 
     } 
    } 
}