从URL连接读取Java
问题描述:
我正在尝试从URL连接读取HTML代码。在一种情况下,我试图读取的html文件在实际的doc类型声明之前包含5个换行符。在这种情况下,输入读取器会引发EOF异常。从URL连接读取Java
URL pageUrl =
new URL(
"http://www.nytimes.com/2011/03/15/sports/basketball/15nbaround.html"
);
URLConnection getConn = pageUrl.openConnection();
getConn.connect();
DataInputStream dis = new DataInputStream(getConn.getInputStream());
//some read method here
有没有人遇到过这样的问题?
URL pageUrl = new URL("http://www.nytimes.com/2011/03/15/sports/basketball/15nbaround.html");
URLConnection getConn = pageUrl.openConnection();
getConn.connect();
DataInputStream dis = new DataInputStream(getConn.getInputStream());
String urlData = "";
while ((urlData = dis.readUTF()) != null)
System.out.println(urlData);
//抛出异常
中的BufferedReader的情况下java.io.EOFException的 在java.io.DataInputStream.readUnsignedShort(DataInputStream.java:323) 在java.io.DataInputStream.readUTF(DataInputStream类。 Java的:572) 在java.io.DataInputStream.readUTF(DataInputStream.java:547)
,它只是回应空,不会继续
个pageUrl = new URL("http://www.nytimes.com/2011/03/15/sports/basketball/15nbaround.html");
URLConnection getConn = pageUrl.openConnection();
getConn.connect();
BufferedReader br = new BufferedReader(new InputStreamReader(getConn.getInputStream()));
String urlData = "";
while(true)
urlData = br.readLine();
System.out.println(urlData);
输出空
答
此:
public class Main {
public static void main(String[] args)
throws MalformedURLException, IOException
{
URL pageUrl = new URL("http://www.google.com");
URLConnection getConn = pageUrl.openConnection();
getConn.connect();
BufferedReader dis = new BufferedReader(
new InputStreamReader(
getConn.getInputStream()));
String myString;
while ((myString = dis.readLine()) != null)
{
System.out.println(myString);
}
}
}
完美。但是,您提供的URL不会返回任何内容。
答
您正在使用DataInputStream
来读取未使用DataOutputStream
进行编码的数据。检查您拨打DataInputStream#readUtf()
的记录行为; it first reads two bytes以形成一个16位整数,表示后面包含UTF编码字符串的字节数。您从HTTP服务器读取的数据不以此格式编码。
相反,HTTP服务器正在按照RFC 2616节6.1和2.2发送以ASCII编码的报头。您需要将文本标题作为文本读取,然后确定邮件正文(“实体”)的编码方式。
答
这工作得很好:
package url;
import java.io.BufferedReader;
import java.io.IOException;
import java.io.InputStreamReader;
import java.io.Reader;
import java.net.URL;
/**
* UrlReader
* @author Michael
* @since 3/20/11
*/
public class UrlReader
{
public static void main(String[] args)
{
UrlReader urlReader = new UrlReader();
for (String url : args)
{
try
{
String contents = urlReader.readContents(url);
System.out.printf("url: %s contents: %s\n", url, contents);
}
catch (Exception e)
{
e.printStackTrace();
}
}
}
public String readContents(String address) throws IOException
{
StringBuilder contents = new StringBuilder(2048);
BufferedReader br = null;
try
{
URL url = new URL(address);
br = new BufferedReader(new InputStreamReader(url.openStream()));
String line = "";
while (line != null)
{
line = br.readLine();
contents.append(line);
}
}
finally
{
close(br);
}
return contents.toString();
}
private static void close(Reader br)
{
try
{
if (br != null)
{
br.close();
}
}
catch (Exception e)
{
e.printStackTrace();
}
}
}
换行符不是EOF。也许发布你的阅读代码和抛出的异常? – 2011-03-20 22:25:43
我同意Brian R.的上述评论,但没有堆栈跟踪,很难说出问题所在。另外,我不确定为什么你需要使用DataInputStream来读取HTML。这主要是为了读取Java基本类型(二进制)。如果你想逐行阅读,BufferedReader是一个更好的(不推荐)选择。 – 2011-03-20 22:33:11
缓冲读取器输出为空 – Penny 2011-03-20 22:45:50