在java中获取html文件的正文内容

问题描述：

想这HTML文件：

<?xml version="1.0" encoding="utf-8" standalone="no"?> 
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.1//EN" 
    "http://www.w3.org/TR/xhtml11/DTD/xhtml11.dtd"> 

<html xmlns="http://www.w3.org/1999/xhtml"> 
<head> 
    <link href="../Styles/style.css" rel="STYLESHEET" type="text/css" /> 

    <title></title> 
</head> 

<body> 
<p> text 1 </p> 
<p> text 2 </p> 
</body> 
</html>

我想是：

<p> text 1 </p> 
<p> text 2 </p>

所以，我认为，使用SAXParser会做（如果你知道更简单的方法，请告诉我）

这是我的代码，但总是我得到null作为正文内容：

private final String HTML_NAME_SPACE = "http://www.w3.org/1999/xhtml"; 
private final String HTML_TAG = "html"; 
private final String BODY_TAG = "body"; 
public static void parseHTML(InputStream in, ContentHandler handler) throws IOException, SAXException, ParserConfigurationException 
{ 
    if(in != null) 
    { 
     try 
     { 
      SAXParserFactory parseFactory = SAXParserFactory.newInstance(); 
      XMLReader reader = parseFactory.newSAXParser().getXMLReader(); 
      reader.setContentHandler(handler); 
      InputSource source = new InputSource(in); 
      source.setEncoding("UTF-8"); 
      reader.parse(source); 
     } 
     finally 
     { 
      in.close(); 
     } 
    } 
} 

public ContentHandler constrauctHTMLContentHandler() 
{ 
    RootElement root = new RootElement(HTML_NAME_SPACE, HTML_TAG); 
    root.setStartElementListener(new StartElementListener() 
     {   
     @Override 
     public void start(Attributes attributes) 
     {   
      String body = attributes.getValue(BODY_TAG); 
      Log.d("html parser", "body: " + body); 
     } 
    }); 
return root.getContentHandler(); 
}

然后

parseHTML(inputStream, constrauctHTMLContentHandler()); // inputStream is html file as stream

什么是错的代码？

更简单的方法 - 考虑[jsoup]（http://jsoup.org/）的HTML解析，请参阅[这里]（http://stackoverflow.com/questions/22043592/trying-to-extract-content-from-url -in-java/22043838＃22043838） – PopoFibo

您是否检查过在'start'方法中获得哪些'attributes'？如果我没有记错，回调将被调用每个起始元素。 – Smutje

@PopoFibo：因为我不熟悉jsoup我宁愿不使用它，除非我必须 – mehdok

答

如何使用Jsoup？您的代码可能看起来像

Document doc = Jsoup.parse(html); 
Elements elements = doc.select("body").first().children(); 
//Elements elements = doc.select("p");//or only `<p>` elements 
for (Element el : elements) 
    System.out.println("element: "+el);

谢谢，我会试试这个。 – mehdok

答

不知道如何抓住HTML。如果它是一个本地文件，那么你可以直接将它加载到Jsoup中。如果你必须从某个URL获取它，我通常使用Apache的HttpClient。一个快速入门指南在这里：HttpClient，并做好你的入门。

这将让你找回数据做这样的事情：

HttpClient client = new DefaultHttpClient(); 
HttpPost post = new HttpPost(URL); 
// 
// here you can do things like add parameters used when connecting to the remote site  
// 
HttpResponse response = client.execute(post); 
BufferedReader rd = new BufferedReader(new InputStreamReader(response.getEntity().getContent()));

然后（如已经被Pshemo建议）我用Jsoup来解析和提取数据Jsoup

Document document = Jsoup.parse(HTML); 
// OR 
Document doc = Jsoup.parseBodyFragment(HTML); 
Elements elements = doc.select("p"); // p for <p>text</p>

谢谢，我的html我本地文件，所以如果我的方式没有工作，我会用jsoup – mehdok

在java中获取html文件的正文内容

相关推荐