用html标签解析文本

问题描述：

我必须从服务器解析xml文件; 我试着用DOM解析器和SAX解析器，但我不能够解析HTML标签，当它发现它停止第一“<”用html标签解析文本

这是我的分析器类：

public class XMLParser { 

    // constructor 
    public XMLParser() { 

    } 


    public String getXmlFromUrl(String url) { 
     String xml = null; 
     BufferedReader in = null; 

     try { 
      // defaultHttpClient 
      DefaultHttpClient httpClient = new DefaultHttpClient(); 
      HttpPost httpPost = new HttpPost(url); 

      HttpResponse httpResponse = httpClient.execute(httpPost); 
      in = new BufferedReader(new InputStreamReader(
        httpResponse.getEntity().getContent(), "UTF-8")); 


      StringBuffer sb=new StringBuffer(""); 
      String line = ""; 
      String NL = System.getProperty("line.separator"); 

      while ((line = in.readLine()) != null) 
       { 
        sb.append(line); 
        sb.append(NL); 
        line=in.readLine(); 
       } 
      in.close(); 

      xml = sb.toString();; 

     } catch (UnsupportedEncodingException e) { 
      e.printStackTrace(); 
     } catch (ClientProtocolException e) { 
      e.printStackTrace(); 
     } catch (IOException e) { 
      e.printStackTrace(); 
     } 
     // return XML 
     return xml; 
    } 

    public Document getDomElement(String xml){ 
     Document doc = null; 
     DocumentBuilderFactory dbf = DocumentBuilderFactory.newInstance(); 
     try { 

      DocumentBuilder db = dbf.newDocumentBuilder(); 

      InputSource is = new InputSource(); 
       is.setCharacterStream(new StringReader(xml)); 
       doc = db.parse(is); 

      } catch (ParserConfigurationException e) { 
       Log.e("Error: ", e.getMessage()); 
       return null; 
      } catch (SAXException e) { 
       Log.e("Error: ", e.getMessage()); 
       return null; 
      } catch (IOException e) { 
       Log.e("Error: ", e.getMessage()); 
       return null; 
      } 

      return doc; 
    } 


    public final String getElementValue(Node elem) { 
     Node child; 
     if(elem != null){ 
      if (elem.hasChildNodes()){ 
       for(child = elem.getFirstChild(); child != null; child = child.getNextSibling()){ 
        if(child.getNodeType() == Node.TEXT_NODE ){ 
         return child.getNodeValue(); 
        } 
       } 
      } 
     } 
     return ""; 
    } 

    /** 
     * Getting node value 
     * @param Element node 
     * @param key string 
     * */ 
    public String getValue(Element item, String str) {  
      NodeList n = item.getElementsByTagName(str);   
     return this.getElementValue(n.item(0)); 
    } 

    }

答

如果你的HTML不是格式良好的（例如，包含不关闭的标签），这些解析器都不会起作用。您可能最终不得不手动解析（例如，使用正则表达式和类）。如果HTML格式正确，那么您应该发布您收到的错误，并且可能会链接到该页面。

[链接]（http://mirsitelfi.comoj.com/test.php）你认为这个html格式不正确吗？ – mir

我不是专家，但它看起来像一个带有HTML标头的XML文档。你的老板应该开始修复标题（见http://www.w3schools.com/xml/） – Melllvar

你认为如果我试图修复它像这样工作吗？ – mir

答

您应该使用HTML解析器，因为Web上可用的大多数html内容都不符合XML规范。在简单的情况下，正则表达式就足够了，但在复杂的情况下，您可能需要一个HTML解析器。

我没有选择我必须用xmlparser !! :( – mir

）然后你没有选择，因为正如我所解释的，XML解析器只是无法使用。顺便说一句，为什么你必须使用XML解析器？ –

这是一个项目和“老板”要求我用xml解析器做到这一点Oo – mir

用html标签解析文本

相关推荐