用html标签解析文本
问题描述:
我必须从服务器解析xml文件; 我试着用DOM解析器和SAX解析器,但我不能够解析HTML标签,当它发现它停止第一“<”用html标签解析文本
这是我的分析器类:
public class XMLParser {
// constructor
public XMLParser() {
}
public String getXmlFromUrl(String url) {
String xml = null;
BufferedReader in = null;
try {
// defaultHttpClient
DefaultHttpClient httpClient = new DefaultHttpClient();
HttpPost httpPost = new HttpPost(url);
HttpResponse httpResponse = httpClient.execute(httpPost);
in = new BufferedReader(new InputStreamReader(
httpResponse.getEntity().getContent(), "UTF-8"));
StringBuffer sb=new StringBuffer("");
String line = "";
String NL = System.getProperty("line.separator");
while ((line = in.readLine()) != null)
{
sb.append(line);
sb.append(NL);
line=in.readLine();
}
in.close();
xml = sb.toString();;
} catch (UnsupportedEncodingException e) {
e.printStackTrace();
} catch (ClientProtocolException e) {
e.printStackTrace();
} catch (IOException e) {
e.printStackTrace();
}
// return XML
return xml;
}
public Document getDomElement(String xml){
Document doc = null;
DocumentBuilderFactory dbf = DocumentBuilderFactory.newInstance();
try {
DocumentBuilder db = dbf.newDocumentBuilder();
InputSource is = new InputSource();
is.setCharacterStream(new StringReader(xml));
doc = db.parse(is);
} catch (ParserConfigurationException e) {
Log.e("Error: ", e.getMessage());
return null;
} catch (SAXException e) {
Log.e("Error: ", e.getMessage());
return null;
} catch (IOException e) {
Log.e("Error: ", e.getMessage());
return null;
}
return doc;
}
public final String getElementValue(Node elem) {
Node child;
if(elem != null){
if (elem.hasChildNodes()){
for(child = elem.getFirstChild(); child != null; child = child.getNextSibling()){
if(child.getNodeType() == Node.TEXT_NODE ){
return child.getNodeValue();
}
}
}
}
return "";
}
/**
* Getting node value
* @param Element node
* @param key string
* */
public String getValue(Element item, String str) {
NodeList n = item.getElementsByTagName(str);
return this.getElementValue(n.item(0));
}
}
答
如果你的HTML不是格式良好的(例如,包含不关闭的标签),这些解析器都不会起作用。您可能最终不得不手动解析(例如,使用正则表达式和类)。如果HTML格式正确,那么您应该发布您收到的错误,并且可能会链接到该页面。
[链接](http://mirsitelfi.comoj.com/test.php) 你认为这个html格式不正确吗? – mir
我不是专家,但它看起来像一个带有HTML标头的XML文档。你的老板应该开始修复标题(见http://www.w3schools.com/xml/) – Melllvar
你认为如果我试图修复它像这样工作吗? – mir