Expat(C) - “无效令牌”(几乎)每行

问题描述:

我有一些XML我想用C中的Expat进行处理.XML可以用Java解析,所以我没有理由相信它是畸形的。此外,我所拥有的C代码将解析我手动插入的字符串 - 但它无法解析我的XML文件。Expat(C) - “无效令牌”(几乎)每行

这是代码(用的东西,我已经添加了 - 如果上帝想让我们用调试器,他不会给我们的printf):

static void XMLCALL 
starthandler(void *data, const XML_Char *name, const XML_Char **attr) 
{ 
int i; 
if (strcmp(name, "file") == 0) { 
    for (i = 0; attr[i]; i += 2) { 
     if (strcmp(attr[i], "path") == 0) { 
      printf("File is at %s\n", attr[i + 1]); 
     } 
    } 
} 
}  

int main(int argc, char *argv[]) 
{ 
FILE* inXML; 
ssize_t read; 
char* line; 
size_t len = 0; 

XML_Parser p_ctrl = XML_ParserCreate("UTF-8"); 
if (!p_ctrl) { 
    fprintf(stderr, "Could not create parser\n"); 
    exit(-1); 
} 

XML_SetStartElementHandler(p_ctrl, starthandler); 
inXML = fopen(argv[1], "r"); 
if (inXML == NULL) { 
    fprintf(stderr, "Could not open %s\n", argv[1]); 
    XML_ParserFree(p_ctrl); 
    exit(-1); 
} 

while ((read = getline(&line, &len, inXML)) != -1) { 
    printf("Line is %s", line); 
    enum XML_Status status = XML_Parse(p_ctrl, line, len, 0); 
    if (status == 0) { 
     enum XML_Error errcde = XML_GetErrorCode(p_ctrl); 
     printf("ERROR: %s\n", XML_ErrorString(errcde)); 
     printf("Error at column number %lu\n", XML_GetCurrentColumnNumber(p_ctrl)); 
     printf("Error at line number %lu\n", XML_GetCurrentLineNumber(p_ctrl)); 
    } 
    free(line); 
    line = NULL; 
    len = 0; 
} 

XML_ParserFree(p_ctrl); 
fclose(inXML); 
return 0; 
} 

这是我试图解析XML文件:

<?xml version="1.0" encoding="UTF-8" standalone="no"?> 
<!DOCTYPE threadrecordml [ 
<!ELEMENT threadrecordml (file)*> 
<!ATTLIST threadrecordml version CDATA #FIXED "0.1"> 
<!ATTLIST threadrecordml xmlns CDATA #FIXED "http://cartesianproduct.wordpress.com"> 
<!ELEMENT file EMPTY> 
<!ATTLIST file thread CDATA #REQUIRED> 
<!ATTLIST file path CDATA #REQUIRED> 
]> 
<threadrecordml xmlns="http://cartesianproduct.wordpress.com"> 
<file thread="1" path="tester_1.xml" /> 
<file thread="3" path="tester_3.xml" /> 
<file thread="2" path="tester_2.xml" /> 
<file thread="4" path="tester_4.xml" /> 
<file thread="5" path="tester_5.xml" /> 
<file thread="6" path="tester_6.xml" /> 
<file thread="7" path="tester_7.xml" /> 
<file thread="8" path="tester_8.xml" /> 
<file thread="9" path="tester_9.xml" /> 
<file thread="10" path="tester_10.xml" /> 
<file thread="11" path="tester_11.xml" /> 
<file thread="12" path="tester_12.xml" /> 
<file thread="13" path="tester_13.xml" /> 
<file thread="14" path="tester_14.xml" /> 
<file thread="15" path="tester_15.xml" /> 
<file thread="16" path="tester_16.xml" /> 
<file thread="17" path="tester_17.xml" /> 
<file thread="18" path="tester_18.xml" /> 
</threadrecordml> 

这是输出的样品...

[email protected]:/n/staffstore/adrianm/optGenC$ ./optgenc ../tester_control.xml 
Line is <?xml version="1.0" encoding="UTF-8" standalone="no"?> 
ERROR: not well-formed (invalid token) 
Error at column number 0 
Error at line number 2 
Line is <!DOCTYPE threadrecordml [ 
ERROR: not well-formed (invalid token) 
Error at column number 0 
Error at line number 3 
Line is <!ELEMENT threadrecordml (file)*> 
ERROR: not well-formed (invalid token) 
Error at column number 0 
Error at line number 4 
Line is <!ATTLIST threadrecordml version CDATA #FIXED "0.1"> 
ERROR: not well-formed (invalid token) 
Error at column number 0 

(对于所有的行)

如果我“欺骗”,并在读取后添加此行...

line = "<file thread=\"1\" path=\"tester.xml\" />"; 

该生产线将被解析(当然的代码,然后打破其他原因)。

因此,从磁盘文件中读取似乎会发生一些变化......这是否被读为16位?但将解析器的编码更改为NULL或UTF-16似乎没有任何区别。

任何人都可以提供解释吗? (如果它有什么区别,我已经在64位OSX和Linux机器上运行这个代码,并且有同样的问题)

+1

是否行开始换行,因为你得到你的第一个错误第2行的'xml'标签?否则,在文件开始处可能会有其他意外字符。 –

+0

好点 - 我没有注意到它是从第2行开始的。 – adrianmcmenamin

+0

在十六进制编辑器中查看文件显示没有任何杂散字符 - 每行都以\ x0A结尾,就是这样。 – adrianmcmenamin

答案是getline(...)在换行符后面添加一个空字符,这是然后传递给解析器,但它当然不是有效的XML,因此导致失败 - 并且在换行符后面,这被记录为在第2行等。

这样做解决了这个问题:

do { 
    len = fread(data, 1, sizeof(data), inXML); 
    done = len < sizeof(data); 

    if (XML_Parse(p_ctrl, data, len, 0) == 0) { 
     enum XML_Error errcde = XML_GetErrorCode(p_ctrl); 
     printf("ERROR: %s\n", XML_ErrorString(errcde)); 
     printf("Error at column number %lu\n", XML_GetCurrentColumnNumber(p_ctrl)); 
     printf("Error at line number %lu\n", XML_GetCurrentLineNumber(p_ctrl)); 
    } 
} while(!done); 
+0

恰恰是我的问题。 – FractalSpace