将数字字符实体引用转换为可读文本

问题描述:

我一直在努力寻找将ASCII字符转换/解码为可读文本的类。将数字字符实体引用转换为可读文本

我在Stack Overflow中找到了这个方法,并且它将很多字符修复为可读的文本。但我仍然例如挣扎:

#&44; 
#&46; 
#&58; 
#&39; 

...等等。

我从XML文件与TBXML并在XML编码接收我的数据是:

iso-8859-1 

有谁有转换/解码所有的ASCII字符来读取的方法文本?

- (NSString *)stringByDecodingXMLEntities { 
    NSUInteger myLength = [self length]; 
    NSUInteger ampIndex = [self rangeOfString:@"&" options:NSLiteralSearch].location; 

    // Short-circuit if there are no ampersands. 
    if (ampIndex == NSNotFound) { 
     return self; 
    } 
    // Make result string with some extra capacity. 
    NSMutableString *result = [NSMutableString stringWithCapacity:(myLength * 1.25)]; 

    // First iteration doesn't need to scan to & since we did that already, but for code simplicity's sake we'll do it again with the scanner. 
    NSScanner *scanner = [NSScanner scannerWithString:self]; 

    [scanner setCharactersToBeSkipped:nil]; 

    NSCharacterSet *boundaryCharacterSet = [NSCharacterSet characterSetWithCharactersInString:@" \t\n\r;"]; 

    do { 
     // Scan up to the next entity or the end of the string. 
     NSString *nonEntityString; 
     if ([scanner scanUpToString:@"&" intoString:&nonEntityString]) { 
      [result appendString:nonEntityString]; 
     } 
     if ([scanner isAtEnd]) { 
      goto finish; 
     } 
     // Scan either a HTML or numeric character entity reference. 
     if ([scanner scanString:@"&" intoString:NULL]) 
      [result appendString:@"&"]; 
     else if ([scanner scanString:@"'" intoString:NULL]) 
      [result appendString:@"'"]; 
     else if ([scanner scanString:@""" intoString:NULL]) 
      [result appendString:@"\""]; 
     else if ([scanner scanString:@"<" intoString:NULL]) 
      [result appendString:@"<"]; 
     else if ([scanner scanString:@"&gt;" intoString:NULL]) 
      [result appendString:@">"]; 
     else if ([scanner scanString:@"&#" intoString:NULL]) { 
      BOOL gotNumber; 
      unsigned charCode; 
      NSString *xForHex = @""; 

      // Is it hex or decimal? 
      if ([scanner scanString:@"x" intoString:&xForHex]) { 
       gotNumber = [scanner scanHexInt:&charCode]; 
      } 
      else { 
       gotNumber = [scanner scanInt:(int*)&charCode]; 
      } 

      if (gotNumber) { 
       [result appendFormat:@"%C", charCode]; 

       [scanner scanString:@";" intoString:NULL]; 
      } 
      else { 
       NSString *unknownEntity = @""; 

       [scanner scanUpToCharactersFromSet:boundaryCharacterSet intoString:&unknownEntity]; 


       [result appendFormat:@"&#%@%@", xForHex, unknownEntity]; 

       //[scanner scanUpToString:@";" intoString:&unknownEntity]; 
       //[result appendFormat:@"&#%@%@;", xForHex, unknownEntity]; 
       NSLog(@"Expected numeric character entity but got &#%@%@;", xForHex, unknownEntity); 

      } 

     } 
     else { 
      NSString *amp; 

      [scanner scanString:@"&" intoString:&amp];  //an isolated & symbol 
      [result appendString:amp]; 

      NSString *unknownEntity = @""; 
      [scanner scanUpToString:@";" intoString:&unknownEntity]; 
      NSString *semicolon = @""; 
      [scanner scanString:@";" intoString:&semicolon]; 
      [result appendFormat:@"%@%@", unknownEntity, semicolon]; 
      NSLog(@"Unsupported XML character entity %@%@", unknownEntity, semicolon); 

     } 

    } 
    while (![scanner isAtEnd]); 

finish: 
    return result; 
} 
+1

关于术语的注释 - 这些不是“ASCII字符”,它们是“数字字符实体引用”。 – 2010-09-14 16:52:18

+0

啊哈,谢谢。你知道我能做些什么来做我想做的事吗?我试着用NSXMLParser读取我的XML文档,因为我从Anders那里得到了答案。但是这导致了与TBXML相同的方式。 – 2010-09-14 17:10:52

+0

现在我也试用了MWFeedParser的方法stringByEncodingXMLEntities,它可以处理某些字符。但是这些还有很多,比如这些-等等。 – 2010-09-14 17:48:05

通常情况下,您会让NSXMLparser为您处理该作业。你不需要手工完成转换。

如果你在NSXMLParser上做一个谷歌,你会得到很多的例子。