使用部分缓冲区将多字节Unicode字节数组转换为NSString

问题描述:

在Objective C中,有一种将多字节Unicode字节数组转换为NSString的方法,即使数组数据是部分缓冲区(不在完整的字符边界上)?使用部分缓冲区将多字节Unicode字节数组转换为NSString

这是当在流中接收字节缓冲区,并且你想要解析数据缓冲区的字符串版本(但是会有更多数据来的时候,并且你的缓冲区数据没有完整的多字节Unicode)的。

的NSString的initWithData:encoding:方法不适用于此目的,如下所示...

测试代码:

- (void)test { 
     char myArray[] = {'f', 'o', 'o', (char) 0xc3, (char) 0x97, 'b', 'a', 'r'}; 
     size_t sizeOfMyArray = sizeof(myArray); 
     [self dump:myArray sizeOfMyArray:sizeOfMyArray]; 
     [self dump:myArray sizeOfMyArray:sizeOfMyArray - 1]; 
     [self dump:myArray sizeOfMyArray:sizeOfMyArray - 2]; 
     [self dump:myArray sizeOfMyArray:sizeOfMyArray - 3]; 
     [self dump:myArray sizeOfMyArray:sizeOfMyArray - 4]; 
     [self dump:myArray sizeOfMyArray:sizeOfMyArray - 5]; 
    } 

    - (void)dump:(char[])myArray sizeOfMyArray:(size_t)sourceLength { 
     NSString *string = [[NSString alloc] initWithData:[NSData dataWithBytes:myArray length:sourceLength] encoding:NSUTF8StringEncoding]; 
     NSLog(@"sourceLength: %lu bytes, string.length: %i bytes, string :'%@'", sourceLength, string.length, string); 
    } 

输出:

sourceLength: 8 bytes, string.length: 7 bytes, string :'foo×bar' 
sourceLength: 7 bytes, string.length: 6 bytes, string :'foo×ba' 
sourceLength: 6 bytes, string.length: 5 bytes, string :'foo×b' 
sourceLength: 5 bytes, string.length: 4 bytes, string :'foo×' 
sourceLength: 4 bytes, string.length: 0 bytes, string :'(null)' 
sourceLength: 3 bytes, string.length: 3 bytes, string :'foo' 

可以看出,转换“sourceLength:4字节”字节数组失败,并返回(null)。这是因为UTF-8 unicode'×'字符(0xc3 0x97)只是部分包含在内。

理想情况下,会有一个函数,我可以使用它会返回正确的NString,并告诉我有多少字节“剩余”。

你很大程度上有你自己的答案。如果initWithData:dataWithBytes:encoding:方法返回nil,那么您知道缓冲区末尾有部分(无效)字符。

修改dump返回int。然后试图在循环中创建NSString。每次获得nil时,请缩短长度并重试。一旦得到有效的NSString,返回使用长度和传递长度之间的差异。

+0

谢谢...给你一个给予好评您的想法。你说的是正确的,但是在某些情况下,性能非常糟糕。在这种情况下,性能至关重要,因为我可能正在处理千兆字节的数据。 – TJez 2014-08-28 16:12:22

这是我低效率的实施,我认为这不是一个正确的答案。我会离开这里,以防别人发现它很有用(在希望别人将给予比这更好的答案!)

这是在NSMutableData类别...

/** 
    * Removes the biggest string possible from this NSMutableData, leaving any remainder unicode half-characters behind. 
    * 
    * NOTE: This is a very inefficient implementation, it may require multiple parsing of the complete NSMutableData buffer, 
    * it is especially inefficient when the data buffer does not contain a valid string encoding, as all lengths will be 
    * attempted. 
    */ 
    - (NSString *)removeMaximumStringUsingEncoding:(NSStringEncoding)encoding { 
     if (self.length > 0) { 
      // Quick test for the case where the whole buffer can be used (is common case, and doesn't require NSData manipulation). 
      NSString *result = [[NSString alloc] initWithData:self encoding:encoding]; 
      if (result != Nil) { 
       self.length = 0; // Simple case, we used the whole buffer. 
       return result; 
      } 

      // Try to find the largest subData that is a valid string. 
      for (NSUInteger subDataLength = self.length - 1; subDataLength > 0; subDataLength--) { 
       NSRange subDataRange = NSMakeRange(0, subDataLength); 
       result = [[NSString alloc] initWithData:[self subdataWithRange:subDataRange] encoding:encoding]; 
       if (result != Nil) { 
        // Delete the bytes we used from our buffer, leave the remainder. 
        [self replaceBytesInRange:subDataRange withBytes:Nil length:0]; 
        return result; 
       } 
      } 
     } 
     return @""; 
    } 

我之前有过这个问题,并忘了一段时间。这是一个机会。下面的代码是通过utf-8 page on wikipedia的信息完成的。它是NSData上的一个类别。

它检查从最后的数据,只有最后四个字节,因为OP说它可以是千兆字节的数据。否则,使用utf-8从头开始运行字节会更简单。

/* 
Return the range of a valid utf-8 encoded text by 
removing partial trailing multi-byte char. 
It assumes that all the bytes are valid utf-8 encoded char, 
e.g. it don't raise a flag if a continuation byte is preceded 
by a single char byte. 
*/ 
- (NSRange)rangeOfUTF8WithoutPartialTrailingMultibytes 
{ 
    NSRange validRange = {0, 0}; 

    NSUInteger trailLength = MIN([self length], 4U); 
    unsigned char trail[4]; 
    [self getBytes:&trail 
      range:NSMakeRange([self length] - trailLength, trailLength)]; 

    unsigned multibyteCount = 0; 

    for (NSInteger i = trailLength - 1; i >= 0; i--) { 
     if (isUTF8SingleByte(trail[i])) { 
      validRange = NSMakeRange(0, [self length] - trailLength + i + 1); 
      break; 
     } 

     if (isUTF8ContinuationByte(trail[i])) { 
      multibyteCount++; 
      continue; 
     } 

     if (isUTF8StartByte(trail[i])) { 
      multibyteCount++; 
      if (multibyteCount == lengthForUTF8StartByte(trail[i])) { 
       validRange = NSMakeRange(0, [self length] - trailLength + i + multibyteCount); 
      } 
      else { 
       validRange = NSMakeRange(0, [self length] - trailLength + i); 
      } 
      break; 
     } 
    } 
    return validRange; 
} 

下面是该方法中使用的静态函数:

static BOOL isUTF8SingleByte(const unsigned char c) 
{ 
    return c <= 0x7f; 
} 

static BOOL isUTF8ContinuationByte(const unsigned char c) 
{ 
    return (c >= 0x80) && (c <= 0xbf); 
} 

static BOOL isUTF8StartByte(const unsigned char c) 
{ 
    return (c >= 0xc2) && (c <= 0xf4); 
} 

static BOOL isUTF8InvalidByte(const unsigned char c) 
{ 
    return (c == 0xc0) || (c == 0xc1) || (c > 0xf4); 
} 

static unsigned lengthForUTF8StartByte(const unsigned char c) 
{ 
    if ((c >= 0xc2) && (c <= 0xdf)) { 
     return 2; 
    } 
    else if ((c >= 0xe0) && (c <= 0xef)) { 
     return 3; 
    } 
    else if ((c >= 0xf0) && (c <= 0xf4)) { 
     return 4; 
    } 
    return 1; 
}