从任何UTF-16偏移量，发现躺着一个字符边界上的相应String.Index

问题描述：

我的目标：给出的String任意UTF-16的位置，找到相应的String.Index表示Character（即延长字形集群）指定的UTF-16代码单元是其中的一部分。从任何UTF-16偏移量，发现躺着一个字符边界上的相应String.Index

例子：

(I put the code in a Gist for easy copying and pasting.)

这是我的测试字符串：

let str = "‍"

（注：见字符串作为单个字符，你需要在一个合理的阅读最新的操作系统/浏览器组合，可以在Unicode 9中引入肤色处理新的专业表情符号。）

这是一个单Character（字形集群）包括四个Unicode的标量或7 UTF-16代码单元：

print(str.unicodeScalars.map { "0x\(String($0.value, radix: 16))" }) 
// → ["0x1f468", "0x1f3fe", "0x200d", "0x1f692"] 
print(str.utf16.map { "0x\(String($0, radix: 16))" }) 
// → ["0xd83d", "0xdc68", "0xd83c", "0xdffe", "0x200d", "0xd83d", "0xde92"] 
print(str.utf16.count) 
// → 7

给定任意UTF-16偏移（比如，2），I可以创建一个相应String.Index：

let utf16Offset = 2 
let utf16Index = String.Index(encodedOffset: utf16Offset)

我可以用这个指数下标串，但如果指数没有落在Character边界上，由标返回Character可能无法覆盖整个石墨烯集群：

let char = str[utf16Index] 
print(char) 
// → ‍ 
print(char.unicodeScalars.map { "0x\(String($0.value, radix: 16))" }) 
// → ["0x1f3fe", "0x200d", "0x1f692"]

或者下标操作可能甚至陷阱（我不知道这是预期的行为）：

let trappingIndex = String.Index(encodedOffset: 1) 
str[trappingIndex] 
// fatal error: Can't form a Character from a String containing more than one extended grapheme cluster

如果指数落在Character边界上，您可以测试：

extension String.Index { 
    func isOnCharacterBoundary(in str: String) -> Bool { 
     return String.Index(self, within: str) != nil 
    } 
} 

trappingIndex.isOnCharacterBoundary(in: str) 
// → false (as expected) 
utf16Index.isOnCharacterBoundary(in: str) 
// → true (WTF!)

问题：

我认为问题是这个最后的表达式返回true。The documentation for String.Index.init(_:within:)说：

如果作为sourcePosition传入的索引表示的扩展字形簇的字符串则初始化成功-the元件型的开始。

这里，utf16Index不代表一个扩展字形簇的开始 - 字形簇开始于偏移0，不能抵消2.然而，初始化成功。

因此，我所有试图通过重复递减索引encodedOffset和测试isOnCharacterBoundary来查找字形集群的开始都失败了。

我可以俯视吗？还有另一种方法来测试指数是否落在Character的开头？这是Swift中的错误吗？

我的环境：MacOS 10.13上的Swift 4.0/Xcode 9.0。

更新：查看有趣的Twitter thread about this question。

更新：我在Swift 4.0中报告了String.Index.init?(_:within:)作为bug的行为：SR-5992。

似乎'String.Index（_：内：）'不会将表情符号序列视为单个字形群集（即使Swift 4是基于Unicode 9）。 –

答

一种可能的解决方案，使用该rangeOfComposedCharacterSequence(at:) 方法：

extension String { 
    func index(utf16Offset: Int) -> String.Index? { 
     guard utf16Offset >= 0 && utf16Offset < utf16.count else { return nil } 
     let idx = String.Index(encodedOffset: utf16Offset) 
     let range = rangeOfComposedCharacterSequence(at: idx) 
     return range.lowerBound 
    } 
}

实施例：

let str = "a‍bcd‍‍‍e" 
for utf16Offset in 0..<str.utf16.count { 
    if let idx = str.index(utf16Offset: utf16Offset) { 
     print(utf16Offset, str[idx]) 
    } 
}

输出：

 
0 a 
1 ‍ 
2 ‍ 
3 ‍ 
4 ‍ 
5 ‍ 
6 ‍ 
7 ‍ 
8 b 
9 
10 
11 
12 
13 c 
14 
15 
16 d 
17 ‍‍‍ 
18 ‍‍‍ 
19 ‍‍‍ 
20 ‍‍‍ 
21 ‍‍‍ 
22 ‍‍‍ 
23 ‍‍‍ 
24 ‍‍‍ 
25 ‍‍‍ 
26 ‍‍‍ 
27 ‍‍‍ 
28 e

谢谢！这是一个我没有想到的非常好的解决方案。它甚至可以与来自UTF-8视图的索引一起工作，所以它不仅限于UTF-16偏移量 –

从任何UTF-16偏移量，发现躺着一个字符边界上的相应String.Index

相关推荐