Scala/Java - 库解析一些文本并删除标点符号?
问题描述:
我在Java中使用BreakIterator
实现从字符串中删除标点符号。我需要在Scala中重写这个,所以我想这可能是一个很好的机会,用一个更好的库替换它(我的实现非常天真,我相信它在边缘情况下失败了)。Scala/Java - 库解析一些文本并删除标点符号?
是否有这样的图书馆存在可能被使用?
编辑:这是我在斯卡拉快速的解决方案:
private val getWordsFromLine = (line: String) => {
line.split(" ")
.map(_.toLowerCase())
.map(word => word.filter(Character.isLetter(_)))
.filter(_.length() > 1)
.toList
}
而鉴于此List[String]
(每行一个...是...这是圣经 - 这是一个很好的测试案例):
第二摩西的书,叫EXODUS
第11章现在,这些[是]以色列,这 来到埃及的孩子的名字;每个人和他的家人都与雅各同行。 2 流便,西缅,利未,犹大,3以萨迦,西布伦,和本雅明,4 丹,拿弗他利,迦得,亚设。
你得到一个List[String]
像这样:
List(the, second, book, of, moses, called, exodus, chapter, now, these, are, the, names, of, the, children, of, israel, which, came, into, egypt, every, man, and, his, household, came, with, jacob, reuben, simeon, levi, and, judah, issachar, zebulun, and, benjamin, dan, and, naphtali, gad, and, asher)
答
下面是使用正则表达式的方法。尽管如此,它还没有过滤单个字符的单词。
val s = """
THE SECOND BOOK OF MOSES, CALLED EXODUS
CHAPTER 1 1 Now these [are] the names of the children of Israel,
which came into Egypt; every man and his household came with
Jacob. 2 Reuben, Simeon, Levi, and Judah, 3 Issachar, Zebulun,
and Benjamin, 4 Dan, and Naphtali, Gad, and Asher.
"""
/* \p{L} denotes Unicode letters */
var items = """\b\p{L}+\b""".r findAllIn s
println(items.toList)
/* List(THE, SECOND, BOOK, OF, MOSES, CALLED, EXODUS,
CHAPTER, Now, these, are, the, names, of, the,
children, of, Israel, which, came, into, Egypt,
every, man, and, his, household, came, with,
Jacob, Reuben, Simeon, Levi, and, Judah,
Issachar, Zebulun, and, Benjamin, Dan, and,
Naphtali, Gad, and, Asher)
*/
/* \w denotes word characters */
items = """\b\w+\b""".r findAllIn s
println(items.toList)
/* List(THE, SECOND, BOOK, OF, MOSES, CALLED, EXODUS,
CHAPTER, 1, 1, Now, these, are, the, names, of,
the, children, of, Israel, which, came, into,
Egypt, every, man, and, his, household, came,
with, Jacob, 2, Reuben, Simeon, Levi, and, Judah,
3, Issachar, Zebulun, and, Benjamin, 4, Dan, and,
Naphtali, Gad, and, Asher)
*/
答
对于这个特殊情况,我会用正则表达式去。
def toWords(lines: List[String]) = lines flatMap { line =>
"[a-zA-Z]+".r findAllIn line map (_.toLowerCase)
}
为什么不在Scala中使用Java实现?这两者是可互操作的。您仍然可以在Java API中添加一些Scala的好东西,使其更易于使用。 – 2012-08-08 09:03:18
我可以。如果我不需要,我只是不想重写它。 – 2012-08-08 09:13:24
通过提供示例说明您正在寻找的内容将有所帮助。从目前的描述来看,我认为一个正则表达式应该能够完成这项工作。 – 2012-08-08 09:19:48