如何通过xpdf或mupdf获取指定的文本pos？

问题描述：

我想提取pdf文件中的一些指定文本和文本位置。如何通过xpdf或mupdf获取指定的文本pos？

我知道xpdf和mupdf可以解析pdf文件，所以我认为他们可以帮助我完成这个任务。

但如何使用这两个lib来获取文本位置？

你是什么意思的文本位置？ –

@ DanD.Text位置表示页面中的第一个字符位置。 – PDF1001

答

Mupdf附带了几个工具，其中一个是pdfdraw。

如果您使用pdfdraw与-tt选项，它将生成一个包含所有字符及其精确定位信息的XML。
从那里你应该能够找到你所需要的。

在较新的版本中，它被称为mudraw.c，并且该trail会导致结构化text.h和stext-output.c，非常有帮助，谢谢。 – Amoss

答

如果你不介意使用的Python MuPDF结合，这里是用PyMuPDF（我是它的开发者之一）一个Python的解决方案：

import fitz      # the PyMuPDF module 
doc = fitz.open("input.pdf") # PDF input file 
page = doc[n]     # page number n (0-based) 
wordlist = page.getTextWords() # gives you a list of all words on the 
# page, together with their position info (a rectangle containing the word) 

# or, if you only are interested in blocks of lines belonging together: 
blocklist = page.getTextBlocks() 

# If you need yet more details, use a JSON-based output, which also gives 
# images and their positions, as well as font information for the text. 
tdict = json.loads(page.getText("json"))

我们在GitHub上，如果你有兴趣。

如何通过xpdf或mupdf获取指定的文本pos？

相关推荐