如何提取超链接信息PDFBOX
问题描述:
我试图从一个PDF使用PDFBox的超链接的信息,但我不能确定如何获得如何提取超链接信息PDFBOX
for(Object p : pages) {
PDPage page = (PDPage)p;
List<?> annotations = page.getAnnotations();
for(Object a : annotations) {
PDAnnotation annotation = (PDAnnotation)a;
if(annotation instanceof PDAnnotationLink) {
PDAnnotationLink link = (PDAnnotationLink)annotation;
System.out.println(link.toString());
System.out.println(link.getDestination());
}
}
}
我想提取的超级链接目标的URL和文本超链接。如何做到这一点?
感谢
答
使用此代码从PrintURLs sample code从源代码下载:
for(PDPage page : doc.getPages())
{
pageNum++;
PDFTextStripperByArea stripper = new PDFTextStripperByArea();
List<PDAnnotation> annotations = page.getAnnotations();
//first setup text extraction regions
for(int j=0; j<annotations.size(); j++)
{
PDAnnotation annot = annotations.get(j);
if(annot instanceof PDAnnotationLink)
{
PDAnnotationLink link = (PDAnnotationLink)annot;
PDRectangle rect = link.getRectangle();
//need to reposition link rectangle to match text space
float x = rect.getLowerLeftX();
float y = rect.getUpperRightY();
float width = rect.getWidth();
float height = rect.getHeight();
int rotation = page.getRotation();
if(rotation == 0)
{
PDRectangle pageSize = page.getMediaBox();
y = pageSize.getHeight() - y;
}
else if(rotation == 90)
{
//do nothing
}
Rectangle2D.Float awtRect = new Rectangle2D.Float(x,y,width,height);
stripper.addRegion("" + j, awtRect);
}
}
stripper.extractRegions(page);
for(int j=0; j<annotations.size(); j++)
{
PDAnnotation annot = annotations.get(j);
if(annot instanceof PDAnnotationLink)
{
PDAnnotationLink link = (PDAnnotationLink)annot;
PDAction action = link.getAction();
String urlText = stripper.getTextForRegion("" + j);
if(action instanceof PDActionURI)
{
PDActionURI uri = (PDActionURI)action;
System.out.println("Page " + pageNum +":'" + urlText.trim() + "'=" + uri.getURI());
}
}
}
}
它的工作原理两个部分,一个是越来越这是很容易的URL,另一种是得到URL文本,这是通过在注释的矩形中提取文本来完成的。
答
我们必须得到超链接信息和内部链接(如移动页面....)。我使用下面的代码:
int pageNum = 0;
for (PDPage page : originalPDF.getPages()) {
pageNum++;
List<PDAnnotation> annotations = page.getAnnotations();
for (PDAnnotation annot : annotations) {
if (annot instanceof PDAnnotationLink) {
// get dimension of annottations
PDAnnotationLink link = (PDAnnotationLink) annot;
// get link action include link url and internal link
PDAction action = link.getAction();
// get link internal some case specal
PDDestination pDestination = link.getDestination();
if (action != null) {
if (action instanceof PDActionURI || action instanceof PDActionGoTo) {
if (action instanceof PDActionURI) {
// get uri link
PDActionURI uri = (PDActionURI) action;
System.out.println("uri link:" + uri.getURI());
} else {
if (action instanceof PDActionGoTo) {
// get internal link
PDDestination destination = ((PDActionGoTo) action).getDestination();
PDPageDestination pageDestination;
if (destination instanceof PDPageDestination) {
pageDestination = (PDPageDestination) destination;
} else {
if (destination instanceof PDNamedDestination) {
pageDestination = originalPDF.getDocumentCatalog().findNamedDestinationPage((PDNamedDestination) destination);
} else {
// error handling
break;
}
}
if (pageDestination != null) {
System.out.println("page move: " + (pageDestination.retrievePageNumber() + 1));
}
}
}
}
} else {
if (pDestination != null) {
PDPageDestination pageDestination;
if (pDestination instanceof PDPageDestination) {
pageDestination = (PDPageDestination) pDestination;
} else {
if (pDestination instanceof PDNamedDestination) {
pageDestination = originalPDF.getDocumentCatalog().findNamedDestinationPage((PDNamedDestination) pDestination);
} else {
// error handling
break;
}
}
if (pageDestination != null) {
System.out.println("page move: " + (pageDestination.retrievePageNumber() + 1));
}
} else {
//
}
}
}
}
}
这段代码很好地提取PDF上的外部链接。但它似乎不提取内部页面的链接。例如,在我的pdf的第3页上,它包含一个链接到第10页。我也需要获取这些信息。任何想法如何做到这一点? –
@ShiranSEkanayake请看另一个回复。底部(带有PDPageDestination)应该做你想做的。我没有测试它,但对我来说看起来不错。 –
谢谢。有用!! –