如何提取超链接信息PDFBOX

问题描述:

我试图从一个PDF使用PDFBox的超链接的信息,但我不能确定如何获得如何提取超链接信息PDFBOX

for(Object p : pages) { 
    PDPage page = (PDPage)p; 

    List<?> annotations = page.getAnnotations(); 
    for(Object a : annotations) { 
     PDAnnotation annotation = (PDAnnotation)a; 

     if(annotation instanceof PDAnnotationLink) { 
      PDAnnotationLink link = (PDAnnotationLink)annotation; 
      System.out.println(link.toString()); 
      System.out.println(link.getDestination()); 

     } 
    } 

} 

我想提取的超级链接目标的URL和文本超链接。如何做到这一点?

感谢

使用此代码从PrintURLs sample code从源代码下载:

for(PDPage page : doc.getPages()) 
{ 
    pageNum++; 
    PDFTextStripperByArea stripper = new PDFTextStripperByArea(); 
    List<PDAnnotation> annotations = page.getAnnotations(); 
    //first setup text extraction regions 
    for(int j=0; j<annotations.size(); j++) 
    { 
     PDAnnotation annot = annotations.get(j); 
     if(annot instanceof PDAnnotationLink) 
     { 
      PDAnnotationLink link = (PDAnnotationLink)annot; 
      PDRectangle rect = link.getRectangle(); 
      //need to reposition link rectangle to match text space 
      float x = rect.getLowerLeftX(); 
      float y = rect.getUpperRightY(); 
      float width = rect.getWidth(); 
      float height = rect.getHeight(); 
      int rotation = page.getRotation(); 
      if(rotation == 0) 
      { 
       PDRectangle pageSize = page.getMediaBox(); 
       y = pageSize.getHeight() - y; 
      } 
      else if(rotation == 90) 
      { 
       //do nothing 
      } 

      Rectangle2D.Float awtRect = new Rectangle2D.Float(x,y,width,height); 
      stripper.addRegion("" + j, awtRect); 
     } 
    } 

    stripper.extractRegions(page); 

    for(int j=0; j<annotations.size(); j++) 
    { 
     PDAnnotation annot = annotations.get(j); 
     if(annot instanceof PDAnnotationLink) 
     { 
      PDAnnotationLink link = (PDAnnotationLink)annot; 
      PDAction action = link.getAction(); 
      String urlText = stripper.getTextForRegion("" + j); 
      if(action instanceof PDActionURI) 
      { 
       PDActionURI uri = (PDActionURI)action; 
       System.out.println("Page " + pageNum +":'" + urlText.trim() + "'=" + uri.getURI()); 
      } 
     } 
    } 
} 

它的工作原理两个部分,一个是越来越这是很容易的URL,另一种是得到URL文本,这是通过在注释的矩形中提取文本来完成的。

+0

这段代码很好地提取PDF上的外部链接。但它似乎不提取内部页面的链接。例如,在我的pdf的第3页上,它包含一个链接到第10页。我也需要获取这些信息。任何想法如何做到这一点? –

+0

@ShiranSEkanayake请看另一个回复。底部(带有PDPageDestination)应该做你想做的。我没有测试它,但对我来说看起来不错。 –

+0

谢谢。有用!! –

我们必须得到超链接信息和内部链接(如移动页面....)。我使用下面的代码:

int pageNum = 0; 
      for (PDPage page : originalPDF.getPages()) { 
       pageNum++; 
       List<PDAnnotation> annotations = page.getAnnotations(); 
       for (PDAnnotation annot : annotations) { 
        if (annot instanceof PDAnnotationLink) { 
         // get dimension of annottations 
         PDAnnotationLink link = (PDAnnotationLink) annot; 
         // get link action include link url and internal link 
         PDAction action = link.getAction(); 
         // get link internal some case specal 
         PDDestination pDestination = link.getDestination(); 

         if (action != null) { 
          if (action instanceof PDActionURI || action instanceof PDActionGoTo) { 
           if (action instanceof PDActionURI) { 
            // get uri link 
            PDActionURI uri = (PDActionURI) action; 
            System.out.println("uri link:" + uri.getURI()); 
           } else { 
            if (action instanceof PDActionGoTo) { 
             // get internal link 
             PDDestination destination = ((PDActionGoTo) action).getDestination(); 
             PDPageDestination pageDestination; 
             if (destination instanceof PDPageDestination) { 
              pageDestination = (PDPageDestination) destination; 
             } else { 
              if (destination instanceof PDNamedDestination) { 
               pageDestination = originalPDF.getDocumentCatalog().findNamedDestinationPage((PDNamedDestination) destination); 
              } else { 
               // error handling 
               break; 
              } 
             } 

             if (pageDestination != null) { 
              System.out.println("page move: " + (pageDestination.retrievePageNumber() + 1)); 
             } 
            } 
           } 
          } 
         } else { 
          if (pDestination != null) { 
           PDPageDestination pageDestination; 
           if (pDestination instanceof PDPageDestination) { 
            pageDestination = (PDPageDestination) pDestination; 
           } else { 
            if (pDestination instanceof PDNamedDestination) { 
             pageDestination = originalPDF.getDocumentCatalog().findNamedDestinationPage((PDNamedDestination) pDestination); 
            } else { 
             // error handling 
             break; 
            } 
           } 

           if (pageDestination != null) { 
            System.out.println("page move: " + (pageDestination.retrievePageNumber() + 1)); 
           } 
          } else { 
           //  
          } 
         } 
        } 
       } 

      }