使用pdfbox计算pdf中的图像

问题描述：

我需要从pdf中提取文本以验证一些内容并使用java计算pdf文档中的图像数量。使用下面的getText函数，我可以获得文本内容没有问题，但无法找到只计算图像对象的方法。我已经能够使用下面的代码来计算所有对象的数量，但无法找到关于如何计算图像的任何doco。任何想法将非常赞赏。由于使用pdfbox计算pdf中的图像

static String getText(File pdfFile) throws IOException { 
    PDDocument doc = PDDocument.load(pdfFile); 
    return new PDFTextStripper().getText(doc); 
} 

static void countImages(File pdfFile) throws IOException{ 

    PDDocument doc = PDDocument.load(pdfFile); 
    List myObjects = doc.getDocument().getObjects(); 
    System.out.println("Count: " + myObjects.size()); 
    doc.close(); 

}

答

一个快速和肮脏的解决办法是这样的：

static void countImages(File pdfFile) throws IOException{ 
    PDDocument doc = PDDocument.load(pdfFile); 
    PDResources res = doc.getDocumentCatalog().getPages().getResources(); 

    int numImg = 0; 
    for (PDXObject xobject : res.getXObjects().values()) { 
     if (xobject instanceof PDXObjectImage) { 
      numImg++; 
     } 
    } 
    System.out.println("Count: " + numImg); 

    doc.close(); 
}

这忽略相当多的地方图像即可。此外，不保证页面资源中的图像实际上用于页面。 – mkl

@mkl有趣但含糊。你为什么不分享你的智慧珍珠，并发表更好的答案？我只用两种测试用例来确保特定的PDF包含图像或不包含图像。由于它的工作可靠，我没有深入探讨这个话题。 –

@Würgspaß在源代码下载或https://svn.apache.org/viewvc/pdfbox/trunk/tools/src/main/java/org/apache/pdfbox/tools/中查看ExtractImages源代码ExtractImages.java?view=markup –

答

这应该做的伎俩：

public static void main(String[] args) throws IOException { 
    PDDocument document = PDDocument.load(new File("")); 

    int numImages = 0; 
    for (int i = 0; i < document.getNumberOfPages(); i++) 
    { 
     PDPage page = document.getPage(i); 

     CountImages countImages = new CountImages(page); 
     countImages.processPage(page); 

     numImages += countImages.numImages; 
    } 

    System.out.println(numImages); 
} 

static class CountImages extends PDFGraphicsStreamEngine { 
    public int numImages = 0; 
    private final Set<COSStream> duplicates = new HashSet<>(); 

    protected CountImages(PDPage page) throws IOException 
    { 
     super(page); 
    } 

    @Override 
    public void appendRectangle(Point2D pd, Point2D pd1, Point2D pd2, Point2D pd3) throws IOException { 
    } 

    @Override 
    public void drawImage(PDImage pdImage) throws IOException { 
     if (pdImage instanceof PDImageXObject) { 
      PDImageXObject xobject = (PDImageXObject)pdImage; 

      if (duplicates.contains(xobject.getCOSObject()) == false) { 
       numImages++; 
       duplicates.add(xobject.getCOSObject()); 
      } 
     } else { 
      numImages++; //means its an inline image 
     } 
    } 

    @Override 
    public void clip(int i) throws IOException { 
    } 

    @Override 
    public void moveTo(float f, float f1) throws IOException { 
    } 

    @Override 
    public void lineTo(float f, float f1) throws IOException { 
    } 

    @Override 
    public void curveTo(float f, float f1, float f2, float f3, float f4, float f5) throws IOException { 
    } 

    @Override 
    public Point2D getCurrentPoint() throws IOException { 
     return new Point2D.Float(0, 0); 
    } 

    @Override 
    public void closePath() throws IOException { 
    } 

    @Override 
    public void endPath() throws IOException { 
    } 

    @Override 
    public void strokePath() throws IOException { 
    } 

    @Override 
    public void fillPath(int i) throws IOException { 
    } 

    @Override 
    public void fillAndStrokePath(int i) throws IOException { 
    } 

    @Override 
    public void shadingFill(COSName cosn) throws IOException { 
    } 
}

使用pdfbox计算pdf中的图像

相关推荐