Tesseract:OCR方法的索引超出范围例外
问题描述:
我正在使用Tesseract进行OCR的Spring-MVC应用程序。我正在为我传递的文件获取索引超出范围的异常。有任何想法吗?Tesseract:OCR方法的索引超出范围例外
错误日志:
et.sourceforge.tess4j.TesseractException: java.lang.IndexOutOfBoundsException
at net.sourceforge.tess4j.Tesseract.doOCR(Tesseract.java:215)
at net.sourceforge.tess4j.Tesseract.doOCR(Tesseract.java:196)
at com.tooltank.spring.service.GroupAttachmentsServiceImpl.testOcr(GroupAttachmentsServiceImpl.java:839)
at com.tooltank.spring.service.GroupAttachmentsServiceImpl.lambda$addAttachment$0(GroupAttachmentsServiceImpl.java:447)
at java.lang.Thread.run(Thread.java:745)
Caused by: java.lang.IndexOutOfBoundsException
at javax.imageio.stream.FileCacheImageOutputStream.seek(FileCacheImageOutputStream.java:170)
at net.sourceforge.tess4j.util.ImageIOHelper.getImageByteBuffer(ImageIOHelper.java:297)
at net.sourceforge.tess4j.Tesseract.setImage(Tesseract.java:397)
at net.sourceforge.tess4j.Tesseract.doOCR(Tesseract.java:290)
at net.sourceforge.tess4j.Tesseract.doOCR(Tesseract.java:212)
... 4 more
代码:
private String testOcr(String fileLocation, int attachId) {
try {
File imageFile = new File(fileLocation);
BufferedImage img = ImageIO.read(imageFile);
BufferedImage blackNWhite = new BufferedImage(img.getWidth(), img.getHeight(), BufferedImage.TYPE_BYTE_BINARY);
Graphics2D graphics = blackNWhite.createGraphics();
graphics.drawImage(img, 0, 0, null);
String identifier = String.valueOf(new BigInteger(130, random).toString(32));
String blackAndWhiteImage = previewPath + identifier + ".png";
File outputfile = new File(blackAndWhiteImage);
ImageIO.write(blackNWhite, "png", outputfile);
ITesseract instance = new Tesseract();
// Point to one folder above tessdata directory, must contain training data
instance.setDatapath("/usr/share/tesseract-ocr/");
// ISO 693-3 standard
instance.setLanguage("deu");
String result = instance.doOCR(outputfile);
result = result.replaceAll("[^a-zA-Z0-9öÖäÄüÜß@\\s]", "");
Files.delete(new File(blackAndWhiteImage).toPath());
GroupAttachments groupAttachments = this.groupAttachmentsDAO.getAttachmenById(attachId);
System.out.println("OCR Result is "+result);
if (groupAttachments != null) {
saveIndexes(result, groupAttachments.getFileName(), null, groupAttachments.getGroupId(), false, attachId);
}
return result;
} catch (Exception e) {
e.printStackTrace();
}
return null;
}
谢谢。
答
由于Java Image IO(已用Java 9修复)中的一个错误,当前版本的Java Tesseract Wrapper(3.4.0作为此答案已编写)不适用于Java 9的<。要使用较低版本Java版本,您可以尝试对Tesseract ImageIOHelper类进行以下修复。只需在项目中制作一份课程副本,并应用必要的更改,即可顺利地处理文件和BufferedImages。
注意:此版本不使用原始类中使用的Tiff优化,如果您的项目需要,可以添加它。
public static ByteBuffer getImageByteBuffer(RenderedImage image) throws IOException {
//Set up the writeParam
if (image instanceof BufferedImage) {
return convertImageData((BufferedImage) image);
}
ColorModel cm = image.getColorModel();
int width = image.getWidth();
int height = image.getHeight();
WritableRaster raster = cm
.createCompatibleWritableRaster(width, height);
boolean isAlphaPremultiplied = cm.isAlphaPremultiplied();
Hashtable properties = new Hashtable();
String[] keys = image.getPropertyNames();
if (keys != null) {
for (int i = 0; i < keys.length; i++) {
properties.put(keys[i], image.getProperty(keys[i]));
}
}
BufferedImage result = new BufferedImage(cm, raster,
isAlphaPremultiplied, properties);
image.copyData(raster);
return convertImageData(result);
}
答
尝试升级到tess4j版本3.4.1。 这解决了我的问题。
所以我应该用你提供的代码替换ImageIOHelper中的getImageBytBuffer方法。我如何调用OCR方法?谢谢。 –
只需将固定副本添加到类路径并调用tesseract通常的方式,它将在库副本之前使用您的固定副本。 – ruhsuzbaykus
对不起,没有工作,同样的例外。我把这个文件放在一个不同的包中,然后在模块设置 - >模块 - >依赖项中添加这个包到Intellij 13. –