为什么这个正则表达式匹配块需要很长时间才能完成?
我有一个程序使用PDF文档的正则表达式提取文本。为什么这个正则表达式匹配块需要很长时间才能完成?
我的问题是匹配块需要较长时间对于某些PDF文件执行...
这是代码:
String title = "(?s)\\(54\\)\\s*([\\w\\s,-]+)|(?s)\\[54\\]\\s*([\\w\\s,-]+)";
String in ="((?s)\\(\\d\\d\\)\\s+Inventor\\w*:\\s*\\w*([\\w\\d,.\\s)(-]+);([\\w\\s.\\',();-]+)(?=\\(\\d*\\)\\s+Assignee:))|((?s)\\[\\d\\d\\)\\s+Inventor:\\s*([\\-\\w\\d\\s,\\.\\(\\)-]+)*[\\w\\']*(?=\\n))|(Inventor\\w*:\\s*\\w*([\\w\\d,.\\s)(-]+);([\\w\\s.\\',();-]+)(?=Assignee:))|((?s)\\(\\d\\d\\)\\s+Inventor\\w*:\\s*\\w*([\\w\\d,.\\s)(-]+);([\\w\\s.\\',();-]+)(?=\\(\\d*\\)\\s+Assignee:))|((?s)\\(\\d\\d\\)\\s+Inventor:\\s*([\\-\\w\\d\\s,\\.\\(\\)-]+)*[\\w\\']*(?=\\n))|(Inventor\\w*:\\s*\\w*([\\w\\d,.\\s)(-]+);([\\w\\s.\\',();-]+)(?=Assignee:))";
String as ="((?s)\\(\\d\\d\\)\\s+Assignee\\w*:\\s*\\w*([\\w\\d,.\\s)(-]+);([\\w\\s.\\',();-]+)(?=\\(\\d*\\)\\s+Notice:))|((?s)\\(\\d\\d\\)\\s+Assignee:\\s*([\\-\\w\\d\\s,\\.\\(\\)-]+)*[\\w\\']*(?=\\n))|(Assignee\\w*:\\s*\\w*([\\w\\d,.\\s)(-]+);([\\w\\s.\\',();-]+)(?=Notice:))|(Assignee\\w*:\\s*\\w*([\\w\\d,.\\s)(-]+)(?=Notice:))";
String app_no ="(?s)\\(21\\)\\s*([\\w\\s,.://-]+)|(?s)\\[21\\]\\s*([\\w\\s,.://-]+)";
String filed ="((?s)\\(22\\)\\s*([\\w\\s,.://-]+))|((?s)\\(22\\)\\s*([\\w\\s,.://-]+)(?=\\s*\\n\\s*Related))|((?s)\\[22\\]\\s*([\\w\\s,.://-]+))|((?s)\\[22\\]\\s*([\\w\\s,.://-]+)(?=\\s*\\n\\s*Related))";
String term ="((?s)\\s*Term\\s*([\\w\\s,.://-]+))|((?s)\\s*Term\\s*([\\w\\s,.://-]+))";
String pat_no = "(?s)\\s*Patent No\\.\\:\\s*([\\w\\d\\s,.://-]+)|(?s)\\s*Patent Number\\:\\s*([\\w\\d\\s,.://-]+)";
String pat_dt = "(?s)\\(45\\)\\s*Date([\\*\\w\\d\\s,.://-]+)(?=\\(\\d*\\)\\s+Inventor:)|(?s)\\(45\\)\\s*Date([\\*\\w\\d\\s,.://-]+)(?=\\(\\d*\\)\\s+Inventors:)|(?s)\\(45\\)\\s*Date([\\*\\w\\d\\s,.://-]+)|(?s)\\[45\\]\\s*Date([\\*\\w\\d\\s,.://-]+)(?=\\[\\d*\\]\\s+Inventor:)|(?s)\\[45\\]\\s*Date([\\*\\w\\d\\s,.://-]+)(?=\\(\\d*\\)\\s+Inventors:)|(?s)\\[45\\]\\s*Date([\\*\\w\\d\\s,.://-]+)";
String region = stripper.getTextForRegion("class1");
String regiont = stripper.getTextForRegion("class2");
Pattern p = Pattern.compile(in);
Matcher m = p.matcher(region);
Pattern p2 = Pattern.compile(as);
Matcher m2 = p2.matcher(region);
Pattern p3 = Pattern.compile(title);
Matcher m3 = p3.matcher(region);
Pattern p4 = Pattern.compile(pat_no);
Matcher m4 = p4.matcher(regiont);
Pattern p5 = Pattern.compile(app_no);
Matcher m5 = p5.matcher(region);
Pattern p6 = Pattern.compile(filed);
Matcher m6 = p6.matcher(region);
Pattern p7 = Pattern.compile(pat_dt);
Matcher m7 = p7.matcher(regiont);
long TIMEOUT = 15000l; // 15 seconds
long now = System.currentTimeMillis(); // init the long just above the while
System.out.println("find start");
while(m.find())
{
// System.out.println(m.group());
}
Long nowtime = System.currentTimeMillis() ;
while(m2.find())
{
// System.out.println(m2.group());
}
while(m3.find() && (System.currentTimeMillis() - now) < TIMEOUT)
{
// System.out.println(m3.group());
patit = m3.group().replace("(54)", " ");
patit = patit.trim();
// System.out.println("m3");
}
while(m4.find() && (System.currentTimeMillis() - now) < TIMEOUT)
{
// System.out.println(m4.group());
patno = m4.group().replace("Patent No.: ", " ");
patno = patno.replace("Patent No: ", " ");
patno = patno.replace("Patent", " ");
patno = patno.replace("No.:", " ");
patno = patno.replace("No:", " ");
patno = patno.replace("Number: ", " ");
patno = patno.replace("Number.: ", " ");
patno = patno.trim();
// System.out.println("m4");
}
while(m5.find() && (System.currentTimeMillis() - now) < TIMEOUT)
{
// System.out.println(m5.group());
appno = m5.group().replace("(21)", " ");
appno = appno.replace("Appl. No.: ", " ");
appno = appno.replace("Appl.", " ");
appno = appno.replace("No.", " ");
appno = appno.replace(":"," ");
appno = appno.trim();
// System.out.println("m5");
}
while(m6.find() && (System.currentTimeMillis() - now) < TIMEOUT)
{
// System.out.println(m6.group());
patfilled = m6.group().replace("(22)", " ");
patfilled = patfilled.replace("Filed", " ");
patfilled= patfilled.replace("PCT", " ");
patfilled = patfilled.replace(":", " ");
patfilled = patfilled.replace("\n", "");
patfilled= patfilled.trim();
// System.out.println("m6");
}
while (m7.find() && (System.currentTimeMillis() - now) < TIMEOUT)
{
patdate = m7.group().replace("(45) Date of Patent: ", " ");
patdate = patdate.replace("(45) Date of Patent.: ", " ");
patdate = patdate.replace("(45)", " ");
patdate = patdate.replace("Date", " ");
patdate = patdate.replace("of", " ");
patdate = patdate.replace("Patent.: ", " ");
patdate = patdate.replace("Patent: ", " ");
patdate = patdate.replace("Reissued", " ");
patdate = patdate.replace(":", " ");
patdate = patdate.replace("Patent", " ");
patdate = patdate.replace("*", " ");
patdate = patdate.trim();
// System.out.println("m7");
}
System.out.println("find end");
在上面的代码中,mX.find()需要很长执行某些文件迭代的时间......这就是执行在一些迭代处冻结在System.out.println("find start");
。
这是输出示例:(滚动看到)
-------
find start
1ms Elasped
1841
File name:06377334.pdf
US 6,377,334 B2
METHOD FOR CONTROLLING IMAGE
SIZE OF INTEGRATED CIRCUITS ON
WAFERS SUPPORTED ON HOT PLATES
DURING POST EXPOSURE BAKING OF THE
WAFERS
Apr. 23, 2002
Jan. 24, 2001
Related U.S. Application Data
09/768,973
-------
find start
1ms Elasped
1842
File name:06377337.pdf
US 6,377,337 B1
PROJECTION EXPOSURE APPARATUS
Apr. 23, 2002
Apr. 27, 1999
09/299,558
-------
find start
1843
File name:06377338.pdf
US 6,377,338 B1
EXPOSURE APPARATUS AND METHOD
Apr. 23, 2002
Oct. 13, 2000
Related U.S. Application Data
09/299,558
-------
find start
1844
File name:06377339.pdf
US 6,377,339 B1
DOCUMENT IMAGING SYSTEM
INCORPORATING A SELECTIVELY
OPAQUE
Apr. 23, 2002
Mar. 29, 1999
09/280,186
-------
find start
1845
File name:06377340.pdf
US 6,377,340 B1
METHOD OF DETECTION OF NATURAL
DIAMONDS THAT HAVE BEEN PROCESSED
AT HIGH PRESSURE AND HIGH
TEMPERATURES
Apr. 23, 2002
Oct. 29, 1999
09/430,477
-------
find start
1846
File name:06377341.pdf
US 6,377,341 B1
REFRACTIVE INDEX BASED DETECTOR
SYSTEM FOR LIQUID CHROMATOGRAPHY
Apr. 23, 2002
Aug. 3, 1999
09/368,310
-------
find start
(execution freezes here)
为什么出现这种情况?为什么正则表达式匹配器需要很长时间?
在这里,整个程序:
import java.awt.Rectangle;
import java.io.File;
import java.io.IOException;
import java.io.PrintWriter;
import java.sql.Connection;
import java.sql.DriverManager;
import java.sql.PreparedStatement;
import java.util.ArrayList;
import java.util.List;
import java.util.regex.Matcher;
import java.util.regex.Pattern;
import org.apache.commons.io.FileUtils;
import org.apache.commons.io.filefilter.TrueFileFilter;
import org.apache.commons.io.filefilter.WildcardFileFilter;
import org.apache.pdfbox.exceptions.InvalidPasswordException;
import org.apache.pdfbox.pdmodel.PDDocument;
import org.apache.pdfbox.pdmodel.PDPage;
import org.apache.pdfbox.util.PDFTextStripperByArea;
public class PatentAdder {
/**
* @param args
*/
public static String patno,patit,patdate,patfilled,appno;
private static int File;
public static void main(String[] args) {
try {
int cnt=0;
if(args.length == 1)
{
// usage();
}
else
{
PDDocument document = null;
try
{
File dataDir = new File("F:/patents/test/tittest/USP2002w17/06/378/pdfs");
File[] files = dataDir.listFiles();
int count=0;
long TIMEOUT1 = 60000l; // 15 seconds
long now1 = System.currentTimeMillis();
for (File file : files) {
try {
// System.out.println ("Satrt2");
File f = file;
if (!f.isDirectory()) {
document = PDDocument.load(f.getAbsolutePath());
if(document.isEncrypted())
{
try
{
document.decrypt("");
}
catch(InvalidPasswordException e)
{
System.err.println("Error: Document is encrypted with a password.");
System.exit(1);
}
} }
PDFTextStripperByArea stripper = new PDFTextStripperByArea();
stripper.setSortByPosition(true);
Rectangle rectt = new Rectangle(288, 60, 222, 40);
Rectangle rect = new Rectangle(55, 108, 230, 600); // US-Patent title h40
stripper.addRegion("class1", rect);
stripper.addRegion("class2", rectt);
List allPages = document.getDocumentCatalog().getAllPages();
PDPage firstPage = (PDPage)allPages.get(0);
stripper.extractRegions(firstPage);
String title = "(?s)\\(54\\)\\s*([\\w\\s,-]+)|(?s)\\[54\\]\\s*([\\w\\s,-]+)";
String in ="((?s)\\(\\d\\d\\)\\s+Inventor\\w*:\\s*\\w*([\\w\\d,.\\s)(-]+);([\\w\\s.\\',();-]+)(?=\\(\\d*\\)\\s+Assignee:))|((?s)\\[\\d\\d\\)\\s+Inventor:\\s*([\\-\\w\\d\\s,\\.\\(\\)-]+)*[\\w\\']*(?=\\n))|(Inventor\\w*:\\s*\\w*([\\w\\d,.\\s)(-]+);([\\w\\s.\\',();-]+)(?=Assignee:))|((?s)\\(\\d\\d\\)\\s+Inventor\\w*:\\s*\\w*([\\w\\d,.\\s)(-]+);([\\w\\s.\\',();-]+)(?=\\(\\d*\\)\\s+Assignee:))|((?s)\\(\\d\\d\\)\\s+Inventor:\\s*([\\-\\w\\d\\s,\\.\\(\\)-]+)*[\\w\\']*(?=\\n))|(Inventor\\w*:\\s*\\w*([\\w\\d,.\\s)(-]+);([\\w\\s.\\',();-]+)(?=Assignee:))";
String as ="((?s)\\(\\d\\d\\)\\s+Assignee\\w*:\\s*\\w*([\\w\\d,.\\s)(-]+);([\\w\\s.\\',();-]+)(?=\\(\\d*\\)\\s+Notice:))|((?s)\\(\\d\\d\\)\\s+Assignee:\\s*([\\-\\w\\d\\s,\\.\\(\\)-]+)*[\\w\\']*(?=\\n))|(Assignee\\w*:\\s*\\w*([\\w\\d,.\\s)(-]+);([\\w\\s.\\',();-]+)(?=Notice:))|(Assignee\\w*:\\s*\\w*([\\w\\d,.\\s)(-]+)(?=Notice:))";
String app_no ="(?s)\\(21\\)\\s*([\\w\\s,.://-]+)|(?s)\\[21\\]\\s*([\\w\\s,.://-]+)";
String filed ="((?s)\\(22\\)\\s*([\\w\\s,.://-]+))|((?s)\\(22\\)\\s*([\\w\\s,.://-]+)(?=\\s*\\n\\s*Related))|((?s)\\[22\\]\\s*([\\w\\s,.://-]+))|((?s)\\[22\\]\\s*([\\w\\s,.://-]+)(?=\\s*\\n\\s*Related))";
String term ="((?s)\\s*Term\\s*([\\w\\s,.://-]+))|((?s)\\s*Term\\s*([\\w\\s,.://-]+))";
String pat_no = "(?s)\\s*Patent No\\.\\:\\s*([\\w\\d\\s,.://-]+)|(?s)\\s*Patent Number\\:\\s*([\\w\\d\\s,.://-]+)";
String pat_dt = "(?s)\\(45\\)\\s*Date([\\*\\w\\d\\s,.://-]+)(?=\\(\\d*\\)\\s+Inventor:)|(?s)\\(45\\)\\s*Date([\\*\\w\\d\\s,.://-]+)(?=\\(\\d*\\)\\s+Inventors:)|(?s)\\(45\\)\\s*Date([\\*\\w\\d\\s,.://-]+)|(?s)\\[45\\]\\s*Date([\\*\\w\\d\\s,.://-]+)(?=\\[\\d*\\]\\s+Inventor:)|(?s)\\[45\\]\\s*Date([\\*\\w\\d\\s,.://-]+)(?=\\(\\d*\\)\\s+Inventors:)|(?s)\\[45\\]\\s*Date([\\*\\w\\d\\s,.://-]+)";
String region = stripper.getTextForRegion("class1");
String regiont = stripper.getTextForRegion("class2");
Pattern p = Pattern.compile(in);
Matcher m = p.matcher(region);
Pattern p2 = Pattern.compile(as);
Matcher m2 = p2.matcher(region);
Pattern p3 = Pattern.compile(title);
Matcher m3 = p3.matcher(region);
Pattern p4 = Pattern.compile(pat_no);
Matcher m4 = p4.matcher(regiont);
Pattern p5 = Pattern.compile(app_no);
Matcher m5 = p5.matcher(region);
Pattern p6 = Pattern.compile(filed);
Matcher m6 = p6.matcher(region);
Pattern p7 = Pattern.compile(pat_dt);
Matcher m7 = p7.matcher(regiont);
System.out.println("find start");
Long nowtime = System.currentTimeMillis() ;
while(m3.find())
{
patit = m3.group().replace("(54)", " ");
patit = patit.trim();
}
while(m4.find())
{
patno = m4.group().replace("Patent No.: ", " ");
patno = patno.replace("Patent No: ", " ");
patno = patno.replace("Patent", " ");
patno = patno.replace("No.:", " ");
patno = patno.replace("No:", " ");
patno = patno.replace("Number: ", " ");
patno = patno.replace("Number.: ", " ");
patno = patno.trim();
}
while(m5.find())
{
appno = m5.group().replace("(21)", " ");
appno = appno.replace("Appl. No.: ", " ");
appno = appno.replace("Appl.", " ");
appno = appno.replace("No.", " ");
appno = appno.replace(":"," ");
appno = appno.trim();
}
while(m6.find())
{
patfilled = m6.group().replace("(22)", " ");
patfilled = patfilled.replace("Filed", " ");
patfilled= patfilled.replace("PCT", " ");
patfilled = patfilled.replace(":", " ");
patfilled = patfilled.replace("\n", "");
patfilled= patfilled.trim();
}
while (m7.find())
{
patdate = m7.group().replace("(45) Date of Patent: ", " ");
patdate = patdate.replace("(45) Date of Patent.: ", " ");
patdate = patdate.replace("(45)", " ");
patdate = patdate.replace("Date", " ");
patdate = patdate.replace("of", " ");
patdate = patdate.replace("Patent.: ", " ");
patdate = patdate.replace("Patent: ", " ");
patdate = patdate.replace("Reissued", " ");
patdate = patdate.replace(":", " ");
patdate = patdate.replace("Patent", " ");
patdate = patdate.replace("*", " ");
patdate = patdate.trim();
}
PrintWriter out = new PrintWriter (new File("F:/patents/test/tittest/USP2002w17/06/378/pdfs/output.txt"));
System.out.println(count);
out.println(count);
System.out.println("File name:"+f.getName());
out.println("File name:"+f.getName());
System.out.println(patno +"\n"+patit+"\n"+patdate+"\n"+patfilled+"\n"+appno+"\n-------");
out.println(patno +"\n"+patit+"\n"+patdate+"\n"+patfilled+"\n"+appno+"\n-------");
Long endtime = System.currentTimeMillis()-nowtime;
System.out.println(endtime+"ms Elasped") ;
out.println(endtime+"ms Elasped") ;
count++;
}
catch (IOException e)
{
continue;
}
}
System.out.print("-----Finised "+count+" Files------ \n");
}
finally
{
if(document != null)
{
document.close();
}
}
}
}
catch (Exception e)
{
System.out.println(e.getStackTrace());
//System.out.println(e.getLocalizedMessage());
System.out.println(e.getMessage());
System.out.println(e.getCause());
//System.out.println(e.getClass());
e.printStackTrace();
}
}
请告诉如何optamize正则表达式和解决这个执行冻结问题...
我喜欢用正则表达式自己,但它看起来像这不是你尝试去做的理想方法。正则表达式很适合从文本中提取特定位置的信息。但是,在文本上反复应用多个正则表达式意味着解析方法会更好。
您的方法的一个问题是每个while循环都是再次读取整个文本。这可以通过编写自己的解析器并让它在文档中一次完成来避免。
你的正则表达式的另一个问题是他们有很多可选的部分(\s*
等)。可选部件使得对正则表达式的评估成本更高。相反,使用一个非常简单的正则表达式可能是一个好方法,并重新检查它的匹配位置是否有误报。例如,而不是你的正则表达式
String term ="((?s)\\s*Term\\s*([\\w\\s,.://-]+))|((?s)\\s*Term\\s*([\\w\\s,.://-]+))";
,你可以随便找
String simple_term ="Term";
然后对Term
每次出现检查它是否真的是你正在寻找的一部分。
顺便看一下我从代码中随机抽取的字符串,我注意到它比它更复杂。只是删除备选|
,因为第一个和第二个选择是一样的。
'可选部分使正则表达式的评估更加昂贵。'它确实,但不会导致这样的减速。减速是由于'\ s * \ s *'问题引起的。 – nhahtdh
由于您的构造类似于\s*\s*c
- 连续重复的字符类,它具有非空的交叉点,后面是非相交的续集,所以会遇到回溯地狱。
让我们来看看字符串in
(由发动机所看到):
((?s)\(\d\d\)\s+Inventor\w*:\s*\w*([\w\d,.\s)(-]+);([\w\s.\',();-]+)(?=\(\d*\)\s+Assignee:))
|
((?s)\[\d\d\)\s+Inventor:\s*([\-\w\d\s,\.\(\)-]+)*[\w\']*(?=\n))
|
(Inventor\w*:\s*\w*([\w\d,.\s)(-]+);([\w\s.\',();-]+)(?=Assignee:))
|
((?s)\(\d\d\)\s+Inventor\w*:\s*\w*([\w\d,.\s)(-]+);([\w\s.\',();-]+)(?=\(\d*\)\s+Assignee:))
|
((?s)\(\d\d\)\s+Inventor:\s*([\-\w\d\s,\.\(\)-]+)*[\w\']*(?=\n))
|
(Inventor\w*:\s*\w*([\w\d,.\s)(-]+);([\w\s.\',();-]+)(?=Assignee:))
你有足够的式样,在您的正则表达式:
((?s)\(\d\d\)\s+Inventor\w*:\s*\w*([\w\d,.\s)(-]+);([\w\s.\',();-]+)(?=\(\d*\)\s+Assignee:))
^^^^^^^^^^^^^^^^^^^^
((?s)\[\d\d\)\s+Inventor:\s*([\-\w\d\s,\.\(\)-]+)*[\w\']*(?=\n))
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(Inventor\w*:\s*\w*([\w\d,.\s)(-]+);([\w\s.\',();-]+)(?=Assignee:))
^^^^^^^^^^^^^^^^^^^^
这是不提的是,你包括在你的正则表达式中两次相同的3个子模式,这是完全多余的。
我需要多余的..因为没有...字符串不正确; y解析@nhahtdh –
@RageshDAntony:你使用捕获组的结果?并且在交替中有两次相同的模式没有意义(嗯,假设发动机工作正常)。 – nhahtdh
我很抱歉,但我不认为我们可以提供帮助。它可以是任何数量的东西。也许目录中有一个损坏的PDF文件或类似的东西。也许在PDF中的特定页面上有些不同。有些情况下,我们不会有任何帮助。你有没有看过这个冻结的文件,围绕1847年的记录看看有什么不同?你有没有试过一个条件调试器,当你到达那条记录时会触发,这样你就可以遍历代码并查看可能发生了什么? –