Apache Poi - 如何删除Word文档中的所有链接

问题描述:

我想删除Word文档的所有超链接并保留文本。我有这两种方法来阅读doc和docx扩展名的文档。Apache Poi - 如何删除Word文档中的所有链接

private void readDocXExtensionDocument(){ 
    File inputFile = new File(inputFolderDir, "test.docx"); 
    try { 
     XWPFDocument document = new XWPFDocument(OPCPackage.open(new FileInputStream(inputFile))); 
     XWPFWordExtractor extractor = new XWPFWordExtractor(document); 
     extractor.setFetchHyperlinks(true); 
     String context = extractor.getText(); 
     System.out.println(context); 
    } catch (InvalidFormatException e) { 
     e.printStackTrace(); 
    } catch (FileNotFoundException e) { 
     e.printStackTrace(); 
    } catch (IOException e) { 
     e.printStackTrace(); 
    } 

} 

private void readDocExtensionDocument(){ 
    File inputFile = new File(inputFolderDir, "test.doc"); 
    POIFSFileSystem fs; 
    try { 
     fs = new POIFSFileSystem(new FileInputStream(inputFile)); 
     HWPFDocument document = new HWPFDocument(fs); 
     WordExtractor wordExtractor = new WordExtractor(document); 
     String[] paragraphs = wordExtractor.getParagraphText(); 
     System.out.println("Word document has " + paragraphs.length + " paragraphs"); 
     for(int i=0; i<paragraphs.length; i++){ 
      paragraphs[i] = paragraphs[i].replaceAll("\\cM?\r?\n", ""); 
      System.out.println(paragraphs[i]); 
     } 
    } catch (IOException e) { 
     e.printStackTrace(); 
    } 
} 

是否可以使用apache poi库去除word文档的所有链接?如果不是,有没有其他的库可以提供这个?

我的解决方案,至少对于.docx类,将使用正则表达式。检查这一个

private void readDocXExtensionDocument(){ 
    Pattern p = Pattern.compile("\\<(.+?)\\>"); 
    File inputFile = new File(inputFolderDir, "test.docx"); 
    try { 
     XWPFDocument document = new XWPFDocument(OPCPackage.open(new FileInputStream(inputFile))); 
     XWPFWordExtractor extractor = new XWPFWordExtractor(document); 
     extractor.setFetchHyperlinks(true); 
     String context = extractor.getText(); 
     Matcher m = p.matcher(context); 
     while (m.find()) { 
     String link = m.group(0); // the bracketed part 
     String textString = m.group(1); // the text of the link without the brackets 
     context = context.replaceAll(link, ""); // ordering important. Link then textString 
     context = context.replaceAll(textString, ""); 
     } 
     System.out.println(context); 
    } catch (InvalidFormatException e) { 
    e.printStackTrace(); 
    } catch (FileNotFoundException e) { 
    e.printStackTrace(); 
    } catch (IOException e) { 
    e.printStackTrace(); 
    } 
    } 

唯一要注意这种方法是,如果有材料,这些尖括号不是一个链接,即得可以被删除。如果您对可能出现的链接有更好的了解,可以尝试更具体的正则表达式,而不是我提供的正则表达式。