lucene8.0实践入门介绍1

首先要先下载lucene的相关jar包:

http://apache.01link.hk/lucene/java/8.0.0/lucene-8.0.0.zip

然后解压，拿出三个jar包：

lucene-queryparser-8.0.0.jar
lucene-core-8.0.0.jar
lucene-analyzers-common-8.0.0.jar

创建一个java工程，然后在根目录下创建lib文件夹，将包拷入，添加到项目中（add as libraries...）

一、存入索引的操作IndexWriter

然后直接在项目中创建以下文件（括号内为文件内容）：

docs

hello.txt (素胚勾勒出青花，笔锋浓转淡；瓶身描绘的牡丹一如你初妆)

subdocs

hi.txt（落霞与孤鹜齐飞，秋水共长天一色）

然后进行存入索引动作的测试：

package com.company;

import org.apache.lucene.analysis.Analyzer;
import org.apache.lucene.analysis.standard.StandardAnalyzer;
import org.apache.lucene.document.*;
import org.apache.lucene.index.*;
import org.apache.lucene.queryparser.classic.ParseException;
import org.apache.lucene.queryparser.classic.QueryParser;
import org.apache.lucene.search.*;
import org.apache.lucene.store.FSDirectory;
import org.apache.lucene.store.Directory;
import java.io.BufferedReader;
import java.io.IOException;
import java.io.InputStream;
import java.io.InputStreamReader;
import java.nio.charset.Charset;
import java.nio.file.*;
import java.nio.file.attribute.BasicFileAttributes;

public class Index {
    public static void main(String[] args) throws Exception {
        index0();     
    }

    private static void index0() throws IOException {
        //需要读入的文件目录
        Path fileDoc = Paths.get("docs");
        //需要存储索引的目录，如果不存在，会主动创建
        Path index = Paths.get("index");
        //FSDirectory打开索引的目录，FS是文件系统的意思,基于磁盘（还有一种基于内存的）
        Directory directory = FSDirectory.open(index);
        //分词器，标准的分词器，对英文能很好的分词，对于中文只能一个一个拆开（中文推荐使用IK分词器）
        Analyzer analyzer = new StandardAnalyzer();
        //写索引的配置类，配置使用的分词器为标准分词器
        IndexWriterConfig indexWriterConfig = new IndexWriterConfig(analyzer);
        //IndexWriter是lucene的核心类，用于存储索引
        IndexWriter indexWriter = new IndexWriter(directory, indexWriterConfig);
        //Files是NIO中操作文件的工具，推荐使用，很好用
        //Files.isDirectory(path)用于判断是否为目录，是返回true,否则返回false
        if (Files.isDirectory(fileDoc)) {
            //Files.walkFileTree是一个递归调用目录的方法
            // 两个参数： Path start  FileVisitor<? super Path> visitor
            //第一个是需要递归的目录，第二个是访问文件的接口（SimpleFileVisitor是其中一个实现类）
            Files.walkFileTree(fileDoc, new SimpleFileVisitor<>() {
                //重写visitFile的方法，这对于
                @Override
                public FileVisitResult visitFile(Path file, BasicFileAttributes attrs) throws IOException {
                    //传入的file是一个文件类型的，attrs是该文件的一些属性
                    indexDocs(file,indexWriter);
                    return FileVisitResult.CONTINUE;
                }
            });
        } else {
            //存储索引的实践操作
            indexDocs(fileDoc, indexWriter);
        }
        //要关闭IndexWriter
        indexWriter.close();
    }

    private static void indexDocs(Path path, IndexWriter indexWriter) throws IOException {
        //将文件以类的方式读入
        InputStream inputStream = Files.newInputStream(path);
        //存入的文档
        Document document = new Document();
        //存入文档的属性,第一个是字段名，第二个是内容，第三个是否存储内容
        //Field有很多实现类，对于不同类型的字段，有不同的实现类来操作StringField是存储String类型的字段，不进行分词
        Field field = new StringField("fileName", path.getFileName().toString(), Field.Store.YES);
        //将属性加入文档中
        document.add(field);
        //TextField存入比较大的文本内容，要进行分词。一个是字段名，一个是Reader
        //new BufferedReader(new InputStreamReader(inputStream, Charset.forName("utf-8")))通过utf-8格式，获取带缓存的Reader
        document.add(new TextField("content",new BufferedReader(new InputStreamReader(inputStream, Charset.forName("utf-8")))));
        //LongPoint用于存储long类型数据，不分词
        document.add(new LongPoint("modified",Files.getLastModifiedTime(path).toMillis()));
        //在存入索引时，打出操作动作
        System.out.println("adding files:"+path);
        //添加文档
        indexWriter.addDocument(document);
        //显示关闭流
        inputStream.close();
    }
}

执行上述代码，控制台打印：

adding files:docs\hello.txt
adding files:docs\subdocs\hi.txt

然后发现项目根目录增加了一个index目录：有几个文件：

_0.cfe
_0.cfs
_0.si
segments_1
write.lock

这样表明已经存入索引了。

如何查看索引，可以使用api写测试用例读出文件内，这个在下一篇博客给出。

还可以使用luke这个工具进行索引的查看：

下载地址：

https://github.com/DmitryKey/luke

注意，因为这个lucene8.0是最新的，所以luke也要版本比较高的才能查看，有些版本低的好像查看不了

因为我本地装了git，所以我直接通过：

git [email protected]:DmitryKey/luke.git 将项目拉到本地的某个文件夹，进入文件夹，在pom.xml的目录下，在cmd命令行的条件下，使用mvn package进行打包，如果不打包，直接双击luke.bat会报错：

ERROR,unable to access jarfile:.\target\luke-swings-with-deps.jar

打包完后，直接双击luke.bat即可如下图：

lucene8.0实践入门介绍1

看到Browse这个按钮，点击，找到项目index目录，打开即可

打开后：

lucene8.0实践入门介绍1

content内容看到没，但是你看到的text都是被拆成一个一个的，因为标准分词器就是这样的，如果要更好的切割中文，使用IK

分词器就可以了。（使用完后，可以多点点luke这个工具，熟悉下）

使用IK分词器：

要引入一个Jar包：下载链接：如果下载不了，可能要*

https://storage.googleapis.com/google-code-archive-downloads/v2/code.google.com/ik-analyzer/IK%20Analyzer%202012FF_hf1.zip

然后解压：项目中引入：IKAnalyzer2012FF_u1.jar（添加到项目中）

将代码中：Analyzer analyzer = new StandardAnalyzer() 改成：Analyzer analyzer = new IKAnalyzer();

啦啦啦，可以了，然后在执行一次，一看：

adding files:docs\hello.txt
Exception in thread "main" java.lang.AbstractMethodError: Receiver class org.wltea.analyzer.lucene.IKAnalyzer does not define or inherit an implementation of the resolved method abstract createComponents(Ljava/lang/String;)Lorg/apache/lucene/analysis/Analyzer$TokenStreamComponents; of abstract class org.apache.lucene.analysis.Analyzer.
at org.apache.lucene.analysis.Analyzer.tokenStream(Analyzer.java:163)

。。。。。

什么鬼，头晕，看了下源码，Analyzer 的 createComponents方法从两个参数变成一个参数了，IK分词器的还是两个参数的，所以出现异常了，看来要改源码了，睡觉，明天弄。

lucene8.0实践入门介绍1

相关推荐