基于Lucene的博客搜索系统

最近看了一下Lucene,所以决定自己实现一个简单的博客搜索系统,大体如下:
基于Lucene的博客搜索系统

首先,我们先来了解一下Lucene,官网:http://lucene.apache.org/core/
打开一看,全篇英文,一脸懵逼,小编我立刻后悔当初没有好好学习英语了,但我还是硬着头皮慢慢看,大概理解了一下:

Apache Lucene是一个完全用Java编写的高性能,功能齐全的文本搜索引擎库。它是一种适用于几乎所有需要全文搜索的应用程序的技术,尤其是跨平台搜索,最重要的是它是一个可供免费下载的开源项目(开源什么的最喜欢了,哈哈。。。),个人感觉,它就是一个用于全文检索的工具包。

学一门东西,首先先学会用,然后去了解原理(纯属个人观点),其实官方提供了详细的API:
基于Lucene的博客搜索系统

打开核心包可以看到官方给的demo,大致了解了用法:基于Lucene的博客搜索系统

OK,话不多说,上代码,首先我们需要引入依赖:

      <dependency>
         <groupId>org.apache.lucene</groupId>
          <artifactId>lucene-core</artifactId>
          <version>7.1.0</version>
      </dependency>

	 <dependency>
	     <groupId>org.apache.lucene</groupId>
	     <artifactId>lucene-queryparser</artifactId>
	     <version>7.1.0</version>
	 </dependency>

     <dependency>
            <groupId>org.apache.lucene</groupId>
            <artifactId>lucene-analyzers-common</artifactId>
            <version>7.1.0</version>
        </dependency>

这里先创建一个Lucene的工具类,用来创建索引和执行搜索功能(这里的文件类型为txt,具体文件具体分析):

public class LuceneUtil {

    private static IndexWriter indexWriter;
    private static IndexWriterConfig indexWriterConfig;
    private static FSDirectory fsDirectory;
    private static Analyzer analyzer;
    private static final String DIR = "/luceneIndex";


    //为文件创建索引
    public static void createIndex(File file,String title){

        String name = file.getName();
        //针对txt文件进行索引创建
        if(name.endsWith(".txt")){
            String content = txtToString(file);
            String id = name.substring(0,name.lastIndexOf("."));

            Document document = new Document();
            document.add(new StringField("id",id, Field.Store.YES));
            document.add(new StringField("title",title, Field.Store.YES));
            document.add(new TextField("content",content, Field.Store.YES));
            try {
                File src = new File("");
                fsDirectory = FSDirectory.open(Paths.get(src.getCanonicalPath() + DIR));
                analyzer = new SmartChineseAnalyzer();
                indexWriterConfig = new IndexWriterConfig(analyzer);
                indexWriter = new IndexWriter(fsDirectory,indexWriterConfig);
                indexWriter.addDocument(document);
                indexWriter.close();
            } catch (IOException e) {
                e.printStackTrace();
            }
        }
    }

    //执行搜索
    public static JSONObject doSearch(String value){
        JSONObject jsonObject = new JSONObject();

        if (value == null || "".equals(value)){
            return jsonObject;
        }

        try {
            File src = new File("");
            fsDirectory = FSDirectory.open(Paths.get(src.getCanonicalPath() + DIR));
            analyzer = new SmartChineseAnalyzer();
            IndexReader reader = DirectoryReader.open(fsDirectory);

            IndexSearcher indexSearcher = new IndexSearcher(reader);

            QueryParser parser = new QueryParser("content",analyzer);
            Query query = parser.parse(value);

            //设置高亮
            SimpleHTMLFormatter simpleHTMLFormatter = new SimpleHTMLFormatter("<b><font color='red'>","</font></b>");
            QueryScorer queryScorer = new QueryScorer(query);
            Fragmenter fragmenter = new SimpleSpanFragmenter(queryScorer);
            Highlighter highlighter = new Highlighter(simpleHTMLFormatter,queryScorer);
            highlighter.setTextFragmenter(fragmenter);

            long start = System.currentTimeMillis();
            TopDocs topDocs = indexSearcher.search(query, 10);
            long end = System.currentTimeMillis();
            System.out.println((end-start) + "毫秒");

            jsonObject.put("total",topDocs.totalHits);
            List<MyDocument> list = new ArrayList<>();
            ScoreDoc[] scoreDocs = topDocs.scoreDocs;
            for (ScoreDoc scoreDoc : scoreDocs) {
                int doc = scoreDoc.doc;
                Document doc1 = indexSearcher.doc(doc);
                MyDocument myDocument = new MyDocument();
                myDocument.setId(doc1.get("id"));
                myDocument.setTitle(doc1.get("title"));
                String content = doc1.get("content");
                if (content != null){
                    TokenStream tokenStream = analyzer.tokenStream("content",new StringReader(content));
                    String bestFragment = highlighter.getBestFragment(tokenStream, content);
                    content = bestFragment;
                }
                myDocument.setContent(content);
                list.add(myDocument);
            }
            jsonObject.put("doc",list);
            reader.close();
        } catch (Exception e) {
            e.printStackTrace();
        }
        return jsonObject;
    }

    //txt文件内容转换成字符串
    private static String txtToString(File file){
        StringBuffer buffer = new StringBuffer();
        try{
            BufferedReader br = new BufferedReader(new FileReader(file));
            String s = null;
            while((s = br.readLine())!=null){
                buffer.append(s + "\n");
            }
            br.close();
        }catch(Exception e){
            e.printStackTrace();
        }
        return buffer.toString();
    }
}

创建一个controller

@RestController
public class BlogController {

    @GetMapping("/index")
    public String index(){
        return "Hello blog";
    }

    @PostMapping("/upload")
    public String upload(MultipartFile file,String title){

        String filePath = FileUpload.upload(file);

        LuceneUtil.createIndex(new File(filePath),title);

        return "ok";

    }

    @PostMapping("/search")
    public Object doSearch(@RequestBody Map<String,String> query){

        return LuceneUtil.doSearch(query.get("query"));
    }
}

主页的html

<!DOCTYPE html>
<html lang="en">
<head>
    <meta charset="UTF-8">
    <title>小俊博客搜索</title>
    <script src="../js/jquery.js"></script>
    <style type="text/css">
        input.text{text-align:center;padding:10px 20px;width:300px;}
    </style>
</head>

<body>
<div align="right"><a href="writeBlog.html">写博客</a>&nbsp;&nbsp;&nbsp;&nbsp;<a href="upload.html">上传文件</a></div>
<div align="center"><h1 style="font-family: Consolas; color: darkseagreen">小俊博客搜索</h1></div>
<div align="center">
<input type="text" id="searchText" class="text"><input type="button" id="search" value="搜索" style="padding:10px 20px;">
</div>
<br><br><br>
<div align="center">
查询结果:<input type="text" id="total" readonly="readonly">条记录
<br><br>
<!--<textarea style="height: 500px; width: 1200px" id="textarea" readonly="readonly"></textarea>-->

    <div id="result"></div>
</div>
<script>
    $(function () {
        $("#search").on("click",function () {
            var query = $("#searchText").val();

            $.ajax({
                url: "/search",
                type: "post",
                contentType: "application/json;charset=UTF-8",
                data:JSON.stringify({"query": query}),
                dataType: "json",
                success: function (data) {
                    $("#total").val(data.total);
                    $("#result").empty();

                    if (data.total == 0) {
                        //$("#textarea").text("未搜到结果,请重新确定关键字");
                        $("#result").html("未搜到结果,请重新确定关键字");
                    }else {
                        //$("#textarea").text("");
                        $.each(data.doc,function (i,n) {
                            // $("#textarea").append(n.title).append("\r\n").append("--------")
                            //     .append("\r\n").append(n.content).append("\r\n").append("##########");
                            var div = document.createElement("div");
                            div.innerHTML = n.title + "<br>" + n.content + "<br><hr><br><br><br>";
                            $("#result").append(div);
                        });
                    }
                },
                error: function () {
                    alert("搜索失败!");
                }

            });

        });
    });

</script>
</body>

</html>

文件上传的html

<!DOCTYPE html>
<html lang="en">
<head>
    <meta charset="UTF-8">
    <title>小俊博客搜索</title>
    <script src="../js/jquery.js"></script>
</head>
<body>
<div align="center">
<input type="file" id="upload" accept=".txt">
<label>标题:</label><input type="text" id="title">
<input type="button" id="uploadButton" value="上传">
</div>
<script>
    $(function () {
        $("#uploadButton").on("click",function () {

            var dataForm = new FormData();
            dataForm.append("file",document.getElementById("upload").files[0]);
            dataForm.append("title",$("#title").val());
            $.ajax({
                url:"/upload",
                type:"post",
                data:dataForm,
                contentType: false,
                processData: false,
                success:function (data) {
                    if (data == "ok"){
                        alert("上传成功");
                    }
                },
                error:function () {
                    alert("上传失败!");
                }

            });

        });
    });

</script>

</body>
</html>

测试结果,效率还是很高的
基于Lucene的博客搜索系统

对于全文检索,分词器决定搜索的准确性,我这里采用的是smartcn分词,常用的一些中文分词器:

paoding :Lucene中文分词“庖丁解牛” Paoding Analysis
imdict :imdict智能词典所采用的智能中文分词程序
mmseg4j : 用 Chih-Hao Tsai 的 MMSeg 算法 实现的中文分词器
ik :采用了特有的“正向迭代最细粒度切分算法“,多子处理器分析模式
smartcn:源于中科院ICTCLAS

对于搜索结果如果想要高亮显示,还需要引入依赖:

       <dependency>
            <groupId>org.apache.lucene</groupId>
            <artifactId>lucene-analyzers-smartcn</artifactId>
            <version>7.1.0</version>
        </dependency>
       <dependency>
            <groupId>org.apache.lucene</groupId>
            <artifactId>lucene-highlighter</artifactId>
            <version>7.1.0</version>
        </dependency>

Lucene的原理:
这里面有两个关键的对象:分别是IndexWriter和IndexSearcher
基于Lucene的博客搜索系统

基于Lucene的博客搜索系统

基于Lucene的博客搜索系统