基于Lucene的博客搜索系统
最近看了一下Lucene,所以决定自己实现一个简单的博客搜索系统,大体如下:
首先,我们先来了解一下Lucene,官网:http://lucene.apache.org/core/
打开一看,全篇英文,一脸懵逼,小编我立刻后悔当初没有好好学习英语了,但我还是硬着头皮慢慢看,大概理解了一下:
Apache Lucene是一个完全用Java编写的高性能,功能齐全的文本搜索引擎库。它是一种适用于几乎所有需要全文搜索的应用程序的技术,尤其是跨平台搜索,最重要的是它是一个可供免费下载的开源项目(开源什么的最喜欢了,哈哈。。。),个人感觉,它就是一个用于全文检索的工具包。
学一门东西,首先先学会用,然后去了解原理(纯属个人观点),其实官方提供了详细的API:
打开核心包可以看到官方给的demo,大致了解了用法:
OK,话不多说,上代码,首先我们需要引入依赖:
<dependency>
<groupId>org.apache.lucene</groupId>
<artifactId>lucene-core</artifactId>
<version>7.1.0</version>
</dependency>
<dependency>
<groupId>org.apache.lucene</groupId>
<artifactId>lucene-queryparser</artifactId>
<version>7.1.0</version>
</dependency>
<dependency>
<groupId>org.apache.lucene</groupId>
<artifactId>lucene-analyzers-common</artifactId>
<version>7.1.0</version>
</dependency>
这里先创建一个Lucene的工具类,用来创建索引和执行搜索功能(这里的文件类型为txt,具体文件具体分析):
public class LuceneUtil {
private static IndexWriter indexWriter;
private static IndexWriterConfig indexWriterConfig;
private static FSDirectory fsDirectory;
private static Analyzer analyzer;
private static final String DIR = "/luceneIndex";
//为文件创建索引
public static void createIndex(File file,String title){
String name = file.getName();
//针对txt文件进行索引创建
if(name.endsWith(".txt")){
String content = txtToString(file);
String id = name.substring(0,name.lastIndexOf("."));
Document document = new Document();
document.add(new StringField("id",id, Field.Store.YES));
document.add(new StringField("title",title, Field.Store.YES));
document.add(new TextField("content",content, Field.Store.YES));
try {
File src = new File("");
fsDirectory = FSDirectory.open(Paths.get(src.getCanonicalPath() + DIR));
analyzer = new SmartChineseAnalyzer();
indexWriterConfig = new IndexWriterConfig(analyzer);
indexWriter = new IndexWriter(fsDirectory,indexWriterConfig);
indexWriter.addDocument(document);
indexWriter.close();
} catch (IOException e) {
e.printStackTrace();
}
}
}
//执行搜索
public static JSONObject doSearch(String value){
JSONObject jsonObject = new JSONObject();
if (value == null || "".equals(value)){
return jsonObject;
}
try {
File src = new File("");
fsDirectory = FSDirectory.open(Paths.get(src.getCanonicalPath() + DIR));
analyzer = new SmartChineseAnalyzer();
IndexReader reader = DirectoryReader.open(fsDirectory);
IndexSearcher indexSearcher = new IndexSearcher(reader);
QueryParser parser = new QueryParser("content",analyzer);
Query query = parser.parse(value);
//设置高亮
SimpleHTMLFormatter simpleHTMLFormatter = new SimpleHTMLFormatter("<b><font color='red'>","</font></b>");
QueryScorer queryScorer = new QueryScorer(query);
Fragmenter fragmenter = new SimpleSpanFragmenter(queryScorer);
Highlighter highlighter = new Highlighter(simpleHTMLFormatter,queryScorer);
highlighter.setTextFragmenter(fragmenter);
long start = System.currentTimeMillis();
TopDocs topDocs = indexSearcher.search(query, 10);
long end = System.currentTimeMillis();
System.out.println((end-start) + "毫秒");
jsonObject.put("total",topDocs.totalHits);
List<MyDocument> list = new ArrayList<>();
ScoreDoc[] scoreDocs = topDocs.scoreDocs;
for (ScoreDoc scoreDoc : scoreDocs) {
int doc = scoreDoc.doc;
Document doc1 = indexSearcher.doc(doc);
MyDocument myDocument = new MyDocument();
myDocument.setId(doc1.get("id"));
myDocument.setTitle(doc1.get("title"));
String content = doc1.get("content");
if (content != null){
TokenStream tokenStream = analyzer.tokenStream("content",new StringReader(content));
String bestFragment = highlighter.getBestFragment(tokenStream, content);
content = bestFragment;
}
myDocument.setContent(content);
list.add(myDocument);
}
jsonObject.put("doc",list);
reader.close();
} catch (Exception e) {
e.printStackTrace();
}
return jsonObject;
}
//txt文件内容转换成字符串
private static String txtToString(File file){
StringBuffer buffer = new StringBuffer();
try{
BufferedReader br = new BufferedReader(new FileReader(file));
String s = null;
while((s = br.readLine())!=null){
buffer.append(s + "\n");
}
br.close();
}catch(Exception e){
e.printStackTrace();
}
return buffer.toString();
}
}
创建一个controller
@RestController
public class BlogController {
@GetMapping("/index")
public String index(){
return "Hello blog";
}
@PostMapping("/upload")
public String upload(MultipartFile file,String title){
String filePath = FileUpload.upload(file);
LuceneUtil.createIndex(new File(filePath),title);
return "ok";
}
@PostMapping("/search")
public Object doSearch(@RequestBody Map<String,String> query){
return LuceneUtil.doSearch(query.get("query"));
}
}
主页的html
<!DOCTYPE html>
<html lang="en">
<head>
<meta charset="UTF-8">
<title>小俊博客搜索</title>
<script src="../js/jquery.js"></script>
<style type="text/css">
input.text{text-align:center;padding:10px 20px;width:300px;}
</style>
</head>
<body>
<div align="right"><a href="writeBlog.html">写博客</a> <a href="upload.html">上传文件</a></div>
<div align="center"><h1 style="font-family: Consolas; color: darkseagreen">小俊博客搜索</h1></div>
<div align="center">
<input type="text" id="searchText" class="text"><input type="button" id="search" value="搜索" style="padding:10px 20px;">
</div>
<br><br><br>
<div align="center">
查询结果:<input type="text" id="total" readonly="readonly">条记录
<br><br>
<!--<textarea style="height: 500px; width: 1200px" id="textarea" readonly="readonly"></textarea>-->
<div id="result"></div>
</div>
<script>
$(function () {
$("#search").on("click",function () {
var query = $("#searchText").val();
$.ajax({
url: "/search",
type: "post",
contentType: "application/json;charset=UTF-8",
data:JSON.stringify({"query": query}),
dataType: "json",
success: function (data) {
$("#total").val(data.total);
$("#result").empty();
if (data.total == 0) {
//$("#textarea").text("未搜到结果,请重新确定关键字");
$("#result").html("未搜到结果,请重新确定关键字");
}else {
//$("#textarea").text("");
$.each(data.doc,function (i,n) {
// $("#textarea").append(n.title).append("\r\n").append("--------")
// .append("\r\n").append(n.content).append("\r\n").append("##########");
var div = document.createElement("div");
div.innerHTML = n.title + "<br>" + n.content + "<br><hr><br><br><br>";
$("#result").append(div);
});
}
},
error: function () {
alert("搜索失败!");
}
});
});
});
</script>
</body>
</html>
文件上传的html
<!DOCTYPE html>
<html lang="en">
<head>
<meta charset="UTF-8">
<title>小俊博客搜索</title>
<script src="../js/jquery.js"></script>
</head>
<body>
<div align="center">
<input type="file" id="upload" accept=".txt">
<label>标题:</label><input type="text" id="title">
<input type="button" id="uploadButton" value="上传">
</div>
<script>
$(function () {
$("#uploadButton").on("click",function () {
var dataForm = new FormData();
dataForm.append("file",document.getElementById("upload").files[0]);
dataForm.append("title",$("#title").val());
$.ajax({
url:"/upload",
type:"post",
data:dataForm,
contentType: false,
processData: false,
success:function (data) {
if (data == "ok"){
alert("上传成功");
}
},
error:function () {
alert("上传失败!");
}
});
});
});
</script>
</body>
</html>
测试结果,效率还是很高的
对于全文检索,分词器决定搜索的准确性,我这里采用的是smartcn分词,常用的一些中文分词器:
paoding :Lucene中文分词“庖丁解牛” Paoding Analysis
imdict :imdict智能词典所采用的智能中文分词程序
mmseg4j : 用 Chih-Hao Tsai 的 MMSeg 算法 实现的中文分词器
ik :采用了特有的“正向迭代最细粒度切分算法“,多子处理器分析模式
smartcn:源于中科院ICTCLAS
对于搜索结果如果想要高亮显示,还需要引入依赖:
<dependency>
<groupId>org.apache.lucene</groupId>
<artifactId>lucene-analyzers-smartcn</artifactId>
<version>7.1.0</version>
</dependency>
<dependency>
<groupId>org.apache.lucene</groupId>
<artifactId>lucene-highlighter</artifactId>
<version>7.1.0</version>
</dependency>
Lucene的原理:
这里面有两个关键的对象:分别是IndexWriter和IndexSearcher