Lucene5学习之自定义Collector

你们都睡了，而我却在写博客，呵呵！我也不知道为什么都夜深了，我却还没一点困意，趁着劲头赶紧把自定义结果集写完，已经拖了2天没更新了，不能让你们等太久，我也要把写博客一直坚持下去。

Collector是什么？还是看源码吧。这也是最权威的解释说明。

Java代码  
/** 
 * <p>Expert: Collectors are primarily meant to be used to 
 * gather raw results from a search, and implement sorting 
 * or custom result filtering, collation, etc. </p> 
 * 
 * <p>Lucene's core collectors are derived from {@link Collector} 
 * and {@link SimpleCollector}. Likely your application can 
 * use one of these classes, or subclass {@link TopDocsCollector}, 
 * instead of implementing Collector directly: 
 * 
 * <ul> 
 * 
 *   <li>{@link TopDocsCollector} is an abstract base class 
 *   that assumes you will retrieve the top N docs, 
 *   according to some criteria, after collection is 
 *   done.  </li> 
 * 
 *   <li>{@link TopScoreDocCollector} is a concrete subclass 
 *   {@link TopDocsCollector} and sorts according to score + 
 *   docID.  This is used internally by the {@link 
 *   IndexSearcher} search methods that do not take an 
 *   explicit {@link Sort}. It is likely the most frequently 
 *   used collector.</li> 
 * 
 *   <li>{@link TopFieldCollector} subclasses {@link 
 *   TopDocsCollector} and sorts according to a specified 
 *   {@link Sort} object (sort by field).  This is used 
 *   internally by the {@link IndexSearcher} search methods 
 *   that take an explicit {@link Sort}. 
 * 
 *   <li>{@link TimeLimitingCollector}, which wraps any other 
 *   Collector and aborts the search if it's taken too much 
 *   time.</li> 
 * 
 *   <li>{@link PositiveScoresOnlyCollector} wraps any other 
 *   Collector and prevents collection of hits whose score 
 *   is &lt;= 0.0</li> 
 * 
 * </ul> 
 * 
 * @lucene.experimental 
 */  
public interface Collector {  
  
  /** 
   * Create a new {@link LeafCollector collector} to collect the given context. 
   * 
   * @param context 
   *          next atomic reader context 
   */  
  LeafCollector getLeafCollector(LeafReaderContext context) throws IOException;  
  
}  

Collector系列接口是用来收集查询结果，实现排序，自定义结果集过滤和收集。Collector和LeafCollector是Lucene结果集收集的核心。

TopDocsCollector：是用来收集Top N结果的，

TopScoreDocCollector：它是TopDocsCollector的子类，它返回的结果集会根据评分和docId进行排序，该接口在IndexSearcher类的search方法内部被调用，但search方法并不需要显式的指定一个Sort排序器，TopScoreDocCollector是使用频率最高的一个结果收集器接口。

TopFieldCollector：它也是TopDocsCollector的子类，跟TopScoreDocCollector的区别是，TopScoreDocCollector是根据评分和docId进行排序的，而TopFieldCollector是根据用户指定的域进行排序，在调用IndexSearcher.search方法时需要显式的指定Sort排序器。

TimeLimitingCollector：它是其他Collector的包装器，它的功能是当被包装的Collector耗时超过限制时可以中断收集过程。

PositiveScoresOnlyCollector：从类名就知道它是干嘛的，Positive正数的意思，即只返回score评分大于零的索引文档，它跟TimeLimitingCollector都属于其他Collector的包装器，都使用了装饰者模式。

Collector接口只有一个接口方法：

Java代码  
LeafCollector getLeafCollector(LeafReaderContext context) throws IOException;  

根据提供的IndexReader上下文对象返回一个LeafCollector，LeafCollector其实就是对应每个段文件的收集器，每次切换段文件时都会调用一次此接口方法。

其实LeafCollector才是结果收集器接口，Collector只是用来生成每个段文件对应的LeafCollector，在Lucene4,x时代，Collector和LeafCollector并没有分开，现在Lucene5.0中，接口定义粒度更细了，为用户自定义扩展提供了更多的便利。

接着看看LeafCollector的源码说明：

Java代码  
/** 
 * <p>Collector decouples the score from the collected doc: 
 * the score computation is skipped entirely if it's not 
 * needed.  Collectors that do need the score should 
 * implement the {@link #setScorer} method, to hold onto the 
 * passed {@link Scorer} instance, and call {@link 
 * Scorer#score()} within the collect method to compute the 
 * current hit's score.  If your collector may request the 
 * score for a single hit multiple times, you should use 
 * {@link ScoreCachingWrappingScorer}. </p> 
 *  
 * <p><b>NOTE:</b> The doc that is passed to the collect 
 * method is relative to the current reader. If your 
 * collector needs to resolve this to the docID space of the 
 * Multi*Reader, you must re-base it by recording the 
 * docBase from the most recent setNextReader call.  Here's 
 * a simple example showing how to collect docIDs into a 
 * BitSet:</p> 
 *  
 * <pre class="prettyprint"> 
 * IndexSearcher searcher = new IndexSearcher(indexReader); 
 * final BitSet bits = new BitSet(indexReader.maxDoc()); 
 * searcher.search(query, new Collector() { 
 * 
 *   public LeafCollector getLeafCollector(LeafReaderContext context) 
 *       throws IOException { 
 *     final int docBase = context.docBase; 
 *     return new LeafCollector() { 
 * 
 *       <em>// ignore scorer</em> 
 *       public void setScorer(Scorer scorer) throws IOException { 
 *       } 
 * 
 *       public void collect(int doc) throws IOException { 
 *         bits.set(docBase + doc); 
 *       } 
 * 
 *     }; 
 *   } 
 * 
 * }); 
 * </pre> 
 * 
 * <p>Not all collectors will need to rebase the docID.  For 
 * example, a collector that simply counts the total number 
 * of hits would skip it.</p> 
 * 
 * @lucene.experimental 
 */  
public interface LeafCollector {  
  
  /** 
   * Called before successive calls to {@link #collect(int)}. Implementations 
   * that need the score of the current document (passed-in to 
   * {@link #collect(int)}), should save the passed-in Scorer and call 
   * scorer.score() when needed. 
   */  
  void setScorer(Scorer scorer) throws IOException;  
    
  /** 
   * Called once for every document matching a query, with the unbased document 
   * number. 
   * <p>Note: The collection of the current segment can be terminated by throwing 
   * a {@link CollectionTerminatedException}. In this case, the last docs of the 
   * current {@link org.apache.lucene.index.LeafReaderContext} will be skipped and {@link IndexSearcher} 
   * will swallow the exception and continue collection with the next leaf. 
   * <p> 
   * Note: This is called in an inner search loop. For good search performance, 
   * implementations of this method should not call {@link IndexSearcher#doc(int)} or 
   * {@link org.apache.lucene.index.IndexReader#document(int)} on every hit. 
   * Doing so can slow searches by an order of magnitude or more. 
   */  
  void collect(int doc) throws IOException;  
  
}  

LeafCollector将打分操作从文档收集中分离出去了，如果你不需要打分操作，你可以完全跳过。

如果你需要打分操作，你需要实现setScorer方法并传入一个Scorer对象，然后在collect方法中

通过调用Scorer.score方法完成对当前命中文档的打分操作。如果你的LeafCollector在collect

方法中需要对命中的某个索引文档调用多次score方法的话，请你使用ScoreCachingWrappingScorer

对象包装你的Scorer对象。(利用缓存防止多次进行重复打分)

collect方法中的doc参数是相对于当前IndexReader的，如果你需要把doc解析成docId(索引文档ID),

你需要调用setNextReader方法来重新计算IndexReader的docBase值。

并不是所有的Collector都需要计算docID基数的，比如对于只需要收集总的命中结果数量的Collector来说，

可以跳过这个操作。

通过以上的理解，我们可以总结出：通过Collector接口生产LeafCollector，然后通过LeafCollector接口

去完成结果收集和命中结果的打分操作。即底下真正干活的是LeafCollector。

Java代码  
void collect(int doc) throws IOException;  

这里collect方法用来收集每个索引文档，提供的doc参数表示段文件编号，如果你要获取索引文档的编号，请加上当前段文件Reader的docBase基数，如leafReaderContext.reader().docBase + doc;

如果你需要自定义打分器，请继承实现自己的Scorer，那这个setScorer什么时候调用呢，这个通过阅读IndexSearcher的search方法顺藤摸瓜从而知晓，看图：

Lucene5学习之自定义Collector
其实内部是先把Query对象包装成Filter，然后通过调用createNormalizedWeight方法生成Weight(权重类)，观摩Weight接口你会发现，其中有个Scorer scorer接口方法：

Lucene5学习之自定义Collector
至此我们就弄清楚了，我们的LeafCollector不用关心Scorer是怎么创建并传入到LeafCollector中的，我们只需要实现自己的Scorer即可，我们在IndexSearcher.search方法时内部会首先创建Weight,通过Weight来生成Scorer，我们在调用search方法时需要传入collector接口，那自然scorer接口就被传入了leafCollector中。

如果实现了自己的Scorer则必然需要也要实现自己的Weight并通过自定义Weight来生成自定义Scorer，特此提醒，为了简便起见，这里就没有自定义Scorer。

下面是一个自定义Collector的简单示例，希望能抛砖引玉，为大家排忧解惑，如果代码有任何BUG或纰漏，还望大家告知我。

Java代码  
package com.yida.framework.lucene5.collector;  
  
import java.io.IOException;  
import java.util.ArrayList;  
import java.util.List;  
  
import org.apache.lucene.index.LeafReaderContext;  
import org.apache.lucene.index.SortedDocValues;  
import org.apache.lucene.search.Collector;  
import org.apache.lucene.search.LeafCollector;  
import org.apache.lucene.search.ScoreDoc;  
import org.apache.lucene.search.Scorer;  
/** 
 * 自定义Collector结果收集器 
 * @author Lanxiaowei 
 * 
 */  
public class GroupCollector implements Collector, LeafCollector {  
    /**评分计算器*/  
    private Scorer scorer;  
    /**段文件的编号*/  
    private int docBase;  
      
    private String fieldName;  
    private SortedDocValues sortedDocValues;  
      
    private List<ScoreDoc> scoreDocs = new ArrayList<ScoreDoc>();  
      
    public LeafCollector getLeafCollector(LeafReaderContext context)  
            throws IOException {  
        this.sortedDocValues = context.reader().getSortedDocValues(fieldName);  
        return this;  
    }  
      
    public void setScorer(Scorer scorer) throws IOException {  
        this.scorer = scorer;  
    }  
  
    public void collect(int doc) throws IOException {  
        // scoreDoc:docId和评分  
        this.scoreDocs.add(new ScoreDoc(this.docBase + doc, this.scorer.score()));  
    }  
  
    public GroupCollector(String fieldName) {  
        super();  
        this.fieldName = fieldName;  
    }  
  
    public int getDocBase() {  
        return docBase;  
    }  
  
    public void setDocBase(int docBase) {  
        this.docBase = docBase;  
    }  
  
    public String getFieldName() {  
        return fieldName;  
    }  
  
    public void setFieldName(String fieldName) {  
        this.fieldName = fieldName;  
    }  
  
    public SortedDocValues getSortedDocValues() {  
        return sortedDocValues;  
    }  
  
    public void setSortedDocValues(SortedDocValues sortedDocValues) {  
        this.sortedDocValues = sortedDocValues;  
    }  
  
    public List<ScoreDoc> getScoreDocs() {  
        return scoreDocs;  
    }  
  
    public void setScoreDocs(List<ScoreDoc> scoreDocs) {  
        this.scoreDocs = scoreDocs;  
    }  
  
    public Scorer getScorer() {  
        return scorer;  
    }  
}  

Java代码  
package com.yida.framework.lucene5.collector;  
  
import java.io.IOException;  
import java.nio.file.Paths;  
import java.util.List;  
  
import org.apache.lucene.document.Document;  
import org.apache.lucene.index.DirectoryReader;  
import org.apache.lucene.index.IndexReader;  
import org.apache.lucene.index.Term;  
import org.apache.lucene.search.IndexSearcher;  
import org.apache.lucene.search.ScoreDoc;  
import org.apache.lucene.search.TermQuery;  
import org.apache.lucene.store.Directory;  
import org.apache.lucene.store.FSDirectory;  
/** 
 * 自定义Collector测试 
 * @author Lanxiaowei 
 * 
 */  
public class GroupCollectorTest {  
    public static void main(String[] args) throws IOException {  
        String indexDir = "C:/lucenedir";  
        Directory directory = FSDirectory.open(Paths.get(indexDir));  
        IndexReader reader = DirectoryReader.open(directory);  
        IndexSearcher searcher = new IndexSearcher(reader);  
        TermQuery termQuery = new TermQuery(new Term("title", "lucene"));  
        GroupCollector collector = new GroupCollector("title2");  
        searcher.search(termQuery, null, collector);  
        List<ScoreDoc> docs = collector.getScoreDocs();  
        for (ScoreDoc scoreDoc : docs) {  
            int docID = scoreDoc.doc;  
            Document document = searcher.doc(docID);  
            String title = document.get("title");  
            float score = scoreDoc.score;  
            System.out.println(docID + ":" + title + "  " + score);  
        }  
          
        reader.close();  
        directory.close();  
    }  
}  

这里仅仅是一个简单的示例，如果你需要更严格的干预索引文档，请在collect方法里实现的代码逻辑，如果你需要更细粒度的干预文档打分过程，请继承Scorer抽象类自定义的实现并继承Weight抽象类自定义的实现，然后调用IndexSearch的这个方法即可：

Java代码  
protected TopFieldDocs search(Weight weight, FieldDoc after, int nDocs,  
                                Sort sort, boolean fillFields,  
                                boolean doDocScores, boolean doMaxScore)  
      throws IOException  

一如既往的，demo源码会上传到底下的附件里，至于有童鞋要求我的demo不要使用Maven构建，I am very sorry,I can't meet your requirments.如果你不会Maven，还是花时间去学下吧。OK，凌晨一点多了，我该搁笔就寝咯！

哥的QQ: 7-3-6-0-3-1-3-0-5,欢迎加入哥的Java技术群一起交流学习。

群号：
Lucene5学习之自定义Collector

转载：http://iamyida.iteye.com/blog/2202111

Lucene5学习之自定义Collector

相关推荐