Solr查询PDF文件，不返回突出显示的内容

问题描述：

我今天在我的debian服务器上实现了solr 6.5.1，但是我无法获取pdf文本内容。搜索是好的，因为当我查询例如我的名字：“juan”时，文档显示正常。然而，每个结果都不会出现它应该做的事情。Solr查询PDF文件，不返回突出显示的内容

这是例子查询：

http://localhost:8983/solr/ex/select?q=juan&fl=title&wt=xml&hl=true&hl.snippets=20&hl.fl=content&hl.usePhraseHighlighter=true

这是结果：

<response> 
    <lst name="responseHeader"> 
     <int name="status">0</int> 
     <int name="QTime">1</int> 
     <lst name="params"> 
      <str name="hl.snippets">20</str> 
      <str name="q">juan</str> 
      <str name="hl">true</str> 
      <str name="fl">title</str> 
      <str name="hl.usePhraseHighlighter">true</str> 
      <str name="hl.fl">content</str> 
      <str name="wt">xml</str> 
     </lst> 
    </lst> 
    <result name="response" numFound="1" start="0"> 
     <doc> 
      <arr name="title"> 
       <str>CV_Juan_Jara_ultimo</str> 
      </arr> 
     </doc> 
    </result> 
    <lst name="highlighting"> 
     <lst name="/solr-6.5.1/mydocs/CV_Juan_Jara_ultimo.pdf"/> 
    </lst> 
</response>

此外，日志中显示所有的PDF文本，所以我认为它是正确的索引（我索引的pdf使用命令：bin/post -c ex mydocs/CV_Juan_Jara_ultimo.pdf）。

我添加了“内容”字段的架构，使用curl：

curl -X POST -H 'Content-type:application/json' --data-binary '{ 
    "add-field" : { 
    "name":"text", 
    "type":"text_general", 
    "indexed":"true", 
    "stored":"false", 
    "multiValued":"true" 
    } 
}' localhost:8983/solr/ex/schema

你知道什么地方出错了？

所有我想要做的就是在我的PDF搜索主题，然后让所有结果突出这样的：

http://www.codewrecks.com/blog/index.php/2013/05/27/hilight-matched-text-inside-documents-indexed-with-solr-plus-tika/

答

这是一个非常常见和简单的错误：

“存储“：”假“应该被”存储“：”真“为'内容'字段。

目前所有的荧光笔都需要储存场地来使用[1]。

[1] https://cwiki.apache.org/confluence/display/solr/Highlighting

感谢您的快速回复。我改变存储为true，但仍然没有。但是我意识到，当我添加参数“hl.method = unified”时，响应突出显示包括字段，但为空。 –

答

解决：对我来说解决了所有的解决方案，与此curl命令替换架构中的_text_领域：

curl -X POST -H 'Content-type:application/json' --data-binary '{ 
"replace-field" : { 
"name":"_text_", 
"type":"text_general", 
"indexed":"true", 
"stored":"true", 
"multiValued":"true" 
} 
}' http://localhost:8983/solr/ex/schema

这是因为_text_领域自带“存储”：默认为“false”。

注意：记住要将所有文件重新编入索引到您的核心，如果您在此模式字段之前进行替换

Solr查询PDF文件，不返回突出显示的内容

相关推荐