pt搜索网站_搜索pt 1简要介绍


The ability to search the entire web in less than a second for whatever we fancy knowing is one of the greatest achievements of recent history. But how does it work? What are its building blocks? And, most importantly, … can we hack together our own version of it? The latter is important because search is inevitably personal: it is all about our focus, preferences, resources at our disposal and even emotions. Plus, it’s really cool!

只需不到一秒钟的时间,便可以在整个网络中进行搜索,这是最近历史上最伟大的成就之一。 但是它如何工作? 它的构成要素是什么? 而且,最重要的是,……我们可以一起**我们自己的版本吗? 后者之所以重要,是因为搜索不可避免地是个人化的:这全都与我们的专注,偏好,可支配的资源乃至情感有关。 另外,它真的很棒!

In this three part series, I will:


  • Pt 1. Provide a gentle introduction to Search using both Google and Elasticsearch as examples

    Pt 1.使用Google和Elasticsearch作为示例,对搜索进行简要介绍

  • Pt 2. We will explain some state-of-the-art NLP techniques, compare results to traditional approaches and discuss pros and cons


  • Pt 3. Provide a hacker's guide to building your own search engine with Elasticsearch engine containing 1 million news headlines & employing state-of-the-art NLP for enhanced semantic searches…

    Pt 3.提供黑客指南,以使用Elasticsearch引擎构建您自己的搜索引擎,其中包含100万个新闻标题,并采用最新的NLP进行增强的语义搜索…

搜寻-简而言之 (Search - in a nutshell)

When we talk about search nowadays we often mean Semantic Search. What is semantic search, you ask? Imagine searching for the word “virus threat”. A simple lexical search approach will come back with documents containing the words exactly and with in particular order of importance. Additionally documents about "security threat" will also be considered relevant as they contain part of the query.

如今,当我们谈论搜索时,我们通常指的是语义搜索。 您问什么是语义搜索? 想象一下搜索“病毒威胁”一词。 一种简单的词法搜索方法将返回包含准确且特别重要的单词的文档。 有关“安全威胁”的其他文档也将被视为相关文档,因为它们包含查询的一部分。

Semantic search, on the other hand, is also able to pick up on the idea of “disease”, “infection” and “corona” - we have a far wider and potentially more accurate search reflecting the "meaning" of what we are looking for instead of sticking to its specific keywords. In this section, I have often sourced ideas from the work by Bast, Hannah; Buchhold, Björn; Haussmann, Elmar (2016). “Semantic search on text and knowledge bases”. In that text they state:

另一方面,语义搜索还可以理解“疾病”,“感染”和“电晕”的概念-我们的搜索范围更广,而且可能更准确,反映了我们所查找内容的“含义”而不是坚持使用其特定的关键字。 在本节中,我经常从汉娜·巴斯特的作品中汲取灵感。 比约恩布赫霍尔德; 豪斯曼·埃尔玛(2016)。 “基于文本和知识库的语义搜索” 。 他们在该文本中指出:

Semantic search denotes search with meaning, as distinguished from lexical search where the search engine looks for literal matches of the query words or variants of them, without understanding the overall meaning of the query


The diagram below shows core components of a search engine of this kind


pt搜索网站_搜索pt 1简要介绍
Image by the author

We focus on semantic search on text with some additional annotations (such as names, dates, links, etc) as opposed to say search on structured databases. This is essentially the typical web search we use all the time.

我们专注于对文本进行语义搜索,并带有一些附加的注释(例如名称,日期,链接等),而不是对结构化数据库进行语义搜索。 本质上,这是我们一直使用的典型网络搜索

Please note, the article deals with a search that produces lists of relevant documents or individual facts, not additional steps such as ranking based source quality, eg PageRank, results summarisation, etc.


查询类型 (Query Types)

These can be broken down into:


  • Keyword - these are shorthand searches, not proper sentences but where the set of keywords and sometimes order carries semantic meaning, for instance Neil Armstrong date of birth, pasta recipe under 10mins


  • Structured/Semi-structured - special syntax used in a query. It can represent either the full query or just refinements to it. For instance, this might be a restriction to only search a specific source, e.g. news from AP only. In other cases this might restrict the languages of the results or state mandatory elements of the query

    结构化/半结构化-查询中使用的特殊语法。 它可以代表完整的查询,也可以只是对其的完善。 例如,这可能是仅搜索特定来源(例如仅来自AP的新闻)的限制 在其他情况下,这可能会限制结果的语言或说明查询中的强制性元素

  • Natural language & natural questions - fully or mostly grammatically formed questions: “What is Neil Armstrong’s date of birth?”. This is the most natural way to interact with search, however, it also poses many difficulties. For instance, we could be asking multiple questions at once "Where can I park and what are opening hours?" or pose philosophical queries instead of fact searching ones "What is the meaning of life?". As you can see from the examples, the scope of questions is quite broad. While those make sense to us, algorithms tend to specialise in narrow tasks, hence the need for various algorithms working in concert that are able to determine which results are appropriate.

    自然语言和自然问题-完全或大部分为语法形式的问题:“尼尔·阿姆斯特朗的生日是什么?”。 这是与搜索进行交互的最自然的方式,但是,这也带来了许多困难。 例如,我们可能一次要问多个问题:“我可以在哪里停车,营业时间是几点?” 或提出哲学问题,而不是进行事实询问: “生命的意义是什么?”。 从示例中可以看到,问题的范围非常广泛。 尽管这些对我们有意义,但是算法往往专注于狭窄的任务,因此需要各种能够协同工作的算法,这些算法能够确定合适的结果。

查询处理 (Query processing)

These are the different types of transformations the system might need to perform on the original entry before passing it on to the search algorithm. Those could be

在将原始条目传递给搜索算法之前,系统可能需要对它们进行不同类型的转换。 那可能是

  • Extractive - where specific names, entities, places are extracted to further help the search and compare with values in the document metadata or against knowledge bases. For instance, in the below query Neil Armstrong the information box on the left is a result from invoking google’s Knowledge Base because the query was successfully matched with an entry from it

    提取-提取特定的名称,实体,地点,以进一步帮助搜索并与文档元数据中的值或知识库进行比较。 例如,在下面的查询尼尔·阿姆斯特朗(Neil Armstrong)中,左侧的信息框是调用Google知识库的结果,因为查询已成功与查询中的条目匹配

pt搜索网站_搜索pt 1简要介绍
  • Filters and constraints - in cases where semi-structured queries specify some restrictions on the results, e.g. only news in English, the scope of the search will be translated to the search engine


  • Other transformations are modifications to the search, e.g. for wildcard or fuzzy search. In which case the original query may be transformed into one or many variants. For instance, with fuzzy search, we might allow for some number of character modifications to the key words entered until we find the most likely word searched. See below, the result in Google when I look for Neil Armslong. Even though a gentleman by the name Armslong probably exists and is important, the system considers it is far more likely we made a typo.

    其他转换是对搜索的修改,例如用于通配符或模糊搜索。 在这种情况下,原始查询可能会转换为一个或多个变体。 例如,对于模糊搜索,我们可以允许对输入的关键字进行一些字符修改,直到找到最可能搜索到的单词。 参见下文,当我寻找Neil Armslong时在Google中获得的结果。 即使一个名叫Armslong的绅士可能存在并且很重要,系统仍认为我们打错字的可能性更大。

pt搜索网站_搜索pt 1简要介绍

搜索和排名 (Search and Rank)

Finally, one or more types of search & ranking approaches may be used. These will either be able to find an answer or return a ranked list of results matching the query. Ranking makes sure that more pertinent results are higher up - those might be results that mention keywords of the search more often than other results or contain relevant information to the query in their title or opening paragraphs. There are:

最后,可以使用一种或多种类型的搜索和排名方法。 这些将能够找到答案或返回与查询匹配的结果的排序列表。 排名可确保相关性更高的结果更高-这些结果可能是比其他结果更频繁提及搜索关键字或在标题或开头段落中包含与查询相关的信息的结果。 有:

  • Keyword searches - the most common types, where exact or very close to literal matches are made. The predominant part of searches is still done this way. What makes them semantic - they would use term occurrences to rank higher documents that appear more relevant to a keyword and recognize when some of the keywords are rare ranks hits on those higher than hits on more 'common' words in the query. A number of algorithms are available: BM25, tf-idf, various Learning to Rank methods, etc.

    关键字搜索-最常见的类型,进行完全匹配或非常接近文字匹配的搜索。 搜索的主要部分仍以这种方式进行。 是什么使它们具有语义-他们会使用术语出现来对看起来与关键字更相关的高级文档进行排名,并识别何时某些关键字在搜索结果中的命中率高于对查询中“常见”单词的命中率。 可以使用多种算法:BM25,tf-idf,各种学习排名方法等。

  • Contextual searches - I refer to any search based on textual embeddings that attempts to use the query entirely and find contextually relevant results. This is opposed to relying on any specific keywords or phrases individually to determine results. We will focus on this a bit more later, as it is central to this series. Some recent advances in NLP techniques here will help us improve the quality of search significantly.

    上下文搜索-我指的是基于文本嵌入的任何搜索,这些搜索试图完全使用查询并查找与上下文相关的结果。 这与单独依赖任何特定的关键字或短语来确定结果相反。 我们将在稍后重点介绍这一点,因为它是本系列的核心。 NLP技术的一些最新进展将帮助我们显着提高搜索质量。

Lets quickly have a face-off - keyword vs contextual search. Searching for “virus thread”, on news headlines, the left set of results are from a keyword approach while the ones on the right are from contextual search. The latter gives us a number of results which are not matching any search term directly like example 5 on the right: “WHO highlights dangers of vector borne diseases”

让我们快速面对面-关键字与上下文搜索。 在新闻标题上搜索“病毒线程”时,左侧的结果来自关键字方法,而右侧的结果来自上下文搜索。 后者为我们提供了许多与任何搜索词都不匹配的结果,如右侧的示例5:“ WHO强调了媒介传播疾病的危险”

pt搜索网站_搜索pt 1简要介绍
  • Knowledge base - as seen above, entries from a knowledge base can be matched directly to entries in a knowledge base and used further for generating a result. More advanced techniques can also apply where a keyword or natural language query can be transformed into a query to a knowledge base. For instance, ‘Astronauts on the moon’ would return another knowledge base result

    知识库-如上所示,可以将知识库中的条目直接与知识库中的条目进行匹配,并进一步用于生成结果。 在将关键字或自然语言查询转换为知识库查询的情况下,也可以应用更高级的技术。 例如,“月球上的宇航员”将返回另一个知识库结果

pt搜索网站_搜索pt 1简要介绍
  • Question-answering - traditionally, search engines have used modifications from the processing step to transform a natural question to a more keyword-like query and process it as such. More recently, advances in NLP have shown strong performance by algorithms that directly pinpoint whether and where an answer to a natural question can be found within a specific document. Unlike the other search paradigms from above, question answering focuses on providing an actual (single) answer as opposed to a list of documents (like the others in this list). Here is what happens when we ask about the moon landing as a natural question. In addition to a list of answers we get a specific answer.

    回答问题-传统上,搜索引擎使用了处理步骤的修改,将自然问题转换为更像关键字的查询并对其进行处理。 最近,通过直接查明在特定文档中是否可以找到自然问题的答案以及在何处可以找到自然问题的答案,NLP的进步已显示出强大的性能。 与上面的其他搜索范例不同,问题解答的重点是提供实际的(单个)答案,而不是文档列表(类似于此列表中的其他列表)。 这是我们自然而然地询问月球着陆时发生的情况。 除了答案列表,我们还会提供特定的答案。

pt搜索网站_搜索pt 1简要介绍

However, the technique works similarly from a not-so-natural question ‘year of first moon landing’


pt搜索网站_搜索pt 1简要介绍

Finally, slight modifications to the query can break the result and we no longer get an explicit answer, we even land somewhere else completely


pt搜索网站_搜索pt 1简要介绍

全部放在一起(Putting it all together)

To summarize, any query type can pass through a number of different modifications and be run through any of a number of search mechanisms to produce candidate results. Each of these approaches will express the confidence in their results, however, different confidence scores may not be comparable between different algorithms. At this stage, a further decision algorithm will be able to determine which answers are well suited and "confident" enough to be passed on to the user as the final list of answers.

总而言之,任何查询类型都可以进行多种不同的修改,并可以通过多种搜索机制中的任何一种来生成候选结果。 这些方法中的每一种都将表达对其结果的置信度,但是,不同算法之间的不同置信度得分可能无法比较。 在这一阶段,另一种决策算法将能够确定哪些答案非常合适并且“足够有信心”,可以作为最终答案列表传递给用户。

pt搜索网站_搜索pt 1简要介绍
Image by the author

A functioning search engine can have any or at least one of each of the three steps of the process. We have seen that Google uses most of them under the hood, but what about making our own...

运行正常的搜索引擎可以具有该过程的三个步骤中的任何一个或至少一个。 我们已经看到Google在幕后使用了大多数工具,但是如何制作自己的工具呢?

我应该透露我的秘密议程… (I should reveal my secret agenda…)

I actually wanted to hack my own search engine all along.


The tool of choice is Elasticsearch, primarily because it actually comes out of the box with a lot of search features. At the same time, it is very well supported and gets you a long way in terms of open source features.

选择的工具是Elasticsearch,主要是因为它实际上具有很多搜索功能,是开箱即用的。 同时,它得到了很好的支持,使您在开源功能方面走了很长一段路。

Here is a diagram of what we get out of the box with Elastic for the purposes of this discussion. Note that you should not trust me on this summary completeness as I have a specific objective in mind.

这是我们出于讨论目的而使用Elastic开箱即用的图表。 请注意,由于我有一个特定的目标,因此您不应该相信我的摘要完整性。

pt搜索网站_搜索pt 1简要介绍
Image by the author

You will notice that Elastic can handle any query type (even though they will all be handled by default by a keyword search mechanism) and allows for further modifying your queries to fuzzy, wildcard and quite a few others. If the data allows this, one can also apply any number of structured conditions on the results: date of publishing, source, etc.

您会注意到,Elastic可以处理任何查询类型(即使默认情况下将由关键字搜索机制处理所有查询类型),并允许您进一步将查询修改为模糊,通配符和许多其他查询。 如果数据允许,还可以对结果应用任意数量的结构化条件:发布日期,来源等。

In terms of search & ranking, there is a lot of flexibility to keyword search but not much else.


Overall, this is a very impressive list of out of the box features. As it turns out, with some extra legwork we can even add contextual search. Which is what we will do next...

总体而言,这是非常令人印象深刻的现成功能列表。 事实证明,通过一些额外的工作,我们甚至可以添加上下文搜索。 接下来我们要做的是...

结论 (Conclusion)

We have explored the major building blocks of search, how they work together and the impact on search results. Different types of queries may trigger different search algorithms with a result that is a mixture of approaches. Looking at examples from Google we see that the same user experience (typing into a simple text box) is serviced by a number of techniques.

我们探索了搜索的主要组成部分,它们如何协同工作以及对搜索结果的影响。 不同类型的查询可能会触发不同的搜索算法,其结果是多种方法的混合。 查看Google的示例,我们可以看到,通过多种技术可以提供相同的用户体验(在简单的文本框中键入内容)。

In the following articles, we will compare contextual and keyword search side-by-side (Pt 2) and finally in will combine a few different tools to extend the capabilities of Elasticsearch with additional contextual capabilities to build our own semantic search engine (Pt 3).

在接下来的文章中,我们将并行比较上下文搜索和关键字搜索( Pt 2 ),最后将结合一些不同的工具来扩展Elasticsearch的功能以及其他上下文功能以构建我们自己的语义搜索引擎( Pt 3 )。

Btw, Neil Armslong


I hope you enjoyed reading this, we will be back with more in Pt 2, next week. In the meantime if you feel like saying Hi or just like to tell me I am wrong, feel free to reach out via LinkedIn

希望您喜欢阅读本文,下周我们将在Pt 2中提供更多信息。 同时,如果您想打招呼或只是想告诉我我错了,请随时通过LinkedIn与我们联系

Special thanks to Rich Knuszka for valuable feedback.

特别感谢Rich Knuszka的宝贵反馈。

Please note that I have no affiliation with Google or Elasticsearch and the opinions and analysis are my own


