Elasticsearch模糊匹配新姿势：ngram分词器详解

创作时间:

作者:

@小白创作中心

Elasticsearch模糊匹配新姿势：ngram分词器详解

引用

来源

https://www.elastic.co/guide/en/elasticsearch/reference/current/api-conventions.html

https://opster.com/guides/elasticsearch/best-practices/elasticsearch-misuse-of-wildcards/

https://discuss.elastic.co/t/want-to-do-partial-match-without-using-wildcards/362928

https://www.cnblogs.com/MrHSR/p/18121718

https://elastic.ac.cn/guide/en/elasticsearch/reference/current/query-dsl-intervals-query.html

https://www.elastic.co/guide/en/elasticsearch/reference/current/_retrievers_examples.html

https://www.elastic.co/guide/en/elasticsearch/reference/current/full-text-search.html

https://opster.com/guides/elasticsearch/data-architecture/elasticsearch-index-pattern/

https://developer.aliyun.com/article/1502311

10.

https://developer.aliyun.com/article/632232

11.

https://www.explinks.com/blog/deep-dive-into-elasticsearch-query-optimization-using-profile-tool-api-for-performance-enhancement/

在Elasticsearch中，模糊匹配是一个常见的需求，尤其是在处理用户输入的搜索查询时。传统的wildcard查询虽然灵活，但性能较差，特别是在大数据量下。而ngram分词器提供了一种更高效、更精确的模糊匹配解决方案。

ngram分词器的工作原理

ngram分词器的核心思想是将文本按照固定长度的滑动窗口进行切分，形成一系列的gram序列。每个gram都是一个连续的字符片段，通过对这些gram的出现频度进行统计，可以构建出文本的特征向量空间。

例如，对于文本"Quick Fox"，使用默认配置（min_gram=1, max_gram=2）进行分词的结果为：

[Q, Qu, u, ui, i, ic, c, ck, k, "k ", " F", F, Fo, o, ox, x]

通过调整min_gram和max_gram参数，可以控制分词的粒度。通常建议从3开始作为min_gram的起点，因为更小的值可能会导致过多的无关匹配。

ngram vs wildcard：性能对比

wildcard查询在处理以*开头的模式时，需要扫描大量词条，性能随数据量增加而显著下降。而ngram分词器通过预先构建gram索引，可以快速定位匹配项，大大提高了查询效率。

实际应用场景

ngram分词器特别适合以下场景：

无分隔符文本的搜索：如手机号码、型号编号等
高亮显示需求：结合match或match_phrase查询实现关键词高亮
大数据量下的模糊搜索：通过空间换时间的方式提升检索效率

配置示例

下面是一个具体的配置示例，展示了如何在Elasticsearch中设置ngram分词器：

PUT my-index-000001
{
  "settings": {
    "analysis": {
      "analyzer": {
        "my_analyzer": {
          "tokenizer": "my_tokenizer"
        }
      },
      "tokenizer": {
        "my_tokenizer": {
          "type": "ngram",
          "min_gram": 3,
          "max_gram": 3,
          "token_chars": [
            "letter",
            "digit"
          ]
        }
      }
    }
  }
}

使用这个配置对"2 Quick Foxes."进行分词的结果为：