修复 ElasticSearch/OpenSearch 查询的模糊问题

Fix fuzziness for ElasticSearch/OpenSearch query

提问人:RobertPro 提问时间:11/16/2023 更新时间:11/18/2023 访问量:35

问:

我在尝试进行简单查询时遇到问题,请参阅以下数据:

拥有此数据:

POST test/_doc/1
{
  "id": 1,
  "title": "Test Name"
}

POST test/_doc/2
{
  "id": 2,
  "title": "TestName"
}

而这个查询:

GET test/_search
{
  "query": {
    "match": {
      "title": {
        "query": "TestName",
        "fuzziness": "AUTO"
      }
    }
  }
}

使用此输出:

{
  ...
  "hits": {
    "total": {
      "value": 1,
      "relation": "eq"
    },
    "max_score": 1.605183,
    "hits": [
      {
        "_index": "test",
        "_id": "2",
        "_score": 1.605183,
        "_source": {
          "id": 2,
          "title": "TestName"
        }
      }
    ]
  }
}

为什么输出不返回两条记录?

我该如何解决?

Elasticsearch OpenSearch

评论


答:

1赞 Paulo 11/16/2023 #1

顶级域名;

Fizziness 在 elasticsearch 中是有限制的。限制在 Levenshtien 距离上,设置为最大值 2。

这意味着您将无法匹配超过 2 次编辑的任何内容。

要了解



POST 77491663/_doc/1
{
  "id": 1,
  "title": "Test Name"
}

POST 77491663/_doc/2
{
  "id": 2,
  "title": "TestName"
}

POST 77491663/_doc/3
{
  "id": 2,
  "title": "TestNa"
}

GET 77491663/_search
{
  "query": {
    "match": {
      "title": {
        "query": "TestName",
        "fuzziness": "2"
      }
    }
  }
}

应该给你

{
  "took": 6,
  "timed_out": false,
  "_shards": {
    "total": 1,
    "successful": 1,
    "skipped": 0,
    "failed": 0
  },
  "hits": {
    "total": {
      "value": 2,
      "relation": "eq"
    },
    "max_score": 1.0925692,
    "hits": [
      {
        "_index": "77491663",
        "_id": "2",
        "_score": 1.0925692,
        "_source": {
          "id": 2,
          "title": "TestName"
        }
      },
      {
        "_index": "77491663",
        "_id": "3",
        "_score": 0.7283795,
        "_source": {
          "id": 2,
          "title": "TestNa"
        }
      }
    ]
  }
}

要修复

您可能需要研究分析仪的功能

例如,如果你要使用 ngram,你就会让它工作。

评论

0赞 RobertPro 11/17/2023
我能够让它与 ngram 一起工作,但看起来只有在它全是大写或小写的情况下才有效,这是我可以改变的,但我更喜欢做不敏感的查询。
0赞 Paulo 11/17/2023
您可能需要添加小写的处理器?elastic.co/guide/en/elasticsearch/reference/current/......
0赞 RobertPro 11/17/2023
是的,我能够让它工作,很快就会在这里更新。
0赞 Mouad Slimane 11/16/2023 #2

如果您使用的是默认的 elasticsearch,则该值将被拆分并存储到单独的 中,这意味着当您使用值 elasticsearch 进行搜索时,请检查是否匹配,是否具有模糊性或匹配级别,而不是短语,这就是您无法获得第一个文档的原因analyzerTest Nameinverted indexTestNameTestNameTestNameTest Name

0赞 RobertPro 11/18/2023 #3

因此,解决方案是使用 elasticsearch edge n-gram 完成的,我还必须在分析器中添加过滤器小写字母。

谢谢@paulo!

PUT test
{
  "mappings": {
    "properties": {
      "title": {
        "type": "text",
        "fields": {
          "edge_ngram": {
            "type": "text",
            "analyzer": "edge_ngram_analyzer"
          }
        }
      }
    }
  },
  "settings": {
    "analysis": {
      "analyzer": {
        "edge_ngram_analyzer": {
          "tokenizer": "edge_ngram_tokenizer",
          "filter": [
            "lowercase"
          ]
        }
      },
      "tokenizer": {
        "edge_ngram_tokenizer": {
          "type": "edge_ngram",
          "min_gram": 3,
          "max_gram": 10,
          "token_chars": [
            "letter",
            "digit"
          ]
        }
      }
    }
  }
}

POST test/_doc/1
{
  "title": "Test Name"
}

POST test/_doc/2
{
  "title": "TestName"
}

GET test/_search
{
  "query": {
    "match": {
      "title.edge_ngram": {
        "query": "Test Name",
        "fuzziness": "AUTO"
      }
    }
  }
}

现在它返回预期的输出:

{
  ...
  "hits": {
    "total": {
      "value": 2,
      "relation": "eq"
    },
    "max_score": 3.1782691,
    "hits": [
      {
        "_index": "test",
        "_id": "1",
        "_score": 3.1782691,
        "_source": {
          "title": "Test Name"
        }
      },
      {
        "_index": "test",
        "_id": "2",
        "_score": 0.68817455,
        "_source": {
          "title": "TestName"
        }
      }
    ]
  }
}