配置语言分析器 | Elasticsearch: 权威指南 | Elastic
2024-11-13
语言分析器都不需要任何配置,开箱即用, 它们中的大多数都允许你控制它们的各方面行为,具体来说:
World Health Organization
的结果,
但是却被替换为搜索 organ health
的结果。有这个困惑是因为 organ
和 organization
有相同的词根: organ
。
通常这不是什么问题,但是在一些特殊的文档中就会导致有歧义的结果,所以我们希望防止单词 organization
和 organizations
被缩减为词干。a, an, and, are, as, at, be, but, by, for, if, in, into, is, it, no, not, of, on, or, such, that, the, their, then, there, these, they, this, to, was, will, with
关于单词 no
和 not
有点特别,这俩词会反转跟在它们后面的词汇的含义。或许我们应该认为这两个词很重要,不应该把他们看成停用词。
为了自定义 english
(英语)分词器的行为,我们需要基于 english
(英语)分析器创建一个自定义分析器,然后添加一些配置:
PUT /my_index { "settings": { "analysis": { "analyzer": { "my_english": { "type": "english", "stem_exclusion": [ "organization", "organizations" ], "stopwords": [ "a", "an", "and", "are", "as", "at", "be", "but", "by", "for", "if", "in", "into", "is", "it", "of", "on", "or", "such", "that", "the", "their", "then", "there", "these", "they", "this", "to", "was", "will", "with" ] } } } } } GET /my_index/_analyze?analyzer=my_english The World Health Organization does not sell organs.
防止 | |
指定一个自定义停用词列表 | |
切词为 |
我们在 将单词还原为词根 和 停用词: 性能与精度 中分别详细讨论了词干提取和停用词。
官方地址:https://www.elastic.co/guide/cn/elasticsearch/guide/current/configuring-language-analyzers.html