使用停用词 | Elasticsearch: 权威指南

使用停用词 | Elasticsearch: 权威指南 | Elastic

2025-11-19

请注意:
本书基于 Elasticsearch 2.x 版本，有些内容可能已经过时。

» » »

« 停用词的优缺点停用词与性能 »

使用停用词编辑

移除停用词的工作是由 stop 停用词过滤器完成的，可以通过创建自定义的分析器来使用它（参见使用停用词过滤器stop 停用词过滤器)。但是，也有一些自带的分析器预置使用停用词过滤器：

语言分析器: 每个语言分析器默认使用与该语言相适的停用词列表，例如：english 英语分析器使用 _english_ 停用词列表。
standard 标准分析器: 默认使用空的停用词列表：_none_ ，实际上是禁用了停用词。
pattern 模式分析器: 默认使用空的停用词列表：为 _none_ ，与 standard 分析器类似。

停用词和标准分析器（Stopwords and the Standard Analyzer）编辑

为了让标准分析器能与自定义停用词表连用，我们要做的只需创建一个分析器的配置好的版本，然后将停用词列表传入：

PUT /my_index
{
  "settings": {
    "analysis": {
      "analyzer": {
        "my_analyzer": { 
          "type": "standard", 
          "stopwords": [ "and", "the" ] 
        }
      }
    }
  }
}

	自定义的分析器名称为 `my_analyzer` 。
	这个分析器是一个标准 `standard` 分析器，进行了一些自定义配置。
	过滤掉的停用词包括 `and` 和 `the` 。

任何语言分析器都可以使用相同的方式配置自定义停用词。

保持位置（Maintaining Positions）编辑

analyzer API 的输出结果很有趣:

GET /my_index/_analyze?analyzer=my_analyzer
The quick and the dead

{
   "tokens": [
      {
         "token":        "quick",
         "start_offset": 4,
         "end_offset":   9,
         "type":         "<ALPHANUM>",
         "position":     1 
      },
      {
         "token":        "dead",
         "start_offset": 18,
         "end_offset":   22,
         "type":         "<ALPHANUM>",
         "position":     4
      }
   ]
}

position 标记每个词汇单元的位置。

停用词如我们期望被过滤掉了，但有趣的是两个词项的位置 position 没有变化：quick 是原句子的第二个词，dead 是第五个。这对短语查询十分重要，因为如果每个词项的位置被调整了，一个短语查询 quick dead 会与以上示例中的文档错误匹配。

指定停用词（Specifying Stopwords）编辑

停用词可以以内联的方式传入，就像我们在前面的例子中那样，通过指定数组:

"stopwords": [ "and", "the" ]

特定语言的默认停用词，可以通过使用 _lang_ 符号来指定:

"stopwords": "_english_"

TIP: Elasticsearch 中预定义的与语言相关的停用词列表可以在文档"languages"stop 停用词过滤器中找到。

停用词可以通过指定一个特殊列表 _none_ 来禁用。例如，使用 _english_ 分析器而不使用停用词，可以通过以下方式做到：

PUT /my_index
{
  "settings": {
    "analysis": {
      "analyzer": {
        "my_english": {
          "type":      "english", 
          "stopwords": "_none_" 
        }
      }
    }
  }
}

	`my_english` 分析器是基于 `english` 分析器。
	但禁用了停用词。

最后，停用词还可以使用一行一个单词的格式保存在文件中。此文件必须在集群的所有节点上，并且通过 stopwords_path 参数设置路径:

PUT /my_index
{
  "settings": {
    "analysis": {
      "analyzer": {
        "my_english": {
          "type":           "english",
          "stopwords_path": "stopwords/english.txt" 
        }
      }
    }
  }
}

停用词文件的路径，该路径相对于 Elasticsearch 的 config 目录。

使用停用词过滤器（Using the stop Token Filter）编辑

当你创建 custom 分析器时候，可以组合多个 stop 停用词过滤器分词器。例如：我们想要创建一个西班牙语的分析器:

自定义停用词列表
light_spanish 词干提取器
在 asciifolding 词汇单元过滤器中除去附加符号

我们可以通过以下设置完成:

PUT /my_index
{
  "settings": {
    "analysis": {
      "filter": {
        "spanish_stop": {
          "type":        "stop",
          "stopwords": [ "si", "esta", "el", "la" ]  
        },
        "light_spanish": { 
          "type":     "stemmer",
          "language": "light_spanish"
        }
      },
      "analyzer": {
        "my_spanish": {
          "tokenizer": "spanish",
          "filter": [ 
            "lowercase",
            "asciifolding",
            "spanish_stop",
            "light_spanish"
          ]
        }
      }
    }
  }
}

	停用词过滤器采用与 `standard` 分析器相同的参数 `stopwords` 和 `stopwords_path` 。
	参见算法提取器（Algorithmic Stemmers）。
	过滤器的顺序非常重要，下面会进行解释。

我们将 spanish_stop 过滤器放置在 asciifolding 过滤器之后.这意味着以下三个词组 esta 、ésta 、++está++ ，先通过 asciifolding 过滤器过滤掉特殊字符变成了 esta ，随后使用停用词过滤器会将 esta 去除。如果我们只想移除 esta 和 ésta ，但是 ++está++ 不想移除。必须将 spanish_stop 过滤器放置在 asciifolding 之前，并且需要在停用词中指定 esta 和 ésta 。

更新停用词（Updating Stopwords）编辑

想要更新分析器的停用词列表有多种方式，分析器在创建索引时，当集群节点重启时候，或者关闭的索引重新打开的时候。

如果你使用 stopwords 参数以内联方式指定停用词，那么你只能通过关闭索引，更新分析器的配置update index settings API，然后在重新打开索引才能更新停用词。

如果你使用 stopwords_path 参数指定停用词的文件路径，那么更新停用词就简单了。你只需更新文件(在每一个集群节点上)，然后通过两者之中的任何一个操作来强制重新创建分析器:

关闭和重新打开索引 (参考索引的开与关)，
一一重启集群下的每个节点。

当然，更新的停用词不会改变任何已经存在的索引。这些停用词的只适用于新的搜索或更新文档。如果要改变现有的文档，则需要重新索引数据。参加重新索引你的数据。

« 停用词的优缺点停用词与性能 »

官方地址：https://www.elastic.co/guide/cn/elasticsearch/guide/current/using-stopwords.html

有任何技术问题请点击这里网站运营推广招聘

IT PHP 编程语言开发编程 Linux 科技 Elasticsearch HTML/CSS/XML 面试数据库网络 JAVA NoSQL C/C++ Golang 操作系统 Git 算法正则表达式 Redis 互联网 MySql 软件运维 JavaScript 国际架构设计 Mac OS TCP/IP Excel Windows Oracle Socket VR Vim MongoDB 运营 Python MemCache 商业硬件电子娱乐设计摄影 nginx WordPress 游戏 HTTP 团建数码电器 Docker

携程Elasticsearch数据同步实践 Elasticsearch集群模式知多少 Elasticsearch是做什么的以及它的使用和基本原理 Elasticsearch简介与实战 elasticsearch动态映射如何配置使用Elasticsearch的动态映射 (dynamic mapping) elasticsearch配置 elasticsearch集群分布式特性 Elasticsearch集群高亮搜索 elasticsearch最新版安装两节点Elasticsearch集群 elasticsearch集群部署文档 ElasticSearch自带的分词类型安装elasticsearch的java环境确认 es 相关配置文件 ES查找空字符串【Elasticsearch集群】打分策略详解与explain手把手计算 Elasticsearch Mapping设置 [Elasticsearch集群分页]from-size VS scroll-scan Elasticsearch－基础介绍及索引原理分析

略微加速

Elasticsearch权威指南 - 互联网笔记

使用停用词编辑

停用词和标准分析器（Stopwords and the Standard Analyzer）编辑

保持位置（Maintaining Positions）编辑

指定停用词（Specifying Stopwords）编辑

使用停用词过滤器（Using the stop Token Filter）编辑

更新停用词（Updating Stopwords）编辑

略微加速

Elasticsearch权威指南 - 互联网笔记

使用停用词编辑

停用词和标准分析器（Stopwords and the Standard Analyzer）编辑

保持位置（Maintaining Positions）编辑

指定停用词（Specifying Stopwords）编辑

使用停用词过滤器（Using the stop Token Filter）编辑

更新停用词（Updating Stopwords）编辑

Getting Started Videos