忽略 TF/IDF | Elasticsearch: 权威指南

忽略 TF/IDF | Elasticsearch: 权威指南 | Elastic

2025-11-18

请注意:
本书基于 Elasticsearch 2.x 版本，有些内容可能已经过时。

» » »

« Not Quite Not function_score 查询 »

忽略 TF/IDF编辑

有时候我们根本不关心 TF/IDF ，只想知道一个词是否在某个字段中出现过。可能搜索一个度假屋并希望它能尽可能有以下设施：

WiFi
Garden（花园）
Pool（游泳池）

这个度假屋的文档如下：

{ "description": "A delightful four-bedroomed house with ... " }

可以用简单的 match 查询进行匹配：

GET /_search
{
  "query": {
    "match": {
      "description": "wifi garden pool"
    }
  }
}

但这并不是真正的 全文搜索 ，此种情况下，TF/IDF 并无用处。我们既不关心 wifi 是否为一个普通词，也不关心它在文档中出现是否频繁，关心的只是它是否曾出现过。实际上，我们希望根据房屋不同设施的数量对其排名——设施越多越好。如果设施出现，则记 1 分，不出现记 0 分。

constant_score 查询编辑

在 constant_score 查询中，它可以包含查询或过滤，为任意一个匹配的文档指定评分 1 ，忽略 TF/IDF 信息：

GET /_search
{
  "query": {
    "bool": {
      "should": [
        { "constant_score": {
          "query": { "match": { "description": "wifi" }}
        }},
        { "constant_score": {
          "query": { "match": { "description": "garden" }}
        }},
        { "constant_score": {
          "query": { "match": { "description": "pool" }}
        }}
      ]
    }
  }
}

或许不是所有的设施都同等重要——对某些用户来说有些设施更有价值。如果最重要的设施是游泳池，那我们可以为更重要的设施增加权重：

GET /_search
{
  "query": {
    "bool": {
      "should": [
        { "constant_score": {
          "query": { "match": { "description": "wifi" }}
        }},
        { "constant_score": {
          "query": { "match": { "description": "garden" }}
        }},
        { "constant_score": {
          "boost":   2 
          "query": { "match": { "description": "pool" }}
        }}
      ]
    }
  }
}

pool 语句的权重提升值为 2 ，而其他的语句为 1 。

最终的评分并不是所有匹配语句的简单求和，协调因子（coordination factor）和查询归一化因子（query normalization factor）仍然会被考虑在内。

我们可以给 features 字段加上 not_analyzed 类型来提升度假屋文档的匹配能力：

{ "features": [ "wifi", "pool", "garden" ] }

默认情况下，一个 not_analyzed 字段会禁用字段长度归一值（field-length norms）的功能，并将 index_options 设为 docs 选项，禁用词频，但还是存在问题：每个词的倒排文档频率仍然会被考虑。

可以采用与之前相同的方法 constant_score 查询来解决这个问题：

GET /_search
{
  "query": {
    "bool": {
      "should": [
        { "constant_score": {
          "query": { "match": { "features": "wifi" }}
        }},
        { "constant_score": {
          "query": { "match": { "features": "garden" }}
        }},
        { "constant_score": {
          "boost":   2
          "query": { "match": { "features": "pool" }}
        }}
      ]
    }
  }
}

实际上，每个设施都应该看成一个过滤器，对于度假屋来说要么具有某个设施要么没有——过滤器因为其性质天然合适。而且，如果使用过滤器，我们还可以利用缓存。

这里的问题是：过滤器无法计算评分。这样就需要寻求一种方式将过滤器和查询间的差异抹平。 function_score 查询不仅正好可以扮演这个角色，而且有更强大的功能。

« Not Quite Not function_score 查询 »

官方地址：https://www.elastic.co/guide/cn/elasticsearch/guide/current/ignoring-tfidf.html

有任何技术问题请点击这里网站运营推广招聘

IT PHP 编程语言开发编程 Linux 科技 Elasticsearch HTML/CSS/XML 面试数据库网络 JAVA NoSQL C/C++ Golang 操作系统 Git 算法正则表达式 Redis 互联网 MySql 软件运维 JavaScript 国际架构设计 Mac OS TCP/IP Excel Windows Oracle Socket VR Vim MongoDB 运营 Python MemCache 商业硬件电子娱乐设计摄影 nginx WordPress 游戏 HTTP 团建数码电器 Docker

携程Elasticsearch数据同步实践 Elasticsearch集群模式知多少 Elasticsearch是做什么的以及它的使用和基本原理 Elasticsearch简介与实战 elasticsearch动态映射如何配置使用Elasticsearch的动态映射 (dynamic mapping) elasticsearch配置 elasticsearch集群分布式特性 Elasticsearch集群高亮搜索 elasticsearch最新版安装两节点Elasticsearch集群 elasticsearch集群部署文档 ElasticSearch自带的分词类型安装elasticsearch的java环境确认【Elasticsearch集群】打分策略详解与explain手把手计算 es 相关配置文件 ES查找空字符串 Elasticsearch Mapping设置 [Elasticsearch集群分页]from-size VS scroll-scan elasticsearch集群新机搭建

略微加速

Elasticsearch权威指南 - 互联网笔记

忽略 TF/IDF编辑

constant_score 查询编辑

略微加速

Elasticsearch权威指南 - 互联网笔记

忽略 TF/IDF编辑

constant_score 查询编辑

Getting Started Videos