elasticsearch. 向导

Search API - Highlighting

实现对搜索结果的一个或者多个字段进行高亮. 使用Lucene的 fast-vector-highlighter 或者 highlighter 来实现. 搜索请求的Body如下:

{
    "query" : {...},
    "highlight" : {
        "fields" : {
            "content" : {}
        }
    }
}

如上所示, 每个搜索结果里面的 content 字段都将会被高亮处理(同时,在每个搜索结果里面,会多出来额外的一个字段 highlight, 里面包含了高亮的字段和高亮之后的文本片段) 如下:

{
    "_index": "index",
    "_type": "fulltext",
    "_id": "3",
    "_score": 0.61370564,
    "_source": {
        "content": "中韩渔警冲突调查:韩警平均每天扣1艘中国渔船"
    },
    "highlight": {
        "content": [
            "均每天扣1艘<tag1>中国</tag1>渔船 "
        ]
    }
}

为了执行高亮操作,字段是实际内容是必须要有的. 如果字段设置为存储 (mapping里面设置 storeyes ), 高亮的时候就会优先使用这个, 否则, 要加载 _source 里面的数据,并抽取相关的字段内容.

如果没有提供 term_vector 信息(mapping字段定义,term_vector设置为 with_positions_offsets ), 则高亮的时候使用的是最朴素的高亮器. 这意味着,如果你保存了单词向量信息,高亮操作的时候速度会更快. 因为你存储了单词的位置和偏移信息,这些在高亮的时候可以直接使用,同时需要注意的是这些额外的信息也会增加索引的体积.

下面是一个简单的例子,设置 content 字段用来做高亮 (索引会增大):

{
    "type_name" : {
        "content" : {"store" : "yes", "term_vector" : "with_positions_offsets"}
    }
}

高亮标记

默认情况下, 高亮器会在高亮的文本两边包裹 <em></em> 标签. 你可以通过设置 pre_tagspost_tags 来进行自定义, 例如:

{
    "query" : {...},
    "highlight" : {
        "pre_tags" : ["<tag1>", "<tag2>"],
        "post_tags" : ["</tag1>", "</tag2>"],
        "fields" : {
            "_all" : {}
        }
    }
}

你可以设置一个或者多个标签,并且按优先级排序. There are also built in “tag” schemas, with currently a single schema called styled with pre_tags of:

<em class="hlt1">, <em class="hlt2">, <em class="hlt3">,
<em class="hlt4">, <em class="hlt5">, <em class="hlt6">,
<em class="hlt7">, <em class="hlt8">, <em class="hlt9">,
<em class="hlt10">

And post tag of </em>. If you think of more nice to have built in tag schemas, just send an email to the mailing list or open an issue. Here is an example of switching tag schemas:

{
    "query" : {...},
    "highlight" : {
        "tags_schema" : "styled",
        "fields" : {
            "content" : {}
        }
    }
}

Highlighted Fragments

Each field highlighted can control the size of the highlighted fragment in characters (defaults to 100), and the maximum number of fragments to return (defaults to 5). For example:

{
    "query" : {...},
    "highlight" : {
        "fields" : {
            "content" : {"fragment_size" : 150, "number_of_fragments" : 3}
        }
    }
}

On top of this it is possible to specify that highlighted fragments are order by score:

{
    "query" : {...},
    "highlight" : {
        "order" : "score",
        "fields" : {
            "content" : {"fragment_size" : 150, "number_of_fragments" : 3}
        }
    }
}

Note the score of text fragment in this case is calculated by Lucene highlighting framework. For implementation details you can check ScoreOrderFragmentsBuilder.java class.

If the number_of_fragments value is set to 0 then no fragments are produced, instead the whole content of the field is returned, and of course it is highlighted. This can be very handy if short texts (like document title or address) need to be highlighted but no fragmentation is required. Note that fragment_size is ignored in this case.

{
    "query" : {...},
    "highlight" : {
        "fields" : {
            "_all" : {},
            "bio.title" : {"number_of_fragments" : 0}
        }
    }
}

When using fast-vector-highlighter one can use fragment_offset parameter to conrol the margin to start highlighting from.

Global Settings

Highlighting settings can be set on a global level and then overridden at the field level.

{
    "query" : {...},
    "highlight" : {
        "number_of_fragments" : 3,
        "fragment_size" : 150,
        "tag_schema" : "styled",
        "fields" : {
            "_all" : { "pre_tags" : ["<em>"], "post_tags" : ["</em>"] },
            "bio.title" : { "number_of_fragments" : 0 },
            "bio.author" : { "number_of_fragments" : 0 },
            "bio.content" : { "number_of_fragments" : 5, "order" : "score" }
        }
    }
}
 
Fork me on GitHub