elasticsearch. blog

New Search Types

By Shay Banon | 24 Mar 2011

elasticsearch has had the ability to control the search type since its inception. It includes the ability to control how to execute the “query” and “fetch” phases of a distributed search execution, as well as the ability to compute distributed frequencies across shards (the “dfs” phase).

Two things that many users have been asking for is the ability to easily iterate over a very large result set as fast as possible, and the ability to just count results (possibly with facets) without actually fetching any hits back. Both features are now supported as new search types.

count

The first is the count search type. It allows to get back the total number of hits matching a query, with the ability to have facets configured in an optimized (implementation wise and performance wise) manner. For example:

curl -XGET 'http://localhost:9200/twitter/tweet/_search?search_type=count' -d '{
    "query": {
        "filtered" : {
            "query" : {
                "query_string" : {
                    "query" : "some query string here"
                }
            },
            "filter" : {
                "term" : { "user" : "kimchy" }
            }
        }
    }
}
'

The result will not include any hits, just the total_hits and optional facets results.

scan

The new scan type allows to scroll a very large result set in an optimized manner. When scrolling large result set using one of the other search types, there is an overhead (that increases the more we scroll “into” the result set) that comes from the fact that sorting needs to be computed (either by score, or custom). The scan type does no sorting, allowing it to optimize the scrolling process. It uses the same scroll mechanism when using other search types.

The initial search execution bootstraps the scanning process:

curl -XGET 'localhost:9200/_search?search_type=scan&scroll=10m&size=50' -d '
{
    "query" : {
        "match_all" : {}
    }
}
'

The scroll parameter indicates two things, the fact that we wish to scroll further into the result set, and the timeout value for maintaing the scroll “open” (the timeout value applies per request, not globally across the scroll process).

The size parameter controls how many hits we want to get back. Note, the actual number of hits will be the the size provided times the number of shards, which is done in order to further optimize the scrolling process.

The result of the search request is a scroll_id. The scroll_id should then be used as a parameter to the next search request:

curl -XGET 'localhost:9200/_search/scroll?scroll=10m' -d 'c2NhbjsxOjBLMzdpWEtqU2IyZHlmVURPeFJOZnc7MzowSzM3aVhLalNiMmR5ZlVET3hSTmZ3OzU6MEszN2lYS2pTYjJkeWZVRE94Uk5mdzsyOjBLMzdpWEtqU2IyZHlmVURPeFJOZnc7NDowSzM3aVhLalNiMmR5ZlVET3hSTmZ3Ow=='

The results now will include hits, and another scroll_id, which should then be used as the parameter to the next scroll request. Also, the scroll parameter needs to be provided again in order to indicate we wish to continue the scrolling process.

The “exit” point from the scrolling process is when no hits are returned back.

-shay.banon

blog comments powered by Disqus
 
Fork me on GitHub