With all of the aggregation examples given so far, you may have noticed that we
omitted a query
from the search request. The entire request was
simply an aggregation.
Aggregations can be run at the same time as search requests, but you need to understand a new concept: scope. By default, aggregations operate in the same scope as the query. Put another way, aggregations are calculated on the set of documents that match your query.
If we look at one of our first aggregation examples:
GET /cars/transactions/_search?search_type=count
{
"aggs" : {
"colors" : {
"terms" : {
"field" : "color"
}
}
}
}
…you can see that that the aggregation is in isolation. In reality, Elasticsearch assumes "no query specified" is equivalent to "query all documents". The above query is internally translated into:
GET /cars/transactions/_search?search_type=count
{
"query" : {
"match_all" : {}
}
"aggs" : {
"colors" : {
"terms" : {
"field" : "color"
}
}
}
}
The aggregation always operates in the scope of the query, so an isolated
aggregation really operates in the scope of a match_all
query…that is to say,
all documents.
Once armed with the knowledge of scoping, we can start to customize aggregations even further. All of our previous examples calculated statistics about all of the data: top selling cars, average price of all cars, most sales per month, etc.
With scope, we can ask questions like "How many colors are Ford cars are
available in?". We do this by simply adding a query to the request (in this case
a match
query):
GET /cars/transactions/_search (1)
{
"query" : {
"match" : {
"make" : "ford"
}
},
"aggs" : {
"colors" : {
"terms" : {
"field" : "color"
}
}
}
}
-
We are omitting
search_type=count
so that search hits are returned too
By omitting the search_type=count
this time, we can see both the search
results and the aggregation results:
{
...
"hits": {
"total": 2,
"max_score": 1.6931472,
"hits": [
{
"_source": {
"price": 25000,
"color": "blue",
"make": "ford",
"sold": "2014-02-12"
}
},
{
"_source": {
"price": 30000,
"color": "green",
"make": "ford",
"sold": "2014-05-18"
}
}
]
},
"aggregations": {
"colors": {
"buckets": [
{
"key": "blue",
"doc_count": 1
},
{
"key": "green",
"doc_count": 1
}
]
}
}
}
This may seem trivial, but it is the key to advanced and powerful dashboards. You can transform any static dashboard into a real-time data exploration device by adding a search bar. This allows the user to search for terms and see all of the graphs (which are powered by aggregations, and thus scoped to the query) update in real-time. Try that with Hadoop!
when the search changes?
You’ll often want your aggregation to be scoped to your query. But sometimes you’ll want to search for some subset of data, but aggregate across all of your data.
For example, say you want to know the average price for Ford cars compared to the
average price of all cars. We can use a regular aggregation (scoped to the query)
to get the first piece of information. The second piece of information can be
obtained by using a global
bucket.
The global bucket will contain all of your documents, regardless of the query scope; it bypasses the scope completely. Because it is a bucket, you can nest aggregations inside of it like normal:
GET /cars/transactions/_search?search_type=count
{
"query" : {
"match" : {
"make" : "ford"
}
},
"aggs" : {
"single_avg_price": {
"avg" : { "field" : "price" } (1)
},
"all": {
"global" : {}, (2)
"aggs" : {
"avg_price": {
"avg" : { "field" : "price" } (3)
}
}
}
}
}
-
This aggregation operates in the query scope (e.g. all docs matching "ford")
-
The
global
bucket has no parameters -
This aggregation operates on the all documents, regardless of the make
The first avg
metric calculates is based on all documents that fall under the
query scope — all "ford" cars. The second avg
metric is nested under a
global
bucket, which means it ignores scoping entirely and calculates on
all the documents. The average returned for that aggregation represents
the average price of all cars.