This tutorial will walk you through basic attachment type setup and use in search including highighting.
Installation
First we need to install the attachments plugin. Navigate to ES_HOME
and execute the following from the command line:
bin/plugin -install elasticsearch/elasticsearch-mapper-attachments/1.4.0
This will download the attachments plugin from plugins repository and install it into ES_HOME/plugins/mapper-attachments/
folder.
You can also install plugins manually by copying relevant files directly into ES_HOME/lib/
folder. For example you extract mapper attachment zip file from ES_HOME/build/distributions/plugins/
into ES_HOME/lib/
folder (this can be useful if you work with SNAPSHOT version of ES).
Make sure you restart ElasticSearch, so the plugins are picked up.
Download some data
Let’s download some PDF document:
curl -C - -O http://www.intersil.com/data/fn/fn6742.pdf
Attachments plugin can parse and index documents in many formats. Check Tika web page for list of supported formats.
Setup mapping
Prepare new index for the data.
curl -X DELETE "localhost:9200/test" curl -X PUT "localhost:9200/test" -d '{ "settings" : { "index" : { "number_of_shards" : 1, "number_of_replicas" : 0 }} }'
NOTICE: Before we can use the attachments plugin we need to create correct mapping for the attachment type.
curl -X PUT "localhost:9200/test/attachment/_mapping" -d '{ "attachment" : { "properties" : { "file" : { "type" : "attachment", "fields" : { "title" : { "store" : "yes" }, "file" : { "term_vector":"with_positions_offsets", "store":"yes" } } } } } }'
Indexing the Data
We are ready to index the data. We just need to encode the content of the file with Base64 for which we will use a simpe Perl script.
#!/bin/sh coded=`cat fn6742.pdf | perl -MMIME::Base64 -ne 'print encode_base64($_)'` json="{\"file\":\"${coded}\"}" echo "$json" > json.file curl -X POST "localhost:9200/test/attachment/" -d @json.file
Search for Highlighted results
We are done indexing the data and we are now free to search it. To make it even more cool we want the document title and some highlighted results back (note how we setup mapping for the title and file content).
curl "localhost:9200/_search?pretty=true" -d '{ "fields" : ["title"], "query" : { "query_string" : { "query" : "amplifier" } }, "highlight" : { "fields" : { "file" : {} } } }'
This will produce the following result:
{ "took" : 6, "timed_out" : false, "_shards" : { "total" : 1, "successful" : 1, "failed" : 0 }, "hits" : { "total" : 1, "max_score" : 0.005872132, "hits" : [ { "_index" : "test", "_type" : "attachment", "_id" : "UUaHJ6CfTOC3T2I4Kj_pXg", "_score" : 0.005872132, "fields" : { "file.title" : "ISL99201" }, "highlight" : { "file" : [ "\nMono <em>Amplifier</em> • Filterless Class D with Efficiency > 86% at 400mW\nThe ISL99201 is a fully integrat", "\nmono <em>amplifier</em>. It is designed to maximize performance for \nmobile phone applications. The applicat" ] } } ] } }
Want to try yourself?
Go grab this script and try yourself.
blog comments powered by Disqus