elasticsearch. tutorials

Attachment Type in Action

By Lukas Vlcek | 18 Jul 2011

This tutorial will walk you through basic attachment type setup and use in search including highighting.

Installation

First we need to install the attachments plugin. Navigate to ES_HOME and execute the following from the command line:

bin/plugin -install elasticsearch/elasticsearch-mapper-attachments/1.4.0

This will download the attachments plugin from plugins repository and install it into ES_HOME/plugins/mapper-attachments/ folder.

You can also install plugins manually by copying relevant files directly into ES_HOME/lib/ folder. For example you extract mapper attachment zip file from ES_HOME/build/distributions/plugins/ into ES_HOME/lib/ folder (this can be useful if you work with SNAPSHOT version of ES).

Make sure you restart ElasticSearch, so the plugins are picked up.

Download some data

Let’s download some PDF document:

curl -C - -O http://www.intersil.com/data/fn/fn6742.pdf

Attachments plugin can parse and index documents in many formats. Check Tika web page for list of supported formats.

Setup mapping

Prepare new index for the data.

curl -X DELETE "localhost:9200/test"

curl -X PUT "localhost:9200/test" -d '{
  "settings" : { "index" : { "number_of_shards" : 1, "number_of_replicas" : 0 }}
}'

NOTICE: Before we can use the attachments plugin we need to create correct mapping for the attachment type.

curl -X PUT "localhost:9200/test/attachment/_mapping" -d '{
  "attachment" : {
    "properties" : {
      "file" : {
        "type" : "attachment",
        "fields" : {
          "title" : { "store" : "yes" },
          "file" : { "term_vector":"with_positions_offsets", "store":"yes" }
        }
      }
    }
  }
}'

Indexing the Data

We are ready to index the data. We just need to encode the content of the file with Base64 for which we will use a simpe Perl script.

#!/bin/sh

coded=`cat fn6742.pdf | perl -MMIME::Base64 -ne 'print encode_base64($_)'`
json="{\"file\":\"${coded}\"}"
echo "$json" > json.file
curl -X POST "localhost:9200/test/attachment/" -d @json.file

Search for Highlighted results

We are done indexing the data and we are now free to search it. To make it even more cool we want the document title and some highlighted results back (note how we setup mapping for the title and file content).

curl "localhost:9200/_search?pretty=true" -d '{
  "fields" : ["title"],
  "query" : {
    "query_string" : {
      "query" : "amplifier"
    }
  },
  "highlight" : {
    "fields" : {
      "file" : {}
    }
  }
}'

This will produce the following result:

{
  "took" : 6,
  "timed_out" : false,
  "_shards" : {
    "total" : 1,
    "successful" : 1,
    "failed" : 0
  },
  "hits" : {
    "total" : 1,
    "max_score" : 0.005872132,
    "hits" : [ {
      "_index" : "test",
      "_type" : "attachment",
      "_id" : "UUaHJ6CfTOC3T2I4Kj_pXg",
      "_score" : 0.005872132,
      "fields" : {
        "file.title" : "ISL99201"
      },
      "highlight" : {
        "file" : [ "\nMono <em>Amplifier</em> • Filterless Class D with Efficiency > 86% at 400mW\nThe ISL99201 is a fully integrat", "\nmono <em>amplifier</em>. It is designed to maximize performance for \nmobile phone applications. The applicat" ]
      }
    } ]
  }
}

Want to try yourself?

Go grab this script and try yourself.

blog comments powered by Disqus
 
Fork me on GitHub