elasticsearch. tutorials

CouchDB Integration

By | 01 Aug 2010

This tutorial explains the process of setting up ElasticSearch to automatically index data
in CouchDB and make it search-able. ElasticSearch 0.11 introduced a feature named The River, which
allows it to connect to external systems and listen for documents updates.
On receiving a notification, Elasticsearch indexes the data and makes it available for search.

Using this feature, it becomes easy to integrate various content stores with ElasticSearch.

elasticsearch supports different plug-ins for a River (CouchDB, RabbitMQ etc.). The CouchDB river plugin makes it extremely simple to integrate with CouchDB. (Yes, it is just a simple configuration).

Quick Steps

curl -XPUT 'http://localhost:5984/my_couch_db'
  • Enable the couchdb-river plugin in ElasticSearch
cd /path/to/elasticsearch/
bin/plugin -install elasticsearch/elasticsearch-river-couchdb/1.1.0
  • Configure ElasticSearch to start indexing
curl -XPUT 'http://elasticsearch-host:9200/_river/my_es_idx/_meta' -d '{
    "type" : "couchdb",
    "couchdb" : {
        "host" : "couchdb-host",
        "port" : 5984,
        "db" : "my_couch_db",
        "filter" : null
    }
}'

Note : The ‘elasticsearch’ river index that you are creating (named my_es_idx) can be named anything.
It is only for internal use by ElasticSearch.
my_es_idx : the NAME of a river see https://github.com/goog/doc.elasticsearch.cn/blob/master/guide/reference/river/index.textile

That’s it. We are ready to go. At this point, what we have is:

  • An ElasticSearch configuration that indexes all data from the CouchDB database – my_couch_db
  • ElasticSearch makes use of dynamic mapping for the documents it receives from CouchDB
  • Any changes to CouchDB documents are automatically updated in ElasticSearch
  • The (continuous) indexing will keep happening on any one node of the ElasticSearch cluster.
    If the node fails, this functionality is taken up by another node.
  • You can query ElasticSearch for the couchdb data at
    http://elasticsearch-host:9200/my_couch_db/my_couch_db.

Detailed Setup

Having gone through the basic setup, let us look at how this works and how we can customize this further.

CouchDB

Change Notifications

CouchDB supports a feature through which all changes to a database can be notified to external interested systems.
This is possible by making an HTTP connection to `http://couchdb-host:5984/my_couch_db/_changes`.

Features supported by `_changes` are:

  • Obtain a list (JSON format) of all changes in the database since it’s creation
  • Sequence id for changes. This also allows a client to request for changes from a particular sequence id
  • Continuous mode – a client can be connected to the HTTP interface indefinitely waiting for changes.

Filters

By default, the `_changes` interface notifies about all changes to the database. However, it is
possible to filter out the changes that are send out to clients. For this a `filter` has to be created in
couchdb.

More documentation on this is available at:

CouchDB River

The ElasticSearch CouchDB river plugin makes use of the change notification interface of couchdb to
to keep itself synchronized with the couchdb database.

Configuration

Detailed documentation is available at : http://www.elasticsearch.org/guide/reference/river/couchdb.html

In the quick how-to, we created a simple river. Let us see what those config parameters mean.

For setting up a river, we will need the following

  • A couchdb database – (eg: my_couch_db)
  • A couchdb database filter (optional) – (eg: design/my_design/myfilter – where my_design is a design document)
  • HTTP auth parameters (optional) for accessing CouchDB (supported in ElasticSearch >= 0.12)
  • Filter parameters (optional) (supported in ElasticSearch >= 0.12)
  • An ElasticSearch instance
  • An index and a type for indexing the couchdb documents ( eg: my_es_idx and my_es_type )
  • This is what users will be querying against.
  • An internal Elasticsearch index name (eg: my_es_int_idx)
  • This is just for internal Elasticsearch mapping and is not to be used for querying or searching.

You can configure the CouchDB river as follows

curl -XPUT 'elasticsearch-host:9200/_river/my_es_int_idx/_meta' -d '{
    "type" : "couchdb",

    "couchdb" : {
        "host" : "couchdb-host",
        "port" : 5984,
        "user" : "admin",
        "password" : "admin",
        "db" : "my_couch_db",
        "filter" : "my_design/my_filter",
        "filter_params" :  {
               "param1" : "value1",
               "param2" : "value2"
        }
    },
    "index" : {
        "index" : "my_es_idx",
        "type" : "my_es_type",
        "bulk_size" : "100",
        "bulk_timeout" : "10ms"
    }
}'

Notes:

  • The username, password feature is available only in ElasticSearch 0.12 and above
  • The filter_params feature is available only in ElasticSearch 0.12 and above
  • The data must be searched against `http://elasticsearch-host:9200/my_es_idx/my_es_type` and not against `http://elasticsearch-host:9200/river/my_es_inttype`

Data Mapping

By default ElasticSearch uses dynamic mapping on the data that is being indexed from CouchDB. However, it is possible
to specify mappings on an index before indexing the data from couchdb

  • First create the index `curl -XPUT http://elasticsearch-host:9200/my_es_idx`
  • Create the type and upload the mapping
curl -XPUT 'elasticsearch-host:9200/my_es_idx/my_es_type/_mapping' -d '{
    ...
}'
  • Configure the river plug-in

Known Issues

Design documents created with couchapp

Indexing of this may fail with ElasticSearch. This can be avoided with a suitable filter, since it is pointless to index design documents. This will be fixed in ElasticSearch 0.12

Auth support for _changes

There seems to be some issue with CouchDB in accessing _changes using a username / password. This
is being investigated.

Frequently Asked Questions

  • What happens when CouchDB restarts ?
    • The Elasticsearch river keeps retrying to connect to CouchDB (with an interval of 5 seconds)
  • What happens when the entire ElasticSearch cluster is restarted ?
    • One of the ElasticSearch nodes will pick up the job of indexing the data.
  • What happens when the ElasticSearch node having the indexer crashes ?
    • Some other ElasticSearch node in the cluster will pick up the job of indexing
  • What happens when the update of a document fails ?
    • As of now, nothing. There are plans for sending out notifications etc. This is being discussed in the ElasticSearch mailing list. Join in and give your feedback.
  • How can I provide auth parameters to CouchDB _changes ?
    • This feature will be added in ElasticSearch 0.12. Alternately, you can try the latest development version
  • How can I provide additional arguments to the CouchDB filter ?
    • This feature will be added in ElasticSearch 0.12. Alternately, you can try the latest development version
blog comments powered by Disqus
 
Fork me on GitHub