Real-time indexing of events with ElasticSearch

Posted by

Elastic Events

A common need in many applications, including Enonic CMS, is the possibility to store and retrieve events. What differentiates events from other types of content or data, is the requirement to be stored quickly and transparently. Often, there are hundreds or thousands of events produced per second. They might not be the main purpose of the application, but still it's important to be able to access and analyze the information later on. Events can be useful for tracing and debugging; they can also be mined to discover interesting trends and patterns.

The purpose of this labs project is to be able to generate and store events in a way that is both efficient and scalable. A secondary objective would be analyzing information by retrieving and searching over the data set. Storing new events should have minimal impact on performance: events are usually not the main task of the application.

event ticket

Events

For the purpose of this project we think of an event as an action of some kind, initiated in a specific instant, by someone — a user or an agent. The idea is to define it as generic as possible, but still be able to cover most common use cases. It should also be simple, we could make the fields completely dynamic and customizable but that would add complexitiy for querying and analyzing the data.

Events are defined by the following properties: id, timestamp, duration, actor, title, source, description, content, type, category.

Note that most of the fields are optional, only a globally unique identifier and a timestamp are strictly necessary. Also note that the meaning of every field (e.g. what's a source?) is up to each concrete application.

Elastic Events

This labs project, code-named Elastic Events, is implemented in Java and consists of 3 main parts:

  • Core: The main processing unit. The core takes care of indexing and retrieving events. It uses an in-memory queue to decouple the delivery of events from storing and indexing. Thus, when an event is received, the system puts it into a queue and returns control to the caller immeditately. A background thread periodically takes events from the queue, and proceeds to store and index the data.
  • Datasource: A function library extension with methods for querying and retrieving stored events. It provides methods for searching by the different fields, sorting, paging and also for facets.
  • Http Interceptor: And extension plugin that intercepts http requests sent to Enonic CMS and generates events. This is a specific application for this project, in which we consider http requests as events. The plan is to extract and analyze information from the traffic received in the CMS, in a second phase.

Actually, there is a 4th part which is where the "magic" happens. To store and index events we will use a NoSQL data store and search engine, that offers real-time indexing and searching.

ElasticSearch


ElasticSearch

Elasticsearch (abbreviated ES) is an open source, distributed, search engine based on Apache Lucene. It's implemented in Java and provides a RESTful API using a JSON based DSL. There is also a Java API, which surprisingly is just a thin wrapping around the main json API. The fact that it makes all the features available through JSON is very significant. It makes it very easy to learn, as hacking, testing and browsing the indexed data can be done with a web browser or with simple shell commands like wget or curl.

Although the interface to use ElasticSearch is relatively simple, the functionality it offers is very powerful. Some interesting ES features are real-time indexing, full text search, distributed storage and last but not least facetting.

To start using ElasticSearch all you have to do is create an index and define a mapping. You can think of an index as the equivalent of creating a table in the relational database world. A mapping defines how the fields are mapped in the search engine. If an index is like a table, a mapping is like defining the columns of the table.

As mentioned before there are two ways to interact with ES, you can either send an HTTP request with a JSON body or use the Java API. There is a one-to-one mapping between the two. Creating the index and mapping can be combined on a single operation. It can be done from the command line:

The command above creates an index named "labs" with a mapping "event".

As opposed to SQL databases, ES stores documents. A document in ElasticSearch is a JSON object with some or all of the fields defined in the mapping. This is an example of an event document indexed in ElasticSearch:



This article is not intended as an introduction to ElasticSearch. But if you are interesting to learn more about it the best place to start is the guide on their website. The code for this project will be made available in GitHub, hopefully very soon.

So, how fast is it? To get a first impression, I made a simple test that reads an Apache log file with 100.000 lines, parses it and generates one event per line. Maybe it's not the most scientifically rigorous test, but it should give an idea of how it performs. Results of executing the test on my laptop give a throughput of more than 1500 events indexed per second.

To sum it up, on this post I have introduced my labs project, Elastic Events , as well as given a high level overview of the ElasticSearch engine. In a following article I will present you with some of the possibilities of search by using the datasource library in Enonic CMS, including the power of facetting.

Comments