Playing around with the Ehcache Search API

September 19, 2014

In the last few days, I had to cope with a requirement with one of my customers:

Let's say, there is an application which produces about 200 events per second, that's not that much.
These events are being pumped into my middleware (actually it's not "mine", but the one I'm responsible for).
An event can have several attributes, but the most important two are the timestamp and the category of an event.
Third party applications requesting events from the middleware by a cutoff timestamp and one ore more categories.
Theses requests may be sent in "short" intervals of 1 to 5 minutes, so the gap between the requested cutoff timestamp and the current time may be small and only a few events will be affected.
But, however, there may be a bigger gap since the last request of a 3rd party application, so the maximum timestamp can be up to two hours.
A request should be completed in about 5 seconds.
Events older than 2 hours are not relevant.

Ok, so let's see, what we have so far:

~200 events per second in two hours are about 1.500.000 events in total I have to fight with.
The events are not really pumped into the middleware, actually it's a polling-mechanism because of source systems limitations.
The requests of the 3rd party applications are REST-like requests (it's not a real REST approach).
A Java EE application server, running Java EE 6 with JDK 7.
So far, no persistence layer is used, but
Ehcache is used for caching purposes.

Not bad at all, but also not a perfect world!

Storing all the events in a (relational) database didn't seem reasonable to me. To many inserts, to many selects, manually house-keeping of old events an deleting them, permanently indexing.... not a good way. Discard!
Perhaps a NoSql store? Ok, maybe... But which? Redis might be a good choice for that, but hey, I've to perform a search in all these events, looking for elements greater than a cutoff timestamp and equal to a category.

So, what about using Elasticsearch? Mmmhh, not bad, but overkill perhaps? I've only to cope a small amount of objects, time intervall and attributes to search for. Perhaps even Lucene is enough. Sure, it will be enough. But what about this Ehcache, which is already available in the middleware?

So far, I only noticed Ehcache as a caching provider when using an ORM, like Hibernate. But nothing more. "Cache" in my mind was related to Infinispan or Hazelcast...

I played around with it, putting objects into the cache, getting them from the cache, everythis is very fast, because I've used the on-heap cache only. Disk usage is also fast, but, of course, slower than only using the memory (this is due to serialization/deserialization of the elements). There's an off-heap option, called "BigMemory", available for single or distributed caches - if caches grow bigger than you GC can handle ;-) But this BigMemory option is not free of charge, it's commercial. So no option for my customer.

I started calculating... An average event has the size of 600 bytes (the Java object stored in the on-heap cache). In total we get a cache size of 850MB - that's not that much. Java on-heap caches are ok up to 4GB of memory, this also can be handled by the GC. Everything bigger, may slow down the system. (Hint: you can save lot's of memory, if you don't use objects like java.util.Date in your object, but convert these attributes to, e.g. integer values, this is much more memory efficient and can also be handled while searching!)

After having a deeper look into the (very good) documentation of Ehcache (deeper means "more than the getting started page"!), I stumbled upon the Ehcache Search API. Wow! Sounds promising! Simply specify your cache as searchable in your configuration and, for performance reasons, specify these attributes, you're gonna searching for. Then, create a query using the attributes, execute it, get the results and be happy! :-)

<cache name="events" maxEntriesLocalHeap="1500000">
  <searchable keys="false" values="false">
    <searchAttribute name="date"/>
    <searchAttribute name="category"/>
  </searchable>
</cache>

Cache cache = CacheManager.getInstance().getCache("events");
Attribute date = cache.getSearchAttribute("date");
Attribute category = cache.getSearchAttribute("category");
Query query = cache.createQuery().includeValues()
                  .addCriteria(date.ge(myDateValueAsInteger))
                  .addCriteria(category.eq(myCategoryValue))
                  .end();
Results eventsResult = query.execute();
int eventsTotalCount = results.size();

How they made it? Is there a search-index like in Lucene? Or do they simply iterate through all elements in the cache and compare the attribute values?
They do both! But depending on how you use Ehcache!

If you use Ehcache, like I do, in a standalone, on-heap only mode, no index is used and all the elements are touched while searching. Because all elements are in memory and no deserialization has to take place, this can be achieved pretty fast. One million elements can be touched in under 1 second! And this seams to be very linear and stable. So, my requirement with up to 1,5 million elements can be handled in less than 2 seconds. This is in fact less than the requested 5 seconds, but this is only the search process. The result still have to be fetched and serialized to JSON for the REST response. But this is another task.

If you use Ehcache with BigMemory, there is an index created, using Lucene in the underground. Java objects in the off-heap have to be serialized and can't be searched directly, so there's need for an index. I didn't test that option, but as Lucene is used, I'm sure, this is also done with high- performance as other Lucene based products (like Elasticsearch).

More on the implementation and performance you can read here!

I created a small demo project on GitHub, not using events, but using fake-data Person objects, putting 1 million into the cache and performing some search queries. It strongly depends on the power of your computer and how your cache is being used before (warmup), if the queries perform better or not. The examples are unit-tests, so just clone the repository and run "mvn test"!

Ah, and what about the requirement that events older than two hours are not interesting?
That's more than easy. Simply configure a timeToLive attribute, or, as I solved it, don't take care of the age of an element, just specify FIFO (first-in-first-out) as the expiry strategy, set maximum size of elements to 1,5 million, and you are done! ;-)

« New MeteorJS demos for MQTT and Espruino My talk at JavaOne »