Donnerstag, 18. Juli 2013

Extending Guava caches to overflow to disk

Caching allows you to significantly speed up applications with only little effort. Two great cache implementations for the Java platform are the Guava caches and Ehcache. While Ehcache is much richer in features (such as its Searchable API, the possibility of persisting caches to disk or overflowing to big memory), it also comes with quite an overhead compared to Guava. In a recent project, I found a need to overflow a comprehensive cache to disk but at the same time, I regularly needed to invalidate particular values of this cache. Because Ehcache's Searchable API is only accessible to in-memory caches, this put me in quite a dilemma. However, it was quite easy to extend a Guava cache to allow overflowing  to disk in a structured manner. This allowed me both overflowing to disk and the required invalidation feature. In this article, I want to show how this can be achieved.

I will implement this file persisting cache FilePersistingCache in form of a wrapper to an actual Guava Cache instance. This is of course not the most elegant solution (more elegant would to implement an actual Guava Cache with this behavior), but I will do for most cases.

To begin with, I will define a protected method that creates the backing cache I mentioned before:

private LoadingCache<K, V> makeCache() {
  return customCacheBuild()
    .removalListener(new PersistingRemovalListener())
    .build(new PersistedStateCacheLoader());
}

protected CacheBuilder<K, V> customCacheBuild(CacheBuilder<K, V> cacheBuilder) {
  return CacheBuilder.newBuilder();
}

The first method will be used internally to build the necessary cache. The second method is supposed to be overridden in order to implement any custom requirement to the cache as for example an expiration strategy. This could for example be a maximum value of entries or soft references. This cache will be used just as any other Guava cache. The key to the cache's functionality are the RemovalListener and the CacheLoader that are used for this cache. We will define these two implementation as inner classes of the FilePersistingCache:

private class PersistingRemovalListener implements RemovalListener<K, V> {
  @Override
  public void onRemoval(RemovalNotification<K, V> notification) {
    if (notification.getCause() != RemovalCause.COLLECTED) {
      try {
        persistValue(notification.getKey(), notification.getValue());
      } catch (IOException e) {
        LOGGER.error(String.format("Could not persist key-value: %s, %s", 
          notification.getKey(), notification.getValue()), e);
      }
    }
  }
}

public class PersistedStateCacheLoader extends CacheLoader<K, V> {
  @Override
  public V load(K key) {
    V value = null;
    try {
      value = findValueOnDisk(key);
    } catch (Exception e) {
      LOGGER.error(String.format("Error on finding disk value to key: %s", 
        key), e);
    }
    if (value != null) {
      return value;
    } else {
      return makeValue(key);
    }
  }
}

As obvious from the code, these inner classes call methods of FilePersistingCache we did not yet define. This allows us to define custom serialization behavior by overriding this class. The removal listener will check the reasons for a cache entry being evicted. If the RemovalCause is COLLECTED, the cache entry was not manually removed by the user but it was removed as a consequence of the cache's eviction strategy. We will therefore only try to persist a cache entry if the user did not wish the entries removal. The CacheLoader will first attempt to restore an existent value from disk and create a new value only if such a value could not be restored.

The missing methods are defined as follows:

private V findValueOnDisk(K key) throws IOException {
  if (!isPersist(key)) return null;
  File persistenceFile = makePathToFile(persistenceDirectory, directoryFor(key));
  (!persistenceFile.exists()) return null;
  FileInputStream fileInputStream = new FileInputStream(persistenceFile);
  try {
    FileLock fileLock = fileInputStream.getChannel().lock();
    try {
      return readPersisted(key, fileInputStream);
    } finally {
      fileLock.release();
    }
  } finally {
    fileInputStream.close();
  }
}

private void persistValue(K key, V value) throws IOException {
  if (!isPersist(key)) return;
  File persistenceFile = makePathToFile(persistenceDirectory, directoryFor(key));
  persistenceFile.createNewFile();
  FileOutputStream fileOutputStream = new FileOutputStream(persistenceFile);
  try {
    FileLock fileLock = fileOutputStream.getChannel().lock();
    try {
      persist(key, value, fileOutputStream);
    } finally {
      fileLock.release();
    }
  } finally {
    fileOutputStream.close();
  }
}


private File makePathToFile(@Nonnull File rootDir, List<String> pathSegments) {
  File persistenceFile = rootDir;
  for (String pathSegment : pathSegments) {
    persistenceFile = new File(persistenceFile, pathSegment);
  }
  if (rootDir.equals(persistenceFile) || persistenceFile.isDirectory()) {
    throw new IllegalArgumentException();
  }
  return persistenceFile;
}

protected abstract List<String> directoryFor(K key);

protected abstract void persist(K key, V value, OutputStream outputStream) 
  throws IOException;

protected abstract V readPersisted(K key, InputStream inputStream) 
  throws IOException;

protected abstract boolean isPersist(K key);

The implemented methods take care of serializing and deserializing values while synchronizing file access and guaranteeing that streams are closed appropriately. The last four methods remain abstract and are up to the cache's user to implement. The directoryFor(K) method should identify a unique file name for each key. In the easiest case, the toString method of the key's K class is implemented in such a way. Additionally, I made the persist, readPersisted and isPersist methods abstract in order to allow for a custom serialization strategy such as using Kryo. In the easiest scenario, you would use the built in Java functionality which uses ObjectInputStream and ObjectOutputStream. For isPersist, you would return true, assuming that you would only use this implementation if you need serialization. I added this feature to support mixed caches where you can only serialize values to some keys. Be sure not to close the streams within the persist and readPersisted methods since the file system locks rely on the streams to be open. The above implementation will take care of closing the stream for you.

Finally, I added some service methods to access the cache. Implementing Guava's Cache interface would of course be a more elegant solution:

public V get(K key) {
  return underlyingCache.getUnchecked(key);
}

public void put(K key, V value) {
  underlyingCache.put(key, value);
}

public void remove(K key) {
  underlyingCache.invalidate(key);
}

protected Cache<K, V> getUnderlyingCache() {
  return underlyingCache;
}

Of course, this solution can be further improved. If you use the cache in a concurrent scenario, be further aware that the RemovalListener is, other than most Guava cache method's executed asynchronously. As obvious from the code, I added file locks to avoid read/write conflicts on the file system. This asynchronicity does however imply that there is a small chance that a value entry gets recreated even though there is still a value in memory. If you need to avoid this, be sure to call the underlying cache's cleanUp method within the wrapper's get method. Finally, remember to clean up the file system when you expire your cache. Optimally, you will use a temporary folder of your system for storing your cache entries in order to avoid this problem at all. In the example code, the directory is represented by an instance field named persistenceDirectory which could for example be initialized in the constructor.

Update: I wrote a clean implementation of what I described above which you can find on my Git Hub page and on Maven Central. Feel free to use it, if you need to store your cache objects on disk.

Freitag, 5. Juli 2013

Object-based micro-locking for concurrent applications by using Guava

One of the presumably most annoying problems with writing concurrent Java applications is the handling of resources that are shared among threads as for example a web applications' session and application data. As a result, many developers choose to not synchronize such resources at all, if an application's concurrency level is low. It is for example unlikely that a session resource is accessed concurrently: if request cycles complete within a short time span, it is unlikely that a user will ever send a concurrent request using a second browser tab while the first request cycle is still in progress. With the ascent of Ajax-driven web applications, this trusting approach does however become increasingly hazardous. In an Ajax-application, a user could for example request a longer-lasting task to complete while starting a similar task in another browser window. If these tasks access or write session data, you need to synchronize such access. Otherwise you will face subtle bugs or even security issues as it it for example pointed out in this blog entry.

An easy way of introducing a lock is by Java's synchronized keyword. This example does for example only block a request cycle's thread if a new instance needs to be written to the session.
HttpSession session = request.getSession(true);
if (session.getAttribute("shoppingCart") == null) {
  synchronize(session) { 
    if(session.getAttribute("shoppingCart")= null) {
      cart = new ShoppingCart();
      session.setAttribute("shoppingCart");
    }
  }
}
ShoppingCart cart = (ShoppingCart)session.getAttribute("shoppingCart");
doSomethingWith(cart);

This code will add a new instance of ShoppingCart to the session. Whenever no shopping cart is found, the code will acquire a monitor for the current user's session and add a new ShoppingCart to the HttpSession of the current user. This solution has however several downsides:
  1. Whenever any value is added to the session by the same method as described above, any thread that is accessing the current session will block. This will also happen, when two threads try to access different session values. This blocks the application more restrictive than it would be necessary.
  2. A servlet API implementation might choose to implement HttpSession not to be a singleton instance. If this is the case, the whole synchronization would fail. (This is however not a common implementation of the servlet API.)
It would be much better to find a different object that the HttpSession instance to synchronize. Creating such objects and sharing them between different threads would however introduce the same problems. A nice way of avoiding that is by using Guava caches which are both intrinsically concurrent and allow the use of weak keys:

LoadingCache<String, Object> monitorCache = CacheBuilder.newBuilder()
       .weakValues()
       .build(
           new CacheLoader<String, Object>{
             public Object load(String key) {
               return new Object();
             }
           });

Now we can rewrite the locking code like this:

HttpSession session = request.getSession(true);
Object monitor = ((LoadingCache<String,Object>)session.getAttribute("cache"))
  .get("shoppingCart");
if (session.getAttribute("shoppingCart") == null) {
  synchronize(monitor) { 
    if(session.getAttribute("shoppingCart")= null) {
      cart = new ShoppingCart();
      session.setAttribute("shoppingCart");
    }
  }
}
ShoppingCart cart = (ShoppingCart)session.getAttribute("shoppingCart");
doSomethingWith(cart);

The Guava cache is self-populating and will simply return a monitor Object instance which can be used as a lock on the shared session resource which is universially identified by shoppingCart. The Guava cache is backed by a ConcurrentHashMap which avoids synchronization by only synchronizing on the map key's hash value bucket. As a result, the application was made thread safe without globally blocking it. Also, you do not need to worry about running out of memory sice the monitors (and the related cache entries) will be garbage collected if they are not longer in use. If you do not use other caches, you can even consider soft references to optimize run time.

This mechanism can of course be refined. Instead of returning an Object instance, one could for example also return a ReadWriteLock. Also, it is important to instanciate the LoadingCache on the session's start up. This can be achieved by for example a HttpSessionListener.