In the last column we talked about accelerating recursive computations with Memoization. I mentioned that Memoization is just a method of caching that can be applied to a wider range of problems. And indeed, not only recursive computations can benefit from caching. In this column we'll implement a complete caching class that can be readily applied to any problem, and we'll see a concrete example of using the cache to speed up calculations.
[size="5"]Software vs. Hardware cache
By reading this column you are most likely already using a cache. Your computer caches what you see on the screen and thus is able to work faster. This cache is a hardware cache, deep below in the hierarchy of computer architecture / hardware / software. It is completely transparent to us, the simple PC users.
It was noticed a long time ago that while CPU (Central Processing Unit) speeds accelerate quickly, the speed of computer memory (DRAM - Dynamic Random Access Memory) and the bus to the memory (Front Side Bus in a PC, or FSB in short) can't keep up. Each access to the memory is expensive, and the CPU spends many cycles just waiting for the data to arrive. Thus, caches were invented. A hardware cache is a small and extremely fast memory that is usually located on the same chip with the CPU. Its access time is almost as fast as the CPU itself and there is no external bus to wait for. When the CPU accesses some memory location, it stores it in the cache and in future accesses it's done much more quickly. Caches usually guess that if the CPU reads some memory location, there's a good chance that it'll read the next one as well, so they store a whole chunk of memory which often results in a very high cache hit-rate (the percentage of memory accesses that find what they want in the cache).
In reality, things are much more complicated. Caches pre-fetch data from the memory with methods based on the principles of time and space locality. These days there are at least 2 (sometimes 3) levels of cache in a computer, and complex algorithms are involved in cleaning up the cache and keeping it coherent (in sync) with the main memory, especially when multiple CPUs / cores are working together. This is a fascinating topic, and if you're interested there's a lot of free and well-written information floating on the web; just run a web search.
But this is all hardware cache. In this article I want to discuss software caches. A software cache is a programming technique used to speed up repetitive calculations, and we saw concrete examples of this in the previous column. The implementation level may be different, but the principles are the same. All we really want is to remember computations we already did and not repeat them unnecessarily.
[size="5"]Basic requirements and definitions
A cache has to store results of computations. That is, for inputs, it stores outputs. Therefore, a cache is somewhat similar to a dictionary data structure - it stores key/value pairs - given a key we want to quickly find the value. A hardware cache, for example, stores the contents of memory locations. Therefore its key is a memory address and its value is the address's contents.
How fast should a cache be? The obvious answer - as fast as possible - is not accurate. It depends on the keys we use. A solution that works fastest in the most general case is not always the solution that works fastest for some specific cases. We'll get back to this a bit later.
[size="5"]To infinity and beyond?
There is an important complication with caches we still haven't considered, however. Caches, like all data structures (and physical structures) have some finite size. Can we store all the calculations we'll ever need in a cache? Can we store the contents of all memory locations in a hardware cache?
The answer to both questions is, of course, no. Hardware caches, for example, are far smaller than the main memory (caches are made of fast hardware which makes them very expensive). Since the amount of memory is finite, we can't let our software caches grow forever. Therefore, the cache must have a limited size. The exact limit depends heavily on the application, the data and the amount of available mamory, so it's best to let the user of the cache decide how large he wants it to be.
[size="5"]Cache removal
So now we know our cache should be of limited size. This raises an important question: what to do when the cache is full? We can just stop and not add new keys, but this is obviously a bad solution. The alternative is to make free space for new keys by removing old ones - cache removal.
There are many algorithms and methods of cache removal, some of which depend on the data. Here are some of the more popular approaches:
- Random
Using this approach, when the cache is full and we want to add a new key to it, we just throw out some old key at random. - LRU
LRU stands for Least Recently Used. Using this approach, we throw out the key that is the oldest - that is, it was accessed least recently. - MRU
MRU is Most Recently Used. We throw out the newest key, the one accessed most recently.
[size="5"]Requirements revisited
Let's define the operations we want the cache to perform.
- Creation and initialization
We'd like to specify the cache size upon its creation - that is the maximal number of keys it stores. - Lookup
We'd like to ask the cache for a key and get the value, or an indication that this key doesn't exist in the cache. - Insertion
We'd like to add keys to the cache. If the key already exists in the cache, its value will be updated with the latest value. If there's no such key in the cache, it will be added to the cache. If the cache is full, the LRU key will be removed to make space for the new key. [size="5"]Design
We certainly need a data structure that lets us look up values for keys efficiently. This will be the core cache table. We can use the C++ standard [font="Courier New"]map[/font] container for this purpose - it provides logarithmic lookup and insertion ([font="Courier New"]O(log N)[/font] where N is the cache size).
But how do we implement LRU removal? We can keep some count of "recent access time stamp" for each key, but how do we know which to throw away? Going over the whole cache to find the LRU key is a [font="Courier New"]O(N)[/font] operation - too slow.
We solve this using a very common programming trick - we sacrifice space for time. Such problems are usually solved by using another data structure that provides the special requirement quickly and is kept fully coherent with the main data structure. What we need here, for example, is a priority queue - keys sorted in a linear structure with the most recent key in some known location - like the front of the queue, which lets us remove it quickly.
This leaves the question of how to implement the queue. We could go for a simple array, but that won't do (can you figure out why?). The problem is that when there's a lookup on some cache key, it immediately becomes the most-recently-used key and should be marked as such, for example by being moved to the back of the queue. This operation is called splicing - take an item from the middle of a container and put it in the end. Splicing in arrays is expensive [font="Courier New"](O(N))[/font], which is unacceptable.
Fortunately, there is a solution - a linked list. In a linked list insertion and removal at both ends is [font="Courier New"]O(1)[/font], and so is splicing, given that we already have a pointer/handle to the key we want to splice. But that can be arranged by holding such a pointer in the main cache data structure.
So, we'll go for two data structures: a [font="Courier New"]map[/font] for the table, and a [font="Courier New"]list[/font] (another container in the C++ standard library) for the recent-usage queue. For each key, the table will hold the value and a pointer to the key in the queue, which makes it trivial to mark it as recent on lookups.
So, enough babbling, let's get to the code.
[size="5"]Implementation
The source code package provided with this column contains a file named cache.h - this is the implementation of the cache (it is wholly in a .h file because it's templated):
template
class cache
Our cache can work for any key type and value type, given to it at creation as template arguments. Here is a portion of the cache class that lists its data members:
typedef typename list::iterator list_iter;
struct cached_value
{
cached_value(value_type value_, list_iter cache_i_)
: value(value_), cache_i(cache_i_)
{
}
value_type value;
list_iter cache_i;
};
typedef typename map::iterator table_iter;
unsigned maxsize;
list lru_list;
map table;
[font="Courier New"]maxsize[/font] is the maximal size given to the cache at creation. [font="Courier New"]table[/font] is the main cache table - for each key, it holds a value and a pointer to the queue. [font="Courier New"]lru_list[/font] is the queue - a list sorted by recent use (with the most recently used key in the front).
Note that the class also defines a [font="Courier New"]cache_statistics[/font] subtype. This is to collect statistics of cache usage. The implementation of statistics is simple enough that I won't mention it in the column. It can be very useful, however, when you plan to use the cache for your needs and want to analyze its performance.
Lookup of keys in the cache is done as follows:
value_type* find(const key_type& key)
{
table_iter ti = table.find(key);
IF_DEBUG(stats.finds++);
if (ti == table.end())
return 0;
IF_DEBUG(stats.finds_hit++);
list_iter li = ti->second.cache_i;
lru_list.splice(lru_list.begin(), lru_list, li);
return &(ti->second.value);
}
The key is looked up in the table which has efficient lookups. If the key wasn't found, we simply return 0. If the key was found, we have to splice the accessed key out of its place in the queue and place it in the front - since now this key is the most recently used. Then, we return the value of the key.
Insertion is just a little more complex:
void insert(const key_type& key, const value_type& value)
{
value_type* valptr = find(key);
if (valptr)
{
*valptr = value;
}
else
{
lru_list.push_front(key);
cached_value cv(value, lru_list.begin());
table.insert(make_pair(key, cv));
if (lru_list.size() > maxsize)
{
key_type lru_key = lru_list.back();
table.erase(lru_key);
lru_list.pop_back();
IF_DEBUG(stats.removed++);
}
}
}
First we look for the key in the table. Note that the local cache function [font="Courier New"]find()[/font] is used here, because if we do find the element we want it marked as MRU.
If the key was found, we just update its value and return. More interesting is what happens when the key is not found - here the insertion takes place. After adding the key to the cache, we check to see if the cache size is exceeded. If it is, we throw out the key that's in the back of [font="Courier New"]lru_list[/font] which is, if you recall, the LRU key - just what we need!
[size="5"]Using the cache
Using this cache is very simple. Here's a small demonstration:
cache cc(4);
cc.insert("pi", 3.14);
cc.insert("e", 2.71);
cc.insert("gold", 1.61);
cc.insert("sq2", 1.41);
cc.debug_dump();
cc.insert("zero", 0);
cc.debug_dump();
double* e_value = cc.find("e");
cc.insert("one", 1);
cc.debug_dump();
cc.statistics();
for (int i = 0; i < 30; ++i)
double* one_value = cc.find("one");
cc.statistics();
Run this (don't forget to [font="Courier New"]#include "cache.h"[/font] and run in debug mode, so that statistics will be collected and printed). Try to predict what the state of the cache is during the execution.
In the first dump, you see the items you inserted, in MRU order. In the second dump, you don't see "pi". That's because it's LRU and was removed when "zero" was added. In the second dump you don't see "gold". Why not "e", which was inserted before "gold"? Because "e" was accessed by [font="Courier New"]find[/font], and thus was marked MRU.
[size="5"]Efficiency revisited
The way the cache is currently implemented, it does all operations in [font="Courier New"]O(log N)[/font] (N being the cache size). LRU removal/splicing is very efficient ([font="Courier New"]O(1)[/font]). What takes most time is the map lookups. Can't we make those more efficient?
As a matter of fact, we can. Well, in most cases. By using a hash table instead of [font="Courier New"]map[/font] (which uses trees and hence is logarithmic), we can make all cache operations [font="Courier New"]O(1)[/font]. There's only one catch though - this can be done only if we have good hashing functions for our keys. But since most of the keys would be either numbers or strings, and good hashing functions for those exist, it's not a real problem.
Interestingly, the C++ standard library has an extension container named [font="Courier New"]hash_map[/font], which is a hash table. Since it's not standard yet (it's only an extension), its implementations differ and aren't very stable. Bruce Eckel, in his "Thinking in C++" book, creates a benchmark that gives him 4x speedup with [font="Courier New"]hash_map[/font] against [font="Courier New"]map[/font].
Maybe his implementation of [font="Courier New"]hash_map[/font] is better, but I didn't get such results with my tests (on Microsoft Visual C++ .NET's implementation of STL). I got only a minimal (about 20%) speedup for integer keys (Eckel's benchmark, in my opinion, is very dubious - the data he uses isn't too good for reliable benchmarking). When I tried [font="Courier New"]strings[/font] as keys, [font="Courier New"]hash_map[/font] was in fact twice as slow as [font="Courier New"]map[/font].
Hence, I stuck with [font="Courier New"]map[/font], but I'm confident that given a good implementation of a hash map and a good hashing function for the keys, the cache can be made more efficient. The fact that the cache size is limited and known beforehand only helps to create a very speedy hash table. This is left as an exercise for the astute reader.
[hr]Copyright (C) 2005 Eli Bendersky