Linked Lists Are Tricky

Published August 15, 2007
Advertisement
Life is too damn busy. Still, I think I prefer work life over school life even if the net amount of free time, or at least time spent at home, is less. Some cool things happening with XNA recently; I will write about it soon but need to collect my thoughts a bit first.

Unfortunately, I suspect that what I'm writing today will be very, very old news to the sorts of people who read this journal.

Linked Lists Are Tricky

Everybody knows linked lists, especially if you learned C or C++ at some point. They're a relatively simple data structure that provides constant time insertion and removal, and linear time random access by index. They're popular when people are writing hobby games, because of how quick and easy it is to add things, remove things, and move things around. Since these games usually run a traversal of the list from beginning to end, the linear time random access isn't a concern. For getting things done, it's hard to beat linked lists. And the performance characteristics seem ideal. They don't suffer from the nasty linear time removals and insertions, and their drawbacks don't apply to a typical update pattern in a game.

As usual, things aren't quite so simple. On a purely theoretical level, looking at asymptotic performance bounds and such, linked lists are great for this sort of thing. Once you add real computer architecture into the mix and see what's going on though, linked lists turn out to have some rather inconvenient hidden drawbacks. Now these aren't necessary fatal problems, and they're largely performance oriented. However, it is important to be aware of such things, and it complicates the landscape quite a bit.

The problem, you see, is locality. Computers love locality. Linked lists, howver, tend to exhibit terrible locality. (And in the case of native code, practically no locality at all, particularly spatial locality.) An application that relies heavily on linked lists will cache miss at a high frequency, costing it dozens of cycles doing nothing but waiting every time it looks at an object. Throw in a few thousand objects and it really starts to add up; before long you're seeing full percentage points of CPU time lost to nothing but cache misses.

In order to understand why linked lists are so bad with locality, we first need to look at how a processor cache works. The cache exists to avoid going all the way out to main memory for recently used memory. There's also management structures involved in order to keep track of what memory is currently available in the cache, what memory needs to be updated from cached values, etc. It would be prohibitively expensive to maintain this information for every single byte, as well as extremely redundant in many cases. So instead, things are managed in what are called cache lines. The actual size of the cache lines can vary between architecture, sub architecture, and even between the L1/L2/L3 caches. 64 and 128 bytes are common sizes, though.

So we can think of all memory as being sliced into cache line sized chunks. When some memory is looked up, the entire chunk which it belongs to is loaded into cache at once. So even if you access a lone int, you'll get 64 or 128 bytes of nearby data loaded as well. So, consider how the cache behaves in a contiguous list. Let's assume we have a 32 byte cache composed of two 16 byte cache lines, and a list of 8 items, each of which is 4 bytes. Some simple math will show that we have just enough cache space for all of our items. Let's take a look at what happens as we iterate over the list a couple times. When we hit the first item, we'll get a cache miss and load some of the list into cache:
1 2 3 4 5 6 7 8
The bold items are in cache, the non-bold ones aren't. The next three items will be pulled from cache. When we hit 5, there will be another cache miss and we'll load the other half of the list:
1 2 3 4 5 6 7 8
Once again, we'll continue through the next couple items without any misses. And on the next pass through, the entire thing is already in cache, and so there are no more future cache misses. This is an exaggerated scenario, of course, but you get the basic idea.

Now, what happens with a linked list? Because a linked list allocates a new node for each item, it exists on its own, usually surrounded either by garbage memory or by memory that is simply not relevant to it. This layout will tend to become exacerbated over time if you're just using a normal heap. All of the nodes will eventually end up scattered far and wide. So when we go to traverse it, we get something like this:
1 -> {+} 2 -> {+} 3 -> {+} 4 -> {+} 5 -> {+} 6 -> {+} 7 -> {+} 8
I've used the {+} symbol to represent the junk data following each node. So far we've managed to load one item, its next pointer, and a bunch of junk data. We've already used an entire cache line up right now. Here's what happens as we continue to traverse:
1 -> {+} 2 -> {+} 3 -> {+} 4 -> {+} 5 -> {+} 6 -> {+} 7 -> {+} 8
1 -> {+} 2 -> {+} 3 -> {+} 4 -> {+} 5 -> {+} 6 -> {+} 7 -> {+} 8
1 -> {+} 2 -> {+} 3 -> {+} 4 -> {+} 5 -> {+} 6 -> {+} 7 -> {+} 8
1 -> {+} 2 -> {+} 3 -> {+} 4 -> {+} 5 -> {+} 6 -> {+} 7 -> {+} 8
You get the idea. The lack of locality has caused a drastic increase in the cache miss rate, because each cache read only pulls in one item. Worse still, the actual cache space used has gotten a lot larger, especially if that junk data really is junk that we're not interested in. We now need eight cache lines to store what previously fit in only two. And since the beginning of the list falls out of cache, we will start the whole pattern of misses over again on future passes.

Of course, this effect decreases quite a bit as the size of each individual node increases. At the point where your node size (including prev/next/etc pointers) is a multiple of the cache size, you're not doing nearly as badly as in this example. It's still not as good as the contiguous storage (and this is connected to much weaker, difficult to pin down factors like virtual address lookup, CPU prediction, etc), but it helps a lot. A structure like std::deque, which is a linked list of large (page size or bigger) blocks will do just as well as the contiguous block. The point is not to avoid linked lists, but to be careful about the exact behavior and structure of the list. A linked list of integers has the potential to really ruin your day; a linked list of 512 byte objects won't really hurt you.

It's worth pointing out that this advice applies to any linked structure, especially trees. A big reason that octrees are preferable to binary partitions is that you can allocate all of the child nodes in a contiguous block, which can really help your cache hit rate as you do a traversal. It doesn't help much if you just have an array of pointers to your children, though. You need to allocate the actual children in a continuous block to really see a benefit. Again, the exact structure and node size is key, and can make or break the caching behavior. Accessing scattered small objects is something you want to try to avoid at a design level. Don't create data structures that force you to do it, because if these things show up on profiles, you'll find yourself trying to micro-optimize code inside traversals and completely missing the actual source of the problem. Preserve locality as much as you can from the beginning, and it will save you headaches later on.
Previous Entry Car Watching
0 likes 1 comments

Comments

Ravuya
I remember reading some caffeine-infused rant from a developer on a mailing list who had decided to "lump together" nodes of a list (sort of like a 2,4-tree except with a list) in order to improve locality.

The problem? He had to create new nodes whenever he wanted to increase the size of one. [rolleyes]
August 15, 2007 08:06 PM
You must log in to join the conversation.
Don't have a GameDev.net account? Sign up!
Advertisement