Advertisement

Many-Core processors -- AI application

Started by July 19, 2008 05:10 PM
29 comments, last by wodinoneeye 16 years, 3 months ago
Quote: Original post by mpipe
Quote: Original post by wodinoneeye

From what Ive seen so far of Intel's Larrabee, it may be very useful in AI applications. The chip which allegedly will start being produced late this year is to have 16 or 32 P5 2Ghz CPU cores on it (each with its own L1 cache).
Intel plans to initially use them as GPUs, but as each processor is capable of 'atomic' (non-ganged) operations they should be able to run independant programs.

AI in games could easily be increased magnitudes to get away from the mannikin-like or limited choreographed scripted NPCs/Monsters/Opponents. Simultaneously running 32 AI scripts would allow more complex/longer/deeper behaviors. Planners and task driven AI methods would be more doable (these often call for alot of evaluation of parallel solutions where only one gets picked as 'best') or doing numerous pattern matchings to classify a changing situation.

One limitation may be the memory bandwidth. Some AI solutions search thru large data sets (the AI 'programming' itself is often really just script data).
32 data hungry cores would quickly overwhelm memory processing (even if they go to DDR5...) Trying to retain code in the cache space of each core might be difficult (how small can you fit a byte-code engine ???)
Intel is planning to use the Larrabee for graphics (to compete with Nvidia/ATI) so they must have some very high bandwidths, but AI requires much more random access than the typical GPU uses.

Physics processing also could make use of many-core architectures no doubt, and maybe with the added versatility there might be more likelyhood that we might have the add-on boards easily available ('cheap') so that they could be common for consumer grade gaming computers.


you've never programmed anything multithreaded in your life, have you?



I have. Multi threading for a game engine with a main thread that does most of the simulation sequentially with 'fibers' to avoid all the lock/switching overhead, then a seperate file service thead and seperate network service thread (I did my own UDP reliable protocol and file xfer).
The full design (for the AI load) requires a cluster of machines with functionality divided across the machine for : zone servers, ai servers, a central controller (also where the primary file archive also lives), and clients (very few since this was to be very heavy AI and not some script type MMORPG, but they then have additional input/sound/rendering threads).
That of course was on single CPU systems. Ive only started investigating use of quad cpu systems (I have similar doubts about keeping 4 cpus fed with data for the type of AI crunching done - task planners...). I have started redesign of the cluster machine network links/protocol (how the quad cpus need to be shunted locally) and possibly some shared data (zone chunks were replicated across cluster nodes). The AI servers dont have any disk and the inflow of world updates is all network driven. The main process 'threading' will still be via fibers and similar to what Ive mentioned it will churn data far in excess of what even the large caches can hold (8meg on q6600, 12meg on the newer quads).

So Im not talking about some simple classroom mutli-thread assignment here, but a mass of computing power running heavyweight real AI in (near) realtime. The issue of optimization is significant because you have to add more hardware ($$$) if you misuse the architecture or have a threading system that wastes most of its resources in inefficient unneeded locking/context switching.
--------------------------------------------[size="1"]Ratings are Opinion, not Fact
Having 16 to 32 cores is nice and all, but in the end, the problem still comes down to the "lowest denominator." Namely that it becomes pointless to design a game or AI for a game that only 1% of the hardcore gaming population can enjoy because most people don't have and/or can't afford the hardware. So, this brings up the biggest problem most people have to tackle when multi-threading, scalability. The ability to scale up and down, while maintaining optimal performance and minimum loss of feature is usually the biggest headache.

Having more cores isn't always a good thing, since it is really implementation dependent. As for Intel, which currently is just trying to churn out processors with as many cores as possible as fast as possible, they don't really care about how they do it. An example of where marketing putting too much pressure on engineering. For example, they can claim the first to market with 4 cores simply because they took 2 dual core processors and slapped them onto a die. This is also the same way they did dual core. How can you tell? Wll, for dual core, the 2 cores shared L1 and L2 cache, and if you read the specs for quad core, you'll realize that it only lists 2 sets of L1 and L2 cache. Which, simply implies, and verified by many sources, that they simply slapped two dual cores together. So, how are they going to get 8 in a short time with guaranteed stability and little R&D? Slap 2 quad cores on a die or maybe 4 dual cores (I'm guessing it'll be 2 quad cores though with maybe an added L3 cache for each of the quad cores), and they'll keep scaling like that. So, this could becomes a nightmare of cache misses if you accidentally group unrelated stuff onto 2 adjacent cores that share cache. Trying not to sound like an AMD fanboy, but AMD's multicore design makes better sense from a programmer's perspective as each core has its own L1 and L2 cache along with the fact that the memory controller is on the processor, so the memory access doesn't have to go through the northbridge, as with what Intel does.

(Sorry, I've been sitting on these thoughts for the past few days now, so this will go on a little bit longer)

If Intel truly wants to do graphics stuff on the CPU, I don't know how many cores will be left to AI or anything else? I'm not sure Intel can even try to compete with nVidia or ATI on the graphics side with only 32 core, unless each core can magically process more than 1 pixel/vertex per clock cycle. Well, they probably can if they ran at a higher clock speed, considering that the two newest nVidia cards run at 576 and 602 MHz respectively. But based on the number of stream processors they have (192 and 240), Intel will have to run all 32 cores at at least 3GHz to come close to matching the lower end one. Then there's ATI's new card with 800 stream processors running at 625MHz. With 32 cores, each core may have to run at 15GHz to match. Of course, these numbers are theoretical and based on the assumption that stream processors and a CPU core can process the same amount of data in a given cycle.

Ok, I've gotten that part out of my system.

As an AI programmer that works on console games, I've come to see the sad reality that unless your game is heavily AI driven, you'll be lucky if they give you 10% of the available processing power. It doesn't really matter how much processing power is there, there will always be something that needs it that is more obvious to the user experience. Also, if they really are shifting towards using a monolithic CPU with multi-cores to do everything from physics to graphics to whatever else, then we shouldn't be too optimistic about the amount of power we will have access to (think winmodem). I can already see multiple cores doing multiple physics sims and multiple cores doing real time 3D sound mixing, and another core or 2 to do gameplay, then some more to do rendering pre-processing, then AI gets whatever is left over.
Advertisement
Quote: Original post by WeirdoFu
Having more cores isn't always a good thing, since it is really implementation dependent. As for Intel, which currently is just trying to churn out processors with as many cores as possible as fast as possible, they don't really care about how they do it. An example of where marketing putting too much pressure on engineering. For example, they can claim the first to market with 4 cores simply because they took 2 dual core processors and slapped them onto a die. This is also the same way they did dual core. How can you tell? Wll, for dual core, the 2 cores shared L1 and L2 cache, and if you read the specs for quad core, you'll realize that it only lists 2 sets of L1 and L2 cache. Which, simply implies, and verified by many sources, that they simply slapped two dual cores together. So, how are they going to get 8 in a short time with guaranteed stability and little R&D? Slap 2 quad cores on a die or maybe 4 dual cores (I'm guessing it'll be 2 quad cores though with maybe an added L3 cache for each of the quad cores), and they'll keep scaling like that. So, this could becomes a nightmare of cache misses if you accidentally group unrelated stuff onto 2 adjacent cores that share cache. Trying not to sound like an AMD fanboy, but AMD's multicore design makes better sense from a programmer's perspective as each core has its own L1 and L2 cache along with the fact that the memory controller is on the processor, so the memory access doesn't have to go through the northbridge, as with what Intel does.

What does it matter if a Penryn is simply two dual-cores slapped together? They are still significantly faster in all benchmarks I've seen than AMD's "true" quad-core solution. Same for energy efficiency. Furthermore, the Nehalem (which has even better performance and energy efficiency) is a new architecture, specifically designed to scale to many cores; it's rather unlikely that they'll "slap together" a bunch of quads in their many-core processors.
-------------Please rate this post if it was useful.
One thing to consider is that from next year Intel will no longer be selling single core architectures. The market is moving and we as programmers need to move with it. At a minimum we need to understand the possibilities the technology offers, even if we elect to use only a fraction of it in our applications.

I spent an interesting afternoon a few weeks back meeting with Intel to discuss incorporating more multi-threaded and parallel eduction into our degree programs. The simple fact is that we need to be developing these skills now if we're to meet the needs of the market in the years to come.

What will that mean for game AI? I agree with WeirdoFu that we still haven't perceptibly changed the industries allocation of resources to AI. We're still going to be lining up for the crumbs in the clock cycle line. What we WILL be able to do though is contemplate the possible efficiency improvements that parallel computations can provide and develop some standard parallel approaches for readily parallelised tasks. This will give us some short term gains (indeed we have been seeing some of these in recent years, but it certainly isn't industry standard practice).

I guess my conclusion is a wait and see... let's start turning out more developers that can at least think clearly about the problem and wait to see what they come up with. On a side note: Intel's research figures suggest only 15% of programmers understand and can apply multithreaded programming techniques! I can understand from that perspective why so many people are averse to incorporating these techniques into their applications!

Cheers,

Timkin



Just today was at my local computer supermarket store and they had a half decent quad system (q6600) for $650 (display not included). Multi core is going mainstream (and from what was said here no more single cores for consumer computers out of Intel soon).

Im not sure how much further graphics has to go until any improvements arent noticeable any more (3000x2000 screens!!!??) But games can definitely improve much further physixs-wise and certainly in AI. Even consumer uses like MPEG conversions of that camcorders data might become a big enough use to make these many-cores mainstream sooner. (Maybe Voice Recognition might become mainstream somehow...) Games will certainly drive part of it (as they have with the graphics) and the lower-midrange Mcores stuff will eventually be sold in non-gamer common consumer systems.

For certain if the big companies dont build them, not much is going forward.

As for retraining/reindoctrinating the programmers, you can do quite alot of parallelization in appropriate parts of programs without that much effort if you just understand which parts are easiest and which are counterproductive.
In time the compilers will do a bit more (only so much though) while we wait for a new batch of college graduates who were bottle fed on threading methods.

--------------------------------------------[size="1"]Ratings are Opinion, not Fact
Quote: Original post by wodinoneeye
Im not sure how much further graphics has to go until any improvements arent noticeable any more (3000x2000 screens!!!??)
People can tell the difference between a 2400dpi and a 1200dpi print-out, so our ~70dpi LCD's (and the pixel-pushing hardware behind it) have a long way to go yet.
Advertisement
However, in the larger picture, we are at a level of diminishing returns with regard to graphics technology. Physics, on the other hand, is going to keep chewing things up for a while as we add more objects to the world. I don't think we are going to be as squished on processor time for AI as we used to be, however.

Dave Mark - President and Lead Designer of Intrinsic Algorithm LLC
Professional consultant on game AI, mathematical modeling, simulation modeling
Co-founder and 10 year advisor of the GDC AI Summit
Author of the book, Behavioral Mathematics for Game AI
Blogs I write:
IA News - What's happening at IA | IA on AI - AI news and notes | Post-Play'em - Observations on AI of games I play

"Reducing the world to mathematical equations!"

Quote: Original post by wodinoneeye
From what Ive seen so far of Intel's Larrabee, it may be very useful in AI applications. The chip which allegedly will start being produced late this year is to have 16 or 32 P5 2Ghz CPU cores on it (each with its own L1 cache).
Intel plans to initially use them as GPUs, but as each processor is capable of 'atomic' (non-ganged) operations they should be able to run independant programs.

Larrabee is in order CPU, which decreases real power by about 3x. Larrabee is GFX card. Larrabee has thermal design (TDP) 300 W
Majority of people simply like theirs electricity bill too much to use 300 W monster.

Quote: AI in games could easily be increased magnitudes to get away from the mannikin-like or limited choreographed scripted NPCs/Monsters/Opponents. Simultaneously running 32 AI scripts would allow more complex/longer/deeper behaviors. Planners and task driven AI methods would be more doable (these often call for alot of evaluation of parallel solutions where only one gets picked as 'best') or doing numerous pattern matchings to classify a changing situation.

All my AIs are compiled (in Java), so no scripting.

Majority of bugs in games are because of scripting, thus multiple parallel scripts are recipe for disaster.


Quote: One limitation may be the memory bandwidth. Some AI solutions search thru large data sets (the AI 'programming' itself is often really just script data).

Yes on die controller is still too weak.

Quote: 32 data hungry cores would quickly overwhelm memory processing (even if they go to DDR5...) Trying to retain code in the cache space of each core might be difficult (how small can you fit a byte-code engine ???)

It depends on memory interface. GFX card could have wide memory interface, thus easily compensate for increased amount of cores.
With 8 memory chips, the card would have 512 bit memory interface. More chips, or more wide chips, and it would increase its transfer rate even more, as long as data would stay on die.

[Edited by - Raghar on July 30, 2008 9:03:33 AM]
Quote: Original post by WeirdoFu
For example, they can claim the first to market with 4 cores simply because they took 2 dual core processors and slapped them onto a die.
Yes for Q6600 they used two C2D and connected them by a fast crossbar.
Quote: This is also the same way they did dual core. How can you tell? Wll, for dual core, the 2 cores shared L1 and L2 cache,
Current dual cores have separated L1 cache, and shared L2 cache (to decrease costs). Both cores are on die. Nehalem would have all four cores and memory controller on die.
Quote: and if you read the specs for quad core, you'll realize that it only lists 2 sets of L1 and L2 cache. Which, simply implies, and verified by many sources, that they simply slapped two dual cores together.
Actually people opened CPU seen two separate chips, and a very sensitive hardware that connected them, so there is no need for such indirect proofs.
The only disadvantage is sensitivity to high FSB voltages. (Quads are not that great overclockers.)
Quote: So, how are they going to get 8 in a short time with guaranteed stability and little R&D?
Are you talking about Larrabee, or about more general CPU architectures? Larrabee was designed as an architecture that would scale easily. (It's an in order CPU, so there are no big problems.)

Because of TDP, Intel would like to do higher amount of cores with 32 nm technology to avoid problems like AMD.
Quote: Intel will have to run all 32 cores at at least 3GHz to come close to matching the lower end one. Then there's ATI's new card with 800 stream processors running at 625MHz. With 32 cores, each core may have to run at 15GHz to match. Of course, these numbers are theoretical and based on the assumption that stream processors and a CPU core can process the same amount of data in a given cycle.

It depends on pixels processed at once per core isn't it? In addition Intel would try to force developers into raytracing, which should make problems to both ATI, and Nvidia.
Quote: As an AI programmer that works on console games

Aka no memory, no CPU power?
Quote: Original post by Raghar
Quote: Original post by wodinoneeye
From what Ive seen so far of Intel's Larrabee, it may be very useful in AI applications. The chip which allegedly will start being produced late this year is to have 16 or 32 P5 2Ghz CPU cores on it (each with its own L1 cache).
Intel plans to initially use them as GPUs, but as each processor is capable of 'atomic' (non-ganged) operations they should be able to run independant programs.

Larrabee is in order CPU, which decreases real power by about 3x. Larrabee is GFX card. Larrabee has thermal design (TDP) 300 W
Majority of people simply like theirs electricity bill too much to use 300 W monster.

Quote: AI in games could easily be increased magnitudes to get away from the mannikin-like or limited choreographed scripted NPCs/Monsters/Opponents. Simultaneously running 32 AI scripts would allow more complex/longer/deeper behaviors. Planners and task driven AI methods would be more doable (these often call for alot of evaluation of parallel solutions where only one gets picked as 'best') or doing numerous pattern matchings to classify a changing situation.

All my AIs are compiled (in Java), so no scripting.

Majority of bugs in games are because of scripting, thus multiple parallel scripts are recipe for disaster.


Quote: One limitation may be the memory bandwidth. Some AI solutions search thru large data sets (the AI 'programming' itself is often really just script data).

Yes on die controller is still too weak.

Quote: 32 data hungry cores would quickly overwhelm memory processing (even if they go to DDR5...) Trying to retain code in the cache space of each core might be difficult (how small can you fit a byte-code engine ???)

It depends on memory interface. GFX card could have wide memory interface, thus easily compensate for increased amount of cores.
With 8 memory chips, the card would have 512 bit memory interface. More chips, or more wide chips, and it would increase its transfer rate even more, as long as data would stay on die.




Larrabee is a chip that Intel also has plans to use for data crunching (not just GPU/Physix). Likely there will be variants with multiples of 8 (so they can sell the half defective chips...) and they dont have to stop at 32. Its also allegedly to use a 45nm fabrication process which will be superceded later, dropping the power used significantly.

When I say scripting it could be bytecode (or equivalent) or compiled scripts, its more the pattern of coding at a somewhat higher level. It might be the source of errors, but its the shear bulk of it that is needed and the irregularity of it (its effectively an 'asset' often created by semi-programmers). The testing difficulties are also problematic because of the combinatoric explosion of endcases within so much logic (and the specs being 'loose'.

With this architecture if you could get the interpretor to fit in the L1 cache (which each core has) and run the compacted bytecode codeblocks thru the data cache it might be similar/faster(?) than bulkier compiled code which has to be continually fetched on the wide memory bus.

As for a wide bus you still have to wait for cache misses (in contention with 31 other cores) aggrevated by the more random nature of AI data flow. Context switching between objects data when each core is running hundreds of scripts makes for a high data turnover, which can only be increased if more complex AI is used. Planner style task processing evaluates many solutions (and their options) and then picks only one to execute. Behaviors reactive to the game situation requires that to be done constantly (churning thru ALOT of data AND script logic...).

--------------------------------------------[size="1"]Ratings are Opinion, not Fact

This topic is closed to new replies.

Advertisement