Advertisement

Many-Core processors -- AI application

Started by July 19, 2008 05:10 PM
29 comments, last by wodinoneeye 16 years, 3 months ago
Quote: Original post by Raghar
Quote: Original post by wodinoneeye
From what Ive seen so far of Intel's Larrabee, it may be very useful in AI applications. The chip which allegedly will start being produced late this year is to have 16 or 32 P5 2Ghz CPU cores on it (each with its own L1 cache).
Intel plans to initially use them as GPUs, but as each processor is capable of 'atomic' (non-ganged) operations they should be able to run independant programs.

Larrabee is in order CPU, which decreases real power by about 3x. Larrabee is GFX card. Larrabee has thermal design (TDP) 300 W
Majority of people simply like theirs electricity bill too much to use 300 W monster.

Quote: AI in games could easily be increased magnitudes to get away from the mannikin-like or limited choreographed scripted NPCs/Monsters/Opponents. Simultaneously running 32 AI scripts would allow more complex/longer/deeper behaviors. Planners and task driven AI methods would be more doable (these often call for alot of evaluation of parallel solutions where only one gets picked as 'best') or doing numerous pattern matchings to classify a changing situation.

All my AIs are compiled (in Java), so no scripting.

Majority of bugs in games are because of scripting, thus multiple parallel scripts are recipe for disaster.


Quote: One limitation may be the memory bandwidth. Some AI solutions search thru large data sets (the AI 'programming' itself is often really just script data).

Yes on die controller is still too weak.

Quote: 32 data hungry cores would quickly overwhelm memory processing (even if they go to DDR5...) Trying to retain code in the cache space of each core might be difficult (how small can you fit a byte-code engine ???)

It depends on memory interface. GFX card could have wide memory interface, thus easily compensate for increased amount of cores.
With 8 memory chips, the card would have 512 bit memory interface. More chips, or more wide chips, and it would increase its transfer rate even more, as long as data would stay on die.




Larrabee is a chip that Intel also has plans to use for data crunching (not just GPU/Physix). Likely there will be variants with multiples of 8 (so they can sell the half defective chips...) and they dont have to stop at 32. Its also allegedly to use a 45nm fabrication process which will be superceded later, dropping the power used significantly.

When I say scripting it could be bytecode (or equivalent) or compiled scripts, its more the pattern of coding at a somewhat higher level. It might be the source of errors, but its the shear bulk of it that is needed and the irregularity of it (its effectively an 'asset' often created by semi-programmers). The testing difficulties are also problematic because of the combinatoric explosion of endcases within so much logic (and the specs being 'loose'.

With this architecture if you could get the interpretor to fit in the L1 cache (which each core has) and run the compacted bytecode codeblocks thru the data cache it might be similar/faster(?) than bulkier compiled code which has to be continually fetched on the wide memory bus.

As for a wide bus you still have to wait for cache misses (in contention with 31 other cores) aggrevated by the more random nature of AI data flow. Context switching between objects data when each core is running hundreds of scripts makes for a high data turnover, which can only be increased if more complex AI is used. Planner style task processing evaluates many solutions (and their options) and then picks only one to execute. Behaviors reactive to the game situation requires that to be done constantly (churning thru ALOT of data AND script logic...).

--------------------------------------------[size="1"]Ratings are Opinion, not Fact
Quote: Original post by wodinoneeye
With this architecture if you could get the interpreter to fit in the L1 cache (which each core has) and run the compacted bytecode codeblocks thru the data cache it might be similar/faster(?) than bulkier compiled code which has to be continually fetched on the wide memory bus.
Wouldn't bytecode be larger than native instructions? And wouldn't it need to be continually fetched as well?

Advertisement
Quote: Original post by Hodgman
Quote: Original post by wodinoneeye
With this architecture if you could get the interpreter to fit in the L1 cache (which each core has) and run the compacted bytecode codeblocks thru the data cache it might be similar/faster(?) than bulkier compiled code which has to be continually fetched on the wide memory bus.
Wouldn't bytecode be larger than native instructions? And wouldn't it need to be continually fetched as well?



You could probably shrink a custom bytecode down considerably over a more general purpose byte-code mechanism (ie- Java). Closer ties to the game engine and limited to a simple(r) high level scripting language, along with being a subset of the game functionality, limited variable use, etc...

You could have (more than?) several different specialized interpretors for the different assignments (object animation controller/sequencer, NPC behavior, special effects scripter, dialog controller... By subsetting/specializing them you are more likely to have them fit into a L1 instruction cache.
Bytecode 'script as data' hopefully is more locality tight (maximizing usage of cache line fetches) and alledgedly being smaller than the equivalent native code.

Data from attributes are still more random (poorer locality/higher cache miss rates) but thats the same either way.


Its interesting though that this sounds a bit like we are going back 25 years where you had only 4k to work with or had to cram functionality into the same memory page. We dont really have the hard limits of back then, but for certain solutions the code efficiency can depend greatly on placement considerations.

--------------------------------------------[size="1"]Ratings are Opinion, not Fact
Quote: Original post by wodinoneeye
Larrabee is a chip that Intel also has plans to use for data crunching (not just GPU/Physix). Likely there will be variants with multiples of 8 (so they can sell the half defective chips...) and they dont have to stop at 32. Its also allegedly to use a 45nm fabrication process which will be superceded later, dropping the power used significantly.


From 300 W of 32 core version to 75 W of 1/2 core version at 32 nm? Nvidia, and ATI cards would run circles around this. (better in terms of GFX, similar level in terms of computation.) In addition a common user dislikes cards with larger power than 75 W.

Quote: When I say scripting it could be bytecode (or equivalent) or compiled scripts, its more the pattern of coding at a somewhat higher level. It might be the source of errors, but its the shear bulk of it that is needed and the irregularity of it (its effectively an 'asset' often created by semi-programmers). The testing difficulties are also problematic because of the combinatoric explosion of endcases within so much logic (and the specs being 'loose'.


Quote: With this architecture if you could get the interpretor to fit in the L1 cache (which each core has) and run the compacted bytecode codeblocks thru the data cache it might be similar/faster(?) than bulkier compiled code which has to be continually fetched on the wide memory bus.

Imagine a FSM that calls hand optimized asm code. Wouldn't it be faster than any general purpose VM? In addition it would be entirely data driven, and commands have small size by definition.

Basically compiled code runs circles around VM. Try GFX card programming to get some idea about problems with in order CPUs.

Quote: As for a wide bus you still have to wait for cache misses (in contention with 31 other cores) aggrevated by the more random nature of AI data flow. Context switching between objects data when each core is running hundreds of scripts makes for a high data turnover, which can only be increased if more complex AI is used. Planner style task processing evaluates many solutions (and their options) and then picks only one to execute. Behaviors reactive to the game situation requires that to be done constantly (churning thru ALOT of data AND script logic...).


When access to memory is sufficiently quick, at least for reading, cache misses doesn't matter. (this is reason why overclocked E2220 runs circles around E8200)

Now the catch with Larrabee might be instruction cache. AI could run down instruction cache quite quickly.

As long as it would be able to process multiple nodes at once, it might be fine.

Quote: Original post by wodinoneeye
Its interesting though that this sounds a bit like we are going back 25 years where you had only 4k to work with or had to cram functionality into the same memory page. We dont really have the hard limits of back then, but for certain solutions the code efficiency can depend greatly on placement considerations.

People 25 years ago had larger experience with ASM and tight optimizations, than average programmer today. When things like that would return, quite a lot of programmers would be screwed.
Quote: Original post by Raghar
Quote: Original post by wodinoneeye
Larrabee is a chip that Intel also has plans to use for data crunching (not just GPU/Physix). Likely there will be variants with multiples of 8 (so they can sell the half defective chips...) and they dont have to stop at 32. Its also allegedly to use a 45nm fabrication process which will be superceded later, dropping the power used significantly.


From 300 W of 32 core version to 75 W of 1/2 core version at 32 nm? Nvidia, and ATI cards would run circles around this. (better in terms of GFX, similar level in terms of computation.) In addition a common user dislikes cards with larger power than 75 W.



It depends on the task, If its graphics then the ganged special purpose subprocessors in the Nvidia/ATI are obviously faster for that (though they are trying to break up the parallelism into smaller narrower units to try to make them more general purpose, but they are still stuck at 8 wide sets running lockstep (ie if-thens apply to all 8 data pipes).

For more irregular logic (physics and AI and raytracing) you need 'atomic' processors which Larabee is much closer to (supposedly each P5 core has an extra wide MMX type unit which lets each core do some parallel data operations).



Quote:
Quote: When I say scripting it could be bytecode (or equivalent) or compiled scripts, its more the pattern of coding at a somewhat higher level. It might be the source of errors, but its the shear bulk of it that is needed and the irregularity of it (its effectively an 'asset' often created by semi-programmers). The testing difficulties are also problematic because of the combinatoric explosion of endcases within so much logic (and the specs being 'loose'.


Quote: With this architecture if you could get the interpretor to fit in the L1 cache (which each core has) and run the compacted bytecode codeblocks thru the data cache it might be similar/faster(?) than bulkier compiled code which has to be continually fetched on the wide memory bus.


Imagine a FSM that calls hand optimized asm code. Wouldn't it be faster than any general purpose VM? In addition it would be entirely data driven, and commands have small size by definition.

Basically compiled code runs circles around VM. Try GFX card programming to get some idea about problems with in order CPUs.



Maybe you dont understand. The bytecode system calls native code functions elsewhere (msg passing) to do alot of the complex operations and would really run at a higher level ('scripting' style) and as I mentioned elsewhere you could have special purpose VMs that fit in the L1 code cache (with 32 cores you can mix and match ratios of cores dedicated to certain operations). I try to stay away from VMs because they usually run 1/2 the speed of native code, but in this case for Larabee the memory access may be a critical chokepoint and it might work out that a mini-VM could beat native code.


Quote:
Quote: As for a wide bus you still have to wait for cache misses (in contention with 31 other cores) aggrevated by the more random nature of AI data flow. Context switching between objects data when each core is running hundreds of scripts makes for a high data turnover, which can only be increased if more complex AI is used. Planner style task processing evaluates many solutions (and their options) and then picks only one to execute. Behaviors reactive to the game situation requires that to be done constantly (churning thru ALOT of data AND script logic...).


When access to memory is sufficiently quick, at least for reading, cache misses doesn't matter. (this is reason why overclocked E2220 runs circles around E8200)

Now the catch with Larrabee might be instruction cache. AI could run down instruction cache quite quickly.

As long as it would be able to process multiple nodes at once, it might be fine.



Again with 32 cores all in contention for the same main ring buss its quite different than a dual cpu system (and as I said before AI has data reads which will be more random than normal and likely make poor use of cache lines fetched.



Quote:
Quote: Original post by wodinoneeye
Its interesting though that this sounds a bit like we are going back 25 years where you had only 4k to work with or had to cram functionality into the same memory page. We dont really have the hard limits of back then, but for certain solutions the code efficiency can depend greatly on placement considerations.


People 25 years ago had larger experience with ASM and tight optimizations, than average programmer today. When things like that would return, quite a lot of programmers would be screwed.


Compilers are a bit better than they were 25 years ago. Its probably closer to what the console programmers have to do these days -- with limitations much more restrictive than the ordinary PC architechure.

The AI engine is only a small part of the project (the specific games 'script' logic is magnitudes more work) and it (mini-VMs) can be optimized as being that 10% of the program which does 90% of the work. The advantage with Larabee is that you can always fall back to conventional (bloat/unoptomized) programming for parts of the code that run very little.

[Edited by - wodinoneeye on August 3, 2008 3:34:09 PM]
--------------------------------------------[size="1"]Ratings are Opinion, not Fact
This may sound kind of stupid, and kind of out of place, but all this talk about L1 cache started to make me wonder. Is it even possible to specify where to put a specific block of code? Because I keep seeing talk of putting code in L1 cache as if it were a physical address space that we can actually point to and manually allocate something into. As far as I can recall, the little asm that I know, I recall moving around registers and pointers, but not actual blocks of code. And the other question is, on a multi-processor system, do we even have that fine grain a control as to which specific hardware core a set of instructions go on? So, if there were more threads than cores, which happens alot, do we even get the ability specify whether a thread gets exclusive use of a core?

All this talk of VMs, reminds me of a ongoing arguement/debate I've had with some co-workers. I say ongoing because we've had them on and off over the past 2 years. (One of them is an AI programmer that has been working on consoles for most of his career with alot of experience in asm, while the other is about as knowledgeable about C++ as you can get without being on the standards committee.) They are generally allergic to VMs for a variety of reason. First, performance. Garbage collection is usually a big issue as there are inherent overheads and it can be rather cumbersome at times. I think their general thought is, "programmers who use C++ and don't know how to properly manage memory have no place in the gaming industry." Also, the whole purpose of a VM is to enable scripting. By enabling scripting, which usually means programming by scripter or producers, you expose yourself to a whole new set of optimization issue. If you implemented, say A* in code, you can optimize alot of its behaviors and control alot of performance related issues. If you implement it in script, and if implemented by someone who doesn't care, you've just lost a chunk of performance there, no matter how optimized your VM is, which just becomes another layer of overhead. (I know A* is abad example, but it was the first thing that came to mind.) Also, optimizing the bytecode that scripts get compiled to is another possible headache. Because, essentially, you're almost trying to do what the C++ compiler may already do for you, just reinventing the wheel in a slightly different form. Not to mention if something goes wrong with your optimization that created critical but hard to reproduce bugs, that becomes another headache in itself.

It really doesn't matter if it was 25 years ago, or 2 years ago or 2 months ago. Programming is still programming. There will always be good programmer and bad programmer, and those who are just godly at what they do. One thing has not changed in the past 5 - 10 years though. The simple fact is, alot of games developed for the PC bank on increase in average hardware specs to cover the bloat that was injected during development. Then people start to forget why virtual machines were invented in the first place. It was to have code that runs cross platform without that much effort from the programmer. But if you're developing for PC, then there's no multi-platform involved. I remember when java first came out in the mid 90s. It wasn't a viable solution for ANYTHING that required any form of critical performance. Why is java so popular now? Because all that processor performance we have can mask over its inherent inefficiency. Why did people start using VM's and running scripting languages for games? Partially because they could. There's no performance gain there. You don't add any more features to a game than the simple fact that any yahoo can now pick up the script and mess with it. I feel that you can probably do more with AI just with those processor cycles you get back from NOT using a VM. (Sorry, I'm kind of allergic to VMs too.)
Advertisement
Quote: Original post by WeirdoFu
This may sound kind of stupid, and kind of out of place, but all this talk about L1 cache started to make me wonder. Is it even possible to specify where to put a specific block of code?
...

When you create a thread on the Win32 platform, you can specify the 'processor affinity', or which core you want the code to run on. I assume that the patforms and other OS's have this functionality as well.

Quote: Original post by WeirdoFu
All this talk of VMs, reminds me of a ongoing arguement/debate I've had with some co-workers. I say ongoing because we've had them on and off over the past 2 years. (One of them is an AI programmer that has been working on consoles for most of his career with alot of experience in asm, while the other is about as knowledgeable about C++ as you can get without being on the standards committee.) They are generally allergic to VMs for a variety of reason. First, performance. Garbage collection is usually a big issue as there are inherent overheads and it can be rather cumbersome at times. I think their general thought is, "programmers who use C++ and don't know how to properly manage memory have no place in the gaming industry." Also, the whole purpose of a VM is to enable scripting. By enabling scripting, which usually means programming by scripter or producers, you expose yourself to a whole new set of optimization issue. If you implemented, say A* in code, you can optimize alot of its behaviors and control alot of performance related issues. If you implement it in script, and if implemented by someone who doesn't care, you've just lost a chunk of performance there, no matter how optimized your VM is, which just becomes another layer of overhead. (I know A* is abad example, but it was the first thing that came to mind.) Also, optimizing the bytecode that scripts get compiled to is another possible headache. Because, essentially, you're almost trying to do what the C++ compiler may already do for you, just reinventing the wheel in a slightly different form. Not to mention if something goes wrong with your optimization that created critical but hard to reproduce bugs, that becomes another headache in itself.

I'm not sure I understand this argument. There's a confusion of items here. VMs are NOT built to enable scripting, they're built mainly to enable cross-platform work, as far as I know.

Scripting languages are not compiled, and are going to have severe performance penalties. This has nothing to do with memory management or any other issues. They do allow for rapid prototyping, short development cycles, and are useful for some things like level design, etc. There are some initiatives like the DLR that hope to cut down on the performance limitations.

When talking about VMs, let's break it down a little, and look at Managed C++ and C++. Memory Management is one example, C++ has it up front, and I believe that most people use reference counted classes to take care of it. Managed C++ has it as part of the library, which takes care of it on a background thread. To use either one, you need to uderstand what the internals are doing. For every deallocation, you're going to take a CPU hit. For C++ it's a small consistent hit, for Managed C++, there is no hit, that happens later when the background thread runs. For either language, you should avoid allocating/deallocating in loops when possible. For either language, you're going to be going through an optimizer to compile the code. Managed C++ does this at load time, C++ does it before it ships. On the PC platform, that means that Managed C++ can take advantage of the actual CPU type, C++ has to compile to the lowest common denominator. It's possible that Managed C++ will be slower on some PCs, and there is always the possibility of a processor specific bug being found, although I haven't heard of any. On my limied tests, Managed C++ was faster than the regular MS version of C++ (ignoring memory management.) That's probably because the normal version had to compile assuning an late model PIII, and the .Net verseion could take advantage of some newer features.

Quote: Original post by WeirdoFu
It really doesn't matter if it was 25 years ago, or 2 years ago or 2 months ago. Programming is still programming. There will always be good programmer and bad programmer, and those who are just godly at what they do. One thing has not changed in the past 5 - 10 years though. The simple fact is, alot of games developed for the PC bank on increase in average hardware specs to cover the bloat that was injected during development. Then people start to forget why virtual machines were invented in the first place. It was to have code that runs cross platform without that much effort from the programmer. But if you're developing for PC, then there's no multi-platform involved. I remember when java first came out in the mid 90s. It wasn't a viable solution for ANYTHING that required any form of critical performance. Why is java so popular now? Because all that processor performance we have can mask over its inherent inefficiency. Why did people start using VM's and running scripting languages for games? Partially because they could. There's no performance gain there. You don't add any more features to a game than the simple fact that any yahoo can now pick up the script and mess with it. I feel that you can probably do more with AI just with those processor cycles you get back from NOT using a VM. (Sorry, I'm kind of allergic to VMs too.)


I'm not sure why the swing back to VMs happened (Yes, back, languages like Lisp and VMs like the UCSD P-System have been around for many decades, as far back as the Apple II, and were very usable.) I think it's just an artifact of the fact that Sun/Java wanted to be multi-platform, and MS/.Net wanted to have one compiler backend instead of many, and chose to do that with a VM. The one potential advantage that I see is that I think that it will be possible to write code that will recognize that you've got 16 additional non-x86 processors that it may be able to share with the graphics system and compile some of the code for them. I don't know if that will ever happen, but those possibilities are starting to open up.

Programming IS changing, BTW. MS is looking heavily at functional programming. Languages like C# now offer things like lambda expressions and linq. Both of these should help me write code faster. Tying this back to the title, MS is pushing functional laguages and constructs, and these have the potential to work better with multiple CPUs. They are also very interested in assymetric multiprocessing because that has a large impact on OS design, and their own programs.

My 2c,
Ralph

Another thing to think about is the genius of the cell design. It places a small local memory store next to each processor. This keeps the data from ever reaching a point where the code/data thrashes a cash. It also means that you can do as much random access to you data set as you want, as long as your dataset fits on the local store.

And all this talk of in-order vs out-of-order chips has never made any sense to me. What is the difference? All the out-of-order processor is doing is scheduling instructions so there are no stalls(if posible) by putting instructions that are not waiting on data to be loaded before ones that are stalled for some reason. But how is that ANY different from doing the same exact reordering before the instructions even reach the cpu. You know the instruction timings (assuming everything is already in cache, which you can if you add in processor specific prefetching instructions). Thus the instruction-reordering step of the compiler should be able negate the need for the out-of-order part of the CPU right? or am i missing something entierly.

Keeping code resident in L1 cache:
There is a processor affinity option that you can set for each 'process' (I forget if you can do it per thread). But I think it would be more the case of minimizing your code block so that when it runs its small enough not to require new chunks to be pulled from main memory. It initailly is loaded from main memory and churning a bit on infrequntly run bits is OK , but if its small (and you push certain operation somewhere else by message passing) it wont have cache misses on the code (the data will likely have that problem and keep the L1 code cache constantly changing). Now its up to the progrmmer(library?/compiler?) to control the threading so that the scheduler doesnt keep flipping threads back and forth between different cores and doesnt run more than will fit in each cores code cache. Since the parallelizing likely requires many 'theads', a solution is to use 'fibers' which are very light weight mini-threads that you build your own scheduler for and keep them within the same core AND with a mini-VM (or small native code fiber running routine, FSM, etc..) to keep the code cache from thrashing.



Garbage Collection:
You dont have to have it in a language -- especially special purpose VMs (or in small native code programs). Many scripts simply need static data blocks/stacks or simple 'object' pools for semi-dynamic operations. You can get away with it because the intended processing for the 'core load' is specialized.


"Scripting languages are not compiled" :
actually many are (into bytecodes or other intermediary code forms -- which are still interpreted when run (Python as example does that and even can be converted to an .exe (really more of a packaging thing than a speed up)). The main thing is that it can be 'compiled' seperately without requiring the entire program to be recompiled (which you CAN do also in C/C++ by using DLLs -- one of the methods I favor when I have the 'dont recompile the entire thing' requirement.



Out-of-Order vs In-Order :
the cpu does paralleization on a very fine scale running different chunks of machine codes when various multiple internal ALU resources are free (doing 2 or 3 things at once) or the required data is in cache yet. So there CAN be a speed up (significant in some cases). Intel went with the In-order for Larabee probably because the cpu core is much simpler (smaller) versus the more complex control logic that Out-of-Order requires.




The Cell design has a bunch of ALUs each with a register store/scratch memory which is still fairly small (good enough for many operations especially pipelining data like graphics), The Larabees cores have 16KB of L1 'scratch memory' (it dynamicly overlaps main memory and so is a bit more flexible).
You can do far more with that much local memory for certain problems (without having to fetch more data slower from elsewhere). The Larabee cores are more self-contained where the Cell's units are really auxillary ALUs.



Hmmm, I remember you could fit a micro-basic interpretor in 4KB of memory.....




http://en.wikipedia.org/wiki/Larrabee_(GPU)
--------------------------------------------[size="1"]Ratings are Opinion, not Fact
Multithreading isn't an AI problem; it is a architectural problem. The most pragmatic solution is to convert any chunks of code that take a significant amount of time to be tasks.

AI code is a good candidate for this, as there are plenty of heavy weight operations in it. At the same time, updating a dynamic mesh or calculating a ray intersection with the world fits the same model.

The right solution here, in my opinion, is to get a task framework running and to leverage it heavily in the AI systems. Moving AI to other threads (ideally on another processor..) will separate AI and the biggest CPU user - the CPU cost of rendering.

That said, I respectfully disagree that the main barrier in games is lack of CPU/memory power. In my personal experience, the biggest challenges have been working with designers to get what they really want and dealing with relatively static presentation (ie lack of high quality animation/audio generation). The AI logic side of the system can be designed to scale relatively easily compared to asset generation.

This topic is closed to new replies.

Advertisement