Back to General and Gameplay Programming

What kind of coding standard do you prefer

iko iko · 2022-12-20T20:57:39

So, other than the official answer “what ever standard client prescribes”, what kind of coding standard do you use at home in your personal projects? Here is mine: Standard is optimized for speed, simplicity and readability and fits on one page. To get the best cache performance it is wise to keep your memory sequential and compact as possible (DOP instead of OOP at least at the lower levels). ECS is main paradigm. using vectors instead of object pools. I want sequential and pre-allocated memory and even if memory isn't preallocated (empty vector) it will become so after a few frames (when enough objects get created). Maybe add another vector to track free elements (or wrap those two ones in a template called ConservativeVector - but that is practically object pool if you preallocate enough memory). Be careful with references. If you have a lot of dynamic object this might not be the best solution. using indexes instead of pointers (this goes together with a point above). In most cases unsigned short will be enough and it will be 4 times smaller than pointer so your data will fit in cache better. This also means that you absolutely don't have to write copy constructors and have any kind of memory management (or very little) - literally entire engines and editors can be written without a single new, delete, malloc and free command (at least for static data like meshes and textures). Another benefit is the code can be run on GPU and CPU all the same (if you wrap recursion). I've implemented ray-tracing octrees this way and did all the debugging on the CPU. My BVH hierarchy copies to GPU the same way. If you have some extremely complex hierarchy, debugging with indices is much easier than reading pointers (easier to draw entire hierarchy on paper and setting break points). using Memory Context class to store all those vectors that would otherwise be created as a local variables in the functions. Of course, these memoryContext objects should be reused (and passed by reference) so memory allocations won't happen on every use but only on first few usages. If you have a thread pool, have one of these objects created for every thread. Be careful with recursive functions. don't use sets or lists - they are 10-20 times slower than vectors. Migrate all your helper libraries to use vectors for internal storage (RadixTries, PatriciaTries, Dictionaries, Octrees, Radix Sort functions, geometry processing functions…) SortForEfficiency is a common function in my helper libraries. Always pass by reference. Don't return vectors, but pass them as input-output parameter. That way memory pre-allocation will be maintained. use JSON for storage and clipboard copy paste. It is simple and it will maintain compatibility within different versions of application and anyone can code plugins in any language they want. Nobody has to waste time with writing custom DSL parser every time. If you are very lazy, you can even just dump those vectors in a file and hope that format won't change but I wouldn't rely on that. greedy thread pool. If it isn't faster than OpenMP you haven't created a fast thread pool. In all my use cases I only needed a thread pool where all the threads run the same task so I've actually omitted single threaded type of task in the latest implementation. It is kind of like async but more. There are 3 levels of task priority and only first one is non blocking. If bottleneck happens all the free threads can help the ones that are still working by preforming sub tasks (there are macros that I use in all of my code that will do parallel processing if thread pool is present or just single core if not). One thing, if some thread submits a task to queue and manages to finish it before anyone helped - that is OK - there won't be any penalty like with OpenMp. I rarely use inheritance and virtual functions but good example of it is the GUI system (it is highly efficient and packs one integer per graphical element instead of entire geometry of 4 vertices which is 28 times bigger like with some other libraries). So, in some cases OOD is more elegant than DOD but I mostly prefer DOD. OOP example is using virtual functions for OnCreate and OnUpdate on your engine's objects. On the other hand, if you use DOD and composition, these function can just be properties of the object. If they are set then call them, otherwise do the default functionality. not using single file per class. I like to keep things that make sense together grouped. I also prefer header only libraries for smaller things that won't change much in the project (many times these classes are templated so there isn't much choice). Bigger classes that are updated frequently should have both H and CPP files. That is it. Almost no memory management and very fast code (for certain types of project where most of the objects is mostly static). Edit: edited for clarity.

General and Gameplay Programming Programming

Started by iko iko December 16, 2022 11:35 AM

82 comments, last by JoeJ 1 year, 11 months ago

JoeJ

4,359

December 17, 2022 11:12 AM

Ivica Kolic said:
This is why we must make simple GI solutions that work with laptops with integrated graphics.

I will deliver just that.
Imo, dGPU is dead. APU with 5tf is enough for next gen visuals, but currently PC platform lacks high bandwidth RAM standards.
If they don't address this quickly, e.g. like M1 or consoles do, i see no future for PC as a gaming platform.

Ivica Kolic said:
Guys, large scale complex environment AI generation is coming soon. We are all screwed including UE and Unity. You'll be able to generate fully custom GTA6 world in the matter of days and it will fit in less than a megabyte which is ridicules since GTA6 is a 2 billion dolar game. Blender and UE just introduced procedural modelling and it is already obsolete they just don't know it. Nanite and chunk map loading that UE 5 introduced will also be useless since API takes care of generating needed geometry on the fly. Lumen is useless since it can't work with dynamically created geometry - you should split it in small chunks for which you must generate SDF which takes too long so it won't be practical. If you have HW raytracing Lumen is OK but that won't work on laptops with integrated graphics, smartphones or AR glasses. So pretty much everything cool that UE5 introduced is not compatible with what is coming which is a shame because it is a really cool tech.

Please let me know what you talk about. Maybe you mean neural stuff, i guess? But till yet i failed to make much sense out of related research papers. Enlighten me.

I work on those problems too, but so far progress isn't convincing. It's not too late to stop my investment, so… : )

Juliean

7,377

December 17, 2022 01:48 PM

JoeJ said:
But i wouldn't say iteration overhead is an issue. Complexity on iteration is minimal. Following a pointer to the next object is O(1) just as much as increasing your index when iteration an array sequentially is O(1).

It is also O(1), but with a very high constant factor. If we take the cutoff-point of where binary search outperforms linear in my example, we might be talking about a constant factor of up to “times 300", per indiviual “step”! Which might sound rough, but even if you only have to iterate 10 elements, the list will have to jump to 10 arbitrary memory locations, which completely fucks speculative execution and prefetching.

And thus it gets complicated: Now at the point we are not discussing whether or not you use the tree-approach per say, I'm just going to agree that it definately makes sense in your case. But you are essentially using a list for the node-elements, because you want to avoid allocations and copying objects inside vectors, right? Thats not so cut and dry. As vector has an amortized cost of O(N) for filling elements/allocations, even using a vector per node does not change the time-complexity of the algorithm. But then you are getting way faster iteration, and the funny thing is, that due to the same factor that makes iteration so slow, even building the individual node-lists would, in certain aspects, be faster than linking a bunch of “random” list nodes. So the only thing you safe is a few allocations, which, last time I checked, aren't even that big of a problem anymore, on PC at least. And you could always (re)use an allocator.

Thats pretty much the main point I have against list, and what we seem to be disagreeing upon. List has so much overhead, that even if vector does have a few properties that seem suboptimal, it can/will outperform list, in many scenarios. Especially in your example, since your are not really using the real “advantage” of list where removal is O(1) - as I said, simply appending to a vector will have the same amortized complexity as adding to a list, which makes it better again.

Of course (I might sound like a broken records), this needs to be benchmarked, and obviously list is a solution in your example - but I never meant to say that it wouldn't be, I'm just saying its not the best. You could pretty much exchange every vector←>list as long as you don't need random access, and it would still work - but simply be slower then vector in most cases, which I do belive yours would also fall under. You didn't benchmark vector vs list for your tree, right? I would be willing to concede that point if you did and found list faster, but otherwise I'm sticking to it ?

JoeJ said:
To propose a brute force solution, we have to build up additional context introducing such specification of N. Which you and others do, but often only after guy like me complains.

I did include this notion from the start, and I don't really want to talk about what other people do or don't.

But for me personally, I do tend to sometimes feel the other way. Certain people seem to be keen on optimizing just for the sake of it, or worse yet, without looking at the actual reality of what performs better or worse. Performance isn't really intuitive, and needs to be properly assessed. To follow up:

JoeJ said:
In questions and discussions brought up here, N mostly isn't specified. In those cases the general advise should be towards a solution of better time complexity, not a solution of maximized simplicity, imo.

I do wholeheartedly disagree with this as well. Simplicty should always be first, unless time-complexity is really needed, which should be a) absolutely be know, and b) measured. The thing with simple algorithms is that they are, well, simple. Writing O(N^2) collision-detection should cost you the fraction of the time as creating a tree, which needs to be both maintainted, as well as have certain considerartions regarding the structure, etc… so in other words, as lot of work. But you can invest that work later on as refactoring/optimization, if you see that the number of elements you have increased to that point, or you see that a considerable amount of time is spent in that loop. But if somebody starts creating trees in places where there objects can be counted with one hand, they might start doing that all over the place and end up wasting precious development-time as well as making their whole code unnecessarily complex as a result.

In addition, when somebody asks a guestion in those forums about how to cull objects for a renderer, I do not assume that they are working on cutting-edge scenarios. I do rather assume they are working on a small-scale project, suitable for what a hobbyist might come up with in his free time. If you are working on (the next) nanite, you shouldn't really have to be asking about basic culling algorithms, do you? And if you are and you still have to ask a question which might appear “basic”, then you should specify your use-case in the question. Culling GameObjects is different from culling voxels, and I do belive that you are in the wrong if you suggest and algorithm that is suited for voxels to somebody who wants to cull a few sprites. But if you assume that every question here is from somebody who needs to squeeze that last bit of performance in high-resource scenarios, I couldn't really say you are wrong eigther. Just not how I approach things.

JoeJ said:
But you don't know that you might need Nanite before you have it. Only after you took the risk to invest serious effort to solve an open problem, and then it actually works after a path of failures, only after that you see what it gives.

The following is obviously just speculation, and a bit of hindsight, but knowing that you need something like nanite is not that much of a stretch for me. Classic rendering is a complex process with lots of “workarounds” to achieve high performance. Culling, batching, etc… are all systems that put in work, to reduce the actual workload on the GPU, which for complex 3D-games is probably the limiting factor, unless you are running complex simulations as well. Especially occlussion-culling always appears a bit “shizoid” to me - you render the geometry so you know what geometry not to render. K? (I know its more complicated than that and it does have benefits. I'm just talking about the general idea in a not-so serious manner ?). Also, CPU=>GPU is always a slow path.
So nanite from my understand is a solution that mostly solves those problems. You have all your data on the GPU, you don't need no more culling, etc… I obviously do not know enough about it to clearly say, so feel free to correct me.
But one thing I'm for certain - they did measure whether or not it was actually an improvement, which is what I'm mainly advocating for here in the first place ? So while I agree that you have to take risks and invest effort, you have to be able to look critically at your results based on how it performs in the real world. Which nanite obivously does, but other things don't. We don't really know all the trials that have went into creating nanite, how much didn't pan out, how had to be discareded, redone etc… its easy to say “well those things that people say were useless are now used in cutting edge”, but we don't know how many things have been tried and have actually turned out to be useless after all. You do need to willing to try those out, sure - for example, I once tried to make my (DX11) renderer multithreaded, but after the fact everything was slower. So after some reading it seems DX11 isn't really that good with threaded rendering, so I discarded the solution and kept it around only if I ever get around to DX12, which probably should perform better.

JoeJ said:
Btw, the real reason of complexity here is multi threading. I have such efficient sparse grid for my fluid simulator, for example. Updating the grid, so most processing remains lock free, becomes the major challenge. That's also the reason i can not create reusable code for such things. I have building blocks such as pools for sparse blocks, traversal of the sparse grid, etc. (I even used inheritance for those things, and templates! \:D/ You would like it.)

Templates, tell me all about it ?

But thats also one of the gripes I have with this whole “efficiency” thing, and what I was alluding to before. Using an optimized algorithm often can make your code extremely more complex, to the point where you can't even ensure that it is “good” anymore (in terms of reusability, coupling, what have you…). Thats one of the reasons why I advocate to keep code simply when you can afford to, especially at the start. Now I know threading in general is a diffucult problem and something that you often have to account for early in the design-process. But even so, I'd like to think that if you start the implementation simple and reusable, and you later add the optimization, which might lead to some code becoming less “maintainable”, but the whole integration might not be as fractured as it would have been otherwise. Perhaps. Still haven't done enough threading to really be able to say specially with this example, but I think it applies to a lot of other optimizations as well.

I mean, the takeway here seems to be that we mainly seem to disagree on
a) lists :D
b) the actual number of times people need work on problems that require algorithms that perform well for high number of objects OR the point where you decide that OR the way you decide it

Which might just be a matter of perspective/environment if you ask me. Which brings me to the last time, before I forget:

JoeJ said:
However, please notice i do not criticize your opinion or work. You make a 2D RPG which does not need pushing tech. Brute force collision detection is fast enough, you do not even need multi threading. That's fine. I'm not disappointed because you don't offer photorealism, massive destruction, or insane details, etc. I don't think any game should have this, just some of them.

While the game doesn't require pushing tech in the sense of GPU/rendering, it does require a lot of effort for creating content (which I do alone), so I need lots of tooling that needs to be on a level even beyond what could be done with Unity. Thats also why I value development-time a lot more. Now sure, developing an engine itself already is not a very effective use of time overall, but putting that aside - if everything I wrote took twice as long because I was trying to implement optimizations all over the place where they are not needed, I would only have half of the engine/game I have now. And even so, my engine outperforms Unity on every front, for the stuff that I actually do have (which excludes 3d rendering; and obviously you have to belive me on that as I do not have ready an undeniable proof to show you ?); and even my “blueprints” are faster than what Unreal has. So at least for the base of a game-engine, simplicity can work out better than over/wrongly-optimized garbage like Unity :D With the addition that, as I said, I do my fair share of optimizations, and on certain occassions even overdo it too ?

JoeJ

4,359

December 17, 2022 04:23 PM

Juliean said:
which completely fucks speculative execution and prefetching.

Which raises the question what's more fucked: Current CPUs relying on speculative execution (in other words: accelerating just trivial work, but wasting power on executing stuff which isn't used), or computer science aiming to reduce work to what's actually needed? That's really why i like GPUs better. Having more cores makes more sense than trying to increase single core perf. with speculative execution and prefetch.

Juliean said:
we might be talking about a constant factor of up to “times 300

It's not that bad. You miss the cache line, but not the entire cache hierarchy. Just try to keep your random access addresses close and minimize big jumps.
And with hyper threading the CPU can switch to the other thread while waiting on the memory access. (At least i assume CPUs do this as well, not just GPUs.)
Finally, if you manage to have sequential access for a big workload, at some point caches become pointless and main memory bandwidth becomes the limit, which is really tight on CPU.
One and a half core is enough to saturate BW on my machine, so with such workload we negate the multi core advantage as well. That's far from efficient.

The conclusion is: Minimize memory access, increase ALU. And both those goals lead away from brute force processing.

Juliean said:
But you are essentially using a list for the node-elements, because you want to avoid allocations and copying objects inside vectors, right?

Yes, at least for my example about grid cells pointing to lists of objects. The primary goals are to avoid O(n^2) cost fr collision detection, and avoiding dynamic allocations.
I'm not sure if dynamic allocations are still that costly today, but >10 years ago it was very bad, causing unpredictable runtime spikes. Plus, on GPU we do not even have this option at all. So i have learned to design around dynamic allocation, and i stick at that.
Remember that back then using stl in games was just a no go in general, for reasons like this.

If it works well enough today, then this can only be a result of two reasons: OS being faster with allocation, or devs being sloppy with optimization, giving us 2000$ GPUs and making Jensen very happy.
And to justify the sloppyness, you bring up smart reasoning, like talking about spec.exec. and prefetch. But i don't buy it. That's just the usual game dev slang to say ‘i don't have time to properly optimize every little bit.’ hihihi >:)

Juliean said:
Of course (I might sound like a broken records), this needs to be benchmarked

Yeah, but to do that, we need to implement both (actually much more than just two) methods so we can compare them.
But we don't have time for that. So we trust our belly feeling instead, which is product of former, limited experience.
Neither of us can be sure we picked the best option. If it's fast enough, we're happy and move on. Maybe, sometime later it turns out we need to optimize harder, but ideally not.
It's quite hard to tell how much self confirmation bias is involved, and hard facts like knowledge about HW or benchmarks do not help much against that.

Juliean said:
I did include this notion from the start

Yeah, you did. I skimmed some earlier posts to check that. So i have to apologize.
Premature optimization is a problem too, so my accusation of ‘cultivating brute force’ wasn't justified, i see.

Juliean said:
I do wholeheartedly disagree with this as well. Simplicty should always be first, unless time-complexity is really needed, which should be a) absolutely be know, and b) measured.

I have to agree to this as well ofc. But as said, to measure and know, we need multiple implementations.
Before doing this work, it makes sense to ask others first. Often, the impression you get is convincing, so you can spare the double work.

Juliean said:
Writing O(N^2) collision-detection should cost you the fraction of the time as creating a tree

Yep. My usual proposal is to start with a regular grid. Complexity is maybe at 10% between O(n^2) and using a tree. It's still simple. And if you need some quadtree because world is large, the former work on the grid isn't lost but makes the extension much easier.
I make such grid approach quite often, not for collision detection, but to find things spatially for various reasons.
Before i do this, i work on the other things first to see the entire algorithm works. And i iterate all objects for that purpose. After that, if the iteration turns out a bottleneck (very often the case), i make some trivial regular grid.
A need to use an actual tree arises only very rarely. If so, it's often a core concept of the entire work, so i make a specific system tailored to the problem, which usually is a long time task.
I also have some generic BVH class, which i could use for anything. But actually i think i've used it only two times til yet. It's not really great either, just good enough for low requirements.

That just said to show i'm not a tree fetishist or something like that. : )

Juliean said:
Especially occlussion-culling always appears a bit “shizoid” to me - you render the geometry so you know what geometry not to render. K?

Hehe, yeah. Occlusion culling is a treasure chest if you look for bad algorithms or pointless HW features. : )
But it's a really hard problem. It was the holy grail in times of software rendering, and people were very creative about finding solutions. I worked on this too quite a bit.
But til yet, no general solution has been found. And raw GPU power allowed us to postpone the problem. It even forced us to do so, because the basic solution: Render front to back, and store information about occluded regions of screen, can't be implemented on early fixed function GPUs.

I often say: Visibility and LOD are the key problems of computer graphics. Interstingly, HW acceleration blocked us from addressing them. Early Rasterization HW prevented efficient occlusion culling methods, and now Raytracing HW prevents any form of fine grained LOD, including Nanite.
That's true. We could discuss it, but it holds true. Primary reason i rant so much about against the wide spread assumption GPUs gave us all graphics progress. People ignore all the limitations it has caused, and still does.

Juliean said:
So nanite from my understand is a solution that mostly solves those problems. You have all your data on the GPU, you don't need no more culling, etc…

I would say we had this long before, under the umbrella term of 'GPU driven rendering'.
The Nanite renderer does culling in a similar way UE4 already did before, so there is not much progress about this. The software rasterizer also isn't that interesting, imo.
The remarkable feature is fine grained (almost continuous) LOD, resolving the popping problem we have with former discrete LOD solutions. It's like having mip-maps for geometry, selecting only the resolution you actually need.
Because it's fine grained, it's also much more efficient to process and less memory is needed.
To select a current level of detail, some hierarchical data structure is needed, actually a BVH, which they also use for culling.
But there is no need for actual traversal. Instead you can just loop over the current selected LOD cut, and for each cluster decide if you want to increase, decrease, or keep current detail.

Nothing of that is actually a new invention. Even the smart way to avoid a need for stitching was propose in a much older paper, i have learned. And we had cntinuous LOD in soem gems before, e.g. Messiah from the year 2000.
But still, nobody expected what amount of detail this enables. I didn't either, although i work on LOD myself. I always thought LOD is useful to increase efficiency, but i did not think it can be used to give such insane details. :D

Juliean said:
We don't really know all the trials that have went into creating nanite, how much didn't pan out, how had to be discareded, redone etc…

We do. Karis gave a talk about that, similar to the very nice one about Dreams. The devs explain all the things they tried, but did not work out.
That's really nice to read, and actually more inspiring than papers about actual results and progress.
I wish, the path of failure woudl be more of a topic more often. It makes me feel less stupid. :D

Juliean said:
Using an optimized algorithm often can make your code extremely more complex, to the point where you can't even ensure that it is “good” anymore (in terms of reusability, coupling, what have you…).

Absolutely, yes. The price is high.
But for me that's every day practice and normal. It turned out i'm more the research guy than the ‘make a game’ guy.

Now that you said this, i realize my programming experience is quite different from the norm, i guess? Probably so. Might explain some things, hmmm….

Juliean

7,377

December 17, 2022 05:03 PM

JoeJ said:
Which raises the question what's more fucked: Current CPUs relying on speculative execution (in other words: accelerating just trivial work, but wasting power on executing stuff which isn't used), or computer science aiming to reduce work to what's actually needed? That's really why i like GPUs better. Having more cores makes more sense than trying to increase single core perf. with speculative execution and prefetch.

That I cannot comment on, I don't know what the alternatives would mean if CPUs would work differently, but I do think that the current CPU-design has somewhat been shaped by real-life examples, just like GPUs are. But maybe there were limiting factors, or better alternatives- don't know.

What I do (think I) know is that having CPUs behave like GPUs entirely would probably not be good for game-developement. Going back to my (simple, but content-driven) game, there are lots of custom NPCs/objects with unique scripts/behaviours. I don't think those would be a good workload for threading, even if a good threaded gameplay/scripting-framework existed, right? In my somewhat limited experience, threading is good if you eigther have a large task that can be run in the background/in parallel, or a large dataset that you can split between multiple thread. Having a CPU work like a GPU, where all threads together are very powerful, but each thread on its own has the calculatory power of a potato, would cause more issues than it solves, IMHO. Because CPUs are being used more versatile than GPUs, and I don't think all those problems could be solved by changing the way we write code, its more how the problems themselve are.

JoeJ said:
It's not that bad. You miss the cache line, but not the entire cache hierarchy. Just try to keep your random access addresses close and minimize big jumps. And with hyper threading the CPU can switch to the other thread while waiting on the memory access. (At least i assume CPUs do this as well, not just GPUs.) Finally, if you manage to have sequential access for a big workload, at some point caches become pointless and main memory bandwidth becomes the limit, which is really tight on CPU. One and a half core is enough to saturate BW on my machine, so with such workload we negate the multi core advantage as well. That's far from efficient. The conclusion is: Minimize memory access, increase ALU. And both those goals lead away from brute force processing.

The thing is, if it were not that bad, then the binary vs linear search would not turn out the way it does. To recap, with my measured point of ~3000 elements where binary gets equally fast as linear, that means that the binary-search will have to do a max of 12 jumps to get to its result (if my log2-calculation is correct) . While linear might have to look through all 3000 elements, which not only means it has to iterate over more elements, but execute way more comparisons, as well as more branches/conditional jumps. But still, even with those numbers, they are equally as fast. That means that, yes, from a pure practical standpoint, it really is bad. Think about it - for each jump that binary takes, linear is crunching through 300 elements+comparisons+checks, in the same time. And here we are even talking about an algorithm where at least the elements are in the same area of memory. So while your theory (which I can follow) would suggest that the difference is marginal, its actually unimaginable. And even if we assume that there is some margin of error in my calculation/tests, even if we half the amount of elements until it gets faster, its still nuts. And it means that, in the scenario of a list, while your list iterates through 8 elements, the vector might go through 1000 or more in the same time. Thats about the difference you would expect from going from O(N^2) to O(N) or even O(log n) ?
Though I might actually do a small test with list vs vector specifically, so ensure that its not something specific about binary-search that is causing those ginormous differences. Just to make sure I'm not actually talking complete trash on accident :D

JoeJ said:
Yeah, but to do that, we need to implement both (actually much more than just two) methods so we can compare them. But we don't have time for that. So we trust our belly feeling instead, which is product of former, limited experience.

Well, I would say that we do get the time to do those checks, by applying simpler solutions to most problems where it won't really matter, and for those “hot paths” or key areas, like a script-backend, we should be able to invest the time saved by micro-optimizing in all areas to get a bit of leeway for comparing and contrasting. But I concede that this might not be always the case, especially if you have deadlines to meet. Though I would hope that, at least when people create libraries/engines that are being used by others, or features like nanite, or whitepapers or whatever, they do have/take the time to actually measure and compare stuff. At least in those areas VS developing of a concrete game, I think its absolutely vital you do. Otherwise, you end up with Unity :D

JoeJ said:
Nothing of that is actually a new invention. Even the smart way to avoid a need for stitching was propose in a much older paper, i have learned. And we had cntinuous LOD in soem gems before, e.g. Messiah from the year 2000. But still, nobody expected what amount of detail this enables. I didn't either, although i work on LOD myself. I always thought LOD is useful to increase efficiency, but i did not think it can be used to give such insane details. :D

Ey, I mean sometimes all you need to do is package things together and make them slightly more accessible/better to become a success. Think World of Warcraft - while the game did innovate in some ways, its was really originally really just a mishmash of various things from other MMOs, done with a great deal of polish and care, and became one of the most successfull titles of its era as a result ? I think sometimes the most important thing to understand about innovation is that its not always about coming up with entirely new ideas - perhapbs even most of the time?

JoeJ said:
We do. Karis gave a talk about that, similar to the very nice one about Dreams. The devs explain all the things they tried, but did not work out.

Thats neat, I might want to check that out. At work, we will be using UE5 in some of our future projects, would be interesting to see how Nanite affects our development cycle as well ?

JoeJ said:
Absolutely, yes. The price is high. But for me that's every day practice and normal. It turned out i'm more the research guy than the ‘make a game’ guy. Now that you said this, i realize my programming experience is quite different from the norm, i guess? Probably so. Might explain some things, hmmm….

I don't really think neigther of us are really the norm in terms of what we do for programming, in our own ways ? I guess back in the day programing your own “engine” was more the norm, but even that is nowadays mostly considered a waste of time - and I can't even really disagree with it, unfortunately :D

JoeJ

4,359

December 17, 2022 06:12 PM

Juliean said:
What I do (think I) know is that having CPUs behave like GPUs entirely would probably not be good for game-developement.

Yeah, agree. CPUs are needed for flexibility, then can do anything with very little restrictions. But i would trade speculative execution against doubling core count, if this would work well. Probably it wouldn't, because speculative execution needs no extra memory access, while a second core does.

Juliean said:
Going back to my (simple, but content-driven) game, there are lots of custom NPCs/objects with unique scripts/behaviours. I don't think those would be a good workload for threading, even if a good threaded gameplay/scripting-framework existed, right?

Why not? Do you have a lot of shared data across them, so you would need too many mutexes? I don't know much about RPG game play logic. But i would be optimistic it could work.
Multi threading can be related to a graph coloring problem in many cases. You could form a dependency graph, then do graph coloring, then you can parallelize the processing of all nodes having the same color, effectively prevent data hazards.
But yeah, multi threading also requires a large enough N to be worth it. :/

Juliean said:
would suggest that the difference is marginal, its actually unimaginable.

Maybe we should test it out, but i also expect we need something like 3000 elements for the turning point.
Still, in practice i have used some grids to beat iteration with much smaller counts i think. Say 500, but i can't really remember actual numbers.
I assumed, this is because my objects are usually structs larger than a single number. Even if the iteration only checks one number in each struct (more likely a vec3 in my cases), maybe the larger stride hurts memory access speed?
I know CPUs are smart with predicting such access pattern, so this shouldn't be the case. But i do not know much about how cashes work. Maybe they still waste a lot of work on caching data between the stride which isn't used?

The other assumption i have is: After that turning point, i expect the binary search outperforms brute force at exponentially growing ratio as N increases linearly?
This is because the binary search makes larger jumps as N increases at the beginning, so your given ‘300 steps vs. one jump’ is an average for the example of N=3000, but this ratio isn't constant and depends on N too.

Fun fact: I have never ever coded binary search, iirc. :D At least not in the sense of finding some thing in a sorted array.

Also fun fact: First program i wrote was guess the number from C64 manual.
I have not learned much about programming from the code.
But i did learn a lot while playing the game. Because i've figured out the binary search algorithm as a strategy to do as least guesses as possible.
However, back then it did not appear to me that i could eventually code an algorithm to play the game in such smart way, which would have been a much better programming exercise. : )

Juliean said:
And it means that, in the scenario of a list, while your list iterates through 8 elements, the vector might go through 1000 or more in the same time.

Um, how many objects do you have in your games collision detection?

At this point i'm willing to code some grid vs. brute force comparison, so we can see at which N the turning point actually is.
Because it's O(n^2), i do not really have a guess in mind.

Juliean

7,377

December 17, 2022 06:44 PM

JoeJ said:
Yeah, agree. CPUs are needed for flexibility, then can do anything with very little restrictions. But i would trade speculative execution against doubling core count, if this would work well. Probably it wouldn't, because speculative execution needs no extra memory access, while a second core does.

Not really sure how speculative execution is implemented in detail, but I do except it to have a certain cost in terms of what has to be put on the CPU. Not sure if enough to double your cores though ?But there needs to be certain heuristis, etc…

JoeJ said:
Why not? Do you have a lot of shared data across them, so you would need too many mutexes? I don't know much about RPG game play logic. But i would be optimistic it could work.

I do assume so, to a certain degree. Scripts can change properties, like speed+texture to implement a sprint, or they can call functions the require queueing some operation to be performed. And scripts/object can access one another, like an NPC that performs its own logic while another watches what he does and then interacts with it on a certain point, those kinds of stuff. In general, the script backend would probably also need to sync at certain points, mostly concerning "coroutines" and their storage (which are used extensively here).
Maybe its really not that bad. The threading I've done mostly had a very slim interface of data, like a background music-emulator, or the script-compiler which uses its own copy of preprocessed dat.a And I'm also worried about losing the determinism I have right now. Also, I think it comes down to:

JoeJ said:
But yeah, multi threading also requires a large enough N to be worth it. :/

I think I would be better off to find other places where threading could be used, if it ever was required. I'm on the verge of having to make (editor) asset loading threaded to speed up load times. In the game, I don't really think it would yield much benefit given the 0,04ms tick-timings I have right now ?

JoeJ said:
Still, in practice i have used some grids to beat iteration with much smaller counts i think. Say 500, but i can't really remember actual numbers.

Even 500 is still a really high number, if you ask me. But the main point was really more about using a list internally for whatever grid/tree you have, I'm not entirely sure that there really is a cutoff point where it would become faster under those circumstances, as long as you have a scenario like you mention where you build the whole tree in one go and can't really use the O(1) deletion vs O(N). I do not absolutely know if amortized O(N) is always comparable to O(N), but I assume it should be. So at this point its no longer which N-notation you want, but really rather of which cost of which operation outweights the other.

JoeJ said:
I assumed, this is because my objects are usually structs larger than a single number. Even if the iteration only checks one number in each struct (more likely a vec3 in my cases), maybe the larger stride hurts memory access speed? I know CPUs are smart with predicting such access pattern, so this shouldn't be the case. But i do not know much about how cashes work. Maybe they still waste a lot of work on caching data between the stride which isn't used?

Larger struct-sizes should have some effect, yes, for the memory access, not really for the speculative execution, I would assume. But also I'm not totally sure now.

JoeJ said:
The other assumption i have is: After that turning point, i expect the binary search outperforms brute force at exponentially growing ratio as N increases linearly? This is because the binary search makes larger jumps as N increases at the beginning, so your given ‘300 steps vs. one jump’ is an average for the example of N=3000, but this ratio isn't constant and depends on N too.

That is an actual interesting question. I'll have to fire up my tests once again and see how it behaves when N keeps growing, would be interesting for any further references to this topic ?

JoeJ said:
Fun fact: I have never ever coded binary search, iirc. :D At least not in the sense of finding some thing in a sorted array.

I didn't code the search algorithm myself eigther, just used the std-function for it :D But thats what its there for, damn it :D

JoeJ said:
Um, how many objects do you have in your games collision detection? At this point i'm willing to code some grid vs. brute force comparison, so we can see at which N the turning point actually is. Because it's O(n^2), i do not really have a guess in mind.

For one of the most complex scenarios I have (a bullet-hellish scenario), it between 100-120 projectiles, depending on what the player does. I do admit here that I don't even have collision masks yet, which I swear I do intend to implement (also because it gives me certain functionality for ignoring collisions between specific objects). So for proceeding, we kind of have to pretend that there was not an even easier to optimize this case from (100^2)/2 checks to 100*1 (as projectiles would only need to check for collision with the player) - and I do belive that even for those small Ns, a mask-based approach could be implemented both easily and perform faster than O(N^2). But I mean, thats the implementation that I'm currently using, and there are obviously cases where 100 elements might potentially need to collide with each others, even if I eigther can't name one or don't have one in my particular example ?
But, since we are talking about the collision-mask thing. What would that really mean? I give each component a “type” (uint8_t), and then have a mask of what collides with what. So then I'd do a bucket-sort, and only collide elements from other buckets that both “type”s agree they should collide with? If so (purely out of interest), how would one combine this with a tree-like structure? Do you just ignore this potential optimization of not even having to check elements at all because they would ignore each other - or do you do something else, like bucket-sort for each tree-node?

JoeJ

4,359

December 17, 2022 09:02 PM

I have coded the test.

I did these options: brute force, grid with std::vector per cell, grid with list per cell
But i want to add BVH as well, so will post results and code later.

Spoiler: Brute force can't beat any of the two other options. I need to get down to < 300 objects, but then it's just nanoseconds, and noise of measurement becomes too big to tell. I'd need to average timings to find the turning point.
Very surprising to me: std::vector<std::vector<>> slightly wins over the list even including the time to build the grid per frame, it seems.