Advertisement

What kind of coding standard do you prefer

Started by December 16, 2022 11:35 AM
82 comments, last by JoeJ 1 year, 11 months ago

JoeJ said:

Juliean said:
This optimization only makes sense when accessing a static variable multiple times within the same function, but it does provide a noticable speedup.

I would never had figured out this on my own, thanks for sharing! :D

Will add and and compare later, with a somewhat larger grid…

You're welcome! Looking forward to the results.

I'm now also implementing the binning/type-filter I mentioned before. So I hope you are happy to hear that you got me motivated to do at least that :P I'll also post the results once I'm done, which might depend - I'll probably start to hardcode stuff for now, if I have to make it super configurable it might be a bit tricky on the tool-side for a start, especially if I don't know how it'll turn out.

edit: more buggy results removed

Advertisement

@Gnollrunner unsinged short is good enough for indexing meshes and textures and sounds. It was not good enough for objects - there I've used integer.

Pointer vs compacted local data: In this discussion there are a lot of interesting examples that demonstrate why sequential access is so much faster than random access in many cases even when using acceleration structures. It is an interesting read, but my original point is that I do it for simplicity and that in most cases it turns out to even give speed boost. Literally looping through 10000 ordered entries to find specific one will be faster than doing binary search or using other acceleration structures. So it is simple and fast but definitely not for every situation. I think that the only conclusion is that testing different options is the only right way to go. I did mention that I use SortForEfficienty a lot for all kinds of containers.

Maybe even if you have objects all over the place (in your heap) you could use some local precaching into small vector and see how much faster solution will be. I did this when I was simulating entire solar systems with a lot of details on planets - obviously I can see less than 1% of entire word so accessing everything in on big heap all the time was a big no. I had that heap and I had localized data that would serialize every now and then and once some object left my local space.

Interesting sample how sequential access is much faster:

I once trained some neural network with 100 GB of data. You would think that if you have 128 GB or RAM you won't have to do much to make things fast but that is wrong because if you shuffle data (which you must) it will cause such a slow down (every memory access will be cache miss). At the end I've implemented memory efficient shuffling that always goes sequentially through segments of the original data and randomly dumps entries into some file (randomly picks segments). I do this several times but the good news is that now everything works even on a 8 GB RAM, works faster but requires 300GB of SSD space. So, using SSD as a cache in a sequential way ended up being much faster than original solution that had 128 GB of RAM and was doing cache misses.

Ivica Kolic said:
@Gnollrunner unsinged short is good enough for indexing meshes and textures and sounds. It was not good enough for objects - there I've used integer.

One comment to that: unsigned short specifically does have an overhead of its own, on standard pc-hardware. The instructions related to it, and how registers need to be handled are not optimal. Shouldn't really matter for most cases, but its something to keep in mind. Specifically, unless you do some packing of multiple numbers in one struct as I described earlier, you should not use “short" as an optimization-measure. Use uint8_t if possible (that one is actually fast), or uint32_t. This one you can check yourself, by seeing what uint16_fast_t etc… are typedefed to be. On PC. its generally “unsigned int” and not “unsigned short” - while “uint8_fast_t” is "unsigned char".

Ivica Kolic said:
Pointer vs compacted local data: In this discussion there are a lot of interesting examples that demonstrate why sequential access is so much faster than random access in many cases even when using acceleration structures. It is an interesting read, but my original point is that I do it for simplicity and that in most cases it turns out to even give speed boost. Literally looping through 10000 ordered entries to find specific one will be faster than doing binary search or using other acceleration structures. So it is simple and fast but definitely not for every situation. I think that the only conclusion is that testing different options is the only right way to go. I did mention that I use SortForEfficienty a lot for all kinds of containers.

Is it simpler for you? Outside of optimization, I do have on example, where I used it for objects by storing a “handle” to an entity instead of the pointer, because the pointer could become invalid, but that does make things actually a lot more complicated in most places, at least for me. Since you now eigther have to store the container where the things reside alongside the int (if you want to make it usable like a pointer - which is what I ended up doing), or have to store the container separately and access/pass it all over the place. But maybe for problems that are more local, it could be not as bad.

JoeJ said:
Haha, 10 times faster. That was a good tip. : ) Now with optimization on in release:

Thats already a good thing ? I'm feeling vindicated right now about my assessment about debug-performance being important (in an earlier thread), since newer MSVC-versions now make “std::move” or “std::forward” be no-ops even in debug-mode, even with the ability to mark custom cast-functions [msvc::intrinsic] to achieve the same effect. So even if the speedup in release and on different compilers is not that big, its still a win in my book for a relatively easy improvement (which doesn't require any structural adjustments of the underlying algorithm).

Juliean said:

Is it simpler for you? Outside of optimization, I do have on example, where I used it for objects by storing a “handle” to an entity instead of the pointer, because the pointer could become invalid, but that does make things actually a lot more complicated in most places, at least for me. Since you now eigther have to store the container where the things reside alongside the int (if you want to make it usable like a pointer - which is what I ended up doing), or have to store the container separately and access/pass it all over the place. But maybe for problems that are more local, it could be not as bad.

Samples:

  • Mesh boolean operation in my CSG library: unsigned short was more than enough for everything. All mem can be preallocated since you know that resulting mesh won't have more that total number of faces of mesh A and B (in most casses less). Debugging is much easier when you have IDX of a face/edge/vertex than pointer to it (who could even read that).
  • Other sample is the editor where objects would change but only when user is doing something. Most of the time 99.99% rendering thread would loop through it. When I'm changing something, rendering thread is stopped. So this is very simple example with practically static data.

Anti-sample (but also sample because of precaching):

  • Engine that handles entire solar systems. Since there are millions of objects and no way they can fit in CPU cache, I precache local objects into those small preallocated vectors. In most cases les than 0.001% of objects is accessible to player - if these objects are locally precached then practically everything fits in a local cache and CPU doesn't have to jump all over the place to access correct memory. Network engine did however had to jump all over the place, but very rarely because it was p2p network engine and you might have end up bein server for interior of some space station but you are also client on some spaceship that left that space station.
Advertisement

Ivica Kolic said:
Mesh boolean operation in my CSG library: unsigned short was more than enough for everything. All mem can be preallocated since you know that resulting mesh won't have more that total number of faces of mesh A and B (in most casses less). Debugging is much easier when you have IDX of a face/edge/vertex than pointer to it (who could even read that).

So for indexing into “primitive” data inside the mesh, if I understand correctly? This should also fall under the catory of “local” problem that I described - you are probably using those indices inside a class/system where you always have access to all the data needed to actually resolve the index back to a pointer/data if need to.

As for readability/debuggability: If we are talking about primitive data, sure. A pointer to an int is way less indicative than an index saying which int it is. But if you have a pointer to an object (with multiple fields, potential inheritance etc…) than the index would become way less understandable IMHO. So its object 425… nice? Whats its properties? What type is it? etc… and if you still want to know which index is it, you can always store the index (even only in debug-builds conditionally, if so desired).

Ivica Kolic said:
Other sample is the editor where objects would change but only when user is doing something. Most of the time 99.99% rendering thread would loop through it. When I'm changing something, rendering thread is stopped. So this is very simple example with practically static data.

That is also one of the benefits I mentioned earlier about handling the backend of my renderer with indicexed integers. The new code for handling device-removal can simply change the stored pointers internally, without forcing users to update what they have. Click noise.
But on the other hand, the actual API/Plugin/Script-Code still gets a “Texture” object-pointer (which is decoupled from the backend via the integer. If find it way easier to code high-level logic that way. Want size of texture? texture.GetSize(). Obviously, getTextureSize(textureID) could be used, but this would require the storage for textures to be stored globally behind the scenes; and you'd have to make different “ID”/handle types so that you cannot just call getTextureSize(vertexBufferID) by accident (and get to keep the other benefits of strongly typed fields, when we are talking about ie. binding them to a scripting-API by using templates, and not having to specify the type yourself).

edit: even more bugs.

Btw, i did the same thing on GPU, maybe 7-8 years ago, on some GCN.
GPU really is super fast for brute force N-Body problem like this.
But the threshold was 64 bodies.
After that, building such grid became faster.

I guess nothing has changed since that for modern GPUs, but not sure.

JoeJ said:
However, only now i notice brute force became faster than any grids with Clang. With a whopping 5000 objects. I will need days to swallow this down and deal with it… :O

Wuut? Are you sure that you didn't introduce some suble error in the algorithm for eigther case? I mean, it does find the same number of elements, but even I would not have expected this to happen for that many objects.

Alternatively, maybe its because with that many elements, the cells all become “satured”, and you end up doing O(N^2) for each for a number of elements that is too high comparatively? I'm not really sure how the cells are divided exactly, but root of 5000 is 70, is of we assume 8*8 cells, that would mean roughly 8-9 elements on each axis per cell, so 81, which would mean that even per cell you are doing about as many brute-force steps per cell than my game is doing in high-load scenarios ? Assuming that intra-cell comparisons are O(N^2), which I assume. So following that logic, brute force would be 12.500.000 “steps”, while 64-grid cells * (81*81) collisions would mean 420.000 by my napkin math, which is a difference of x25 “steps” for brute-force, which would still fall within the range that one could except the raw speed to outweight the complexity-difference.

On my own front, I did implement the collision-filtering, with somehow good results:

Running my test-sequence with the bossfight with the most projectiles, in debug execution-speed goes up from 35x to 72x. Nice! I don't have a release-variant for this single test, but for the whole project-tests, it went from 87s to 83s, which is a good improvement considering the length of game that is being run in that time. Sorry for the unusual performance-metrics, I could measure the actual time being spent inside the actual loop, but I do care about how it globally affects those test-speeds the most, this has a practical advantage for me ? Considering that those numbers do include everything that is being run with the game, those numbers do mean a lot actually.
EDIT: See next post for some corrections


The code is here: https://pastebin.com/gJRJ3M0V

There is a lot of engine-specific yadayda going on, but I hope the core of the algorithm is understandable. Notable things are:

  • Iterating over the actual containers I have still done in O(N^2), which made ma chuckle a bit. But I am fairly certain that for an N of 5 as of now, it is still better, and easier to type than creating explicit list-of target types for this quick example ?
  • There is some additional complexity in having to order collisions, which already existed before, which not every system might need. So this could be used to speed up things even more
  • There a few things which potentially could be moved around to eigther improve, or make worse, the characteristics. Like certain properties could additionally be cached when moving the elements into their bins, or less, depending on the use-case. This does depend a bit on how often things actually collide, which brings me to my next point:
  • This implementation is only for “touch”-triggers, meaning no physically blocking collision. Which also means that the actual % of overlaps here is very few each frame. The actual collision-system, at least for this new format, could also be updated somewhat easily, as I have certain guarantees (like than an entity will not physically be deleted until the end of frame). However, the system in its entirely is a bit more complex, so going beyond that with a grid there wouldn't be as easy. In short, the blocking-collisions are not handled just in one loop. For the specific movement-system I have, it is necessary to dynamically update potential locations of objects while they move, as well as potential collision-pairs, in a sense that cannot simply be resolved by pushing collisions out of their contact points.

I'll continue implementing this for blocking-collisions as well, and see what the final numbers say ?

This topic is closed to new replies.

Advertisement