Advertisement

ECS - Transform and Mesh component link

Started by June 24, 2022 08:08 PM
9 comments, last by All8Up 2 years, 5 months ago

Hi guys!

I made a little project to understand and learn about ECS, and I learn about this wonderful library EnTT so I decided to give it a shoot.

My issue begin when I noticed, how do I create like a link between TransformComponent and the MeshComponent? I do ask because right now, my MeshComponent need to update it's own matrix based on the TransformComponent, but, the only way that I see to solve this is:

A) Every update go through all the entities that has MeshComponent & TransformComponent like this

auto group = mRegistry.group<TransformComponent>( entt::get<MeshComponent> );
for ( auto e : group )
{
    auto [transform, mesh] = group.get<TransformComponent, MeshComponent>( e );

    mesh.mMatrix = transform.GetTransform();
}

but if I don't have any change in the transform, it will just copy the data needlessly

B) Create a function while when I change the pos, rot or scale of a transform component:

void OnTransformChanged(entt::entity e, TransformComponent& transform)
{
    if ( !EMS::HasComponent<MeshComponent>( e ) )
        return;

    auto& mesh = EMS::GetComponent<MeshComponent>( e );

    mesh.mMatrix = transform.GetTransform();
}

But that would mean adding extra effort while doing any change to the transform component.

I'm open to read ideas, books, youtube videos or any recommendation about this subject.

Thanks!

The approach you describe in A is probably what you want, and is probably less expensive than you imagine for (potentially) a couple reasons:

  1. Simple is good. No need to do anything more complicated unless you determine you have a performance bottleneck here.
  2. A simple loop building and/or copying data like this is likely very performant, and probably will never be a performance bottleneck unless you approach tens of thousands or more entities. If you're worried about this, I believe in EnTT specifically there are ways to sort the component arrays such that entities with both transform and mesh components end up sorted together in a linear chunk of memory with no holes.
  3. Depending on how your transform hierarchy works, you could defer transforming things from local space to world space until this point in the frame. Then you're paying the cost of calculating final world-space matrices only once for each object (and only for objects that get rendered).
  4. You could potentially this loop with the loop that actually operates on or submits meshes for drawing, making it “free” (if that makes sense for your architecture).
Advertisement

@Valakor

Yeah, agree with 1. IMO simple is always good, I don't like doing verbose code, nor adding extra to the code in general.

The only issue that I see it's that I might have 5.000 +/- entities on the map loaded at once (my idea is to have something similar to WoW as a game), so I wonder if running through 5.000 entities and copying the transform to mesh with that many entities might have bottleneck in performance.

2. By any chance u know what is that of the array and component? I'm checking but I can't find it, I only found the group stuff.

3. What about hierarchy? Let's say I have an entity “Chain” and another “Box”, ofc Box it's child of Chain, so if I move Chain I have to move Box, what could be a good solution for it?

4. I think I get it, when I do that loop, also retrive the transform and submit the mesh for drawing, but unfortunately the render engine that I use, don't allow it :/

I wonder if running through 5.000 entities and copying the transform to mesh with that many entities might have bottleneck in performance

There's no way to know until you profile it - no use assuming something that may or may not be true.

By any chance u know what is that of the array and component? I'm checking but I can't find it, I only found the group stuff.

Unfortunately I don't remember, you'll have to browse the docs.

What about hierarchy? Let's say I have an entity “Chain” and another “Box”, ofc Box it's child of Chain, so if I move Chain I have to move Box, what could be a good solution for it?

Hierarchical relationships in entity component systems can be implemented in a handful of ways - you'll probably want to do some research on that topic specifically to find something that works best for you.

Rewaz said:
The only issue that I see it's that I might have 5.000 +/- entities on the map loaded at once (my idea is to have something similar to WoW as a game), so I wonder if running through 5.000 entities and copying the transform to mesh with that many entities might have bottleneck in performance.

As others have pointed out, this should not be an issue. I regularly work with > 100,000 entities in an ECS and have tested variations of this problem regularly. It typically comes down to 3 things:

  1. If most things in the scene are moving at all times, just iterate and copy the data. It's cheap, quick and I've never seen it actually show up in profiles so I really don't suggest the following alternatives.
  2. If you promise that most things in the scene are not moving, you could add a “dirty” flag to the transforms and skip those which have not changed. In most cases where even 5% of entities move, this is more expensive than #1. There are many ways you can approach this but typically speaking the combination of branching code, random cache misses and potentially wasted memory put the point of diminishing returns at very low numbers.
  3. Like #2, if you promise most things are not moving, bypass the ECS and use a queue to create a list of entities which moved. Then just iterate that doing a single entity lookup and copy. This is inherently against the ECS style but sometimes is the right thing to do. From my findings, this is almost always faster than #2 but until the entity counts get pretty high with few moving items, slower than #1.

Now, having said the above, there are a few more things to consider:

First, if you have a transform and a mesh component in the ECS and they are both flat POD structures containing matrices, why copy at all? Remove the matrix from the mesh and when you go to render, just iterate (Transform, Mesh) splicing the data from the two components into the result you need. This is typically how you want to use the ECS when possible. Of course, for many cases the Mesh structure is supplied by a rendering engine you may not be able to, or at least don't desire to, modify. You could of course reverse this and just use the matrix from the Mesh component rather than the one you are using. The downside there will depend on the size of the Mesh component, if it is large, iteration to update that matrix could end up touching considerably more memory and slowing things down such that #1 makes more sense again. It's a balancing act here and depends on what you are using.

Going off the deeper end is always possible. Typically speaking, most rendering engines contain an “easy to use” retained scene graph between you and the rendering device. Often it is implemented very similarly to an ECS in fact. This is duplicating the functionality and data contained in the ECS and generally speaking, something you don't really want between you and the underlying rendering device. This is how my solution works, generally speaking, the difference is notable when it comes to scale. A bad comparison would be the debug visualizer I wrote against a retained mode renderer and the current solution which bypasses the retained mode and uses the same backend device code. 5000 entities (low counts because this had no culling) with the retained mode rendered at 10-15 FPS where the current code renders at over 120 FPS with ¼th the memory bandwidth. The results are identical in look but 5 times faster by simply bypassing code that is not necessary. Let me be clear, this is a totally unfair and questionable comparison as there are two huge differences: a) the renderer managed the mesh instances and they were pointers to random memory, so all contiguous memory access was out the door when pushing the transforms (aka: cache miss palooza!) and b) the ECS was executing in parallel while each of those meshes had to be protected via a mutex. Those are pretty huge differences but, unfortunately are likely to be standard fair unless a given renderer is written specifically with ECS and multicore execution in mind. All said and done, if you want to go off the deep end for performance, this is a pretty big one to consider.

Now, the reality check, keep in mind that different needs play a role in where you should be spending your time. Making a game with < 5000 entities, ignore everything above except #1. Seriously, you have better things to do with your time rather than optimize the crap out of something that is likely good enough. The primary reason for all the above though is that you will run into it again if/when you integrate a physics engine, a sound system, networking and/or many other things where middleware is probably desired. In a lot of those cases #3 solution starts to look very good….

The matrix is 16 numbers. 5000 of them is 64000 numbers.

Your each of your processors probably runs at 4000000000 - 5500000000 cycles per second. Modern CPUs have a CPU core with 7 internal processing units, four typically do one per cycle, three of them can do simple operations at 3 per cycle. You might have 4, 6, 8, 16, even 32 of these processors that can each individually exceed ten billion or even twenty billion instructions per second.

While some tasks do show up in profiling, copying those around will not be an issue unless you have done something terribly wrong.

That said, ECS is a fancier name for a simple CS concept, it is a container. You can use an index, a handle, a pointer, or some other reference if you want, but it is just a container that links one thing (a game object entity) to another thing (a component). A container with a few thousand items is a small thing on a modern computer.

You are unlikely to hit the limits of a modern computer as an individual game developer.

Advertisement

frob said:
Your each of your processors probably runs at 4000000000 - 5500000000 cycles per second.

All very true, current CPU's are so insanely fast it's not worth a worry. I figure I will add one thing though, as you push past about 5000 entities memory bandwidth becomes the limiting factor long before you anything else. CPU's are hundreds of times faster than 10 or 20 years ago while memory is only 10-20 times faster. As you scale past 5000 entities, most of the work is in optimizing memory even if it means wasting significant CPU cycles recomputing rather than touching memory. But, below 5000, blow the memory, who cares! :D

Yep, somewhere around 5000-10000.

Depends on details, but on current systems with cache effects and cpu design and data structure, the tradeoff between linear and random access performance is around 5000-10000 items.

That is, a linear scan of 5000 items is faster than a binary search requiring up to 13 comparisons. For various hardware it might 8000 or 10000 or more depending on the details. Simple linear access of data is extremely fast, and with cache effects often the clock time of doing one is exactly the same as the clock time for doing 4 or 8 or 16 as all are fetched at once. With detection and prefetch the second batch is ready before the first is processed, the third batch is ready before the second is processed, etc.

It is hard to overstate how fast modern processors are at raw computing speed. Most optimization these days is not about raw processing. The era of looking at cpu instruction timing in the 70s and 80s ended decades ago except for some relatively rare occasions. There are bigger fish to fry.

frob said:
That is, a linear scan of 5000 items is faster than a binary search requiring up to 13 comparisons. For various hardware it might 8000 or 10000 or more depending on the details. Simple linear access of data is extremely fast, and with cache effects often the clock time of doing one is exactly the same as the clock time for doing 4 or 8 or 16 as all are fetched at once. With detection and prefetch the second batch is ready before the first is processed, the third batch is ready before the second is processed, etc.

In the https://pastebin.com/13V2Sxsy​ quick test I made (which unfortunately cannot be directly copied as it uses my own system-libraries, but you should be able to understand it and modify) the cutoff-point where binary search became faster was more like 600+-100, on a very fast i9-9900X CPU. Not sure if I've been measuring something wrong, but eigther way, its still pretty impressive. I assumed previously that the cutoff would be at around 50-100 elements, thats why I made this quick benchmark. I've been using a vector-backed small_map/set with linear search for that reason on many occasions where there were not many elements, but I quess I can use it in more places still.

EDIT: After making finding the cutoff-point a bit more resistant against flukes, and reducing overhead for preventing removal of the “find”-operations, the cutoff now is at around 1400 elements. Thats wild. https://pastebin.com/fpVqXi0S

frob said:
Depends on details, but on current systems with cache effects and cpu design and data structure, the tradeoff between linear and random access performance is around 5000-10000 items.

I think we're basically saying the same things, just pointing out different situations. The details in this are the fact that the ECS provides a framework where without much programmer work they will tend to get this linear access benefit. It's just the nature of the ECS in this area and why it can end up as a major performance win in certain scenarios. This brings with it the possibility that you need to rethink what is optimal. What is more expensive to the compute side but lighter on the memory can very often be the winner even if in every other measure it should be slower.

That was the only point I was making, I wasn't arguing that linear access wasn't way faster. I was pointing out that it can actually run you into those theoretical limits which won't be so theoretical anymore… :D Specific case in point, for some of my code, inline computation of a full matrix inverse, rather than pulling in the pre-computed one available in a different linear array can be a notable scale improvement. This is especially notable for AMD CPU's when measuring things on the 8, 12 and 16 core variations of the standard gaming Ryzen's. The 8 core doesn't see much difference either way, the 12 core shows a notable improvement and the 16 core difference is exceptional. All for an “optimization” which by all other measures should not be an improvement.

This topic is closed to new replies.

Advertisement