Back to General and Gameplay Programming

Some questions about allocators and multi-threading

General and Gameplay Programming Programming C++

Started by Rewaz August 30, 2020 08:03 AM

3 comments, last by Shaarigan 4 years, 6 months ago

Rewaz

Author

August 30, 2020 08:03 AM

Hi guys, I'm new at the forum, this is my first post.

I came here, bc I didn't knew who to ask these questions that I have.

I want to use multi-thread for loading asset in an old game ( and in future change more functions are going to use multi-thread ), I started adding an easy ThreadPool lib ( thread-pool-cpp ) which have 1 thread per logical core, with steal strategy. ( The game is a mmorpg ).

1) Is it worth it, or there is other ways? For me, it is, since I saw some 4ms ( and a lot more ) loose in those task, and I had a SSD Nvme 2.0, I can't imagine ppl with HDD. Prob I also need something to manage the assets, like an Asset Manager, if anyone knows a good lib about that ( if there isn't I will create one from 0, since it's only data storage ).

Also, the allocators are very old, doesn't call constructor or destructor ( u can imagine the rest ).

I wanted to change them, but, I didn't know if there was a good lib to do it, I was reading https://www.gamedev.net/tutorials/programming/general-and-gameplay-programming/c-custom-memory-allocation-r3010/ tut, but is from 2013, and I was wonder if there was any more updated lib ( since last update is from 2014, 6 years ago, maybe we got more recent allocator's lib ).

2) Is there any good allocator lib for me to use?

I wonder, if u guys can help me through my questions, bc I don't know who to ask, and reading this website, I read a lot of ppl with experience ( some have job exp with AAA games ), and can help me, or point me what I'm doing wrong.

Shaarigan

1,471

August 31, 2020 06:39 AM

Rewaz said:
since I saw some 4ms ( and a lot more ) loose in those task

4ms is ridiculous depending on what kind of asset you load. A huge chunk of your world in 4ms, very fast, a 5 lines config file: not so good. Define the kind of asset you load first and then ask questions on how you can make those assets load faster!

As a general tipp I can give, Don't load stuff into memory byte bytes rather than using streams instead as they have a continous access to the data which leads to higher performance in case of data you need to parse, such as OBJ files. Declare your own file format to store assets into which follows some hardware rules for your target platform. One of thise rules are for example to fill the HDD cache, so that your data doesn't need to have the drive jump around too much (which is also true for SSDs). Uncompressed data is much bigger on the disk but loads faster than compressed data. It may depend on the platform if you need to compress your assets or leave them as is. Android for example has a 4GB limit for APK files.

Last but not least, if you really have a lot of data you need to get into your game as fast as possible, Memory Mapped Files are your best friends. Instead of load them into RAM, they are linked to virtual memory pages that are managed by the OS at the time you need to access them and unloaded on demand. This way you can map your entire 1 TB asset archive into your game without flooding RAM and killing performance. However, this technique works best with chunked data as well, so you should pad your asset file to 64k boundaries which fills an entire drive cache line and also fits into the usual 4k OS page size.

Rewaz said:
Is there any good allocator lib

I don't think so as memory allocation is always a matter of taste of the engine dev. STD lib is not designed to play well with custom allocation so whatever design you intend, be aware that you might also have to write your own basic container classes like Vector, Map and so on.

I found this blog post about memory allocation quite useful and build my system in a similar way to what they did. My basic allocator is wrapping malloc/free in both unaligned and aligned ways to grab memory for any basic purpose, I also have other kinds of allocator.

My containers cover:

Dynamic Array
Vector (similar to C# List<T>)
HashSet
Map (similar to C# Dictionary)

And everything is based on my Dynamic Array it involves a larger memory footprint than simply wrapping a plain memory pointer into it (it has vTable as well) though it is very open to be used by any function requiring a linear block of memory (and I of course don't need to duplicate code in every container)

Rewaz

Author

August 31, 2020 05:49 PM

Shaarigan said:
4ms is ridiculous depending on what kind of asset you load. A huge chunk of your world in 4ms, very fast, a 5 lines config file: not so good.

Nono, I know xD, it is 4ms for only loading 1 texture ( but isn't always, sometimes it happen ).

Shaarigan said:
Don't load stuff into memory byte bytes rather than using streams instead as they have a continous access to the data which leads to higher performance in case of data you need to parse, such as OBJ files. Declare your own file format to store assets into which follows some hardware rules for your target platform.

( I guess u mean byte per byte when u said “Don't load stuff into memory byte bytes” ) IIRC the framework loads the assets using streams. ( in this case I'm talking about PNG, BMP, DDS ) I don't know if there is a better way to load those files, I wanted to do the code from 0 ( the game use D3D9 ) and I didn't knew if I should use FILE , std::ifstream, which tenchniques I can use to improve the speed of loading those assets, also, if I should have some “common textures” in like a static map always loaded X texture, since it's used like 80% of the time, when I need it, I can instant get it. Have some “refCount” so when 2 obj use the same texture, I can use same loaded pointer for both.

The textures are packed into .pak files ( example, I have 500mb in texture folder, I will do like 500 / 100, so I will have 5 files each 100mb with all the textures inside ), and there is a file that it is encrypted, which contains all the data ( fileName, offset, endPos ) like a DB of those assets, so I read that header, and I know if I want “PLAYER.MODEL.02” I read that file, and it will tell me in which file of those 5, my texture is, in which offset starts and in which pos it ends. that way I will have a buffer like char buffer[ end - offset ] with a the raw data of the .DDS.

Shaarigan said:
One of thise rules are for example to fill the HDD cache, so that your data doesn't need to have the drive jump around too much (which is also true for SSDs).

Sorry, didn't quite understand, how do I fill the HDD cache?

Shaarigan said:
I don't think so as memory allocation is always a matter of taste of the engine dev. STD lib is not designed to play well with custom allocation so whatever design you intend, be aware that you might also have to write your own basic container classes like Vector, Map and so on.

I don't want something that complicate honestly, I preffer something like MemoryPool<T*, maxObjectCount> with auto alignment.

That's why I preffer some lib, bc I know there should be a good one, not the best, but useful.

Shaarigan said:
I found this blog post about memory allocation quite useful and build my system in a similar way to what they did. My basic allocator is wrapping malloc/free in both unaligned and aligned ways to grab memory for any basic purpose, I also have other kinds of allocator.

I read one old comment that u did, and also recommend that blog, I'm reading it, thks for that blog!

Shaarigan

1,471

September 01, 2020 07:30 AM

Rewaz said:
I don't know if there is a better way to load those files, I wanted to do the code from 0 ( the game use D3D9 ) and I didn't knew if I should use FILE , std::ifstream, which tenchniques I can use to improve the speed of loading those assets

From my experience, it is harder to write simple than optimized code. I made my Bitmap loader quite easy (I also support PNG, JPEG and TGA files) and it runs in a good time depending on the size of the image.

    bool Asset::BmpReadHeader(IDataReader& stream, BmpHeader& header)

    {

        byte uint[4] = { 0 };

        if(stream.Get() != 'B' || stream.Get() != 'M') return false;

        stream.Read((byte*)&header.Size, 4);

        stream.Read(uint, 2);

        stream.Read(uint, 2);

        stream.Read((byte*)&header.OffBits, 4);

        stream.Read(uint, 4); if(*(uint32*)uint != 40)

            return false;

        stream.Read(uint, 4); header.DataHeader.Width = *(uint32*)uint;

        stream.Read(uint, 4); header.DataHeader.Height = *(uint32*)uint;

        stream.Read(uint, 2); header.DataHeader.Planes = *(uint16*)uint;

        stream.Read(uint, 2); header.DataHeader.BitCount = *(uint16*)uint;

        stream.Read(uint, 4); header.DataHeader.Compression = *(uint32*)uint;

        stream.Read(uint, 4); header.DataHeader.SizeImage = *(uint32*)uint;

        stream.Read(uint, 4); header.DataHeader.X_pels_per_meter = *(uint32*)uint;

        stream.Read(uint, 4); header.DataHeader.Y_pels_per_meter = *(uint32*)uint;

        stream.Read(uint, 4); header.DataHeader.Clr_used = *(uint32*)uint;

        stream.Read(uint, 4); header.DataHeader.Clr_important = *(uint32*)uint;

        byte BitsPerPixel; if (header.DataHeader.BitCount == 32) BitsPerPixel = 32;

        else BitsPerPixel = 24;

        header.Buffer = header.DataHeader.Width * (BitsPerPixel / 8) * header.DataHeader.Height;

        return true;

    }

    bool Asset::BmpRead(IDataReader& stream, BmpHeader const& header, byte* data)

    {

        uint32 BytesPerPixel = header.DataHeader.BitCount / 8;

        uint32 BytesPerRow = BytesPerPixel * header.DataHeader.Width;

        uint32 BytePaddingPerRow = 4 - BytesPerRow % 4;

        if (BytePaddingPerRow == 4) BytePaddingPerRow = 0;

        if(header.DataHeader.BitCount <= 8) //using color table

        {

            uint32 colors = ((1 << header.DataHeader.BitCount) * 4);

            byte *colorMap = MainAllocator::Allocator().Allocate<byte>((size_t)colors);

            stream.Read(&colorMap[0], colors);

            for(uint32 y = header.DataHeader.Height; y > 0; y--)

            {

                for(uint32 x = 0; x < header.DataHeader.Width; x++)

                {

                    uint32 clIdx = stream.Get() * 4;

                    for(uint32 i = 0; i < 3; i++)

                        data[(((y - 1) * header.DataHeader.Width) + x) * 3 + i] = colorMap[clIdx + 1];

                }

                for(uint32 i = 0; i < BytePaddingPerRow; i++)

                    stream.Get();

            }

            MainAllocator::Allocator().Release(colorMap);

        }

        else if(header.DataHeader.BitCount == 16)

        {

            for(uint32 y = header.DataHeader.Height; y > 0; y--)

            {

                for(uint32 x = 0; x < header.DataHeader.Width; x++)

                    for(uint32 i = 0; i < BytesPerPixel; i++)

                        stream.Read(&data[(((y - 1) * header.DataHeader.Width) + x) * (BytesPerPixel * 2) + i], 2);

                for(uint32 i = 0; i < BytePaddingPerRow; i++)

                    stream.Get();

            }

        }

        else //24bit Rgb and 32bit Rgba image

        {

            for(uint32 y = header.DataHeader.Height; y > 0; y--)

            {

                for(uint32 x = 0; x < header.DataHeader.Width; x++)

                    for(uint32 i = 0; i < BytesPerPixel; i++)

                        data[(((y - 1) * header.DataHeader.Width) + x) * BytesPerPixel + i] = stream.Get();

                for(uint32 i = 0; i < BytePaddingPerRow; i++)

                    stream.Get();

            }

        }

        return true;

    }

Everything in my engine code has a streaming interface, I don't use any binary blocks allocated in heap to load everything at once because this would rather take much more time and wastes resources at all. My streams are custom classes that inherit from a common interface (just showing the in-stream as the out one is not relevant here)

    /**

     Interface abstracting I/O read operation to certain data

    */

    interface IDataReader

    {

        public:

            /**

             Class destructor

            */

            virtual ~IDataReader() {}

            /**

             Gets current position of the reader

            */

            virtual int64 Position() const = 0;

            /**

             Sets current position of the reader

            */

            virtual void Position(int64 pos) = 0;

            /**

             Gets the underlaying data size in bytes

            */

            virtual int64 Size() const = 0;

            /**

             Gets if data pointer has reached end bit

            */

            virtual bool Eof() const = 0;

            /**

             Reads the next byte without incrementing the data pointer

            */

            virtual byte Peek() = 0;

            /**

             Reads the next byte of data

            */

            inline byte Get()

            {

                byte bt; Read(&bt, 1);

                return bt;

            }

            /**

             Reads size bytes into the given memory. Returns 0 when anything

             failed inside the file handle, otherwise the size read.

            */

            virtual size_t Read(byte* buffer, size_t size) = 0;

            /**

             Reads size bytes into an other stream. Returns 0 when anything

             failed inside the file handle, otherwise the size read.

            */

            api size_t Copy(IDataWriter& stream, size_t size);

    };

My InFileStream implements this interface and is build from two fundamentals, the specific OS API to open file handles and read the data, as same as a circular buffer to simulate a continous read. Every read operation performs a lookup if the buffer has any data to read and otherwise fills the next N bytes fof the file into the buffer. 256 byte turned out to be a good size for the buffer to cache enougth data from disk but not waste too much memory on smaller files. The time consuming operation in here is fetching the data from disk!

However, some ms are ok when loading a file from disk.

As I wrote above, if you need more speed, you have to use Memory Mapped File I/O which maps your data to virtual memory pages managed by the OS and offers access via pointer to them. I also implemented a MemoryStream class that wraps around that pointer and a given size to make static memory also available as stream (you remember, everything is working on streams ? )

Rewaz said:
like a static map always loaded X texture, since it's used like 80% of the time, when I need it, I can instant get it. Have some “refCount” so when 2 obj use the same texture, I can use same loaded pointer for both.

This is a matter of optimizing memory, not load times and you should ALWAYS use references for your models rather than have each model it's own texture instance. You should also use the same Mesh, Shader etc. for similar models as well.

Rewaz said:
The textures are packed into .pak files ( example, I have 500mb in texture folder, I will do like 500 / 100, so I will have 5 files each 100mb with all the textures inside ), and there is a file that it is encrypted, which contains all the data ( fileName, offset, endPos ) like a DB of those assets, so I read that header, and I know if I want “PLAYER.MODEL.02” I read that file, and it will tell me in which file of those 5, my texture is, in which offset starts and in which pos it ends. that way I will have a buffer like char buffer[ end - offset ] with a the raw data of the .DDS.

This solution sounds similar to what I use, however our .sep archives also use some digital signature algorythm and hashes to ensure no one modified it. A major difference, also in performance between our and your format is that we use 64k chunks and padd our data to them. As I wrote above, this has the advantage of filling an entire disk cache line and so prevents jumping around on the harddrive. Also 64k is a multiple of 4k, which is the usual OS memory page size.

Once in a job I also wrote a database from scratch with the same techniques, it was able, thanks to Memory Mapped I/O, to handle up to 5 TB of data while access time was even faster than reading something from disk by a stream. So how did I do that:

Handle everything in chunks of 64k
Always load a full chunk, no matter if you need 1 or 1000 bytes
Don't try to optimize storage, some trash bytes in between chunks is ok

Also our package tool is trying to puzzle data into chunks on the most optimized way and after that writes the header chunks. It may happen that a file is taking several chunks on it's own, shares some chunks with other data or is so small that multiple files are on one chunk. It doesn't matter where the data are, the primary goal is to have them stored for linear access in page/chunk boundaries.

Rewaz said:
Sorry, didn't quite understand, how do I fill the HDD cache?

Data needs to be as large as the disk cache. This is usually 64k bytes. If your data is smaller, fill the remaining space with thrash bytes to get the desired size, if it is larger: pad it to the next multiple of that size.

Rewaz said:
I don't want something that complicate honestly, I preffer something like MemoryPool with auto alignment. That's why I preffer some lib, bc I know there should be a good one, not the best, but useful.

I don't guess there is a library out there you want to use and be happy with for the rest of your life. As I wrote, memory allocation is always a matter of personal taste and switches from engine/developer to one another.

Do you want memory buckets, garbadge collection or even something not thought about yet, do you prefer memory reallocation and how do you handle pointers, what about different platforms and memory alignment for CPU? Those questions offer millions of possible ways to handle everything and so a memory allocator will.

I really suggest reading the blog first and then think about what you want and if you really “wanted to do the code from 0”, you'll also get this done. It is however important to write your own std::shared_ptr to have whatever memory manager you intend to use, the chance to reallocate memory ofr better use. This is the main secret of memory management, otherwise you could fire and forgett malloc/free and don't care about anything. The real magic happens when you optimize the memory by filing gaps, moving data around and handle all of these in a multithreading environment is a lot of fun (and failure)

Some questions about allocators and multi-threading

This topic is closed to new replies.

Popular Topics

Recommended Tutorials

Some questions about allocators and multi-threading

This topic is closed to new replies.

Popular Topics

Recommended Tutorials

Reticulating splines