Advertisement

CPU-side performance...

Started by March 12, 2014 09:32 AM
0 comments, last by cr88192 10 years, 8 months ago

basically, I am sitting around mostly lazily just writing plain C code, largely without resorting to anything "fancy" here in an attempt to get performance (ex: ASM, SIMD, multithreading, ...), though admittedly there was some amount of micro-optimizing and "fine tuning", ... (a lot of this is used elsewhere in the project, just not in the code in question...).

ok, so?...

I "recently" went and wrote a version of my currently-active video codec mostly for real-time capture, and with some fiddling got it up to doing around 1680x1050 at 30fps (with the recording front-end being VirtualDub). basically does full desktop capture at typically around 12% CPU load on my system (though if the CPU load for VirtualDub hits 15% it will drop below 30 and start dropping frames, with a single-threaded encoder).

what is a mystery is, ok, I have done this, but why then are there so few other options that can do similarly or better?...

one program will encode using MS-CRAM (MS Video 1), which goes fast enough but the video quality leaves much to be desired.

FRAPS sort of holds up ok (despite the free version having a recording time-limit and similar), but grinds the HDD pretty bad and goes through unreasonably large amounts of HDD space.

another program based around x264 basically runs the CPU at full load (on all cores), lags computer some, and has to use downsampling to be able to record in real-time (on settings for high-speed encoding).

well, ok, there is Lagarith, using Lagarith via VirtualDub pulling around 27 fps and running the CPU at 30%-40% load, with VirtualDub still dropping frames (still a viable option though).

also tested capturing using XviD, but it didn't hold up well (quickly dropped below 20 fps, while pretty much maxing-out the CPU in the process).

well, nevermind differences between the various formats, which can contribute a fair bit to the computational cost of encoding.

well, and me also throwing together a BC7 encoder (now with partition support) generally fast enough for load-time encoding (still probably a bit too slow for real-time though).

then again, I don't use a brute-force search, instead driving most of the process by arithmetic and lookup tables.

ex:

RGB -> YCbCr, then do VQ magic on the CbCr values, and use this to drive a lookup table (to get the partition, a series of LUTs are built when the coder initializes which map chroma-space vectors to partition numbers);

the partition is then used by the good old CYGM endpoint-selector (*), which chooses endpoints independently per-partition;

the various options (block-format / etc) are then evaluated and then the final output block is produced.

the logic could still be faster (ex: as-is, it has to run the filter / endpoint classifier multiple times per block, ... and seems to be the bottleneck here).

*: CYGM=Cyan, Yellow, Green, Magenta. basically the pixels are converted into a CYGM-based color-space, and this is used to evaluate the endpoints (in prior tests for naive linear classifier axes, CYGM seemed to give the best results). this was also computationally cheaper than some other options, while giving generally better results than simply classifying things based on Luma.

( the CYGM filter was previously used some with video recording, but framerates didn't hold at higher resolutions, so recording has reverted to a luma-driven selector, with a potentially cheaper algorithm considered as a possible option to improve image quality. still used for batch encoding and for load-time texture conversion and similar, as these are less speed-critical...).

I don't really get it, it didn't really seem all that difficult to get ok speeds out of BC7 encoding, but a lot of the existing encoders seem to take a very long time and need to resort to GPGPU stuff and similar...

unless it is mostly time spent on looking more for an "optimal" solution, rather than finding a "good approximate" to be sufficient?... does still seem a little steep though.

well, also BC7 is used as the current primary texture format for sending video to the GPU (replacing DXT5, mostly as BC7 can give better image quality).

sorry if there is no particular question here, but, thoughts?...


I don't really get it, it didn't really seem all that difficult to get ok speeds out of BC7 encoding, but a lot of the existing encoders seem to take a very long time and need to resort to GPGPU stuff and similar...

unless it is mostly time spent on looking more for an "optimal" solution, rather than finding a "good approximate" to be sufficient?... does still seem a little steep though.
It is quite trivial to encode BC7 since it's merely an interpolation between some values that you can choose, but finding an optimal encoding is daunting.

Also, there are at least four different interpreations of "optimal":

  1. Minimum RMS difference within each block
  2. "Looks best" regardless of any statistical metrics
  3. Minimum blocking artefacts (that is, minimum difference between individual 4x4 blocks) while also attaining a reasonable approximation for either (1) or (2)
  4. High compression ratio if a general purpose compressor is applied after BC

If you spend a few minutes thinking, you can probably come up with another 2 or 3.

I can't tell why the different video codecs that you have tried are so different, it seems rather unlikely that they are simply not optimized properly (though of course, that's possible). There may be differences in how different codecs compress the DCT (or whatever they use) transormed data (e.g. Huffman or arithmetic), or how they traverse the image (Morton?), which might make a difference of running 3 or 4 times (or 20 times?) slower. For some formats, these details are exactly defined, but not necessarily for all.

It is also very easy to use more CPU and run slower with multithreading, if done wrong (I'd assume someone writing a video codec gets it right, but you never know).

What's funny is that yours only takes 15% CPU, which means that either you are stalled waiting for image data most of the time, or you're stalled in a blocking disk write -- but you're certainly not doing compression work most of the time.

Advertisement

I don't really get it, it didn't really seem all that difficult to get ok speeds out of BC7 encoding, but a lot of the existing encoders seem to take a very long time and need to resort to GPGPU stuff and similar...

unless it is mostly time spent on looking more for an "optimal" solution, rather than finding a "good approximate" to be sufficient?... does still seem a little steep though.

It is quite trivial to encode BC7 since it's merely an interpolation between some values that you can choose, but finding an optimal encoding is daunting.


yes.

though there are the usual time-wasters:
pulling in input pixels;
writing output bits.

some things, like the classifier or luma function, tend to influence things a fair bit, since they represent work that needs to be done per-pixel.

the luma classifier basically just calculates luma, and chooses the brightest and darkest pixels as the endpoints.
the CYGM classifier basically just does the same thing as the luma classifier, but calculating the pixels for each color axes, then afterwards making a choice based on the distances between the output values. however, because it has to calculate and classify multiple values per-pixel, it runs a bit slower.

another considered classifier strategy was:
calculate average pixel;
calculate error-matrix (basically, a similar calculation to an inertia tensor, based on the delta from the average);
calculate vector from matrix;
re-run and classify pixels along this vector.

this was not used mostly because the logic was unreasonably expensive.


some variants (of the BC7 block encode and decode logic) partly sidestep the bits issue by packing the bits into larger values which can be read/written as a single block, for example, rather than reading/writing 4x7 bits, reading/writing 28 bits, ... (this being mostly due to the cost of the "if()" calls to determine whether to emit or read-in another byte).

in most of my (normal) block encoders, no attempt is made to minimize error, rather it just proceeds top/down via if/else chains and heuristics, usually with me fiddling with the logic in an attempt to get good image quality from whatever I was using for testing.


the partition-based encoder though, does make use of MSE/RMSE in choosing the output block format.
generally modes 1, 3, 5, and 6 seem to dominate at present.
modes 0, 2, and 4 are chosen less often.

thus far, the encoder seems to never choose mode 7, implying that potentially the logic is broken.

the logic basically works by producing blocks in each format, then decoding them and comparing them with the original.

Also, there are at least four different interpreations of "optimal":

  • Minimum RMS difference within each block
  • "Looks best" regardless of any statistical metrics
  • Minimum blocking artefacts (that is, minimum difference between individual 4x4 blocks) while also attaining a reasonable approximation for either (1) or (2)
  • High compression ratio if a general purpose compressor is applied after BC
If you spend a few minutes thinking, you can probably come up with another 2 or 3.


probably. I am not really sure what exactly the standard encoders are aiming for, but minutes per texture seems a bit severe in any case.

I can't tell why the different video codecs that you have tried are so different, it seems rather unlikely that they are simply not optimized properly (though of course, that's possible). There may be differences in how different codecs compress the DCT (or whatever they use) transormed data (e.g. Huffman or arithmetic), or how they traverse the image (Morton?), which might make a difference of running 3 or 4 times (or 20 times?) slower. For some formats, these details are exactly defined, but not necessarily for all.
It is also very easy to use more CPU and run slower with multithreading, if done wrong (I'd assume someone writing a video codec gets it right, but you never know).


different ones use different strategies:
x264 and XviD are both DCT based, but given they are being used on fast options (such as x264 "ultrafast"), they should presumably encode quickly (don't know much about the format specifics, but wouldn't expect that the design would be *that* hostile to viable real-time encoding).

Lagarith is basically like an arithmetic-coded PNG-like format (predicting pixels and entropy-coding the differences).
FRAPS also uses a PNG-like encoding (but is Huffman coded AFAIK).


though, granted, in my case, M-JPEG was one of my slower encoding strategies, with a fair bit of effort put into trying to make it reasonably fast.
like with a lot of my other encoders, most of the time seems to get spent on dealing with input pixels (YCbCr conversion, ...), followed by the logic to spit out coded blocks (so, more stupid bits-writing calls, JPEG also added the cost of needing check-for and escape-code 0xFF bytes, ...).

couldn't get it much past around 60 megapixels/second in benchmarks, and it bogs down if used for capture at higher resolutions (in my tests, in-engine capture doesn't really work well at higher-resolutions if the encode speed is much below around 100-120 megapixels/second on benchmarks). so, it ended up largely being used for real-time capture.

though, in a strict sense, 1680x1050p30 should only require 53 Mpix/sec, the rest seems to be mostly buffering for IO / timing / image acquisition / ...


my current codec basically works like this:
converts input pixels into a "metablock" format (basically, it is similar to a DXT5/BC7 hybrid, but using 256-bits / block instead of 128, and typically storing raw 32-bit color-endpoints, with everything kept nicely byte-aligned like in DXT5, *1).
encodes blocks into an RPZA-like format (*2);
uses a speed-optimized LZ77+Huffman encoder (the format is Deflate-like, but uses a 1MB sliding window and up-to 64kB matches). it is kept fast mostly by only checking for a single match, and basically working with a single-pass strategy (and using "running statistics" to build the Huffman tables).

*1: for real-time video-capture, a simple luma-classifier is used. the block-encoder also unrolled pretty much everything, basically transforming the block mostly as a single big ugly mass of expressions (though with an annoying number of "if()" blocks, where "if()" isn't exactly cheap either, most are related to pixel-classification, though with a slightly odd / non-raster pixel ordering, *3).

though, on the plus-side, since pixels only need to calculate Y, and not Cb or Cr, there is a little saving here.

*2: it is derived from RPZA, and in same general category as MS-CRAM.

basically either encodes runs of skipped-blocks (unchanged from prior frame), flat-colors, or emits raw blocks (in this case, typical blocks will use 80 bits, namely 48 bits for 23-bit color-endpoints and 32-bits for interpolation data).
the format supports a lot more features than this, but this is basically what is used for real-time recording.

support for a larger/higher-quality special-case block has partly been developed, which would store a 4x4 block as 4 2x2 pixel sub-blocks, with each sub-block having its own color (and working similar to ETC/ETC2). the motivation is mostly that this should be a lot cheaper (in terms of encoding cost) than a CYGM classifier, and partition-based blocks are non-viable for real-time encoding.


*3: ADD:
basically, if pixels are:
00 01 02 03
10 11 12 13
20 21 22 23
30 31 32 33

they can be visited in the order:
00, 33, 03, 30,
01, 32, 13, 20,
02, 10, 31, 23,
11, 22, 12, 21

effectively trying to maximize the distance between pixels, and clustering most of the branch mispredicts into the first several pixels (though, the profiler implies that this order may still not be ideal).

ADD2: seems to be a better traversal order:

00, 33, 03, 30, 11, 22,
01, 32, 13, 20, 12, 21,
02, 10, 31, 23

What's funny is that yours only takes 15% CPU, which means that either you are stalled waiting for image data most of the time, or you're stalled in a blocking disk write -- but you're certainly not doing compression work most of the time.

well, but on a quad-core, each core is limited to 25% (going above 25% requires threads), this means that about 60% of that thread time is being used.

but, yeah, IO or similar is likely an issue, as it is only below 15% that VirtualDub can maintain 30 fps.

its ability to maintain 30fps also depends some on settings within VirtualDub, such as disabling preview during capture (preview did hurt on the framerate).
it apparently uses GDI+ for image acquisition, but the OpenGL and DXGI options don't work.

This topic is closed to new replies.

Advertisement