Edit: clarified that this is about reading disk files with a memory mapped file, not shared-memory-only files (which may have been a source of confusion in the discussion).
I've heard that reading a disk file by using a memory mapped file should be very fast when the file already cached in OS memory, because the application can directly read from the page cache. But in a test I noticed there's an overhead of 5000 CPU cycles per each 4k page when it's first accessed (page fault etc.), which makes it twice as slow (2 GB/s) as simply fread()'ing the data into a buffer (4 GB/s), and 11 times as slow as simply reading from a large memory block (22 GB/s). When reading the memory mapped file the second time (without remapping it), it's as fast as reading from a memory block. I used a 100 MB file so that the CPU cache should not be involved.
Also, when a newly allocated 100 MB memory block is read the first time (in which case malloc() forwards to VirtualAlloc() I presume), it suffers the same penalty (10x) as a memory mapped file.
Is there a way to overcome the overhead, or is it true that memory mapped files are slower even if the cached case, where they were supposed to be at their best? It looks like there's a page committing cost of some sort, and it's much higher than simply copying or zeroing memory. It's possible to call VirtualLock() to commit several pages at once, but the total performance remains the same. Flags passed to the CreateFile or CreateFileMapping functions didn't seem to help (e.g. FILE_FLAG_SEQUENTIAL_SCAN).