Read from main memory will take 100+ clocks
{writes} ... are around 1 clock
Ah! You're looking at thread-blocking stall time. I'm looking at overall system throughput.
What will actually happen when you write, is that a cache line will be allocated for the write (which needs bus arbitration on multi-CPU systems,) and the bytes will be put in the right part of the cache line (or, for unaligned writes, cache lines.) Then a read will be issued, to fill the rest of the cache line. Then, once the cache line gets evicted, the memory bus will be used again, for putting the data back out.
At a minimum, this will take twice as long as reads. If you have data structures (or a heap) that alias cache lines to different pointers that different CPUs want to update, they will end up ping-ponging the cache line, costing more than 10x the "base" cost.
Similarly, unaligned writes that span cache line boundaries may cause significant additional slowdowns, depending on the specific (micro-)architecture. (Some RISC architectures disallow unaligned access for this reason.)
Btw: When a CPU thread stalls, waiting for a read to complete, a hyper-threaded CPU will switch to its next thread. Hopefully, that thread was previously stalled waiting for a cache line that has now loaded, so it can still do useful work.
The recommendation is still: If you can do a compare and a branch to avoid a write, that will improve your overall throughput if you are memory bound (which almost all workloads are.)
I think we're somewhat below the useful level of abstraction for timers based on priority queues, though :-)
And, finally: If you "step" each entity in your world each game tick anyway, you might as well just have an array of current attached timers inline in each object, and compare the current game time to the time in those timers; they don't need a queue separate from the collection of all game entities.