Ah! You're looking at thread-blocking stall time. I'm looking at overall system throughput.
What will actually happen when you write, is that a cache line will be allocated...
Yep, it is a loooong story. However, writing thread still does NOT need to wait for write to be completed.
The recommendation is still: If you can do a compare and a branch to avoid a write, that will improve your overall throughput if you are memory bound (which almost all workloads are.)
Ahh, it depends on the way HOW "memory-bound" your workload is. From what I've seen, real-world workloads (except for DBs) tend to be memory-LATENCY-bound (i.e time while the thread waits for read), rather than memory-THROUGHPUT-bound. Even for DBs I am not 100% sure about it since FSB has been replaced with NUMA 10 years ago...
Alternatively, if we don't want to rely on "whatever I've seen", we can do some math. Xeon E5 can pump up to 100GB/sec at 14 cores (http://www.anandtech.com/show/8423/intel-xeon-e5-version-3-up-to-18-haswell-ep-cores-/10 ) or about 2.5 bytes/core/tick. While it is certainly possible to write a program pumping more, for any kind of real-world load (except, maybe, for DB) such patterns will be quite unusual. Even more so for simulation games which are traditionally do a bit more than just memcpy :-).
And we're back to the square one ;-). In other words - I contend that most of the loads out there are memory-LATENCY-bound, and therefore writes are cheap (and reads are expensive).
I think we're somewhat below the useful level of abstraction for timers based on priority queues, though :-)
As usual :-).