There is a current thread on this: http://www.gamedev.net/topic/681723-faster-sin-and-cos/
Bottom line: a pretty good custom implementation (way better than yours, and about twice as much work) can be easily 10 times, if not 20 times faster than the standard library.
Hardware trig functions have been around since... forever (I think 386 had it already). But they have latencies in the 200 cycles range (and I think you can only start one every 200 or so cycles, too). Even traditional 387 math implementations are usually faster (and approximations anyway).
On modern hardware, a couple of fused-multiply-add instructions (which is what your code boils down to) have a latency of maybe a dozen or so cycles at most, and depending on how well the actual code allows pipelining to kick in, it might effectively cost you 1-2 cycles in the best case.
For the timer i use the the high resolution timer available on the platform
This is mighty fine for anthing like at least 10 million or so iterations. QPC has a resolution of about 0.3 µs on my system (which is also an i7).
Thus, you can say that anything that takes at least something-millisecond will be fine to measure with good precision, anything that is something-micro (like ten thousand thousand to a million iterations) is kinda acceptable, but precision will not be stellar, and anything that is something-nano (such as only running e.g. 5-10 iterations) is a pure bullshit measure.
You can time single iterations (or few) using the CPU's TSC counter, but there are three important gotchas. First, TSC may not be in sync between different cores. That's said to be mostly a "historic problem" and no longer the case, but I just had "negative time" on my current system trying this last week. Second, what you measure is processor cycles, not time. That's actually an advantage because you are no longer subject to different clock rates. If your algorithm takes 15 CPU cycles, it will take 15 cycles at 800MHz and it will still take 15 cycles at 4,000MHz (only now cycles are faster!). It's someting to be aware of, however... some people use TSC for measuring time, and that's bound to fail. Lastly, to give a precise measurement, you must completely empty the processor pipeline before taking the TSC. In other words, rather than RDTSC, you must execute CPUID followed by RDTSC (CPUID being the only synchronizing instruction that you can call in a user process).