Moore only said that transistor counts double every 2-3 years, over the years its become synonymous with "processing power doubles every three years", which is really what the consumer cares about anyhow. There's a relationship between performance and number of transistors, sure, but no one cares about how many transistors their CPU sports.
To continue the single-threaded performance curve, I suspect that chip manufacturers will find ways to get around the slowing of moore's law. As we approach the limit of transitor shrinks on current technology, we'll just switch to new materials and processes, make bigger and bigger dies, bigger and bigger wafers, and move into '3-D' chips--stacking 2-D silicon dies one on top of the other. There's still hurdles there today, but they're far from insurmountable.
I also think that the transition to predominantly-multi-threaded, asynchronous programming can happen quite gradually. As those practices become more mainstream, a chip manufacturer can build future CPUs out of cores that are, for example, only half as fast, but consume only 25% of the area of previous cores -- thus, aggregate performance doubles without affecting area. As apps take advantage of this new way of doing things, they'll still seem twice as fast. That said, some problems are necessarily sequential, and so there will probably always be a place for high single-threaded performance. In the future, I can foresee a CPU that has 1-2 'big' cores, and 8-16 (or more) smaller cores -- all functionally identical, but the big ones chasing highest-possible IPC, and the little ones chasing smallest reasonable surface area. It would be best for OSes to know about these different kinds of CPUs, but you can even handle scheduling to the appropriate kind of CPU at a silicon level, and dynamically move between kinds of cores.
Another angle is stream-processing (GPGPU, et all) which accounts for a significant portion of our heaviest workloads. There, they've already figured out how to spread work across thousands of working units. If your problem can be solved in that computing model, we're already at a point where we can just throw more area at the problem, or spread it across multiple chips trivially.
The single-threaded performance wall is a hardware issue. The multi-threaded performance wall is a wetware (that's us) issue.