Actually, it applies equally well to both tcp and udp. From the os side of things it doesn't matter if it is tcp or udp, multiple sockets or single sockets, these systems are just going to generate events which will wake a thread to pull the data.
Yes and no. The hardware generates interrupts of course, and the OS eventually wakes a thread, but not necessarily (even not usually) in a 1:1 correlation. However, given poll+receive you have two guaranteed user-kernel-user transitions instead of one.
For TCP that makes sense since you somehow must mulitplex between many sockets. There is not much of a choice if you wish to be responsive. -- For UDP, it's wasting 10k cycles per packet received for nothing, since there is only one socket, and nothing to multiplex. You can just as well receive right away instead of doing another round trip. Same for IOCP where you have two roundtrips, one for kicking off the operation, and one for checking completeness.
Throwing more threads at the problem doesn't help, by the way. Operating systems even try to do the opposite and coalesce interrupts. The network card DMAs several packets into memory, and sends a single interrupt. A single kernel thread does the checksumming and other stuff (like re-routing, fragment assembly), being entirely bound by memory bandwidth, not ALU. It eventually notifies whoever wants some.
A single user thread can easily receive data at the maximum rate any existing network device is capable delivering, using APIs from the early 1980s. Several threads will receive the same amount of data in smaller chunks, none faster, but with many more context switches.
Basically what you are describing with waiting on recvfrom and then pushing to a queue is exactly what the underlying OS async solutions would be doing for you. The benefit, even with UDP, is you can throw more threads to wait on work at the OS solution without writing the queue portion yourself.
Yes, this is more or less how Windows overlapped I/O or the GLibc userland aio implementation works, but not traditional Unix-style nonblocking I/O (or socket networking as such). Of course in reality there is no queue at all, only conceptually insofar as the worker thread reads into "some location" and then signals another thread via some means.
Additionally, in WinIOCP at least, you will bypass much of the normal user space buffering of packet data and instead the data is directly written to your working buffers.
Yes, albeit overlapped I/O is troublesome, and in some cases considerably slower than non-overlapped I/O. I have not benchmarked it for sockets since I deem that pointless, but e.g. for file I/O, overlapped is roughly half the speed on every system I've measured (for no apparent reason). The copy to the userland buffer does not seem to be a performance issue at all, surprising as it is (that's also true for Linux, try one of the complicated zero-copy APIs like tee/splice, and you'll see that while they look great on paper, in reality they're much more complicated and more troublesome, but none faster. Sometimes they're even slower than APIs that simply copy the buffer contents -- don't ask me why).
But even disregarding the performance issue (if it exists for overlapped sockets, it likely does not really matter), overlapped I/O is complicated and full of quirks. If it just worked as you expect, without exceptions and special cases, then it would be great, but it doesn't. Windows is a particular piss-head in that respect, but e.g. Linux is sometimes not much better.
Normally, when you do async I/O then your expectation is that you tell the OS to do something, and it doesn't block or stall or delay more than maybe a few dozen cycles, just to record the request. It may then take nanoseconds, milliseconds, or minutes for the request to complete (or fail) and then you are notified in some way. That's what happens in your dreams, at least.
Reality has it that Windows will sometimes just decide that it can serve the request "immediately", even though they have a very weird idea of what "immediately" means. I've had "immediately" take several milliseconds in extreme cases, which is a big "WTF?!!" when you expect that stuff happens asynchronously and thus your thread won't block. Also there is no way of preventing Windows from doing that, nor is there anything you can do (since it's already too late!) when you realize it happened.
Linux on the other hand, has some obscure undocumented limits that you will usually not run into, but when you do, submitting a command just blocks for an arbitrarily long time, bang you're dead. Since this isn't even documented, it is actually an even bigger "WTF?!!" than on the Windows side (although you can't do anything about it, at least Microsoft tells you right away about the quirks in their API).
In summary, I try to stay away from async APIs since they not only require more complicated program logic but also cause much more trouble than they're worth compared to just having one I/O thread of yours perform the work using a blocking API (with select/(e)poll/kqueue for TCP, and with nothing else for UDP).