I'm still learning this myself but from what I understand this is the way the transforms work:
Model Space → World Space → View Space (camera space) →Clip Space → Screen Space
So after you apply the projection matrix you will be in the homogenous clip space, once the perspective divide happens (this is done internally by DirectX) you will have the point in normalized device coordinates which are basically in the range of -1 to 1. Once you're in NDC a viewport transform is applied (given by the viewport you set up) to move you from NDC to actual screen space coordinates.
In other words I don't think you can just multiply pixel position on screen by inverse of projection matrix. You have to first reverse the viewport transform and then reverse the projection to get to view space.
Just as a disclaimer I'm not 100% sure this is entirely correct but I think this is how it works.
If you need a very good example of the actual implementation of this you can check out the Picking demo from the DirectX sampler, it should be on github afaik.