Let me try to demonstrate results you are getting (time to put the new site to the test, this is going to be image heavy). For this I've used real time path tracer (used in GROOM game).
So let's start with various result (what I have experimented with), first randomization of ray direction:
Fig. 01 - Randomization of ray direction, left to right - 0.1%, 0.5%, 2.5%
This method is extremely fast, but will end up in blurring of more distant objects (it behaves like if your focus was at 0). GROOM used this solution (with 0.1% randomization).
The physical simulation of camera is also possible by doing proper depth of field, which looks like this (increasing lens angle shape, constant focal plane):
Fig. 02 - Proper depth of field simulation (lens angle grows from left to right, focal distance is the same)
In Fig. 02, notice the aliasing around the focus plane. That can be solved by combining this is the previous method. This method is a bit more expensive, and I personally didn't want to use it, as in my opinion depth of field should be used in cut-scenes but not in gameplay (as you technically don't know what player's eyes focus at).
Now, you have asked about texture filtering - GROOM uses only bilinear filtering (no mip mapping), and for comparison (single vs. multiple samples per pixel (+ temporal filtering in second one)). I hope youtube didn't destroy the noise though (it is captured as quite low-resolution intentionally, I worked in window as that is far more comfortable to me ... in worst case I'll record it anew in at least 720p format).
EDIT: New forum HATES videos (they are over whole post), please check the links (first is unfiltered, second filtered) .. also you have to copy/paste the link, they don't work yet:
https://youtu.be/Kq61i5oyZzA
https://youtu.be/GUKflK03D8I
EDIT: To follow up, the main reason for texture filtering is performance and visual quality. While you have to additionally compute with ray differentials, the performance by cache misses when reading pixels from high resolution texture (compared to reading neighboring pixels of low resolution one) is often a bigger hit for performance (this might give you quite a boost especially when computing ray tracing on CPU).
Additionally to this, you don't need redundant samples to filter the texture image (suppose you have just 1 spp, you will always have noise, temporal filtering would end up in flickering of those surfaces ... with mip mapping, you magically fix this problem too).