The issue with trying to extract that information from an image is that you have to somehow inform the algorithm of which parts of the image actually affect the footprint. The algorithm can't look at your example image and know that the tall part is a chimney; for all it knows, the tall part lays on the ground and thus affects the footprint, making the object occupy much more area on the back-side of the object where the chimney occludes, even though that area should be open and not affected by the footprint.
When I was still working on isometric games, I always just did the bounds/occlusion/footprint calculation step directly from the geometry in the 3D file from which I rendered the object sprite. The 3D file contains all the information to determine exactly what the footprint of the object is. In the absence of the original 3D file, then you'll probably have to go with a more brute-force approach.
You could probably ease this a bit by building some sort of tool that will allow you to associate a particular object with a rough shape. ie, you could say "this image is LIKE a cube of size XYZ", and build the footprint data from that rough shape. Provide some sort of "grammar" or tool to aggregate simple primitives for more complex shapes. ie, "this image is like a cube of size XYZ plus a box of size WUV located at MNO offset from original shape", etc... How you build this tool or grammar is entirely up to you, but ideally it would:
1) Not take as long to develop as manually determining the footprint of each image would take
2) Allow enough conciseness as to actually reduce the workload of creating each footprint, enough that the overall time savings offsets the time spent developing the tool.
Give that you could iterate on such a tool for quite some time, there would need to be a LOT of images needing done to make it worthwhile.