Advertisement

Reaching out for possible answers

Started by July 22, 2011 01:03 PM
6 comments, last by zacaj 13 years, 3 months ago
Hello everyone. I'm in a bit of a pickle. I couldn't decide which forum section to put this in, so I played it safe and put it in the general spot, but it is a technical question.

I'm stuck working with a bit of software that I don't have access to the source code for, so hackery isn't an option. This software operates over files. It's input is files [large files], its output is files [large files]. The way I currently do it, because I pretty much have to, is I run a script that generates the input that'll be fed into this program, run the program, then run a script that chews up the output, grabs one nugget of knowledge, and throws the rest away.

I need to run a lot of these runs, but because this thing is so IO intensive, running it on a naive cluster isn't an option since the NFS just chokes on all the file IO. Instead I am spreading it to a few places, but using local disks. Still though, it's super IO bound, and sits on an enormous amount of disk space that is really unnecessary.

So I started thinking... These files it inputs and outputs have a very regular structure to them. I know right where the data I need is in this file. Likewise, I'm spinning off the input file from a script that is logically simple. If only I could intercept the file request, and get the file requests to go to my script instead of to the disk. If only I could channel the file output into my script that catches the data I want, and just throw all the rest away instead of even writing it to disk. If only.... then I could take the use of this retarded piece of software that I unfortunately need, and make it scale, and thus soar.

So. Any ideas? An instrumentable file system would work great. Virtual files?
On Windows (and I assume something similar is possible on other systems, but I can't tell you) you can potentially hook into the file IO API calls that are used and redirect them to your own version. You can use a library like Detours. It's not pretty and still requires a fair amount of work, but it might work as a temporary 'solution'.
Advertisement
How large are these files, would a ramdisk work?
If it's a unix app, make the files be one end of pipes. The other end of the pipe can then be anything you like -- including another application which reads/writes the main files and feeds or extracts appropriate parts up and down the pipes.

If it's a windows application, you could look at virtualisation and using the back end of the virtualised filesystem to feed data in.
I'd second Wan's suggestion. In fact, if you hooked all the IO API functions the application is using you can 'fake' writing to disk and not use any space except for the information you want. It depends on how well you can identify the data you're interested in. I wrote an application that's somewhat similar to what you're doing now. It injected a DLL into a remote process, this DLL then intercepted CloseHandle and CreateFile functions. Instead of creating a handle to a file, the CreateFile returned a handle to a named pipe. Then whenever the hooked process thought it was writing to disk, it was actually sending data straight over the pipe into the first program.

It was actually really straightforward. I used Detours for the hooks. You can probably get something like that going with a lot less effort than you'd think.


I'm stuck working with a bit of software that I don't have access to the source code for, so hackery isn't an option. This software operates over files. It's input is files [large files], its output is files [large files]. The way I currently do it, because I pretty much have to, is I run a script that generates the input that'll be fed into this program, run the program, then run a script that chews up the output, grabs one nugget of knowledge, and throws the rest away.


I'm curious about "large". How large?

Also, which OS?
Advertisement
Operating system doesn't matter, but i'm much more comfortable developing in windows in general [but can operate in either... I'm just much less experienced with linux]. The program being used is written in Java, and is written portably.

Input file sizes are typically between 500mb to on the order of 3gb, producing from about 250mb to the order of 1gb. A bit too much to run concurrently from a ram-disk [at least for me, I've got about 4-8 gb per machine on a cluster I have access to, which I need to share with others, or 12 gb on my own machine. The cluster is Linux, my machine is windows]. Outgoing is streamed, incoming is random-access [not completely random, it streams for a few mb, then skips to somewhere else].

I'm not sure how to make files into pipes, but I'll read up on it. Also, I'm reading up on detours. Thanks for the pointers thus far.
Might try making a filesystem driver using FUSE, or something of the like

This topic is closed to new replies.

Advertisement