Advertisement

File Merging Tool?

Started by November 13, 2010 08:32 PM
6 comments, last by phresnel 13 years, 11 months ago
Does this tool exist?

I have ton of files on my external hard drive. Movie clips, MP3s, e-books, and other stuff. Every chance I get, I 'sync-up' files with friends. We get our externals together, and I get everything from them and they get everything from me.

I'm looking for a program to make this easier, faster, better.

In the beginning it was simple. I would just copy all of their files to my HDD, and they would copy all of my files to their HDD. Then we would each sort out our new files.

As our hard drives are getting bigger and bigger (2 TB +), and the number and sizes of the files are getting larger, the above strategy is not a good one. First of all we typically dont have the free space to copy the other guys entire HDD to our own. Secondly were finding that most of what we copy over is stuff we already have.

I want an app that will only copy over stuff I dont already have. Going through and doing this by hand would take forever.

To make things much more complicated, of course not everyone organizes their files the same way. So in order to decide wether or not to copy a file over, the app will have to be smart enough to realize that his
\\Files\Mp3s\Mozart\Marriage of Figaro.mp3
is the same as my
\\Music\Classical\Mozart - The Marriage of Figaro.mp3

Im sure something like this must exist but my googling has resulted in nothing but diff programs for merging the contents of a single file...
Give SyncBack a try.
Advertisement
I just realized how BIG of a job this would be for the computer to run.

If you've got a 2TB hard drive and you have something like 100,000 files (being really generous here), and you want to find any duplicate files which may have different names or some slight variation in the data (ie a 192Khz song vs a 128Khz song), then you'll probably end up doing a file comparison on the bit level against every other file on the hard drive.

This would be an O(N^2) problem with N being a really large number. The choke point wouldn't be CPU bound, but bound by the seek time of your 2TB HDD. I can only guess that this would take weeks to months to churn through a 2Tb drive.
Quote: Original post by slayemin
If you've got a 2TB hard drive and you have something like 100,000 files (being really generous here), and you want to find any duplicate files which may have different names or some slight variation in the data (ie a 192Khz song vs a 128Khz song), then you'll probably end up doing a file comparison on the bit level against every other file on the hard drive.


In most cases, though, you won't be trying to determine if two files are "similar". You can just check file sizes and then do a hash comparison when they're the same. This will only degenerate to N^2 when every file is the same size, which is highly improbable.

As for a program that does precisely this, I was unable to find one either. However, this can easily be done by a small Python program. Since you're not concerned about directory structure, just group all files you want to check on either side by their sizes and then compute hashes for both groups. The set difference of all hashes for a given size will tell you what both of you are missing.

[Edited by - DaWanderer on November 14, 2010 9:15:27 AM]
Why don't you search for files that have been created after the last sync and then just copy those?
Quote: Original post by slayemin
I just realized how BIG of a job this would be for the computer to run.

If you've got a 2TB hard drive and you have something like 100,000 files (being really generous here), and you want to find any duplicate files which may have different names or some slight variation in the data (ie a 192Khz song vs a 128Khz song), then you'll probably end up doing a file comparison on the bit level against every other file on the hard drive.

This would be an O(N^2) problem with N being a really large number. The choke point wouldn't be CPU bound, but bound by the seek time of your 2TB HDD. I can only guess that this would take weeks to months to churn through a 2Tb drive.


I dont think that kind of deep comparison would be necessary in the general case.

If 2 files have the same name and the same size & modified date, I would just assume they are identical. This would be a pretty common scenerio, especially after the first time I run the app.

As for slightly different versions of things, it gets trickier. Like 128 vs 192khz for MP3s, I would need some pretty fancy AI for determining that the songs are the same, even if the names are slightly different. Maybe for cases where the program cant reliably decide what to do, it could prompt the user about the 2 files and ask what should be done.
Advertisement
Quote: Original post by Kambiz
Why don't you search for files that have been created after the last sync and then just copy those?


This is a great idea and I was thinking about it while I typed out the original post. I think I will start with something like this just because it is very easy and should really speed things up.

I guess I'll just run a BAT or exe to recursively spit out the names, sizes, and modified dates of all files on a particular HDD. Then I can do a diff on the current list vs the previous list and easily see what has changed/is new. Then save the current list as the previous list. I would have to maintain lists for each friend that has I swap with.

I could probably improve this system by creating mappings between our file systems. For eg, a mappang that says my music folder:
\Musicmaps to Johnnys music folder:
\Files\MP3setc.
Quote: Original post by AndreTheGiant
Does this tool exist?


rsync, maybe git.

This topic is closed to new replies.

Advertisement