Reconstruct version control from set of files - database

I am looking after an approach for the following task:
given a set of files that are highly similar (I am using Fuzzy hashing here), I would like to know if there is an algorithm that allows to label those files with a version number. The output should return the sequential order of when those files have been generated.
The reason is I have to re-organize data of a team who were not familiar with version control.
Thank you

A fairly simple approach (I hope) would be to try and convert this into some kind of graph problem.
Let's say every file is a node with edges between every two files.
The weight of an edge between two nodes would be, for instance, the number of different lines between the files (or some some other function).
What you do next is find a non-cyclic path that traverses all files with the minimum cost. something like this, if you know the first file and the last.
You could add an empty file and the latest version you have as your start and end nodes.
I'm guessing this won't give you the exact result, but it'll probably give you a good starting point.
Hope this is helpful.

Related

My program relies on hashes to identify files, some are repeated. How can I work around this?

sorry for the messy title but I can't come up with something that really describes what's happening here. So I'm making a program that fetches .cue files for Playstation 1 roms. To do this, the program creates a SHA-1 hash of the file and checks it in a database. The database can be found in the "psx.hash" file in this repo. This has been working fine but I suddenly stumbled upon a very very nasty problem. There's plenty of files that have the same hash, because they are essentially the same file.
Let me break down the problem a bit. PSX roms are essentially cd files, and they can come in tracks. These tracks usually contain audio, and the .cue file is used to tell the emulator where each audio track is located [in the disc file]. So what I do is to identify each and every track file (based on their SHA-1 hash), see if they match the database, and then construct a link based on their name (minus the track text) to get to the original cue file. Then I read the text and add it to the cue, simple as that. Well, apparently many games use the same track for some reason? Exactly 175 of them
So... what can I do to difentiate them? This leads to the problem that I fetch the wrong cue file whenever this hash comes into play. This is the hash by the way: "d9f92af296360772e62caa4cb276de3fa74f5538". I tried other algorithms to see if it was just an extremely unlikely coincidence, but nope, all gave the same results. SHA-256 gave the same result, CRC gave the same result, MD5 gave the same result (by the same result I mean the same between files, of course the results of different algorithms for the same file will be different).
So I don't know what to do. This is a giant bug in my program that I have no idea on how to fix, any insight is welcome. I'm afraid I explained myself poorly, if so, I apologize, but I have a hard time seeing where I may not be clear enough, so if you have any doubts please, do ask.
It's worth noting that the database was not constructed by myself, but by redump.org, also, here's the code I'm using to retrieve the hashes of the files:
def getSha1(file):
hashSha1 = hashlib.sha1()
with open(file, "rb") as f:
for chunk in iter(lambda: f.read(4096), b""):
hashSha1.update(chunk)
return hashSha1.hexdigest()
The correct solution would be to construct the hash file in such a way that I can differentiate between track files for each game, but I ended up doing the following:
Sort the list of Tracks to have them ordered.
Get the first track file and retrieve the hash (this one will always be unique since it contains the game)
For every next track file that isn't Track 1, assume it belongs to the game before it. So if the next file is Track 2, assume it belongs to the previous file that had Track 1.
This nicely avoids the issue, although it's circumventing the bigger problem of not having properly formatted data.

how to get added content of a file since last modification

I'm working on a project in golang that needs to index recently added file content (using framework called bleve), and I'm looking for a solution to get content of a file since last modification. My current work-around is to record the last indexed position of each file, and during indexing process later on I only retrieve file content starting from the previous recorded position.
So I wonder if there's any library or built-in functionality for this? (doesn't need to be restricted to go, any language could work)
I'll really appreciate it if anyone has a better idea than my work-around as well!
Thanks
It depends on how the files change.
If the files are append-only, then you only need to record the last offset where you stopped indexing, and start from there.
If the changes can happen anywhere, and the changes are mostly replacing old bytes with new bytes (like changing pixels of an image), then perhaps you can consider computing checksum for small chucks, and only index those chunks that has different checksums.
You can check out crypto package in Go standard library for computing hashes.
If the changes are line insertion/deletion to text files (like changes to source code), then maybe a diff algorithm can help you find the differences. Something like https://github.com/octavore/delta.
If you're running in a Unix-like system, you could just use tail. If you specify to follow the file, the process will keep waiting after reaching end of file. You can invoke this in your program with os/exec and pipe the Stdout to your program. Your program can then read from it periodically or with blocking.
The only way I can think of to do this natively in Go is like how you described. There's also a library that tries to emulate tail in Go here: https://github.com/hpcloud/tail

Selecting a file randomly from a file system

This question relates to Simulating file system access .
I need to choose files and directories randomly to act as arguments of file operations like rename, write, read etc. What I was planning to do was to make a list of all files and directories with thier paths and randomly select from this list. But, as files and directories are created and deleted in the actual file system, the list also has to be updated. I am finding maintaining the list and updating it in this manner to be inefficient and it also has to be atomic so that a later operation does not access a file that was deleted by a previous operation.
Can you suggest a different way of selecting the files ..maybe someway to do it diretly from the file system...but how would we know paths to files then.
Thanks
I found something interesting here Randomly selecting a file from a tree of directories in a completely fair manner specially in
Michael J. Barber's answer, but not being able to follow it completely due to my python ignorance
You don't want to try to maintain a list of files when the filesystem is right there. You should be able to do it right from C. Walk from the root directory, selecting a random file from it. You can pick a random maximum depth, and if you hit a regular file, at or before this, use it. If it's a directory, repeat up to max depth. If it's a special file, maybe start over.
This should be pretty quick. The operation shouldn't have to be atomic. If the file's not there when you want do your operation, try again. Shouldn't be too complicated. You can build the path up as you find your target file. This will be simpler than fussing with the fs directly (I assume you meant at a much lower level) and should be simple to implement.
Here is my proposed solution. It is not the fastest, but should be quick (after preparation), use only modest memory, and be "fairly well-distributed". This is, of course, 100% untested and somewhat complex (as complex as maintain a RB-tree or similar, anyway) -- I pitty one for having to use C ;-)
For each directory in the target domain, build a directory tree using a depth-first walk of the filesystem and record the "before" file count (files found to date, in tree) and the "after" file count (the "before" count plus the number of files in directory). It should not store the files themselves. Fast way to find the number of files gives some example C code. It still requires iteration of the directory contents but does not need to store the files themselves.
Count up the total number of files in the tree. (This should really just be the "after" count of the last node in the tree.)
Pick a random number in the range [0, max files).
Navigate to the node in the tree such that the "before" file count <= random number < "after" file count. This is just walking down the (RB-)tree structure and is itself O(lg n) or similar.
Pick a random file in the directory associated with the selected node -- make sure to count the directory again and use this as the [0, limit] in the selection (with a fallback in case of running-off-the-end due to concurrency issues). If the number of files changed, make sure to update the tree with such information. Also update/fix the tree if the directory has been deleted, etc. (The extra full count here shouldn't be as bad as it sounds, as readdir (on average) must already be navigated through 1/2 the entries in the directory. However, the benefit of the re-count, if any, should be explored.)
Repeat steps 2-5 as needed.
Periodically rebuild the entire tree (step #1) to account for filesystems changes. Deleting/adding files will slowly skew the randomness -- step #5 can help to update the tree in certain situations. The frequency of rebuild should be determined through experimentation. It might also be possible to reduce the error introduction with rebuilding the parent/grandparent nodes or [random] child nodes each pass, etc. Using the modified time as a fast way to detect changes may also be worth looking into.
Happy coding.
All you should know is how many files are in each directory in order to pick directory in which you should traverse. Avoid traversing over symbolic links and counting files in symbolic links.
You can use similar solution as pst described.
Example you have 3 directories and there are 20,40 and 1000 files in each.
You make total [20,60,1060] and you pick random number 0-1060. if this number if greater or equal 60 you go 3rd folder.
You stop traversing once you reach folder whitout folders.
To find random file trough this path you can apply same trick as before.
This way you will pick any file whit equal probability.

What's the fastest way to tell if two MP3 files are duplicates?

I want to write a program that deletes duplicate iTunes music files. One approach to identifying dupes is to compare MD5 digests of the MP3 and m4a files. Is there a more efficient strategy?
BTW the "Display Duplicates" menu command on iTunes shows false positives. Apparently it just compares on the Artist and Track title strings.
If you use hashes to compare two sets of data, ideally they'd have to have exactly the same input each time in order to get exactly the same output (unless you miraculously picked two collisions of different input resulting in the same output). If you wanted to compare two MP3 files by hashing the entire file, then the two sets of song data might be exactly the same but since ID3 is stored inside the file, discrepancies there might make the files appear to be completely different. Since you're using a hash, you won't notice that perhaps 99% of the two files are matches because the outputs will be too different.
If you really want to use a hash to do this, you should only hash the sound data excluding any tags that may be attached to the file. This is not recommended, if music is ripped from CDs for example, and the same CD is ripped two different times, the results might be encoded/compressed differently depending on ripping parameters.
A better (but much more complicated) alternative would be an attempt to compare the uncompressed audio data values. With a little trial and error with known inputs can lead to a decent algo. Doing this perfectly will be very hard (if possible at all), but if you get something that's more than 50% accurate, it'll be better than going through by hand.
Note that even an algorithm that can detect if two songs are close (say the same song ripped under different parameters), the algo would have to be more complex than it's worth to tell if a live version is anything like a studio version. If you can do that, there's money to be made here!
And touching back on the original idea of how fast to tell if they're duplicates. A hash would be a lot faster, but a lot less accurate than any algorithm with this purpose. It's speed vs accuracy and complexity.

What should I know before poking around an unknown archive file for things?

A game that I play stores all of its data in a .DAT file. There has been some work done by people in examining the file. There are also some existing tools, but I'm not sure about their current state. I think it would be fun to poke around in the data myself, but I've never tried to examine a file, much less anything like this before.
Is there anything I should know about examining a file format for data extraction purposes before I dive headfirst into this?
EDIT: I would like very general tips, as examining file formats seems interesting. I would like to be able to take File X and learn how to approach the problem of learning about it.
You'll definitely want a hex editor before you get too far. It will let you see the raw data as numbers instead of as large empty blocks in whatever font notepad is using (or whatever text editor).
Try opening it in any archive extractors you have (i.e. zip, 7z, rar, gz, tar etc.) to see if it's just a renamed file format (.PK3 is something like that).
Look for headers of known file formats somewhere within the file, which will help you discover where certain parts of the data are stored (i.e. do a search for "IPNG" to find any (uncompressed) png files somewhere within).
If you do find where a certain piece of data is stored, take a note of its location and length, and see if you can find numbers equal to either of those values near the beginning of the file, which usually act as pointers to the actual data.
Some times you just have to guess, or intuit what a certain value means, and if you're wrong, well, keep moving. There's not much you can do about it.
I have found that http://www.wotsit.org is particularly useful for known file type formats, for help finding headers within the .dat file.
Back up the file first. Once you've restricted the amount of damage you can do, just poke around as Ed suggested.
Looking at your rep level, I guess a basic primer on hexadecimal numbers, endianness, representations for various data types, and all that would be a bit superfluous. A good tool that can show the data in hex is of course essential, as is the ability to write quick scripts to test complex assumptions about the data's structure. All of these should be obvious to you, but might perhaps help someone else so I thought I'd mention them.
One of the best ways to attack unknown file formats, when you have some control over contents is to take a differential approach. Save a file, make a small and controlled change, and save again. Do a binary compare of the files to find the difference - preferably using a tool that can detect inserts and deletions. If you're dealing with an encrypted file, a small change will trigger a massive difference. If it's just compressed, the difference will not be localized. And if the file format is trivial, a simple change in state will result in a simple change to the file.
The other thing is to look at some of the common compression techniques, notably zip and gzip, and learn their "signatures". Most of these formats are "self identifying" so when they start decompressing, they can do quick sanity checks that what they're working on is in a format they understand.
Barring encryption, an archive file format is basically some kind of indexing mechanism (a directory or sorts), and a way located those elements from within the archive via pointers in the index.
With the the ubiquitousness of the standard compression algorithms, it's mostly a matter of finding where those blocks start, and trying to hunt down the index, or table of contents.
Some will have the index all in one spot (like a file system does), others will simply precede each element within the archive with its identity information. But in the end somewhere, there is information about offsets from one block to another, there is information about data types (for example, if they're storing GIF files, GIF have a signature as well), etc.
Those are the patterns that you're trying to hunt down within the file.
It would be nice if somehow you can get your hand on two versions of data using the same format. For example, on a game, you might be able to get the initial version off the CD and a newer, patched version. These can really highlight the information you're looking for.

Resources