My program relies on hashes to identify files, some are repeated. How can I work around this? - database

sorry for the messy title but I can't come up with something that really describes what's happening here. So I'm making a program that fetches .cue files for Playstation 1 roms. To do this, the program creates a SHA-1 hash of the file and checks it in a database. The database can be found in the "psx.hash" file in this repo. This has been working fine but I suddenly stumbled upon a very very nasty problem. There's plenty of files that have the same hash, because they are essentially the same file.
Let me break down the problem a bit. PSX roms are essentially cd files, and they can come in tracks. These tracks usually contain audio, and the .cue file is used to tell the emulator where each audio track is located [in the disc file]. So what I do is to identify each and every track file (based on their SHA-1 hash), see if they match the database, and then construct a link based on their name (minus the track text) to get to the original cue file. Then I read the text and add it to the cue, simple as that. Well, apparently many games use the same track for some reason? Exactly 175 of them
So... what can I do to difentiate them? This leads to the problem that I fetch the wrong cue file whenever this hash comes into play. This is the hash by the way: "d9f92af296360772e62caa4cb276de3fa74f5538". I tried other algorithms to see if it was just an extremely unlikely coincidence, but nope, all gave the same results. SHA-256 gave the same result, CRC gave the same result, MD5 gave the same result (by the same result I mean the same between files, of course the results of different algorithms for the same file will be different).
So I don't know what to do. This is a giant bug in my program that I have no idea on how to fix, any insight is welcome. I'm afraid I explained myself poorly, if so, I apologize, but I have a hard time seeing where I may not be clear enough, so if you have any doubts please, do ask.
It's worth noting that the database was not constructed by myself, but by redump.org, also, here's the code I'm using to retrieve the hashes of the files:
def getSha1(file):
hashSha1 = hashlib.sha1()
with open(file, "rb") as f:
for chunk in iter(lambda: f.read(4096), b""):
hashSha1.update(chunk)
return hashSha1.hexdigest()

The correct solution would be to construct the hash file in such a way that I can differentiate between track files for each game, but I ended up doing the following:
Sort the list of Tracks to have them ordered.
Get the first track file and retrieve the hash (this one will always be unique since it contains the game)
For every next track file that isn't Track 1, assume it belongs to the game before it. So if the next file is Track 2, assume it belongs to the previous file that had Track 1.
This nicely avoids the issue, although it's circumventing the bigger problem of not having properly formatted data.

Related

how to get added content of a file since last modification

I'm working on a project in golang that needs to index recently added file content (using framework called bleve), and I'm looking for a solution to get content of a file since last modification. My current work-around is to record the last indexed position of each file, and during indexing process later on I only retrieve file content starting from the previous recorded position.
So I wonder if there's any library or built-in functionality for this? (doesn't need to be restricted to go, any language could work)
I'll really appreciate it if anyone has a better idea than my work-around as well!
Thanks
It depends on how the files change.
If the files are append-only, then you only need to record the last offset where you stopped indexing, and start from there.
If the changes can happen anywhere, and the changes are mostly replacing old bytes with new bytes (like changing pixels of an image), then perhaps you can consider computing checksum for small chucks, and only index those chunks that has different checksums.
You can check out crypto package in Go standard library for computing hashes.
If the changes are line insertion/deletion to text files (like changes to source code), then maybe a diff algorithm can help you find the differences. Something like https://github.com/octavore/delta.
If you're running in a Unix-like system, you could just use tail. If you specify to follow the file, the process will keep waiting after reaching end of file. You can invoke this in your program with os/exec and pipe the Stdout to your program. Your program can then read from it periodically or with blocking.
The only way I can think of to do this natively in Go is like how you described. There's also a library that tries to emulate tail in Go here: https://github.com/hpcloud/tail

read thunderbird address mab files content

I have several address list's on my TBIRD address book.
every time I need to edit an address that is contained in several lists, is a pain on the neck to find which list contains the address to be modified.
As a help tool I want to read the several files and just gave the user a list of which
xxx.MAB files includes the searched address on just one search.
having the produced list, the user can simply go to edit just the right address list's.
Will like to know a minimum about the format of mentioned MAB files, so I can OPEN + SEARCH for strings into the files.
thanks in advance
juan
PD have asked mozilla forum, but there are no plans from mozilla to consolidate the address on one master file and have the different list's just containing links to the master. There is one individual thinking to do that, but he has no idea when due to lack of resources,
on this forum there is a similar question mentioning MORK files, but my actual TBIRD looks like to have all addresses contained on MAB files
I am afraid there is no answer that will give you a proper solution for this question.
MORK is a textual database containing the files Address Book Data (.mab files) and Mail Folder Summaries (.msf files).
The format, written by David McCusker, is a mix of various numerical namespaces and is undocumented and seem to no longer be developed/maintained/supported. The only way you would be able to get the grips of it is to reverse engineer it parallel with looking at source code using this format.
However, there have been experienced people trying to write parsers for this file format without any success. According to Wikipedia former Netscape engineer Jamie Zawinski had this to say about the format:
...the single most brain-damaged file format that I have ever seen in
my nineteen year career
This page states the following:
In brief, let's count its (Mork's) sins:
Two different numerical namespaces that overlap.
It can't decide what kind of character-quoting syntax to use: Backslash? Hex encoding with dollar-sign?
C++ line comments are allowed sometimes, but sometimes // is just a pair of characters in a URL.
It goes to all this serious compression effort (two different string-interning hash tables) and then writes out Unicode strings
without using UTF-8: writes out the unpacked wchar_t characters!
Worse, it hex-encodes each wchar_t with a 3-byte encoding, meaning the file size will be 3x or 6x (depending on whether whchar_t is 2
bytes or 4 bytes.)
It masquerades as a "textual" file format when in fact it's just another binary-blob file, except that it represents all its magic
numbers in ASCII. It's not human-readable, it's not hand-editable, so
the only benefit there is to the fact that it uses short lines and
doesn't use binary characters is that it makes the file bigger. Oh
wait, my mistake, that isn't actually a benefit at all."
The frustration shines through here and it is obviously not a simple task.
Consequently there apparently exist no parsers outside Mozilla products that is actually able to parse this format.
I have reversed engineered complex file formats in the past and know it can be done with the patience and right amount of energy.
Sadly, this seem to be your only option as well. A good place to start would be to take a look at Thunderbird's source code.
I know this doesn't give you a straight-up solution but I think it is the only answer to the question considering the circumstances for this format.
And of course, you can always look into the extension API to see if that allows you to access the data you need in a more structured way than handling the file format directly.
Sample code which reads mork
Node.js: https://www.npmjs.com/package/mork-parser
Perl: http://metacpan.org/pod/Mozilla::Mork
Python: https://github.com/KevinGoodsell/mork-converter
More links: https://wiki.mozilla.org/Mork

How to find out if two binary files are exactly the same

I have got a repository where I store all my image files. I know that there are much images which are duplicated and I want to delete each one of duplicated ones.
I thought if I generate checksum for each image file and rename the file to its checksum, I can easily find out if there are duplicated ones by examining the filename. But the problem is that, I cannot be sure about selecting the checksum algorithm to use. For example, if I generate the checksums using MD5, can I exactly trust if the checksums are the same that means the files are exactly the same?
Judging from the response to a similar question in security forum (https://security.stackexchange.com/a/3145), the collision rate is about 1 collision per 2^64 messages. If your files are differenet and your collection is not huge (i.e. close to this number), md5 can be used safely.
Also, see response to a very similar question here: How many random elements before MD5 produces collisions?
The chances of getting the same checksum for 2 different files are extremely slim, but can never be absolutely guaranteed (Pigeonhole principle). An indication of how slim may be that GIT uses the SHA-1 checksum for software development source code including Linux and has never caused any known problems so I would say that you are safe. I would use SHA-1 instead of MD5 because it is slightly better if you are really paranoid.
To make really sure you best follow a two-step-procedure: first calculate a checksum for every file. If the checksums differ you're sure the files are not identical. If you happen to find some files with equal checksums there's no way around doing a bit-by-bit-comparison to make 100% sure if they are really identical. This holds regardless of the hashing-algorithm used.
What you'll get is a massive time-saving as doing bit-by-bit comparison for every possible pair of files will take forever and a day while comparing a hand full of possible candidates is fairly easy.

What's the fastest way to tell if two MP3 files are duplicates?

I want to write a program that deletes duplicate iTunes music files. One approach to identifying dupes is to compare MD5 digests of the MP3 and m4a files. Is there a more efficient strategy?
BTW the "Display Duplicates" menu command on iTunes shows false positives. Apparently it just compares on the Artist and Track title strings.
If you use hashes to compare two sets of data, ideally they'd have to have exactly the same input each time in order to get exactly the same output (unless you miraculously picked two collisions of different input resulting in the same output). If you wanted to compare two MP3 files by hashing the entire file, then the two sets of song data might be exactly the same but since ID3 is stored inside the file, discrepancies there might make the files appear to be completely different. Since you're using a hash, you won't notice that perhaps 99% of the two files are matches because the outputs will be too different.
If you really want to use a hash to do this, you should only hash the sound data excluding any tags that may be attached to the file. This is not recommended, if music is ripped from CDs for example, and the same CD is ripped two different times, the results might be encoded/compressed differently depending on ripping parameters.
A better (but much more complicated) alternative would be an attempt to compare the uncompressed audio data values. With a little trial and error with known inputs can lead to a decent algo. Doing this perfectly will be very hard (if possible at all), but if you get something that's more than 50% accurate, it'll be better than going through by hand.
Note that even an algorithm that can detect if two songs are close (say the same song ripped under different parameters), the algo would have to be more complex than it's worth to tell if a live version is anything like a studio version. If you can do that, there's money to be made here!
And touching back on the original idea of how fast to tell if they're duplicates. A hash would be a lot faster, but a lot less accurate than any algorithm with this purpose. It's speed vs accuracy and complexity.

What should I know before poking around an unknown archive file for things?

A game that I play stores all of its data in a .DAT file. There has been some work done by people in examining the file. There are also some existing tools, but I'm not sure about their current state. I think it would be fun to poke around in the data myself, but I've never tried to examine a file, much less anything like this before.
Is there anything I should know about examining a file format for data extraction purposes before I dive headfirst into this?
EDIT: I would like very general tips, as examining file formats seems interesting. I would like to be able to take File X and learn how to approach the problem of learning about it.
You'll definitely want a hex editor before you get too far. It will let you see the raw data as numbers instead of as large empty blocks in whatever font notepad is using (or whatever text editor).
Try opening it in any archive extractors you have (i.e. zip, 7z, rar, gz, tar etc.) to see if it's just a renamed file format (.PK3 is something like that).
Look for headers of known file formats somewhere within the file, which will help you discover where certain parts of the data are stored (i.e. do a search for "IPNG" to find any (uncompressed) png files somewhere within).
If you do find where a certain piece of data is stored, take a note of its location and length, and see if you can find numbers equal to either of those values near the beginning of the file, which usually act as pointers to the actual data.
Some times you just have to guess, or intuit what a certain value means, and if you're wrong, well, keep moving. There's not much you can do about it.
I have found that http://www.wotsit.org is particularly useful for known file type formats, for help finding headers within the .dat file.
Back up the file first. Once you've restricted the amount of damage you can do, just poke around as Ed suggested.
Looking at your rep level, I guess a basic primer on hexadecimal numbers, endianness, representations for various data types, and all that would be a bit superfluous. A good tool that can show the data in hex is of course essential, as is the ability to write quick scripts to test complex assumptions about the data's structure. All of these should be obvious to you, but might perhaps help someone else so I thought I'd mention them.
One of the best ways to attack unknown file formats, when you have some control over contents is to take a differential approach. Save a file, make a small and controlled change, and save again. Do a binary compare of the files to find the difference - preferably using a tool that can detect inserts and deletions. If you're dealing with an encrypted file, a small change will trigger a massive difference. If it's just compressed, the difference will not be localized. And if the file format is trivial, a simple change in state will result in a simple change to the file.
The other thing is to look at some of the common compression techniques, notably zip and gzip, and learn their "signatures". Most of these formats are "self identifying" so when they start decompressing, they can do quick sanity checks that what they're working on is in a format they understand.
Barring encryption, an archive file format is basically some kind of indexing mechanism (a directory or sorts), and a way located those elements from within the archive via pointers in the index.
With the the ubiquitousness of the standard compression algorithms, it's mostly a matter of finding where those blocks start, and trying to hunt down the index, or table of contents.
Some will have the index all in one spot (like a file system does), others will simply precede each element within the archive with its identity information. But in the end somewhere, there is information about offsets from one block to another, there is information about data types (for example, if they're storing GIF files, GIF have a signature as well), etc.
Those are the patterns that you're trying to hunt down within the file.
It would be nice if somehow you can get your hand on two versions of data using the same format. For example, on a game, you might be able to get the initial version off the CD and a newer, patched version. These can really highlight the information you're looking for.

Resources