Is it possible to determine degree to which 2 files are alike? - file

For the purposes of this example, suppose there exist 2 binary files A and B, each containing a variation of, say, youtube video, where
A contains a 5 second ad
B contains no ad
With the exception for the ad, A contains the same content as B
Total length of file A is 60 seconds
Total length of file B is 55 seconds
As a general rule, if we were to compare bits patterns of each file, would we arrive to the same conclusion: files contain 55 seconds worth of common bits?
If we extend the problem further, say to the world of 2 jars, the only difference between which are comments, would it be appropriate to compare the order of bits and based on what we find, determine the degree of likeness?
It's easy to determine whether files are identical or not. Will the approach of comparing bits help accurately determine the degree to which files are close to one another?
The question is not about video files, but rather a general binary files. I mention video file above for example purposes only.

It depends on the file-format, but in your examples — no, probably not.
Video with and without initial ad: videos are usually encoded by breaking them into small time-blocks, and then encoding and compressing those blocks; if you insert an ad at the beginning, then you will most likely cause the block-transitions to happen at different time offsets within the main video.
Jar-file with and without comments (or with different comments): same story; changing the length of a comment within a file will affect the splitting of the entire file into compressible blocks, so all blocks after an altered comment will be compressed differently. (This is, of course, assuming that the jar-file actually includes the comments. Just because comments were in the source-code, that doesn't mean the jar-file will have them; that depends on compiler settings and so on.)

Most video compression these days is done with lossy algorithms. The compression is done both within a frame and BETWEEN frames. If the extra video frames added in your "A" video "leak" into the original movie because of the inter-frame compression, then by definition your two video files will be different videos, even though logically they're the same movie with 5 seconds of ad tacked onto the front. The compression algorithm will have merged 1 or more frames of the two videos into a hybrid of the two, and this fundamentally changes things.

Related

read thunderbird address mab files content

I have several address list's on my TBIRD address book.
every time I need to edit an address that is contained in several lists, is a pain on the neck to find which list contains the address to be modified.
As a help tool I want to read the several files and just gave the user a list of which
xxx.MAB files includes the searched address on just one search.
having the produced list, the user can simply go to edit just the right address list's.
Will like to know a minimum about the format of mentioned MAB files, so I can OPEN + SEARCH for strings into the files.
thanks in advance
juan
PD have asked mozilla forum, but there are no plans from mozilla to consolidate the address on one master file and have the different list's just containing links to the master. There is one individual thinking to do that, but he has no idea when due to lack of resources,
on this forum there is a similar question mentioning MORK files, but my actual TBIRD looks like to have all addresses contained on MAB files
I am afraid there is no answer that will give you a proper solution for this question.
MORK is a textual database containing the files Address Book Data (.mab files) and Mail Folder Summaries (.msf files).
The format, written by David McCusker, is a mix of various numerical namespaces and is undocumented and seem to no longer be developed/maintained/supported. The only way you would be able to get the grips of it is to reverse engineer it parallel with looking at source code using this format.
However, there have been experienced people trying to write parsers for this file format without any success. According to Wikipedia former Netscape engineer Jamie Zawinski had this to say about the format:
...the single most brain-damaged file format that I have ever seen in
my nineteen year career
This page states the following:
In brief, let's count its (Mork's) sins:
Two different numerical namespaces that overlap.
It can't decide what kind of character-quoting syntax to use: Backslash? Hex encoding with dollar-sign?
C++ line comments are allowed sometimes, but sometimes // is just a pair of characters in a URL.
It goes to all this serious compression effort (two different string-interning hash tables) and then writes out Unicode strings
without using UTF-8: writes out the unpacked wchar_t characters!
Worse, it hex-encodes each wchar_t with a 3-byte encoding, meaning the file size will be 3x or 6x (depending on whether whchar_t is 2
bytes or 4 bytes.)
It masquerades as a "textual" file format when in fact it's just another binary-blob file, except that it represents all its magic
numbers in ASCII. It's not human-readable, it's not hand-editable, so
the only benefit there is to the fact that it uses short lines and
doesn't use binary characters is that it makes the file bigger. Oh
wait, my mistake, that isn't actually a benefit at all."
The frustration shines through here and it is obviously not a simple task.
Consequently there apparently exist no parsers outside Mozilla products that is actually able to parse this format.
I have reversed engineered complex file formats in the past and know it can be done with the patience and right amount of energy.
Sadly, this seem to be your only option as well. A good place to start would be to take a look at Thunderbird's source code.
I know this doesn't give you a straight-up solution but I think it is the only answer to the question considering the circumstances for this format.
And of course, you can always look into the extension API to see if that allows you to access the data you need in a more structured way than handling the file format directly.
Sample code which reads mork
Node.js: https://www.npmjs.com/package/mork-parser
Perl: http://metacpan.org/pod/Mozilla::Mork
Python: https://github.com/KevinGoodsell/mork-converter
More links: https://wiki.mozilla.org/Mork

What's the fastest way to tell if two MP3 files are duplicates?

I want to write a program that deletes duplicate iTunes music files. One approach to identifying dupes is to compare MD5 digests of the MP3 and m4a files. Is there a more efficient strategy?
BTW the "Display Duplicates" menu command on iTunes shows false positives. Apparently it just compares on the Artist and Track title strings.
If you use hashes to compare two sets of data, ideally they'd have to have exactly the same input each time in order to get exactly the same output (unless you miraculously picked two collisions of different input resulting in the same output). If you wanted to compare two MP3 files by hashing the entire file, then the two sets of song data might be exactly the same but since ID3 is stored inside the file, discrepancies there might make the files appear to be completely different. Since you're using a hash, you won't notice that perhaps 99% of the two files are matches because the outputs will be too different.
If you really want to use a hash to do this, you should only hash the sound data excluding any tags that may be attached to the file. This is not recommended, if music is ripped from CDs for example, and the same CD is ripped two different times, the results might be encoded/compressed differently depending on ripping parameters.
A better (but much more complicated) alternative would be an attempt to compare the uncompressed audio data values. With a little trial and error with known inputs can lead to a decent algo. Doing this perfectly will be very hard (if possible at all), but if you get something that's more than 50% accurate, it'll be better than going through by hand.
Note that even an algorithm that can detect if two songs are close (say the same song ripped under different parameters), the algo would have to be more complex than it's worth to tell if a live version is anything like a studio version. If you can do that, there's money to be made here!
And touching back on the original idea of how fast to tell if they're duplicates. A hash would be a lot faster, but a lot less accurate than any algorithm with this purpose. It's speed vs accuracy and complexity.

How are files (especially audio files) organized internally?

I try to grok that: Apple is talking about "packets" in audio files, and there is a fancy function called AudioFileReadPackets which takes a lot of arguments. One of them specifies the "start packet", and another one the number of packets which you want to read.
So I imagine an audio file to look like this, internally: It's made up of a lot of packets. If it's an audio file which has an variable bit rate format, then every packet may have a different size. If the file has an constant bit rate format, then every packet is the same size. So an audio file is like a truck full of boxes, and every box contains some interesting stuff.
Is that correct? Does it apply to any kind of file? Is this how files actually look like?
The question (even with the "especially audio files" qualification) is far too broad; different file formats are, well, different!
So to answer the question you will first have to specify a particular file type; then the answer to the question will invariably to look at its specification. Proprietary formats may not have a publicly available specification.
Specifications for many files (official and reverse engineered) can be found at the brilliant Wotsit's Format site.
AAC used by Apple iTunes and others is defined by ISO/IEC 13818-7:2006. The document will cost you 252 Swiss Francs (about US$233)! You'd have to be really interested (commercially) to pay that rather than use an existing AAC Codec.
"Packet" is a term commonly used in data transmission, so may be more applicable to audio streaming than audio files, where a "frame" may be more appropriate, or for data files in general a "record", but the terminology is flexible because it means whatever the person that wrote it thought it meant! If enough people misuse a term, it essentially becomes redefined (or multiply defined) to mean that, so I would not get too hung up on that. The author was do doubt using it to define a unit that has a defined format within a file that has multiple such units repeated sequentially.
"Packet" looks to me like Apple-specific terminology. I just did a lot of reading and coding to process WAV and MP3 files and I don't believe I saw the term "packet" once.
Files contain whatever the application that created them chose to place in them. Files are essentially a sequence of bytes. Any further organisation is a semantic distinction made by the program that created them. It is untrue to think of all files containing the same structure.
That said, certain data storage problems are similar enough to be solved in similar ways, and patterns start to emerge. Splitting data into records or packets is an example of that.
That's pretty much what audio files look like: a series of chunks of data, or frames. AudioFileReadPacketData and AudioFileReadPackets shield you from the details of, for instance, how big a frame might be in bytes (because you might be reading from a WAV file, which has a different structure to an MP3 file, or your MP3 file uses a variable bit rate).
The concept of frames doesn't apply in general to any file, but then you wouldn't be using the Audio File Services API to access just any old file.
For MP3 (and MP1, MP2) the file consists of frames. And yes, your understanding is correct - in VBR files packets have different size. In WAV files packets have the same length if memory serves (I wrote a decoder / player 11 years ago,).

How can I predict the size of an ISO 9660 filesystem?

I'm archiving data to DVD, and I want to pack the DVDs full. I know the names and sizes of all the files I want on the DVD, but I don't know how much space is taken up by metadata. I want to get as many files as possible onto each DVD, so I'm using a Bubblesearch heuristic with greedy bin-packing. I try 10,000 alternatives and get the best one. Currently I know the sizes of all the files and because I don't know how files are stored in an ISO 9660 filesystem, I add a lot of slop for metadata. I'd like to cut down the slop.
I could use genisoimage -print-size except it is too slow---given 40,000 files occupying 500MB, it takes about 3 seconds. Taking 8 hours per DVD is not in the cards. I've modified the genisoimage source before and am really not keen to try to squeeze the algorithm out of the source code; I am hoping someone knows a better way to get an estimate or can point me to a helpful specification.
Clarifying the problem and the question:
I need to burn archives that split across multiple DVDs, typically around five at a time. The problem I'm trying to solve is to decide which files to put on each DVD, so that each DVD (except the last) is as full as possible. This problem is NP-hard.
I'm using the standard greedy packing algorithm where you place the largest file first and you put it in the first DVD having sufficient room. So j_random_hacker, I am definitely not starting from random. I start from sorted and use Bubblesearch to perturb the order in which the files are packed. This procedure improves my packing from around 80% of estimated capacity to over 99.5% of estimated capacity. This question is about doing a better job of estimating the capacity; currently my estimated capacity is lower than real capacity.
I have written a program that tries 10,000 perturbations, each of which involves two steps:
Choose a set of files
Estimate how much space those files will take on DVD
Step 2 is the step I'm trying to improve. At present I am "erring on the side of caution" as Tyler D suggests. But I'd like to do better. I can't afford to use genisomage -print-size because it's too slow. Similarly, I can't tar the files to disk, because on only is it too slow, but a tar file is not the same size as an ISO 9660 image. It's the size of the ISO 9660 image I need to predict. In principle this could be done with complete accuracy, but I don't know how to do it. That's the question.
Note: these files are on a machine with 3TB of hard-drive storage. In all cases the average size of the files is at least 10MB; sometimes it is significantly larger. So it is possible that genisomage will be fast enough after all, but I doubt it---it appears to work by writing the ISO image to /dev/null, and I can't imagine that will be fast enough when the image size approaches 4.7GB. I don't have access to that machine right now, or when I posted the original question. When I do have access in the evening I will try to get better numbers for the question. But I don't think genisomage is going to be a good solution---although it might be a good way to learn a model of the filesystem
that tells me how how it works. Knowing that block size is 2KB is already helpful.
It may also be useful to know that files in the same directory are burned to the samae DVD, which simplifies the search. I want access to the files directly, which rules out tar-before-burning. (Most files are audio or video, which means there's no point in trying to hit them with gzip.)
Thanks for the detailed update. I'm satisfied that your current bin-packing strategy is pretty efficient.
As to the question, "Exactly how much overhead does an ISO 9660 filesystem pack on for n files totalling b bytes?" there are only 2 possible answers:
Someone has already written an efficient tool for measuring exactly this. A quick Google search turned up nothing however which is discouraging. It's possible someone on SO will respond with a link to their homebuilt tool, but if you get no more responses for a few days then that's probably out too.
You need to read the readily available ISO 9660 specs and build such a tool yourself.
Actually, there is a third answer:
(3) You don't really care about using every last byte on each DVD. In that case, grab a small representative handful of files of different sizes (say 5), pad them till they are multiples of 2048 bytes, and put all 2^5 possible subsets through genisoimage -print-size. Then fit the equation nx + y = iso_size - total_input_size on that dataset, where n = number of files in a given run, to find x, which is the number of bytes of overhead per file, and y, which is the constant amount of overhead (the size of an ISO 9660 filesystem containing no files). Round x and y up and use that formula to estimate your ISO filesystem sizes for a given set of files. For safety, make sure you use the longest filenames that appear anywhere in your collection for the test filenames, and put each one under a separate directory hierarchy that is as deep as the deepest hierarchy in your collection.
I'm not sure exactly how you are currently doing this -- according to my googling, "Bubblesearch" refers to a way to choose an ordering of items that is in some sense near a greedy ordering, but in your case, the order of adding files to a DVD does not change the space requirements so this approach wastes time considering multiple different orders that amount to the same set of files.
In other words, if you are doing something like the following to generate a candidate file list:
Randomly shuffle the list of files.
Starting from the top of the list, greedily choose all files that you estimate will fit on a DVD until no more will.
Then you are searching the solution space inefficiently -- for any final candidate set of n files, you are potentially considering all n! ways of producing that set. My suggestion:
Sort all files in decreasing order of file size.
Mark the top (largest) file as "included," and remove it from the list. (It must be included on some DVD, so we might as well include it now.)
Can the topmost file in the list be included without the (estimated) ISO filesystem size exceeding the DVD capacity? If so:
With probability p (e.g. p = 0.5), mark the file as "included".
Remove the topmost file from the list.
If the list is now empty, you have a candidate list of files. Otherwise, goto 3.
Repeat this many times and choose the best file list.
Tyler D's suggestion is also good: if you have ~40000 files totalling ~500Mb, that means an average file size of 12.5Kb. ISO 9660 uses a block size of 2Kb, meaning those files are wasting on average 1Kb of disk space, or about 8% of their size. So packing them together with tar first will save around 8% of space.
Can't use tar to store the files on disk?
It's unclear if you're writing a program to do this, or simply making some backups.
Maybe do some experimentation and err on the side of caution - some free space on a disk wouldn't hurt.
Somehow I imagine you've already considered these, or that my answer is missing the point.
I recently ran an experiment to find a formula to do a similar filling estimate on dvds, and found a simple formula given some assumptions... from your original post this formula will likely be a low number for you, it sounds like you have multiple directories and longer file names.
Assumptions:
all the files are exactly 8.3 characters.
all the files are in the root directory.
no extensions such as Joliet.
The formula:
174 + floor(count / 42) + sum( ceil(file_size / 2048) )
count is the number of files
file_size is each file's size in bytes
the result is in 2048 byte blocks.
An example script:
#!/usr/bin/perl -w
use strict;
use POSIX;
sub sum {
my $out = 0;
for(#_) {
$out += $_;
}
return $out;
}
my #sizes = ( 2048 ) x 1000;
my $file_count = #sizes;
my $data_size = sum(map { ceil($_ / 2048) } #sizes);
my $dir_size = floor( $file_count / 42 ) + 1;
my $overhead = 173;
my $size = $overhead + $dir_size + $data_size;
$\ = "\n";
print $size;
I verified this on disks with up to 150k files, with sizes ranging from 200 bytes to 1 MiB.
Nice thinking, J. Random. Of course I don't need every last byte, this is mostly for fun (and bragging rights at lunch). I want to be able to type du at the CD-ROM and have it very close to 4700000000.
I looked at the ECMA spec but like most specs it's medium painful and I have no confidence in my ability to get it right. Also it appears not to discuss Rock Ridge extensions, or if it does, I missed it.
I like your idea #3 and think I will carry it a bit further: I'll try to build a fairly rich model of what's going on and then use genisoimage -print-size on a number of filesets to estimate the parameters of the model. Then I can use the model to do my estimation. This is a hobby project so it will take a while, but I will get around to it eventually. I will post an answer here to say how much wastage is eliminated!

Reuse of characters in compiled .exe file

Once long ago, out of curiosity, I've tried hex-editing the executable file of the game "Dangerous Dave".
I've looked around the file for any strings I could find, and made some random edits to see if it would actually change the text displayed within the game.
I was surprised to see the result, which I have now recreated using a hex-editor and DOSBox:
As can be seen, editing the two characters "RO" in the string "ROMERO" resulted in 4 characters being changed, with the result becoming "ZUMEZU". It seems as if the program is reusing the two characters and prints them at the start and end of that string.
What is the cause of this? My first guess would be trying to make the executable smaller but just the code that reuses the characters would probably require more space than those 2 bytes to be saved.
Is it just a trick done by the author, or just some compiler voodoo?
Tricky to say for sure without reverse-engineering, but my guess would be that a lot of the constant data in the program is compressed using an algorithm from the LZ family. These compression schemes work essentially in the way that you've observed: they encode repeated substrings as references to text that has previously been decoded.
These compression algorithms were probably used for more than just this one string, and not just for text either; it's quite possible that they were also used to compress other data, such as graphics or level layouts. In short, there were probably significant savings made by using this algorithm!
The use of these compression algorithms is common in older games as a way of saving disk space, but was not automatic - the implementation of this algorithm would likely have been something Romero added himself.

Resources