CRC checks for files - file

I'm working with a small FAT16 filesystem, and I want to generate CRC values for indidual XML files which store configuration information. In case the data changes or is corrupted, I want to be able to check the CRC to determine that the file is still in it's original state.
The question is, how do I put the CRC value into the file, without changing the CRC value of the file itself? I can think of a couple solutions, but I think there must be a fairly standard solution for this issue.

You could append the CRC value to the end of the file. Then, when computing the CRC value later for checking, omit the last four bytes.

Define a header, generate the CRC of everything except the header then put the value in the header.

A common solution is to just use different files. Alongside each file simply have a file with the same file name with a different extension. For example: foobar.txt and foobar.txt.md5 (or .crc).

The common solution which is widely used in communication protocols is to set the CRC field to 0, compute the CRC and then place it instead of the 0. The checking code should do the reverse process - read the CRC, zero the field, calculate the CRC and compare.
Also, for a file checksum I strongly recommend MD5 instead of CRC.

One solution would be to use dsofile.dll to add extended properties to your files. You could save the CRC value (converted to a string) as an extended file property. That way you don't change the structure of the file.
dsofile.dll is an ActiveX dll so it can be called from various languages, however it confines you to running on Windows. Here's more information on dsofile.dll: http://support.microsoft.com/kb/224351

I wouldn't store the CRC in the file itself. I would have a single file ( I would use XML format ) that your program uses, with a list of filenames and their associated CRC values. No need to make it that complicated.

There is no way to do this. You could make the first x bytes (CRC uses a 32 bit integer, so 4 bytes) of the file contain the CRC, and then when calculating your CRC, you could only consider the bytes that come after that initial 4 bytes.
Another solution would be to include the CRC into the file name. So MyFile.Config would end up being MyFile.CRC1234567.Config.

Related

Common CRC file formats

Are there any common (or even standardized) file formats that are used to store CRC parameter sets (i.e. polynomial, width, initial value, XOR, etc.)?
It doesn't matter if the format is based on XML, TXT, CSV etc. as long as it commonly used.
Yes. Ross Williams wrote an excellent tutorial on CRCs, in which he defines a canonical set of parameters to define a CRC. Greg Cook took that further and created a list of known CRCs that uses those parameters with a standard format and set of names in a single line for each CRC. Here is an example description from that site of what is probably the most common CRC:
width=32 poly=0x04c11db7 init=0xffffffff refin=true refout=true xorout=0xffffffff check=0xcbf43926 name="CRC-32"
The check value is the CRC of the nine-byte string of ASCII characters "123456789".
I wrote the crcany code which reads those parameter lines and computes a CRC using them, applying bit-wise, byte-wise, and word-wise algorithms. Included with crcany is crcgen, which generates C code for any CRC.
This parameter set does not tell you how to put the CRC in a byte stream. That is a convention of the data format, not the CRC itself. So you are left having to say whether the bytes of the CRC word are to be stored in the stream in little-endian or big-endian order, as well as where they go and what the CRC is computed over.

How to properly work with file upon encoding and decoding it?

It doesn't matter how I exactly encrypt and decode files. I operate with file as a char massive, everything is almost fine, until I get file, which size is not divide to 8 bytes. Because I can encrypt and decode file each round 8 bytes, because of particular qualities of algorithm (size of block must be 64 bit).
So then, for example, I face .jpg and tried simply add spaces to end of file, result file can't be opened ( ofc. with .txt files nothing bad happen).
Is any way out here?
If you want information about algorithm http://en.wikipedia.org/wiki/GOST_(block_cipher).
UPD: I can't store how many bytes was added, because initial file can be deleted or moved. And, what we are suppose to do then we know only key and have encrypted file.
Do you need padding.
The best way to do this would be to use PKCS#7.
However GOST is not so good, better using AES-CBC.
There is an ongoing similar discussion in "python-channel".

checksum and md5, not the same thing?

I downloaded a file and used md5sum to see if the download was successful without corruption. I got the following value:
a7099fcf9572d91b10d0073b07e112cb ./Macaca_mulatta.MMUL_1.70.dna.chromosome.1.fa.gz
But when I checked the website I downloaded the file from, it gave me the following value.
10256 63747 Macaca_mulatta.MMUL_1.70.dna.chromosome.1.fa.gz
What is this 10 digit code? is it not md5?
I downloaded the file from : ftp://ftp.ensembl.org/pub/release-70/fasta/macaca_mulatta/dna/CHECKSUMS
Ensembl is using the unix 'sum' utilty to calcualte the CHECKSUM.gz file.
Here's more info about the program : http://en.wikipedia.org/wiki/Sum_%28Unix%29
To see if your download is correct, try:
sum Macaca_mulatta.MMUL_1.70.dna.chromosome.1.fa.gz
NOTE: It happened before that Ensembl did not update their CHECKSUM file so it can always happen that the download is correct but the CHECKSUM.gz file is incorrect.
They are not the same thing. MD5 is a checksum but there are other checksum algorithms that are not MD5, such as SHA, CRC etc.
Generally a checksum is a function that takes an input that's larger in size than its output and (it better) produces greatly different outputs even if one bit in the input is changed.
The output you're looking at consists of two 5-digit decimal numbers, so it's likely your checksum algorithm is CRC32. The unix sum command may be used to calculate/verify it.
MD5 is a way to do a checksum, but there are others. CRC is one, so is SHA. All MD5 does is produce a hash code, and it is not the only algorithm to do so. I'm not sure what the 10 digit one is, but it can't be MD5.

How to include the md5 value of a file into itself

Supposed that you want to create a file which contains the md5 digest of itself. How to do that?
Before you include the md5 value into the file, you have to calculate the value, but if and only if the md5 digest has been included into the file, you can calculate the value. It's kind o a dilemma. Any Idea?
The short answer is that unless you use MD5 vulnerabilities you can't do that. I believe that even using MD5 vulnerabilities building such collision is impractical. A solution would be to either attach the digest at the end of the file or ship it separately.
You have to avoid that cyclic dependency, so you can't checksum the checksum. To work around that, you could reserver space for the checksum in your file, but set that space to zeroes. Then calculate the checksum and embed it. In order to check it later, you'd have to read it in, and set those bytes in the file to zero again.

C: creating an archive file header

I am creating a file archiver/extractor (like tar), using POSIX API system calls in C. I have done part of the archiving bit.
I would like to know if any one could help me with some C source code (using above) to create a file header for a file in C (where header acts as an index), which describes the files attributes/meta data (name,date time,etc). All I have done so far is understand (not sure if that's even correct) that to create a file header it needs
a struct to hold the meta data, and lseek is needed to seek to beginning/end of file
like:
FileName=file.txt FileSize=0
FileDir=./blah/blah
FilePerms=000
\n\n
The archiving part of program has this process:
Get a list of all the files from the command line. (I can do this part)
Create a structure to hold the meta data about each file: name (255 char), size (64-bit int), date and time, and permissions.
For each file, get its stats.
Store the stats of each file within an array of structures.
Open the archive for writing. (I can do this part)
Write the header structure.
For each file, append its content to the archive file (at the end/start of each file).
Close the archive file. (I can do this part)
I am having difficulty in creating the header file as a whole even though I know what it needs to do, as mentioned in numbered points above the bits I cant do are stated (2,3,4,6,7).
Any help will be appreciated.
Thanks.
As ijw notes, there are several ways to create an archive file header. If cross-platform portability is going to be an issue at all - or if you need to switch between 32-bit and 64-bit builds of the software on the same platform, even - then you need to ensure that the sizes and layouts of the fields are fully understood on all platforms.
Per-file Metadata
One way to do that is to use a fixed format binary header with types of known size and endianness. This is what ijw suggested. However, you will need to handle long file names, and so you will need to store a length (probably in a 2-byte unsigned integer) and then follow that with the actual pathname.
The alternative, and generally now favoured technique, is to use printable fields (often called ASCII format, though that is something of a misnomer). The time is recorded as the decimal number of seconds since the Epoch converted to a string, etc. This is what modern ar archives use; it is what GNU tar does (more or less; there are some historical quirks that make that more confusing); it is what cpio -c (which is usually the default these days) does. The fields might be separated by nulls or spaces; there is an easy way to detect the end of the header; the header contains information about the file name (not necessarily as directly as you'd like or expect, but again, that is usually because the format has evolved over the years), and then is followed by the actual data. Somehow, you know the size of each field, and the file which the header describes, so that you can read the data reliably.
Efficiency is a red herring. The conversion to/from the text format is so swift by comparison with the first disk access that there is essentially no measurable performance issue. And the guaranteed portability typically far outweighs the (microscopic) performance benefit from using binary data format instead - doubly so when the binary data has to be transformed on input or output anyway to get it into an architecture-neutral format.
Central Index vs Distributed Index
The other issue to consider is whether the index of files in the archive is centralized (at the front, or at the end) or distributed (the metadata for each file immediately precedes the data for the file). There are some advantages to each format - generally, systems use the distributed version because you can write the information for each file without knowing how many files there are to process in total (for example, because you are recursively archiving a directory's contents). Having a central index up front means you can list the files without reading the whole archive - distributed metadata means you have to read the whole file. However, the central index complicates the building of the archive.
Note that even with a distributed index, you will normally need a header for the archive as a whole so that you can detect that the file is in the format you expect. Typically, there is some sort of marker information (!<arch>\n for an ar archive, usually; %PDF-1.2\n at the start of a PDF file, etc) to reassure you that the file contains what you expect. There might be some overall (archive-level) metadata. Then you will have the first file metadata followed by the file data, repeating until the end of archive (which might, or might not, have a formal end marker - more metadata).
[H]ow would I go about implementing it in the 'fixed format binary header' you suggested. I am having trouble with deciding what commands/functions are needed.
I intended to suggest that you do not go with a fixed format binary header; you should use a text-based header format. If you can work out how to do the binary format, be my guest (I've done it numerous times over the years - that doesn't mean I think it is a good idea).
So, some pointers here towards the 'text header' format.
For the file metadata, you might define that you include:
size
mode (permissions, type)
owner
group
modification time
length of name
name
You might reasonably decide that your file sizes are limited to 64-bit unsigned integer quantities, which means 20 decimal digits. The mode might be printed as a 16-bit octal number, requiring 6 octal digits. The owner and group might be printed as UID and GID values (rather than name), in which case you could use 10 digits for each. Alternatively, you could decide to use names, but you should then allow for names up to say 32 characters each. Note that names are typically more portable than numbers. Neither name nor number is of much relevance on the receiving machine unless you extract the data as root (but why would you want to do that?). The modification time is classically a 32-bit signed integer, representing the number of seconds since the Epoch (1970-01-01 00:00:00Z). You should allow for the Y2038 bug by allowing the number of seconds to grow bigger than the 32-bit quantity; you might decide that 12 leading digits will take you beyond the Y10K crisis (by a factor of 4 or so), and this is good enough; you might decide to allow for fractional seconds too. Together, this suggests that 26 spaces for the timestamp should be overkill. You can decide that each field will be separated from the next by a space (for legibility - think 'ease of debugging'!). You might reasonably decide that all file names will be restricted to 4 decimal digits in total length.
You need to know how to format the types portably - #include <inttypes.h> is your friend.
You then devise a format string for printing (writing) the file metadata, and a parallel string for scanning (reading) the file metadata.
Printing:
"%20" PRIu64 " %06o %-.32s %-.32s %26" PRIu64 " %-4d %s\n"
This prints the name too. It terminates the header with a newline. The total size is 127 bytes plus the length of the file name. That's probably excessive, but you can tweak the numbers to suit yourself.
Scanning:
"%" SCNu64 " %o %.32s %.32s %" SCNu64 "%d"
This does not scan the name; you need to create the scanner for the name carefully, not least because you need to read spaces in the name. In fact, the code for scanning the user name and group name both assume no spaces too. If that is not acceptable (that is, names may contain spaces), then you need a more complex scan format, or something other than sscanf() to process the input data.
I'm assuming a 64-bit integer for the time field, rather than mixing up fractional seconds, etc, even though there's enough space to allow for fractional seconds. You'd likely save some space here.
The getting of information for each file you can do with the stat() system call.
For the writing of the header, here's two solutions.
Trivial but evil:
struct file_header {
... data you want to put in
} fhdr;
fwrite(file, fhdr, sizeof(fhdr));
This is evil because structure packing varies from machine to machine, as does byte order and the size of basic types like 'int'. A file written by your program may not be readable by your program when it's compiled on another machine, or even with another compiler on the same machine in some cases.
Non-trivial but safe:
char name[xxx];
uint32_t length; /* Fixed byte length across architectures */
...
fwrite(file, name, sizeof(name));
length=htonl(length); /* Or something else that converts
the length to a known endianness */
fwrite(file, &length, sizeof(length);
Personally I'm not a fan of htonl() and friends, I prefer to write something that converts a uint32_t to a uchar[4] using shift operators (which can be written trivially using shift operators) because C doesn't pin down the format of even an integer in memory. In practice you'd be hard pushed to find something that doesn't store a uint32_t as 4 bytes of 8 bits, but it's a thing to consider.
The variables listed above can be structure members in your structure. Reversing the process on read is left as an exercise to the reader.

Resources