MD5 implementation in C for a XML file - c

I need to implement the MD5 checksum to verify a MD5 checksum in a XML file including all XML tags and which has received from our client. The length of the received MD5 checksum is 32 byte hexadecimal digits.
We need set MD5 Checksum field should be 0 in received XML file prior to checksum calculation and we have to indepandantly calculate and verify the MD5 checksum value in a received XML file.
Our application is implemented in C. Please assist me on how to implement this.
Thanks

This directly depends on the library used for XML parsing. This is tricky however, because you can't embed the MD5 in the XML file itself, for after embedding the checksum inside, unless you do the checksum only from the specific elements. As I understand you receive the MD5 independently? Is it calculated from the whole file, or only the tags/content?
MD5 Public Domain code link - http://www.fourmilab.ch/md5/
XML library for C - http://xmlsoft.org/
Exact solutions depend on the code used.
Based on your comment you need to do the following steps:
load the xml file (possibly even as plain-text) read the MD5
substitute the MD5 in the file with zero, write the file down (or better to memory)
run MD5 on the pure file data and compare it with the value stored before

There are public-domain implementations of MD5 that you should use, instead of writing your own. I hear that Colin Plumb's version is widely used.

Don't reinvent the wheel, use a proven existing solution: http://userpages.umbc.edu/~mabzug1/cs/md5/md5.html
Incidentally that was the first link that came up when I googled "md5 c implementation".

This is rather nasty. The approach suggested seems to imply you need to parse the XML document into something like a DOM tree, find the MD5 checksum and store it for future reference. Then you would replace the checksum with 0 before re-serializing the document and calculating it's MD5 hash. This all sounds doable but potentially tricky. The major difficulty I see is that your new serialization of the document may not be the same as the original one and irrelevant (to XML) differences like the use of single or double quotes around attribute values, added line breaks or even a different encoding will cause the hashs to differ. If you go down this route you'll need to make sure your app and the procedure used to create the document in the first place make the same choices. For this sort of problem canonical XML is the standard solution (http://www.w3.org/TR/xml-c14n).
However, I would do something different. With any luck it should be quite easy to write a regular expression to locate the MD5 hash in the file and replace it with 0. You can then use this to grab the hash and replace with 0 it in the XML file before recalculating the hash. This sidesteps all the possible issues with parsing, changing and re-serializing the XML document. To illustrate I'm going to assume the hash '33d4046bea07e89134aecfcaf7e73015' lives in the XML file like this:
<docRoot xmlns='some-irrelevant-uri>
<myData>Blar blar</myData>
<myExtraData number='1'/>
<docHash MD5='33d4046bea07e89134aecfcaf7e73015' />
<evenMoreOfMyData number='34'/>
</docRoot>
(which I've called hash.xml), that the MD5 should be replaced by 32 zeros (so the hash is correct) and illustrate the procedure on a shell command line using perl, md5 and bash. (Hopefully translating this into C won't be too hard given the existence of regular expression and hashing libraries.)
Breaking down the problem, you first need to be able to find the hash that is in the file:
perl -p -e'if (m#<docHash.+MD5="([a-fA-F0-9]{32})#) {$_ = "$1\n"} else {$_ = ""}' hash.xml
(this works by looking for the start of the MD5 attribute of the docHash element, allowing for possible other attributes, and then grabbing the next 32 hex characters. If it finds them it bungs them in the magic $_ variable, if not it sets $_ to be empty, then the value of $_ gets printed for each line. This results in the string "33d4046bea07e89134aecfcaf7e73015" being printed.)
Then you need to calculate the hash of the the file with the has replaced with zeros:
perl -p -e's#(<docHash.+MD5=)"([a-fA-F0-9]{32})#$1"000000000000000000000000000000#' hash.xml | md5
(where the regular expression is almost the same, but this time the hex characters are replaced by zeros and the whole file is printed. Then the MD5 of this is calculated by piping the result through an md5 hashing program. Putting this together with a bit of bash gives:
if [ `perl -p -e'if (m#<docHash.+MD5="([a-fA-F0-9]{32})#) {$_ = "$1\n"} else {$_ = ""}' hash.xml` = `perl -p -e's#(<docHash.+MD5=)"([a-fA-F0-9]{32})#$1"000000000000000000000000000000#' hash.xml | md5` ] ; then echo OK; else echo ERROR; fi
which executes those two small commands, compares the output and prints "OK" if the outputs match or "ERROR" if they don't. Obviously this is just a simple prototype, and is in the wrong language, I think it illustrates the most straight forward solution.
Incidentally, why do you put the hash inside the XML document? As far as I can see it doesn't have any advantage compared to passing the hash along on a side channel (even something as simple as in a second file called documentname.md5) and makes the hash validation more difficult.

Check out these examples for how to use the XMLDSIG standard with .net
How to: Sign XML Documents with Digital Signatures
How to: Verify the Digital Signatures of XML Documents
You should maybe consider to change the setting for preserving whitespaces.

Related

Replace a number in a file using array data, bash

I'm not an expert in bash coding and I'm trying to do one interative-like code to help me in my work.
I have a file that contains some numbers (coordinates), and I'm trying to make a code to read some specific numbers from the file and then store them in an array. Modify that array using some arithmetic operation and then replace the numbers in the original file with the modified array. So far I've done everything except replacing the numbers in the file, I tried using sed but it does not change the file. The original numbers are stored in an array called "readfile" and the new numbers are stored in an array called "d".
I'm trying to use sed in this way: sed -i 's/${readfile[$j]}/${d[$k]}/' file.txt
And I loop j and k to cover all the numbers in the arrays. Everything seems to work but the file is not being modified. After some digging, I'm noticing that sed is not reading the value of the array, but I do not know how to fix that.
Your help is really appreciated.
When a file isn't modified by sed -i, it means sed didn't find any matches to modify. Your pattern is wrong somehow.
After using " instead of ' so that the variables can actually be evaluated inside the string, look at the contents of the readfile array and check whether it actually matches the text. If it seems to match, look for special characters in the pattern, characters that would mean something specific to sed (the most common mistake is /, which will interfere with the search command).
The fix for special characters is either to (1) escape them, e.g. \/ instead of just /, or (2) (and especially for /) to use another delimiter for the search/replace command (instead of s/foo/bar/ you can use s|foo|bar| or s,foo,bar, etc - pretty much any delimiter works, so you can pick one that you know isn't in the pattern string).
If you post data samples and more of your script, we can look at where you went wrong.

How to read content of unknown file

I have a file that holds manufacturing orders for a machine.
I would like to read the content of this file and edit it, but when I open it in a text editor i.e. Notepad++, I get a bunch of wierd charecters:
xÚ¥—_HSQÀo«a)’êaAXŽâê×pD8R‰¬©s“i+ƒ´#¡$
-þl-ó/ÓíºIúPôàƒHˆP–%a&RÎÈn÷ü¹·;Ú;ç<ìòÝÃý}¿ó}‡{϶«rWg>˜›ãR‡)Çn0³Ûf³yÎW[5–šw½ÇRW{ñ’rO6¹ŽŸp¦ÙœcÏ.9yÀnýg
)Ë—e90ejÕø£rC. f¦}3ËŒ˜hü”å1g[…ø±ú ÜJøz®‹˜YfÈ,4`ŽKÉ—ù“ÔË¿d„þlG3#=˜Ž´+hF¬¦£€«šm¿áØ
ïÖµv‡ËpíÍ~™‡Aù
šëÈÚ]ÿç™DŒÉFØ ïƒæsij  ¦y=-74Æ/t=ÕŠr\˜š»Âä‰Ý­¨žã΢
dz·à‡'fœ½­yâ½4qåPjácòÄŒeÊhñ“ý™ÙÎÕ÷5ôlñ=˜Õ{ú;ø=Û;4OêYä>Ìpxbæâ­'è"oëB×1gQ9“'¹]Ô³’Ô³ø!ÌózÞyŸõžÓIŽù*&OÌXPÕ"ŽWžpíOÌè‚Þ3Òr0{Ž†R=_?…/¼žÞ0,ê=/?£ûÓËîy“2Z<ij³[ËÁì™÷–ôžÎ’Ããa÷<Maêéí…¼ž}©žYýZ-˜=­”á¤}π>3°¢÷œ$ïè‰3ìž«ƒÄs¿—xnŒÀ*¯gi$ÕómDËÁìùIeоû‡À¬?3°x¾"~ª§c˜öÝÇî颌°›x¾Fßb>Ï}QXÓ{öFi-êÙßóR”œe^Ñ÷ü‘¿g[Lë ŽwJZϘë¹3”³L©gH‚,^Ïe 2ôžWGøëÙ2‚Î
øœL¾ÅqÈäõ,ýç\œË3¾þeྗ&`Ϻ<KÒf“’»ðù]í‰ãžU^wèþåÔÖy”H}ò•6ø6
It looks like the file is encoded.
Any idea how to find the encoding and make the file readable and editable?
It's binary and probably encoded so without knowledge of data structure you can't do much - just reverse engineering based on trying and checking what changed, operating with hex editor.
It isn't impossible, tho. If you can change the data the way you know (eg. change number of orders from 1 to 2) and export to file, you can compare binary values and find which byte holds that number. Of course if it is encrypted and you don't know the key... It's easier to find another way.
For further read, check this out - https://en.wikibooks.org/wiki/Reverse_Engineering/File_Formats
If you've got access to a Linux box why not use
hexdump -C <filename>
You will be able to get a much better insight into how the file is structured, than by using a text editor.
There are also many "hexdump" equivalent commands on Windows

Sorting huge volumed data using Serialized Binary Search Tree

I have 50 GB structured (as key/value) data like this which are stored in a text file (input.txt / keys and values are 63 bit unsigned integers);
3633223656935182015 2473242774832902432
8472954724347873710 8197031537762113360
2436941118228099529 7438724021973510085
3370171830426105971 6928935600176631582
3370171830426105971 5928936601176631564
I need to sort this data as keys in increasing order with the minimum value of that key. The result must be presented in another text file (data.out) under 30 minutes. For example the result must be like this for the sample above;
2436941118228099529 7438724021973510085
3370171830426105971 5928936601176631564
3633223656935182015 2473242774832902432
8472954724347873710 8197031537762113360
I decided that;
I will create a BST tree with the keys from the input.txt with their minimum value, but this tree would be more than 50GB. I mean, I have time and memory limitation at this point.
So I will use another text file (tree.txt) and I will serialize the BST tree into that file.
After that, I will traverse the tree using in-order traverse and write the sequenced data into data.out file.
My problem is mostly with the serialization and deserialization part. How can I serialize this type of data? and I want to use the INSERT operation on the serialized data. Because my data is bigger than memory. I can't perform this in the memory. Actually I want to use text files as a memory.
By the way, I am very new to this kind of stuffs. If is there a conflict with my algorithm steps, please warn me. Any thought, technique and code samples would be helpful.
OS: Linux
Language: C
RAM: 6 GB
Note: I am not allowed to use built-in functions like sort and merge.
Considering, that your files seems to have the same line size around 40 chars giving me around 1250000000 lines in total, I'd split the input file into smaller, by a command:
split -l 2500000 biginput.txt
then I'd sort each of them
for f in x*; do sort -n $f > s$f; done
and finally I'd merge them by
sort -m sx* > bigoutput.txt

HDF gzip compression vs. ASCII gzip compression

I have a 2D matrix with 1100x1600 data points. Initially, I stored it in an ascii-file which I tar-zipped using the command
tar -cvzf ascii_file.tar.gz ascii_file
Now, I wanted to switch to hdf5 files, but they are too large, at least in the way I am using them... First, I write the array into an hdf5-file using the c-procedures
H5Fcreate, H5Screate_simple, H5Dcreate, H5Dwrite
in that order. The data is not compressed within the hdf-file and it is relatively large, so I compressed it using the command
h5repack --filter=GZIP=9 hdf5_file hdf5_file.gzipped
Unfortunatelly, this hdf file with the zipped content is still larger than the compressed ascii file by a factor of 5, see the following table:
file size
--------------------------
ascii_file 5721600
ascii_file.tar.gz 287408
hdf5_file 7042144
hdf5_file.gzipped 1117033
Now my question(s): Why is the gzipped ascii-file so much smaller and is there a way to make the hdf-file smaller?
Thanks.
well, after reading Mark Adler's comment, I realized that this question is somehow stupid: In the ascii case, the values are truncated after a certain number of digits, whereas in the hdf case the "real" values ("real" = whatever precision the data type has I am using) are stored.
There was, however, one possibility to further reduce the size of my hdf file: by applying the shuffle filter using the option
--filter=SHUF

How do I find out if a file name has any extension in Unix?

I need to find out if a file or directory name contains any extension in Unix for a Bourne shell scripting.
The logic will be:
If there is a file extension
Remove the extension
And use the file name without the extension
This is my first question in SO so will be great to hear from someone.
The concept of an extension isn't as strictly well-defined as in traditional / toy DOS 8+3 filenames. If you want to find file names containing a dot where the dot is not the first character, try this.
case $filename in
[!.]*.*) filename=${filename%.*};;
esac
This will trim the extension (as per the above definition, starting from the last dot if there are several) from $filename if there is one, otherwise no nothing.
If you will not be processing files whose names might start with a dot, the case is superfluous, as the assignment will also not touch the value if there isn't a dot; but with this belt-and-suspenders example, you can easily pick the approach you prefer, in case you need to extend it, one way or another.
To also handle files where there is a dot, as long as it's not the first character (but it's okay if the first character is also a dot), try the pattern ?*.*.
The case expression in pattern ) commands ;; esac syntax may look weird or scary, but it's quite versatile, and well worth learning.
I would use a shell agnostic solution. Runing the name through:
cut -d . -f 1
will give you everything up to the first dot ('-d .' sets the delimeter and '-f 1' selects the first field). You can play with the params (try '--complement' to reverse selection) and get pretty much anything you want.

Resources