CSUM's md5 calculation is different from fciv - md5

Is any reason why calculated by CSUM MD5 hash is different from the same hash calculated by fciv program on Windows?

If FCIV output is set to XML, the hexadecimal hash values are stored in base64 encoded format. When you view the XML database directly, the base64 encoded representation of the hash value does not visually match the hexadecimal value that the console displays. FCIV decodes the base64 encoded hashes when it displays the contents of the database to the screen. Therefore, it displays the correct hexadecimal value.
try: fciv.exe -list -md5 -xml hashdb.xml

Related

Is there a query in Snowflake to identify the characters in a file that are invalid utf8 [duplicate]

This question already has answers here:
How to find rows with non utf8 characters in Snowflake?
(2 answers)
Closed 2 years ago.
I have a file that when loading into Snowflake gets an error for invalid UTF-8 characters, I have managed to load it into a table using another encoding, by creating a file format with option ENCODING = 'iso-8859-1' but I would like to find a way to get those characters queried.
I've tried TO_BINARY(col,'UTF-8') function hoping it will fail on the col that has invalid UTF-8 but was not able to get a valid result to capture those characters, has anyone faced the same issue?
Please note that ALL character data within Snowflake is encoded using UTF-8. There is no other option. A while back, this was not strictly true, and it was possible to have character data in Snowflake that was NOT valid UTF-8. But that should not be possible now.
Specifying the ENCODING = 'iso-8859-1' option instructed Snowflake (during the COPY INTO operation) to perform character set translation on the file (which was then interpreted as being encoded in ISO-8859-1), mapping all characters into their UTF-8 equivalent as it was written into Snowflake. As a result, all data in Snowflake is UTF-8 encoded, and therefore there should not be ANY non-UTF-8 characters to discover. That said, the result of character set translation might not end up translating to the correct/expected UTF-8 characters if the underlying (source) file was not truly encoded with the encoding that you specified during the COPY INTO (in this case, ISO-8859-1).
Given this, what is the ultimate problem that you are trying to resolve here? Did you load a source file with ENCODING = 'iso-8859-1' that was not actually ISO-8859-1? Or are you saying that the source file WAS truly encoded as ISO-8859-1, and yet somehow the resulting characters in Snowflake are either (1) incorrect or (2) invalid UTF-8? Or are you trying to determine the actual encoding of a source file (ignoring the whole ISO-8859-1 aspect altogether)?
Detailed answer found here How to find rows with non utf8 characters in Snowflake?
Should mark my question as duplicate and refer to the link, please.

Identify files based on the ASCII characters in the binary

I have a file uploader and in order to validate the file is actually of the type expected, I am inspecting the binary and checking the ASCII identifying characters (see the PDF example here).
The majority of files have an ASCII identifier, however some don't (like XLS files)
How would I best identify these ones?
I see all have a hex values but as it stands, I don't currently have the capability of converting the binary data to Hex.
Workaround...
I don't have a HEX converter, but I can convert both Binary and HEX to Base64 so that is what I am doing and comparing the Base64 output.

Transferring external images & files to Google Cloud Storage

If I use the Google Cloud Storage File Transfer console
https://console.cloud.google.com/storage/transfer?project=XXXX
How do I generate an MD5 string for my image? Say my image is located at https://www.planwallpaper.com/static/images/desktop-year-of-the-tiger-images-wallpaper.jpg for example.
I can easily get the bytes value, but how would I generate the MD5 for this?
The docs were a bit vague. Any ideas?
An MD5 hash is used to ensure the data transferred into GCS is imported correctly. HTTPS data transfers include a variety of built-in checksums, but for very large imports of many, many files, errors can and do show up, and so GCS wants to be sure that each object that it downloads is exactly what you think it is.
An MD5 is a 128 bit number that is the result of running the MD5 algorithm on an object. This number can be represented in a variety of ways (the popular md5sum command uses hexadecimal strings). GCS asks that you represent this number as a base64 encoding. Here's a command that can generate an MD5 sum in the right format:
openssl md5 -binary NameOfSourceFile | openssl enc -base64
There's a standard GCS object that can be used to validate your MD5 logic. The object https://storage.googleapis.com/md5-test/md5-test has a base64'd MD5 string of BfnRTwvHpofMOn2Pq7EVyQ==.

Convert WAV to base64

I have some wave files (.wav) and I need to convert them in base64 encoded strings. Could you guide me how to do that in Python/C/C++ ?
#ForceBru's answer
import base64
enc=base64.b64encode(open("file.wav").read())
has one problem. I noticed for some WAV files that I encoded the string generated was shorter than expected.
Python docs for "open()" say
If mode is omitted, it defaults to 'r'.
The default is to use text mode, which may convert '\n' characters to a platform-specific representation on writing and back on reading. Thus, when opening a binary file, you should append 'b' to the mode value to open the file in binary mode, which will improve portability.
Hence, the code snippet doesn't read in binary. So, the below code should be used for better output.
import base64
enc = base64.b64encode(open("file.wav", "rb").read())
Python
The easiest way
from base64 import b64encode
f=open("file.wav")
enc=b64encode(f.read())
f.close()
Now enc contains the encoded value.
You can use a bit simplified version:
import base64
enc=base64.b64encode(open("file.wav").read())
C
See this file for an example of base64 encoding of a file.
C++
Here you can see the base64 conversion of strings. I think it wouldn't be too difficult to do the same for files.

MD5 implementation in C for a XML file

I need to implement the MD5 checksum to verify a MD5 checksum in a XML file including all XML tags and which has received from our client. The length of the received MD5 checksum is 32 byte hexadecimal digits.
We need set MD5 Checksum field should be 0 in received XML file prior to checksum calculation and we have to indepandantly calculate and verify the MD5 checksum value in a received XML file.
Our application is implemented in C. Please assist me on how to implement this.
Thanks
This directly depends on the library used for XML parsing. This is tricky however, because you can't embed the MD5 in the XML file itself, for after embedding the checksum inside, unless you do the checksum only from the specific elements. As I understand you receive the MD5 independently? Is it calculated from the whole file, or only the tags/content?
MD5 Public Domain code link - http://www.fourmilab.ch/md5/
XML library for C - http://xmlsoft.org/
Exact solutions depend on the code used.
Based on your comment you need to do the following steps:
load the xml file (possibly even as plain-text) read the MD5
substitute the MD5 in the file with zero, write the file down (or better to memory)
run MD5 on the pure file data and compare it with the value stored before
There are public-domain implementations of MD5 that you should use, instead of writing your own. I hear that Colin Plumb's version is widely used.
Don't reinvent the wheel, use a proven existing solution: http://userpages.umbc.edu/~mabzug1/cs/md5/md5.html
Incidentally that was the first link that came up when I googled "md5 c implementation".
This is rather nasty. The approach suggested seems to imply you need to parse the XML document into something like a DOM tree, find the MD5 checksum and store it for future reference. Then you would replace the checksum with 0 before re-serializing the document and calculating it's MD5 hash. This all sounds doable but potentially tricky. The major difficulty I see is that your new serialization of the document may not be the same as the original one and irrelevant (to XML) differences like the use of single or double quotes around attribute values, added line breaks or even a different encoding will cause the hashs to differ. If you go down this route you'll need to make sure your app and the procedure used to create the document in the first place make the same choices. For this sort of problem canonical XML is the standard solution (http://www.w3.org/TR/xml-c14n).
However, I would do something different. With any luck it should be quite easy to write a regular expression to locate the MD5 hash in the file and replace it with 0. You can then use this to grab the hash and replace with 0 it in the XML file before recalculating the hash. This sidesteps all the possible issues with parsing, changing and re-serializing the XML document. To illustrate I'm going to assume the hash '33d4046bea07e89134aecfcaf7e73015' lives in the XML file like this:
<docRoot xmlns='some-irrelevant-uri>
<myData>Blar blar</myData>
<myExtraData number='1'/>
<docHash MD5='33d4046bea07e89134aecfcaf7e73015' />
<evenMoreOfMyData number='34'/>
</docRoot>
(which I've called hash.xml), that the MD5 should be replaced by 32 zeros (so the hash is correct) and illustrate the procedure on a shell command line using perl, md5 and bash. (Hopefully translating this into C won't be too hard given the existence of regular expression and hashing libraries.)
Breaking down the problem, you first need to be able to find the hash that is in the file:
perl -p -e'if (m#<docHash.+MD5="([a-fA-F0-9]{32})#) {$_ = "$1\n"} else {$_ = ""}' hash.xml
(this works by looking for the start of the MD5 attribute of the docHash element, allowing for possible other attributes, and then grabbing the next 32 hex characters. If it finds them it bungs them in the magic $_ variable, if not it sets $_ to be empty, then the value of $_ gets printed for each line. This results in the string "33d4046bea07e89134aecfcaf7e73015" being printed.)
Then you need to calculate the hash of the the file with the has replaced with zeros:
perl -p -e's#(<docHash.+MD5=)"([a-fA-F0-9]{32})#$1"000000000000000000000000000000#' hash.xml | md5
(where the regular expression is almost the same, but this time the hex characters are replaced by zeros and the whole file is printed. Then the MD5 of this is calculated by piping the result through an md5 hashing program. Putting this together with a bit of bash gives:
if [ `perl -p -e'if (m#<docHash.+MD5="([a-fA-F0-9]{32})#) {$_ = "$1\n"} else {$_ = ""}' hash.xml` = `perl -p -e's#(<docHash.+MD5=)"([a-fA-F0-9]{32})#$1"000000000000000000000000000000#' hash.xml | md5` ] ; then echo OK; else echo ERROR; fi
which executes those two small commands, compares the output and prints "OK" if the outputs match or "ERROR" if they don't. Obviously this is just a simple prototype, and is in the wrong language, I think it illustrates the most straight forward solution.
Incidentally, why do you put the hash inside the XML document? As far as I can see it doesn't have any advantage compared to passing the hash along on a side channel (even something as simple as in a second file called documentname.md5) and makes the hash validation more difficult.
Check out these examples for how to use the XMLDSIG standard with .net
How to: Sign XML Documents with Digital Signatures
How to: Verify the Digital Signatures of XML Documents
You should maybe consider to change the setting for preserving whitespaces.

Resources