MD5 checksum of the whole file is different from checksum of content - md5

I created a file a.txt containing one word - 'dog'.
Here is a MD5 checksum:
$md5sum a.txt
c52605f607459b2b80e0395a8976234d a.txt
Here is MD5 checksum of the word dog:
$perl -e "use Digest::MD5 qw(md5_base64 md5_hex); print(md5_hex('dog'));"
06d80eb0c50b49a509b49f2424e8c805
Why are checksums different?
Thank you,
Martin

Presumably you have a newline at the end of the file. Try using echo -n:
$ perl -e "use Digest::MD5 qw(md5_base64 md5_hex); print(md5_hex('dog'));"
06d80eb0c50b49a509b49f2424e8c805
$ echo 'dog' >a.txt
$ md5sum a.txt
362842c5bb3847ec3fbdecb7a84a8692 a.txt
$ echo -n 'dog' >a.txt
$ md5sum a.txt
06d80eb0c50b49a509b49f2424e8c805 a.txt
This is quite a common question:
Why does Perl and /bin/sha1 give different results?
Why PHP's md5 is different from OpenSSL's md5?
Why is the same input returning two different MD5 hashes?

md5_base64 is just a function declaration.
use Digest::MD5 qw(md5_base64 md5_hex)
means that I can use functions md5_base64() or md5_hex() from the library Digest::MD5
Basically you can use some other tools than Perl to compute MD5 hash of the word...
I'm wondering why the checksum of the file (using md5sum) is different from the checksum of the content itself...
Does the md5sum append some information about the file to the content before computing MD5?
Or is there some character "end of file"?
Thank you for your time...

When You perform and MD5 Calculation of a file (txt in your situation) the whole content of the file is taken in consideration even control chars (EOF, SOH, LF, CR) they are non printable chars but have some HEXA values which changes the corresponding MD5 result, which make different than the result of just passing a string to the MD5 function.

Related

c - merging two text files

Say I have 2 text files:
A.txt:
Param1: Value1
Param2: Value2
.
.
.
.
ParamM: ValueM
B.txt
Param1: Value1
Param2: Value2
.
.
.
.
ParamN: ValueN
Number of parameters in A.txt, i.e. M can be greater than, lesser than or equal to the number of parameters in B.txt, i.e. N.
Values for the same parameter need not be the same in A.txt and B.txt.
M and N can probably reach a maximum value of 200.
The parameter names are arbitrary. They have no numbers. The above was just for an illustration.
My objective is to merge A.txt and B.txt. If any conflicts occur, I have a file/in-memory storage which dictates which one takes precedence.
For example,
A.txt could look like:
Path: C:\Program\Files
Data: ImportantInfo.dat
Version: 1.2.3
Useless: UselessParameter.txt
and
B.txt could look like:
Path: C:\ProgramFiles
Data: NotSoImportant.dat
Version: 1.0.0
Useful: UsefulParameter.txt
The final text file should look like:
Path: C:\ProgramFiles
Data: ImportantInfo.dat
Version: 1.2.3
Useful: UsefulParameter.txt
Now my approach to thinking about this is:
get a line from A.txt
get a line from B.txt
tokenize both by ":"
compare param names
if same
write A.txt's value to Result.txt
else if different
write A.txt's line into Result.txt
write B.txt's line into Result.txt /* Order doesn't really matter */
repeat above steps until end of both text files
This approach doesn't take care of when A.txt has a certain parameter and B.txt doesn't. It would be extremely helpful if some lightweight library/APIs exist to do this. I am not looking for command-line tools (thanks to the overhead of doing system() calls).
Thanks!
This would be so much easier in a language that has a Map datatyp.
I would do it like this:
Read all of B and store the key-value strings in memory
Read A and for each key either overwrite the value (if it exists) or add it (if it doesn't)
Output the key-values to file

Shell Script to remove duplicate entries from file

I would like to remove duplicate entries from a file. The file looks like this:
xyabcd1:5!b4RlH/IgYzI:cvsabc
xyabcd2:JXfFZCZrL.6HY:cvsabc
xyabcd3:mE7YHNejLCviM:cvsabc
xyabcd1:5!b4RlH/IgYzI:cvsabc
xyabcd4:kQiRgQTU20Y0I:cvsabc
xyabcd2:JXfFZCZrL.6HY:cvsabc
xyabcd1:5!b4RlH/IgYzI:cvsabc
xyabcd2:JXfFZCZrL.6HY:cvsabc
xyabcd4:kQiRgQTU20Y0I:cvsabc
xyabcd2:JXfFZCZrL.6HY:cvsabc
How can I remove the duplicates from this file by using shell script?
From the sort manpage:
-u, --unique
with -c, check for strict ordering; without -c, output only the first of an equal run
sort -u yourFile
should do.
If you do not want to change the order of the input file, you can do:
$ awk '!v[$0]{ print; v[$0]=1 }' input-file
or, if the file is small enough (less than 4 billion lines, to ensure that no line is repeated 4 billion times), you can do:
$ awk '!v[$0]++' input-file
Depending on the implementation of awk, you may not need to worry about the file being less than 2^32 lines long. The concern is that if you see the same line 2^32 times, you may overflow an integer in the array value, and the 2^32nd instance (or 2^31st) of the duplicate line will be output a second time. In reality, this is highly unlikely to be an issue!
#shadyabhi answer is correct, if the output needs to be redirected to a different file use:
sort -u inFile -o outFile

linux appending files

I have a program that I generally run like this: a.out<guifpgm.txt>gui.html
Were a.out is my compiled c program, guifpgm.txt is the input file and the gui.html is the output file. But what I really want to do is take the output from a.out<guifpgm.txt and rather than just replacing whatever is in gui.html with the output, place the output in the middle of the file.
So something like this:
gui.html contains the following to start: <some html>CODEGOESHERE<some more html>
output to a.outalert("this is some dynamically generated stuff");
I want gui.html to contain the following: <some html>alert("this is some dynamically generated stuff");<some more html>
How can I do this?
Thanks!
Sounds like you want to replace text. For that, use sed, not C:
sed -i s/CODEGOESHERE/alert(\"this is some dynamically generated stuff\")/g gui.html
If you really need to run a.out to get its output, then do something like:
sed -i s/CODEGOESHERE/`a.out`/g gui.html
I ended up using the linux cat function. output a.out>guifpgm.txt>output.txt. Then did cat before.txt output.txt after.txt > final.txt
A simplification of your cat method would be to use
./a.out < guifpgm.txt | cat header.txt - footer.txt > final.txt
The - is replaced with the input from STDIN. This cuts down somewhat on the intermediate files. Using > instead of >> overwrites the contents of final.txt, rather than appending.
Just for the fun of it... the awk solution is as follows for program a.out, template file template that needs to replace the line "REPLACE ME". This puts the resulting output in output_file.txt.
awk '/^REPLACE ME$/{while("./a.out <input.txt"|getline){print $0}getline} {print $0}' template > output_file.txt
EDIT: Minor correction to add input file, remove UUOC, and fix a minor bug (the last line of a.out was being printed twice)
Alternatively... perl:
perl -pe '$_=`./a.out <input.txt` if /REPLACE ME/' template > output_file.txt
Although dedicated perlers could probably do better

BSD md5 vs GNU md5sum output format?

Any one knows why BSD md5 program produces hash output in this format ...
MD5 (checksum.md5) = 9eb7a54d24dbf6a2eb9f7ce7a1853cd0
... while GNU md5sum produces much more sensible format like this?
9eb7a54d24dbf6a2eb9f7ce7a1853cd0 checksum.md5
As far as I can tell, the md5sum format is much easier to parse and makes more sense. How do you do md5sum -check with md5? And what do the -p, -q, -r, -t, -x options mean? man md5 says nothing about those options! :|
On current OS X BSD systems you can specify the md5 -r command to get the expected output.
sgwilbur#gura:/vms/DevOps-v3.4$ md5 vmware*
MD5 (vmware-0.log) = 61ba1d68a144023111539abee08f4044
MD5 (vmware-1.log) = 97bc6f22b25833c3eca2b2cc40b83ecf
MD5 (vmware-2.log) = f92a281102710c4528d4ceb88aa0ac9b
MD5 (vmware.log) = 1f7858d361929d4bc5739931a075c0ad
Adding the md5 -r switch made the output look more like I was expecting, and easier to diff with the linux md5 sums that were produced from a Linux machine.
sgwilbur#gura:/vms/DevOps-v3.4$ md5 -r vmware*
61ba1d68a144023111539abee08f4044 vmware-0.log
97bc6f22b25833c3eca2b2cc40b83ecf vmware-1.log
f92a281102710c4528d4ceb88aa0ac9b vmware-2.log
1f7858d361929d4bc5739931a075c0ad vmware.log
This was the simplest approach for me to make is easy to diff from output generated by the md5sum command on a linux box.
Historical reasons, i guess. Meanwhile, -q suppress "MD5(...) = " output, so md5 -q checksum.md5 gives 9eb7a54d24dbf6a2eb9f7ce7a1853cd0
This is implied if md5 is not given any arguments and it reads from stdin.
Unfortunately md5sum in this case leaves "-" behind the checksum ("9eb7a54d24dbf6a2eb9f7ce7a1853cd0 -"),
so if you're looking for some generic function to return the checksum, here is what might help:
checksum() {
(md5sum <"$1"; test $? = 127 && md5 <"$1") | cut -d' ' -f1
}
checksum /etc/hosts
FreeBSD's man page says about the arguments -p Echo stdin to stdout and append the checksum to stdout.
-q Quiet mode ‐ only the checksum is printed out. Overrides the -r
option.
-r Reverses the format of the output. This helps with visual diffs.
Does nothing when combined with the -ptx options.
-t Run a built‐in time trial.
-x Run a built‐in test script.
I realize this is an old page, but I was making checksums on FreeBSD and checking them on Linux and I came across this page too. This page didn't help me solve the problem, so I came up with this small sed script to create the checksums on FreeBSD that match the Linux md5sum output:
md5 file [file ...] | sed -e 's#^MD5 [(]\(.*\)[)] = \(.*\)$#\2 \1#' > md5sums.txt
This will use the FreeBSD md5 command and rearrange the output to look like the GNU md5sum.
Then on Linux I can just use md5sum --check md5sums.txt
You can also use the above sed script with an existing file produced by FreeBSD's md5 command.
I also put this alias in my FreeBSD .cshrc file:
alias md5sum "md5 \!* | sed -e '"'s#MD5 [(]\(.*\)[)] = \(.*\)$#\2 \1#'"'"
now on FreeBSD I can just say md5sum file1 file2 file3 ... and it just works.
One can use the GNU md5sum -c checksum.md5 that will look for checksum file and check against the checksum.md5 file content.
md5sum -c checksum.md5 | grep "checksum: OK" -
Example inside a Ruby system call to check against a BSD formatted .md5 file:
system("md5sum -c checksum.md5 | grep \"checksum: OK\" -")
This will return true or false.

How can I find encoding of a file via a script on Linux?

I need to find the encoding of all files that are placed in a directory. Is there a way to find the encoding used?
The file command is not able to do this.
The encoding that is of interest to me is ISO 8859-1. If the encoding is anything else, I want to move the file to another directory.
It sounds like you're looking for enca. It can guess and even convert between encodings. Just look at the man page.
Or, failing that, use file -i (Linux) or file -I (OS X). That will output MIME-type information for the file, which will also include the character-set encoding. I found a man-page for it, too :)
file -bi <file name>
If you like to do this for a bunch of files
for f in `find | egrep -v Eliminate`; do echo "$f" ' -- ' `file -bi "$f"` ; done
uchardet - An encoding detector library ported from Mozilla.
Usage:
~> uchardet file.java
UTF-8
Various Linux distributions (Debian, Ubuntu, openSUSE, Pacman, etc.) provide binaries.
In Debian you can also use: encguess:
$ encguess test.txt
test.txt US-ASCII
As it is a perl script, it can be installed on most systems, by installing perl or the script as standalone, in case perl has already been installed.
$ dpkg -S /usr/bin/encguess
perl: /usr/bin/encguess
Here is an example script using file -I and iconv which works on Mac OS X.
For your question, you need to use mv instead of iconv:
#!/bin/bash
# 2016-02-08
# check encoding and convert files
for f in *.java
do
encoding=`file -I $f | cut -f 2 -d";" | cut -f 2 -d=`
case $encoding in
iso-8859-1)
iconv -f iso8859-1 -t utf-8 $f > $f.utf8
mv $f.utf8 $f
;;
esac
done
To convert encoding from ISO 8859-1 to ASCII:
iconv -f ISO_8859-1 -t ASCII filename.txt
It is really hard to determine if it is ISO 8859-1. If you have a text with only 7-bit characters that could also be ISO 8859-1, but you don't know. If you have 8-bit characters then the upper region characters exist in order encodings as well. Therefore you would have to use a dictionary to get a better guess which word it is and determine from there which letter it must be. Finally, if you detect that it might be UTF-8 then you are sure it is not ISO 8859-1.
Encoding is one of the hardest things to do, because you never know if nothing is telling you.
With Python, you can use the chardet module.
With this command:
for f in `find .`; do echo `file -i "$f"`; done
you can list all files in a directory and subdirectories and the corresponding encoding.
If files have a space in the name, use:
IFS=$'\n'
for f in `find .`; do echo `file -i "$f"`; done
Remember it'll change your current Bash session interpreter for "spaces".
In PHP you can check it like below:
Specifying the encoding list explicitly:
php -r "echo 'probably : ' . mb_detect_encoding(file_get_contents('myfile.txt'), 'UTF-8, ASCII, JIS, EUC-JP, SJIS, iso-8859-1') . PHP_EOL;"
More accurate "mb_list_encodings":
php -r "echo 'probably : ' . mb_detect_encoding(file_get_contents('myfile.txt'), mb_list_encodings()) . PHP_EOL;"
Here in the first example, you can see that I used a list of encodings (detect list order) that might be matching.
To have a more accurate result, you can use all possible encodings via: mb_list_encodings()
Note the mb_* functions require php-mbstring:
apt-get install php-mbstring
This is not something you can do in a foolproof way. One possibility would be to examine every character in the file to ensure that it doesn't contain any characters in the ranges 0x00 - 0x1f or 0x7f -0x9f but, as I said, this may be true for any number of files, including at least one other variant of ISO 8859.
Another possibility is to look for specific words in the file in all of the languages supported and see if you can find them.
So, for example, find the equivalent of the English "and", "but", "to", "of" and so on in all the supported languages of ISO 8859-1 and see if they have a large number of occurrences within the file.
I'm not talking about literal translation such as:
English French
------- ------
of de, du
and et
the le, la, les
although that's possible. I'm talking about common words in the target language (for all I know, Icelandic has no word for "and" - you'd probably have to use their word for "fish" [sorry that's a little stereotypical. I didn't mean any offense, just illustrating a point]).
I know you're interested in a more general answer, but what's good in ASCII is usually good in other encodings. Here is a Python one-liner to determine if standard input is ASCII. (I'm pretty sure this works in Python 2, but I've only tested it on Python 3.)
python -c 'from sys import exit,stdin;exit()if 128>max(c for l in open(stdin.fileno(),"b") for c in l) else exit("Not ASCII")' < myfile.txt
If you're talking about XML files (ISO-8859-1), the XML declaration inside them specifies the encoding: <?xml version="1.0" encoding="ISO-8859-1" ?>
So, you can use regular expressions (e.g., with Perl) to check every file for such specification.
More information can be found here: How to Determine Text File Encoding.
I am using the following script to
Find all files that match FILTER with SRC_ENCODING
Create a backup of them
Convert them to DST_ENCODING
(optional) Remove the backups
 
#!/bin/bash -xe
SRC_ENCODING="iso-8859-1"
DST_ENCODING="utf-8"
FILTER="*.java"
echo "Find all files that match the encoding $SRC_ENCODING and filter $FILTER"
FOUND_FILES=$(find . -iname "$FILTER" -exec file -i {} \; | grep "$SRC_ENCODING" | grep -Eo '^.*\.java')
for FILE in $FOUND_FILES ; do
ORIGINAL_FILE="$FILE.$SRC_ENCODING.bkp"
echo "Backup original file to $ORIGINAL_FILE"
mv "$FILE" "$ORIGINAL_FILE"
echo "converting $FILE from $SRC_ENCODING to $DST_ENCODING"
iconv -f "$SRC_ENCODING" -t "$DST_ENCODING" "$ORIGINAL_FILE" -o "$FILE"
done
echo "Deleting backups"
find . -iname "*.$SRC_ENCODING.bkp" -exec rm {} \;
I was working in a project that requires cross-platform support and I encounter many problems related with the file encoding.
I made this script to convert all to utf-8:
#!/bin/bash
## Retrieve the encoding of files and convert them
for f `find "$1" -regextype posix-egrep -regex ".*\.(cpp|h)$"`; do
echo "file: $f"
## Reads the entire file and get the enconding
bytes_to_scan=$(wc -c < $f)
encoding=`file -b --mime-encoding -P bytes=$bytes_to_scan $f`
case $encoding in
iso-8859-1 | euc-kr)
iconv -f euc-kr -t utf-8 $f > $f.utf8
mv $f.utf8 $f
;;
esac
done
I used a hack to read the entire file and estimate the file encoding using file -b --mime-encoding -P bytes=$bytes_to_scan $f
You can extract encoding of a single file with the file command. I have a sample.html file with:
$ file sample.html
sample.html: HTML document, UTF-8 Unicode text, with very long lines
$ file -b sample.html
HTML document, UTF-8 Unicode text, with very long lines
$ file -bi sample.html
text/html; charset=utf-8
$ file -bi sample.html | awk -F'=' '{print $2 }'
utf-8
In Cygwin, this looks like it works for me:
find -type f -name "<FILENAME_GLOB>" | while read <VAR>; do (file -i "$<VAR>"); done
Example:
find -type f -name "*.txt" | while read file; do (file -i "$file"); done
You could pipe that to AWK and create an iconv command to convert everything to UTF-8, from any source encoding supported by iconv.
Example:
find -type f -name "*.txt" | while read file; do (file -i "$file"); done | awk -F[:=] '{print "iconv -f "$3" -t utf8 \""$1"\" > \""$1"_utf8\""}' | bash
With Perl, use Encode::Detect.

Resources