BSD md5 vs GNU md5sum output format? - md5

Any one knows why BSD md5 program produces hash output in this format ...
MD5 (checksum.md5) = 9eb7a54d24dbf6a2eb9f7ce7a1853cd0
... while GNU md5sum produces much more sensible format like this?
9eb7a54d24dbf6a2eb9f7ce7a1853cd0 checksum.md5
As far as I can tell, the md5sum format is much easier to parse and makes more sense. How do you do md5sum -check with md5? And what do the -p, -q, -r, -t, -x options mean? man md5 says nothing about those options! :|

On current OS X BSD systems you can specify the md5 -r command to get the expected output.
sgwilbur#gura:/vms/DevOps-v3.4$ md5 vmware*
MD5 (vmware-0.log) = 61ba1d68a144023111539abee08f4044
MD5 (vmware-1.log) = 97bc6f22b25833c3eca2b2cc40b83ecf
MD5 (vmware-2.log) = f92a281102710c4528d4ceb88aa0ac9b
MD5 (vmware.log) = 1f7858d361929d4bc5739931a075c0ad
Adding the md5 -r switch made the output look more like I was expecting, and easier to diff with the linux md5 sums that were produced from a Linux machine.
sgwilbur#gura:/vms/DevOps-v3.4$ md5 -r vmware*
61ba1d68a144023111539abee08f4044 vmware-0.log
97bc6f22b25833c3eca2b2cc40b83ecf vmware-1.log
f92a281102710c4528d4ceb88aa0ac9b vmware-2.log
1f7858d361929d4bc5739931a075c0ad vmware.log
This was the simplest approach for me to make is easy to diff from output generated by the md5sum command on a linux box.

Historical reasons, i guess. Meanwhile, -q suppress "MD5(...) = " output, so md5 -q checksum.md5 gives 9eb7a54d24dbf6a2eb9f7ce7a1853cd0
This is implied if md5 is not given any arguments and it reads from stdin.
Unfortunately md5sum in this case leaves "-" behind the checksum ("9eb7a54d24dbf6a2eb9f7ce7a1853cd0 -"),
so if you're looking for some generic function to return the checksum, here is what might help:
checksum() {
(md5sum <"$1"; test $? = 127 && md5 <"$1") | cut -d' ' -f1
}
checksum /etc/hosts
FreeBSD's man page says about the arguments -p Echo stdin to stdout and append the checksum to stdout.
-q Quiet mode ‐ only the checksum is printed out. Overrides the -r
option.
-r Reverses the format of the output. This helps with visual diffs.
Does nothing when combined with the -ptx options.
-t Run a built‐in time trial.
-x Run a built‐in test script.

I realize this is an old page, but I was making checksums on FreeBSD and checking them on Linux and I came across this page too. This page didn't help me solve the problem, so I came up with this small sed script to create the checksums on FreeBSD that match the Linux md5sum output:
md5 file [file ...] | sed -e 's#^MD5 [(]\(.*\)[)] = \(.*\)$#\2 \1#' > md5sums.txt
This will use the FreeBSD md5 command and rearrange the output to look like the GNU md5sum.
Then on Linux I can just use md5sum --check md5sums.txt
You can also use the above sed script with an existing file produced by FreeBSD's md5 command.
I also put this alias in my FreeBSD .cshrc file:
alias md5sum "md5 \!* | sed -e '"'s#MD5 [(]\(.*\)[)] = \(.*\)$#\2 \1#'"'"
now on FreeBSD I can just say md5sum file1 file2 file3 ... and it just works.

One can use the GNU md5sum -c checksum.md5 that will look for checksum file and check against the checksum.md5 file content.
md5sum -c checksum.md5 | grep "checksum: OK" -
Example inside a Ruby system call to check against a BSD formatted .md5 file:
system("md5sum -c checksum.md5 | grep \"checksum: OK\" -")
This will return true or false.

Related

using a Bash variable in place of a file as input for an executable

I have an executable that is used in a way such as the following:
executable -v -i inputFile.txt -o outputFile.eps
In order to be more efficient, I want to use a Bash variable in place of the input file. So, I want to do something like the following:
executable -v -i ["${inputData}"] -o outputFile.eps
Here, the square brackets represent some clever code.
Do you know of some trick that would allow me to pipe information into the described executable in this way?
Many thanks for your assistance
You can use the following construct:
<(command)
So, to have bash create a FIFO with the command as the output for you, instead of your attempted -i ["${inputData}"], you would do:
-i <(echo "$inputData")
Therefore, here is your final total command:
executable -v -i <(echo "$inputData") -o outputFile.eps
Echo is not safe to use for arbitrary input.
To correctly handle pathological cases like inputdata='\ntest' or inputdata='-e', you need
executable -v -i <(cat <<< "$inputData")
In zsh, the cat is not necessary
Edit: even this adds a trailing newline. To output the exact variable contents byte-by-byte, you need
executable -v -i <(printf "%s" "$inputData")
Note: zsh only:
To get a filename containing the contents of ${variable}, use:
<(<<<${variable})
Note:
<<<${variable} redirects STDIN to come from ${variable}
<<<${variable} is equivalent to (but faster than) cat <<<${variable}
So for the OP's case:
executable -v -i <(<<<${inputData}) -o outputFile.eps
I couldn't find a better solution than making temporary file and reference it into a variable. For example:
key=$(mktemp); # Creating a random file in the tmp folder
echo "some data" > $key; # Filling the file with content
ssh -i $key user#localhost
I assume when the bash session is close, the variable and temporary file is cleared

Moving things in terminal based on their name

Edit: I think this has been answered successfully, but I can't check 'til later. I've reformatted it as suggested though.
The question: I have a series of files, each with a name of the form XXXXNAME, where XXXX is some number. I want to move them all to separate folders called XXXX and have them called NAME. I can do this manually, but I was hoping that by naming them XXXXNAME there'd be some way I could tell Terminal (I think that's the right name, but not really sure) to move them there. Something like
mv *NAME */NAME
but where it takes whatever * was in the first case and regurgitates it to the path.
This is on some form of Linux, with a bash shell.
In the real life case, the files are 0000GNUmakefile, with sequential numbering. I'm having to make lots of similar-but-slightly-altered versions of a program to compile and run on a cluster as part of my research. It would probably have been quicker to write a program to edit all the files and put in the right place in the first place, but I didn't.
This is probably extremely simple, and I should be able to find an answer myself, if I knew the right words. Thing is, I have no formal training in programming, so I don't know what to call things to search for them. So hopefully this will result in me getting an answer, and maybe knowing how to find out the answer for similar things myself next time. With the basic programming I've picked up, I'm sure I could write a program to do this for me, but I'm hoping there's a simple way to do it just using functionality already in Terminal. I probably shouldn't be allowed to play with these things.
Thanks for any help! I can actually program in C and Python a fair amount, but that's through trial and error largely, and I still don't know what I can do and can't do in Terminal.
SO many ways to achieve this.
I find that the old standbys sed and awk are often the most powerful.
ls | sed -rne 's:^([0-9]{4})(NAME)$:mv -iv & \1/\2:p'
If you're satisfied that the commands look right, pipe the command line through a shell:
ls | sed -rne 's:^([0-9]{4})(NAME)$:mv -iv & \1/\2:p' | sh
I put NAME in brackets and used \2 so that if it varies more than your example indicates, you can come up with a regular expression to handle your filenames better.
To do the same thing in gawk (GNU awk, the variant found in most GNU/Linux distros):
ls | gawk '/^[0-9]{4}NAME$/ {printf("mv -iv %s %s/%s\n", $1, substr($0,0,4), substr($0,5))}'
As with the first sample, this produces commands which, if they make sense to you, can be piped through a shell by appending | sh to the end of the line.
Note that with all these mv commands, I've added the -i and -v options. This is for your protection. Read the man page for mv (by typing man mv in your Linux terminal) to see if you should be comfortable leaving them out.
Also, I'm assuming with these lines that all your directories already exist. You didn't mention if they do. If they don't, here's a one-liner to create the directories.
ls | sed -rne 's:^([0-9]{4})(NAME)$:mkdir -p \1:p' | sort -u
As with the others, append | sh to run the commands.
I should mention that it is generally recommended to use constructs like for (in Tim's answer) or find instead of parsing the output of ls. That said, when your filename format is as simple as /[0-9]{4}word/, I find the quick sed one-liner to be the way to go.
Lastly, if by NAME you actually mean "any string of characters" rather than the literal string "NAME", then in all my examples above, replace NAME with .*.
The following script will do this for you. Copy the script into a file on the remote machine (we'll call it sortfiles.sh).
#!/bin/bash
# Get all files in current directory having names XXXXsomename, where X is an integer
files=$(find . -name '[0-9][0-9][0-9][0-9]*')
# Build a list of the XXXX patterns found in the list of files
dirs=
for name in ${files}; do
dirs="${dirs} $(echo ${name} | cut -c 3-6)"
done
# Remove redundant entries from the list of XXXX patterns
dirs=$(echo ${dirs} | uniq)
# Create any XXXX directories that are not already present
for name in ${dirs}; do
if [[ ! -d ${name} ]]; then
mkdir ${name}
fi
done
# Move each of the XXXXsomename files to the appropriate directory
for name in ${files}; do
mv ${name} $(echo ${name} | cut -c 3-6)
done
# Return from script with normal status
exit 0
From the command line, do chmod +x sortfiles.sh
Execute the script with ./sortfiles.sh
Just open the Terminal application, cd into the directory that contains the files you want moved/renamed, and copy and paste these commands into the command line.
for file in [0-9][0-9][0-9][0-9]*; do
dirName="${file%%*([^0-9])}"
mkdir -p "$dirName"
mv "$file" "$dirName/${file##*([0-9])}"
done
This assumes all the files that you want to rename and move are in the same directory. The file globbing also assumes that there are at least four digits at the start of the filename. If there are more than four numbers, it will still be caught, but not if there are less than four. If there are less than four, take off the appropriate number of [0-9]s from the first line.
It does not handle the case where "NAME" (i.e. the name of the new file you want) starts with a number.
See this site for more information about string manipulation in bash.

MD5 checksum of the whole file is different from checksum of content

I created a file a.txt containing one word - 'dog'.
Here is a MD5 checksum:
$md5sum a.txt
c52605f607459b2b80e0395a8976234d a.txt
Here is MD5 checksum of the word dog:
$perl -e "use Digest::MD5 qw(md5_base64 md5_hex); print(md5_hex('dog'));"
06d80eb0c50b49a509b49f2424e8c805
Why are checksums different?
Thank you,
Martin
Presumably you have a newline at the end of the file. Try using echo -n:
$ perl -e "use Digest::MD5 qw(md5_base64 md5_hex); print(md5_hex('dog'));"
06d80eb0c50b49a509b49f2424e8c805
$ echo 'dog' >a.txt
$ md5sum a.txt
362842c5bb3847ec3fbdecb7a84a8692 a.txt
$ echo -n 'dog' >a.txt
$ md5sum a.txt
06d80eb0c50b49a509b49f2424e8c805 a.txt
This is quite a common question:
Why does Perl and /bin/sha1 give different results?
Why PHP's md5 is different from OpenSSL's md5?
Why is the same input returning two different MD5 hashes?
md5_base64 is just a function declaration.
use Digest::MD5 qw(md5_base64 md5_hex)
means that I can use functions md5_base64() or md5_hex() from the library Digest::MD5
Basically you can use some other tools than Perl to compute MD5 hash of the word...
I'm wondering why the checksum of the file (using md5sum) is different from the checksum of the content itself...
Does the md5sum append some information about the file to the content before computing MD5?
Or is there some character "end of file"?
Thank you for your time...
When You perform and MD5 Calculation of a file (txt in your situation) the whole content of the file is taken in consideration even control chars (EOF, SOH, LF, CR) they are non printable chars but have some HEXA values which changes the corresponding MD5 result, which make different than the result of just passing a string to the MD5 function.

How can I find encoding of a file via a script on Linux?

I need to find the encoding of all files that are placed in a directory. Is there a way to find the encoding used?
The file command is not able to do this.
The encoding that is of interest to me is ISO 8859-1. If the encoding is anything else, I want to move the file to another directory.
It sounds like you're looking for enca. It can guess and even convert between encodings. Just look at the man page.
Or, failing that, use file -i (Linux) or file -I (OS X). That will output MIME-type information for the file, which will also include the character-set encoding. I found a man-page for it, too :)
file -bi <file name>
If you like to do this for a bunch of files
for f in `find | egrep -v Eliminate`; do echo "$f" ' -- ' `file -bi "$f"` ; done
uchardet - An encoding detector library ported from Mozilla.
Usage:
~> uchardet file.java
UTF-8
Various Linux distributions (Debian, Ubuntu, openSUSE, Pacman, etc.) provide binaries.
In Debian you can also use: encguess:
$ encguess test.txt
test.txt US-ASCII
As it is a perl script, it can be installed on most systems, by installing perl or the script as standalone, in case perl has already been installed.
$ dpkg -S /usr/bin/encguess
perl: /usr/bin/encguess
Here is an example script using file -I and iconv which works on Mac OS X.
For your question, you need to use mv instead of iconv:
#!/bin/bash
# 2016-02-08
# check encoding and convert files
for f in *.java
do
encoding=`file -I $f | cut -f 2 -d";" | cut -f 2 -d=`
case $encoding in
iso-8859-1)
iconv -f iso8859-1 -t utf-8 $f > $f.utf8
mv $f.utf8 $f
;;
esac
done
To convert encoding from ISO 8859-1 to ASCII:
iconv -f ISO_8859-1 -t ASCII filename.txt
It is really hard to determine if it is ISO 8859-1. If you have a text with only 7-bit characters that could also be ISO 8859-1, but you don't know. If you have 8-bit characters then the upper region characters exist in order encodings as well. Therefore you would have to use a dictionary to get a better guess which word it is and determine from there which letter it must be. Finally, if you detect that it might be UTF-8 then you are sure it is not ISO 8859-1.
Encoding is one of the hardest things to do, because you never know if nothing is telling you.
With Python, you can use the chardet module.
With this command:
for f in `find .`; do echo `file -i "$f"`; done
you can list all files in a directory and subdirectories and the corresponding encoding.
If files have a space in the name, use:
IFS=$'\n'
for f in `find .`; do echo `file -i "$f"`; done
Remember it'll change your current Bash session interpreter for "spaces".
In PHP you can check it like below:
Specifying the encoding list explicitly:
php -r "echo 'probably : ' . mb_detect_encoding(file_get_contents('myfile.txt'), 'UTF-8, ASCII, JIS, EUC-JP, SJIS, iso-8859-1') . PHP_EOL;"
More accurate "mb_list_encodings":
php -r "echo 'probably : ' . mb_detect_encoding(file_get_contents('myfile.txt'), mb_list_encodings()) . PHP_EOL;"
Here in the first example, you can see that I used a list of encodings (detect list order) that might be matching.
To have a more accurate result, you can use all possible encodings via: mb_list_encodings()
Note the mb_* functions require php-mbstring:
apt-get install php-mbstring
This is not something you can do in a foolproof way. One possibility would be to examine every character in the file to ensure that it doesn't contain any characters in the ranges 0x00 - 0x1f or 0x7f -0x9f but, as I said, this may be true for any number of files, including at least one other variant of ISO 8859.
Another possibility is to look for specific words in the file in all of the languages supported and see if you can find them.
So, for example, find the equivalent of the English "and", "but", "to", "of" and so on in all the supported languages of ISO 8859-1 and see if they have a large number of occurrences within the file.
I'm not talking about literal translation such as:
English French
------- ------
of de, du
and et
the le, la, les
although that's possible. I'm talking about common words in the target language (for all I know, Icelandic has no word for "and" - you'd probably have to use their word for "fish" [sorry that's a little stereotypical. I didn't mean any offense, just illustrating a point]).
I know you're interested in a more general answer, but what's good in ASCII is usually good in other encodings. Here is a Python one-liner to determine if standard input is ASCII. (I'm pretty sure this works in Python 2, but I've only tested it on Python 3.)
python -c 'from sys import exit,stdin;exit()if 128>max(c for l in open(stdin.fileno(),"b") for c in l) else exit("Not ASCII")' < myfile.txt
If you're talking about XML files (ISO-8859-1), the XML declaration inside them specifies the encoding: <?xml version="1.0" encoding="ISO-8859-1" ?>
So, you can use regular expressions (e.g., with Perl) to check every file for such specification.
More information can be found here: How to Determine Text File Encoding.
I am using the following script to
Find all files that match FILTER with SRC_ENCODING
Create a backup of them
Convert them to DST_ENCODING
(optional) Remove the backups
 
#!/bin/bash -xe
SRC_ENCODING="iso-8859-1"
DST_ENCODING="utf-8"
FILTER="*.java"
echo "Find all files that match the encoding $SRC_ENCODING and filter $FILTER"
FOUND_FILES=$(find . -iname "$FILTER" -exec file -i {} \; | grep "$SRC_ENCODING" | grep -Eo '^.*\.java')
for FILE in $FOUND_FILES ; do
ORIGINAL_FILE="$FILE.$SRC_ENCODING.bkp"
echo "Backup original file to $ORIGINAL_FILE"
mv "$FILE" "$ORIGINAL_FILE"
echo "converting $FILE from $SRC_ENCODING to $DST_ENCODING"
iconv -f "$SRC_ENCODING" -t "$DST_ENCODING" "$ORIGINAL_FILE" -o "$FILE"
done
echo "Deleting backups"
find . -iname "*.$SRC_ENCODING.bkp" -exec rm {} \;
I was working in a project that requires cross-platform support and I encounter many problems related with the file encoding.
I made this script to convert all to utf-8:
#!/bin/bash
## Retrieve the encoding of files and convert them
for f `find "$1" -regextype posix-egrep -regex ".*\.(cpp|h)$"`; do
echo "file: $f"
## Reads the entire file and get the enconding
bytes_to_scan=$(wc -c < $f)
encoding=`file -b --mime-encoding -P bytes=$bytes_to_scan $f`
case $encoding in
iso-8859-1 | euc-kr)
iconv -f euc-kr -t utf-8 $f > $f.utf8
mv $f.utf8 $f
;;
esac
done
I used a hack to read the entire file and estimate the file encoding using file -b --mime-encoding -P bytes=$bytes_to_scan $f
You can extract encoding of a single file with the file command. I have a sample.html file with:
$ file sample.html
sample.html: HTML document, UTF-8 Unicode text, with very long lines
$ file -b sample.html
HTML document, UTF-8 Unicode text, with very long lines
$ file -bi sample.html
text/html; charset=utf-8
$ file -bi sample.html | awk -F'=' '{print $2 }'
utf-8
In Cygwin, this looks like it works for me:
find -type f -name "<FILENAME_GLOB>" | while read <VAR>; do (file -i "$<VAR>"); done
Example:
find -type f -name "*.txt" | while read file; do (file -i "$file"); done
You could pipe that to AWK and create an iconv command to convert everything to UTF-8, from any source encoding supported by iconv.
Example:
find -type f -name "*.txt" | while read file; do (file -i "$file"); done | awk -F[:=] '{print "iconv -f "$3" -t utf8 \""$1"\" > \""$1"_utf8\""}' | bash
With Perl, use Encode::Detect.

Calculate checksum of audio files without considering the header

I want to programmatically create a SHA1 checksum of audio files (MP3, Ogg Vorbis, Flac).
The requirement is that the checksum should be stable even if the header (eg. ID3) changes.
Note: The audio files don't have CRCs
This is what I tried by now:
1) Reading + Hashing all MPEG frames using Perl and MPEG::Audio::Frame
my $sha1 = Digest::SHA1->new;
while (my $frame = MPEG::Audio::Frame->read(\*FH)) {
$sha1->add($frame->content());
}
2) Decoding + Hashing all MPEG frames using Python and libmad (pymad)
mf = mad.MadFile(path)
sha1 = hashlib.sha1()
while 1:
buf = mf.read()
if (buf is None):
break
sha1.update(buf)
3) Using mp3cat
> mp3cat - - < file.mp3 | sha1sum
However, none of those methods provided a stable checksum. Namely, in some cases the checksum changed after retagging the file with picard.
Are there any libraries that already provide what I want?
I don't care about the programming language…
Update:
I debugged the case a bit further.
The libmad checksum inconsitency seems to happen in cases where libmad gets some decoding errors, like "Huffman data overrun (0x0238)".
As this really happens on many of the mp3 files I'm not sure if it really indicates a broken file…
If you are looking for stable hashes for the actual music you might want to look at libOFA. Your current methods will give you different results because the formats can have embedded tags. Also if you want two different files with the same song to return the same hash you need to regard things like bitrate and sample frequencies.
libOFA on the other hand can give you a stable hash that can be used between formats and different encodings. Might be what you want?
I needed tools to quickly check if my MP3/OGG library is still valid.
For MP3 I found mp3md5.py (http://snipplr.com/view/4025/mp3-checksum-in-id3-tag/) which does the job, but no simple tool for OGG Vorbis, but I coded a little bash script to do this for me.
Both tools should tolerate modifications of the comment/ID3Tag.
#!/bin/bash
# This bash script appends an MD5SUM to the vorbiscomment and/or verifies it if it exists
# Later modification of the vorbis comment does not alter the MD5SUM
# Julian M.K.
FILE="$1"
if [[ ! -f "$FILE" || ! -r "$FILE" || ! -w "$FILE" ]] ; then
echo "File $FILE" does not exist or is not readable or writable
exit 1
fi
OLDCRC=`vorbiscomment "$FILE" | grep ^CRC=|cut -d "=" -f 2`
NEWCRC=`ogginfo "$FILE" |grep "Total data length:" |cut -d ":" -f 2 | md5sum |cut -d " " -f 1`
if [[ "$OLDCRC" == "" ]] ; then
echo "ADDED $FILE $NEWCRC"
vorbiscomment -a -t "CRC=$NEWCRC" "$FILE"
# rewrite CRC to get proper data length, I dont know why this is necessary
NEWCRC=`ogginfo "$FILE" |grep "Total data length:" |cut -d ":" -f 2 | md5sum |cut -d " " -f 1`
vorbiscomment -w -t "CRC=$NEWCRC" "$FILE"
elif [[ "$OLDCRC" == "$NEWCRC" ]] ; then
echo "VERIFIED $FILE"
else
echo "FAILURE $FILE -- $OLDCRC - $NEWCRC"
fi
There is an easy stable way to do it. Just make a copy of the file and remove all the tags from it (e.g. using mutagen.id3) and take the hashsum of the resulting file.
The only disadvantage of this method is its performance.
Bene, If I were you, (And I am in the process of working on something very similar to what you want to do), I would hash the mp3 data block. (Extract it to raw data first, and write it out to disk, so you know what you are dealing with). Then, modify the ID3 tag. Hash your data again. Now, if it changes, compare your two sets of raw data and find out WHERE it changed. Chances are, you might be over-stepping a boundary somewhere. If I recall, MP3 files start with something like FF F8. Well, at least the frame does.
I'm interested in your findings, as I'm still writing all my code to deal with the finger prints, etc, and haven't gotten to the actual hashing yet.
Update many years later:
See my answer here to a very similar question. It turns out that ffmpeg actually supports doing checksums of the individual streams. To get the md5 hash of only the audio stream:
ffmpeg -i "$filename" -map 0:a -codec copy -f md5 "$filename.md5"
There is also support for other hash formats with the generic -f hash format, or for doing it per frame with -f framemd5.
I'm trying to do the same thing. I used MD5 instead of SHA1. I started to export audio checksums using mp3tag (www.mp3tag.de/en/); then made a Perl script similar to yours to do the same thing. Then I removed all tags from my test file, and the audio checksum remained the same.
This is the script:
use MPEG::Audio::Frame;
use Digest::MD5 qw(md5_hex);
use strict;
my $file = 'E:\Music\MP3\Russensoul\01 - 5nizza , Soldat (Russensoul - Russensoul).mp3';
my $mp3tag_audio_md5 = lc '2EDFBD62995A46A45CEEC08C1F303486';
my $md5 = Digest::MD5->new;
open(FILE, $file) or die "Cannot open $file : $!\n";
binmode FILE;
while(my $frame = MPEG::Audio::Frame->read(\*FILE)){
$md5->add($frame->asbin);
}
print '$md5->hexdigest : ', $md5->hexdigest, "\n",
'mp3tag_audio_md5 : ', $mp3tag_audio_md5, "\n",
;
Is it possible that whatever you use to modify your tags sometimes also modifies mp3 headers?

Resources