Calculate checksum of audio files without considering the header - file

I want to programmatically create a SHA1 checksum of audio files (MP3, Ogg Vorbis, Flac).
The requirement is that the checksum should be stable even if the header (eg. ID3) changes.
Note: The audio files don't have CRCs
This is what I tried by now:
1) Reading + Hashing all MPEG frames using Perl and MPEG::Audio::Frame
my $sha1 = Digest::SHA1->new;
while (my $frame = MPEG::Audio::Frame->read(\*FH)) {
$sha1->add($frame->content());
}
2) Decoding + Hashing all MPEG frames using Python and libmad (pymad)
mf = mad.MadFile(path)
sha1 = hashlib.sha1()
while 1:
buf = mf.read()
if (buf is None):
break
sha1.update(buf)
3) Using mp3cat
> mp3cat - - < file.mp3 | sha1sum
However, none of those methods provided a stable checksum. Namely, in some cases the checksum changed after retagging the file with picard.
Are there any libraries that already provide what I want?
I don't care about the programming language…
Update:
I debugged the case a bit further.
The libmad checksum inconsitency seems to happen in cases where libmad gets some decoding errors, like "Huffman data overrun (0x0238)".
As this really happens on many of the mp3 files I'm not sure if it really indicates a broken file…

If you are looking for stable hashes for the actual music you might want to look at libOFA. Your current methods will give you different results because the formats can have embedded tags. Also if you want two different files with the same song to return the same hash you need to regard things like bitrate and sample frequencies.
libOFA on the other hand can give you a stable hash that can be used between formats and different encodings. Might be what you want?

I needed tools to quickly check if my MP3/OGG library is still valid.
For MP3 I found mp3md5.py (http://snipplr.com/view/4025/mp3-checksum-in-id3-tag/) which does the job, but no simple tool for OGG Vorbis, but I coded a little bash script to do this for me.
Both tools should tolerate modifications of the comment/ID3Tag.
#!/bin/bash
# This bash script appends an MD5SUM to the vorbiscomment and/or verifies it if it exists
# Later modification of the vorbis comment does not alter the MD5SUM
# Julian M.K.
FILE="$1"
if [[ ! -f "$FILE" || ! -r "$FILE" || ! -w "$FILE" ]] ; then
echo "File $FILE" does not exist or is not readable or writable
exit 1
fi
OLDCRC=`vorbiscomment "$FILE" | grep ^CRC=|cut -d "=" -f 2`
NEWCRC=`ogginfo "$FILE" |grep "Total data length:" |cut -d ":" -f 2 | md5sum |cut -d " " -f 1`
if [[ "$OLDCRC" == "" ]] ; then
echo "ADDED $FILE $NEWCRC"
vorbiscomment -a -t "CRC=$NEWCRC" "$FILE"
# rewrite CRC to get proper data length, I dont know why this is necessary
NEWCRC=`ogginfo "$FILE" |grep "Total data length:" |cut -d ":" -f 2 | md5sum |cut -d " " -f 1`
vorbiscomment -w -t "CRC=$NEWCRC" "$FILE"
elif [[ "$OLDCRC" == "$NEWCRC" ]] ; then
echo "VERIFIED $FILE"
else
echo "FAILURE $FILE -- $OLDCRC - $NEWCRC"
fi

There is an easy stable way to do it. Just make a copy of the file and remove all the tags from it (e.g. using mutagen.id3) and take the hashsum of the resulting file.
The only disadvantage of this method is its performance.

Bene, If I were you, (And I am in the process of working on something very similar to what you want to do), I would hash the mp3 data block. (Extract it to raw data first, and write it out to disk, so you know what you are dealing with). Then, modify the ID3 tag. Hash your data again. Now, if it changes, compare your two sets of raw data and find out WHERE it changed. Chances are, you might be over-stepping a boundary somewhere. If I recall, MP3 files start with something like FF F8. Well, at least the frame does.
I'm interested in your findings, as I'm still writing all my code to deal with the finger prints, etc, and haven't gotten to the actual hashing yet.

Update many years later:
See my answer here to a very similar question. It turns out that ffmpeg actually supports doing checksums of the individual streams. To get the md5 hash of only the audio stream:
ffmpeg -i "$filename" -map 0:a -codec copy -f md5 "$filename.md5"
There is also support for other hash formats with the generic -f hash format, or for doing it per frame with -f framemd5.
I'm trying to do the same thing. I used MD5 instead of SHA1. I started to export audio checksums using mp3tag (www.mp3tag.de/en/); then made a Perl script similar to yours to do the same thing. Then I removed all tags from my test file, and the audio checksum remained the same.
This is the script:
use MPEG::Audio::Frame;
use Digest::MD5 qw(md5_hex);
use strict;
my $file = 'E:\Music\MP3\Russensoul\01 - 5nizza , Soldat (Russensoul - Russensoul).mp3';
my $mp3tag_audio_md5 = lc '2EDFBD62995A46A45CEEC08C1F303486';
my $md5 = Digest::MD5->new;
open(FILE, $file) or die "Cannot open $file : $!\n";
binmode FILE;
while(my $frame = MPEG::Audio::Frame->read(\*FILE)){
$md5->add($frame->asbin);
}
print '$md5->hexdigest : ', $md5->hexdigest, "\n",
'mp3tag_audio_md5 : ', $mp3tag_audio_md5, "\n",
;
Is it possible that whatever you use to modify your tags sometimes also modifies mp3 headers?

Related

create movie from one image updates in linux

I'm a C programmer on linux.
I write a program that saves an image in /srv/ftp/preview.png which is updating frequently and i want to create a movie from this updates.
It's timestamp is important for me, e.g if image updates after 3.654 seconds i want movie show this update(frame) after 3.654 seconds too.
I searched in Internet for several hours but i can't find any solution.
I know about ffmpeg but it will convert images(and not one image) to movie without millisecond timestamp.
I found this Question but it seems is not useful in this case.
Is there any tool to do that? if not, please introduce an API in c to write a program myself
You can try to use inotify watch modification on the file and ffmpeg to append file to the movie:
#!/bin/bash
FRAMERATE=1
FILE="/path/to/image.jgp"
while true
do inotifywait -e modify "$FILE"
echo "file changed"
# create temp file name
TMP=$(mktemp)
# copy file
cp "$FILE" "$TMP$
# append copy file to movie
# from https://video.stackexchange.com/q/17228
# if movie already exist
if [ -f movie.mp4 ]
then
# append image to a new movie
ffmpeg -y -i movie.avi -loop 1 -f image2 -t $FRAMERATE -i "$TMP".jpg -f lavfi -t 3 -i anullsrc -filter_complex "[0:v] [1:v] concat=n=2:v=1 [v] " -map "[v]" newmovie.avi
# replace old by new movie
mv newmovie.mp4 movie.mp4
else
#create a movie from one image
ffmpeg -framerate 1 -t $FRAMERATE -i "$TMP" movie.mp4
fi
rm "$TMP"
done
This script must certainly be adapted, (in particular if your framerate is high) but I think you can try to play with it.
One bad thing also that the movie creation will become slower and slower because the movie becomes bigger.
You should to store images of a certain time duration in a directory and convert all at once (like once an hour/day)
If you want to serve a stream instead of creating a video file, you can look at https://stackoverflow.com/a/31705978/1212012

Is there a way to validate a .env file?

I'm getting errors when trying to parse a .env file I have, but I have no way of figuring out where it's erring out. Is there an easy way to lint/validate the file, online or otherwise?
Many thanks!
You can try https://github.com/dotenv-linter/dotenv-linter.
It's a lightning-fast linter for .env files. Written in Rust.
It depends on the syntax you are using. Looking at the Docker and NPM documentation, different tools seem to have a different scope on what they are able to parse.
I use a simple grep to validate if I have a <key>=<value> pattern, where key and value are non-empty. You can adapt the patterns to match your context, ensuring upper case keys for example.
#!/bin/bash
for envfile in $(find . -maxdepth 1 -type f -name '.env.*'); do
for line in $(cat ${envfile}); do
# exclude comments
if [[ "${line:0:1}" == "#" ]]; then
continue
fi
match_line=$(echo ${line} | grep -E "^[A-Za-z0-9_].+=.+$")
if [[ ${match_line} == "" ]]; then
echo "Error in file: ${envfile}: line: ${line}"
fi
done
done
Alternatively, look at your language loadenv library to see if you can catch specific parsing exceptions, if available, to narrow down the specific line that causes the error.

Export VCF image to JPEG

I have created a .vcf contact with an iPhone and sent the file to myself in email. In that .vcf, I took a photo which is directly saved in the vCard, not in the phone's memory.
In the source of the .vcf, there is a code part starting like this:
PHOTO;ENCODING=b;TYPE=JPEG:/9j/4AAQSkZJRgABAQAAAQABAAD/4QBYRXhpZgAATU0AKgAA
And it continues on... Now, I would like to get this photo and save it as a .JPEG. Any ideas how to do that?
Thanks.
In macOS, it easy to to from the line command with "vi" and "base64.
For example,
Export the "Apple Inc." contact that comes with every user account.
Use vi to manually remove the other lines.
Remove the heading and the meta-data for that line
PHOTO;ENCODING=b;TYPE=JPEG:
base64 decode the remaining file
# base64 -D -i Apple\ Inc..vcf -o Apple_Logo.jpeg
The encoding is Base64. You can find a tool for decoding online.
I can recommend Freeformatter.com's decoder, which lets you save as a binary file. You will then need to rename that file to photo.jpg.
You should use a vCard parser (like vpim) that provides the ability to pull photo data from the vCard.
Another vCard parser is ez-vcard, which is written in Java (disclaimer: I am the author).
File file = new File("vcard.vcf");
VCard vcard = Ezvcard.parse(file).first();
for (PhotoType photo : vcard.getPhotos()){
byte data[] = photo.getData();
//save byte array to file
}
Because this isn't https://apple.stackexchange.com/ I'll suggest a quick bash script that I've used to extracted images from .vcf files on the command line:
#!/bin/bash
#vcf_photo_extractor ver 20180207094631 Copyright 2018 alexx, MIT Licence
if [ ! -f "$1" ]; then
echo "Usage: $(basename $0) [path/]any/contact.vcf"
exit 1
fi
DATA=$(cat "$1" |tr -d "\r\n"|sed -e 's/.*TYPE=//' -e 's/END:VCARD.*//')
NAME=$(grep -a '^N;' $1|sed -e 's/.*://')
#if [ $(wc -c <<< $DATA) -lt 5 ];then #bashism
if [ $(echo $DATA|wc -c) -lt 5 ];then
echo "No images found in $1"
exit 2
fi
EXT=${DATA%%:*}
if [ "$EXT" == 'BEGIN' ]; then echo "FAILED to extract $EXT"; exit 3; fi
IMG=${DATA#*:}
FILE=${1%.*}
Fn=${FILE##*/}
if [ -f "${FILE}.${EXT}" ]; then
echo "Overwrite ${FILE}.${EXT} ? "
read -r YN
if [ "$YN" != 'y' ]; then exit; fi
fi
echo $IMG | base64 -id > ${FILE}.${EXT} || \
echo "Failed to output $NAME to ${FILE}.${EXT}"
This script tries to extract the base64 data, decode it using base64 and create an image file. I found on linux that base64 -id worked but base64 -d threw errors.
If you are a fan of single-line code or code-golf then this might work:
cat 1.vcf|tr -d "\n"|sed -e 's/.*TYPE=[^:]*://' -e 's/END:V.*//'|base64 -id >1.jpg
If you want something cleaner then Matt Brock's
vCard_photo_extractor.sh might be what you are looking for.
Used http://www.sobolsoft.com/convertvcfjpg/ with vCards from OSX, with success.

Bash scripting testing for existing file with arbitrary extension in a while or for loop

I've been trying to figure this one out for a while. I've read through multiple threads, and feel like I'm close, but the script just isn't coming together.
Scenario:
I have a media server and thousands of movie files. Each movie file has various accessory files such as the Cover artwork, Database info, Fanart, and trailer. While everything in the directory has it's coverart and database info, some files may or may not have their respective fanart or trailer. For these files I'm trying to get this script working which will create an empty "dummy" file in place of the file that should be there. Then when I actually have the time I can go back and search out just the dummy files and work to fill in the gaps where I can.
Here is what I have so far.
#!/bin/bash
find . -type f -print0 | while read -d $'\0' movie ;
do
echo $movie
moviename=${movie%\.*} #remove the extension from the string
moviename1=`echo $moviename | sed 's/\ /\\ /'` #add escaped spaces to the string
echo $moviename1 #echo the string (for debugging)
if [ ! -f $moviename-fanart* ]; #because the fanart could be .jpg, or .png, etc
then
echo "Creating $moviename-fanart.dummy"
touch "$moviename-fanart.dummy"
fi
if [ ! -f $moviename-trailer* ]; #because tralers could be .mp4, .mov, .mkv, .avi, etc
then
echo "Creating $moviename-trailer.dummy"
touch "$moviename-trailer.dummy"
fi
done
This should be pretty simple, but I think that I'm not getting the proper formating for the input string going into the test operators.
Any help would be greatly appreciated.
Thanks
Line-by-line analysis:
find . -type f -print0 | while read -d $'\0' movie; do
OK, but with bash4 you can just use shopt -s globstar to operate recursively on a directory.
moviename=${movie%\.*} #remove the extension from the string
You don't need the backslash.
moviename1=`echo $moviename | sed 's/\ /\\ /'` #add escaped spaces to the string
This line is suspect because if you quote the name, escaped spaces become doubly-escaped. You're confusing the value of the string with the representation you see of it.
if [ ! -f $moviename-fanart* ]; then #because the fanart could be .jpg, or .png, etc
Quote the string or use bash's [[ test keyword. It's a little dangerous to expand a glob inside the test expression because if it matches multiple results you'll get an error. That said, if you're sure there can be only one you can quote up to the glob. "$moviename-fanart"*.
touch "$moviename-fanart.dummy"
Here, you quote it. So essentially you're dealing with a different name now.
fi
if [ ! -f $moviename-trailer* ]; then #because tralers could be .mp4, .mov, .mkv, .avi, etc
echo "Creating $moviename-trailer.dummy"
touch "$moviename-trailer.dummy"
fi
Same thing.
done

How can I find encoding of a file via a script on Linux?

I need to find the encoding of all files that are placed in a directory. Is there a way to find the encoding used?
The file command is not able to do this.
The encoding that is of interest to me is ISO 8859-1. If the encoding is anything else, I want to move the file to another directory.
It sounds like you're looking for enca. It can guess and even convert between encodings. Just look at the man page.
Or, failing that, use file -i (Linux) or file -I (OS X). That will output MIME-type information for the file, which will also include the character-set encoding. I found a man-page for it, too :)
file -bi <file name>
If you like to do this for a bunch of files
for f in `find | egrep -v Eliminate`; do echo "$f" ' -- ' `file -bi "$f"` ; done
uchardet - An encoding detector library ported from Mozilla.
Usage:
~> uchardet file.java
UTF-8
Various Linux distributions (Debian, Ubuntu, openSUSE, Pacman, etc.) provide binaries.
In Debian you can also use: encguess:
$ encguess test.txt
test.txt US-ASCII
As it is a perl script, it can be installed on most systems, by installing perl or the script as standalone, in case perl has already been installed.
$ dpkg -S /usr/bin/encguess
perl: /usr/bin/encguess
Here is an example script using file -I and iconv which works on Mac OS X.
For your question, you need to use mv instead of iconv:
#!/bin/bash
# 2016-02-08
# check encoding and convert files
for f in *.java
do
encoding=`file -I $f | cut -f 2 -d";" | cut -f 2 -d=`
case $encoding in
iso-8859-1)
iconv -f iso8859-1 -t utf-8 $f > $f.utf8
mv $f.utf8 $f
;;
esac
done
To convert encoding from ISO 8859-1 to ASCII:
iconv -f ISO_8859-1 -t ASCII filename.txt
It is really hard to determine if it is ISO 8859-1. If you have a text with only 7-bit characters that could also be ISO 8859-1, but you don't know. If you have 8-bit characters then the upper region characters exist in order encodings as well. Therefore you would have to use a dictionary to get a better guess which word it is and determine from there which letter it must be. Finally, if you detect that it might be UTF-8 then you are sure it is not ISO 8859-1.
Encoding is one of the hardest things to do, because you never know if nothing is telling you.
With Python, you can use the chardet module.
With this command:
for f in `find .`; do echo `file -i "$f"`; done
you can list all files in a directory and subdirectories and the corresponding encoding.
If files have a space in the name, use:
IFS=$'\n'
for f in `find .`; do echo `file -i "$f"`; done
Remember it'll change your current Bash session interpreter for "spaces".
In PHP you can check it like below:
Specifying the encoding list explicitly:
php -r "echo 'probably : ' . mb_detect_encoding(file_get_contents('myfile.txt'), 'UTF-8, ASCII, JIS, EUC-JP, SJIS, iso-8859-1') . PHP_EOL;"
More accurate "mb_list_encodings":
php -r "echo 'probably : ' . mb_detect_encoding(file_get_contents('myfile.txt'), mb_list_encodings()) . PHP_EOL;"
Here in the first example, you can see that I used a list of encodings (detect list order) that might be matching.
To have a more accurate result, you can use all possible encodings via: mb_list_encodings()
Note the mb_* functions require php-mbstring:
apt-get install php-mbstring
This is not something you can do in a foolproof way. One possibility would be to examine every character in the file to ensure that it doesn't contain any characters in the ranges 0x00 - 0x1f or 0x7f -0x9f but, as I said, this may be true for any number of files, including at least one other variant of ISO 8859.
Another possibility is to look for specific words in the file in all of the languages supported and see if you can find them.
So, for example, find the equivalent of the English "and", "but", "to", "of" and so on in all the supported languages of ISO 8859-1 and see if they have a large number of occurrences within the file.
I'm not talking about literal translation such as:
English French
------- ------
of de, du
and et
the le, la, les
although that's possible. I'm talking about common words in the target language (for all I know, Icelandic has no word for "and" - you'd probably have to use their word for "fish" [sorry that's a little stereotypical. I didn't mean any offense, just illustrating a point]).
I know you're interested in a more general answer, but what's good in ASCII is usually good in other encodings. Here is a Python one-liner to determine if standard input is ASCII. (I'm pretty sure this works in Python 2, but I've only tested it on Python 3.)
python -c 'from sys import exit,stdin;exit()if 128>max(c for l in open(stdin.fileno(),"b") for c in l) else exit("Not ASCII")' < myfile.txt
If you're talking about XML files (ISO-8859-1), the XML declaration inside them specifies the encoding: <?xml version="1.0" encoding="ISO-8859-1" ?>
So, you can use regular expressions (e.g., with Perl) to check every file for such specification.
More information can be found here: How to Determine Text File Encoding.
I am using the following script to
Find all files that match FILTER with SRC_ENCODING
Create a backup of them
Convert them to DST_ENCODING
(optional) Remove the backups
 
#!/bin/bash -xe
SRC_ENCODING="iso-8859-1"
DST_ENCODING="utf-8"
FILTER="*.java"
echo "Find all files that match the encoding $SRC_ENCODING and filter $FILTER"
FOUND_FILES=$(find . -iname "$FILTER" -exec file -i {} \; | grep "$SRC_ENCODING" | grep -Eo '^.*\.java')
for FILE in $FOUND_FILES ; do
ORIGINAL_FILE="$FILE.$SRC_ENCODING.bkp"
echo "Backup original file to $ORIGINAL_FILE"
mv "$FILE" "$ORIGINAL_FILE"
echo "converting $FILE from $SRC_ENCODING to $DST_ENCODING"
iconv -f "$SRC_ENCODING" -t "$DST_ENCODING" "$ORIGINAL_FILE" -o "$FILE"
done
echo "Deleting backups"
find . -iname "*.$SRC_ENCODING.bkp" -exec rm {} \;
I was working in a project that requires cross-platform support and I encounter many problems related with the file encoding.
I made this script to convert all to utf-8:
#!/bin/bash
## Retrieve the encoding of files and convert them
for f `find "$1" -regextype posix-egrep -regex ".*\.(cpp|h)$"`; do
echo "file: $f"
## Reads the entire file and get the enconding
bytes_to_scan=$(wc -c < $f)
encoding=`file -b --mime-encoding -P bytes=$bytes_to_scan $f`
case $encoding in
iso-8859-1 | euc-kr)
iconv -f euc-kr -t utf-8 $f > $f.utf8
mv $f.utf8 $f
;;
esac
done
I used a hack to read the entire file and estimate the file encoding using file -b --mime-encoding -P bytes=$bytes_to_scan $f
You can extract encoding of a single file with the file command. I have a sample.html file with:
$ file sample.html
sample.html: HTML document, UTF-8 Unicode text, with very long lines
$ file -b sample.html
HTML document, UTF-8 Unicode text, with very long lines
$ file -bi sample.html
text/html; charset=utf-8
$ file -bi sample.html | awk -F'=' '{print $2 }'
utf-8
In Cygwin, this looks like it works for me:
find -type f -name "<FILENAME_GLOB>" | while read <VAR>; do (file -i "$<VAR>"); done
Example:
find -type f -name "*.txt" | while read file; do (file -i "$file"); done
You could pipe that to AWK and create an iconv command to convert everything to UTF-8, from any source encoding supported by iconv.
Example:
find -type f -name "*.txt" | while read file; do (file -i "$file"); done | awk -F[:=] '{print "iconv -f "$3" -t utf8 \""$1"\" > \""$1"_utf8\""}' | bash
With Perl, use Encode::Detect.

Resources