Loading of large CSV file into bash associative array slow/stuck

Loading of large CSV file into bash associative array slow/stuck - arrays

I have a very large CSV file (~10mil rows) with 2 numeric column representing ids. The requirement is: given the first id, return very fast the second id.
I need to get the CSV to behave like a map structure and it has to be in memory. I couldn't find a way to expose awk variables back to the shell so I thought of using bash associative arrays.
The problem is that loading the csv into an associative array gets very slow/stuck after ~8 mil rows. I've been trying to eliminate the causes of slowdown that I could think of: file reading/IO, associative arraylimitations. So, I have a couple of functions that read the file into an associative array, but all of them have the same slowness problem.
Here is the test data
loadSplittedFilesViaMultipleArrays -> assumes the original file was split into smaller files (1 mil rows) and uses a while read loop to build 4 associative arrays (max 3 mil records each)
loadSingleFileViaReadarray -> uses readarray to read the original file into a temp array and then goes through that to build the associative array
loadSingleFileViaWhileRead -> uses a while read loop to build the associative array
But I can't seem to figure it out. Maybe this way of doing it is completely wrong... Can anyone pitch in with some suggestions?

Bash is the wrong tool for an associative array of this size. Consider using a language more suited (Perl, Python, Ruby, PHP, js, etc etc)
For a Bash only environment you could use a sqlite3 sql database which is usually installed with Bash. (It is not POSIX however)
First you would create the database from your csv file. There are many ways to do this (Perl, Python, Ruby, GUI tools) but this is simple enough to do interactively in sqlite3 command line shell (exp.db must not exist at this point):
$ sqlite3 exp.db
SQLite version 3.19.3 2017-06-27 16:48:08
Enter ".help" for usage hints.
sqlite> create table mapping (id integer primary key, n integer);
sqlite> .separator ","
sqlite> .import /tmp/mapping.csv mapping
sqlite> .quit
Or, pipe in the sql statements:
#!/bin/bash
cd /tmp
[[ -f exp.db ]] && rm exp.db # must be a new db as written
echo 'create table mapping (id integer primary key, n integer);
.separator ","
.import mapping.csv mapping' | sqlite3 exp.db
(Note: as written, exp.db must not exist or you will get INSERT failed: UNIQUE constraint failed: mapping.id. You can write it so the database exp.db is updated rather than created by the csv file, but you would probably want to use a language like Python, Perl, Tcl, Ruby, etc to do that.)
In either case, that will create an indexed database mapping the first column onto the second. The import will take a little while (15-20 seconds with the 198 MB example) but it creates a new persistent database from the imported csv:
$ ls -l exp.db
-rw-r--r-- 1 dawg wheel 158105600 Nov 19 07:16 exp.db
Then you can quickly query that new database from Bash:
$ time sqlite3 exp.db 'select n from mapping where id=1350044575'
1347465036
real 0m0.004s
user 0m0.001s
sys 0m0.001s
That takes 4 milliseconds on my older iMac.
If you want to use Bash variables for your query you can concatenate or construct the query string as needed:
$ q=1350044575
$ sqlite3 exp.db 'select n from mapping where id='"$q"
1347465036
And since the db is persistent, you can just compare file times of the csv file to the db file to test whether you need to recreate it:
if [[ ! -f "$db_file" || "$csv_file" -nt "$db_file" ]]; then
[[ -f "$db_file" ]] && rm "$db_file"
echo "creating $db_file"
# create the db as above...
else
echo "reusing $db_file"
fi
# query the db...
More:
sqlite tutorial
sqlite home

Inspired by #HuStmpHrrr's comment, I thought about another, maybe simpler alternative.
You can use GNU Parallel to split the file up into 1MB (or other) sized chunks and then use all your CPU cores to search each of the resulting chunks in parallel:
parallel --pipepart -a mapping.csv --quote awk -F, -v k=1350044575 '$1==k{print $2;exit}'
1347465036
Takes under a second on my iMac and that was the very last record.

I made a little Perl-based TCP server that reads the CSV into a hash and then sits looping forever doing lookups for requests coming via TCP from clients. It is pretty self-explanatory:
#!/usr/bin/perl
use strict;
use warnings;
################################################################################
# Load hash from CSV at startup
################################################################################
open DATA, "mapping.csv";
my %hash;
while( <DATA> ) {
chomp $_;
my ($field1,$field2) = split /,/, $_;
if( $field1 ne '' ) {
$hash{$field1} = $field2;
}
}
close DATA;
print "Ready\n";
################################################################################
# Answer queries forever
################################################################################
use IO::Socket::INET;
# auto-flush on socket
$| = 1;
my $port=5000;
# creating a listening socket
my $socket = new IO::Socket::INET (
LocalHost => '127.0.0.1',
LocalPort => $port,
Proto => 'tcp',
Listen => 5,
Reuse => 1
);
die "cannot create socket $!\n" unless $socket;
while(1)
{
# waiting for a new client connection
my $client_socket = $socket->accept();
my $data = "";
$client_socket->recv($data, 1024);
my $key=$data;
chomp $key;
my $reply = "ERROR: Not found $key";
if (defined $hash{$key}){
$reply=$hash{$key};
}
print "DEBUG: Received $key: Replying $reply\n";
$client_socket->send($reply);
# notify client that response has been sent
shutdown($client_socket, 1);
}
So, you save the code above as go.pl and then make it executable with:
chmod +x go.pl
then start the server in the background with:
./go.pl &
Then, when you want to do a lookup as a client, you send your key to localhost:5000 using the standard socat utility like this:
socat - TCP:127.0.0.1:5000 <<< "1350772177"
1347092335
As a quick benchmark, it does 1,000 lookups in 8 seconds.
START=$SECONDS; tail -1000 *csv | awk -F, '{print $1}' |
while read a; do echo $a | socat - TCP:127.0.0.1:5000 ; echo; done; echo $START,$SECONDS
It could probably be speeded up by a slight change to handle multiple keys to lookup per request to reduce socket connection and teardown overhead.

Related

Randomly generating invoice IDs - moving text database into script file?

I've come up with the following bash script to randomly generate invoice numbers, preventing duplications by logging all generated numbers to a text file "database".
To my surprise the script actually works, and it seems robust (although I'd be glad to have any flaws pointed out to me at this early stage rather than later on).
What I'm now wondering is whether it's at all possible to move the "database" of generated numbers into the script file itself. This would allow me to rely on and keep track of just the one file rather than two separate ones.
Is this at all possible, and if so, how? If it isn't a good idea, what valid reasons are there not to do so?
#!/usr/bin/env bash
generate_num() {
#num=$(head /dev/urandom | tr -dc '[:digit:]' | cut -c 1-5) [Original method, no longer used]
num=$(shuf -i 10000-99999 -n 1)
}
read -p "Are you sure you want to generate a new invoice ID? [Y/n] " -n 1 -r
echo
if [[ $REPLY =~ ^[Yy]$ ]]
then
generate_num && echo Generating a random invoice ID and checking it against the database...
sleep 2
while grep -xq "$num" "ID_database"
do
echo Invoice ID \#$num already exists in the database...
sleep 2
generate_num && echo Generating new random invoice ID and checking against database...
sleep 2
done
while [[ ${#num} -gt 5 ]]
do
echo Invoice ID \#$num is more than 5 digits...
sleep 2
generate_num && echo Generating new random invoice ID and checking against database...
sleep 2
done
echo Generated random invoice ID \#$num
sleep 1
echo Invoice ID \#$num does not exist in database...
sleep 2
echo $num >> "ID_database" && echo Successfully added Invoice ID \#$num to the database.
else
echo "Exiting..."
fi

I do not recommend this because:
These things are fragile. One bad edit and your invoice database is corrupt.
It makes version control a pain. Each new version of the script should preferably be checked in. You could add logic to make sure that "$mydir" is an empty directory when you run the script (except for "$myname", .git and other git-related files) then run git -C "$mydir" init if "$mydir"/.git doesn't exist. Then for each database update, git -C "$mydir" add "$myname" and git -C "$mydir" commit -m "$num". It's just an idea to explore...
Locking - It's possible to do file locking to make sure that not two users run the script at the same time, but it adds to the complexity so I didn't bother. If you feel that's a risk, you need to add that.
... but you want a self-modifying script, so here goes.
This just adds a new invoice number to its internal database for each time you run it. I've explained what goes on as comments. The last line should read __INVOICES__ (+ a newline) if you copy the script.
As always when dealing with things like this, remember to make a backup before making changes :-)
As it's currently written, you can only add one invoice per run. It shouldn't be hard to move things around (you need a new tempfile) to get it to add more than one if you need that.
#!/bin/bash
set -e # exit on error - imporant for this type of script
#------------------------------------------------------------------------------
myname="$0"
mydir=$(dirname "$myname")
if [[ ! -w $myname ]]; then
echo "ERROR: You don't have permission to update $myname" >&2
exit 1
fi
# create a tempfile to be able to update the database in the file later
#
# set -e makes the script end if this fails:
temp=$(mktemp -p "$mydir")
trap "{ rm -f "$temp"; }" EXIT # remove the tempfile if we die for some reason
# read current database from the file
readarray -t ID_database <<< $(sed '0,/^__INVOICES__$/d' "$0")
#declare -p ID_database >&2 # debug
#------------------------------------------------------------------------------
# a function to check if a number is already in the db
is_it_taken() {
local num=$1
# return 1 (true, yes it's taken) if the regex found a match
[[ ! " ${ID_database[#]} " =~ " ${num} " ]]
}
generate_num() {
local num
(exit 1) # set $? to 1
# loop until $? becomes 0
while (( $? )); do
num=$(shuf -i 10000-99999 -n 1)
is_it_taken "$num"
done
# we found a free number
echo $num
}
add_to_db() {
local num=$1
# add to db in memory
ID_database+=($num)
# add to db in file:
# copy the script to the tempfile
cp -pf "$myname" "$temp"
# add the new number
echo $num >> "$temp"
# move the tempfile into place
mv "$temp" "$myname"
}
#------------------------------------------------------------------------------
num=$(generate_num)
add_to_db $num
# your business logic goes here:
echo "All current invoices:"
for invoice in ${ID_database[#]}
do
echo ">$invoice<"
done
#------------------------------------------------------------------------------
# leave the rest untouched:
exit
__INVOICES__

Edited
To answer the question you asked -
Make sure your file ends with an explicit exit statement.
Without some sort of branching it won't execute past that, so unless there is a gross parsing error anything below could be used as storage space. Just
echo $num >> $0
If you write your records directly onto the bottom of the script, the script grows, but ...relatively harmlessly. Just make sure your grep pattern doesn't grab any lines of code, though grep -E '^\d[%]$' seems pretty safe.
This is only ever going to give you a max of ~90k id's, and spends unneeded time and cycles on redundancy checking. Is there a limit on the length of the value?
If you can assure there won't be more than one invoice processed per second,
date +%s >> "ID_database" # the UNIX epoch, seconds since 00:00:00 01/01/1970
If you need more precision that that,
date +%Y%m%d%H%M%S%N
will output Year month day hour minute second nanoseconds, which is both immediate and "pretty safe".
date +%s%N # epoch with nanoseconds
is shorter, but doesn't have the convenient side effect of automatically giving you the date and time of invoice creation.
If you absolutely need to guarantee uniqueness and nanoseconds isn't good enough, use a lock of some sort, and maybe a more fine-grained language.
On the other hand, if minutes are unique enough, you could use
date +%y%m%d%H%M
You get the idea.

Adding value to an associative array named after a variable

I need your help with a bash >= 4 script I'm writing.
I am retrieving some files from remote hosts to back them up.
I have a for loop that iterate through the hosts and for each one tests connection and the start a function that retrieves the various files.
My problem is that I need to know what gone wrong (and if), so I am trying to store OK or KO values in an array and parse it later.
This is the code:
...
for remote_host in $hosts ; do
short_host=$(echo "$remote_host" | grep -o '^[^.]\+')
declare -A cluster
printf "INFO: Testing connectivity to %s... " "$remote_host"
if ssh -q "$remote_host" exit ; then
printf "OK!\n"
cluster[$short_host]="Reacheable"
mkdir "$short_host"
echo "INFO: Collecting files ..."
declare -A ${short_host}
objects1="/etc/krb5.conf /etc/passwd /etc/group /etc/fstab /etc/sudoers /etc/shadow"
for obj in ${objects1} ; do
if file_retrieve "$user" "$remote_host" "$obj" ; then
-> ${short_host}=["$obj"]=OK
else
${short_host}=["$obj"]=KO
fi
done
...
So I'm using an array named cluster to list if the nodes were reacheable, and another array - named after the short name of the node - to list OK or KO for single files.
On execution, I got the following error (line 130 is the line I marked with the arrow above):
./test.sh: line 130: ubuntu01=[/etc/krb5.conf]=OK: command not found
I think this is a synthax error for sure, but I can't fix it. I tried a bunch of combinations without success.
Thanks for your help.

Since the array name is contained in a variable short_list, you need eval to make the assignment work:
${short_host}=["$obj"]=OK
Change it to:
eval ${short_host}=["$obj"]=OK
eval ${short_host}=["$obj"]=OK
Similar posts:
Single line while loop updating array

convert multiple DBs to CSV

I have thousands of dB files that need to be converted to CSV files. This can be achieved by a simple script / batch file i.e.
.open "Test.db"
.mode csv
.headers on.
I need the script to open the other db files which all have different names, is there a way that this can be performed as i do not want to write the above script for each db file

I made a script that batch-converts all db-sqlite files in the current directory to CSV, called 'sqlite2csv'. Well it outputs each table of each db-sqlite as a CSV file, so if you have 10 files with 3 tables each you will get 30 CSV files. Hope it helps at least as a starting point to make your own script.
#!/bin/bash
# USAGE EXAMPLES :
# sqlite2csv
# - Will loop all sqlite files in the current directory, take the tables of
# each of these sqlite files, and generate a CSV file per table.
# E.g. If there are 10 sqlite files with 3 tables each, it will generate
# 30 CSV output files, each containing the data of one table.
# The naming of the generated CSV files take from the original sqlite
# file name, prepended with the name of the table.
# check for dependencies
if ! type "sqlite3" > /dev/null; then
echo "[ERROR] SQLite binary not found."
exit 1
fi
# define list of string tokens that an SQLite file type should contain
# the footprint for SQLite 3 is "SQLite 3.x database"
declare -a list_sqlite_tok
list_sqlite_tok+=( "SQLite" )
#list_sqlite_tok+=( "3.x" )
list_sqlite_tok+=( "database" )
# get a lis tof only files in current path
list_files=( $(find . -maxdepth 1 -type f) )
# loop the list of files
for f in ${!list_files[#]}; do
# get current file
curr_fname=${list_files[$f]}
# get file type result
curr_ftype=$(file -e apptype -e ascii -e encoding -e tokens -e cdf -e compress -e elf -e tar $curr_fname)
# loop through necessary token and if one is not found then skip this file
curr_isqlite=0
for t in ${!list_sqlite_tok[#]}; do
curr_tok=${list_sqlite_tok[$t]}
# check if 'curr_ftype' contains 'curr_tok'
if [[ $curr_ftype =~ $curr_tok ]]; then
curr_isqlite=1
else
curr_isqlite=0
break
fi
done
# test if curr file was sqlite
if (( ! $curr_isqlite )); then
# if not, do not continue executung rest of script
continue
fi
# print sqlite filename
echo "[INFO] Found SQLite file $curr_fname, exporting tables..."
# get tables of sqlite file in one line
curr_tables=$(sqlite3 $curr_fname ".tables")
# split tables line into an array
IFS=$' ' list_tables=($curr_tables)
# loop array to export each table
for t in ${!list_tables[#]}; do
curr_table=${list_tables[$t]}
# strip unsafe characters as well as newline
curr_table=$(tr '\n' ' ' <<< $curr_table)
curr_table=$(sed -e 's/[^A-Za-z0-9._-]//g' <<< $curr_table)
# temporarily strip './' from filename
curr_fname=${curr_fname//.\//}
# build target CSV filename
printf -v curr_csvfname "%s_%s.csv" $curr_table "$curr_fname"
# put back './' to filenames
curr_fname="./"$curr_fname
curr_csvfname="./"$curr_csvfname
# export current table to target CSV file
sqlite3 -header -csv $curr_fname "select * from $curr_table;" > $curr_csvfname
# log
echo "[INFO] Exported table $curr_table in file $curr_csvfname"
done
done

The sqlite3 command-line shell allows some settings to be done with command-line arguments, so you can simply execute a simple SELECT * for the table in each DB file:
for %%a in (*.db) do sqlite3 -csv -header "%%a" "select * from TableName" > %%~na.csv
(When this is not part of a batch file but run directly from the command line, you must replace %% with %.)

I prepared a short python script which will write a csv file from multiple sqlite databases.
#function for merging sqlite files to csv
def convert_sqlite_to_csv(inputFolder, ext, tableName):
""" inputFolder - Folder where sqlite files are located.
ext - Extension of your sqlite file (eg. db, sqlite, sqlite3 etc.)
tableName - table name from which you want to select the data.
"""
csvWriter = csv.writer(open(inputFolder+'/output.csv', 'w', newline=''))
for file1 in os.listdir(inputFolder):
if file1.endswith('.'+ext):
conn = sqlite3.connect(inputFolder+'/'+file1)
cursor = conn.cursor()
cursor.execute("SELECT * FROM "+tableName)
rows = cursor.fetchall()
for row in rows:
csvWriter.writerow(row)
continue
else:
continue
Or find the script on github link below for converting multiple files in a folder.
python multiple_sqlite_files_tocsv.py -d <inputFolder> -e <extension> -t <tableName>
will output the data to output.csv file.
Jupyter notebook and a python script are on github.
https://github.com/darshanz/CombineMultipleSqliteToCsv

SGE array jobs and R

I currently have a R script written to perform a population genetic simulation, then write a table with my results to a text file. I would like to somehow run multiple instances of this script in parallel using an array job (my University's cluster uses SGE), and when its all done I will have generated results files corresponding to each job (Results_1.txt, Results_2.txt, etc.).
Spent the better part of the afternoon reading and trying to figure out how to do this, but haven't really found anything along the lines of what I am trying to do. I was wondering if someone could provide and example or perhaps point me in the direction of something I could read to help with this.

To boil down mithrado's answer to the bare essentials:
Create job script, pop_gen.bash, that may or may not take SGE task id argument as input, storing results in specific file identified by same SGE task id:
#!/bin/bash
Rscript pop_gen.R ${SGE_TASK_ID} > Results_${SGE_TASK_ID}.txt
Submit this script as a job array, e.g. 1000 jobs:
qsub -t 1-1000 pop_gen.bash
Grid Engine will execute pop_gen.bash 1000 times, each time setting SGE_TASK_ID to value ranging from 1-1000.
Additionally, as mentioned above, via passing SGE_TASK_ID as command line variable to pop_gen.R you can use SGE_TASK_ID to write to output file:
args <- commandArgs(trailingOnly = TRUE)
out.file <- paste("Results_", args[1], ".txt", sep="")
# d <- "some data frame"
write.table(d, file=out.file)
HTH

I am not used to do this in R, but I've been using the same approach in python. Imagine that you have an script genetic_simulation.r and it has 3 parameter:
--gene_id --khmer_len and --output_file.
You will have one csv file, genetic_sim_parms.csv with n rows:
first_gene,10,/result/first_gene.txt
...
nth_gene,6,/result/nth_gene.txt
A import detail is the first lane of your genetic_simulation.r. It needs to tell which executable the cluster is going to will use. You might need to tweak its parameters as well, depending on your setup, it will look like to:
#!/path/to/Rscript --vanilla
And finally, you will need a array-job bash script:
#!/bin/bash
#$ -t 1:N < change to number of rows in genetic_sim_parms.csv
#$ -N genetic_simulation.r
echo "Starting on : $(date)"
echo "Running on node : $(hostname)"
echo "Current directory : $(pwd)"
echo "Current job ID : $JOB_ID"
echo "Current job name : $JOB_NAME"
echo "Task index number : $SGE_TASK_ID"
ID=$(awk -F, -v "line=$SGE_TASK_ID" 'NR==line {print $1}' genetic_sim_parms.csv)
LEN=$(awk -F, -v "line=$SGE_TASK_ID" 'NR==line {print $2}' genetic_sim_parms.csv)
OUTPUT=$(awk -F, -v "line=$SGE_TASK_ID" 'NR==line {print $3}' genetic_sim_parms.csv)
echo "id is: $ID"
rscript genetic_simulation.r --gene_id $ID --khmer_len $LEN --output_file $OUTPUT
echo "Finished on : $(date)"
Hope this helps!

Calculate checksum of audio files without considering the header

I want to programmatically create a SHA1 checksum of audio files (MP3, Ogg Vorbis, Flac).
The requirement is that the checksum should be stable even if the header (eg. ID3) changes.
Note: The audio files don't have CRCs
This is what I tried by now:
1) Reading + Hashing all MPEG frames using Perl and MPEG::Audio::Frame
my $sha1 = Digest::SHA1->new;
while (my $frame = MPEG::Audio::Frame->read(\*FH)) {
$sha1->add($frame->content());
}
2) Decoding + Hashing all MPEG frames using Python and libmad (pymad)
mf = mad.MadFile(path)
sha1 = hashlib.sha1()
while 1:
buf = mf.read()
if (buf is None):
break
sha1.update(buf)
3) Using mp3cat
> mp3cat - - < file.mp3 | sha1sum
However, none of those methods provided a stable checksum. Namely, in some cases the checksum changed after retagging the file with picard.
Are there any libraries that already provide what I want?
I don't care about the programming language…
Update:
I debugged the case a bit further.
The libmad checksum inconsitency seems to happen in cases where libmad gets some decoding errors, like "Huffman data overrun (0x0238)".
As this really happens on many of the mp3 files I'm not sure if it really indicates a broken file…

If you are looking for stable hashes for the actual music you might want to look at libOFA. Your current methods will give you different results because the formats can have embedded tags. Also if you want two different files with the same song to return the same hash you need to regard things like bitrate and sample frequencies.
libOFA on the other hand can give you a stable hash that can be used between formats and different encodings. Might be what you want?

I needed tools to quickly check if my MP3/OGG library is still valid.
For MP3 I found mp3md5.py (http://snipplr.com/view/4025/mp3-checksum-in-id3-tag/) which does the job, but no simple tool for OGG Vorbis, but I coded a little bash script to do this for me.
Both tools should tolerate modifications of the comment/ID3Tag.
#!/bin/bash
# This bash script appends an MD5SUM to the vorbiscomment and/or verifies it if it exists
# Later modification of the vorbis comment does not alter the MD5SUM
# Julian M.K.
FILE="$1"
if [[ ! -f "$FILE" || ! -r "$FILE" || ! -w "$FILE" ]] ; then
echo "File $FILE" does not exist or is not readable or writable
exit 1
fi
OLDCRC=`vorbiscomment "$FILE" | grep ^CRC=|cut -d "=" -f 2`
NEWCRC=`ogginfo "$FILE" |grep "Total data length:" |cut -d ":" -f 2 | md5sum |cut -d " " -f 1`
if [[ "$OLDCRC" == "" ]] ; then
echo "ADDED $FILE $NEWCRC"
vorbiscomment -a -t "CRC=$NEWCRC" "$FILE"
# rewrite CRC to get proper data length, I dont know why this is necessary
NEWCRC=`ogginfo "$FILE" |grep "Total data length:" |cut -d ":" -f 2 | md5sum |cut -d " " -f 1`
vorbiscomment -w -t "CRC=$NEWCRC" "$FILE"
elif [[ "$OLDCRC" == "$NEWCRC" ]] ; then
echo "VERIFIED $FILE"
else
echo "FAILURE $FILE -- $OLDCRC - $NEWCRC"
fi

There is an easy stable way to do it. Just make a copy of the file and remove all the tags from it (e.g. using mutagen.id3) and take the hashsum of the resulting file.
The only disadvantage of this method is its performance.

Bene, If I were you, (And I am in the process of working on something very similar to what you want to do), I would hash the mp3 data block. (Extract it to raw data first, and write it out to disk, so you know what you are dealing with). Then, modify the ID3 tag. Hash your data again. Now, if it changes, compare your two sets of raw data and find out WHERE it changed. Chances are, you might be over-stepping a boundary somewhere. If I recall, MP3 files start with something like FF F8. Well, at least the frame does.
I'm interested in your findings, as I'm still writing all my code to deal with the finger prints, etc, and haven't gotten to the actual hashing yet.

Update many years later:
See my answer here to a very similar question. It turns out that ffmpeg actually supports doing checksums of the individual streams. To get the md5 hash of only the audio stream:
ffmpeg -i "$filename" -map 0:a -codec copy -f md5 "$filename.md5"
There is also support for other hash formats with the generic -f hash format, or for doing it per frame with -f framemd5.
I'm trying to do the same thing. I used MD5 instead of SHA1. I started to export audio checksums using mp3tag (www.mp3tag.de/en/); then made a Perl script similar to yours to do the same thing. Then I removed all tags from my test file, and the audio checksum remained the same.
This is the script:
use MPEG::Audio::Frame;
use Digest::MD5 qw(md5_hex);
use strict;
my $file = 'E:\Music\MP3\Russensoul\01 - 5nizza , Soldat (Russensoul - Russensoul).mp3';
my $mp3tag_audio_md5 = lc '2EDFBD62995A46A45CEEC08C1F303486';
my $md5 = Digest::MD5->new;
open(FILE, $file) or die "Cannot open $file : $!\n";
binmode FILE;
while(my $frame = MPEG::Audio::Frame->read(\*FILE)){
$md5->add($frame->asbin);
}
print '$md5->hexdigest : ', $md5->hexdigest, "\n",
'mp3tag_audio_md5 : ', $mp3tag_audio_md5, "\n",
;
Is it possible that whatever you use to modify your tags sometimes also modifies mp3 headers?

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight

Loading of large CSV file into bash associative array slow/stuck - arrays

Related

Randomly generating invoice IDs - moving text database into script file?

Adding value to an associative array named after a variable

convert multiple DBs to CSV

SGE array jobs and R

Calculate checksum of audio files without considering the header

Categories

Resources