convert multiple DBs to CSV - batch-file

I have thousands of dB files that need to be converted to CSV files. This can be achieved by a simple script / batch file i.e.
.open "Test.db"
.mode csv
.headers on.
I need the script to open the other db files which all have different names, is there a way that this can be performed as i do not want to write the above script for each db file

I made a script that batch-converts all db-sqlite files in the current directory to CSV, called 'sqlite2csv'. Well it outputs each table of each db-sqlite as a CSV file, so if you have 10 files with 3 tables each you will get 30 CSV files. Hope it helps at least as a starting point to make your own script.
#!/bin/bash
# USAGE EXAMPLES :
# sqlite2csv
# - Will loop all sqlite files in the current directory, take the tables of
# each of these sqlite files, and generate a CSV file per table.
# E.g. If there are 10 sqlite files with 3 tables each, it will generate
# 30 CSV output files, each containing the data of one table.
# The naming of the generated CSV files take from the original sqlite
# file name, prepended with the name of the table.
# check for dependencies
if ! type "sqlite3" > /dev/null; then
echo "[ERROR] SQLite binary not found."
exit 1
fi
# define list of string tokens that an SQLite file type should contain
# the footprint for SQLite 3 is "SQLite 3.x database"
declare -a list_sqlite_tok
list_sqlite_tok+=( "SQLite" )
#list_sqlite_tok+=( "3.x" )
list_sqlite_tok+=( "database" )
# get a lis tof only files in current path
list_files=( $(find . -maxdepth 1 -type f) )
# loop the list of files
for f in ${!list_files[#]}; do
# get current file
curr_fname=${list_files[$f]}
# get file type result
curr_ftype=$(file -e apptype -e ascii -e encoding -e tokens -e cdf -e compress -e elf -e tar $curr_fname)
# loop through necessary token and if one is not found then skip this file
curr_isqlite=0
for t in ${!list_sqlite_tok[#]}; do
curr_tok=${list_sqlite_tok[$t]}
# check if 'curr_ftype' contains 'curr_tok'
if [[ $curr_ftype =~ $curr_tok ]]; then
curr_isqlite=1
else
curr_isqlite=0
break
fi
done
# test if curr file was sqlite
if (( ! $curr_isqlite )); then
# if not, do not continue executung rest of script
continue
fi
# print sqlite filename
echo "[INFO] Found SQLite file $curr_fname, exporting tables..."
# get tables of sqlite file in one line
curr_tables=$(sqlite3 $curr_fname ".tables")
# split tables line into an array
IFS=$' ' list_tables=($curr_tables)
# loop array to export each table
for t in ${!list_tables[#]}; do
curr_table=${list_tables[$t]}
# strip unsafe characters as well as newline
curr_table=$(tr '\n' ' ' <<< $curr_table)
curr_table=$(sed -e 's/[^A-Za-z0-9._-]//g' <<< $curr_table)
# temporarily strip './' from filename
curr_fname=${curr_fname//.\//}
# build target CSV filename
printf -v curr_csvfname "%s_%s.csv" $curr_table "$curr_fname"
# put back './' to filenames
curr_fname="./"$curr_fname
curr_csvfname="./"$curr_csvfname
# export current table to target CSV file
sqlite3 -header -csv $curr_fname "select * from $curr_table;" > $curr_csvfname
# log
echo "[INFO] Exported table $curr_table in file $curr_csvfname"
done
done

The sqlite3 command-line shell allows some settings to be done with command-line arguments, so you can simply execute a simple SELECT * for the table in each DB file:
for %%a in (*.db) do sqlite3 -csv -header "%%a" "select * from TableName" > %%~na.csv
(When this is not part of a batch file but run directly from the command line, you must replace %% with %.)

I prepared a short python script which will write a csv file from multiple sqlite databases.
#function for merging sqlite files to csv
def convert_sqlite_to_csv(inputFolder, ext, tableName):
""" inputFolder - Folder where sqlite files are located.
ext - Extension of your sqlite file (eg. db, sqlite, sqlite3 etc.)
tableName - table name from which you want to select the data.
"""
csvWriter = csv.writer(open(inputFolder+'/output.csv', 'w', newline=''))
for file1 in os.listdir(inputFolder):
if file1.endswith('.'+ext):
conn = sqlite3.connect(inputFolder+'/'+file1)
cursor = conn.cursor()
cursor.execute("SELECT * FROM "+tableName)
rows = cursor.fetchall()
for row in rows:
csvWriter.writerow(row)
continue
else:
continue
Or find the script on github link below for converting multiple files in a folder.
python multiple_sqlite_files_tocsv.py -d <inputFolder> -e <extension> -t <tableName>
will output the data to output.csv file.
Jupyter notebook and a python script are on github.
https://github.com/darshanz/CombineMultipleSqliteToCsv

Related

Loading of large CSV file into bash associative array slow/stuck

I have a very large CSV file (~10mil rows) with 2 numeric column representing ids. The requirement is: given the first id, return very fast the second id.
I need to get the CSV to behave like a map structure and it has to be in memory. I couldn't find a way to expose awk variables back to the shell so I thought of using bash associative arrays.
The problem is that loading the csv into an associative array gets very slow/stuck after ~8 mil rows. I've been trying to eliminate the causes of slowdown that I could think of: file reading/IO, associative arraylimitations. So, I have a couple of functions that read the file into an associative array, but all of them have the same slowness problem.
Here is the test data
loadSplittedFilesViaMultipleArrays -> assumes the original file was split into smaller files (1 mil rows) and uses a while read loop to build 4 associative arrays (max 3 mil records each)
loadSingleFileViaReadarray -> uses readarray to read the original file into a temp array and then goes through that to build the associative array
loadSingleFileViaWhileRead -> uses a while read loop to build the associative array
But I can't seem to figure it out. Maybe this way of doing it is completely wrong... Can anyone pitch in with some suggestions?
Bash is the wrong tool for an associative array of this size. Consider using a language more suited (Perl, Python, Ruby, PHP, js, etc etc)
For a Bash only environment you could use a sqlite3 sql database which is usually installed with Bash. (It is not POSIX however)
First you would create the database from your csv file. There are many ways to do this (Perl, Python, Ruby, GUI tools) but this is simple enough to do interactively in sqlite3 command line shell (exp.db must not exist at this point):
$ sqlite3 exp.db
SQLite version 3.19.3 2017-06-27 16:48:08
Enter ".help" for usage hints.
sqlite> create table mapping (id integer primary key, n integer);
sqlite> .separator ","
sqlite> .import /tmp/mapping.csv mapping
sqlite> .quit
Or, pipe in the sql statements:
#!/bin/bash
cd /tmp
[[ -f exp.db ]] && rm exp.db # must be a new db as written
echo 'create table mapping (id integer primary key, n integer);
.separator ","
.import mapping.csv mapping' | sqlite3 exp.db
(Note: as written, exp.db must not exist or you will get INSERT failed: UNIQUE constraint failed: mapping.id. You can write it so the database exp.db is updated rather than created by the csv file, but you would probably want to use a language like Python, Perl, Tcl, Ruby, etc to do that.)
In either case, that will create an indexed database mapping the first column onto the second. The import will take a little while (15-20 seconds with the 198 MB example) but it creates a new persistent database from the imported csv:
$ ls -l exp.db
-rw-r--r-- 1 dawg wheel 158105600 Nov 19 07:16 exp.db
Then you can quickly query that new database from Bash:
$ time sqlite3 exp.db 'select n from mapping where id=1350044575'
1347465036
real 0m0.004s
user 0m0.001s
sys 0m0.001s
That takes 4 milliseconds on my older iMac.
If you want to use Bash variables for your query you can concatenate or construct the query string as needed:
$ q=1350044575
$ sqlite3 exp.db 'select n from mapping where id='"$q"
1347465036
And since the db is persistent, you can just compare file times of the csv file to the db file to test whether you need to recreate it:
if [[ ! -f "$db_file" || "$csv_file" -nt "$db_file" ]]; then
[[ -f "$db_file" ]] && rm "$db_file"
echo "creating $db_file"
# create the db as above...
else
echo "reusing $db_file"
fi
# query the db...
More:
sqlite tutorial
sqlite home
Inspired by #HuStmpHrrr's comment, I thought about another, maybe simpler alternative.
You can use GNU Parallel to split the file up into 1MB (or other) sized chunks and then use all your CPU cores to search each of the resulting chunks in parallel:
parallel --pipepart -a mapping.csv --quote awk -F, -v k=1350044575 '$1==k{print $2;exit}'
1347465036
Takes under a second on my iMac and that was the very last record.
I made a little Perl-based TCP server that reads the CSV into a hash and then sits looping forever doing lookups for requests coming via TCP from clients. It is pretty self-explanatory:
#!/usr/bin/perl
use strict;
use warnings;
################################################################################
# Load hash from CSV at startup
################################################################################
open DATA, "mapping.csv";
my %hash;
while( <DATA> ) {
chomp $_;
my ($field1,$field2) = split /,/, $_;
if( $field1 ne '' ) {
$hash{$field1} = $field2;
}
}
close DATA;
print "Ready\n";
################################################################################
# Answer queries forever
################################################################################
use IO::Socket::INET;
# auto-flush on socket
$| = 1;
my $port=5000;
# creating a listening socket
my $socket = new IO::Socket::INET (
LocalHost => '127.0.0.1',
LocalPort => $port,
Proto => 'tcp',
Listen => 5,
Reuse => 1
);
die "cannot create socket $!\n" unless $socket;
while(1)
{
# waiting for a new client connection
my $client_socket = $socket->accept();
my $data = "";
$client_socket->recv($data, 1024);
my $key=$data;
chomp $key;
my $reply = "ERROR: Not found $key";
if (defined $hash{$key}){
$reply=$hash{$key};
}
print "DEBUG: Received $key: Replying $reply\n";
$client_socket->send($reply);
# notify client that response has been sent
shutdown($client_socket, 1);
}
So, you save the code above as go.pl and then make it executable with:
chmod +x go.pl
then start the server in the background with:
./go.pl &
Then, when you want to do a lookup as a client, you send your key to localhost:5000 using the standard socat utility like this:
socat - TCP:127.0.0.1:5000 <<< "1350772177"
1347092335
As a quick benchmark, it does 1,000 lookups in 8 seconds.
START=$SECONDS; tail -1000 *csv | awk -F, '{print $1}' |
while read a; do echo $a | socat - TCP:127.0.0.1:5000 ; echo; done; echo $START,$SECONDS
It could probably be speeded up by a slight change to handle multiple keys to lookup per request to reduce socket connection and teardown overhead.

importing data from a CSV in Bash

I have a CSV file that I need to use in a bash script. The CSV is formatted like so.
server1,file.name
server1,otherfile.name
server2,file.name
server3,file.name
I need to be able to pull this information into either an array or in some other way so that I can then filter the information and only pull out data for a single server that i can then pass to another command within the script.
I need it to go something like this.
Import workfile.csv
check hostname | return only lines from workfile.csv that have the hostname as column one and store column 2 as a variable.
find / -xdev -type f -perm -002 | compare to stored info | chmod o-w all files not in listing
I'm stuck using bash because of the environment that I'm working in.
The csv can be to big for adding all filenames in the find parameter list.
You also do not want to call find in a loop for every line in the csv.
Solution:
First make a complete list of files in a tmp file.
Second parse the csv and filter the files.
Third is chmod -w.
The next solution stores the files in a tmp
Make a script that gets the servername as a parameter.
See comment in the code:
# Before EDIT:
# Hostname by parameter 1
# Check that you have a hostname
if [ $# -ne 1 ]; then
echo "Usage: $0 hostname"
# Exit script, failure
exit 1
fi
hostname=$1
# Edit, get hostname by system call
hostname=$(hostname)
# Or: hostname=$(hostname -s)
# Additional check
if [ ! -f workfile.csv ]; then
echo "inputfile missing"
exit 1
fi
# After edits, ${hostname} is now filled.
find / -xdev -type f -perm -002 -name "${file}" > /tmp/allfiles.tmp
# Do not use cat workfile.csv | grep ..., you do not need to call cat
# grep with ^ for beginning of line, add a , for a complete first field
# grep "^${hostname}," workfile.csv
# cut for selecting second field with delimiter ','
# cut -d"," -f2
# while read file => can be improved with xargs but lets start with this.
grep "^${hostname}," workfile.csv | cut -d"," -f2 | while read file; do
# Using sed with #, not /, since you need / in the search string
# Variable in sed mist be outside the single quotes and in double quotes
# Add $ after the file for end-of-line
# delete the line with the file (#searchstring#d)
sed -i '#/'"${file}"'$#d' /tmp/allfiles.tmp
done
echo "Review /tmp/allfiles.tmp before chmodding all these files"
echo "Delete the echo and exit when you are happy"
# Just an exit for testing
exit
# Using < is for avoiding a call to cat
</tmp/allfiles.tmp xargs chmod -w
It might be easier when you can chmod -w all the files and chmod +w the files in the csv. This is a little different than you asked, since all files from the csv are writable after this process, maybe you do not want that.

Listing files to array then covert them to pipe delimiter text files

I have couple of .dpff files. I want to do the following in Solaris 10 Sparc
Listen/wait for the director /cm/vic/digital/orcr/vic_export for the arrival of one or more .dpff file. then
Remove the ^M character in all the .dpff files
add the file path to the first column of all .dpff files
the file path is: /cm/vic/digital/orcr/vic_export
.dpff files are currenlty in tab delimited file so I want to then convert them to the pipe delimited file.
Lastly, Rename each file with a time stamp eg. 20140415140648.txt
My code is below. I am not able the expected result.
Please advice.
#! /bin/bash
declare -a files
declare -a z
i=1
z=`ls *.dpff`
c=`ls *.dpff`|wc -l
echo "Start listening for the .dpff files"
while :;
do [ -f /cm/vic/digital/orcr/vic_export/*.dpff ]
sleep 60;
echo " Assing array with list of .dpff files"
for i in c
do
dirs[i]=$z
done
echo " Listing files"
for i in c
do
sed 's/^/\/cm\/vic\/digital\/orcr\/\vic_export\//' $files[i] > `date +"%Y%m%d%H%M%S"`.dpfff
tr '\t' '|' < $files[i] > t.txt
done
done
Hard to be sure what you're really asking: your description and your code don't really match up. Nevertheless, how's this?
#! /bin/bash
declare -a files files_with_timestamp
echo "Start listening for the .dpff files"
while :; do
# Listen/wait for the director /cm/vic/digital/orcr/vic_export for the
# arrival of one or more .dpff file
while :; do
files=( /cm/vic/digital/orcr/vic_export/*.dpff )
(( ${#files[#]} > 0 )) && break
sleep 60;
done
timestamp=$( date "+%Y%m%d%H%M%S" )
for file in "${files[#]}"; do
sed -i '
# Remove the ^M character in all the .dpff files
s/\r//g
# add the file path to the first column of all .dpff files
s#^#/cm/vic/digital/orcr/vic_export/#
# .dpff files are currenlty in tab delimited file so I want to then
# convert them to the pipe delimited file.
s/\t/|/g
' "$file"
newfile="${file%.dpff}.$timestamp.dpff"
mv "$file" "$newfile"
files_with_timestamp+=( "$newfile" )
done
echo ".dpff files converted:"
printf "%s\n" "${files_with_timestamp[#]}"
done

ubuntu terminal rename files (vlc*.png)

i need a lot of files in one of my programs and to get them loaded in all in one line of code i would like to rename all of the files i need.
it are over 100 files so doing manualy isn't realy an option.
the files are named vlc(random numbers).png and want to rename them to vlc(incrementing number).png
i already found how i can get all the needed files and rename them (see # bottom) but i can't get an incrementing filenumber on the end, how can i get this?
clear; for f in vlc*.png; do echo $f ${f/vlc*/vlc}; done
here is how you can do it and it works:
copy the code under these lines and name the file rename.sh, then all you need to do is make it so that you can run the script and then run it useig ./rename.sh
#!/bin/bash
#
# Author:
# rename script
# rename.sh
x=0;
for filename in *.png
do
echo $filename;
x=`expr $x + 1`;
echo $x.png;
mv $filename $x.png;
done

Unique file names in a directory in unix

I have a capture file in a directory in which some logs are being written in a file
word.cap
now there is a script in which when its size becomes exactly 1.6Gb then it clears itself and prepares files in below format in same directory-
word.cap.COB2T_1389889231
word.cap.COB2T_1389958275
word.cap.COB2T_1390035286
word.cap.COB2T_1390132825
word.cap.COB2T_1390213719
Now i want to pick all these files in a script one by one and want to perform some actions.
my script is-
today=`date +%d_%m_%y`
grep -E '^IPaddress|^Node' /var/rawcap/word.cap.COB2T* | awk '{print $3}' >> snmp$today.txt
sort -u snmp$today.txt > snmp_final_$today.txt
so, what should i write to pick all file names of above mentioned format one by one as i will place this script in crontab,but i don't want to read main word.cap file as that is being edited.
As per your comment:
Thanks, this is working but i have a small issue in this. There are
some files which are bzipped i.e. word.cap.COB2T_1390213719.bz2, so i
dont want these files in list, so what should be done?
You could add a condition inside the loop:
for file in word.cap.COB2T*; do
if [[ "$file" != *.bz2 ]]; then
# Do something here
echo ${file};
fi
done

Resources