I have a dataset of many files. Each file contains many reviews of the type separated by a blank line:
<Author>bigBob
<Content>definitely above average! we had a really nice stay there last year when I and...USUALLY OVER MANY LINES
<Date>Jan 2, 2009
<img src="http://cdn.tripadvisor.com/img2/new.gif" alt="New"/>
<No. Reader>-1
<No. Helpful>-1
<Overall>4
<Value>4
<Rooms>4
<Location>4
<Cleanliness>5
<Check in / front desk>4
<Service>3
<Business service>4
<Author>rickMN... next review goes on
For every review I need to extract the data after the tag and put it in something like this (which I plan write to a .sql file so when I do ".read" it will populate my database):
INSERT INTO [HotelReviews] ([Author], [Content], [Date], [Image], [No_Reader], [No_Helpful], [Overall], [Value], [Rooms], [Location], [Cleanliness], [Check_In], [Service], [Business_Service]) VALUES ('bigBob', 'definitely above...', ...)
My question is how can I extract the data after each tag and put it in an insert statement using bash?
EDIT
Text after <Content> tag is usually a paragraph with a number of lines
This is the right approach for what you're trying to do:
$ cat tst.awk
NF {
if ( match($0,/^<img\s+src="([^"]+)/,a) ) {
name="Image"
value=a[1]
}
else if ( match($0,/^<([^>"]+)>(.*)/,a) ) {
name=a[1]
value=a[2]
sub(/ \/.*|\./,"",name)
gsub(/ /,"_",name)
}
names[++numNames] = name
values[numNames] = value
next
}
{ prt() }
END { prt() }
function prt() {
printf "INSERT INTO [HotelReviews] ("
for (nameNr=1; nameNr<=numNames; nameNr++) {
printf " [%s]", names[nameNr]
}
printf ") VALUES ("
for (nameNr=1; nameNr<=numNames; nameNr++) {
printf " \047%s\047", values[nameNr]
}
print ""
numNames = 0
delete names
delete values
}
.
$ awk -f tst.awk file
INSERT INTO [HotelReviews] ( [Author] [Content] [Date] [Image] [No_Reader] [No_Helpful] [Overall] [Value] [Rooms] [Location] [Cleanliness] [Check_in] [Service] [Business_service]) VALUES ( 'bigBob' 'definitely above average! we had a really nice stay there last year when I and...USUALLY OVER MANY LINES' 'Jan 2, 2009' 'http://cdn.tripadvisor.com/img2/new.gif' '-1' '-1' '4' '4' '4' '4' '5' '4' '3' '4'
INSERT INTO [HotelReviews] ( [Author]) VALUES ( 'rickMN... next review goes on'
The above uses GNU awk for the 3rd arg to match(). Massage to get the precise formatting/output you want.
Example:
#!/bin/bash
while IFS= read -r line; do
[[ $line =~ ^\<Author\>(.*) ]] && Author="${BASH_REMATCH[1]}"
[[ $line =~ ^\<Content\>(.*) ]] && Content="${BASH_REMATCH[1]}"
# capture lines not starting with < and append to variable Content
[[ $line =~ ^[^\<] ]] && Content+="$line"
# match an empty line
[[ $line =~ ^$ ]] && echo "${Author}, ${Content}"
done < file
Output with your file:
bigBob, definitely above average! we had a really nice stay there last year when I and ...
=~: match to a regex (string left, regex right without quotes)
^: match start of line
\< or \>: match < or >
.*: here match rest of line
(.*): capture rest of line to first element of array BASH_REMATCH
See: The Stack Overflow Regular Expressions FAQ
Related
I have a text file with the following:
Paige
Buckley
Govan
Mayer
King
Harrison
Atkins
Reinhardt
Wilson
Vaughan
Sergovia
Tarrega
My goal is to create an array for each set of names. Then Iterate through the first array of values then move on to the second array of values and lastly the third array. Each set is separated by a new line in the text file. Help with code or logic is much appreciated!
so far I have the following. i am unsure of the logic moving forward when i reach a line break. My research here also suggests that i can use readarray -d.
#!/bin/bash
my_array=()
while IFS= read -r line || [[ "$line" ]]; do
if [[ $line -eq "" ]];
.
.
.
arr+=("$line") # i know this adds the value to the array
done < "$1"
printf '%s\n' "${my_array[#]}"
desired output:
array1 = (Paige Buckley6 Govan Mayer King)
array2 = (Harrison Atkins Reinhardt Wilson)
array3 = (Vaughan Sergovia Terrega)
#then loop through the each array one after the other.
Bash has no array-of-arrays. So you have to represent it in an other way.
You could leave the newlines and have an array of newline separated elements:
array=()
elem=""
while IFS= read -r line; do
if [[ "$line" != "" ]]; then
elem+="${elem:+$'\n'}$line" # accumulate lines in elem
else
array+=("$elem") # flush elem as array element
elem=""
fi
done
if [[ -n "$elem" ]]; then
array+=("$elem") # flush the last elem
fi
# iterate over array
for ((i=0;i<${#array[#]};++i)); do
# each array element is newline separated items
readarray -t elem <<<"${array[i]}"
printf 'array%d = (%s)\n' "$i" "${elem[*]}"
done
You could simplify the loop with some unique character and a sed for example like:
readarray -d '#' -t array < <(sed -z 's/\n\n/#/g' file)
But overall, this awk generates same output:
awk -v RS= -v FS='\n' '{
printf "array%d = (", NR;
for (i=1;i<=NF;++i) printf "%s%s", $i, i==NF?"":" ";
printf ")\n"
}'
Using nameref :
#!/usr/bin/env bash
declare -a array1 array2 array3
declare -n array=array$((n=1))
while IFS= read -r line; do
test "$line" = "" && declare -n array=array$((n=n+1)) || array+=("$line")
done < "$1"
declare -p array1 array2 array3
Called with :
bash test.sh data
# result
declare -a array1=([0]="Paige" [1]="Buckley" [2]="Govan" [3]="Mayer" [4]="King")
declare -a array2=([0]="Harrison" [1]="Atkins" [2]="Reinhardt" [3]="Wilson")
declare -a array3=([0]="Vaughan" [1]="Sergovia" [2]="Tarrega")
Assumptions:
blank links are truly blank (ie, no need to worry about any white space on said lines)
could have consecutive blank lines
names could have embedded white space
the number of groups could vary and won't always be 3 (as with the sample data provided in the question)
OP is open to using a (simulated) 2-dimensional array as opposed to a (variable) number of 1-dimensional arrays
My data file:
$ cat names.dat
<<< leading blank lines
Paige
Buckley
Govan
Mayer
King Kong
<<< consecutive blank lines
Harrison
Atkins
Reinhardt
Wilson
Larry
Moe
Curly
Shemp
Vaughan
Sergovia
Tarrega
<<< trailing blank lines
One idea that uses a couple arrays:
array #1: associative array - the previously mentioned (simulated) 2-dimensional array with the index - [x,y] - where x is a unique identifier for a group of names and y is a unique identifier for a name within a group
array #2: 1-dimensional array to keep track of max(y) for each group x
Loading the arrays:
unset names max_y # make sure array names are not already in use
declare -A names # declare associative array
x=1 # init group counter
y=0 # init name counter
max_y=() # initialize the max(y) array
inc= # clear increment flag
while read -r name
do
if [[ "${name}" = '' ]] # if we found a blank line ...
then
[[ "${y}" -eq 0 ]] && # if this is a leading blank line then ...
continue # ignore and skip to the next line
inc=y # set flag to increment 'x'
else
[[ "${inc}" = 'y' ]] && # if increment flag is set ...
max_y[${x}]="${y}" && # make note of max(y) for this 'x'
((x++)) && # increment 'x' (group counter)
y=0 && # reset 'y'
inc= # clear increment flag
((y++)) # increment 'y' (name counter)
names[${x},${y}]="${name}" # save the name
fi
done < names.dat
max_y[${x}]="${y}" # make note of the last max(y) value
Contents of the array:
$ typeset -p names
declare -A names=([1,5]="King Kong" [1,4]="Mayer" [1,1]="Paige" [1,3]="Govan" [1,2]="Buckley" [3,4]="Shemp" [3,3]="Curly" [3,2]="Moe" [3,1]="Larry" [2,4]="Wilson" [2,2]="Atkins" [2,3]="Reinhardt" [2,1]="Harrison" [4,1]="Vaughan" [4,2]="Sergovia" [4,3]="Tarrega" )
$ for (( i=1; i<=${x}; i++ ))
do
for (( j=1; j<=${max_y[${i}]}; j++ ))
do
echo "names[${i},${j}] : ${names[${i},${j}]}"
done
echo ""
done
names[1,1] : Paige
names[1,2] : Buckley
names[1,3] : Govan
names[1,4] : Mayer
names[1,5] : King Kong
names[2,1] : Harrison
names[2,2] : Atkins
names[2,3] : Reinhardt
names[2,4] : Wilson
names[3,1] : Larry
names[3,2] : Moe
names[3,3] : Curly
names[3,4] : Shemp
names[4,1] : Vaughan
names[4,2] : Sergovia
names[4,3] : Tarrega
I have an issue and no idea which solution is better to use, I will describe my need first in order to explain it to you :
I have a text file, titled with a server name, example : server1.txt
Inside this file i have something like array (basically it's a check result):
cciex6:~/folder$
cciex6:~/folder$ cat server1
8 16 UP 10-6-2020-16:04
8 16 UP 15-6-2020-16:04
8 16 UP 20-6-2020-16:04
6 16 UP 25-6-2020-16:04
cciex6:~/folder$
8 : means 8 cores
16 : means 16 Gb of ram
UP : means ping successful, server is up
20-1-2020-16:04 : date of check and creation of the result.
I have some servers IPs in a servers.txt
2 scripts :
first script, will do a check for each server in the servers.txt, (check cpu cores, memory, up or down, and date of the check) store this infos in a server1 file
second script should read the lines from the server1 file, in do some comparison between new result (last line in the server1 file) and old result (line before last line) if it's the same results Nothing to do, in case memory increase or server DOWN, should give me alert in a log file or console or something.
for the first script, it's done i'm able to run it and get the information i need from the list of the servers
my issue is the second script, how can i read the RAM + CPU + UP or DOWN status, from the server1 file and make the comparison between last result and the result before the last one,
Any ideas please or help
really appreciate it
Take the last two lines of the file, and compare them, something like.
#!/usr/bin/env bash
file=server1.txt
declare -A old_result new_result
mapfile -t results < <(tail -n2 "$file")
read -r old_core old_ram old_status old_date <<< "${results[0]}"
read -r new_core new_ram new_status new_date <<< "${results[1]}"
old_result=(
[core]="$old_core"
[ram]="$old_ram"
[status]="$old_status"
[date]="$old_date"
)
new_result=(
[core]="$new_core"
[ram]="$new_ram"
[status]="$new_status"
[date]="$new_date"
)
for item in "${!old_result[#]}"; do
[[ $item == date ]] && continue
if [[ ${old_result[$item]} != ${new_result[$item]} ]]; then
printf >&2 'old %s and new %s does not match!\n' "$item ${old_result[$item]}" "$item ${new_result[$item]}"
fi
done
It needs bash4+ because of mapfile and probably associative arrays.
It skips the date but if you need to compare the date just comment out the date comparison or just delete that line.
Just using an associative array without mapfile.
#!/usr/bin/env bash
file=server1.txt
declare -A old_result new_result
{
read -r old_core old_ram old_status old_date
old_result=(
[core]="$old_core"
[ram]="$old_ram"
[status]="$old_status"
[date]="$old_date"
)
read -r new_core new_ram new_status new_date
new_result=(
[core]="$new_core"
[ram]="$new_ram"
[status]="$new_status"
[date]="$new_date"
)
} < <(tail -n2 "$file")
for item in "${!old_result[#]}"; do
[[ $item == date ]] && continue
if [[ ${old_result[$item]} != ${new_result[$item]} ]]; then
printf >&2 'old %s and new %s does not match!\n' "$item ${old_result[$item]}" "$item ${new_result[$item]}"
fi
done
Without mapfile and associative array, just using a while read loop.
#!/usr/bin/env bash
file=server1.txt
while read -r old_core old_ram old_status old_date; do
read -r new_core new_ram new_status new_date
if [[ $old_core != $new_core ]]; then
printf >&2 'old core %s and new core %s does not match!\n' "$old_core" "$new_core"
fi
if [[ $old_ram != $new_ram ]]; then
printf >&2 'old ram %s and new ram %s does not match!\n' "$old_ram" "$new_ram"
fi
if [[ $old_status != $new_status ]]; then
printf >&2 'old status %s and new status %s does not match!\n' "$old_status" "$new_status"
fi
done < <(tail -n2 "$file")
should give me alert in a log file or console or something.
Pipe the script to wall which should work even invoke via cron, also it has some options that you can use.
It seems as if all you need to do is compare the last two lines. Here's an awk script that does this:
awk ' {
line_prev = line_last ;
$4 = ""
line_last = $0
}
END {
if (line_prev && line_prev != line_last || line_last ~ /DOWN/ ) {
print "Alert!", FILENAME, line_prev, "--->", line_last
}
}' "$1"
The script, let's call it ./second_script, can be invoked like this:
./second_script server1
You can pipe the output to syslog so that you could receive an alert.
./second_script server1 | logger "<add-your-favorite-syslog-parameters here>"
I'd like to either process one row of a csv file or the whole file.
The variables are set by the header row, which may be in any order.
There may be up to 12 columns, but only 3 or 4 variables are needed.
The source files might be in either format, and all I want from both is lastname and country. I know of many different ways and tools to do it if the columns were fixed and always in the same order. But they're not.
examplesource.csv:
firstname,lastname,country
Linus,Torvalds,Finland
Linus,van Pelt,USA
examplesource2.csv:
lastname,age,country
Torvalds,66,Finland
van Pelt,7,USA
I have cobbled together something from various Stackoverflow postings which looks a bit voodoo but seems fairly robust. I say "voodoo" because shellcheck complains that, for example, "firstname is referenced but not assigned". And yet it prints it.
#!/bin/bash
#set the field seperator to newline
IFS=$'\n'
#split/transpose the first-line column titles to rows
COLUMNAMES=$(head -n1 examplesource.csv | tr ',' '\n')
#set an array and read the columns into it
columns=()
for line in $COLUMNAMES; do
columns+=("$line")
done
#reset the field seperator
IFS=","
#using -p here to debug in output
declare -ap columns
#read from line 2 onwards
sed 1d examplesource.csv | while read "${columns[#]}"; do
echo "${firstname} ${lastname} is from ${country}"
done
In the case of looping through everything, it works perfectly for my needs and I can process within the "while read" loop. But to make it cleaner, I'd rather pass the current element(?) to an external function to process (not just echo).
And if I only wanted the array (current row) belonging to "Torvalds", I cannot find how to access that or even get its current index, eg: "if $wantedname && $lastname == $wantedname then call function with currentrow only otherwise loop all rows and call function".
I know there aren't multidimensional associative arrays in bash from reading
Multidimensional associative arrays in Bash and I've tried to understand arrays from
https://opensource.com/article/18/5/you-dont-know-bash-intro-bash-arrays
Is it clear what I'm trying to achieve in a bash-only manner and does the question make sense?
Many thanks.
Let's short your function. Don't read the source twice (first with head then with sed). You can do that once. Also the whole array reading can be shorten to just IFS=',' COLUMNAMES=($(head -n1 source.csv)). Here's a shorter version:
#!/bin/bash
cat examplesource.csv |
{
IFS=',' read -r -a columnnames
while IFS=',' read -r "${columnnames[#]}"; do
echo "${firstname} ${lastname} is from ${country}"
done
}
If you want to parse both files and the same time, ie. join them, nothing simpler ;). First, let's number lines in the first file using nl -w1 -s,. Then we use join to join the files on the name of the people. Remember that join input needs to be sort-ed using proper fields. Then we sort the output with sort using the number from the first file. After that we can read all the data just like that:
# join the files, using `,` as the seaprator
# on the 3rd field from the first file and the first field from the second file
# the output should be first the fields from the first file, then the second file
# the country (field 1.4) is duplicated in 2.3, so just omiting it.
join -t, -13 -21 -o 1.1,1.2,1.3,2.2,2.3 <(
# number the lines in the first file
<examplesource.csv nl -w1 -s, |
# there is one field more, sort using the 3rd field
sort -t, -k3
) <(
# sort the second file using the first field
<examplesource2.csv sort -t, -k1
) |
# sort the output using the numbers from the first file
sort -t, -k1 -n |
# well, remove the numbers
cut -d, -f2- |
# just a normal read follows
{
# read the headers
IFS=, read -r -a names
while IFS=, read -r "${names[#]}"; do
# finally out output!
echo "${firstname} ${lastname} is from ${country} and is so many ${age} years old!"
done
}
Tested on tutorialspoint.
GNU Awk has multidimensional arrays. It also has array sorting mechanisms, which I have not used here. Please comment if you are interested in pursuing this solution further. The following depends on consistent key names and line numbers across input files, but can handle an arbitrary number of fields and input files.
$ gawk -V |gawk NR==1
GNU Awk 4.1.4, API: 1.1 (GNU MPFR 3.1.5, GNU MP 6.1.2)
$ gawk -F, '
FNR == 1 {for(f=1;f<=NF;f++) Key[f]=$f}
FNR != 1 {for(f=1;f<=NF;f++) People[FNR][Key[f]]=$f}
END {
for(Person in People) {
for(attribute in People[Person])
output = output FS People[Person][attribute]
print substr(output,2)
output=""
}
}
' file*
66,Finland,Linus,Torvalds
7,USA,Linus,van Pelt
A bash solution takes a bit more work than an awk solution, but if this is an exercise over what bash provides, it provides all you need to handle determining the column holding the last name from the first line of input and then outputting the lastname from the remaining lines.
An easy approach is simply to read each line into a normal array and then loop over the elements of the first line to locate the column "lastname" appears in saving the column in a variable. You can then read each of the remaining lines the same way and output the lastname field by outputting the element at the saved column.
A short example would be:
#!/bin/bash
col=0 ## column count for lastname
cnt=0 ## line count
while IFS=',' read -a arr; do ## read each line into array
if [ "$cnt" -eq '0' ]; then ## test if line-count is zero
for ((i = 0; i < "${#arr[#]}"; i++)); do ## loop for lastname
[ "${arr[i]}" = 'lastname' ] && ## test for lastname
{ col=i; break; } ## if found set cos = 1, break loop
done
fi
[ "$cnt" -gt '0' ] && ## if not headder row
echo "line $cnt lastname: ${arr[col]}" ## output lastname variable
((cnt++)) ## increment linecount
done < "$1"
Example Use/Output
Using your two files data files, the output would be:
$ bash readcsv.sh ex1.csv
line 1 lastname: Torvalds
line 2 lastname: van Pelt
$ bash readcsv.sh ex2.csv
line 1 lastname: Torvalds
line 2 lastname: van Pelt
A similar implementation using awk would be:
awk -F, -v col=1 '
NR == 1 {
for (i in FN) {
if (i = "lastname") next
}
col++
}
NR > 1 {
print "lastname: ", $col
}
' ex1.csv
Example Use/Output
$ awk -F, -v col=1 'NR == 1 { for (i in FN) { if (i = "lastname") next } col++ } NR > 1 {print "lastname: ", $col }' ex1.csv
lastname: Torvalds
lastname: van Pelt
(output is the same for either file)
Thank you all. I've taken a couple of bits from two answers
I used the answer from David to find the number of the row, then I used the elegantly simple solution from Kamil at to loop through what I need.
The result is exactly what I wanted. Thank you all.
$ readexample.sh examplesource.csv "Torvalds"
Everyone
Linus Torvalds is from Finland
Linus van Pelt is from USA
now just Torvalds
Linus Torvalds is from Finland
And this is the code - now that you know what I want it to do, if anyone can see any dangers or improvements, please let me know as I'm always learning. Thanks.
#!/bin/bash
FILENAME="$1"
WANTED="$2"
printDetails() {
SINGLEROW="$1"
[[ ! -z "$SINGLEROW" ]] && opt=("--expression" "1p" "--expression" "${SINGLEROW}p") || opt=("--expression" "1p" "--expression" "2,199p")
sed -n "${opt[#]}" "$FILENAME" |
{
IFS=',' read -r -a columnnames
while IFS=',' read -r "${columnnames[#]}"; do
echo "${firstname} ${lastname} is from ${country}"
done
}
}
findRow() {
col=0 ## column count for lastname
cnt=0 ## line count
while IFS=',' read -a arr; do ## read each line into array
if [ "$cnt" -eq '0' ]; then ## test if line-count is zero
for ((i = 0; i < "${#arr[#]}"; i++)); do ## loop for lastname
[ "${arr[i]}" = 'lastname' ] && ## test for lastname
{
col=i
break
} ## if found set cos = 1, break loop
done
fi
[ "$cnt" -gt '0' ] && ## if not headder row
if [ "${arr[col]}" == "$1" ]; then
echo "$cnt" ## output lastname variable
fi
((cnt++)) ## increment linecount
done <"$FILENAME"
}
echo "Everyone"
printDetails
if [ ! -z "${WANTED}" ]; then
echo -e "\nnow just ${WANTED}"
row=$(findRow "${WANTED}")
printDetails "$((row + 1))"
fi
The structure of the my input files are as follows:
<string1> <string2> <stringN>
hello nice world
one three
NOTE:, the second row has a tab/null on the second column. so second column on second row is empty and not 'three'
In bash, I want to loop through each row and also be able to process each individual column (string[1-N])
I am able to iterate to each row:
#!/bin/bash
while IFS='' read -r line || [[ -n "$line" ]]; do
line=${line/$/'\t'/,}
read -r -a columns <<< "$line"
echo "current Row: $line"
echo "column[1]: '${columns[1]}'"
#echo "column[N] '${columns[N]}'"
done < "${1}"
Expected result:
current Row: hello,nice,world
column[1]: 'nice'
current Row: one,,three
column[1]: ''
Basically what I do is iterate through the input file (here passed as argument), do all the "cleaning" like prevents whitespace from being trimmed, ignore backslashes an consider also the last line.
then I replace the tabs '\t' by a comma
and finally read the line into an array (columns) to be able to select a particular column.
The input file has tabs as separator value, so I tried to convert it to csv format, I am not sure if the regex I use is correct in bash, or something else is wrong because this does not return a value in the array.
Thanks
You are almost there, a little fix on the on translating '\t' to commas and you have to set also IFS to be the comma.
try this:
#!/bin/bash
while IFS='' read -r line || [[ -n "$line" ]]; do
line=${line//$'\t'/,}
IFS=',' read -r -a columns <<< "$line"
#echo "current Row: $line"
echo "column[0]:'${columns[0]}' column[1]:'${columns[1]}' column[2]:'${columns[2]}'"
done < "${1}"
run:
$> <the_script> <the_file>
Outputs:
column[0]:'hello' column[1]:'nice' column[2]:'world '
column[0]:'one' column[1]:'' column[2]:'three'
So what I'm trying to do in my code is basically read in a spreadsheet that has this format
username, lastname, firstname, x1, x2, x3, x4
user1, dudette, mary, 7, 2, 4
user2, dude, john, 6, 2, 4,
user3, dudest, rad,
user4, dudaa, pad, 3, 3, 5, 9
basically, it has usernames, the names those usernames correspond to, and values for each x. What I want to do is read in this from a csv file and then find all of the blank spaces and fill them in with 5s. My approach to doing this was to read in the whole array and then substitute all null spaces with 0s. This is the code so far...
#!/bin/bash
while IFS=$'\t' read -r -a myarray
do
echo $myarray
done < something.csv
for e in ${myarray[#]
do
echo 'Can you see me #1?'
if [[-z $e]]
echo 'Can you see me #2?'
sed 's//0'
fi
done
The code isn't really changing my csv file at all. EDITED NOTE: the data is all comma separated.
What I've figured out so far:
Okay, the 'Can you see me' and the echo myarray are test code. I wanted to see if the whole csv file was being read in from echo myarray (which according to the output of the code seems to be the case). It doesn't seem, however, that the code is running through the for loop at all...which I can't seem to understand.
Help is much appreciated! :)
The format of your .csv file is not comma separated, it's left aligned with a non-constant number of whitespace characters separating each field. This makes it difficult to be accurate when trying to find and replace empty columns which are followed by non-empty columns.
Here is a Bash only solution that would be entirely accurate if the fields were comma separated.
#!/bin/bash
n=5
while IFS=, read username lastname firstname x1 x2 x3 x4; do
! [[ $x1 ]] && x1=$n
! [[ $x2 ]] && x2=$n
! [[ $x3 ]] && x3=$n
! [[ $x4 ]] && x4=$n
echo $username,$lastname,$firstname,$x1,$x2,$x3,$x4
done < something.csv > newfile.csv && mv newfile.csv something.csv
Output:
username,lastname,firstname,x1,x2,x3,x4
user1,dudette,mary,7,2,5,4
user2,dude,john,6,2,4,5
user3,dudest,rad,5,5,5,5
user4,dudaa,pad,3,3,5,9
I realize you asked for bash, but if you don't mind perl in lieu of bash, perl is a great tool for record-oriented files.
#!/usr/bin/perl
open (FILE, 'something.csv');
open (OUTFILE, '>outdata.txt');
while(<FILE>) {
chomp;
($username,$lastname,$firstname,$x1,$x2,$x3,$x4) = split("\t");
$x1 = 5 if $x1 eq "";
$x2 = 5 if $x2 eq "";
$x3 = 5 if $x3 eq "";
$x4 = 5 if $x4 eq "";
print OUTFILE "$username\t$lastname\t$x1\t$x2\t$x3\t$x4\n";
}
close (FILE);
close (OUTFILE);
exit;
This reads your infile, something.csv which is assumed to have tab-separated fields, and writes a new file outdata.txt with the re-written records.
I'm sure there's a better or more idiomatic solution, but this works:
#!/bin/bash
infile=bashcsv.csv # Input filename
declare -i i # Iteration variable
declare -i defval=5 # Default value for missing cells
declare -i n_cells=7 # Total number of cells per line
declare -i i_start=3 # Starting index for numeric cells
declare -a cells # Array variable for cells
# We'd usually save/restore the old value of IFS, but there's no need here:
IFS=','
# Convenience function to bail/bug out on error:
bail () {
echo $# >&2
exit 1
}
# Strip whitespace and replace empty cells with `$defval`:
sed -s 's/[[:space:]]//g' $infile | while read -a cells; do
# Skip empty/malformed lines:
if [ ${#cells[*]} -lt $i_start ]; then
continue
fi
# If there are fewer cells than $n_cells, pad to $n_cells
# with $defval; if there are more, bail:
if [ ${#cells[*]} -lt $n_cells ]; then
for ((i=${#cells[*]}; $i<$n_cells; i++)); do
cells[$i]=$defval
done
elif [ ${#cells[*]} -gt $n_cells ]; then
bail "Too many cells."
fi
# Replace empty cells with default value:
for ((i=$i_start; $i<$n_cells; i++)); do
if [ -z "${cells[$i]}" ]; then
cells[$i]=$defval
fi
done
# Print out whole line, interpolating commas back in:
echo "${cells[*]}"
done
Here's a gratuitous awk one-liner that gets the job done:
awk -F'[[:space:]]*,[[:space:]]*' 'BEGIN{OFS=","} /,/ {NF=7; for(i=4;i<=7;i++) if($i=="") $i=5; print}' infile.csv