Reduce processing time for 'While read' loop - arrays

New to shell scripting..
I have a huge csv file, with a varying length f11, like
"000000aaad000000bhb200000uwwed..."
"000000aba200000bbrb2000000wwqr00000caba2000000bhbd000000qwew..."
.
.
After splitting the string in size of 10, I need 6-9 characters. then I have to join them back using delimiter '|' like
0aaa|0bhb|uwwe...
0aba|bbrb|0wwq|caba|0bhb|0qwe...
and join the processed f11 with other fields
this is the time taken for processing 10k records ->
real 4m43.506s
user 0m12.366s
sys 0m12.131s
20K records ->
real 5m20.244s
user 2m21.591s
sys 3m20.042s
80K records (around 3.7Million f11 split and merge with '|') ->
real 21m18.854s
user 9m41.944s
sys 13m29.019s
My expected time is 30mins for processing 650K records (around 56Million f11 split and merge). Any way to optimize ?
while read -r line1; do
f10=$( echo $line1 | cut -d',' -f1,2,3,4,5,7,9,10)
echo $f10 >> $path/other_fields
f11=$( echo $line1 | cut -d',' -f11 )
f11_trim=$(echo "$f11" | tr -d '"')
echo $f11_trim | fold -w10 > $path/f11_extract
cat $path/f11_extract | awk '{print $1}' | cut -c6-9 >> $path/str_list_trim
arr=($(cat $path/str_list_trim))
printf "%s|" ${arr[#]} >> $path/str_list_serialized
printf '\n' >> $path/str_list_serialized
arr=()
rm $path/f11_extract
rm $path/str_list_trim
done < $input
sed -i 's/.$//' $path/str_list_serialized
sed -i 's/\(.*\)/"\1"/g' $path/str_list_serialized
paste -d "," $path/other_fields $path/str_list_serialized > $path/final_out

Your code is not time-efficient due to:
invoking multiple commands including awk within the loop.
generating many intermediate temporal files.
You can do the job just with awk:
awk -F, -v OFS="," ' # assign input/output field separator to a comma
{
len = length($11) # length of the 11th field
s = ""; d = "" # clear output string and the delimiter
for (i = 1; i <= len / 10; i++) { # iterate over the 11th field
s = s d substr($11, (i - 1) * 10 + 6, 4) # concatenate 6-9th substring of 10 characters long chunks
d = "|" # set the delimiter to a pipe character
}
$11 = "\"" s "\"" # assign the 11th field to the generated string
} 1' "$input" # the final "1" tells awk to print all fields
Example of the input:
1,2,3,4,5,6,7,8,9,10,000000aaad000000bhb200000uwwed
1,2,3,4,5,6,7,8,9,10,000000aba200000bbrb2000000wwqr00000caba2000000bhbd000000qwew
Output:
1,2,3,4,5,6,7,8,9,10,"0aaa|0bhb|uwwe"
1,2,3,4,5,6,7,8,9,10,"0aba|bbrb|0wwq|caba|0bhb|0qwe"

Related

How to split string to array with specific word in bash

I have a string after I do a command:
[username#hostname ~/script]$ gsql ls | grep "Graph graph_name"
- Graph graph_name(Vertice_1:v, Vertice_2:v, Vertice_3:v, Vertice_4:v, Edge_1:e, Edge_2:e, Edge_3:e, Edge_4:e, Edge_5:e)
Then I do
IFS=", " read -r -a vertices <<< "$(gsql use graph ifgl ls | grep "Graph ifgl(" | cut -d "(" -f2 | cut -d ")" -f1)" to make the string splitted and append to array. But, what I want is to split it by delimiter ", " then append each word that contain ":v" to an array, its mean word that contain ":e" will excluded.
How to do it? without do a looping
Like this, using grep
mapfile -t array < <(gsql ls | grep "Graph graph_name" | grep -oP '\b\w+:v')
The regular expression matches as follows:
Node
Explanation
\b
the boundary between a word char (\w) and something that is not a word char
\w+
word characters (a-z, A-Z, 0-9, _) (1 or more times (matching the most amount possible))
:v
':v'
This bash script should work:
declare arr as array variable
arr=()
# use ", " as delimiter to parse the input fed through process substituion
while read -r -d ', ' val || [[ -n $val ]]; do
val="${val%)}"
val="${val#*\(}"
[[ $val == *:v ]] && arr+=("$val")
done < <(gsql ls | grep "Graph graph_name")
# check array content
declare -p arr
Output:
declare -a arr='([0]="Vertice_1:v" [1]="Vertice_2:v" [2]="Vertice_3:v" [3]="Vertice_4:v")'
Since there is a condition per element the logical way is to use a loop. There may be ways to do it, but here is a solution with a for loop:
#!/bin/bash
input="Vertice_1:v, Vertice_2:v, Vertice_3:v, Vertice_4:v, Edge_1:e, Edge_2:e, Edge_3:e, Edge_4:e, Edge_5:e"
input="${input//,/ }" #replace , with SPACE (bash array uses space as separator)
inputarray=($input)
outputarray=()
for item in "${inputarray[#]}"; do
if [[ $item =~ ":v" ]]; then
outputarray+=($item) #append the item to the output array
fi
done
echo "${outputarray[#]}"
will give output: Vertice_1:v Vertice_2:v Vertice_3:v Vertice_4:v
since the elements don't have space in them this works

Shell script which reads each line of a csv file and counts the number of each column

I want a script that reads each row of a CSV file which is called sample.csv and it counts the number of fields of each row and if the number is more than a threshold (here is 14) it stores the whole of that line or just two fields of that line in another file (Hello.bsd) the script which I wrote is as below:
while read -r line
do
echo "$line" > tmp.kk
count= $(awk -F, '{ print NF; exit }' ~/tmp.kk)
if [ "$count" -gt 14 ]; then
field1=$(echo "$line" | awk -F',' '{printf "%s", $1}' | tr -d ',')
field2=$(echo "$line" | awk -F',' '{printf "%s", $2}' | tr -d ',')
echo "$field1 $field2" >> Hello.bsd
fi
done < ~/sample.csv
there is no output for the above code.
I would be so grateful if you could help me in this regard.
Best regards,
sina
FOR JUST FIRST 2 FIELDS
< sample.csv |
mawk 'NF=(_=(+__<NF))+_' FS=',' __="14" # enter constant or shell variable
SAMPLE OUTPUT
echo "${a}"
04z,Y7N,=TT,WLq,n54,cb8,qfy,LLG,ria,hIQ,Mmd,8N2,FK=,7a9,
us6,ck6,LvI,tnY,CQm,wBp,gPH,8ly,JAH,Phv,uwm,x1r,MF1,ide,
03I,GEs,Mok,BxK,z2D,IUH,VWn,Zb7,TkP,Ddt,RE9,mv2,XyD,tr5,
A2t,u0z,MLi,3RF,es1,goz,G0S,l=h,8Ka,coN,vHP,snk,tTV,xNF,
RiU,yBI,QrS,N6D,fWG,oOr,CwZ,9lb,f8h,g5I,c1u,D3X,kOo,lKG,
CSj,da4,Y54,S7R,AEj,Vqx,Fem,sqn,l4Z,YEA,OKe,6Bu,0xU,hGc,
1X8,jUD,XZM,pMc,Q6V,piz,6jp,SJp,E3W,zgJ,BuW,5wd,qVg,wBy,
TQC,O9k,RJ9,fie,2AV,XZ4,meR,tEC,U7v,JWH,LTs,ngF,3A3,ZPa,
ONJ,Phw,jrp,UvY,9Kb,qxf,57f,yHo,a0Q,2S=,=Ob,l1b,XjC
echo "${a}" | mawk 'NF=(_=(+__<NF))+_' FS=',' __="14"
04z Y7N
us6 ck6
03I GEs
A2t u0z
RiU yBI
CSj da4
1X8 jUD
TQC O9k
note that the last line didn't print because it didn't meet the NF threshold

how to perform string operation in an input string and get the desired output

I have to write a shell script to accept the product order details as string input and display the total order amount and number of order under each category.
INPUT:
101:Redmi:Mobile:15000#102:Samsung:TV:20000#103:OnePlus:Mobile:35000#104:HP:Laptop:65000#105:Samsung:Mobile:10000#106:Samsung:TV:30000
OUTPUT:
Mobile:60000:3#TV:50000:2#Laptop:65000:1
I have to achieve this using sort,tr,cut,grep command only no sed awk should be used.
Here is my solution, it is improvable, I will explain it:
x="101:Redmi:Mobile:15000#102:Samsung:TV:20000#103:OnePlus:Mobile:35000#104:HP:Laptop:65000#105:Samsung:Mobile:10000#106:Samsung:TV:30000"
echo $x > origin.txt
cat origin.txt | grep -E -o ':[a-Z]+:[0-9]+' | cut -c 2- > temp.txt
categories=()
quantity=()
items=()
while IFS= read -r line
do
category=$(echo $line | grep -E -o '[a-Z]+')
amount=($(( $(echo $line | grep -E -o '[0-9]+') )))
if [ "0" = "${#categories[#]}" ]
then
# Add new element at the end of the array
categories+=( $category )
quantity+=( $amount )
items+=(1)
else
let in=0
let index=0
let i=0
# Iterate the loop to read and print each array element
for value in "${categories[#]}"
do
if [ $category = $value ]
then
let in=$in+1
let index=$i
fi
let i=$i+1
done
if [ $in = 0 ]
then
categories+=( $category )
quantity+=( $amount )
items+=(1)
else
let sum=$amount+${quantity[$index]}
quantity[$index]=$sum
let newitems=${items[$index]}+1
items[$index]=$newitems
fi
fi
done < temp.txt
let j=0
for value in "${categories[#]}"
do
echo -n $value
echo -n ':'
echo -n ${quantity[$j]}
echo -n ':'
echo -n ${items[$j]}
let k=$j+1
if [ $k != ${#categories[#]} ]
then
echo -n '#'
fi
let j=$j+1
done
First i save the string in x and later in origin.txt, with cat, grep and cut I obtain the string in this format inside of temp.txt:
Mobile:15000 \n
TV:20000 \n
Mobile:35000 \n
Laptop:65000 \n
Mobile:10000 \n
TV:30000
I create three arrays: categories for storage the name of the category, quantity for storage the amount of the category and items the objects of the category.
Later I read temp.txt line by line and using a simple regex I can get the name of the category and the amount of that line.
amount=($(( $(echo $line | grep -E -o '[0-9]+') )))
This is used to convert string into int.
The first if is for check if the array is empty, if it's empty we append the category, the amount and 1 object.
If is not empty we need to declare three variables:
in, for check if the incoming category is in the array or not
index, for storage the index of the existing category
i, for count the times we loop
If the category already exist in the array $in is setted to 1 or higher.
If not $in still have the default value 0.
Next if in is equal to 0, means the category is new, we append to the three arrays.
If in is not equal to 0 using the index we can sum the new amount to the stored amount.
Finally when the main loop finish I print the data following the format you have mentioned.
The variable k is for not print the last "#", we compare k with the size of the array. Note that arrays start with 0 index.
The input is:
x="101:Redmi:Mobile:15000#102:Samsung:TV:20000#103:OnePlus:Mobile:35000#104:HP:Laptop:65000#105:Samsung:Mobile:10000#106:Samsung:TV:30000"
First, split records on lines:
tr '#' '\n' <<< "$x"
This gives:
101:Redmi:Mobile:15000
102:Samsung:TV:20000
103:OnePlus:Mobile:35000
104:HP:Laptop:65000
105:Samsung:Mobile:10000
106:Samsung:TV:30000
Better to keep only relevant data and store it in a variable:
data="$(tr '#' '\n' <<< "$x" | cut -d: -f3,4)"
echo "$data"
This gives:
Mobile:15000
TV:20000
Mobile:35000
Laptop:65000
Mobile:10000
TV:30000
Count how many items in each category. sed is used for formatting:
categories="$(cut -d: -f1 <<< "$data" | sort | uniq -c | sed 's/^ *//;s/ /:/')"
echo "$categories"
This gives:
1:Laptop
3:Mobile
2:TV
Then loop over categories to extract and compute total amount for each:
for i in $categories
do
count=${i%:*}
item=${i/*:}
amount=$(($(grep $item <<< "$data"| cut -d: -f2 | tr '\n' '+')0))
echo $item $amount $count
done
This gives:
Laptop 65000 1
Mobile 60000 3
TV 50000 2
Finally, a little bit of formatting after the done:
<...> done | tr ' \n' ':#' | sed 's/#$//' ; echo
This gives:
Laptop:65000:1#Mobile:60000:3#TV:50000:2

How do i echo specific rows and columns from csv's in a variable?

The below script:
#!/bin/bash
otscurrent="
AAA,33854,4528,38382,12
BBB,83917,12296,96213,13
CCC,20399,5396,25795,21
DDD,27198,4884,32082,15
EEE,2472,981,3453,28
FFF,3207,851,4058,21
GGG,30621,4595,35216,13
HHH,8450,1504,9954,15
III,4963,2157,7120,30
JJJ,51,59,110,54
KKK,87,123,210,59
LLL,573,144,717,20
MMM,617,1841,2458,75
NNN,234,76,310,25
OOO,12433,1908,14341,13
PPP,10627,1428,12055,12
QQQ,510,514,1024,50
RRR,1361,687,2048,34
SSS,1,24,25,96
TTT,0,5,5,100
UUU,294,1606,1900,85
"
IFS="," array1=(${otscurrent})
echo ${array1[4]}
Prints:
$ ./test.sh
12
BBB
I'm trying to get it to just print 12... And I am not even sure how to make it just print row 5 column 4
The variable is an output of a sqlquery that has been parsed with several sed commands to change the formatting to csv.
otscurrent="$(sqlplus64 user/password#dbserverip/db as sysdba #query.sql |
sed '1,11d; /^-/d; s/[[:space:]]\{1,\}/,/g; $d' |
sed '$d'|sed '$d'|sed '$d' | sed '$d' |
sed 's/Used,MB/Used MB/g' |
sed 's/Free,MB/Free MB/g' |
sed 's/Total,MB/Total MB/g' |
sed 's/Pct.,Free/Pct. Free/g' |
sed '1b;/^Name/d' |
sed '/^$/d'
)"
Ultimately I would like to be able to call on a row and column and run statements on the values.
Initially i was piping that into :
awk -F "," 'NR>1{ if($5 < 10) { printf "%-30s%-10s%-10s%-10s%-10s\n", $1,$2,$3,$4,$5"%"; } else { echo "Nothing to do" } }')"
Which works but I couldn't run commands from if else ... or atleaste I didn't know how.
If you have bash 4.0 or newer, an associative array is an appropriate way to store data in this kind of form.
otscurrent=${otscurrent#$'\n'} # strip leading newline present in your sample data
declare -A data=( )
row=0
while IFS=, read -r -a line; do
for idx in "${!line[#]}"; do
data["$row,$idx"]=${line[$idx]}
done
(( row += 1 ))
done <<<"$otscurrent"
This lets you access each individual item:
echo "${data[0,0]}" # first field of first line
echo "${data[9,0]}" # first field of tenth line
echo "${data[9,1]}" # second field of tenth line
"I'm trying to get it to just print 12..."
The issue is that IFS="," splits on commas and there is no comma between 12 and BBB. If you want those to be separate elements, add a newline to IFS. Thus, replace:
IFS="," array1=(${otscurrent})
With:
IFS=$',\n' array1=(${otscurrent})
Output:
$ bash test.sh
12
All you need to print the value of the 4th column on the 5th row is:
$ awk -F, 'NR==5{print $4}' <<< "$otscurrent"
3453
and just remember that in awk row (record) and column (field) numbers start at 1, not 0. Some more examples:
$ awk -F, 'NR==1{print $5}' <<< "$otscurrent"
12
$ awk -F, 'NR==2{print $1}' <<< "$otscurrent"
BBB
$ awk -F, '$5 > 50' <<< "$otscurrent"
JJJ,51,59,110,54
KKK,87,123,210,59
MMM,617,1841,2458,75
SSS,1,24,25,96
TTT,0,5,5,100
UUU,294,1606,1900,85
If you'd like to avoid all of the complexity and simply parse your SQL output to produce what you want without 20 sed commands in between, post a new question showing the raw sqlplus output as the input and what you want finally output and someone will post a brief, clear, simple, efficient awk script to do it all at one time, or maybe 2 commands if you still want an intermediate CSV for some reason.

Append elements of an array to the end of a line

First let me say I followed questions on stackoverflow.com that relate to my question and it seems the rules are not applying. Let me show you.
The following script:
#!/bin/bash
OUTPUT_DIR=/share/es-ops/Build_Farm_Reports/WorkSpace_Reports
TODAY=`date +"%m-%d-%y"`
HOSTNAME=`hostname`
WORKSPACES=( "bob" "mel" "sideshow-ws2" )
if ! [ -f $OUTPUT_DIR/$HOSTNAME.csv ] && [ $HOSTNAME == "sideshow" ]; then
echo "$TODAY","$HOSTNAME" > $OUTPUT_DIR/$HOSTNAME.csv
echo "${WORKSPACES[0]}," >> $OUTPUT_DIR/$HOSTNAME.csv
sed -i "/^'"${WORKSPACES[0]}"'/$/'"${WORKSPACES[1]}"'/" $OUTPUT_DIR/$HOSTNAME.csv
sed -i "/^'"${WORKSPACES[1]}"'/$/${WORKSPACES[2]}"'/" $OUTPUT_DIR/$HOSTNAME.csv
fi
I want the output to look like:
09-20-14,sideshow
bob,mel,sideshow-ws2
the sed statements are supposed to append successive array elements to preceding ones on the same line. Now I know there's a simpler way to do this like:
echo "${WORKSPACES[0]},${WORKSPACES[1]},${WORKSPACES[2]}" >> $OUTPUT_DIR/$HOSTNAME.csv
But let's say I had 30 elements in the array and I wanted to appended them one after the other on the same line? Can you show me how to loop through the elements in an array and append them one after the other on the same line?
Also let's say I had the output of a command like:
df -m /export/ws/$ws | awk '{if (NR!=1) {print $3}}'
and I wanted to append that to the end of the same line.
But when I run it I get:
+ OUTPUT_DIR=/share/es-ops/Build_Farm_Reports/WorkSpace_Reports
++ date +%m-%d-%y
+ TODAY=09-20-14
++ hostname
+ HOSTNAME=sideshow
+ WORKSPACES=("bob" "mel" "sideshow-ws2")
+ '[' -f /share/es-ops/Build_Farm_Reports/WorkSpace_Reports/sideshow.csv ']'
And the file right now looks like:
09-20-14,sideshow
bob,
I am happy to report that user syme solved this (see below) but then I realized I need the date in the first column:
09-7-14,bob,mel,sideshow-ws2
Can I do this using syme's for loop?
Okay user syme solved this too he said "Just add $TODAY to the for loop" like this:
for v in "$TODAY" "${WORKSPACES[#]}"
Okay now the output looks like this I changed the elements in the array btw:
sideshow
09-20-14,bob_avail,bob_used,mel_avail,mel_used,sideshow-ws2_avail,sideshow-ws2_used
Now below that the next line will be populated by a , in the first column skipping the date and then:
df -m /export/ws/$v | awk '{if (NR!=1) {print $3}}
which equals the value of available space on bob in the first iteration
and then:
df -m /export/ws/$v | awk '{if (NR!=1) {print $2}}
which equals the value of used space on bob in the 2nd iteration
and then we just move on to the next value in ${WORKSPACE[#]}
which will be mel and do the available and used as we did with bob or $v above.
I know you geniuses on here will make child's play out of this.
I solved my own last question on this thread:
WORKSPACES2=( "bob" "mel" "sideshow-ws2" )
separator="," # defined empty for the first value
for v in "${WORKSPACES2[#]}"
do
available=`df -m /export/ws/$v | awk '{if (NR!=1) {print $3}}'`
used=`df -m /export/ws/$v | awk '{if (NR!=1) {print $2}}'`
echo -n "$separator$available$separator$used" >> $OUTPUT_DIR/$HOSTNAME.csv # append, concatenated, the separator and the value to the file
done
produces:
sideshow
09-20-14,bob_avail,bob_used,mel_avail,mel_used,sideshow-ws2_avail,sideshow-ws2_used
,470400,1032124,661826,1032124,43443,1032108
echo -n permits to print text without the linebreak.
To loop over the values of the array, you can use a for-loop:
echo "$TODAY,$HOSTNAME" > $OUTPUT_DIR/$HOSTNAME.csv # with a linebreak
separator="" # defined empty for the first value
for v in "${WORKSPACES[#]}"
do
echo -n "$separator$v" >> $OUTPUT_DIR/$HOSTNAME.csv # append, concatenated, the separator and the value to the file
separator="," # comma for the next values
done
echo >> $OUTPUT_DIR/$HOSTNAME.csv # add a linebreak (if you want it)

Resources