importing data from a CSV in Bash - arrays

I have a CSV file that I need to use in a bash script. The CSV is formatted like so.
server1,file.name
server1,otherfile.name
server2,file.name
server3,file.name
I need to be able to pull this information into either an array or in some other way so that I can then filter the information and only pull out data for a single server that i can then pass to another command within the script.
I need it to go something like this.
Import workfile.csv
check hostname | return only lines from workfile.csv that have the hostname as column one and store column 2 as a variable.
find / -xdev -type f -perm -002 | compare to stored info | chmod o-w all files not in listing
I'm stuck using bash because of the environment that I'm working in.

The csv can be to big for adding all filenames in the find parameter list.
You also do not want to call find in a loop for every line in the csv.
Solution:
First make a complete list of files in a tmp file.
Second parse the csv and filter the files.
Third is chmod -w.
The next solution stores the files in a tmp
Make a script that gets the servername as a parameter.
See comment in the code:
# Before EDIT:
# Hostname by parameter 1
# Check that you have a hostname
if [ $# -ne 1 ]; then
echo "Usage: $0 hostname"
# Exit script, failure
exit 1
fi
hostname=$1
# Edit, get hostname by system call
hostname=$(hostname)
# Or: hostname=$(hostname -s)
# Additional check
if [ ! -f workfile.csv ]; then
echo "inputfile missing"
exit 1
fi
# After edits, ${hostname} is now filled.
find / -xdev -type f -perm -002 -name "${file}" > /tmp/allfiles.tmp
# Do not use cat workfile.csv | grep ..., you do not need to call cat
# grep with ^ for beginning of line, add a , for a complete first field
# grep "^${hostname}," workfile.csv
# cut for selecting second field with delimiter ','
# cut -d"," -f2
# while read file => can be improved with xargs but lets start with this.
grep "^${hostname}," workfile.csv | cut -d"," -f2 | while read file; do
# Using sed with #, not /, since you need / in the search string
# Variable in sed mist be outside the single quotes and in double quotes
# Add $ after the file for end-of-line
# delete the line with the file (#searchstring#d)
sed -i '#/'"${file}"'$#d' /tmp/allfiles.tmp
done
echo "Review /tmp/allfiles.tmp before chmodding all these files"
echo "Delete the echo and exit when you are happy"
# Just an exit for testing
exit
# Using < is for avoiding a call to cat
</tmp/allfiles.tmp xargs chmod -w
It might be easier when you can chmod -w all the files and chmod +w the files in the csv. This is a little different than you asked, since all files from the csv are writable after this process, maybe you do not want that.

Related

"basename" command won't include multiple files

I have a problem with “basename” command as follow:
In my host directory I have two samples’ fastq.gz files, named as:
A29_WES_S3_R1_001.fastq.gz
A29_WES_S3_R2_001.fastq.gz
A30_WES_S1_R1_001.fastq.gz
A30_WES_S1_R2_001.fastq.gz
Now I need to have their basename without suffix like:
A29_WES_S3_R1_001
A29_WES_S3_R2_001
A30_WES_S1_R1_001
A30_WES_S1_R2_001
I used the bash pipeline as follow:
#!/bin/bash
FILES1=(*R1_001.fastq.gz)
FILES2=(*R2_001.fastq.gz)
read1="${FILES1[#]}"
read2="${FILES2[#]}"
Ffile=$read1
Ffileprevix=$(basename "$Ffile" .fastq.gz)
Mfile=$read2
Mfileprevix=$(basename "$Mfile" .fastq.gz)
echo $Ffileprevix
echo $Mfileprevix
exit;
But every time I just get this output:
A29_WES_S3_R1_001.fastq.gz A30_WES_S1_R1_001
A29_WES_S3_R2_001.fastq.gz A30_WES_S1_R2_001
Only the last file (A30) would be included in the command!
I checked my pipeline in this way:
echo $read1
echo $read2
The result:
A29_WES_S3_R1_001.fastq.gz A30_WES_S1_R1_001.fastq.gz
A29_WES_S3_R2_001.fastq.gz A30_WES_S1_R2_001.fastq.gz
Then I did:
echo $Ffile
echo $Mfile
The result:
A29_WES_S3_R1_001.fastq.gz A30_WES_S1_R1_001.fastq.gz
A29_WES_S3_R2_001.fastq.gz A30_WES_S1_R2_001.fastq.gz
So $read1, $read2, $Ffile, and $Mfile work well.
Then I put “-a” in my basename command as it will take multiple files:
Ffileprevix=$(basename -a "$Ffile" .fastq.gz)
Mfileprevix=$(basename -a "$Mfile" .fastq.gz)
But it got worse! The result was like:
A29_WES_S3_R1_001.fastq.gz A30_WES_S1_R1_001.fastq.gz .fastq.gz
A29_WES_S3_R2_001.fastq.gz A30_WES_S1_R2_001.fastq.gz .fastq.gz
Finally, I tried “for ..... do ....” command to make a loop for basename command. Again, nothing changed!!
Is there anybody can help me to obtain what I want:
A29_WES_S3_R1_001
A29_WES_S3_R2_001
A30_WES_S1_R1_001
A30_WES_S1_R2_001
I'd leave basename out of this entirely, but that's entirely personal preference. You could do something more like:
FILES_PATTERN_1=".*R1_001.fastq.gz"
FILES_PATTERN_2=".*R2_001.fastq.gz"
# Get FILE PATTERN 1
echo "Pattern 1:"
for FILE in $(find . | grep "${FILES_PATTERN_1}" | cut -d. -f2 | tr -d /); do
echo $FILE
done
# Get FILE PATTERN 2
echo "Pattern 2:"
for FILE in $(find . | grep "${FILES_PATTERN_2}" | cut -d. -f2 | tr -d /); do
echo $FILE
done
Output should be:
Pattern 1:
A30_WES_S1_R1_001
A29_WES_S3_R1_001
Pattern 2:
A29_WES_S3_R2_001
A30_WES_S1_R2_001
You could also play with awk to parse things instead:
# Get FILE PATTERN 1
echo "Pattern 1:"
for FILE in $(find . | grep "${FILES_PATTERN_1}" | awk -F '[/.]' '{print $3}'); do
echo $FILE
done
There are a number of ways to approach this. If you had a lot more patterns to test you could make more use of functions here to reduce code duplication.
Also note, I'm doing this from a shell on Mac OSX, so if you're doing this from a Linux box some of these commands may need to be tweaked due to differences in output for some commands, like find. (ex: print $1 instead of print $3)

Putting files in directory into array variable

I'm writing bash code that will search for specific files in the directory it is run in and add them into an array variable. The problem I am having is formatting the results. I need to find all the compressed files in the current directory and display both the names and sizes of the files in order of last modified. I want to take the results of that command and put them into an array variable with each line element containing the file's name and corresponding size but I don't know how to do that. I'm not sure if I should be using command "find" instead of "ls" but here is what I have so far:
find_files="$(ls -1st --block-size=MB)"
arr=( ($find_files) )
I'm not sure exactly what format you want the array to be in, but here is a snippet that creates an associative array keyed by filename with the size as the value:
$ ls -l test.{zip,bz2}
-rw-rw-r-- 1 user group 0 Sep 10 13:27 test.bz2
-rw-rw-r-- 1 user group 0 Sep 10 13:26 test.zip
$ declare -A sizes; while read SIZE FILENAME ; do sizes["$FILENAME"]="$SIZE"; done < <(find * -prune -name '*.zip' -o -name *.bz2 | xargs stat -c "%Y %s %N" | sort | cut -f 2,3 -d " ")
$ echo "${sizes[#]#A}"
declare -A sizes=(["'test.zip'"]="0" ["'test.bz2'"]="0" )
$
And if you just want an array of literally "filename size" entries, that's even easier:
$ while read SIZE FILENAME ; do sizes+=("$FILENAME $SIZE"); done < <(find * -prune -name '*.zip' -o -name *.bz2 | xargs stat -c "%Y %s %N" | sort | cut -f 2,3 -d " ")
$ echo "${sizes[#]#A}"
declare -a sizes=([0]="'test.zip' 0" [1]="'test.bz2' 0")
$
Both of these solutions work, and were tested via copy paste from this post.
The first is fairly slow. One problem is external program invocations within a loop - date for example, is invoked for every file. You could make it quicker by not including the date in the output array (see Notes below). Particularly for method 2 - that would result in no external command invocations inside the while loop. But method 1 is really the problem - orders of magnitude slower.
Also, somebody probably knows how to convert an epoch date to another format in awk for example, which could be faster. Maybe you could do the sort in awk too. Perhaps just keep the epoch date?
These solutions are bash / GNU heavy and not portable to other environments (bash here strings, find -printf). OP tagged linux and bash though, so GNU can be assumed.
Solution 1 - capture any compressed file - using file to match (slow)
The criteria for 'compressed' is if file output contains the word compress
Reliable enough, but perhaps there is a conflict with some other file type description?
file -l | grep compress (file 5.38, Ubuntu 20.04, WSL) indicates for me there are no conflicts at all (all files listed are compression formats)
I couldn't find a way of classifying any compressed file other than this
I ran this on a directory containing 1664 files - time (real) was 40 seconds
#!/bin/bash
# Capture all files, recursively, in $TARGET, that are
# compressed files. In an indexed array. Using file name
# extensions to match.
# Initialise variables, and check the target is valid
declare -g c= compressed_files= path= TARGET=$1
[[ -r "$TARGET" ]] || exit 1
# Make the array
# A here string (<<<) must be used, to keep array in the global environment
while IFS= read -r -d '' path; do
[[ "$(file --brief "${path%% *}")" == *compress* ]] &&
compressed_files[c++]="${path% *} $(date -d #${path##* })"
done < \
<(
find "$TARGET" -type f -printf '%p %s %T#\0' |
awk '{$2 = ($2 / 1024); print}' |
sort -n -k 3
)
# Print results - to test
printf '%s\n' "${compressed_files[#]}"
Solution 2 - use file extensions - orders of magnitude faster
If you know exactly what extensions you are looking for, you can
compose them in a find command
This is alot faster
On the same directory as above, containing 1664 files - time (real) was 200 miliseconds
This example looks for .gz, .zip, and .7z (gzip, zip and 7zip respectively)
I'm not sure if -type f -and -regex '.*[.]\(gz\|zip\|7z\) -and printf may be faster again, now I think of it. I started with globs cause I assumed that was quicker
That may also allow for storing the extension list in a variable..
This method avoids a file analysis on every file in your target
It also makes the while loop shorter - you're only iterating matches
Note the repetition of -printf here, this is due to the logic that
find uses: -printf is 'True'. If it were included by itself, it would
act as a 'match' and print all files
It has to be used as a result of a name match being true (using -and)
Perhaps somebody has a better composition?
#!/bin/bash
# Capture all files, recursively, in $TARGET, that are
# compressed files. In an indexed array. Using file name
# extensions to match.
# Initialise variables, and check the target is valid
declare -g c= compressed_files= path= TARGET=$1
[[ -r "$TARGET" ]] || exit 1
while IFS= read -r -d '' path; do
compressed_files[c++]="${path% *} $(date -d #${path##* })"
done < \
<(
find "$TARGET" \
-type f -and -name '*.gz' -and -printf '%p %s %T#\0' -or \
-type f -and -name '*.zip' -and -printf '%p %s %T#\0' -or \
-type f -and -name '*.7z' -and -printf '%p %s %T#\0' |
awk '{$2 = ($2 / 1024); print}' |
sort -n -k 3
)
# Print results - for testing
printf '%s\n' "${compressed_files[#]}"
Sample output (of either method):
$ comp-find.bash /tmp
/tmp/comptest/websters_english_dictionary.tmp.tar.gz 265.148 Thu Sep 10 07:53:37 AEST 2020
/tmp/comptest/What_is_Systems_Architecture_PART_1.tar.gz 1357.06 Thu Sep 10 08:17:47 AEST 2020
Note:
You can add a literal K to indicate the block size / units (kilobytes)
If you want to print the path only from this array, you can use suffix removal: printf '%s\n' "${files[#]&& *}"
For no date in the array (it's used to sort, but then its job may be done), simply remove $(date -d #${path##* }) (incl. the space).
Kind of tangential, but to use different date formats, replace $(date -d #${path##* }) with:
$(date -I -d #${path##* }) ISO format - note that short opts style: date -Id #[date] did not work for me
$(date -d #${path##* } +%Y-%M-%d_%H-%m-%S) like ISO, but w/ seconds
$(date -d #${path##* } +%Y-%M-%d_%H-%m-%S) same again, but w/ nanoseconds (find gives you nano seconds)
Sorry for the long post, hopefully it's informative.

Looking to take only main folder name within a tarball & match it to folders to see if it's been extracted

I have a situation where I need to keep .tgz files & if they've been extracted, remove the extracted directory & contents.
In all examples, the only top-level directory within the tarball has a different name than the tarball itself:
[host1]$ find / -name "*\#*.tgz" #(has an # symbol somewhere in the name)
/1-#-test.tgz
[host1]$ tar -tzvf /1-#-test.tgz | head -n 1 | awk '{ print $6 }'
TJ #(directory name)
What I'd like to accomplish (pulling my hair out; rusty scripting fingers), is to look at each tarball, see if the corresponding directory name (like above) exists. If it does, echo "rm -rf /directoryname" into an output file for review.
I can read all of the tarballs into an array ... but how to check the directories?
Frustrated & appreciate any help.
Maybe you're looking for something like this:
find / -name "*#*.tgz" | while read line; do
dir=$(tar ztf "$line" | awk -F/ '{print $6; exit}')
test -d "$dir" && echo "rm -fr '$dir'"
done
Explanation:
We iterate over the *#*.tgz files found with a while loop, line by line
Get the list of files in the tgz file with tar ztf "$line"
Since paths are separated by /, use that as the separator in the awk, print the 6th field. After the print we exit, making this equivalent to but more efficient than using head -n1 first
With dir=$(...) we put the entire output of the tar..awk chain, thus the 6th field of the first file in the tar, into the variable dir
We check if such directory exists, if yes then echo an rm command so you can review and execute later if looks good
My original answer used a find ... -exec but I think that's not so good in this particular case:
find / -name "*#*.tgz" -exec \
sh -c 'dir=$(tar ztf "{}" | awk -F/ "{print \$6; exit}");\
test -d "$dir" && echo "rm -fr \"$dir\""' \;
It's not so good because of running sh for every file, and since we are using {} in the subshell, we lose the usual benefits of a typical find ... -exec where special characters in {} are correctly handled.

Using a variable to pass grep pattern in bash

I am struggling with passing several grep patterns that are contained within a variable. This is the code I have:
#!/bin/bash
GREP="$(which grep)"
GREP_MY_OPTIONS="-c"
for i in {-2..2}
do
GREP_MY_OPTIONS+=" -e "$(date --date="$i day" +'%Y-%m-%d')
done
echo $GREP_MY_OPTIONS
IFS=$'\n'
MYARRAY=( $(${GREP} ${GREP_MY_OPTIONS} "/home/user/this path has spaces in it/"*"/abc.xyz" | ${GREP} -v :0$ ) )
This is what I wanted it to do:
determine/define where grep is
assign a variable (GREP_MY_OPTIONS) holding parameters I will pass to grep
assign several patterns to GREP_MY_OPTIONS
using grep and the patterns I have stored in $GREP_MY_OPTIONS search several files within a path that contains spaces and hold them in an array
When I use "echo $GREP_MY_OPTIONS" it is generating what I expected but when I run the script it fails with an error of:
/bin/grep: invalid option -- ' '
What am I doing wrong? If the path does not have spaces in it everything seems to work fine so I think it is something to do with the IFS but I'm not sure.
If you want to grep some content in a set of paths, you can do the following:
find <directory> -type f -print0 |
grep "/home/user/this path has spaces in it/\"*\"/abc.xyz" |
xargs -I {} grep <your_options> -f <patterns> {}
So that <patterns> is a file containing the patterns you want to search for in each file from directory.
Considering your answer, this shall do what you want:
find "/path\ with\ spaces/" -type f | xargs -I {} grep -H -c -e 2013-01-17 {}
From man grep:
-H, --with-filename
Print the file name for each match. This is the default when
there is more than one file to search.
Since you want to insert the elements into an array, you can do the following:
IFS=$'\n'; array=( $(find "/path\ with\ spaces/" -type f -print0 |
xargs -I {} grep -H -c -e 2013-01-17 "{}") )
And then use the values as:
echo ${array[0]}
echo ${array[1]}
echo ${array[...]}
When using variables to pass the parameters, use eval to evaluate the entire line. Do the following:
parameters="-H -c"
eval "grep ${parameters} file"
If you build the GREP_MY_OPTIONS as an array instead of as a simple string, you can get the original outline script to work sensibly:
#!/bin/bash
path="/home/user/this path has spaces in it"
GREP="$(which grep)"
GREP_MY_OPTIONS=("-c")
j=1
for i in {-2..2}
do
GREP_MY_OPTIONS[$((j++))]="-e"
GREP_MY_OPTIONS[$((j++))]=$(date --date="$i day" +'%Y-%m-%d')
done
IFS=$'\n'
MYARRAY=( $(${GREP} "${GREP_MY_OPTIONS[#]}" "$path/"*"/abc.xyz" | ${GREP} -v :0$ ) )
I'm not clear why you use GREP="$(which grep)" since you will execute the same grep as if you wrote grep directly — unless, I suppose, you have some alias for grep (which is then the problem; don't alias grep).
You can do one thing without making things complex:
First do a change directory in your script like following:
cd /home/user/this\ path\ has\ spaces\ in\ it/
$ pwd
/home/user/this path has spaces in it
or
$ cd "/home/user/this path has spaces in it/"
$ pwd
/home/user/this path has spaces in it
Then do what ever your want in your script.
$(${GREP} ${GREP_MY_OPTIONS} */abc.xyz)
EDIT :
[sgeorge#sgeorge-ld stack1]$ ls -l
total 4
drwxr-xr-x 2 sgeorge eng 4096 Jan 19 06:05 test tesd
[sgeorge#sgeorge-ld stack1]$ cat test\ tesd/file
SUKU
[sgeorge#sgeorge-ld stack1]$ grep SUKU */file
SUKU
EDIT :
[sgeorge#sgeorge-ld stack1]$ find */* -print | xargs -I {} grep SUKU {}
SUKU

Need bash to separate cat'ed string to separate variables and do a for loop

I need to get a list of files added to a master folder and copy only the new files to the respective backup folders; The paths to each folder have multiple folders, all named by numbers and only 1 level deep.
ie /tester/a/100
/tester/a/101 ...
diff -r returns typically "Only in /testing/a/101: 2093_thumb.png" per line in the diff.txt file generated.
NOTE: there is a space after the colon
I need to get the 101 from the path and filename into separate variables and copy them to the backup folders.
I need to get the lesserfolder var to get 101 without the colon
and mainfile var to get 2093_thumb.png from each line of the diff.txt and do the for loop but I can't seem to get the $file to behave. Each time I try testing to echo the variables I get all the wrong results.
#!/bin/bash
diff_file=/tester/diff.txt
mainfolder=/testing/a
bacfolder= /testing/b
diff -r $mainfolder $bacfolder > $diff_file
LIST=`cat $diff_file`
for file in $LIST
do
maindir=$file[3]
lesserfolder=
mainfile=$file[4]
# cp $mainfolder/$lesserFolder/$mainfile $bacfolder/$lesserFolder/$mainfile
echo $maindir $mainfile $lesserfolder
done
If I could just get the echo statement working the cp would work then too.
I believe this is what you want:
#!/bin/bash
diff_file=/tester/diff.txt
mainfolder=/testing/a
bacfolder= /testing/b
diff -r -q $mainfolder $bacfolder | egrep "^Only in ${mainfolder}" | awk '{print $3,$4}' > $diff_file
cat ${diff_file} | while read foldercolon mainfile ; do
folderpath=${foldercolon%:}
lesserFolder=${folderpath#${mainfolder}/}
cp $mainfolder/$lesserFolder/$mainfile $bacfolder/$lesserFolder/$mainfile
done
But it is much more reliable (and much easier!) to use rsync for this kind of backup. For example:
rsync -a /testing/a/* /testing/b/
You could try a while read loop
diff -r $mainfolder $bacfolder | while read dummy dummy dir file; do
echo $dir $file
done

Resources