Copy a range of directories with float number naming in bash - arrays

I have few thousands of folders which store some files and (the folders) are named in the following fashion:
0.01
0.02
.
.
.
1.01
.
.
.
Normally, I would use cp -r {1..1000} some/destination, however trying to do cp -r {0.01..0.21} some/destination does not work.
Also, if I would want to copy only the every fifth folder?
0.05
0.1
0.15
.
.
.
Once again, files would be in an array, and would end at a specific number, for instance 1.2.

Make use of the seq command:
seq [OPTION]... FIRST INCREMENT LAST
-f, --format=FORMAT use printf style floating-point FORMAT
source: man seq
Thus for all folders:
cp -r $(seq 0.01 0.01 0.21) some/destination
If you want to copy only every 5th folder:
cp -r $(seq 0.05 0.05 0.21) some/destination
However, this seq will create a list of folder-names with names like 2.00. If you do not want the trailing zeros, you need to reformat it a bit by adding the flag -f '%g'

Probably you are better off using an array so that you are resilient to the issue of folders not following a strict sequence, though this is not as efficient as copying multiple directories with each invocation of cp as in kvantour's answer:
dirs=(*.*/) # get list of directories into an array
n=0
for dir in "${dirs[#]}"; do # traverse the array
(( ++n % 5 == 0 )) || continue # skip if it is not the 5th
cp -r -- "$dir" "$dest" # copy!
done

Related

Putting files in directory into array variable

I'm writing bash code that will search for specific files in the directory it is run in and add them into an array variable. The problem I am having is formatting the results. I need to find all the compressed files in the current directory and display both the names and sizes of the files in order of last modified. I want to take the results of that command and put them into an array variable with each line element containing the file's name and corresponding size but I don't know how to do that. I'm not sure if I should be using command "find" instead of "ls" but here is what I have so far:
find_files="$(ls -1st --block-size=MB)"
arr=( ($find_files) )
I'm not sure exactly what format you want the array to be in, but here is a snippet that creates an associative array keyed by filename with the size as the value:
$ ls -l test.{zip,bz2}
-rw-rw-r-- 1 user group 0 Sep 10 13:27 test.bz2
-rw-rw-r-- 1 user group 0 Sep 10 13:26 test.zip
$ declare -A sizes; while read SIZE FILENAME ; do sizes["$FILENAME"]="$SIZE"; done < <(find * -prune -name '*.zip' -o -name *.bz2 | xargs stat -c "%Y %s %N" | sort | cut -f 2,3 -d " ")
$ echo "${sizes[#]#A}"
declare -A sizes=(["'test.zip'"]="0" ["'test.bz2'"]="0" )
$
And if you just want an array of literally "filename size" entries, that's even easier:
$ while read SIZE FILENAME ; do sizes+=("$FILENAME $SIZE"); done < <(find * -prune -name '*.zip' -o -name *.bz2 | xargs stat -c "%Y %s %N" | sort | cut -f 2,3 -d " ")
$ echo "${sizes[#]#A}"
declare -a sizes=([0]="'test.zip' 0" [1]="'test.bz2' 0")
$
Both of these solutions work, and were tested via copy paste from this post.
The first is fairly slow. One problem is external program invocations within a loop - date for example, is invoked for every file. You could make it quicker by not including the date in the output array (see Notes below). Particularly for method 2 - that would result in no external command invocations inside the while loop. But method 1 is really the problem - orders of magnitude slower.
Also, somebody probably knows how to convert an epoch date to another format in awk for example, which could be faster. Maybe you could do the sort in awk too. Perhaps just keep the epoch date?
These solutions are bash / GNU heavy and not portable to other environments (bash here strings, find -printf). OP tagged linux and bash though, so GNU can be assumed.
Solution 1 - capture any compressed file - using file to match (slow)
The criteria for 'compressed' is if file output contains the word compress
Reliable enough, but perhaps there is a conflict with some other file type description?
file -l | grep compress (file 5.38, Ubuntu 20.04, WSL) indicates for me there are no conflicts at all (all files listed are compression formats)
I couldn't find a way of classifying any compressed file other than this
I ran this on a directory containing 1664 files - time (real) was 40 seconds
#!/bin/bash
# Capture all files, recursively, in $TARGET, that are
# compressed files. In an indexed array. Using file name
# extensions to match.
# Initialise variables, and check the target is valid
declare -g c= compressed_files= path= TARGET=$1
[[ -r "$TARGET" ]] || exit 1
# Make the array
# A here string (<<<) must be used, to keep array in the global environment
while IFS= read -r -d '' path; do
[[ "$(file --brief "${path%% *}")" == *compress* ]] &&
compressed_files[c++]="${path% *} $(date -d #${path##* })"
done < \
<(
find "$TARGET" -type f -printf '%p %s %T#\0' |
awk '{$2 = ($2 / 1024); print}' |
sort -n -k 3
)
# Print results - to test
printf '%s\n' "${compressed_files[#]}"
Solution 2 - use file extensions - orders of magnitude faster
If you know exactly what extensions you are looking for, you can
compose them in a find command
This is alot faster
On the same directory as above, containing 1664 files - time (real) was 200 miliseconds
This example looks for .gz, .zip, and .7z (gzip, zip and 7zip respectively)
I'm not sure if -type f -and -regex '.*[.]\(gz\|zip\|7z\) -and printf may be faster again, now I think of it. I started with globs cause I assumed that was quicker
That may also allow for storing the extension list in a variable..
This method avoids a file analysis on every file in your target
It also makes the while loop shorter - you're only iterating matches
Note the repetition of -printf here, this is due to the logic that
find uses: -printf is 'True'. If it were included by itself, it would
act as a 'match' and print all files
It has to be used as a result of a name match being true (using -and)
Perhaps somebody has a better composition?
#!/bin/bash
# Capture all files, recursively, in $TARGET, that are
# compressed files. In an indexed array. Using file name
# extensions to match.
# Initialise variables, and check the target is valid
declare -g c= compressed_files= path= TARGET=$1
[[ -r "$TARGET" ]] || exit 1
while IFS= read -r -d '' path; do
compressed_files[c++]="${path% *} $(date -d #${path##* })"
done < \
<(
find "$TARGET" \
-type f -and -name '*.gz' -and -printf '%p %s %T#\0' -or \
-type f -and -name '*.zip' -and -printf '%p %s %T#\0' -or \
-type f -and -name '*.7z' -and -printf '%p %s %T#\0' |
awk '{$2 = ($2 / 1024); print}' |
sort -n -k 3
)
# Print results - for testing
printf '%s\n' "${compressed_files[#]}"
Sample output (of either method):
$ comp-find.bash /tmp
/tmp/comptest/websters_english_dictionary.tmp.tar.gz 265.148 Thu Sep 10 07:53:37 AEST 2020
/tmp/comptest/What_is_Systems_Architecture_PART_1.tar.gz 1357.06 Thu Sep 10 08:17:47 AEST 2020
Note:
You can add a literal K to indicate the block size / units (kilobytes)
If you want to print the path only from this array, you can use suffix removal: printf '%s\n' "${files[#]&& *}"
For no date in the array (it's used to sort, but then its job may be done), simply remove $(date -d #${path##* }) (incl. the space).
Kind of tangential, but to use different date formats, replace $(date -d #${path##* }) with:
$(date -I -d #${path##* }) ISO format - note that short opts style: date -Id #[date] did not work for me
$(date -d #${path##* } +%Y-%M-%d_%H-%m-%S) like ISO, but w/ seconds
$(date -d #${path##* } +%Y-%M-%d_%H-%m-%S) same again, but w/ nanoseconds (find gives you nano seconds)
Sorry for the long post, hopefully it's informative.

Running an array on multiple files with different output files linux

I would like to breakdown 8 files (representing a chromosome each) that each have roughly 2e9 lines into roughly 5 chunks of 4e8 lines. These are VCF files (https://en.wikipedia.org/wiki/Variant_Call_Format) which have a header and then the genetic variants so I need to preserve the headers for each and reattach them to the chromosome specific header. I am doing this in linux on a HPC.
I have done this with a single file before using:
#grab the header
head -n 10000 my.vcf | grep "^#" >header
#grab the non header lines
grep -v "^#" my.vcf >variants
#split into chunks with 40000000 lines
split -l 40000000 variants
#reattach the header to each and clean up
for i in x*;do cat header $i >$i.vcf && rm -f $i;done
rm -f header variants
I could do this manually with all 8 chromosome but I work within a HPC with array capabilities and feel like this could be done better with a for loop however, the syntax is slightly confusing to me.
I have tried:
#filelist is a list of the 8 chromosome files i.e. chr001.vcf, chr002.vcf...chr0008.vcf
for f in 'cat filelist.txt'; head -n 10000 my.vcf | grep "^#" >header; done
This puts everything into the same header. How would I put the outputs into unique per chromosome headers? Similarly how would this then work with splitting the variants and reattaching the headers to each chunk per chromosome?
The desired output would be:
chr001_chunk1.vcf
chr001_chunk2.vcf
chr001_chunk3.vcf
chr001_chunk4.vcf
chr001_chunk5.vcf
...
chr008_chunk5.vcf
with each vcf chunk having the header from their respective chromosome"parent".
Many thanks
#!/bin/bash
#
# scan the current directory for chr[0-9]*.vcf
# extract header lines (^#)
# extract variants (non-header lines) and split to 40m partial files
# combine header with each partial file
#
# for tuning
lines=40000000
vcf_list=(chr[0-9]*.vcf)
if [ ${#vcf_list} -eq 0 ]; then
echo no .vcf files
exit 1
fi
tmpv=variants
hdr=header
for chrfile in "${vcf_list[#]}"; do
# isolate without . extn
base=${chrfile%%.*}
echo $chrfile
# extract header lines
head -1000 $chrfile | grep "^#" > $hdr
# extract variants
grep -v "^#" $chrfile > $tmpv
#
# split variants into files with max $lines;
# output files are created with a filter to combine header data and
# partial variant data in 1 pass, avoiding additional file I/O;
# output files are named with a leading 'p' to support multiple
# runs without filename collision
#
split -d -l $lines $tmpv p${base}_chunk --additional-suffix=.vcf \
--filter="cat $hdr - > \$FILE; echo \" \$FILE\""
done
rm -f $tmpv $hdr
exit 0

Bash Array Script Exclude Duplicates

So I have written a bash script (named music.sh) for a Raspberry Pi to perform the following functions:
When executed, look into one single directory (Music folder) and select a random folder to look into. (Note: none of these folders here have subdirectories)
Once a folder within "Music" has been selected, then play all mp3 files IN ORDER until the last mp3 file has been reached
At this point, the script would go back to the folders in the "Music" directory and select another random folder
Then it would again play all mp3 files in that folder in order
Loop indefinitely until input from user
I have this code which does all of the above EXCEPT for the following items:
I would like to NOT play any other "album" that has been played before
Once all albums played once, then shutdown the system
Here is my code so far that is working (WITH duplicates allowed):
#!/bin/bash
folderarray=($(ls -d /home/alphekka/Music/*/))
for i in "${folderarray[#]}";
do
folderitems=(${folderarray[RANDOM % ${#folderarray[#]}]})
for j in "${folderitems[#]}";
do
echo `ls $j`
cvlc --play-and-exit "${j[#]}"
done
done
exit 0
Please note that there isn't a single folder or file that has a space in the name. If there is a space, then I face some issues with this code working.
Anyways, I'm getting close, but I'm not quite there with the entire functionality I'm looking for. Any help would be greatly appreciated! Thank you kindly! :)
Use an associative array as a set. Note that this will work for all valid folder and file names.
#!/bin/bash
declare -A folderarray
# Each folder name is a key mapped to an empty string
for d in /home/alphekka/Music/*/; do
folderarray["$d"]=
done
while [[ "${!folderarray[*]}" ]]; do
# Get a list of the remaining folder names
foldernames=( "${!folderarray[#]}" )
# Pick a folder at random
folder=${foldernames[RANDOM%${#foldernames[#]}]}
# Remove the folder from the set
# Must use single quotes; see below
unset folderarray['$folder']
for j in "$folder"/*; do
cvlc --play-and-exit "$j"
done
done
Dealing with keys that contain spaces (and possibly other special characters) is tricky. The quotes shown in the call to unset above are not syntactic quotes in the usual sense. They do not prevent $folder from being expanded, but they do appear to be used by unset itself to quote the resulting string.
Here's another solution: randomize the list of directories first, save the result in an array and then play (my script just prints) the files from each element of the array
MUSIC=/home/alphekka/Music
OLDIFS=$IFS
IFS=$'\n'
folderarray=($(ls -d $MUSIC/*/|while read line; do echo $RANDOM $line; done| sort -n | cut -f2- -d' '))
for folder in ${folderarray[*]};
do
printf "Folder: %s\n" $folder
fileArray=($(find $folder -type f))
for j in ${fileArray[#]};
do
printf "play %s\n" $j
done
done
For the random shuffling I used this answer.
One liner solution with mpv, rl (randomlines), xargs, find:
find /home/alphekka/Music/ -maxdepth 1 -type d -print0 | rl -d \0 | xargs -0 -l1 mpv

Bash - Removing files using Arrays

I have one flat directory. I have 2 arrays. Array1 stores the contents of the directory (all .PNG files). Array2 has six files. These six files are the same as the six files within Array1. How do I use Array2 to remove the 6 files in the directory? Two arrays are as follows:
array1= (`ls ${files}*.PNG`)
array2= $(find . ! -name 'PHOTO*')
Tried using a for loop but not sure how to proceed:
for files in $array2;do
rm -f files $array1
No space is allowed after the = in an assignment.
Don't parse ls. Your code will not work for files whose names contain whitespace.
array1=( "${files}"*.png )
Your array2 isn't an array; it's a string consisting of a sequence of file names separated by whitespace.
array2=( $(find . ! -name 'PHOTO*') )
Also, using find in a command substitution like this can fail for the same reason outlined in 2). Use an extended pattern instead (activated by running shopt -s extglob):
array2=( !(PHOTO*) )
To iterate over the files in an array, you first need to expand the array into a sequence of words, one element per word:
for files in "${array2[#]}"; do
rm -f "$files"
done
To run rm only once... (assuming no spaces in filenames)
unset rm_tmpfiles
if [ ${#tmpfiles[#]} -gt 0 ]; then
for tmpfile in "${tmpfiles[#]}"; do
rm_tmpfiles+="$tmpfile "
done
rm -f $rm_tmpfiles
fi

Need bash to separate cat'ed string to separate variables and do a for loop

I need to get a list of files added to a master folder and copy only the new files to the respective backup folders; The paths to each folder have multiple folders, all named by numbers and only 1 level deep.
ie /tester/a/100
/tester/a/101 ...
diff -r returns typically "Only in /testing/a/101: 2093_thumb.png" per line in the diff.txt file generated.
NOTE: there is a space after the colon
I need to get the 101 from the path and filename into separate variables and copy them to the backup folders.
I need to get the lesserfolder var to get 101 without the colon
and mainfile var to get 2093_thumb.png from each line of the diff.txt and do the for loop but I can't seem to get the $file to behave. Each time I try testing to echo the variables I get all the wrong results.
#!/bin/bash
diff_file=/tester/diff.txt
mainfolder=/testing/a
bacfolder= /testing/b
diff -r $mainfolder $bacfolder > $diff_file
LIST=`cat $diff_file`
for file in $LIST
do
maindir=$file[3]
lesserfolder=
mainfile=$file[4]
# cp $mainfolder/$lesserFolder/$mainfile $bacfolder/$lesserFolder/$mainfile
echo $maindir $mainfile $lesserfolder
done
If I could just get the echo statement working the cp would work then too.
I believe this is what you want:
#!/bin/bash
diff_file=/tester/diff.txt
mainfolder=/testing/a
bacfolder= /testing/b
diff -r -q $mainfolder $bacfolder | egrep "^Only in ${mainfolder}" | awk '{print $3,$4}' > $diff_file
cat ${diff_file} | while read foldercolon mainfile ; do
folderpath=${foldercolon%:}
lesserFolder=${folderpath#${mainfolder}/}
cp $mainfolder/$lesserFolder/$mainfile $bacfolder/$lesserFolder/$mainfile
done
But it is much more reliable (and much easier!) to use rsync for this kind of backup. For example:
rsync -a /testing/a/* /testing/b/
You could try a while read loop
diff -r $mainfolder $bacfolder | while read dummy dummy dir file; do
echo $dir $file
done

Resources