How to sort content of arrays? - arrays

Ultimately, I want to get rid of the possibility of duplicate entries showing up my array. The reason I'm doing this is because I'm working on a script that compares two directories, searches for, and deletes duplicate files. The potential duplicate files are stored in an array and the files are only deleted if they have the same name and checksum as the originals. So if there are duplicate entries, I wind up encountering minor errors where md5 either tries to find the checksum of a file that doesn't exist (because it was already deleted) or rm tries to delete a file that was deleted already.
Here's part of the script.
compare()
{
read -p "Please enter two directories: " dir1 dir2
if [[ -d "$dir1" && -d "$dir2" ]]; then
echo "Searching through $dir2 for duplicates of files in $dir1..."
else
echo "Invalid entry. Please enter valid directories." >&2
exit 1
fi
#create list of files in specified directory
while read -d $'\0' file; do
test_arr+=("$file")
done < <(find $dir1 -print0)
#search for all duplicate files in the home directory
#by name
#find checksum of files in specified directory
tmpfile=$(mktemp -p $dir1 del_logXXXXX.txt)
for i in "${test_arr[#]}"; do
Name=$(sed 's/[][?*]/\\&/g' <<< "$i")
if [[ $(find $dir2 -name "${Name##*/}" ! -wholename "$Name") ]]; then
[[ -f $i ]] || continue
find $dir2 -name "${Name##*/}" ! -wholename "$Name" >> $tmpfile
origray[$i]=$(md5sum "$i" | cut -c 1-32)
fi
done
#create list of duplicate file locations.
dupe_loc
#compare similarly named files by checksum and delete duplicates
local count=0
for i in "${!indexray[#]}"; do
poten=$(md5sum "${indexray[$i]}" | cut -c 1-32)
for i in "${!origray[#]}"; do
if [[ "$poten" = "${origray[$i]}" ]]; then
echo "${indexray[$count]} is a duplicate of a file in $dir1."
rm -v "${indexray[$count]}"
break
fi
done
count=$((count+1))
done
exit 0
}
dupe_loc is the following function.
dupe_loc()
{
if [[ -s $tmpfile ]]; then
mapfile -t indexray < $tmpfile
else
echo "No duplicates were found."
exit 0
fi
}
I figure the best way to solve this issue would be to use the sort and uniq commands to dispose of duplicate entries in the array. But even with process substitution, I encounter errors when trying to do that.

First things first. Bash array sorting has been answered here: How to sort an array in BASH
That said, I don't know that sorting the array will be much help. It seems a more simple solution would just be wrapping your md5 check and rm statements in an if statement:
if [ -f origarr[$i]} ]; do #True if file exists and is a regular file.
#file exists
...
rm ${origarr[$i]}
fi

Related

How to iterate a file filled with file and directory paths to check what they are

I have a problem which is iterating a file called for example: fileAndFolderPaths, and in other script I have to iterate this same file and check if each line is a file or folder path.
fileAndFolderPaths
/opt/sampleFolder
/opt/sampleFolder/aText.txt
/opt/otherFolder
Then my script file is something like that:
myScript.sh
#!/bin/bash
mapfile -t array < /tmp/fileAndFolderPaths
function checkIfFilesOrFolder(){
for i in "${array[#]}" do
if [ -f $i ]; then
echo -e "[Info] found the file: $i"
elif [ -d $i ]; then
echo -e "[Info] found the directory: $i"
else
echo -e "[Error] Nor directory or file were found based on this value: $i"
fi
done
}
checkIfFilesOrFolder
exit 0;
The problem is the check only works for the last line of the array created by the mapfile command. Any thoughts about that? I'm new to shell scripting so probably this is a really basic problem, but even so I wasn't able to fix it yet.
A couple of review suggestions, if you don't mind:
Don't need the global variable: pass the filename to the function and loop over the file:
checkIfFilesOrFolder() {
local file=$1
while IFS= read -r line; do
# test "$line" here ...
done < "$file"
}
checkIfFilesOrFolder /tmp/fileAndFolderPaths
I recommend using local for function variables, to minimize polluting the global namespace.
Always quote your variables, unless you're aware of exactly what expansions occur on them unqoted:
if [ -f "$line" ]; then ...
is there a reason you're using echo -e? The common advice is to use
printf '[Info] found the file: %s\n' "$line"
Interesting reading: Why is printf better than
echo?

Search for a substring within listed files with spaces from multiple directories in bash

I want to create a script that loops over multiple directories from an array and, if the files there, which are not in the blacklist, are older than a certain time period, remove them. The problem is that any type of string comparison (whether grep -q or wildcards) doesn't work when trying to list a directory with files that contain spaces in them (so I change the $IFS value to loop through them), making the script unusable. Blacklisted strings can also have spaces in them, of course.
Here's what I wrote so far:
#!/bin/bash
declare -a dirs=(~/path/to/dir1/* ~/path/to/dir2/*)
declare -a blacklist=("file number 1" "file number 2" "file number 3")
saveifs=$IFS
IFS=$'\n'
echo "Starting the autocleaner..."
for dirname in "${dirs[#]}"; do
for filename in $(ls "$dirname"); do
for excluded in ${blacklist[#]}; do
if [ -e $filename ]; then
if echo "$filename" | grep -q "$excluded"; then
# if [[ "$filename" == *"$excluded"* ]]; then
:
else
if test `find "$filename" -mtime +1`; then
# rm -f $filename
echo "File $filename removed."
else
echo "File $filename is up-to-date and doesn't need to be removed."
fi
fi
else
:
fi
done
done
done
IFS=$saveifs
How can I make the comparison actually work?
Have you tried using single square brackets [ ... ] for the comparison line? Reading about the difference here between [ ... ] and [[ ... ]] may help you.

Script to group numbered files into folders

I have around a million files in one folder in the form xxxx_description.jpg where xxx is a number ranging from 100 to an unknown upper.
The list is similar to this:
146467_description1.jpg
146467_description2.jpg
146467_description3.jpg
146467_description4.jpg
14646_description1.jpg
14646_description2.jpg
14646_description3.jpg
146472_description1.jpg
146472_description2.jpg
146472_description3.jpg
146500_description1.jpg
146500_description2.jpg
146500_description3.jpg
146500_description4.jpg
146500_description5.jpg
146500_description6.jpg
To get the file number down in the at folder I'd like to put them all into folders grouped by the number at the start.
ie:
146467/146467_description1.jpg
146467/146467_description2.jpg
146467/146467_description3.jpg
146467/146467_description4.jpg
14646/14646_description1.jpg
14646/14646_description2.jpg
14646/14646_description3.jpg
146472/146472_description1.jpg
146472/146472_description2.jpg
146472/146472_description3.jpg
146500/146500_description1.jpg
146500/146500_description2.jpg
146500/146500_description3.jpg
146500/146500_description4.jpg
146500/146500_description5.jpg
146500/146500_description6.jpg
I was thinking to try and use command line: find | awk {} | mv command or maybe write a script, but I'm not sure how to do this most efficiently.
If you really are dealing with millions of files, I suspect that a glob (*.jpg or [0-9]*_*.jpg may fail because it makes a command line that's too long for the shell. If that's the case, you can still use find. Something like this might work:
find /path -name "[0-9]*_*.jpg" -exec sh -c 'f="{}"; mkdir -p "/target/${f%_*}"; mv "$f" "/target/${f%_*}/"' \;
Broken out for easier reading, this is what we're doing:
find /path - run find, with /path as a starting point,
-name "[0-9]*_*.jpg" - match files that match this filespec in all directories,
-exec sh -c execute the following on each file...
'f="{}"; - put the filename into a variable...
mkdir -p "/target/${f%_*}"; - make a target directory based on that variable (read mkdir's man page about the -p option)
mv "$f" "/target/${f%_*}/"' - move the file into the directory.
\; - end the -exec expression
On the up side, it can handle any number of files that find can handle (i.e. limited only by your OS). On the down side, it's launching a separate shell for each file to be handled.
Note that the above answer is for Bourne/POSIX/Bash. If you're using CSH or TCSH as your shell, the following might work instead:
#!/bin/tcsh
foreach f (*_*.jpg)
set split = ($f:as/_/ /)
mkdir -p "$split[1]"
mv "$f" "$split[1]/"
end
This assumes that the filespec will fit in tcsh's glob buffer. I've tested with 40000 files (894KB) on one command line and not had a problem using /bin/sh or /bin/csh in FreeBSD.
Like the Bourne/POSIX/Bash parameter expansion solution above, this avoids unnecessary calls to external I haven't tested that, and would recommend the find solution even though it's slower.
You can use this script:
for i in [0-9]*_*.jpg; do
p=`echo "$i" | sed 's/^\([0-9]*\)_.*/\1/'`
mkdir -p "$p"
mv "$i" "$p"
done
Using grep
for file in *.jpg;
do
dirName=$(echo $file | grep -oE '^[0-9]+')
[[ -d $dirName ]] || mkdir $dirName
mv $file $dirName
done
grep -oE '^[0-9]+' extracts the starting digits in the filename as
146467
146467
146467
146467
14646
...
[[ -d $dirName ]] returns 1 if the directory exists
[[ -d $dirName ]] || mkdir $dirName ensures that the mkdir works only if the test [[ -d $dirName ]] fails, that is the direcotry does not exists

Bash script - how to fill array?

Let's say I have this directory structure:
DIRECTORY:
.........a
.........b
.........c
.........d
What I want to do is: I want to store elements of a directory in an array
something like : array = ls /home/user/DIRECTORY
so that array[0] contains name of first file (that is 'a')
array[1] == 'b' etc.
Thanks for help
You can't simply do array = ls /home/user/DIRECTORY, because - even with proper syntax - it wouldn't give you an array, but a string that you would have to parse, and Parsing ls is punishable by law. You can, however, use built-in Bash constructs to achieve what you want :
#!/usr/bin/env bash
readonly YOUR_DIR="/home/daniel"
if [[ ! -d $YOUR_DIR ]]; then
echo >&2 "$YOUR_DIR does not exist or is not a directory"
exit 1
fi
OLD_PWD=$PWD
cd "$YOUR_DIR"
i=0
for file in *
do
if [[ -f $file ]]; then
array[$i]=$file
i=$(($i+1))
fi
done
cd "$OLD_PWD"
exit 0
This small script saves the names of all the regular files (which means no directories, links, sockets, and such) that can be found in $YOUR_DIR to the array called array.
Hope this helps.
Option 1, a manual loop:
dirtolist=/home/user/DIRECTORY
shopt -s nullglob # In case there aren't any files
contentsarray=()
for filepath in "$dirtolist"/*; do
contentsarray+=("$(basename "$filepath")")
done
shopt -u nullglob # Optional, restore default behavior for unmatched file globs
Option 2, using bash array trickery:
dirtolist=/home/user/DIRECTORY
shopt -s nullglob
contentspaths=("$dirtolist"/*) # This makes an array of paths to the files
contentsarray=("${contentpaths[#]##*/}") # This strips off the path portions, leaving just the filenames
shopt -u nullglob # Optional, restore default behavior for unmatched file globs
array=($(ls /home/user/DIRECTORY))
Then
echo ${array[0]}
will equal to the first file in that directory.

Looping through an Array in bash

I am currently attempting to create a bash script that will check inside of each users /Library/Mail folder to see if a folder named V2 exists. The script should create an array with each item in the array being a user and then iterate through each of these users checking their home folder for the above captioned contents. This is what I have so far:
#!/bin/bash
cd /Users
array=($(ls))
for i in ${array[#]}
do
if [ -d /$i/Library/Mail/V2 ]
then
echo "$i mail has been upgraded."
else
echo "$i FAIL"
fi
done
Populating your array from the output of ls is going to make for serious problems when you have a username with spaces. Use a glob expression instead. Also, using [ -d $i/... ] will similarly break on names with spaces -- either use [[ -d $i/... ]] (the [[ ]] construct has its own syntax rules and doesn't require quoting) or [ -d "$i/..." ] (with the quotes).
Similarly, you need to double-quote "${array[#]}" to avoid string-splitting from splitting names with spaces in two, as follows:
cd /Users
array=(*)
for i in "${array[#]}"; do
if [[ -d $i/Library/Mail/V2 ]]; then
echo "$i mail has been upgraded."
else
echo "$i FAIL"
fi
done
That said, you don't really need an array here at all:
for i in *; do
...check for $i/Library/Mail/V2...
done

Resources