How to diff md5 sums of two filesystem states? - filesystems

I'm collecting md5sum snapshots of the same filesystem at two different points in time. (ie, Before and after an infection.) I need to diff these two states in order to see what files change between these two points in time.
To collect these states I might do the following (on macOS with SIP turned off):
sudo gfind / ! -path '*/dev/*' ! -path '*/Network/*' ! -path '*/Volumes/*' ! -path '*/.fseventsd/*' ! -path '*/.Spotlight-V100/*' -type f -exec md5sum {} \; > $(date "+%y%m%d%H%M%S").system_listing
The problem I'm having is that the resultant files are around 100MB a piece and using diff by itself seems to compare chunks instead of each individual file's md5sum in the output.
Is there an efficient way of using diff tools to do this or is it necessary to write a script to somehow compare the two files based upon filename paths, effectively recreating diff to compare lines with path as the unique comparator value and then return info based on the associated md5sum?

appearance of directories order could produce a lot of noisy diff
for example i ran the following two commands , diffing two directories full of pdfs.
one with 1 file , the other with tens of files
swapping the directory order produce 2 diff line,
instead we want to the diff report the fact of no diff .
find books/ docs-pdf/ -type f -exec md5sum {} \; > snapshot1
find docs-pdf/ books/ -type f -exec md5sum {} \; > snapshot2
diff snapshot1 snapshot2
--- snapshot1
+++ snapshot2
## -1,4 +1,3 ##
-83322cb1aaa94f9c8e87925f9d2a695e books/ModSimPy.pdf
192e5d38e59d8295ec9ca715e784a6d0 docs-pdf/c-api.pdf
76c5bfb41bc6e5f9c8da1ab1f915e622 docs-pdf/distributing.pdf
0a630ec314653c68153f5bbc4446660c docs-pdf/extending.pdf
## -25,3 +24,4 ##
31e3dc3f78a12c59cdc0426d8e75ec99 docs-pdf/tutorial.pdf
4c59e969009b6c3372804efdfc99e2d9 docs-pdf/using.pdf
cf5330f4ed5ca5f63f300ccfa3057825 docs-pdf/whatsnew.pdf
+83322cb1aaa94f9c8e87925f9d2a695e books/ModSimPy.pdf
after sorting by 2nd column , diff successfully report with no diff
sort -k2 snapshot1 >sorted.snapshot1
sort -k2 snapshot2 >sorted.snapshot2
diff sorted.snapshot1 sorted.snapshot2
if this did not solve all noisy diff outputs , please post out pieces of the example output you do not want

Related

Putting files in directory into array variable

I'm writing bash code that will search for specific files in the directory it is run in and add them into an array variable. The problem I am having is formatting the results. I need to find all the compressed files in the current directory and display both the names and sizes of the files in order of last modified. I want to take the results of that command and put them into an array variable with each line element containing the file's name and corresponding size but I don't know how to do that. I'm not sure if I should be using command "find" instead of "ls" but here is what I have so far:
find_files="$(ls -1st --block-size=MB)"
arr=( ($find_files) )
I'm not sure exactly what format you want the array to be in, but here is a snippet that creates an associative array keyed by filename with the size as the value:
$ ls -l test.{zip,bz2}
-rw-rw-r-- 1 user group 0 Sep 10 13:27 test.bz2
-rw-rw-r-- 1 user group 0 Sep 10 13:26 test.zip
$ declare -A sizes; while read SIZE FILENAME ; do sizes["$FILENAME"]="$SIZE"; done < <(find * -prune -name '*.zip' -o -name *.bz2 | xargs stat -c "%Y %s %N" | sort | cut -f 2,3 -d " ")
$ echo "${sizes[#]#A}"
declare -A sizes=(["'test.zip'"]="0" ["'test.bz2'"]="0" )
$
And if you just want an array of literally "filename size" entries, that's even easier:
$ while read SIZE FILENAME ; do sizes+=("$FILENAME $SIZE"); done < <(find * -prune -name '*.zip' -o -name *.bz2 | xargs stat -c "%Y %s %N" | sort | cut -f 2,3 -d " ")
$ echo "${sizes[#]#A}"
declare -a sizes=([0]="'test.zip' 0" [1]="'test.bz2' 0")
$
Both of these solutions work, and were tested via copy paste from this post.
The first is fairly slow. One problem is external program invocations within a loop - date for example, is invoked for every file. You could make it quicker by not including the date in the output array (see Notes below). Particularly for method 2 - that would result in no external command invocations inside the while loop. But method 1 is really the problem - orders of magnitude slower.
Also, somebody probably knows how to convert an epoch date to another format in awk for example, which could be faster. Maybe you could do the sort in awk too. Perhaps just keep the epoch date?
These solutions are bash / GNU heavy and not portable to other environments (bash here strings, find -printf). OP tagged linux and bash though, so GNU can be assumed.
Solution 1 - capture any compressed file - using file to match (slow)
The criteria for 'compressed' is if file output contains the word compress
Reliable enough, but perhaps there is a conflict with some other file type description?
file -l | grep compress (file 5.38, Ubuntu 20.04, WSL) indicates for me there are no conflicts at all (all files listed are compression formats)
I couldn't find a way of classifying any compressed file other than this
I ran this on a directory containing 1664 files - time (real) was 40 seconds
#!/bin/bash
# Capture all files, recursively, in $TARGET, that are
# compressed files. In an indexed array. Using file name
# extensions to match.
# Initialise variables, and check the target is valid
declare -g c= compressed_files= path= TARGET=$1
[[ -r "$TARGET" ]] || exit 1
# Make the array
# A here string (<<<) must be used, to keep array in the global environment
while IFS= read -r -d '' path; do
[[ "$(file --brief "${path%% *}")" == *compress* ]] &&
compressed_files[c++]="${path% *} $(date -d #${path##* })"
done < \
<(
find "$TARGET" -type f -printf '%p %s %T#\0' |
awk '{$2 = ($2 / 1024); print}' |
sort -n -k 3
)
# Print results - to test
printf '%s\n' "${compressed_files[#]}"
Solution 2 - use file extensions - orders of magnitude faster
If you know exactly what extensions you are looking for, you can
compose them in a find command
This is alot faster
On the same directory as above, containing 1664 files - time (real) was 200 miliseconds
This example looks for .gz, .zip, and .7z (gzip, zip and 7zip respectively)
I'm not sure if -type f -and -regex '.*[.]\(gz\|zip\|7z\) -and printf may be faster again, now I think of it. I started with globs cause I assumed that was quicker
That may also allow for storing the extension list in a variable..
This method avoids a file analysis on every file in your target
It also makes the while loop shorter - you're only iterating matches
Note the repetition of -printf here, this is due to the logic that
find uses: -printf is 'True'. If it were included by itself, it would
act as a 'match' and print all files
It has to be used as a result of a name match being true (using -and)
Perhaps somebody has a better composition?
#!/bin/bash
# Capture all files, recursively, in $TARGET, that are
# compressed files. In an indexed array. Using file name
# extensions to match.
# Initialise variables, and check the target is valid
declare -g c= compressed_files= path= TARGET=$1
[[ -r "$TARGET" ]] || exit 1
while IFS= read -r -d '' path; do
compressed_files[c++]="${path% *} $(date -d #${path##* })"
done < \
<(
find "$TARGET" \
-type f -and -name '*.gz' -and -printf '%p %s %T#\0' -or \
-type f -and -name '*.zip' -and -printf '%p %s %T#\0' -or \
-type f -and -name '*.7z' -and -printf '%p %s %T#\0' |
awk '{$2 = ($2 / 1024); print}' |
sort -n -k 3
)
# Print results - for testing
printf '%s\n' "${compressed_files[#]}"
Sample output (of either method):
$ comp-find.bash /tmp
/tmp/comptest/websters_english_dictionary.tmp.tar.gz 265.148 Thu Sep 10 07:53:37 AEST 2020
/tmp/comptest/What_is_Systems_Architecture_PART_1.tar.gz 1357.06 Thu Sep 10 08:17:47 AEST 2020
Note:
You can add a literal K to indicate the block size / units (kilobytes)
If you want to print the path only from this array, you can use suffix removal: printf '%s\n' "${files[#]&& *}"
For no date in the array (it's used to sort, but then its job may be done), simply remove $(date -d #${path##* }) (incl. the space).
Kind of tangential, but to use different date formats, replace $(date -d #${path##* }) with:
$(date -I -d #${path##* }) ISO format - note that short opts style: date -Id #[date] did not work for me
$(date -d #${path##* } +%Y-%M-%d_%H-%m-%S) like ISO, but w/ seconds
$(date -d #${path##* } +%Y-%M-%d_%H-%m-%S) same again, but w/ nanoseconds (find gives you nano seconds)
Sorry for the long post, hopefully it's informative.

How to cat similar named sequence files from different directories into single large fasta file

I am trying to get the following done. I have circa 40 directories of different species, each with 100s of sequence files that contain orthologous sequences. The sequence files are similarly named for each of the species directories. I want to concatenate the identically named files of the 40 species directories into a single sequence file which is named similarly.
My data looks as follows, e.g.:
directories: Species1 Species2 Species3
Within directory (similar for all): sequenceA.fasta sequenceB.fasta sequenceC.fasta
I want to get single files named: sequenceA.fasta sequenceB.fasta sequenceC.fasta
where the content of the different files from the different species is concatenated.
I tried to solve this with a loop (but this never ends well with me!):
ls . | while read FILE; do cat ./*/"$FILE" >> ./final/"$FILE"; done
This resulted in empty files and errors. I did try to find a solution elsewhere, e.g.: (https://www.unix.com/unix-for-dummies-questions-and-answers/249952-cat-multiple-files-according-file-name.html, https://unix.stackexchange.com/questions/424204/how-to-combine-multiple-files-with-similar-names-in-different-folders-by-using-u) but I have been unable to edit them to my case.
Could anyone give me some help here? Thanks!
In a root directory where your species directories reside, you should run the following:
$ mkdir output
$ find Species* -type f -name "*.fasta" -exec sh -c 'cat {} >> output/`basename {}`' \;
It traverses all the files recursively and merges the contents of files with identical basename into one under output directory.
EDIT: even though this was an accepted answer, in a comment the OP mentioned that the real directories don't match a common pattern Species* as shown in the original question. In this case you can use this:
$ find -type f -not -path "./output/*" -name "*.fasta" -exec sh -c 'cat {} >> output/`basename {}`' \;
This way, we don't specify the search pattern but rather explicitly omit output directory to avoid duplicates of already processed data.

Rename all files found by extension in for loop

My task is to write a script that searches for all files without .old extension within given directory and renames files to this format: filename.old. I've tried this script:
#!/bin/bash
for i in $(grep "\.[^old]&" $1 | ls)
do
mv "$1/$i" "$1/$i.old"
done
but it gives a wrong output.
These files were in my directory: f1, f2.old, f3, f4.old.
Expected output: f1.old, f2.old, f3.old, f4.old.
My output (1st launch): f1.old, f2.old.old, f3.old, f4.old.old.
Each time when I launch script it keeps adding .old extension, so it becomes like this:
My output (2nd launch): f1.old.old, f2.old.old.old, f3.old.old, f4.old.old.old.
How can this be improved?
You could use a one-liner like so:
find . -mindepth 1 ! -name '*.old' -exec mv {} {}.old \;
Example on GNU/Linux (Ubuntu 14.04 LTS):
mkdir so
cd so
touch f1 f2.old f3 f4.old
find . -mindepth 1 ! -name '*.old' -exec mv {} {}.old \;
ls
Result:
f1.old f2.old f3.old f4.old
Explanation:
find . means find in current directory
-mindepth 1 will return the files without returning the current directory . (see https://askubuntu.com/questions/153770/how-to-have-find-not-return-the-current-directory)
! -name '*.old' will skip any files ending with .old
-exec mv executes the mv (move) command on the returned files denoted with {} and adds an extension to it with {}.old meaning whatever-filename-was-returned.old
You can modify your script like so to get similar result:
test.sh.old
#!/bin/bash
for i in $(find . -mindepth 1 ! -name '*.old'); do
mv "$i" "$i.old"
done
Execute with bash test.sh.old to get similar results.
You may have to try some test cases to see if the one-liner and the modified test.sh.old file passes those test conditions. I tested it with the sample you provided and this returns the desired results.

sort elements read into array

While reading find results into an array I want them sorted at the same time (mp3's, so by track number, which is the first part of the file name), and thought something like this should do the trick:
mp3s=()
while read -r -d $'\0'; do
mp3s+=("$REPLY")
done < <(sort <(find "$mp3Dir" -type f -name '*.mp3' -print0))
but the elements in the array are never sorted correctly (by first part of file name which is mp3 track number: 01_..., 02_..., 03_..., etc.)
Although the following gets the job done, it seems unnecessarily awkward:
mp3s=()
while read -r -d $'\0'; do
mp3s+=("$REPLY")
done < <(find "$mp3Dir" -type f -name '*.mp3' -print0)
mp3s=( $(for f in "${mp3s[#]}" ; do
echo "$f"
done | sort) )
There must be a more streamlined way to get this done, along similar lines to what I was thinking in the first example, no? I have tried reading thru sort on both sides of the find command, using its numerous options for sorting (-n, -d, etc.) but without any luck (so far).
So, is there a more efficient way to incorporate a sort command while the array is initially being populated?
By default, sort assumes newline-separated records. The call to find, however, specifies nul-separated output. The solution is to add the -z flag to sort. This tells sort to expect nul-separated input and produce nul-separated output. Thus, try:
mp3s=()
while read -r -d $'\0'; do
mp3s+=("$REPLY")
done < <(sort -z <(find "$mp3Dir" -type f -name '*.mp3' -print0))
Example
Suppose that we have these mp3 files:
$ find "." -type f -name '*.mp3' -print0
./music1/d b2.mp3./music1/a b1.mp3./music1/a b2.mp3./music1/d b1.mp3./music1/a b3.mp3./music1/d b3.mp3
First, try sort:
$ sort <(find "." -type f -name '*.mp3' -print0)
./music1/d b2.mp3./music1/a b1.mp3./music1/a b2.mp3./music1/d b1.mp3./music1/a b3.mp3./music1/d b3.mp3
The files remain unordered.
Now, try sort -z:
$ sort -z <(find "." -type f -name '*.mp3' -print0)
./music1/a b1.mp3./music1/a b2.mp3./music1/a b3.mp3./music1/d b1.mp3./music1/d b2.mp3./music1/d b3.mp3
The files are now in order.
One way to do do the sorting internally to bash is to use an associative array and put your data in keys, rather than values.
declare -A mp3s=()
while IFS= read -r -d ''; do
mp3s[$REPLY]=1
done < <(find "$mp3Dir" -type f -name '*.mp3' -print0)
...then, to iterate over the values:
for mp3 in "${!mp3s[#]}"; do
printf '%q\n' "$mp3"
done
As associative arrays are a feature added in bash 4.0, note that this functionality isn't available in 3.2 (which is still in use in some circles, most particularly MacOS).

get a list of files that are different between two filetrees

I'm comparing two very large filesystems (to do with a migration) and diff -qr was great but now since the users have been using the new location the files have changed. I there a way to use diff, grep or anything else to compare only if the file exists, ie: ignore the fact that files differ. By latest diff has a lot of:
Only in /dir1/myFile
Only in /dir2/myFile
In it. is there either an easy way to use grep to show only the files that don't exist at all in dir2 that do exist in dir1 or do something similar with diff.
you can use:
tree /path/to/dir1 > out1
tree /path/to/dir2 > out2
diff out1 out2 | grep ">"
but i find beyond compare more suitable for a job like this.
You may also use the command comm
comm -1 -2 <(ls /path_to_dir-1/) <(ls /path_to_dir-2/)
Check the following link for additional info:
http://nixtricks.wordpress.com/2010/01/11/unix-compare-the-contents-of-two-directories-in-the-command-line/
The one liners above are great, but this bash script allows you more customization: can change -e to ! -e or skip certain files.
for f in $(find old/ -type f) ;
do
if [ -e "new/${f}" ] ;
then
echo "${f} exists" ;
fi ;
done

Resources