String matching between files Linux - file

I have two files that I want to compare. The first is tab separated, the second is comma separated and both begin with an ID. I want to match those IDs and do two things. First, I want to print out all of the ones that match between the two files. Then (if possible) I want to print to a separate file all of those that do not match.
The files look like this:
(comma separated)
S-3DFSG,0,254654,3,e /// x, /// 5
S-8FGDG,6,464782,6,i /// n /// n /// e /// n, /// /
S-4SKDH,0,445676,3,n /// e /// p, /// /// F
(tab separated)
S-3DGSF DG 2 5 7 DF 2 2 4684648654
S-4GXBG DF 6 2 4 FD 7 1 2415244459
S-3DFST GA 0 8 4 CF 9 8 2
I tried
grep -F -wf file1 file2 > incommon.txt
For grep fixed pattern -only words that match from these files
But I got nothing output...
Does anyone have any suggestions on how I can improve this? I did think about regex but I am not terribly proficient in its use. I wouldn't mind using it though.

analyze.py:
import re
f = open('tab.txt', 'r')
data_tab = f.read()
f.close()
f = open('csv.txt', 'r')
data_csv = f.read()
f.close()
matches_tab = re.findall(r'^([^\t]+)', data_tab, re.M)
matches_csv = re.findall(r'^([^,]+)', data_csv, re.M)
common = set(matches_tab) & set(matches_csv)
not_common = set(matches_tab) ^ set(matches_csv)
f = open('common.txt', 'w')
for el in common:
f.write(el)
f.write('\n')
f.close()
f = open('not_common.txt', 'w')
for el in not_common:
f.write(el)
f.write('\n')
f.close()
Save this in a file called analyze.py and run the script by using:
python analyze.py
Change tab.txt to your tabbed filename, csv.txt to your comma separated filename, and your lists should be dumped in the working directory.
Let me know if you have any problems.

If you still want to do it in the shell, for the "in common" you may use:
sed 's/\([^,]*\),.*/\1/' commed.txt > __ids.txt
grep -F -f __ids.txt $f tabbed.txt
rm -f __ids.txt
and for the "not in common":
sed 's/\([^,]*\),.*/\1/' commed.txt > __ids.txt
grep -F -v -f __ids.txt $f tabbed.txt
sed 's/\([^\t]*\)\t.*/\1/' commed.txt > __ids.txt
grep -F -v -f __ids.txt $f tabbed.txt
rm -f __ids.txt
Where "commed.txt" is the file comma-separated and "tabbed.txt" is the file tab-separated.
This might fail if the ID may occur elsewhere in the second file! A more robust solution with "grep" is possible if the ID cannot be mistaken for a regexp (no ., ,, \, *, etc).

Related

Issue using diff with array and value quoted SHELL [duplicate]

This question already has answers here:
How can I store the "find" command results as an array in Bash
(8 answers)
Closed 2 months ago.
Hi guys i'm having an issue while using diff.
In my script i'm trying to compare all files in 1 dir to all files in 2 other dir
Using diff to compare is files are the same.
Here is my script :
`
#!/bin/bash
files1=()
files2=()
# Directories to compare. Adding quotes at the begining and at the end of each files found in content1 & content3
content2=$(find /data/logs -name "*.log" -type f)
content1=$(find /data/other/logs1 -type f | sed 's/^/"/g' | sed 's/$/"/g')
content3=$(find /data/other/logs2 -type f | sed 's/^/"/g' | sed 's/$/"/g')
# ADDING CONTENT INTO FILES1 & FILES2 ARRAY
while read -r line; do
files1+=("$line")
done <<< "$content1"
# content1 and content3 goes into the same array
while read -r line3;do
files1+=("$line3")
done <<< "$content3"
while read -r line2; do
files2+=("$line2")
done <<< "$content2"
# Here i'm trying to compare 1 by 1 the files in files2 to all files1
for ((i=0; i<${#files2[#]}; i++))
do
for ((j=0; j<${#files1[#]}; j++))
do
if [[ -n ${files2[$i]} ]];then
diff -s "${files2[$i]}" "${files1[$j]}" > /dev/null
if [[ $? == 0 ]]; then
echo ${files1[$j]} "est identique a" ${files2[$i]}
unset 'files2[$i]'
break
fi
fi
done
done
#SHOW THE FILES WHO DIDN'T MATCHED
echo ${files2[#]}
`
I'm having the folling issue when i'm trying to diff :
diff: "/data/content3/other/log2/perso log/somelog.log": No such file or directory
But when i'm doing
ll "/data/content3/other/log2/perso log/somelog.log" -rw-rw-r-- 2 lopom lopom 551M 30 oct. 18:53 '/data/content3/other/logs2/perso log/somelog.log'
So the file exist.
i need those quotes because sometimes there are space in the path
Does some1 know how to fix that ?
Thanks.
I already tried to change the quotes by single quotes, but it didn't fixed it
First, don't do this -
content2=$(find /data/logs -name "*.log" -type f)
content1=$(find /data/other/logs1 -type f | sed 's/^/"/g' | sed 's/$/"/g')
content3=$(find /data/other/logs2 -type f | sed 's/^/"/g' | sed 's/$/"/g')
don't stack all these into single vars. This is asking for ten kinds of obscure trouble. More importantly, those sed calls are embedding the quotation marks into the data as part of the filenames, which is probably what's causing diff to crash, because there are no actual files with the quotes in the name.
Also, if you are throwing away the output and just using diff to check the files are identical, try cmp instead. The -s is silent, and it's a lot faster since it exits at the first differing byte without reading the rest of both files and generating a report. If there ae a lot of files, this will add up.
If the logs are the only things in the directories, and you don't have to scan subdirectoies, and the filename can't appear in both /data/other/logs1 AND /data/other/logs2, but you're pretty sure it will be in at least one of them... then simplify:
for f in /data/logs/*.log # I'll assume these are all files...
do t=/data/other/logs[12]/"${f#/data/logs/}" # always just one?
if cmp -s "$f" "$t" # cmp -s *has* no output
then echo "$t est identique a $f" # files are same
elif [[ -e "$t" ]] # check t exists
then echo "$t diffère de $f" # maybe ls -l "$f" "$t" ?
else echo "$t n'existe pas" # report it does not
fi
done
This needs no arrays, no find, no sed calls, etc.
If you do need to read subdirectories, use shopt to handle it with globs so that you don't have to worry about parsing odd characters with read. (c.f. https://mywiki.wooledge.org/ParsingLs for some reasons.)
shopt -s globstar
for f in /data/logs/**/*.log # globstar makes ** match at arbitrary depth
do for t in /data/other/logs[12]/**/"${f#/data/logs/}" # if >1 possible hit
do if cmp -s "$f" "$t"
then echo "$t est identique a $f"
elif [[ -e "$t" ]]
then echo "$t diffère de $f"
else echo "$t n'existe pas" # $t will be the glob, one iteration
fi
done
done

Sed: Better way to address the n-th line where n are elements of an array

We know that the sed command loops over each line of a file and for each line, it loops over the given commands list and does something. But when the file is extremely large, the time and resource cost on the repeating operation may be terrible.
Suppose that I have an array of line numbers which I want to use as addresses to delete or print with sed command (e.g. A=(20000 30000 50000 90000)) and there is a VERY LARGE object file.
The easiest way may be:
(Remark by #John1024, careful about the line number changes for each loop)
( for NL in ${A[#]}; do sed "$NL d" $very_large_file; done; )>.temp_file;
cp .temp_file $very_large_file; rm .temp_file
The problem of the code above is that, for each indexed line number of the array, it needs to loop over the whole file.
To avoid this, one can:
#COMM=`echo "${A[#]}" | sed 's/\s/d;/g;s/$/d'`;
#sed -i "$COMM" $very_large_file;
#Edited: Better with direct parameter expansion:
sed -i "${A[#]/%/d;}" $very_large_file;
It first print the array and replace its SPACE and the END_OF_LINE with the d command of sed, so that the string looks like "20000d;30000d;50000d;90000d"; on the second line, we treat this string as the command list of sed. The result is that with this code, it only loops over the file for once.
More over, for in-place operation (argument -i), one cannot quit using q with sed even though the greatest line number of interest has passed, because if so, the lines after the that line (e.g. 90001+) will disappear (It seems that the in-place operation is just overwriting the file with stdout).
Better ideas?
(Reply to #user unknown:) I think it could be even more efficient if we manage to "quit" the loop once all indexed lines have passed. We can't, using sed -i, for the aforementioned reasons. Printing each line to a file cost more time than copying a file (e.g. cat file1 > file2 and cp file1 file2). We may benefit from this concept, using any other methods or tools. This is what I expect.
PS: The points of this question are "Lines location" and "Efficiency"; the "delete lines" operation is just an example. For real tasks, there are much more - append/insert/substituting, field separating, cases judgement followed by read from/write to files, calculations etc.
In order words, it may invoke all kind of operations, creating sub-shells or not, caring about the variable passing, ... so, the tools to use should allow me to line processing, and the problem is how to get myself onto the lines of interest, doing all kinds operations.
Any comments are appreciated.
First make a copy to a testfile for checking the results.
You want to sort the linenumbers, highest first.
echo "${a[#]}" | sed 's/\s/\n/g' | sort -rn
You can feed commands into ed using printf:
printf "%s\n" "command1" "command2" w q testfile | ed -s testfile
Combine these
printf "%s\n" $(echo "${a[#]}" | sed 's/\s/\n/g' | sort -rn | sed 's/$/d/') w q |
ed -s testfile
Edit (tx #Ed_Morton):
This can be written in less steps with
printf "%s\n" $(printf '%sd\n' "${a[#]}" | sort -rn ) w q | ed -s testfile
I can not remove the sort, because each delete instruction is counting the linenumbers from 1.
I tried to find a command for editing the file without redirecting to another, but I started with the remark that you should make a copy. I have no choice, I have to upvote the straight forward awk solution that doesn't need a sort.
sed is for doing s/old/new, that is all, and when you add a shell loop to the mix you've really gone off the rails (see https://unix.stackexchange.com/q/169716/133219). To delete lines whose numbers are stored in an array is (using seq to generate input since no sample input/output provided in the question):
$ a=( 3 7 8 )
$ seq 10 |
awk -v a="${a[*]}" 'BEGIN{split(a,tmp); for (i in tmp) nrs[tmp[i]]} !(NR in nrs)'
1
2
4
5
6
9
10
and if you wanted to stop processing with awk once the last target line has been deleted and let tail finish the job then you could figure out the max value in the array up front and then do awk on just the part up to that last target line:
max=$( printf '%s\n' "${a[#]}" | sort -rn | head -1 )
head -"$max" file | awk '...' file > out
tail +"$((max+1))" file >> out
idk if that'd really be any faster than just letting awk process the whole file since awk is very efficient, especially when you're not referencing any fields and so it doesn't do any field splitting, but you could give it a try.
You could generate an intermediate sed command file from your lines.
echo ${A[#]} | sort -n > lines_to_delete
min=`head -1` lines_to_delete
max=`head -1` lines_to_delete
# skip to first and from last line, delete the others
sed -i -e 1d -e ${linecount}d -e 's#$#d#' lines_to_delete
head -${min} input > output
sed -f lines_to_delete input >> output
tail -${max} input >> output
mv output input

How to store the list of subdirectories into an array and access them by their index in Bash?

Assume That we have a directory name "A" with 4 sub directories(aa,bb,cc,dd), some of the sub directories also have sub directories, so assume a schematic like below:
A
aa
aaa
bb
bbb
bbbb
cc
dd
I tried to list the sub directories(aa,bb,cc,dd) in an array and then use them in my script by their array number.
I used the script below for copying dd to parent directory:
while IFS= read -d '' file; do
A+=( "$file" )
done < <(find . -type d -print0 | LC_ALL=C sort -z)
cp -r `pwd`/${A[4]}" `pwd`/..
But the problem is that the script make an array of all of the sub-directories, [aa aaa bb bbb bbbb cc dd]
so ${a[4]} = bbb and not dd.
Any idea how to fix it?
You can restrict find to just look at the top directory, with the maxdepth option:
find . -type d -print0 -maxdepth 1 | LC_ALL=C sort -z
You can achieve the same thing in a simpler way using a glob:
dirs=(*/) # store all top level directories into the dirs array
dirs=("${dirs[#]%/}") # strip trailing / from each element of the array
and then
cp -r "$PWD/${dirs[4]}" "$PWD/.."
Double quotes are needed to prevent word splitting and globbing
pwd in backquotes can simply be written as $PWD, which doesn't need to create a subshell

Bash: rm with an array of filenames

So I'm working on making an advanced delete script. The idea is the user inputs a grep regex for what needs to be deleted, and the script does an rm operation for all of it. Basically eliminates the need to write all the code directly in the command line each time.
Here is my script so far:
#!/bin/bash
# Script to delete files passed to it
if [ $# -ne 1 ]; then
echo "Error! Script needs to be run with a single argument that is the regex for the files to delete"
exit 1
fi
IFS=$'\n'
files=$(ls -a | grep $1 | awk '{print "\"" $0 "\"" }')
## TODO ensure directory support
echo "This script will delete the following files:"
for f in $files; do
echo " $f"
done
valid=false
while ! $valid ; do
read -p "Do you want to proceed? (y/n): "
case $REPLY in
y)
valid=true
echo "Deleting, please wait"
echo $files
rm ${files}
;;
n)
valid=true
;;
*)
echo "Invalid input, please try again"
;;
esac
done
exit 0
My problem is when I actually do the "rm" operation. I keep getting errors saying No such file or directory.
This is the directory I'm working with:
drwxr-xr-x 6 user staff 204 May 9 11:39 .
drwx------+ 51 user staff 1734 May 9 09:38 ..
-rw-r--r-- 1 user staff 10 May 9 11:39 temp two.txt
-rw-r--r-- 1 user staff 6 May 9 11:38 temp1.txt
-rw-r--r-- 1 user staff 6 May 9 11:38 temp2.txt
-rw-r--r-- 1 user staff 10 May 9 11:38 temp3.txt
I'm calling the script like this: easydelete.sh '^tem'
Here is the output:
This script will delete the following files:
"temp two.txt"
"temp1.txt"
"temp2.txt"
"temp3.txt"
Do you want to proceed? (y/n): y
Deleting, please wait
"temp two.txt" "temp1.txt" "temp2.txt" "temp3.txt"
rm: "temp two.txt": No such file or directory
rm: "temp1.txt": No such file or directory
rm: "temp2.txt": No such file or directory
rm: "temp3.txt": No such file or directory
If I try and directly delete one of these files, it works fine. If I even pass that whole string that prints out before I call "rm", it works fine. But when I do it with the array, it fails.
I know I'm handling the array wrong, just not sure exactly what I'm doing wrong. Any help would be appreciated. Thanks.
Consider instead:
# put all filenames containing $1 as literal text in an array
#files=( *"$1"* )
# ...or, use a grep with GNU extensions to filter contents into an array:
# this passes filenames around with NUL delimiters for safety
#files=( )
#while IFS= read -r -d '' f; do
# files+=( "$f" )
#done < <(printf '%s\0' * | egrep --null --null-data -e "$1")
# ...or, evaluate all files against $1, as regex, and add them to the array if they match:
files=( )
for f in *; do
[[ $f =~ $1 ]] && files+=( "$f" )
done
# check that the first entry in that array actually exists
[[ -e $files || -L $files ]] || {
echo "No files containing $1 found; exiting" >&2
exit 1
}
# warn the user
echo "This script will delete the following files:" >&2
printf ' %q\n' "${files[#]}" >&2
# prompt the user
valid=0
while (( ! valid )); do
read -p "Do you want to proceed? (y/n): "
case $REPLY in
y) valid=1; echo "Deleting; please wait" >&2; rm -f "${files[#]}" ;;
n) valid=1 ;;
esac
done
I'll go into the details below:
files has to be explicitly created as an array to actually be an array -- otherwise, it's just a string with a bunch of files in it.
This is an array:
files=( "first file" "second file" )
This is not an array (and, in fact, could be a single filename):
files='"first file" "second file"'
A proper bash array is expanded with "${arrayname[#]}" to get all contents, or "$arrayname" to get only the first entry.
[[ -e $files || -L $files ]]
...thus checks the existence (whether as a file or a symlink) of the first entry in the array -- which is sufficient to tell if the glob expression did in fact expand, or if it matched nothing.
A boolean is better represented with numeric values than a string containing true or false: Running if $valid has potential to perform arbitrary activity if the contents of valid could ever be set to a user-controlled value, whereas if (( valid )) -- checking whether $valid is a positive numeric value (true) or otherwise (false) -- has far less room for side effects in presence of bugs elsewhere.
There's no need to loop over array entries to print them in a list: printf "$format_string" "${array[#]}" will expand the format string additional times whenever it has more arguments (from the array expansion) than its format string requires. Moreover, using %q in your format string will quote nonprintable values, whitespace, newlines, &c. in a format that's consumable by both human readers and the shell -- whereas otherwise a file created with touch $'evil\n - hiding' will appear to be two list entries, whereas in fact it is only one.

Store grep output in an array

I need to search a pattern in a directory and save the names of the files which contain it in an array.
Searching for pattern:
grep -HR "pattern" . | cut -d: -f1
This prints me all filenames that contain "pattern".
If I try:
targets=$(grep -HR "pattern" . | cut -d: -f1)
length=${#targets[#]}
for ((i = 0; i != length; i++)); do
echo "target $i: '${targets[i]}'"
done
This prints only one element that contains a string with all filnames.
output: target 0: 'file0 file1 .. fileN'
But I need:
output: target 0: 'file0'
output: target 1: 'file1'
.....
output: target N: 'fileN'
How can I achieve the result without doing a boring split operation on targets?
You can use:
targets=($(grep -HRl "pattern" .))
Note use of (...) for array creation in BASH.
Also you can use grep -l to get only file names in grep's output (as shown in my command).
Above answer (written 7 years ago) made an assumption that output filenames won't contain special characters like whitespaces or globs. Here is a safe way to read those special filenames into an array: (will work with older bash versions)
while IFS= read -rd ''; do
targets+=("$REPLY")
done < <(grep --null -HRl "pattern" .)
# check content of array
declare -p targets
On BASH 4+ you can use readarray instead of a loop:
readarray -d '' -t targets < <(grep --null -HRl "pattern" .)

Resources