Comparing two files using awk or sed - file

I have two files...
Lookup is 1285 lines long:
cat Lookup.txt
abc
def
ghi
jkl
main is 4,838,869 lines long:
cat main.txt
abc, USA
pqr, UK
xyz, SA
I need to compare lookup and main and then output the matching lines in main to final.txt

You don't need awk or sed here but grep, assuming I am reading your requirements correctly:
% grep -f lookup.txt main.txt > final.txt
% cat final.txt
abc, USA

Related

Edit a string in shell script and display it as an array

Input:
1234-A1;1235-A2;2345-B1;5678-C2;2346-D5
Expected Output:
1234
1235
2345
5678
2346
Input shown is a user input. I want to store it in an array and do some operations to display as shown in 'Expected Output'
I have done it in perl, but want to achieve it in shell script. Please help in achieving this.
To split an input text to an array you can follow this technique:
IFS="[;-]" read -r -a arr <<< "1234-A1;1235-A2;2345-B1;5678-C2;2346-D5"
printf '%s\n' "${arr[#]}"
1234
A1
1235
A2
2345
B1
5678
C2
2346
D5
If you want to keep only 1234,1234, etc as per your expected output you can either to use the corresponding array elements (0-2-4-etc) or to do something like this:
a="1234-A1;1235-A2;2345-B1;5678-C2;2346-D5"
IFS="[;]" read -r -a arr <<< "${a//-[A-Z][0-9]/}" #or more generally <<< "${a//-??/}"
declare -p arr #This asks bash to print the array for us
#Output
declare -a arr='([0]="1234" [1]="1235" [2]="2345" [3]="5678" [4]="2346")'
# Array can now be printed or used elsewhere in your script. Array counting starts from zero
#Yash:#try:
echo "1234-A1;1235-A2;2345-B1;5678-C2;2346-D5" | awk '{gsub(/-[[:alnum:]]+/,"");gsub(/;/,RS);print}'
Substituting all alpha bate, numbers with NULL, then substituting all semi colons to RS(record separator) which is a new line by default.
Thanks #George and #Vipin.
Based on your inputs the solution which best suites my environment is as under:
i=0
a="1234-A1;1235-A2;2345-B1;5678-C2;2346-D5"
IFS="[;]" read -r -a arr <<< "${a//-??/}"
#declare -p arr
for var in "${arr[#]}"
do
echo " var $((i++)) is : $var"
done
Output:
var 0 is : 1234
var 1 is : 1235
var 2 is : 2345
var 3 is : 5678
var 4 is : 2346
Try this -
awk -F'[-;]' '{for(i=1;i<=NF;i++) if(i%2!=0) {print $i}}' f
1234
1235
2345
5678
2346
OR
echo "1234-A1;1235-A2;2345-B1;5678-C2;2346-D5"|tr ';' '\n'|cut -d'-' -f1
OR
As #George Vasiliou Suggested -
awk -F'[-;]' '{for(i=1;i<=NF;i+=2) {print $i}}'f
If Data needs to store in Array and you are using gawk, try below -
awk -F'[;-]' -v k=1 '{for(i=1;i<=NF;i++) if($i !~ /[[:alpha:]]/) {a[k++]=$i}} END {
> PROCINFO["sorted_in"] = "#ind_str_asc"
> for(k in a) print k,a[k]}' f
1 1234
2 1235
3 2345
4 5678
5 2346
PROCINFO["sorted_in"] = "#ind_str_asc" used to print the data in
sorted order.

Use bash variable as array in awk and filter input file by comparing with array

I have bash variable like this:
val="abc jkl pqr"
And I have a file that looks smth like this:
abc 4 5
abc 8 8
def 43 4
def 7 51
jkl 4 0
mno 32 2
mno 9 2
pqr 12 1
I want to throw away rows from file which first field isn't present in the val:
abc 4 5
abc 8 8
jkl 4 0
pqr 12 1
My solution in awk doesn't work at all and I don't have any idea why:
awk -v var="${val}" 'BEGIN{split(var, arr)}$1 in arr{print $0}' file
Just slice the variable into array indexes:
awk -v var="${val}" 'BEGIN{split(var, arr)
for (i in arr)
names[arr[i]]
}
$1 in names' file
As commented in the linked question, when you call split() you get values for the array, while what you want to set are indexes. The trick is to generate another array with this content.
As you see $1 in names suffices, you don't have to call for the action {print $0} when this happens, since it is the default.
As a one-liner:
$ awk -v var="${val}" 'BEGIN{split(var, arr); for (i in arr) names[arr[i]]} $1 in names' file
abc 4 5
abc 8 8
jkl 4 0
pqr 12 1
grep -E "$( echo "${val}"| sed 's/ /|/g' )" YourFile
# or
awk -v val="${val}" 'BEGIN{gsub(/ /, "|",val)} $1 ~ val' YourFile
Grep:
it use a regex (extended version with option -E) that filter all the lines that contains the value. The regex is build OnTheMove in a subshell with a sed that replace the space separator by a | meaning OR
Awk:
use the same princip as the grep but everything is made inside (so no subshell)
use the variable val assigned to the shell variable of the same name
At start of the script (before first line read) change the space, (in val) by | with BEGIN{gsub(/ /, "|",val)}
than, for every line where first field (default field separator is space/blank in awk, so first is the letter group) matching, print it (defaut action of a filter with $1 ~ val.

Print duplicate entries in a file using linux commands

I have a file called foo.txt, which consists of:
abc
zaa
asd
dess
zaa
abc
aaa
zaa
I want the output to be stored in another file as:
this text abc appears 2 times
this text zaa appears 3 times
I have tried the following command, but this just writes duplicate entries and their number.
sort foo.txt | uniq --count --repeated > sample.txt
Example of output of above command:
abc 2
zaa 3
How do I add the line "this text appears x times" ?
Awk is your friend:
sort foo.txt | uniq --count --repeated | awk '{print($2" appears "$1" times")}'

Print the middle line of any file UNIX

I have to print the middle line of any text file without sed nor awk.
For example, the following file.txt:
line 1
line 2
line 3
line 4
line 5
I need something like:
$ command -flags file.txt
line 3
Is there any command?
Thanks.
Not the most efficient, but works in bash.
Use wc -l to count the lines, and divide by two. Then use tail -n +N | head -n 1 to print just the Nth line (where N starts at 1).
$ cat input.txt
A
B
C
D
E
$ tail -n +$(((`cat input.txt | wc -l` / 2) + 1)) input.txt | head -n 1
C
Note that a file with an even number of lines has no single "middle line".
I cat-ed the file to wc -l so it wouldn't print the filename.
sed -n $(((`cat input.txt| wc -l`/ 2) + 1))p input.txt

Awk: extract different columns from many different files

File Example
I have a 3-10 amount of files with:
- different number of columns
- same number of rows
- inconsistent spacing (sometimes one space, other tabs, sometimes many spaces) **within** the very files like the below
> 0 55.4 9.556E+09 33
> 1 1.3 5.345E+03 1
> ........
> 33 134.4 5.345E+04 932
>
........
I need to get column (say) 1 from file1, column 3 from file2, column 7 from file3 and column 1 from file4 and combine them into a single file, side by side.
Trial 1: not working
paste <(cut -d[see below] -f1 file1) <(cut -d[see below] -f3 file2) [...]
where the delimiter was ' ' or empty.
Trial 2: working with 2 files but not with many files
awk '{
a1=$1;b1=$4;
getline <"D2/file1.txt";
print a1,$1,b1,$4
}' D1/file1.txt >D3/file1.txt
Now more general question:
How can I extract different columns from many different files?
In your paste / cut attempt, replace cut by awk:
$ paste <(awk '{print $1}' file1 ) <(awk '{print $3}' file2 ) <(awk '{print $7}' file3) <(awk '{print $1}' file4)
Assuming each of your files has the same number of rows, here's one way using GNU awk. Run like:
awk -f script.awk file1.txt file2.txt file3.txt file4.txt
Contents of script.awk:
FILENAME == ARGV[1] { one[FNR]=$1 }
FILENAME == ARGV[2] { two[FNR]=$3 }
FILENAME == ARGV[3] { three[FNR]=$7 }
FILENAME == ARGV[4] { four[FNR]=$1 }
END {
for (i=1; i<=length(one); i++) {
print one[i], two[i], three[i], four[i]
}
}
Note:
By default, awk separates columns on whitespace. This includes tab characters and spaces, and any amount of these. This makes awk ideal for files with inconsistent spacing. You can also expand the above code to include more files if you wish.
The combination of cut and paste should work:
$ cat f1
foo
bar
baz
$ cat f2
1 2 3
4 5 6
7 8 9
$ cat f3
a b c d
e f g h
i j k l
$ paste -d' ' <(cut -f1 f1) <(cut -d' ' -f2 f2) <(cut -d' ' -f3 f3)
foo 2 c
bar 5 g
baz 8 k
Edit: This works with tabs, too:
$ cat f4
a b c d
e f g h
i j k l
$ paste -d' ' <(cut -f1 f1) <(cut -d' ' -f2 f2) <(cut -f3 f4)
foo 2 c
bar 5 g
baz 8 k

Resources