Compare 2 columns in 2 files and print the soustraction result

Compare 2 columns in 2 files and print the soustraction result - database

I'm trying to compare tablespaces sizes between 2 databases. I already extracted the needed field to compare as above:
STAT-TBS-DB-SOURCE.lst: (column 1 : TBS Name, column 2 : real size)
TBS001 12
TBS002 50
TBS003 20
TBS004 45
STAT-TBS-DBTARGET.lst (column1:TBS Name, column 2 :max size)
TBS001 10
TBS002 50
TBS003 20
TBS004 40
I need to compare the second columns (c1,c2) of the 2 files (f1,f2), if f2.c2<f1.c2 then print increase Tablespace f1.c1 by ( f1.c2 - f2.c2) MB.
What solution have you for me?
I tried with awk but I cannot get the value of the f1.c2.
Thanks

kent$ awk 'NR==FNR{a[$1]=$2;next}$1 in a && $2<a[$1]{
printf "increase Tablespace %s by %d MB\n",$1,(a[$1]-$2)}' f f2
increase Tablespace TBS001 by 2 MB
increase Tablespace TBS004 by 5 MB

Related

Complete file2 with data from file1

I have two files with fields separated with tabs:
File1 has 13 columns and 90 millions of lines (~5GB). The number of lines of file 1 is always smaller than the number of lines of file2.
1 1 27 0 2 0 0 1 0 0 0 1 false
1 2 33 0 3 0 0 0 0 0 0 1 false
1 5 84 3 0 0 0 0 0 0 0 2 false
1 6 41 0 1 0 0 0 0 0 0 1 false
1 7 8 4 0 0 0 0 0 0 0 1 false
File2 has 2 columns and 100 millions of lines (1.3GB)
1 1
1 2
1 3
1 4
1 5
What I want to achieve:
When the pair columns $1/$2 of file2 is identical to the pair columns $1/$2 of file1, I would like to print $1 and $2 from file2 and $3 from file1 into an output file. In addition, if the pair $1/$2 of file2 does not have a match in file1, print $1/$2 in the output and the 3rd column is left empty. Thus, the output keeps the same structure (number of lines) than file2.
If relevant: The pairs $1/$2 are unique in both file1 and 2 and both files are sorted according $1 first and then $2.
Output file:
1 1 27
1 2 33
1 3 45
1 4
1 5 84
What I have done so far:
awk -F"\t" 'NR == FNR {a[$1 "\t" $2] = $3; next } { print $0 "\t" a[$1 "\t" $2] }' file1 file2 > output
The command runs for few minutes and unexpectedly stop without additional information. When I open the output file, the first 5 to 6x10E6 lines have been correctly processed (I can see the 3rd column that was correctly added) but the rest of the output file does not have a 3rd column. I am running this command on a 3.2 GHz Intel Core i5 with 32 GB 1600 MHz DDR3. Any ideas why the command stops? Thanks for your help.

You are close.
I would do something like this:
awk 'BEGIN{FS=OFS="\t"}
{key=$1 FS $2}
NR==FNR{seen[key]=$3; next}
key in seen {print key, seen[key]}
' file1 file2
Or, since file1 is bigger, reverse which file is held in memory:
awk 'BEGIN{FS=OFS="\t"}
{key=$1 FS $2}
NR==FNR{seen[key]; next}
key in seen {print key, $3}
' file2 file1
You could also use join which will likely handle files much larger than memory. This is BSD join that can use multiple fields for the join:
join -1 1 -1 2 -2 1 -2 2 -t $'\t' -o 1.1,1.2,1.3 file1 file2
join requires the files be sorted, as your example is. If not sorted, you could do:
join -1 1 -1 2 -2 1 -2 2 -t $'\t' -o 1.1,1.2,1.3 <(sort -n file1) <(sort -n file2)
Or, if your join can only use a single field, you can temporarily use ' ' as the field separator between field 2 and 3 and set join to use that as the delimiter:
join -1 1 -2 1 -t $' ' -o 1.1,2.2 <(sort -k1n -k2n file2) <(awk '{printf("%s\t%s %s\n",$1,$2,$3)}' file1 | sort -k1n -k2n) | sed 's/[ ]/\t/'
Either awk or join prints:
1 1 27
1 2 33
1 3 45
1 4 7
1 5 84
Your comment:
After additional investigations, the suggested solutions did not worked because my question was not properly asked (my mistake). The suggested solutions printed lines only when matches between pairs ($1/$2) were found between files 1 and 2. Thus, the resulting output file has always the number of lines of file1 (that is always smaller than file2). I want the output file to keep the same structure than file2, as said, the same number of lines (for further comparison). The question was further refined.
If your computer can handle the file sizes:
awk 'BEGIN{FS=OFS="\t"}
{key=$1 FS $2}
NR==FNR{seen[key]=$3; next}
{if (key in seen)
print key, seen[key]
else
print key
}
' file1 file2
Otherwise you can filter file1 so only the matches are feed to the awk from file1 and then file2 dictates the final output structure:
awk 'BEGIN{FS=OFS="\t"}
{key=$1 FS $2}
NR==FNR{seen[key]=$3; next}
{if (key in seen)
print key, seen[key]
else
print key
}
' <(join -1 1 -1 2 -2 1 -2 2 -t $'\t' -o 1.1,1.2,1.3 file1 file2) file2
If you still need something more memory efficient, I would break out ruby for a line-by-line solution:
ruby -e 'f1=File.open(ARGV[0]); f2=File.open(ARGV[1])
l1=f1.gets
f2.each { |l2|
l1a=l1.chomp.split(/\t/)[0..2].map(&:to_i)
l2a=l2.chomp.split(/\t/).map(&:to_i)
while((tst=l1a[0..1]<=>l2a)<0 && !f1.eof?)
l1=f1.gets
l1a=l1.chomp.split(/\t/)[0..2].map(&:to_i)
end
if tst==0
l2a << l1a[2]
end
puts l2a.join("\t")
}
' file1 file2

Issues with OP's current awk code:
testing shows loading file1 into memory (a[$1 "\t" $2] = $3) requires ~290 bytes per entry; for 90 million rows this works out to ~26 GBytes; this amount of memory usage should not be an issue in OP's system (max of 32 GBytes) ... assuming all other processes are not consuming 6+ GBytes; having said that ...
in the 2nd half of OP's script (ie, file2 processing) the print/a[$1 "\t" $2] will actually create a new array entry if one doesn't already exist (ie, if file2 key not found in file1 then create a new array entry); since we know this situation can occur we have to take into consideration the amount of memory required to store an entry from file2 in the a[] array ...
testing shows loading file2 into memory (a[$1 "\t" $2] = $3) requires ~190 bytes per entry; for 100 million rows this works out to ~19 GBytes; 'course we won't be loading all of file2 into the a[] array so total additional memory will be less than 19 GBytes; then again, we only need to load about 26 million rows (from file2) into the a[] array to use up another ~5 GBytes (26 million * 190 bytes) to run out of memory
OP has mentioned that processing 'stops' after generating 5-6 million rows of output; this symptom ('stopped' process) can occur when the system runs out of memory and/or goes into heavy swapping (also a memory issue); with 100 million rows in file2 and only 5-6 million rows in the output, that leaves 94-95 million rows from file2 unaccounted for which in turn is considerably more then the 26 million rows it would take to use up the remaining ~5 GBytes of memory ...
net result: OP's current awk script is likely hanging/stopped due to running out of memory
net result: we need to look at solutions that keep us from running out of memory; better would be solutions that use considerably less memory than the current awk code; even better would be solutions that use little (effectively 'no') memory at all ...
Assumptions/understandings:
both input files have already been sorted by the 1st and 2nd columns
within a given file the combination of the 1st and 2nd columns represents a unique key (ie, there are no duplicate keys in a file)
all lines from file2 are to be written to stdout while an optional 3rd column will be derived from file1 (when the key exists in both files)
General approach for a 'merge join' operation:
read from both files in parallel
compare the 1st/2nd columns from both files to determine what to print
memory usage should be minimal since we never have more than 2 lines (one from each file) loaded in memory
One awk idea for implementing a 'merge join' operation:
awk -v lookup="file1" ' # assign awk variable "lookup" the name of the file where we will obtain the optional 3rd column from
function get_lookup() { # function to read a new line from the "lookup" file
rc=getline line < lookup # read next line from "lookup" file into variable "line"; rc==1 if successful; rc==0 if reached end of file
if (rc) split(line,f) # if successful (rc==1) then split "line" into array f[]
}
BEGIN { FS=OFS="\t" # define input/output field delimiters
get_lookup() # read first line from "lookup" file
}
{ c3="" # set optional 3rd column to blank
while (rc) { # while we have a valid line from the "lookup" file look for a matching line ...
if ($1< f[1] ) { break }
else if ($1==f[1] && $2< f[2]) { break }
else if ($1==f[1] && $2==f[2]) { c3= OFS f[3]; get_lookup(); break }
# else if ($1==f[1] && $2> f[2]) { get_lookup() }
# else if ($1> f[1] ) { get_lookup() }
else { get_lookup() }
}
print $0 c3 # print current line plus optional 3rd column
}
' file2
This generates:
1 1 27
1 2 33
1 3
1 4
1 5 84

awk lookup table, blank column replacement

I'm trying to use a lookup table to do a search and replace for two specific columns and keep getting a blank column as output. I've followed the syntax for several examples of lookup tables that I've found on stack, but no joy. Here is a snippet from each of the files.
Sample lookup table -- want to search for instances of column 1 in my data file and replace them with the corresponding value in column 2 (first row is a header):
#xyz type
N 400
C13 401
13A 402
13B 402
13C 402
C14 405
The source file to be substituted has the following format:
1 N 0.293000 2.545000 16.605000 0 2 6 10 14
2 C13 0.197000 2.816000 15.141000 0 1
3 13A 1.173000 2.887000 14.676000 0
4 13B -0.319000 3.756000 14.937000 0
5 13C -0.351000 1.998000 14.678000 0
6 C14 0.749000 3.776000 17.277000 0 1
The corresponding values in column 2 of the lookup table will replace the values in column 6 of my source file (currently all zeroes). Here's the awk one-liner that I thought should work:
awk -v OFS='\t' 'NR==1 { next } FNR==NR { a[$1]=$2; next } $2 in a { $6=a[$1] }1' lookup.txt source.txt
But my output essentially deletes the entire entry for column 6:
1 N 0.293000 2.545000 16.605000 2 6 10 14
2 C13 0.197000 2.816000 15.141000 1
3 13A 1.173000 2.887000 14.676000
4 13B -0.319000 3.756000 14.937000
5 13C -0.351000 1.998000 14.678000
6 C14 0.749000 3.776000 17.277000 1
(The sixth column should be 400 to 405. I considered using sed, but I have duplicate values in the source and output columns of my lookup table, so that won't work in this case. What's frustrating is that I had this one-liner working on almost the exact same source file the other week, but now can only get this behavior. I'd love to be able to modify my awk call to do lookups of two different columns simultaneously, but wanted to start simple for now. Thanks!

You have $6=a[$1] instead of $6=a[$2] in your script.
$ awk -v OFS='\t' 'NR==FNR{map[$1]=$2; next} {$6=map[$2]} 1' file1 file2
1 N 0.293000 2.545000 16.605000 400 2 6 10 14
2 C13 0.197000 2.816000 15.141000 401 1
3 13A 1.173000 2.887000 14.676000 402
4 13B -0.319000 3.756000 14.937000 402
5 13C -0.351000 1.998000 14.678000 402
6 C14 0.749000 3.776000 17.277000 405 1

How to remove first joined row Talend DI

How to delete first matching row in a file using a second one ?
I use Talend DI 7.2 and I need to delete some rows in one delimited file using a second one containing the rows to delete. My first file contains multiple rows matching the second one but for each row in my second file I need to delete only the first row matching in the first file.
For example :
File A : File B :
Code | Amount Code | Amount
1 | 45 1 | 45
1 | 45 3 | 70
2 | 50 3 | 70
2 | 60
3 | 70
3 | 70
3 | 70
3 | 70
At the end, I need to obtain :
File A :
Code | Amount
1 | 45
2 | 50
2 | 60
3 | 70
3 | 70
Only the first match in file A for each row in file B is missing.
I tried with tMap and tFilterRow but it matches all rows not only the first one.
Example edited : I can have many times the same couple code-amount in file B and I need to remove this same number of rows from file A

You can do this by using Variables within the Tmap. I created 3:
v_match - return "match" if code and amount are in lookup file b.
v_count - add to the count if it's a repeating value. otherwise reset to 0
v_last_row - set the value of v_match to this before comparing again. this way we can compare current row to last row and get counts
Then add an Expression filter to remove any first match.
This will give the desired results:

You can't delete rows from a file, so you'll have to generate a new file containing only the rows you want.
Here's a simple solution.
First, join your files using a left join between A as a main flow, and B as a lookup.
In the tMap, using an output filter, you only write to the output file the rows from A that don't match anything in B (row2.code == null) or those which have a match, but not a first match.
The trick is to use a Numeric.sequence, with the code as an id of the sequence; if the sequence returns a value other than 1, you know you've already had that line previously. If it's the first occurence of the code, the sequence would start at 1 and return 1, so the row is filtered out.

SQL Server reduce on disk size

I have a database in production and need to reduce the on disk size of the database. I followed the instructions to shrink the file but the results was a bit surprising.
Sorry for the lots of numbers here but I do not know how to express the problem any better.
Database containing only one table with 11,634,966 rows.
The table structure as follow I just changed the column names
id bigint not null 8 -- the primary key (clustered index)
F1 int not null 4
F2 int not null 4
F3 int not null 4
F4 datetime not null 8
F5 int not nul 4
F6 int not null 4
F7 int not null 4
F8 xml ?
F9 uniqueidentifier 16
F10 int 4
F11 datetime not null 8
Excluding the XML field I calculate that the data size will be 68 bytes per row.
I ran a query against the database finding the min, max and avg size of the xml field F8
Showing the following:
min : 625 bytes
max : 81782 bytes
avg : 5321 bytes
The on disk file is 108G big after shrinking the database.
This translate to the following
108G / 11.6M records = 9283 bytes per row
- 5321 bytes per row (Avg of XML)
= 3962 bytes per row
- 68 (data size of other fields in row)
= 3894 bytes per row. (must be overhead)
but this mean that the overhead is 41.948%
Is this to be expected? and is there anything that I can do to reduce the 108G disk size.
BTW there is only one clustered index on the table.
And I am using SQL Server 2008 (SP3)

Awk multiply column by a specific value

I'm quite new to awk, that I am using more and more to process the output files from a model I am running. Right now, I am stuck with a multiplication issue.
I would like to calculate relative change in percentage.
Example:
A B
1 150 0
2 210 10
3 380 1000
...
I would like to calculate Ax = (Ax-A1)/A1 * 100.
Output:
New_A B
1 0 0
2 10 40
3 1000 153.33
...
I can multiply columns together but don't know how to fix a value to a position in the text file (ie. Row 1 Column 1).
Thank you.

Assuming your actual file does not have the "A B" header and the row numbers in it:
$ cat file
150 0
210 10
380 1000
$ awk 'NR==1 {a1=$1} {printf "%s %.1f\n", $2, ($1-a1)/a1*100}' file | column -t
0 0.0
10 40.0
1000 153.3

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight

Compare 2 columns in 2 files and print the soustraction result - database

kent$ awk 'NR==FNR{a[$1]=$2;next}$1 in a && $2<a[$1]{ printf "increase Tablespace %s by %d MB\n",$1,(a[$1]-$2)}' f f2 increase Tablespace TBS001 by 2 MB increase Tablespace TBS004 by 5 MB

Related

Complete file2 with data from file1

awk lookup table, blank column replacement

How to remove first joined row Talend DI

SQL Server reduce on disk size

Awk multiply column by a specific value

Categories

Resources