File Comparision in Unix for Huge Files

File Comparision in Unix for Huge Files - file

I have two files in Unix Box, both have around 10 million rows.
File1 (Only one column)
ASD123
AFG234
File2 (Only one column)
ASD456
AFG234
Now I want to compare the records from File 1 to File 2 and output those that are in File2. How to achieve this?
I have tried a while loop and grep, seems it is way too slow, any ideas will be appreciated.

If you want to find all the rows from file A which are also in file B, you can use grep's inbuilt -f option:
grep -Ff fileA.txt fileB.txt
This should be faster than putting it inside any kind of loop (although given the size of your files, it may still take some time).

Related

What is the best way to implement Linux command diff using any java API?

I need to find the difference between two files in terms of lines.
Lets say I have 2 files: file1.txt and file2.txt.
File1 has the lines:
LINE A
LINE B
LINE C
File2 has the lines:
LINE C
LINE D
LINE E
Now lets say I want to find the difference between file2 and file1 (i.e (file2)-(file1) ). Then as my result I should be able to get:
The lines that appeared in file2 but not in file 1. In my example this would be: LINE D and LINE E.
The lines that appeared in file1 but not in file2. In my example this would be: LINE A and LINE B.
It is easy to implement this requirement using diff command of Linux.
But now somehow I need to implement this using Java. While it is easy to implement the solution using brute force comparison of lines of both the files, it is extremely bad in terms both time and space complexity.
As my file can have as many as millions of lines such implementation is not scalable. So I am looking for any Java API that provides such functionality as well as scalability when it comes to such big files.

Diff two files in shell script

Assume we have two files named file1 and file2.
File1:
a=b
c=d
e=f
File2:
a=p
c=o
e=f
g=h
i=j
Here file2 has the same keys of file1 and different values.Apart from some extra key-value pair of its own.
Compare two files keys, replace file2 value with file1 value based on key match. Retain the new entries in file2.
So, my final output should be :
File2:
a=b
c=d
e=f
g=h
i=j
Thanks In Advance.

Quickest way without using scripts is using the tool called "meld".
I can give one way of approaching the problem (though not the best)
1.read from first file line by line
2.split based on "=" expression
3.store the two variables as key and value
make an array of all key value pairs
4.read from the second file and repeat the procedure
compare the two arrays and save values not in first array only
In this specific case you can use "cut" command in shell to choose fields .
I personally prefer Perl script for file operations like this :)

Easiest way to overwrite a series of files with zeros

I'm on Linux. I have a list of files and I'd like to overwrite them with zeros and remove them. I tried using
srm file1 file2 file3 ...
but it's too slow (I have to overwrite and remove ~50 GB of data) and I don't need that kind of security (I know that srm does a lot of passes instead of a single pass with zeros).
I know I could overwrite every single file using the command
cat /dev/zero > file1
and then remove it with rm, but I can't do that manually for every single file.
Is there a command like srm that does a single pass of zeros, or maybe a script that can do cat /dev/zero on a list of files instead of on a single one? Thank you.

Something like this, using stat to get the correct size to write, and dd to overwrite the file, might be what you need:
for f in $(<list_of_files.txt)
do
read blocks blocksize < <(stat -c "%b %B" ${f})
dd if=/dev/zero bs=${blocksize} count=${blocks} of=${f} conv=notrunc
rm ${f}
done
Use /dev/urandom instead of /dev/zero for (slightly) better erasure semantics.
Edit: added conv=notrunc option to dd invocation to avoid truncating the file when it's opened for writing, which would cause the associated storage to be released before it's overwritten.

I use shred for doing this.
The following are the options that I generally use.
shred -n 3 -z <filename> - This will make 3 passes to overwrite the file with random data. It will then make a final pass overwriting the file with zeros. The file will remain on disk though, but it'll all the 0's on disk.
shred -n 3 -z -u <filename> - Similar to above, but also unlinks (i.e. deletes) the file. The default option for deleting is wipesync, which is the most secure but also the slowest. Check the man pages for more options.
Note: -n is used here to control the number of iterations for overwriting with random data. Increasing this number, will result in the shred operation taking longer to complete and better shredding. I think 3 is enough but maybe wrong.

The purpose of srm is to destroy the data in the file before releasing its blocks.
cat /dev/null > file is not at all equivalent to srm because
it does not destroy the data in the file: the blocks will be released with the original data intact.
Using /dev/zero instead of /dev/null does not even work because /dev/zero never ends.
Redirecting the output of a program to the file will never work for the same reason given for cat /dev/null.
You need a special-purpose program that opens the given file for writing, writes zeros over all bytes of the file, and then removes the file. That's what srm does.

Is there a command like srm that does a single pass of zeros,
Yes. SRM does this with the correct parameters. From man srm:
srm -llz
-l lessens the security. Only two passes are written: one mode with
0xff and a final mode random values.
-l -l for a second time lessons the security even more: only one
random pass is written.
-z wipes the last write with zeros instead of random data
srm -llzr will do the same recursively if wiping a directory.
You can even use 'srm -llz [file1] [file2] [file3] to wipe multiple files i this way with a single command

Is there a unix command to count the number of different file types in a directory?

I used the ls | wc -1 command to count the number of files in a directory. Is there a command to count the number of different file types ? Say the directory has 2 text files and one jpeg, the output should be 2 (text and jpeg are the different file types).
Any help is much appreciated. Thanks !

There is no single command (although you can certainly create one!) to do what you want, but it is quite simple to get your result. Decide exactly how you want to distinguish file type (filename extension, file content, name, etc.), then use common tools to count the result. If you are happy with the results printed by the file command, perhaps something as simple as:
file * | awk '{$1=""}1' | sort -u | wc -l
The awk filters out the first column of output (the filename) and the remaining processes in the pipeline count the results. This is fragile and will break if any of your filenames contain whitespace, so you might want to use : for the field separater in awk (in which case the solution is fragile and will fail if any filename contains a colon.)

Use file to find out the file types. Pipe that through grep to filter out things like images etc. and then do a wc -l.

removing a line from a text file?

I am working with a text file, which contains a list of processes under my programs control, along with relevant data.
At some point, one of the processes will finish, and thus will need to be removed from the file (as its no longer under control).
Here is a sample of the file contents (which has enteries added "randomly"):
PID=25729 IDLE=0.200000 BUSY=0.300000 USER=-10.000000
PID=26416 IDLE=0.100000 BUSY=0.800000 USER=-20.000000
PID=26522 IDLE=0.400000 BUSY=0.700000 USER=-30.000000
So for example, if I wanted to remove the line that says PID=26416.... how could I do that, without writing the file over again?
I can use external unix commands, however I am not very familiar with them so please if that is your suggestion, give an example.
Thanks!

Either you keep the contents of the file in temporary memory and then rewrite the file. Or you could have a file for each of the PIDs with the relevant information in them. Then you simply delete the file when it's no longer running. Or you could use a database for this instead.

As others have already pointed out, your only real choice is to rewrite the file.
The obvious way to do that with "external UNIX commands" would be grep -v "PID=26416" (or whatever PID you want to remove, obviously).
Edit: It is probably worth mentioning that if the lines are all the same length (as you've shown here) and order doesn't matter, you could delete a line more efficiently by copying the last line into the space being vacated, then shorten the file so eliminate what had been the last line. This will only work if they really are all the same length though (e.g., if you got a PID of '1', you'd need to pad it to the same length as the others in the file).

The only way is by copying each character that comes after the deleted line down over the characters that are deleted.
It is far more efficient to simply rewrite the file.

how could I do that, without writing the file over again?
You cannot. Filesystems (perhaps besides more esoteric record based ones) does not support insertion or deletion.
So you'll have to write the lines to a temporary file up till the line you want to delete, skip over that line, and write the rest of the lines to the file. When done, rename/copy the temp file to the original filename

Why are you maintaining these in a text file? That's not the best model for such a task. But, if you're stuck with it ... if these lines are guaranteed to all be the same length (it appears that way from the sample), and if the order of the lines in the file doesn't matter, then you can write the last line over the line for the process that has died and then shorten the file by one line with the (f)truncate() call if you're on a POSIX system: see Jonathan Leffler's answer in How to truncate a file in C?
But note carefully netrom's answer, which gives three different better ways to maintain this info.
Also, if you stick with a text file (preferably written from scratch each time from data structures you maintain, as per netrom's first suggestion), and you want to be sure that the file is always well formed, then write the new data into a temp file on the same device (putting it in the same directory is easiest) and then do a rename() call, which is an atomic operation.

You can use sed:
sed -i.bak -e '/PID=26416/d' test
-i is for editing in place. It also creates a back-up file with the new extension .bak
-e is for specifying the pattern. The /d indicates all lines matching the pattern should be deleted.
test is the filename

The unix command for it is:
grep -v "PID=26416" myfile > myfile.tmp
mv myfile.tmp myfile
The grep -v part outputs the file without the rows with the search term.
The > myfile.tmp part creates a new temp file for this output.
The mv part renames the temp file to the original file.
Note that we are rewriting the file here, and moreover, we can lose data if someone write something to file between the two commands.

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight

File Comparision in Unix for Huge Files - file

If you want to find all the rows from file A which are also in file B, you can use grep's inbuilt -f option: grep -Ff fileA.txt fileB.txt This should be faster than putting it inside any kind of loop (although given the size of your files, it may still take some time).

Related

What is the best way to implement Linux command diff using any java API?

Diff two files in shell script

Easiest way to overwrite a series of files with zeros

Is there a unix command to count the number of different file types in a directory?

removing a line from a text file?

Categories

Resources