How to search common passwords from two given files of size 20GB? - c

i have two files of size 20GB. i have to remove common passwords from either of one file.
i sorted the second file by calling sort command of UNIX. after this i splitted the sorted file into many files so that file could fit in RAM memory using split command. After splitting into n files i just used an structure array of size n to store first password of each splitted file and its corresponding file name.
then i applied binary search technique in that structure array for each key of first file to to first password stored in structure to get index of the corresponding file. and then i applied b search to that indexed splitted file.
i assumed 20 character as a max length of passwords
this program is not yet efficient.
Please help to make it efficient, if possible....
Please give me some advise to sort that 20GB file efficiently .....
in 64 bit stream with 8gb RAM and i3 quard processor.....
i just tested my program with two file of size 10MB. it took about 2.66 hours without using any optimization option. ....according to my program it will take about 7-8 hours to check each passwords of 20GB after splitting , sorting and binary searching.....
can i improve its time complexity? i mean can i make it to run more "faster"???

Check out external sorting. See http://www.umbrant.com/blog/2011/external_sorting.html which does have code at the end of the page (https://github.com/umbrant/extsort).
The idea behind external sorting is selecting and sorting equidistant samples from the file. Then partitioning the file at sampling points, sorting the partitions and merging the results.
example numbers = [1, 100, 2, 400, 60, 5, 0, 4]
example samples (distance 4) = 1, 60
chunks = {0,1,2,5,4} , {60, 100, 400}
Also, I don't think splitting the file is a good idea because you need to write 20GB to disk to split them. You might as well create the structure on the fly by seeking within the file.

For a previous SE question, "What algorithm to use to delete duplicates?" I described an algorithm for a probably-similar problem except with 50GB files instead of 20GB. The method is faster than sorting the big files in that problem.
Here is an adaptation of the method to your problem. Let's call the original two files A and B, and suppose A is larger than B. I don't understand from your problem description what is supposed to happen if or when a duplicate is detected, but in the following I assume you want to leave file A unchanged, and remove from B any items that also are in A. I also assume that entries within A are specified to be unique within A at the outset, and similarly for B. If that is not the case, the method needs more adapting and about twice as much I/O.
Suppose you can fit 1/k'th of file A into memory and still have room for the other required data structures. The whole file B can then be processed in k or fewer passes, as below, and this has a chance of being much faster than sorting either file, depending on line lengths and sort-algorithm constants. Sorting averages O(n ln n) and the process below is O(k n) worst case. For example, if lines average 10 characters and there are n = 2G lines, ln(n) ~ 21.4, likely to be about 4 times as bad as O(k n) if k=5. (Algorithm constants still can change the situation either way, but with a fast hash function the method has good constants.)
Process:
Let Q = B (ie rename or copy B to Q)
Allocate a few gigabytes for a work buffer W, and a gigabyte or so for a hash table H. Open input files A and Q, output file O, and temp file T. Go to step 2.
Fill work buffer W by reading from file A.
For each line L in W, hash L into H, such that H[hash[L]] indexes line L.
Read all of Q, using H to detect duplicates, writing non-duplicates to temp file T.
Close and delete Q, rename T to Q, open new temp file T.
If EOF(A), rename Q to B and quit, else go to step 2.
Note that after each pass (ie at start of step 6) none of the lines in Q are duplicates of what has been read from A so far. Thus, 1/k'th of the original file is processed per pass, and processing takes k passes. Also note that although processing will be I/O bound you can read and write several times faster with big buffers (eg 8MB) than line-by-line.
The algorithm as stated above does not include buffering details or how to deal with partial lines in big buffers.
Here is a simple performance example: Suppose A, B both are 20GB files, that each has about 2G passwords in it, and that duplicates are quite rare. Also suppose 8GB RAM is enough for work buffer W to be 4GB in size leaving enough room for hash table H to have say .6G 4-byte entries. Each pass (steps 2-5) reads 20% of A and reads and writes almost all of B, at each pass weeding out any password already seen in A. I/O is about 120GB read (1*A+5*B), 100GB written (5*B).
Here is a more involved performance example: Suppose about 1G randomly distributed passwords in B are duplicated in A, with all else as in previous example. Then I/O is about 100GB read and 70GB written (20+20+18+16+14+12 and 18+16+14+12+10, respectively).

Searching in external files is going to be painfully slow, even using binary search. You might speed it up by putting the data in an actual database designed for fast lookups. You could also sort both text files once and then do a single linear scan to filter out words. Something like the following pseudocode:
sort the files using any suitable sorting utility
open files A and B for reading
read wordA from A
read wordB from B
while (A not EOF and B not EOF)
{
if (wordA < wordB)
write wordA to output
read wordA from A
else if (wordA > wordB)
read wordB from B
else
/* match found, don't output wordA */
read wordA from A
}
while (A not EOF) /* output remaining words */
{
write wordA to output
read wordA from A
}

Like so:
Concatenate the two files.
Use sort to sort the total result.
Use uniq to remove duplicates from the sorted total.

If c++ it's an option for you, the ready to use STXXL should be able to handle your dataset.
Anyway, if you use external sort in c, as suggested by another answer, I think you should sort both files and then scan both sequentially. The scan should be fast, and the sort can be done in parallel.

Related

How do I search most common words in very big file (over 1 Gb) wit using 1 Kb or less memory?

I have very big text file, with dozens of millions of words, one word per line. I need to find top 10 most common words in that file. There is some restrictions: usage of only standard library and usage of less that 1 KB of memory.
It is guaranteed that any 10 words in that file is short enough to fit into said memory limit and there will be enough memory to some other variables such as counters etc.
The only solution I come with is to use another text file as additional memory and buffer. But, it seems to be bad and slow way to deal with that problem.
Are there any better and efficient solutions?
You can first sort this file (it is possible with limited memory, but will require disk IO of course - see How do I sort very large files as starter).
Then you will be able to read sorted file line by line and calculate frequency of each word one by one - store them, after 10 words - if frequency is higher then all stored in your array - add it to internal array and remove least occurred one, thus you will keep only 10 most frequent words in memory during this stage.
As #John Bollinger mentioned - if your requirment is to print all top 10 words, if for example - all words from files have the same frequency, i.e. they all are "top", then this approach will not work, you need to calculate frequency for each word, store in file, sort it and then print top 10 including all words with the same frequency as 10th one.
If you can create a new file however big, you can create a simple disk-based tree database holding each word and its frequency so far. This will cost you O(log n) each time, with n from 1 to N words, plus the final scan of the whole N-sized tree, which adds up to O(N log N).
If you cannot create a new file, you'll need to perform a in-place sorting of the whole file, which will cost about O(N2). That's closer to O((N/k)2), I think, with k the average number of words you can keep in memory for the simplest bubble-sort - but that is O(1/k2)O(N2) = K O(N2) which is still O(N2). At that point you can rescan the file one final time, and after each run of each word you'll know whether that word can enter your top ten, and at which position. So you need to fit just twelve words in memory (the top ten, the current word, and the word just read from the file). 1K should be enough.
So, the auxiliary file is actually the fastest option.

Read a file of length n in logn time

Doing any(read/write) operations on the FILE in any programming language, the exact location of the file should be navigated and then it should be handled with read/write mode. Consider a file of N lines size, which requires a loop to read each line resulting the loop repeating for N times. And the complexity in reading the file turns out to be O(N).
Is there any algorithm to read the file of 'N' lines in log(N) time.?
It definitely depends on filesystem capabilities; for example, in FAT32 filesystem, the numbers of disk sectors which form the file are stored as a linked list, so it would require linear time (albeit with a rather small constant) to even arrive at the end of a file.
Other than that, it depends on whether we know the lengths of the lines in advance. Otherwise, it is unlikely to be possible in the general case: each position in the file which we don't read can either contain a line break or not contain it.
On the other hand, if, for example, all lines have the same length known in advance, and the filesystem allows accessing an arbitrary position in the file in sublinear time, the whole problem is solvable in that sublinear time, too.

File Browser in C for POSIX OS

I have created a file browsing UI for an embedded device. On the embedded side, I am able to get all files in a directory off the hard disk and return stats such as name, size, modified, etc. This is done using opendir and closedir and a while loop that goes through every file until no files are left.
This is cool, until file counts reach large quantities. I need to implement pagination and sorting. Suppose I have 10,000 files in a directory - how can I possibly go through this amount of files and sort based on size, name, etc, without easily busting the RAM (about 1mb of RAM... !). Perhaps something already exists within the Hard Drive OS or drivers?
Here's two suggestions, both of which have small memory footprint. The first will use no more memory that the number of results you wish to return for the request. It's a constant-time O(1) memory - it only depends on the size of the result set but is ultimately quadratic time (or worse) if the user really does page through all results:
You are only looking for a small paged result (e.g the r=25 entries). You can generate these by scanning through all filenames and maintaining a sorted list of items you will return, using an insertion sort of length r and for each file inserted, only retain the first r results. (In practice you would not insert the file F if it is lower than the rth entry).
How would you generate the 2nd page of results? You already know the 25th file from the previous request - so during the scan ignore all entries that are before that. (You'll need to work harder if sorting on fields with duplicates)
The upside is the minimum memory required - the memory needed is not much larger than the r results you wish to return (and can even be less if you don't cache the names). The downside is generating the complete result will be quadratic in time in terms of the number of total files you have. In practice people don't sort results then page through all pages, so this may be acceptable.
If your memory budget is larger (e.g. fewer than 10000 files) but you still don't have enough space to perform a simple in-memory sort of all 10000 filenames then seekdir/telldir is your friend. i.e. create an array of longs by streaming readdir and using telldir to capture the position of each entry. (you might even be able to compress the delta between each telldir to a 2 byte short). As a minimal implementation you can then sort 'em all with clib's sort function and writing your own callback to convert a location into a comparable value. Your call back will use seekdir twice to read the two filenames.
The above approach is overkill - you just sorted all entries and you only needed one page of ~25, so for fun why not read up on Hoare's QuickSelect algorithm and use a version of it to identify the results within the required range. You can recursive ignore all entries outside the required range and only sort the entries between the first and last entry of the results.
What you want is an external sort, that's a sort done with external resources, usually on disk. The Unix sort command does this. Typically this is done with an external merge sort.
The algorithm is basically this. Let's assume you want to dedicate 100k of memory to this (the more you dedicate, the fewer disk operations, the faster it will go).
Read 100k of data into memory (ie. call readdir a bunch).
Sort that 100k hunk in-memory.
Write the hunk of sorted data to its own file on disk.
You can also use offsets in a single file.
GOTO 1 until all hunks are sorted.
Now you have X hunks of 100k on disk an each of them are sorted. Let's say you have 9 hunks. To keep within the 100k memory limit, we'll divide the work up into the number of hunks + 1. 9 hunks, plus 1, is 10. 100k / 10 is 10k. So now we're working in blocks of 10k.
Read the first 10k of each hunk into memory.
Allocate another 10k (or more) as a buffer.
Do an K-way merge on the hunks.
Write the smallest in any hunk to the buffer. Repeat.
When the buffer fills, append it to a file on disk.
When a hunk empties, read the next 10k from that hunk.
When all hunks are empty, read the resulting sorted file.
You might be able to find a library to perform this for you.
Since this is obviously overkill for the normal case of small lists of files, only use it if there are more files in the directory than you care to have in memory.
Both in-memory sort and external sort begin with the same step: start calling readdir and writing to a fixed-sized array. If you run out of files before running out of space, just do an in-memory quicksort on what you've read. If you run out of space, this is now the first hunk of an external sort.

Binary Search on Large Disk File in C - Problems

This question recurs frequently on StackOverflow, but I have read all the previous relevant answers, and have a slight twist on the question.
I have a 23Gb file containing 475 million lines of equal size, with each line consisting of a 40-character hash code followed by an identifier (an integer).
I have a stream of incoming hash codes - billions of them in total - and for each incoming hash code I need to locate it and print out corresponding identifier. This job, while large, only needs to be done once.
The file is too large for me to read into memory and so I have been trying to usemmap in the following way:
codes = (char *) mmap(0,statbuf.st_size,PROT_READ,MAP_SHARED,codefile,0);
Then I just do a binary search using address arithmetic based on the address in codes.
This seems to start working beautifully and produces a few million identifiers in a few seconds, using 100% of the cpu, but then after some, seemingly random, amount of time it slows down to a crawl. When I look at the process using ps, it has changed from status "R" using 100% of the cpu, to status "D" (diskbound) using 1% of the cpu.
This is not repeatable - I can start the process off again on the same data, and it might run for 5 seconds or 10 seconds before the "slow to crawl" happens. Once last night, I got nearly a minute out of it before this happened.
Everything is read only, I am not attempting any writes to the file, and I have stopped all other processes (that I control) on the machine. It is a modern Red Hat Enterprise Linux 64-bit machine.
Does anyone know why the process becomes disk-bound and how to stop it?
UPDATE:
Thanks to everyone for answering, and for your ideas; I had not previously tried all the various improvements before because I was wondering if I was somehow using mmap incorrectly. But the gist of the answers seemed to be that unless I could squeeze everything into memory, I would inevitable run into problems. So I squashed the size of the hash code to the size of the leading prefix that did not create any duplicates - the first 15 characters were enough. Then I pulled the resulting file into memory, and ran the incoming hash codes in batches of about 2 billion each.
The first thing to do is split the file.
Make one file with the hash-codes and another with the integer ids. Since the rows are the same then it will line up fine after the result is found. Also you can try an approach that puts every nth hash into another file and then stores the index.
For example, every 1000th hash key put into a new file with the index and then load that into memory. Then binary scan that instead. This will tell you the range of 1000 entries that need to be further scanned in the file. Yes that will do it fine! But probably much less than that. Like probably every 20th record or so will divide that file size down by 20 +- if I am thinking good.
In other words after scanning you only need to touch a few kilobytes of the file on disk.
Another option is to split the file and put it in memory on multiple machines. Then just binary scan each file. This will yield the absolute fastest possible search with zero disk access...
Have you considered hacking a PATRICIA trie algorithm up? It seems to me that if you can build a PATRICIA tree representation of your data file, which refers to the file for the hash and integer values, then you might be able to reduce each item to node pointers (2*64 bits?), bit test offsets (1 byte in this scenario) and file offsets (uint64_t, which might need to correspond to multiple fseek()s).
Does anyone know why the process becomes disk-bound and how to stop it?
Binary search requires a lot of seeking within the file. In the case where the whole file doesn't fit in memory, the page cache doesn't handle the big seeks very well, resulting in the behaviour you're seeing.
The best way to deal with this is to reduce/prevent the big seeks and make the page cache work for you.
Three ideas for you:
If you can sort the input stream, you can search the file in chunks, using something like the following algorithm:
code_block <- mmap the first N entries of the file, where N entries fit in memory
max_code <- code_block[N - 1]
while(input codes remain) {
input_code <- next input code
while(input_code > max_code) {
code_block <- mmap the next N entries of the file
max_code <- code_block[N - 1]
}
binary search for input code in code_block
}
If you can't sort the input stream, you could reduce your disk seeks by building an in-memory index of the data. Pass over the large file, and make a table that is:
record_hash, offset into file where this record starts
Don't store all records in this table - store only every Kth record. Pick a large K, but small enough that this fits in memory.
To search the large file for a given target hash, do a binary search in the in-memory table to find the biggest hash in the table that is smaller than the target hash. Say this is table[h]. Then, mmap the segment starting at table[h].offset and ending at table[h+1].offset, and do a final binary search. This will dramatically reduce the number of disk seeks.
If this isn't enough, you can have multiple layers of indexes:
record_hash, offset into index where the next index starts
Of course, you'll need to know ahead of time how many layers of index there are.
Lastly, if you have extra money available you can always buy more than 23 gb of RAM, and make this a memory bound problem again (I just looked at Dell's website - you pick up a new low-end workstation with 32 GB of RAM for just under $1,400 Australian dollars). Of course, it will take a while to read that much data in from disk, but once it's there, you'll be set.
Instead of using mmap, consider just using plain old lseek+read. You can define some helper functions to read a hash value or its corresponding integer:
void read_hash(int line, char *hashbuf) {
lseek64(fd, ((uint64_t)line) * line_len, SEEK_SET);
read(fd, hashbuf, 40);
}
int read_int(int line) {
lseek64(fd, ((uint64_t)line) * line_len + 40, SEEK_SET);
int ret;
read(fd, &ret, sizeof(int));
return ret;
}
then just do your binary search as usual. It might be a bit slower, but it won't start chewing up your virtual memory.
We don't know the back story. So it is hard to give you definitive advice. How much memory do you have? How sophisticated is your hard drive? Is this a learning project? Who's paying for your time? 32GB of ram doesn't seem so expensive compared to two days of work of person that makes $50/h. How fast does this need to run? How far outside the box are you willing to go? Does your solution need to use advanced OS concepts? Are you married to a program in C? How about making Postgres handle this?
Here's is a low risk alternative. This option isn't as intellectually appealing as the other suggestions but has the potential to give you significant gains. Separate the file into 3 chunks of 8GB or 6 chunks of 4GB (depending on the machines you have around, it needs to comfortably fit in memory). On each machine run the same software, but in memory and put an RPC stub around each. Write an RPC caller to each of your 3 or 6 workers to determine the integer associated with a given hash code.

Efficient algorithm to sort file records

I have a file which contains number of records of varying length. What would be the efficient algorithm to sort these records.
Record sample:
000000000000dc01 t error_handling 44
0000000dfa01a000 t fun 44
Total record = >5000
Programming language c
I would like to know which algorithm is suitable to sort this file based on address and what would be the efficient way to read these records?
If the file is too large to fit into memory, then your only reasonable choice is a file-based merge sort, which involves two passes.
In the first pass, read blocks of N records (where N is defined as the number of records that will fit into memory), sort them, and write them to a temporary file. When this pass is done, you either have a number (call it M) of temporary files, each with some varying number of records that are sorted, or you have a single temporary file that contains blocks of sorted records.
The second pass is an M-way merge.
I wrote an article some time back about how to do this with a text file. See Sorting a Large Text File. It's fairly straightforward to extend that so that it will sort other types of records that you define.
For more information, see External sorting.
Since the records are of varying length, an efficent method would be:
Read and parse file into array of pointer to records
Sort array of pointers
Write the results
Random accessing the file will be slow as the newlines have to counted to find a specific record.
If you've got a really big file, adapt the process to:
for each n records
read and parse
sort
write to temporary file
mergesort temporary files
In-place Quicksort is one of the best generic sorting algorithm. Faster sorting is possible (such as bucketsort) but it depends on some properties of the data you're sorting.

Resources