Sorting multiple vectors with "slices" distributed across many files - arrays

I'm dealing with a big data problem: I've got some large number of arrays (~1M) that are distributed across a large number of files (~1k). The data is organized so that the ith file contains the ith entry of all arrays. If the overall cost of my algorithm is determined by the number of files that I need to open (and assuming only one file can be opened at a time), is there a strategy to simultaneously sort all of the arrays in-place so as to minimize the overall cost?
Note that the data is far too large for everything to be stored in memory, but there should be no problem storing ~10 entries from all arrays in memory (i.e. 10x1M values).

This question has lack of information. There is no mention if the arrays are already sorted itself or not. I am going to answer assuming the arrays are not sorted itself.
The data is organized so that the ith file contains the ith entry of
all arrays.
From this, I can assume this -
file i
------------
arr1[i]
arr2[i]
arr3[i]
...
...
arrN[i] # N = ~1M
You mentioned the number of arrays are 1M and number of files 1K, so according this no array will contain more than 1K elements otherwise more files would be required.
Each file contains 1M elements.
....but there should be no problem storing ~10 entries from all arrays
in memory (i.e. 10x1M values).
So, we should be able to load all elements of a file in memory as it won't be more than 1M elements.
So load each file in memory and sort the elements of the file.
Then apply K-Way Merge Algorithms using minheap to sort the 1K files holding sorted elements. This step will take c * 1M elements to load in memory when c is small constant (c < 3).
Let me know if you have any trouble to understand K-way merging.
Hope it helps!

Related

Appropriate data structure for counting frequency of string

I have a task of counting frequency of strings(words) in a text file. What data structure do you think is appropriate(based on implementation difficulty, memeory usage and time complexity of algorithm)? I have hash-table, bunary search tree and heap in mind but I don't know which one to choose? Also if there is any better data structure than the ones I mentioned, it will be great too. Thanks in advance.
N.B. the text file could be extremely large.
Because you say the file could be extremely large, I assumed you can't keep all the words in memory simultaneously.
Note that if the file had all words sorted, finding the frequencies would require keeping only the counter and two last words in memory at a time to compare them. As long as the same word as before is read, increment the counter. When you hit a different word, save the previous word and its count to another file with the frequencies and start counting over for the new word.
So the question is how to sort words in a file. For that purpose, you can use merge sort. Note that when merging subarrays, it's needed to keep only two words in memory, one per subarray. Additionally, you will need to create an extra file, like an extra array in in-memory merge sort, and play with positions in files. If you write to the original and extra files alternately in recursive calls, these two will be enough.

How do I search most common words in very big file (over 1 Gb) wit using 1 Kb or less memory?

I have very big text file, with dozens of millions of words, one word per line. I need to find top 10 most common words in that file. There is some restrictions: usage of only standard library and usage of less that 1 KB of memory.
It is guaranteed that any 10 words in that file is short enough to fit into said memory limit and there will be enough memory to some other variables such as counters etc.
The only solution I come with is to use another text file as additional memory and buffer. But, it seems to be bad and slow way to deal with that problem.
Are there any better and efficient solutions?
You can first sort this file (it is possible with limited memory, but will require disk IO of course - see How do I sort very large files as starter).
Then you will be able to read sorted file line by line and calculate frequency of each word one by one - store them, after 10 words - if frequency is higher then all stored in your array - add it to internal array and remove least occurred one, thus you will keep only 10 most frequent words in memory during this stage.
As #John Bollinger mentioned - if your requirment is to print all top 10 words, if for example - all words from files have the same frequency, i.e. they all are "top", then this approach will not work, you need to calculate frequency for each word, store in file, sort it and then print top 10 including all words with the same frequency as 10th one.
If you can create a new file however big, you can create a simple disk-based tree database holding each word and its frequency so far. This will cost you O(log n) each time, with n from 1 to N words, plus the final scan of the whole N-sized tree, which adds up to O(N log N).
If you cannot create a new file, you'll need to perform a in-place sorting of the whole file, which will cost about O(N2). That's closer to O((N/k)2), I think, with k the average number of words you can keep in memory for the simplest bubble-sort - but that is O(1/k2)O(N2) = K O(N2) which is still O(N2). At that point you can rescan the file one final time, and after each run of each word you'll know whether that word can enter your top ten, and at which position. So you need to fit just twelve words in memory (the top ten, the current word, and the word just read from the file). 1K should be enough.
So, the auxiliary file is actually the fastest option.

File Browser in C for POSIX OS

I have created a file browsing UI for an embedded device. On the embedded side, I am able to get all files in a directory off the hard disk and return stats such as name, size, modified, etc. This is done using opendir and closedir and a while loop that goes through every file until no files are left.
This is cool, until file counts reach large quantities. I need to implement pagination and sorting. Suppose I have 10,000 files in a directory - how can I possibly go through this amount of files and sort based on size, name, etc, without easily busting the RAM (about 1mb of RAM... !). Perhaps something already exists within the Hard Drive OS or drivers?
Here's two suggestions, both of which have small memory footprint. The first will use no more memory that the number of results you wish to return for the request. It's a constant-time O(1) memory - it only depends on the size of the result set but is ultimately quadratic time (or worse) if the user really does page through all results:
You are only looking for a small paged result (e.g the r=25 entries). You can generate these by scanning through all filenames and maintaining a sorted list of items you will return, using an insertion sort of length r and for each file inserted, only retain the first r results. (In practice you would not insert the file F if it is lower than the rth entry).
How would you generate the 2nd page of results? You already know the 25th file from the previous request - so during the scan ignore all entries that are before that. (You'll need to work harder if sorting on fields with duplicates)
The upside is the minimum memory required - the memory needed is not much larger than the r results you wish to return (and can even be less if you don't cache the names). The downside is generating the complete result will be quadratic in time in terms of the number of total files you have. In practice people don't sort results then page through all pages, so this may be acceptable.
If your memory budget is larger (e.g. fewer than 10000 files) but you still don't have enough space to perform a simple in-memory sort of all 10000 filenames then seekdir/telldir is your friend. i.e. create an array of longs by streaming readdir and using telldir to capture the position of each entry. (you might even be able to compress the delta between each telldir to a 2 byte short). As a minimal implementation you can then sort 'em all with clib's sort function and writing your own callback to convert a location into a comparable value. Your call back will use seekdir twice to read the two filenames.
The above approach is overkill - you just sorted all entries and you only needed one page of ~25, so for fun why not read up on Hoare's QuickSelect algorithm and use a version of it to identify the results within the required range. You can recursive ignore all entries outside the required range and only sort the entries between the first and last entry of the results.
What you want is an external sort, that's a sort done with external resources, usually on disk. The Unix sort command does this. Typically this is done with an external merge sort.
The algorithm is basically this. Let's assume you want to dedicate 100k of memory to this (the more you dedicate, the fewer disk operations, the faster it will go).
Read 100k of data into memory (ie. call readdir a bunch).
Sort that 100k hunk in-memory.
Write the hunk of sorted data to its own file on disk.
You can also use offsets in a single file.
GOTO 1 until all hunks are sorted.
Now you have X hunks of 100k on disk an each of them are sorted. Let's say you have 9 hunks. To keep within the 100k memory limit, we'll divide the work up into the number of hunks + 1. 9 hunks, plus 1, is 10. 100k / 10 is 10k. So now we're working in blocks of 10k.
Read the first 10k of each hunk into memory.
Allocate another 10k (or more) as a buffer.
Do an K-way merge on the hunks.
Write the smallest in any hunk to the buffer. Repeat.
When the buffer fills, append it to a file on disk.
When a hunk empties, read the next 10k from that hunk.
When all hunks are empty, read the resulting sorted file.
You might be able to find a library to perform this for you.
Since this is obviously overkill for the normal case of small lists of files, only use it if there are more files in the directory than you care to have in memory.
Both in-memory sort and external sort begin with the same step: start calling readdir and writing to a fixed-sized array. If you run out of files before running out of space, just do an in-memory quicksort on what you've read. If you run out of space, this is now the first hunk of an external sort.

How to sort the runs in external sorting using merge sort

I am trying to implement (in C) an external sorting algorithm of a database using merge sort for a college assignment. The available memory is buffSize blocks. I found this link very helpful:
http://web.eecs.utk.edu/~huangj/CS302S04/notes/external-sorting2.html
but my problem is about this line of the pseudo-code, in phase one of the algorithm:
sort array a using an in-memory algorithm like quicksort
If I don't have the right to use any memory other than my buffSize space, so I cannot allocate the a array of the link, how can I sort the records that are contained into those blocks (and then store them in a temporary run file), using an in-memory sorting procedure (e.g. quicksort). My records in that case would not be in a contiguous array but rather in non-contiguous memory located blocks and I can't directly apply qsort. Any hints?
The general approach for an external sort is:
Read as much data as will fit into an array memory.
Sort it.
Write it out to a temporary file (keeping track of name and size and largest record, etc).
Go back to step 1 until you reach the end of the data.
Set up a merge tree for the files written so that you do the minimum of merges.
Read a line from each of the first (only?) merge phase input files.
Write the smallest (for an ascending sort) to the next temporary file (or the final file).
Get a new record to replace the one just written.
Go back to step 7 until there is no more data to read.
Go back to step 6 to continue the merge until you're done.
You've not detailed what buffSize blocks of memory means, but there's an array a that can be sorted in memory. So, you read the data into the array. You sort the array using quicksort. Then you write the array out to disk. Repeat reading, sorting, writing until there's no more input data. Then do the merging...

Efficient algorithm to sort file records

I have a file which contains number of records of varying length. What would be the efficient algorithm to sort these records.
Record sample:
000000000000dc01 t error_handling 44
0000000dfa01a000 t fun 44
Total record = >5000
Programming language c
I would like to know which algorithm is suitable to sort this file based on address and what would be the efficient way to read these records?
If the file is too large to fit into memory, then your only reasonable choice is a file-based merge sort, which involves two passes.
In the first pass, read blocks of N records (where N is defined as the number of records that will fit into memory), sort them, and write them to a temporary file. When this pass is done, you either have a number (call it M) of temporary files, each with some varying number of records that are sorted, or you have a single temporary file that contains blocks of sorted records.
The second pass is an M-way merge.
I wrote an article some time back about how to do this with a text file. See Sorting a Large Text File. It's fairly straightforward to extend that so that it will sort other types of records that you define.
For more information, see External sorting.
Since the records are of varying length, an efficent method would be:
Read and parse file into array of pointer to records
Sort array of pointers
Write the results
Random accessing the file will be slow as the newlines have to counted to find a specific record.
If you've got a really big file, adapt the process to:
for each n records
read and parse
sort
write to temporary file
mergesort temporary files
In-place Quicksort is one of the best generic sorting algorithm. Faster sorting is possible (such as bucketsort) but it depends on some properties of the data you're sorting.

Resources