Appropriate data structure for counting frequency of string - c

I have a task of counting frequency of strings(words) in a text file. What data structure do you think is appropriate(based on implementation difficulty, memeory usage and time complexity of algorithm)? I have hash-table, bunary search tree and heap in mind but I don't know which one to choose? Also if there is any better data structure than the ones I mentioned, it will be great too. Thanks in advance.
N.B. the text file could be extremely large.

Because you say the file could be extremely large, I assumed you can't keep all the words in memory simultaneously.
Note that if the file had all words sorted, finding the frequencies would require keeping only the counter and two last words in memory at a time to compare them. As long as the same word as before is read, increment the counter. When you hit a different word, save the previous word and its count to another file with the frequencies and start counting over for the new word.
So the question is how to sort words in a file. For that purpose, you can use merge sort. Note that when merging subarrays, it's needed to keep only two words in memory, one per subarray. Additionally, you will need to create an extra file, like an extra array in in-memory merge sort, and play with positions in files. If you write to the original and extra files alternately in recursive calls, these two will be enough.

Related

How do I search most common words in very big file (over 1 Gb) wit using 1 Kb or less memory?

I have very big text file, with dozens of millions of words, one word per line. I need to find top 10 most common words in that file. There is some restrictions: usage of only standard library and usage of less that 1 KB of memory.
It is guaranteed that any 10 words in that file is short enough to fit into said memory limit and there will be enough memory to some other variables such as counters etc.
The only solution I come with is to use another text file as additional memory and buffer. But, it seems to be bad and slow way to deal with that problem.
Are there any better and efficient solutions?
You can first sort this file (it is possible with limited memory, but will require disk IO of course - see How do I sort very large files as starter).
Then you will be able to read sorted file line by line and calculate frequency of each word one by one - store them, after 10 words - if frequency is higher then all stored in your array - add it to internal array and remove least occurred one, thus you will keep only 10 most frequent words in memory during this stage.
As #John Bollinger mentioned - if your requirment is to print all top 10 words, if for example - all words from files have the same frequency, i.e. they all are "top", then this approach will not work, you need to calculate frequency for each word, store in file, sort it and then print top 10 including all words with the same frequency as 10th one.
If you can create a new file however big, you can create a simple disk-based tree database holding each word and its frequency so far. This will cost you O(log n) each time, with n from 1 to N words, plus the final scan of the whole N-sized tree, which adds up to O(N log N).
If you cannot create a new file, you'll need to perform a in-place sorting of the whole file, which will cost about O(N2). That's closer to O((N/k)2), I think, with k the average number of words you can keep in memory for the simplest bubble-sort - but that is O(1/k2)O(N2) = K O(N2) which is still O(N2). At that point you can rescan the file one final time, and after each run of each word you'll know whether that word can enter your top ten, and at which position. So you need to fit just twelve words in memory (the top ten, the current word, and the word just read from the file). 1K should be enough.
So, the auxiliary file is actually the fastest option.

storing strings in an array in a compact way [duplicate]

I bet somebody has solved this before, but my searches have come up empty.
I want to pack a list of words into a buffer, keeping track of the starting position and length of each word. The trick is that I'd like to pack the buffer efficiently by eliminating the redundancy.
Example: doll dollhouse house
These can be packed into the buffer simply as dollhouse, remembering that doll is four letters starting at position 0, dollhouse is nine letters at 0, and house is five letters at 3.
What I've come up with so far is:
Sort the words longest to shortest: (dollhouse, house, doll)
Scan the buffer to see if the string already exists as a substring, if so note the location.
If it doesn't already exist, add it to the end of the buffer.
Since long words often contain shorter words, this works pretty well, but it should be possible to do significantly better. For example, if I extend the word list to include ragdoll, then my algorithm comes up with dollhouseragdoll which is less efficient than ragdollhouse.
This is a preprocessing step, so I'm not terribly worried about speed. O(n^2) is fine. On the other hand, my actual list has tens of thousands of words, so O(n!) is probably out of the question.
As a side note, this storage scheme is used for the data in the `name' table of a TrueType font, cf. http://www.microsoft.com/typography/otspec/name.htm
This is the shortest superstring problem: find the shortest string that contains a set of given strings as substrings. According to this IEEE paper (which you may not have access to unfortunately), solving this problem exactly is NP-complete. However, heuristic solutions are available.
As a first step, you should find all strings that are substrings of other strings and delete them (of course you still need to record their positions relative to the containing strings somehow). These fully-contained strings can be found efficiently using a generalised suffix tree.
Then, by repeatedly merging the two strings having longest overlap, you are guaranteed to produce a solution whose length is not worse than 4 times the minimum possible length. It should be possible to find overlap sizes quickly by using two radix trees as suggested by a comment by Zifre on Konrad Rudolph's answer. Or, you might be able to use the generalised suffix tree somehow.
I'm sorry I can't dig up a decent link for you -- there doesn't seem to be a Wikipedia page, or any publicly accessible information on this particular problem. It is briefly mentioned here, though no suggested solutions are provided.
I think you can use a Radix Tree. It costs some memory because of pointers to leafs and parents, but it is easy to match up strings (O(k) (where k is the longest string size).
My first thought here is: use a data structure to determine common prefixes and suffixes of your strings. Then sort the words under consideration of these prefixes and postfixes. This would result in your desired ragdollhouse.
Looks similar to the Knapsack problem, which is NP-complete, so there is not a "definitive" algorithm.
I did a lab back in college where we tasked with implementing a simple compression program.
What we did was sequentially apply these techniques to text:
BWT (Burrows-Wheeler transform): helps reorder letters into sequences of identical letters (hint* there are mathematical substitutions for getting the letters instead of actually doing the rotations)
MTF (Move to front transform): Rewrites the sequence of letters as a sequence of indices of a dynamic list.
Huffman encoding: A form of entropy encoding that constructs a variable-length code table in which shorter codes are given to frequently encountered symbols and longer codes are given to infrequently encountered symbols
Here, I found the assignment page.
To get back your original text, you do (1) Huffman decoding, (2) inverse MTF, and then (3) inverse BWT. There are several good resources on all of this on the Interwebs.
Refine step 3.
Look through current list and see whether any word in the list starts with a suffix of the current word. (You might want to keep the suffix longer than some length - longer than 1, for example).
If yes, then add the distinct prefix to this word as a prefix to the existing word, and adjust all existing references appropriately (slow!)
If no, add word to end of list as in current step 3.
This would give you 'ragdollhouse' as the stored data in your example. It is not clear whether it would always work optimally (if you also had 'barbiedoll' and 'dollar' in the word list, for example).
I would not reinvent this wheel yet another time. There has already gone an enormous amount of manpower into compression algorithms, why not take one of the already available ones?
Here are a few good choices:
gzip for fast compression / decompression speed
bzip2 for a bit bitter compression but much slower decompression
LZMA for very high compression ratio and fast decompression (faster than bzip2 but slower than gzip)
lzop for very fast compression / decompression
If you use Java, gzip is already integrated.
It's not clear what do you want to do.
Do you want a data structure that lets to you store in a memory-conscious manner the strings while letting operations like search possible in a reasonable amount of time?
Do you just want an array of words, compressed?
In the first case, you can go for a patricia trie or a String B-Tree.
For the second case, you can just adopt some index compression techinique, like that:
If you have something like:
aaa
aaab
aasd
abaco
abad
You can compress like that:
0aaa
3b
2sd
1baco
2ad
The number is the length of the largest common prefix with the preceding string.
You can tweak that schema, for ex. planning a "restart" of the common prefix after just K words, for a fast reconstruction

How to represent a random-access text file in memory (C)

I'm working on a project in which I need to read text (source) file in memory and be able to perform random access into (say for instance, retrieve the address corresponding to line 3, column 15).
I would like to know if there is an established way to do this, or data structures that are particularly good for the job. I need to be able to perform a (probably amortized) constant time access. I'm working in C, but am willing to implement higher level data structures if it is worth it.
My first idea was to go with a linked list of large buffer that will hold the character data of the file. I would also make an array, whose index are line numbers and content are addresses corresponding to the begin of the line. This array would be reallocated on need.
Subsidiary question: does anyone have an idea the average size of a source file ? I was surprised not to find this on google.
To clarify:
The file I'm concerned about are source files, so their size should be manageable, they should not be modified and the lines have variables length (tough hopefully capped at some maximum).
The problem I'm working on needs mostly a read-only file representation, but I'm very interested in digging around the problem.
Conlusion:
There is a very interesting discussion of the data structures used to maintain a file (with read/insert/delete support) in the paper Data Structures for Text Sequences.
If you just need read-only, just get the file size, read it in memory with fread(), then you have to maintain a dynamic array which maps the line numbers (index) to pointer to the first character in the line. Someone below suggested to build this array lazily, which seems a good idea in many cases.
I'm not quite sure what the question is here, but there seems to be a bit of both "how do I keep the file in memory" and "how do I index it". Since you need random access to the file's contents, you're probably well advised to memory-map the file, unless you're tight on address space.
I don't think you'll be able to avoid a linear pass through the file once to find the line endings. As you said, you can create an index of the pointers to the beginning of each line. If you're not sure how much of the index you'll need, create it lazily (on demand). You can also store this index to disk (as offsets, not pointers) if you will need it on subsequent runs. You can estimate the size of the index based on the file size and the expected line length.
1) Read (or mmap) the entire file into one chunk of memory.
2) In a second pass create an array of pointers or offsets pointing to the beginnings of the lines (hint: one after the '\n' ) into that memory.
Now you can index the array to access a specific line.
It's impossible to make insertion, deletion, and reading at a particular line/column/character address all simultaneously O(1). The best you can get is simultaneous O(log n) for all of these operations, and it can be achieved using various sorts of balanced binary trees for storing the file in memory.
Of course, unless your files will be larger than 100 kB or so, you're probably best off not bothering with anything fancy and just using a flat linear buffer...
solution: If lines are about same size, make all lines equally long by appending needed number of metacharacters to each line. Then you can simply calculate the fseek() position from line number, making your search O(1).
If lines are sorted, then you can perform binary search, making your search O(log(nõLines)).
If neither, you can store the indexes of line begginings. But then, you have a problem if you modify file a lot, because if you insert let's say X characters somewhere, you have to calculate which line it is, and then add this X to the all next lines. Similar with with deletion. Yu essentially get O(nõLines). And code gets ugly.
If you want to store whole file in memory, just create aray of lines *char[]. You then get line by first dereference and character by second dereference.
As an alternate suggestion (although I do not fully understand the question), you might want to consider a struct based, dynamically linked list of dynamic strings. If you want to be astutely clever, you could build a dynamically linked list of chars which you then export as strings.
You'd have to use OO type design for this to be manageable.
So structs you'd likely want to build are:
DynamicArray;
DynamicListOfArrays;
CharList;
So it goes:
CharList(Gets Chars/Size) -> (SetSize)DynamicArray -> (AddArray)DynamicListOfArrays
If you build suitable helper functions for malloc and delete, and make it so the structs can either delete themselves automatically or manually. Using the above combinations won't get you O(1) read in (which isn't possible without the files have a static format), but it will get you good time.
If you know the file static length (at least individual line wise), IE no bigger than 256 chars per line, then all you need is the DynamicListOfArries - write directly to the array (preset to 256), create a new one, repeat. Downside is it wastes memory.
Note: You'd have to convert the DynamicListOfArrays into a 'static' ArrayOfArrays before you could get direct point-to-point access.
If you need source code to give you an idea (although mine is built towards C++ it wouldn't take long to rewrite), leave a comment about it. As with any other code I offer on stackoverflow, it can be used for any purpose, even commercially.
Average size of a source file? Does such a thing exist? A source file could go from 0 bytes to thousands of bytes, like any text file, it depends on the number of caracters it contains

Efficient algorithm to sort file records

I have a file which contains number of records of varying length. What would be the efficient algorithm to sort these records.
Record sample:
000000000000dc01 t error_handling 44
0000000dfa01a000 t fun 44
Total record = >5000
Programming language c
I would like to know which algorithm is suitable to sort this file based on address and what would be the efficient way to read these records?
If the file is too large to fit into memory, then your only reasonable choice is a file-based merge sort, which involves two passes.
In the first pass, read blocks of N records (where N is defined as the number of records that will fit into memory), sort them, and write them to a temporary file. When this pass is done, you either have a number (call it M) of temporary files, each with some varying number of records that are sorted, or you have a single temporary file that contains blocks of sorted records.
The second pass is an M-way merge.
I wrote an article some time back about how to do this with a text file. See Sorting a Large Text File. It's fairly straightforward to extend that so that it will sort other types of records that you define.
For more information, see External sorting.
Since the records are of varying length, an efficent method would be:
Read and parse file into array of pointer to records
Sort array of pointers
Write the results
Random accessing the file will be slow as the newlines have to counted to find a specific record.
If you've got a really big file, adapt the process to:
for each n records
read and parse
sort
write to temporary file
mergesort temporary files
In-place Quicksort is one of the best generic sorting algorithm. Faster sorting is possible (such as bucketsort) but it depends on some properties of the data you're sorting.

Creating a binary search of an alphabetically ordered .txt file in C

I'm working on creating a binary search algorithm in C that searches for a string in a .txt file. Each line is a string representing a stock ticker. Not being familiar with C, this is taking far too long. I have a few questions:
1.) Once I have opened a file using fopen, does it make more sense in terms of efficiency for the algorithm to step through the file using some function provided in the C library for scanning files, doing the compare directly from the file, or should I copy each line into an array and have the algorithm search the array?
2.) If I should compare directly from the file, what is the best way to step through it? Assume I have the number of lines in the file, is there some way to go directly to the middle line, scan the string and do the compare?
I'm sorry if this is too vague. Not too sure how to better explain. Thanks for your time
Unless your file is exceedingly big (> 2GB) then loading the file in memory prior searching it is the way to go. In case you cannot load the file in memory, you could hold the offset of each line in an int[] or (if the file contains too many lines...) create another binary file and write the offset of each lines as integers...
Having everything in memory is by far preferable, though.
You cannot binary search lines of a text-file without knowing the length of each line in advance, so you'll most likely want to read each line into memory at first (unless the file is very big).
But if your goal is only to search for a single given line as quickly as possible, you might as well just do linear search directly on the file. There's no point in getting O(log n) at the cost of a O(n) setup cost if the search is only done once.
Reading it all in with a bulk read and walking through it with pointers (to memory) is very fast. Avoid doing multiple I/O calls if you can.
I should also mention that memory mapped files can be very suitable for something like this. See mmap() if on Unix. This is definitely your best bet for really large files.
This is a great question!
The challenge of binary search is that the benefits of binary search come from being able to skip past half the elements at each step in O(1). This guarantees that, since you only do O(lg n) probes, that the runtime is O(lg n). This is why, for example, you can do a fast binary search on an array but not a linked list - in the linked list, finding the halfway point of the elements takes linear time, which dominates the time for the search.
When doing binary search on a file you are in a similar position. Since all the lines in the file might not have the same length, you can't easily jump to the nth line in the file given some number n. Consequently, implementing a good, fast binary search on a file will be a bit tricky. Somehow, you will need to know where each line starts and stops so that you can efficiently jump around in the file.
There are many ways you can do this. First, you could load all the strings from the file into an array, as you've suggested. This takes linear time, but once you have the array of strings in memory all future binary searches will be very fast. The catch is that if you have a very large file, this may take up a lot of memory, and could be prohibitively expansive. Consequently, another alternative might be not to store the actual stings in the array, but rather the offsets into the file at which each string occurs. This would let you do the binary search quickly - you could seek the file to the proper offset when doing a comparison - and for large stings can be much more space-efficient than the above. And, if all the strings are roughly the same length, you could just pad every line to some fixed size to allow for direct computation of the start position of each line.
If you're willing to expend some time implementing more complex solutions, you might want to consider preprocessing the file so that instead of having one string per line, instead you have at the top of the file a list of fixed-width integers containing the offsets of each string in the file. This essentially does the above work, but then stores the result back in the file to make future binary searches much faster. I have some experience with this sort of file structure, and it can be quite fast.
If you're REALLY up for a challenge, you could alternatively store the strings in the file using a B-tree, which would give you incredibly fast lookup times fir each string by minimizing the number of disk reads that you need to do.
Hope this helps!
I don't see how you can do compare directly from the file. You will have to have a buffer to store data read from disk and use that buffer. So it doesn't make sense, it is just impossible.
You cannot jump to a particular line in the file. Not unless you know the offset in bytes of the beginning of that line relative to the beginning of the file.
I'd recommend using mmap to map this file directly into memory and work with it as with character array. Operating system will make work with file (like seeking, reading, writing) transparent to you, and you will just work with it like with a buffer in memory. Note that mmap is limited to 4 GB on 32-bit systems. But if that file is bigger, you probably need to ask the question - why on earth someone has this big file not in an indexed database.

Resources