Fast binary search/indexing in c - c

I have a dataset composed of n Elements of a fixed size (24 bytes). I want to create an index to be able to search as fast as possible a random element of 24 bytes in this dataset. What algorithm should I use ? Do you know a C library implementing this ?
fast read access/search speed is the priority. Memory usage and insertion speed is not a problem, there will be barely no write access after the initialization.
EDIT: The dataset will be stored in memory (RAM) no disk access.

If there's a logical ordering between the elements then a quick sort of the data is a fast way to order the data. Once it's ordered you can then use a binary search algorithm to look for elements. This is a O(log N) search, and you'll be hard pressed to get anything faster!
std::sort can be used to sort the data, and std::binary_search can be used to search the data.

Use a hash table, available as a std::unordered_map in STL. Will beat a binary search (my bet).
Alternatively, a (compressed) trie (http://en.wikipedia.org/wiki/Trie). This is really the fastest if you can afford the memory space.

Related

Which of the following methods is more efficient

Problem Statement:- Given an array of integers and an integer k, print all the pairs in the array whose sum is k
Method 1:-
Sort the array and maintain two pointers low and high, start iterating...
Time Complexity - O(nlogn)
Space Complexity - O(1)
Method 2:-
Keep all the elements in the dictionary and do the process
Time Complexity - O(n)
Space Complexity - O(n)
Now, out of above 2 approaches, which one is the most efficient and on what basis I am going to compare the efficiency, time (or) space in this case as both are different in both the approaches
I've left my comment above for reference.
It was hasty. You do allow O(nlogn) time for the Method 1 sort (I now think I understand?) and that's fair (apologies;-).
What happens next? If the input array must be used again, then you need a sorted copy (the sort would not be in-place) which adds an O(n) space requirement.
The "iterating" part of Method 1 also costs ~O(n) time.
But loading up the dictionary in Method 2 is also ~O(n) time (presumably a throw-away data structure?) and dictionary access - although ~O(1) - is slower (than array indexing).
Bottom line: O-notation is helpful if it can identify an "overpowering cost" (rendering others negligible by comparison), but without a hint at use-cases (typical and boundary, details like data quantities and available system resources etc), questions like this (seeking a "generalised ideal" answer) can't benefit from it.
Often some simple proof-of-concept code and performance tests on representative data can make "the right choice obvious" (more easily and often more correctly than speculative theorising).
Finally, in the absence of a clear performance winner, there is always "code readability" to help decide;-)

How to profile sort algorithms?

I have coded a few sorting methods in C and I would like to find the input size at which the program is optimal (i.e.) profiling each algorithm. But how do I do this? I know to time each method, but I don't know how I can find the size at which it is 'optimal'.
It depends on some factors:
Data behaviour: is your data already partially sorted? or it is very random?
Data size: for a big input (say 1 thousand or more) you can assure that O(N^2) sorting methods will lose to O(N*log(N)) methods..
Data structure of the data: is it array or list or ?. Sorting method with non sequential access to data will be slower for something like list
So the answer is by empirically running your program with some real data you will likely handle combined by varying in the input size.
When a slower method (like O(N^2)) gets beaten by some faster method (like O(N*log(N))) when input size is > X then you can say that the slower method is 'empirically optimal' for input size <= X (the value depends on the characteristics of the input data).
Sort algorithms do not have a single number at which they are optimal.
For pure execution time, almost every sort algorithm will be fastest on a set of 2 numbers, but that it not useful in most cases.
Some sort algorithms may work more efficiently on smaller data sets, but that does not mean they are 'optimal' at that size.
Some sorts may also work better on other characteristics of the data. There are sorts that can be extremely efficient if the data is almost sorted already, but may be very slow if it is not. Others will run the same on any set of a given size.
It is more useful to look at the Big O of the sort (such as O(n^2), O(n log n) etc) and any special properties the sort has, such as operating on nearly sorted data.
To find the input size at which the program is optimal (by which I assume you mean the fastest, or for which the sorting algorithm requires the fewest comparisons) you will have to test it against various inputs and graph the independent axis (input size) against the dependent axis (runtime) and find the minimum.

How to implement a set?

I want to implement a Set in C.
Is it OK to use a linked list, when creating the SET, or should I use another approach ?
How do you usually implement your own set (if needed).
NOTE:
If I use the Linked List approach, I will probably have the following complexities for Set my operations:
init : O(1);
destroy: O(n);
insert: O(n);
remove: O(n);
union: O(n*m);
intersection: O(n*m);
difference: O(n*m);
ismember: O(n);
issubset: O(n*m);
setisequal: O(n*m);
O(n*m) seems may be a little to big especially for huge data... Is there a way to implement my Set more efficient ?
Sets are typically implemented either as red-black trees (which requires the elements to have a total order), or as an automatically-resizing hashtable (which requires a hash function).
The latter is typically implemented by having the hashtable double in size and reinserting all elements when a certain capacity threshold (75% works well) is exceeded. This means that inidividual insert operations can be O(n), but when amortized over many operations, it's actually O(1).
std::set is often implemented as a red black tree: http://en.wikipedia.org/wiki/Red-black_tree
This approach will give you much better complexity on all the listed operations.
I have used Red-Black trees in the past to build sets.
Here are the time complexities from the Wikipedia article.
Space O(n)
Search O(log n)
Insert O(log n)
Delete O(log n)
There are lot of ways for set implementation. Here are some of them. Besides MSDN have very good article on it.
Since you already have a linked list implemented, the easiest is a skip list. If you want to use balanced trees, the easiest in my opinion is a treap. These are randomized data structures, but generally they are just as efficient as their deterministic counterparts, if not more (and a skip list can be made deterministic).

C: Sorting Methods Analysis

I have alot of different sorting algorithms which all have the following signature:
void <METHOD>_sort_ints(int * array, const unsigned int ARRAY_LENGTH);
Are there any testing suites for sorting which I could use for the purpose of making empirical comparisons?
This detailed discussion, as well as linking to a large number of related web pages you are likely to find useful, also describes a useful set of input data for testing sorting algorithms (see the linked page for reasons). Summarising:
Completely randomly reshuffled array
Already sorted array
Already sorted in reverse order array
Chainsaw array
Array of identical elements
Already sorted array with N permutations (with N from 0.1 to 10% of the size)
Already sorted array in reverse order array with N permutations
Data that have normal distribution with duplicate (or close) keys (for stable sorting only)
Pseudorandom data (daily values of S&P500 or other index for a decade might be a good test set here; they are available from Yahoo.com )
The definitive study of sorting is Bob Sedgewick's doctoral dissertation. But there's a lot of good information in his algorithms textbooks, and those are the first two places I would look for test suite and methodology. If you've had a recent course you'll know more than I do; last time I had a course, the best method was to use quicksort down to partitions of size 12, then run insertion sort on the entire array. But the answers change as quickly as the hardware.
Jon Bentley's Programming Perls books have some other info on sorting.
You can quickly whip up a test suite containing
Random integers
Sorted integers
Reverse sorted integers
Sorted integers, mildly perturbed
If memory serves, these are the most important cases for a sort algorithm.
If you're looking to sort arrays that won't fit in cache, you'll need to measure cache effects. valgrind is effective if slow.
sortperf.py has a well selected suite of benchmark test cases and was used to support the essay found here and make timsort THE sort in Python lo that many years ago. Note that, at long last, Java may be moving to timsort too, thanks to Josh Block (see here), so I imagine they have written their own version of the benchmark test cases -- however, I can't easily find a reference to it. (timsort, a stable, adaptive, iterative natural mergesort variant, is especially suited to languages with reference-to-object semantics like Python and Java, where "data movement" is relatively cheap [[since all that's ever being moved is references aka pointers, not blobs of unbounded size;-)]], but comparisons can be relatively costly [[since there is no upper bound to the complexity of a comparison function -- but then this holds for any language where sorting may be customized via a custom comparison or key-extraction function]]).
This site shows the various sorting algorithms using four groups:
http://www.sorting-algorithms.com/
In addition to the four group in Norman's answer you want to check the sorting algorithms with collection of numbers that have a few similarities in the numbers:
All integers are unique
The same integer in the whole collection
Few Unique Keys
Changing the number of elements in the collection is also a good practice check each algorithm with 1K, 1M, 1G etc. to see what are the memory implications of that algorithm.

Data structure for finding nearby keys with similar bitvalues

I have some data, up to a between a million and a billion records, each which is represented by a bitfield, about 64 bits per key. The bits are independent, you can imagine them basically as random bits.
If I have a test key and I want to find all values in my data with the same key, a hash table will spit those out very easily, in O(1).
What algorithm/data structure would efficiently find all records most similar to the query key? Here similar means that most bits are identical, but a minimal number are allowed to be wrong. This is traditionally measured by Hamming distance., which just counts the number of mismatched bits.
There's two ways this query might be made, one might be by specifying a mismatch rate like "give me a list of all existing keys which have less than 6 bits that differ from my query" or by simply best matches, like "give me a list of the 10,000 keys which have the lowest number of differing bits from my query."
You might be temped to run to k-nearest-neighbor algorithms, but here we're talking about independent bits, so it doesn't seem likely that structures like quadtrees are useful.
The problem can be solved by simple brute force testing a hash table for low numbers of differing bits. If we want to find all keys that differ by one bit from our query, for example, we can enumerate all 64 possible keys and test them all. But this explodes quickly, if we wanted to allow two bits of difference, then we'd have to probe 64*63=4032 times. It gets exponentially worse for higher numbers of bits.
So is there another data structure or strategy that makes this kind of query more efficient?
The database/structure can be preprocessed as much as you like, it's the query speed that should be optimized.
What you want is a BK-Tree. It's a tree that's ideally suited to indexing metric spaces (your problem is one), and supports both nearest-neighbour and distance queries. I wrote an article about it a while ago.
BK-Trees are generally described with reference to text and using levenshtein distance to build the tree, but it's straightforward to write one in terms of binary strings and hamming distance.
This sounds like a good fit for an S-Tree, which is like a hierarchical inverted file. Good resources on this topic include the following papers:
Hierarchical Bitmap Index: An Efficient and Scalable Indexing Technique for Set-Valued Attributes.
Improved Methods for Signature-Tree Construction (2000)
Quote from the first one:
The hierarchical bitmap index efficiently supports dif-
ferent classes of queries, including subset, superset and similarity queries.
Our experiments show that the hierarchical bitmap index outperforms
other set indexing techniques significantly.
These papers include references to other research that you might find useful, such as M-Trees.
Create a binary tree (specifically a trie) representing each key in your start set in the following way: The root node is the empty word, moving down the tree to the left appends a 0 and moving down the right appends a 1. The tree will only have as many leaves as your start set has elements, so the size should stay manageable.
Now you can do a recursive traversal of this tree, allowing at most n "deviations" from the query key in each recursive line of execution, until you have found all of the nodes in the start set which are within that number of deviations.
I'd go with an inverted index, like a search engine. You've basically got a fixed vocabulary of 64 words. Then similarity is measured by hamming distance, instead of cosine similarity like a search engine would want to use. Constructing the index will be slow, but you ought to be able to query it with normal search enginey speeds.
The book Introduction to Information Retrieval covers the efficient construction, storage, compression and querying of inverted indexes.
"Near-optimal hashing algorithms for approximate nearest neighbor in high dimensions", from 2008, seems to be the best result as of then. I won't try to summarize since I read it over a year ago and it's hairy. That's from a page on locality-sensitive hashing, along with an implementation of an earlier version of the scheme. For more general pointers, read up on nearest neighbor search.
This kind of question has been asked before: Fastest way to find most similar string to an input?
The database/structure can be
preprocessed as much as you like
Well...IF that is true. Then all you need is a similarity matrix of your hamming distances. Make the matrix sparse by pruning out large distances. It doesn't get any faster and not that much of a memory hog.
Well, you could insert all of the neighbor keys along with the original key. That would mean that you store (64 choose k) times as much data, for k differing bits, and it will require that you decide k beforehand. Though you could always extend k by brute force querying neighbors, and this will automatically query the neighbors of your neighbors that you inserted. This also gives you a time-space tradeoff: for example, if you accept a 64 x data blowup and 64 times slower you can get two bits of distance.
I haven't completely thought this through, but I have an idea of where I'd start.
You could divide the search space up into a number of buckets where each bucket has a bucket key and the keys in the bucket are the keys that are more similar to this bucket key than any other bucket key. To create the bucket keys, you could randomly generate 64 bit keys and discard any that are too close to any previously created bucket key, or you could work out some algorithm that generates keys that are all dissimilar enough. To find the closest key to a test key, first find the bucket key that is closest, and then test each key in the bucket. (Actually, it's possible, but not likely, for the closest key to be in another bucket - do you need to find the closest key, or would a very close key be good enough?)
If you're ok with doing it probabilistically, I think there's a good way to solve question 2. I assume you have 2^30 data and cutoff and you want to find all points within cutoff distance from test.
One_Try()
1. Generate randomly a 20-bit subset S of 64 bits
2. Ask for a list of elements that agree with test on S (about 2^10 elements)
3. Sort that list by Hamming distance from test
4. Discard the part of list after cutoff
You repeat One_Try as much as you need while merging the lists. The more tries you have, the more points you find. For example, if x is within 5 bits, you'll find it in one try with about (2/3)^5 = 13% probability. Therefore if you repeat 100 tries you find all but roughly 10^{-6} of such x. Total time: 100*(1000*log 1000).
The main advantage of this is that you're able to output answers to question 2 as you proceed, since after the first few tries you'll certainly find everything within distance not more than 3 bits, etc.
If you have many computers, you give each of them several tries, since they are perfectly parallelizable: each computer saves some hash tables in advance.
Data structures for large sets described here: Detecting Near-Duplicates for Web Crawling
or
in memory trie: Judy-arrays at sourceforge.net
Assuming you have to visit each row to test its value (or if you index on the bitfield then each index entry), then you can write the actual test quite efficiently using
A xor B
To find the difference bits, then bit-count the result, using a technique like this.
This effectively gives you the hamming distance.
Since this can compile down to tens of instructions per test, this can run pretty fast.
If you are okay with a randomized algorithm (monte carlo in this case), you can use the minhash.
If the data weren't so sparse, a graph with keys as the vertices and edges linking 'adjacent' (Hamming distance = 1) nodes would probably be very efficient time-wise. The space would be very large though, so in your case, I don't think it would be a worthwhile tradeoff.

Resources