How to profile sort algorithms?

How to profile sort algorithms? - c

I have coded a few sorting methods in C and I would like to find the input size at which the program is optimal (i.e.) profiling each algorithm. But how do I do this? I know to time each method, but I don't know how I can find the size at which it is 'optimal'.

It depends on some factors:
Data behaviour: is your data already partially sorted? or it is very random?
Data size: for a big input (say 1 thousand or more) you can assure that O(N^2) sorting methods will lose to O(N*log(N)) methods..
Data structure of the data: is it array or list or ?. Sorting method with non sequential access to data will be slower for something like list
So the answer is by empirically running your program with some real data you will likely handle combined by varying in the input size.
When a slower method (like O(N^2)) gets beaten by some faster method (like O(N*log(N))) when input size is > X then you can say that the slower method is 'empirically optimal' for input size <= X (the value depends on the characteristics of the input data).

Sort algorithms do not have a single number at which they are optimal.
For pure execution time, almost every sort algorithm will be fastest on a set of 2 numbers, but that it not useful in most cases.
Some sort algorithms may work more efficiently on smaller data sets, but that does not mean they are 'optimal' at that size.
Some sorts may also work better on other characteristics of the data. There are sorts that can be extremely efficient if the data is almost sorted already, but may be very slow if it is not. Others will run the same on any set of a given size.
It is more useful to look at the Big O of the sort (such as O(n^2), O(n log n) etc) and any special properties the sort has, such as operating on nearly sorted data.

To find the input size at which the program is optimal (by which I assume you mean the fastest, or for which the sorting algorithm requires the fewest comparisons) you will have to test it against various inputs and graph the independent axis (input size) against the dependent axis (runtime) and find the minimum.

Related

what does worst case big omega(n) means?

If Big-Omega is the lower bound then what does it mean to have a worst case time complexity of Big-Omega(n).
From the book "data structures and algorithms with python" by Michael T. Goodrich:
consider a dynamic array that doubles it size when the element reaches its capacity.
this is from the book:
"we fully explored the append method. In the worst case, it requires
Ω(n) time because the underlying array is resized, but it uses O(1)time in the amortized sense"
The parameterized version, pop(k), removes the element that is at index k < n
of a list, shifting all subsequent elements leftward to fill the gap that results from
the removal. The efficiency of this operation is O(n−k), as the amount of shifting
depends upon the choice of index k. Note well that this
implies that pop(0) is the most expensive call, using Ω(n) time.
how is "Ω(n)" describes the most expensive time?

The number inside the parenthesis is the number of operations you must do to actually carry out the operation, always expressed as a function of the number of items you are dealing with. You never worry about just how hard those operations are, only the total number of them.
If the array is full and has to be resized you need to copy all the elements into the new array. One operation per item in the array, thus an O(n) runtime. However, most of the time you just do one operation for an O(1) runtime.
Common values are:
O(1): One operation only, such as adding it to the list when the list isn't full.
O(log n): This typically occurs when you have a binary search or the like to find your target. Note that the base of the log isn't specified as the difference is just a constant and you always ignore constants.
O(n): One operation per item in your dataset. For example, unsorted search.
O(n log n): Commonly seen in good sort routines where you have to process every item but can divide and conquer as you go.
O(n^2): Usually encountered when you must consider every interaction of two items in your dataset and have no way to organize it. For example a routine I wrote long ago to find near-duplicate pictures. (Exact duplicates would be handled by making a dictionary of hashes and testing whether the hash existed and thus be O(n)--the two passes is a constant and discarded, you wouldn't say O(2n).)
O(n^3): By the time you're getting this high you consider it very carefully. Now you're looking at three-way interactions of items in your dataset.
Higher orders can exist but you need to consider carefully what's it's going to do. I have shipped production code that was O(n^8) but with very heavy pruning of paths and even then it took 12 hours to run. Had the nature of the data not been conductive to such pruning I wouldn't have written it at all--the code would still be running.
You will occasionally encounter even nastier stuff which needs careful consideration of whether it's going to be tolerable or not. For large datasets they're impossible:
O(2^n): Real world example: Attempting to prune paths so as to retain a minimum spanning tree--I computed all possible trees and kept the cheapest. Several experiments showed n never going above 10, I thought I was ok--until a different seed produced n = 22. I rewrote the routine for not-always-perfect answer that was O(n^2) instead.
O(n!): I don't know any examples. It blows up horribly fast.

Which of the following methods is more efficient

Problem Statement:- Given an array of integers and an integer k, print all the pairs in the array whose sum is k
Method 1:-
Sort the array and maintain two pointers low and high, start iterating...
Time Complexity - O(nlogn)
Space Complexity - O(1)
Method 2:-
Keep all the elements in the dictionary and do the process
Time Complexity - O(n)
Space Complexity - O(n)
Now, out of above 2 approaches, which one is the most efficient and on what basis I am going to compare the efficiency, time (or) space in this case as both are different in both the approaches

I've left my comment above for reference.
It was hasty. You do allow O(nlogn) time for the Method 1 sort (I now think I understand?) and that's fair (apologies;-).
What happens next? If the input array must be used again, then you need a sorted copy (the sort would not be in-place) which adds an O(n) space requirement.
The "iterating" part of Method 1 also costs ~O(n) time.
But loading up the dictionary in Method 2 is also ~O(n) time (presumably a throw-away data structure?) and dictionary access - although ~O(1) - is slower (than array indexing).
Bottom line: O-notation is helpful if it can identify an "overpowering cost" (rendering others negligible by comparison), but without a hint at use-cases (typical and boundary, details like data quantities and available system resources etc), questions like this (seeking a "generalised ideal" answer) can't benefit from it.
Often some simple proof-of-concept code and performance tests on representative data can make "the right choice obvious" (more easily and often more correctly than speculative theorising).
Finally, in the absence of a clear performance winner, there is always "code readability" to help decide;-)

Fast binary search/indexing in c

I have a dataset composed of n Elements of a fixed size (24 bytes). I want to create an index to be able to search as fast as possible a random element of 24 bytes in this dataset. What algorithm should I use ? Do you know a C library implementing this ?
fast read access/search speed is the priority. Memory usage and insertion speed is not a problem, there will be barely no write access after the initialization.
EDIT: The dataset will be stored in memory (RAM) no disk access.

If there's a logical ordering between the elements then a quick sort of the data is a fast way to order the data. Once it's ordered you can then use a binary search algorithm to look for elements. This is a O(log N) search, and you'll be hard pressed to get anything faster!
std::sort can be used to sort the data, and std::binary_search can be used to search the data.

Use a hash table, available as a std::unordered_map in STL. Will beat a binary search (my bet).
Alternatively, a (compressed) trie (http://en.wikipedia.org/wiki/Trie). This is really the fastest if you can afford the memory space.

Fast spatial data structure for nearest neighbor search amongst non-uniformly sized hyperspheres

Given a k-dimensional continuous (euclidean) space filled with rather unpredictably moving/growing/shrinking  hyperspheres I need to repeatedly find the hypersphere whose surface is nearest to a given coordinate. If some hyperspheres are of the same distance to my coordinate, then the biggest hypersphere wins. (The total count of hyperspheres is guaranteed to stay the same over time.)
My first thought was to use a KDTree but it won't take the hyperspheres' non-uniform volumes into account.
So I looked further and found BVH (Bounding Volume Hierarchies) and BIH (Bounding Interval Hierarchies), which seem to do the trick. At least in 2-/3-dimensional space. However while finding quite a bit of info and visualizations on BVHs I could barely find anything on BIHs.
My basic requirement is a k-dimensional spatial data structure that takes volume into account and is either super fast to build (off-line) or dynamic with barely any unbalancing.
Given my requirements above, which data structure would you go with? Any other ones I didn't even mention?
Edit 1: Forgot to mention: hypershperes are allowed (actually highly expected) to overlap!
Edit 2: Looks like instead of "distance" (and "negative distance" in particular) my described metric matches the power of a point much better.

I'd expect a QuadTree/Octree/generalized to 2^K-tree for your dimensionality of K would do the trick; these recursively partition space, and presumably you can stop when a K-subcube (or K-rectangular brick if the splits aren't even) does not contain a hypersphere, or contains one or more hyperspheres such that partitioning doesn't separate any, or alternatively contains the center of just a single hypersphere (probably easier).
Inserting and deleting entities in such trees is fast, so a hypersphere changing size just causes a delete/insert pair of operations. (I suspect you can optimize this if your sphere size changes by local additional recursive partition if the sphere gets smaller, or local K-block merging if it grows).
I haven't worked with them, but you might also consider binary space partitions. These let you use binary trees instead of k-trees to partition your space. I understand that KDTrees are a special case of this.
But in any case I thought the insertion/deletion algorithms for 2^K trees and/or BSP/KDTrees was well understood and fast. So hypersphere size changes cause deletion/insertion operations but those are fast. So I don't understand your objection to KD-trees.
I think the performance of all these are asymptotically the same.

I would use the R*Tree extension for SQLite. A table would normally have 1 or 2 dimensional data. SQL queries can combine multiple tables to search in higher dimensions.
The formulation with negative distance is a little weird. Distance is positive in geometry, so there may not be much helpful theory to use.
A different formulation that uses only positive distances may be helpful. Read about hyperbolic spaces. This might help to provide ideas for other ways to describe distance.

C: Sorting Methods Analysis

I have alot of different sorting algorithms which all have the following signature:
void <METHOD>_sort_ints(int * array, const unsigned int ARRAY_LENGTH);
Are there any testing suites for sorting which I could use for the purpose of making empirical comparisons?

This detailed discussion, as well as linking to a large number of related web pages you are likely to find useful, also describes a useful set of input data for testing sorting algorithms (see the linked page for reasons). Summarising:
Completely randomly reshuffled array
Already sorted array
Already sorted in reverse order array
Chainsaw array
Array of identical elements
Already sorted array with N permutations (with N from 0.1 to 10% of the size)
Already sorted array in reverse order array with N permutations
Data that have normal distribution with duplicate (or close) keys (for stable sorting only)
Pseudorandom data (daily values of S&P500 or other index for a decade might be a good test set here; they are available from Yahoo.com )

The definitive study of sorting is Bob Sedgewick's doctoral dissertation. But there's a lot of good information in his algorithms textbooks, and those are the first two places I would look for test suite and methodology. If you've had a recent course you'll know more than I do; last time I had a course, the best method was to use quicksort down to partitions of size 12, then run insertion sort on the entire array. But the answers change as quickly as the hardware.
Jon Bentley's Programming Perls books have some other info on sorting.
You can quickly whip up a test suite containing
Random integers
Sorted integers
Reverse sorted integers
Sorted integers, mildly perturbed
If memory serves, these are the most important cases for a sort algorithm.
If you're looking to sort arrays that won't fit in cache, you'll need to measure cache effects. valgrind is effective if slow.

sortperf.py has a well selected suite of benchmark test cases and was used to support the essay found here and make timsort THE sort in Python lo that many years ago. Note that, at long last, Java may be moving to timsort too, thanks to Josh Block (see here), so I imagine they have written their own version of the benchmark test cases -- however, I can't easily find a reference to it. (timsort, a stable, adaptive, iterative natural mergesort variant, is especially suited to languages with reference-to-object semantics like Python and Java, where "data movement" is relatively cheap [[since all that's ever being moved is references aka pointers, not blobs of unbounded size;-)]], but comparisons can be relatively costly [[since there is no upper bound to the complexity of a comparison function -- but then this holds for any language where sorting may be customized via a custom comparison or key-extraction function]]).

This site shows the various sorting algorithms using four groups:
http://www.sorting-algorithms.com/
In addition to the four group in Norman's answer you want to check the sorting algorithms with collection of numbers that have a few similarities in the numbers:
All integers are unique
The same integer in the whole collection
Few Unique Keys
Changing the number of elements in the collection is also a good practice check each algorithm with 1K, 1M, 1G etc. to see what are the memory implications of that algorithm.

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight