How to find an almost sorted array? - arrays

I'm trying to create a program that will select the fastest sorting algorithm for a particular array of integers. I'm trying to check off the condition "is almost sorted," and was wondering what common practice to find this in the industry is.
Assume that there is a sorted array available to the coder. The two possible solutions I can think of are:
Loop through both lists simultaneously. Compare values at the index, find the percentage of correctly placed values. I understand that this is pretty quick (just O(N)), but it can be wildly inaccurate... what if everything is shifted by one space? This algorithm will give 0, but insertion sort will take a single run to do order this.
Find how far something is shifted from it's correct position in either direction (w/ wraparound). This seems to be a better solution, but could be pretty slow (O(N^2), since we might have to loop through a sorted list for every unsorted object, which could be corrected A BIT by comparing the value in a while loop).
Are there others? If not, which do I pick?
Thanks!

Related

efficient way to remove duplicate pairs of integers in C

I've found answers to similar problems, but none of them exactly described my problem.
so on the risk of being down-voted to hell I was wondering if there is a standard method to solve my problem. Further, there's a chance that I'm asking the wrong question. Maybe the problem can be solved more efficiently another way.
So here's some background:
I'm looping through a list of particles. Each particle has a list of it's neighboring particles. Now I need to create a list of unique particle pairs of mutual neightbours.
Each particle can be identified by an integer number.
Should I just build a list of all the pair's including duplicates and use some kind of sort & comparator to eliminate duplicates or should I try to avoid adding duplicates into my list in the first place?
Performance is really important to me. I guess most of the loops may be vectorized and threaded. On average each particle has around 15 neighbours and I expect, that there will be 1e6 particles at most.
I do have some ideas, but I'm not an experienced coder and I don't want to waste 1 week to test every single method by benchmarking different situations just to find out that there's already a standard meyjod for my problem.
Any suggestions?
BTW: I'm using C.
Some pseudo-code
for i in nparticles
particle=particles[i]; //just an array containing the "index" of each particle
//each particle has a neightbor-list
for k in neighlist[i] //looping through all the neighbors
//k represent the index of the neighbor of particle "i"
if the pair (i,k) or (k,i) is not already in the pair-list, add it. otherwise don't
Sorting the elements each iteration is not a good idea since comparison sort is O(n log n) complex.
The next best thing would be to store the items in a search tree, better yet binary search tree, and better yet self equalizing binary search tree, you can find implementations on GitHub.
Even better solution would give an access time of O(1), you can achieve this in 2 different ways one is a simple identity array, where at each slot you would save say a pointer to item if there is on at this id or some flag defining that current id is empty. This is very fast but wasteful. You'll need O(N) memory.
The best solution in my opinion would be to use a set or a has-map. Which are basically the same because sets can be implemented using hash-map.
Here is a github project with c hash-map implementation.
And stack overflow answer to a similar question.

Algorithm - What is the best algorithm for detecting duplicate numbers in small array?

What is the best algorithm for detecting duplicate numbers in array, the best in speed, memory and avoiving overhead.
Small Array like [5,9,13,3,2,5,6,7,1] Note that 5 i dublicate.
After searching and reading about sorting algorithms, I realized that I will use one of these algorithms, Quick Sort, Insertion Sort or Merge Sort.
But actually I am really confused about what to use in my case which is a small array.
Thanks in advance.
To be honest, with that size of array, you may as well choose the O(n2) solution (checking every element against every other element).
You'll generally only need to worry about performance if/when the array gets larger. For small data sets like this, you could well have found the duplicate with an 'inefficient' solution before the sort phase of an efficient solution will have finished :-)
In other words, you can use something like (pseudo-code):
for idx1 = 0 to nums.len - 2 inclusive:
for idx2 = idx1 + 1 to nums.len - 1 inclusive:
if nums[idx1] == nums[idx2]:
return nums[idx1]
return no dups found
This finds the first value in the array which has a duplicate.
If you want an exhaustive list of duplicates, then just add the duplicate value to another (initially empty) array (once only per value) and keep going.
You can sort it using any half-decent algorithm though, for a data set of the size you're discussing, even a bubble sort would probably be adequate. Then you just process the sorted items sequentially, looking for runs of values but it's probably overkill in your case.
Two good approaches depend on the fact that you know or not the range from which numbers are picked up.
Case 1: the range is known.
Suppose you know that all numbers are in the range [a, b[, thus the length of the range is l=b-a.
You can create an array A the length of which is l and fill it with 0s, thus iterate over the original array and for each element e increment the value of A[e-a] (here we are actually mapping the range in [0,l[).
Once finished, you can iterate over A and find the duplicate numbers. In fact, if there exists i such that A[i] is greater than 1, it implies that i+a is a repeated number.
The same idea is behind counting sort, and it works fine also for your problem.
Case 2: the range is not known.
Quite simple. Slightly modify the approach above mentioned, instead of an array use a map where the keys are the number from your original array and the values are the times you find them. At the end, iterate over the set of keys and search those that have been found more then once.
Note.
In both the cases above mentioned, the complexity should be O(N) and you cannot do better, for you have at least to visit all the stored values.
Look at the first example: we iterate over two arrays, the lengths of which are N and l<=N, thus the complexity is at max 2*N, that is O(N).
The second example is indeed a bit more complex and dependent on the implementation of the map, but for the sake of simplicity we can safely assume that it is O(N).
In memory, you are constructing data structures the sizes of which are proportional to the number of different values contained in the original array.
As it usually happens, memory occupancy and performance are the keys of your choice. Greater the former, better the latter and vice versa. As suggested in another response, if you know that the array is small, you can safely rely on an algorithm the complexity of which is O(N^2), but that does not require memory at all.
Which is the best choice? Well, it depends on your problem, we cannot say.

Dynamic threshold determination in array of dynamic numerical values

I have an array of ~1000 objects that are float values which evolve over time (in a manner which cannot be predetermined; assume it is a black box). At every fixed time interval, I want to set a threshold value that separates the top 5-15% of values, making the cut wherever a distinction can be made most "naturally," in the sense that there are the largest gaps between data points in the array.
What is the best way for me to implement such an algorithm? Obviously (I think) the first step to take at the end of each time interval is to sort the array, but then after that I am not sure what the most efficient way to resolve this problem is. I have a feeling that it is not necessary to tabulate all of the gaps between consecutive data points in the region of interest in the sorted array, and that there is a much faster way than brute-force to solve this, but I am not sure what it is. Any ideas?
You could write your own quicksort/select routine that doesn't issue recursive calls for subarrays lying entirely outside of the 5%-15%ile range. For only 1,000 items, though, I'm not sure if it would be worth the trouble.
Another possibility would be to use fancy data structures to track the largest gaps online as the values evolve (e.g., a binary search tree decorated with subtree counts (for fast indexing) and largest subtree gaps). It's definitely not clear if this would be worth the trouble.

How to know if an array is sorted?

I already read this post but the answer didn't satisfied me Check if Array is sorted in Log(N).
Imagine I have a serious big array over 1,000,000 double numbers (positive and/or negative) and I want to know if the array is "sorted" trying to avoid the max numbers of comparisons because comparing doubles and floats take too much time. Is it possible to use statistics on It?, and if It was:
It is well seen by real-programmers?
Should I take samples?
How many samples should I take
Should they be random, or in a sequence?
How much is the %error permitted to say "the array sorted"?
Thanks.
That depends on your requirements. If you can say that if 100 random samples out of 1.000.000 is enough the assume it's sorted - then so it is. But to be absolutely sure, you will always have to go through every single entry. Only you can answer this question since only you know how certain you need to be about it being sorted.
This is a classic probability problem taught in high school. Consider this question:
What is the probability that the batch will be rejected?
In a batch of 8,000, clocks 7% are defective. A random sample of 10 (without replacement) from the 8,000 is selected and tested. If at least one is defective the entire batch will be rejected.
So you can take a number of random samples from your large array and see if it's sorted, but you must note that you need to know the probability that the sample is out of order. Since you don't have that information, a probabilistic approach wouldn't work efficiently here.
(However, you can check 50% of the array and naively conclude that there is a 50% chance that it is sorted correctly.)
If you run a divide and conquer algorithm using multiprocessing (real parallelism, so only for multi-core CPUs) you can check whether an array is sorted or not in Log(N).
If you have GPU multiprocessing you can achieve Log(N) very easily since modern graphics card are able to run few thousands processes in parallel.
Your question 5 is the question that you need to answer to determine the other answers. To ensure the array is perfectly sorted you must go through every element, because any one of them could be the one out of place.
The maximum number of comparisons to decide whether the array is sorted is N-1, because there are N-1 adjacent number pairs to compare. But for simplicity, we'll say N as it does not matter if we look at N or N+1 numbers.
Furthermore, it is unimportant where you start, so let's just start at the beginning.
Comparison #1 (A[0] vs. A[1]). If it fails, the array is unsorted. If it succeeds, good.
As we only compare, we can reduce this to the neighbors and whether the left one is smaller or equal (1) or not (0). So we can treat the array as a sequence of 0's and 1's, indicating whether two adjacent numbers are in order or not.
Calculating the error rate or the propability (correct spelling?) we will have to look at all combinations of our 0/1 sequence.
I would look at it like this: We have 2^n combinations of an array (i.e. the order of the pairs, of which only one is sorted (all elements are 1 indicating that each A[i] is less or equal to A[i+1]).
Now this seems to be simple:
initially the error is 1/2^N. After the first comparison half of the possible combinations (all unsorted) get eliminated. So the error rate should be 1/2^n + 1/2^(n-1).
I'm not a mathematician, but it should be quite easy to calculate how many elements are needed to reach the error rate (find x such that ERROR >= sum of 1/2^n + 1/2^(n-1)... 1/^(2-x) )
Sorry for the confusing english. I come from germany..
Since every single element can be the one element that is out-of-line, you have to run through all of them, hence your algorithm has runtime O(n).
If your understanding of "sorted" is less strict, you need to specify what exaclty you mean by "sorted". Usually, "sorted" means that adjacent elements meet a less or less-or-equal condition.
Like everyone else says, the only way to be 100% sure that it is sorted is to run through every single element, which is O(N).
However, it seems to me that if you're so worried about it being sorted, then maybe having it sorted to begin with is more important than the array elements being stored in a contiguous portion in memory?
What I'm getting at is, you could use a map whose elements by definition follow a strict weak ordering. In other words, the elements in a map are always sorted. You could also use a set to achieve the same effect.
For example: std::map<int,double> collectoin; would allow you to almost use it like an array: collection[0]=3.0; std::cout<<collection[0]<<std:;endl;. There are differences, of course, but if the sorting is so important then an array is the wrong choice for storing the data.
The old fashion way.Print it out and see if there in order. Really if your sort is wrong you would probably see it soon. It's more unlikely that you would only see a few misorders if you were sorting like 100+ things. When ever I deal with it my whole thing is completely off or it works.
As an example that you probably should not use but demonstrates sampling size:
Statistically valid sample size can give you a reasonable estimate of sortedness. If you want to be 95% certain eerything is sorted you can do that by creating a list of truly random points to sample, perhaps ~1500.
Essentially this is completely pointless if the list of values being out of order in one single place will break subsequent algorithms or data requirements.
If this is a problem, preprocess the list before your code runs, or use a really fast sort package in your code. Most sort packages also have a validation mode, where it simply tells you yes, the list meets your sort criteria - or not. Other suggestions like parallelization of your check with threads are great ideas.

How to search a big array for an object?

I had an interview today, I was asked how search for a number inside an array, I said binarysearch, he asked me how about a big array that has thousands of bjects (for example Stocks) searching for example by price of the stocks, I said binarysearch again, he said sorting an array of thousands will take lot of time before applying binarysearch.
Can you please bear with me and teach me how to approach this problem ?
thanks
your help is appreciated.
I was asked a similar question.The twist was to search in sorted and then an unsorted array .These were my answers all unaccepted
For sorted I suggested we can find the center and do a linear search .Binary search will also work here
For unsorted I suggested linear again .
Then I suggested Binary which is kind of wrong.
Suggested storing the array in a hashset and utilize hashing . (Not accepted since high space complexcity)
I suggested Tree Set which is a Red Black tree quite good for lookup.(Not accepted since high space complexcity)
Copying into Arraylist etch were also considered overhead.
In the end I got a negative feedback.
Though we may think that one of the above is solution but surely there is something special in linear searching which I am missing.
To be noted sorting before searching is also an overhead especially if you are utilizing any extra data structures in between.
Any comments welcomed.
I am not sure what he had in mind.
If you just want to find the number one time, and you have no guarantees about whether the array is sorted, then I don't think you can beat linear search. On average you will need to seek halfway through the array before you find the value, i.e. expected running time O(N); when sorting you have to touch every single value at least once and probably more than that, i.e. expected running time O(N log N).
But if you need to find multiple values then the time spent sorting it pays off quickly. With a sorted array, you can binary search in O(log N) time, so for sure by the third search you are ahead if you invested the time to sort.
You can do even better if you are allowed to build different data structures to help with the problem. You could build some sort of index, such as a hash table; but the champion data structure for this sort of problem probably would be some sort of tree structure. Then you can insert new values into the tree faster than you could append new values and re-sort the array, and the lookup is still going to be O(log N) to find any value. There are different sorts of trees available: binary tree, B-tree, trie, etc.
But as #Hot Licks said, a hash table is often used for this sort of thing, and it's pretty cheap to update: you just append a value on the main array, and update the hash table to point to the new value. And a hash table is very close to O(1) time, which you can't beat. (A hash table is O(1) if there are no hash collisions; assuming a good hash algorithm and a big enough hash table there will be almost no collisions. I think you could say that a hash table is O(N) where N is the average number of hash collisions per "bucket". If I'm wrong about that I expect to be corrected very quickly; this is StackOverflow!)
I think the interviewer wants you to analyze under different case about the array initial state, what algorithm will you use. Of cause , you must know you can build a hash table and then O(1) can find the number, or when the array is sorted (time spent on sorting maybe concerned) , you can use binarysearch, or use some other data structures to finish the job.

Resources