Compare an array of dictionaries with it self for similarities

Compare an array of dictionaries with it self for similarities - arrays

I use core data to store a JSON file. The core data is an array of dictionaries, [TestMO], and one of these dictionaries is an array of keywords (not a standard .count - maybe 3, 5, 7 etc). So what I am trying to do is to to compare the entire database with it self to find the objects, TestMO's, which have a similar (or > 50%) matching keywords. I tried a loop inside a loop but is just too time consuming and a terrible user experience. Any ideas how I can achieve this efficiently? Thank you.

Use your knowledge to reduce the complexity
If your array has n elements and you want to compare each element with every other element, you end up with n*(n-1)/2 comparisons. For n=10 you get 45 comparisons, for n=100 you get 4950, for n=1000 half a million, for n=1000000 half a trillion. Your complexity grows quadratically, with O(n2).
You will need to use your statistical knowledge on your array and how your analysis is used to beat this complexity. For instance, if your n is relatively small and you need to run your analysis only once, don't bother optimizing and just let it run for a night.
If you want to run your analysis every time a user adds another element, you may want to compare just this new element to all the other n elements, at a complexity of only O(n).
To further optimize, you may want to establish an index, say a dictionary that associates a set of elements to each keyword. Establishing the index will still be time-consuming and memory-intensive, on the order of O(n*m) if every keyword occurs on average m times on the existing elements. Depending on what you want to analyse, comparing a new element with k keywords, you might be able to get a complexity of the order of O(m*k) for adding that new element. If m and k are much smaller than n, that might reduce your waiting time significantly.
This question has nothing to do with swift nor core-data, but with computational complexity and time-complexity in particular. Please add the latter tag to your question.

Related

Time comlexity analysis for arrays

I sometimes get confused with the time complexity analysis for the code that includes arrays.
For example:
ans = [0] * n
for x in range(1, n):
ans[x] = ans[x-1] + 1
I thought the for-loop had a time complexity of O(n^2) because it accesses elements in the array with n elements, and it repeats the same thing for n times.
However, I've seen some explanations saying it takes just O(n); thus, my question is: when we analyze the time complexity of a program that accesses elements in an array (not necessarily the first or the last element), should we include the time to access those array elements, or is it often ignored?

Indexed access is usually a constant-time operation, due to the availability of random access memory in most practical cases. If you were to run this e.g. in Python and measure the time it takes for different values of n, you will find that this is the case.
Therefore, your code only performs one loop from 1 to n and all other operations are constant-time, so you get a time complexity of O(n).
Your thinking is otherwise right - if this was a linked list and you had to iterate through it to find your value, then it would be O(n2).

time complexity
Big-O cheat sheet

Algorithm - What is the best algorithm for detecting duplicate numbers in small array?

What is the best algorithm for detecting duplicate numbers in array, the best in speed, memory and avoiving overhead.
Small Array like [5,9,13,3,2,5,6,7,1] Note that 5 i dublicate.
After searching and reading about sorting algorithms, I realized that I will use one of these algorithms, Quick Sort, Insertion Sort or Merge Sort.
But actually I am really confused about what to use in my case which is a small array.
Thanks in advance.

To be honest, with that size of array, you may as well choose the O(n2) solution (checking every element against every other element).
You'll generally only need to worry about performance if/when the array gets larger. For small data sets like this, you could well have found the duplicate with an 'inefficient' solution before the sort phase of an efficient solution will have finished :-)
In other words, you can use something like (pseudo-code):
for idx1 = 0 to nums.len - 2 inclusive:
for idx2 = idx1 + 1 to nums.len - 1 inclusive:
if nums[idx1] == nums[idx2]:
return nums[idx1]
return no dups found
This finds the first value in the array which has a duplicate.
If you want an exhaustive list of duplicates, then just add the duplicate value to another (initially empty) array (once only per value) and keep going.
You can sort it using any half-decent algorithm though, for a data set of the size you're discussing, even a bubble sort would probably be adequate. Then you just process the sorted items sequentially, looking for runs of values but it's probably overkill in your case.

Two good approaches depend on the fact that you know or not the range from which numbers are picked up.
Case 1: the range is known.
Suppose you know that all numbers are in the range [a, b[, thus the length of the range is l=b-a.
You can create an array A the length of which is l and fill it with 0s, thus iterate over the original array and for each element e increment the value of A[e-a] (here we are actually mapping the range in [0,l[).
Once finished, you can iterate over A and find the duplicate numbers. In fact, if there exists i such that A[i] is greater than 1, it implies that i+a is a repeated number.
The same idea is behind counting sort, and it works fine also for your problem.
Case 2: the range is not known.
Quite simple. Slightly modify the approach above mentioned, instead of an array use a map where the keys are the number from your original array and the values are the times you find them. At the end, iterate over the set of keys and search those that have been found more then once.
Note.
In both the cases above mentioned, the complexity should be O(N) and you cannot do better, for you have at least to visit all the stored values.
Look at the first example: we iterate over two arrays, the lengths of which are N and l<=N, thus the complexity is at max 2*N, that is O(N).
The second example is indeed a bit more complex and dependent on the implementation of the map, but for the sake of simplicity we can safely assume that it is O(N).
In memory, you are constructing data structures the sizes of which are proportional to the number of different values contained in the original array.
As it usually happens, memory occupancy and performance are the keys of your choice. Greater the former, better the latter and vice versa. As suggested in another response, if you know that the array is small, you can safely rely on an algorithm the complexity of which is O(N^2), but that does not require memory at all.
Which is the best choice? Well, it depends on your problem, we cannot say.

Find all elements that appear more than n/4 times in linear time

This problem is 4-11 of Skiena. The solution to finding majority elements - repeated more than half times is majority algorithm. Can we use this to find all numbers repeated n/4 times?

Misra and Gries describe a couple approaches. I don't entirely understand their paper, but a key idea is to use a bag.
Boyer and Moore's original majority algorithm paper has a lot of incomprehensible proofs and discussion of formal verification of FORTRAN code, but it has a very good start of an explanation of how the majority algorithm works. The key concept starts with the idea that if the majority of the elements are A and you remove, one at a time, a copy of A and a copy of something else, then in the end you will have only copies of A. Next, it should be clear that removing two different items, neither of which is A, can only increase the majority that A holds. Therefore it's safe to remove any pair of items, as long as they're different. This idea can then be made concrete. Take the first item out of the list and stick it in a box. Take the next item out and stick it in the box. If they're the same, let them both sit there. If the new one is different, throw it away, along with an item from the box. Repeat until all items are either in the box or in the trash. Since the box is only allowed to have one kind of item at a time, it can be represented very efficiently as a pair (item type, count).
The generalization to find all items that may occur more than n/k times is simple, but explaining why it works is a little harder. The basic idea is that we can find and destroy groups of k distinct elements without changing anything. Why? If w > n/k then w-1 > (n-k)/k. That is, if we take away one of the popular elements, and we also take away k-1 other elements, then the popular element remains popular!
Implementation: instead of only allowing one kind of item in the box, allow k-1 of them. Whenever you see a group of k different items show up (that is, there are k-1 types in the box, and the one arriving doesn't match any of them), you throw one of each type in the trash, including the one that just arrived. What data structure should we use for this "box"? Well, a bag, of course! As Misra and Gries explain, if the elements can be ordered, a tree-based bag with O(log k) basic operations will give the whole algorithm a complexity of O(n log k). One point to note is that the operation of removing one of each element is a bit expensive (O(k) for a typical implementation), but that cost is amortized over the arrivals of those elements, so it's no big deal. Of course, if your elements are hashable rather than orderable, you can use a hash-based bag instead, which under certain common assumptions will give even better asymptotic performance (but it's not guaranteed). If your elements are drawn from a small finite set, you can guarantee that. If they can only be compared for equality, then your bag gets much more expensive and I'm pretty sure you end up with something like O(nk) instead.

Find the majority element that appears n/2 times by Moore-Voting Algorithm
See method 3 of the given link for Moore's Voting Algo (http://www.geeksforgeeks.org/majority-element/).
Time:O(n)
Now after finding majority element, scan the array again and remove the majority element or make it -1.
Time:O(n)
Now apply Moore Voting Algorithm on the remaining elements of array (but ignore -1 now as it has already been included earlier). The new majority element appears n/4 times.
Time:O(n)
Total Time:O(n)
Extra Space:O(1)
You can do it for element appearing more than n/8,n/16,.... times
EDIT:
There may exist a case when there is no majority element in the array:
For e.g. if the input arrays is {3, 1, 2, 2, 1, 2, 3, 3} then the output should be [2, 3].
Given an array of of size n and a number k, find all elements that appear more than n/k times
See this link for the answer:
https://stackoverflow.com/a/24642388/3714537
References:
http://www.cs.utexas.edu/~moore/best-ideas/mjrty/

See this paper for a solution that uses constant memory and runs in linear time, which will find 3 candidates for elements that occur more than n/4 times. Note that if you assume that your data is given as a stream that you can only go through once, this is the best you can do -- you have to go through the stream one more time to test each of the 3 candidates to see if it occurs more than n/4 times in the stream. However, if you assume a priori that there are 3 elements that occur more than n/4 times then you only need to go through the stream once so you get a linear time online algorithm (only goes through the stream once) that only requires constant storage.

As you didnt mention space complexity , one possible solution is using hashtable for the elements which maps to count then you can just increment count if the element is found.

How to know if an array is sorted?

I already read this post but the answer didn't satisfied me Check if Array is sorted in Log(N).
Imagine I have a serious big array over 1,000,000 double numbers (positive and/or negative) and I want to know if the array is "sorted" trying to avoid the max numbers of comparisons because comparing doubles and floats take too much time. Is it possible to use statistics on It?, and if It was:
It is well seen by real-programmers?
Should I take samples?
How many samples should I take
Should they be random, or in a sequence?
How much is the %error permitted to say "the array sorted"?
Thanks.

That depends on your requirements. If you can say that if 100 random samples out of 1.000.000 is enough the assume it's sorted - then so it is. But to be absolutely sure, you will always have to go through every single entry. Only you can answer this question since only you know how certain you need to be about it being sorted.

This is a classic probability problem taught in high school. Consider this question:
What is the probability that the batch will be rejected?
In a batch of 8,000, clocks 7% are defective. A random sample of 10 (without replacement) from the 8,000 is selected and tested. If at least one is defective the entire batch will be rejected.
So you can take a number of random samples from your large array and see if it's sorted, but you must note that you need to know the probability that the sample is out of order. Since you don't have that information, a probabilistic approach wouldn't work efficiently here.
(However, you can check 50% of the array and naively conclude that there is a 50% chance that it is sorted correctly.)

If you run a divide and conquer algorithm using multiprocessing (real parallelism, so only for multi-core CPUs) you can check whether an array is sorted or not in Log(N).
If you have GPU multiprocessing you can achieve Log(N) very easily since modern graphics card are able to run few thousands processes in parallel.

Your question 5 is the question that you need to answer to determine the other answers. To ensure the array is perfectly sorted you must go through every element, because any one of them could be the one out of place.

The maximum number of comparisons to decide whether the array is sorted is N-1, because there are N-1 adjacent number pairs to compare. But for simplicity, we'll say N as it does not matter if we look at N or N+1 numbers.
Furthermore, it is unimportant where you start, so let's just start at the beginning.
Comparison #1 (A[0] vs. A[1]). If it fails, the array is unsorted. If it succeeds, good.
As we only compare, we can reduce this to the neighbors and whether the left one is smaller or equal (1) or not (0). So we can treat the array as a sequence of 0's and 1's, indicating whether two adjacent numbers are in order or not.
Calculating the error rate or the propability (correct spelling?) we will have to look at all combinations of our 0/1 sequence.
I would look at it like this: We have 2^n combinations of an array (i.e. the order of the pairs, of which only one is sorted (all elements are 1 indicating that each A[i] is less or equal to A[i+1]).
Now this seems to be simple:
initially the error is 1/2^N. After the first comparison half of the possible combinations (all unsorted) get eliminated. So the error rate should be 1/2^n + 1/2^(n-1).
I'm not a mathematician, but it should be quite easy to calculate how many elements are needed to reach the error rate (find x such that ERROR >= sum of 1/2^n + 1/2^(n-1)... 1/^(2-x) )
Sorry for the confusing english. I come from germany..

Since every single element can be the one element that is out-of-line, you have to run through all of them, hence your algorithm has runtime O(n).
If your understanding of "sorted" is less strict, you need to specify what exaclty you mean by "sorted". Usually, "sorted" means that adjacent elements meet a less or less-or-equal condition.

Like everyone else says, the only way to be 100% sure that it is sorted is to run through every single element, which is O(N).
However, it seems to me that if you're so worried about it being sorted, then maybe having it sorted to begin with is more important than the array elements being stored in a contiguous portion in memory?
What I'm getting at is, you could use a map whose elements by definition follow a strict weak ordering. In other words, the elements in a map are always sorted. You could also use a set to achieve the same effect.
For example: std::map<int,double> collectoin; would allow you to almost use it like an array: collection[0]=3.0; std::cout<<collection[0]<<std:;endl;. There are differences, of course, but if the sorting is so important then an array is the wrong choice for storing the data.

The old fashion way.Print it out and see if there in order. Really if your sort is wrong you would probably see it soon. It's more unlikely that you would only see a few misorders if you were sorting like 100+ things. When ever I deal with it my whole thing is completely off or it works.

As an example that you probably should not use but demonstrates sampling size:
Statistically valid sample size can give you a reasonable estimate of sortedness. If you want to be 95% certain eerything is sorted you can do that by creating a list of truly random points to sample, perhaps ~1500.
Essentially this is completely pointless if the list of values being out of order in one single place will break subsequent algorithms or data requirements.
If this is a problem, preprocess the list before your code runs, or use a really fast sort package in your code. Most sort packages also have a validation mode, where it simply tells you yes, the list meets your sort criteria - or not. Other suggestions like parallelization of your check with threads are great ideas.

How can we find the i'th greatest element of the array?

Algorithm for Finding nth smallest/largest element in an array using data strucuture self balancing binary search tree..
Read the post: Find kth smallest element in a binary search tree in Optimum way. But the correct answer is not clear, as i am not able to figure out the correct answer, for an example that i took...... Please a bit more explanation required.......

C.A.R. Hoare's select algorithm is designed for precisely this purpose. It executes in [expected] linear time, with logarithmic extra storage.
Edit: the obvious alternative of sorting, then picking the right element has O(N log N) complexity instead of O(N). Storing the i largest elements in sorted order requires O(i) auxiliary storage, and roughly O(N * i log i) complexity. This can be a win if i is known a priori to be quite small (e.g. 1 or 2). For more general use, select is usually better.
Edit2: offhand, I don't have a good reference for it, but described the idea in a previous answer.

First sort the array descending, then take the ith element.

Create a sorted data structure to hold i elements and set the initial count to 0.
Process each element in the source array, adding it to that new structure until the new structure is full.
Then process the rest of the source array. For each one that is larger than the smallest in the sorted data structure, remove the smallest from that structure and put the new one in.
Once you've processed all elements in the source array, your structure will hold the i greatest elements. Just grab the last of these and you have your i'th greatest element.
Voila!
Alternatively, sort it then just grab the i'th element directly.

That's a fitting task for the heaps which feature very low insert and low delete_min costs. E.g. pairing heaps. It would have the worst case O(n*log(n)) performance. But since non-trivial to implement, better check first suggested elsewhere selection algorithms.

There are many strategies available for your task (if you don't focus on the self-balancing tree to begin with).
It's usually a tradeoff speed / memory. Most algorithms require either to modify the array in place or O(N) additional storage.
The solution with self-balancing tree is in the latter category, but it's not the right choice here. The issue is that building the tree itself takes O(N*log N), which will dominate the later search term and give a final complexity of O(N*log N). Therefore you're not better than simply sorting the array and use a complex datastructure...
In general, the issue largely depends on the magnitude of i related to N. If you think for a minute, for i == 1 it's trivial right ? It's called finding the maximum.
Well, the same strategy obviously work for i == 2 (carrying the 2 maximum elements around) in linear time. And it's also trivially symmetric: ie if you need to find the N-1 th element, then just carry around the 2 minimum elements.
However, it loses efficiency when i is about N/2 or N/4. Carrying the i maximum elements then mean sorting an array of size i... and thus we fallback on the N*log N wall.
Jerry Coffin pointed out a simple solution, which works well for this case. Here is the reference on Wikipedia. The full article also describes the Median of Median method: it's more reliable, but involves more work and is thus generally slower.

Create an empty list L
For each element x in the original list,
add x in sorted position to L
if L has more than i elements,
pop the smallest one off L
if List2 has i elements,
return the i-th element,
else
return failure
This should take O(N (log (i))). If i is assumd to be a constant, then it is O(N).

Build a heap from the elements and call MIN i times.

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight