Minimal change between two arrays - arrays

I'm trying to apply only the minimal number of changes when table's data is updated (it's an iOS app and table view is the UITableView of course, but I don't think it's relevant here). Those changes include adding new items, removing old ones and also moving some existing ones to a different position without updating their content. I know there are similar questions on SO, but most of them only take the adds and removes into account and existing ones are either ignored or simply reloaded.
Mostly the moves involve not more than a few existing elements and the table can have up to 500 elements.
Items in the arrays are unique.
I can easily get added items by subtracting the set of items in new array from the set of items in the old array. And the opposite operation will yield a set of deleted items.
So the problem comes down to finding the minimal differences between two arrays having the same elements.
[one, two, three, four]
[one, three, four, two]
Diffing those arrays should result in just a move from index 1 to 3.
The algorithm doesn't know if there's only one such move. Just as well the change can be:
[one, two, three, four, five]
[one, four, five, three, two]
Which should result in moving index 1 to 4 and 2 to 3, not moving 3 and 4 two indexes to the left, because that could result in moving 300 items, when in fact the change should be much simpler. In terms of applying the visual change to the view, that is. That may require recalculating cell heights or performing lots of animations and other related operations. I would like to avoid them. As an example - marking an item as favorite that causes moving the item to top of the list or 300 items takes about 400 milliseconds. That's because with the algorithm I'm using currently, e.g. 100 items are moved one index up, one moved to index 0, 199 other are left untouched. If I unmark it, one item is moved 100 indices down and that's great, but that is the perfect, but a very rare, case.
I have tried finding item's index in old array, checking if it changed in the new array. If there were a change I moved the item from new index to old one, recorded the opposite change and compared arrays until there're equal in terms of element order. But that sometimes results in moving the huge chunks of items that actually were not changed, depending on those items' position.
So the question is: what can I do?
Any ideas or pointers? Maybe a modified Levenshtein distance algorithm? Could the unmodified one work for that? I'll probably have to implement it in one form or another if so.
Rubber duck talked:
Thinking about finding all unchanged sequences of items and moving around all the other items. Could it be the right direction?

I have an idea, don't know if it would work.. Just my two cents. How about if you would implement an algorithm similar to the longest common subsequences on your array items.
The idea would be to find large "substrings" of data that have kept the initial sequence, the largest ones first. Once you've covered a certain threshold percent of items in 'long sequences' apply a more trivial algorithm for solving the remaining problems.
Sorry for being rather vague, it's just meant to be a sugestion. Hope you solve your problem.

Related

Algorithm in "random" elimination array

So we have this "homework" which seems to be one of those things found in Roswell, I guess. Looking for someone to help me out with this - literally any insight is priceless.
The code picks six object - each one of them contains two numbers. Then it eliminates one of the objects, and then eliminates another one from the remaining five, leaving only four of them at the end.
I have an array containing thirty nine rounds containing six objects and the result which one got deleted first and which one got deleted second. The 40th row contains only objects' values, and we need to find a pattern and assume which two objects will be deleted in the 40th round.
Here's the link to the array in PDF
and in Excel format
Is that even possible? Any idea on how to get started is ultra valuable.
Thank you for your time.

Returning multiple adjacent cell results from an min array which may include multiple duplicate values

I'm trying to setup a formula that will return the contents of an related cell (my related cell is on another sheet) from the smallest 2 results in an array. This is what I'm using right now.
=INDEX('Sheet1'!$A$40:'Sheet1'!$A$167,MATCH(SMALL(F1:F128,1),F1:F128,0),1)
And
=INDEX('Sheet1'!$A$40:'Sheet1:!$A$167,MATCH(SMALL(F1:F128,2),F1:F128,0),1)
The problem I've run into is twofold.
First, if there are multiple lowest results I get whichever one appears first in the array for both entries.
Second, if the second lowest result is duplicated but the first is not I get whichever one shows up on the list first, but any subsequent duplicates are ignored. I would like to be able to display the names associated with the duplicated scores.
You will have to adjust the k parameter of the SMALL function to raise the k according to duplicates. The COUNTIF function should be sufficient for this. Once all occurrences of the top two scores are retrieved, standard 'lookup multiple values' formulas can be applied. Retrieving successive row positions with the AGGREGATE¹ function and passing those into an INDEX of the names works well.
    
The formulas in H2:I2 are,
=IF(SMALL(F$40:F$167, ROW(1:1))<=SMALL(F$40:F$167, 1+COUNTIF(F$40:F$167, MIN(F$40:F$167))), SMALL(F$40:F$167, ROW(1:1)), "") '◄ H2
=IF(LEN(H40), INDEX(A$40:A$167, AGGREGATE(15, 6, ROW($1:$128)/(F$40:F$167=H40), COUNTIF(H$40:H40, H40))), "") '◄ I2
Fill down as necessary. The scores are designed to terminate after the last second place so it would be a good idea to fill down several rows more than is immediately necessary for future duplicates.
¹ The AGGREGATE function was introduced with Excel 2010². It is not available in earlier versions.
² Related article for pre-xl2010 functions - see Multiple Ranked Returns from INDEX().
The following formula will do what I think you want:
=IF(OR(ROW(1:1)=1,COUNTIF($E$1:$E1,INDEX(Sheet1!$A$40:$A$167,MATCH(SMALL($F$1:$F$128,ROW(1:1)),$F$1:$F$128,0)))>0,ROW(1:1)=2),INDEX(Sheet1!$A$40:$A$167,MATCH(1,INDEX(($F$1:$F$128=SMALL($F$1:$F$128,ROW(1:1)))*(COUNTIF($E$1:$E1,Sheet1!$A$40:$A$167)=0),),0)),"")
NOTE:
This is an array formula and must be confirmed with Ctrl-Shift-Enter.
There are two references $E$1:$E1. This formula assumes that it will be entered in E2 and copied down. If it is going in a different column Change these two references. It must go in the second row or it will through a circular reference.
What it will do
If there is a tie for first place it will only list those teams that are tied for first.
If there is only one first place but multiple tied for second places it will list all those in second.
So make sure you copy the formula down far enough to cover all possible ties. It will put "" in any that do not fill, so err on the high side.
To get the Scores use this simple formula, I put mine in Column F:
=IF(E2<>"",SMALL($F$1:$F$128,ROW(1:1)),"")
Again change the E reference to the column you use for the output.
I did a small test:

Array vs Dictionary - fight! What's the fastest way to search through a large dataset?

Calling all computer scientists - I need your expert advice :)
Here's my problem:
I have a mapping application, and I've divided the world into 10 million possible squares of fixed size(latitude/longitude, ie double/double data type). Let's call that data set D1.
A second set of data, call it D2, is around 20,000 squares of the same size (latitude/longitude, or double/double data type), and represents locations of interest in my app.
When the user zooms in far enough, I want to display all the squares of interest that are in the present view, but not the ones outside the view, because that's way too many for the app to handle (generating overlays, etc.) without getting completely bogged down.
So rather than submitting 20,000 overlay squares for rendering and letting the Mapkit framework manage what gets shown (it completely chokes on that much data), here are a few things I've tried to optimize performance:
1) Put D2 in an array. Iterate through every possible visible square on my view, and for each possible square do a lookup in D2 (using Swift's find() function) to see if the corresponding element exists in that array. If it exists, display it. This is really slow -> if my view has an area of 4000 squares viewable, I have to check 4000 squares * 20000 points in the array = up to 80 million lookups = SLOW..
2) Put D2 in an array. Iterate through D2 and for each element in D2, check if that element is within the bounds of my view. If it is, display it. This is better than #1 (only takes 10% of the time of #1) but still on the slow side
3) Put D2 in an array. Iterate through D2 and create a new array D3 which filters out (using Swift's array.filter() method with a closure) all datapoints outside the view, then submit just those points for rendering. This is fastest (about 2% of original time of #1) but still too slow (depending on the data pattern, can still take several seconds of processing on an iphone4).
I have been thinking about dictionaries or other data structures. Would I expect a dictionary populated with key=(latitude,longitude) and value = (true/false) to be faster to look up than an array? I'm thinking if a map view with bounds y2, y1, x2, x1, I could do a simple for{} loop to find all the dictionary entries in those bounds with value = true (or even no value; all I'd really need is something like dictionarydata.exists(x,y), unless a value is absolutely required to build a dictionary). This would be much faster, but again it depends on how fast a dictionary is compared to an array.
Long story short: is searching through a large dictionary for a key a lot faster than searching through an array? I contemplated sorting my array and building a binary search as a test, but figured dictionaries might be more efficient. Since my array D2 will be built dynamically over time, I'd rather commit much more time/resources per addition (which are singular in nature) in order to absolutely maximize lookup performance later (which are orders of magnitude more data to sift through).
Appreciate any/all advice - are dictionaries the way to go? Any other suggestions?
Riffing off of the examples in the comments here. Create a two dimensional array sorted on both the X and Y axes. Then do a binary search to find the NW corner and the SE corner elements. Then all the elements in the box formed by those two corners need to be displayed.

Find all elements that appear more than n/4 times in linear time

This problem is 4-11 of Skiena. The solution to finding majority elements - repeated more than half times is majority algorithm. Can we use this to find all numbers repeated n/4 times?
Misra and Gries describe a couple approaches. I don't entirely understand their paper, but a key idea is to use a bag.
Boyer and Moore's original majority algorithm paper has a lot of incomprehensible proofs and discussion of formal verification of FORTRAN code, but it has a very good start of an explanation of how the majority algorithm works. The key concept starts with the idea that if the majority of the elements are A and you remove, one at a time, a copy of A and a copy of something else, then in the end you will have only copies of A. Next, it should be clear that removing two different items, neither of which is A, can only increase the majority that A holds. Therefore it's safe to remove any pair of items, as long as they're different. This idea can then be made concrete. Take the first item out of the list and stick it in a box. Take the next item out and stick it in the box. If they're the same, let them both sit there. If the new one is different, throw it away, along with an item from the box. Repeat until all items are either in the box or in the trash. Since the box is only allowed to have one kind of item at a time, it can be represented very efficiently as a pair (item type, count).
The generalization to find all items that may occur more than n/k times is simple, but explaining why it works is a little harder. The basic idea is that we can find and destroy groups of k distinct elements without changing anything. Why? If w > n/k then w-1 > (n-k)/k. That is, if we take away one of the popular elements, and we also take away k-1 other elements, then the popular element remains popular!
Implementation: instead of only allowing one kind of item in the box, allow k-1 of them. Whenever you see a group of k different items show up (that is, there are k-1 types in the box, and the one arriving doesn't match any of them), you throw one of each type in the trash, including the one that just arrived. What data structure should we use for this "box"? Well, a bag, of course! As Misra and Gries explain, if the elements can be ordered, a tree-based bag with O(log k) basic operations will give the whole algorithm a complexity of O(n log k). One point to note is that the operation of removing one of each element is a bit expensive (O(k) for a typical implementation), but that cost is amortized over the arrivals of those elements, so it's no big deal. Of course, if your elements are hashable rather than orderable, you can use a hash-based bag instead, which under certain common assumptions will give even better asymptotic performance (but it's not guaranteed). If your elements are drawn from a small finite set, you can guarantee that. If they can only be compared for equality, then your bag gets much more expensive and I'm pretty sure you end up with something like O(nk) instead.
Find the majority element that appears n/2 times by Moore-Voting Algorithm
See method 3 of the given link for Moore's Voting Algo (http://www.geeksforgeeks.org/majority-element/).
Time:O(n)
Now after finding majority element, scan the array again and remove the majority element or make it -1.
Time:O(n)
Now apply Moore Voting Algorithm on the remaining elements of array (but ignore -1 now as it has already been included earlier). The new majority element appears n/4 times.
Time:O(n)
Total Time:O(n)
Extra Space:O(1)
You can do it for element appearing more than n/8,n/16,.... times
EDIT:
There may exist a case when there is no majority element in the array:
For e.g. if the input arrays is {3, 1, 2, 2, 1, 2, 3, 3} then the output should be [2, 3].
Given an array of of size n and a number k, find all elements that appear more than n/k times
See this link for the answer:
https://stackoverflow.com/a/24642388/3714537
References:
http://www.cs.utexas.edu/~moore/best-ideas/mjrty/
See this paper for a solution that uses constant memory and runs in linear time, which will find 3 candidates for elements that occur more than n/4 times. Note that if you assume that your data is given as a stream that you can only go through once, this is the best you can do -- you have to go through the stream one more time to test each of the 3 candidates to see if it occurs more than n/4 times in the stream. However, if you assume a priori that there are 3 elements that occur more than n/4 times then you only need to go through the stream once so you get a linear time online algorithm (only goes through the stream once) that only requires constant storage.
As you didnt mention space complexity , one possible solution is using hashtable for the elements which maps to count then you can just increment count if the element is found.

Sorting n sets of data into one

I have n arrays of data, each of these arrays is sorted by the same criteria.
The number of arrays will, in almost all cases, not exceed 10, so it is a relatively small number. In each array, however, can be a large number of objects, that should be treated as infinite for the algorithm I am looking for.
I now want to treat these arrays as if they are one array. However, I do need a way, to retrieve objects in a given range as fast as possible and without touching all objects before the range and/or all objects after the range. Therefore it is not an option to iterate over all objects and store them in one single array. Fetches with low start values are also more likely than fetches with a high start value. So e.g. fetching objects [20,40) is much more likely than fetching objects [1000,1020), but it could happen.
The range itself will be pretty small, around 20 objects, or can be increased, if relevant for the performance, as long as this does not hit the limits of memory. So I would guess a couple of hundred objects would be fine as well.
Example:
3 arrays, each containing a couple of thousand entires. I now want to get the overall objects in the range [60, 80) without touching either the upper 60 objects in each set nor all the objets that are after object 80 in the array.
I am thinking about some sort of combined, modified binary search. My current idea is something like the following (note, that this is not fully thought through yet, it is just an idea):
get object 60 of each array - the beginning of the range can not be after that, as every single array would already meet the requirements
use these objects as the maximum value for the binary search in every array
from one of the arrays, get the centered object (e.g. 30)
with a binary search in all the other arrays, try to find the object in each array, that would be before, but as close as possible to the picked object.
we now have 3 objects, e.g. object 15, 10 and 20. The sum of these objects would be 45. So there are 42 objects in front, which is more than the beginning of the range we are looking for (30). We continue our binary search in the remaining left half of one of the arrays
if we instead get a value where the sum is smaller than the beginning of the range we are looking for, we continue our search on the right.
at some point we will hit object 30. From there on, we can simply add the objects from each array, one by one, with an insertion sort until we hit the range length.
My questions are:
Is there any name for this kind of algorithm I described here?
Are there other algorithms or ideas for this problem, that might be better suited for this issue?
Thans in advance for any idea or help!
People usually call this problem something like "selection in the union of multiple sorted arrays". One of the questions in the sidebar is about the special case of two sorted arrays, and this question is about the general case. Several comparison-based approaches appear in the combined answers; they more or less have to determine where the lower endpoint in each individual array is. Your binary search answer is one of the better approaches; there's an asymptotically faster algorithm due to Frederickson and Johnson, but it's complicated and not obviously an improvement for small ranks.

Resources