Comparing Two Datasets - Slow

Comparing Two Datasets - Slow - arrays

I have a set of data in the following format, although very simplified:
DealerName, AccountCode, Value
Dealer1, A-1, 5
Dealer2, A-1, 10
Dealer1, A-2, 20
Dealer2, A-2, 15
Dealer3, A-3, 5
I am trying to achieve an end result that gives me the data summed by AccountCode, so the following in the case of the above data:
AccountCode, Value
A-1, 15
A-2, 35
A-3, 5
I have done this by creating an array of distinct account codes named OutputData and then going through the data comparing the account code to the same field in SelectedDealerData and adding it to the existing values:
For i = 0 To UBound(SelectedDealerData)
For j = 0 To UBound(OutputData)
If SelectedDealerData(i).AccountNumber = OutputData(j).AccountNumber And SelectedDealerData(i).Year = OutputData(j).Year Then
OutputData(j).Units = OutputData(j).Units + SelectedDealerData(i).Units
Exit For
End If
Next j
Next i
There are around 10,00 dealers and 600-1000 account codes for each, so this means a lot of unnecessary looping.
Can someone point me in the direction of a more efficient solution? I am thinking that some kind of Dictionary compare but I am unsure how to implement it.

Add a reference to Microsoft Scripting Runtime for a Dictionary:
Dim aggregated As Dictionary
Set aggregated = New Dictionary
For i = 0 To UBound(SelectedDealerData)
With SelectedDealerData(i)
If aggregated.Exists(.AccountCode) Then
aggregated(.AccountCode) = aggregated(.AccountCode) + .Value
Else
aggregated(.AccountCode) = .Value
End If
End With
Next
For Each Key In aggregated.Keys
Debug.? Key, aggregated(Key)
Next

The code is slow because there are 10 million comparisons and assignment operations going on here (10,000 x 1000).
Also, looping through collections is not very efficient, but nothing can be done about that since the design is already set and maintained the way it is.
Two ways you can make this more efficient (you can time your code right now and see the % savings after these steps).
There are two condition-checks going on with an And. VBA will evaluate both even if the first one is false (no short circuiting). So put nested if then conditions, so that if the first condition fails, you do not proceed to checking the second one. Also, keep the condition more likely to fail in the outer if statement (so it fails fast and moves to the next element). At best, you get a minor speed bump here, at worst, you are no worse off.
There are too many comparisons happening here. Too late to change that, but you can go through the loops based on the following pseudocode if you can sort your collections or build an index which maintains their sort order (save that index array on your spreadsheet if you like). The sorting should be done based on a composite field called Account_Number_Year (just concatenate them)
You can use this concatenated field in the dictionary structure suggested by Alex K. So you can lookup this joint field in the second dictionary and then do operations if needed.
Code to try to fully implement it in VBA:
'Assuming both arrays are sorted
For i = 0 to Ni
MatchingIndex = _
BinarySearchForAccNumberYear(SelectedUserData(i).AccountNumberYear)
Next i
You can look up Binary Search here.
This will reduce your time complexity from O(n^2) to O(n log n) and your code will run an order of magnitude faster.

Related

What is an efficient way to split an array into a training and testing set in Julia?

So I am running a machine learning algorithm in Julia with limited spare memory on my machine. Anyway, I have noticed a rather large bottleneck in the code I am using from the repository. It seems that splitting the array (randomly) takes even longer than reading the file from disk which seems to highlight code's inefficiencies. As I said before, any tricks to speed up this function would be greatly appreciated. The original function can be found here. Since it's a short function, I'll post it below as well.
# Split a list of ratings into a training and test set, with at most
# target_percentage * length(ratings) in the test set. The property we want to
# preserve is: any user in some rating in the original set of ratings is also
# in the training set and any item in some rating in the original set of ratings
# is also in the training set. We preserve this property by iterating through
# the ratings in random order, only adding an item to the test set only if we
# haven't already hit target_percentage and we've already seen both the user
# and the item in some other ratings.
function split_ratings(ratings::Array{Rating,1},
target_percentage=0.10)
seen_users = Set()
seen_items = Set()
training_set = (Rating)[]
test_set = (Rating)[]
shuffled = shuffle(ratings)
for rating in shuffled
if in(rating.user, seen_users) && in(rating.item, seen_items) && length(test_set) < target_percentage * length(shuffled)
push!(test_set, rating)
else
push!(training_set, rating)
end
push!(seen_users, rating.user)
push!(seen_items, rating.item)
end
return training_set, test_set
end
As previously stated, anyway I can push the data would be greatly appreciated. I also will note that I do not really need to retain the ability to remove duplicates, but it would be a nice feature. Also if this is already implemented in a Julia library I would be grateful to know about it. Bonus points for any solutions that leverage that parallelism abilities of Julia!

This is the most efficient code I could come up with in terms of memory.
function splitratings(ratings::Array{Rating,1}, target_percentage=0.10)
N = length(ratings)
splitindex = round(Integer, target_percentage * N)
shuffle!(ratings) #This shuffles in place which avoids the allocation of another array!
return sub(ratings, splitindex+1:N), sub(ratings, 1:splitindex) #This makes subarrays instead of copying the original array!
end
However, Julia's incredibly slow file IO is now the bottleneck. This algorithm takes about 20 seconds to run on an array of 170 million elements so I say it''s rather performant.

Optimization of array function that calculates products

I have the following array formula that calculates the returns on a particular stock in a particular year:
=IF(AND(NOT(E2=E3),H2=H3),PRODUCT(IF($E$2:E2=E1,$O$2:O2,""))-1,"")
But since I have 500,000 row entries as soon as I hit row 50,000 I get an error from Excel stating that my machine does not have enough resources to compute the values.
How shall I optimize the function so that it actually works?
E column refers to a counter to check the years and ticker values of stocks. If year is different from the previous value the function will output 1. It will also output 1 when the name of stock has changed. So for example you may have values for year 1993 and the next value is 1993 too but the name of stock is different, so clearly the return should be calculated anew, and I use 1 as an indication for that.
Then I have another column that runs a cumulative sum of those 1s. When a new 1 in that previous column is encountered I add 1 to the running total and keep printing same number until I observe a new one. This makes possible use of the array function, if the column that contains running total values (E column) has a next value that is different from previous I use my twist on SUMIF but with PRODUCT IF. This will return the product of all the corresponding running total E column values.

The source of the inefficiency, I believe, is in the steady increase with row number of the number of cells that must be examined in order to evaluate each successive array formula. In row 50,000, for example, your formula must examine cells in all the rows above it.
I'm a big fan of array formulas, so it pains me to say this, but I wouldn't do it this way. Instead, use additional columns to compute, in each row, the pieces of your formula that are needed to return the desired result. By taking that approach, you're exploiting Excel's very efficient recalculation engine to compute only what's needed.
As for the final product, compute that from a cumulative running product in an auxiliary column, and that resets to the value now in column O when column P in the row above contains a number. This approach is much more "local" and avoids formulas that depend on large numbers of cells.
I realize that text is not the best language for describing this, and my poor writing skills might be adding to the challenge, so please let me know if more detail is needed.
Interesting problem, thanks.

Could I suggest a really quick and [very] dirty vba? Something like the below. Obviously, have a backup of your file before running this. This assumes you want to start calculating from row 13.
Sub calculateP()
'start on row 13, column P:
Cells(13, 16).Select
'loop through every row as long as column A is populated:
Do
If ActiveCell(1, -14).Value = "" Then Exit Do 'column A not populated so exit loop
'enter formula:
Selection.FormulaR1C1 = _
"=IF(AND(NOT(RC[-11]=R[1]C[-11]),RC[-8]=R[1]C[-8]),PRODUCT(IF(R[-11]C5:RC[-11]=R[-1]C[-11],R2C15:RC[-1],""""))-1,"""")"
'convert cell value to value only (remove formula):
ActiveCell.Value = ActiveCell.Value
'select next row:
ActiveCell(2, 1).Select
Loop
End Sub
Sorry, this is definitely not a great answer for you... in fact, even this method could be achieved more elegantly using range... but, the quick and dirty approach may help you in the interim ??

Finding arrays that contain a subset of another array without using #> with postgreSQL

I have a table with 1.5 MM records. Each record has a row number and an array with between 1 and 1,000 elements in the array. I am trying to find all of the arrays that are a subset of the larger arrays.
When I use the code below, I get ERROR: statement requires more resources than resource queue allows (possibly because there are over a trillion possible combinations):
select
a.array as dup
from
table a
left join
table b
on
b.array #> a.array
and a.row_number <> b.row_number
Is there a more efficient way to identify which arrays are subsets of the other arrays and mark them for removal other than using #>?

Your example code suggests that you are only interested in finding arrays that are subsets of any other array in another row of the table.
However, your query with a JOIN returns all combinations, possibly multiplying results.
Try an EXISTS semi-join instead, returning qualifying rows only once:
SELECT a.array as dup
FROM table a
WHERE EXISTS (
SELECT 1
FROM table b
WHERE a.array <# b.array
AND a.row_number <> b.row_number
);
With this form, Postgres can stop iterating rows as soon as the first match is found. If this won't go through either, try partitioning your query. Add a clause like
AND table_id BETWEEN 0 AND 10000
and iterate through the table. Should be valid for this case.
Aside: it's a pity that your derivate (Greenplum) doesn't seem to support GIN indexes, which would make this operation much faster. (The index itself would be big, though)

Well, I don't see how to do this efficiently in a single declarative SQL statement without appropriate support from an index. I don't know how well this would work with a GIN index, but using a GIN index would certainly avoid the need to compare every possible pair of rows.
The first thing I would do is to carefully investigate the types of indexes you do have at your disposal, and try creating one as necessary.
If that doesn't work, the first thing that comes to my mind, procedurally speaking, would be to sort all the arrays, then sort the rows into a graded lexicographic order on the arrays. Then start with the shortest arrays, and work upwards as follows: e.g. for [1,4,9], check all the arrays with length <= 3 that start with 1 if they are a subset, then check all arrays with length <= 2 that start with 4, and then check all the arrays of length <= 1 that start with 9, removing any found subsets from consideration as you go so that you don't keep re-checking the same rows over and over again.
I'm sure you could tune this algorithm a bit, especially depending on the particular nature of the data involved. I wouldn't be surprised if there was a much better algorithm; this is just the first thing that I thought of. You may be able to work from this algorithm backwards to the SQL you want, or you may have to dump the table for client-side processing, or some hybrid thereof.

Loop that creates doubles and insert them into sorted array, in C

Suppose I have a set of sorted doubles.
{ 0.124, 4.567, 12.3 }
A positive, non-zero double is created by another part of the code, and needs to be inserted into this set while keeping it sorted. For example, if the created double is 7.56, the final result is,
{ 0.124, 4.567, 7.56, 12.3 }
In my code, this "create double and insert in sorted set" process is then repeated a great number of times. Possibly 500k to 1 million times. I don't know how many doubles will be created in total exactly, but I know the upper bound.
Attempt
My naive first approach was to create an array with length = upper bound and fill it with zeros, then adding the initial set of doubles to it ("add" = replace a 0 valued entry with the double). Whenever a double is created, I add it to the array and do an insertion sort, which I read is good for sorting ordered arrays.
Question
I have a feeling running 500k to 1 million insertion slots will be a serious performance issue. (or am I wrong?) Is there a more efficient data structure and/or algorithm for doing this in C?
Edit:
The reason why I want to keep the set sorted is because after every "create double and insert in sorted set" process, I need to be able to look up the smallest element in that set (and possibly remove it by replacing it with a 0). I thought the best way to do this would be to keep the set sorted.
But if that is not the case, perhaps there is an alternative?

Since all you want to do is pull out the minimum element in every iteration, use a min-heap instead. You can implement them to have O(1) insertion, O(1) find-min, and O(1) decrease-key operations (though note that removing the minimum element always takes O(log n) time). For what you are doing, a heap will be substantially faster.

Rather than running an insertion sort, you could use binary search to find the insertion point, and then insert the value there. But this is slow, because you may need to shift a lot of data many times (think what happens if the random data comes in sorted in reverse of what you need, the timing would be O(N^2)).
The fastest approach is to insert first, and then sort everything at once. If this is not possible, consider replacing your array with a self-balancing ordered tree structure, such as an RB-Tree.

CS theory problem: evaluating each element in an array only once and choosing the largest value

We covered this problem in a theory class back in college.
The setup is this:
You're presented with an array of N values. You know the length of the array, but not the range of values. You are presented the elements one at a time. Each value can be examined only once, and if the value is not chosen when presented it is discarded. The goal is to choose the maximum value.
There is an algorithm that gives a better than 1/N chance of choosing the maximum, but I can't for the life of me recall what it is.

It's called the secretary problem.

The Simplest way I know of is to iterate through the array. You create a variable, insert the first array value into it, then as you iterate through each value in the array you compare the stored value to the current value and if the current value is larger than the stored value you put the current value into the stored value, replacing the lower value.
The syntax will change based on what programming language you use but a basic representation would be the following:
Dim MaxValue as integer
Dim ArrayOfN as Array
Dim N as integer
Dim i as integer
ArrayOfN = [0,10,87,6,59,1200,5] 'In this case there are 7 values in the array
N = 7 'You would typically want to programically get the size of the array but I can't remember the syntax
MaxValue = ArrayOfN(0)
For i = 0 to N - 1 'Assumes iteration of 1
If MaxValue < ArrayOfN(i) Then
MaxValue = ArrayOfN(i)
End IF
Next
I largely used vb for the coding example but I am aware that some of the syntax is incorrect, especially for the array. However since I don't have access currently to ay coding software that is the best I can do. When coding something you can definitively choose the largest value, there isn't any uncertainty. There are more effient ways of doing this as well. For example if you have the ability to use an array.sort feature then you can get the max value in just 2 lines of code. However since you did not state any programing languages, and after seeing yi_H's answer, I am starting to think that you did not want to know how to do program a solution but simply wanted the name of the problem.

See..
this is a kind'a bit mess using such strategy to retrieve the maximum
as you can use any of advanced algorithms like fibonacci search or another
anyways, here we are certainly checking the elements one by one..
so to make it happen flag up a variable with the max available till know
and iterate it along the length of array linearly..
it may be resolved like this
Arr[n]={ }
int max=0
for x in range 0 to n-1
if max>Arr[x]
max=Arr[x]
[END IF]
get(max)
so finally we will be getting the maximum
but on the cost of high time complexity.
try to use divide and conquer kind of algos...

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight