Fast way to count duplicates in 30000 rows (Libreoffice Calc) - arrays

Actually, I already have a partial answer!!! Conditional formatting with "Cell value is" -> "duplicate" !!!
This way a check is performed for each user's new entry in "real time".
I need to check if duplicate entries exist in 30000 rows of a column (any value, but not blanks!) . I would like to keep track of how many duplicates during the filling process.
Ok, conditional formatting is a very effective visual indication and fast anough for my needs, but as I am not able to perform a loop to check the color of the cells (found some people against this approach!! Would be so easy! ) I need to find an alternative way to count the duplicates (as a whole, no need to identify how many for each case!).
I tryed the formula:
=SUMPRODUCT((COUNTIF(F2:F30001;$F$2:$F$30001)>1))
It works, but it takes two minutes to finish.
If you want to replicate my case. My 30000 entries are formatted as: letter "A" and numbers between 100000 and 999999, e.g., A354125, A214547, etc. Copy as text the result of "=CONCATENATE("A";RANDBETWEEN(100000;999999))" to save time.
Thanks!
PS: Does anybody know the algorithm used to find the duplicates in conditional formatting (it is fast)?
A macro solution is not the best, but is acceptable! ;)

The =SUMPRODUCT((COUNTIF(F2:F30001;$F$2:$F$30001)>1)) must do following: Count if $F$2 is in F2:F30001, then count if $F$3 is in F2:F30001, ..., then count if $F$30001 is in F2:F30001. So it must fully loop over the array F2:F30001 with each single item.
The fastest way counting duplicates in an array is avoiding fully loop over the array with each single item. One way is sorting first. There are very fast quick sort methods. Or using collections which per definition can only have unique items.
The following code uses the second way. The keys of a Collection must be unique. Adding an item having a duplicate key fails.
Public Function countDuplicates(vArray As Variant, Optional inclusive As Boolean ) As Variant
On Error Goto wrong
If IsMissing(inclusive) Then inclusive = False
oDuplicatesCollection = new Collection
oUniqueCollection = new Collection
lCountAll = 0
For Each vValue In vArray
If contains(oUniqueCollection, CStr(vValue)) Then
On Error Resume Next
oDuplicatesCollection.Add 42, CStr(vValue)
On Error Goto 0
Else
oUniqueCollection.Add 42, CStr(vValue)
End If
lCountAll = lCountAll + 1
Next
countDuplicates = lCountAll - oUniqueCollection.Count + IIF(inclusive, oDuplicatesCollection.Count, 0)
Exit Function
wrong:
'xray vArray
countDuplicates = CVErr(123)
End Function
Function contains(oCollection As Collection, sKey As String)
On Error Goto notContains
oCollection.Item(sKey)
contains = True
Exit Function
notContains:
contains = False
End Function
The function can be called:
=COUNTDUPLICATES(F2:F30001, TRUE())
This should return the same result as your
=SUMPRODUCT((COUNTIF(F2:F30001,$F$2:$F$30001)>1))
The optional second parameter inclusive means the count includes all the values which are present multiple times. For example {A1, A2, A2, A2, A3} contains 3 times A2. Counting inclusive means the count result will be 3. Counting not inclusive means the count result will be 2. There is 2 times A2 as a duplicate.
As you see, the function contains much more information than only the count of the duplicates. The oDuplicatesCollection contains each duplicate item. The oUniqueCollection contains each unique item. So this code could also be used for getting all unique items or all duplicate items.

Related

Why is the MATCH function not working how anticipated in Excel with array formulas? (example attached)

File Located Here (using Google Drive)
Explanation and background: I have a list of animals and their color in column G (animal) and H (color). I have a list of the unique colors in column A, and a list of unique animals in column D. In column B, next to the list of unique colors, I need to know of all the animals, which animal has that color most often (raw number, not proportion.) I cannot use any additional helper cells.
I have established the max by animal for each color with the formula {=MAX(COUNTIFS(H:H,A2,G:G,$D$2:$D$20))} with the result being 7, but that's as far as I can get. I can set the COUNTIFS statement to the MAX statement like so: COUNTIFS(H:H,A2,G:G,E1:E19)=MAX(COUNTIFS(H:H,A2,G:G,E1:E19)) which can be used as a TRUE/FALSE array in an array formula. Finally I try to use MATCH with the previous formula as my array, and look for a TRUE value to attempt to retrieve the position of the only TRUE value in the array, but it seems to not be able to find it and instead gives me 19, which is the length of the entire array.
When I step through the formula, here's the step right before resulting in 19:
step before final result
Why does this MATCH not work?
put this in B2 :
=CHOOSE(IF(MAX(COUNTIFS(G:G,$E$1,H:H,A2),COUNTIFS(G:G,$E$2,H:H,A2),COUNTIFS(G:G,$E$3,H:H,A2),COUNTIFS(G:G,$E$4,H:H,A2),COUNTIFS(G:G,$E$5,H:H,A2),COUNTIFS(G:G,$E$6,H:H,A2),COUNTIFS(G:G,$E$7,H:H,A2),COUNTIFS(G:G,$E$8,H:H,A2),COUNTIFS(G:G,$E$9,H:H,A2))>MAX(COUNTIFS(G:G,$E$10,H:H,A2),COUNTIFS(G:G,$E$11,H:H,A2),COUNTIFS(G:G,$E$12,H:H,A2),COUNTIFS(G:G,$E$13,H:H,A2),COUNTIFS(G:G,$E$14,H:H,A2),COUNTIFS(G:G,$E$15,H:H,A2),COUNTIFS(G:G,$E$16,H:H,A2),COUNTIFS(G:G,$E$17,H:H,A2),COUNTIFS(G:G,$E$18,H:H,A2),COUNTIFS(G:G,$E$19,H:H,A2)),1,2),CHOOSE(IF(MAX(COUNTIFS(G:G,$E$1,H:H,A2),COUNTIFS(G:G,$E$2,H:H,A2),COUNTIFS(G:G,$E$3,H:H,A2),COUNTIFS(G:G,$E$4,H:H,A2))>MAX(COUNTIFS(G:G,$E$5,H:H,A2),COUNTIFS(G:G,$E$6,H:H,A2),COUNTIFS(G:G,$E$7,H:H,A2),COUNTIFS(G:G,$E$8,H:H,A2),COUNTIFS(G:G,$E$9,H:H,A2)),1,2),CHOOSE(IF(MAX(COUNTIFS(G:G,$E$1,H:H,A2),COUNTIFS(G:G,$E$2,H:H,A2))>MAX(COUNTIFS(G:G,$E$3,H:H,A2),COUNTIFS(G:G,$E$4,H:H,A2)),1,2),IF(COUNTIFS(G:G,$E$1,H:H,A2)>COUNTIFS(G:G,$E$2,H:H,A2),"bat","raccoon"),IF(COUNTIFS(G:G,$E$3,H:H,A2)>COUNTIFS(G:G,$E$4,H:H,A2),"bear","goat")),CHOOSE(IF(MAX(COUNTIFS(G:G,$E$5,H:H,A2),COUNTIFS(G:G,$E$6,H:H,A2))>MAX(COUNTIFS(G:G,$E$7,H:H,A2),COUNTIFS(G:G,$E$8,H:H,A2),COUNTIFS(G:G,$E$9,H:H,A2)),1,2),IF(COUNTIFS(G:G,$E$5,H:H,A2)>COUNTIFS(G:G,$E$6,H:H,A2),"moose","turtle"),CHOOSE(IF(COUNTIFS(G:G,$E$7,H:H,A2)>MAX(COUNTIFS(G:G,$E$8,H:H,A2),COUNTIFS(G:G,$E$9,H:H,A2)),1,2),"squirrel",IF(COUNTIFS(G:G,$E$8,H:H,A2)>COUNTIFS(G:G,$E$9,H:H,A2),"snake","bird")))),CHOOSE(IF(MAX(COUNTIFS(G:G,$E$10,H:H,A2),COUNTIFS(G:G,$E$11,H:H,A2),COUNTIFS(G:G,$E$12,H:H,A2),COUNTIFS(G:G,$E$13,H:H,A2),COUNTIFS(G:G,$E$14,H:H,A2))>MAX(COUNTIFS(G:G,$E$15,H:H,A2),COUNTIFS(G:G,$E$16,H:H,A2),COUNTIFS(G:G,$E$17,H:H,A2),COUNTIFS(G:G,$E$18,H:H,A2),COUNTIFS(G:G,$E$19,H:H,A2)),1,2),CHOOSE(IF(MAX(COUNTIFS(G:G,$E$10,H:H,A2),COUNTIFS(G:G,$E$11,H:H,A2))>MAX(COUNTIFS(G:G,$E$12,H:H,A2),COUNTIFS(G:G,$E$13,H:H,A2),COUNTIFS(G:G,$E$14,H:H,A2)),1,2),IF(COUNTIFS(G:G,$E$10,H:H,A2)>COUNTIFS(G:G,$E$11,H:H,A2),"cat","dog"),CHOOSE(IF(COUNTIFS(G:G,$E$12,H:H,A2)>MAX(COUNTIFS(G:G,$E$13,H:H,A2),COUNTIFS(G:G,$E$14,H:H,A2)),1,2),"rabbit",IF(COUNTIFS(G:G,$E$13,H:H,A2)>COUNTIFS(G:G,$E$14,H:H,A2),"sheep","cow"))),CHOOSE(IF(MAX(COUNTIFS(G:G,$E$15,H:H,A2),COUNTIFS(G:G,$E$16,H:H,A2))>MAX(COUNTIFS(G:G,$E$17,H:H,A2),COUNTIFS(G:G,$E$18,H:H,A2),COUNTIFS(G:G,$E$19,H:H,A2)),1,2),IF(COUNTIFS(G:G,$E$15,H:H,A2)>COUNTIFS(G:G,$E$16,H:H,A2),"chicken","llama"),CHOOSE(IF(COUNTIFS(G:G,$E$17,H:H,A2)>MAX(COUNTIFS(G:G,$E$18,H:H,A2),COUNTIFS(G:G,$E$19,H:H,A2)),1,2),"pig",IF(COUNTIFS(G:G,$E$18,H:H,A2)>COUNTIFS(G:G,$E$19,H:H,A2),"horse","deer")))))
then drag until B10.
Key things that I learnt here :
Choose() is a good alternative to heavily nested if(). It had somehow helped me to 'not-getting-lost' in the formula breakdown.
Cascaded binary evaluation is a good way to break a list of repeated evaluation.
"I cannot use any additional helper cells." <-- If the OP don't mind, you may always use a helper Sheet instead. Putting this requirement really pushed the limit of excel formula. My 1st version of the solution needs > 10000 characters and it is out excel's 8192 characters limits.
Try this array formula in B2 then copy to B3:B10:
= INDEX( $G$1:$G$510,
MATCH( MAX(
IF( $H$1:$H$510 = A2,
COUNTIFS( $G$1:$G$510, $G$1:$G$510, $H$1:$H$510, $H$1:$H$510), "" ) ),
IF( $H$1:$H$510 = A2,
COUNTIFS( $G$1:$G$510, $G$1:$G$510, $H$1:$H$510, $H$1:$H$510), "" ), 0 ) )

Excel VBA: How to concatenate variant array elements (row numbers) into a range object?

I did research this question but could not find the specific answer I was looking for and am actually even more confused at present.
I created a macro that would run through rows on a sheet and run boolean checks on a number of cells in each row that looked for the presence or absence of specific values, or calculated the outcome of a specific inequality. On the basis of those checks, the macro may or may not pass the row number into a specific array. That part is working fine.
My issue is, now that I have the row numbers (stored in variant arrays) - I cannot figure out how to properly concatenate that data into a range and then take a bulk excel action on those items. What I'd like to do is create a range of those values and then delete all of those rows at once rather than looping through.
My macro is on my work computer, but here's something I wrote that should explain what I'm trying to do:
Sub Test()
Dim Str As String
Dim r As Range
Dim i, a As Integer
Dim Count As Integer
Dim RngArray()
Count = ThisWorkbook.Sheets("Sheet1").Cells(Rows.Count, "A:A").End(xlUp).Row
ReDim RngArray(Count)
a = 0
For i = 1 To Count
If Not i = Count Then
RngArray(a) = i
Str = Str & RngArray(a) & ":" & RngArray(a) & ", "
a = a + 1
ElseIf i = Count Then
RngArray(a) = i
Str = Str & RngArray(a) & ":" & RngArray(a)
a = a + 1
Else: End If
Next i
Set r = Range(Str)'Error Can Appear here depending on my concatenation technique
Range(Str).EntireRow.Delete 'error will always appear here
End Sub
I've combined a few steps here and left out any Boolean checks; in my actual macro the values in the arrays are already stored and I loop from LBound to UBound and concatenate those values into a string of the form ("1:1, 2:2, 3:3, ...., n:n")
The reason why I did this is that the rows are all over the sheet and I wanted to get to a point where I could pass the argument
Range("1:1, 2:2, 3:3, ..., n:n").EntireRow.Delete
I think it's clear that I'm just not understanding how to pass the correct information to the range object. When I try to run this I get a "Method Range of Object Global" error message.
My short term fix is to just loop through and clear the rows and then remove all of the blank rows (the macro keeps track of absolute positions of the rows, not the rows after an iterative delete) - but I'd like to figure out HOW to do this my way and why it's not working.
I'm open to other solutions as well, but I'd like to understand what I'm doing wrong here. I should also mention that I used the Join() to try to find a workaround and still received the same type of error.
Thank you.
After some experimentation with my dataset for the macro above, I discovered that it worked on small sets of data in A:A but not larger sets.
I ran Debug.Print Len(Str) while tweaking the set size and macro and found that it appears Range() can only accept a maximum of 240 characters. I still don't understand why this is or the reason for the specific error message I received, but the macro will work if Len(Str) < 240.
I'll have to loop backwards through my array to delete these rows if I want to use my present method...or I may just try something else.
Thanks to Andrew for his attention to this!

Excel how to return an array that meets a certain condition?

If I have data in cell range A1:A6 which is:
Apple
Banana
Cherry
Grape
Orange
Watermelon
Is there a way to return an array (in a single cell, for an intermediate step for a larger formula) which returns the above array except only those entries that satisfy a certain condition?
For example, if I wanted a formula to return an array of only those cells that contain the letter n, it would return this:
Banana
Orange
Watermelon
Is there a way to accomplish this?
Note I do not want to return an array of the same size, just with blank entries, i.e. I do not want:
""
Banana
""
""
Orange
Watermelon
Yes.
Here is the array formula (line break added for readability):
= INDEX(A1:A6,N(IF({1},MODE.MULT(IF(ISNUMBER(SEARCH("n",A1:A6)),
(ROW(A1:A6)-ROW(A1)+1)*{1,1})))))
Note, this is an array formula, meaning you must press Ctrl+Shift+Enter after typing the formula instead of just Enter.
There's some particularly odd things about this formula so I thought I would explain what is going below if you are interested. Some of what I explain below is probably obvious, but I am just being thorough.
To return a result from a list based on a single index, use this:
= INDEX(A1:A6,2)
This would return Banana.
To return results from a list based on multiple indices, you would think to use something like this:
= INDEX(A1:A6,{2;5;6})
This would ideally return {Banana;Orange;Watermelon}.
However, this does not return an array. Based on a recent question that I asked, a very clever workaround to this problem was given:
= INDEX(A1:A6,N(IF({1},{2;5;6})))
This will return the desired result of {Banana;Orange;Watermelon}.
This I consider part 1 of the explanation.
Part 2 of the explanation is how the following returns {2;5;6}:
= MODE.MULT(IF(ISNUMBER(SEARCH("n",A1:A6)),(ROW(A1:A6)-ROW(A1)+1)*{1,1}))
MODE.MULT is a function that returns the data in a set that appears most frequently. There are a few caveats, however:
Data must appear at least twice to be returned by MODE.MULT. If there is no duplicate data, it will return an error. For example, MODE.MULT({1;2;3}) would return an error because there is no repeating data in the array {1;2;3}. Another example: MODE.MULT({1;1;2} would return 1 because 1 appears most often in the data.
If there is a "tie" in terms of what data appears the most, MODE.MULT returns an array of all tied entries. For example MODE.MULT({1;1;2;2}) would return an array of {1;2}.
Most importantly, and probably the most peculiar but also most useful behavior of MODE.MULT, MODE.MULT completely ignores logical values (TRUE and FALSE values) when determining the mode of the data, even if they appear more often than the non-logical values in the data.
We can exploit these properties of MODE.MULT to get the desired array.
ISNUMBER(SEARCH("n",A1:A6)) returns an array of TRUE/FALSE values where the data contains an n. Something like this:
FALSE
TRUE
FALSE
FALSE
TRUE
TRUE
(ROW(A1:A6)-ROW(A1)+1) returns an array starting at 1 and increasing by 1 to however large the original array is:
1
2
3
4
5
6
(ROW(A1:A6)-ROW(A1)+1)*{1,1} effectively just copies this column:
1 1
2 2
3 3
4 4
5 5
6 6
The IF statement is used to return the number in the array above if TRUE, and FALSE otherwise. (Since the IF statement contains no "else" clause, FALSE is the default value given.)
In this example, the IF statement would return this:
FALSE FALSE
2 2
FALSE FALSE
FALSE FALSE
5 5
6 6
Taking MODE.MULT of the above formula will return {2;5;6} because as mentioned, MODE.MULT conveniently ignores the FALSE values in the array above when considering the mode.
It is necessary to consider the above array inside MODE.MULT instead of simply:
FALSE
2
FALSE
FALSE
5
6
Because as mentioned previously, MODE.MULT requires that at least two entries in the data are required to match for it to return a value.
=INDEX($A$2:$A$7,MATCH(1,(COUNTIF($C$1:C1,$A$2:$A$7)=0)*(FIND("n",$A$2:$A$7)>0),0))
This is an array formula so will need to entered with Ctrl+Shift+Enter
Input and output
The INDEX returns an element from a range
The MATCH allows us to find the position of the row(s) to return
The COUNTIF is to make sure we only select unique values in the results
The FIND returns only rows where there is a n present in the string
Try the following User Defined Function:
Public Function ContainsN(rng As Range) As String
Dim r As Range
ContainsN = ""
For Each r In rng
If InStr(1, UCase(r.Value), "N") > 0 Then ContainsN = ContainsN & Chr(10) & r.Value
Next r
ContainsN = Mid(ContainsN, 2)
End Function
I came across this post while searching for a solution to a similar problem: return an array of expiration dates for all batches of a specific SKU.
I transposed a filter function so that my array of Batch Exp dates would appear on the same line as my Inventory SKU.
=IFERROR(TRANSPOSE(FILTER('Batch EXP'!$A$2:$A$1000,('BatchEXP'!$B$2:$B$1000=Inventory!$R4))),"No Expiration")
Added a concatenate column onto my filter range where I could combine quantities (F4) associated with each expiration date (I4).
=CONCATENATE(TEXT(I4,"MM-DD-YY")," x ",TEXT(F4,"#,#"), " units")
It works great and returns all an array for SKUS that match the criteria I setout. Could even take one step further to only return batches > X %, qty, etc. Here is example of output for a SKU:
11-15-22 x 40 units 11-01-22 x 5,216 units

Odd "Check_VIOLATION" failed test case in Eiffel

The main issue from the below picture is that when "check Result end" statement is added it automatically fails and displays "CHECK_VIOLATION" error in debugger.
Also, the HASH_TABLE doesn't store all items given to it but I fixed that by switching HASH_TABLE[G, INTEGER] instead of using the current HASH_TABLE[INTEGER, G]
My main problem is why does it throw Check_violation always and fail whenever a "check result end" statement is seen? Maybe the HAS[...] function is bad?
Currently any test case feature with "check result end" makes it false and throw CHECK_VILOATION
code:
class
MY_BAG[G -> {HASHABLE, COMPARABLE}]
inherit
ADT_BAG[G]
create
make_empty, make_from_tupled_array
convert
make_from_tupled_array ({ARRAY [TUPLE [G, INTEGER]]})
feature{NONE} -- creation
make_empty
do
create table.make(1)
end
make_from_tupled_array (a_array: ARRAY [TUPLE [x: G; y: INTEGER]])
require else
non_empty: a_array.count >= 0
nonnegative: is_nonnegative(a_array)
do
create table.make(a_array.count)
across a_array as a
loop
table.force (a.item.y, a.item.x)
end
end
feature -- attributes
table: HASH_TABLE[INTEGER, G]
counter: INTEGER
testing code:
t6: BOOLEAN
local
bag: MY_BAG [STRING]
do
comment ("t6:repeated elements in contruction")
bag := <<["foo",4], ["bar",3], ["foo",2], ["bar",0]>> -- test passes
Result := bag ["foo"] = 5 -- test passes
check Result end -- test fails (really weird but as soon as check statement comes it fails)
Result := bag ["bar"] = 3
check Result end
Result := bag ["baz"] = 0
end
Most probably ADT_BAG stands for an abstraction of a multiset (also called a bag) that allows to keep items and to tell how many items equal to the given one are there (unlike a set, where at most one item may be present). If so, it is correct to use HASH_TABLE [INTEGER, G] as a storage. Then its keys are the elements and its items are the elements numbers.
So, if we add the same element multiple times, its count should be increased. In the initialization line we add 4 elements of "foo", 3 elements of "bar", 2 elements of "foo" again, and 0 elements of "bar" again. As a result we should have a bag with 6 elements of "foo" and 3 elements of "bar". Also there are no elements "baz".
According to this analysis, either initialization is incorrect (numbers for "foo" should be different) or the comparison should be done for 6 instead of 5.
As to the implementation of the class MY_BAG the idea would be to have a feature add (or whatever name specified in the interface of ADT_BAG) that
Checks if there are items with the given key.
If there are none, adds the new element with the given count.
Otherwise, replaces the current element count with the sum of the current element count and the given element count.
For simplicity the initialization procedure would use this feature to add new items instead of storing items in the hash table directly to process repeated items correctly.

Use of SumIf function with range object or arrays

I'm am trying to optimize a sub that uses Excel´s sumif function since it takes several time to finish.
The specific line (contained in to a for loop) is this one:
Cupones = Application.WorksheetFunction.SumIf(Range("Test_FecFinCup"), Arr_FecFlujos(i), Range("Test_MtoCup"))
Where the ranges are named ranges in the workbook, and Arr_FecFlujos() is an array of dates
That, code works fine, except for it takes to much time to finish.
I am trying this two approaches
Arrays:
Declare my arrays
With Test
Fluj = .Range(Range("Test_Emision").Cells(2, 1), Range("Test_Emision").Cells(2, 1).End(xlDown)).Rows.Count
Arr_FecFinCup = .Range("Test_FecFinCup")
Arr_MtoCup = .Range("Test_MtoCup")
End With
Cupones = Application.WorksheetFunction.SumIf(Arr_FecFinCup, Arr_FecFlujos(i), Arr_MtoCup)
Error tells me I need to work with Range Objects, so I changed to:
With Test
Set Rango1 = .Range("Test_FecIniCup")
Set Rango2 = .Range("Test_MtoCup")
End With
Cupones = Application.WorksheetFunction.SumIf(Rango1, Arr_FecFlujos(i), Rango2)
That one, doesn't shows any error messages, but the sum is incorrect.
Can anybody tell me what's working wrong with these methods and perhaps point me in the correct direction?
It seems that you try to sum a range of numbers using a range of criteria:
WorksheetFunction.SumIf(Arr_FecFinCup, Arr_FecFlujos(i), Arr_MtoCup)
As i know, if the criteria parameter is given a range, Excel don't iterate over that range but instead look for the one value in the criteria_range that coincides with the row of the cell that it is calculating.
For example
Range("D3") = WorksheetFunction.SumIf(Range("A1:A10"),Range("B1:B10"))
Excel will actually calculate as follow
Range("D3") = WorksheetFunction.SumIf(Range("A1:A10"),Range("B3"))
If there is no coincident, then the return is 0
For example
Range("D7") = WorksheetFunction.SumIf(Range("A1:A10"),Range("B1:B5"))
Then D7 is always 0 because looking for [B7] in [B1:B5] is out of range.
Therefore, to do a sum with multiple criterias, the correct way is using SUMIFS as suggested by #mrtiq.

Resources