So, I've got a word analyzing program in Excel with which I hope to be able to import over 30 million words.
At first,I created a separate object for each of these words so that each word has a...
.value '(string), the actual word itself
.bool1 '(boolean)
.bool2 '(boolean)
.bool3 '(boolean)
.isUsed '(boolean)
.cancel '(boolean)
When I found out I may have 30 million of these objects (all stored in a single collection), I thought that this could be a monster to compile. And so I decided that all my words would be strings, and that I would stick them into an array.
So my array idea is to append each of the 30 million strings by adding 5 spaces (for my 5 bools) at the beginning of each string, with each empty space representing a false bool val. e.g,
If instr(3, arr(n), " ") = 1 then
'my 3rd bool val is false.
Elseif instr(3, arr(n), "*") = 1 then '(I'll insert a '*' to denote true)
'my third bool val is true.
End If
Anyway, what do you guys think? Which way (collection or array) should I go about this (for optimization specifically)?
(I wanted to make this a comment but it became too long)
An answer would depend on how you want to access and process the words, once stored.
There are significant benefits and distinct advantages for 3 candidates:
Arrays are very efficient to populate and retrieve all items at once (ex. range to array and array back to range), but much slower at re-sizing and inserting items in the middle. Each Redim copies the entire memory block to a larger location, and if Preserve is used, all values copied over as well. This may translate to perceived slowness for every operation (in a potential application)
More details (arrays vs collections) here (VB specific but it applies to VBA as well)
Collections are linked lists with hash-tables - quite slow to populate but after that you get instant access to any element in the collection, and just as fast at reordering (sorting) and re-sizing. This can translate into a slow opening file, but all other operations are instant. Other aspects:
Retrieve keys as well as the items associated with those keys
Handle case-sensitive keys
Items can be other collections, arrays, objects
While keys must be unique, they are also optional
An item can be returned in reference to its key, or in reference to its index value
Keys are always strings, and always case insensitive
Items are accessible and retrievable, but its keys are not
Cannot remove all items at once (either one by one, or destroy then recreate the Collection
Enumerating with For...Each...Next, lists all items
More info here and here
Dictionaries: same as collections but with the extra benefit of the .Exists() method which, in some scenarios, makes them much faster than collections. Other aspects:
Keys are mandatory and always unique to that Dictionary
An item can only be returned in reference to its key
The key can take any data type; for string keys, by default a Dictionary is case sensitive
Exists() method to test for the existence of a particular key (and item)
Collections have no similar test; instead, you must attempt to retrieve a value from the Collection, and handle the resulting error if the key is not found
Items AND keys are always accessible and retrievable to the developer
Item property is read/write, so it allows changing the item associated with a particular key
Allows you to remove all items in a single step without destroying the Dictionary itself
Using For...Each...Next dictionaries will enumerate the keys
A Dictionary supports implicit adding of an item using the Item property.
In Collections, items must be added explicitly
More details here
Other links: optimizing loops and optimizing strings (same site)
Related
Problem:
I'm looking for an efficient data structure in VBA, which allows me to lookup a value in one 'column' and find a corresponding value in another column. All columns have the same fixed length.
Background
Essentially I have 2 Enums, each with n items, and an array of n strings; I'd like to pass the ith value from any of these sets, and return the ith value from another specified set
One option would be a Collection of Arrays; the Collection would have keys corresponding to the type of list (e.g. Enum1, Enum2, StringList), and I would be able to make a function which takes two list keys and a lookup value as argument, and returns the corresponding value in the second column with a loop:
Function findCorresponding(dataTable As Collection, header1 As String, header2 As String, lookupVal As Variant) As Variant
Set array1= dataTable(header1) 'pick out array from collection
For i = Lbound(array1) To Ubound(array1) 'loop through to find lookup val
If array1(i) = lookupVal Then Exit For
Next i
findCorresponding = dataTable(header2)(i) 'return corresponding val
End Function
And sure, I could replace the lookup arrays with un-Keyed Collections to avoid looping. But that doesn't seem like the most efficient way (I believe dictionaries Hash rather than loop, so would be faster on that front, but have a lot of extra baggage compared to an array)
What I really want is something like a Scripting.Dictionary, where you can access both values and keys, and use one to get the other. But with a third parameter that can be found using either of the other two, and can be used to find either of the other two.
If something extends to n columns that would also be useful
I am creating a database storage engine (for fun).
I know it uses b-trees (and stuff), but in all of b-tree base examples, it shows that we need to sort keys and then store it for indexing, not for integers.
I can understand sorting, but how to do it for strings, if I have string as a key for indexing?
Ex : I want to index all email addresses in btree , how would I do that ??
It does not matter, what type of data you are sorting. For a B-Tree you only need a comparator. The first value you put into your db is the root. The second value gets compared to the root. If smaller, then continue down left, else right. Inserting new values often requires to restructure your tree.
A comparator for a string could use the length of the string or compare it alphabetically or count the dots in an email behind the at-sign.
I am new to Swift Lang, have seen lots of tutorials, but it's not clear – my question is what's the main difference between the Array, Set and Dictionary collection type?
Here are the practical differences between the different types:
Arrays are effectively ordered lists and are used to store lists of information in cases where order is important.
For example, posts in a social network app being displayed in a tableView may be stored in an array.
Sets are different in the sense that order does not matter and these will be used in cases where order does not matter.
Sets are especially useful when you need to ensure that an item only appears once in the set.
Dictionaries are used to store key, value pairs and are used when you want to easily find a value using a key, just like in a dictionary.
For example, you could store a list of items and links to more information about these items in a dictionary.
Hope this helps :)
(For more information and to find Apple's own definitions, check out Apple's guides at https://developer.apple.com/library/content/documentation/Swift/Conceptual/Swift_Programming_Language/CollectionTypes.html)
Detailed documentation can be found here on Apple's guide. Below are some quick definations extracted from there:
Array
An array stores values of the same type in an ordered list. The same value can appear in an array multiple times at different positions.
Set
A set stores distinct values of the same type in a collection with no defined ordering. You can use a set instead of an array when the order of items is not important, or when you need to ensure that an item only appears once.
Dictionary
A dictionary stores associations between keys of the same type and values of the same type in a collection with no defined ordering. Each value is associated with a unique key, which acts as an identifier for that value within the dictionary. Unlike items in an array, items in a dictionary do not have a specified order. You use a dictionary when you need to look up values based on their identifier, in much the same way that a real-world dictionary is used to look up the definition for a particular word.
Old thread yet worth to talk about performance.
With given N element inside an array or a dictionary it worth to consider the performance when you try to access elements or to add or to remove objects.
Arrays
To access a random element will cost you the same as accessing the first or last, as elements follow sequentially each other so they are accessed directly. They will cost you 1 cycle.
Inserting an element is costly. If you add to the beginning it will cost you 1 cycle. Inserting to the middle, the remainder needs to be shifted. It can cost you as much as N cycle in worst case (average N/2 cycles). If you append to the end and you have enough room in the array it will cost you 1 cycle. Otherwise the whole array will be copied which will cost you N cycle. This is why it is important to assign enough space to the array at the beginning of the operation.
Deleting from the beginning or the end it will cost you 1. From the middle shift operation is required. In average it is N/2.
Finding element with a given property will cost you N/2 cycle.
So be very cautious with huge arrays.
Dictionaries
While Dictionaries are disordered they can bring you some benefits here. As keys are hashed and stored in a hash table any given operation will cost you 1 cycle. Only exception can be finding an element with a given property. It can cost you N/2 cycle in the worst case. With clever design however you can assign property values as dictionary keys so the lookup will cost you 1 cycle only no matter how many elements are inside.
Swift Collections - Array, Dictionary, Set
Every collection is dynamic that is why it has some extra steps for expanding and collapsing. Array should allocate more memory and copy an old date into new one, Dictionary additionally should recalculate basket indexes for every object inside
Big O (O) notation describes a performance of some function
Array - ArrayList - a dynamic array of objects. It is based on usual array. It is used for task where you very often should have an access by index
get by index - O(1)
find element - O(n) - you try to find the latest element
insert/delete - O(n) - every time a tail of array is copied/pasted
Dictionary - HashTable, HashMap - saving key/value pairs. It contains a buckets/baskets(array structure, access by index) where each of them contains another structure(array list, linked list, tree). Collisions are solved by Separate chaining. The main idea is:
calculate key's hash code[About] (Hashable) and based on this hash code the index of bucket is calculated(for example by using modulo(mod)).
Since Hashable function returns Int it can not guarantees that two different objects will have different hash codes. More over count of basket is not equals Int.max. When we have two different objects with the same hash codes, or situation when two objects which have different hash codes are located into the same basket - it is a collision. Than is why when we know the index of basket we should check if anybody there is the same as our key, and Equatable is to the rescue. If two objects are equal the key/value object will be replaces, otherwise - new key/value object will be added inside
find element - O(1) to O(n)
insert/delete - O(1) to O(n)
O(n) - in case when hash code for every object is the same, that is why we have only one bucket. So hash function should evenly distributes the elements
As you see HashMap doesn't support access by index but in other cases it has better performance
Set - hash Set. Is based on HashTable without value
*Also you are able to implement a kind of Java TreeMap/TreeSet which is sorted structure but with O(log(n)) complexity to access an element
[Java Thread safe Collections]
I'm having a hard time wrapping my head around this concept of Alternative 1 vs 2/3 for data entry into an index. Here is an excerpt from some notes:
Alternative 1:
Actual data record (with key
value k)
– If this is used, index structure is a file
organization for data records (like Heap
files or sorted files).
– At most one index on a given collection of
data records can use Alternative 1.
– This alternative saves pointer lookups but
can be expensive to maintain with
insertions and deletions.
Alternative 2: (k, rid of matching data record) and
Alternative 3: (k, list of rids of matching data records)
– Easier to maintain than Alt 1.
– If more than one index is required on a given file, at most
one index can use Alternative 1; rest must use Alternatives 2
or 3.
– Alternative 3 more compact than Alternative 2, but leads to
variable sized data entries even if search keys are of fixed
length.
– Even worse, for large rid lists the data entry would have to
span multiple blocks!
Can someone help me understand this by providing some concrete examples?
many people use extensively arrays in Excel/VBA to store a list of data. However, there is the collection object which in my view is MUCH MUCH more convenient (mainly: don't need to re/define length of the list).
So, I am sincerely asking myself if I am missing something? Why do other people still use arrays to store a list of data? Is it simply a hangover of the past?
Several reasons to use arrays instead of collections (or dictionaries):
you can transfer easily array to range (and vice-versa) with Range("A1:B12") = MyArray
collections can store only unique keys whereas arrays can store any value
collections have to store a couple (key, value) whereas you can store whatever in an array
See Chip Pearson's article about arrays for a better understanding
A better question would rather be why people would use collections over dictionaries (ok, collections are standard VBA whereas you have to import dictionaries)
#CharlesWilliams answer is correct: looping through all the values of an array is faster than iterating a Collection or dictionary: so much so, that I always use the Keys() or Items() method of a dictionary when I need to do that - both methods return a vector array.
A note: I use the Dictionary class far more than I use collections, the Exists() method is just too useful.
There are, or course, drawbacks to collections and dictionaries. One of them is that arrays can be 2- or even 3-Dimensional - a much better data structure for tabulated data. You can store arrays as members of a collection, but there's some downsides to that: one of them is that you might not be getting a reference to the item - unless you use arrItem = MyDictionary(strKey) you will almost certainly get a 'ByVal' copy of the array; that's bad if your data is dynamic, and subject to change by multiple processes. It's also slow: lots of allocation and deallocation.
Worst of all, I don't quite trust VBA to deallocate the memory if I have a collection or dictionary with arrays (or objects!) as members: not on out-of-scope, not by Set objCollection = Nothing, not even by objDictionary.RemoveAll - it's difficult to prove that the problem exists with the limited testing toolkit available in the VBE, but I've seen enough memory leaks in applications that used arrays in dictionaries to know that you need to be cautious. That being said, I never use an array without an Erase command somewhere.
#JMax has explained the other big plus for arrays: you can populate an array in a single 'hit' to the worksheet, and write back your work in a single 'hit.
You can, of course, get the best of both worlds by constructing an Indexed Array class: a 2-dimensional array with associated collection or dictionary objects storing some kind of row identifier as the keys, and the row ordinals as the data items.
Collections that auto-resize are slower (theoretically speaking, different implementations will obviously have their own mileage). If you know you have a set number of entries and you only need to access them in a linear fashion then a traditional array is the correct approach.