I have this code, which works fine, but is slow on large datasets.
I'd like to hear from the experts if this code could benefit from using Linq, or another method, and if so, how?
Dim array_of_strings As String()
' now I add strings to my array, these come from external file(s).
' This does not take long
' Throughout the execution of my program, I need to validate millions
' of other strings.
Dim search_string As String
Dim indx As Integer
' So we get million of situation like this, where I need to find out
' where in the array I can find a duplicate of this exact string
search_string = "the_string_search_for"
indx = array_of_strings.ToList().IndexOf(search_string)
Each of the strings in my array are unique, no duplicates.
This works pretty well, but like I said, too slow for larger datasets. I am running this query millions of times. Currently it takes about 1 minute for a million queries but this is too slow to my liking.
There's no need to use Linq.
If you used an indexed data structure like a dictionary, the search would be O(log n), at the cost of a slightly longer process of filling the structure. But you do that once, then do a million searches, you're going to come out ahead.
See the description of Dictionary at this site:
https://msdn.microsoft.com/en-us/library/7y3x785f(v=vs.110).aspx
Since (I think) you're talking about a collection that is its own key, you could save some memory by using SortedSet<T>
https://msdn.microsoft.com/en-us/library/dd412070(v=vs.110).aspx
No, I don't think it can benefit from linq.
Linq queries are slow, relatively speaking.
You might try to multithread it, however.
Related
I think it's probably a simple answer but I thought I'd quickly check...
Let's say I'm adding Ints to an array at various points in my code, and then I want to find if an array contains a certain Int in the future..
var array = [Int]()
array.append(2)
array.append(4)
array.append(5)
array.append(7)
if array.contains(7) { print("There's a 7 alright") }
Is this heavier performance wise than if I created a dictionary?
var dictionary = [Int:Int]()
dictionary[7] = 7
if dictionary[7] != nil { print("There's a value for key 7")}
Obviously there's reasons like, you might want to eliminate the possibility of having duplicate entries of the same number... but I could also do that with a Set.. I'm mainly just wondering about the performance of dictionary[key] vs array.contains(value)
Thanks for your time
Generally speaking, Dictionaries provide constant, i.e. O(1), access, which means searching if a value exists and updating it are faster than with an Array, which, depending on implementation can be O(n). If those are things that you need to optimize for, then a Dictionary is a good choice. However, since dictionaries enforce uniqueness of keys, you cannot insert multiple values under the same key.
Based on the question, I would recommend for you to read Ray Wenderlich's Collection Data Structures to get a more holistic understanding of data structures than I can provide here.
I did some sampling!
I edited your code so that the print statements are empty.
I ran the code 1.000.000 times. Every time I measured how long it takes to access the dictionary and array separately. Then I subtracted the dictTime for arrTime (arrTime - dictTime) and saved this number each time.
Once it finished I took the average of the results.
The result is: 23150. Meaning that over 1.000.000 tries the array was faster to access by 23150 nanoSec.
The max difference was 2426737 and the min was -5711121.
Here are the results on a graph:
Introduction
My collection has more than 1 million of documents. Each document's structure is identical and looks like this:
{_id: "LiTC4psuoLWokMPmY", number: "12345", letter: "A", extra: [{eid:"jAHBSzCeK4SS9bShT", value: "Some text"}]}
So, as you can see, my extra field is an array that contains small objects. I'm trying to insert these objects as many as possible (until I get closer to 16MB of document limit). And these objects usually present in the extra array of the most documents in the collections. So I usually have hundreds of thousands of the same objects.
I have an index on eid key in the extra array. I created this index by using this:
db.collectionName.createIndex({"extra.eid":1})
Problem
I want to count how many extra field object present in the collection. I'm doing it by using this:
db.collectionName.find({extra: {eid: "jAHBSzCeK4SS9bShT"}}).count()
In the beginning, the query above is very fast. But whenever extra array gets a little bit bigger (more than 20 objects), it gets really slow.
With 3-4 objects, it takes less than 100 miliseconds but when it gets bigger, it takes a lot more time. With 50 objects, it takes 6238 miliseconds.
Questions
Why is this happening?
How can I make this process faster?
Is there any other way that does this process but faster?
I ran into a similar problem. I bet your query isn't hitting your index.
You can do an explain (run db.collectionName.find({extra: {eid: "jAHBSzCeK4SS9bShT"}}).explain() in the Mongo shell) to know for sure.
The reason is that in Mongo db.collectionName.find({extra: {eid: "jAHBSzCeK4SS9bShT"}}) is not the same as db.collectionName.find({"extra.eid": "jAHBSzCeK4SS9bShT"}). The first form won't use your index, while the second form will (as an example, although this wouldn't work in your case because your subdocument is actually an array). Not sure why, but this seems to be a quirk of Mongo's query builder.
I didn't find any solution except for indexing the entire subdocument.
I am having a difficult time producing a script that makes all sub strings within a string upper case if they contain no vowels.
For example:
'Hammer Products Llc' Should be: 'Hammer Product LLC'.
Or:
'49 Ways Ltd' Should be: '49 Ways LTD'.
As a application developer I am still working on the concept of TSQL Set based processing and avoiding iterations whenever possible. So aside from the task of identifying those rows that have words within them that are missing vowels... for the life of me I can not think of a Set based way to identify and then further update those sub strings other than iterating through them.
So far I am working on the first part in trying to identify those rows that have sub strings that are missing vowels. My only thought is to take each string and Split() it with a customized function... then taking each of the split words and testing them to see if it is all consonants. Then if all consonants perform an update on that word.
My major concern is that this approach will be very heavy to process. This is a real brain twister and any help in the right direction would be greatly appreciated!
You can readily find which strings have a word with no vowels with something like:
where ' ' + lower(str) + ' ' not like '% %[aeiou]% %'
Do note that this doesn't take punctuation into account. That gets a bit harder, because SQL Server doesn't support regular expressions.
Modifying a part of a string to be upper case is much, much harder. Your idea of using split() is definitely one approach. Another is to write a user-defined function.
My recommendation, though, is to do this work in another tool. If you are learning SQL, try using it for tasks it is more suited for.
I've been using VBA for about a month now, and this forum has been a great resource for my first "programming" language. As I've started to get more comfortable with VBA arrays, I've begun to wonder what the best way to store variables is, and I'm sure someone here knows the answer to what's probably a programming newb question:
Is there any difference, say, between having 10 String variables used independently of each other or an array of String variables used independently of each other (by independent I mean their position in the array doesn't matter for their use in the program). There are bits of code I use where I might have around 9 public variables. Is there any advantage to setting them as an array, despite the fact that I don't need to preserve their order vis a vis one another? e.g. I could have
Public x As String
Public y As String
Public v As String
Public w As String
Or
Public arr(1 to 4) As String
arr(1) = x
arr(2) = y
arr(3) = v
arr(4) = w
In terms of what I need to do with the code, these two versions are functionally equivalent. But is there a reason to use one rather than the other?
Connected to this, I can transpose an array into an Excel field, and use xlUp and xlDown to move around the various values in the array. But I can also move through arrays in similar ways by looking for elements with a particular value or position in an array held "abstractly."* Sometimes I find it easier to manipulate array values once they have been transposed into a worksheet, using xlUp and xlDown. Apart from having to have dedicated worksheet space to do this, is this worse (time, processing power, reliability etc.) than looping through an "abstract"* array (if Applications.ScreenUpdating = False)?
*This may mean something technical to mathematicians/ serious programmers - I'm trying to say an array that doesn't use the visual display of the worksheet grid.
EDIT:
Thank you for your interesting answers. I'm not sure if the second part of my question counts as a second question entirely and I'm therefore breaking a rule of the forum, or if it is connected, but I would be very happy to tick the answer that also considered it
Unless you need to refer to them sequentially or by index# dynamically do not use an array as a grouping of scratch variables. It is harder to read.
Memory-wise they should be near identical with slight more overhead on the array.
As others have noted, there's no need to use arrays for variables which are not related or part of a "set" of values. If however you find yourself doing this:
Dim email1 as String, email2 as String, email3 as String, _
email4 as String, email5 as String
then you should consider whether an array would be a better approach.
To the second part of your question: if you're using arrays in your VBA then it would be preferrable to work with them directly in memory rather than dumping them to a worksheet and navigating them from there.
Keeping everything in-memory is going to be faster, and removes dependencies such as having to ensure there's a "scratch" worksheet around: such dependencies make your code less re-usable and more brittle.
A question on variants. Im aware that variants in Excel vba are both the default data type and also inefficient (from the viewpoint of overuse in large apps). However, I regularly use them for storing data in arrays that have multiple data types. A current project I am working on is essentially a task that requires massive optimistaion of very poor code (c.7000 lines)- and it got me thinking; is there a way around this?
To explain; the code frequently stores data in array variables. So consider a dataset of 10 columns by 10000. The columns are multiple different data types (string, double, integers, dates,etc). Assuming I want to store these in an array, I would usually;
dim myDataSet(10,10000) as variant
But, my knowledge says that this will be really inefficient with the code evaluating each item to determine what data type it is (when in practise I know what Im expecting). Plus, I lose the control that dimensioning individual data types gives me. So, (assuming the first 6 are strings, the next 4 doubles for ease of explaining the point), I could;
dim myDSstrings(6,10000) as string
dim myDSdoubles(4,10000) as double
This gives me back the control and efficiency- but is also a bit clunky (in practise the types are mixed and different- and I end up having an odd number of elements in each one, and end up having to assign them individually in the code- rather than on mass). So, its a case of;
myDSstrings(1,r) = cells(r,1)
myDSdoubles(2,r) = cells(r,2)
myDSstrings(2,r) = cells(r,3)
myDSstrings(3,r) = cells(r,4)
myDSdoubles(3,r) = cells(r,5)
..etc...
Which is a lot more ugly than;
myDataSet(c,r) = cells(r,c)
So- it got me thinking- I must be missing something here. What is the optimal way for storing an array of different data types? Or, assuming there is no way of doing it- what would be best coding-practise for storing an array of mixed data-types?
Never optimize your code without measuring first. You'll might be surprised where the code is the slowest. I use the PerfMon utility from Professional Excel Development, but you can roll your own also.
Reading and writing to and from Excel Ranges is a big time sink. Even though Variants can waste a lot of memory, this
Dim vaRange as Variant
vaRange = Sheet1.Range("A1:E10000").Value
'do something to the array
Sheet1.Range("A1:E10000").Value = vaRange
is generally faster than looping through rows and cells.
My preferred method for using arrays with multiple data types is to not use arrays at all. Rather, I'll use a custom class module and create properties for the elements. That's not necessarily a performance boost, but it makes the code much easier to write and read.
I'm not sure your bottleneck comes from the Variant typing of your array.
By the way, to set values from an array to an Excel range, you should use (in Excel 8 or higher):
Range("A1:B2") = myArray
On previous versions, you should use the following code:
Sub SuperBlastArrayToSheet(TheArray As Variant, TheRange As Range)
With TheRange.Parent.Parent 'the workbook the range is in
.Names.Add Name:="wstempdata", RefersToR1C1:=TheArray
With TheRange
.FormulaArray = "=wstempdata"
.Copy
.PasteSpecial Paste:=xlValues
End With
.Names("wstempdata").Delete
End With
End Sub
from this source that you should read for VBA optimization.
Yet, you should profile your app to see where your bottlenecks are. See this question from Issun to help you benchmark your code.