Optimizing GAE Datastore querying repeated fields - arrays

In one of my NDB models, I have one field whose type is ndb.JsonProperty(repeated=True, indexed=False).
This field is an array that on the average case has about 800 elements which look like {ac: "i", e: [0, 3], ls: ["a"], s: [0, 2], sn: 9, ts: "2018-06-25T22:35:04.855Z"}. These represent editions in a text editor.
In some parts of my application just a summary of these elements is needed. For example: how many elements have the ac property equal to 'x' or how many seconds elapsed between the first and the last element of the array.
My problem is that sometimes I need to get several models, each of them with an array containing about 800 of the elements described previously. So, let's say I need to process 40 models * 800 elements = 32000 elements – this already causes my application to use about 150MB of memory, when GAE allows 128MB to be used.
I tried to do an optimization: I summarize the data once, and keep it stored in a property, because at some point, the array doesn't change anymore. This allows me to use less memory by not accessing the array. But the summary data may become stale when the array is updated – this can be verified by getting the timestamp of last element in the array, and comparing it to the summary timestamp, if it's more recent, then the summary is stale.
My question is: in Python 2.7 – more specifically when using NDB – will the whole array be retrieved when I access the last element, as in elements[-1]? Or is there a way to lazily load the list from the last to the first element? Or is there a way to optimize the data, so that I don't use so much memory when I need to process it?

Related

Are there Erlang arrays "with a defined representation"?

Context:
Erlang programs running on heterogeneous nodes, retrieving and storing data
from Mnesia databases. These database entries are meant to be used for a long
time (e.g. across multiple Erlang version releases) remains in the form of
Erlang objects (i.e. no serialization). Among the information stored, there are
currently two uses for arrays:
Large (up to 16384 elements) arrays. Fast access to an element
using its index was the basis for choosing this type of collection.
Once the array has been created, the elements are never modified.
Small (up to 64 elements) arrays. Accesses are mostly done using indices, but there are also some iterations (foldl/foldr). Both reading and replacement of the elements is done frequently. The size of the collection remains constant.
Problem:
Erlang's documentation on arrays states that "The representation is not
documented and is subject to change without notice." Clearly, arrays should not be used in my context: database entries containing arrays may be
interpreted differently depending on the node executing the program and
unannounced changes to how arrays are implemented would make them unusable.
I have noticed that Erlang features "ordsets"/"orddict" to address a similar
issue with "sets"/"dict", and am thus looking for the "array" equivalent. Do you know of any? If none exists, my strategy is likely going to be using lists of lists to replace my large arrays, and orddict (with the index as key) to replace the smaller ones. Is there a better solution?
An array is a tuple of nested tuples and integers, with each tuple being a fixed size of 10 and representing a segment of cells. Where a segment is not currently used an integer (10) acts as a place holder. This without the abstraction is I suppose the closet equivalent.You could indeed copy the array module from otp and add to your own app and thus it would be a stable representation.
As to what you should use devoid of array depends on the data and what you will do with it. If data that would be in your array is fixed, then a tuple makes since, it has constant access time for reads/lookups. Otherwise a list sounds like a winner, be it a list of lists, list of tuples, etc. However, once again, that's a shot in the dark, because I don't know your data or how you use it.
See the implementation here: https://github.com/erlang/otp/blob/master/lib/stdlib/src/array.erl
Also see Robert Virding's answer on the implementation of array here: Arrays implementation in erlang
And what Fred Hebert says about the array in A Short Visit to Common Data Structures
An example showing the structure of an array:
1> A1 = array:new(30).
{array,30,0,undefined,100}
2> A2 = array:set(0, true, A1).
{array,30,0,undefined,
{{true,undefined,undefined,undefined,undefined,undefined,
undefined,undefined,undefined,undefined},
10,10,10,10,10,10,10,10,10,10}}
3> A3 = array:set(19, true, A2).
{array,30,0,undefined,
{{true,undefined,undefined,undefined,undefined,undefined,
undefined,undefined,undefined,undefined},
{undefined,undefined,undefined,undefined,undefined,
undefined,undefined,undefined,undefined,true},
10,10,10,10,10,10,10,10,10}}
4>

Difference between Array, Set and Dictionary in Swift

I am new to Swift Lang, have seen lots of tutorials, but it's not clear – my question is what's the main difference between the Array, Set and Dictionary collection type?
Here are the practical differences between the different types:
Arrays are effectively ordered lists and are used to store lists of information in cases where order is important.
For example, posts in a social network app being displayed in a tableView may be stored in an array.
Sets are different in the sense that order does not matter and these will be used in cases where order does not matter.
Sets are especially useful when you need to ensure that an item only appears once in the set.
Dictionaries are used to store key, value pairs and are used when you want to easily find a value using a key, just like in a dictionary.
For example, you could store a list of items and links to more information about these items in a dictionary.
Hope this helps :)
(For more information and to find Apple's own definitions, check out Apple's guides at https://developer.apple.com/library/content/documentation/Swift/Conceptual/Swift_Programming_Language/CollectionTypes.html)
Detailed documentation can be found here on Apple's guide. Below are some quick definations extracted from there:
Array
An array stores values of the same type in an ordered list. The same value can appear in an array multiple times at different positions.
Set
A set stores distinct values of the same type in a collection with no defined ordering. You can use a set instead of an array when the order of items is not important, or when you need to ensure that an item only appears once.
Dictionary
A dictionary stores associations between keys of the same type and values of the same type in a collection with no defined ordering. Each value is associated with a unique key, which acts as an identifier for that value within the dictionary. Unlike items in an array, items in a dictionary do not have a specified order. You use a dictionary when you need to look up values based on their identifier, in much the same way that a real-world dictionary is used to look up the definition for a particular word.
Old thread yet worth to talk about performance.
With given N element inside an array or a dictionary it worth to consider the performance when you try to access elements or to add or to remove objects.
Arrays
To access a random element will cost you the same as accessing the first or last, as elements follow sequentially each other so they are accessed directly. They will cost you 1 cycle.
Inserting an element is costly. If you add to the beginning it will cost you 1 cycle. Inserting to the middle, the remainder needs to be shifted. It can cost you as much as N cycle in worst case (average N/2 cycles). If you append to the end and you have enough room in the array it will cost you 1 cycle. Otherwise the whole array will be copied which will cost you N cycle. This is why it is important to assign enough space to the array at the beginning of the operation.
Deleting from the beginning or the end it will cost you 1. From the middle shift operation is required. In average it is N/2.
Finding element with a given property will cost you N/2 cycle.
So be very cautious with huge arrays.
Dictionaries
While Dictionaries are disordered they can bring you some benefits here. As keys are hashed and stored in a hash table any given operation will cost you 1 cycle. Only exception can be finding an element with a given property. It can cost you N/2 cycle in the worst case. With clever design however you can assign property values as dictionary keys so the lookup will cost you 1 cycle only no matter how many elements are inside.
Swift Collections - Array, Dictionary, Set
Every collection is dynamic that is why it has some extra steps for expanding and collapsing. Array should allocate more memory and copy an old date into new one, Dictionary additionally should recalculate basket indexes for every object inside
Big O (O) notation describes a performance of some function
Array - ArrayList - a dynamic array of objects. It is based on usual array. It is used for task where you very often should have an access by index
get by index - O(1)
find element - O(n) - you try to find the latest element
insert/delete - O(n) - every time a tail of array is copied/pasted
Dictionary - HashTable, HashMap - saving key/value pairs. It contains a buckets/baskets(array structure, access by index) where each of them contains another structure(array list, linked list, tree). Collisions are solved by Separate chaining. The main idea is:
calculate key's hash code[About] (Hashable) and based on this hash code the index of bucket is calculated(for example by using modulo(mod)).
Since Hashable function returns Int it can not guarantees that two different objects will have different hash codes. More over count of basket is not equals Int.max. When we have two different objects with the same hash codes, or situation when two objects which have different hash codes are located into the same basket - it is a collision. Than is why when we know the index of basket we should check if anybody there is the same as our key, and Equatable is to the rescue. If two objects are equal the key/value object will be replaces, otherwise - new key/value object will be added inside
find element - O(1) to O(n)
insert/delete - O(1) to O(n)
O(n) - in case when hash code for every object is the same, that is why we have only one bucket. So hash function should evenly distributes the elements
As you see HashMap doesn't support access by index but in other cases it has better performance
Set - hash Set. Is based on HashTable without value
*Also you are able to implement a kind of Java TreeMap/TreeSet which is sorted structure but with O(log(n)) complexity to access an element
[Java Thread safe Collections]

MongoDB is extremely slow when using count() on array

Introduction
My collection has more than 1 million of documents. Each document's structure is identical and looks like this:
{_id: "LiTC4psuoLWokMPmY", number: "12345", letter: "A", extra: [{eid:"jAHBSzCeK4SS9bShT", value: "Some text"}]}
So, as you can see, my extra field is an array that contains small objects. I'm trying to insert these objects as many as possible (until I get closer to 16MB of document limit). And these objects usually present in the extra array of the most documents in the collections. So I usually have hundreds of thousands of the same objects.
I have an index on eid key in the extra array. I created this index by using this:
db.collectionName.createIndex({"extra.eid":1})
Problem
I want to count how many extra field object present in the collection. I'm doing it by using this:
db.collectionName.find({extra: {eid: "jAHBSzCeK4SS9bShT"}}).count()
In the beginning, the query above is very fast. But whenever extra array gets a little bit bigger (more than 20 objects), it gets really slow.
With 3-4 objects, it takes less than 100 miliseconds but when it gets bigger, it takes a lot more time. With 50 objects, it takes 6238 miliseconds.
Questions
Why is this happening?
How can I make this process faster?
Is there any other way that does this process but faster?
I ran into a similar problem. I bet your query isn't hitting your index.
You can do an explain (run db.collectionName.find({extra: {eid: "jAHBSzCeK4SS9bShT"}}).explain() in the Mongo shell) to know for sure.
The reason is that in Mongo db.collectionName.find({extra: {eid: "jAHBSzCeK4SS9bShT"}}) is not the same as db.collectionName.find({"extra.eid": "jAHBSzCeK4SS9bShT"}). The first form won't use your index, while the second form will (as an example, although this wouldn't work in your case because your subdocument is actually an array). Not sure why, but this seems to be a quirk of Mongo's query builder.
I didn't find any solution except for indexing the entire subdocument.

algorithm/data structure for this "enumerating all possibilities" task (combinatorial objects)

This is probably a common question that arises in search/store situations and there is a standard answer. I'm trying to do this from intuition and am somewhat out of my comfort zone.
I'm attempting to generate all of a certain kind of combinatorial object. Each object of size n can be generated from an object of size n-1, usually in multiple ways. From the single object of size 2, my search generates 6 objects of size 3, about 140 objects of size 4, and about 29,000 objects of size 5. As I generate the objects, I store them in a globally declared array. Before storing each object, I have to check all the previous ones stored for that size, to make sure I didn't generate it already from an earlier (n-1)-object. I currently do this in a naive way, which is just that I go through all the objects currently sitting in the array and compare them to the one currently being generated. Only if it's different from every single one there do I add it to the array and increment the number of objects currently in there. The new object is just added as the most recent object in the array, it is not sorted, and so this is obviously inefficient, and I can't hope to generate the objects of size 6 in this way.
(To give an idea of the problem of the growth of the array: the first couple of 4-objects, from among the 140 or so, give rise to over 2000 new 5-objects in a fraction of a second. By the time I've gotten to the last few 4-objects, with over 25,000 5-objects already stored, each 4-object generates only a handful of previously unseen 5-objects, but takes several seconds for the process for each 4-object. There is very little correlation between the order I generate new objects in, and their eventual position as a consequence of the comparison function I'm using.)
Obviously if I had a sorted array of objects, it would be much more efficient to find out whether I'm looking at a new object: using a binary midpoint search strategy I'd only have to look at roughly log_2(n) of the n objects currently stored, instead of all n of them. But placing the newly generated object at the right place in an array means moving half of the existing ones, on average, to make room for it. (I would implement this with an array of pointers pointing to the unsorted array of object structs, so that I only had to move pointers instead of moving data, but it still seems like a lot of pointers to have to repoint at each insert.)
The other option would be to place the objects in a linked list, as insertion is very cheap in that situation. But then I wouldn't have random access to the elements in the linked list--you can only find the right place to insert the newly generated object (if it's actually new) by traversing the list node by node and comparing. On average you'd have to traverse half the list before finding the right insertion point, which doesn't sound any better than repointing half the pointers.
Is there a third choice I'm missing? I could accomplish this very easily if I had both random access to stored elements so I could find the insertion point quickly (in log_2(n) steps), and I could insert new objects very cheaply, like in a linked list. Am I dreaming?
To summarise: I need to be able to determine whether an object is new or duplicates an existing one, and I need to be able to insert an object at the right place. I don't ever need to delete an object. Thank you.

Sorting n sets of data into one

I have n arrays of data, each of these arrays is sorted by the same criteria.
The number of arrays will, in almost all cases, not exceed 10, so it is a relatively small number. In each array, however, can be a large number of objects, that should be treated as infinite for the algorithm I am looking for.
I now want to treat these arrays as if they are one array. However, I do need a way, to retrieve objects in a given range as fast as possible and without touching all objects before the range and/or all objects after the range. Therefore it is not an option to iterate over all objects and store them in one single array. Fetches with low start values are also more likely than fetches with a high start value. So e.g. fetching objects [20,40) is much more likely than fetching objects [1000,1020), but it could happen.
The range itself will be pretty small, around 20 objects, or can be increased, if relevant for the performance, as long as this does not hit the limits of memory. So I would guess a couple of hundred objects would be fine as well.
Example:
3 arrays, each containing a couple of thousand entires. I now want to get the overall objects in the range [60, 80) without touching either the upper 60 objects in each set nor all the objets that are after object 80 in the array.
I am thinking about some sort of combined, modified binary search. My current idea is something like the following (note, that this is not fully thought through yet, it is just an idea):
get object 60 of each array - the beginning of the range can not be after that, as every single array would already meet the requirements
use these objects as the maximum value for the binary search in every array
from one of the arrays, get the centered object (e.g. 30)
with a binary search in all the other arrays, try to find the object in each array, that would be before, but as close as possible to the picked object.
we now have 3 objects, e.g. object 15, 10 and 20. The sum of these objects would be 45. So there are 42 objects in front, which is more than the beginning of the range we are looking for (30). We continue our binary search in the remaining left half of one of the arrays
if we instead get a value where the sum is smaller than the beginning of the range we are looking for, we continue our search on the right.
at some point we will hit object 30. From there on, we can simply add the objects from each array, one by one, with an insertion sort until we hit the range length.
My questions are:
Is there any name for this kind of algorithm I described here?
Are there other algorithms or ideas for this problem, that might be better suited for this issue?
Thans in advance for any idea or help!
People usually call this problem something like "selection in the union of multiple sorted arrays". One of the questions in the sidebar is about the special case of two sorted arrays, and this question is about the general case. Several comparison-based approaches appear in the combined answers; they more or less have to determine where the lower endpoint in each individual array is. Your binary search answer is one of the better approaches; there's an asymptotically faster algorithm due to Frederickson and Johnson, but it's complicated and not obviously an improvement for small ranks.

Resources