Best way to store a lot of geopoints in Firestore - database

Let's say I have thousands and thousands Geopoints all around the world to store in Firestore. I will query the ones in a certain radius, then display them on a map.
Is it a bad idea to put every objects which contain the geopoints in one big collection? Or will it get significantly slower the more data I add?
And if it is, what would be a better database structure?

The number of documents in a collection has no impact on the performance of a Firestore query, that's pretty much its main performance guarantee.
If you retrieve 10 documents from 1,000 documents in the collection, the performance will be the same as when you retrieve those 10 documents when there are 1,000,000 documents in that collection.
While other limitations and behaviors of Firestore may affect the data model you decide on, query performance typically isn't one of them.

Related

MongoDb - Does it help performance if all insertMany documents write to a single shard?

I have a very large collection that is sharded on hashed(user_id).
I have a script that pulls some file containing thousands of JSON lines, each one being an individual document, and writes them to the collection via insertMany(ordered: false)
I'm dealing with very large amounts of data and still running into CPU usage problems and slower than desired write speeds.
I know that Mongo suggests to pre-split chunks, but only for an empty collection, and of course ours will be populated after the initial load.
Would it help boost performance if we pre-"bucketed" documents into a group keyed on user_id, and then executed an insertMany with all docs having the same user_id, since they would all go to the same shard, as opposed to mixed together? Or does Mongo still need to examine and "manually" balance each individual document in the insertMany even if they do all have the same user_id?

Google Cloud Datastore queries too slow when fetching all records

I am experiencing extremely slow performance of Google Cloud Datastore queries.
My entity structure is very simple:
calendarId, levelId, levelName, levelValue
And there are only about 1400 records and yet the query takes 500ms-1.2 sec to give back the data. Another query on a different entity also takes 300-400 ms just for 313 records.
I am wondering what might be causing such delay. Can anyone please give some pointers regarding how to debug this issue or what factors to inspect?
Thanks.
You are experiencing expected behavior. You shouldn't need to get that many entities when presenting a page to user. Gmail doesn't show you 1000 emails, it shows you 25-100 based on your settings. You should fetch a smaller number (e.g., the first 100) and implement some kind of paging to allow users to see other entities.
If this is backend processing, then you will simply need that much time to process entities, and you'll need to take that into account.
Note that you generally want to fetch your entities in large batches, and not one by one, but I assume you are already doing that based on the numbers in your question.
Not sure if this will help but you could try packing more data into a single entity by using embedded entities. Embedded entities are not true entities, they are just properties that allow for nested data. So instead of having 4 properties per entity, create an array property on the entity that stores a list of embedded entities each with those 4 properties. The max size an entity can have is 1MB, so you'll want to pack the array to get as close to that 1MB limit as possible.
This will lower the number of true entities and I suspect this will also reduce overall fetch time.

Using QuerySplitter in Google Datastore to load chunks of a known size

I'd like to load lots of data from a Google Datastore table. For performance, I'd like to run, in parallel, a few queries that each loads a lot of objects. Cursors are not suitable for the parallel execution.
QuerySplitter is. However, for QuerySplitter you have to tell it how many splits you want, and what I care about is loading a certain number of objects. The number is chosen for the needs of my application, large but not not too large, say 800 objects. It's OK if the number of objects returned by each query is only very roughly the same; nothing worse would happen that different threads running different amounts of time.
How do I do this? I could query all objects keys-only in order to count them, and divide by 800. Is there a better way.
Querying all your entities (even keys only) might not scale so well, but you could run your query/ies periodically and save the counts in datastore or memcache, depending on how frequently you need to run your job.
However, to find all the entities of a given kind you can use the Datastore Statistics API which should be a lot quicker. I don't know how frequently the stats are updated but it's probably the same as the stats in the console.
If you are going to more frequent counts, or figures for filtered queries, you might consider sharded counters. Since you only need an approximate number, you could update them asynchronously on each new put.

Should I normalize an entity to one-to-one relationship model in gae

I have a student entity which already has about 12 fields.Now, I want to add 12 more fields(all related to his academic details).Should I normalize(as one-to-one) and store it in a different entity or should I keep on adding the information in Student entity only.
I am using gaesession to store the logged in user in memory
session = get_current_session()
session['user'] = user
Will this affect in the read and write performance/cost of the app? Does cost of storing an entity in the memcache(FE instance) related to the number of attributes stored in an entity?
Generally the costs of either writing two entities or fetching two entities will be greater than the cost of writing or fetching a single entity.
Write costs are associated with the number of indexed fields. If you're adding indexed fields, that would increase the write cost whenever those fields are modified. If an indexed field is not modified and the index doesn't need to be updated, you do not incur the cost of updating that index. You're also not charged for the size of the entity, so from a cost perspective, sticking with a single entity will be cheaper.
Performance is a bit more complicated. Performance will be affected by 1) query overhead and 2) the size of the entities you are fetching.
If you have two entities, you're going to suffer double the query overhead, since you'll likely have to query/fetch the base student entity and then issue a second query/fetch for the second entity. There may be certain ways around this if you are able to fetch both entities by id asynchronously. If you need to query though, you're perf is likely going to suffer whenever you need to query for the 2nd entity.
On the flip side, perf scales negatively with entity size. Fetching 100 1MB entities will take significantly longer than fetching 100 500 byte entities. If your extra data is large, and you typically query for many student entities at once, then storing the extra data in a separate entity such that the basic student entity is small, you can increase performance significantly for the cases where you don't need the 2nd entity.
Overall, for performance, you should consider your data access patterns, and try to minimize extraneous data fetching for the common fetching situation. ie if you tend to only fetch one student at a time, and you almost always need all the data for that student, then it won't affect your cost to load all the data.
However, if you generally pull lists of many students, and rarely use the full data for a single student, and the data is large, you may want to split the entities.
Also, that comment by #CarterMaslan is wrong. You can support transactional updates. It'll actually be more complicated to synchronize if you have parts of your data in separate entities. In that case you'll need to make sure you have a common ancestor between the two entities to do a transactional operation.
It depends on how often these two "sets" of data need to be retrieved from datastore. As a general principle in GAE, you should de-normalize your data, thus in your case store all properties in the same model. This, will result in more write operations when you store an entity but will reduce the get and query operations.
Memcache is not billable, thus you don't have to worry about memcache costs. Also, if you you use ndb (and I recommend you to do so), caching in memcache is automatically handled.

How the performance of datastore batch gets compares to ancestor-only queries?

I would expect batch gets to be one of the fastest ways to retrieve data from the datastore. How does it compare to a query to get all the entities of a kind that are below an ancestor? Of course, this query does not have any filters or sort orders.
I would expect this query to be as fast as a batch get, because I would think that it does NOT require an index scan, and would only require the retrieval of the entities directly from the entities bigtable. Also, assuming that all the entities in this table are sorted by their keys, the results would be sitting one next the order, all sequentially arranged - which is not a guarantee in a batch get.
Considering both operations retrieve the same amount of entities, in terms of cost, the query would have only +1 read operation when compared to a batch get.
Do my assumptions make any sense? Have you experienced anything that could confirm or deny these assumptions?
I am planning to make heavy use of these queries if I could confirm my expectations. I would organize my models in a hierarchy, and would avoid storing ref's to other entities in a list (for batch gets) - would not have the list size limitation and I could also avoid retrieving a large entity (with a lot of multi-valued properties) in situations not requiring the batch get.
I would really appreciate any comments on that.
Thank you in advance.
I doubt that any performance difference you observe between ancestor and non-ancestor queries is anything other than coincidental. But sure, set something up do do the measurement. That's a good practice to follow.

Resources