Size/cost of query in GAE datastore

Size/cost of query in GAE datastore - google-app-engine

If I have a query that's structured like this:
q = Questions.all()
q.order('-votes')
results = q.run(limit=25)
And votes is just an IntegerProperty in a Questions db model, does the size/cost (basically what counts towards my quota) of the query depend on the number of entities?
Basically, if I'm trying to order 1000 Questions, is it more expensive than ordering only 10 Questions?

short answer: No.
There are read costs and write costs.
Write costs occur when you write an entity, and the big influence is the number of indexed properties per entity.
Read costs are based on the number of entities returned in a query.
If you sort on votes, you need to make sure the votes property is indexed. That's 1-2
additional writes per entity written.
Read costs vary by the number of entities returned. The filter and sort order don't affect the cost on read.

Related

Choosing proper database in AWS when all items must be read from the table

I have an AWS application where DynamoDB is used for most data storage and it works well for most cases. I would like to ask you about one particular case where I feel DynamoDB might not be the best option.
There is a simple table with customers. Each customer can collect virtual coins so each customer has a balance attribute. The balance is managed by 3rd party service keeping up-to-date value and the balance attribute in my table is just a cached version of it. The 3rd party service requires its own id of the customer as an input so customers table contains also this externalId attribute which is used to query the balance.
I need to run the following process once per day:
Update the balance attribute for all customers in a database.
Find all customers with the balance greater than some specified constant value. They need to be sorted by the balance.
Perform some processing for all of the customers - the processing must be performed in proper order - starting from the customer with the greatest balance in descending order (by balance).
Question: which database is the most suitable one for this use case?
My analysis:
In terms of costs it looks to be quite similar, i.e. paying for Compute Units in case of DynamoDB vs paying for hours of micro instances in case of RDS. Not sure though if micro RDS instance is enough for this purpose - I'm going to check it but I guess it should be enough.
In terms of performance - I'm not sure here. It's something I will need to check but wanted to ask you here beforehand. Some analysis from my side:
It involves two scan operations in the case of DynamoDB which
looks like something I really don't want to have. The first scan can be limited to externalId attribute, then balances are queried from 3rd party service and updated in the table. The second scan requires a range key defined for balance attribute to return customers sorted by the balance.
I'm not convinced that any kind of indexes can help here. Basically, there won't be too many read operations of the balance - sometimes it will need to be queried for a single customer using its primary key. The number of reads won't be much greater than number of writes so indexes may slow the process down.
Additional assumptions in case they matter:
There are ca. 500 000 customers in the database, the average size of a single customer is 200 bytes. So the total size of the customers in the database is 100 MB.
I need to repeat step 1 from the above procedure (update the balance of all customers) several times during the day (ca. 20-30 times per day) but the necessity to retrieve sorted data is only once per day.
There is only one application (and one instance of the application) performing the above procedure. Besides that, I need to handle simple CRUD which can read/update other attributes of the customers.

I think people are overly afraid of DynamoDB scan operations. They're bad if used for regular queries but for once-in-a-while bulk operations they're not so bad.
How much does it cost to scan a 100 MB table? That's 25,000 4KB blocks. If doing eventually consistent that's 12,250 read units. If we assume the cost is $0.25 per million (On Demand mode) that's 12,250/1,000,000*$0.25 = $0.003 per full table scan. Want to do it 30 times per day? Costs you less than a dime a day.
The thing to consider is the cost of updating every item in the database. That's 500,000 write units, which if in On Demand at $1.25 per million will be about $0.63 per full table update.
If you can go Provisioned for that duration it'll be cheaper.
Regarding performance, DynamoDB can scan a full table faster than any server-oriented database, because it's supported by potentially thousands of back-end servers operating in parallel. For example, you can do a parallel scan with up to a million segments, each with a client thread reading data in 1 MB chunks. If you write a single-threaded client doing a scan it won't be as fast. It's definitely possible to scan slowly, but it's also possible to scan at speeds that seem ludicrous.
If your table is 100 MB, was created in On Demand mode, has never hit a high water mark to auto-increase capacity (just the starter capacity), and you use a multi-threaded pull with 4+ segments, I predict you'll be done in low single digit seconds.

Entity Group - deciding on how to group

I've read throughout the Internet that the Datastore has a limit of 1 write per second for an Entity Group. Most of what I read indicate a "write to an entity", which I would understand as an update. Does the 1 write per second also apply to adding entities into the group?
A simple case would be a Thread where multiple posts can be added by different users. The way I see it, it's logical to have the Thread be the ancestor of the Posts. Thus, forming a wide entity group. If the answer to my question above is yes, a "trending" thread would be devastated by the write limit.
That said, would it make sense to get rid of the ancestry altogether or should I switch to the user as the ancestor? What I'd like to avoid is having the user be confused when they don't see the post due to eventual consistency.

A quick clarification to start with
1 write per second doesn't mean 1 entity per second. You can batch writes together, up to a maximum of 500 entities (transactions also have a 10 MiB limit). So if you can patch posts, you can improve your write rate.
Note: you can technically go higher than 1 per second, although your risk of contention errors increases the longer you exceed that limit as well as the eventual consistency of the system.
You can read more on the limits here.
Client-side sharding
If you need to use ancestor queries for strong consistency AND 1 write per second is not enough, you could implement client-side sharding. This essentially means that you write the posts to a up to N different entity-groups using a known key scheme, For example:
Primary parent: "AncestorA"
Optional shard 1: "AncestorA-1"
Optional shard N: "AncestorA-(N-1)"
To query for your posts, issue N ancestor queries. Naturally, you'll need to merge these results on the client-side to display it in the correct order.
This will allow you to do N writes per second.

Should I normalize an entity to one-to-one relationship model in gae

I have a student entity which already has about 12 fields.Now, I want to add 12 more fields(all related to his academic details).Should I normalize(as one-to-one) and store it in a different entity or should I keep on adding the information in Student entity only.
I am using gaesession to store the logged in user in memory
session = get_current_session()
session['user'] = user
Will this affect in the read and write performance/cost of the app? Does cost of storing an entity in the memcache(FE instance) related to the number of attributes stored in an entity?

Generally the costs of either writing two entities or fetching two entities will be greater than the cost of writing or fetching a single entity.
Write costs are associated with the number of indexed fields. If you're adding indexed fields, that would increase the write cost whenever those fields are modified. If an indexed field is not modified and the index doesn't need to be updated, you do not incur the cost of updating that index. You're also not charged for the size of the entity, so from a cost perspective, sticking with a single entity will be cheaper.
Performance is a bit more complicated. Performance will be affected by 1) query overhead and 2) the size of the entities you are fetching.
If you have two entities, you're going to suffer double the query overhead, since you'll likely have to query/fetch the base student entity and then issue a second query/fetch for the second entity. There may be certain ways around this if you are able to fetch both entities by id asynchronously. If you need to query though, you're perf is likely going to suffer whenever you need to query for the 2nd entity.
On the flip side, perf scales negatively with entity size. Fetching 100 1MB entities will take significantly longer than fetching 100 500 byte entities. If your extra data is large, and you typically query for many student entities at once, then storing the extra data in a separate entity such that the basic student entity is small, you can increase performance significantly for the cases where you don't need the 2nd entity.
Overall, for performance, you should consider your data access patterns, and try to minimize extraneous data fetching for the common fetching situation. ie if you tend to only fetch one student at a time, and you almost always need all the data for that student, then it won't affect your cost to load all the data.
However, if you generally pull lists of many students, and rarely use the full data for a single student, and the data is large, you may want to split the entities.
Also, that comment by #CarterMaslan is wrong. You can support transactional updates. It'll actually be more complicated to synchronize if you have parts of your data in separate entities. In that case you'll need to make sure you have a common ancestor between the two entities to do a transactional operation.

It depends on how often these two "sets" of data need to be retrieved from datastore. As a general principle in GAE, you should de-normalize your data, thus in your case store all properties in the same model. This, will result in more write operations when you store an entity but will reduce the get and query operations.
Memcache is not billable, thus you don't have to worry about memcache costs. Also, if you you use ndb (and I recommend you to do so), caching in memcache is automatically handled.

GAE — Performance of queries on indexed properties

If I had an entity with an indexed property, say "name," what would the performance of == queries on that property be like?
Of course, I understand that no exact answers are possible, but how does the performance correlate with the total number of entities for which name == x for some x, the total number of entities in the datastore, etc.?
How much slower would a query on name == x be if I had 1000 entities with name equalling x, versus 100 entities? Has any sort of benchmarking been done on this?

Some not very strenuous testing on my part indicated response times increased roughly linearly with the number of results returned. Note that even if you have 1000 entities, if you add a limit=100 to your query, it'll perform the same as if you only had 100 entities.
This is in line with the documentation which indicates that perf varies with the number of entities returned.
When I say not very strenuous, I mean that the response times were all over the place, and it was a very very rough estimate to draw a line through. I'd often see an order of magnitude difference in perf on the same request.

AppEngine does queries in a very optimized way, so it is virtually irrelevant from a performance stand-point whether you do a query on the name property vs. just doing a batch-get with the keys only. Either will be linear in the number of entities returned. The total number of entities stored in your database does not make a difference. What does make a tiny difference, though, is the number of different values for "name" that occur in your database (so, 1000 entities returned will be pretty much exactly 10 times slower than 100 entities returned).
The way this is done is via the indices (or indexes as preferred) stored along with your data. An index for the "name" property consists of a table that has all names sorted in alphabetical order (and a second one sorted in reverse alphabetical order, if you use descending order in any of your queries) and a query will then simply find the first occurrence of the name you are querying in the table and start returning results in order. This is called a "scan".
This video is a bit technical, but it explains in detail how all this works and if you're concerned about coding for maximum performance, might be a good time investment:
Google I/O 2008: Under the Covers of the Google App Engine Datastore
(the video quality is fairly bad, but they also have the slides online (see link above video))

What counts towards the GAE datastore quotas?

The docs say that there are 50,000 free Read, Write, and Small Operations to the datastore.
I guess read and write are obvious, but what falls under the "small ops" category? If I ran a query would it be a small or read operation?

Here's the documentation on this. As I understand it, queries by key are considered "small operations," so a query for entities, a query for keys, and creating a new key all deplete the small operations quota.
A query is both a small and read operation: it costs 1 read + 1 small per entity retrieved.

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight