Suppose if we do
select * from Products where category = "electronics"
versus
select name, price from Products where category = "electronics"
What impact will it have on datastore pricing ?
From Google Cloud Platform doc it is written:
Projection queries that do not use the distinct on clause. This type of query is counted as a single entity read for the query itself. The individual results are counted as small operations.
There are two things to trade off with respect to pricing. The projection query will need a composite index which will cause an increase in storage costs, but the queries will be cheaper by O(num results * read price).
Related
I use NDB query to retrieve entities by a list of keys by IN operator and then filter by Dates. The query was broken into several sub queries, which is written in doc. and running in sequence instead of in parallel.
class Post(ndb.Model):
modified_date = ndb.DateTimeProperty()
# query
posts = Post.query(
Post.key.IN(keys),
Post.modified_date >= start_date,
Post.modified_date <= end_date).fetch()
The query profiling graph shows sequentially run sub queries. It takes about 0.3 seconds for 25 keys. The query latency is linear to number of keys to get.
Is there any way to optimize the query, and what is the best practice to retrieve entities by keys and filter by date range?
The problem is with the IN operator.
For each key in keys GAE will perform an individual query.
From the GAE docs:
The IN operator also performs multiple queries: one for each item in the specified list, with all other filters unchanged and the IN filter replaced with an equality (=) filter. The results are merged in order of the items in the list. If a query has more than one IN filter, it is performed as multiple queries, one for each possible combination of values in the IN lists.
A single query containing not-equal (!=) or IN operators is limited to
no more than 30 subqueries.
https://cloud.google.com/appengine/docs/standard/python/datastore/queries
The modified_date is ok, because there is an index and therefore is efficient.
I know that Datastore pricing quotas are based, for any query, on the number of entities retrieved. Now, if I write, using objectify, a query like this or a similar one:
Car car = ofy().load().type(Car.class).filter("vin >", "123456789").first().now();
do I pay for any entitiy that has vin > 123456789 selected by the query or only for the first one that I'm actually retrieving?
The datastore documentation on indexes says this:
Identifies the index corresponding to the query's kind, filter
properties, filter operators, and sort orders.
Scans from the beginning of the index to the first entity that meets
all of the query's filter conditions.
Continues scanning the index, returning each entity in turn, until
it
encounters an entity that does not meet the filter conditions, or
reaches the end of the index, or
has collected the maximum number of results requested by the query.
(source documentation)
Since your maximum number of results requested by the query is 1 you only have an index scan with a single read which you would be billed for.
Note that indexes are ordered, therefor this would be a very short index scan and a really small operation.
On the other hand, you do not specify an order in the query. So, technically, the result could be any entity that qualifies your query. Usually you would want the biggest or smallest or whatever value within the qualifying range. Since indexes are ordered you should get the first entity in your index depending on the index order (ascending or descending).
Google Datastore will have new pricing effective July 1st (https://cloud.google.com/datastore/docs/pricing) and I having trouble understanding how the changes will effect me.
My KIND does have a structure to it. My kind is called MESSAGES and it looks like this for the every entity:
ID
FROM
TO
MESSAGE
DATE_CREATED
MISC1
MISC2
I have an index on ID, FROM, TO, DATE_CREATED, MISC1, and MISC2. With the new pricing:
What will be the cost of inserting a new entity into this kind?
If I run a query to get all the attributes and it returns 10 entities what is the cost of the query?
If I run a projection query to get all the attributes except MISC1 and MISC2 and it returns 10 entities what is the cost of the query?
If I update a entity with all these indexes what will be the cost?
The old pricing is based primarily on how many index you have, but it seems the new prices are not based on indexes at all. All the documentation on understanding the costs of read and writes are shown with indexes, so it is confusing how it applies without indexes in the pricing model. I would like know how much these 4 types of operations would cost in terms of read/write/small ops.
Writing a new Entity
In the current pricing model, inserting a new entity costs 2 write operations for the entity + 2 write operations per index.
So in your example, with 6 indexes properties it would be:
2 + 2 * 6 = 8 write operations
The effective price would be (8 * $0.06) per 100K entities written
Summary current: $0.48/100K
The new pricing just counts the entities written:
Summary new: $0.18/100K
Regular Queries
In the current model you are charged the number of entities returned + 1
11 read operations # $0.06/100K
In the new pricing model, you only get charged the number entities
10 entity reads # $0.06/100K
Projection Queries
Reading projections count as 'Small Ops' and are free. The query itself costs 1 read though - this stays the same in both current and new pricing models.
Updating Entities
In the current pricing model, updating an new entity costs 1 write operation for the entity + 4 write operations per index.
So in your example, with 6 indexes properties it would be:
1 + 4 * 6 = 25 write operations
The effective price would be (25 * $0.06) per 100K entities written
Summary current: $1.50/100K
The new pricing just counts the entities written:
Summary new: $0.18/100K
Isn't the new one simpler? it's only based on the number of entities, ignore the indexes. You can see the number and explanation here https://cloudplatform.googleblog.com/2016/03/Google-Cloud-Datastore-simplifies-pricing-cuts-cost-dramatically-for-most-use-cases.html.
Let's say I have an ordering system which has a table size of around 50,000 rows and grows by about 100 rows a day. Also, say once an order is placed, I need to store metrics about that order for the next 30 days and report on those metrics on a daily basis (i.e. on day 2, this order had X activations and Y deactivations).
1 table called products, which holds the details of the product listing
1 table called orders, which holds the order data and product id
1 table called metrics, which holds a date field, and order id, and metrics associated.
If I modeled this in a star schema format, I'd design like this:
FactOrders table, which has 30 days * X orders rows and stores all metadata around the orders, product id, and metrics (each row represents the metrics of a product on a particular day).
DimProducts table, which stores the product metadata
Does my performance gain from a huge FactOrders table only needing one join to get all relevant information outweigh the fact that I increased by table size by 30x and have an incredible amount of repeated data, vs. the truly normalized model that has one extra join but much smaller tables? Or am I designing this incorrectly for a star schema format?
Do not denormalize something this small to get rid of joins. Index properly instead. Joins are not bad, joins are good. Databases are designed to use them.
Denormalizing is risky for data integrity and may not even be faster due to the much wider size of the tables. IN tables this tiny, it is very unlikely that denormalizing would help.
I am trying to do some simple reporting in the datastore viewer in GAE. Using GQL I want to show just a few fields of a record. Is this possible?
How do I take entity with fields:
f1 f2 f3 f4 f5 f6
and show
f1 f3 f5 f6
This is not possible. From the GQL Reference documentation:
Every GQL query always begins with either SELECT * or SELECT __key__.
And from the Differences with SQL section of the datastore overview:
When querying the datastore, it is not currently possible to return
only a subset of kind properties. The App Engine datastore can either
return entire entities or only entity keys from a query.
As for why this kind of limitation exists, the article about How Entities and Indexes are Stored gave a good insight regarding the technical aspect behind Google's Bigtable, the distributed database system powering App Engine's datastore. (And other Google products)
From the article, datastore entities are stored in several different Bigtables. An Entity Bigtable stores the entire properties of the entity, and several Index Bigtables stores the entity key sorted according to indexes of the entity.
When we perform a query, basically there are two step that happen. The first step is our query is being executed against the Index Bigtables, producing a set of entity key that matches our query. The second step is that the set of keys is then used to fetch the whole entity from the Entity Bigtable.
Therefore, when you execute your query starting with SELECT __key__, the datastore only need to do the first step and immediately return with the set of keys. When you execute your query starting with SELECT *, the datastore did both steps and return with the set of entities.
Now, regarding why queries like SELECT f1, f3, f5, f6 is not supported by the datastore, we need to look into further detail on what happened during the second step stated above. From the article, it is stated that on the Entity Bigtable:
Instead of storing each entity property as an individual column in the corresponding Bigtable row, a single column is used which contains a binary-encoded protocol buffer containing the names and values for every property of a given entity.
Since the low level protocol buffer stores the entire entity's properties as a single serialized data, it means querying only a subset of the entity's property actually would take an extra post-processing step of filtering the result set and taking only the queried properties. This would entail a performance degradation of the datastore, and is probably why it is not supported by Google at the moment.