Google Datastore will have new pricing effective July 1st (https://cloud.google.com/datastore/docs/pricing) and I having trouble understanding how the changes will effect me.
My KIND does have a structure to it. My kind is called MESSAGES and it looks like this for the every entity:
ID
FROM
TO
MESSAGE
DATE_CREATED
MISC1
MISC2
I have an index on ID, FROM, TO, DATE_CREATED, MISC1, and MISC2. With the new pricing:
What will be the cost of inserting a new entity into this kind?
If I run a query to get all the attributes and it returns 10 entities what is the cost of the query?
If I run a projection query to get all the attributes except MISC1 and MISC2 and it returns 10 entities what is the cost of the query?
If I update a entity with all these indexes what will be the cost?
The old pricing is based primarily on how many index you have, but it seems the new prices are not based on indexes at all. All the documentation on understanding the costs of read and writes are shown with indexes, so it is confusing how it applies without indexes in the pricing model. I would like know how much these 4 types of operations would cost in terms of read/write/small ops.
Writing a new Entity
In the current pricing model, inserting a new entity costs 2 write operations for the entity + 2 write operations per index.
So in your example, with 6 indexes properties it would be:
2 + 2 * 6 = 8 write operations
The effective price would be (8 * $0.06) per 100K entities written
Summary current: $0.48/100K
The new pricing just counts the entities written:
Summary new: $0.18/100K
Regular Queries
In the current model you are charged the number of entities returned + 1
11 read operations # $0.06/100K
In the new pricing model, you only get charged the number entities
10 entity reads # $0.06/100K
Projection Queries
Reading projections count as 'Small Ops' and are free. The query itself costs 1 read though - this stays the same in both current and new pricing models.
Updating Entities
In the current pricing model, updating an new entity costs 1 write operation for the entity + 4 write operations per index.
So in your example, with 6 indexes properties it would be:
1 + 4 * 6 = 25 write operations
The effective price would be (25 * $0.06) per 100K entities written
Summary current: $1.50/100K
The new pricing just counts the entities written:
Summary new: $0.18/100K
Isn't the new one simpler? it's only based on the number of entities, ignore the indexes. You can see the number and explanation here https://cloudplatform.googleblog.com/2016/03/Google-Cloud-Datastore-simplifies-pricing-cuts-cost-dramatically-for-most-use-cases.html.
Related
I'm using GCP/App Engine to build a Feed that returns posts for a given user in descending order of the post's score (a modified timestamp). Posts that are not 'seen' are returned first, followers by posts where 'seen' = true.
When a user creates a post, a Feed entity is created for each one of their followers (i.e. a fan-out inbox model)
Will my current index model result in an exploding index and/or contention on the 'score' index if many users load their feed simultaneously?
index.yaml
indexes:
- kind: "Feed"
properties:
- name: "seen" // Boolean
- name: "uid" // The user this feed belongs to
- name: "score" // Int timestamp
direction: desc
// Other entity fields include: authorUid, postId, postType
A user's feed is fetched by:
SELECT postId FROM Feed WHERE uid = abc123 AND seen = false ORDER BY score DESC
Would I be better off prefixing the 'score' with the user id? Would this improve the performance of the score index? e.g. score="{alphanumeric user id}-{unix timestamp}"
From the docs:
You can improve performance with "sharded queries", that prepend a
fixed length string to the expiration timestamp. The index is sorted
on the full string, so that entities at the same timestamp will be
located throughout the key range of the index. You run multiple
queries in parallel to fetch results from each shard.
With just 4 entities I'm seeing 44 indexes which seems excessive.
You do not have an exploding indexes problem, that problem is specific to queries on entities with repeated properties (i.e properties with multiple values) when those properties are used in composite indexes. From Index limits:
The situation becomes worse in the case of entities with multiple
properties, each of which can take on multiple values. To accommodate
such an entity, the index must include an entry for every possible
combination of property values. Custom indexes that refer to multiple properties, each with multiple values, can "explode"
combinatorially, requiring large numbers of entries for an entity with
only a relatively small number of possible property values. Such
exploding indexes can dramatically increase the storage size of an entity in Cloud Datastore, because of the large number of index
entries that must be stored. Exploding indexes also can easily cause
the entity to exceed the index entry count or size limit.
The 44 built-in indexes are nothing more than the indexes created for the multiple indexed properties of your 4 entities (probably your entity model has about 11 indexed properties). Which is normal. You can reduce the number by scrubbing your model usage and marking as unindexed all properties which you do not plan to use in queries.
You do however have the problem of potentially high number of index updates in a short time - when a user with many followers creates a post with all those indexes falling in a narrow range - hotspots, which the article you referenced applies to. Pre-pending the score with the follower user ID (not the post creator ID, which won't help as the same number of updates on the same index range will happen for one use posting event regardless of sharding being used or not) should help. The impact of followers reading the post (when the score properly is updated) is less impactful since it's less likely for all followers to read the post exactly in the same time.
Unfortunately prepending the follower ID doesn't help with the query you intend to do as the result order will be sorted by follower ID first, not by timestamp.
What I'd do:
combine the functionality of the seen and score properties into one: a score value of 0 can be used to indicate that a post was not yet seen, any other value would indicate the timestamp when it was seen. Fewer indexes, fewer index updates, less storage space.
I wouldn't bother with sharding in this particular case:
reading a post takes a bit of time, one follower reading multiple posts won't typically happen fast enough for the index updates for that particular follower to be a serious problem. In the rare worst case an already read post may appear as unread - IMHO not bad enough for justification
delays in updating the indexes for all followers again is IMHO not a big problem - it may just take a bit longer for the post to appear in a follower's feed
Looking for some advice.
Have a table where the average record size is 2.3 KB. Average 60,000 records per user per month for a total of 180,00 to 200,000 per month. Table has 9 different indexes. All data is stored in the same table, separated by a FilingID.
Each month 3 users import their data prepare it under unique FilingIDs. Once each individual has completed their process, the data needs to be combined under a single FilingID to be submitted. For example.
User A = FilingID 1
User B = FilingID 2
User C = FilingID 3
Combined = FilingID 4
Each month will have new FilingIDs and previous month’s data is retained.
As I see it I have 2 options.
1.) When all users have finished their prep, copy the data from FilingIDs 1-3 to FilingID 4. When 4 has been filed successfully delete data from FilingIDs 1-3.
2.) When all users have finished their prep, update the FilingIDs for 1-3 to FilingID 4.
I prefer option 1 for a number of reasons, however I am concerned what this will do to the size of my database with bloat, fragmented indexes etc. I don’t understand the inner workings of the storage engine that well and would appreciate any insight anyone can provide.
NOTE I do not control the table schema and don’t have an option to use a different table as this is part of a larger application.
I am working on a project where I have the following scenario:
Sales-rep A may deal with customers x, y and z for products 1, 2 and 3
Sales-rep B may deal with customers w, y and z for product 1, 5 and 6
to deal with the above I have the following table structures:
SalesReps: a table with the detail of each sales rep uniquely identified by SRID
Products: a table with all the products with unique key PRID
Customers: a tabel with customer details with unique key CID
Finally I link them with a table
RepProductCustomer: with columns SRID, PRID and CID
It all works fine generally but when dealing with a company which has a lot of products and customers and sales-reps the number of rows gets really large and take very long to add using Entity Framework.
What's the most efficient way to add these entries? Can I use some sort of stored procedure or SQL command to speed this up? I have tried various EF optimisations, but there is just too much data to be inserted on some occasions and the user has to wait for a very long time for the request to complete.
Any help would be appreciated.
Unfortunately, EF optimization you try will not work since the real issue in your scenario is the number of database round trip performed (one per record) which make the SaveChanges method very slow.
Disclaimer: I'm the owner of the project Entity Framework Extensions
BulkSaveChanges allow to save using Bulk Operations and reduce database round trip.
All associations and inheritance are supported.
// Upgrade SaveChanges performance with BulkSaveChanges
var context = new CustomerContext();
// ... context code ...
// Easy to use
context.BulkSaveChanges();
// Easy to customize
context.BulkSaveChanges(operation => operation.BatchSize = 1000);
Let's say I have an ordering system which has a table size of around 50,000 rows and grows by about 100 rows a day. Also, say once an order is placed, I need to store metrics about that order for the next 30 days and report on those metrics on a daily basis (i.e. on day 2, this order had X activations and Y deactivations).
1 table called products, which holds the details of the product listing
1 table called orders, which holds the order data and product id
1 table called metrics, which holds a date field, and order id, and metrics associated.
If I modeled this in a star schema format, I'd design like this:
FactOrders table, which has 30 days * X orders rows and stores all metadata around the orders, product id, and metrics (each row represents the metrics of a product on a particular day).
DimProducts table, which stores the product metadata
Does my performance gain from a huge FactOrders table only needing one join to get all relevant information outweigh the fact that I increased by table size by 30x and have an incredible amount of repeated data, vs. the truly normalized model that has one extra join but much smaller tables? Or am I designing this incorrectly for a star schema format?
Do not denormalize something this small to get rid of joins. Index properly instead. Joins are not bad, joins are good. Databases are designed to use them.
Denormalizing is risky for data integrity and may not even be faster due to the much wider size of the tables. IN tables this tiny, it is very unlikely that denormalizing would help.
When I said "small" entity, I mean Entity for example having just 1-2 fields, when I said "big" entity, those that have many fields and/or have EmbeddedEntity on it having many fields.
So my question would be, is there a difference in Storing (put) and retrieving (get):
Both in put time and get time
In cost per put and get
put/get time are related to how long it takes to serialize your entity, as well as how long it takes to transmit your entity over the network. This will generally depend more on the size of your entity in bytes, rather than the number of fields. An entity with 1 900KB field will take longer to process than an entity with 100 4-byte fields.
Cost for puts/gets are described in the GAE pricing page. The get costs don't depend on entity size. The put costs depend on the number of indexes being updated - not the total number of fields or total size. Unindexed fields don't affect the cost, so you could have a huge entity with many unindexed fields and one indexed field - it'll cost the same to put as an entity with a single 4-byte indexed field.
Note also that only indexes that require updating affect your cost. If you update an entity with many indexed fields, but the fields haven't changed and the index doesn't require an update, you don't get charged for those.
Don't forget about storage costs for large entities though.