I am a little confused on the difference between query and projection queries (this is in regards to the new pricing which will be in effect July st). Say I have a Kind like this:
**POSTS:**
post_id -index
author -index
post_message -index
created -index
If I want to query all posts by an author I will retrieve N posts, N being the number of posts an author has written. So if he wrote 100 posts I will eat up 100 read requests for this query. Can I just create a dummy property and then turn the query into a projection query. So I add a property named dummy and then I do a query but select only id, post_message, and created (author I would already know if I am filtering by it). This way it would only cost be 1 read to get all these entities. Is this possible to do? Why wouldn't everyone do this then to avoid query costs?
Projections return values from indexes rather than the entity itself so there are some limitations.
In your example, you would need to create an index on (post_id, post_message, created), but if you wanted to retrieve properties such as Text or Blob you would need to fetch the entity as those properties cannot be indexed.
You may also find that if you add properties or change properties you want to project, you will need to build new indexes. So while it may save you on some entity reads, you make some sacrifices too.
Related
New to this community. I need some help in designing the Amazon Dynamo DB table for my personal projects.
Overview, this is a simple photo gallery application with following attributes.
UserID
PostID
List item
S3URL
Caption
Likes
Reports
UploadTime
I wish to perform the following queries:
For a given user, fetch 'N' most recent posts
For a given user, fetch 'N' most liked posts
Give 'N' most recent posts (Newsfeed)
Give 'N' most liked posts (Newsfeed)
My solution:
Keeping UserID as the partition key, PostID as the sort key, likes and UploadTime as the local secondary index, I can solve the first two query.
I'm confused on how to perform query operation for 3 and 4 (Newsfeed). I know without partition ket I cannot query and scan is not an effective solution. Any workaround for operatoin 3 and 4 ?
Any idea on how should I design my DB ?
It looks like you're off to a great start with your current design, well done!
For access pattern #3, you want to fetch the most recent posts. One way to approach this is to create a global secondary index (GSI) to aggregate posts by their creation time. For example, you could create a variable named GSI1PK on your main table and assign it a value of POSTS and use the upload_time field as the sort key. That would look something like this:
Viewing the secondary index (I've named it GSI1), your data would look like this:
This would allow you to query for Posts and sort by upload_time. This is a great start. However, your POSTS partition will grow quite large over time. Instead of choosing POSTS as the partition key for your secondary index, consider using a truncated timestamp to group posts by date. For example, here's how you could store posts by the month they were created:
Storing posts using a truncated timestamp will help you distribute your data across partitions, which will help your DB scale. If a month is too long, you could use truncated timestamps for a week/day/hour/etc. Whatever makes sense.
To fetch the N most recent posts, you'd simply query your secondary index for POSTS in the current month (e.g. POSTS#2021-01-00). If you don't get enough results, run the same query against the prior month (e.g. POSTS#2020-12-00). Keep doing this until your application has enough posts to show the client.
For the fourth access pattern, you'd like to fetch the most liked posts. One way to implement this access pattern is to define another GSI with "LIKES" as the partition key and the number of likes as the sort key.
If you intend on introducing a data range on the number of likes (e.g. most popular posts this week/month/year/etc) you could utilize the truncated timestamp approach I outlined for the previous access pattern.
When you find yourself "fetch most recent" access patterns, you may want to check out KSUIDs. KSUIDs, or K-sortable Universal Identifier, are unique identifiers that are sortable by their creation date/time/. Think of them as UUID's and timestamps combined into one attribute. This could be useful in supporting your first access pattern where you are fetching most recent posts for a user. If you were to use a KSUID for the Post ID, your table would look like this:
I've replaced the POST ID's in this example with KSUIDs. Because the KSUIDs are unique and sortable by the time they were created, you are able to support your first access pattern without any additional indexing.
There are KSUID libraries for most popular programming languages, so implementing this feature is pretty simple.
You could add two Global Secondary Indexes.
For 3):
Create a static attribute type with the value post, which serves as the Partition Key for the GSI and use the attribute UploadTime as the Sort Key. You can then query for type="post" and get the most recent items based on the sort key.
The solution for 4) is very similar:
Create another Global secondary index with the aforementioned item type as the partition key and Likes as the sort key. You can then query in a similar way as above. Note, that GSIs are eventually consistent, so it may take time until your like counters are updated.
Explanation and additional infos
Using this approach you group all posts in a single item collection, which allows for efficient queries. To save on storage space and RCUs, you can also choose to only project a subset of attributes into the index.
If you have more than 10GB of post-data, this design isn't ideal, but for a smaller application it will work fine.
If you're going for a Single Table Design, I'd recommend to use generic names for the Index attributes: PK, SK, GSI1PK, GSI1SK, GSI2PK, GSI2SK. You can then duplicate the attribute values into these items. This will make it less confusing if you store different entities in the table. Adding a type column that holds the entity type is also common.
Our product is using Google Datastore as the application database. Most of the entities use IDs of type Long and some of type String. I noticed that the IDs of type Long are not in consecutive order.
Now we are exporting some big tables, with around 30 - 40 million entries, to json files for some business purposes. Initially we expected that a simple query like "ofy().load().type(ENTITY.class).startAt(cursor).limit(BATCH_LIMIT).iterator()" will help us iterate through the entire content of that specific table, starting from the first entry and ending with the most recently created one. We are working in batches and storing the cursor after every batch, so that the next task can load the batch and resume.
But after noticing that an entity created some minutes ago can have an ID smaller than the ID of another entity created 1 week ago, we are wondering if we should consider a content freeze during this export period. On one hand it's critical to make a good export and not to miss older data up to a specific date, on the other hand a content freeze longer than 1 day is a problem for our customers.
What do you advice us to do?
Thanks,
Cristian.
I do not think you need to worry about uniqueness of your id. Datastore build on top of Bigtable with 6 tables.
first table stores entities
second stores entities by kind
third stores indexes for the property values in the ascending order
fourth to store indexes for the property values in the descending order
fifth stores indexes for multiple properties together
sixth keeps a track of the next unique ID for Kind
Format is something like this.
[application ID]-[namespace]-[Kind]-[ID]
It is garanties of uniqueness each entities.
Yes, the format on that table is [Application ID]-[Kind Name] and the value is the next value. Let say you have kind products and that table will look like this |key(yourapp-products), Next ID(3)|. Now you created new entity for kind products it will be assigned to ID(3) and the row on that table will get new value |key(yourapp-products), Next ID(4)|. Also to mention that table has only one row since we have only one kind products.
Do you specify ID yourself or let datastore generate itself? It sounds like you have "Pre-allocating IDs" issue, just speculating but for every batch you need sort Kind.allocate_ids(size=blah) that way you can keep sequence.
I have a system where users can vote on entities, if they like or hate them. It will be bazillion votes and trazillion records, hopefully, some time in the future :)
At the moment i store a vote in an Entity like this:
UserRecordVote: recordId, userId, hateOrLike
And when i want to get every Record the user liked i do a query like this:
I query the "UserRecordVote" table for all the "likes", then i take the recordIds from that resultset, create a key of that property and get the record from the Record Table.
Then i aggregate all that in a list and return it.
Here's the question:
I came up with a different approach and i want to find out if that one is 1. faster and 2. how much is the difference in cost.
I would create an Entity which's name would be userId + "likes" and the key would be the record id:
new Entity(userId + "likes", recordId)
So when i would do a query to get all the likes i could simply query for all, no filters needed. AND i could just grab the entity key! which would be much cheaper if i remember the documentation of app engine right. (can't find the pricing page anymore). Then i could take the Iterable of keys and do a single get(Iterable keys). Ok so i guess this approach is faster and cheaper right? But what if i want to grab all the votes of a user or better said, i want to grab all the records a user didn't vote on yet.
Here's the real question:
I wan't to load all the records a user didn't vote on yet:
So i would have entities like this:
new Entity(userId+"likes", recordId);
and
new Entity(userId+"hates", recordId);
I would query both vote tables for all entity keys and query the record table for all entity keys. Then i would remove all the record entity keys matching one of the vote entity keys and with the result i would get(Iterable keys) the full entities and have all the record entites which are not in one of the two voting tables.
Is that a useful approach? Is that the fastest and cost efficient way to do a datastore query? Am i totally wrong and i should store the information as list properties?
EDIT:
With that approach i would have 2 entity groups for each user, which would result in million different entity groups, how would GAE Datastore handle that? Atleast the Datastore Viewer entity select box would probably crash :) ?
To answer the Real Question, you probably want to have your hateOrLike field store an integer that indicates either hated/liked/notvoted. Then you can filter on hateOrLike=notVoted.
The other solutions you propose with the dynamically named entities make it impossible to query on other aspects of your entities, since you don't know their names.
The other thing is you expect this to be huge, you likely want to keep a running counter of your votes rather than tabulating every time you pull up a UserRecord - querying all the votes, and then calculating them on each view is very slow - especially since App Engine will only return 1000 results on each query, and if you have more than 1000 votes, you'll have to keep making repeated queries to get all the results.
If you think people will vote quickly, you should look into using a sharded counter for performance. There's examples of that with code available if you do a google search.
Consider serializing user hate/like votes in two separate TextProperties inside the entity. Use the userId as key_name.
rec = UserRecordVote.get_by_key_name(userId)
hates = len(rec.hates.split('_'))
etc.
For my website, I want to make something that works a bit like the tags on Stackoverflow - so some fields will have an autocompleter, and the autocompleter will display the number of times that other users have selected each suggested value. I suppose I'd have a database structure like this:
Articles
ArticleID
Content
TagId
Tags
TagId
TagName
Occurances
With the idea being that Occurances represents the number of times each TagId is referenced from the Articles table.
What is the best way to implement this? I could add/subtract from the occurances column on each of the stored procedures that update the article table, but I might miss one, and anyway, there is are some difficulties with this if a user removes a tag from something (as its easy to add 1 to the field for the newly added tag, but harder to work out which tag is being replaced.)
There is lots I don't understand about sql-server. Is there a more robust way of counting occurances like this, that the database system will deal with itself? It would be ok if the data was cached once a day or something.
To be able to have more than one tag attached to an article, you will have to add another table that connects the article table to the tag table. It's called a 'many to many' relation.
article
article_id
content
article_tag
article_id
tag_id
tag
tag_id
tagname
Doing like this, article 1 can be attached to tag 2, and the next row can be 1 and 3 and so on, so one article points to many tags. To count a certain tag, you join the Article_Tag and Tag tables, and and count the rows in Article_Tag where Tag.tagname = 'mysql', for examle.
You can create an indexes view that aggregates all the counts you need and is automatically maintained:
create view TagCounts
with schemabinding
as select TagId, count_big(*) as Occurances
from dbo.ArticleTags
group by TagId;
go
create unique clustered index cdxTagCounts on TagCounts (TagId);
go
Now the TagCounts.Occurances field is automatically maintained by SQL Server whenever you insert/delete/update the Articles table. You can query it like:
select Occurances from dbo.TagCounts with (noexpand) where TagId = ...;
And you can cache the result with LinqToCache, as such a query matches the restrictions of Query Notifications.
The trade off of using a pre-aggregated indexed view is scalability: as update of any article updates the count of Occurances for the tags of the article, an exclusive lock is required to update this count. Which implies that only one transaction can use a TagId at any moment. Depending on your traffic and on other elements of your design this restriction may or may not be acceptable.
The other alternative is a table of counts. Front ends (your ASP.Net farm) read this counts and then they update the in-memory count for each operation, keeping track of the delta from the counts in the table. Periodically the front ends merge their deltas into the table (eg. every 5 minutes) and refresh the in-memory table. This way front ends see a stale version of the truth, but an user sees immediate feedback of its actions: because of session stickiness his HTTP requests are processed by the same front end, and thus he see immediately his own article updates triggering modifications to the tag counts. User though do no immediately see the updates from other users that are load-balanced to another front end. Because a crash of the front end (or a process recycle...) will loose the deltas kept so far, the count table will drift in time away from the truth and would have to be periodically updated to the true count in the database.
If you which even more accuracy (all users see the true count immediately) then you can do something based on fast in-memory key value stores, which would be basically the same as my first proposal but with much higher throughput/lower latency, perhaps something based on memcached + redis. I'm not acquainted with SO architecture, but I believe they may be doing something similar.
You could use this query to get the number of occurances by tag:
SELECT Tags.TagId, COUNT(Articles.TagId) as Occurances
FROM Articles
JOIN Tags ON Articles.TagId
GROUP BY Tags.TagId
It could be used in a view or stored procedure, and you can set up your website's cache to requery it as often as required.
If you are using a relational database, the correct way to handle this problem is to NOT store the occurrences on the table itself, but rather dynamically query the number of occurrences on the articles table.
If you don't do it this way, you're stuck coding update queries every time you add/delete a row...generally not nice. If you query dynamically, you won't have an occurrences column in the table, but rather will get that information in your eg. presentation/model layer code.
Use:
SELECT COUNT(*) FROM ARTICLES WHERE TagId = 'xxx' ;
This line is part of iterating code.
I'm trying to design a database for a product aggregator. Each product has information about where it comes from, what it costs, what type of thing it is, price, color, etc. Users need to able to search and filter results based on any of those product categories. I also expect to have a large number of users. My initial thought was having one big table with every product in it with a column for each piece of information and an index on anything I need to be able to search by but I think this might be inefficient with a lot of users pounding on this one table. My other thought was to organize the database to promote a tree-like navigation of tables but because you can search by anything I'm not sure how I would organize the tables.
Any thoughts on some good practices?
One table of products - databases are designed to have lots of users pounding on tables.
(from the comments)
You need to model your data. This comes from looking at the all the data you have, determining what is related to what (a table is called a relation because all the attributes in a row are related to a candidate key). You haven't really given enough information about the scope of what data (unstructured?) you have on these products and how it varies. Are you going to have difficulties because Shoes have brand, model, size and color, but Desks only have brand, model and finish? All this is going to inform your data model. Typically you have one products table, and other things link to it.
Some of those attributes will be foreign keys to lookup tables, others (price) would be simple scalars. Appropriate indexing and you'll be fine. For advanced analytics, consider a dimensionally modeled star-schema, but perhaps not for your live transaction system - depends what your data flow/workflow/transactions are. Or consider some benefits of its principles in your transactional database. Ralph Kimball is source of good information on dimensional modeling.
I dont see any need for the tree structure here. You can do with single table.
if you insist on tree structure with hierarchy here is an example to get you started.
For text based search, and ease of startup & design, I strongly recommend Apache SOLR. The SOLR API is easy to use (especially JSON). Databases do text search poorly, and I would instead recommend that you just make sure that they respond to primary/unique key queries properly, and those are the fields you should index.
One table for the products, and another table for the product category hierarchy (you don't specifically say you have this but "tree-like navigation of tables" makes me think you might).
I can see you might be concerned about over-indexing causing problems if you plan to index almost every column. In that case, it might be best to index on the top 5 or 10 columns you think users are likely to search for, unless it's possible for a user to search on ANY column. In that case you might want to look at building a data warehouse. Maybe you'll want to look into data cubes to see if those will help...?
For hierarchical data, you need a PRODUCT_CATEGORY table looking something like this:
ID
PARENT_ID
NAME
Some sample data:
ID PARENT_ID NAME
1 ROOT
2 1 SOCKS
3 1 HELICOPTER PARTS
4 2 ARGYLE
Some SQL engines (such as Oracle) allow you to write recursive queries to traverse the hierarchy in a single query. In this example, the root of the tree has a PARENT_ID of NULL, but if you don't want this column to be nullable, I've also seen -1 used for the same purposes.