Handling Complex Queries in Firestore - database

I have a database consisting of reviews, follow, and users. Where users following other users is a many to many relationship modeled by the follow table. In total my schema looks as follows:
follow (collection) - key: fid
following (uid)
follower (uid)
review (collection) - key: rid
title (string)
author (uid)
posted (timestamp)
user (collection) - key: uid
created (timestamp)
email (string)
I want to run a query to get the T most recent reviews where the user is following the author. In a SQL environment I would do this with two joins and a where clause.
Let us consider a user following n people, where each person they're following has m reviews. I was considering finding all reviews for all of the n people one is following, then discarding all those older than T, but recognize the number of reads will be n*m. As we can easily expect n > 100 and m > 1000, this is not a viable solution. I recognize there is probably no great way to do this in firestore. Any suggestions?
UPDATE: The top answer to a similar question is giving an nk (where k is an arbitrary limit) solution for a number of reads. It is also answering an easier question: "get the T most recent reviews for each person one is following" not "get the T most recent reviews of all people one is following." This answer, suggests keeping an updated copy of all followers in every review then doing a whereArrayContains clause to find reviews one is following. But if user A follows a user B who has B_m reviews, we will perform B_m writes for each follow or unfollow. We will also be massively denormalizing our database, storing and updating the same information in thousands of locations.

Your data seems highly relational so the best option for you is to switch to a relational database. The only way to get this to work in firestore is to either completely denormalize your data or chain a ton of queries together to get all of the data you need, neither is ideal in my opinion.

there's a way that worked for me which is creating a link between indexes, how is that ? going to firestore => indexes and add index with this you can link the fields you want to query on and then you will be able to do this query in your code

Related

Amazon DynamoDB Single Table Design For Blog Application

New to this community. I need some help in designing the Amazon Dynamo DB table for my personal projects.
Overview, this is a simple photo gallery application with following attributes.
UserID
PostID
List item
S3URL
Caption
Likes
Reports
UploadTime
I wish to perform the following queries:
For a given user, fetch 'N' most recent posts
For a given user, fetch 'N' most liked posts
Give 'N' most recent posts (Newsfeed)
Give 'N' most liked posts (Newsfeed)
My solution:
Keeping UserID as the partition key, PostID as the sort key, likes and UploadTime as the local secondary index, I can solve the first two query.
I'm confused on how to perform query operation for 3 and 4 (Newsfeed). I know without partition ket I cannot query and scan is not an effective solution. Any workaround for operatoin 3 and 4 ?
Any idea on how should I design my DB ?
It looks like you're off to a great start with your current design, well done!
For access pattern #3, you want to fetch the most recent posts. One way to approach this is to create a global secondary index (GSI) to aggregate posts by their creation time. For example, you could create a variable named GSI1PK on your main table and assign it a value of POSTS and use the upload_time field as the sort key. That would look something like this:
Viewing the secondary index (I've named it GSI1), your data would look like this:
This would allow you to query for Posts and sort by upload_time. This is a great start. However, your POSTS partition will grow quite large over time. Instead of choosing POSTS as the partition key for your secondary index, consider using a truncated timestamp to group posts by date. For example, here's how you could store posts by the month they were created:
Storing posts using a truncated timestamp will help you distribute your data across partitions, which will help your DB scale. If a month is too long, you could use truncated timestamps for a week/day/hour/etc. Whatever makes sense.
To fetch the N most recent posts, you'd simply query your secondary index for POSTS in the current month (e.g. POSTS#2021-01-00). If you don't get enough results, run the same query against the prior month (e.g. POSTS#2020-12-00). Keep doing this until your application has enough posts to show the client.
For the fourth access pattern, you'd like to fetch the most liked posts. One way to implement this access pattern is to define another GSI with "LIKES" as the partition key and the number of likes as the sort key.
If you intend on introducing a data range on the number of likes (e.g. most popular posts this week/month/year/etc) you could utilize the truncated timestamp approach I outlined for the previous access pattern.
When you find yourself "fetch most recent" access patterns, you may want to check out KSUIDs. KSUIDs, or K-sortable Universal Identifier, are unique identifiers that are sortable by their creation date/time/. Think of them as UUID's and timestamps combined into one attribute. This could be useful in supporting your first access pattern where you are fetching most recent posts for a user. If you were to use a KSUID for the Post ID, your table would look like this:
I've replaced the POST ID's in this example with KSUIDs. Because the KSUIDs are unique and sortable by the time they were created, you are able to support your first access pattern without any additional indexing.
There are KSUID libraries for most popular programming languages, so implementing this feature is pretty simple.
You could add two Global Secondary Indexes.
For 3):
Create a static attribute type with the value post, which serves as the Partition Key for the GSI and use the attribute UploadTime as the Sort Key. You can then query for type="post" and get the most recent items based on the sort key.
The solution for 4) is very similar:
Create another Global secondary index with the aforementioned item type as the partition key and Likes as the sort key. You can then query in a similar way as above. Note, that GSIs are eventually consistent, so it may take time until your like counters are updated.
Explanation and additional infos
Using this approach you group all posts in a single item collection, which allows for efficient queries. To save on storage space and RCUs, you can also choose to only project a subset of attributes into the index.
If you have more than 10GB of post-data, this design isn't ideal, but for a smaller application it will work fine.
If you're going for a Single Table Design, I'd recommend to use generic names for the Index attributes: PK, SK, GSI1PK, GSI1SK, GSI2PK, GSI2SK. You can then duplicate the attribute values into these items. This will make it less confusing if you store different entities in the table. Adding a type column that holds the entity type is also common.

Adapting one-to-many relationship to DynamoDB (NoSQL)

Introduction
Hello, I'm moving to AWS because of stability, performance, etc. I will be using DynamoDB because of the always free tier that allows me to reduce my bills a lot. I was using MySQL until now. I will make the attributes simple for this example (to show the actual places where I need help and make the question shorter).
My actual DB has less than 5k rows and I expect it to grow to 20-30k in 2 years. Each user (without any group/order data) is around 600B. I don't know how this will translate to a NoSQL DB but I expect it to be less than 10MB.
What data will I have?
User:
username
password
is_group_member
Group:
capacity
access_level
Order:
oid
status
prod_id
Relationships:
User has many orders.
Group has many users.
How will I access the data and what will I get?
I will access the user by username (I won't know the group he is in). I will need to get the user's data, the group he belongs to and its data.
I will access the users that belong to a certain group. I will need to get the users' data and the group data.
I will access an order by its oid. I will need to get the user it belongs to and its data.
What I tried
I watched a series of videos by Gary Jennings, read answers on SO and also read alexdebrie's article about one-to-many relationships. My problem is that I can't seem to find an alternative that suits all the ways I will access the data.
For example:
Denormalization: it will leave me with a lot of duplicated data thus increasing the cost.
Composite primary key: I will be able to access the users by its group but how will I access the user and the group's data without knowing the group beforehand. I would need to use 2 requests making it inefficient and increasing the costs.
Secondary index + the Query API action: Again I would need to use 2 requests making it inefficient and increasing the costs.
Final questions
Did I misunderstood the alternatives? I started this question because my knowledge is not "big enough" to actually know if there is a better alternative that I can't think of so maybe I got the explanations wrong.
Is there a better alternative for this case?
If there wasn't a better alternative, what would you do in my case? Would you duplicate the group's data (thus increasing the used space and making it need only 1 request)? or would you use one of the other 2 alternatives and use 2 requests?
You're off to a great start by articulating your access patterns ahead of time.
Let's start by addressing some of your comments about data modeling in DynamoDB:
Denormalization: it will leave me with a lot of duplicated data thus increasing the cost.
When first learning DynamoDB data modeling, prior SQL Database knowledge can really get in the way. Normalizing your data is a common practice when working with SQL databases. However, denormalizing your data is a key data modeling strategy in DynamoDB.
One BIG reason you want to denormalize your data: DynamoDB doesn't have joins. Because DDB doesn't have joins, you'll be well served to pre-join your data so it can be fetched in a single query.
This blog post does a good job of explaining why denormalization is important in DDB.
Keep in mind, storage is cheap. Denomralizing your data makes for faster data access at a relatively low cost. With the size of your database, you will likely be well under the free tier threshold. Don't stress about the duplicate data!
Composite primary key: I will be able to access the users by its group but how will I access the user and the group's data without
knowing the group beforehand. I would need to use 2 requests making it inefficient and increasing the costs.
Denormalizing your data will help solve this problem (e.g. store the group info with the user). I'll give you an example of this below.
Secondary index + the Query API action: Again I would need to use 2 requests making it inefficient and increasing the costs.
You didn't share your primary key structure, so I'm not sure what scenario will require two requests. However, I will say that there may be certain situations where making two requests to DDB is a reasonable approach. Making two efficient query operations is not the end of the world.
OK, on to an example of modeling your relationships! Keep in mind that there are many ways to model data in DynamoDB. This example is not THE way. Rather, it's an example meant to demonstrate a few strategies that might help.
Here's one take of your data model:
With this arrangement, you can support the following access patterns:
Fetch user information - PK = USER#[username] SK = USER#[username]
Fetch user group - PK = USER#[username] SK begins_with GROUP#. Notice I denormalized user data in the group item. The reason for this will be apparent shortly :)
Fetch user orders - PK = USER#[username] SK begins_with ORDER#
Fetch all user data - PK = USER#[username]
To support your remaining access patterns, I created a secondary index. The primary key and sort key of the secondary index is swapped with the primary key/sort key of the base table. This pattern is called an inverted index. The secondary index looks like this:
This secondary index supports the following access patterns:
Fetch Group users - PK = GROUP#[grouped]
Fetch Order by oid - PK = ORDER#[oid]
You can see that I denormalized the User and Group relationship by repeating user data in the item representing the Group. This helps me with the "fetch group users" access pattern.
Again, this is just one way you can achieve the access patterns you described. There are many strategies, but many will require that you abandon some of the best practices you learned working with SQL databases!

Storing Likes in a Non-Relational Database

Gist
I implemented a like button in my application. Let's imagine users are able to like other users products.
Issue
I am now wondering which of the following is the most effective and robust method to store those likes in a non-relational Database (in my case MongoDB). It's important that no user can like a product twice.
Possible Solutions
(1) Store the user ids of those, who liked on the product itself and keep track of the number of likes via likes.length
// Product in database
{
likes: [
'userId1',
'userId2',
'userId3',
...
],
...
}
(2) Store all products, that a user liked on the user itself and keep track of the number of likes through a number on the product
// User in database
{
likedProducts: [
'productId1',
'productId2',
'productId3',
...
]
...
}
// Product in database
{
numberOfLikes: 42,
...
}
(3) Maybe there is even a better solution for this?
Either way, if the product has many likes or the user liked many products, there is a big amount of data, that has to load only to show likes and check if the user has already liked it.
Which approach to use, (1) or (2) depends on your use case, specifically, you should think about what data you will need to access more: to retrieve all products liked by a particular user (2) or to retrieve all users who liked a particular product (1). It looks more likely that (1) is a more frequent case - that way you would easily know if the user already liked the product as well as number of likes for the product as it is simply array length.
I would argue that any further improvement would likely be a premature optimization - it's better to optimize with a problem in hand.
If showing number of likes, for example, appears to be a bottleneck, you can denormalize your data further by storing array length as a separate key-value. That way displaying the product list wouldn't require receiving array of likes with userIds from the database.
Even more unlikely, with millions of likes of a single product, you'll find significant slowdown from looping through the likes array to check if the userId is already in it. You can, of course, use something like a sorted array to keep likes sorted, but database communication would be still slow (slower than looping through array in memory anyway). It's better to use the database indexing for binary search and instead of storing array of likes as array embedded into the product (or user) you can store likes in a separate collection:
{
_id: $oid1,
productId: $oid2,
userId: $oid3
}
That, assuming, that the product has key with a number of likes, should be fastest way of accessing likes if all 3 keys are indexed.
You can also be creative and use concatenation of $oid2+$oid3 as $oid1 which would automatically enforce uniqueness of the user-product pair likes. So you'd just try saving it and ignore database error (might lead to subtle bugs, so it'd be safer to check like exists on a failure to save).
Why simply not amend requirements and use either relational database or RDBMS alike solution. Basically, use the right tool, for the right job:
Create another table Likes that keeps pair of your productId and userId as unique key. For example:
userId1 - productId2
userId2 - productId3
userId2 - productId2
userId1 - productId5
userId3 - productId2
Then you can query by userId and get number of likes per user or query by productId and get number of likes per product.
Moreover, unique key userId_productId will guarantee that one user can only like one product.
Additionally, you can keep in another column(s) extra information like timestamp when user liked the product etc.
You might also need to consider the document size, storing user id on each product or string product id in each user might lead to memory outage and won't scale very well.
Rdbms will be better solution for this problem.

App Engine Datastore: entity design and query optimization

I have a system where users can vote on entities, if they like or hate them. It will be bazillion votes and trazillion records, hopefully, some time in the future :)
At the moment i store a vote in an Entity like this:
UserRecordVote: recordId, userId, hateOrLike
And when i want to get every Record the user liked i do a query like this:
I query the "UserRecordVote" table for all the "likes", then i take the recordIds from that resultset, create a key of that property and get the record from the Record Table.
Then i aggregate all that in a list and return it.
Here's the question:
I came up with a different approach and i want to find out if that one is 1. faster and 2. how much is the difference in cost.
I would create an Entity which's name would be userId + "likes" and the key would be the record id:
new Entity(userId + "likes", recordId)
So when i would do a query to get all the likes i could simply query for all, no filters needed. AND i could just grab the entity key! which would be much cheaper if i remember the documentation of app engine right. (can't find the pricing page anymore). Then i could take the Iterable of keys and do a single get(Iterable keys). Ok so i guess this approach is faster and cheaper right? But what if i want to grab all the votes of a user or better said, i want to grab all the records a user didn't vote on yet.
Here's the real question:
I wan't to load all the records a user didn't vote on yet:
So i would have entities like this:
new Entity(userId+"likes", recordId);
and
new Entity(userId+"hates", recordId);
I would query both vote tables for all entity keys and query the record table for all entity keys. Then i would remove all the record entity keys matching one of the vote entity keys and with the result i would get(Iterable keys) the full entities and have all the record entites which are not in one of the two voting tables.
Is that a useful approach? Is that the fastest and cost efficient way to do a datastore query? Am i totally wrong and i should store the information as list properties?
EDIT:
With that approach i would have 2 entity groups for each user, which would result in million different entity groups, how would GAE Datastore handle that? Atleast the Datastore Viewer entity select box would probably crash :) ?
To answer the Real Question, you probably want to have your hateOrLike field store an integer that indicates either hated/liked/notvoted. Then you can filter on hateOrLike=notVoted.
The other solutions you propose with the dynamically named entities make it impossible to query on other aspects of your entities, since you don't know their names.
The other thing is you expect this to be huge, you likely want to keep a running counter of your votes rather than tabulating every time you pull up a UserRecord - querying all the votes, and then calculating them on each view is very slow - especially since App Engine will only return 1000 results on each query, and if you have more than 1000 votes, you'll have to keep making repeated queries to get all the results.
If you think people will vote quickly, you should look into using a sharded counter for performance. There's examples of that with code available if you do a google search.
Consider serializing user hate/like votes in two separate TextProperties inside the entity. Use the userId as key_name.
rec = UserRecordVote.get_by_key_name(userId)
hates = len(rec.hates.split('_'))
etc.

What could be cassandra schema to serve this query?

Assume a social application that has some million users & there are around 200-300 topics, Users can make posts which could be tagged on upto 5 topics. I have 2 kind of queries on this data:
find post by a certain user
find all recent posts tagged on a specific topic.
For 1st query I can easily create the schema using superColumns in the User Columnfamily(in this supercolumn, I can store the postIds of all posts by user as columns).
My question is how should I design the schema to serve 2nd query in Cassandra?
Although Justice's answer would work, I don't like it because it requires an OrderPreservingPartitioner to perform the range scan. OPP has a lot of problems associated with it. See the article that I've been linking to constantly for details.
Instead, I would recommend this:
topic|YYMMDDHH: {TimeUUID: postID, TimeUUID: postID, etc... }
where "topic|YYMMDDHH" is the row key, each column name is a TimeUUID, and the column values are postIDs.
To get the latest posts for any topic, you get a slice off the end of the most recent row for that topic. If that row didn't have enough columns, you go to the previous one in time, etc.
This has a few nice properties. First, if you don't care about really old posts on a topic, only relatively recent ones, you can purge old rows on a regular basis and save yourself some space; this could even be done with column TTLs so that you don't have to do any extra work. Second, your rows will be bounded in size because they are split every hour. Third, you don't need OPP :)
One downside to this is that if there's a really hot topic, one node may receive higher traffic than the others for an hour at a time.
For the second query, build a secondary-index column family whose keys are #{topic}:#{unix_timestamp}. Rows would have a single column with the post ID. You can then do a range scan.

Resources