Data Modeling and uuid on Cassandra - database

I am trying to build a movie database for educational purpose using Cassandra in the backend. The querying on the database will be principally made by movie title. So currently the data I have fits in the following model.
movie title | imdb rating | year of release | actors
Reading the CQL documentation I found the music playlist example where the following structure was used
CREATE TABLE playlists (
id uuid,
song_order int,
song_id uuid,
title text,
album text,
artist text,
PRIMARY KEY (id, song_order ) );
The query I have is what is the necessity of using a separate id column. Can't the title column be used as a primary key? what are the advantages and disadvantages of not using a separate uuid field?
The command which I am designing for my model is
CREATE TABLE movies (
title text,
imdb_rating double,
year int,
actors text,
PRIMARY KEY (title, imdb_rating ) );
Here I believe in my model title is the PRIMARY KEY and the PARTITION KEY and imdb_rating is the CLUSTERING KEY(for arranging output in ascending order). Is there anything wrong in my model and how will it affect distribution of the data and why should I/should not use uuid? I am planning to keep a replication_factor of 2 because the number of nodes I am using is just 3.
Also according to the documentation
Do not use an index in these situations:
......
•On a frequently updated or deleted column
In my database the most updated column is imdb_rating so I am not building any secondary index on it.

Can't the title column be used as a primary key?
If the movie title is unique (which is not necessarily true) you could use title as primary key.
what are the advantages and disadvantages of not using a separate uuid field?
UUID is good if you need a unique id that is globally unique and you don't have to check for it's uniqueness. If you can find a set of columns that can be granted that their combination is unique you don't have to use UUID (assuming you don't need an id to refer to it).
But it all depends on your query pattern. if you are going to look for a movie with it's id (probably coming from another table) use UUID as primary key. if you want to find movies with specific title then use title as primary key.
in your case since title is not unique, use a combination of title and UUID as composite key, given that you would search by title.
Here I believe in my model title is the PRIMARY KEY and the PARTITION KEY and imdb_rating is the CLUSTERING KEY(for arranging output in ascending order). Is there anything wrong in my model and how will it affect distribution of the data and why should I/should not use uuid?
in this case you have to use the rating and a UUID for primary key, but when you query you need to allow filtering.

Related

Cassandra static columns

I am new to Cassandra and What I am trying to accomplish is that I have resources table which is a part of an e-learning platform that I am designing .. so that table have the courseId as its partition key and the sectionId as a clustering key then the resourceId as another clustering key then the rest of the data .. so now I want to add the section_name column into the table in a way that its value isn't copied into every resource under the sectionId and I think the way to do it probably similar to the static column but instead of being static for each partition key(courseId in our case) it should be static just for the (sectionId) .. so my question is there is a feature that can do that or there's any way to achieve this. One way that comes to my mind now is to make a table for the course sections and another table for the resources that has the sectionId as its partition key and that way we can make the section_name a static column. Another solution is to keep the table as it's and just create another one that have the courseId as a partition key and the sectionId as a clustering key then I put the section_name under that primary key. The problem why I don't want to make the section_name to be copied into every resource as if there's an update in the section name ( which I think is not likely to happen alot) it will require me to update all the resources in the section.
Side Note : I am using microservices so the resources table is my boundaries. And Sorry fot the title if it wasn't expressive enough
Unfortunately, no. All rows of a partition share the same value in the static column.
This isn't the right way to model data in Cassandra:
One way that comes to my mind now is to make a table for the course sections and another table for the resources that has the sectionId as its partition key ...
Data is denormalised in Cassandra so you don't do joins or foreign lookups.
For each app query, you need to design a table with all the data required to respond to that app query. Cheers!
I think, looking at your question, the below table design should solve your problem -
CREATE TABLE IF NOT EXISTS resource_by_courseId _sectionId
(
courseId text,
sectionId text,
resourceId text,
section_name text static
primary key ((courseId,sectionId),resourceId)
)WITH CLUSTERING ORDER BY (resourceId DESC);
section_name is static, this will remain constant for (courseId,sectionId).

How to create unique key using cassandra database

I am beginner of cassandra DB I want to create unique key like oracle in cassandra.
I searched a lot site but not able to get relevant answer.
is it possible to create unique key using cassandra ?
In Cassandra, the PRIMARY KEY definition of your table is used for uniqueness. For example:
CREATE TABLE users (
userid uuid,
firstname text,
lastname text,
email text,
created_date timestamp,
PRIMARY KEY (userid)
);
Here, the userid column is the unique identifier. You can, of course, have multiple columns as part of your PRIMARY KEY definition as well. But a few things to keep in mind:
Primary key columns have other implications in Cassandra as well (beyond uniqueness). You'll want to read up on Partition Keys and Clustering Columns and how Cassandra uses it to organize data around the cluster and on disk.
Cassandra doesn't have or enforce constraints (for example, no foreign keys)
Cassandra doesn't do a read before a write (unless you're using the Lightweight Transactions feature), and so doing an INSERT or an UPDATE are functionally equivalent (i.e. an "upsert") and will overwrite data that already exists
If you're looking for a feature like a "unique constraint" or "unique index" in Oracle, you won't find it in Cassandra. There's a simple data modeling example available in the CQL docs and I also recommend checking out the data modeling course it links to if you're just getting started with Cassandra. Good luck!

Social media's like and unlike data model in Cassandra

Imagine there is a social network and here is a table for storing the like (favorite) action and unlike that is deleting from this table:
CREATE TABLE IF NOT EXISTS post_likes(
post_id timeuuid,
liker_id uuid, //liker user_id
like_time timestamp,
PRIMARY KEY ((post_id) ,liker_id, like_time)
) WITH CLUSTERING ORDER BY (like_time DESC);
The above table has problem in Cassandra because when liker_id is the first clustering_key, we can't sort by the second clustering key which is like_time.
We need to sort our tables data by like_time, we use it when a user wants to see who liked this post and we show list of people who liked that post that sorted by time (like_time DESC)
and we also need to delete (unlike) and we again need to have post_id and liker_id
What is your suggestion? How we can sort this table by like_time?
After more researches, I found out this solution:
Picking the right data model is the hardest part of using Cassandra and here is the solution we found for likes tables in Cassandra, first of all, I have to say Cassandra's read and write path is amazingly fast and you don't need to be worry about writing on your Cassandra's tables, you need to model around your queries and remember, data duplication is okay. Many of your tables may repeat the same data. and do not forget to spread data evenly around the cluster and minimize the number of partitions read
Since we are using Cassandra which is NoSQL, we know one of the rules in NoSQLs is denormalization and we have to denormalize data and just think about the queries you want to have; Here for the like table data modeling we will have two tables, these tables have mainly focused on the easy read or easier to say we have focused on queries we want to have:
CREATE TABLE IF NOT EXISTS post_likes(
post_id timeuuid,
liker_id uuid, //liker user_id
like_time timestamp,
PRIMARY KEY ((post_id) ,liker_id)
);
CREATE TABLE IF NOT EXISTS post_likes_by_time(
post_id timeuuid,
liker_id uuid, //liker user_id
like_time timestamp,
PRIMARY KEY ((post_id), like_time, liker_id)
) WITH CLUSTERING ORDER BY (like_time DESC);
When a user like a post, we just insert into both above tables.
why do we have post_likes_by_time table?
In a social network, you should show list of users who liked a post, it is common that you have to sort likes by the like_time DESC and since you are going to sort likes by like_time you need to have like_time as clustering key to be able to sort likes by time.
Then why do we have post_likes table too?
In the post_likes_by_time, our clustering key is like_time, we also need to remove one like! We can't do that when we sorted data in our table when clustering key is like_time. That is the reason we also have post_likes table
Why you could not only have one table and do both actions, sorting and removing on it?
To delete one like from post_likes table we need to provide user_id (here liker_id) and post_id (together) and in post_likes_by_time we have like_time as clustering key and we need to sort table by like_time, then it should be the first clustering key and the second clustering key could be liker_id, and here is the point! like_time is the first clustering key then for selecting or deleting by liker_id you also need to provide like_time, but you do not have like_time most of the times.

Postgresql - index on individual array elements or index on keys in an hstore

I have a table of users, with a column uuid and a column tags:
| uuid varchar PRIMARY KEY | tags ????? |
I am not sure what type the column tags should be, hstore or varchar[]. I want it to contain a list of interests or categories, like 'burgers' or 'vegetables', such that I can query for all users who have any tags in a specified array (i.e. "Which users like any of 'burgers' 'vegetables' 'hotdogs'?") For this query to be fast, I imagine I should index on the individual categories however they are stored. I expect most users to have a small number of tags (0-5) but they could potentially have up to 100 or so. And there are many different options of tags (could be 1000+).
I believe I can index on keys in an hstore so that I know hstore type is an option. Is it possible to index on individual varchar elements of arrays? (I've seen posts about this but they were inconclusive.)
Postgres version 9.3.5
I would recommend separate tables for tags
You already have Table users with uuid, let's say:
CREATE TABLE users (
uuid serial primary key,
user_name text
);
Now the tags:
CREATE TABLE tags (
tag_id serial primary key,
tag_name text
);
CREATE TABLE users_tags (
uuid integer references users,
tag_id integer references tags,
primary key (uuid, tag_id)
);
Now you can easily query with for example:
SELECT * FROM users
JOIN users_tags USING (uuid)
JOIN tags USING (tag_id)
WHERE tag_name = 'Burgers';
Now you can easily add index on tag_name. You can also easily force uniqueness on tag name or create a unique index on lower(tag_name) that would eliminate problems with capital letters in tag names (Burgers vs. BurgerS).
A simpler solution would be to leave the tag table out and just create:
CREATE TABLE users_tags (
uuid integer references users,
tag_name text,
primary key (uuid, tag_name)
);
Whether you create a separate table for tags or just use users_tags -table mostly depends on how tags are used. Separate table is needed if you have a (mostly) defined set of tags and you maybe want to add info to a specific tag name later. The query "which users like 'hotdogs'" would suggest a separate tagle where tag 'hotdog' has a specific ID. If users can freely add all kinds of mumbojumbo tags and no info will be attached to them later then leave the separate table out.

SQL multiple primary keys - localization

I am trying to implement some localization in my database.
It looks something like this (prefixes only for clarification)
tbl-Categories
ID
Language
Name
tbl-Articles
ID
CategoryID
Now, in my tbl-Categories, I want to have primary keys spanning ID and language, so that every combination of ID and language is unique. In tbl-Articles I would like a foreign key to reference ID in categories, but not Language, since I do not want to bind an article to a certain language, only category.
Of course, I cannot add a foreign key to part of the primary key. I also cannot have the primary key only on the ID of categories, since then there can only be one language. Having no primary keys disables foreign keys altogether, and that is also not a great solution.
Do you have any ideas how I can solve this in an elegant fashion?
Thanks.
Given the scenario you need to have one to many relationship established between Category and Language. Create 3 tables:
Category with CategoryID and Name as columns
Language with LanguageID and Name as Columns
CategoryLanguage with CategoryLanguageId, CategoryID and LanguageID (create a composite primary key on CategoryId and LanguageId which establishes uniqueness)
You dont have to do anything on the Articles table since ID and CategoryId establishes that an article can be in one of the category but not dependant on language.
HTH

Resources