Social media's like and unlike data model in Cassandra

Social media's like and unlike data model in Cassandra - database

Imagine there is a social network and here is a table for storing the like (favorite) action and unlike that is deleting from this table:
CREATE TABLE IF NOT EXISTS post_likes(
post_id timeuuid,
liker_id uuid, //liker user_id
like_time timestamp,
PRIMARY KEY ((post_id) ,liker_id, like_time)
) WITH CLUSTERING ORDER BY (like_time DESC);
The above table has problem in Cassandra because when liker_id is the first clustering_key, we can't sort by the second clustering key which is like_time.
We need to sort our tables data by like_time, we use it when a user wants to see who liked this post and we show list of people who liked that post that sorted by time (like_time DESC)
and we also need to delete (unlike) and we again need to have post_id and liker_id
What is your suggestion? How we can sort this table by like_time?

After more researches, I found out this solution:
Picking the right data model is the hardest part of using Cassandra and here is the solution we found for likes tables in Cassandra, first of all, I have to say Cassandra's read and write path is amazingly fast and you don't need to be worry about writing on your Cassandra's tables, you need to model around your queries and remember, data duplication is okay. Many of your tables may repeat the same data. and do not forget to spread data evenly around the cluster and minimize the number of partitions read
Since we are using Cassandra which is NoSQL, we know one of the rules in NoSQLs is denormalization and we have to denormalize data and just think about the queries you want to have; Here for the like table data modeling we will have two tables, these tables have mainly focused on the easy read or easier to say we have focused on queries we want to have:
CREATE TABLE IF NOT EXISTS post_likes(
post_id timeuuid,
liker_id uuid, //liker user_id
like_time timestamp,
PRIMARY KEY ((post_id) ,liker_id)
);
CREATE TABLE IF NOT EXISTS post_likes_by_time(
post_id timeuuid,
liker_id uuid, //liker user_id
like_time timestamp,
PRIMARY KEY ((post_id), like_time, liker_id)
) WITH CLUSTERING ORDER BY (like_time DESC);
When a user like a post, we just insert into both above tables.
why do we have post_likes_by_time table?
In a social network, you should show list of users who liked a post, it is common that you have to sort likes by the like_time DESC and since you are going to sort likes by like_time you need to have like_time as clustering key to be able to sort likes by time.
Then why do we have post_likes table too?
In the post_likes_by_time, our clustering key is like_time, we also need to remove one like! We can't do that when we sorted data in our table when clustering key is like_time. That is the reason we also have post_likes table
Why you could not only have one table and do both actions, sorting and removing on it?
To delete one like from post_likes table we need to provide user_id (here liker_id) and post_id (together) and in post_likes_by_time we have like_time as clustering key and we need to sort table by like_time, then it should be the first clustering key and the second clustering key could be liker_id, and here is the point! like_time is the first clustering key then for selecting or deleting by liker_id you also need to provide like_time, but you do not have like_time most of the times.

Related

How to create unique key using cassandra database

I am beginner of cassandra DB I want to create unique key like oracle in cassandra.
I searched a lot site but not able to get relevant answer.
is it possible to create unique key using cassandra ?

In Cassandra, the PRIMARY KEY definition of your table is used for uniqueness. For example:
CREATE TABLE users (
userid uuid,
firstname text,
lastname text,
email text,
created_date timestamp,
PRIMARY KEY (userid)
);
Here, the userid column is the unique identifier. You can, of course, have multiple columns as part of your PRIMARY KEY definition as well. But a few things to keep in mind:
Primary key columns have other implications in Cassandra as well (beyond uniqueness). You'll want to read up on Partition Keys and Clustering Columns and how Cassandra uses it to organize data around the cluster and on disk.
Cassandra doesn't have or enforce constraints (for example, no foreign keys)
Cassandra doesn't do a read before a write (unless you're using the Lightweight Transactions feature), and so doing an INSERT or an UPDATE are functionally equivalent (i.e. an "upsert") and will overwrite data that already exists
If you're looking for a feature like a "unique constraint" or "unique index" in Oracle, you won't find it in Cassandra. There's a simple data modeling example available in the CQL docs and I also recommend checking out the data modeling course it links to if you're just getting started with Cassandra. Good luck!

In a junction table, should I use a Primary key and a unique constraint, or a compound/composite key?

I have read through handfuls of what would seem to make this a duplicate question. But reading through all of these has left me uncertain. I'm hoping to get an answer based on the absolute example below, as many questions/answers trail off into debates back and forth.
If I have:
dbo.Book
--------
BookID PK int identity(1,1)
dbo.Author
----------
AuthorID PK int identity(1,1)
Now I have two choices for a simple junction table:
dbo.BookAuthor
--------------
BookID CPK and FK
AuthorID CPK and FK
The above would be a compound/composite key on both FKs, as well as set up the FK relationships for both columns - also using Cascade on delete.
OR
dbo.BookAuthor
--------------
RecordID PK int identity(1,1)
BookID FK
AuthorID FK
Foreign key relationships on BookID and AuthorID, along with Cascade on delete. Also set up a unique constraint on BookID and AuthorID.
I'm looking for a simple answer as to why one method is better than another in the ABOVE particular example. The answers that I'm reading are very detailed, and I was just about to settle on a compound key, but then watched a video where the example used an Identity column like my first example.
It seems this topic is slightly torn in half, but my gut is telling me that I should just use a composite key.
What's more efficient for querying? It seems having a PK identity column along with setting up a unique constraint on the two columns, AND the FK relationships would be more costly, even if a little.

This is something I've always remembered from my database course way back in college. We were covering the section from the textbook on "Entity Design" and it was talking about junction tables... we called them intersect tables or intersection relations. I was actually paying attention in class that day. The professor said, in his experience, a many-to-many junction table almost always indicates an unidentified missing entity. These entities almost always end up with data of their own.
We were given an example of Student and Course entities. For a student to take a course, you need to junction between those two. What you actually have as a result is a new entity: an Enrollment. The additional data in this case would be things like Credit Type (audit vs regular) or Final Grade.
I remember that advice to this day... but I don't always follow it. What I will do in this situation is stop, and make sure to go back to the stakeholders on the issue and work with them on what data points we might still be missing in this junction. If we really can't find anything, then I'll use the compound key. When we do find data, we think of a better name and it gets a surrogate key.
Update in 2020
I still have the textbook, and by amazing coincidence both it and this question were brought to my attention within a few hours of each other. So for the curious, it was Chapter 5, section 6, of the 7th edition of this book:
https://www.amazon.com/Database-Processing-Fundamentals-Design-Implementation-dp-9332549958/dp/9332549958/

As a staunch proponent of, and proselytizer for, the benefits of surrogate keys, I none-the-less make an exception for all-key join tables such as your first example. One of the benefits of surrogate keys is that engines are generally optimized for joining on single integer fields, as the default and most common circumstance.
Your first proposal still obtains this benefit, but also has a 50% greater fan-put on each index level, reducing both the overall size and height of the indices on the join table. Although the performance benefits of this are likely negligible for anything smaller than a massive table it is best practice and comes at no cost.
When I might opt for the other design is if the relation were to accrue additional columns. At that point it is no longer strictly a join table.

I prefer the first design, using Composite Keys. Having an identity column on the junction table does not give you an advantage even if the parent tables have them. You won't be querying the BookAuthor using the identity column, instead you would query it using the BookID and AuthorID.
Also, adding an identity would allow for duplicate BookID-AuthorID combination, unless you put a constraint.
Additionally, if your primary key is (BookID, AuthorID), you need to an index on AuthorID, BookID). This will help if you want to query the the books written by an author.

Using composite key would be my choice too. Here's why:
Less storage overhead
Let's say you would use a surrogate key. Since you'd probably gonna want to query all authors for a specific book and vica versa you'd need indexes starting with both BookId and AuthorId. For performance reasons you should include the other column in both indexes to prevent a clustered key lookup. You'd probably would want to make one of them a unique to make sure no duplicate BookId/AuthorId combinations are added to the table.
So as a net result:
The data is stored 3 times instead of 2 times
2 unique constraints are to be validated instead of 1
Querying a junction table referencing table
Even if you'd add a table like Contributions (AuthorId, BookId, ...) referencing the junction table. Most queries won't require the junction table to be touched at all. E.g.: to find all contribution of a specific author would only involve the author and contributions tables.

Depending on the amount of data in the junction table, a compound key might end up causing poor performance over an auto generated sequential primary key.
The primary key is the clustered index for the table, which means that it determines the order in which rows are stored on disc. If the primary key's values are not generated sequentially (e.g. it is a composite key comprised of foreign keys from tables where rows do not fall in the same order as the junction table's rows, or it is a GUID or other random key) then each time a row is added to the junction table a reshuffle of the junction table's rows will be necessary.

You probably should use the compound/composite key. This way you are fully relational - one author can write many books and one book can have multiple authors.

Having a table with just the fields of the primary key is a conceptual error?

I'm designing a database which will store information about some artists. These artists can belongs to one or more organizations. From these organizations i just want to store their names and i'm thinking in create a table with these organizations which just have the names as primary key and nothing else. Is the fact of having a table with just the fieds of the primary key a conceptual error? In this case, i will appreciate some suggestions to solve that.

Is the fact of having a table with just the fieds of the primary key a conceptual error?
Not by itself. There are perfectly legitimate situations where all fields comprise a PK.
In this particular case, the organization name is a key, but that doesn't necessarily mean it should be primary key - you could "invent" another key that is smaller (typically integer) and easier to maintain and make it primary, like this:
The organizarion_id is called a "surrogate key", and some pros of doing that include:
Child FKs will be slimmer (since only the integer is migrated to the child, not the whole string).
You can update the organization_name without updating the organization_id, and consequently without cascading this update to children.
A small integer surrogate may be friendlier to ORMs than a more complex natural key.
Cons:
May require more JOINing.
Requires one more index, and each additional index brings overhead (even in heap-based tables, but especially in clustered tables).
As you can see, it's a matter of balance and you are the only one who has enough domain knowledge to make the right decision.
NOTE: Order of fields in organization_artist matters. Use the order shown above if you need to efficiently query for artists of a given organization and reverse it if you need organizations of a given artist. If you need both directions, you'll need another composite index on these two fields (beside the index underneath PK), but in opposite order. If you can live with only one index, consider clustering this table (if your DBMS supports it).

You want an OrganizationId, to handle the situations where the Organization name changes.
You might also have situations where different organizations seem to have the same name. How many "Museum of Modern Art"s are there? (Well, to a New Yorker, only one ;-)
Your organization table might expand over time, with columns such as shortname, address, contact person, prefered language, and so on. So, the table should look like:
create table Organizations (
OrganizationId int not null identity(1,1),
Name varchar(255),
CreatedBy varchar(255) default system_user,
CreatedAt datetime default getdate()
)
In a mature database, you would even recognize that organizations change names, merge, and sometimes split. You can handle this by adding effective dates and end dates to the records.

The standard practice for something like this would be to have 1 table for the artists, 1 table for the organizations, and 1 association table to associate the artist with 1 or more organization.
ARTIST (id, firstName, lastName)
ORGANIZATION (id, name)
ARTIST_ORGANIZATION(artist_id, org_id)
Even though the organization name may/should be unique, it's good to have a numeric id as the primary key so you can do associations. And querying the association with id is better performance than searching for a string.

Should each and every table have a primary key?

I'm creating a database table and I don't have a logical primary key assigned to it. Should each and every table have a primary key?

Short answer: yes.
Long answer:
You need your table to be joinable on something
If you want your table to be clustered, you need some kind of a primary key.
If your table design does not need a primary key, rethink your design: most probably, you are missing something. Why keep identical records?
In MySQL, the InnoDB storage engine always creates a primary key if you didn't specify it explicitly, thus making an extra column you don't have access to.
Note that a primary key can be composite.
If you have a many-to-many link table, you create the primary key on all fields involved in the link. Thus you ensure that you don't have two or more records describing one link.
Besides the logical consistency issues, most RDBMS engines will benefit from including these fields in a unique index.
And since any primary key involves creating a unique index, you should declare it and get both logical consistency and performance.
See this article in my blog for why you should always create a unique index on unique data:
Making an index UNIQUE
P.S. There are some very, very special cases where you don't need a primary key.
Mostly they include log tables which don't have any indexes for performance reasons.

Always best to have a primary key. This way it meets first normal form and allows you to continue along the database normalization path.
As stated by others, there are some reasons not to have a primary key, but most will not be harmed if there is a primary key

Disagree with the suggested answer. The short answer is: NO.
The purpose of the primary key is to uniquely identify a row on the table in order to form a relationship with another table. Traditionally, an auto-incremented integer value is used for this purpose, but there are variations to this.
There are cases though, for example logging time-series data, where the existence of a such key is simply not needed and just takes up memory. Making a row unique is simply ...not required!
A small example:
Table A: LogData
Columns: DateAndTime, UserId, AttribA, AttribB, AttribC etc...
No Primary Key needed.
Table B: User
Columns: Id, FirstName, LastName etc.
Primary Key (Id) needed in order to be used as a "foreign key" to LogData table.

Pretty much any time I've created a table without a primary key, thinking I wouldn't need one, I've ended up going back and adding one. I now create even my join tables with an auto-generated identity field that I use as the primary key.

Except for a few very rare cases (possibly a many-to-many relationship table, or a table you temporarily use for bulk-loading huge amounts of data), I would go with the saying:
If it doesn't have a primary key, it's not a table!
Marc

Just add it, you will be sorry later when you didn't (selecting, deleting. linking, etc)

Will you ever need to join this table to other tables? Do you need a way to uniquely identify a record? If the answer is yes, you need a primary key. Assume your data is something like a customer table that has the names of the people who are customers. There may be no natural key because you need the addresses, emails, phone numbers, etc. to determine if this Sally Smith is different from that Sally Smith and you will be storing that information in related tables as the person can have mulitple phones, addesses, emails, etc. Suppose Sally Smith marries John Jones and becomes Sally Jones. If you don't have an artifical key onthe table, when you update the name, you just changed 7 Sally Smiths to Sally Jones even though only one of them got married and changed her name. And of course in this case withouth an artificial key how do you know which Sally Smith lives in Chicago and which one lives in LA?
You say you have no natural key, therefore you don't have any combinations of field to make unique either, this makes the artficial key critical.
I have found anytime I don't have a natural key, an artifical key is an absolute must for maintaining data integrity. If you do have a natural key, you can use that as the key field instead. But personally unless the natural key is one field, I still prefer an artifical key and unique index on the natural key. You will regret it later if you don't put one in.

It is a good practice to have a PK on every table, but it's not a MUST. Most probably you will need a unique index, and/or a clustered index (which is PK or not) depending on your need.
Check out the Primary Keys and Clustered Indexes sections on Books Online (for SQL Server)
"PRIMARY KEY constraints identify the column or set of columns that have values that uniquely identify a row in a table. No two rows in a table can have the same primary key value. You cannot enter NULL for any column in a primary key. We recommend using a small, integer column as a primary key. Each table should have a primary key. A column or combination of columns that qualify as a primary key value is referred to as a candidate key."
But then check this out also: http://www.aisintl.com/case/primary_and_foreign_key.html

To make it future proof you really should. If you want to replicate it you'll need one. If you want to join it to another table your life (and that of the poor fools who have to maintain it next year) will be so much easier.

I am in the role of maintaining application created by offshore development team. Now I am having all kinds of issues in the application because original database schema did not contain PRIMARY KEYS on some tables. So please dont let other people suffer because of your poor design. It is always good idea to have primary keys on tables.

Late to the party but I wanted to add my two cents:
Should each and every table have a primary key?
If you are talking about "Relational Albegra", the answer is Yes. Modelling data this way requires the entities and tables to have a primary key. The problem with relational algebra (apart from the fact there are like 20 different, mismatching flavors of it), is that it only exists on paper. You can't build real world applications using relational algebra.
Now, if you are talking about databases from real world apps, they partially/mostly adhere to the relational algebra, by taking the best of it and by overlooking other parts of it. Also, database engines offer massive non-relational functionality nowadays (it's 2020 now). So in this case the answer is No. In any case, 99.9% of my real world tables have a primary key, but there are justifiable exceptions. Case in point: event/log tables (multiple indexes, but not a single key in sight).
Bottom line, in transactional applications that follow the entity/relationship model it makes a lot of sense to have primary keys for almost (if not) all of the tables. If you ever decide to skip the primary key of a table, make sure you have a good reason for it, and you are prepared to defend your decision.

I know that in order to use certain features of the gridview in .NET, you need a primary key in order for the gridview to know which row needs updating/deleting. General practice should be to have a primary key or primary key cluster. I personally prefer the former.

I'd like to find something official like this - 15.6.2.1 Clustered and Secondary Indexes - MySQL.
If the table has no PRIMARY KEY or suitable UNIQUE index, InnoDB internally generates a hidden clustered index named GEN_CLUST_INDEX on a synthetic column containing row ID values. The rows are ordered by the ID that InnoDB assigns to the rows in such a table. The row ID is a 6-byte field that increases monotonically as new rows are inserted. Thus, the rows ordered by the row ID are physically in insertion order.
So, why not create primary key or something like it by yourself? Besides, ORM cannot identify this hidden ID, meaning that you cannot use ID in your code.

I always have a primary key, even if in the beginning I don't have a purpose in mind yet for it. There have been a few times when I eventually need a PK in a table that doesn't have one and it's always more trouble to put it in later. I think there is more of an upside to always including one.

If you are using Hibernate its not possible to create an Entity without a primary key. This issues can create problem if you are working with an existing database which was created with plain sql/ddl scripts, and no primary key was added

In short, no. However, you need to keep in mind that certain client access CRUD operations require it. For future proofing, I tend to always utilize primary keys.

One or Two Primary Keys in Many-to-Many Table?

I have the following tables in my database that have a many-to-many relationship, which is expressed by a connecting table that has foreign keys to the primary keys of each of the main tables:
Widget: WidgetID (PK), Title, Price
User: UserID (PK), FirstName, LastName
Assume that each User-Widget combination is unique. I can see two options for how to structure the connecting table that defines the data relationship:
UserWidgets1: UserWidgetID (PK), WidgetID (FK), UserID (FK)
UserWidgets2: WidgetID (PK, FK), UserID (PK, FK)
Option 1 has a single column for the Primary Key. However, this seems unnecessary since the only data being stored in the table is the relationship between the two primary tables, and this relationship itself can form a unique key. Thus leading to option 2, which has a two-column primary key, but loses the one-column unique identifier that option 1 has. I could also optionally add a two-column unique index (WidgetID, UserID) to the first table.
Is there any real difference between the two performance-wise, or any reason to prefer one approach over the other for structuring the UserWidgets many-to-many table?

You only have one primary key in either case. The second one is what's called a compound key. There's no good reason for introducing a new column. In practise, you will have to keep a unique index on all candidate keys. Adding a new column buys you nothing but maintenance overhead.
Go with option 2.

Personally, I would have the synthetic/surrogate key column in many-to-many tables for the following reasons:
If you've used numeric synthetic keys in your entity tables then having the same on the relationship tables maintains consistency in design and naming convention.
It may be the case in the future that the many-to-many table itself becomes a parent entity to a subordinate entity that needs a unique reference to an individual row.
It's not really going to use that much additional disk space.
The synthetic key is not a replacement to the natural/compound key nor becomes the PRIMARY KEY for that table just because it's the first column in the table, so I partially agree with the Josh Berkus article. However, I don't agree that natural keys are always good candidates for PRIMARY KEY's and certainly should not be used if they are to be used as foreign keys in other tables.

Option 2 uses a simple compund key, option 1 uses a surrogate key. Option 2 is preferred in most scenarios and is close to the relational model in that it is a good candidate key.
There are situations where you may want to use a surrogate key (Option 1)
You are not certain that the compound key is a good candidate key over time. Particularly with temporal data (data that changes over time). What if you wanted to add another row to the UserWidget table with the same UserId and WidgetId? Think of Employment(EmployeeId,EmployeeId) - it would work in most cases except if someone went back to work for the same employer at a later date
If you are creating messages/business transactions or something similar that requires an easier key to use for integration. Replication maybe?
If you want to create your own auditing mechanisms (or similar) and don't want keys to get too long.
As a rule of thumb, when modeling data you will find that most associative entities (many to many) are the result of an event. Person takes up employment, item is added to basket etc. Most events have a temporal dependency on the event, where the date or time is relevant - in which case a surrogate key may be the best alternative.
So, take option 2, but make sure that you have the complete model.

I agree with the previous answers but I have one remark to add.
If you want to add more information to the relation and allow more relations between the same two entities you need option one.
For example if you want to track all the times user 1 has used widget 664 in the userwidget table the userid and widgetid isn't unique anymore.

What is the benefit of a primary key in this scenario? Consider the option of no primary key:
UserWidgets3: WidgetID (FK), UserID (FK)
If you want uniqueness then use either the compound key (UserWidgets2) or a uniqueness constraint.
The usual performance advantage of having a primary key is that you often query the table by the primary key, which is fast. In the case of many-to-many tables you don't usually query by the primary key so there is no performance benefit. Many-to-many tables are queried by their foreign keys, so you should consider adding indexes on WidgetID and UserID.

Option 2 is the correct answer, unless you have a really good reason to add a surrogate numeric key (which you have done in option 1).
Surrogate numeric key columns are not 'primary keys'. Primary keys are technically one of the combination of columns that uniquely identify a record within a table.
Anyone building a database should read this article http://it.toolbox.com/blogs/database-soup/primary-keyvil-part-i-7327 by Josh Berkus to understand the difference between surrogate numeric key columns and primary keys.
In my experience the only real reason to add a surrogate numeric key to your table is if your primary key is a compound key and needs to be used as a foreign key reference in another table. Only then should you even think to add an extra column to the table.
Whenever I see a database structure where every table has an 'id' column the chances are it has been designed by someone who doesn't appreciate the relational model and it will invariably display one or more of the problems identified in Josh's article.

I would go with both.
Hear me out:
The compound key is obviously the nice, correct way to go in so far as reflecting the meaning of your data goes. No question.
However: I have had all sorts of trouble making hibernate work properly unless you use a single generated primary key - a surrogate key.
So I would use a logical and physical data model. The logical one has the compound key. The physical model - which implements the logical model - has the surrogate key and foreign keys.

Since each User-Widget combination is unique, you should represent that in your table by making the combination unique. In other words, go with option 2. Otherwise you may have two entries with the same widget and user IDs but different user-widget IDs.

The userwidgetid in the first table is not needed, as like you said the uniqueness comes from the combination of the widgetid and the userid.
I would use the second table, keep the foriegn keys and add a unique index on widgetid and userid.
So:
userwidgets( widgetid(fk), userid(fk),
unique_index(widgetid, userid)
)
There is some preformance gain in not having the extra primary key, as the database would not need to calculate the index for the key. In the above model though this index (through the unique_index) is still calculated, but I believe that this is easier to understand.

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight