Cassandra DB structure suggestion (two tables vs one) - database

I am new to Cassandra's DB, and I'm creating a database structure and I wonder whether my pick is optimal for my requirements.
I need to get information on unique users, and each unique user will have multiple page views.
My two options are de-normalizing my data into one big table, or create two different tables.
My most used queries would be:
Searching for all of the pages with a certain cookie_id.
Searching for all of the pages with a certain cookie_id and a client_id. If a cookie doesn't have a client, it would be marked client_id=0 (that would be most of the data).
Find the first cookie_id with extra data (for example data_type_1 + data_type_2).
My two suggested schemas are these:
Two tables - One for users and one for visited pages.
This would allow me to save a new user on a different table, and keep every recurring request in another table.
CREATE TABLE user_tracker.pages (
cookie_id uuid,
created_uuid timeuuid,
data_type_3 text,
data_type_4 text,
PRIMARY KEY (cookie_id, created_uuid)
);
CREATE TABLE user_tracker.users (
cookie_id uuid,
client_id id,
data_type_1 text,
data_type_2 text,
created_uuid timeuuid,
PRIMARY KEY (cookie_id, client_id, created_uuid)
);
This data is normalized as I don't enter the user's data for every request.
One table - For all of the data saved, and the first request as the key. First request would have data_type_1 and data_type_2.
I could also save "data_type_1" and "data_type_2" as a hashmap, as they represent multiple columns and they will always be in a single data set (is it considered to be better?).
The column "first" would be a secondary index.
CREATE TABLE user_tracker.users_pages (
cookie_id uuid,
client_id id,
data_type_1 text,
data_type_2 text,
data_type_3 text,
data_type_4 text,
first boolean,
created_uuid timeuuid,
PRIMARY KEY (cookie_id, client_id, created_uuid)
);
In reality we have more columns than 4, this was written briefly.
As far as I understand Cassandra's best practices, I'm leaning into option #2, but please do enlighten me :-)

Data modelling in Cassandra is done based on type of queries you will be making. You should be able to query using partition key.
For following option 2 mentioned by you is okay.
1.Searching for all of the pages with a certain cookie_id
Searching for all of the pages with a certain cookie_id and a client_id. If a cookie doesn't have a client, it would be marked client_id=0 (that would be most of the data).
For third query
Find the first cookie_id with extra data (for example data_type_1 + data_type_2).
Not sure what do you mean by first cookie_id with extra data. In Cassandra all data is stored by partition key and are not sorted. So all your data will be stored using cookie_id as parition key and all future instances with the same cookie_id will keep adding to this row.

Related

Design scenario of a DynamoDB table

I am new to DynamoDB and after reading several docs, there is a scenario in which I am not sure which would be the best approach for designing a table.
Consider that we have some JobOffers and we should support the following data access:
get a specific JobOffer
get all JobOffers from a specific Company sorted by different criteria (newest/oldest/wage)
get JobOffers from a specific Company filtered by a specific city sorted by different criteria (newest/oldest/wage)
get all JobOffers (regardless of any Company !!!) sorted by different criteria (newest/oldest/wage)
get JobOffers (regardless of any Company !!!) filtered by a specific city sorted by different criteria (newest/oldest/wage)
Since we need to support sorting, my understanding is that we should use Query instead of Scan. In order to use Query, we need to use a primary key. Because we need to support a search like "get all JobOffers without any filters sorted somehow", which would be a good candidate for partition key?
As a workaround, I was thinking to use a new attribute "country" which can be used as the partition key, but since all JobOffers are specified in one country, all these items fall under the same partition, so it might be a bit redundant until we will have support for JobOffers from different countries.
Any suggestion on how to design JobOffer table (with PK and GSI/LSI) for such a scenario?
Design of a Dynamodb table is best done with an Access approach - that is - how are you going to be accessing the data in here. You have information X, you need Y.
Also remember that a dynamo is NOT an sql, and it is not restricted that every item has to be the same - consider each entry a document, with its PK/SK as directory/item structure in a file system almost.
So for example:
You have user data. You have things like : Avatar data (image name, image size, image type) Login data (salt/pepper hashes, email address, username), Post history (post title, identifier, content, replies). Each user will only have 1 Avatar item and 1 Login item, but have many Post items
You know that from the user page you are always going to have the user ID. 100% of the time. This should then be your PK - your Hash Key, PartitionKey. Then you have the rest of the things you need inform your sort key/range key.
PK
USER#123456
SK:
AVATAR - Attributes: (image name, image size, image type)
PK
USER#123456
SK:
LOGIN - Attributes: (salt/pepper hashes, email address, username)
PK
USER#123456
SK:
POST#123 - Attributes: (post title, identifier, content, replies)
PK
USER#123456
SK:
POST#125 - Attributes: (post title, identifier, content, replies)
PK
USER#123456
SK:
POST#193 - Attributes: (post title, identifier, content, replies)
This way you can do a query with the User ID and get ALL the user data back. Or if you are on a page that just displays their post history, you can do a query against User ID # SK Begins With POST and get back all t heir posts.
You can put in an inverted index (SK -> PK and vice versa) and then do a query on POST#193 and get back the user ID. Or if you have other PK types with POST#193 as the SK, you get more information there (like a REPLIES#193 PK or something)
The idea here is that you have X bits of information, and you need to craft your dynamo to be able to retrieve as much as possible with just that information, and using prefix's on your SKs you can then narrow the fields a little.
Note!
Sometimes this means a duplication of information! That you may have the same information under two sets of keys. This is ok and kind of expected when you start getting into really complex relationships. You can mitigate it somewhat with index's, but you should aim to avoid them where possible as they do introduce a bit of lag in terms of data propagation (its tiny, but it can add up)
So you have your list of things you want to get for your dynamo. What will you always have to tie them together? What piece of data do you have that will work?
You can do the first 3 with a company PK identifier and a reverse index. That will let you look up and get all a companies jobs, or using the reverse index all a specific job. Or if you can always know the company when looking up a specific job, then it uses the general first index.
Company# - Job# - data data data
You then do the sorting on your own, OR you add some sort of sort valuye to the Job# key - Sort Keys are inherently sorted after all. Company# - Job#1234#UNITED_STATES
of course this will only work for one sort at a time. You can make more than one index, but again - data sync lag is a real possibility.
But how to do this regardless of Company? Well you can have another index with your searchable attribute (Country for example) as the PK then you can query that.
Or do you have another set of data that can tie this all together? Do you have another thing that can reach it all?
If not, you may just have two items in your dynamo:
Company#1234 - Job#321 - details
Company#1234 - Country#United_states - job#321, job#456, job#1234
Company#1234 - Country#England - job#992, job#123, job#19231
your reverse index here would apply - you could do a query on PK: Contry#UnitedStates and you'd get back:
Country#United_states - Company#1234 - job#321, job #456, job31234
Country#United_states - Company#4556
Country#United_States - Comapny#8322
this isnt a relational database however! So either you have to do one of two things - use t hose job#s to then query that company and get the filter the jobs by what you want (bad - trying to avoid multiple queries!) or each job# is an attribute on country sk's, and it contains a copy of that relevant data in a map format {job title, job#, country, company, salary}. Then when they click on that job to go to the details, it makes a direct call straight to the job query, gets the details to display,and its good.
Again, it all comes down to access patterns. What do you have, and how can you arrange it in a way that lets you get what you need fast

Oracle APEX - Data Modeling & Primary Keys

I'm creating a rather large APEX application which allows managers to go in and record statistics for associates in the company. Currently we have a database in oracle with data from AD which hold all the associates information. Name, Manager, Employee ID, etc.
Now I'm responsible for creating and modeling a table that will house all their stats for each employee. The table I have created has over 90+ columns in it. Some contain data such as:
Documents Processed
Calls Received
Amount of Doc 1 Processed
Amount of Doc 2 Processed
and the list goes on for well over 90 attributes. So here is my question:
When creating this table in my application with so many different columns how would I go about choosing a primary key that's appropriate? Should I link it to our employee table using the employees identification which is unique (each have a associate number)?
Secondly, how can I create these tables (and possibly form) to allow me to associate the statistic I am entering for an individual to the actual individual?
I have ordered two books from amazon on data modeling since I am new to APEX and DBA design. Not a fresh chicken, but new enough to need some guidance. An additional problem I am running into is that each form can have only 60 fields to it. So I had thought about creating tables for different functions out of my 90+ I have.
Thanks
4.2 allows for 200 items per page.
oracle apex component limits
A couple of questions come to mind:
Are you sure that the employee Ids are not recyclable? If these ids are unique and not recycled.. you've found yourself a good primary key.
What do you plan on doing when you decide to add a new metric? Seems like you might have to add a new column to your rather large and likely not normalized table.
I'd recommend a vertical table for your metrics.. you can use oracle's pivot function to make your data appear more like a horizontal table.
If you went this route you would store your employee Id in one column, your metric key in another, and value...
I'd recommend that you create a metric table consisting of a primary key, a metric label, an active indicator, creation timestamp, creation user id, modified timestamp, modified user id.
This metric table will allow you to add new metrics, change the name of the metric, deactivate a metric, and determine who changed what and when.
This would be a much more flexible approach in my opinion. You may also want to think about audit logs.

cassandra - how to perform table query?

I am trying to perform a query using 2 tables:
CREATE TABLE users(
id_ UUID PRIMARY KEY,
username text,
email text,
);
CREATE TABLE users_by_email(
id UUID,
email text PRIMARY KEY
)
In this cas, how to perform a query by email?
I am assuming in the case above you are specifically trying to retrieve the username by the email.
Short Answer:
There is no way in Cassandra that you are going to be able to get the username from the email in a single query using the table structure you have defined. You would need to query users_by_email to get the id and then query users to get the username. A better option would be to add the username column to the users_by_email table.
Long Answer:
Due to the underlying mechanisms by which Cassandra stores data on disk the only available parameters you may use in a where clause have to be in the Primary Key. The Primary Key is made up of 2 different types of keys. First is the partition key which is used to physically separate files on disk and between nodes in the cluster. Second are the cluster keys which are used to organize data stored in a partition and aid in efficient retrieval of data. One other critical part to note is that if you use a WHERE clause in your query it must contain all of the partition keys in it for each call. This is to allow for efficient retrieval of the data. If you want to get some more detailed information on the working of the WHERE clause take a look at this link:
http://www.datastax.com/dev/blog/a-deep-look-to-the-cql-where-clause
Now that you know what the limitations of the WHERE clause are the question is how do we get around them. First thing you need to know is that Cassandra is not a RDBMS and you cannot perform JOIN's against tables. This means that we need to forget all the rules that we have learned for so many years about how to properly normalize data in a database and begin thinking differently about the problem. In general Cassandra is designed for a table-per-query pattern. This means that for each data access pattern (i.e. query) you are going to run against there is an associated table that contains the data for that query and has the proper keys to allow the data to be filtered appropriately. I am not going to be able to go into all the nitty gritty details of how to properly data model your data but I suggest you take the free Datastax Academy Data modeling course avaliable here:
https://academy.datastax.com/courses/ds220-data-modeling
So as I understand your particular need I think that you can modify your users table to look like this:
CREATE TABLE users_by_email(
email text,
username text,
id_ UUID,
PRIMARY KEY (email, username)
);
This table setup will allow you to select the username by email using a query like:
SELECT username FROM users_by_email WHERE email=XXXXX;
I am assuming that you also want username returned in the query. You cannot JOIN tables in Cassandra. So to do that, you will have to add that column to your users_by_email table:
CREATE TABLE users_by_email(
id UUID,
email text PRIMARY KEY,
username text,
);
Then, simply query that table by email address.
> SELECT id, email, username FROM users_by_email WHERE email='mreynolds#serenity.com';
id | email | username
--------------------------------------+------------------------+----------
d8e57eb4-c837-4bd7-9fd7-855497861faf | mreynolds#serenity.com | Mal
(1 rows)

Organizing database tables - large number of properties

I have a database that stores some users in it. Each user has its account settings, privacy settings and lots of other properties to set. The number of those properties started to grow and I could end up with 30 properties or so.
Till now, I used to keep it in "UserInfo" table having User and UserInfo related as One-To-Many (keeping a log of all changes). Putting it in a single "UserInfo" table doesn't sound nice and, at least in the database model, it would look messy. What's the solution?
Separating privacy settings, account settings and other "groups" of settings in separate tables and have 1-1 relations between UserInfo and each group of settings table is one solution, but would that be too slow (or much slower) when retrieving the data? I guess all data would not be presented on a single page at the same moment. So maybe having one-to-many relationships to each table is a solution too (keeping log of each group separately)?
If it's only 30 properties, I'd recommend just creating 30 columns. That's not too much for a modern database to handle.
But I would guess that if you ahve 30 properties today, you will continue to invent new properties as time goes on, and the number of columns will keep growing. Restructuring your table to add columns every day may become time-consuming as you get lots of rows.
For an alternative solution check out this blog for a nifty solution for storing lots of dynamic attributes in a "schemaless" way: How FriendFeed Uses MySQL.
Basically, collect all the properties into some format and store it in a single TEXT column. The format is semi-structured, that is your application can separate the properties if needed but you can also add more at any time, or even have different properties per row. XML or YAML or JSON are example formats, or some object serialization format supported by your application code language.
CREATE TABLE Users (
user_id SERIAL PRIMARY KEY,
user_proerties TEXT
);
This makes it hard to search for a given value in a given property. So in addition to the TEXT column, create an auxiliary table for each property you want to be searchable, with two columns: values of the given property, and a foreign key back to the main table where that particular value is found. Now you have can index the column so lookups are quick.
CREATE TABLE UserBirthdate (
user_id BIGINT UNSIGNED PRIMARY KEY,
birthdate DATE NOT NULL,
FOREIGN KEY (user_id) REFERENCES Users(user_id),
KEY (birthdate)
);
SELECT u.* FROM Users AS u INNER JOIN UserBirthdate b USING (user_id)
WHERE b.birthdate = '2001-01-01';
This means as you insert or update a row in Users, you also need to insert or update into each of your auxiliary tables, to keep it in sync with your data. This could grow into a complex chore as you add more auxiliary tables.

how are viewing permissions usually implemented in a relational database?

What's the standard relational database idiom for setting permissions for items?
Answers should be general; however, they should be able to be applied to example below. Anything flies: adding columns, adding another table—whatever as long as it works well.
Application / Example
Assume the Twitter database is extremely simple: we have one User table, which contains a login and user id; we have a Tweet table, which contains a tweet id, tweet text, and creator id; and we have a Follower table, which contains the id of the person being followed and the follower.
Now, assume Twitter wants to enable advanced privacy settings (viewing permissions), so that users can pick exactly which followers can view tweets. The settings can be:
Everyone on Twitter
Only current followers (which would of course have to be approved by the user, this doesn't really matter though) EDIT: Current as in, I get a new follower, he sees it; I remove a follower, he stops seeing it.
Specific followers (e.g., user id 5, 10, 234, and 1)
Only the owner
Under these circumstances, what's the best way to represent viewing permissions? The priorities, in order, are speed of lookup (you want to be able to figure out what tweets to display to a user quickly), speed of creation (you don't want to take forever to post a tweet), and efficient use of space (every time I post a tweet to everyone on my followers' list, I shouldn't have to add a row for each and every follower I have to some table.)
Looks like a typical many-to-many relationship -- I don't see any restrictions on what you desire that would allow space savings wrt the typical relational DB idiom for those, i.e., a table with two columns (both foreign keys, one into users and one into tweets)... since the current followers can and do change all the time, posting a tweet to all the followers that are current at the instant of posting (I assume that's what you mean?) does mean adding that many (extremely short) rows to that relationship table (the alternative of keeping a timestamped history of follower sets so you can reconstruct who was a follower at any given tweet-posting time appears definitely worse in time and not substantially better in space).
If, on the other hand, you want to check followers at the time of viewing (rather than at the time of posting), then you could make a special userid artificially meaning "all followers of the current user" (just like you'll have one meaning "all users on Twitter"); the needed SQL to make the lookup fast, in that case, looks hairy but feasible (a UNION or OR with "all tweets for which I'm a follower of the author and the tweet is readable by [the artificial userid representing] all followers"). I'm not getting deep into that maze of SQL until and unless you confirm that it is this peculiar meaning that you have in mind (rather than the simple one which seems more natural to me but doesn't allow any space savings on the relationship table for the action of "post tweet to all followers").
Edit: the OP has clarified they mean the approach I mention in the second paragraph.
Then, assume userid is the primary key of the Users table, the Tweets table has a primary key tweetid and a foreign key author for the userid of each tweet's author, the Followers table is a typical many-to-many relationship table with the two columns (both foreign keys into Users) follower and followee, and the Canread table a not-so-typical many-to-many relationship table, still with two column -- foreign key into Users is column reader, foreign key into Tweets is column tweet (phew;-). Two special users #everybody and #allfollowers are defined with the above meanings (so that posting to everybody, all followers, or "just myself", all add only one row to Canread -- only selective posting to a specific list of N people adds N rows).
So the SQL for the set of tweet IDs a user #me can read is, I think, something like:
SELECT Tweets.tweetid
FROM Tweets
JOIN Canread ON(Tweets.tweetid=Canread.tweet)
WHERE Canread.reader IN (#me, #everybody)
UNION
SELECT Tweets.tweetid
FROM Tweets
JOIN Canread ON(Tweets.tweetid=Canread.tweet)
JOIN Followers ON(Tweets.author=Followers.followee)
WHERE Canread.reader=#allfollowers
AND Followers.follower=#me

Resources