Cassandra data model - column-family - database

I checked some questions here like Understanding Cassandra Data Model and Column-family concept and data model, and some articles about Cassandra, but I'm still not clear what is it's data model.
Cassandra follows a column-family data model, which is similar to key-value data model. In column-family you have data in rows and columns, so 2 dimensional structure and on top of that you have a grouping in column families? I suppose this is organized in column families to be able to partition the database across several nodes?
How are rows and columns grouped into column families? Why do we have column families?
For example let's say we have database of messages, as rows:
id: 123, message: {author: 'A', recipient: 'X', text: 'asd'}
id: 124, message: {author: 'B', recipient: 'X', text: 'asdf'}
id: 125, message: {author: 'C', recipient: 'Y', text: 'a'}
How and why would we organize this around column-family data model?
NOTE: Please correct or expand on example if necessary.

Kinda wrong question. Instead of modeling around data, model around how your going to query the data. What do you want to read? You create your data model around that since the storage is strict on how you can access data. Most likely the id is not the key, if you want the author or recipient as on reads you use that as the partition key, with the unique id (use uuid not auto inc) as a clustering index. ie:
CREATE TABLE message_by_recipient (
author text,
recipient text,
id timeuuid,
data text,
PRIMARY KEY (recipient, id)
) WITH CLUSTERING ORDER BY (id DESC)
Then to see the five newest emails to "bob"
select * from message_by_recipient where recipient = 'bob' limit 5
using timeuuid for id will guarantee uniqueness without a auto increment bottleneck and also provide sorting by time. You may duplicate writes on a new message, writing to multiple tables so each read is a single look up. If data can get large, may want to replace it with a uuid (type 4) and store it in a blob store or distributed file system (ie s3) keyed by it. It would reduce impact on C* and also reduce the cost of the denormalization.

Related

Design scenario of a DynamoDB table

I am new to DynamoDB and after reading several docs, there is a scenario in which I am not sure which would be the best approach for designing a table.
Consider that we have some JobOffers and we should support the following data access:
get a specific JobOffer
get all JobOffers from a specific Company sorted by different criteria (newest/oldest/wage)
get JobOffers from a specific Company filtered by a specific city sorted by different criteria (newest/oldest/wage)
get all JobOffers (regardless of any Company !!!) sorted by different criteria (newest/oldest/wage)
get JobOffers (regardless of any Company !!!) filtered by a specific city sorted by different criteria (newest/oldest/wage)
Since we need to support sorting, my understanding is that we should use Query instead of Scan. In order to use Query, we need to use a primary key. Because we need to support a search like "get all JobOffers without any filters sorted somehow", which would be a good candidate for partition key?
As a workaround, I was thinking to use a new attribute "country" which can be used as the partition key, but since all JobOffers are specified in one country, all these items fall under the same partition, so it might be a bit redundant until we will have support for JobOffers from different countries.
Any suggestion on how to design JobOffer table (with PK and GSI/LSI) for such a scenario?
Design of a Dynamodb table is best done with an Access approach - that is - how are you going to be accessing the data in here. You have information X, you need Y.
Also remember that a dynamo is NOT an sql, and it is not restricted that every item has to be the same - consider each entry a document, with its PK/SK as directory/item structure in a file system almost.
So for example:
You have user data. You have things like : Avatar data (image name, image size, image type) Login data (salt/pepper hashes, email address, username), Post history (post title, identifier, content, replies). Each user will only have 1 Avatar item and 1 Login item, but have many Post items
You know that from the user page you are always going to have the user ID. 100% of the time. This should then be your PK - your Hash Key, PartitionKey. Then you have the rest of the things you need inform your sort key/range key.
PK
USER#123456
SK:
AVATAR - Attributes: (image name, image size, image type)
PK
USER#123456
SK:
LOGIN - Attributes: (salt/pepper hashes, email address, username)
PK
USER#123456
SK:
POST#123 - Attributes: (post title, identifier, content, replies)
PK
USER#123456
SK:
POST#125 - Attributes: (post title, identifier, content, replies)
PK
USER#123456
SK:
POST#193 - Attributes: (post title, identifier, content, replies)
This way you can do a query with the User ID and get ALL the user data back. Or if you are on a page that just displays their post history, you can do a query against User ID # SK Begins With POST and get back all t heir posts.
You can put in an inverted index (SK -> PK and vice versa) and then do a query on POST#193 and get back the user ID. Or if you have other PK types with POST#193 as the SK, you get more information there (like a REPLIES#193 PK or something)
The idea here is that you have X bits of information, and you need to craft your dynamo to be able to retrieve as much as possible with just that information, and using prefix's on your SKs you can then narrow the fields a little.
Note!
Sometimes this means a duplication of information! That you may have the same information under two sets of keys. This is ok and kind of expected when you start getting into really complex relationships. You can mitigate it somewhat with index's, but you should aim to avoid them where possible as they do introduce a bit of lag in terms of data propagation (its tiny, but it can add up)
So you have your list of things you want to get for your dynamo. What will you always have to tie them together? What piece of data do you have that will work?
You can do the first 3 with a company PK identifier and a reverse index. That will let you look up and get all a companies jobs, or using the reverse index all a specific job. Or if you can always know the company when looking up a specific job, then it uses the general first index.
Company# - Job# - data data data
You then do the sorting on your own, OR you add some sort of sort valuye to the Job# key - Sort Keys are inherently sorted after all. Company# - Job#1234#UNITED_STATES
of course this will only work for one sort at a time. You can make more than one index, but again - data sync lag is a real possibility.
But how to do this regardless of Company? Well you can have another index with your searchable attribute (Country for example) as the PK then you can query that.
Or do you have another set of data that can tie this all together? Do you have another thing that can reach it all?
If not, you may just have two items in your dynamo:
Company#1234 - Job#321 - details
Company#1234 - Country#United_states - job#321, job#456, job#1234
Company#1234 - Country#England - job#992, job#123, job#19231
your reverse index here would apply - you could do a query on PK: Contry#UnitedStates and you'd get back:
Country#United_states - Company#1234 - job#321, job #456, job31234
Country#United_states - Company#4556
Country#United_States - Comapny#8322
this isnt a relational database however! So either you have to do one of two things - use t hose job#s to then query that company and get the filter the jobs by what you want (bad - trying to avoid multiple queries!) or each job# is an attribute on country sk's, and it contains a copy of that relevant data in a map format {job title, job#, country, company, salary}. Then when they click on that job to go to the details, it makes a direct call straight to the job query, gets the details to display,and its good.
Again, it all comes down to access patterns. What do you have, and how can you arrange it in a way that lets you get what you need fast

Cassandra DB structure suggestion (two tables vs one)

I am new to Cassandra's DB, and I'm creating a database structure and I wonder whether my pick is optimal for my requirements.
I need to get information on unique users, and each unique user will have multiple page views.
My two options are de-normalizing my data into one big table, or create two different tables.
My most used queries would be:
Searching for all of the pages with a certain cookie_id.
Searching for all of the pages with a certain cookie_id and a client_id. If a cookie doesn't have a client, it would be marked client_id=0 (that would be most of the data).
Find the first cookie_id with extra data (for example data_type_1 + data_type_2).
My two suggested schemas are these:
Two tables - One for users and one for visited pages.
This would allow me to save a new user on a different table, and keep every recurring request in another table.
CREATE TABLE user_tracker.pages (
cookie_id uuid,
created_uuid timeuuid,
data_type_3 text,
data_type_4 text,
PRIMARY KEY (cookie_id, created_uuid)
);
CREATE TABLE user_tracker.users (
cookie_id uuid,
client_id id,
data_type_1 text,
data_type_2 text,
created_uuid timeuuid,
PRIMARY KEY (cookie_id, client_id, created_uuid)
);
This data is normalized as I don't enter the user's data for every request.
One table - For all of the data saved, and the first request as the key. First request would have data_type_1 and data_type_2.
I could also save "data_type_1" and "data_type_2" as a hashmap, as they represent multiple columns and they will always be in a single data set (is it considered to be better?).
The column "first" would be a secondary index.
CREATE TABLE user_tracker.users_pages (
cookie_id uuid,
client_id id,
data_type_1 text,
data_type_2 text,
data_type_3 text,
data_type_4 text,
first boolean,
created_uuid timeuuid,
PRIMARY KEY (cookie_id, client_id, created_uuid)
);
In reality we have more columns than 4, this was written briefly.
As far as I understand Cassandra's best practices, I'm leaning into option #2, but please do enlighten me :-)
Data modelling in Cassandra is done based on type of queries you will be making. You should be able to query using partition key.
For following option 2 mentioned by you is okay.
1.Searching for all of the pages with a certain cookie_id
Searching for all of the pages with a certain cookie_id and a client_id. If a cookie doesn't have a client, it would be marked client_id=0 (that would be most of the data).
For third query
Find the first cookie_id with extra data (for example data_type_1 + data_type_2).
Not sure what do you mean by first cookie_id with extra data. In Cassandra all data is stored by partition key and are not sorted. So all your data will be stored using cookie_id as parition key and all future instances with the same cookie_id will keep adding to this row.

DynamoDB - Design 1 to Many relationship

I'm new at DynamoDB technologies but not at NoSQL (I've already done some project using Firebase).
Read that a DynamoDB best practice is one table per application I've been having a hard time on how to design my 1 to N relationship.
I have this entity (pseudo-json):
{
machineId: 'HASH_ID'
machineConfig: /* a lot of fields */
}
A machineConfig is unique for each machine and can change rarely and only by an administration (no consistency issue here).
The issue is that I have to manage a log of data from the sensors of each machine. The log is described as:
{
machineId: 'HASH_ID',
sensorsData: [
/* Huge list of: */
{ timestamp: ..., data: /* lot of fields */ },
...
]
}
I want to keep my machineConfig in one place. Log list can't be insert into the machine entity because it's a continuous stream of data taken over time.
Furthermore, I don't understand which could be the composite key, the partition key obviously is the machineId, but what about the order key?
How to design this relationship taking into account the potential dimensions of data?
You could do this with 1 table. The primary key could be (machineId, sortKey) where machineId is the partition key and sortKey is a string attribute that is going to be used to cover the 2 cases. You could probably come up with a better name.
To store the machineConfig you would insert an item with primary key (machineId, "CONFIG"). The sortKey attribute would have the constant value CONFIG.
To store the sensorsData you could use the timestamp as the sortKey value. You would insert a new item for each piece of sensor data. You would store the timestamp as a string (as time since the epoch, ISO8601, etc)
Then to query everything about a machine you would run a Dynamo query specifying just the machineId partition key - this would return many items including the machineConfig and the sensor data.
To query just the machineConfig you would run a Dynamo query specifying the machineId partition key and the constant CONFIG as the sortKey value
To query the sensor data you could specify an exact timestamp or a timestamp range for the sortKey. If you need to query the sensor data by other values then this design might not work as well.
Editing to answer follow up question:
You would have to resort to a scan with a filter to return all machines with their machineId and machineConfig. If you end up inserting a lot of sensor data then this will be a very expensive operation to perform as Dynamo will look at every item in the table. If you need to do this you have a couple of options.
If there are not a lot of machines you could insert an item with a primary key like ("MACHINES", "ALL") and a list of all the machineIds. You would query on that key to get the list of machineIds, then you would do a bunch of queries (or a batch get) to retrieve all the related machineConfigs. However since the max Dynamo item size is 400KB you might not be able to fit them all.
If there are too many machines to fit in one item you could alter the above approach a bit and have ("MACHINES", $machineIdSubstring) as a primary key and store chunks of machineIds under each sort key. For example, all machineIds that start with 0 go in ("MACHINES", "0"). Then you would query by each primary key 0-9, build a list of all machineIds and query each machine as above.
Alternatively, you don't have to put everything in 1 table - it is just a guideline that fits a lot of use cases. If there are too many machines to fit in less than 400KB but there aren't tens of thousands and you aren't trying to query all of them all the time, you could have a separate table of machineId and machineConfig that you resort to scanning when necessary.

Table design Cassandra

I am persisting data from a machine which lets say has different sensors.
CREATE TABLE raw_data (
device_id uuid,
time timestamp,
id uuid,
unit text,
value double,
PRIMARY KEY ((device_id, unit), time)
)
I need to know which sensor was using when the data was sent. I could add an field "sensor_id" and store sensor related data in an other table. Problem about this approach is that i have to store the location of the sensor (A,B,C) which can change. Changing the location in the sensor table would invalidate old data.
I have a feeling that im still thinking to much in the relational way. How would you suggest to solve this?
Given your table description, I would say that device_id is the identifier (or PK) of the device,
but this is not what you apparently are thinking...
And IMHO this is the root of your problem.
I don't want to look pedant, but I often see that people forget (or do not know) that in relational model, a relation is not (or not only) a relation between tables, but a relation between attributes, ie. values taken in "domain values", including the PK with the PK (cf the relational model definition of Codd that you can easily find on the net).
In relational model a table is a relation, a query (a SELECT in SQL, including joins) is also a relation.
Even with NoSQL, entities should (IMHO) follow at least the first 3 normal forms (atomicity and dependence on pk for short) which are more or less minimal common sense modeling.
About PK, in the relational model, there are flame debates on natural versus subrogates (unnatural calculated) primary keys.
I would tend to natural, and often composite, keys, but this is just an opinion, and of course it depends on context.
In you data model unit should not (IMHO) be part of PK : it does not identify the device, it is a characteristic of the device.
The PK must uniquely identify the device, it is not a position or location, a unit or any other characteristic of the device. It is a unique id, a serial number, a combination of other characteristics with is unique for the device and does not change in time or any other dimension.
For example in the case of cars with embedded devices, you have the choice of giving an opaque uuid PK for each embedded device with a reference table to retrieve additional information about the device, and a composite PK which could be given by : car maker, car serial number (sno), device type , device id .
like for example :
CREATE TABLE raw_data (
car_maker text,
car_sno text,
device_type text,
device_id text,
time timestamp,
id uuid,
unit text,
value double,
PRIMARY KEY ((car_maker, car_sno, device_type, device_id), time)
)
example data :
( 'bmw', '1256387A1AA43', 'tyrep', 'tyre1', 'bar', 150056709xxx, 2.4 ),
( 'bmw', '1256387A1AA43', 'tyrec', 'tyre1', 'tempC',150056709xxx, 150 ),
( 'bmw', '1256387A1AA43', 'tyrep', 'tyre2', 'bar', 150056709xxx,2.45 ),
( 'bmw', '1256387A1AA43', 'tyrec', 'tyre2', 'tempC', 150056709xxx, 160),
( 'bmw', '1256387A1AA43', 'tyrep', 'tyre3', 'bar', 150056709xxx,2.5 ),
( 'bmw', '1256387A1AA43', 'tyrec', 'tyre3', 'tempC', 150056709xxx, 150 ),
( 'bmw', '1256387A1AA43', 'tyre', 'tyre4', 'bar', 150056709xxx,2.42 ),
( 'bmw', '1256387A1AA43', 'tyre', 'tyre4', 'tempC', 150056709xxx, 150 ),
This is a general thought and must align to your problem. Sometimes, uuids and calculated keys are best.
With Cassandra the difficulty is that you have to design your model around your queries, because the first part of the PK is the partition key and you cannot query (or it is difficult, you have to paginate or use other system like spark) between multiple partitions.
Don't think relational too much, don't be afraid to duplicate.
And I would suggest that you also have look at Chebotko diagrams for Cassandra who can help you design your Cassandra schema around queries here or here .
best,
Alain

What is 'document data store' and 'key-value data store'?

What is document data store? What is key-value data store?
Please, describe in very simple and general words the mechanisms which stand behind each of them.
In a document data store each record has multiple fields, similar to a relational database. It also has secondary indexes.
Example record:
"id" => 12345,
"name" => "Fred",
"age" => 20,
"email" => "fred#example.com"
Then you could query by id, name, age, or email.
A key/value store is more like a big hash table than a traditional database: each key corresponds with a value and looking things up by that one key is the only way to access a record. This means it's much simpler and often faster, but it's difficult to use for complex data.
Example record:
12345 => "Fred,fred#example.com,20"
You can only use 12345 for your query criteria. You can't query for name, email, or age.
Here's a description of a few common data models:
Relational systems are the databases we've been using for a while now. RDBMSs and systems that support ACIDity and joins are considered relational.
Key-value systems basically support get, put, and delete operations based on a primary key.
Column-oriented systems still use tables but have no joins (joins must be handled within your application). Obviously, they store data by column as opposed to traditional row-oriented databases. This makes aggregations much easier.
Document-oriented systems store structured "documents" such as JSON or XML but have no joins (joins must be handled within your application). It's very easy to map data from object-oriented software to these systems.
From this blog post I wrote: Visual Guide to NoSQL Systems.
From wikipedia:
Document data store: As opposed to relational databases, document-based databases do not store data in tables with uniform sized fields for each record. Instead, each record is stored as a document that has certain characteristics. Any number of fields of any length can be added to a document. Fields can also contain multiple pieces of data.
Key Value: An associative array (also associative container, map, mapping, dictionary, finite map, and in query-processing an index or index file) is an abstract data type composed of a collection of unique keys and a collection of values, where each key is associated with one value (or set of values). The operation of finding the value associated with a key is called a lookup or indexing, and this is the most important operation supported by an associative array. The relationship between a key and its value is sometimes called a mapping or binding. For example, if the value associated with the key "bob" is 7, we say that our array maps "bob" to 7.
More examples at NoSQL.

Resources