Table design Cassandra - database

I am persisting data from a machine which lets say has different sensors.
CREATE TABLE raw_data (
device_id uuid,
time timestamp,
id uuid,
unit text,
value double,
PRIMARY KEY ((device_id, unit), time)
)
I need to know which sensor was using when the data was sent. I could add an field "sensor_id" and store sensor related data in an other table. Problem about this approach is that i have to store the location of the sensor (A,B,C) which can change. Changing the location in the sensor table would invalidate old data.
I have a feeling that im still thinking to much in the relational way. How would you suggest to solve this?

Given your table description, I would say that device_id is the identifier (or PK) of the device,
but this is not what you apparently are thinking...
And IMHO this is the root of your problem.
I don't want to look pedant, but I often see that people forget (or do not know) that in relational model, a relation is not (or not only) a relation between tables, but a relation between attributes, ie. values taken in "domain values", including the PK with the PK (cf the relational model definition of Codd that you can easily find on the net).
In relational model a table is a relation, a query (a SELECT in SQL, including joins) is also a relation.
Even with NoSQL, entities should (IMHO) follow at least the first 3 normal forms (atomicity and dependence on pk for short) which are more or less minimal common sense modeling.
About PK, in the relational model, there are flame debates on natural versus subrogates (unnatural calculated) primary keys.
I would tend to natural, and often composite, keys, but this is just an opinion, and of course it depends on context.
In you data model unit should not (IMHO) be part of PK : it does not identify the device, it is a characteristic of the device.
The PK must uniquely identify the device, it is not a position or location, a unit or any other characteristic of the device. It is a unique id, a serial number, a combination of other characteristics with is unique for the device and does not change in time or any other dimension.
For example in the case of cars with embedded devices, you have the choice of giving an opaque uuid PK for each embedded device with a reference table to retrieve additional information about the device, and a composite PK which could be given by : car maker, car serial number (sno), device type , device id .
like for example :
CREATE TABLE raw_data (
car_maker text,
car_sno text,
device_type text,
device_id text,
time timestamp,
id uuid,
unit text,
value double,
PRIMARY KEY ((car_maker, car_sno, device_type, device_id), time)
)
example data :
( 'bmw', '1256387A1AA43', 'tyrep', 'tyre1', 'bar', 150056709xxx, 2.4 ),
( 'bmw', '1256387A1AA43', 'tyrec', 'tyre1', 'tempC',150056709xxx, 150 ),
( 'bmw', '1256387A1AA43', 'tyrep', 'tyre2', 'bar', 150056709xxx,2.45 ),
( 'bmw', '1256387A1AA43', 'tyrec', 'tyre2', 'tempC', 150056709xxx, 160),
( 'bmw', '1256387A1AA43', 'tyrep', 'tyre3', 'bar', 150056709xxx,2.5 ),
( 'bmw', '1256387A1AA43', 'tyrec', 'tyre3', 'tempC', 150056709xxx, 150 ),
( 'bmw', '1256387A1AA43', 'tyre', 'tyre4', 'bar', 150056709xxx,2.42 ),
( 'bmw', '1256387A1AA43', 'tyre', 'tyre4', 'tempC', 150056709xxx, 150 ),
This is a general thought and must align to your problem. Sometimes, uuids and calculated keys are best.
With Cassandra the difficulty is that you have to design your model around your queries, because the first part of the PK is the partition key and you cannot query (or it is difficult, you have to paginate or use other system like spark) between multiple partitions.
Don't think relational too much, don't be afraid to duplicate.
And I would suggest that you also have look at Chebotko diagrams for Cassandra who can help you design your Cassandra schema around queries here or here .
best,
Alain

Related

Cassandra data model - column-family

I checked some questions here like Understanding Cassandra Data Model and Column-family concept and data model, and some articles about Cassandra, but I'm still not clear what is it's data model.
Cassandra follows a column-family data model, which is similar to key-value data model. In column-family you have data in rows and columns, so 2 dimensional structure and on top of that you have a grouping in column families? I suppose this is organized in column families to be able to partition the database across several nodes?
How are rows and columns grouped into column families? Why do we have column families?
For example let's say we have database of messages, as rows:
id: 123, message: {author: 'A', recipient: 'X', text: 'asd'}
id: 124, message: {author: 'B', recipient: 'X', text: 'asdf'}
id: 125, message: {author: 'C', recipient: 'Y', text: 'a'}
How and why would we organize this around column-family data model?
NOTE: Please correct or expand on example if necessary.
Kinda wrong question. Instead of modeling around data, model around how your going to query the data. What do you want to read? You create your data model around that since the storage is strict on how you can access data. Most likely the id is not the key, if you want the author or recipient as on reads you use that as the partition key, with the unique id (use uuid not auto inc) as a clustering index. ie:
CREATE TABLE message_by_recipient (
author text,
recipient text,
id timeuuid,
data text,
PRIMARY KEY (recipient, id)
) WITH CLUSTERING ORDER BY (id DESC)
Then to see the five newest emails to "bob"
select * from message_by_recipient where recipient = 'bob' limit 5
using timeuuid for id will guarantee uniqueness without a auto increment bottleneck and also provide sorting by time. You may duplicate writes on a new message, writing to multiple tables so each read is a single look up. If data can get large, may want to replace it with a uuid (type 4) and store it in a blob store or distributed file system (ie s3) keyed by it. It would reduce impact on C* and also reduce the cost of the denormalization.

Postgresql inheritance based database design

I'm developing a simple babysitter application that has 2 types of users: a 'Parent' and the 'Babysitter'. I'm using postgresql as my database but I'm having trouble working out my database design.
The 'Parent' and the 'Babysitter' entities have attributes that can be generalized, for example: username, password, email, ... Those attributes could be
placed into a parent entity called 'User'. They both also have their own attributes, for example: Babysitter -> age.
In terms of OOP things are very clear for me, just extend the user class and you are good to go but in DB design things are differently.
Before posting this question I roamed around the internet for a good week looking for insight into this 'issue'. I did find a lot of information but
it seemed to me that there was a lot a disagreement. Here are some of the posts I've read:
How do you effectively model inheritance in a database?: Table-Per-Type (TPT), Table-Per-Hierarchy (TPH) and Table-Per-Concrete (TPC) VS 'Forcing the RDb into a class-based requirements is simply incorrect.'
https://dba.stackexchange.com/questions/75792/multiple-user-types-db-design-advice:
Table: `users`; contains all similar fields as well as a `user_type_id` column (a foreign key on `id` in `user_types`
Table: `user_types`; contains an `id` and a `type` (Student, Instructor, etc.)
Table: `students`; contains fields only related to students as well as a `user_id` column (a foreign key of `id` on `users`)
Table: `instructors`; contains fields only related to instructors as well as a `user_id` column (a foreign key of `id` on `users`)
etc. for all `user_types`
https://dba.stackexchange.com/questions/36573/how-to-model-inheritance-of-two-tables-mysql/36577#36577
When to use inherited tables in PostgreSQL?: Inheritance in postgresql does not work as expected for me and a bunch of other users as the original poster points out.
I am really confused about which approach I should take. Class-table-inheritance (https://stackoverflow.com/tags/class-table-inheritance/info) seems like the most correct in
my OOP mindset but I would very much appreciate and updated DB minded opinion.
The way that I think of inheritance in the database world is "can only be one kind of." No other relational modeling technique works for that specific case; even with check constraints, with a strict relational model, you have the problem of putting the wrong "kind of" person into the wrong table. So, in your example, a user can be a parent or a babysitter, but not both. If a user can be more than one kind-of user, then inheritance is not the best tool to use.
The instructor/student relationship really only works well in the case where students cannot be instructors or vice-versa. If you have a TA, for example, it's better to model that using a strict relational design.
So, back to the parent-babysitter, your table design might look like this:
CREATE TABLE user (
id SERIAL,
full_name TEXT,
email TEXT,
phone_number TEXT
);
CREATE TABLE parent (
preferred_payment_method TEXT,
alternate_contact_info TEXT,
PRIMARY KEY(id)
) INHERITS(user);
CREATE TABLE babysitter (
age INT,
min_child_age INT,
preferred_payment_method TEXT,
PRIMARY KEY(id)
) INHERITS(user);
CREATE TABLE parent_babysitter (
parent_id INT REFERENCES parent(id),
babysitter_id INT REFERENCES babysitter(id),
PRIMARY KEY(parent_id, babysitter_id)
);
This model allows users to be "only one kind of" user - a parent or a babysitter. Notice how the primary key definitions are left to the child tables. In this model, you can have duplicated ID's between parent and babysitter, though this may not be a problem depending on how you write your code. (Note: Postgres is the only ORDBMS I know of with this restriction - Informix and Oracle, for example, have inherited keys on inherited tables)
Also see how we mixed the relational model in - we have a many-to-many relationship between parents and babysitters. That way we keep the entities separated, but we can still model a relationship without weird self-referencing keys.
All the options can be roughly represented by following cases:
base table + table for each class (class-table inheritance, Table-Per-Type, suggestions from the dba.stackexchange)
single table inheritance (Table-Per-Hierarchy) - just put everything into the single table
create independent tables for each class (Table-Per-Concrete)
I usually prefer option (1), because (2) and (3) are not completely correct in terms of DB design.
With (2) you will have unused columns for some rows (like "age" will be empty for Parent). And with (3) you may have duplicated data.
But you also need to think in terms of data access. With option (1) you will have the data spread over few tables, so to get Parent, you will need to use join operations to select data from both User and Parent tables.
I think that's the reason why options (2) and (3) exist - they are easier to use in terms of SQL queries (no joins are needed, you just select the data you need from one table).

Are guaranteed null fields an indication of poor database design?

I am working on a batch processing application that allows users to submit requests for information about particular vehicles. Users can submit a request either using a VIN or a License Plate/State combination. I proposed the following table structure:
VehiclesToBeProcessed
vehicle_id(fk)|user_id(fk)|status|start_time
Vehicles
vehicle_id|VIN|plate|state
My colleague argued that this was poor design, because every record in Vehicles would either have a null VIN field, or null plate and state fields. Instead, they proposed the following:
VehiclesToBeProcessed
vehicle_id(fk)|user_id(fk)|status|start_time
Vehicles
vehicle_id(pk)|field|value
where an entry in Vehicles would either consist of one row for a vin:
1|"vin"|"123
or two rows for a plate/state:
2|"plate"|"abc 123"
2|"state"|"NY"
I thought that the first solution would be much easier to query without having any significant downside. Which design should be preferred? Are guaranteed null fields really an indicator of bad design?
What your colleague proposed is about the ultimate antipattern in database design.
Google for Bill Karwin's "antipatterns" book and for "EAV".
Ask your colleague how he proposes to enforce that "plate" and "state" values always appear in pairs in his database. If he points to the application code, ask him how he proposes to enforce that the database will only ever get updated through his application.
Your solution is a thousand times better than his. Still "better" (from the perspective of relational purity which involves avoiding all nulls) is to give each type of request its own table :
VehicleQueriesByVIN
user_id(fk)|status|start_time|VIN
VehicleQueriesByPlate
user_id(fk)|status|start_time|plate|state
If historical trace is to be kept of the statuses over time for each query, that stuff has to be singled out in its own table.
In a word: no. This is a case of misplaced optimization. His schema will actually take up more space on average, due to storing the strings; and of course the more complex code and queries will have worse performance.
Think of it as multiple ways to identify a vehicle. You vehicle has one or many identities. The local police may identify your vehicle using LPN while the parking authority may use permit badges or active/passive transponders, furthermore the dmv probably relies on vrn numbers.
If you really want to build a flexible way to bind a vehicle to multiple identities I would use an Identity type table so a vehicle can have one or many identities.
VehicleIdentity
VehicleIdentity PK
VehicleID FK
IdentityValue
IdentityType (Type)
StateID??
Vehicle
VehicelID PK
I updated the answer by removing a table that I see would be of no use :)
Nulls are fine. They are particularly useful for Single Table Inheritance, and if your system needs "Draft" entities.
If you use a quality database like Postgres, there is no storage penalty for nulls.
Anyways, if the problem is "we need A OR B, and A and B are pretty darn similar" then the answer is almost always Table Inheritance. If you want to move fast then use Single Table Inheritance. If NULLs make you sad, then use Class Table Inheritance.
--STI:
create table vehicle_identifiers (
id int primary key,
type text not null check (type in ( 'VIN', 'STATE_N_PLATE' ) )
vin null,
state char(2) null,
plate text null,
check ( ( type='VIN' and vin is not null ) or ( type='STATE_N_PLATE' and state is not null and plate is not null ) )
);
--CTI:
create table vehicle_identifiers (
id int primary key
);
create table vehicle_identifiers_vin (
id int primary key references vehicle_identifiers(id),
vin text not null
);
create table vehicle_identifiers_state_n_plate (
id int primary key references vehicle_identifiers(id),
state text not null,
plate text not null
);

Ensuring referential integrity between Columns in a Table

Currently I have a table model with the following format:
criteria (criteria_id, criteria_name)
criteria_data(criteria_id, value)
I intend on storing Date-related information into the table in the sense that one criteria ( the Date criteria) could just contain the single date in the criteria_data table while other criteria_data could be the price of the stock for the date in a separate row. ( Another complication is that: the name of the stock is also a criteria)
My problem:
How is it possible for me to ensure that only 1 price ( single row criteria) can be entered into the table for a particular date and stock name ( 2 other separate criteria and rows).
I really don't want to enforce this in the App layer so I am mainly looking for DB Layer solutions , if available.
I am also open to being told to scrap my entire table model, if a more suitable Data model is suggested.
EDIT
After being informed of my folly ( see dPortas post below), I accept that this is not the smart way to go. I thought of a new model:
criteria_data(stockName,price, high,low,price,change)
While this is what it looks like, I am thinking the actual column names would be an identifier containing the criteria_id . For example, the stockname field could be col_1 and high could be col_3 but this would ensure that I could enforce integrity on the various columns.
What are people thoughts on this?
Your table design looks suspiciously like a case of EAV. Among the disadvantages of that anti-pattern are that you can't accurately store the right datatypes or apply constraints to it. I suggest you reconsider the design.
Suggested redesign: criteria (criteria_id, criteria_name, date, stock_name, price) key: (stock_name, date)

What is 'document data store' and 'key-value data store'?

What is document data store? What is key-value data store?
Please, describe in very simple and general words the mechanisms which stand behind each of them.
In a document data store each record has multiple fields, similar to a relational database. It also has secondary indexes.
Example record:
"id" => 12345,
"name" => "Fred",
"age" => 20,
"email" => "fred#example.com"
Then you could query by id, name, age, or email.
A key/value store is more like a big hash table than a traditional database: each key corresponds with a value and looking things up by that one key is the only way to access a record. This means it's much simpler and often faster, but it's difficult to use for complex data.
Example record:
12345 => "Fred,fred#example.com,20"
You can only use 12345 for your query criteria. You can't query for name, email, or age.
Here's a description of a few common data models:
Relational systems are the databases we've been using for a while now. RDBMSs and systems that support ACIDity and joins are considered relational.
Key-value systems basically support get, put, and delete operations based on a primary key.
Column-oriented systems still use tables but have no joins (joins must be handled within your application). Obviously, they store data by column as opposed to traditional row-oriented databases. This makes aggregations much easier.
Document-oriented systems store structured "documents" such as JSON or XML but have no joins (joins must be handled within your application). It's very easy to map data from object-oriented software to these systems.
From this blog post I wrote: Visual Guide to NoSQL Systems.
From wikipedia:
Document data store: As opposed to relational databases, document-based databases do not store data in tables with uniform sized fields for each record. Instead, each record is stored as a document that has certain characteristics. Any number of fields of any length can be added to a document. Fields can also contain multiple pieces of data.
Key Value: An associative array (also associative container, map, mapping, dictionary, finite map, and in query-processing an index or index file) is an abstract data type composed of a collection of unique keys and a collection of values, where each key is associated with one value (or set of values). The operation of finding the value associated with a key is called a lookup or indexing, and this is the most important operation supported by an associative array. The relationship between a key and its value is sometimes called a mapping or binding. For example, if the value associated with the key "bob" is 7, we say that our array maps "bob" to 7.
More examples at NoSQL.

Resources