CassandraDB table with multiple Key-Value - database

I am a new CassandraDB user. I am trying to create a table which has 3 static columns, for example "name", "city" and "age", and then I was thinking in two "key" and "value" columns, since my table could receive a lot of inputs. How can I define this table? I am trying to achieve something scalable, i.e:
Table columns --> "Name", "City", "Age", "Key", "Value"
Name: Mark
City: Liverpool
Age: 26
Key: Car
Value: Audi A3
Key: Job
Value: Computer Engineer
Key: Main hobby
Value: Football
I am looking for the TABLE DEFINITION.. Any help? Thank you so so much in advance.

If I understand correctly, you want to create a key-value store, grouped by "name", "city" and "age". There are few solutions for this approach -
First by using STATIC columns -
create table record_by_id(
recordId text,
name text static,
city text static,
age int static,
key text,
value text
primary key (recordId, key)
);
Which this table design, Name, City, Age remain constant for same recordid. You can any number of key- values for same record id.
Second Approach would be -
create table record_by_id(
name text ,
city text ,
age int ,
key text,
value text
primary key ((name,city,age),key)
);
In this design, Name , city and age is are part of partition key. The key column is part of clustering key.
Both approach are scalable but first approach is good for maintenance.

table which has 3 static columns
So by "static" I assume you're not referring to Cassandra's definition of static columns. Which is cool, I know what you mean. But the mention did give me an idea of how to approach this:
trying to create the table definition
I see two ways to go about this.
CREATE TABLE user_properties (
name TEXT,
city TEXT STATIC,
age INT STATIC,
key TEXT,
value TEXT,
PRIMARY KEY (name,key));
Because we have static columns (only stored w/ the partition key name) adding more key/values is just a matter of adding more keys to the same name, so INSERTing data looks like this:
INSERT INTO user_properties (name,city,age,key,value)
VALUES ('Mark','Liverpool',26,'Car','Audi A3');
INSERT INTO user_properties (name,key,value)
VALUES ('Mark','Job','Computer Engineer');
INSERT INTO user_properties (name,key,value)
VALUES ('Mark','Main hobby','Football');
Querying looks like this:
> SELECT * FROm user_properties WHERE name='Mark';
name | key | age | city | value
------+------------+-----+-----------+-------------------
Mark | Car | 26 | Liverpool | Audi A3
Mark | Job | 26 | Liverpool | Computer Engineer
Mark | Main hobby | 26 | Liverpool | Football
(3 rows)
This is the "simple" way to go about it.
Or
CREATE TABLE user_properties_map (
name TEXT,
city TEXT,
age INT,
kv MAP<TEXT,TEXT>,
PRIMARY KEY (name));
With a single partition key as the PRIMARY KEY, we can INSERT everything in one shot:
INSERT INTO user_properties_map (name,city,age,kv)
VALUES ('Mark','Liverpool',26,{'Car':'Audi A3',
'Job':'Computer Engineer',
'Main hobby':'Football'});
And querying looks like this:
> SELECT * FROm user_properties_map WHERE name='Mark';
name | age | city | kv
------+-----+-----------+--------------------------------------------------------------------------
Mark | 26 | Liverpool | {'Car': 'Audi A3', 'Job': 'Computer Engineer', 'Main hobby': 'Football'}
(1 rows)
This has the added benefit of putting the properties into a map, which might be helpful if that's the way you're intending to work with it on the application side. The drawbacks, are that Cassandra collections are best kept under 100 items, the writes are a little more complicated, and you can't query individual entries of the map.
But by keying on name (might want to also include last name or something else to help with uniqueness), data should scale fine. And partition growth won't be a problem, unless you're planning on thousands of key/value pairs.
Basically, choose the structure based ons the standard Cassandra advice of considering how you'd query the data, and then build the table to suit it.

Related

Does taking advantage of dynamic columns in Cassandra require duplicated data in each row?

I've been trying to understand how one would model time series data in Cassandra, like shown in the below image from a popular System Design Interview video, where counts of views are stored hourly.
While I would think the schema for this time series data would be something like the below, I don't believe this would lead to data actually being stored in the way the screenshot shows.
CREATE table views_data {
video_id uuid
channel_name varchar
video_name varchar
viewed_at timestamp
count int
PRIMARY_KEY (video_id, viewed_at)
};
Instead, I'm assuming it would lead to something like this (inspired by datastax), where technically there is a single row for each video_id, but the other columns seem like they would all be duplicated, such as channel_name, video_name, etc.. within the row for each unique viewed_at.
[cassandra-cli]
list views_data;
RowKey: A
=> (channel_name='System Design Interview', video_name='Distributed Cache', count=2, viewed_at=1370463146717000)
=> (channel_name='System Design Interview', video_name='Distributed Cache', count=3, viewed_at=1370463282090000)
=> (channel_name='System Design Interview', video_name='Distributed Cache', count=8, viewed_at=1370463282093000)
-------------------
RowKey: B
=> (channel_name='Some other channel', video_name='Some video', count=4, viewed_at=1370463282093000)
I assume this is still considered dynamic wide row, as we're able to expand the row for each unique (video_id, viewed_at) combination. But it seems less than ideal that we need to duplicate the extra information such as channel_name and video_name.
Is the screenshot of modeling time series data misleading or is it actually possible to have dynamic columns where certain columns in the row do not need to be duplicated?
If I was upserting time series data to this row, I wouldn't want to have to provide the channel_name and video_name for every single upsert, I would just want to provide the count.
No, it is not necessary to duplicate the values of columns within the rows of a partition. It is possible to model your table to accomodate your use case.
In Cassandra, there is a concept of "static columns" -- columns which have the same value for all rows within a partition.
Here's the schema of an example table that contains two static columns, colour and item:
CREATE TABLE statictbl (
pk int,
ck text,
c int,
colour text static,
item text static,
PRIMARY KEY (pk, ck)
)
In this table, each partition share the same colour and item for all rows of the same partition. For example, partition pk=1 has the same colour='red' and item='apple' for all rows:
pk | ck | colour | item | c
----+----+--------+--------+----
1 | a | red | apple | 12
1 | b | red | apple | 23
1 | c | red | apple | 34
If I insert a new partition pk=2:
INSERT INTO statictbl (pk, ck, colour, item, c) VALUES (2, 'd', 'yellow', 'banana', 45)
we get:
pk | ck | colour | item | c
----+----+--------+--------+----
2 | d | yellow | banana | 45
If I then insert another row withOUT specifying a colour and item:
INSERT INTO statictbl (pk, ck, c) VALUES (2, 'e', 56)
the new row with ck='e' still has the colour and item populated even though I didn't insert a value for them:
pk | ck | colour | item | c
----+----+--------+--------+----
2 | d | yellow | banana | 45
2 | e | yellow | banana | 56
In your case, both the channel and video names will share the same value for all rows in a given partition if you declare them as static and you only ever need to insert them once. Note that when you update the value of static columns, ALL the rows for that partition will reflect the updated value.
For details, see Sharing a static column in Cassandra. Cheers!

How can I associate a single record with one or more PKs

If I had a single record that represented, say, a sellable item:
ItemID | Name
-------------
101 | Chips
102 | Candy bar
103 | Beer
I need to create a relationship between these items and one or more different types of PKs. For instance, a company might have an inventory that included chips; a store might have an inventory that includes chips and a candy bar, and the night shift might carry chips, candy bars, and beer. The is that we have different kinds of IDs: CompanyID, StoreID, ShiftID respectively.
My first though was "Oh just create link tables that link Company to inventory items, Stores to inventory items, and shifts to inventory items" and that way if I needed to look up the inventory collection for any of those entities, I could query them explicitly. However, the UI shows that I should be able to compile a list arbitrarily (e.g. show me all inventory items for company a, all west valley stores and Team BrewHa who is at an east valley store) and then display them grouped by their respective entity:
Company A
---------
- Chips
West Valley 1
-------------
- Chips
- Candy Bar
West Valley 2
-------------
- Chips
BrewHa (East Valley 6)
--------------------
- Chips
- Candy Bar
- Beer
So again, my first though was to base the query on the provided information (what kinds of IDs did they give me) and then just union them together with some extra info for grouping (candidate keys like IDType+ID) so that the result looked kind of like this:
IDType | ID | InventoryItemID
------------------------------
1 |100 | 1
2 |200 | 1
2 |200 | 2
2 |201 | 1
3 |300 | 1
3 |300 | 2
3 |300 | 3
I guess this would work, but it seems incredibly inefficient and contrived to me; I'm not even sure how the parameters of that sproc would work... So my question to everyone is: is this even the right approach? Can anyone explain alternative or better approaches to solve the problem of creating and managing these relationships?
It's hard to ascertain what you want as I don't know the purpose/use of this data. I'm not well-versed in normalization, but perhaps a star schema might work for you. Please keep in mind, I'm using my best guess for the terminology. What I was thinking would look like this:
tbl_Current_Inventory(Fact Table) records current Inventory
InventoryID INT NOT NULL FOREIGN KEY REFERENCES tbl_Inventory(ID),
CompanyID INT NULL FOREIGN KEY REFERENCES tbl_Company(ID),
StoreID INT NULL FOREIGN KEY REFERENCES tbl_Store(ID),
ShiftID INT NULL FOREIGN KEY REFERENCES tbl_Shift(ID),
Shipped_Date DATE --not really sure, just an example,
CONSTRAINT clustered_unique CLUSTERED(InventoryID,CompanyID,StoreID,ShiftID)
tbl_Inventory(Fact Table 2)
ID NOT NULL INT,
ProductID INT NOT NULL FOREIGN KEY REFERENCES tbl_Product(ID),
PRIMARY KEY(ID,ProductID)
tbl_Store(Fact Table 3)
ID INT PRIMARY KEY,
CompanyID INT FOREIGN KEY REFERENCES tbl_Company(ID),
RegionID INT FOREIGN KEY REFERENCES tbl_Region(ID)
tbl_Product(Dimension Table)
ID INT PRIMARY KEY,
Product_Name VARCHAR(25)
tbl_Company(Dimension Table)
ID INT PRIMARY KEY,
Company_Name VARCHAR(25)
tbl_Region(Dimension Table)
ID PRIMARY KEY,
Region_Name VARCHAR(25)
tbl_Shift(Dimension Table)
ID INT PRIMARY KEY,
Shift_Name VARCHAR(25)
Start_Time TIME,
End_Time TIME
So a little explanation. Each dimension table holds only distinct values like tbl_Region. Lists each region's name once and an ID.
Now for tbl_Current_Inventory, that will hold all the columns. I have companyID and StoreID both in their for a reason. Because this table can hold company inventory information(NULL StoreID and NULL shiftID) AND it can hold Store Inventory information.
Then as for querying this, I would create a view that joins each table, then simply query the view. Then of course there's indexes, but I don't think you asked for that. Also notice I only had like one column per dimension table. My guess is that you'll probably have more columns then just the name of something.
Overall, this helps eliminate a lot of duplicate data. And strikes a good balance at performance and not overly complicated data structure. Really though, if you slap a view on it, and query the view, it should perform quite well especially if you add some good indexes.
This may not be a perfect solution or even the one you need, but hopefully it at least gives you some ideas or some direction.
If you need any more explanation or anything else, just let me know.
In a normalized database, you implement a many-to-many relationship by creating a table that defines the relationships between entities just as you thought initially. It might seem contrived, but it gives you the functionality you need. In your case I would create a table for the relationship called something like "Carries" with the primary key of (ProductId, StoreId, ShiftId). Sometimes you can break normalization rules for performance, but it comes with side effects.
I recommend picking up a good book on designing relational databases. Here's a starter on a few topics:
http://en.wikipedia.org/wiki/Entity%E2%80%93relationship_model
http://en.wikipedia.org/wiki/Database_normalization
You need to break it down to inventory belongs to a store and a shift
Inventory does does not belong to a company - a store belongs to a company
If the company holds inventory directly then I would create a store name warehouse
A store belongs to a region
Don't design for the UI - put the data in 3NF
Tables:
Company ID, name
Store ID, name
Region ID, name
Product ID, name
Shift ID, name
CompanyToStore CompanyID, StoreID (composite PK)
RegionToStore RegionID, StoreID (composite PK)
Inventory StoreID, ShiftID, ProductID (composite PK)
The composite PK are not just efficient they prevent duplicates
The join tables should have their own ID as PK
Let the relationships they are managing be the PK
If you want to report by company across all shifts you would have a query like this
select distinct store.Name, Product.Name
from Inventory
join Store
on Inventory.StoreID = Store.ID
join CompanyToStore
on Store.ID = CompanyToStore.StoreID
and CompanyToStore.CompanyID = X
store count in a region
select RegionName, count(*)
from RegionToStore
join Region
on Region.ID = RegionToStore.RegionID
group by RegionName

References in a table

I have a table like this, that contains items that are added to the database.
Catalog table example
id | element | catalog
0 | mazda | car
1 | penguin | animal
2 | zebra | animal
etc....
And then I have a table where the user selects items from that table, and I keep a reference of what has been selected like this
User table example
id | name | age | itemsSelected
0 | john | 18 | 2;3;7;9
So what I am trying to say, is that I keep a reference to what the user has selected as a string if ID's, but I think this seems a tad troublesome
Because when I do a query to get information about a user, all I get is the string of 2;3;7;9, when what I really want is an array of the items corresponing to those ID's
Right now I get the ID's and I have to split the string, and then run another query to find the elements the ID's correspond to
Is there any easier ways to do this, if my question is understandable?
Yes, there is a way to do this. You create a third table which contains a map of A/B. It's called a Multiple to Multiple foreign-key relationship.
You have your Catalogue table (int, varchar(MAX), varchar(MAX)) or similar.
You have your User table (int, varchar(MAX), varchar(MAX), varchar(MAX)) or similar, essentially, remove the last column and then create another table:
You create a UserCatalogue table: (int UserId, int CatalogueId) with a Primary Key on both columns. Then the UserId column gets a Foreign-Key to User.Id, and the CatalogueId table gets a Foreign-Key to Catalogue.Id. This preserves the relationship and eases queries. It also means that if Catalogue.Id number 22 does not exist, you cannot accidentally insert it as a relation between the two. This is called referential-integrity. The SQL Server mandates that if you say, "This column must have a reference to this other table" then the SQL Server will mandate that relationship.
After you create this, for each itemsSelected you add an entry: I.e.
UserId | CatalogueId
0 | 2
0 | 3
0 | 7
0 | 9
This also alows you to use JOINs on the tables for faster queries.
Additionally, and unrelated to the question, you can also optimize the Catalogue table you have a bit, and create another table for CatalogueGroup, which contains your last column there (catalog: car, animal) which is referenced via a Foreign-Key Relationship in the current Catalogue table definition you have. This will also save storage space and speed up SQL Server work, as it no longer has to read a string column if you only want the element value.

MS Access - How do i use a table for a data type

Hi i am creating a contacts database and i want to use a create a cities table that i can use for the people table in the City field. How do i do this?
City table:
ID | City
--------------
1 | Wellington
2 | Auckland
3 | Christchurch
People Table Design
Field Name: City
Data Type: Short Text
Display Control: Combobox
Row Source Type: Table/Query
Row Source: City
These are my table design for the field City, but it is only showing the ID numbers in the combobox
I really am against the concept of Lookups in table. So I would suggest you to have a read of "The Evils of Lookup" before you proceed.
The problem is because you have used a table name as the RowSource. You need t modify some of the properties of the Field. In the lookup tab, change the Column Count to 2, Column Width to 0cm;2.04cm. Probably RowSource to
SELECT ID, City FROM City;

Relational Design: Column Attributes

I have a system that allows a person to select a form type that they want to fill out from a drop down box. From this, the rest of the fields for that particular form are shown, the user fills them out, and submits the entry.
Form Table:
| form_id | age_enabled | profession_enabled | salary_enabled | name_enabled |
This describes the metadata of a form so the system will know how to draw it. So each _enabled column is a boolean true if the form should include a field to be filled out for this column.
Entry Table:
| entry_id | form_id | age | profession | salary | name | country |
This stores a submitted form. Where age, profession, etc stores the actual value filled out in the form (or null if it didn't exist in the form)
Users can add new forms to the system on the fly.
Now the main question: I would like to add the ability for a user designing a new form to be able to include a list of possible values for an attribute (e.g. profession is a drop down list of say 20 professions instead of just a text box when filling out the form). I can't simply store a global list of possible values for each column because each form will have a different list of values to pick from.
The only solution I can come up with is to include another set of columns in Form table like profession_values and then store the values in a character delimited format. I am concerned that a column may one day have a large number of possible values and this column will get out of control.
Note that new columns can be added later to Form if necessary (and thus Entry in turn), but 90% of forms have the same base set of columns, so I think this design is better than an EAV design. Thoughts?
I have never seen a relational design for such a system (as a whole) and I can't seem to figure out a decent way to do this.
Create a new table to contain groups of values:
CREATE TABLE values (
id SERIAL,
group INT NOT NULL,
value TEXT NOT NULL,
label TEXT NOT NULL,
PRIMARY KEY (id),
UNIQUE (group, value)
);
For example:
INSERT INTO values (group, value, label) VALUES (1, 'NY', 'New York');
INSERT INTO values (group, value, label) VALUES (1, 'CA', 'California');
INSERT INTO values (group, value, label) VALUES (1, 'FL', 'Florida');
So, group 1 contains three possible values for your drop-down selector. Then, your form table can reference what group a particular column uses.
Note also that you should add fields to a form via rows, not columns. I.e., your app shouldn't be adjusting the schema when you add new forms, it should only create new rows. So, make each field its own row:
CREATE TABLE form (
id SERIAL,
name TEXT NOT NULL,
PRIMARY KEY (id)
);
CREATE TABLE form_fields (
id SERIAL,
form_id INT NOT NULL REFERENCES form(id),
field_label TEXT NOT NULL,
field_type INT NOT NULL,
field_select INT REFERENCES values(id),
PRIMARY KEY (id)
);
INSERT INTO form (name) VALUES ('new form');
$id = last_insert_id()
INSERT INTO form_fields (form_id, field_label, field_type) VALUES ($id, 'age', 'text');
INSERT INTO form_fields (form_id, field_label, field_type) VALUES ($id, 'profession', 'text');
INSERT INTO form_fields (form_id, field_label, field_type) VALUES ($id, 'salary', 'text');
INSERT INTO form_fields (form_id, field_label, field_type, field_select) VALUES ($id, 'state', 'select', 1);
I think you are starting from the wrong place entirely.
| form_id | age_enabled | profession_enabled | salary_enabled | name_enabled |
Are you just going to keep adding to this table for every single for field you can ever have? Generically the list could be endless.
How will your application code display a form if all the fields are in columns in this table?
What about a form table like this:
| form_id | form description |
Then another table, formAttributes with one row per entry on the form:
| attribute_id | form_id | position | name | type |
Then a third table forAttributeValidValues with one row per attribute valid value:
| attribute_id | value_id | value |
This may seem like more work to begin with, but it really isn't. THink about how easy it is to add or remove new attribute or value to a form. Also think about how your application will render the form:
for form_element in (select name, attribute_id
from formAttributes
where form_id = :bind
order by position asc) loop
render_form_element
if form_element.type = 'list of values' then
render_values with 'select ... from formAttributeValidValues'
end if
end loop;
The dilema will then become how to store the form results. Ideally you would store them with 1 row per form element in a table that is something like:
| completed_form_id | form_id | attribute_id | value |
If you only ever work on one form at a time, then this model will work well. If you want to do aggregations over lots of forms, then the resulting queries become more difficult, however that is reporting, which can run in a different process to the online form entry. You can start to think of things that pivot queries to transform the rows in into columns or materialized view to pull together forms of the same type etc.

Resources