Dynamic schema changes in Cassandra - database

I have lots of users(150-200 million). Each user has N(30-100) attributes. The attribute can be of type integer, text or timestamp. Attributes are not known, so I want to add them dynamically, on the fly.
Solution 1 - Add new column by altering the table
CREATE TABLE USER_PROFILE(
UID uuid PRIMARY KEY,
LAST_UPDATE_DATE TIMESTAMP,
CREATION_DATE TIMESTAMP
);
For each new attribute:
ALTER TABLE USER_PROFILE ADD AGE INT;
INSERT INTO USER_PROFILE ( UID, LAST_UPDATE_DATE, CREATION_DATE, AGE) VALUES ('01f63e8b-db53-44ef-924e-7a3ccfaeec28', 2021-01-12 07:34:19.121, 2021-01-12 07:34:19.121, 27);
Solution 2 - Fixed schema:
CREATE TABLE USER_PROFILE(
UID uuid,
ATTRIBUTE_NAME TEXT,
ATTRIBUTE_VALUE_TEXT TEXT,
ATTRIBUTE_VALUE_TIMESTAMP TIMESTAMP,
ATTRIBUTE_VALUE_INT INT,
LAST_UPDATE_DATE TIMESTAMP,
CREATION_DATE TIMESTAMP,
PRIMARY KEY (UID, ATTRIBUTE_NAME)
);
For each new attribute:
INSERT INTO USER_PROFILE ( UID, ATTRIBUTE_NAME, ATTRIBUTE_VALUE_INT, LAST_UPDATE_DATE, CREATION_DATE) VALUES ('01f63e8b-db53-44ef-924e-7a3ccfaeec28', 'age', 27, 2021-01-12 07:34:19.121, 2021-01-12 07:34:19.121, 27);
Which is the best solution in terms of performance?

I would personally go with the 2nd solution - having columns for each data type that is used, and use the attribute name as the last component of the primary key (see examples in my previous answers on that topic:
Cassandra dynamic column family
How to handle Dynamic columns in Cassandra
How to handle Dynamic columns in Cassandra
How to understand the 'Flexible schema' in Cassandra?
First solution has following problems:
If you do schema modification from the code, then you need to coordinate these changes, otherwise you will get schema disagreement that will must be resolved by admins by restarting the nodes. And coordinated change will either slowdown the data insertion, or will create a single point of failure
Existence of many columns has significant performance impact. For example, per this very good analysis by The Last Pickle, having 100 columns instead of 10 increases read latency more than 10 times
You can't change attribute type if you'll need - in the solution with attribute as clustering column, you can just start to put attribute as another type. If you have attribute as column, you can't do that, because Cassandra doesn't allow to change column type (don't try to drop column & add it back with the new type - you'll corrupt your existing data). So you will need to create a completely new column for that attribute.

Related

Bad design to compare to computed columns?

Using SQL Server I have a table with a computed column. That column concatenates 60 columns:
CREATE TABLE foo
(
Id INT NOT NULL,
PartNumber NVARCHAR(100),
field_1 INT NULL,
field_2 INT NULL,
-- and so forth
field_60 INT NULL,
-- and so forth up to field_60
)
ALTER TABLE foo
ADD RecordKey AS CONCAT (field_1, '-', field_2, '-', -- and so on up to 60
) PERSISTED
CREATE INDEX ix_foo_RecordKey ON dbo.foo (RecordKey);
Why I used a persisted column:
Not having the need to index 60 columns
To test to see if a current record exists by checking just one column
This table will contain no fewer than 20 million records. Adds/Inserts/updates happen a lot, and some binaries do tens of thousands of inserts/updates/deletes per run and we want these to be quick and live.
Currently we have C# code that manages records in table foo. It has a function which concatenates the same fields, in the same order, as the computed column. If a record with that same concatenated key already exists we might not insert, or we might insert but call other functions that we may not normally.
Is this a bad design? The big danger I see is if the code for any reason doesn't match the concatenation order of the computed column (if one is edited but not the other).
Rules/Requirements
We want to show records in JQGrid. We already have C# that can do so if the records come from a single table or view
We need the ability to check two records to verify if they both have the same values for all of the 60 columns
A better table design would be
parts table
-----------
id
partnumber
other_common_attributes_for_all_parts
attributes table
----------------
id
attribute_name
attribute_unit (if needed)
part_attributes table
---------------------
part_id (foreign key to parts)
attribute_id (foreign key to attributes)
attribute value
It looks complicated but due to proper indexing this is super fast even if part_attributes contain billions of records!

Cassandra Ordering Query (or Auto Update Record) Without Primary Key

I have cassandra table containing information about ship identities. It has column: imo_number, mmsi_number, ship_name, timestamp.
I have 2 requirement for this table:
I want this table to only save row with last updated records, according to column imo_number, mmsi_number, and ship_name. So if any new record with value of those column is exacly same, it will be updated. As I know, I need to define those column as primary key in cassandra.
create table keyspace.table (
mmsi_number text,
imo_number text,
ship_name text,
timestamp timestamp,
primary key ((mmsi_number), imo_number, ship_name)
);
I want to be able to load data from table by mmsi as where clause, and sort it by timestamp to find the latest record, or find records between a date range. As I know, the CQL schema will be:
create table keyspace.table (
mmsi_number text,
imo_number text,
ship_name text,
timestamp timestamp,
primary key ((mmsi_number), timestamp, imo_number, ship_name)
) with clustering order by (timestamp desc);
The problem is, when I use my second schema, my first requirement will not be fulfilled. Because every new record will have different timestamp, so it will be inserted instead updated.
How do I fulfilled my requirements above? Or maybe I did something wrong? Help appreciated.

Primary Key Constraint, migration of table from local db to snowflake, recommended data types for json column?

What order can I copy data into two different tables to comply with the table constraints I created locally?
I created an example from the documentation, but was hoping to get recommendations on how to optimize the data stored by selecting the right types.
I created two tables, one was the list of names and the second is a list of names with a date they did something.
create or replace table name_key (
id integer not null,
id_sub integer not null,
constraint pkey_1 primary key (id, id_sub) not enforced,
name varchar
);
create or replace table recipts (
col_a integer not null,
col_b integer not null,
constraint fkey_1 foreign key (col_a, col_b) references name_key (id, id_sub) not enforced,
recipt_date datetime,
did_stuff variant
);
Insert into name_key values (0, 0, 'Geinie'), (1, 1, 'Greg'), (2,2, 'Alex'), (3,3, 'Willow');
Insert into recipts values(0,0, Current_date()), (1,1, Current_date()), (2,2, Current_date()), (3,3, Current_date());
Select * from name_key;
Select * from recipts;
Select * from name_key
join recipts on name_key.id = recipts.col_a
where id = 0 or col_b = 2;
I read: https://docs.snowflake.net/manuals/user-guide/table-considerations.html#storing-semi-structured-data-in-a-variant-column-vs-flattening-the-nested-structure where it recommends to change timestamps from strings to a variant. I did not include the fourth column, I left it blank for future use. Essentially it captures data in json format, so I made it a variant. Would it be better to rethink this table structure to flatten the variant column?
Also I would like to change the key to AUTO_INCRDEMENT, is there something like this in Snowflake?
What order can I copy data into two different tables to comply with the table constraints I created locally?
You need to give more context about your constraints, but you can control the order of copy statements. For foreign keys generally you want to load the table that is referenced before the table that does the referencing.
where it recommends to change timestamps from strings to a variant.
I think you misread that documentation. It recommends extracting values from a variant column into their own separate columns (in this case a timestamp column), ESPECIALLY if those columns are dates and times, arrays, and numbers within strings.
Converting a timestamp column to a variant, is exactly what it is recommending against.
Would it be better to rethink this table structure to flatten the variant column?
It's definitely good to think carefully about, and do performance tests on, situations where you are using semi-structured data, but without more information on your specific situation and data, it's hard to say.
Also I would like to change the key to AUTO_INCRDEMENT, is there something like this in Snowflake?
Yes Snowflake has an Auto_increment feature. Although I've heard this has some issue with working with COPY INTO Statements

How to do transaction.insert_or_update on secondary index and not the primary index?

I have a table in Google Cloud Spanner.
CREATE TABLE test_id (
Id STRING(MAX) NOT NULL,
KeyColumn STRING(MAX) NOT NULL,
parent_id INT64 NOT NULL,
Updated TIMESTAMP NOT NULL OPTIONS (allow_commit_timestamp=true),
) PRIMARY KEY (Id)
And, I am trying to perform transaction.insert_or_update through a python script.
For each row in a pandas dataframe, I am doing:
transaction.insert_or_update(
'test_id', columns=['Id','KeyColumn', 'parent_id', 'Updated'],
values=[(uuid.uuid4().hex, row["KeyColumn"], row["parent_id"], spanner.COMMIT_TIMESTAMP)],
)
What I want is that if the row["KeyColumn"] is already present in KeyColumn of the table, update its parent_id column, otherwise insert a new row in the Spanner table corresponding to that KeyColumn.
But since, my primary key is Id which is generated randomly by uuid.uuid4().hex, it every time inserts a new row.
If I understand you correctly, the following is the situation:
ID is the primary key of your table.
There is a unique index defined for the table on the column KeyColumn.
You want to insert_or_update a row using KeyColumn as the column that should be used to determine whether the row already exists.
That is unfortunately not possible. insert_or_update will always use the primary key of the table to determine whether the row exists. I can think of three possible solutions to this problem, but they all have their drawbacks:
You could change the table definition and make KeyColumn the primary key and set a unique index on the Id column. The problem with this is of course that any other code that depends on Id being the primary key also needs to change. It is also a rather cumbersome change, because Cloud Spanner does not allow you to change the primary key of a table, so you would have to create a copy of the test_id table and then drop the old table.
You could fetch the row from Cloud Spanner before updating it by reading it using the KeyColumn value that you have. The big problem with this is obviously performance. You will need to do a read for each row that you want to update.
You could use a DML statement (UPDATE test_id SET parent_id=#parent WHERE KeyColumn=#key) to execute the update and check whether it actually updated a row by checking the returned update count. If it did not update anything, you could then execute the insert. This will obviously also be slower than an insert_or_update mutation.
Here there is a way to query the Cloud Spanner with a specific index.
You should use something like this in the end of your query : FROM test_id#{FORCE_INDEX=KeyColumnIndex} .
Even though this is the way to execute queries on secondary indexes and the answer for the question in the title, I do not know how much it can be applied in your use case.

what is the correct way to design a 'table to row' relationship?

I am trying to model the following in a postgres db.
I have N number of 'datasets'. These datasets are things like survey results, national statistics, aggregated data etc. They each have a name a source insitution a method etc. This is the meta data of a dataset and I have tables created for this and tables for codifying the research methods etc. The 'root' meta-data table is called 'Datasets'. Each row represents one dataset.
I then need to store and access the actual data associated with this dataset. So I need to create a table that contains that data. How do I represent the relationship between this table and its corresponding row in the 'Datasets' table?
an example
'hea' is a set of survey responses. it is unaggregated so each row is one survey response. I create a table called 'HeaData' that contains this data.
'cso' is a set of aggregated employment data. each row is a economic sector. I create a table called 'CsoData' that contains this data
I create a row for each of these in the 'datasets' table with the relevant meta data for each and they have ids of 1 & 2 respectively.
what is the best way to relate 1 to the HeaData table and 2 to the CsoData table?
I will eventually be accessing this data with scala slick so if the database design could just 'plug and play' with slick that would be ideal
Add a column to the Datasets table which designates which type of dataset it represents. Then a 1 may mean HEA and 2 may mean CSO. A check constraint would limit the field to one of the two values. If new types of datasets are added later, the only change needed is to change the constraint. If it is defined as a foreign key to a "type of dataset" table, you just need to add the new type of dataset there.
Form a unique index on the PK and the new field.
Add the same field to each of the subtables. But the check constraint limits the value in the HEA table to only that value and the CSO table to only that value. Then form the ID field of Datasets table and the new field as the FK to Datasets table.
This limits the ID value to only one of the subtables and it must be the one defined in the Datasets table. That is, if you define a HEA dataset entry with an ID value of 1000 and the HEA type value, the only subtable that can contain an ID value of 1000 is the HEA table.
create table Datasets(
ID int identity/auto_generate,
DSType char( 3 ) check( DSType in( 'HEA', 'CSO' ),
[everything else],
constraint PK_Datasets primary key( ID ),
constraint UQ_Dateset_Type unique( ID, DSType ) -- needed for references
);
create table HEA(
ID int not null,
DSType char( 3 ) check( DSType = 'HEA' ) -- making this a constant value
[other HEA data],
constraint PK_HEA primary key( ID ),
constraint FK_HEA_Dataset_PK foreign key( ID )
references Dataset( ID ),
constraint FK_HEA_Dataset_Type foreign key( ID, DSType )
references Dataset( ID, DSType )
);
The same idea with the CSO subtable.
I would recommend an HEA and CSO view that would show the complete dataset rows, metadata and type-specific data, joined together. With triggers on those views, they can be the DML points for the application code. Then the apps don't have to keep track of how that data is laid out in the database, making it a lot easier to make improvements should the opportunity present itself.

Resources