How can I version my data with Cassandra? - database

I want to use the key-value pair feature of Cassandra. Until now, I have been using Kyotocabinet but it does not support multiple writes and hence, I want to use Cassandra for versioning my tabular data.
Roll No, Name, Age, Sex
14BCE1008, Aviral, 22, Male
14BCE1007, Shantanu, 22, Male
The above data is tabular(csv). It's version 1.
Next is version 2:
Roll No, Name, Age, Sex
14BCE1008, Aviral, 22, Male
14BCE1007, Shantanu, 22, Male
14BCE1209, Piyush, 22, Male
Hence, I would call the above version as version 2 with the following diff:
insert_patch: 14BCE1209 as key(PK) and 14BCE1209, Piyush, 22, Male as value.
I am familiar with the creation of the table but unable to figure out the versioning part.

To have multiple versions of data in your table if you use composite primary key instead of primary key consisting of one field.
So table definition could look as following (if you "know" the version number prior inserting the data):
create table test(
id text,
version int,
payload text,
primary key (id, version)
) with clustering order by (version desc);
and inserting data as:
insert into test (id, version, payload) values ('14BCE1209', 1, '....');
insert into test (id, version, payload) values ('14BCE1209', 2, '....');
to select the latest value for given key you can use LIMIT 1:
SELECT * from test where id = '14BCE1209' LIMIT 1;
and to select latest versions for all partitions (not recommended, just for example - need a special approach for effective processing):
SELECT * from test PER PARTITION LIMIT 1;
But this will work only for cases when you know version in advance. If you don't know, then you can use timeuuid type for version instead of the int:
create table test(
id text,
version timeuuid,
payload text,
primary key (id, version)
) with clustering order by (version desc);
and inserting data as (instead of now() you can use current timestamp generated from your code):
insert into test (id, version, payload) values ('14BCE1209', now(), '....');
insert into test (id, version, payload) values ('14BCE1209', now(), '....');
and select will work the same as above...

Related

Snowflake - how to do multiple DML operations on same primary key in a specific order?

I am trying to set up continuous data replication in Snowflake. I get the transactions happened in source system and I need to perform them in Snowflake in the same order as source system. I am trying to use MERGE for this, but when there are multiple operations on same key in source system, MERGE is not working correctly. It either misses an operation or returns duplicate row detected during DML operation error.
Please note that the transactions need to happen in exact order and it is not possible to take the latest transaction for a key and just do it (like if a record has been INSERTED and UPDATED, in Snowflake too it needs to be inserted first and then updated even though insert is only transient state) .
Here is the example:
create or replace table employee_source (
id int,
first_name varchar(255),
last_name varchar(255),
operation_name varchar(255),
binlogkey integer
)
create or replace table employee_destination ( id int, first_name varchar(255), last_name varchar(255) );
insert into employee_source values (1,'Wayne','Bells','INSERT',11);
insert into employee_source values (1,'Wayne','BellsT','UPDATE',12);
insert into employee_source values (2,'Anthony','Allen','INSERT',13);
insert into employee_source values (3,'Eric','Henderson','INSERT',14);
insert into employee_source values (4,'Jimmy','Smith','INSERT',15);
insert into employee_source values (1,'Wayne','Bellsa','UPDATE',16);
insert into employee_source values (1,'Wayner','Bellsat','UPDATE',17);
insert into employee_source values (2,'Anthony','Allen','DELETE',18);
MERGE into employee_destination as T using (select * from employee_source order by binlogkey)
AS S
ON T.id = s.id
when not matched
And S.operation_name = 'INSERT' THEN
INSERT (id,
first_name,
last_name)
VALUES (
S.id,
S.first_name,
S.last_name)
when matched AND S.operation_name = 'UPDATE'
THEN
update set T.first_name = S.first_name, T.last_name = S.last_name
When matched
And S.operation_name = 'DELETE' THEN DELETE;
I am expecting to see - Bellsat - as last name for employee id 1 in the employee_destination table after all rows get processed. Same way, I should not see emp id 2 in the employee_destination table.
Is there any other alternative to MERGE to achieve this? Basically to go over every single DML in the same order (using binlogkey column for ordering) .
thanks.
You need to manipulate your source data to ensure that you only have one record per key/operation otherwise the join will be non-deterministic and will (dpending on your settings) either error or will update using a random one of the applicable source records. This is covered in the documentation here https://docs.snowflake.com/en/sql-reference/sql/merge.html#duplicate-join-behavior.
In any case, why would you want to update a record only for it to be overwritten by another update - this would be incredibly inefficient?
Since your updates appear to include the new values for all rows, you can use a window function to get to just the latest incoming change, and then merge those results into the target table. For example, the select for that merge (with the window function to get only the latest change) would look like this:
with SOURCE_DATA as
(
select COLUMN1::int ID
,COLUMN2::string FIRST_NAME
,COLUMN3::string LAST_NAME
,COLUMN4::string OPERATION_NAME
,COLUMN5::int PROCESSING_ORDER
from values
(1,'Wayne','Bells','INSERT',11),
(1,'Wayne','BellsT','UPDATE',12),
(2,'Anthony','Allen','INSERT',13),
(3,'Eric','Henderson','INSERT',14),
(4,'Jimmy','Smith','INSERT',15),
(1,'Wayne','Bellsa','UPDATE',16),
(1,'Wayne','Bellsat','UPDATE',17),
(2,'Anthony','Allen','DELETE',18)
)
select * from SOURCE_DATA
qualify row_number() over (partition by ID order by PROCESSING_ORDER desc) = 1
That will produce a result set that has only the changes required to merge into the target table:
ID
FIRST_NAME
LAST_NAME
OPERATION_NAME
PROCESSING_ORDER
1
Wayne
Bellsat
UPDATE
17
2
Anthony
Allen
DELETE
18
3
Eric
Henderson
INSERT
14
4
Jimmy
Smith
INSERT
15
You can then change the when not matched to remove the operation_name. If it's listed as an update and it's not in the target table, it's because it was inserted in a previous operation in the new changes.
For the when matched clause, you can use the operation_name to determine if the row should be updated or deleted.

Creating a unique ID using first 3 character of last name and a sequence number

I have this employee table which I want every employee to have a unique ID using the 3 first letters of their name plus a sequence number in SQL Server.
I don't remember at all how to do this I haven't used SQL in a year and kinda forgot everything.
Can anyone refresh my mind on how to do this. Google has been of no help on this matter. Thanks
Firstly, I suggest that the numeric portion of your identifier be unique in and of itself, in case the employee gets married and changes their last name. The prefix can still appear to the left of it, but should not be necessary to be unique.
If you agree with this design, then you can simply use a numeric identity column on the Employee table and combine that with the last name when retrieving the data, using a computed column. I suggest you seed the identity with a value that has enough digits to keep your identifier lengths consistent, so for example to support 90,000 employees you can use a seed of 10,000 which ensures all identifiers are 8 characters long (three letters of the name plus five numeric).
Simple example:
CREATE TABLE Employee
(
EmployeeNo int IDENTITY(10000,1) PRIMARY KEY,
LastName VarChar(64),
EmployeeID AS SUBSTRING(UPPER(LastName), 1, 3) + RIGHT('0000' + CONVERT(char(5), EmployeeNo), 5)
)
INSERT Employee (LastName) VALUES ('Smith')
SELECT * FROM Employee
Results:
EmployeeNo LastName EmployeeID
10000 Smith SMI10000
For the purposes of your SQL and table design, your tables should all use EmployeeNo as foreign key, since it is compact and unique. Apply the three-letter prefix during data retrieval and only for customer-facing purposes.
#John Wu is right. However, if you don't want to rely on the Employee no then you use NewID() function, which will create a unique number always. Below is the code.
CREATE TABLE EmployeeDetails
(
EmployeeCode int IDENTITY(1,1),
FirstName varchar(50),
LastName Varchar(50),
Empid as left(Lastname,3) + convert(varchar(500), newid())
)
INSERT EmployeeDetails VALUES ('Atul', 'Jain')
`

How to implement many-to-many-to-many database relationship?

I am building a SQLite database and am not sure how to proceed with this scenario.
I'll use a real-world example to explain what I need:
I have a list products that are sold by many stores in various states. Not every Store sells a particular Product at all, and those that do, may only sell it in one State or another. Most stores sell a product in most states, but not all.
For example, let's say I am trying to buy a vacuum cleaner in Hawaii. Joe's Hardware sells vacuums in 18 states, but not in Hawaii. Walmart sells vacuums in Hawaii, but not microwaves. Burger King does not sell vacuums at all, but will give me a Whopper anywhere in the US.
So if I am in Hawaii and search for a vacuum, I should only get Walmart as a result. While other stores may sell vacuums, and may sell in Hawaii, they don't do both but Walmart does.
How do I efficiently create this type of relationship in a relational database (specifically, I am currently using SQLite, but need to be able to convert to MySQL in the future).
Obviously, I would need tables for Product, Store, and State, but I am at a loss on how to create and query the appropriate join tables...
If I, for example, query a certain Product, how would I determine which Store would sell it in a particular State, keeping in mind that Walmart may not sell vacuums in Hawaii, but they do sell tea there?
I understand the basics of 1:1, 1:n, and M:n relationships in RD, but I am not sure how to handle this complexity where there is a many-to-many-to-many situation.
If you could show some SQL statements (or DDL) that demonstrates this, I would be very grateful. Thank you!
An accepted and common way is the utilisation of a table that has a column for referencing the product and another for the store. There's many names for such a table reference table, associative table mapping table to name some.
You want these to be efficient so therefore try to reference by a number which of course has to uniquely identify what it is referencing. With SQLite by default a table has a special column, normally hidden, that is such a unique number. It's the rowid and is typically the most efficient way of accessing rows as SQLite has been designed this common usage in mind.
SQLite allows you to create a column per table that is an alias of the rowid you simple provide the column followed by INTEGER PRIMARY KEY and typically you'd name the column id.
So utilising these the reference table would have a column for the product's id and another for the store's id catering for every combination of product/store.
As an example three tables are created (stores products and a reference/mapping table) the former being populated using :-
CREATE TABLE IF NOT EXISTS _products(id INTEGER PRIMARY KEY, productname TEXT, productcost REAL);
CREATE TABLE IF NOT EXISTS _stores (id INTEGER PRIMARY KEY, storename TEXT);
CREATE TABLE IF NOT EXISTS _product_store_relationships (storereference INTEGER, productreference INTEGER);
INSERT INTO _products (productname,productcost) VALUES
('thingummy',25.30),
('Sky Hook',56.90),
('Tartan Paint',100.34),
('Spirit Level Bubbles - Large', 10.43),
('Spirit Level bubbles - Small',7.77)
;
INSERT INTO _stores (storename) VALUES
('Acme'),
('Shops-R-Them'),
('Harrods'),
('X-Mart')
;
The resultant tables being :-
_product_store_relationships would be empty
Placing products into stores (for example) could be done using :-
-- Build some relationships/references/mappings
INSERT INTO _product_store_relationships VALUES
(2,2), -- Sky Hooks are in Shops-R-Them
(2,4), -- Sky Hooks in x-Mart
(1,3), -- thingummys in Harrods
(1,1), -- and Acme
(1,2), -- and Shops-R-Them
(4,4), -- Spirit Level Bubbles Large in X-Mart
(5,4), -- Spiirit Level Bubble Small in X-Mart
(3,3) -- Tartn paint in Harrods
;
The _product_store_relationships would then be :-
A query such as the following would list the products in stores sorted by store and then product :-
SELECT storename, productname, productcost FROM _stores
JOIN _product_store_relationships ON _stores.id = storereference
JOIN _products ON _product_store_relationships.productreference = _products.id
ORDER BY storename, productname
;
The resultant output being :-
This query will only list stores that have a product name that contains an s or S (as like is typically case sensitive) the output being sorted according to productcost in ASCending order, then storename, then productname:-
SELECT storename, productname, productcost FROM _stores
JOIN _product_store_relationships ON _stores.id = storereference
JOIN _products ON _product_store_relationships.productreference = _products.id
WHERE productname LIKE '%s%'
ORDER BY productcost,storename, productname
;
Output :-
Expanding the above to consider states.
2 new tables states and store_state_reference
Although no real need for a reference table (a store would only be in one state unless you consider a chain of stores to be a store, in which case this would also cope)
The SQL could be :-
CREATE TABLE IF NOT EXISTS _states (id INTEGER PRIMARY KEY, statename TEXT);
INSERT INTO _states (statename) VALUES
('Texas'),
('Ohio'),
('Alabama'),
('Queensland'),
('New South Wales')
;
CREATE TABLE IF NOT EXISTS _store_state_references (storereference, statereference);
INSERT INTO _store_state_references VALUES
(1,1),
(2,5),
(3,1),
(4,3)
;
If the following query were run :-
SELECT storename,productname,productcost,statename
FROM _stores
JOIN _store_state_references ON _stores.id = _store_state_references.storereference
JOIN _states ON _store_state_references.statereference =_states.id
JOIN _product_store_relationships ON _stores.id = _product_store_relationships.storereference
JOIN _products ON _product_store_relationships.productreference = _products.id
WHERE statename = 'Texas' AND productname = 'Sky Hook'
;
The output would be :-
Without the WHERE clause :-
make Stores-R-Them have a presence in all states :-
The following would make Stores-R-Them have a presence in all states :-
INSERT INTO _store_state_references VALUES
(2,1),(2,2),(2,3),(2,4)
;
Now the Sky Hook's in Texas results in :-
Note This just covers the basics of the topic.
You will need to create combine mapping table of product, states and stores as tbl_product_states_stores which will store mapping of products, state and store. The columns will be id, product_id, state_id, stores_id.

Cloning a row in a table with indexes, autonumber and default fields

Building a new system on MsSQL from an old MySQL setup, we come to find that several times, a row is clonned in DRAFT version before it gets PUBLISHED, short of data preparation.
This is done in several tables that have autoincrement fields (ID) and defaults (timestamp, and so on)
As today, this is done with something like
Insert into MyTable (Status, CustomerID, CName, Csurname, <all other fields>
SELECT 'DRAFT', CustomerID, CName, Csurname, <all other fields>
FROM MyTable
WHERE CustomerID = 1234
For a table as:
MyTable : ID, timestamp, Status, CustomerID, CName, Csurname, <all other fields>
the problem, each time a new column is added to the table, SPs need to be rewritten
would be any better strategy??
We tough on an function, that given a table name, and the "Key fields" (the where) and the Changed fields with the new values (the Status field in the example), it goes trough the table definition, and build the SQL sentence, looking for all fields except those autonumber (ID) and those that have a default (timestamp).
But before come into such a weird code, I would like some feedback form the community.

Best way to use compound Index to query with multiple combination of query parameters?

I am building a functionality to estimate Inventory for my Ads serve platform.The fields on which I am trying to estimate with their cardinality is as below:
FIELD: CARDINALITY
location: 10000 (bengaluru, chennai etc..)
n/w speed : 6 (w, 4G, 3G, 2G, G, NA)
priceRange : 10 (1, 2, 3, 4, 5, 6, 7, 8, 9, 10)
users: contains number of users falling under any of the above combination.
Ex. {'location':'bengaluru', 'n/w':'4G', priceRange:8, users: 1000}
means 1000 users are from bengaluru having 4G and priceRange = 8
So total combination can be 10000 * 6 * 10 = 600000 and in future more fields can be added to around 29(currently it is 3 location, n/w, priceRange) and total combination can reach the order of 10mn. Now I want to estimate how many users fall under
Now queries I will need are as follows:
1) find all users who are from location:bengaluru , n/w:3G, priceRange: 6
2) find all users from bengaluru
3) Find all users falling under n/w: 3G and priceRange: 8
What is the best possible way to approach to this?
Which database can be best suited for this requirement.What indexes I need to build. Will compound index help? If yes then How ? Any help is appreciated.
Here's my final answer:
Create table Attribute(
ID int,
Name varchar(50));
Create table AttributeValue(
ID int,
AttributeID int,
Value varchar(50));
Create table userAttributeValue(
userID int,
AttributeID varchar(20),
AttributeValue varchar(50));
Create table User(
ID int);
Insert into user (ID) values (1),(2),(3),(4),(5);
Insert into Attribute (ID,Name) Values (1,'Location'),(2,'nwSpeed'),(3,'PriceRange');
Insert into AttributeValue values
(1,1,'bengaluru'),(2,1,'chennai'),
(3,2, 'w'), (4, 2,'4G'), (5,2,'3G'), (6,2,'2G'), (7,2,'G'), (8,2,'NA'),
(9,3,'1'), (10,3,'2'), (11,3,'3'), (12,3,'4'), (13,3,'5'), (14,3,'6'), (15,3,'7'), (16,3,'8'), (17,3,'9'), (18,3,'10');
Insert into UserAttributeValue (userID, AttributeID, AttributeValue) values
(1,1,1),
(1,2,5),
(1,3,9),
(2,1,1),
(2,2,4),
(3,2,6),
(2,3,13),
(4,1,1),
(4,2,4),
(4,3,13),
(5,1,1),
(5,2,5),
(5,3,13);
Select USERID
from UserAttributeValue
where (AttributeID,AttributeValue) in ((1,1),(2,4))
GROUP BY USERID
having count(distinct concat(AttributeID,AttributeValue))=2
Now if you need a count wrap userID in count and divide by the attributes passed in as each user will have 1 record per attribute and to get the "count of users" you'd need to divide by the number of attributes.
This allows for N growth of Attributes and the AttributeValues per user without changes to UI or database if UI is designed correctly.
By treating each datapoint as an attribute and storing them in once place we can enforce database integrity.
Attribute and AttributeValue tables becomes lookups for UserAttributevalue so you can translate the IDs back to attribute name and the value.
This also means we only have 4 tables user, attribute, attributeValue, and UserAttributeValue.
Technically you don't have to store attributeID on the userAttributeValue, but for performance reasons on later joins/reporting I think you'll find it beneficial.
You need to add proper Primary Key's, Foreign keys, and indexes to the tables. They should be fairly self explanatory. On UserAttributeValue I would have a few Composite indexes each with a different order of the unique key. Just depends on the type of reporting/analysis you'll be doing but adding keys as performance tuning is needed is commonplace.
Assumptions:
You're ok with all datavalues being varchar data in all cases.
If needed you could add a datatype, precision, and scale on the attribute table and allow the UI to cast the attribute value as needed. but since they are all in the same field in the database they all have to be the same datatype. and of the same precision/scale.
Pivot tables to display the data across will likely be needed and you know how to handle those (and engine supports them!)
Gotta say I loved the metal exercise; but still would appreciate feedback from others on SO. I've used this approach in 1 systems I've developed and it's been in two I've supported. There are some challenges but it does follow 3rd normal form db design (except for the replicated attributeID in userAttributevalue but that's there for performance gain in reporting/filtering.

Resources