I'm trying to move a RDBMS model over to Cassandra, and having a hard time creating the schema. Here is my data model:
CREATE TABLE Domain (
ID INT NOT NULL PRIMARY KEY,
DomainName NVARCHAR(74) NOT NULL,
HasBadWords BIT,
...
);
INSERT INTO Domain (DomainName, HasBadWords) VALUES ('domain1.com', 0);
INSERT INTO Domain (DomainName, HasBadWords) VALUES ('domain2.com', 0);
CREATE TABLE ZoneFile (
ID INT NOT NULL PRIMARY KEY,
DomainID INT NOT NULL,
Available BIT NOT NULL,
Nameservers NVARCHAR(MAX),
Timestamp DATETIME NOT NULL
);
INSERT INTO ZoneFile (DomainID, Available, Nameservers, Timestamp) VALUES (1, 0, "ns1", '2010-01-01');
INSERT INTO ZoneFile (DomainID, Available, Nameservers, Timestamp) VALUES (2, 0, "ns1", '2010-01-01');
INSERT INTO ZoneFile (DomainID, Available, Nameservers, Timestamp) VALUES (1, 1, "ns2", '2011-01-01');
INSERT INTO ZoneFile (DomainID, Available, Nameservers, Timestamp) VALUES (2, 1, "ns2", '2011-01-01');
CREATE TABLE Backlinks (
ID INT NOT NULL PRIMARY KEY,
DomainID INT NOT NULL,
Backlinks INT NOT NULL,
Indexed INT NOT NULL,
Timestamp DATETIME NOT NULL
);
INSERT INTO Backlinks (DomainID, Backlinks, Indexed, Timestamp) VALUES (1, 100, 200, '2010-01-01');
INSERT INTO Backlinks (DomainID, Backlinks, Indexed, Timestamp) VALUES (2, 300, 600, '2010-01-01');
INSERT INTO Backlinks (DomainID, Backlinks, Indexed, Timestamp) VALUES (1, 500, 1000, '2010-01-01');
INSERT INTO Backlinks (DomainID, Backlinks, Indexed, Timestamp) VALUES (2, 600, 1200, '2010-01-01');
From this, I've deduced that I can probably have one Keyspace: DomainData. In this keyspace, I can have a columnfamily called "Domain" which is like my Domain table in sql:
"Domain" : { //ColumnFamily
"domain1.com" : { "HasBadWords" : 0 }, //SuperColumn
"domain2.com" : { "HasBadWords" : 0 } //SuperColumn
}
The next tables are where I start getting confused. ZoneFile and Backlinks are essentially supposed to store a history of results from looking up these values for each domain. So, one Domain to Many ZoneFile records. For querying purposes, I want to be able to easily get the 'newest' ZoneFile record, or a given Domain. I will need to do the same for Backlinks.
I was considering something like this, and doing a range lookup on the key for the domain, and then getting the 'last' record which should be the newest timestamp...
"ZoneFiles" : { //ColumnFamily
"domain1.com:2010-01-01 12:00:00.000" : { "Available" : 0, "Nameservers" : "ns1" }, //SuperColumn
"domain1.com:2011-01-01 12:00:00.000" : { "Available" : 1, "Nameservers" : "ns2" }, //SuperColumn
"domain2.com:2010-01-01 12:00:00.000" : { "Available" : 0, "Nameservers" : "ns1" }, //SuperColumn
"domain2.com:2011-01-01 12:00:00.000" : { "Available" : 1, "Nameservers" : "ns2" } //SuperColumn
}
I'm not convinced this is the right answer, the combination of a string domain and string datetime in a key feels wrong. Could someone point me in the right direction?
EDIT:
Assuming I use:
"ZoneFiles" : {
"domain1.com" : {
timestamp1 : "{\"available\":1,\"nameservers\":\"ns1\"}",
timestamp2 : "{\"available\":1,\"nameservers\":\"ns1\"}",
}
}
How would I query a list of domain rows where the newest timestamp is older than a given date?
If I understand your question correctly, the only query you want to do on this model is "please get me the latest zonefile or backlinks for a given domain" ?
If thats the case, I would store the latest values for these in the "Domain" column family, under the domains row key, in separate columns. I would also store when this latest value was updated (the timestamp). Every time you get new values for the info in zonefile and backlinks, I would just overwrite the value in the "Domain" column family and update the timestamp.
I assume you are also keeping this historical data so you can query it, and I assume the kind of query will be "show me all the updates for a given domain between two times" (is this correct?). If so, I wouldn't manually construct a composite row key like that, since it will require you to use the Order Preserving Partitioner to get the correct results from get_range_slices. And as you probably know, load balancing with the OPP can be a difficult task.
Instead, I would have the row key be domain id, and the column key be the timestamp of the update. Then you can either pack you updates into a single value (eg using json), use super columns or use the new composite keys in 0.8. If done like this, you can use a get_slice to satisfy your query, and it will behave correctly with the Random Partitioner, making load balancing much easier.
Tom Wilkie | Acunu | www.acunu.com | #tom_wilkie
Reply to comment: "how would I query a list of domains that's most recent zonefile timestamp column is older than a given timestamp?"
You could do that by inserting into another column family:
row key: day (or hour, or some other reasonable 'bucketing')
column key: timestamp of update
value: domain
...every time you update the zonefile. Then, to get the most recently updated domains since t, do:
result = []
for i in day(t) ... day(now):
result.extend(get_slice(i, range(t, '')))
This would require you to remove repeat entries from result, so would only work best when t is pretty recent. You also have to consider the load balancing for the writes, which would focus all the load on a single server (since, at any one time, you are inserting into only one row)
If these trade offs aren't appropriate, then you could look at the hadoop integrations and use that to perform this query. Or you could make other tradeoff (use the OPP, or do a read before a write to remove the duplicates, which would be v. slow)
Related
it looks like the optimizer is doing table scan instead of index scan if I select columns that are not part of the index!.
I created the following table and index
CREATE TABLE customer_lastseen_products (
customer_ref_value STRING(50) NOT NULL,
customer_ref_type STRING(20) NOT NULL,
sku_config STRING(40) NOT NULL,
mp_code STRING(20) NOT NULL,
is_added_to_cart BOOL NOT NULL,
is_purchased BOOL NOT NULL,
l ast_visit_time TIMESTAMP OPTIONS( allow_commit_timestamp = true)
)PRIMARY KEY(customer_ref_value, sku_config),
ROW DELETION POLICY (OLDER_THAN(last_visit_time, INTERVAL 30 DAY))
and index
CREATE INDEX customercodeIndex3 ON customer_lastseen_products(customer_ref_value, customer_ref_type, mp_code, last_visit_time DESC);
But this query is doing full table scan
SELECT
sku_config , is_added_to_cart, is_purchased
FROM customer_lastseen_products
WHERE(customer_ref_value, customer_ref_type) in (('0f2e9ed9-2d5e-4c78-b03f-0c6dd3f65598', 'customer_code'), ('', 'visitor_id'))
AND mp_code = "mp"
AND last_visit_time between '2020-10-03T12:35:59' and '2022-10-03T12:35:59'
order by last_visit_time desc
According to internal documentation, it is encouraged to use the directive FORCE_INDEX in order to improve the consistency of queries by specifying the index you would like to use. To add on that, I was able to observe that the Cloud Spanner's query optimizer may take up to 3 days to start using an index after it's creation, as it requires time to collect the database's statistics.
Please find the documentation article here.
I want to use the key-value pair feature of Cassandra. Until now, I have been using Kyotocabinet but it does not support multiple writes and hence, I want to use Cassandra for versioning my tabular data.
Roll No, Name, Age, Sex
14BCE1008, Aviral, 22, Male
14BCE1007, Shantanu, 22, Male
The above data is tabular(csv). It's version 1.
Next is version 2:
Roll No, Name, Age, Sex
14BCE1008, Aviral, 22, Male
14BCE1007, Shantanu, 22, Male
14BCE1209, Piyush, 22, Male
Hence, I would call the above version as version 2 with the following diff:
insert_patch: 14BCE1209 as key(PK) and 14BCE1209, Piyush, 22, Male as value.
I am familiar with the creation of the table but unable to figure out the versioning part.
To have multiple versions of data in your table if you use composite primary key instead of primary key consisting of one field.
So table definition could look as following (if you "know" the version number prior inserting the data):
create table test(
id text,
version int,
payload text,
primary key (id, version)
) with clustering order by (version desc);
and inserting data as:
insert into test (id, version, payload) values ('14BCE1209', 1, '....');
insert into test (id, version, payload) values ('14BCE1209', 2, '....');
to select the latest value for given key you can use LIMIT 1:
SELECT * from test where id = '14BCE1209' LIMIT 1;
and to select latest versions for all partitions (not recommended, just for example - need a special approach for effective processing):
SELECT * from test PER PARTITION LIMIT 1;
But this will work only for cases when you know version in advance. If you don't know, then you can use timeuuid type for version instead of the int:
create table test(
id text,
version timeuuid,
payload text,
primary key (id, version)
) with clustering order by (version desc);
and inserting data as (instead of now() you can use current timestamp generated from your code):
insert into test (id, version, payload) values ('14BCE1209', now(), '....');
insert into test (id, version, payload) values ('14BCE1209', now(), '....');
and select will work the same as above...
I am building a functionality to estimate Inventory for my Ads serve platform.The fields on which I am trying to estimate with their cardinality is as below:
FIELD: CARDINALITY
location: 10000 (bengaluru, chennai etc..)
n/w speed : 6 (w, 4G, 3G, 2G, G, NA)
priceRange : 10 (1, 2, 3, 4, 5, 6, 7, 8, 9, 10)
users: contains number of users falling under any of the above combination.
Ex. {'location':'bengaluru', 'n/w':'4G', priceRange:8, users: 1000}
means 1000 users are from bengaluru having 4G and priceRange = 8
So total combination can be 10000 * 6 * 10 = 600000 and in future more fields can be added to around 29(currently it is 3 location, n/w, priceRange) and total combination can reach the order of 10mn. Now I want to estimate how many users fall under
Now queries I will need are as follows:
1) find all users who are from location:bengaluru , n/w:3G, priceRange: 6
2) find all users from bengaluru
3) Find all users falling under n/w: 3G and priceRange: 8
What is the best possible way to approach to this?
Which database can be best suited for this requirement.What indexes I need to build. Will compound index help? If yes then How ? Any help is appreciated.
Here's my final answer:
Create table Attribute(
ID int,
Name varchar(50));
Create table AttributeValue(
ID int,
AttributeID int,
Value varchar(50));
Create table userAttributeValue(
userID int,
AttributeID varchar(20),
AttributeValue varchar(50));
Create table User(
ID int);
Insert into user (ID) values (1),(2),(3),(4),(5);
Insert into Attribute (ID,Name) Values (1,'Location'),(2,'nwSpeed'),(3,'PriceRange');
Insert into AttributeValue values
(1,1,'bengaluru'),(2,1,'chennai'),
(3,2, 'w'), (4, 2,'4G'), (5,2,'3G'), (6,2,'2G'), (7,2,'G'), (8,2,'NA'),
(9,3,'1'), (10,3,'2'), (11,3,'3'), (12,3,'4'), (13,3,'5'), (14,3,'6'), (15,3,'7'), (16,3,'8'), (17,3,'9'), (18,3,'10');
Insert into UserAttributeValue (userID, AttributeID, AttributeValue) values
(1,1,1),
(1,2,5),
(1,3,9),
(2,1,1),
(2,2,4),
(3,2,6),
(2,3,13),
(4,1,1),
(4,2,4),
(4,3,13),
(5,1,1),
(5,2,5),
(5,3,13);
Select USERID
from UserAttributeValue
where (AttributeID,AttributeValue) in ((1,1),(2,4))
GROUP BY USERID
having count(distinct concat(AttributeID,AttributeValue))=2
Now if you need a count wrap userID in count and divide by the attributes passed in as each user will have 1 record per attribute and to get the "count of users" you'd need to divide by the number of attributes.
This allows for N growth of Attributes and the AttributeValues per user without changes to UI or database if UI is designed correctly.
By treating each datapoint as an attribute and storing them in once place we can enforce database integrity.
Attribute and AttributeValue tables becomes lookups for UserAttributevalue so you can translate the IDs back to attribute name and the value.
This also means we only have 4 tables user, attribute, attributeValue, and UserAttributeValue.
Technically you don't have to store attributeID on the userAttributeValue, but for performance reasons on later joins/reporting I think you'll find it beneficial.
You need to add proper Primary Key's, Foreign keys, and indexes to the tables. They should be fairly self explanatory. On UserAttributeValue I would have a few Composite indexes each with a different order of the unique key. Just depends on the type of reporting/analysis you'll be doing but adding keys as performance tuning is needed is commonplace.
Assumptions:
You're ok with all datavalues being varchar data in all cases.
If needed you could add a datatype, precision, and scale on the attribute table and allow the UI to cast the attribute value as needed. but since they are all in the same field in the database they all have to be the same datatype. and of the same precision/scale.
Pivot tables to display the data across will likely be needed and you know how to handle those (and engine supports them!)
Gotta say I loved the metal exercise; but still would appreciate feedback from others on SO. I've used this approach in 1 systems I've developed and it's been in two I've supported. There are some challenges but it does follow 3rd normal form db design (except for the replicated attributeID in userAttributevalue but that's there for performance gain in reporting/filtering.
Let's say that we have to store information of different types of product in a database. However, these products have different specifications. For example:
Phone: cpu, ram, storage...
TV: size, resolution...
We want to store each specification in a column of a table, and all the products (whatever the type) must have a different ID.
To comply with that, now I have one general table named Products (with an auto increment ID) and one subordinate table for each type of product (ProductsPhones, ProductsTV...) with the specifications and linked with the principal with a Foreign Key.
I find this solution inefficient since the table Products has only one column (the auto incremented ID).
I would like to know if there is a better approach to solve this problem using relational databases.
The short answer is no. The relational model is a first-order logical model, meaning predicates can vary over entities but not over other predicates. That means dependent types and EAV models aren't supported.
EAV models are possible in SQL databases, but they don't qualify as relational since the domain of the value field in an EAV row depends on the value of the attribute field (and sometimes on the value of the entity field as well). Practically, EAV models tend to be inefficient to query and maintain.
PostgreSQL supports shared sequences which allows you to ensure unique auto-incremented IDs without a common supertype table. However, the supertype table may still be a good idea for FK constraints.
You may find some use for your Products table later to hold common attributes like Type, Serial number, Cost, Warranty duration, Number in stock, Warehouse, Supplier, etc...
Having Products table is fine. You can put there all the columns common across all types like product name, description, cost, price just to name some. So it's not just auto increment ID. Having an internal ID of type int or long int as the primary key is recommended. You may also add another field "code" or whatever you want to call it for user-entered or user-friendly which is common with product management systems. Make sure you index it if used in searching or query criteria.
HTH
While this can't be done completely relationally, you can still normalize your tables some and make it a little easier to code around.
You can have these tables:
-- what are the products?
Products (Id, ProductTypeId, Name)
-- what kind of product is it?
ProductTypes (Id, Name)
-- what attributes can a product have?
Attributes (Id, Name, ValueType)
-- what are the attributes that come with a specific product type?
ProductTypeAttributes (Id, ProductTypeId, AttributeId)
-- what are the values of the attributes for each product?
ProductAttributes (ProductId, ProductTypeAttributeId, Value)
So for a Phone and TV:
ProductTypes (1, Phone) -- a phone type of product
ProductTypes (2, TV) -- a tv type of product
Attributes (1, ScreenSize, integer) -- how big is the screen
Attributes (2, Has4G, boolean) -- does it get 4g?
Attributes (3, HasCoaxInput, boolean) -- does it have an input for coaxial cable?
ProductTypeAttributes (1, 1, 1) -- a phone has a screen size
ProductTypeAttributes (2, 1, 2) -- a phone can have 4g
-- a phone does not have coaxial input
ProductTypeAttributes (3, 2, 1) -- a tv has a screen size
ProductTypeAttributes (4, 2, 3) -- a tv can have coaxial input
-- a tv does not have 4g (simple example)
Products (1, 1, CoolPhone) -- product 1 is a phone called coolphone
Products (2, 1, AwesomePhone) -- prod 2 is a phone called awesomephone
Products (3, 2, CoolTV) -- prod 3 is a tv called cooltv
Products (4, 2, AwesomeTV) -- prod 4 is a tv called awesometv
ProductAttributes (1, 1, 6) -- coolphone has a 6 inch screen
ProductAttributes (1, 2, True) -- coolphone has 4g
ProductAttributes (2, 1, 4) -- awesomephone has a 4 inch screen
ProductAttributes (2, 2, False) -- awesomephone has NO 4g
ProductAttributes (3, 3, 70) -- cooltv has a 70 inch screen
ProductAttributes (3, 4, True) -- cooltv has coax input
ProductAttributes (4, 3, 19) -- awesometv has a 19 inch screen
ProductAttributes (4, 4, False) -- awesometv has NO coax input
The reason this is not fully relational is that you'll still need to evaluate the value type (bool, int, etc) of the attribute before you can use it in a meaningful way in your code.
I have several different tables in my database and I'm trying to use Sphinx to do fast full-text searches. For ease of discussion, let's say the main records of interest are packing slips, one of which is included when an order ships. How do I use Sphinx to execute complex queries across all of these tables without completely denormalizing the database?
Each packing slip lists the order number, shipper, recipient, and the tracking number of each box included with the shipment. A separate table contains information about the order items. An additional table contains the customer address information. So, orders contain boxes and boxes contain items. (Example schema listed at the bottom of this question).
I would like to be able to query Sphinx to answers to questions like:
How many people who live on a street named "Maple" ordered an item with "large" in the description?
Which orders contain include the word "blue" in either the box description or order items' description?
To answer these types of questions, I need to refer to several tables. Since Sphinx doesn't have JOINs, one option is to denormalize the database. Denormalizing using a view, so that each row represents an order item--plus all of the data of it's parent box and order, would result in billions of very wide rows. So I've been creating a separate index for each table instead. But that doesn't allow me to query across tables as a SQL JOIN would. Is there another solution?
Example database
CREATE TABLE orders (
id integer PRIMARY KEY,
date_ordered date,
customer_po varchar
);
INSERT INTO orders VALUES (1, '2012-12-13', NULL);
INSERT INTO orders VALUES (2, '2012-12-14', 'DF312442');
CREATE TABLE parties (
id integer PRIMARY KEY,
order_id integer NOT NULL REFERENCES orders(id),
party_type varchar,
company varchar,
city varchar,
state char(2)
);
INSERT INTO parties VALUES (1, 1, 'shipper', 'ACME, Inc.', 'New York', 'NY');
INSERT INTO parties VALUES (2, 1, 'recipient', 'Wylie Coyote Corp.', 'Flagstaff', 'AZ');
INSERT INTO parties VALUES (3, 2, 'shipper', 'Cyberdyne', 'Las Vegas', 'NV');
-- Please disregard the fact that this design permits multiple shippers and multiple recipients
-- per order. This is a vastly simplified version of the system I'm working on.
CREATE TABLE boxes (
id integer PRIMARY KEY,
order_id integer NOT NULL REFERENCES orders(id),
tracking_num varchar NOT NULL,
description varchar NOT NULL,
);
INSERT INTO boxes VALUES (1, 1, '1234567890', 'household goods');
INSERT INTO boxes VALUES (2, 1, '0987654321', 'kitchen appliances');
INSERT INTO boxes VALUES (3, 2, 'ABCDE12345', 'audio equipment');
CREATE TABLE box_contents (
id integer PRIMARY KEY,
order_id integer NOT NULL REFERENCES orders(id),
box integer NOT NULL REFERENCES boxes(id),
qty_units integer,
description varchar
);
INSERT INTO box_contents VALUES (1, 1, 1, 4, 'cookbook');
INSERT INTO box_contents VALUES (2, 1, 1, 2, 'baby bottle');
INSERT INTO box_contents VALUES (3, 1, 2, 1, 'television');
INSERT INTO box_contents VALUES (4, 2, 3, 2, 'lamp');
You put the JOIN in the sql_query that builds the index. The tables remain normalized, but you denormalize when building the index.
Its only a basic example, but your query would be something like.. .
sql_query = SELECT o.id,customer_po,UNIX_TIMESTAMP(date_ordered) AS date_ordered, \
GROUP_CONCAT(DISTINCT party_type) AS party_type, \
GROUP_CONCAT(DISTINCT company) AS company, \
GROUP_CONCAT(DISTINCT city) AS city, \
GROUP_CONCAT(DISTINCT description) AS description \
FROM orders o \
INNER JOIN parties p ON (o.id = p.order_id) \
INNER JOIN box_contents b ON (o.id = b.order_id) \
GROUP BY o.id \
ORDER BY NULL
Update: alternatively can use sql_joined_field to do the same but avoid actual sql_query joins. Sphinx then does the join process for you