Quite new to Cassandra here. As the title depicts, I am trying to create a table with the following schema for a chat use case,
chat_id - uuid primary key
sender - text
topic - text
content - map<text,text>
timestamp - timestamp
status_id - int
These are the following query that I would be using
INSERT .....
Filter data with topic and it should be filtered - SELECT
Update content based on topic - UPDATE
I have tried defining the primary keys but not able to execute these queries hence ended up with different Create table queries.
I would like to know the perfect CREATE TABLE query which supports these queries.
Thanks
...hence ended up with different Create table queries...
What did you end up with?
The main caveat to understand this is to figure out the access patterns in advance, primarily the SELECT queries, equality & range predicates, etc., so that you could better design the data model.
Filter data with topic and it should be filtered - SELECT
need to be elaborated. See this data modeling by example course & CQL Data modeling concepts to better understand the concept in depth.
More here:
Understanding the primary key
Deep look into the WHERE clause
Cassandra Fundamentals
Without knowing the in-depth access patterns required in your application, here is a bruteforce table schema that would satisfy the below queries:
CREATE TABLE IF NOT EXISTS <<your_keyspace_name>>.chat_app_tbl (
sender TEXT,
topic TEXT,
chat_id UUID,
status_id INT,
col_timestamp TIMESTAMP,
content MAP<TEXT,TEXT>,
PRIMARY KEY((topic,sender),chat_id,col_timestamp)
) WITH CLUSTERING ORDER BY (chat_id DESC, col_timestamp DESC);
and it can satisfy queries such as below:
Inserting data:
insert into chat_app_tbl(sender,topic,chat_id,status_id,col_timestamp,content) Values ('sender','sports',now(),0,totimestamp(now()),{'cricket':'India'});
Querying data based on a given topic and sender:
token#cqlsh:curatedns> select * From chat_app_tbl where topic = 'sports' and sender = 'sender';
topic | sender | chat_id | col_timestamp | content | status_id
--------+--------+--------------------------------------+---------------------------------+----------------------+-----------
sports | sender | 3f72b6d0-ad43-11ed-be31-7952b3260293 | 2023-02-15 15:13:05.725000+0000 | {'cricket': 'India'} | 0
(1 rows)
you will not be able to update the data without providing the other primary key (partition + clustering) fields/columns you will get the following error if you try to do so:
token#cqlsh:curatedns> UPDATE chat_app_tbl SET status_id = 1 WHERE topic = 'sports' AND sender = 'sender';
InvalidRequest: Error from server: code=2200 [Invalid query] message="Some clustering keys are missing: chat_id, col_timestamp"
whereas, you could perform updates to the data using queries such as below:
UPDATE chat_app_tbl SET status_id = 5 WHERE topic = 'sports' AND sender = 'sender' AND chat_id = 3f72b6d0-ad43-11ed-be31-7952b3260293 AND col_timestamp = '2023-02-15 15:13:05.725';
which would give us,
token#cqlsh:curatedns> select * From chat_app_tbl where topic = 'sports' and sender = 'sender';
topic | sender | chat_id | col_timestamp | content | status_id
--------+--------+--------------------------------------+---------------------------------+----------------------+-----------
sports | sender | 3f72b6d0-ad43-11ed-be31-7952b3260293 | 2023-02-15 15:13:05.725000+0000 | {'cricket': 'India'} | 5
and these below will insert a new row by doing an upsert:
UPDATE chat_app_tbl SET status_id = 1 WHERE topic = 'sports' AND sender = 'sender' AND chat_id = 3f72b6d0-ad43-11ed-be31-7952b3260293 AND col_timestamp = '2023-02-15T15:13:05'; <<<-- remember this would add a new row by performing an upsert
(or)
UPDATE chat_app_tbl SET status_id = 1 WHERE topic = 'sports' AND sender = 'sender' AND chat_id = 3f72b6d0-ad43-11ed-be31-7952b3260293 AND col_timestamp = '2023-02-15'; <<<-- remember this would add a new row by performing an upsert
If you ask me if the above is a good data model? The answer would be "it depends". Why?
Having just the topics as the partition column would definitely be a bad idea, because for a given topic, (i.e. sports in our example) there could be a variety of discussions and you'll end up with a FAT / BIG / HUGE partition. See this blog or this learning exercise to understand how partitioning works under the covers.
The general idea for a partition size in Cassandra is to have it less than 50-100MB for scale & throughput.
The current (topic, sender) partition key may not even be sufficient to break down the partition size to keep under the limits, in some cases.
You may be able to play with different partitioning strategis like,
PRIMARY KEY((topic, sender), chat_id, col_timestamp) (or)
PRIMARY KEY((topic, sender), chat_id) (or)
etc.,
to find out which would suite your case and offer the scale. There is a great tool called NoSQLBench where you could test/benchmark your table & queries with which you could emulate real-world application workloads and test your data model. More details here.
Data modelling in Cassandra means that you need to design a table for each application query so that reads are optimised.
If your app needs to retrieve conversations by topic then the table needs to be partitioned by topic. Additionally given it is a chat application, I imagine you would want to sort the messages in reverse chronological order based on when it was posted so the table would look something like this:
CREATE TABLE messages_by_topic (
topic text,
posted_tstamp timestamp,
sender text,
message text,
PRIMARY KEY (topic, posted_tstamp)
) WITH CLUSTERING ORDER BY (tstamp DESCENDING)
To retrieve the last 20 messages for a topic:
SELECT sender, message, posted_tstamp
FROM messages_by_topic
WHERE topic = ?
LIMIT 20
To post a new message to a topic:
INSERT INTO messages_by_topic (topic)
VALUES (?, ?, ?, ?)
In your case using the chat ID as the partition key is not going to be helpful since your app does not query based on it. Cheers!
Related
Originally I had a cassandra table like this:
CREATE TABLE table (
open_time timestamp,
open double,
close double,
high double,
low double,
volume bigint,
PRIMARY KEY(open_time));
open_time | close | high | low | open | volume
---------------------------------+--------+--------+-------+--------+--------
2020-08-05 06:00:00.000000+0000 | 181.53 | 184.32 | 181.1 | 184.32 | 100
2020-08-04 06:00:00.000000+0000 | 181.53 | 184.32 | 181.1 | 184.32 | 100
I need to perform a query to get the latest open_time. After noticing that querys like
SELECT open_time FROM table ORDER BY open_time DESC LIMIT 1;
are not allowed, I wonder what's the best practice here.
My idea is to add an id column, that I can use open_time as clustering order. Something like:
CREATE TABLE table (
id int,
open_time timestamp,
open double,
close double,
high double,
low double,
volume bigint,
PRIMARY KEY(id, open_time)
)
WITH CLUSTERING ORDER BY (open_time DESC);
Is this a valid solution to get the job done or are there better ways, e.g. something without an extra id column, because I would never query over the id itslef.
The most queries would be something like:
SELECT * FROM table WHERE open_time >= '2013-01-01 00:00:00+0200' AND open_time <= '2013-08-13 23:59:00+0200';
Thanks!
CLUSTERING ORDER enforces the on-disk sort order within each partition. So ordering by the same key that you're partitioning on isn't possible. Partitioning by id will face a similar challenge, in that the CLUSTERING ORDER BY open_time will only be enforced within each id.
I wonder what's the best practice here.
Models like these are usually solved by time bucketing, as I mentioned in an answer to a similar question earlier today. To select the best "bucket," you'll need to understand your business case like number of entries per day, as well as the query requirements.
For the sake of example, let's say that month would work the best. If each row contained a value of 'YEAR-MONTH', the PK definition would look like this:
PRIMARY KEY (month_bucket,open_time))
WITH CLUSTERING ORDER BY (open_time DESC);
Then, you could support a query like this:
SELECT * FROM table
WHERE month_bucket = '2013-08'
AND open_time >= '2013-08-01 00:00:00+0200' AND open_time <= '2013-08-13 23:59:00+0200';
Likewise, querying the most recent entry would only require the most recent (current?) month as a parameter:
SELECT * FROM table
WHERE month_bucket = '2020-08'
LIMIT 1;
As the results are stored within each month_bucket sorted by open_time in descending order, that query would return the most-recent entry.
I wrote an article on this for DataStax (several years ago) which is relevant to this problem. It's been moved to a new part of their site, which hosed the formatting, but the content is defintely there. Give it a read; hope it helps: We Shall Have Order!
If id is mentioned as primary key, it must be included in where clause otherwise it would need allow filtering.
You can try querying with "Select max(open_time)....",otherwise you can use id as above which will be incremented with every record and a result, id with highest value will always have the latest record.
I'm currently designing my tables. i have three types of user which is, pyd, ppp and ppk. Which is better? inserting data in one row or in multiple row?
which is better?
or
or any suggestion? thanks
I would go for 3 tables:
user_type
typeID | typeDescription
Main_table
id_main_table | id_user | id_type
table_bhg_i
id_bhg_i | id_main_table | data1 | data2 | data3
Although I see you are inserting IDs for each user , I don't quite understand how are are you going to differentiate between the users , had I designed this DB , I would have gone for tables like
tableName: UserTypes
this table would contain two field first would be ID and second would be type of user
like
UsertypeID | UserType
the UsertypeID is a primary key and can be auto increment , while UserType would be your users pyd ,ppk or so on . Designing in this way would give you flexibility of adding data later on in the table without changing the schema of the table ,
the next you can edit a table for generating multiple users of a particular type, this table would refer the userID of the previous table , this will help you adding new user easily and would remove redundancy
tableName:Users
this table would again contain two fields, the first field would be the id call and the secind field would be the usertypeId try
UserId |UserName | UserTypeID
the next thing you can do is make a table to insert the data , let the table be called DataTable
tableName: DataTable
this table will contain the data of the users and this will reference then easily
DataTabID | DataFields(can be any in number) | UserID(refrences Users table)
these tables would be more than sufficient .If doubts as me in chatbox
I am trying to learn Cassandra and always find the best way is to start with creating a very simple and small application. Hence I am creating a basic messaging application which will use Cassandra as the back-end. I would like to do the following:
User will create an account with a username, email, and password. The
email and the password can be changed at anytime.
The user can add another user as their contact. The user would add a
contact by searching their username or email. The contacts don't need
to be mutual meaning if I add a user they are my contact, I don't
need to wait for them to accept/approve anything like in Facebook.
A message is sent from one user to another user. The sender needs to
be able to see the messages they sent (ordered by time) and the
messages which were sent to them (ordered by time). When a user opens
the app I need to check the database for any new messages for that
user. I also need to mark if the message has been read.
As I come from the world of relational databases my relational database would look something like this:
UsersTable
username (text)
email (text)
password (text)
time_created (timestamp)
last_loggedIn (timestamp)
------------------------------------------------
ContactsTable
user_i_added (text)
user_added_me (text)
------------------------------------------------
MessagesTable
from_user (text)
to_user (text)
msg_body (text)
metadata (text)
has_been_read (boolean)
message_sent_time (timestamp)
Reading through a couple of Cassandra textbooks I have a thought of how to model the database. My main concern is to model the database in a very efficient manner. Hence I am trying to avoid things such as secondary indexes etc. This is my model so far:
CREATE TABLE users_by_username (
username text PRIMARY KEY,
email text,
password text
timeCreated timestamp
last_loggedin timestamp
)
CREATE TABLE users_by_email (
email text PRIMARY KEY,
username text,
password text
timeCreated timestamp
last_loggedin timestamp
)
To spread data evenly and to read a minimal amount of partitions (hopefully just one) I can lookup a user based on their username or email quickly. The downside of this is obviously I am doubling my data, but the cost of storage is quite cheap so I find it to be a good trade off instead of using secondary indexes. Last logged in will also need to be written in twice but Cassandra is efficent at writes so I believe this is a good tradeoff as well.
For the contacts I can't think of any other way to model this so I modelled it very similar to how I would in a relational database. This is quite a denormalized design I beleive which should be good for performance according to the books I have read?
CREATE TABLE "user_follows" (
follower_username text,
followed_username text,
timeCreated timestamp,
PRIMARY KEY ("follower_username", "followed_username")
);
CREATE TABLE "user_followedBy" (
followed_username text,
follower_username text,
timeCreated timestamp,
PRIMARY KEY ("followed_username", "follower_username")
);
I am stuck on how to create this next part. For messaging I was thinking of this table as it created wide rows which enables ordering of the messages.
I need messaging to answer two questions. It first needs to be able to show the user all the messages they have and also be able to show the user
the messages which are new and are unread. This is a basic model, but am unsure how to make it more efficent?
CREATE TABLE messages (
message_id uuid,
from_user text,
to_user text,
body text,
hasRead boolean,
timeCreated timeuuid,
PRIMARY KEY ((to_user), timeCreated )
) WITH CLUSTERING ORDER BY (timeCreated ASC);
I was also looking at using things such as STATIC columns to 'glue' together the user and messages, as well as SETS to store contact relationships, but from my narrow understanding so far the way I presented is more efficient. I ask if there are any ideas to improve this model's efficiency, if there are better practices do the things I am trying to do, or if there are any hidden problems I can face with this design?
In conclusion, I am trying to model around the queries. If I were using relation databases these would be essentially the queries I am looking to answer:
To Login:
SELECT * FROM USERS WHERE (USERNAME = [MY_USERNAME] OR EMAIL = [MY_EMAIL]) AND PASSWORD = [MY_PASSWORD];
------------------------------------------------------------------------------------------------------------------------
Update user info:
UPDATE USERS (password) SET password = [NEW_PASSWORD] where username = [MY_USERNAME];
UPDATE USERS (email) SET password = [NEW_PASSWORD ] where username = [MY_USERNAME];
------------------------------------------------------------------------------------------------------------------------
To Add contact (If by username):
INSERT INTO followings(following,follower) VALUES([USERNAME_I_WANT_TO_FOLLOW],[MY_USERNAME]);
------------------------------------------------------------------------------------------------------------------------
To Add contact (If by email):
SELECT username FROM users where email = [CONTACTS_EMAIL];
Then application layer sends over another query with the username:
INSERT INTO followings(following,follower) VALUES([USERNAME_I_WANT_TO_FOLLOW],[MY_USERNAME]);
------------------------------------------------------------------------------------------------------------------------
To View contacts:
SELECT following FROM USERS WHERE follower = [MY_USERNAME];
------------------------------------------------------------------------------------------------------------------------
To Send Message:,
INSERT INTO MESSAGES (MSG_ID, FROM, TO, MSG, IS_MSG_NEW) VALUES (uuid, [FROM_USERNAME], [TO_USERNAME], 'MY MSG', true);
------------------------------------------------------------------------------------------------------------------------
To View All Messages (Some pagination type of technique where shows me the 10 recent messages, yet shows which ones are unread):
SELECT * FROM MESSAGES WHERE TO = [MY_USERNAME] LIMIT 10;
------------------------------------------------------------------------------------------------------------------------
Once Message is read:
UPDATE MESSAGES SET IS_MSG_NEW = false WHERE TO = [MY_USERNAME] AND MSG_ID = [MSG_ID];
Cheers
Yes it's always a struggle to adapt to the limitations of Cassandra when coming from a relational database background. Since we don't yet have the luxury of doing joins in Cassandra, you often want to cram as much as you can into a single table. In your case that would be the users_by_username table.
There are a few features of Cassandra that should allow you to do that.
Since you are new to Cassandra, you could probably use Cassandra 3.0, which is currently in beta release. In 3.0 there is a nice feature called materialized views. This would allow you to have users_by_username as a base table, and create the users_by_email as a materialized view. Then Cassandra will update the view automatically whenever you update the base table.
Another feature that will help you is user defined types (in C* 2.1 and later). Instead of creating separate tables for followers and messages, you can create the structure of those as UDT's, and then in the user table keep lists of those types.
So a simplified view of your schema could be like this (I'm not showing some of the fields like timestamps to keep this simple, but those are easy to add).
First create your UDT's:
CREATE TYPE user_follows (
followed_username text,
street text,
);
CREATE TYPE msg (
from_user text,
body text
);
Next we create your base table:
CREATE TABLE users_by_username (
username text PRIMARY KEY,
email text,
password text,
follows list<frozen<user_follows>>,
followed_by list<frozen<user_follows>>,
new_messages list<frozen<msg>>,
old_messages list<frozen<msg>>
);
Now we create a materialized view partitioned by email:
CREATE MATERIALIZED VIEW users_by_email AS
SELECT username, password, follows, new_messages, old_messages FROM users_by_username
WHERE email IS NOT NULL AND password IS NOT NULL AND follows IS NOT NULL AND new_messages IS NOT NULL
PRIMARY KEY (email, username);
Now let's take it for a spin and see what it can do. Let's create a user:
INSERT INTO users_by_username (username , email , password )
VALUES ( 'someuser', 'someemail#abc.com', 'somepassword');
Let the user follow another user:
UPDATE users_by_username SET follows = [{followed_username: 'followme2', street: 'mystreet2'}] + follows
WHERE username = 'someuser';
Let's send the user a message:
UPDATE users_by_username SET new_messages = [{from_user: 'auser', body: 'hi someuser!'}] + new_messages
WHERE username = 'someuser';
Now let's see what's in the table:
SELECT * FROM users_by_username ;
username | email | followed_by | follows | new_messages | old_messages | password
----------+-------------------+-------------+---------------------------------------------------------+----------------------------------------------+--------------+--------------
someuser | someemail#abc.com | null | [{followed_username: 'followme2', street: 'mystreet2'}] | [{from_user: 'auser', body: 'hi someuser!'}] | null | somepassword
Now let's check that our materialized view is working:
SELECT new_messages, old_messages FROM users_by_email WHERE email='someemail#abc.com';
new_messages | old_messages
----------------------------------------------+--------------
[{from_user: 'auser', body: 'hi someuser!'}] | null
Now let's read the email and put it in the old messages:
BEGIN BATCH
DELETE new_messages[0] FROM users_by_username WHERE username='someuser'
UPDATE users_by_username SET old_messages = [{from_user: 'auser', body: 'hi someuser!'}] + old_messages where username = 'someuser'
APPLY BATCH;
SELECT new_messages, old_messages FROM users_by_email WHERE email='someemail#abc.com';
new_messages | old_messages
--------------+----------------------------------------------
null | [{from_user: 'auser', body: 'hi someuser!'}]
So hopefully that gives you some ideas you can use. Have a look at the documentation on collections (i.e. lists, maps, and sets), since those can really help you to keep more information in one table and are sort of like tables within a table.
For cassandra or noSQL data modelling beginners, there is a process involved in data modelling your application, like
1- Understand your data, design a concept diagram
2- List all your quires in detail
3- Map your queries using defined rules and patterns, best suitable for cassandra
4- Create a logical design, table with fields derived from queries
5- Now create a schema and test its acceptance.
if we model it well, then it is easy to handle issues such as new complex queries, data over loading, data consistency setc.
After taking this free online data modelling training, you will get more clarity
https://academy.datastax.com/courses/ds220-data-modeling
Good Luck!
If I had a single record that represented, say, a sellable item:
ItemID | Name
-------------
101 | Chips
102 | Candy bar
103 | Beer
I need to create a relationship between these items and one or more different types of PKs. For instance, a company might have an inventory that included chips; a store might have an inventory that includes chips and a candy bar, and the night shift might carry chips, candy bars, and beer. The is that we have different kinds of IDs: CompanyID, StoreID, ShiftID respectively.
My first though was "Oh just create link tables that link Company to inventory items, Stores to inventory items, and shifts to inventory items" and that way if I needed to look up the inventory collection for any of those entities, I could query them explicitly. However, the UI shows that I should be able to compile a list arbitrarily (e.g. show me all inventory items for company a, all west valley stores and Team BrewHa who is at an east valley store) and then display them grouped by their respective entity:
Company A
---------
- Chips
West Valley 1
-------------
- Chips
- Candy Bar
West Valley 2
-------------
- Chips
BrewHa (East Valley 6)
--------------------
- Chips
- Candy Bar
- Beer
So again, my first though was to base the query on the provided information (what kinds of IDs did they give me) and then just union them together with some extra info for grouping (candidate keys like IDType+ID) so that the result looked kind of like this:
IDType | ID | InventoryItemID
------------------------------
1 |100 | 1
2 |200 | 1
2 |200 | 2
2 |201 | 1
3 |300 | 1
3 |300 | 2
3 |300 | 3
I guess this would work, but it seems incredibly inefficient and contrived to me; I'm not even sure how the parameters of that sproc would work... So my question to everyone is: is this even the right approach? Can anyone explain alternative or better approaches to solve the problem of creating and managing these relationships?
It's hard to ascertain what you want as I don't know the purpose/use of this data. I'm not well-versed in normalization, but perhaps a star schema might work for you. Please keep in mind, I'm using my best guess for the terminology. What I was thinking would look like this:
tbl_Current_Inventory(Fact Table) records current Inventory
InventoryID INT NOT NULL FOREIGN KEY REFERENCES tbl_Inventory(ID),
CompanyID INT NULL FOREIGN KEY REFERENCES tbl_Company(ID),
StoreID INT NULL FOREIGN KEY REFERENCES tbl_Store(ID),
ShiftID INT NULL FOREIGN KEY REFERENCES tbl_Shift(ID),
Shipped_Date DATE --not really sure, just an example,
CONSTRAINT clustered_unique CLUSTERED(InventoryID,CompanyID,StoreID,ShiftID)
tbl_Inventory(Fact Table 2)
ID NOT NULL INT,
ProductID INT NOT NULL FOREIGN KEY REFERENCES tbl_Product(ID),
PRIMARY KEY(ID,ProductID)
tbl_Store(Fact Table 3)
ID INT PRIMARY KEY,
CompanyID INT FOREIGN KEY REFERENCES tbl_Company(ID),
RegionID INT FOREIGN KEY REFERENCES tbl_Region(ID)
tbl_Product(Dimension Table)
ID INT PRIMARY KEY,
Product_Name VARCHAR(25)
tbl_Company(Dimension Table)
ID INT PRIMARY KEY,
Company_Name VARCHAR(25)
tbl_Region(Dimension Table)
ID PRIMARY KEY,
Region_Name VARCHAR(25)
tbl_Shift(Dimension Table)
ID INT PRIMARY KEY,
Shift_Name VARCHAR(25)
Start_Time TIME,
End_Time TIME
So a little explanation. Each dimension table holds only distinct values like tbl_Region. Lists each region's name once and an ID.
Now for tbl_Current_Inventory, that will hold all the columns. I have companyID and StoreID both in their for a reason. Because this table can hold company inventory information(NULL StoreID and NULL shiftID) AND it can hold Store Inventory information.
Then as for querying this, I would create a view that joins each table, then simply query the view. Then of course there's indexes, but I don't think you asked for that. Also notice I only had like one column per dimension table. My guess is that you'll probably have more columns then just the name of something.
Overall, this helps eliminate a lot of duplicate data. And strikes a good balance at performance and not overly complicated data structure. Really though, if you slap a view on it, and query the view, it should perform quite well especially if you add some good indexes.
This may not be a perfect solution or even the one you need, but hopefully it at least gives you some ideas or some direction.
If you need any more explanation or anything else, just let me know.
In a normalized database, you implement a many-to-many relationship by creating a table that defines the relationships between entities just as you thought initially. It might seem contrived, but it gives you the functionality you need. In your case I would create a table for the relationship called something like "Carries" with the primary key of (ProductId, StoreId, ShiftId). Sometimes you can break normalization rules for performance, but it comes with side effects.
I recommend picking up a good book on designing relational databases. Here's a starter on a few topics:
http://en.wikipedia.org/wiki/Entity%E2%80%93relationship_model
http://en.wikipedia.org/wiki/Database_normalization
You need to break it down to inventory belongs to a store and a shift
Inventory does does not belong to a company - a store belongs to a company
If the company holds inventory directly then I would create a store name warehouse
A store belongs to a region
Don't design for the UI - put the data in 3NF
Tables:
Company ID, name
Store ID, name
Region ID, name
Product ID, name
Shift ID, name
CompanyToStore CompanyID, StoreID (composite PK)
RegionToStore RegionID, StoreID (composite PK)
Inventory StoreID, ShiftID, ProductID (composite PK)
The composite PK are not just efficient they prevent duplicates
The join tables should have their own ID as PK
Let the relationships they are managing be the PK
If you want to report by company across all shifts you would have a query like this
select distinct store.Name, Product.Name
from Inventory
join Store
on Inventory.StoreID = Store.ID
join CompanyToStore
on Store.ID = CompanyToStore.StoreID
and CompanyToStore.CompanyID = X
store count in a region
select RegionName, count(*)
from RegionToStore
join Region
on Region.ID = RegionToStore.RegionID
group by RegionName
I am trying to store meta data about a document into a SQL Server. The document are stored into a document archive, and returns back an identifier so I can get back that document by asking the archive to get the document by identifier.
Our user would like to be able to search for this document based on different meta data. The meta data could be 1 attribute or 5 depending on the document type, and the users should be able to create new document types from a admin site.
I can see two solution here. One is that each documenttype gets it's own metadata table, where all metadata attributes are predefined, and if one should be added a new column needs to be created. And if a new documenttype is created a new metadata table needs to be created. Our DBA will freak out with a solution like this, and I also see a problem with indexes. Because if the documenttype has 5 different meta data attributes it needs to be searchable with 1 or 4 of them specified in the search. Then I would need to write index for all the different combinations of possible searchs.
here is an example (fictiv)
|documentId | Name | InsertDate | CustomerId | City
| 1 | John | 2014-01-01 | 2 | London
| 2 | John | 2014-01-20 | 5 | New York
| 3 | Able | 2014-01-01 | 10 | Paris
I could here say:
Give me all documents where Name = 'John'
Give me all documets where Name = 'John' And CustomerId = 5
Give me all document where InserDate = '2014-01-01' and City = 'London'
This will be 3 differnet indexes and then I haven't coverd all possible combinations. This isn't practical.
So I am look in to the evil 'EAV' (anti)pattern.
So instead of having the metadata as columns I can have the as rows.
|documentId | MetaAttribute | MetaValue
| 1 | Name | John
| 1 | InsertDate | 2014-01-01
| 1 | CustomerId | 2
| 1 | City | London
| 2 | Name | John
| 2 | InsertDate | 2014-01-20
| 2 | CustomerId | 5
| 2 | City | New York
| 3 | Name | Able
| 3 | InserDate | 2014-01-01
| 3 | CustomerId | 10
| 3 | City | Paris
Here it's simple to create one index om MetaAttribute och metaValue, and it's covered. If a new documenttype is created, new metadata can be created with that documenttype into a MetaAttributeTable (that contains all MetaAttribute for the different documenttype). So no need to create new tables or coulms if a new documenttype is added or if a new attribute is added to a documenttype. Instead all MetaValues most be strings :( and the SQL Query to find the document id is a bit more complicated.
This is what I figured out. (In this example the MetaAttribute is a string, but would be an ID to the MetaAttribute Table)
SELECT * FROM [Document]
WHERE ID IN (SELECT documentId FROM [MetaData]
WHERE ((MetaAttribute = 'Name' AND MetaValue = 'John')
OR (MetaAttribute = 'CustomerId' and MetaValue = '5'))
GROUP BY [documentId]
HAVING Count(1) = 2)
Here I need to ask if the Name = 'John' and CustomerId = 5. I do that by finding all records that have Name = 'John' and CustomerId = '5' and the Group it on the documentId and count number of items in the group. If I got 2 then both Name = 'John' and CustomerId = '5' is true for this search. Return the documentId and use that to retrive information about the document, like the document archive storage id.
There should be a better SQL statement for this isn't there?
So my question is. Is there a better approche than these 2. Is the EAV-pattern so bad that I should stick with the first approche and have a Freaked out DBA and "ten millions of indexes"
We are talking about a system that will have around 10-20 millions of new records each month, and contain data for at least 3 years.... So the tables will be preatty big and good indexes are neccasary for performance.
Best Regards
Magnus
The EAV model is appealing if you have unbounded attributes--that is, anyone can set up anything as an attribute. However, it sounds from your description that this is not the case--the possible document attributes come from a known and fairly limited set. If this is the case, routine normalization suggests the following:
-- One per document
CREATE TABLE Document
(
DocumentId -- primary key
,DocumentType
,<etc>
)
-- One per "type" of document
CREATE TABLE DocumentType
(
DocumentTypeId -- pirmary key
,Name
)
-- One per possible document attribute.
-- Note that multiple document types can reference the same attribute
CREATE TABLE DocumentAttributes
(
AttributeId -- primary key
,Name
)
-- This lists which attributes are used by a given type
CREATE TABLE DocumentTypeAttributes
(
DocumentTypeId
,AttributeId
-- compound primary key on both columns
-- foeign keys on both columns
)
-- This contains the final association of document and attributes
CREATE TABLE DocumentAttributeValues
(
DocumentId
,AttributeId
,Value
-- compound primary key on DocumentId, AttributeId
-- foeign keys on both columns ot their respective parent tables
)
A tighter model with more robust keys could be implemented to ensure at the database level that an attribute cannot be assigned to a document with an “inappropriate” type.
Queries have to use joins, but (presumably) only the Documents and DocumentAttributes tables will ever be large. An index on on (AttributeId + Value) facilitiate lookups by attribute type, and depending on cardinality an index on (Value + AttributeId) could make searches for specific attributes quite efficient.
(Edit)
Ooh, clever, I created two tables with the same name. I've renamed the last one to DocumentAttributeValues. (Free advice is clearly worth what you paid for it!)
This shows how ugly these systems can get in SQL, as you have to “look up” both attributes separately. On the plus side you don’t have to worry about “does this type go with this document”, as those rules have (better had) been applied when the data was loaded. Two examples:
This one spells everything out in joins, and as such I think it might perform worse than the next:
-- Top-down
SELECT do.DocumentId
from Documents do
inner join DocumentAttributes da1
on da.Name = 'Name'
inner join DocumentAttributeValues dav1
on dav1.AttributeId = da1.AttributeId
and dav1.Value = 'John'
inner join DocumentAttributes da2
on da2.Name = 'CustomerId'
inner join DocumentAttributeValues dav2
on dav2.AttributeId = da2.AttributeId
and dav2.Value = '5'
This one picks out the attributes, then finds which documents have all of them. It might perform better, as there’s one less table to process:
-- Bottom-up
SELECT xx.DocumentId
from (-- All documents with name "John"
select dav.DocumentId
from DocumentAttributes da
inner join DocumentAttributeValues dav
on dav.AttributeId = da.AttributeId
where da.Name = 'Name'
and dav.Value = 'John'
-- This combines the two sets, with "all" keeping any duplicate entries
union all
-- All documents with CustomerId = "5"
select dav.DocumentId
from DocumentAttributes da
inner join DocumentAttributeValues dav
on dav.AttributeId = da.AttributeId
where da.Name = 'CustomerId'
and dav.Value = '5') xx -- Have to give the subquery an alias
group by xx.DocumentId
having count(*) = 2
While further refinements might be possible, the more more attributes you’re filtering on, the uglier the queries will be. Five attributes max might work ok in SQL, but if you’ve got tons of attributes, a NoSQL solution might be what you’re looking for.
(Please note that, as with my original post, I have not tested this code, so there may be typos or subtle--or not so subtle--errors in here.)
SQL Server 2008+ offers three related features for dealing with such cases:
Sparse Columns which allow you to define hundreds of columns even if only a subset are used at a time
Column Sets allow you to group these columns and treat them as a group
Filtered indexes can index only the rows that actually have values in them.
These features allow you to work with more-or-less normal SQL statements to handle all metadata columns.
These features were specifically added to address the EAV/metadata scenario.
EDIT
If you have a limited set of attributes that are always filled, there is no need for Sparse Columns or the EAV anti-pattern either.
You can create your tables as you normally would and add indexes to optimize the real workload you encounter. Certain types of queries will occur far more often than others and SQL Server's Index tuning advisor can propose the indexes and statistics to use based on a trace captured using SQL Server's Profiler.
It's quite possible that only a subset of the columns will accelerate searches and the rest can be added as include columns in the index.
Full Text Search
A more powerful option is to use SQL Server's Full Text Search. This will allow you to execute queries using arbitrary attributes. This is another technique using by document/content management systems, ERPs and CRMs to handle arbitrary attributes.
With FTS you simply specify the columns to include in one FTS index and don't have to create separate indexes for each attribute.
You can use FTS predicates in SELECT queries like this:
SELECT Name, ListPrice
FROM Production.Product
WHERE ListPrice = 80.99
AND CONTAINS(Name, 'Mountain')
This can result in much simpler queries (you just write a modified select) and administration (no worries about column order in indexes, only one FTS index to manage)