References in a table - database

I have a table like this, that contains items that are added to the database.
Catalog table example
id | element | catalog
0 | mazda | car
1 | penguin | animal
2 | zebra | animal
etc....
And then I have a table where the user selects items from that table, and I keep a reference of what has been selected like this
User table example
id | name | age | itemsSelected
0 | john | 18 | 2;3;7;9
So what I am trying to say, is that I keep a reference to what the user has selected as a string if ID's, but I think this seems a tad troublesome
Because when I do a query to get information about a user, all I get is the string of 2;3;7;9, when what I really want is an array of the items corresponing to those ID's
Right now I get the ID's and I have to split the string, and then run another query to find the elements the ID's correspond to
Is there any easier ways to do this, if my question is understandable?

Yes, there is a way to do this. You create a third table which contains a map of A/B. It's called a Multiple to Multiple foreign-key relationship.
You have your Catalogue table (int, varchar(MAX), varchar(MAX)) or similar.
You have your User table (int, varchar(MAX), varchar(MAX), varchar(MAX)) or similar, essentially, remove the last column and then create another table:
You create a UserCatalogue table: (int UserId, int CatalogueId) with a Primary Key on both columns. Then the UserId column gets a Foreign-Key to User.Id, and the CatalogueId table gets a Foreign-Key to Catalogue.Id. This preserves the relationship and eases queries. It also means that if Catalogue.Id number 22 does not exist, you cannot accidentally insert it as a relation between the two. This is called referential-integrity. The SQL Server mandates that if you say, "This column must have a reference to this other table" then the SQL Server will mandate that relationship.
After you create this, for each itemsSelected you add an entry: I.e.
UserId | CatalogueId
0 | 2
0 | 3
0 | 7
0 | 9
This also alows you to use JOINs on the tables for faster queries.
Additionally, and unrelated to the question, you can also optimize the Catalogue table you have a bit, and create another table for CatalogueGroup, which contains your last column there (catalog: car, animal) which is referenced via a Foreign-Key Relationship in the current Catalogue table definition you have. This will also save storage space and speed up SQL Server work, as it no longer has to read a string column if you only want the element value.

Related

Does taking advantage of dynamic columns in Cassandra require duplicated data in each row?

I've been trying to understand how one would model time series data in Cassandra, like shown in the below image from a popular System Design Interview video, where counts of views are stored hourly.
While I would think the schema for this time series data would be something like the below, I don't believe this would lead to data actually being stored in the way the screenshot shows.
CREATE table views_data {
video_id uuid
channel_name varchar
video_name varchar
viewed_at timestamp
count int
PRIMARY_KEY (video_id, viewed_at)
};
Instead, I'm assuming it would lead to something like this (inspired by datastax), where technically there is a single row for each video_id, but the other columns seem like they would all be duplicated, such as channel_name, video_name, etc.. within the row for each unique viewed_at.
[cassandra-cli]
list views_data;
RowKey: A
=> (channel_name='System Design Interview', video_name='Distributed Cache', count=2, viewed_at=1370463146717000)
=> (channel_name='System Design Interview', video_name='Distributed Cache', count=3, viewed_at=1370463282090000)
=> (channel_name='System Design Interview', video_name='Distributed Cache', count=8, viewed_at=1370463282093000)
-------------------
RowKey: B
=> (channel_name='Some other channel', video_name='Some video', count=4, viewed_at=1370463282093000)
I assume this is still considered dynamic wide row, as we're able to expand the row for each unique (video_id, viewed_at) combination. But it seems less than ideal that we need to duplicate the extra information such as channel_name and video_name.
Is the screenshot of modeling time series data misleading or is it actually possible to have dynamic columns where certain columns in the row do not need to be duplicated?
If I was upserting time series data to this row, I wouldn't want to have to provide the channel_name and video_name for every single upsert, I would just want to provide the count.
No, it is not necessary to duplicate the values of columns within the rows of a partition. It is possible to model your table to accomodate your use case.
In Cassandra, there is a concept of "static columns" -- columns which have the same value for all rows within a partition.
Here's the schema of an example table that contains two static columns, colour and item:
CREATE TABLE statictbl (
pk int,
ck text,
c int,
colour text static,
item text static,
PRIMARY KEY (pk, ck)
)
In this table, each partition share the same colour and item for all rows of the same partition. For example, partition pk=1 has the same colour='red' and item='apple' for all rows:
pk | ck | colour | item | c
----+----+--------+--------+----
1 | a | red | apple | 12
1 | b | red | apple | 23
1 | c | red | apple | 34
If I insert a new partition pk=2:
INSERT INTO statictbl (pk, ck, colour, item, c) VALUES (2, 'd', 'yellow', 'banana', 45)
we get:
pk | ck | colour | item | c
----+----+--------+--------+----
2 | d | yellow | banana | 45
If I then insert another row withOUT specifying a colour and item:
INSERT INTO statictbl (pk, ck, c) VALUES (2, 'e', 56)
the new row with ck='e' still has the colour and item populated even though I didn't insert a value for them:
pk | ck | colour | item | c
----+----+--------+--------+----
2 | d | yellow | banana | 45
2 | e | yellow | banana | 56
In your case, both the channel and video names will share the same value for all rows in a given partition if you declare them as static and you only ever need to insert them once. Note that when you update the value of static columns, ALL the rows for that partition will reflect the updated value.
For details, see Sharing a static column in Cassandra. Cheers!

Database normalization. Which is better, inserting in one row or multiple row?

I'm currently designing my tables. i have three types of user which is, pyd, ppp and ppk. Which is better? inserting data in one row or in multiple row?
which is better?
or
or any suggestion? thanks
I would go for 3 tables:
user_type
typeID | typeDescription
Main_table
id_main_table | id_user | id_type
table_bhg_i
id_bhg_i | id_main_table | data1 | data2 | data3
Although I see you are inserting IDs for each user , I don't quite understand how are are you going to differentiate between the users , had I designed this DB , I would have gone for tables like
tableName: UserTypes
this table would contain two field first would be ID and second would be type of user
like
UsertypeID | UserType
the UsertypeID is a primary key and can be auto increment , while UserType would be your users pyd ,ppk or so on . Designing in this way would give you flexibility of adding data later on in the table without changing the schema of the table ,
the next you can edit a table for generating multiple users of a particular type, this table would refer the userID of the previous table , this will help you adding new user easily and would remove redundancy
tableName:Users
this table would again contain two fields, the first field would be the id call and the secind field would be the usertypeId try
UserId |UserName | UserTypeID
the next thing you can do is make a table to insert the data , let the table be called DataTable
tableName: DataTable
this table will contain the data of the users and this will reference then easily
DataTabID | DataFields(can be any in number) | UserID(refrences Users table)
these tables would be more than sufficient .If doubts as me in chatbox

Single table column refers to multiple primary key

I need to store multiple values in a single column.
For example I am creating table which holds the user preferences
e.g.
| user_id | cities | countries |
|---------|------------|------------|
| 1 | 10, 11, 23 | 21, 34 |
because i can't store them as array (or don't prefer to store as array even if it is available - due to maintenance and performance reasons - and better RDMS design), i have to create a mapping table like this
| user_id | type | reference_id |
|---------|---------|--------------|
| 1 | CITY | 10 |
| 1 | CITY | 11 |
| 1 | CITY | 23 |
| 1 | COUNTRY | 21 |
| 1 | COUNTRY | 34 |
The reference id in this column refers to the master tables like city, country, etc.
The problem here i see is
I can't have FK reference to city or country table, because single reference_id column may refer to city or country depends on the type
As i can't have FK, there is no guaranty that we can't have dirty data
Is there any better approach?
Note:
I have given city/country as sample, but i need to have around 20 columns which can have multiple values like city or country
In future i may introduce some boolean preference like "whether you like to travel" so i might want to store TYPE as "TRAVEL" and referece_id as 0 for yes 1 for no; which definately will not have any reference
You could create a Location Table {LocationId, locationType (city/country)}
and then everytime you add a new record to the city or country table, add it to location table first, then add it to city (or country) table as appropriate with same cityId (or countryId) as was used as LocationId in Location Table.
then create FK between preferences table and location table, and add [zero or one] to one (0/1 - 1) FK relationship between City and country tables to the Location table. (Every record in City and COuntry table tables must be in Location table, but not the other way around.
You're saying you want a table for generic data instead of 20 lookup tables enforcing RI? On a large system, the data would be stored in multiple tables instead of using a delimiter to separate the values and then exploding them out in another table, introducing the problem of enforcing RI. If you're storing values that are really generic, like code/description pairs, you just need a codeSetID field to identify which codes belong in which codesets.

Enforce uniqueness on column based on contents from another column

I have the typical Invoice/InvoiceItems master/detail tables (the ones every book and tutorial out there uses as examples). I also have a "Proforma" table which holds data similar to invoices that are sometimes linked to invoices. Both are linked to each item in the invoice, with a column optionally referencing a proforma, something like this:
id | id_invoice | id_proforma | amount ....... and a bunch of irrelevant stuff
-----------------------------------------------
1 | 1 | null | 100
2 | 1 | null | 40
3 | 2 | 3 | 1000
4 | 3 | 4 | 473
5 | 3 | 4 | 139
Basically, each item in an invoice can be linked to a proforma. There is also a business rule that says that each proforma can be used in only one invoice (it's OK to use it in many items within the same invoice).
Currently that rule is enforced on the application side but this has problems with concurrency, as 2 users could take the same proforma at the same time and the system would let it pass. My intention is to have the DB validate this in addition to some front-end visual clues, but so far I've failed to come with an approach for this particular case.
Filtered unique indexes could serve well, except that the same proforma can be used twice if it's for the same invoice, so my question is, how can I make the DB server enforce that rule?
Database engine can be SQL 2012 or latter and any edition from express to enterprise.
You can create a user-defined scalar function that returns TRUE if the proforma id and invoice id combination are valid. Then put a check constraint on the table requiring the function to return true. Like this (tweak to fit your table name/needs):
-- Here's the function:
create function dbo.svfIsCombinationValid (
#id_invoice int
, #id_proforma int
)
returns bit
as
begin;
declare #return bit = 1;
if exists (
select 1
from dbo.YourInvoiceProformaXRefTable
where id_proforma = #id_proforma
and id_invoice <> #id_invoice
)
begin;
set #return = 0;
end;
return #return;
end;
After that, you can alter the table and add the check constraint:
alter table dbo.YourInvoiceProformaXRefTable
add constraint CK_YourInvoiceProformaXRefTable_UniqueInvoiceProforma
check (dbo.svfIsCombinationValid(id_invoice,id_proforma)=1);
This is OK with nulls (multiple id_invoice can have id_proforma NULL values). but if both values are not null, then the combination must either be NEW or the same as existing rows.

Metadata database design

I am trying to store meta data about a document into a SQL Server. The document are stored into a document archive, and returns back an identifier so I can get back that document by asking the archive to get the document by identifier.
Our user would like to be able to search for this document based on different meta data. The meta data could be 1 attribute or 5 depending on the document type, and the users should be able to create new document types from a admin site.
I can see two solution here. One is that each documenttype gets it's own metadata table, where all metadata attributes are predefined, and if one should be added a new column needs to be created. And if a new documenttype is created a new metadata table needs to be created. Our DBA will freak out with a solution like this, and I also see a problem with indexes. Because if the documenttype has 5 different meta data attributes it needs to be searchable with 1 or 4 of them specified in the search. Then I would need to write index for all the different combinations of possible searchs.
here is an example (fictiv)
|documentId | Name | InsertDate | CustomerId | City
| 1 | John | 2014-01-01 | 2 | London
| 2 | John | 2014-01-20 | 5 | New York
| 3 | Able | 2014-01-01 | 10 | Paris
I could here say:
Give me all documents where Name = 'John'
Give me all documets where Name = 'John' And CustomerId = 5
Give me all document where InserDate = '2014-01-01' and City = 'London'
This will be 3 differnet indexes and then I haven't coverd all possible combinations. This isn't practical.
So I am look in to the evil 'EAV' (anti)pattern.
So instead of having the metadata as columns I can have the as rows.
|documentId | MetaAttribute | MetaValue
| 1 | Name | John
| 1 | InsertDate | 2014-01-01
| 1 | CustomerId | 2
| 1 | City | London
| 2 | Name | John
| 2 | InsertDate | 2014-01-20
| 2 | CustomerId | 5
| 2 | City | New York
| 3 | Name | Able
| 3 | InserDate | 2014-01-01
| 3 | CustomerId | 10
| 3 | City | Paris
Here it's simple to create one index om MetaAttribute och metaValue, and it's covered. If a new documenttype is created, new metadata can be created with that documenttype into a MetaAttributeTable (that contains all MetaAttribute for the different documenttype). So no need to create new tables or coulms if a new documenttype is added or if a new attribute is added to a documenttype. Instead all MetaValues most be strings :( and the SQL Query to find the document id is a bit more complicated.
This is what I figured out. (In this example the MetaAttribute is a string, but would be an ID to the MetaAttribute Table)
SELECT * FROM [Document]
WHERE ID IN (SELECT documentId FROM [MetaData]
WHERE ((MetaAttribute = 'Name' AND MetaValue = 'John')
OR (MetaAttribute = 'CustomerId' and MetaValue = '5'))
GROUP BY [documentId]
HAVING Count(1) = 2)
Here I need to ask if the Name = 'John' and CustomerId = 5. I do that by finding all records that have Name = 'John' and CustomerId = '5' and the Group it on the documentId and count number of items in the group. If I got 2 then both Name = 'John' and CustomerId = '5' is true for this search. Return the documentId and use that to retrive information about the document, like the document archive storage id.
There should be a better SQL statement for this isn't there?
So my question is. Is there a better approche than these 2. Is the EAV-pattern so bad that I should stick with the first approche and have a Freaked out DBA and "ten millions of indexes"
We are talking about a system that will have around 10-20 millions of new records each month, and contain data for at least 3 years.... So the tables will be preatty big and good indexes are neccasary for performance.
Best Regards
Magnus
The EAV model is appealing if you have unbounded attributes--that is, anyone can set up anything as an attribute. However, it sounds from your description that this is not the case--the possible document attributes come from a known and fairly limited set. If this is the case, routine normalization suggests the following:
-- One per document
CREATE TABLE Document
(
DocumentId -- primary key
,DocumentType
,<etc>
)
-- One per "type" of document
CREATE TABLE DocumentType
(
DocumentTypeId -- pirmary key
,Name
)
-- One per possible document attribute.
-- Note that multiple document types can reference the same attribute
CREATE TABLE DocumentAttributes
(
AttributeId -- primary key
,Name
)
-- This lists which attributes are used by a given type
CREATE TABLE DocumentTypeAttributes
(
DocumentTypeId
,AttributeId
-- compound primary key on both columns
-- foeign keys on both columns
)
-- This contains the final association of document and attributes
CREATE TABLE DocumentAttributeValues
(
DocumentId
,AttributeId
,Value
-- compound primary key on DocumentId, AttributeId
-- foeign keys on both columns ot their respective parent tables
)
A tighter model with more robust keys could be implemented to ensure at the database level that an attribute cannot be assigned to a document with an “inappropriate” type.
Queries have to use joins, but (presumably) only the Documents and DocumentAttributes tables will ever be large. An index on on (AttributeId + Value) facilitiate lookups by attribute type, and depending on cardinality an index on (Value + AttributeId) could make searches for specific attributes quite efficient.
(Edit)
Ooh, clever, I created two tables with the same name. I've renamed the last one to DocumentAttributeValues. (Free advice is clearly worth what you paid for it!)
This shows how ugly these systems can get in SQL, as you have to “look up” both attributes separately. On the plus side you don’t have to worry about “does this type go with this document”, as those rules have (better had) been applied when the data was loaded. Two examples:
This one spells everything out in joins, and as such I think it might perform worse than the next:
-- Top-down
SELECT do.DocumentId
from Documents do
inner join DocumentAttributes da1
on da.Name = 'Name'
inner join DocumentAttributeValues dav1
on dav1.AttributeId = da1.AttributeId
and dav1.Value = 'John'
inner join DocumentAttributes da2
on da2.Name = 'CustomerId'
inner join DocumentAttributeValues dav2
on dav2.AttributeId = da2.AttributeId
and dav2.Value = '5'
This one picks out the attributes, then finds which documents have all of them. It might perform better, as there’s one less table to process:
-- Bottom-up
SELECT xx.DocumentId
from (-- All documents with name "John"
select dav.DocumentId
from DocumentAttributes da
inner join DocumentAttributeValues dav
on dav.AttributeId = da.AttributeId
where da.Name = 'Name'
and dav.Value = 'John'
-- This combines the two sets, with "all" keeping any duplicate entries
union all
-- All documents with CustomerId = "5"
select dav.DocumentId
from DocumentAttributes da
inner join DocumentAttributeValues dav
on dav.AttributeId = da.AttributeId
where da.Name = 'CustomerId'
and dav.Value = '5') xx -- Have to give the subquery an alias
group by xx.DocumentId
having count(*) = 2
While further refinements might be possible, the more more attributes you’re filtering on, the uglier the queries will be. Five attributes max might work ok in SQL, but if you’ve got tons of attributes, a NoSQL solution might be what you’re looking for.
(Please note that, as with my original post, I have not tested this code, so there may be typos or subtle--or not so subtle--errors in here.)
SQL Server 2008+ offers three related features for dealing with such cases:
Sparse Columns which allow you to define hundreds of columns even if only a subset are used at a time
Column Sets allow you to group these columns and treat them as a group
Filtered indexes can index only the rows that actually have values in them.
These features allow you to work with more-or-less normal SQL statements to handle all metadata columns.
These features were specifically added to address the EAV/metadata scenario.
EDIT
If you have a limited set of attributes that are always filled, there is no need for Sparse Columns or the EAV anti-pattern either.
You can create your tables as you normally would and add indexes to optimize the real workload you encounter. Certain types of queries will occur far more often than others and SQL Server's Index tuning advisor can propose the indexes and statistics to use based on a trace captured using SQL Server's Profiler.
It's quite possible that only a subset of the columns will accelerate searches and the rest can be added as include columns in the index.
Full Text Search
A more powerful option is to use SQL Server's Full Text Search. This will allow you to execute queries using arbitrary attributes. This is another technique using by document/content management systems, ERPs and CRMs to handle arbitrary attributes.
With FTS you simply specify the columns to include in one FTS index and don't have to create separate indexes for each attribute.
You can use FTS predicates in SELECT queries like this:
SELECT Name, ListPrice
FROM Production.Product
WHERE ListPrice = 80.99
AND CONTAINS(Name, 'Mountain')
This can result in much simpler queries (you just write a modified select) and administration (no worries about column order in indexes, only one FTS index to manage)

Resources