I have to create a database combined with 4 types of xls files, for example A, B, C and D. Every year new file is created, starting from 2004. A have 7 sheets with 800-1000 rows, B - D have one sheet with max 200 rows.
Everyone knows that people are lazy, so in excel files, address data are stored differently in each sheet. One of them, from 2008, have address data stored separately, but every other sheets have this data combined into one column.
Sooo, here is a question - how should I design a datatable? Something like this?
+---------+----------+----------+-------------+--------------------------------+
| Street | House Nr | City | Postal Code | Combined Address |
+---------+----------+----------+-------------+--------------------------------+
| Street1 | 20 | Somwhere | 00-000 | null |
| Street2 | 98 | Elswhere | 99-999 | null |
| null | null | null | null | Somwhere 00-000, street3 29 |
| null | null | null | null | st. Street2 65 12-345 Elswhere |
+---------+----------+----------+-------------+--------------------------------+
There will be a lot of nulls, so maybe best solution would be 2 different tables?
Most important thing is that users will search by using this data, and in the future add data into that database without excel files.
There are at least two different angles of view here: Normalization and efficiency, leading to different results.
Normalization
If this is the most important criterion you would make even three tables. Obviously Combined Address needs a place of it's own, but also Postal Code and City have to be stored into another table, because there is a dependency between them. Just one of the two, most probably Postal Code will stay. (Yes, there even is sth. about Street and Postal Code too, but I'm clearly not going to be pedantic.)
Efficiency
Normalization as an end in itself doesn't necessarily make the best result. If you permit yourself to be a bit sloppy on that and leave it the way it is in the model you posted, things might become easier in coding. You could use a trigger to make sure Combined Address is never null or use a (materialized) view that pretends it is and just search in Combined Address for the time being.
Imagine the effort if you use different tables and there is a need for referencing these addresses in other tables (Which table to use when? How to provide a unique id? Clearly a problem.).
So, decide on what's more important.
If I'm not mistaken we are taking about some 2000 rows or some 8000 rows if it is '7 sheets with 800-1000 rows each' actually. Even if the latter applies this isn't a number that makes data correction impracticable. If the number of different input pattern in the combined column is low, you might be able to do this (partly) automatically and just have some-one prove reading.
So you might want to think about a future redesign as well and choose what's more convenient in this case.
Taking MySQL as an example DB to perform this in (although I'm not restricted to Relational flavours at this stage) and Java style syntax for model / db interaction.
I'd like the ability to allow versioning of individual column values (and their corresponding types) as and when users edit objects. This is primarily in an attempt to drop the amount of storage required for frequent edits of complex objects.
A simple example might be
- Food (Table)
- id (INT)
- name (VARCHAR(255))
- weight (DECIMAL)
So we could insert an object into the database that looks like...
Food banana = new Food("Banana",0.3);
giving us
+----+--------+--------+
| id | name | weight |
+----+--------+--------+
| 1 | Banana | 0.3 |
+----+--------+--------+
if we then want to update the weight we might use
banana.weight = 0.4;
banana.save();
+----+--------+--------+
| id | name | weight |
+----+--------+--------+
| 1 | Banana | 0.4 |
+----+--------+--------+
Obviously though this is going to overwrite the data.
I could add a revision column to this table, which could be incremented as items are saved, and set a composite key that combines id/version, but this would still mean storing ALL attributes of this object for every single revision
- Food (Table)
- id (INT)
- name (VARCHAR(255))
- weight (DECIMAL)
- revision (INT)
+----+--------+--------+----------+
| id | name | weight | revision |
+----+--------+--------+----------+
| 1 | Banana | 0.3 | 1 |
| 1 | Banana | 0.4 | 2 |
+----+--------+--------+----------+
But in this instance we're going to be storing every single piece of data about every single item. This isn't massively efficient if users are making minor revisions to larger objects where Text fields or even BLOB data may be part of the object.
What I'd really like, would be the ability to selectively store data discretely, so the weight could possible be saved in a separate DB in its own right, that would be able to reference the table, row and column that it relates to.
This could then be smashed together with a VIEW of the table, that could sort of impose any later revisions of individual column data into the mix to create the latest version, but without the need to store ALL data for each small revision.
+----+--------+--------+
| id | name | weight |
+----+--------+--------+
| 1 | Banana | 0.3 |
+----+--------+--------+
+-----+------------+-------------+-----------+-----------+----------+
| ID | TABLE_NAME | COLUMN_NAME | OBJECT_ID | BLOB_DATA | REVISION |
+-----+------------+-------------+-----------+-----------+----------+
| 456 | Food | weight | 1 | 0.4 | 2 |
+-----+------------+-------------+-----------+-----------+----------+
Not sure how successful storing any data as blob to then CAST back to original DTYPE might be, but thought since I was inventing functionality here, why not go nuts.
This method of storage would also be fairly dangerous, since table and column names are entirely subject to change, but hopefully this at least outlines the sort of behaviour I'm thinking of.
A table in 6NF has one CK (candidate key) (in SQL a PK) and at most one other column. Essentially 6NF allows each pre-6NF table's column's update time/version and value recorded in an anomaly-free way. You decompose a table by dropping a non-prime column while adding a table with it plus an old CK's columns. For temporal/versioning applications you further add a time/version column and the new CK is the old one plus it.
Adding a column of time/whatever interval (in SQL start time and end time columns) instead of time to a CK allows a kind of data compression by recording longest uninterupted stretches of time or other dimension through which a column had the same value. One queries by an original CK plus the time whose value you want. You dont need this for your purposes but the initial process of normalizing to 6NF and the addition of a time/whatever column should be explained in temporal tutorials.
Read about temporal databases (which deal both with "valid" data that is times and time intervals but also "transaction" times/versions of database updates) and 6NF and its role in them. (Snodgrass/TSQL2 is bad, Date/Darwen/Lorentzos is good and SQL is problematic.)
Your final suggested table is an example of EAV. This is usually an anti-pattern. It encodes a database in to one or more tables that are effectively metadata. But since the DBMS doesn't know that you lose much of its functionality. EAV is not called for if DDL is sufficient to manage tables with columns that you need. Just declare appropriate tables in each database. Which is really one database, since you expect transactions affecting both. From that link:
You are using a DBMS anti-pattern EAV. You are (trying to) build part of a DBMS into your program + database. The DBMS already exists to manage data and metadata. Use it.
Do not have a class/table of metatdata. Just have attributes of movies be fields/columns of Movies.
The notion that one needs to use EAV "so every entity type can be
extended with custom fields" is mistaken. Just implement via calls
that update metadata tables sometimes instead of just updating regular
tables: DDL instead of DML.
here is what i am trying to do, i want to store a list of values within a db record, so it is something like this:
| id | tags |
| 1 | 1,3,5 |
| 2 | 121,4,6 |
| 3 | 3,101,2 |
most of the suggestion i found so far suggest creating a separate join table to establish a many-to-many relationship, but in my case, i dont think it is suitable to create a separate table because the tags values are just a list of numbers.
the best i can think of right now is to store the data as a csv string, and parse it accordingly when it is retrieved, but i'm still trying to find a way where i can get the values as an array when i retrieve it from the db, even better if i can restrict the number of elements in the list, is there any better way to do this?
I haven't decided which database to use yet, most probably postgresql, but im open to others if it can help me implement this better,
On PostgreSQL you can use array type.
On MySQL you can use set type.
Then it depends on what you really need.
I am a newbie to Postgresql and was trying with it.
I have created a simple table:
CREATE table items_tags (
ut_id SERIAL Primary KEY,
item_id integer,
item_tags_weights text[]
);
where:
item_id - Item Id with these tags are associated
item_tags_weights - Tags associated with Itm including weight
Example entry:
--------------------
ut_id | item_id | item_tags_weights
---------+---------+-------------------------------------------------------------------------------------------------------------------------------
3 | 2 | {{D,1},{B,9},{W,3},{R,18},{F,9},{L,15},{G,12},{T,17},{0,3},{I,7},{E,14},{S,2},{O,5},{M,4},{V,3},{H,2},{X,14},{Q,9},{U,6},{P,16},{N,11},{J,1},{A,12},{Y,15},{C,15},{K,4},{Z,17}}
1000003 | 3 | {{Q,4},{T,19},{P,15},{M,14},{O,20},{S,3},{0,6},{Z,6},{F,4},{U,13},{E,18},{B,14},{V,14},{X,10},{K,18},{N,17},{R,14},{J,12},{L,15},{Y,3},{D,20},{I,18},{H,20},{W,15},{G,7},{A,11},{C,14}}
4 | 4 | {{Q,2},{W,7},{A,6},{T,19},{P,8},{E,10},{Y,19},{N,11},{Z,13},{U,19},{J,3},{O,1},{C,2},{L,7},{V,2},{H,12},{G,19},{K,15},{D,7},{B,4},{M,9},{X,6},{R,14},{0,9},{I,10},{F,12},{S,11}}
5 | 5 | {{M,9},{B,3},{I,6},{L,12},{J,2},{Y,7},{K,17},{W,6},{R,7},{V,1},{0,12},{N,13},{Q,2},{G,14},{C,2},{S,6},{O,19},{P,19},{F,4},{U,11},{Z,17},{T,3},{E,10},{D,2},{X,18},{H,2},{A,2}}
(4 rows)
where:
{D,1} - D = tag, 1 = tag weight
Well, I just wanted to list the items_id where tags = 'U' according tag weight.
On way is to select ALL the tags from database and do the processing in high-level language with sort and use the result set.
For this, I can do the following:
1) SELECT * FROM user_tags WHERE 'X' = ANY (interest_tags_weights)
2) Extract and sort the information and display.
But considering that multiple items can be associated with a single 'TAG', and assuming
10 million entry, this method will be surely sluggish.
Any idea to list as needed with CREATE function or so?
Any pointers will be helpfull.
Many thanks.
Have you considered normalization, i.e. moving the array field into another table? Apart from being easy to query and extend, it's likely to have better performance on larger databases.
I have a data set which consists of an ID and a matrix (n x n) of data related to that ID.
Both the column names (A,B,C,D) and the Row names (1,2,3) are also important and need to be held for each individual ID, as well as the data (a1,b1,c1,d1,...)
for example:
ID | A | B | C | D |
1 | a1 | b1 | c1 | d1 |
2 | ... | ... | ... | ... |
3 | ... | ... | ... | ... |
I am trying to determine the best way of modelling this data set in a database, however, it seems like something that is difficult given the flat nature of RDBMS.
Am I better off holding the ID and an XML blob representing the data matrix, or am i overlooking a simpler solution here.
Thanks.
RDBMSes aren't flat. The R part sees to that. What you need is:
Table Entity
------------
ID
Table EntityData
----------------
EntityID
MatrixRow (1, 2, 3...)
MatrixColumn (A, B, C, D...)
Value
Entity:EntityData is a one-to-many relationship; each cell in the matrix has an EntityData row.
Now you have a schema that can be analyzed at the SQL level, instead of just being a data dump where you have to pull and extract everything at the application level in order to find out anything about it.
This is one of the reasons why PostgreSQL supports arrays as a data type. See
http://www.postgresql.org/docs/8.4/static/functions-array.html
and
http://www.postgresql.org/docs/8.4/static/arrays.html
Where it shows you can use syntax like ARRAY[[1,2,3],[4,5,6],[7,8,9]] to define the values of a 3x3 matrix or val integer[3][3] to declare a column type to be a 3x3 matrix.
Of course this is not at all standard SQL and is PostgreSQL specific. Other databases may have similar-but-slightly-different implementations.
If you want a truly relational solution:
Matrix
------
id
Matrix_Cell
-----------
matrix_id
row
col
value
But constraints to make sure you had valid data would be hideous.
I would consider a matrix as a single value as far as the DB is concerned and store it as
csv:
Matrix
------
id
cols
data
Which is somewhat lighter than XML.
I'd probably implement it like this:
Table MatrixData
----------------
id
rowName
columnName
datapoint
If all you're looking for is storing the data, this structure will hold any size matrix and allow you to reconstitute any matrix from the ID. You will need some post-processing to present it in "matrix format", but that's what the front-end code is for.
can the data be thought of as "row data"? if so then maybe you could store each row as a Object (or XML Blob) with data A,B,C,D and then, in your "representation", you use something like a LinkedHashMap (assuming Java) to get the objects with an ID key.
Also, it seems that by its very basic nature, a typical database table already does what you need doesn't it?
Or even better what you can do is, create a logical array like structure.
Say u want to store an m X n array..
Create m attributes in the table.
In each attribute store n elements separated by delimiters ...
while retrieving the data, simply do reverse parsing to easily get back the data..