One large table or many small ones in database? - database

Say I want to create a typical todo-webApp using a db like postgresql. A user should be able to create todo-lists. On this lists he should be able to make the actual todo-entries.
I regard the todo-list as an object which has different properties like owner, name, etc, and of course the actual todo-entries which have their own properties like content, priority, date ... .
My idea was to create a table for all the todo-lists of all the users. In this table I would store all the attributes of each list. But the questions which arises is how to store the todo-entries themselves? Of course in an additional table, but should I rather:
1. Create one big table for all the entries and have a field storing the id of the todo-list they belong to, like so:
todo-list: id, owner, ...
todo-entries: list.id, content, ...
which would give 2 tables in total. The todo-entries table could get very large. Although we know that entries expire, hence the table only grows with more usage but not over time. Then we would write something like SELECT * FROM todo-entries WHERE todo-list-id=id where id is the of the list we are trying to retrieve.
OR
2. Create a todo-entries table on a per user basis.
todo-list: id, owner, ...
todo-entries-owner: list.id, content,. ..
Number of entries table depends on number of users in the system. Something like SELECT * FROM todo-entries-owner. Mid-sized tables depending on the number of entries users do in total.
OR
3. Create one todo-entries-table for each todo-list and then store a generated table name in a field for the table. For instance could we use the todos-list unique id in the table name like:
todo-list: id, owner, entries-list-name, ...
todo-entries-id: content, ... //the id part is the id from the todo-list id field.
In the third case we could potentially have quite a large number of tables. A user might create many 'short' todo-lists. To retrieve the list we would then simply go along the lines SELECT * FROM todo-entries-id where todo-entries-id should be either a field in the todo-list or it could be done implicitly by concatenating 'todo-entries' with the todos-list unique id. Btw.: How do I do that, should this be done in js or can it be done in PostgreSQL directly? And very related to this: in the SELECT * FROM <tablename> statement, is it possible to have the value of some field of some other table as <tablename>? Like SELECT * FROM todo-list(id).entries-list-name or so.
The three possibilities go from few large to many small tables. My personal feeling is that the second or third solutions are better. I think they might scale better. But I'm not sure quite sure of that and I would like to know what the 'typical' approach is.
I could go more in depth of what I think of each of the approaches, but to get to the point of my question:
Which of the three possibilities should I go for? (or anything else, has this to do with normalization?)
Follow up:
What would the (PostgreSQL) statements then look like?

The only viable option is the first. It is far easier to manage and will very likely be faster than the other options.
Image you have 1 million users, with an average of 3 to-do lists each, with an average of 5 entries per list.
Scenario 1
In the first scenario you have three tables:
todo_users: 1 million records
todo_lists: 3 million records
todo_entries: 15 million records
Such table sizes are no problem for PostgreSQL and with the right indexes you will be able to retrieve any data in less than a second (meaning just simple queries; if your queries become more complex (like: get me the todo_entries for the longest todo_list of the top 15% of todo_users that have made less than 3 todo_lists in the 3-month period with the highest todo_entries entered) it will obviously be slower (as in the other scenarios). The queries are very straightforward:
-- Find user data based on username entered in the web site
-- An index on 'username' is essential here
SELECT * FROM todo_users WHERE username = ?;
-- Find to-do lists from a user whose userid has been retrieved with previous query
SELECT * FROM todo_lists WHERE userid = ?;
-- Find entries for a to-do list based on its todoid
SELECT * FROM todo_entries WHERE listid = ?;
You can also combine the three queries into one:
SELECT u.*, l.*, e.* -- or select appropriate columns from the three tables
FROM todo_users u
LEFT JOIN todo_lists l ON l.userid = u.id
LEFT JOIN todo_entries e ON e.listid = l.id
WHERE u.username = ?;
Use of the LEFT JOINs means that you will also get data for users without lists or lists without entries (but column values will be NULL).
Inserting, updating and deleting records can be done with very similar statements and similarly fast.
PostgreSQL stores data on "pages" (typically 4kB in size) and most pages will be filled, which is a good thing because reading a writing a page are very slow compared to other operations.
Scenario 2
In this scenario you need only two tables per user (todo_lists and todo_entries) but you need some mechanism to identify which tables to query.
1 million todo_lists tables with a few records each
1 million todo_entries tables with a few dozen records each
The only practical solution to that is to construct the full table names from a "basename" related to the username or some other persistent authentication data from your web site. So something like this:
username = 'Jerry';
todo_list = username + '_lists';
todo_entries = username + '_entries';
And then you query with those table names. More likely you will need a todo_users table anyway to store personal data, usernames and passwords of your 1 million users.
In most cases the tables will be very small and PostgreSQL will not use any indexes (nor does it have to). It will have more trouble finding the appropriate tables, though, and you will most likely build your queries in code and then feed them to PostgreSQL, meaning that it cannot optimize a query plan. A bigger problem is creating the tables for new users (todo_list and todo_entries) or deleting obsolete lists or users. This typically requires behind-the scenes housekeeping that you avoid with the previous scenario. And the biggest performance penalty will be that most pages have only little content so you waste disk space and lots of time reading and writing those partially filled pages.
Scenario 3
This scenario is even worse that scenario 2. Don't do it, it's madness.
3 million tables todo_entries with a few records each
So...
Stick with option 1. It is your only real option.

Related

Storing processed results of connection in RDBMS

A csv file contains following two column : admission_number, project_name.
The relationship between two entities are many to many relationships : a specific admission_number can work over multiple projects. A specific project may have multiple admission_number.
Data will be like as follows and initially there are '1000 milion' rows and data will keep on updating on daily basis in this table will go upto 1300 milion rows.
admission_number,project_name
1234567890,ABC1234567
1234567890,ABC1234568
1234567891,ABC1234569
1234567892,ABC1234569
1234567893,ABC1234570
1234567894,ABC1234567
1234567895,ABC1234567
For a specific admission number(lets say 1234567890), i want to know all the admission_number who are working on the same projects (ABC1234567,ABC1234568). The output of above query will be
1234567894,1234567895.
Explanation : Since for admission number '1234567890', projects name are 'ABC1234567' and 'ABC1234568'. On these two projects other 'admission_number' are working as '1234567894','1234567895'
I came up with two solutions, To store the data,RDBMS will be used.
Approach 1 : By using two retrieval query : First query shall return all the projcects_name for a specific 'admission_number' and the second query will retrun all the admission_number for 'project_name'.
select admission_number from table where project_name IN (select project_name from table where admission_number='ABC1234567'.
Approach 2 : In this approach, before going for loading i am preprocessing the results and directly results is storing in database. I am only storing all the connected 'admission_number'.
Eg. For project_name 'ABC1234567', these 3 admission_number '1234567890','1234567894', '1234567895' are working. I want to store all connected admission_number in table with two columns (number,connected_number) like ('1234567890','1234567894'),('1234567890','1234567895'), ('1234567894','1234567895'), and query will work on both columns (number and connected_number).
But in this approach there will be many rows means if a specifc project_name 'p', there are n 'admission_number' than total number of rows will be n(n-1)/2
How can i store all the connected admission_number in RDBMS? Loading of data can be slow, but retrieval should be fast.
Do not optimize the data structure. It would only cause problems.
Create a simple table with two columns for both ID and create index for both columns.
The RDBMS will build and maintain an index of the column values, which will enable fast lookup for a specific record.

WHERE clause partly based on a field value from another table: is there a better way than exec with a dynamic string?

I have a table with about 50,000 records (a global index of corporate and government bonds).
I would like the user to be able to filter this master index firstly into a smaller subset index (based on permanent logic), and then apply further run time criteria that vary each time.
For example, let's say the user wanted to start from one of many subset indices of bonds, let's say of government bonds only, rather than government and corporate bonds, and also only wanted the US$ government bond index specifically. This would be a permanently defined subset of the master index, with a where clause something like "[Level1]='Government' AND [Currency]='USD' AND [CountryCode]='US'"
At run time, the user would additionally request additional criteria, say for example "AND [IssueSize] > 1,000,000,000 AND [Yield] > 0.0112".
I initially thought of having a separate table that stored the different criteria for these permanent sub-indices as where clauses, for example it might have columns "IndexCode, IndexLogic", and using the example above the values would be "UST", "[Level1]='Government' and [Currency]='USD' AND [CountryCode]='US'", and there would be dozens of rows in this table defining commonly used bond indices.
I had originally thought of creating a dynamic string at run-time, where the user supplies their choice of sub-index code ('UST' in the example above), which then adds the relevant where conditions, and any additional criteria passed as separate parameters, and then doing an exec(#tsql) type command. I had also thought of perhaps having a where clause that was a function call, but this seems very inefficient?
Is the dynamic string method the best way of doing this, or is there a better way involving some kind of 'eval' function equivalent which can take a field value and use that as a where clause?
The problem here is you don't know in advance what the filtered index is.
A solution I have used in this instance, where the filtered index can often change is to grab the definition of the filter back into the client app, and use that to dynamically generate the SQL batch. You can also do this with dynamic SQL in a stored procedure:
SELECT ISNULL(
(SELECT i.filter_definition
FROM sys.indexes i
WHERE i.object_id = OBJECT_ID(#tablename) AND
i.name = #indexname AND has_filter = 1),
'(1=1)');
You pass the table name, and the index name, and you get back the exact definition for the index. This has the benefit of if the index is dropped, the condition becomes (1=1) i.e. every row. You can change this to (1=0) to return nothing instead.
Then you concat this into your dynamic query like so:
SELECT *
FROM table
WHERE regular_conditions_here
AND concated_filter_here
If you are joining other tables, I would advise you to subquery the filter, otherwise you many get column clash as there are no aliases.
SELECT *
FROM (SELECT * FROM table WHERE concated_filter_here) table
JOIN othertables etc
WHERE regular_conditions_here

Counting the number of occurances of something in the database

For my website, I want to make something that works a bit like the tags on Stackoverflow - so some fields will have an autocompleter, and the autocompleter will display the number of times that other users have selected each suggested value. I suppose I'd have a database structure like this:
Articles
ArticleID
Content
TagId
Tags
TagId
TagName
Occurances
With the idea being that Occurances represents the number of times each TagId is referenced from the Articles table.
What is the best way to implement this? I could add/subtract from the occurances column on each of the stored procedures that update the article table, but I might miss one, and anyway, there is are some difficulties with this if a user removes a tag from something (as its easy to add 1 to the field for the newly added tag, but harder to work out which tag is being replaced.)
There is lots I don't understand about sql-server. Is there a more robust way of counting occurances like this, that the database system will deal with itself? It would be ok if the data was cached once a day or something.
To be able to have more than one tag attached to an article, you will have to add another table that connects the article table to the tag table. It's called a 'many to many' relation.
article
article_id
content
article_tag
article_id
tag_id
tag
tag_id
tagname
Doing like this, article 1 can be attached to tag 2, and the next row can be 1 and 3 and so on, so one article points to many tags. To count a certain tag, you join the Article_Tag and Tag tables, and and count the rows in Article_Tag where Tag.tagname = 'mysql', for examle.
You can create an indexes view that aggregates all the counts you need and is automatically maintained:
create view TagCounts
with schemabinding
as select TagId, count_big(*) as Occurances
from dbo.ArticleTags
group by TagId;
go
create unique clustered index cdxTagCounts on TagCounts (TagId);
go
Now the TagCounts.Occurances field is automatically maintained by SQL Server whenever you insert/delete/update the Articles table. You can query it like:
select Occurances from dbo.TagCounts with (noexpand) where TagId = ...;
And you can cache the result with LinqToCache, as such a query matches the restrictions of Query Notifications.
The trade off of using a pre-aggregated indexed view is scalability: as update of any article updates the count of Occurances for the tags of the article, an exclusive lock is required to update this count. Which implies that only one transaction can use a TagId at any moment. Depending on your traffic and on other elements of your design this restriction may or may not be acceptable.
The other alternative is a table of counts. Front ends (your ASP.Net farm) read this counts and then they update the in-memory count for each operation, keeping track of the delta from the counts in the table. Periodically the front ends merge their deltas into the table (eg. every 5 minutes) and refresh the in-memory table. This way front ends see a stale version of the truth, but an user sees immediate feedback of its actions: because of session stickiness his HTTP requests are processed by the same front end, and thus he see immediately his own article updates triggering modifications to the tag counts. User though do no immediately see the updates from other users that are load-balanced to another front end. Because a crash of the front end (or a process recycle...) will loose the deltas kept so far, the count table will drift in time away from the truth and would have to be periodically updated to the true count in the database.
If you which even more accuracy (all users see the true count immediately) then you can do something based on fast in-memory key value stores, which would be basically the same as my first proposal but with much higher throughput/lower latency, perhaps something based on memcached + redis. I'm not acquainted with SO architecture, but I believe they may be doing something similar.
You could use this query to get the number of occurances by tag:
SELECT Tags.TagId, COUNT(Articles.TagId) as Occurances
FROM Articles
JOIN Tags ON Articles.TagId
GROUP BY Tags.TagId
It could be used in a view or stored procedure, and you can set up your website's cache to requery it as often as required.
If you are using a relational database, the correct way to handle this problem is to NOT store the occurrences on the table itself, but rather dynamically query the number of occurrences on the articles table.
If you don't do it this way, you're stuck coding update queries every time you add/delete a row...generally not nice. If you query dynamically, you won't have an occurrences column in the table, but rather will get that information in your eg. presentation/model layer code.
Use:
SELECT COUNT(*) FROM ARTICLES WHERE TagId = 'xxx' ;
This line is part of iterating code.

Athletics Ranking Database - Number of Tables

I'm fairly new to this so you may have to bear with me. I'm developing a database for a website with athletics rankings on them and I was curious as to how many tables would be the most efficient way of achieving this.
I currently have 2 tables, a table called 'athletes' which holds the details of all my runners (potentially around 600 people/records) which contains the following fields:
mid (member id - primary key)
firstname
lastname
gender
birthday
nationality
And a second table, 'results', which holds all of their performances and has the following fields:
mid
eid (event id - primary key)
eventdate
eventcategory (road, track, field etc)
eventdescription (100m, 200m, 400m etc)
hours
minutes
seconds
distance
points
location
The second table has around 2000 records in it already and potentially this will quadruple over time, mainly because there are around 30 track events, 10 field, 10 road, cross country, relays, multi-events etc and if there are 600 athletes in my first table, that equates to a large amount of records in my second table.
So what I was wondering is would it be cleaner/more efficient to have multiple tables to separate track, field, cross country etc?
I want to use the database to order peoples results based on their performance. If you would like to understand better what I am trying to emulate, take a look at this website http://thepowerof10.info
Changing the schema won't change the number of results. Even if you split the venue into a separate table, you'll still have one result per participant at each event.
The potential benefit of having a separate venue table would be better normalization. A runner can have many results, and a given venue can have many results on a given date. You won't have to repeat the venue information in every result record.
You'll want to pay attention to indexes. Every table must have a primary key. Add additional indexes for columns you use in WHERE clauses when you select.
Here's a discussion about normalization and what it can mean for you.
PS - Thousands of records won't be an issue. Large databases are on the order of giga- or tera-bytes.
My thought --
Don't break your events table into separate tables for each type (track, field, etc.). You'll have a much easier time querying the data back out if it's all there in the same table.
Otherwise, your two tables look fine -- it's a good start.

Storing Preferences/One-to-One Relationships in Database

What is the best way to store settings for certain objects in my database?
Method one: Using a single table
Table: Company {CompanyID, CompanyName, AutoEmail, AutoEmailAddress, AutoPrint, AutoPrintPrinter}
Method two: Using two tables
Table Company {CompanyID, COmpanyName}
Table2 CompanySettings{CompanyID, utoEmail, AutoEmailAddress, AutoPrint, AutoPrintPrinter}
I would take things a step further...
Table 1 - Company
CompanyID (int)
CompanyName (string)
Example
CompanyID 1
CompanyName "Swift Point"
Table 2 - Contact Types
ContactTypeID (int)
ContactType (string)
Example
ContactTypeID 1
ContactType "AutoEmail"
Table 3 Company Contact
CompanyID (int)
ContactTypeID (int)
Addressing (string)
Example
CompanyID 1
ContactTypeID 1
Addressing "name#address.blah"
This solution gives you extensibility as you won't need to add columns to cope with new contact types in the future.
SELECT
[company].CompanyID,
[company].CompanyName,
[contacttype].ContactTypeID,
[contacttype].ContactType,
[companycontact].Addressing
FROM
[company]
INNER JOIN
[companycontact] ON [companycontact].CompanyID = [company].CompanyID
INNER JOIN
[contacttype] ON [contacttype].ContactTypeID = [companycontact].ContactTypeID
This would give you multiple rows for each company. A row for "AutoEmail" a row for "AutoPrint" and maybe in the future a row for "ManualEmail", "AutoFax" or even "AutoTeleport".
Response to HLEM.
Yes, this is indeed the EAV model. It is useful where you want to have an extensible list of attributes with similar data. In this case, varying methods of contact with a string that represents the "address" of the contact.
If you didn't want to use the EAV model, you should next consider relational tables, rather than storing the data in flat tables. This is because this data will almost certainly extend.
Neither EAV model nor the relational model significantly slow queries. Joins are actually very fast, compared with (for example) a sort. Returning a record for a company with all of its associated contact types, or indeed a specific contact type would be very fast. I am working on a financial MS SQL database with millions of rows and similar data models and have no problem returning significant amounts of data in sub-second timings.
In terms of complexity, this isn't the most technical design in terms of database modelling and the concept of joining tables is most definitely below what I would consider to be "intermediate" level database development.
I would consider if you need one or two tables based onthe following criteria:
First are you close the the record storage limit, then two tables definitely.
Second will you usually be querying the information you plan to put inthe second table most of the time you query the first table? Then one table might make more sense. If you usually do not need the extended information, a separate ( and less wide) table should improve performance on the main data queries.
Third, how strong a possibility is it that you will ever need multiple values? If it is one to one nopw, but something like email address or phone number that has a strong possibility of morphing into multiple rows, go ahead and make it a related table. If you know there is no chance or only a small chance, then it is OK to keep it one assuming the table isn't too wide.
EAV tables look like they are nice and will save futue work, but in reality they don't. Genreally if you need to add another type, you need to do future work to adjust quesries etc. Writing a script to add a column takes all of five minutes, the other work will need to be there regarless of the structure. EAV tables are also very hard to query when you don;t know how many records you wil need to pull becasue normally you want them on one line and will get the information by joining to the same table multiple times. This causes performance problmes and locking especially if this table is central to your design. Don't use this method.
It depends if you will ever need more information about a company. If you notice yourself adding fields like companyphonenumber1 companyphonenumber2, etc etc. Then method 2 is better as you would seperate your entities and just reference a company id. If you do not plan to make these changes and you feel that this table will never change then method 1 is fine.
Usually, if you don't have data duplication then a single table is fine.
In your case you don't so the first method is OK.
I use one table if I estimate the data from the "second" table will be used in more than 50% of my queries. Use two tables if I need multiple copies of the data (i.e. multiple phone numbers, email addresses, etc)

Resources