Several tables vs one table with many columnfamilies - database

I am in the process of designing a Bigtable database.
I have two structures to persist:
"Data" (in SQL database I would put it in one table): type, reprocessed data, timestamp.
Payload: raw, non-processed Data information, timestamp.
Question: should I make two separate tables for Data and Payload? Or maybe it is better to make one table (for example, Data). In this table, two columnfamilies (processed and unprocessed) with corresponding columns.
What are the advantages/disadvantages of using one table with two columnfamilies versus two tables?

It is advisable to have fewer tables as much as possible. Sending requests to many different tables can increase back-end connection overhead, resulting in increased tail latency. Having multiple tables of different sizes can disrupt the behind-the-scenes load balancing that makes Bigtable function well.
Refer this link for more guidance on designing your schema.

Related

SQLite performance advice for .net

I am using SQLite in my application. The scenario is that I have stock market data and each company is a database with 1 table. That table stores records which can range from couple thousand to half a million.
Currently when I update the data in real time I - open connection, check if that particular data exists or not. If not, I then insert it and close the connection. This is then done in a loop and each database (representing a company) is updated. The number of records inserted is low and is not the problem. But is the process okay?
An alternate way is to have 1 database with many tables (each company can be a table) and each table can have a lot of records. Is this better or not?
You can expect at around 500 companies. I am coding in VS 2010. The language is VB.NET.
The optimal organization for your data is to make it properly normalized, i.e., put all data into a single table with a company column.
This is better for performance because the table- and database-related overhead is reduced.
Queries can be sped up with indexes, but what indexes you need depends on the actual queries.
I did something similar, with similar sized data in another field. It depends a lot on your indexes. Ultimately, separating each large table was best (1 table per file, representing a cohesive unit, in you case one company). Plus you gain the advantage of each company table being the same name, versus having x tables of different names that have the same scheme (and no sanitizing of company names to make new tables required).
Internally, other DBMSs often keep at least one file per table in their internal structure, SQL is thus just a layer of abstraction above that. SQLite (despite its conceptors' boasting) is meant for small projects and querying larger data models will get more finicky in order to make it work well.

Database design: one large table versus several smaller tables

I have to create a database to store information being sent and received to / from a 3rd party web service portal. There are about 150 fields of information to be sent though I can remove about 50 of those fields by normalising (there are three sets addresses that can be saved in an address table, for example). However, this still leaves a table that could potentially have 100 columns.
I've come up with two ways of handling this though I'm not sure which to use:
1. Have a table with 100 columns and three references to an address table.
2. Break it down into maybe 15-20 separate dedicated tables.
Option 1 seems the quickest as it involves the fewest joins but the idea of a table with 100 columns doesn't feel right.
Option 2 feels better and would break things down in to more managable chunks but it won't save any database space and will increase the number of joins. Pretty much all the columns in the database will have a value and I cannot normalise these columns any further.
My question is, in this situation is it acceptable to have a table with c.100 columns in it or should I try and break it down over several tables for presentation?
Please note: The table structure will not change over the course of it's useage, a new database would be created for a new version of the web service portal. I have no control over the web service data structure.
Edit: #Oded's answer below has made me think a bit more about how the data will be accessed; it will really only be accessed in whole and not in part. I wouldn't for example, need to return columns 5-20 on a regular basis.
Answer: I accepted Oded's answer based on the comments after he posted it helped me make my mind up and I decided to go with option 1. As the data is accessed in full then having one table seems the better solution. If, for example, I regularly wanted to access columns 5-20 rather than the full table row then I'd see about breaking it up into separate tables for performance reasons.
Speaking from a relational purist point of view - first, there is nothing against having 100 columns in a table, if they are related. The point here is that if after normalizing you still have 100 columns, that's OK.
But you should normalize, and in the process you may very well end up with 15-20 separate dedicated tables, which most relational database professionals would agree is a better design (avoid data duplication with the update/delete issues associated, smaller data footprint etc...).
Pragmatically, however, if there is a measurable performance problem, it may be sensible to denormalize your design for performance benefit. The key here - measureable. Don't optimize before you have an actual problem.
In that respect, I'd say you should go with the set of 15-20 tables as an initial design.
From MSDN:Maximum Capacity Specifications for SQL Server :
Columns per nonwide table: 1,024
Columns per wide table: 30,000
So I think 100 columns is ok in your case. And also maybe you need to note(from same link):
Columns per primary key: 16
Of course this is only in the case if need data only as Log for a service.
If after reading from service you need to maintain data -> then normalising seems better...
If you find it easier to "manage" tables with fewer columns, however you happen to define manageability (e.g. less horizontal scrolling when looking at the table data in SSMS), you can break the table up into several tables with 1-to-1 relationships without violating the rules of normalization.

Is there a disadvantage to having large columns in your database?

My database stores user stats on a variety of questions. There is no table of question types, so instead of using a join table on the question types, I've just stored the user stats for each type of question the user has done in a serialized hash-map in the user table. Obviously this has led to some decently sized user rows - the serialized stats for my own user is around 950 characters, and I can imagine them easily growing to 5 kb on power users.
I have never read an example of a column this large in any book. Will performance be greatly hindered by having such large/variable columns in my table? Should I add in a table for question types, and make the user stats a separate table as well?
I am currently using PostgreSQL, if that's relevant.
I've seen this serialized approach on systems like ProcessMaker, which is a web workflow and BPM app and stores its data in a serialized fashion. It performs quite well, but building reports based on this data is really tricky.
You can (and should) normalize your database, which is OK if your information model doesn´t change so often.
Otherwise, you may want to try non-relational databases like RavenDB, MongoDB, etc.
The big disadvantage has to do with what happens with a select *. If you have a specific field list, you are not likely to have a big problem but with select * with a lot of TOASTed columns, you have a lot of extra random disk I/O unless everything fits in memory. Selecting fewer columns makes things better.
In an object-relational database like PostgreSQL, database normalization poses different tradeoffs than in a purely relational model. In general it is still a good thing (as I say push the relational model as far as it can comfortably go before doing OR stuff in your db), but it isn't the absolute necessity that you might think of it as being in a purely relational db. Additionally you can add functions to process that data with regexps, extract elements from JSON, etc, and pull those back into your relational queries. So for data that cannot comfortably be normalized, big amorphous "docdb" fields are not that big of a problem.
Depends on the predominant queries you need:
If you need queries that select all (or most) of the columns, then this is the optimal design.
If, however, you select mostly on a subset of columns, then it might be worth trying to "vertically partition"1 the table, so you avoid I/O for the "unneeded" columns and increase the cache efficiency.2
Of course, all this is under assumption that the serialized data behaves as "black box" from the database perspective. If you need to search or constrain that data in some fashion, then just storing a dummy byte array would violate the principle of atomicity and therefore the 1NF, so you'd need to consider normalizing your data...
1 I.e. move the rarely used columns to a second table, which is in 1:1 relationship to the original table. If you are using BLOBs, similar effect could be achieved by declaring what portion of the BLOB should be kept "in-line" - the remainder of any BLOB that exceeds that limit will be stored to a set of pages separate from the table's "core" pages.
2 DBMSes typically implement caching at the page level, so the wider the rows, the less of them will fit into a single page on disk, and therefore into a single page in cache.
You can't search in serialzed arrays.

In what way does denormalization improve database performance?

I heard a lot about denormalization which was made to improve performance of certain application. But I've never tried to do anything related.
So, I'm just curious, which places in normalized DB makes performance worse or in other words, what are denormalization principles?
How can I use this technique if I need to improve performance?
Denormalization is generally used to either:
Avoid a certain number of queries
Remove some joins
The basic idea of denormalization is that you'll add redundant data, or group some, to be able to get those data more easily -- at a smaller cost; which is better for performances.
A quick examples?
Consider a "Posts" and a "Comments" table, for a blog
For each Post, you'll have several lines in the "Comment" table
This means that to display a list of posts with the associated number of comments, you'll have to:
Do one query to list the posts
Do one query per post to count how many comments it has (Yes, those can be merged into only one, to get the number for all posts at once)
Which means several queries.
Now, if you add a "number of comments" field into the Posts table:
You only need one query to list the posts
And no need to query the Comments table: the number of comments are already de-normalized to the Posts table.
And only one query that returns one more field is better than more queries.
Now, there are some costs, yes:
First, this costs some place on both disk and in memory, as you have some redundant informations:
The number of comments are stored in the Posts table
And you can also find those number counting on the Comments table
Second, each time someone adds/removes a comment, you have to:
Save/delete the comment, of course
But also, update the corresponding number in the Posts table.
But, if your blog has a lot more people reading than writing comments, this is probably not so bad.
Denormalization is a time-space trade-off. Normalized data takes less space, but may require join to construct the desired result set, hence more time. If it's denormalized, data are replicated in several places. It then takes more space, but the desired view of the data is readily available.
There are other time-space optimizations, such as
denormalized view
precomputed columns
As with any of such approach, this improves reading data (because they are readily available), but updating data becomes more costly (because you need to update the replicated or precomputed data).
The word "denormalizing" leads to confusion of the design issues. Trying to get a high performance database by denormalizing is like trying to get to your destination by driving away from New York. It doesn't tell you which way to go.
What you need is a good design discipline, one that produces a simple and sound design, even if that design sometimes conflicts with the rules of normalization.
One such design discipline is star schema. In a star schema, a single fact table serves as the hub of a star of tables. The other tables are called dimension tables, and they are at the rim of the schema. The dimensions are connected to the fact table by relationships that look like the spokes of a wheel. Star schema is basically a way of projecting multidimensional design onto an SQL implementation.
Closely related to star schema is snowflake schema, which is a little more complicated.
If you have a good star schema, you will be able to get a huge variety of combinations of your data with no more than a three way join, involving two dimensions and one fact table. Not only that, but many OLAP tools will be able to decipher your star design automatically, and give you point-and-click, drill down, and graphical analysis access to your data with no further programming.
Star schema design occasionally violates second and third normal forms, but it results in more speed and flexibility for reports and extracts. It's most often used in data warehouses, data marts, and reporting databases. You'll generally have much better results from star schema or some other retrieval oriented design, than from just haphazard "denormalization".
The critical issues in denormalizing are:
Deciding what data to duplicate and why
Planning how to keep the data in synch
Refactoring the queries to use the denormalized fields.
One of the easiest types of denormalizing is to populate an identity field to tables to avoid a join. As identities should not ever change, this means the issue of keeping the data in sync rarely comes up. For instance, we populate our client id to several tables because we often need to query them by client and do not necessarily need, in the queries, any of the data in the tables that would be between the client table and the table we are querying if the data was totally normalized. You still have to do one join to get the client name, but that is better than joining to 6 parent tables to get the client name when that is the only piece of data you need from outside the table you are querying.
However, there would be no benefit to this unless we were often doing queries where data from the intervening tables was needed.
Another common denormalization might be to add a name field to other tables. As names are inherently changeable, you need to ensure that the names stay in synch with triggers. But if this saves you from joining to 5 tables instead of 2, it can be worth the cost of the slightly longer insert or update.
If you have certain requirement, like reporting etc., it can help to denormalize your database in various ways:
introduce certain data duplication to save yourself some JOINs (e.g. fill certain information into a table and be ok with duplicated data, so that all the data in that table and doesn't need to be found by joining another table)
you can pre-compute certain values and store them in a table column, insteda of computing them on the fly, everytime to query the database. Of course, those computed values might get "stale" over time and you might need to re-compute them at some point, but just reading out a fixed value is typically cheaper than computing something (e.g. counting child rows)
There are certainly more ways to denormalize a database schema to improve performance, but you just need to be aware that you do get yourself into a certain degree of trouble doing so. You need to carefully weigh the pros and cons - the performance benefits vs. the problems you get yourself into - when making those decisions.
Consider a database with a properly normalized parent-child relationship.
Let's say the cardinality is an average of 2x1.
You have two tables, Parent, with p rows. Child with 2x p rows.
The join operation means for p parent rows, 2x p child rows must be read. The total number of rows read is p + 2x p.
Consider denormalizing this into a single table with only the child rows, 2x p. The number of rows read is 2x p.
Fewer rows == less physical I/O == faster.
As per the last section of this article,
https://technet.microsoft.com/en-us/library/aa224786%28v=sql.80%29.aspx
one could use Virtual Denormalization, where you create Views with some denormalized data for running more simplistic SQL queries faster, while the underlying Tables remain normalized for faster add/update operations (so long as you can get away with updating the Views at regular intervals rather than in real-time). I'm just taking a class on Relational Databases myself but, from what I've been reading, this approach seems logical to me.
Benefits of de-normalization over normalization
Basically de-normalization is used for DBMS not for RDBMS. As we know that RDBMS works with normalization, which means no repeat data again and again. But still repeat some data when you use foreign key.
When you use DBMS then there is a need to remove normalization. For this, there is a need for repetition. But still, it improves performance because there is no relation among the tables and each table has indivisible existence.

Database design--billions of records in one table?

Let's say you're creating a database to store messages for a chat room application. There's an infinite number of chat rooms (they're created at run-time on-demand), and all messages need to be stored in the database.
Would it be a mistaken to create one giant table to store messages for all chat rooms, knowing that there could eventually be billions of records in that one table?
Would it be more prudent to dynamically create a table for each room created, and store that room's messages only in that table?
It would be proper to have a single table. When you have n tables which grows by application usage, you're describing using the database itself as a table of tables, which is not how an RDBMS is designed to work. Billions of records in a single table is trivial on a modern database. At that level, your only performance concerns are good indexes and how you do joins.
Billions of records?
Assuming you have constantly 1000 active users with 1 message per minute, this results in 1.5mio messages per day, and approx 500mio messages per year.
If you still need to store chat messages several years old (what for?), you could archive them into year-based tables.
I would definitely argue against dynamic creation of room-based tables.
Whilst a table per chat room could be performed, each database has limits over the number of tables that may be created, so given an infinite number of chat rooms, you are required to create an infinite number of tables, which is not going to work.
You can on the other hand store billions of rows of data, storage is not normally the issue given the space - retrieval of the information within a sensible time frame is however and requires careful planning.
You could partition the messages by a date range, and if planned out, you can use LUN migration to move older data onto slower storage, whilst leaving more recent data on the faster storage.
Strictly speaking, your design is right, a single table. fields with low entropy {e.g 'userid' - you want to link from ID tables, i.e following normal database normalization patterns}
you might want to think about range based partitioning. e.g 'copies' of your table with a year prefix. Or maybe even a just a 'current' and archive table
Both of these approaches mean that your query semantic is more complex {consider if someone did a multi-year search}, you would have to query multiple tables.
however, the upside is that your 'current' table will remain at a roughly constant size, and archiving is more straightforward. - {you can just drop table 2005_Chat when you want to archive 2005 data}
-Ace

Resources