what do they mean when a database is “flattened out” vs. normalized?
"Flattened out" typically refers to a database where you have a single (or few) very large tables.
"Normalized" refers to whether the data has been organized into well structured, related tables. This typically reduces duplication of values across rows in a table by pulling the values into a separate table, and relating to it by ID.
For details, see Database Normalization.
A normalized database is one that is organized to minimize redundancy of data and to produce small and well structured relationships, normally via related tables. An example might be a customer and all his/her orders. In a normalized database, you would have at least two (and probably more) tables. A customer table and an orders table, joined together in some fashion. In a flattened structure, customer and order data might be in a single table.
Reporting databases tend to be denormalized to allows quicker retrieval of data (where many joins may be required), whereas production or transactional databases (OLTP) tend to be (or should be) more normalized with foreign keys established between tables.
Related
I was of the impression that in OLAP , we try to store data in a denormalized fashion to reduce the number of joins and make query processing faster. Normalization that avoids data redundancy was more for OLTP systems.
But then again, 2 of the common modelling approaches (star and snowflake schema) are essentially normalized schemas.
Can you help me connect the dots?
Actually, that's very perceptive and the vast majority of people accept it. The truth is that a star is partially denormalized - the dimension tables are highly denormalized; they typically come from joining together a lot of related tables into one. A well designed fact table, however, is normalized - Each record is a bunch of values identified by a single, unique, primary key which is composed of the intersection of a set of foreign keys.
Snowflake schemas are, as you surmised, even more normalized. They effectively take the dimension tables and break them into small values that are all joined together when needed. While there are constant arguments over whether this is better or worse than a star, many folks believe that these are inexpensive joins and, depending on your thinking, may be worth it.
Initially, snowflakes were sold as a way of saving disk space because they do take up less room than dimension tables but disk space is rarely an issue nowadays.
I personally prefer a hybrid approach that allows me to build a few levels of dimension table that ultimately can provide referential integrity to both my atomic level data but also to my aggregate fact tables.
When people use the term "normalised" they normally mean something that is in, or at least close to, 3rd normal form (3NF).
Unless you mean something significantly different by the term normalised then neither Star or Snowflake schemas are normalised. Why do you think they are normalised?
On one of our client's database there are a few snapshot tables that summarize useful information from many other tables (e.g. what was the state of each customer in each period, etc).
The snapshot tables however, contain mostly foreign keys to their original tables. Therefore in order to obtain useful information about the snapshot, we have to join them multiple times to their corresponding tables. And these joins often take very long. Adding indexes to all FL columns in databases (or at least on columns in WHERE clauses in our queries) on the other hand, slows down the database significantly.
So my question is, wouldn't it be better to have snapshot tables with real values instead of foreign keys? And if the answer is negative, wouldn't it beat the purpose of snapshot tables if the original tables are updated (e.g. if an was called 'Candle' and now 'Lamp' of course are snapshot remains consistent but is it really snapshot in this case?)
I'd lean towards storing the actual data rather than FK values for the reason you mentioned. That said, a better solution might be to relocate this historical data along with relevant attributes (IE Dimensions) and restructure it for analysis. Data warehousing is certainly a solution for this, although these can be very large-scale projects so you'd need to understand the value and scope it appropriately. However, even a light-weight star schema that targets the specific events they're trying to capture could be a better solution than a large historical table with relationships to transaction-based tables (especially if the query logic against the related tables is complex).
I have two tables in my database, one for login and second for user details (the database is not only two tables). Logins table has 12 columns (Id, Email, Password, PhoneNumber ...) and user details has 23 columns (Job, City, Gender, ContactInfo ..). The two tables have one-to-one relationship.
I am thinking to create one table that contain the columns of both tables but I not sure because this may make the size of the table big.
So this lead to my question, what the number of columns that make table big? Is there a certain or approximate number that make size of table big and make us stop adding columns to a table and create another one? or it is up to the programmer to decide such number?
The number of columns isn't realistically a problem. Any kind of performance issues you seem to be worried with can be attributed to the size of the DATA on the table. Ie, if the table has billions of rows, or if one of the columns contains 200 MB of XML data on each separate row, etc.
Normally, the only issue arising from a multitude of columns is how it pertains to indexing, as it can get troublesome trying to create 100 different indexes covering each variation of each query.
Point here is, we can't really give you any advice since just the number of tables and columns and relations isn't enough information to go on. It could be perfectly fine, or not. The nature of the data, and how you account for that data with proper normalization, indexing and statistics, is what really matters.
The constraint that makes us stop adding columns to an existing table in SQL is if we exceed the maximum number of columns that the database engine can support for a single table. As can be seen here, for SQLServer that is 1024 columns for a non-wide table, or 30,000 columns for a wide table.
35 columns is not a particularly large number of columns for a table.
There are a number of reasons why decomposing a table (splitting up by columns) might be advisable. One of the first reasons a beginner should learn is data normalization. Data normalization is not directly concerned with performance, although a normalized database will sometimes outperform a poorly built one, especially under load.
The first three steps in normalization result in 1st, 2nd, and 3rd normal forms. These forms have to do with the relationship that non-key values have to the key. A simple summary is that a table in 3rd normal form is one where all the non-key values are determined by the key, the whole key, and nothing but the key.
There is a whole body of literature out there that will teach you how to normalize, what the benefits of normalization are, and what the drawbacks sometimes are. Once you become proficient in normalization, you may wish to learn when to depart from the normalization rules, and follow a design pattern like Star Schema, which results in a well structured, but not normalized design.
Some people treat normalization like a religion, but that's overselling the idea. It's definitely a good thing to learn, but it's only a set of guidelines that can often (but not always) lead you in the direction of a satisfactory design.
A normalized database tends to outperform a non normalized one at update time, but a denormalized database can be built that is extraordinarily speedy for certain kinds of retrieval.
And, of course, all this depends on how many databases you are going to build, and their size and scope,
I take it that the login tables contains data that is only used when the user logs into your system. For all other purposes, the details table is used.
Separating these sets of data into separate tables is not a bad idea and could work perfectly well for your application. However, another option is having the data in one table and separating them using covering indexes.
One aspect of an index no one seems to consider is that an index can be thought of as a sub-table within a table. When a SQL statement accesses only the fields within an index, the I/O required to perform the operation can be limited to only the index rather than the entire row. So creating a "login" index and "details" index would achieve the same benefits as separate tables. With the added benefit that any operations that do need all the data would not have to perform a join of two tables.
I always see in database articles or tutorials or... just everywhere where they use databases, they use a thing called relations. It comes to my mind instantaneously those little boxes with lists of field names and one field connected to another field in another box with a line.
I'm not an expert on databases (as you can probably tell) but the little bit I've used, I never needed relations. They always seemed to be redundant as I can always use JOIN to achieved what it seemed to me they are made for. Are they redundant or is there anything you can do with relations that you cannot do with JOIN? Or am I just talking nonsense?
Relations are not just about joins for SQL queries. Relations provide many benefits:
Data integrity
Query convenience
Third party tool integration benefits
"Self-describing" data model to future DBAs/developers working with the database
Etc
Data integrity:
Relations help to ensure that your "order records" can't exist without a "customer record" for example. Simply by defining a relationship between customer and order, the database will ensure that this cannot happen. This helps to make sure that your database doesn't become a big pile of junk data
Query convenience:
Relations can make it easier to do certain types of queries. Deleting a customer record can automatically have the customer's orders deleted at the same time, thanks to the relationship between customer and order
Third party tool integration benefits
Many third party tools (O/R tools come to mind) rely on relations in order to work properly
Really, the list could go on and on...you should use them, they're very beneficial. Even if you don't perceive the value today, if you're working on a database project that will continue to grow over a long period of time, it would be to your benefit to set relationships up from the beginning.
I think that they're not that critical for small projects/one-off data models...but for anything of substance, you're better off using them.
A RELATION is a subset of the cartesian product of a set of domains (http://mathworld.wolfram.com/Relation.html). In everyday terms a relation (or more specifically a relation variable) is the data structure that most people refer to as a table (although tables in SQL do not necessarily qualify as relations).
Relations are the basis of the relational database model.
Relationships are something different. A relationship is a semantic "association among things".
I think you are actually asking about referential integrity constraints (foreign keys). A foreign key is a data integrity rule that ensures the database is consistent by preventing inconsistent data from being added to it. Don't confuse foreign keys with relations because they are very different things.
I'm assuming when you are reading about relations it is probably referring to foreign keys. If that's true, relations and joins are not different solutions for the same problem. They are 2 tools that accomplish different things, and they are usually used together.
A join as it sound like you know is part of a select query that let you get rows from more then 1 table.
A relation is part of the database structure its self that defines a rule. For example if you had a city table and a country table, you should have a relation pointing each row in the city table to a row in the country table. This would ensure the integrity of the data and not allow a city row to point to a country row that doesn't exist.
Asking "Why use relations when you can use joins?" to me sounds like asking 'Why do variables have types when I could read them anyway?".
The theory behind databases is based on something called Relational Algebra. Relation is not a database specific term, it is derived from Relational Algebra.
JOIN is kind of Relation, there can be different kind of relations. Refer to this wiki page to know more about what a Relation exactly is.
The relationships established in a RELATIONAL database are the very core of the relational database model. In a database, we model entities. We use relationships between entities to maintain data integrity, and ensure the records are organized properly. Relationships also create indexes between related tables.
If you are not using the relationships, and/or modelling your table structure based upon the relationships between discrete entities, then you are not harnessing the true power of your relational database. Yes, you can make queries work, and yes, you can get the Db to do some usefule work. But can you ensure that, say, every Employee record is properly RELATED to the proper company? Can you ensure that there is only one record for that company, and that all the emplotyees of that company are related to that record?
Without designing your database structure around entities and the relationships between them, you might as well use a spreadsheet, or one big, flat table. RELATIONSHIPS and NORMALIZATION form the basis of the modern relational database.
An SQL table is an approximation of a relational model relation. Tables/relations (bases, views & query results) represent relations/relationships/associations. These are boxes & diamonds on ER (Entity-Relationship) & pseudo-ER diagrams. Most lines on such diagrams correspond to FK (foreign key) constraints. They are frequently but wrongly called "relations" or "relationships" but they are not. They are facts. An SQL FK says that a table's subrows appear elsewhere where they are a PK (primary key) or UNIQUE. Equivalently, it says that an entity participating in a relation/relationship/association also participates once in another one. Table meanings are necessary & sufficient to query. Constraints--including PKs, UNIQUEs & FKs--are not needed to query. They are consequences of the table relation/relationship/association choices & what situations/states can arise. They are for integrity to be enforced by the DBMS.
When Ed Codd developed the relational model of data for use with large scale databases, he based his design on the mathematics of relational calculus and algebra. The results of this kind of mathematics is predictable with mathematical precision, and Ed Codd was able to forecast with near mathematical precision how relational databases would behave before the first one was ever built.
In mathematics, a relation is a mathematical abstraction. It's a subset of the cartesian product of two or more domains, as another responder said. If that's as clear as mud to you, maybe you're not a mathematician.
No matter. A good computer scientist can understand SQL tables fairly easily, and recognize and exploit the power of an SQL JOIN. This understanding will do in place of a mathematical understanding of relations for many purposes. An SQL table materializes a mathematical relation, approximately. If you are careful with table design, you can turn "approximately" into "exactly".
In Oracle, a table cluster is a group of tables that share common columns and store related data in the same blocks. When tables are clustered, a single data block can contain rows from multiple tables. For example, a block can store rows from both the employees and departments tables rather than from only a single table:
http://download.oracle.com/docs/cd/E11882_01/server.112/e10713/tablecls.htm#i25478
Can this be done in SQLServer?
On the one hand, this sounds very much like views. Data is stored in the table, and the views provide access to only those columns within the table specified by the view's definition. (Thus, your "common columns".)
On the other hand, this sounds like how the database engine stores data the hard drive. In SQL, this is done via 8kb pages. Assuming two completely separate table definitions, there is no way to store data from two such distinct tables in the same page. (If an Oracle block is more along the lines of OS files, then that turns into SQL Files and File Groups, at which point the answer is "yes"... but I suspect this is not what blocks are about.)
Not based on what I am reading here. In SQL Server, each table's pages are independent of other tables' pages.
On the other hand, each table can have a choice of clustered index which can influence the performance greatly. In addition, I believe partitions will influence the execution plan and if both table have similar partition functions, this might boost performance, but the normal objective of partitioning is not for performance reasons.
Typically, optimization of JOINS involves index strategies (in my experience, preferably with covering non-clustered indexes)