(with SQL Server 2008) I have a big table (~50M records) that is fully normalized. There are 4 primary columns, and one of the them has only three possible entries- A, B, and C. The issue is, often there is much redundancy for this column. That is to say, there can be many records with value A, and then many repeated records that are identical in all respects, except with value B (and/or C). This redundancy does not always happen, but it's frequent enough that it greatly increases the record count and I wish to be rid of it.
My idea is that instead of A, B, C being choices for a column, I've thought about creating 3 bit columns titled A, B, C. Then in the case of the aforementioned redundancies for these values, I don't have to create repeated records, but instead just have one record and then flag the A, B, and/or C columnns as necessary.
These seems unorthodox so I thought I'd see what the experts think. One thing is that there would be three different uniqueness contraints for this table, each including all the other primary keys plus one of the three flag columns.
[EDIT] To clarify on the meaning of "many repeated records", one of the other PK's is a date column. So for example, there could be 1000 records of different dates with entry A, and then another 1000 records of the same dates (and other columns identical) but with entry B. So that is how even with only three choices there can still be lots of redundancy.
You can't have "many repeated records that are identical in all respects" except for the 4th column in the PK that takes one of A OR B or C. This means to me that you have at most 3 rows (over the other 3 PK columns) differentiated by either A or B or C
This means you should have one unique constraint because of this.
I'd do nothing based on this and also
a row with A is a different row with C
only 50 millions
it's simple (no extra tables or fancy bit columns)
no stated performance issues (until you add extra tables or fancy bit columns)
you have a clear, normalised schema
Edit:
Your redundancy isn't in the ABC column. The row multiplication is caused by the datetime.
Can you change the datetime to smalldatetime and suppress near-duplicates that way? eg resolve to nearest minute not 3.33 milliseconds? Or for SQL Server 2008 use datetime2 and pick your resolution
How about creating a separate table that stores these "flags", foreign key'd back to your original table?
Table1 (original table)
----------------------
PriKey1 (PK for Table1)
Col1
Col2
Table2 (new table)
------------------
PriKey2 (PK for Table2)
PriKey1 (FK to Table1)
A
B
C
I personally wouldn't do it that way, I would create another table that would store either the A, B, or C and the RecordID.
The only issue I can think of off the top of my head is that you will need to change your existing code and include all 3 fields if you want to get any use out of indexing on those bit columns.
Bit fields by their nature are not very selective. To get good selectivity you will need to create a covering index on all 3 fields, and then include all 3 in your WHERE clauses so you get optimum seeks.
Most databases will allocate a minimimum of the most efficient processing unit per field in any case so calling them bit fields would only be a metadata difference. But unpacking bits into words is just overhead anyway. You might as well use probably ints. And I'm pretty sure Sql Server doesn't index bit fields - cardinality of 2 doesn't help much.
50M records? A small number by most accounts.
Have you tried to quantify the overhead you're trying to reduce? If nothing else you're going to add work for the increased complexity.
I'd have to think a long time before increasing complexity.
Is this a really stable design otherwise, and you have some extra time?
Related
I have a legacy application which has below tables which has 1 to 1 mapping
customer (has already 40 columns)
customer_additional_attributes(has 20 columns)
My question :- Would not it be better design if customer and customer_additional_attributes tables were combined as it would have saves extra join or query sometime to fetch data
from customer_additional_attributes ?
Is there any disadvantage of single table(like in above scenario) but large number of columns?
The data format that you have is called "vertical partitioning". This is when rows of an entity are split across multiple tables. In a normalized structure, this is problematic, because inserts of rows (for instance) are not necessarily atomic -- they affect two tables.
But there are good reasons for doing this. The most obvious is when the rows are too wide. If the columns are too wide, they simply will not fit in one table, so they are spread through multiple tables.
Similarly, if some columns are much larger -- and rarely used -- then putting them in another table can be a big win on performance.
Before combining the tables, you should recognize that the data structure is intentional. It might simply be the result of "laziness". The first table was created -- and then additional attributes came along so they were put into another table. Or, it could be quite intentional, and you would want to understand why.
Note that the join between the two tables should be pretty fast, particularly if the same primary key is used for both.
You have many to many relationship maybe you have to create intermediate table so one for customer, one for customer_attributes and one for customer_additional_attibutes containing id of the two table
What is the best way to model a database? I have many known channels with values. Is it better create one table with many columns, one for each channel or create two table one for values and one for channels? Like that:
Table RAW_VALUES: SERIE_ID, CHANNEL_1, ..., CHANNEL_1000
or
Table RAW_VALUES: SERIE_ID, CHANNEL_ID, VALUE
Table CHANNELS: CHANNEL_ID, NAME, UNIT, ....
My question is about performance to search some data or save database space.
Thanks.
Usually, one would want to know what type of queries you will run against the tables as well as the data distribution etc to choose between two designs. However, I think that there are more fundamental issues here to guide you.
The second alternative is certainly more flexible. Adding one more channel ("Channel_1001") can be done simply by inserting rows in the two tables (a simple DML operation), whereas if you use the first option, you need to add a column to the table (a DDL operation), and that will not be usable by any programs using this table unless you modify them.
That type of flexibility alone is probably a good reason to go with the second option.
Searching will also be better served with the second option. You may create one index on the raw_values table and support indexed searches on the Channel/Value columns. (I would avoid the name "value" for a column by the way.)
Now if you consider what column(s) to index under the first option, you will probably be stumped: you have 1001 columns there. If you want to support indexed searches on the values, would you index them all? Even if you were dealing with just 10 channels, you would still need to index those 10 columns under your first option; not a good idea in general to load a table with more than a few indexes.
As an aside, if I am not mistaken, the limit is 1000 columns per table these days, but a table with more than 255 columns will store a row in multiple row pieces, each storing 255 columns and that would create a lot of avoidable I/O for each select you issue against this table.
I was wondering which approach is better for designing databases?
I have currently one big table (97 columns per row) with references to lookup tables where I could.
Wouldn't it be better for performance to group some columns into smaller tables and add them key columns for referencing one whole row?
If you split up your table into several parts, you'll need additional joins to get all your columns for a single row - that will cost you time.
97 columns isn't much, really - I've seen way beyond 100.
It all depends on how your data is being used - if your row just has 97 columns, all the time, and needs to 97 columns - then it really hardly ever makes sense to split those up into various tables.
It might make sense if:
you can move some "large" columns (like XML, VARCHAR(MAX) etc.) into a separate table, if you don't need those all the time -> in that case, your "basic" row becomes smaller and your basic table will perform better - as long as you don't need those extra large column
you can move away some columns to a separate table that aren't always present, e.g. columns that might be "optional" and only present for e.g. 20% of the rows - in that case, you might save yourself some processing for the remaining 80% of the cases where those columns aren't needed.
It would be better to group relevant columns into different tables. This will improve the performance of your database as well as your ease of use as the programmer. You should try to first find all the different relationships between your columns and following that you should attempt to break everything into tables while keeping in mind these relationships (using primary keys, forking keys, references and so forth).Try to create a diagram as this http://www.simple-talk.com/iwritefor/articlefiles/354-image008.gif and take it from there.
Unless your data is denormalized it is likely best to keep all the columns in the same table. SQL Server reads pages into the buffer pool from individual tables. Thus you will have the cost of the joins on every access even if the pages accessed are already in the buffer pool. If you access just a few rows of the data per query with a key then an index will serve that query fine with all columns in the same table. Even if you will scan a large percentage of the rows (> 1% of a large table) but only a few of the 97 columns you are still better off keeping the columns in the same table as you can use a non clustered index that covers the query. However, if the data is heavily denormalized then normalizing it, which by definition breaks it into many tables based upon the rules of normalization to eliminate redundancy, will result in much improved performance and you will be able to write queries to access only the specific data elements you need.
I want to add an integer column to a table with a large number of rows and many indexes (Its a data warehouse Fact Table).
To keep the row width as narrow as possible all the columns in this table are defined as not null. So I want the new column to be not null with a default of zero.
From experience adding this column will take some time, presumably because the database will need to rewrite all the rows with a new column with a filled value. And this presumably will involve updating the clustered index and all the non-clustered indexes.
So should I drop all the indexes before adding the column and then recreate them all.
Or is there an easier way to do this?
Also I don't really understand why adding a column that is nullable is so much quicker. Why does this not involve re-writng the records with an extra Is Null bit flipped for each row.
It will require updating the clustered index, yes - this IS the table data, after all.
But I don't see why any of the non-clustered indices would have to updated - your new column won't be member of any of the non-clustered indices.
Also, I don't see how dropping and recreating the indices would benefit you in this scenario. If you were bulk-loading several million existing rows from another table or database - yes, then it might be faster (due to the INSERTs being much faster) - but adding a column doesn't really suffer from any indices or constraints being around, I don't think.
Marc
SQL Server is a row oriented database. This is in contrast to a column oriented database. This means that in SQL Server, all of the data for a given row is stored together on the disk. Let's have an example:
Say you have a Customer table with 3 columns, FirstName, MiddleInitial, and LastName. Then, say you have 3 records in this table for Jabba T. Hutt, Dennis T. Menace, and George W. Bush.
In a row oriented database (like SQL Server), the records will be stored on disk as such:
Jabba, T, Hutt; Dennis, T, Menace; George, W, Bush;
In contrast, a column oriented database would store the records on disk like this:
Jabba, Dennis, George; T, T, W; Hutt Menace, Bush;
Where columns are grouped together instead of rows.
Now, when you go to add a column to a table in a row oriented database (SQL Server, for example), the new data for each column has to be inserted alongside the existing rows, shifting the rows requiring a lot of read/write operations. So, if you were to insert a new column for the customer prefix that defaults to 'Mr', this is what you'd get:
Mr, Jabba, T, Hutt; Mr, Dennis, T, Menace; Mr, George, W, Bush;
As you can see, all of the original data has been shifted to the right. On the other hand, when you insert a new column that defaults to NULL, no new data has to be put into the existing rows. Thus, there is less shifting, requiring fewer disk read/write operations.
Of course, this an oversimplification of what's actually going on on disk. There are other things to take into account when dealing with indexes, pages, etc. But, it should help you get the picture.
For clarification I'm not at all suggesting you move to a column oriented database, I just put that info in there to help explain what Row oriented meant.
"Also I don't really understand why adding a column that is nullable is so much quicker. Why does this not involve re-writng the records with an extra Is Null bit flipped for each row."
Adding a nullable column merely changes the definition of the table. The individual records are not affected.
Good day,
In SQL Server 2005, I have a table numerous columns, including a few boolean (bit) columns. For example,
table 'Person' has columns ID and columns HasItem1, HasItem2, HasItem3, HasItem4. This table is kinda large, so I would like to create indexes to get faster search results.
I know that is not I good idea to create an index on a bit column, so I thought about using a index with all of the bit columms. However, the thing is, all of these bit columns may or may not be in the query. Since the order of the indexed columns are important in an index, and that I don't know which ones will be used in the query, how should I handle this?
BTW, there is already clustered index that I can't remove.
I would suggest that this is probably not a good idea. Trying to index fields with very low cardinality will generally not make queries faster and you have the overhead of maintaining the index as well.
If you generally search for one of your bit fields with another field then a composite index on the two fields would probably benefit you.
If you were to create a composite index on the bit fields then this would help but only if the composite fields at the beginning of the index were provided. If you do not include the 1st value within the composite index then the index will probably not be used at all.
If, as an example bita was used in 90% of your queries and bitd in 70% and bits b and c in 20% then a composite index on (bita, bitd, bitb, bitc) would probably yield some benefit but for at least 10% of your queries and possibly even 40% the index would most likely not be used.
The best advice is probably to try it with the same data volumes and data cardinality and see what the Execution plan says.
I don't know a lot of specifics on sql server, but in general indexing a column that has non-unique data is not very effective. In some RDBMS systems, the optimizer will ignore indexes that are less than a certain percent unique anyway, so the index may as well not even exist.
Using a composite, or multi-column index can help, but only in particular cases where the filter constraints are in the same order that the index was built in. If you index includes 'field1, field2' and you are searching for 'field2, field1' or some other combination, the index may not be used. You could add an index for each of the particular search cases that you want to optimize, that is really all I can think of that you could do. And in the case that your data is not very unique, even after considering all of the bit fields, the index may be ignored anyway.
For example, if you have 3 bit fields, you are only segmenting your data into 8 distinct groups. If you have a reasonable number of rows in the table, segmenting it by 8 isn't going to be very effective.
Odds are it will be easier for SQL to query the large table with the person_id and item_id and BitValue then it will be to search a single table with Item1, Item2, ... ItemN.
I don't know about 2005 but in SQL Server 2000 (From Books Online):
"Columns of type bit cannot have indexes on them."
How about using checksum?
Add a int field named mysum to your table and execute this
UPDATE checksumtest SET mysum = CHECKSUM(hasitem1,hasitem2,hasitem3,hasitem4)
Now you have a value that represents the combination of bits.
Do the same checksum calc in your search query and match on mysum.
This may speed things up.
You should revisit the design of your database. Instead of having a table with fields HasItem1 to HasItem#, you should create a bridge entity, and a master Items table if you don't have one. The bridge entity (table), person_items, would have (a minimum of) two fields: person_id and item_id.
Designing the database this way doesn't lock you in to a database that only handles N number of items based on column definitions. You can add as many items as you want to a master Items table, and associate as many of them as you need with as many people as you need.