How can I combine three tables in a database, but not have a large increase in data storage needs at any point in the process?
I have inherited a database with three tables. Two have the same columns, and one drops one of the columns:
Table1:
name
info
longkey
shortkey
Table2:
name
info
longkey
shortkey
Table3:
name
info
longkey
I want to create a single table with name and info columns and have no need for the remnants of other tables or the key fields. There are likely high numbers of duplicate entries between the three tables - an entry in table 1 will likely appear in table 2 and/or table 3.
The big problem, and the reason that this solution is not suitable is that it leads to a large increase in size and I have limited space available.
What can I do, either in SQL or perhaps through Python Scripting or other methods, that will not lead to a large increase in data storage at any time during the process?
Related
I have a legacy application which has below tables which has 1 to 1 mapping
customer (has already 40 columns)
customer_additional_attributes(has 20 columns)
My question :- Would not it be better design if customer and customer_additional_attributes tables were combined as it would have saves extra join or query sometime to fetch data
from customer_additional_attributes ?
Is there any disadvantage of single table(like in above scenario) but large number of columns?
The data format that you have is called "vertical partitioning". This is when rows of an entity are split across multiple tables. In a normalized structure, this is problematic, because inserts of rows (for instance) are not necessarily atomic -- they affect two tables.
But there are good reasons for doing this. The most obvious is when the rows are too wide. If the columns are too wide, they simply will not fit in one table, so they are spread through multiple tables.
Similarly, if some columns are much larger -- and rarely used -- then putting them in another table can be a big win on performance.
Before combining the tables, you should recognize that the data structure is intentional. It might simply be the result of "laziness". The first table was created -- and then additional attributes came along so they were put into another table. Or, it could be quite intentional, and you would want to understand why.
Note that the join between the two tables should be pretty fast, particularly if the same primary key is used for both.
You have many to many relationship maybe you have to create intermediate table so one for customer, one for customer_attributes and one for customer_additional_attibutes containing id of the two table
I have two table with the same columns,one is use to save bank's amount and the other to save cashdesk,they both might have many data,so i'm concerned about data retrieving speed.I don't know it's better to combine them by adding extra column to determine type of each record or create a separate table for each one?
The main question you should be asking is - how am I querying the tables.
If there is no real logical connection between the 2 tables (you don't want to get the rows in the same query) - use 2 different tables, since the other why around you will need to hold another column to tell you what type of row you are working on, and that will slow you down and make your queries more complex
In addition FKs might be a problem if the same column if a FK to 2 different places
In addition (2nd) - locks might be an issue - if you work on one type you might block the other
conclusion - 2 tables, not just for speed
In theory you have one unique entity, So you need to consider one table to your accounts and another one for your types of accounts, for better performance you could separate these types of account on two different file groups and partitions and create an index on the typeFK for account table, in this scenario you have logically one entity that is ruled by relational theory and physically your data is separated and data retrieval process would be fast and beneficial.
I am reading that RDMS stores table data on disk in some form of B-tree, and also that table indexes are stored in the B-tree form.
I read that primary key index is created automatically for a primary key defined, but that it could also be dropped anytime. So, it implies that primary-key index is an additional structure next to the B-tree used for just storing table data.
Isn't that wasting of resources - why wouldn't all table table be kept through primary-key index?
If it isn't like that, which order is then used for the B-tree used to store table data?
Thanks for clarifying
The primary key index is an optimization for finding the place on disk where the row is held. As a structure, it contains simply the PK data, not the whole row.
On a database, performance is often gated by how many pages are read from disk vs. cache. Since the PK index is smaller than the whole table, it is more likely to be in cache, it causes fewer blocks to be read from disk, and less blocks of other tables are removed from cache. It therefore is a major performance optimization.
Further, while modifying the table data, rows are locked. If the primary key were being scanned from the table data on disk, locked rows would slow access for all the other queries. By separating the index as a separate structure, the index can be used even while the row being pointed to is locked.
So overall, the separate PK structure is a classic space-for-time optimization.
EDIT What is the order of the rows in the table? The following answer is for Oracle, but is applicable to many databases.
Short answer: rows are not ordered on disk which is why the PK index (and other indexes) are so important.
Long answer:
While the primary-key b-tree structure is necessarily sorted (the b-tree) the rows of the table are scattered across the table-space. To understand this we need to drill down the various data structures.
First, the database is structured into logical entities called a tablespaces. A tablespace is the space in one or more files on one or more disks. The files start empty. When the tablespace become full (technically when the data in it reaches a threshold) the tablespace can be automatically grown. It can also be grown manually by enlarging the file (adding an 'extent', or adding new files). Tablespaces can be clustered across multiple machines as well as disks.
Second: A tablespace is divided segments, each segment for the use of a single table or index.
Third: The segment is divided into blocks, each block has space for one or more rows. These blocks are not the same as disk or OS blocks; Oracle blocks are one or more OS blocks. (This is for transportability, and for managing media with different block sizes).
On insert, a database will select a space in a block from anywhere in the tablespace. The row can be inserted sequentially (especially bulk inserting into an empty table), but normally the database will also reuse space where rows have been deleted or moved due to some types of update. While the placement is theoretically kind-a predictable, in practice you should never rely on or expect the row to be placed in any specific block.
One interesting thing in Oracle is the ROWID. This is the reference stored in the index that allows the DB to look up the row:
An extended rowid has a four-piece format, OOOOOOFFFBBBBBBRRR:
The first 6 characters OOOOOO represent data object number, using 32bits
The next 3 characters FFF represent tablespace-relative datafile number, using 10bits.
The next 6 characters BBBBB represent block number, using 22bits.
The last 3 character RRR represent row number, using 16bit
For much more detail, see http://docs.oracle.com/cd/E11882_01/server.112/e25789/logical.htm#autoId0
One other thought: There is a concept in the DB world called partitions, where a dataset is divided across different tablespaces (frequently different disks or nodes in a cluster) depending on some expression logic. For example, on a table of customers, a vertical partition could be defined by the country of the person. That way you can ensure that the US customers are physically on one disk while the Australians are on another.
I was wondering which approach is better for designing databases?
I have currently one big table (97 columns per row) with references to lookup tables where I could.
Wouldn't it be better for performance to group some columns into smaller tables and add them key columns for referencing one whole row?
If you split up your table into several parts, you'll need additional joins to get all your columns for a single row - that will cost you time.
97 columns isn't much, really - I've seen way beyond 100.
It all depends on how your data is being used - if your row just has 97 columns, all the time, and needs to 97 columns - then it really hardly ever makes sense to split those up into various tables.
It might make sense if:
you can move some "large" columns (like XML, VARCHAR(MAX) etc.) into a separate table, if you don't need those all the time -> in that case, your "basic" row becomes smaller and your basic table will perform better - as long as you don't need those extra large column
you can move away some columns to a separate table that aren't always present, e.g. columns that might be "optional" and only present for e.g. 20% of the rows - in that case, you might save yourself some processing for the remaining 80% of the cases where those columns aren't needed.
It would be better to group relevant columns into different tables. This will improve the performance of your database as well as your ease of use as the programmer. You should try to first find all the different relationships between your columns and following that you should attempt to break everything into tables while keeping in mind these relationships (using primary keys, forking keys, references and so forth).Try to create a diagram as this http://www.simple-talk.com/iwritefor/articlefiles/354-image008.gif and take it from there.
Unless your data is denormalized it is likely best to keep all the columns in the same table. SQL Server reads pages into the buffer pool from individual tables. Thus you will have the cost of the joins on every access even if the pages accessed are already in the buffer pool. If you access just a few rows of the data per query with a key then an index will serve that query fine with all columns in the same table. Even if you will scan a large percentage of the rows (> 1% of a large table) but only a few of the 97 columns you are still better off keeping the columns in the same table as you can use a non clustered index that covers the query. However, if the data is heavily denormalized then normalizing it, which by definition breaks it into many tables based upon the rules of normalization to eliminate redundancy, will result in much improved performance and you will be able to write queries to access only the specific data elements you need.
I have a database which logs modification records into a table. This modification table contains foreign keys to other tables (the modification table only contains references to objects modified).
Objects in this modification table can be grouped into different populations. When a user access the service he only requests the database for object on his population.
I will have about 2 to 10 new populations each week.
This table is requested by smartphones very very often and will contains about 500 000 / 1 000 000 records.
If I split the modification table into many tables there is no table-join to do to answer user requests
If I change this single table into many tables, I guess it will speed the response time.
But on the other hand, each "insert" in the modification table will require to have first the name of the target table (it implies another request). To do so, I plan to have a column in the "population" table with a varchar representing the target table for modification.
My question is a design-pattern / architecture one --> Should I go for a single very huge table with 3 "where" for each request, or should I give a try to many light table with no "where" to play?
The cleanest thing would be to use one table and partition it on the populations. Partitions are made for this.
500K - 1M records isn't trivial - but it certainly isn't huge either. What's your database platform? Most mainstream professional platforms (SQL, Oracle, MySQL, etc) are more than capable of handling this.
If the table in question is narrow (has few columns) then its less likely to be an issue than "wide tables" with lots of columns.
Having lots of joins could be an issue (i just can't speak from experience). Depending on how you manage things (and how good your application code is) do you really need the foreign-key constraints?