I'm writing a schema to store Vehicle records. I want to store UP TO 62 pieces of information ("constraints") for each vehicle (year, make, model, aspiration, wheel base, body style, number of doors, etc). Most vehicles will only have 5 - 10 constraints populated.
I have about 12.5 million records to store. I've no option but to use a single database running on a single computer.
Each constraint is stored as an integer. Another provider, an industry standard, gives me labels for each of these values. For example a make of 54 is "Ford". The labels can update, but it's more common for new ones to be added.
There are no mandatory fields. Some companies catalog by year + make + model, others by engine, others by transmission, and some companies a mix of these.
I could make a single table with at least 62 columns. Each column would be indexed as they frequently are used for joining or used in a where clause. Or I could make a vehicle table that contains an Id, then make a constraints table that has the vehicle Id as a foreign key and the information for a single constraint for a single vehicle.
A single table has the advantage of being able to retrieve all constraints for one vehicle without any joins, but has the down side of having a NULL value in most columns of every record and having a LOT of indexes.
Is one design generally preferred over the other? I've done a little performance testing with my prototypes and don't see a huge difference in query times.
The vehicles are searched for and displayed hundreds of times per day from a website.
I'm currently prototyping in mssql 2008 r2, but could conceivably use 2012.
Related
I am trying to migrate a table that is currently in a relational database to BigTable.
Let's assume that the table currently has the following structure:
Table: Messages
Columns:
Message_id
Message_text
Message_timestamp
How can I create a similar table in BigTable?
From what I can see in the documentation, BigTable uses ColumnFamily. Is ColumnFamily the equivalent of a column in a relational database?
BigTable is different from a relational database system in many ways.
Regarding database structures, BigTable should be considered a wide-column, NoSQL database.
Basically, every record is represented by a row and for this row you have the ability to provide an arbitrary number of name-value pairs.
This row has the following characteristics.
Row keys
Every row is identified univocally by a row key. It is similar to a primary key in a relational database. This field is stored in lexicographic order by the system, and is the only information that will be indexed in a table.
In the construction of this key you can choose a single field or combine several ones, separated by # or any other delimiter.
The construction of this key is the most important aspect to take into account when constructing your tables. You must thing about how will you query the information. Among others, keep in mind several things (always remember the lexicographic order):
Define prefixes by concatenating fields that allows you to fetch information efficiently. BigTable allows and you to scan information that starts with a certain prefix.
Related, model your key in a way that allows you to store common information (think, for example, in all the messages that come from a certain origin) together, so it can be fetched in a more efficient way.
At the same time, define keys in a way that maximize dispersion and load balance between the different nodes in your BigTable cluster.
Column families
The information associated with a row is organized in column families. It has no correspondence with any concept in a relational database.
A column family allows you to agglutinate several related fields, columns.
You need to define the column families before-hand.
Columns
A column will store the actual values. It is similar in a certain sense to a column in a relational database.
You can have different columns for different rows. BigTable will sparsely store the information, if you do not provide a value for a row, it will consume no space.
BigTable is a third dimensional database: for every record, in addition to the actual value, a timestamp is stored as well.
In your use case, you can model your table like this (consider, for example, that you are able to identify the origin of the message as well, and that it is a value information):
Row key = message_origin#message_timestamp (truncated to half hour, hour...)1#message_id
Column family = message_details
Columns = message_text, message_timestamp
This will generate row keys like, consider for example that the message was sent from a device with id MT43:
MT43#1330516800#1242635
Please, as #norbjd suggested, see the relevant documentation for an in-deep explanation of these concepts.
One important difference with a relational database to note: BigTable only offers atomic single-row transactions and if using single cluster routing.
1 See, for instance: How to round unix timestamp up and down to nearest half hour?
I know it is a big and general question. Let me describe what I am looking for.
In big projects, we have some entities with many properties. (Many is over 100 properties for just a specific entity.) These properties have one to one relation. By the time goes, these tables with many columns are really big problems for maintenance and further development.
As you think, these 90 columns is created in a time with many projects. Not a single project. Therefore, requirements affect the table design in a wide time duration.
i.e. : There is a table to store information of payments between banks in global.
Some columns are foreign keys of others.(Customer, TransferType etc.)
Some columns are parameters of current payment. (IsActive, IsLoaded, IsOurCustomer etc.)
Some columns are fields of payment. (Information Bank, Receiver Bank etc.)
and so on.
These fields are always counting and now we have about 90 columns with one to one relation.
What are the concerns to divide a table to smaller tables. I know normalization rules and I am not interested it. (Already duplicated columns are normalized)
I try to find some patterns or some rules to divide a table which has one to one relation among columns.
If all of the columns are only dependent on the primary table key and are not repeating (phone1, phone2) they should be part of the same table. If you split a table you will have to do joins when you need all the columns of the table. If many of the values are null you may investigate the use of sparse columns (which don't take up any room if they have a null value).
I have two tables in my database, one for login and second for user details (the database is not only two tables). Logins table has 12 columns (Id, Email, Password, PhoneNumber ...) and user details has 23 columns (Job, City, Gender, ContactInfo ..). The two tables have one-to-one relationship.
I am thinking to create one table that contain the columns of both tables but I not sure because this may make the size of the table big.
So this lead to my question, what the number of columns that make table big? Is there a certain or approximate number that make size of table big and make us stop adding columns to a table and create another one? or it is up to the programmer to decide such number?
The number of columns isn't realistically a problem. Any kind of performance issues you seem to be worried with can be attributed to the size of the DATA on the table. Ie, if the table has billions of rows, or if one of the columns contains 200 MB of XML data on each separate row, etc.
Normally, the only issue arising from a multitude of columns is how it pertains to indexing, as it can get troublesome trying to create 100 different indexes covering each variation of each query.
Point here is, we can't really give you any advice since just the number of tables and columns and relations isn't enough information to go on. It could be perfectly fine, or not. The nature of the data, and how you account for that data with proper normalization, indexing and statistics, is what really matters.
The constraint that makes us stop adding columns to an existing table in SQL is if we exceed the maximum number of columns that the database engine can support for a single table. As can be seen here, for SQLServer that is 1024 columns for a non-wide table, or 30,000 columns for a wide table.
35 columns is not a particularly large number of columns for a table.
There are a number of reasons why decomposing a table (splitting up by columns) might be advisable. One of the first reasons a beginner should learn is data normalization. Data normalization is not directly concerned with performance, although a normalized database will sometimes outperform a poorly built one, especially under load.
The first three steps in normalization result in 1st, 2nd, and 3rd normal forms. These forms have to do with the relationship that non-key values have to the key. A simple summary is that a table in 3rd normal form is one where all the non-key values are determined by the key, the whole key, and nothing but the key.
There is a whole body of literature out there that will teach you how to normalize, what the benefits of normalization are, and what the drawbacks sometimes are. Once you become proficient in normalization, you may wish to learn when to depart from the normalization rules, and follow a design pattern like Star Schema, which results in a well structured, but not normalized design.
Some people treat normalization like a religion, but that's overselling the idea. It's definitely a good thing to learn, but it's only a set of guidelines that can often (but not always) lead you in the direction of a satisfactory design.
A normalized database tends to outperform a non normalized one at update time, but a denormalized database can be built that is extraordinarily speedy for certain kinds of retrieval.
And, of course, all this depends on how many databases you are going to build, and their size and scope,
I take it that the login tables contains data that is only used when the user logs into your system. For all other purposes, the details table is used.
Separating these sets of data into separate tables is not a bad idea and could work perfectly well for your application. However, another option is having the data in one table and separating them using covering indexes.
One aspect of an index no one seems to consider is that an index can be thought of as a sub-table within a table. When a SQL statement accesses only the fields within an index, the I/O required to perform the operation can be limited to only the index rather than the entire row. So creating a "login" index and "details" index would achieve the same benefits as separate tables. With the added benefit that any operations that do need all the data would not have to perform a join of two tables.
I have a large amount of data around 5M that are stored in a very flat table which has 12 Columns. This table contains aggregated data and it does not have any relationship with other tables. I want to run dynamic queries on this data for reporting purpose. The table contains Fields like District, City, Year, Category, SubCategory, SaleAmount etc
I want to view reports such as Sales between year 2010 and 2013.
Sales of each product in various year and compare them.
Sales by specific salesmen in a year.
Sales by category, Subcategory etc.
I am using SQL Server 2008, but I am not a DBA hence I do not know things like what type of indexes should I create? Which Columns should I index in order to make my queries work.
If the amount of data was small I would not have bothered about all these questions and just proceeded but knowing which columns to index and what type of indexes to create is vital in this case.
Kindly let me know the best way to ensure fast execution of queries.
Will it work if I create a clustered index on all my columns? or will it hurt me.
Keep in mind that this table will not be updated very frequently maybe on monthly basis.
Given your very clear and specific requirements, I would suggest you create a non-clustered index for each field and leave it to the optimiser as a first step. (ie you create 12 indexes) Place only a single field in each index. Dont index ( or at least use caution ) any long text type fields. Also dont index a field such as M/F that has only 2 values and a 50/50 split. I am assuming you have predicates on each field, but dont bother indexing any fields that are never used for selection purposes.
If you still have problems after this, find the query analyser in sql server and use it to guide how queries are processed.
Multi segmented indexes are sometimes better, but if your queries are mostly restricting to a small subset of the table then single field indexs will be fine.
You might have residual performance issues with queries that use "order by", but lets just leave that as a heads up at this stage.
My reasoning is based on
You only have 12 columns, so we wont overload anything
There are only 5M rows. This is quite easy for sql/server to handle
The growth in the data is small, so index updates shouldnt be too much of an issue.
The optimiser will love these queries combined with indexes.
We dont't have typical query examples to specify multi segment indexes, and the question seems to imply highly variable queries.
I have a database which logs modification records into a table. This modification table contains foreign keys to other tables (the modification table only contains references to objects modified).
Objects in this modification table can be grouped into different populations. When a user access the service he only requests the database for object on his population.
I will have about 2 to 10 new populations each week.
This table is requested by smartphones very very often and will contains about 500 000 / 1 000 000 records.
If I split the modification table into many tables there is no table-join to do to answer user requests
If I change this single table into many tables, I guess it will speed the response time.
But on the other hand, each "insert" in the modification table will require to have first the name of the target table (it implies another request). To do so, I plan to have a column in the "population" table with a varchar representing the target table for modification.
My question is a design-pattern / architecture one --> Should I go for a single very huge table with 3 "where" for each request, or should I give a try to many light table with no "where" to play?
The cleanest thing would be to use one table and partition it on the populations. Partitions are made for this.
500K - 1M records isn't trivial - but it certainly isn't huge either. What's your database platform? Most mainstream professional platforms (SQL, Oracle, MySQL, etc) are more than capable of handling this.
If the table in question is narrow (has few columns) then its less likely to be an issue than "wide tables" with lots of columns.
Having lots of joins could be an issue (i just can't speak from experience). Depending on how you manage things (and how good your application code is) do you really need the foreign-key constraints?