Here is the problem, I have a sales information table which contains sales information, which has columns like (Primary Key ID, Product Name, Product ID, Store Name, Store ID, Sales Date). I want to do analysis like drill up and drill down on store/product/sales date.
There are two design options I am thinking about,
Create individual index on columns like product name, product ID, Store Name, Store ID, Sales Date;
Using data warehouse snowflake model, treating current sales information table as fact table, and create product, store, and sales date dimension table.
In order to have better analysis performance, I heard snowflake model is better. But why it is better than index on related columns from database design perspective?
thanks in advance,
Lin
Knowing your app usage patterns and what you want to optimize for are important. Here are a few reasons (among many) to choose one over the other.
Normalized Snowflake PROs:
Faster queries and lower disk and memory requirements. Due to each normalized row having only short keys rather than longer text fields, your primary fact table becomes much smaller. Even when an index is used (unless the query can be answered directly by the index itself), partial table scans are often required, and smaller data means fewer disk reads and faster access.
Easier modifications and better data integrity. Say a store changes its name. In snowflake, you change one row, whereas in a large denormalized table, you have to change it every time it comes up, and you will often end up with spelling errors and multiple variations of the same name.
Denormalized Wide Table PROs:
Faster single record loads. When you most often load just a single record or small number of records, having all your data together in one row will incur only a single cache miss or disk read, whereas in the snowflake the DB might have to read from multiple tables in different disk locations. This is more like how NoSQL databases store their "objects" associated with a key.
Related
This query is regarding partitioning in hive/delta tables.
Which column should we pick for partitioning the table if the table is always being used to join based on key which have only unique value.
Ex: we have a table Customer(id, name, otherDetails)
which field be suitable to partition this table.
Thanks,
Deepak
Good question. Below are factors you need to consider while partitioning -
Requirement - when you have lots of data, heavily used table, frequently data added to it, and you want to manage it better.
Distribution of data - choose a field or fields on which data is evenly distributed. Most common is date or month or year, and normally transactional data is somewhat evenly distributed on these fields. You can choose something like country or region as well to partition when you have data evenly distributed.
Loading strategy - You can load/insert/delete each partition separately. So choose some columns which will help you deciding a better strategy. Like you can choose to delete old data based on date every time you load. so chose load date as partition.
Reasonable number of partitions - Make sure you do not have thousands of partitions but less that 500 is good ( check your systems performance).
Do not choose unique key/composite key as partition key. because hive creates folders with data files for each partition and it will be very difficult to manage thousands of partitions.
I am trying to migrate a table that is currently in a relational database to BigTable.
Let's assume that the table currently has the following structure:
Table: Messages
Columns:
Message_id
Message_text
Message_timestamp
How can I create a similar table in BigTable?
From what I can see in the documentation, BigTable uses ColumnFamily. Is ColumnFamily the equivalent of a column in a relational database?
BigTable is different from a relational database system in many ways.
Regarding database structures, BigTable should be considered a wide-column, NoSQL database.
Basically, every record is represented by a row and for this row you have the ability to provide an arbitrary number of name-value pairs.
This row has the following characteristics.
Row keys
Every row is identified univocally by a row key. It is similar to a primary key in a relational database. This field is stored in lexicographic order by the system, and is the only information that will be indexed in a table.
In the construction of this key you can choose a single field or combine several ones, separated by # or any other delimiter.
The construction of this key is the most important aspect to take into account when constructing your tables. You must thing about how will you query the information. Among others, keep in mind several things (always remember the lexicographic order):
Define prefixes by concatenating fields that allows you to fetch information efficiently. BigTable allows and you to scan information that starts with a certain prefix.
Related, model your key in a way that allows you to store common information (think, for example, in all the messages that come from a certain origin) together, so it can be fetched in a more efficient way.
At the same time, define keys in a way that maximize dispersion and load balance between the different nodes in your BigTable cluster.
Column families
The information associated with a row is organized in column families. It has no correspondence with any concept in a relational database.
A column family allows you to agglutinate several related fields, columns.
You need to define the column families before-hand.
Columns
A column will store the actual values. It is similar in a certain sense to a column in a relational database.
You can have different columns for different rows. BigTable will sparsely store the information, if you do not provide a value for a row, it will consume no space.
BigTable is a third dimensional database: for every record, in addition to the actual value, a timestamp is stored as well.
In your use case, you can model your table like this (consider, for example, that you are able to identify the origin of the message as well, and that it is a value information):
Row key = message_origin#message_timestamp (truncated to half hour, hour...)1#message_id
Column family = message_details
Columns = message_text, message_timestamp
This will generate row keys like, consider for example that the message was sent from a device with id MT43:
MT43#1330516800#1242635
Please, as #norbjd suggested, see the relevant documentation for an in-deep explanation of these concepts.
One important difference with a relational database to note: BigTable only offers atomic single-row transactions and if using single cluster routing.
1 See, for instance: How to round unix timestamp up and down to nearest half hour?
I have two tables in my database, one for login and second for user details (the database is not only two tables). Logins table has 12 columns (Id, Email, Password, PhoneNumber ...) and user details has 23 columns (Job, City, Gender, ContactInfo ..). The two tables have one-to-one relationship.
I am thinking to create one table that contain the columns of both tables but I not sure because this may make the size of the table big.
So this lead to my question, what the number of columns that make table big? Is there a certain or approximate number that make size of table big and make us stop adding columns to a table and create another one? or it is up to the programmer to decide such number?
The number of columns isn't realistically a problem. Any kind of performance issues you seem to be worried with can be attributed to the size of the DATA on the table. Ie, if the table has billions of rows, or if one of the columns contains 200 MB of XML data on each separate row, etc.
Normally, the only issue arising from a multitude of columns is how it pertains to indexing, as it can get troublesome trying to create 100 different indexes covering each variation of each query.
Point here is, we can't really give you any advice since just the number of tables and columns and relations isn't enough information to go on. It could be perfectly fine, or not. The nature of the data, and how you account for that data with proper normalization, indexing and statistics, is what really matters.
The constraint that makes us stop adding columns to an existing table in SQL is if we exceed the maximum number of columns that the database engine can support for a single table. As can be seen here, for SQLServer that is 1024 columns for a non-wide table, or 30,000 columns for a wide table.
35 columns is not a particularly large number of columns for a table.
There are a number of reasons why decomposing a table (splitting up by columns) might be advisable. One of the first reasons a beginner should learn is data normalization. Data normalization is not directly concerned with performance, although a normalized database will sometimes outperform a poorly built one, especially under load.
The first three steps in normalization result in 1st, 2nd, and 3rd normal forms. These forms have to do with the relationship that non-key values have to the key. A simple summary is that a table in 3rd normal form is one where all the non-key values are determined by the key, the whole key, and nothing but the key.
There is a whole body of literature out there that will teach you how to normalize, what the benefits of normalization are, and what the drawbacks sometimes are. Once you become proficient in normalization, you may wish to learn when to depart from the normalization rules, and follow a design pattern like Star Schema, which results in a well structured, but not normalized design.
Some people treat normalization like a religion, but that's overselling the idea. It's definitely a good thing to learn, but it's only a set of guidelines that can often (but not always) lead you in the direction of a satisfactory design.
A normalized database tends to outperform a non normalized one at update time, but a denormalized database can be built that is extraordinarily speedy for certain kinds of retrieval.
And, of course, all this depends on how many databases you are going to build, and their size and scope,
I take it that the login tables contains data that is only used when the user logs into your system. For all other purposes, the details table is used.
Separating these sets of data into separate tables is not a bad idea and could work perfectly well for your application. However, another option is having the data in one table and separating them using covering indexes.
One aspect of an index no one seems to consider is that an index can be thought of as a sub-table within a table. When a SQL statement accesses only the fields within an index, the I/O required to perform the operation can be limited to only the index rather than the entire row. So creating a "login" index and "details" index would achieve the same benefits as separate tables. With the added benefit that any operations that do need all the data would not have to perform a join of two tables.
I have a large amount of data around 5M that are stored in a very flat table which has 12 Columns. This table contains aggregated data and it does not have any relationship with other tables. I want to run dynamic queries on this data for reporting purpose. The table contains Fields like District, City, Year, Category, SubCategory, SaleAmount etc
I want to view reports such as Sales between year 2010 and 2013.
Sales of each product in various year and compare them.
Sales by specific salesmen in a year.
Sales by category, Subcategory etc.
I am using SQL Server 2008, but I am not a DBA hence I do not know things like what type of indexes should I create? Which Columns should I index in order to make my queries work.
If the amount of data was small I would not have bothered about all these questions and just proceeded but knowing which columns to index and what type of indexes to create is vital in this case.
Kindly let me know the best way to ensure fast execution of queries.
Will it work if I create a clustered index on all my columns? or will it hurt me.
Keep in mind that this table will not be updated very frequently maybe on monthly basis.
Given your very clear and specific requirements, I would suggest you create a non-clustered index for each field and leave it to the optimiser as a first step. (ie you create 12 indexes) Place only a single field in each index. Dont index ( or at least use caution ) any long text type fields. Also dont index a field such as M/F that has only 2 values and a 50/50 split. I am assuming you have predicates on each field, but dont bother indexing any fields that are never used for selection purposes.
If you still have problems after this, find the query analyser in sql server and use it to guide how queries are processed.
Multi segmented indexes are sometimes better, but if your queries are mostly restricting to a small subset of the table then single field indexs will be fine.
You might have residual performance issues with queries that use "order by", but lets just leave that as a heads up at this stage.
My reasoning is based on
You only have 12 columns, so we wont overload anything
There are only 5M rows. This is quite easy for sql/server to handle
The growth in the data is small, so index updates shouldnt be too much of an issue.
The optimiser will love these queries combined with indexes.
We dont't have typical query examples to specify multi segment indexes, and the question seems to imply highly variable queries.
I'm just starting to build a Social Site into DynamoDB.
I will have a fair amount of data that relates to a user and I'm planning on putting this all into one table - eg:
userid
date of birth
hair
photos urls
specifics
etc - there could potentially be a few hundred attributes.
Question:
is there anything wrong with putting this amount of data into one table?
how can I query that data (could I do a query like this "All members between this age, this color hair, this location, and logged on this time) - assuming all this data is contained in the table?
if the contents of a table are long and I'm running queries on that table like above would the read IO's cost be high - might be a lot of entries in the table in the long run...
Thanks
No. You can't query DynamoDB this way. You can only query the primary key (and a single range optionally). Scanning the tables in DynamoDB is slow and costly and will cause your other queries to hung.
If you have a small number of attributes, you can easily create index tables for these attributes. But if you have more than a few, it becomes too complex.
Main Table:
Primary Key (Type: Hash) - userid
Attributes - the rest of the attributes
Index Table for "hair":
Primary Key (Type: Hash and Range) - hair and userid
You can check out Amazon SimpleDB that is adding an index for the other attributes as well, therefore allowing such queries as you wanted. But it is limited in its scale and ability to support low latency.
You might also consider a combination of several data stores and tables as your requirements are different between your real time and reporting:
DynamoDB for the quick real time user lookup
SimpleDB/RDBMS (as MySQL or Amazon RDS) for additional attributes filters and queries
In Memory DB (as Redis, Casandra) for counters and tables as leader boards or cohort
Activity logs that you can analyze to discover patterns and trends