What is sparse and purpose of sparse table in Bigtable? - sparse-matrix

I have some information that I don't understand:
Bigtable may be understood a sparse table. Most cells contain null
values - too sparse to store it as in relational database systems.
Bigtable rather implements a multi-dimensional sparse map.
Is it a special property and what is the difference between a table and a sparse table?

A sparse table is one that does not need to store an entry in every (row, column) intersection, which may be referred to as a "cell"; instead, it only stores the ones that are explicitly written to.
For example, if you have a table with 500 rows and 30 columns, where every row only has an entry in one of the columns, instead of storing 500 × 30 = 15000 cells, most of which are the empty string or null, you only need to store 500 × 1 = 500 cells, which provides significant savings.
Since a Bigtable table may have billions, trillions, or more rows and hundreds or thousands of columns, this provides significant savings in storage.
See also other related sparse data structures:
Sparse matrix
Sparse array

Related

Indexing on a column of a huge database that can take only 3 possible values

I am trying to understand if it is worth from a performance point of view in order to create an index on a column of a huge table (about 90 million records in total).
What I am trying to achieve is fast filtering on the indexed column. The column to be indexed can have only 3 possible values and as per my requirement I have to fetch data on a regular basis with two of those values. This comes out to about 45 million records (half the table size).
Does it make any sense to create an index on a column that can have only 3 possible values and you need to retrieve data with two values amongst them ? Also, will creating this index improve performance of my query with WHERE clause on the column ?

How data is stored physically in Bigtable

Lets assume a table test
cf:a cf:b yy:a kk:cat
"com.cnn.news" zubrava10 sobaka foobar
"ch.main.users" - - - purrpurr
And the first cell ("zubrava") has 10 versions (10 timestamps) ("zubrava1", "zubrava2"...)
How data of this table will be stored on disk?
I mean is the primary index always
("row","column_family:column",timestamp) ?
So 10 versions of the same row for 10 timestamps will be stored together? How the entire table is stored?
Is scan for all values of given column is as fast as in column-oriented models?
SELECT cf:a from test
So 10 versions of the same row for 10 timestamps will be stored together? How the entire table is stored?
Bigtable is a row-oriented database, so all data for a single row are stored together, organized by column family, and then by column. Data is stored in reversed-timestamp order, which means it's easy and fast to ask for the latest value, but hard to ask for the oldest value.
Is scan for all values of given column is as fast as in column-oriented models?
SELECT cf:a from test
No, a column-oriented storage model stores all the data for a single column together, across all rows. Thus, a full-table scan in a column-oriented system (such as Google BigQuery) is faster than in a row-oriented storage system, but a row-oriented system provides for row-based mutations and row-based atomic mutations that a column-oriented storage system typically cannot.
On top of this, Bigtable provides a sorted order of all row keys in lexicographic order; column-oriented storage systems typically make no such guarantees.

Data warehouse design, multiple dimensions or one dimension with attributes?

Working on a data warehouse and am looking for suggestions on having numerous dimensions versus on large dimension with attributes.
We currently have DimEntity, DimStation, DimZone, DimGroup, DimCompany and have multiple fact tables that contain the keys from each of the dimensions. Is this the best way or would it be better to have just one dimension, DimEntity and include station, zone, group and company as attributes of the entity?
We have already gone the route of separate dimensions with our ETL so it isn't like the work to populate and build out the star schema is an issue. Performance and maintainability are important. These dimensions do not change often so looking for guidance on the best way to handle such dimensions.
Fact tables have over 100 million records. The entity dimension has around 1000 records and the others listed have under 200 each.
Without knowing your star schema table definitions, data cardinality, etc, it's tough to give a yes or no. It's going to be a balancing act.
For read performance, the fact table should be as skinny as possible and the dimension should be as short (low row count) as possible. Consolidating dimensions typically means that the fact table gets skinnier while the dimension record count increases.
If you can consolidate dimensions without adding a significant number of rows to the consolidated dimension, it may be worth looking into. It may be that you can combine the low cardinality dimensions into a junk dimension and achieve a nice balance. Dimensions with high cardinality attributes shouldn't be consolidated.
Here's a good Kimball University article on dimensional modeling. Look specifically where he addresses centipede fact tables and how he recommends using junk dimensions.

Main table with hundreds vs few smaller

I was wondering which approach is better for designing databases?
I have currently one big table (97 columns per row) with references to lookup tables where I could.
Wouldn't it be better for performance to group some columns into smaller tables and add them key columns for referencing one whole row?
If you split up your table into several parts, you'll need additional joins to get all your columns for a single row - that will cost you time.
97 columns isn't much, really - I've seen way beyond 100.
It all depends on how your data is being used - if your row just has 97 columns, all the time, and needs to 97 columns - then it really hardly ever makes sense to split those up into various tables.
It might make sense if:
you can move some "large" columns (like XML, VARCHAR(MAX) etc.) into a separate table, if you don't need those all the time -> in that case, your "basic" row becomes smaller and your basic table will perform better - as long as you don't need those extra large column
you can move away some columns to a separate table that aren't always present, e.g. columns that might be "optional" and only present for e.g. 20% of the rows - in that case, you might save yourself some processing for the remaining 80% of the cases where those columns aren't needed.
It would be better to group relevant columns into different tables. This will improve the performance of your database as well as your ease of use as the programmer. You should try to first find all the different relationships between your columns and following that you should attempt to break everything into tables while keeping in mind these relationships (using primary keys, forking keys, references and so forth).Try to create a diagram as this http://www.simple-talk.com/iwritefor/articlefiles/354-image008.gif and take it from there.
Unless your data is denormalized it is likely best to keep all the columns in the same table. SQL Server reads pages into the buffer pool from individual tables. Thus you will have the cost of the joins on every access even if the pages accessed are already in the buffer pool. If you access just a few rows of the data per query with a key then an index will serve that query fine with all columns in the same table. Even if you will scan a large percentage of the rows (> 1% of a large table) but only a few of the 97 columns you are still better off keeping the columns in the same table as you can use a non clustered index that covers the query. However, if the data is heavily denormalized then normalizing it, which by definition breaks it into many tables based upon the rules of normalization to eliminate redundancy, will result in much improved performance and you will be able to write queries to access only the specific data elements you need.

Adding a column efficently in SQL Server

I want to add an integer column to a table with a large number of rows and many indexes (Its a data warehouse Fact Table).
To keep the row width as narrow as possible all the columns in this table are defined as not null. So I want the new column to be not null with a default of zero.
From experience adding this column will take some time, presumably because the database will need to rewrite all the rows with a new column with a filled value. And this presumably will involve updating the clustered index and all the non-clustered indexes.
So should I drop all the indexes before adding the column and then recreate them all.
Or is there an easier way to do this?
Also I don't really understand why adding a column that is nullable is so much quicker. Why does this not involve re-writng the records with an extra Is Null bit flipped for each row.
It will require updating the clustered index, yes - this IS the table data, after all.
But I don't see why any of the non-clustered indices would have to updated - your new column won't be member of any of the non-clustered indices.
Also, I don't see how dropping and recreating the indices would benefit you in this scenario. If you were bulk-loading several million existing rows from another table or database - yes, then it might be faster (due to the INSERTs being much faster) - but adding a column doesn't really suffer from any indices or constraints being around, I don't think.
Marc
SQL Server is a row oriented database. This is in contrast to a column oriented database. This means that in SQL Server, all of the data for a given row is stored together on the disk. Let's have an example:
Say you have a Customer table with 3 columns, FirstName, MiddleInitial, and LastName. Then, say you have 3 records in this table for Jabba T. Hutt, Dennis T. Menace, and George W. Bush.
In a row oriented database (like SQL Server), the records will be stored on disk as such:
Jabba, T, Hutt; Dennis, T, Menace; George, W, Bush;
In contrast, a column oriented database would store the records on disk like this:
Jabba, Dennis, George; T, T, W; Hutt Menace, Bush;
Where columns are grouped together instead of rows.
Now, when you go to add a column to a table in a row oriented database (SQL Server, for example), the new data for each column has to be inserted alongside the existing rows, shifting the rows requiring a lot of read/write operations. So, if you were to insert a new column for the customer prefix that defaults to 'Mr', this is what you'd get:
Mr, Jabba, T, Hutt; Mr, Dennis, T, Menace; Mr, George, W, Bush;
As you can see, all of the original data has been shifted to the right. On the other hand, when you insert a new column that defaults to NULL, no new data has to be put into the existing rows. Thus, there is less shifting, requiring fewer disk read/write operations.
Of course, this an oversimplification of what's actually going on on disk. There are other things to take into account when dealing with indexes, pages, etc. But, it should help you get the picture.
For clarification I'm not at all suggesting you move to a column oriented database, I just put that info in there to help explain what Row oriented meant.
"Also I don't really understand why adding a column that is nullable is so much quicker. Why does this not involve re-writng the records with an extra Is Null bit flipped for each row."
Adding a nullable column merely changes the definition of the table. The individual records are not affected.

Resources