Get select data based on micropartition - snowflake-cloud-data-platform

Get select data based on micropartition - snowflake-cloud-data-platform

In Snowflake table Data gets stored within micro partitions, suppose there is a table A which has data stored in 1000 micro partitions (P1, P2, P3.....Pn), requirement is to get data in select clause based on specific partition (for example select * from A where partition =P1).
is it possible to get data specific to a micro partition?

Snowflake's micropartitions are not like other database technologies out there where you can specify a partition. What you might want to look at is the following documentation that specifies how micropartitions work and how clustering would assist you in pruning based on a cluster key (similar to partition keys in other systems, but not exactly).
https://docs.snowflake.com/en/user-guide/tables-clustering-micropartitions.html

Related

Partitioning the table

This query is regarding partitioning in hive/delta tables.
Which column should we pick for partitioning the table if the table is always being used to join based on key which have only unique value.
Ex: we have a table Customer(id, name, otherDetails)
which field be suitable to partition this table.
Thanks,
Deepak

Good question. Below are factors you need to consider while partitioning -
Requirement - when you have lots of data, heavily used table, frequently data added to it, and you want to manage it better.
Distribution of data - choose a field or fields on which data is evenly distributed. Most common is date or month or year, and normally transactional data is somewhat evenly distributed on these fields. You can choose something like country or region as well to partition when you have data evenly distributed.
Loading strategy - You can load/insert/delete each partition separately. So choose some columns which will help you deciding a better strategy. Like you can choose to delete old data based on date every time you load. so chose load date as partition.
Reasonable number of partitions - Make sure you do not have thousands of partitions but less that 500 is good ( check your systems performance).
Do not choose unique key/composite key as partition key. because hive creates folders with data files for each partition and it will be very difficult to manage thousands of partitions.

How does columnfamily from BigTable in GCP relate to columns in a relational database

I am trying to migrate a table that is currently in a relational database to BigTable.
Let's assume that the table currently has the following structure:
Table: Messages
Columns:
Message_id
Message_text
Message_timestamp
How can I create a similar table in BigTable?
From what I can see in the documentation, BigTable uses ColumnFamily. Is ColumnFamily the equivalent of a column in a relational database?

BigTable is different from a relational database system in many ways.
Regarding database structures, BigTable should be considered a wide-column, NoSQL database.
Basically, every record is represented by a row and for this row you have the ability to provide an arbitrary number of name-value pairs.
This row has the following characteristics.
Row keys
Every row is identified univocally by a row key. It is similar to a primary key in a relational database. This field is stored in lexicographic order by the system, and is the only information that will be indexed in a table.
In the construction of this key you can choose a single field or combine several ones, separated by # or any other delimiter.
The construction of this key is the most important aspect to take into account when constructing your tables. You must thing about how will you query the information. Among others, keep in mind several things (always remember the lexicographic order):
Define prefixes by concatenating fields that allows you to fetch information efficiently. BigTable allows and you to scan information that starts with a certain prefix.
Related, model your key in a way that allows you to store common information (think, for example, in all the messages that come from a certain origin) together, so it can be fetched in a more efficient way.
At the same time, define keys in a way that maximize dispersion and load balance between the different nodes in your BigTable cluster.
Column families
The information associated with a row is organized in column families. It has no correspondence with any concept in a relational database.
A column family allows you to agglutinate several related fields, columns.
You need to define the column families before-hand.
Columns
A column will store the actual values. It is similar in a certain sense to a column in a relational database.
You can have different columns for different rows. BigTable will sparsely store the information, if you do not provide a value for a row, it will consume no space.
BigTable is a third dimensional database: for every record, in addition to the actual value, a timestamp is stored as well.
In your use case, you can model your table like this (consider, for example, that you are able to identify the origin of the message as well, and that it is a value information):
Row key = message_origin#message_timestamp (truncated to half hour, hour...)1#message_id
Column family = message_details
Columns = message_text, message_timestamp
This will generate row keys like, consider for example that the message was sent from a device with id MT43:
MT43#1330516800#1242635
Please, as #norbjd suggested, see the relevant documentation for an in-deep explanation of these concepts.
One important difference with a relational database to note: BigTable only offers atomic single-row transactions and if using single cluster routing.
1 See, for instance: How to round unix timestamp up and down to nearest half hour?

How data is stored physically in Bigtable

Lets assume a table test
cf:a cf:b yy:a kk:cat
"com.cnn.news" zubrava10 sobaka foobar
"ch.main.users" - - - purrpurr
And the first cell ("zubrava") has 10 versions (10 timestamps) ("zubrava1", "zubrava2"...)
How data of this table will be stored on disk?
I mean is the primary index always
("row","column_family:column",timestamp) ?
So 10 versions of the same row for 10 timestamps will be stored together? How the entire table is stored?
Is scan for all values of given column is as fast as in column-oriented models?
SELECT cf:a from test

So 10 versions of the same row for 10 timestamps will be stored together? How the entire table is stored?
Bigtable is a row-oriented database, so all data for a single row are stored together, organized by column family, and then by column. Data is stored in reversed-timestamp order, which means it's easy and fast to ask for the latest value, but hard to ask for the oldest value.
Is scan for all values of given column is as fast as in column-oriented models?
SELECT cf:a from test
No, a column-oriented storage model stores all the data for a single column together, across all rows. Thus, a full-table scan in a column-oriented system (such as Google BigQuery) is faster than in a row-oriented storage system, but a row-oriented system provides for row-based mutations and row-based atomic mutations that a column-oriented storage system typically cannot.
On top of this, Bigtable provides a sorted order of all row keys in lexicographic order; column-oriented storage systems typically make no such guarantees.

How Many Tables Should be Placed on the Same Partition Scheme?

My company just provided me with SQL Server 2005 Enterprise Edition and I wanted to partition some tables with large(r) amounts of data. I have about about 5 or 6 tables which would be a good fit to partition by datetime.
There will be some queries that need 2 of these tables in the course of the same query.
I was wondering if I should use the same partition scheme for all of these tables or if I should copy the partition scheme and put different tables on each one.
Thanks for any help in advance.

You should define your partition by what makes sense for your domain. i.e. if you deal primarily in year quarters, create 5 partitions (4 quarters + 1 overspill).
You should also take into account physical file placement. From the MSDN article:
The first step in partitioning tables
and indexes is to define the data on
which the partition is keyed. The
partition key must exist as a single
column in the table and must meet
certain criteria. The partition
function defines the data type on
which the key (also known as the
logical separation of data) is based.
The function defines this key but not
the physical placement of the data on
disk. The placement of data is
determined by the partition scheme. In
other words, the scheme maps the data
to one or more filegroups that map the
data to specific file(s) and therefore
disks. The scheme always uses a
function to do this: if the function
defines five partitions, then the
scheme must use five filegroups. The
filegroups do not need to be
different; however, you will get
better performance when you have
multiple disks and, preferably,
multiple CPUs. When the scheme is used
with a table, you will define the
column that is used as an argument for
the partition function.
These two articles may be useful:
Partitioned Tables in SQL Server 2005
Partitioned Tables and Indexes in SQL Server 2005

SQL split/merge of table partitions: What is the best approach to implement?

Microsoft in its MSDN entry about altering SQL 2005 partitions, listed a few possible approaches:
Create a new partitioned table with the desired partition function, and then insert the data from the old table into the new table by using an INSERT INTO...SELECT FROM statement.
Create a partitioned clustered index on a heap
Drop and rebuild an existing partitioned index by using the Transact-SQL CREATE INDEX statement with the DROP EXISTING = ON clause.
Perform a sequence of ALTER PARTITION FUNCTION statements.
Any idea what will be the most efficient way for a large scale DB (millions of records) with partitions based on the dates of the records (something like monthly partitions), where data spreads over 1-2 years?
Also, if I mostly access (for reading) recent information, will it make sense to keep a partition for the last X days, and all the rest of the data will be another partition? Or is it better to partition the rest of the data too (for any random access based on date range)?

I'd recommend the first approach - creating a new partitioned table and inserting into it - because it gives you the luxury of comparing your old and new tables. You can test query plans against both styles of tables and see if your queries are indeed faster before cutting over to the new table design. You may find there's no improvement, or you may want to try several different partitioning functions/schemes before settling on your final result. You may want to partition on something other than date range - date isn't always effective.
I've done partitioning with 300-500m row tables with data spread over 6-7 years, and that table-insert approach was the one I found most useful.
You asked about how to partition - the best answer is to try to design your partitions so that your queries will hit a single partition. If you tend to concentrate queries on recent data, AND if you filter on that date field in your where clauses, then yes, have a separate partition for the most recent X days.
Be aware that you do have to specify the partitioned field in your where clause. If you aren't specifying that field, then the query is probably going to hit every partition to get the data, and at that point you won't have any performance gains.
Hope that helps! I've done a lot of partitioning, and if you want to post a few examples of table structures & queries, that'll help you get a better answer for your environment.