How partiton pruning works on integer column in snowflake table

How partiton pruning works on integer column in snowflake table - snowflake-cloud-data-platform

I have a table in snowflake with around 1000 columns, i have an id column which is of integer type
when i run query like
select * from table where id=12
it is scanning all the micro-paritions .I am expecting that snowflake will maintain metadata of min/max of id column and based on that it should scan only one partition rather than all the partition.
In this doc https://docs.snowflake.com/en/user-guide/tables-clustering-micropartitions.html its mentioned that they maintain min/max , disticnt value of columns in each micro-partition.
How can i take advantage of partititon pruning in this scenario?Currently even for unique id snowflake is scanning all the partitions.

It's a little more complicated than that unfortunately. Snowflake would only scan a single partition if your table was perfectly clustered by your id column, which it probably isn't, nor should it be. Snowflake is a data warehouse and isn't ideal for single-row lookups.
You could always cluster your table by your id column but you usually don't want to do this in a data warehouse. I would recommend reading this document to understand how table clustering works.

Related

Querying a High Cardinality Field

I am designing a data model for our orders for our upcoming Cassandra migration. An order has an orderId (arcane UUID field) and an orderNumber (user-friendly number). A getOrder query can be done by using any of the two.
My partition key is the orderId, so getByOrderId is not a problem. By getByOrderNumber is - there's a one-to-one mapping b/w the orderId and the orderNumber (high-cardinality field), so creating a local secondary index on each node would slow down my queries.
What I was wondering was that I could create a new table with the orderNumber as the partition key and the orderId as the only column (kind of a secondary index but maintained by me). So now, a getByOrderNumber query can be resolved in two calls.
Bear with me if the above solution is egregiously wrong, I am extremely new to Cassandra. As I understand, for such a column, if I used local secondary indices, Cassandra would have to query each node for a single order. So I thought why not create another table that stores the mapping.
What would I be missing on by managing this index myself? One thing I can see if for every write, I'll now have to update two tables. Anything else?

I thought why not create another table that stores the mapping.
That's okay. From Cassandra documentation:
Do not use an index in these situations:
On high-cardinality columns because you then query a huge volume of
records for a small number of results. See Problems using a
high-cardinality column index below.
Problems using a high-cardinality column index
If you create an index on a high-cardinality column, which has many
distinct values, a query between the fields incurs many seeks for very
few results. In the table with a billion songs, looking up songs by
writer (a value that is typically unique for each song) instead of by
their recording artist is likely to be very inefficient..
It would probably be more efficient to manually maintain the table as
a form of an index instead of using the built-in index. For columns
containing unique data, it is sometimes fine performance-wise to use
an index for convenience, as long as the query volume to the table
having an indexed column is moderate and not under constant load.
Conversely, creating an index on an extremely low-cardinality column,
such as a boolean column, does not make sense. Each value in the index
becomes a single row in the index, resulting in a huge row for all the
false values, for example. Indexing a multitude of indexed columns
having foo = true and foo = false is not useful.
It's normal for Cassandra data modelling to have a denormalized data.

SQL Server partitioniong

I am working on a heavy record set database in MS SQL 2016. So I want to use row table partition feature to improve speed.
As we know partition feature is working on partition column of a table. Let's say [Date Column] of a table. In our scenario, have many tables that need to partition because of heaver record set in 5 to 7 tables. Each table not have that [Date column]. Also not possible to add that column in each table.
So is there any way I can select partition column of another table or something else.

The best option is to add a common column to all tables that you will then use to partition by.
You must already have a way of relating the different tables to each other so you can use this to tag each table with the correct Partition column.
This column could be as simple as an int with YYYYMM as values for monthly partitions.
You also need to make sure your queries are "Partition Aware".
This means that you should include this column in your WHERE Clause and also your JOIN Clauses for any queries.
Use Query Plans to make sure you are getting Partition Elimination on your queries.
If you can't change the model (but can add partitions???) then you could implement the partitioning with different columns in each table provided you have a single column in each table that you can partition on named ranges - but if you have 1-many relationships then it is unlikely that the child tables keys will be consecutive relative to the parent table. Note that this approach will make your "partition aware" queries more complex to craft.

Cassandra: Why do I not have to include all partition keys in query?

Currently, I am dealing with Cassandra.
While reading a blog post, it is said:
When issuing a CQL query, you must include all partition key columns,
at a minimum.
(https://shermandigital.com/blog/designing-a-cassandra-data-model/)
However, in my database it seems like it possible without including all partition keys. Here the table:
CREATE TABLE usertable (
personid text,
name text,
"timestamp" timestamp,
active boolean,
PRIMARY KEY ((personid, name), timestamp)
) WITH
CLUSTERING ORDER BY ("timestamp" DESC)
AND comment=''
AND read_repair_chance=0
AND dclocal_read_repair_chance=0.1
AND gc_grace_seconds=864000
AND bloom_filter_fp_chance=0.01
AND compaction={ 'class':'org.apache.cassandra.db.compaction.SizeTieredCompactionStrategy',
'max_threshold':'32',
'min_threshold':'4' }
AND compression={ 'chunk_length_in_kb':'64',
'class':'org.apache.cassandra.io.compress.LZ4Compressor' }
AND caching={ 'keys':'ALL',
'rows_per_partition':'NONE' }
AND default_time_to_live=0
AND id='23ff16b0-c400-11e8-55c7-2b453518a213'
AND min_index_interval=128
AND max_index_interval=2048
AND memtable_flush_period_in_ms=0
AND speculative_retry='99PERCENTILE';
So I can do select * from usertable where personid = 'ABC-02';. However, according to the blog post, I have to include timestamp as well.
Can someone explain this?

In cassandra, partition key spreads data around cluster. It computes the hash of partition key and determine the location of data in the cluster.
One exception is, if you use ALLOW FILTERING or secondary index it does not require you too include all partition keys in where query.
For further information take a look at blog post:
The purpose of a partition key is to split the data into partitions
where an entire partition is stored on a single node in the cluster
(with each node storing many partitions). When data is read or written
from the cluster, a function called Partitioner is used to compute the
hash value of the partition key. This hash value is used to determine
the node/partition which contains that row. The clustering key is used
further to search for a row within a given partition.
Select queries in Apache Cassandra look a lot like select queries from
a relational database. However, they are significantly more
restricted. The attributes allowed in ‘where’ clause of Cassandra
query must include the full partition key and additional clauses may
only reference the clustering key columns or a secondary index of the
table being queried.
Requiring the partition key attributes in the ‘where’ helps Cassandra
to maintain constant result-set retrieval time as the cluster is
scaled-out by allowing Cassandra to determine the partition, and thus
the node (and even data files on disk), that the query must be
directed to.
If a query does not specify the values for all the columns from the
primary key in the ‘where’ clause, Cassandra will not execute it and
give the following warning :
‘InvalidRequest: Error from server: code=2200 [Invalid query]
message=”Cannot execute this query as it might involve data filtering
and thus may have unpredictable performance. If you want to execute
this query despite the performance unpredictability, use ALLOW
FILTERING” ‘
https://www.instaclustr.com/apache-cassandra-scalability-allow-filtering-partition-keys/
https://www.datastax.com/dev/blog/a-deep-look-to-the-cql-where-clause

According to your schema, your timestamp column is the clustering column, the sorting column, no part of the partition key. That’s why it is not required.
(personid, name) are your partitions columns.

Are Indexes needed in Azure Sql Data Warehouse table to boost performance on multi table join query

My understanding in an Azure SQL Data Warehouse table, each column is indexed in a column store table (maybe I'm wrong). If that is the case, why is there ability to create additional indexes (Create Index statement). I was thinking maybe due to composite index (Multiple field indexes).
But in a query with a single field join, after creating an index on that field, the performance got much better.
Is there some general rule to go by when creating indexes in Azure Sql Data Warehouse?

Generally we should create index on the column which are more used in any query. Index are also a burden on a database because Index take some space in the disk. Creating Index on all the column of a table is not a good idea. we should make Index on the basis of query.
Besides index we can use Partition or table space to boost query performance.

What happens to a clustered index when PK is created on two columns in SQL Server

I just created a table with TWO primary keys in SQL Server. One column is age, another is ID number and I set the option to CLUSTER INDEX, so it automatically creates a cluster index on both columns. However, when I query the table, the results only seem to sort the ID and completely disregard/ignore the AGE (other PK and other Cluster index column). Why is this? Why is it only sorting based on the first cluster index column?

The query optimizer may decide to use the physical ordering of the rows in the table if there is no advantage in ordering any other way. So, when you select from the table using a simple query, it may be ordered this way. It is very easy to assume that the rows are physically stored in the order specified within the definition of your clustered index. But this turns out to be a false assumption.
Please view the following article for more details: Clustered Index do “NOT” guarantee Physically Ordering or Sorting of Rows

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight