I have a table clustered on s_nation_key as below.
create or replace table t1
( S_SUPPKEY string,
S_NAME string,
S_NATIONKEY string,
S_ADDRESS string,
S_ACCTBAL string) cluster by (S_NATIONKEY);
Now i have added data to it
INSERT INTO T1
SELECT S_SUPPKEY , S_NAME,S_NATIONKEY,S_ADDRESS,S_ACCTBAL
FROM "SNOWFLAKE_SAMPLE_DATA"."TPCH_SF1000"."SUPPLIER"
WHERE S_NATIONKEY=7
limit 50000;
When i check data distribution in underlying micro partition itlooks good .
>select system$clustering_information('t1','S_NATIONKEY');
{ "cluster_by_keys" : "LINEAR(S_NATIONKEY)", "total_partition_count" : 1, "total_constant_partition_count" : 0, "average_overlaps" : 0.0, "average_depth" : 1.0, "partition_depth_histogram" : {
"00000" : 0,
"00001" : 1,
"00002" : 0,
"00003" : 0,
"00004" : 0,
"00005" : 0,
"00006" : 0,
"00007" : 0,
"00008" : 0,
"00009" : 0,
"00010" : 0,
"00011" : 0,
"00012" : 0,
"00013" : 0,
"00014" : 0,
"00015" : 0,
"00016" : 0 } }
Again i have loaded few more record as below for particular s_nation_key set as below.
--batch load 2
INSERT INTO T1
SELECT S_SUPPKEY , S_NAME,S_NATIONKEY,S_ADDRESS,S_ACCTBAL
FROM "SNOWFLAKE_SAMPLE_DATA"."TPCH_SF1000"."SUPPLIER"
WHERE S_NATIONKEY=3
LIMIT 50000;
--batch load 3
INSERT INTO T1
SELECT S_SUPPKEY , S_NAME,S_NATIONKEY,S_ADDRESS,S_ACCTBAL
FROM "SNOWFLAKE_SAMPLE_DATA"."TPCH_SF1000"."SUPPLIER"
WHERE S_NATIONKEY=1
limit 50000;
--batch load 3
INSERT INTO T1
SELECT S_SUPPKEY , S_NAME,S_NATIONKEY,S_ADDRESS,S_ACCTBAL
FROM "SNOWFLAKE_SAMPLE_DATA"."TPCH_SF1000"."SUPPLIER"
WHERE S_NATIONKEY=2
and S_ACCTBAL>0
limit 50000;
Now when i check clustering information again ,this also looks good . Now total 4 micro-partition and each distinct S_NATIONKEY value set is loaded into individual partition with no overlapping in range.So all micro-partition is having clustering depth 1.
>select system$clustering_information('t1','S_NATIONKEY');
{
"cluster_by_keys" : "LINEAR(S_NATIONKEY)",
"total_partition_count" : 4,
"total_constant_partition_count" : 4,
"average_overlaps" : 0.0,
"average_depth" : 1.0,
"partition_depth_histogram" : {
"00000" : 0,
"00001" : 4,
"00002" : 0,
"00003" : 0,
"00004" : 0,
"00005" : 0,
"00006" : 0,
"00007" : 0,
"00008" : 0,
"00009" : 0,
"00010" : 0,
"00011" : 0,
"00012" : 0,
"00013" : 0,
"00014" : 0,
"00015" : 0,
"00016" : 0
}
}
Now as per Snowflake documentation and concept of query pruning, when ever we search for records belong to one cluster_key value , it should scan only particular micro-partition which will be holding that cluster_key value (basing on min/max value range of each micro-partition). But in my case it is scanning all underlying micro partition(as below)
.
As per above query planning stats,it is scanning all the partitions, instead of scanning 1 .
Am i missing anything here ??What is the logic behind it ??
Please help me in understanding this scenario in Snowflake.
Thanks,
#Himanshu
The Autoclustering or the clustering keys are not intended for all tables. It is usually suggested for a very large table that runs into Terra bytes in size. We should not compare the cluster key to any index kind of object that is available in most of the RDBMS systems. Here we are grouping the data into the micro partitions in an orderly fashion which helps to avoid scanning the partitions which may not contain the requested data. In the case of small tables, the engine prefers to scan all the partitions if it estimates that this is not a costly operation.
Refer to the Attention Section of the documentation :
https://docs.snowflake.com/en/user-guide/tables-clustering-keys.html#clustering-keys-clustered-tables.
Here the size of the table is not that big, that is why it is scanning all the partition rather one. Even if you check the total size scanned it is just 7.96 mb which is small hence SF scans all partitions
I want to order an array. The JSONata expression below has an incoming array as follows.
[{"id":"Air-1a",
"Controller":"ESP62",
"Cntr-TaskNo":10,
"Cntr-GPIO":13,
"name":"Air",
"valueName":"Humidity",
"Sensor":"DHT22",
(and many other key pairs)},
{next object}, ...]
I then transform the array with the following JSONata expression:
payload.(
{ "Controller" : $.Controller,
"Cntr-TaskNo": $.CntrDef.TaskNo,
"Cntr-GPIO" : $.CntrDef.GPIO,
"name" : $.name,
"valueName" : $.valueName,
"Sensor" : $.Sensor,
"id" : $.id
}
)
But now I want to - in the same JSONata expression, sort on firstly the Controller, and then the GPIO. To tried with the Controller only first.
I tried:
payload.(
{ $sort("Controller",function($l, $r){$l.Controller > $r.Controller}) : $.Controller ,
"Cntr-TaskNo": $.CntrDef.TaskNo,
"Cntr-GPIO" : $.CntrDef.GPIO,
"name" : $.name,
"valueName" : $.valueName,
"Sensor" : $.Sensor,
"id" : $.id
}
)
As well as trying to add the sort function at the end with the ~> chaining command. I also tried the order-by operator.
Could anyone point me in the right direction?
//----------
The new flow with the changed 'ESP62' to '-' that does not work:
[{"id":"874b0c77.f87418","type":"inject","z":"6f27a311.d135bc","name":"","topic":"","payload":"","payloadType":"date","repeat":"","crontab":"","once":false,"onceDelay":0.1,"x":200,"y":180,"wires":[["8c196590.c20638"]]},{"id":"8c196590.c20638","type":"change","z":"6f27a311.d135bc","name":"Dataset","rules":[{"t":"set","p":"payload","pt":"msg","to":"[{\"id\":\"Air-1a\",\"Controller\":\"ESP62\",\"CntrTaskNo\":10,\"CntrGPIO\":13,\"name\":\"Air\",\"valueName\":\"Humidity\",\"Sensor\":\"DHT22\",\"aaa\":\"111\",\"bbb\":\"222\",\"ccc\":\"333\"},{\"id\":\"Air-2a\",\"Controller\":\"ESP72\",\"CntrTaskNo\":11,\"CntrGPIO\":14,\"name\":\"Air\",\"valueName\":\"Humidity\",\"Sensor\":\"DHT22\",\"aaa\":\"444\",\"bbb\":\"555\",\"ccc\":\"666\"},{\"id\":\"Air-1a\",\"Controller\":\"ESP62\",\"CntrTaskNo\":2,\"CntrGPIO\":9,\"name\":\"Air\",\"valueName\":\"Humidity\",\"Sensor\":\"DHT22\",\"aaa\":\"777\",\"bbb\":\"888\",\"ccc\":\"999\"},{\"id\":\"Air-1a\",\"Controller\":\"-\",\"CntrTaskNo\":10,\"CntrGPIO\":12,\"name\":\"Air\",\"valueName\":\"Humidity\",\"Sensor\":\"DHT22\",\"aaa\":\"777\",\"bbb\":\"888\",\"ccc\":\"999\"}]","tot":"json"}],"action":"","property":"","from":"","to":"","reg":false,"x":360,"y":180,"wires":[["13981162.14e28f"]]},{"id":"c8a256a5.a170c8","type":"debug","z":"6f27a311.d135bc","name":"","active":true,"tosidebar":true,"console":false,"tostatus":false,"complete":"false","x":690,"y":180,"wires":[]},{"id":"13981162.14e28f","type":"change","z":"6f27a311.d135bc","name":"Jsonata $sort","rules":[{"t":"set","p":"payload","pt":"msg","to":"($sort(payload,function($l , $r){$l.Controller > $r.Controller}) ; \t$sort(payload,function($l , $r){$l.CntrGPIO > $r.CntrGPIO}))","tot":"jsonata"}],"action":"","property":"","from":"","to":"","reg":false,"x":520,"y":180,"wires":[["c8a256a5.a170c8"]]}]
I suggest first sorting the dataset and afterward transform the already sorted array of objects. The transformation is trivial and you want to know how to sort, so I show below one possible solution. It uses an expression with two concatenated $sort functions.
Edited after a better understanding of the requirement.
I tested successfully a Node-RED flow using this expression in a change node:
($a := $sort(payload,function($l , $r){$l.Controller > $r.Controller}) ; $sort($a,function($l , $r){(($l.Controller = $r.Controller) and ($l.CntrGPIO > $r.CntrGPIO))}))
Flow (contain dataset set hardcoded):
[{"id":"a7814b7e.3adeb8","type":"tab","label":"Flow 4","disabled":false,"info":""},{"id":"8bf10833.c71748","type":"inject","z":"a7814b7e.3adeb8","name":"","topic":"","payload":"","payloadType":"date","repeat":"","crontab":"","once":false,"onceDelay":0.1,"x":140,"y":140,"wires":[["9e365564.edca08"]]},{"id":"9e365564.edca08","type":"change","z":"a7814b7e.3adeb8","name":"Dataset","rules":[{"t":"set","p":"payload","pt":"msg","to":"[{\"id\":\"Air-1a\",\"Controller\":\"ESP62\",\"CntrTaskNo\":10,\"CntrGPIO\":13,\"name\":\"Air\",\"valueName\":\"Humidity\",\"Sensor\":\"DHT22\",\"aaa\":\"111\",\"bbb\":\"222\",\"ccc\":\"333\"},{\"id\":\"Air-2a\",\"Controller\":\"ESP72\",\"CntrTaskNo\":11,\"CntrGPIO\":14,\"name\":\"Air\",\"valueName\":\"Humidity\",\"Sensor\":\"DHT22\",\"aaa\":\"444\",\"bbb\":\"555\",\"ccc\":\"666\"},{\"id\":\"Air-1a\",\"Controller\":\"ESP62\",\"CntrTaskNo\":2,\"CntrGPIO\":9,\"name\":\"Air\",\"valueName\":\"Humidity\",\"Sensor\":\"DHT22\",\"aaa\":\"777\",\"bbb\":\"888\",\"ccc\":\"999\"},{\"id\":\"Air-1a\",\"Controller\":\"-\",\"CntrTaskNo\":10,\"CntrGPIO\":12,\"name\":\"Air\",\"valueName\":\"Humidity\",\"Sensor\":\"DHT22\",\"aaa\":\"777\",\"bbb\":\"888\",\"ccc\":\"999\"}]","tot":"json"}],"action":"","property":"","from":"","to":"","reg":false,"x":300,"y":140,"wires":[["762f6421.074fec"]]},{"id":"f827bddb.c9acd","type":"debug","z":"a7814b7e.3adeb8","name":"","active":true,"tosidebar":true,"console":false,"tostatus":false,"complete":"false","x":630,"y":140,"wires":[]},{"id":"762f6421.074fec","type":"change","z":"a7814b7e.3adeb8","name":"Jsonata $sort","rules":[{"t":"set","p":"payload","pt":"msg","to":"($a := $sort(payload,function($l , $r){$l.Controller > $r.Controller}) ; $sort($a,function($l , $r){(($l.Controller = $r.Controller) and ($l.CntrGPIO > $r.CntrGPIO))}))","tot":"jsonata"}],"action":"","property":"","from":"","to":"","reg":false,"x":460,"y":140,"wires":[["f827bddb.c9acd"]]}]
also tested in Jsonata exerciser: http://try.jsonata.org/S1IlT3y-E
You can sort the array using the following expression:
payload^(Controller, CntrDef.GPIO)
The order-by operator ^ will sort the array, first by increasing value of Controller, then by increasing value of CntrGPIO. You can then transform each object within that array
payload^(Controller, CntrDef.GPIO).{
"Controller" : Controller,
"Cntr-TaskNo": CntrDef.TaskNo,
"Cntr-GPIO" : CntrDef.GPIO,
"name" : name,
"valueName" : valueName,
"Sensor" : Sensor,
"id" : id
}