How clustering is helping in query pruning in Snowflake? - snowflake-cloud-data-platform

I have a table clustered on s_nation_key as below.
create or replace table t1
( S_SUPPKEY string,
S_NAME string,
S_NATIONKEY string,
S_ADDRESS string,
S_ACCTBAL string) cluster by (S_NATIONKEY);
Now i have added data to it
INSERT INTO T1
SELECT S_SUPPKEY , S_NAME,S_NATIONKEY,S_ADDRESS,S_ACCTBAL
FROM "SNOWFLAKE_SAMPLE_DATA"."TPCH_SF1000"."SUPPLIER"
WHERE S_NATIONKEY=7
limit 50000;
When i check data distribution in underlying micro partition itlooks good .
>select system$clustering_information('t1','S_NATIONKEY');
{ "cluster_by_keys" : "LINEAR(S_NATIONKEY)", "total_partition_count" : 1, "total_constant_partition_count" : 0, "average_overlaps" : 0.0, "average_depth" : 1.0, "partition_depth_histogram" : {
"00000" : 0,
"00001" : 1,
"00002" : 0,
"00003" : 0,
"00004" : 0,
"00005" : 0,
"00006" : 0,
"00007" : 0,
"00008" : 0,
"00009" : 0,
"00010" : 0,
"00011" : 0,
"00012" : 0,
"00013" : 0,
"00014" : 0,
"00015" : 0,
"00016" : 0 } }
Again i have loaded few more record as below for particular s_nation_key set as below.
--batch load 2
INSERT INTO T1
SELECT S_SUPPKEY , S_NAME,S_NATIONKEY,S_ADDRESS,S_ACCTBAL
FROM "SNOWFLAKE_SAMPLE_DATA"."TPCH_SF1000"."SUPPLIER"
WHERE S_NATIONKEY=3
LIMIT 50000;
--batch load 3
INSERT INTO T1
SELECT S_SUPPKEY , S_NAME,S_NATIONKEY,S_ADDRESS,S_ACCTBAL
FROM "SNOWFLAKE_SAMPLE_DATA"."TPCH_SF1000"."SUPPLIER"
WHERE S_NATIONKEY=1
limit 50000;
--batch load 3
INSERT INTO T1
SELECT S_SUPPKEY , S_NAME,S_NATIONKEY,S_ADDRESS,S_ACCTBAL
FROM "SNOWFLAKE_SAMPLE_DATA"."TPCH_SF1000"."SUPPLIER"
WHERE S_NATIONKEY=2
and S_ACCTBAL>0
limit 50000;
Now when i check clustering information again ,this also looks good . Now total 4 micro-partition and each distinct S_NATIONKEY value set is loaded into individual partition with no overlapping in range.So all micro-partition is having clustering depth 1.
>select system$clustering_information('t1','S_NATIONKEY');
{
"cluster_by_keys" : "LINEAR(S_NATIONKEY)",
"total_partition_count" : 4,
"total_constant_partition_count" : 4,
"average_overlaps" : 0.0,
"average_depth" : 1.0,
"partition_depth_histogram" : {
"00000" : 0,
"00001" : 4,
"00002" : 0,
"00003" : 0,
"00004" : 0,
"00005" : 0,
"00006" : 0,
"00007" : 0,
"00008" : 0,
"00009" : 0,
"00010" : 0,
"00011" : 0,
"00012" : 0,
"00013" : 0,
"00014" : 0,
"00015" : 0,
"00016" : 0
}
}
Now as per Snowflake documentation and concept of query pruning, when ever we search for records belong to one cluster_key value , it should scan only particular micro-partition which will be holding that cluster_key value (basing on min/max value range of each micro-partition). But in my case it is scanning all underlying micro partition(as below)
.
As per above query planning stats,it is scanning all the partitions, instead of scanning 1 .
Am i missing anything here ??What is the logic behind it ??
Please help me in understanding this scenario in Snowflake.
Thanks,
#Himanshu

The Autoclustering or the clustering keys are not intended for all tables. It is usually suggested for a very large table that runs into Terra bytes in size. We should not compare the cluster key to any index kind of object that is available in most of the RDBMS systems. Here we are grouping the data into the micro partitions in an orderly fashion which helps to avoid scanning the partitions which may not contain the requested data. In the case of small tables, the engine prefers to scan all the partitions if it estimates that this is not a costly operation.
Refer to the Attention Section of the documentation :
https://docs.snowflake.com/en/user-guide/tables-clustering-keys.html#clustering-keys-clustered-tables.

Here the size of the table is not that big, that is why it is scanning all the partition rather one. Even if you check the total size scanned it is just 7.96 mb which is small hence SF scans all partitions

Related

Error in creating index in Mongo db using mongo console

I am using this query for creating the index:
db.CollectionName.createIndex({result: {$exists:true}, timestamp : {$gte: 1573890921898000}})
What am I trying to do here is, creating indexing on timestamp > last month and for only those data where result exist, but I am getting error
{
"ok" : 0,
"errmsg" : "Values in v:2 index key pattern cannot be of type object. Only numbers > 0, numbers < 0, and strings are allowed.",
"code" : 67,
"codeName" : "CannotCreateIndex"
}
What am I doing wrong here?
thanks to #prasad_ for the suggestion, i needed to use partial_Indexes so the query becomes :
db.CollectionName.createIndex({result: 1, timestamp: 1}, {partialFilterExpression :{result: {$exists:true}, timestamp : {$gte: 1573890921898000}}})

mongoDB aggregate find total number of employees group by each state

Find total number of employees group by each state using aggregate
I tried the following in the screenshot link below. But the result is 0.
db.research.aggregate({$unwind:'$offices'},{"$match":
{'offices.country_code':"USA"}},{"$project": {'offices.state_code' : 1}},
{"$group" : {"_id":'$of
fices.state_code',"count" : {"$sum":'$number_of_employees'}}})
you need to $project number_of_employees to count in next stage, or you can remove the $project stage
db.research.aggregate([
{$unwind:'$offices'},
{"$match": {'offices.country_code':"USA"}},
{"$project": {'offices.state_code' : 1, 'number_of_employees' : 1}}, //project number_of_employees
{"$group" : {"_id":'$offices.state_code',"count" : {"$sum":'$number_of_employees'}}}
])
or
db.research.aggregate([
{$unwind:'$offices'},
{"$match": {'offices.country_code':"USA"}},
{"$group" : {"_id":'$offices.state_code',"count" : {"$sum":'$number_of_employees'}}}
])

Adding child node values

Below is the firebase Database of a child node of a particular user under the "users" node:
"L1Bczun2d5UTZC8g2LXchLJVXsh1" : {
"email" : "orabbz#yahoo.com",
"fullname" : "orabueze yea",
"teamname" : "orabbz team",
"total" : 0,
"userName" : "orabbz#yahoo.com",
"week1" : 0,
"week10" : 0,
"week11" : 0,
"week12" : 0,
"week2" : 0,
"week3" : 17,
"week4" : 0,
"week5" : 20,
"week6" : 0,
"week7" : 0,
"week8" : 0,
"week9" : 10
},
IS there a way to add up all the values of Weeks 1 down to week 12 and have the total sum in the total key?
I am curently thinking of bringing all the values of week 1 - week 12 brought into the angular js scope then adding up the values and then posting the total back in the firebase databse key total. But this sounds too long winded. is there a shorter solution?
As far as I know, Firebase DB doesn't have any DB functions as you'd have in SQL. So the options you have is to get the data and calculate it in Angular as you say. Or update a counter when the weeks are added to the user (when writing to the DB). Then just read the counter later

How to plot a Real time graph in Zeppelin?

I am trying to use zeppelin to plot a realtime graph. I have doing a sentiment analysis on tweets on per minute . I am able to query statically and plot a graph. But i would like this to be done dynamically. I am new to zeppelin and do not have much knowledge about angularJS. What should be the correct approach to this problem?
val final_score=uni_join.map{case((year,month,day,hour,minutes),(tweet_count,sentiment))=>(year, month, day, hour, minutes(sentiment/tweet_count).ceil)}
final_score.saveToCassandra("twitter", "score",writeConf = WriteConf(ttl = TTLOption.constant(1000)))
final_score.foreachRDD(score => {
val rowRDD =score.map{case(year,month,day,hour,minutes,sentiment) =>(year,month,day,hour,minutes,sentiment) }
val tempDF = sqlContext.createDataFrame(rowRDD)
z.angularBindGlobal("stream", parsed) //to bind parsed to stream.
tempDF.registerTempTable("realTimeTable")
})
Doing a query on the above table , i am able to get the graph. But i would like to dynamically update the graph every minute in order to keep in sync with the sentiment score .
Thanks prior.
[update] the angular part for the zeppelin notebook are as follows:
%angular
<div id="graph" style="height: 100%; width: 100%">
<canvas id="myChart" width="400" height="400"></canvas>
<div id="legendDiv"></div>
</div>
<script>
function initMap() {
var colorList = ["#fde577", "#ff6c40", "#c72a40", "#520833", "#a88399"]
var el = angular.element($('#stream'));
console.log("El is "+el) //returns el as object
angular.element(el).ready(function() {
console.log('Hello')
window.locationWatcher =el.$scope.$watch('stream', function(new, old){
console.log('changed');}, true)})
</script>
But running this code keeps returning the following error.
vendor.js:29 jQuery.Deferred exception: Cannot read property '$watch' of undefined TypeError: Cannot read property '$watch' of undefined
The spark version that i am using is 1.6
And Zeppelin is 0.6
spark-highcharts since version 0.6.3 support Spark Structured Streaming.
For a structuredDataFrame after aggregation, with the following code in one Zeppelin paragraph. The OutputMode can be either append or complete depends how the structureDataFrame is aggregated.
import com.knockdata.spark.highcharts._
import com.knockdata.spark.highcharts.model._
val query = highcharts(
structuredDataFrame.seriesCol("country")
.series("x" -> "year", "y" -> "stockpile")
.orderBy(col("year")), z, "append")
And the following code in the next paragraph. The chart in this paragraph will be updated when there are new data coming to the structureDataFrame.
StreamingChart(z)
Run following code to stop update the chart.
query.stop()
Here is the example generate structureDataFrame.
spark.conf.set("spark.sql.streaming.checkpointLocation","/usr/zeppelin/checkpoint")
case class NuclearStockpile(country: String, stockpile: Int, year: Int)
val USA = Seq(0, 0, 0, 0, 0, 6, 11, 32, 110, 235, 369, 640,
1005, 1436, 2063, 3057, 4618, 6444, 9822, 15468, 20434, 24126,
27387, 29459, 31056, 31982, 32040, 31233, 29224, 27342, 26662,
26956, 27912, 28999, 28965, 27826, 25579, 25722, 24826, 24605,
24304, 23464, 23708, 24099, 24357, 24237, 24401, 24344, 23586,
22380, 21004, 17287, 14747, 13076, 12555, 12144, 11009, 10950,
10871, 10824, 10577, 10527, 10475, 10421, 10358, 10295, 10104).
zip(1940 to 2006).map(p => NuclearStockpile("USA", p._1, p._2))
val USSR = Seq(0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
5, 25, 50, 120, 150, 200, 426, 660, 869, 1060, 1605, 2471, 3322,
4238, 5221, 6129, 7089, 8339, 9399, 10538, 11643, 13092, 14478,
15915, 17385, 19055, 21205, 23044, 25393, 27935, 30062, 32049,
33952, 35804, 37431, 39197, 45000, 43000, 41000, 39000, 37000,
35000, 33000, 31000, 29000, 27000, 25000, 24000, 23000, 22000,
21000, 20000, 19000, 18000, 18000, 17000, 16000).
zip(1940 to 2006).map(p => NuclearStockpile("USSR/Russia", p._1, p._2))
input.addData(USA.take(30) ++ USSR.take(30))
val structureDataFrame = input.toDF
And the following code can be simulate to update the chart. The chart will be updated when the following code run.
input.addData(USA.drop(30) ++ USSR.drop(30))
NOTE: The example using Zeppelin 0.6.2 and Spark 2.0
NOTE: Please check the highcharts license for commercial usage

Extract values from an array in mongoDB to dataframe using rmongodb

I'm querying a database containing entries as displayed in the example. All entries contain the following values:
_id: unique id of overallitem and placed_items
name: the name of te overallitem
loc: location of the overallitem and placed_items
time_id: time the overallitem was stored
placed_items: array containing placed_items (can range from zero: placed_items : [], to unlimited amount.
category_id: the category of the placed_items
full_id: the full id of the placed_items
I want to extract the name, full_id and category_id on a per placed_items level given a time_id and loc constraint
Example data:
{
"_id" : "5040",
"name" : "entry1",
"loc" : 1,
"time_id" : 20121001,
"placed_items" : [],
}
{
"_id" : "5041",
"name" : "entry2",
"loc" : 1,
"time_id" : 20121001,
"placed_items" : [
{
"_id" : "5043",
"category_id" : 101,
"full_id" : 901,
},
{
"_id" : "5044",
"category_id" : 102,
"full_id" : 902,
}
],
}
{
"_id" : "5042",
"name" : "entry3",
"loc" : 1,
"time_id" : 20121001,
"placed_items" : [
{
"_id" : "5045",
"category_id" : 101,
"full_id" : 903,
},
],
}
The expected outcome for this example would be:
"name" "full_id" "category_id"
"entry2" 901 101
"entry2" 902 102
"entry3" 903 101
So if placed_items is empty, do put the entry in the dataframe and if placed_items containts n entries, put n entries in dataframe
I tried to work out an RBlogger example to create the desired dataframe.
#Set up database
mongo <- mongo.create()
#Set up condition
buf <- mongo.bson.buffer.create()
mongo.bson.buffer.append(buf, "loc", 1)
mongo.bson.buffer.start.object(buf, "time_id")
mongo.bson.buffer.append(buf, "$gte", 20120930)
mongo.bson.buffer.append(buf, "$lte", 20121002)
mongo.bson.buffer.finish.object(buf)
query <- mongo.bson.from.buffer(buf)
#Count
count <- mongo.count(mongo, "items_test.overallitem", query)
#Note that these counts don't work, since the count should be based on
#the number of placed_items in the array, and not the number of entries.
#Setup Cursor
cursor <- mongo.find(mongo, "items_test.overallitem", query)
#Create vectors, which will be filled by the while loop
name <- vector("character", count)
full_id<- vector("character", count)
category_id<- vector("character", count)
i <- 1
#Fill vectors
while (mongo.cursor.next(cursor)) {
b <- mongo.cursor.value(cursor)
order_id[i] <- mongo.bson.value(b, "name")
product_id[i] <- mongo.bson.value(b, "placed_items.full_id")
category_id[i] <- mongo.bson.value(b, "placed_items.category_id")
i <- i + 1
}
#Convert to dataframe
results <- as.data.frame(list(name=name, full_id=full_uid, category_id=category_id))
The conditions work and the code works if I would want to extract values on an overallitem level (i.e. _id or name) but fails to gather the information on a placed_items level. Furthermore, the dotted call for extracting full_id and category_id does not seem to work. Can anyone help?

Resources