Performance Improvement for Real time Updates from RDS to Snowflake - snowflake-cloud-data-platform

As a part of an hourly process, we source data from a table in RDS and ingest into the raw data layer Snowflake.
After ingesting, we have to update certain fields in the final mart table based on new data ingested.
This mart table is a very wide table with more than 100 fields and the size of the table is around 10 GB.
So these updates are taking 6-7 minutes.
Edit: Update Query and Explain Plan:
update <target_table> tgt
set email_id =src.email_id,
first_name = (
case
when len(src.updated_first_name) < 2 then first_name
when src.updated_first_name is null or src.updated_first_name = '' then first_name
else src.updated_first_name
end ),
last_name = (
case
when len(src.updated_last_name) < 2 then last_name
when src.updated_last_name is null or src.updated_last_name =
'' then last_name
else src.updated_last_name
end),
link = src._link,
title = src.updated_title,
is_verified = 1,
last_verified = src.verified_time,
last_active_type_update = src.verified_time,
active_type = 'ENGAGED',
title_id = src.title_id,
function = src.function,
role = src.role
from <sourcetable> src
where lower(tgt.email_id) = lower(src.email_id)
and src.demographic_disposition in ('Updated Last Name','Updated
Title','Updated First Name','Verified','Updated
Country','Partial
Verify')
and src.verified_time > '<last_runtime>';
Explain Plan:
{
"GlobalStats": {
"partitionsTotal": 728,
"partitionsAssigned": 486,
"bytesAssigned": 9182406144
},
"Operations": [
[
{
"id": 0,
"operation": "Result",
"expressions": [
"number of rows updated",
"number of multi-joined rows updated"
]
},
{
"id": 1,
"parent": 0,
"operation": "Update",
"objects": [
"<target_table>"
]
},
{
"id": 2,
"parent": 1,
"operation": "InnerJoin",
"expressions": [
"joinKey: (LOWER(src.EMAIL_ID) = LOWER(tgt.EMAIL_ID))"
]
},
{
"id": 3,
"parent": 2,
"operation": "Filter",
"expressions": [
"(src.DEMOGRAPHIC_DISPOSITION IN 'Updated Last Name' IN 'Updated Job Title' IN 'Updated First Name' IN 'Verified' IN 'Updated Country' IN 'Partial Verify') AND (src.VERIFIED_TIME > '<last_runtime>') AND (LOWER(src.EMAIL_ID) IS NOT NULL)"
]
},
{
"id": 4,
"parent": 3,
"operation": "TableScan",
"objects": [
"<src_table>"
],
"expressions": [
"EMAIL_ID",
"LINK",
"UPDATED_FIRST_NAME",
"UPDATED_LAST_NAME",
"UPDATED_TITLE",
"DEMOGRAPHIC_DISPOSITION",
"VERIFIED_TIME",
"FUNCTION",
"ROLE",
"TITLE_ID"
],
"alias": "BH",
"partitionsAssigned": 1,
"partitionsTotal": 243,
"bytesAssigned": 1040384
},
{
"id": 5,
"parent": 2,
"operation": "Filter",
"expressions": [
"LOWER(tgt.EMAIL_ID) IS NOT NULL"
]
},
{
"id": 6,
"parent": 5,
"operation": "JoinFilter",
"expressions": [
"joinKey: (LOWER(src.EMAIL_ID) = LOWER(tgt.EMAIL_ID))"
]
},
{
"id": 7,
"parent": 6,
"operation": "TableScan",
"objects": [
"<target_table>"
],
"expressions": [
"EMAIL_ID",
"FIRST_NAME",
"LAST_NAME"
],
"partitionsAssigned": 485,
"partitionsTotal": 485,
"bytesAssigned": 9181365760
}
]
]
}
Now the new requirement is that this update jobs run every 5 minutes so that we can have near real time updated data in Snowflake.
However, we have tried all query optimisation and are unable to bring the execution time of the updates to less than 5 minutes.
We are using a Small warehouse in Snowflake and due to budget constraints we cant increase it further to fasten the update query performance.
Is their any other budget friendly way to do near real time updates(after ingesting the data) in Snowflake?

Related

Solr's Complex Query for Nested Documents is not Working

The DataSet I am working on Solr.
[
{
"id": "doc_1",
"name": "Harpreet Chaggar",
"_childDocuments_": [
{ "id": "child_doc_a", "number": 22,"created_at":"2020-03-20T00:00:00Z" },
{ "id": "child_doc_b", "number": 10 ,"created_at":"2021-05-28T00:00:00Z"},
]
},
{
"id": "doc_2",
"name": "Hardik Deshmukh",
"_childDocuments_": [
{ "id": "child_doc_1", "number": 67,"created_at":"2022-03-20T00:00:00Z" },
{ "id": "child_doc_2", "number": 78 ,"created_at":"2022-05-28T00:00:00Z"},
]
},
]
My objective is to make exclude query for a nested Date Data along with some parent conditions and to return parent document for all queries.
I am trying to fetch "id" : "doc_2", "name": "Hardik Deshmukh" by the following query. Note:- I need parent document in return.
q = {!parent which='(name:("Hardik" OR "Harpreet") AND id:"doc_1")'}-created_at:[2020-01-17T00:00:00Z TO 2021-12-17T00:00:00Z]
But I am not getting any results.
To make sure if the date query is working properly, I executed the below query.
q = -created_at:[2020-01-17T00:00:00Z TO 2021-12-17T00:00:00Z]
And it was working.
"response":{"numFound":4,"start":0,"numFoundExact":true,"docs":[
{
"id":"doc_1",
"name":["Harpreet Chaggar"],
"_version_":1746310602768252928},
{
"id":"child_doc_1",
"number":["67"],
"created_at":"2022-03-20T00:00:00Z",
"_version_":1746310602791321600},
{
"id":"child_doc_2",
"number":["78"],
"created_at":"2022-05-28T00:00:00Z",
"_version_":1746310602791321600},
{
"id":"doc_2",
"name":["Hardik Deshmukh"],
"_version_":1746310602791321600}]
}}
Field types:
For created_at
Field-Type:org.apache.solr.schema.DatePointField
For name
Field-Type:org.apache.solr.schema.TextField
And if I want to fetch "id": "doc_1", I am able to get it by executing the following query.
{!parent which='(name:("Hardik" OR "Harpreet") AND id:"doc_1")'} ( created_at:[2020-01-17T00:00:00Z TO 2021-12-17T00:00:00Z] )
It fetches desired results.
"response":{"numFound":1,"start":0,"numFoundExact":true,"docs":[
{
"id":"doc_1",
"name":["Harpreet Chaggar"],
"_version_":1746310602768252928}]
}}

PostgreSQL jsonb_set multiple elements in array

I have following jsonb structure in column recipients in a table called mailing:
[
{
"text": "Text1",
"smsId": 1,
"value": "123456",
"status": "Sent"
},
{
"text": "Text1",
"smsId": 2,
"value": "23456",
"status": "Sent"
},
{
"text": "Text1",
"smsId": 3,
"value": "345678",
"status": "Sent"
}]
I need to update one field in multiple elements, so the outcome should look like this:
[
{
"text": "Text1",
"smsId": 1,
"value": "123456",
"status": "Delivered"
},
{
"text": "Text1",
"smsId": 2,
"value": "23456",
"status": "Delivered"
},
{
"text": "Text1",
"smsId": 3,
"value": "345678",
"status": "Delivered"
}]
The most close I got to solution is this:
WITH item AS (SELECT mailing_id, ('{' || INDEX-1 || ',status}')::text[] AS PATH
FROM mailing, jsonb_array_elements(recipients) WITH ORDINALITY arr(recipient, INDEX)
WHERE recipient->>'smsId' = any(array['1', '2', '3']))
UPDATE mailing m
SET recipients = jsonb_set(recipients, item.path, '"Delivered"',FALSE)
FROM item
WHERE m.mailing_id = item.mailing_id;
But this solution updates only first row, and I am not sure if I should somehow loop this or try different approach?
You need to aggregate modified array elements with jsonb_agg():
with new_data as (
select
mailing_id,
jsonb_agg(
case when value->>'smsId' = any('{1,2,3}') then value || '{"status": "Delivered"}'
else value
end) as recipients
from mailing
cross join jsonb_array_elements(recipients)
group by mailing_id
)
update mailing m
set recipients = n.recipients
from new_data n
where m.mailing_id = n.mailing_id;
Test it in db<>fidlle.

Rails how to extract data from nested JSON data

I am trying to parse out the values I need from this nested JSON data. I need to get from quotas is responders and I need to get from qualified is service_id, codes.
I tried first to get just the quotas but kept getting this error []': no implicit conversion of String into Integer
hash = JSON::parse(response.body)
hash.each do |data|
p data["quotas"]
end
Json data
{
"id": 14706,
"relationships" : [
{
"id": 538
}
]
"quotas": [
{
"id": 48894,
"name": "Test",
"responders": 6,
"qualified": [
{
"service_id": 12,
"codes": [
1,
2,
3,
6,
]
},
{
"service_id": 23,
"pre_codes": [
1,
2
]
}
]
}
]
}
I needed to convert your example into json. Then I could loop the quotas and output the values.
hash = JSON::parse(data.to_json)
hash['quotas'].each do |data|
p data["responders"]
data["qualified"].each do |responder|
p responder['service_id']
p responder['codes']
end
end
Hash in data variable (needed for the sample code to work):
require "json"
data = {
"id": 14706,
"relationships": [
{
"id": 538
}
],
"quotas": [
{
"id": 48894,
"name": "Test",
"responders": 6,
"qualified": [
{
"service_id": 12,
"codes": [
1,
2,
3,
6,
]
},
{
"service_id": 23,
"pre_codes": [
1,
2
]
}
]
}
]
}

Array within Element within Array in Variant

How can I get the data out of this array stored in a variant column in Snowflake. I don't care if it's a new table, a view or a query. There is a second column of type varchar(256) that contains a unique ID.
If you can just help me read the "confirmed" data and the "editorIds" data I can probably take it from there. Many thanks!
Output example would be
UniqueID ConfirmationID EditorID
u3kd9 xxxx-436a-a2d7 nupd
u3kd9 xxxx-436a-a2d7 9l34c
R3nDo xxxx-436a-a3e4 5rnj
yP48a xxxx-436a-a477 jTpz8
yP48a xxxx-436a-a477 nupd
[
{
"confirmed": {
"Confirmation": "Entry ID=xxxx-436a-a2d7-3525158332f0: Confirmed order submitted.",
"ConfirmationID": "xxxx-436a-a2d7-3525158332f0",
"ConfirmedOrders": 1,
"Received": "8/29/2019 4:31:11 PM Central Time"
},
"editorIds": [
"xxsJYgWDENLoX",
"JR9bWcGwbaymm3a8v",
"JxncJrdpeFJeWsTbT"
] ,
"id": "xxxxx5AvGgeSHy8Ms6Ytyc-1",
"messages": [],
"orderJson": {
"EntryID": "xxxxx5AvGgeSHy8Ms6Ytyc-1",
"Orders": [
{
"DropShipFlag": 1,
"FromAddressValue": 1,
"OrderAttributes": [
{
"AttributeUID": 548
},
{
"AttributeUID": 553
},
{
"AttributeUID": 2418
}
],
"OrderItems": [
{
"EditorId": "aC3f5HsJYgWDENLoX",
"ItemAssets": [
{
"AssetPath": "https://xxxx573043eac521.png",
"DP2NodeID": "10000",
"ImageHash": "000000000000000FFFFFFFFFFFFFFFFF",
"ImageRotation": 0,
"OffsetX": 50,
"OffsetY": 50,
"PrintedFileName": "aC3f5HsJYgWDENLoX-10000",
"X": 50,
"Y": 52.03909266409266,
"ZoomX": 100,
"ZoomY": 93.75
}
],
"ItemAttributes": [
{
"AttributeUID": 2105
},
{
"AttributeUID": 125
}
],
"ItemBookAttribute": null,
"ProductUID": 52,
"Quantity": 1
}
],
"SendNotificationEmailToAccount": true,
"SequenceNumber": 1,
"ShipToAddress": {
"Addr1": "Addr1",
"Addr2": "0",
"City": "City",
"Country": "US",
"Name": "Name",
"State": "ST",
"Zip": "00000"
}
}
]
},
"orderNumber": null,
"status": "order_placed",
"submitted": {
"Account": "350000",
"ConfirmationID": "xxxxx-436a-a2d7-3525158332f0",
"EntryID": "xxxxx-5AvGgeSHy8Ms6Ytyc-1",
"Key": "D83590AFF0CC0000B54B",
"NumberOfOrders": 1,
"Orders": [
{
"LineItems": [],
"Note": "",
"Products": [
{
"Price": "00.30",
"ProductDescription": "xxxxxint 8x10",
"Quantity": 1
},
{
"Price": "00.40",
"ProductDescription": "xxxxxut Black 8x10",
"Quantity": 1
},
{
"Price": "00.50",
"ProductDescription": "xxxxx"
},
{
"Price": "00.50",
"ProductDescription": "xxxscount",
"Quantity": 1
}
],
"SequenceNumber": "1",
"SubTotal": "00.70",
"Tax": "1.01",
"Total": "00.71"
}
],
"Received": "8/29/2019 4:31:10 PM Central Time"
},
"tracking": null,
"updatedOn": 1.598736670503000e+12
}
]
So, this is how I'd query that exact JSON assuming the data is in column var in table x:
SELECT x.var[0]:confirmed:ConfirmationID::varchar as ConfirmationID,
f.value::varchar as EditorID
FROM x,
LATERAL FLATTEN(input => var[0]:editorIds) f
;
Since your sample output doesn't match the JSON that you provided, I will assume that this is what you need.
Also, as a note, your JSON includes outer [ ] which indicates that the entire JSON string is inside an array. This is the reason for var[0] in my query. If you have multiple records inside that array, then you should remove that. In general, you should exclude those and instead load each record into the table separately. I wasn't sure whether you could make that change, so I just wanted to make note.

jsonb query in postgres

I've a table in postgres named: op_user_event_data, which has a column named data, where I store a jsonb, and what I have at the moment is a json like this:
{
"aisles": [],
"taskGroups": [
{
"index": 0,
"tasks": [
{
"index": 1,
"mandatory": false,
"name": "Dados de Linear",
"structuresType": null,
"lines": [
{
"sku": {
"skuId": 1,
"skuName": "Limiano Bola",
"marketId": [
1,
3,
10,
17
],
"productId": 15,
"brandId": [
38,
44
]
},
"taskLineFields": [
{
"tcv": {
"value": "2126474"
},
"columnType": "skuLocalCode",
"columnId": 99
},
{
"tcv": {
"value": null
},
"columnType": "face",
"columnId": 29
},
]
},
{
"sku": {
"skuId": 3,
"skuName": "Limiano Bolinha",
"marketId": [
1,
3,
10,
17
],
"productId": 15,
"brandId": [
38,
44
]
},
"taskLineFields": [
{
"tcv": {
"value": "2545842"
},
"columnType": "skuLocalCode",
"columnId": 99
},
{
"tcv": {
"value": null
},
"columnType": "face",
"columnId": 29
},
]
},
{
"sku": {
"skuId": 5,
"skuName": "Limiano Bola 1/2",
"marketId": [
1,
3,
10,
17
],
"productId": 15,
"brandId": [
38,
44
]
},
"taskLineFields": [
{
"tcv": {
"value": "5127450"
},
"columnType": "skuLocalCode",
"columnId": 99
},
{
"tcv": {
"value": "5.89"
},
"columnType": "rsp",
"columnId": 33
}
]
}
Basically I've an object which has
Aisles [],
taskGroups,
id and name.
Inside the taskGroups as shown in the json, one of the atributes is tasks which is an array, that also have an array called lines which have an array of sku and tasklines.
Basically:
taskGroups -> tasks -> lines -> sku or taskLineFields.
I've tried different queries to get the sku but when I try to get anything further than 'lines' it just came as blank or in some other tries 'cannot call elements from scalar'
Can anyone help me with this issue? Note this is just a sample json.
Anyone knows how make this to work:
I Want all lines where lines->taskLineFields->columnType = 'offer'
All I can do is this, but throwing error on scalar:
SELECT lines->'sku' Produto, lines->'taskLineFields'->'tcv'->>'value' ValorOferta
FROM sd_bel.op_user_event_data,
jsonb_array_elements(data->'taskGroups') taskgroups,
jsonb_array_elements(taskgroups->'tasks') tasks,
jsonb_array_elements(tasks->'columns') columns,
jsonb_array_elements(tasks->'lines') lines
WHERE created_by = 'belteste'
AND lines->'taskLineFields'->>'columnType' = 'offer'
say your data is in some json_column in your table
with t as (
select json_column as xyz from table
),
tg as ( select json_array_elements(xyz->'taskGroups') taskgroups from t),
tsk as (select json_array_elements(taskgroups->'tasks') tasks from tg)
select json_array_elements(tasks->'lines') -> 'sku' as sku from tsk;

Resources