Merge query in snowflake gives incorrect count of data

Merge query in snowflake gives incorrect count of data - snowflake-cloud-data-platform

I am downloading two files and adding them to staging area to temp table then merging them to main table.
But I am getting incorrect count of date:
def load_snowflake_table(self, directory):
try:
self.snowflake_client.run("ALTER SESSION SET TIMEZONE = 'UTC';")
self.stage_name = f"{self.sf_temp_table}_{self.snowflake_client.generate_random_string()}"
create_stage = f"CREATE TEMPORARY STAGE {self.stage_name} COMMENT = 'TEMPORARY STAGE FOR {self.sf_temp_table} DATA LOAD'"
self.snowflake_client.run(create_stage)
logger.info("Temporary stage created.")
self.snowflake_client.run(f"put file://{directory}/* #{self.stage_name} PARALLEL=4")
logger.info(f"Successfully uploaded all files to staging area.")
self.snowflake_client.run(
f"COPY INTO {self.sf_temp_table} FROM #{self.stage_name} PURGE = TRUE MATCH_BY_COLUMN_NAME = "
f"'CASE_INSENSITIVE' FILE_FORMAT "
f"= (TYPE = 'AVRO')")
logger.info(f"Successfully copied the files data temporary table {self.sf_temp_table}")
sf_query = f"SELECT COUNT(*) from {self.sf_temp_table}"
logger.info(sf_query)
sf_count = self.snowflake_client.run(sf_query).fetchall()
sf_count = sf_count[0][0]
print("=====================")
print(sf_count)
merge_query = self.form_merge_query(self.COLUMNS)
logger.info(f"Executing merge query: {merge_query}")
self.snowflake_client.run(merge_query)
logger.info("Truncating temporary table")
self.snowflake_client.run(f"TRUNCATE TABLE IF EXISTS {self.sf_temp_table}")
except Exception as e:
logger.error("Error Loading to Snowflake{e}".format(e=e))
raise e
finally:
self.snowflake_client.run(f"DROP STAGE IF EXISTS {self.stage_name}")
logger.info(f"Dropped temporary stage {self.stage_name}")
The count in first file was 135839 and the same count was in temporary table and on second file count was 135687 and same count was on temporary table. Hence the count in final table should be 271,526 but this is coming out to be incorrect.
All the records are unique on primary key combination.
My merge query is:
MERGE
INTO
MERGETEST
USING
(SELECT $1 CHANNELGROUPING, $2 CLIENTID, $3 CUSTOMDIMENSIONS, $4 DATE, $5 DEVICE, $6 FULLVISITORID,
$7 GEONETWORK, $8 SOCIALENGAGEMENTTYPE, $9 TOTALS,
$10 TRAFFICSOURCE, $11 VISITID, $12 VISITNUMBER, $13 VISITSTARTTIME, $14 APPINFO, $15 CONTENTGROUP,
$16 HITS_CUSTOMDIMENSIONS, $17 CUSTOMMETRICS,$18 CUSTOMVARIABLES, $19 DATASOURCE, $20 ECOMMERCEACTION,
$21 EVENTINFO, $22 EXCEPTIONINFO, $23 EXPERIMENT, $24 HITNUMBER
FROM T_SESSIONS)
as TEST_SESSIONS
ON MERGETEST.FULLVISITORID = TEST_SESSIONS.FULLVISITORID and MERGETEST.VISITID = TEST_SESSIONS.VISITID and
MERGETEST.VISITSTARTTIME = TEST_SESSIONS.VISITSTARTTIME and MERGETEST.HITNUMBER = TEST_SESSIONS.HITNUMBER
WHEN
NOT
MATCHED and
THEN
INSERT(CHANNELGROUPING,CLIENTID,CUSTOMDIMENSIONS,DATE,DEVICE,FULLVISITORID,GEONETWORK,SOCIALENGAGEMENTTYPE,TOTALS,
TRAFFICSOURCE,VISITID,VISITNUMBER,VISITSTARTTIME,APPINFO,CONTENTGROUP,HITS_CUSTOMDIMENSIONS,CUSTOMMETRICS,
CUSTOMVARIABLES,DATASOURCE,ECOMMERCEACTION,EVENTINFO,EXCEPTIONINFO,EXPERIMENT,HITNUMBER)
VALUES
(TEST_SESSIONS.CHANNELGROUPING,TEST_SESSIONS.CLIENTID,TEST_SESSIONS.CUSTOMDIMENSIONS,TEST_SESSIONS.DATE,
TEST_SESSIONS.DEVICE,TEST_SESSIONS.FULLVISITORID,TEST_SESSIONS.GEONETWORK,TEST_SESSIONS.SOCIALENGAGEMENTTYPE,
TEST_SESSIONS.TOTALS,TEST_SESSIONS.TRAFFICSOURCE,TEST_SESSIONS.VISITID,TEST_SESSIONS.VISITNUMBER,
TEST_SESSIONS.VISITSTARTTIME,TEST_SESSIONS.APPINFO,TEST_SESSIONS.CONTENTGROUP,TEST_SESSIONS.HITS_CUSTOMDIMENSIONS,
TEST_SESSIONS.CUSTOMMETRICS,TEST_SESSIONS.CUSTOMVARIABLES,TEST_SESSIONS.DATASOURCE,TEST_SESSIONS.ECOMMERCEACTION,
TEST_SESSIONS.EVENTINFO,TEST_SESSIONS.EXCEPTIONINFO,TEST_SESSIONS.EXPERIMENT,TEST_SESSIONS.HITNUMBER);
All my rows are unique on FULLVISITORID, VISITID, VISITSTARTTIME and HITNUMBER combination. But still I am not getting the correct number of rows.
Till the temporary table I got what was expected. Is there something wrong with the approach or my merge query?

File/table one: 135,839
File/table two: 135,687
Expected: 135,839 + 135,687 = 271,526
The assumption that their merged version has to be the sum may not be the case when they share the same keys between two tables.
Sample scenario (two rows + one row)
(FULLVISITORID, VISITID, VISITSTARTTIME, HITNUMBER, ...)
1, 1, 2021-01-01, 1, ...
2, 2, 2021-01-01, 2, ...
Other table:
(FULLVISITORID, VISITID, VISITSTARTTIME, HITNUMBER, ...)
1, 1, 2021-01-01, 1, ...
After the merge that is performing INSERT only the output will be 2 rows and not 3.
(FULLVISITORID, VISITID, VISITSTARTTIME, HITNUMBER, ...)
1, 1, 2021-01-01, 1, ...
2, 2, 2021-01-01, 2, ...
To find the overlap INTERSECT could be used:
SELECT $6 FULLVISITORID, $11 VISITID, $13 VISITSTARTTIME, $24 HITNUMBER
FROM T_SESSIONS
INTERSECT
SELECT FULLVISITORID, VISITID, VISITSTARTTIME, HITNUMBER
FROM MERGETEST;

Related

Count nested JSON array elements over all result rows

I have a SQL query that I am running in order to get results, where one of the column contains a JSON array.
I want to count the total of JSON elements in total from all returned rows.
I.e. if 2 rows were returned, where one row had 3 JSON array items in metadata column, and the second row had 4 JSON array items in metadata column, I'd like to see 7 as a returned count.
Is this possible?
This is my current SQL query:
WITH _result AS (
SELECT lo.*
FROM laser.laser_checks la
JOIN laser.laser_brands lo ON la.id = lo.brand_id
WHERE lo.type not in (1)
AND la.source in (1,4,5)
AND la.prod_id in (1, 17, 19, 22, 27, 29)
)
SELECT ovr.json -> 'id' AS object_uuid,
ovr.json -> 'username' AS username,
image.KEY AS image_uuid,
image.value AS metadata,
user_id as user_uuid
FROM _result ovr,
jsonb_array_elements(ovr."json" -> 'images') elem,
jsonb_each(elem) image

Unpack the arrays and count the elements:
WITH q AS (/* your query */)
SELECT object_uuid,
username,
image_uuid,
metadata,
user_uuid,
sum(elemcount) OVER () AS total_array_elements
FROM (SELECT q.object_uuid,
q.username,
q.image_uuid,
q.metadata,
q.user_uuid,
count(a.e) AS elemcount
FROM q
LEFT JOIN LATERAL jsonb_array_elements(q.metadata) AS a(e)
ON TRUE
GROUP BY q.object_uuid,
q.username,
q.image_uuid,
q.metadata,
q.user_uuid
) AS p;

An elephant managed to slip everybody's attention in this room: jsonb_array_length().
(Or json_array_length() for json.)
The manual:
Returns the number of elements in the top-level JSON array.
After you have already unnested the JSON array to your level of interest, you can apply the function to the (now) top level. Wrap it in a window function to get total counts for every result row.
Your query should work like this:
SELECT lo.json -> 'id' AS object_uuid
, lo.json -> 'username' AS username
, image.key AS image_uuid
, image.value AS metadata
, lo.user_id AS user_uuid
, sum(jsonb_array_length(image.value)) OVER () AS total_array_elements -- !!!
FROM laser.laser_checks la
JOIN laser.laser_brands lo ON la.id = lo.brand_id
, jsonb_array_elements(lo."json" -> 'images') elem
, jsonb_each(elem) image
WHERE lo.type NOT IN (1)
AND la.source IN (1,4,5)
AND la.prod_id IN (1, 17, 19, 22, 27, 29);
No need for a LATERAL subquery, aggregation, nor even for a CTE, really.
Related:
Sort by length of nested JSON array

Snowflake merge query in batch manner

I have a lot of data which is in form of list of dictionaries. I want to insert all the data into the snowflake table.
The primary key on the table is ID, i can receive new data for which there is already an id present then I would need to update the data. What I have done till now is since the data is large I have inserted the batch data into temporary table and the from temporary table I have used merge query to update/insert in main table.
def batch_data(data, chunk_size):
for i in range(0, len(data), chunk_size):
yield data[i:i + chunk_size]
def upsert_user_data(self, user_data):
columns = ["\"" + x + "\"" for x in user_data[0].keys()]
values = ['?' for _ in user_data[0].keys()]
for chunk in batch_data(user_data, 1000):
sql = f"INSERT INTO TEMP ({','.join(columns)}) VALUES ({','.join(values)});"
print(sql)
data_to_load = [[x for x in i.values()] for i in chunk]
snowflake_client.run(sql, tuple(data_to_load))
sql = "MERGE INTO USER USING (SELECT ID AS TID, NAME AS TNAME, STATUS AS TSTATUS FROM TEMP) AS TEMPTABLE" \
"ON USER.ID = TEMPTABLE.TID WHEN MATCHED THEN UPDATE SET USER.NAME = TEMPTABLE.TNAME, USER.STATUS = TEMPTABLE.TSTATUS " \
"WHEN NOT MATCHED THEN INSERT (ID, NAME, STATUS) VALUES (TEMPTABLE.TID, TEMPTABLE.TNAME, TEMPTABLE.TSTATUS);"
snowflake_client.run(sql)
Is there any way I can remove temporary table and use only merge query in batch way?

SQL Server : nested looping over two Selects

I have the following two queries that produce the results I need. Now the final output I truly need I would usually use python for after the results are returned, but unfortunately only SQL can be used.
Query A:
SELECT *
FROM openquery(PROD, 'SELECT `status`, computer_name, device_type
FROM assets
WHERE (device_type="SERVER")
AND (status="ACTIVE")')
Query B:
SELECT *
FROM openquery(AppMap, 'SELECT `t1`.`uaid` AS `uaid`, `t3`.`computer_name`,
FROM ((`applications` `t1`
JOIN `app_infrastructure` `t2` ON (((`t1`.`uaid` = `t2`.`uaid`))))
JOIN `infrastructure` `t3` ON ((`t2`.`infrastructure_id` = `t3`.`infrastructure_id`)));')
How I would want to process the results:
if a computer_name is in both A and B:
final_row = ['computer_name', 1]
elseif a computer_name is in A but not B:
final_row = ['computer_name', 0]
elseif a computer_name is in B but not A:
final_row = ['computer_name', 2]
So my final query results need to look like those rows, does that make sense?

In a stored procedure, use both queries to load table variables.
Then do a FULL OUTER JOIN query, joining the two table variables on computer_name, and use a CASE expression to get your final_row value for each computer name.

Creating multiple rows in databases

I'm not sure why I can't test my function. My desired output is ID, then Room, but if there are multiple rooms for the same ID, then put it in a new row, like
ID Room
1 SW128 SW 143
into
ID Room
1 SW128
1 SW143
This is some of the data in the file.
1,SW128,SW143
2,SW309
3,AA205
4,AA112,SY110
5,AC223
6,AA112,AA206
but I can't even test my function. Can anyone please help me fix this?
def create_location_table(db, loc_file):
'''Location table has format ID, Room'''
con = sqlite3.connect(db)
cur = con. cursor()
cur.execute('''DROP TABLE IF EXISTS Locations''')
# create the table
cur.execute('''CREATE TABLE Locations (id TEXT, Room TEXT)''')
# Add the rows
loc_file = open('locations.csv', 'r')
loc_file.readline()
for line in loc_file:
d = {}
data = line.split(',')
ID = data[0]
Room = data[1:]
for (ID, Room) in d.items():
if Room not in d:
d[ID] = [Room]
for i in Rooms:
cur.execute(''' INSERT INTO Locations VALUES(?, ?)''', (ID,
Room))
# commit and close cursor and connection
con.commit()
cur.close()
con.close()

The problem is, that d is always an empty dict, so the for (ID, Room) in d.items() won't do anything. What you need to do is looping over Room. And you don't need the d dict.
def create_location_table(db, loc_file):
'''Location table has format ID, Room'''
con = sqlite3.connect(db)
cur = con. cursor()
cur.execute('''DROP TABLE IF EXISTS Locations''')
# create the table
cur.execute('''CREATE TABLE Locations (id TEXT, Room TEXT)''')
# open the CSV
csv_content = open(loc_file, 'r')
for line in csv_content:
data = line.strip().split(',')
# made lowercase following PEP8, but 'id' is a built in name in python
idx = data[0]
rooms = data[1:]
# loop through the rooms of this line and insert one row per room
for room in rooms:
cur.execute(''' INSERT INTO Locations VALUES(?, ?)''', (idx, room))
# for debug purposes only
print('INSERT INTO Locations VALUES(%s, %s)' % (idx, room))
# commit and close cursor and connection
con.commit()
cur.close()
con.close()
# call the method
create_location_table('db.sqlite3', 'locations.csv')
Note: Following PEP 8 I made your variables lowercase.
EDIT: post full code example, use loc_file parameter

DBD::SQLite, how to pass array in query via placeholder?

Let's have a table:
sqlite> create table foo (foo int, bar int);
sqlite> insert into foo (foo, bar) values (1,1);
sqlite> insert into foo (foo, bar) values (1,2);
sqlite> insert into foo (foo, bar) values (1,3);
Then SELECT some data:
sqlite> select * from foo where foo = 1 and bar in (1,2,3);
1|1
1|2
1|3
Works all right. Now I'm trying to use DBD::SQLite 1.29:
my $sth = $dbh->prepare('select * from foo where foo = $1 and bar in ($2)');
$sth->execute(1,[1,2,3]);
And this gives me null results. DBI trace shows that 2nd placeholder is bound to array all right, but no score. If I join array values in a string and pass it, no result. If I flatten the array, I get predictable error of "called with N placeholders instead of 2".
I'm kinda at loss. What else is there to try?
Upd: All right, here's one bona fide example taken from the real world application.
First, the setup: I have several tables filled with statistical data, number of columns varies from 10 to 700+. The queries I'm talking about select subset of that data for reporting purposes. Different reports consider different aspects and therefore run different queries, one or more per request. There are more than 200 reports, i.e. 200-300 queries. This approach was developed for Postgres and now I need to scale it down and make it work with SQLite. Considering that all this works well with Postgres, I can't justify going over all queries and rewriting them. Bad for maintenance. I can and do use in-place query adjustments, like replacing = ANY () with IN (), these are minor aspects.
So, here's my example: 2 queries ran in succession for one report:
SELECT SPLIT, syn(SPLIT),
(SELECT COUNT(*) FROM cagent WHERE ACD = $1 AND SPLIT = $2 AND
LOC_ID = ANY ($3) AND LOGID IS NOT NULL AND WORKMODE = 40),
(SELECT COUNT(*) FROM cagent WHERE ACD = $1 AND SPLIT = $2 AND
LOC_ID = ANY ($3) AND LOGID IS NOT NULL AND WORKMODE = 30),
(SELECT COUNT(*) FROM cagent WHERE ACD = $1 AND SPLIT = $2 AND
LOC_ID = ANY ($3) AND LOGID IS NOT NULL AND WORKMODE = 50),
(SELECT COUNT(*) FROM cagent WHERE ACD = $1 AND SPLIT = $2 AND
LOC_ID = ANY ($3) AND LOGID IS NOT NULL AND WORKMODE = 220),
(SELECT COUNT(*) FROM cagent WHERE ACD = $1 AND SPLIT = $2 AND
LOC_ID = ANY ($3) AND LOGID IS NOT NULL),
(SELECT COUNT(*) FROM cagent WHERE ACD = $1 AND SPLIT = $2 AND
LOC_ID = ANY ($3) AND LOGID IS NOT NULL AND WORKMODE = 20),
(SELECT COUNT(*) FROM cagent WHERE ACD = $1 AND SPLIT = $2 AND
LOC_ID = ANY ($3) AND LOGID IS NOT NULL AND WORKMODE = 80)
FROM csplit WHERE ACD = $1 AND SPLIT = $2
SELECT syn(LOGID), syn(LOC_ID), LOGID, EXTENSION, syn(ROLE), PERCENT,
syn(AUXREASON), syn(AWORKMODE), syn(DIRECTION), WORKSKILL, syn(WORKSKLEVEL),
AGTIME FROM cagent WHERE ACD = $1 AND SPLIT = $2 AND LOC_ID = ANY ($3) AND
LOGID IS NOT NULL
This is not the most complex example, as there can be any number of input parameters used and reused in different places in query; replacing them with generic ? placeholders is not a trivial task. Code that runs queries against Postgres looks like that (after input cleansing et al):
sub run_select {
my ($class, $dbh, $sql, #bind_values) = #_;
my $sth;
eval {
$sth = $dbh->prepare_cached($sql);
$sth->execute(#bind_values);
};
$# and die "Error executing query: $#";
my %types;
{
my $dbt = $dbh->type_info_all;
#types{ map { $_->[1] } #$dbt[1..$#$dbt] } =
map { $_->[0] } #$dbt[1..$#$dbt];
};
my #result;
while (my $row = $sth->fetchrow_arrayref) {
my $i = 0;
push #result, [ map { [ $types{${$sth->{TYPE}}[$i++]}, $_ ] } #$row ];
};
return \#result;
};
I can rewrite queries and inject values directly; SQL injection is not much of a threat because all input is untainted through regex patterns long before it can hit SQL engine. I don't want to rewrite queries dynamically for two reasons: a) it can potentially lead to problems with value quotation and b) it kinda kills the whole reason behind prepare_cached. SQL engine can't cache reuse prepared statement if it changes every time.
Now as I said, the code above works well with Postgres. Since SQLite engine itself obviously have the possibility of working with data sets, I thought it was a deficiency in DBD::SQLite implementation. So the real question sounds like: is there any way to pass a data set in a placeholder with DBD::SQLite? Not necessarily array though that would be most logical.

Try this:
my $sth = $dbh->prepare("select * from foo where foo = ? and bar in (?,?,?)";
$sth->execute(1,1,2,3);
You can use the x repetition operator to generate the required number of ?s:
my $sql = sprintf "select ... and bar in (%s)", join ",", ('?')x#values;

Use SQL::Abstract, like this:
use strict;
use warnings;
use SQL::Abstract;
my $sqla = SQL::Abstract->new;
my %where = (
foo => 1,
bar => { -in => [1,2,3] }
);
my ($sql, #params) =
$sqla->select('foo', '*', \%where);
my $sth = $dbh->prepare($sql);
$sth->execute(#params);

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight

Merge query in snowflake gives incorrect count of data - snowflake-cloud-data-platform

Related

Count nested JSON array elements over all result rows

Snowflake merge query in batch manner

SQL Server : nested looping over two Selects

Creating multiple rows in databases

DBD::SQLite, how to pass array in query via placeholder?

Categories

Resources