Trasform SQL Cursor using Pyspark in Databricks - loops

We have a cursor in DB2 which reads in each loop data from 2 tables. At the end of each loop, after inserting the data to a target table, we update records related to each loop in these 2 tables before moving to the next loop. An indicative example is the below:
FETCH CUR1 INTO V_A1, V_A2, V_C1, V_C3, V_M1, V_M2
SELECT V_M1
FROM TABLE_1
WHERE A1=V_A1
SELECT V_M2
FROM TABLE_2
WHERE C1=V_C1
IF ..... THEN V_B1 = V_M1-V_M2 ELSE ....
INSERT INTO TARGET
...
VALUES
(V_A1, V_A2, ...)
UPDATE TABLE_1
SET V_M1 = V_M1 - V_B1
UPDATE TABLE_2
SET V_M2 = V_M2 - V_B1
FETCH CUR1 INTO V_A1, V_A2, V_C1, V_C3, V_M1, V_M2
END WHILE
CLOSE CUR1
Just to note that A1, C1 are not unique across the data.
Could you please suggest a way to transform it using Pyspark? Performance also matters as we speak about a large amount of data. I saw that RDDs are immutable in case we were trying RDD-map option.

Related

How to generate an excel file (.xlsx) from SQL Server

I have this query:
WITH InfoNeg AS
(
SELECT DISTINCT
n.idcliente,
CASE
WHEN DATEDIFF(MONTH, MAX(n.fechanegociacion), GETDATE()) <= 2
THEN 'Negociado 6 meses'
ELSE NULL
END AS TipoNeg
FROM
SAB2NewExports.dbo.negociaciones AS n
WHERE
Aprobacion = 'Si'
AND cerrado = 'Si'
GROUP BY
n.idcliente
), Multi AS
(
SELECT DISTINCT
idcliente, COUNT(distinct idportafolio) AS NumPorts
FROM
orangerefi.wfm.wf_master_HIST
WHERE
YEAR(Fecha_BKP) = 2021
AND MONTH(Fecha_BKP) = 08
GROUP BY
idcliente
)
SELECT DISTINCT
m.IdCliente, c.Nombre1
FROM
orangerefi.wfm.wf_master_HIST as m
LEFT JOIN
InfoNeg ON m.idcliente = InfoNeg.idcliente
LEFT JOIN
Multi ON m.IdCliente = Multi.idcliente
LEFT JOIN
SAB2NewExports.dbo.Clientes AS c ON m.IdCliente = c.IdCliente
WHERE
CanalTrabajo = 'Callcenter - Outbound' -- Cambiar aca
AND YEAR (Fecha_BKP) = 2021
AND MONTH(Fecha_BKP) = 08
AND GrupoTrabajo IN ('Alto') -- Cambiar aca
AND Bucket IN (1, 2) -- Cambiar aca
AND Multi.NumPorts > 1
AND Infoneg.TipoNeg IS NULL
When I run it, I get 30 thousand rows and the columns that I get when performing the query are: ClientID, name. I would like it to be saved in an Excel file when I run it, I don't know if it's possible.
Another question, is it possible to create a variable that stores text?
I used CONCAT(), but the text being so long is a bit cumbersome, I don't know if there is any alternative.
If you can help me, I appreciate it.
To declare a variable
DECLARE #string VARCHAR(MAX)
SET #string = concat()
then insert whatever you are concatenating
Here is an answer given by carusyte
Export SQL query data to Excel
I don't know if this is what you're looking for, but you can export the results to Excel like this:
In the results pane, click the top-left cell to highlight all the records, and then right-click the top-left cell and click "Save Results As". One of the export options is CSV.
You might give this a shot too:
INSERT INTO OPENROWSET
('Microsoft.Jet.OLEDB.4.0',
'Excel 8.0;Database=c:\Test.xls;','SELECT productid, price FROM dbo.product')
Lastly, you can look into using SSIS (replaced DTS) for data exports. Here is a link to a tutorial:
http://www.accelebrate.com/sql_training/ssis_2008_tutorial.htm
== Update #1 ==
To save the result as CSV file with column headers, one can follow the steps shown below:
Go to Tools->Options
Query Results->SQL Server->Results to Grid
Check “Include column headers when copying or saving results”
Click OK.
Note that the new settings won’t affect any existing Query tabs — you’ll need to open new ones and/or restart SSMS.

Streams + tasks missing inserts?

We've setup a stream on a table that is continuously loaded via snowpipe.
We're consuming this data with a task that runs every minute where we merge into another table. There is a possibility of duplicate keys so we use a ROW_NUMBER() window function, ordered by the file created timestamp descending where row_num=1. This way we always get the latest insert
Initially we used a standard task with the merge statement but we noticed that in some instances, since snowpipe does not guarantee loading in order of when the files were staged, we were updating rows with older data. As such, on the WHEN MATCHED section we added a condition so only when the file created ts > existing, to update the row
However, since we did that, reconciliation checks show that some new inserts are missing. I don't know for sure why changing the matched clause would interfere with the not matched clause.
My theory was that the extra clause added a bit of time to the task run where some runs were skipped or the next run happened almost immediately after the last one completed. The idea being that the missing rows were caught up in the middle and the offset changed before they could be consumed
As such, we changed the task to call a stored procedure which uses an explicit transaction. We did this because the docs seem to suggest that using a transaction will lock the stream. However even with this we can see that new inserts are still missing. We're talking very small numbers e.g. 8 out of 100,000s
Any ideas what might be happening?
Example task code below (not the sp version)
WAREHOUSE = TASK_WH
SCHEDULE = '1 minute'
WHEN SYSTEM$stream_has_data('my_stream')
AS
MERGE INTO processed_data pd USING (
select
ms.*,
CASE WHEN ms.status IS NULL THEN 1/mv.count ELSE NULL END as pending_count,
CASE WHEN ms.status='COMPLETE' THEN 1/mv.count ELSE NULL END as completed_count
from my_stream ms
JOIN my_view mv ON mv.id = ms.id
qualify
row_number() over (
partition by
id
order by
file_created DESC
) = 1
) ms ON ms.id = pd.id
WHEN NOT MATCHED THEN INSERT (col1, col2, col3,... )
VALUES (ms.col1, ms.col2, ms.col3,...)
WHEN MATCHED AND ms.file_created >= pd.file_created THEN UPDATE SET pd.col1 = ms.col1, pd.col2 = ms.col2, pd.col3 = ms.col3, ....
;
I am not fully sure what is going wrong here, but the file created time related recommendation is given by Snowflake somewhere. It suggest that the file created timestamp is calculated in cloud service and it may be bit different than you think. There is another recommendation related to snowpipe and data ingestion. The queue service takes a min to consume the data from pipe and if you have lot of data being flown inside with in a min, you may end up this issue. Look you implementation and simulate if pushing data in 1min interval solve that issue and don't rely on file create time.
The condition "AND ms.file_created >= pd.file_created" seems to be added as a mechanism to avoid updating the same row multiple times.
Alternative approach could be using IS DISTINCT FROM to compare source against target columns(except id):
MERGE INTO processed_data pd USING (
select
ms.*,
CASE WHEN ms.status IS NULL THEN 1/mv.count ELSE NULL END as pending_count,
CASE WHEN ms.status='COMPLETE' THEN 1/mv.count ELSE NULL END as completed_count
from my_stream ms
JOIN my_view mv ON mv.id = ms.id
qualify
row_number() over (
partition by
id
order by
file_created DESC
) = 1
) ms ON ms.id = pd.id
WHEN NOT MATCHED THEN INSERT (col1, col2, col3,... )
VALUES (ms.col1, ms.col2, ms.col3,...)
WHEN MATCHED
AND (pd.col1, pd.col2,..., pd.coln) IS DISTINCT FROM (ms.col1, ms.col2,..., ms.coln)
THEN UPDATE SET pd.col1 = ms.col1, pd.col2 = ms.col2, pd.col3 = ms.col3, ....;
This approach will also prevent updating row when nothing has changed.

Snowflake Stream NOT Purging

I create a Stream in Snowflake on a table and created a task to move the data to a table. Even after the task is complete the data in the stream is not purging. Because of that the task is not getting skipped and keep reinserting data from stream to the table and the final table keeps on growing. What can be the reason? It was working yesterday but from today the stream is not purging even after a DML is executed using that stream using a task.
create or replace stream test_stream on table test_table_raw APPEND_ONLY = TRUE;
create or replace task test_task_task warehouse = test_warehouse
schedule = '1 minute'
when system$stream_has_data('test_stream')
as insert into test_table
SELECT
level1.FILE_NAME,
level1.FILE_ROWNUMBER,
GET(lvl, '#id')::string as app_id
FROM (SELECT FILE_NAME,FILE_ROWNUMBER,src:"$" as lvl FROM test_table_raw) level1,
lateral FLATTEN(LVL:"$") level2
where level2.value like '%<test %';
alter task test_task resume;
select
(select count(*) from test_table) table_count,
(select count(*) from test_stream) stream_count;
TABLE_COUNT STREAM_COUNT
500 1
Is the transaction committing; i.e. do you see the inserts or whatever the DML in the task using that stream is supposed to do happening?
Any chance you can post the SQL.
Stream offset changes when a transaction where the stream is used commits. There is really no "purge" but the stream offset just moves forward so you don't see the same rows again.
Dinesh Kulkarni
(PM, Snowflake)
My bad! I am using the base table in the task instead of using the stream.
create or replace task test_task_task warehouse = test_warehouse
schedule = '1 minute'
when system$stream_has_data('test_stream')
as insert into test_table
SELECT
level1.FILE_NAME,
level1.FILE_ROWNUMBER,
GET(lvl, '#id')::string as app_id
FROM (SELECT FILE_NAME,FILE_ROWNUMBER,src:"$" as lvl FROM *test_table_raw* test_stream) level1,
lateral FLATTEN(LVL:"$") level2
where level2.value like '%<test %';

Oracle cursor variables

I have this Oracle code that I need to convert in SQL Server but need help understanding what exactly it is doing . Since I have always avoided cursors it is still a mystery to me how they are used . please see the code below . This is placed in an insert trigger
CURSOR c1(table_name1 IN VARCHAR2)
IS
SELECT
a.begin, a.end, a.isnotactive, a.isactive, MIN(g.age) minage
FROM alltables a
LEFT OUTER JOIN people g ON (g.ageid = a.ageid)
WHERE table_name = table_name1
c1x_rec c1%ROWTYPE;
c1_rec c1%ROWTYPE;
I am particularly unsure about the following 3 lines . What exactly is it doing ? Where does the table_name1 gets its value from ?
WHERE table_name = table_name1
c1x_rec c1%ROWTYPE;
c1y_rec c1%ROWTYPE;
OPEN c1(table_name1) - this will open the cursor
FETCH c1 INTO variable - this is will fetch data into variable
You need to create variable, that must match with your select statement in your cursor, to fetch the data. For this, you use %ROWTYPE - that attribute provides a record type that represents a row. For example, your variables c1x_rec, c1y_rec have begin, end, isnoactive, isactive, minage fields, with each field declared as equivalent column type in alltables or people table.

How to do Sql Server CE table update from another table

I have this sql:
UPDATE JOBMAKE SET WIP_STATUS='10sched1'
WHERE JBT_TYPE IN (SELECT JBT_TYPE FROM JOBVISIT WHERE JVST_ID = 21)
AND JOB_NUMBER IN (SELECT JOB_NUMBER FROM JOBVISIT WHERE JVST_ID = 21)
It works until I turn it into a parameterised query:
UPDATE JOBMAKE SET WIP_STATUS='10sched1'
WHERE JBT_TYPE IN (SELECT JBT_TYPE FROM JOBVISIT WHERE JVST_ID = #jvst_id)
AND JOB_NUMBER IN (SELECT JOB_NUMBER FROM JOBVISIT WHERE JVST_ID = #jvst_id)
Duplicated parameter names are not allowed. [ Parameter name = #jvst_id ]
I tried this (which i think would work in SQL SERVER 2005 - although I haven't tried it):
UPDATE JOBMAKE
SET WIP_STATUS='10sched1'
FROM JOBMAKE JM,JOBVISIT JV
WHERE JM.JOB_NUMBER = JV.JOB_NUMBER
AND JM.JBT_TYPE = JV.JBT_TYPE
AND JV.JVST_ID = 21
There was an error parsing the query. [ Token line number = 3,Token line offset = 1,Token in error = FROM ]
So, I can write dynamic sql instead of using parameters, or I can pass in 2 parameters with the same value, but does someone know how to do this a better way?
Colin
Your second attempt doesn't work because, based on the Books On-Line entry for UPDATE, SQL CE does't allow a FROM clause in an update statement.
I don't have SQL Compact Edition to test it on, but this might work:
UPDATE JOBMAKE
SET WIP_STATUS = '10sched1'
WHERE EXISTS (SELECT 1
FROM JOBVISIT AS JV
WHERE JV.JBT_TYPE = JOBMAKE.JBT_TYPE
AND JV.JOB_NUMBER = JOBMAKE.JOB_NUMBER
AND JV.JVST_ID = #jvst_id
)
It may be that you can alias JOBMAKE as JM to make the query slightly shorter.
EDIT
I'm not 100% sure of the limitations of SQL CE as they relate to the question raised in the comments (how to update a value in JOBMAKE using a value from JOBVISIT). Attempting to refer to the contents of the EXISTS clause in the outer query is unsupported in any SQL dialect I've come across, but there is another method you can try. This is untested but may work, since it looks like SQL CE supports correlated subqueries:
UPDATE JOBMAKE
SET WIP_STATUS = (SELECT JV.RES_CODE
FROM JOBVISIT AS JV
WHERE JV.JBT_TYPE = JOBMAKE.JBT_TYPE
AND JV.JOB_NUMBER = JOBMAKE.JOB_NUMBER
AND JV.JVST_ID = 20
)
There is a limitation, however. This query will fail if more than one row in JOBVISIT is retuned for each row in JOBMAKE.
If this doesn't work (or you cannot straightforwardly limit the inner query to a single row per outer row), it would be possible to carry out a row-by-row update using a cursor.

Resources