U-SQL cursor in Azure Data Lake - loops

I need to process data in Azure Data Lake.
My flow is as follows:
I would like to select from the database list of IDs for next processing. This I have done.
I need to iterate through IDs (from the first step) and I need to successively export data into separated files (partitioned by ID)
The problem is following statemanet:
U-SQL’s procedures do not provide any imperative code-flow constructs
such as a for or while loops.
Any idea how to process data in similar way as with cursor?
I didn't find any documentation regarding to the cursors in U-SQL.
Thank you!

There are no cursors in U-SQL, because of the statement you reference above.
U-SQL does not provide any imperative code-flow constructs because it impedes the optimizer's ability to globally optimize your script.
You should think of approaching your problem declaratively. For example, if you have a list of IDs (either in a table or SqlArray or even a file), use a declarative join. For example, you want to add 42 to every value where the key is in a list of existing keys:
// Two options for providing the "looping data"
// Option 1: Array Variable
DECLARE #keys_var = new SqlArray<string>{"k1", "k2", "k3"};
// Option 2: Rowset (eg from an EXTRACT from file, a table or other place)
#keys = SELECT * FROM (VALUES("k1"), ("k2"), ("k3")) AS T(key);
// This is the data you want to iterate over to add 42 to the value column for every matching key
#inputdata = SELECT * FROM (VALUES (1, "k1"), (2, "k1"), (3, "k2"), (6, "k5")) AS T(value, key);
//Option 1:
#res = SELECT value+42 AS newval, key FROM #inputdata WHERE #keys_var.Contains(key);
OUTPUT #res TO "/output/opt1.csv" USING Outputters.Csv();
//Option 2:
#res = SELECT value+42 AS newval, i.key
FROM #inputdata AS i INNER JOIN #keys AS k
ON i.key == k.key;
OUTPUT #res TO "/output/opt2.csv" USING Outputters.Csv();
Now in your case, you want to have data-driven output file sets. This feature is currently being worked on (it is one of our top asks). Until then you would have to write a script to generate the script (I will provide an example on your other question).

If you really want iterative behaviour you need to call the USQL from PowerShell.
For example:
ForEach ($Date in $Dates)
{
$USQLProcCall = '[dbo].[usp_OutputDailyAvgSpeed]("' + $Date + '");'
$JobName = 'Output daily avg dataset for ' + $Date
Write-Host $USQLProcCall
$job = Submit-AzureRmDataLakeAnalyticsJob `
-Name $JobName `
-AccountName $DLAnalyticsName `
–Script $USQLProcCall `
-DegreeOfParallelism $DLAnalyticsDoP
Write-Host "Job submitted for " $Date
}
Source: https://www.purplefrogsystems.com/paul/2017/05/recursive-u-sql-with-powershell-u-sql-looping/

Related

Kotlin Exposed batch insert not working as documented

I am trying to batch insert records into an SQL table using Kotlin Exposed. I have set up the code as per the Exposed documentation, however, the SQL statements being executed are individual insert statements rather than 1 batch insert statement.
The documentation located here: https://github.com/JetBrains/Exposed/wiki/DSL
has the following on Batch Inserting:
Batch Insert
Batch Insert allow mapping a list of entities into DB raws in one sql statement. It is more efficient than inserting one by one as it initiates only one statement. Here is an example:
val cityNames = listOf("Paris", "Moscow", "Helsinki")
val allCitiesID = cities.batchInsert(cityNames) { name ->
this[cities.name] = name
}
My code is as follows:
val mappings: List<Triple<String, String, String>> = listOf(triple1, triple2, triple3)
transaction {
TableName.batchInsert(mappings) {
this[TableName.value1] = it.first
this[TableName.value2] = it.second
this[TableName.value3] = it.third
}
}
What I expect to see printed out is 1 batch insert statement which follows the syntax of
INSERT INTO TableName (value1, value2, value3) values
(triple1value1, triple1value2, triple1value3),
(triple2value1, triple2value2, triple2value3),
(triple3value1, triple3value2, triple3value3), ...
but instead it prints 3 individual insert statements with the following syntax
INSERT INTO TableName (value1, value2, value3) values (triple1value1, triple1value2, triple1value3)
INSERT INTO TableName (value1, value2, value3) values (triple2value1, triple2value2, triple2value3)
INSERT INTO TableName (value1, value2, value3) values (triple3value1, triple3value2, triple3value3)
As this seems like the documented correct way to batch insert, what am I doing incorrectly here?
The docs explain:
NOTE: The batchInsert function will still create multiple INSERT
statements when interacting with your database. You most likely want
to couple this with the rewriteBatchedInserts=true (or
rewriteBatchedStatements=true) option of your relevant JDBC driver,
which will convert those into a single bulkInsert. You can find the
documentation for this option for MySQL here and PostgreSQL here.
https://github.com/JetBrains/Exposed/wiki/DSL#batch-insert

MSSQL Data type conversion

I have a pair of databases (one mssql and one oracle), ran by different teams. Some data are now being synchronized regularily by a stored procedure in the mssql table. This stored procedure is calling a very large
MERGE [mssqltable].[Mytable] as s
USING THEORACLETABLE.BLA as t
ON t.[R_ID] = s.[R_ID]
WHEN MATCHED THEN UPDATE SET [Field1] = s.[Field1], ..., [Brokenfield] = s.[BrokenField]
WHEN NOT MATCHED BY TARGET THEN
... another big statement
Field Brokenfield was a numeric one until today, and could take value NULL, 0, 1, .., 24
Now, the oracle team introduced a breaking change today for some reason, changed the type of the column to string and now has values NULL, "", "ALFA", "BRAVO"... in the column. Of course, the sync got broken.
What is the easiest way to fix the sync here? I (Mysql team lead, frontend expert but not so in databases) would usually apply one of our database expert guys here, but all of them are now ill, and the fix must go online today....
I thought of a stored procedure like CONVERT_BROKENFIELD_INT_TO_STRING or so, based on some switch-case, which could be called in that merge statement, but not sure how to do that.
Edit/Clarification:
What I need is a way to make a chunk of SQL code (stored procedure), taking an input of "ALFA" and returning 1, "BRAVO" -> 2, etc. and which can be reused, to avoid writing huge ifs in more then one place.
If you can not simplify the logic for correct values the way #RichardHansell desribed, you can create a crosswalk table for BrokenField to the correct values. Then you can use a common table expression or subquery with a left join to that crosswalk to use in the merge.
create table dbo.BrokenField_Crosswalk (
BrokenField varchar(32) not null primary key
, CorrectedValue int
);
insert into dbo.BrokenField_Crosswalk (BrokenField,CorrectedValue) values
('ALFA', 1)
, ('ALPHA', 1)
, ('BRAVO', 2)
...
go
And your code for the merge would look something like this:
;with cte as (
select o.R_ID
, o.Field1
, BrokenField = cast(isnull(c.CorrectedValue,o.BrokenField) as int)
....
from oracle_table.bla as o
left join dbo.BrokenField_Crosswalk as c
)
merge into [mssqltable].[Mytable] t
using cte as s
on t.[R_ID] = s.[R_ID]
when matched
then update set
[Field1] = s.[Field1]
, ...
, [Brokenfield] = s.[BrokenField]
when not matched by target
then
If they are using names with a letter at the start that goes in a sequence:
A = 1
B = 2
C = 3
etc.
Then you could do something like this:
MERGE [mssqltable].[Mytable] as s
USING THEORACLETABLE.BLA as t
ON t.[R_ID], 1)) - ASCII('A') + 1 = s.[R_ID]
WHEN MATCHED THEN UPDATE SET [Field1] = s.[Field1], ..., [Brokenfield] = s.[BrokenField]
WHEN NOT MATCHED BY TARGET THEN
... another big statement
Edit: but actually I re-read your question and you are talking about [Brokenfield] being the problem column, so my solution wouldn't work.
I don't really understand now, as it seems as though the MERGE statement is updating the oracle table with numbers, so surely you need the mapping to work the other way, i.e. 1 -> ALFA, 2 -> BETA, etc.?

R RODBCext and Parameterizing IN statement?

I've been working to parameterize a SQL Statement that uses the IN statement in the WHERE clause. I'm using rodbcext library for parameterizing but it seems to lack expansion of a list.
I was hoping to write code such as
sqlExecute("SELECT * FROM table WHERE name IN (?)", c("paul","ringo","john", "george")
I'm using the following code but wondered if there's an easier way.
library(RODBC)
library(RODBCext)
# Search inputs
names <- c("paul", "ringo", "john", "george")
# Build SQL statement
qmarks <- replicate(length(names), "?")
stringmarks <- paste(qmarks, collapse = ",")
sql <- paste("SELECT * FROM tableA WHERE name IN (", stringmarks, ")")
# expand to Columns - seems to be the magic step required
bindnames <- rbind(names)
# Execute SQL statement
dbhandle <- RODBC::odbcDriverConnect(connectionString)
result <- RODBCext::sqlExecute(dbhandle, sql, bindnames, fetch = TRUE)
RODBC::odbcClose(dbhandle)
It works but feel I'm using R to expand the strings in the wrong way (bit new to R - so many ways to do the same thing wrong). Somebody will probably say "that creates factors - never do that" :-)
I found this article which suggest I'm on the right track but it doesn't discuss having to expand the "?" and turn the list into columns of a data.frame
R RODBC putting list of numbers into an IN() statement
Thank you.
UPDATE: As Benjamin shows below - the sqlExecute function can handle a list() of inputs. However upon inspection of the resulting SQL I discovered that it uses cursors to rollup the results. This significantly increases the CPU and I/O over the sample code I show above.
While the library can indeed solve this for you, for large results it may be too expensive. There are two answers and it depends upon your needs.
Since your only parameter in the query is in collection for IN, you could get away with
sqlExecute(dbhandle,
"SELECT * FROM table WHERE name IN (?)",
list(c("paul","ringo","john", "george")),
fetch = TRUE)
sqlExecute will bind the values in the list to the question mark. Here, it will actually repeat the query four times, once for each value in the vector. It may seem kind of silly to do it this way, but when trying to pass strings, it's a lot safer in many ways to let the binding take care of setting up the appropriate quote structure rather than trying to paste it in yourself. You will generate fewer errors this way and avoid a lot of database security concerns.
What if you declare a variable table in a character object and then concatenate with the query.
library(RODBC)
library(RODBCext)
# Search inputs
names <- c("paul", "ringo", "john", "george")
# Build SQL statement
sql_top <- paste0( "SET NOCOUNT ON \r\n DECLARE #LST_NAMES TABLE (ID NVARCHAR(20)) \r\n INSERT INTO #LST_NAMES VALUES ('", paste(names, collapse = "'), ('" ) , "')")
sql_body <- paste("SELECT * FROM tableA WHERE name IN (SELECT id FROM #LST_NAMES)")
sql <- paste0(sql_top, "\r\n", sql_body)
# Execute SQL statement
dbhandle <- RODBC::odbcDriverConnect(connectionString)
result <- RODBCext::sqlExecute(dbhandle, sql, bindnames, fetch = TRUE)
RODBC::odbcClose(dbhandle)
The query will be (the set no count on is important to retrieve the results)
SET NOCOUNT ON
DECLARE #LST_NAMES TABLE (ID NVARCHAR(20))
INSERT INTO #LST_NAMES VALUES ('paul'), ('ringo'), ('john'), ('george')
SELECT * FROM tableA WHERE name IN (SELECT id FROM #LST_NAMES)

SQL - How can I return a value from a different table base on a parameter?

SQL - How can I return a value from a different table base on a parameter
First time poster, long time reader:
I am using a custom Excel function that allows be to pass parameters and build a SQL string that returns a value. This is working fine. However, I would like to choose among various tables based on the parameters that are passed.
At the moment I have two working functions with SQL statements look like this:
_______FUNCTION ONE________
<SQLText>
SELECT PRODDTA.TABLE1.T1DESC as DESCRIPTION
FROM PRODDTA.TABLE1
WHERE PRODDTA.TABLE1.T1KEY = '&PARM02'</SQLText>
_______FUNCTION TWO________
<SQLText>
SELECT PRODDTA.TABLE2.T2DESC as DESCRIPTION
FROM PRODDTA.TABLE2
WHERE PRODDTA.TABLE2.T2KEY = '&PARM02'</SQLText>
So I am using IF logic in Excel to check the first parameter and decide which function to use.
It would be much better if I could do a single SQL statement that could pick the right table based on the 1st parameter. Logically something like this:
_______FUNCTIONS COMBINED________
IF '&PARM02' = “A” THEN
SELECT PRODDTA.TABLE1.T1DESC as DESCRIPTION
FROM PRODDTA.TABLE1
WHERE PRODDTA.TABLE1.T1KEY = '&PARM02'
ELSE IF '&PARM02' = “B” THEN
SELECT PRODDTA.TABLE2.T2DESC as DESCRIPTION
FROM PRODDTA.TABLE2
WHERE PRODDTA.TABLE2.T2KEY = '&PARM02'
ELSE
DESCRIPTION = “”
Based on another post Querying different table based on a parameter I tried this exact syntax with no success
<SQLText>
IF'&PARM02'= "A"
BEGIN
SELECT PRODDTA.F0101.ABALPH as DESCRIPTION
FROM PRODDTA.F0101
WHERE PRODDTA.F0101.ABAN8 = '&PARM02'
END ELSE
BEGIN
SELECT PRODDTA.F4801.WADL01 as DESCRIPTION
FROM PRODDTA.F4801
WHERE PRODDTA.F4801.WADOCO = '&PARM02'
END</SQLText>
You could try using a JOIN statement.
http://www.sqlfiddle.com/#!9/23461d/1
Here is a fiddle showing two tables.
The following code snip will give you the values from both tables, using the Key as the matching logic.
SELECT Table1.description, Table1.key, Table2.description
from Table1
Join Table2 on Table1.key = Table2.key
Here's one way to do it. If PARM03='Use Table1' then the top half of the union will return records and vice versa. This won't necessarily product good performance though. You should consider why you are storing data in this way. It looks like you are partitioning data across different tables which is a bad idea.
SELECT PRODDTA.TABLE1.T1DESC as DESCRIPTION
FROM PRODDTA.TABLE1
WHERE PRODDTA.TABLE1.T1KEY = '&PARM02'
AND &PARM03='Use Table1'
UNION ALL
SELECT PRODDTA.TABLE2.T2DESC as DESCRIPTION
FROM PRODDTA.TABLE2
WHERE PRODDTA.TABLE2.T2KEY = '&PARM02'</SQLText>
AND &PARM03='Use Table2'

perl-dbi #temp table created using dbh handle not accessible when accessed via ->do() with the same handle

I am facing this problem perl DBD::ODBC rollback ineffective with AutoCommit enabled at and while looking at the problem , I found that a very basic thing is failing with Perl::DBI using DBD::ODBC on sql server. But i am not sure if this wont happen with any other driver.
The problem is that when I create a #temp table using $dbh->do and when i try to access the same #temp table using another $dbh->do , i am getting the below error. Also this does not happen all the time , but only intermittently.
Invalid object name '#temp'
$dbh->do("SELECT ... INTO #temp FROM ...");
$dbh->do("INSERT INTO ... SELECT ... FROM #temp");
The second do fails with 'Invalid object name '#temp''
Kindly help me with the problem.
Not that it answers your question but it might help. The following works for me.
#
# To access temporary tables in MS SQL Server they need to be created via
# SQLExecDirect
#
use strict;
use warnings;
use DBI;
my $h = DBI->connect();
eval {
$h->do(q{drop table martin});
$h->do(q{drop table martin2});
};
$h->do(q{create table martin (a int)});
$h->do(q{create table martin2 (a int)});
$h->do('insert into martin values(1)');
my $s;
# this long winded way works:
#$s = $h->prepare('select * into #tmp from martin',
# { odbc_exec_direct => 1}
#);
#$s->execute;
# and this works too:
$h->do('select * into #tmp from martin');
# but a prepare without odbc_exec_direct would not work
print "NUM_OF_FIELDS: " . DBI::neat($s->{NUM_OF_FIELDS}), "\n";
$s = $h->selectall_arrayref(q{select * from #tmp});
use Data::Dumper;
print Dumper($s), "\n";
$h->do(q/insert into martin2 select * from #tmp/);
$s = $h->selectall_arrayref(q{select * from martin2});
print Dumper($s), "\n";
I was having this problem as well. I tried all of the above but it didnt matter. I stumbled upon this http://bytes.com/topic/sql-server/answers/80443-creating-temporary-table-select-into which solved my problem.
What's happening is that ADO is opening a second connection behind
your back. This has really not anything to do with how you created the
table.
The reason that ADO opens an extra connection, is because there are
rows waiting to be fetched on the first connection, so ADO cannot
submit a query on that connection.
I assume that Perl DBI is doing the same, so based on this assumption, here's what I did and it worked perfectly fine:
my $sth = $dbh->prepare('Select name into #temp from NameTable');
$sth->execute();
$sth->fetchall_arrayref();
$sth = $dbh->prepare('Select a.name, b.age from #temp a, AgeTable b where a.name = name');
$sth->execute();
my ($name,$age)
$sth->bind_columns(\$name,\$age);
while ( $sth->fetch())
{
# processing
}

Resources