Flink:How to get the value after union other datastream? - apache-flink

eg.
I have two DataStream<Tuple4<String, String, Date, String>> named ds1 and ds2, DataStream ds3 = ds1.union(ds2).Then I want to know How can I get the value of ds1.f2 and ds2.f2 from ds3.
Thanks.

Stream union in Flink is the same as the union operation on multisets -- you just get a bigger stream will all of the elements from the two input streams.
So, in other words, a Union is not a Join. ds3.f2 is a value that previously was either ds1.f2 or ds2.f2 for some Tuple in one of those streams.
Depending on what you are trying to accomplish, you could add a fifth element to each Tuple so you would know its origin. Or you might rather use some sort of Join operation to combine the two streams. See the documentation for window joins, table joins, sql joins, and low-level joins.

Related

query and update a stream in flink stream sql

I am looking for a solution based on flink, the situation is that I have a trans stream and some rules which can be expressed as SQL, I want to update the stream after query(if matched ruleSql1 set this transEvent respCode = 01; if matched ruleSql2 then set this transEvent respCode = 02; respCode has priority).
The question is:
By flink sql I can get a result, but how to feedback the result to original stream, the output I expected is original stream with different respCode.
I have a lot of rules, how to merge the result.
Flink's operators have streams coming in, and transformed streams coming out. It's not clear exactly what you want -- but whether you want to modify each event to add a field with the response code, or something else, it's easily done. If you are using SQL, simply describe the output you want in the SELECT clause.
You can use split/select to make n copies of your stream, and then apply one of your rules (expressed as a SQL query) to each of these parallel copies. Then you can use union to merge them back together (provided they are all of the same type).
You'll find the documentation on split, select, and union in this section of the docs.
The Flink training site has a sequence of hands-on exercises that you may find helpful in learning how the pieces of the API fit together, though none that use split/select/union.

Comparing SQL tables

I am still new in SQL. I am currently having two tables in SQL server and I would like to not exactly compare but more likely see if the one specific column in table 1 is equal to similar specific column in table 2. I have a certain level of success with it but I would like to see also the ones which don't match from table 1 with table 2 (e.g. it can give back null value). Below you can see an example code which might help to understand better my point:
select tb1.models, tb1.year, tb1.series, tb2.model, tb.price
from tb1, tb2
where tb1.year = '2014' and tb1.models = tb2.model
and here comes the place which I have tried all kind of combinations like <> and etc. but unfortunately haven't got to a solution. The point is that in table 1 I have certain amount of models and on table 2 I have quite huge list which sometimes is not including the same ones from table 1. Due to which I want to see what is not matching exactly so I can try to check and analyse it.
The above example I've shown is returning only the ones which are equal and I see for example that there are 30 more models in table 1 but they are not in table 2 and don't have visibility which ones exactly.
Thank you in advance!
Btw: Do not use '2014', if this value (and the column tb1.year) is numeric (probably INT). Rather use tb1.year=2014. Implicit casts are expensive and can have various side effects...
This sounds like a plain join:
select tb1.models
, tb1.year
, tb1.series
, tb2.model
, tb.price
from tb1
INNER JOIN tb2 ON tb1.models = tb2.model
where tb1.year = '2014'
But your model*s* vs. modell might point to troubles with not normalized data... If this does not help, please provide sample data and expected output!
UPDATE
Use LEFT JOIN to find all rows from tb1 (rows without a corresponding row in tb2 get NULLs
USE RIGHT JOIN for the opposite
USE FULL OUTER JOIN to enforce all rows of both tables with NULLs on both sides, if there is no corresponding row.

Sql Server aggregate concatenate CLR returning different sequence of strings based on number of records

I have a clr aggregate concatenation function, similar to https://gist.github.com/FilipDeVos/5b7b4addea1812067b09. When the number of rows are small, the sequence of concatenated strings follows the input data set. When the number of rows are larger (dozens and more), the sequence seems indeterminate. There is a difference in the execution plan, but I'm not that familiar with the optimizer and what hints to apply (I've tried MAXDOP 1, without success). From a different test than the example below with similar results here's what seems to be the difference in the plan - the separate sorts, then a merge join. The row count where it tipped over here was 60.
yielded expected results:
yielded unexpected results:
Below is the query that demonstrates the issue in the AdventureWorks2014 sample database with the above clr (renamed to TestConcatenate). The intended result is a dataset with a row for each order and a column with a delimited list of products for that order in quantity sequence.
;with cte_ordered_steps AS (
SELECT top 100000 sd.SalesOrderID, SalesOrderDetailID, OrderQty
FROM [Sales].[SalesOrderDetail] sd
--WHERE sd.SalesOrderID IN (53598, 53595)
ORDER BY sd.SalesOrderID, OrderQty
)
select
sd.SalesOrderID,
dbo.TestConcatenate(' QTY: ' + CAST(sd.OrderQty AS VARCHAR(9)) + ': ' + IsNull(p.Name, ''))
FROM [Sales].[SalesOrderDetail] sd
JOIN [Production].[Product] p ON p.ProductID = sd.ProductID
JOIN cte_ordered_steps r ON r.SalesOrderID = sd.SalesOrderID AND r.SalesOrderDetailID = sd.SalesOrderDetailID
where sd.SalesOrderID IN (53598, 53595)
GROUP BY sd.SalesOrderID
When the SalesOrderID is constrained in the cte for 53598, 53595, the sequence is correct (top set), when it's constrained in the main select for 53598, 53595, the sequence is not (botton set).
So what's my question? How can I build the query, with hints or other changes to return consistent (and correct) sequenced concatenated values independent of the number of rows.
Just like a normal query, if there isn't an order by clause, return order isn't guaranteed. If I recall correctly, the SQL 92 spec allows for an order by clause to be passed in to an aggregate via an over clause, SQL Server doesn't implement it. So there's no way to guarantee ordering in your CLR function (unless you implement it yourself by collecting everything in the Accumulate and Merge methods into some sort of collection and then sorting the list in the Terminate method before returning it. But you'll pay a cost in terms of memory grants as now need to serialize the collection.
As to why you're seeing different behavior based on the size of your result set, I notice that a different join operator is being used between the two. A loop join and a merge join walk through the two sets being joined differently and so that might account for the difference you're seeing.
Why not try the aggregate dbo.GROUP_CONCAT_S available at http://groupconcat.codeplex.com. The S is for Sorted output. It does exactly what you want.
While this answer doesn't have a solution, the additional information that Ben and Orlando provided (thanks!) have provided what I need to move on. I'll take the approach that Orlando pointed to, which was also my plan B, i.e. sorting in the clr.

How to populate a CTE with a list of values in Sqlite

I am working with SQLite and straight C. I have a C array of ids of length N that I am compiling into a string with the following format:
'id1', 'id2', 'id3', ... 'id[N]'
I need to build queries to do several operations that contain comparisons to this list of ids, an example of which might be...
SELECT id FROM tableX WHERE id NOT IN (%s);
... where %s is replaced by the string representation of my array of ids. For complicated queries and high values of N, this obviously produces some very ungainly queries, and I would like to clean them up using common-table-expressions. I have tried the following:
WITH id_list(id) AS
(VALUES(%s))
SELECT * FROM id_list;
This doesn't work because SQLite expects a column for for each value in my string. My other failed attempt was
WITH id_list(id) AS
(SELECT (%s))
SELECT * FROM id_list;
but this throws a syntax error at the comma. Does SQLite syntax exist to accomplish what I'm trying to do?
SQLite supports VALUES clauses with multiple rows, so you can write:
WITH id_list(id) AS (VALUES ('id1'), ('id2'), ('id3'), ...
However, this is not any more efficient than just listing the IDs in IN.
You could write all the IDs into a temporary table, but this would not make sense unless you have measured the performance improvement.
One solution I have found is to reformat my string as follows:
VALUES('id1') UNION VALUES('id2') ... UNION VALUES('id[N]')
Then, the following query achieves the desired result:
WITH id_list(id) AS
(%s)
SELECT * FROM id_list;
However, I am not totally satisfied with this solution. It seems inefficient.

Ordering numbers that are stored as strings in the database

I have a bunch of records in several tables in a database that have a "process number" field, that's basically a number, but I have to store it as a string both because of some legacy data that has stuff like "89a" as a number and some numbering system that requires that process numbers be represented as number/year.
The problem arises when I try to order the processes by number. I get stuff like:
1
10
11
12
And the other problem is when I need to add a new process. The new process' number should be the biggest existing number incremented by one, and for that I would need a way to order the existing records by number.
Any suggestions?
Maybe this will help.
Essentially:
SELECT process_order FROM your_table ORDER BY process_order + 0 ASC
Can you store the numbers as zero padded values? That is, 01, 10, 11, 12?
I would suggest to create a new numeric field used only for ordering and update it from a trigger.
Can you split the data into two fields?
Store the 'process number' as an int and the 'process subtype' as a string.
That way:
you can easily get the MAX processNumber - and increment it when you need to generate a
new number
you can ORDER BY processNumber ASC,
processSubtype ASC - to get the
correct order, even if multiple records have the same base number with different years/letters appended
when you need the 'full' number you
can just concatenate the two fields
Would that do what you need?
Given that your process numbers don't seem to follow any fixed patterns (from your question and comments), can you construct/maintain a process number table that has two fields:
create table process_ordering ( processNumber varchar(N), processOrder int )
Then select all the process numbers from your tables and insert into the process number table. Set the ordering however you want based on the (varying) process number formats. Join on this table, order by processOrder and select all fields from the other table. Index this table on processNumber to make the join fast.
select my_processes.*
from my_processes
inner join process_ordering on my_process.processNumber = process_ordering.processNumber
order by process_ordering.processOrder
It seems to me that you have two tasks here.
• Convert the strings to numbers by legacy format/strip off the junk• Order the numbers
If you have a practical way of introducing string-parsing regular expressions into your process (and your issue has enough volume to be worth the effort), then I'd
• Create a reference table such as
CREATE TABLE tblLegacyFormatRegularExpressionMaster(
LegacyFormatId int,
LegacyFormatName varchar(50),
RegularExpression varchar(max)
)
• Then, with a way of invoking the regular expressions, such as the CLR integration in SQL Server 2005 and above (the .NET Common Language Runtime integration to allow calls to compiled .NET methods from within SQL Server as ordinary (Microsoft extended) T-SQL, then you should be able to solve your problem.
• See
http://www.codeproject.com/KB/string/SqlRegEx.aspx
I apologize if this is way too much overhead for your problem at hand.
Suggestion:
• Make your column a fixed width text (i.e. CHAR rather than VARCHAR).
• Pad the existing values with enough leading zeros to fill each column and a trailing space(s) where the values do not end in 'a' (or whatever).
• Add a CHECK constraint (or equivalent) to ensure new values conform to the pattern e.g. something like
CHECK (process_number LIKE '[0-9][0-9][0-9][0-9][0-9][0-9][ab ]')
• In your insert/update stored procedures (or equivalent), pad any incoming values to fit the pattern.
• Remove the leading/trailing zeros/spaces as appropriate when displaying the values to humans.
Another advantage of this approach is that the incoming values '1', '01', '001', etc would all be considered to be the same value and could be covered by a simple unique constraint in the DBMS.
BTW I like the idea of splitting the trailing 'a' (or whatever) into a separate column, however I got the impression the data element in question is an identifier and therefore would not be appropriate to split it.
You need to cast your field as you're selecting. I'm basing this syntax on MySQL - but the idea's the same:
select * from table order by cast(field AS UNSIGNED);
Of course UNSIGNED could be SIGNED if required.

Resources