Results changed after sorting a table - apache-flink

I have a following scenario:
I created a batch job using SQL API.
final TableEnvironment tEnv = TableEnvironment.create(EnvironmentSettings.newInstance().inBatchMode().build());
I load the data from csv files, convert/aggregate it using SQL API.
At some stage I have a table:
CREATE VIEW ohlc_current_day as
SELECT
CAST(transact_time as DATE) as `day`,
instrument_id,
first_value(price) as `open`,
min(price) AS `low`,
max(price) AS `high`,
last_value(price) as `close`,
count(*) AS `count`,
sum(quantity) AS volume,
sum(quantity * price) AS turnover
FROM trades //table loaded from csv
group by CAST(transact_time as DATE), instrument_id
Now when check the results:
select * from ohlc_current_day where instrument_id=14
+------------+---------------+---------+---------+--------+---------+-------+-----------+---------------+
| day | instrument_id | open | low | high | close | count | volume | turnover |
+------------+---------------+---------+---------+--------+---------+-------+-----------+---------------+
| 2021-04-11 | 14 | 1723.0 | 1709.0 | 1743.0 | 1728.0 | 679 | 487470.0 | 8.4114803E8 |
+------------+---------------+---------+---------+--------+---------+-------+-----------+---------------+
The results are repeatable and correct (checked with reference).
Then, for futrher processing, I need ohlc values from the previous day which are already stored in a database:
CREATE TABLE ohlc_database (
`day` TIMESTAMP,
instrument_id INT,
`open` float,
`low` FLOAT,
`high` FLOAT,
`close` FLOAT,
`count` BIGINT,
volume FLOAT,
turnover FLOAT
) WITH (
'connector' = 'jdbc',
'url' = 'url',
'table-name' = 'ohlc',
'username' = 'user',
'password' = 'password'
)
Let's now merge ohlc_current_day with ohlc_database:
CREATE VIEW ohlc_raw as
SELECT * from ohlc_current_day
UNION ALL
select
CAST(`day` as DATE) as `day`,
instrument_id,
`open`,
`low`,
`high`,
`close`,
`count`,
volume,
turnover
FROM ohlc_database
WHERE `day` = '2021-04-10' //hardcoded previous day date
And check the results:
select * from ohlc_raw where instrument_id=14
+------------+---------------+--------+--------+--------+---------+-------+-----------+---------------+
| day | instrument_id | open | low | high | close | count | volume | turnover |
+------------+---------------+--------+--------+--------+---------+-------+-----------+---------------+
| 2021-04-10 | 14 | 1696.0 | 1654.0 | 1703.0 | 1691.0 | 936 | 1040888.0 | 1.74619264E9 |
| 2021-04-11 | 14 | 1723.0 | 1709.0 | 1743.0 | 1728.0 | 679 | 487470.0 | 8.4114829E8 |
+------------+---------------+--------+--------+--------+---------+-------+-----------+---------------+
results are ok, values the same as in previous select query.
Now let's order by day:
CREATE VIEW ohlc as
SELECT * from ohlc_raw ORDER BY `day`
Check the results:
select * from ohlc where instrument_id=14
+------------+---------------+-----------------+-----------------+-------------+----------------+----------------------+--------------------------------+--------------------------------+
| day | instrument_id | open | low | high | close | count | volume | turnover |
+------------+---------------+-----------------+-----------------+-------------+----------------+----------------------+--------------------------------+--------------------------------+
| 2021-04-10 | 14 | 1696.0 | 1654.0 | 1703.0 | 1691.0 | 936 | 1040888.0 | 1.74619264E9 |
| 2021-04-11 | 14 | 1729.0 | 1709.0 | 1743.0 | 1732.0 | 679 | 487470.0 | 8.4114854E8 |
+------------+---------------+-----------------+-----------------+-------------+----------------+----------------------+--------------------------------+--------------------------------+
open and close are wrong compared to previous values. They are calculated using first_value() and last_value() functions which depend on the order of elements. So my guess is that order by in last query has changed the order and this is why there are different results.
Is my understanding correct? How can I fix it?

I thought that first_value() or last_value() themself is a sort operation. When you order by in last query, the sort according to last_value() is out of order. May be you can output the result into database after union all, and then do order by date after extract data from database if possible.

Related

Extract into multiple columns from JSON with PostgreSQL

I have a column item_id that contains data in JSON (like?) structure.
+----------+---------------------------------------------------------------------------------------------------------------------------------------+
| id | item_id |
+----------+---------------------------------------------------------------------------------------------------------------------------------------+
| 56711 | {itemID":["0530#2#1974","0538\/2#2#1974","0538\/3#2#1974","0538\/18#2#1974","0539#2#1974"]}" |
| 56712 | {itemID":["0138528#2#4221","0138529#2#4221","0138530#2#4221","0138539#2#4221","0118623\/2#2#4220"]}" |
| 56721 | {itemID":["2704\/1#1#1356"]}" |
| 56722 | {itemID":["0825\/2#2#3349","0840#2#3349","0844\/10#2#3349","0844\/11#2#3349","0844\/13#2#3349","0844\/14#2#3349","0844\/15#2#3349"]}" |
| 57638 | {itemID":["0161\/1#2#3364","0162\/1#2#3364","0163\/2#2#3364"]}" |
| 57638 | {itemID":["109#1#3364","110\/1#1#3364"]}" |
+----------+---------------------------------------------------------------------------------------------------------------------------------------+
I need the last four digits before every comma (if there is) and the last 4 digits distincted and separated into individual colums.
The distinct should happen across id as well, so only one result row with id: 57638 is permitted.
Here is a fiddle with a code draft that is not giving the right answer.
The desired result should look like this:
+----------+-----------+-----------+
| id | item_id_1 | item_id_2 |
+----------+-----------+-----------+
| 56711 | 1974 | |
| 56712 | 4220 | 4221 |
| 56721 | 1356 | |
| 56722 | 3349 | |
| 57638 | 3364 | 3365 |
+----------+-----------+-----------+
There can be quite a lot 'item_id_%' column in the results.
with the_table (id, item_id) as (
values
(56711, '{"itemID":["0530#2#1974","0538\/2#2#1974","0538\/3#2#1974","0538\/18#2#1974","0539#2#1974"]}'),
(56712, '{"itemID":["0138528#2#4221","0138529#2#4221","0138530#2#4221","0138539#2#4221","0118623\/2#2#4220"]}'),
(56721, '{"itemID":["2704\/1#1#1356"]}'),
(56722, '{"itemID":["0825\/2#2#3349","0840#2#3349","0844\/10#2#3349","0844\/11#2#3349","0844\/13#2#3349","0844\/14#2#3349","0844\/15#2#3349"]}'),
(57638, '{"itemID":["0161\/1#2#3364","0162\/1#2#3364","0163\/2#2#3364"]}'),
(57638, '{"itemID":["109#1#3365","110\/1#1#3365"]}')
)
select id
,(array_agg(itemid)) [1] itemid_1
,(array_agg(itemid)) [2] itemid_2
from (
select distinct id
,split_part(replace(json_array_elements(item_id::json -> 'itemID')::text, '"', ''), '#', 3)::int itemid
from the_table
order by 1
,2
) t
group by id
DEMO
You can unnest the json array, get the last 4 characters of each element as a number, then do conditional aggregation:
select
id,
max(val) filter(where rn = 1) item_id_1,
max(val) filter(where rn = 2) item_id_2
from (
select
id,
right(val, 4)::int val,
dense_rank() over(partition by id order by right(val, 4)::int) rn
from mytable t
cross join lateral jsonb_array_elements_text(t.item_id -> 'itemID') as x(val)
) t
group by id
You can add more conditional max()s to the outer query to handle more possible values.
Demo on DB Fiddle:
id | item_id_1 | item_id_1
----: | --------: | --------:
56711 | 1974 | null
56712 | 4220 | 4221
56721 | 1356 | null
56722 | 3349 | null
57638 | 3364 | 3365

Delete partial dulicate rows - sql

I have some troubles with deleting partial duplicate rows
The structure is like this:
+-----+--------+--+-----------+--+------+
| id | userid | | location | | week |
+-----+--------+--+-----------+--+------+
| 1 | 001 | | amsterdam | | 11 |
| 2 | 001 | | amsterdam | | 23 |
| 3 | 002 | | berlin | | 28 |
| 4 | 002 | | berlin | | 22 |
| 5 | 003 | | paris | | 19 |
| 6 | 003 | | paris | | 35 |
+-----+--------+--+-----------+--+------+
I only need to keep one row from each userid, it doesn't matter which week number it has.
Thanks,
Maxcim
This should work across most databases:
DELETE
FROM yourTable
WHERE id <> (SELECT MIN(id)
FROM yourTable t
WHERE t.userid = userid)
This query would delete from each userid group all records except for the record having the lowest id for that group. I assume that id is a unique column.
This method is tested, try it.
We are getting the number of rows occuring at each record, and then we are deleting only the ones with more than 1 row occruring... keeping the original one.
BEGIN TRANSACTION
SELECT UserID, Location,
RN = ROW_NUMBER()OVER(PARTITION BY UserID, Location ORDER BY UserID, Location)
into #test1
FROM dbo.MyTbl
Delete MyTbl
From MyTbll
INNER JOIN #test1
ON #test1.UserID= MyTbl.UserID
WHERE RN > 1
if ##Error <> 0 GOTO Errlbl
Commit Transaction
RETURN
Errlbl:
RollBack Transaction
GO

Selecting grouped rows after first two rows SQL Server

This is a bit of a tricky question/situation and my search fu failed me.
Lets say i have the following data
| UID | SharedID | Type | Date |
|-----|----------|------|-----------|
| 1 | 1 | foo | 2/4/2016 |
| 2 | 1 | foo | 2/5/2016 |
| 3 | 1 | foo | 2/8/2016 |
| 4 | 1 | foo | 2/11/2016 |
| 5 | 2 | bar | 1/11/2016 |
| 6 | 2 | bar | 2/11/2016 |
| 7 | 3 | baz | 2/1/2016 |
| 8 | 3 | baz | 2/3/2016 |
| 9 | 3 | baz | 2/11/2016 |
And I would like to ommit a variable number of leading rows (most recent date in this case) and lets say that number is 2 in this example. The resulting table would be something like this:
| UID | SharedID | Type | Date |
|-----|----------|------|-----------|
| 1 | 1 | foo | 2/4/2016 |
| 2 | 1 | foo | 2/5/2016 |
| 7 | 3 | baz | 2/1/2016 |
Is this possible in SQL? Essentially I want to filter on an unknown number of rows which uses the date column as the order by. The goal is to get the oldest types and get a list of UID's in the process.
Sure, it's possible. Use a ROW_NUMBER function to assign a value to each row, partitioning by the SharedID column so that the count restarts every time that ID changes, and select those rows with a value greater than your limit.
WITH cteNumberedRows AS (
SELECT UID, SharedID, Type, Date,
ROW_NUMBER() OVER(PARTITION BY SharedID ORDER BY Date DESC) AS RowNum
FROM YourTable
)
SELECT UID, SharedID, Type, Date
FROM cteNumberedRows
WHERE RowNum > 2;
Not sure if I understand what you mean but something like this?
SELECT * FROM MyTable t1 JOIN MyTable T2 ON t2.id NOT IN (
SELECT TOP 2 UID FROM myTable
WHERE SharedID = t1.sharedID
ORDER BY [Date] DESC
)

Rearranging and deduplicating SQL columns based on column data

Sorry I know that's a rubbish Title but I couldn't think of a more concise way of describing the issue.
I have a (MSSQL 2008) table that contains telephone numbers:
| CustomerID | Tel1 | Tel2 | Tel3 | Tel4 | Tel5 | Tel6 |
| Cust001 | 01222222 | 012333333 | 07111111 | 07222222 | 01222222 | NULL |
| Cust002 | 07444444 | 015333333 | 07555555 | 07555555 | NULL | NULL |
| Cust003 | 01333333 | 017777777 | 07888888 | 07011111 | 016666666 | 013333 |
I'd like to:
Remove any duplicate phone numbers
Rearrange the telephone numbers so that anything beginning with "07" is the first phone number. If there are multiple 07's, they should be in the first fields. The order of the numbers apart from that doesn't really matter.
So, for example, after processing, the table would look like:
| CustomerID | Tel1 | Tel2 | Tel3 | Tel4 | Tel5 | Tel6 |
| Cust001 | 07111111 | 07222222 | 01222222 | 012333333 | NULL | NULL |
| Cust002 | 07444444 | 07555555 | 015333333 | NULL | NULL | NULL |
| Cust003 | 07888888 | 07011111 | 016666666 | 013333 | 01333333 | 017777777 |
I'm struggling to figure out how to efficiently achieve my goal (there are 600,000+ records in the table). Can anyone help?
I've created a fiddle if it'll help anyone play around with the scenario.
You can break up the numbers into individual rows using UNPIVOT, then reorder them based on the occurence of the '07' prefix using ROW_NUMBER(), and finally recombine it using PIVOT to end up with the 6 Tel columns again.
select *
FROM
(
select CustomerID, Col, Tel
FROM
(
select *, Col='Tel' + RIGHT(
row_number() over (partition by CustomerID
order by case
when Tel like '07%' then 1
else 2
end),10)
from phonenumbers
UNPIVOT (Tel for Seq in (Tel1,Tel2,Tel3,Tel4,Tel5,Tel6)) seqs
) U
) P
PIVOT (MAX(TEL) for Col IN (Tel1,Tel2,Tel3,Tel4,Tel5,Tel6)) V;
SQL Fiddle
Perhaps using cursor to collect all customer id and sorting the fields...traditional sorting technique as we used to do in school c++ ..lolz...like to know if any other method possible.
If you dont get any then it is the last way . It will take a long time for sure to execute.

Finding the max and min date values in SQL Server tables

I have two tables:
A lookup table (tabOne):
KEY | Group | Name | Desc | Val_Key
----------------------------------------
1 | a | NameA | DescA | 10
2 | b | NameB | DescB | 20
3 | c | NameC | DescC | 30
4 | d | NameD | DescD | 40
5 | e | NameE | DescE | 50
6 | f | NameF | DescF | 60
A second table containing readings (tabTwo):
KEY | Date | Reading | Val_Key
----------------------------------------
1 | Date | Read | 10
2 | Date | Read | 20
3 | Date | Read | 40
4 | Date | Read | 40
5 | Date | Read | 30
6 | Date | Read | 20
7 | Date | Read | 40
8 | Date | Read | 20
9 | Date | Read | 10
10 | Date | Read | 20
11 | Date | Read | 50
12 | Date | Read | 60
What I need to do is join tabTwo with TabOne and create a column with the newest Reading and a column with the oldest reading for each item in the group column of TabOne.
At the end of the day I want a table that look as follow:
KEY | Group | Name | Desc | Val_Key | LastReading | FirstReading |
-------------------------------------------------------------------------
1 | a | NameA | DescA | 10 | | |
2 | b | NameB | DescB | 20 | | |
3 | c | NameC | DescC | 30 | | |
4 | d | NameD | DescD | 40 | | |
5 | e | NameE | DescE | 50 | | |
6 | f | NameF | DescF | 60 | | |
Thanks!
Freddie
If this is Sql Server 2005 or newer, outer apply will help:
select TabOne.*,
last.Reading LastReading,
first.Reading FirstReading
from TabOne
outer apply
(
select top 1
Reading
from TabTwo
where TabTwo.Val_Key = TabOne.val_Key
order by TabTwo.Date desc
) last
outer apply
(
select top 1
Reading
from TabTwo
where TabTwo.Val_Key = TabOne.val_Key
order by TabTwo.Date asc
) first
Live test is # Sql Fiddle.
#Nikola Markovinović's solution can be made more universally applicable if the subqueries are moved directly to the main query's SELECT clause, which is possible each of them retrieves only one value and is, therefore, valid as a scalar expression:
SELECT
t1.[KEY],
t1.[Group],
t1.Name,
t1.[Desc],
t1.Val_Key,
(
SELECT TOP 1 Reading
FROM TabTwo
WHERE Val_Key = t1.Val_Key
ORDER BY Date DESC
) AS LastReading,
(
SELECT TOP 1 Reading
FROM TabTwo
WHERE Val_Key = t1.Val_Key
ORDER BY Date ASC
) AS FirstReading
FROM TabOne t1
If you needed e.g. dates along the way, you would probably have to stick to Nikola's solution. There is an alternative to it, but it's more cumbersome (albeit more standard too): it would involve grouping TabTwo's data by Val_Key to get earliest/latest dates per Val_Key, then joining back to TabTwo to access entire rows corresponding to the found dates to finally pull the necessary columns, and ultimately joining both result sets to TabOne to get the final column set.

Resources