What is the best approach to join Dynamic Table with Versioned Table - apache-flink

I have two streams that I want to join like LEFT JOIN way. I want just enrich my left side stream from the right one. Let's say my left stream is a car_traffic and right stream is a car_electronics. The license_plate_number is a common field for both streams.
From car_electronics I only keep license_plate_number and gps_mac_addr as # Mac changes constantly but not every car equipped by GPS module. I'm filtering non NULL values then transforming stream to versioned table view.
The main idea is to have on the right side like a table-reference of cars with gps_mac_addr and enrich left side with all known mac# keeping the values that don't have a match.
The streams moving with different speed.
My questions:
What is the best approach to join those streams ?
Which join type should I use ?
We using: Flink 1.14.6
Each stream sends around 1,6-2 Billion records / day
Data tables ex:
car_traffic
+------------------------+--------------------------+----------------------+
| license_plate_number | eventTime | ... |
+------------------------+--------------------------+----------------------+
| AA-123-BB | 2022-11-29 ... | ... |
| AA-456-CC | 2022-11-29 ... | ... |
| EE-935-JJ | 2022-11-29 ... | ... |
car_electronics
+----+----------------------+-------------------+--------------------------+
| op | license_plate_number | gps_mac_addr | eventTime |
+----+----------------------+-------------------+--------------------------+
| +I | AA-123-BB | AA | 2022-11-28 ... |
| -U | AA-123-BB | AA | 2022-11-29 ... |
| +U | AA-123-BB | FFFF0A0FBBC6 | 2022-11-29 ... |
| +I | AA-456-CC | FFFF0A0F00F0 | 2022-11-29 ... |
result I want
+------------------------+------------------------+------------------------+
| license_plate_number | gps_mac_addr | eventTime |
+------------------------+------------------------+------------------------+
| AA-123-BB | FFFF0A0FBBC6 | ... |
| AA-456-CC | FFFF0A0F00F0 | ... |
| EE-935-JJ | (NULL) | ... |

If car_electronics is a database table, maybe you can use the Flink CDC Project to catch the changelog as a Flink Source.
You can use the Datastream API or SQL to implement the demand.
For SQL:
CREATE TABLE car_electronics (
license_plate_number STRING ,
gps_mac_addr STRING ,
eventTime TIMESTAMP(3)
) WITH (
'connector' = 'mysql-cdc',
......
);
SELECT *
FROM car_traffic c1
LEFT JOIN car_electronics c2 on c1.license_plate_number = c2.license_plate_number

Related

Postgres filter on array column - removing subarrays

I have a PostgreSQL table with a column that contains arrays:
| col |
| --- |
| {1,5,6} |
| {5,6,7} |
| {5,6} |
| {5,7} |
| {6,7} |
| {1} |
| {2} |
| {3} |
| {4} |
| {5} |
| {6} |
| {7} |
I want to find all arrays in col which are not wholly contained by another array. The output should look like this:
| col |
| --- |
| {1,5,6} |
| {5,6,7} |
| {2} |
| {3} |
| {4} |
I am hoping there is a PostgreSQL way of doing this. Can anyone help please?
#################################################
You can use a NOT EXISTS condition to find those arrays:
select t1.*
from the_table t1
where not exists (select *
from the_table t2
where t1.col <# t2.col
and t1.ctid <> t2.ctid)
ctid is an internal unique value for each row and is used to avoid comparing values with itself. If you have a proper primary key column in your table, you should use that instead.
Online example

Left Join Not Returning Expected Results

I have a table of Vendors (Vendors):
+-----+-------------+--+
| ID | Vendor | |
+-----+-------------+--+
| 1 | ABC Company | |
| 2 | DEF Company | |
| 3 | GHI Company | |
| ... | ... | |
+-----+-------------+--+
and a table of services (AllServices):
+-----+------------+--+
| ID | Service | |
+-----+------------+--+
| 1 | Automotive | |
| 2 | Medical | |
| 3 | Financial | |
| ... | ... | |
+-----+------------+--+
and a table that links the two (VendorServices):
+-----------+-----------+
| Vendor ID | ServiceID |
+-----------+-----------+
| 1 | 1 |
| 1 | 3 |
| 3 | 2 |
| ... | ... |
+-----------+-----------+
Note that one company may provide multiple services while some companies may not provide any of the listed services.
The query results I want would be, for a given Vendor:
+------------+----------+
| Service ID | Provided |
+------------+----------+
| 1 | 0 |
| 2 | 0 |
| 3 | 1 |
| ... | ... |
+------------+----------+
Where ALL of the services are listed and the ones that the given vendor provides would have a 1 in the Provided column, otherwise a zero.
Here's what I've got so far:
SELECT
VendorServices.ServiceID,
<Some Function> AS Provided
FROM
AllServices LEFT JOIN VendorServices ON AllServices.ID = VendorServices.ServiceID
WHERE
VendorServices.VendorID = #VendorID
ORDER BY
Service
I have two unknowns:
The above query does not return every entry in the AllServices table; and
I don't know how to write the function for the Preovided column.
You need a LEFT join of AllServices to VendorServices and a case expression to get the column provided:
select s.id,
case when v.serviceid is null then 0 else 1 end provided
from AllServices s left join VendorServices v
on v.serviceid = s.id and v.vendorid = #VendorID
See the demo.

TSQL query parser in TSQL

I would like to have something like a procedure that takes a query definition as input and output a set of tables containing the individual elements of the query.
Searching the internet for this yields me numerous results in various programming languages, but not in tsql itself. Is there such a resource around?
An example in order to illustrate what I mean by parser:
Input example (any query, really:)
'select t1.col1,t2.col2
from table1 t1
inner join table2.col2
on t1.t2ref=t2.key'
The output, of course, will be a multitude of data. I mentioned tables, but it could be in any form eg an xml. Here is a VERY SIMPLISTIC and arbitrary example of decomposition for the query above:
tables_used:
+----+-----------+--------+------------+
| id | object_id | name | alias used |
+----+-----------+--------+------------+
| 1 | 43252345 | table1 | t1 |
| 2 | 6542625 | table2 | t2 |
+----+-----------+--------+------------+
columns_used:
+----------+-------------+
| table_id | column name |
+----------+-------------+
| 1 | col1 |
| 1 | t2ref |
| 2 | key |
| 2 | col2 |
+----------+-------------+
joins_used:
+-----+-----+-------+-----------------+
| tb1 | tb2 | type | on |
+-----+-----+-------+-----------------+
| 1 | 2 | inner | t1.t2ref=t2.key |
+-----+-----+-------+-----------------+

Mariadb Parititioning

There is table named history in Zabbix database, I have created partitions on this table.
And the partition type is range and column type is UNIX_TYPESTAMP.
After the date is changed zabbix service does not insert data to the related partition.
What is the problem?
And how do I display all partitions?
Could you please help how do I write data to the related partitions?
Sample of Partition creation statement;
.
.
.
ALTER TABLE zabbix.history_test PARTITION BY RANGE(clock)(PARTITION
p28082021 VALUES LESS THAN(UNIX_TIMESTAMP("2021-08-28 00:00:00"
))ENGINE=InnoDB);
Server version: 10.1.31-MariaDB MariaDB Server
EXPLAIN PARTITIONS SELECT * FROM zabbix.history;
+------+-------------+---------+------------+------+---------------+------
| id | select_type | table | partitions | type | possible_keys | key |
key_len | ref | rows | Extra |
| 1 | SIMPLE | history | p28082021 | ALL | NULL | NULL
| NULL | NULL | 18956757 | |
SELECT DISTINCT PARTITION_EXPRESSION FROM
INFORMATION_SCHEMA.PARTITIONS WHERE TABLE_NAME='history' AND
TABLE_SCHEMA='zabbix';
+----------------------+
| PARTITION_EXPRESSION |
+----------------------+
| clock |
+----------------------+
MariaDB [(none)]> SELECT PARTITION_ORDINAL_POSITION, TABLE_ROWS, PARTITION_METHOD
FROM information_schema.PARTITIONS
WHERE TABLE_SCHEMA = 'zabbix' AND TABLE_NAME = 'history';
+----------------------------+------------+------------------+
| PARTITION_ORDINAL_POSITION | TABLE_ROWS | PARTITION_METHOD |
+----------------------------+------------+------------------+
| 1 | 18851132 | RANGE |
+----------------------------+------------+------------------+
SELECT FROM_UNIXTIME(MAX(clock)) FROM zabbix.history;
+---------------------------+
| FROM_UNIXTIME(MAX(clock)) |
+---------------------------+
| 2018-04-07 23:06:06 |
+---------------------------+
SELECT FROM_UNIXTIME(MIN(clock)) FROM zabbix.history;
+---------------------------+
| FROM_UNIXTIME(MIN(clock)) |
+---------------------------+
| 2018-04-06 01:06:23 |
+---------------------------+
This document help me to create partition on clock column.
There are stored procedures, that create partitions,you can check it.
https://www.zabbix.org/wiki/Docs/howto/mysql_partition

Join two Select Statements into a single row when one select has n amount of entries?

Is it possible in SQL Server to take two select statements and combine them into a single row without knowing how many entries one of the select statements got?
I've been looking around at various Join solutions but they all seem to work on the basis that the amount of columns is predetermined. I have a case here where one table has a determined amount of columns (t1) and the other table have an undetermined amount of entries (t2) which all use a key that matches one entry in t1.
+----+------+-----+
| id | name | ... |
+----+------+-----+
| 1 | John | ... |
+----+------+-----+
And
+-------------+----------------+
| activity_id | account_number |
+-------------+----------------+
| 1 | 12345467879 |
| 1 | 98765432515 |
| ... | ... |
| ... | ... |
+-------------+----------------+
The number of account numbers belonging to the first query is unknown.
After the query it would become:
+----+------+-----+----------------+------------------+-----+------------------+
| id | name | ... | account_number | account_number_2 | ... | account_number_n |
+----+------+-----+----------------+------------------+-----+------------------+
| 1 | John | ... | 12345467879 | 98765432515 | ... | ... |
+----+------+-----+----------------+------------------+-----+------------------+
So I don't know how many account numbers could be associated with the id beforehand.

Resources