Left join in influx DB - database

I am new to influx DB. Now I need to migrate MySQL db into influxDB. I chose influx DB because it support SQL like queries. But I could not found left join in it. I have a series called statistics which contains browser_id and another series contains browser list. How can I join these 2 tables like relational database concept?
I wrote this query but it is not giving any result.
select * from statistics as s inner join browsers as b where s.browser_type_id = b.id
statistics
browsers

You cannot join series in InfluxDB using arbitrary columns. InfluxDB only supports joining time series based on the time column. This is a special type of join unlike the one you're used to in relational databases. Time join in InfluxDB tries to correlate points from different time series that happened at approximately the same time. You can read more about joins in InfluxDB in the docs

Seems that now is possible. Check again documentation: https://docs.influxdata.com/influxdb/v0.8/api/query_language/#joining-series
select hosta.value + hostb.value
from cpu_load as hosta
inner join cpu_load as hostb
where hosta.host = 'hosta.influxdb.orb' and hostb.host = 'hostb.influxdb.org';

Related

Should Full Outer Joins be avoided in Power Query?

I created a full outer join in Power Query that joins 2 views from SQL server, which both run quickly on their own. The query sometimes takes longish (a few minutes) to run, and despite the views being READ UNCOMITTED has blocked other users. Now I need to create another Full outer join, but am nervous about doing so. Also full outer joins in Power Query seem to break query folding - although in this new case that's irrelevant since the join is not to another query/view in SQL server, so it can't be folded anyway.
Should Full Outer Joins in Power Query be avoided?

how to compare (1 billion records) data between two kafka streams or Database tables

we are sending data from DB2 (table-1) via CDC to Kafka topics (topic-1).
we need to do reconciliation between DB2 data and Kafka topics.
we have two options -
a) bring down all kafka topic data into DB2 (as table-1-copy) and then do left outer join (between table-1 and table-1-copy) to see the non-matching records, create the delta and push it back into kafka.
problem: Scalability - our data set is about a billion records and i am not sure if DB2 DBA is going to let us run such a huge join operation (that may last easily over 15-20 mins).
b) push DB2 back again into parallel kafka topic (topic-1-copy) and then do some kafka streams based solution to do left outer join between kafka topic-1 and topic-1-copy. I am still wrapping my head around kafka streams and left outer joins.
I am not sure whether (using the windowing system in kafka streams) I will be able to compare the ENTIRE contents of topic-1 with topic-1-copy.
To make matters worse, the topic-1 in kafka is a compact topic,
so when we push the data from DB2 back into Kafka topic-1-copy, we cannot deterministically kick off the kafka topic-compaction cycle to make sure both topic-1 and topic-1-copy are fully compacted before running any sort of compare operation on them.
c) is there any other framework option that we can consider for this ?
The ideal solution has to scale for any size data.
I see no reason why you couldn't do this in either Kafka Streams or KSQL. Both support table-table joins. That's assuming the format of the data is supported.
Key compaction won't affect the results as both Streams and KSQL will build the correct final state of joining the two tables. If compaction has run the amount of data that needs processing may be less, but the result will be the same.
For example, in ksqlDB you could import both topics as tables and perform a join and then filter by the topic-1 table being null to find the list of missing rows.
-- example using 0.9 ksqlDB, assuming a INT primary key:
-- create table from main topic:
CREATE TABLE_1
(ROWKEY INT PRIMARY KEY, <other column defs>)
WITH (kafka_topic='topic-1', value_format='?');
-- create table from second topic:
CREATE TABLE_2
(ROWKEY INT PRIMARY KEY, <other column defs>)
WITH (kafka_topic='topic-1-copy', value_format='?');
-- create a table containing only the missing keys:
CREATE MISSING AS
SELECT T2.* FROM TABLE_2 T2 LEFT JOIN TABLE_1 T1
WHERE T1.ROWKEY = null;
The benefit of this approach is that the MISSING table of missing rows would automatically update: as you extracted the missing rows from your source DB2 instance and produced them to the topic-1 then the rows in the 'MISSING' table would be deleted, i.e. you'd see tombstones being produced to the MISSING topic.
You can even extend this approach to find rows that exist in topic-1 that are no longer in the source db:
-- using the same DDL statements for TABLE_1 and TABLE_2 from above
-- perform the join:
CREATE JOINED AS
SELECT * FROM TABLE_2 T2 FULL OUTER JOIN TABLE_1 T1;
-- detect rows in the DB that aren't in the topic:
CREATE MISSING AS
SELECT * FROM JOINED
WHERE T1_ROWKEY = null;
-- detect rows in the topic that aren't in the DB:
CREATE EXTRA AS
SELECT * FROM JOINED
WHERE T2_ROWKEY = null;
Of course, you'll need to size your cluster accordingly. The bigger your ksqlDB cluster the quicker it will process the data. It'll also need on-disk capacity to materialize the table.
The maximum amount of parallelization you can get set by the number of partitions on the topics. If you've only 1 partition, then data will be processed sequentially. If running with 100 partitions, then you can process the data using 100 CPU cores, assuming you run enough ksqlDB instances. (By default, each ksqlDB node will create 4 stream-processing threads per query, (though you can increase this if the server has more cores!)).

Execution of TSQL statement

I am aware of the sequence of the execution of SQL statements but I still want to make sure few things with the help of SQL experts here. I have a big SQL query which returns thousands of rows. Here is the minimized version of the query which I wrote and think that it is correct.
Select *
from property
inner join tenant t on (t.hproperty = p.hmy **and p.hmy = 7**)
inner join commtenant ct on ct.htenant = t.hmyperson
where 1=1
My colleague says that above query is equivalent to below query performance wise(He is very confident about it)
Select *
from property
inner join tenant t on (t.hproperty = p.hmy)
inner join commtenant ct on ct.htenant = t.hmyperson
where **p.hmy = 7**
Could anybody help me with the explanation about why above queries are not equivalent or equivalent? Thanks.
If you want to know if two queries are equivalent, learn how to look at the execution plans in SQL Server Management Studio. You can put the two queries in different windows, look at the estimated execution plans, and see for yourself if they are the same.
In this case, they probably are the same. SQL is intended to be a descriptive language, not a procedural language. That is, it describes the output you want, but the SQL engine is allowed to rewrite the query to be as efficient as possible. The two forms you have describe the same output. Do note that if there were a left outer join instead of an inner join, then the queries would be different.
In all likelihood, the engine will read the table and filter the records during the read or use an index for the read. The key idea, though, is that the output is the same and SQL Server can recognize this.
"p.hmy = 7" is not a join condition, as it relates only to a single table. As such, it doesn't really belong in the ON clause of the join. Since you are not adding any information by placing the condition in the ON clause, having it in the WHERE clause (in which it really belongs) will not make any difference to the query plan generated. If in doubt, look at the query plans.

SQL Server performance - Subselect or Inner Join?

I've been pondering the question which of those 2 Statements might have a higher performance (and why):
select * from formelement
where formid = (select id from form where name = 'Test')
or
select *
from formelement fe
inner join form f on fe.formid = f.id
where f.name = 'Test'
One form contains several form elements, one form element is always part of one form.
Thanks,
Dennis
look at the execution plan, most likely it will be the same if you add the filtering to the join, that said the join will return everything from both tables, the in will not
I actually prefer EXISTS over those two
select * from formelement fe
where exists (select 1 from form f
where f.name='Test'
and fe.formid =f.id)
The performance depends on the query plan choosen by the SQL Server Engine. The query plan depends on a lot of factors, including (but not limited to) the SQL, the exact table structure, the statistics of the tables, available indexes, etc.
Since your two queries are quite simple, my guess would be that they result in the same (or a very similar) execution plan, thus yielding comparable performance.
(For large, complicated queries, the exact wording of the SQL can make a difference, the book SQL Tuning by Dan Tow gives a lot of great advice on that.)

SQL Server Views performances

If I have defined a view in SQL Server like this:
CREATE View V1
AS
SELECT *
FROM t1
INNER JOIN t2 ON t1.f1 = t2.f2
ORDER BY t1.f1
Should I expect performance differences between
SELECT * FROM V1 WHERE V1.f1 = 100
and just avoiding view, like this
SELECT *
FROM t1
INNER JOIN t2 ON t1.f1 = t2.f2
WHERE t1.f1 = 100
ORDER BY t1.f1
?
We don't have any reason to use views except the need to centralize complex queries.
Thanks
There should be no performance penalty.
Simplifying complex queries is what views are for.
If performance is something you are concerned about - read about indexed views in SQL Server:
indexed views provide additional performance benefits that cannot be achieved using standard indexes. Indexed views can increase query performance in the following ways:
Aggregations can be precomputed and stored in the index to minimize expensive computations during query execution.
Tables can be prejoined and the resulting data set stored.
Combinations of joins or aggregations can be stored.
Generally you shouldn't expect performance differences but check the execution plans for your queries.
If you are joining Views onto Views then the execution plans can be sub optimal and contain repeated accesses to the same table that could have been consolidated. Also there can be issues with views and predicate pushing.

Resources