citus can I use join if I use citus? - database

citus can I use join if I use citus ?
And yes are they performant ?
.....................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................

citus can I use join if I use citus ?
Yes, you can use joins with Citus.
https://docs.citusdata.com/en/v10.0/develop/reference_sql.html#joins.
Most of the joins between different Citus table types are just supported, but some of them requires enabling repartition joins.
https://docs.citusdata.com/en/v10.0/develop/api_guc.html?highlight=enable_repartition_joins#citus-enable-repartition-joins-boolean
As of Citus 10, joins between local tables and distributed tables are also supported.
https://docs.citusdata.com/en/v10.0/develop/api_guc.html?highlight=enable_repartition_joins#citus-local-table-join-policy-enum
And yes are they performant ?
Citus joins tables very efficiently when tables are co-located.
https://docs.citusdata.com/en/v10.0/develop/reference_sql.html#co-located-joins

Related

Postgresql: Arrays instead of extra table in a many-to-many relationship?

I was wondering whether it is cleaner or faster using arrays of integers instead of creating an extra new table in a m-to-m relationship and using foreign keys to ensure data integrity.
What do you think? Do you say using arrays instead is a no-go?
Arrays to implement an m-to-n relationship are a no-go.
Write down the join between the two tables. You will notice that instead of two joins with an = as join condition, you now have a single join with #> as join condition (or a LATERAL unnest of the array, which is worse). That means that you are reduced to nested loop joins which will be slow if the tables are big (even if you create a GIN index to support the #>).
You cannot have referential integrity (foreign key constraints) that way.

Using joins on non-distributed columns in Citus distributed tables

I have succesfully created a Citus cluster with a couple of worker nodes. I get issues running a query using distributed tables.
This is a very simplified version of an indicative query I want to run:
SELECT a.origin, a.destination, b.label AS origin_label, c.label AS destination_label
FROM commuting_data a
LEFT JOIN geography_data b ON a.origin=b.sequence_id
LEFT JOIN geography_data c ON a.destination=c.sequence_id
ORDER BY a.origin, a.destination
commuting_data consists of about 20 million rows and among other information each row contains an origin and destination code.
geography_data consists of about 100 thousand rows and contains information about each geography code included in the commuting_data table.
The sequence_id column in geography_data matches the origin and destination columns and provides labels and other information about these spatial points. I want to join the commuting_data to the geography_data table on both origin and destination columns. This is easily done in a single node PostgreSQL database, but it is slow to complete the query.
My problem is that Citus only supports one distribution column per table. Therefore, if I set the distribution column to be either 'origin' or 'destination' for table commuting_data I get the error 'complex joins are only supported when all distributed tables are co-located and joined on their distribution columns'.
How can I overcome this problem and produce the results given from the query above utilising all the power of database sharding that Citus can offer?
I am using PostgreSQL 12.4 with Citus 9.4 on Ubuntu 20.04.
Citus cannot execute your query in a distributed fashion as your query does a LEFT JOIN on a non-distribution column.
If it wouldn't cost too much for you, you could wrap non-distribution column of geography_data in a cte and use it in your query.

Just curious about SQL Joins

I'm just curious here. If I have two tables, let's say Clients and Orders.
Clients have a unique and primary key ID_Client. Orders have an ID_Client field also and a relation to maintain integrity to Client's table by ID_Client field.
So when I want to join both tables i do:
SELECT
Orders.*, Clients.Name
FROM
Orders
INNER JOIN
Clients ON Clients.ID_Client = Orders.ID_Client
So if I took the job to create the primary key, and the relation between the tables,
Is there a reason why I need to explicitly include the joined columns in on clause?
Why can't I do something like:
SELECT
Orders.*, Clients.Name
FROM
Orders
INNER JOIN
Clients
So SQL should know which columns relate both tables...
I had this same question once and I found a great explanation for it on Database Administrator Stack Exchange, the answer below was the one that I found to be the best, but you can refer to the link for additional explanations as well.
A foreign key is meant to constrain the data. ie enforce
referential integrity. That's it. Nothing else.
You can have multiple foreign keys to the same table. Consider the following where a shipment has a starting point, and an ending point.
table: USA_States
StateID
StateName
table: Shipment
ShipmentID
PickupStateID Foreign key
DeliveryStateID Foreign key
You may want to join based on the pickup state. Maybe you want to join on the delivery state. Maybe you want to perform 2 joins for
both! The sql engine has no way of knowing what you want.
You'll often cross join scalar values. Although scalars are usually the result of intermediate calculations, sometimes you'll have a
special purpose table with exactly 1 record. If the engine tried to
detect a foriegn key for the join.... it wouldn't make sense because
cross joins never match up a column.
In some special cases you'll join on columns where neither is unique. Therefore the presence of a PK/FK on those columns is
impossible.
You may think points 2 and 3 above are not relevant since your questions is about when there IS a single PK/FK relationship
between tables. However the presence of single PK/FK between the
tables does not mean you can't have other fields to join on in
addition to the PK/FK. The sql engine would not know which fields you
want to join on.
Lets say you have a table "USA_States", and 5 other tables with a FK to the states. The "five" tables also have a few foreign keys to
each other. Should the sql engine automatically join the "five" tables
with "USA_States"? Or should it join the "five" to each other? Both?
You could set up the relationships so that the sql engine enters an
infinite loop trying to join stuff together. In this situation it's
impossible fore the sql engine to guess what you want.
In summary: PK/FK has nothing to do with table joins. They are separate unrelated things. It's just an accident of nature that you
often join on the PK/FK columns.
Would you want the sql engine to guess if it's a full, left, right, or
inner join? I don't think so. Although that would arguably be a lesser
sin than guessing the columns to join on.
If you don't explicitly give the field names in the query, SQL doesn't know which fields to use. You won't always have fields that are named the same and you won't always be joining on the primary key. For example, a relationship could be between two foreign key fields named "Client_Address" and "Delivery_Address". In that case, you can easily see how you would need to give the field name.
As an example:
SELECT o.*, c.Name
FROM Clients c
INNER JOIN Orders o
ON o.Delivery_Address = c.Client_Address
Is there a reason why do i need to explicit include then joinned fields in on clause?
Yes, because you still need to tell the database server what you want. "Do what I mean" is not within the capabilities of any software system so far.
Foreign keys are tools for enforcing data integrity. They do not dictate how you can join tables. You can join on any condition that is expressible through an SQL expression.
In other words, a join clause relates two tables to each other by a freely definable condition that needs to evaluate to true given the two rows from left hand side and the right hand side of the join. It does not have to be the foreign key, it can be any condition.
Want to find people that have last names equal to products you sell?
SELECT
Products.Name,
Clients.LastName
FROM
Products
INNER JOIN Clients ON Products.Name = Clients.LastName
There isn't even a foreign key between Products and Clients, still the whole thing works.
It's like that. :)
The sql standard says that you have to say on which columns to join. The constraints are just for referential integrity. With mysql the join support "join table using (column1, column2)" but then those columns have to be present in both tables
Reasons why this behaviour is not default
Because one Table can have multiple columns referencing back to one column in another table.
In a lot of legacy databases there are no Foreign key constraints but yet the columns are “Supposed to be” referencing some column in some other table.
The join conditions are not always as simple as A.Column = B.Column . and the list goes on…….
Microsoft developers were intelligent enough to let us make this decision rather than them guessing that it will always be A.Column = B.Column

Renaming table results in different Query Plan

Probably more information is needed, but this is really odd. Using SQL 2005 I am executing an Inner Join on two tables. If I rename one of the tables (using Alter Table), the resulting Query Plan is significanly longer. There are views on the table, but the Inner Join is using the base table, not any of the views. Does this make sense? Is this to be expected?

Composite indexes for columns in joins and where

If I have a query:
Select *
From tableA
Inner Join tableB ON tableA.bId = tableB.id
Inner Join tableC ON tableA.cId = tableC.id
where
tableA.someColumn = ?
Do I get any performance benefit from creating a composite index(bId,cId,someColumn)?
I'm using DB2 Database for this activity.
Indexing joins depends on the join algorithm used by the database. You'll see that in the execution plan.
You will probably need an index on tableA that starts with someColumn for the where clause. Everything else depends on the join algorithm and join order.
You will probably get a more specific answer if you post the execution plan. You can also read the chapter "The Join Operation" on my site about sql indexing and try yourself.
If there are no indexes now, I'd guess that the composite index might be used in one or both inner joins. I doubt that it would be used in the WHERE clause.
But I've been doing this stuff for a long time. Guessing, like hoping, doesn't scale well.
Instead of guessing, you're better off learning how to use DB2's explain and design advisor utilities. Expect to test things like indexing first on a development computer. Building a three-column index on a 500 million row table that's in production will not make you popular.

Resources