I have succesfully created a Citus cluster with a couple of worker nodes. I get issues running a query using distributed tables.
This is a very simplified version of an indicative query I want to run:
SELECT a.origin, a.destination, b.label AS origin_label, c.label AS destination_label
FROM commuting_data a
LEFT JOIN geography_data b ON a.origin=b.sequence_id
LEFT JOIN geography_data c ON a.destination=c.sequence_id
ORDER BY a.origin, a.destination
commuting_data consists of about 20 million rows and among other information each row contains an origin and destination code.
geography_data consists of about 100 thousand rows and contains information about each geography code included in the commuting_data table.
The sequence_id column in geography_data matches the origin and destination columns and provides labels and other information about these spatial points. I want to join the commuting_data to the geography_data table on both origin and destination columns. This is easily done in a single node PostgreSQL database, but it is slow to complete the query.
My problem is that Citus only supports one distribution column per table. Therefore, if I set the distribution column to be either 'origin' or 'destination' for table commuting_data I get the error 'complex joins are only supported when all distributed tables are co-located and joined on their distribution columns'.
How can I overcome this problem and produce the results given from the query above utilising all the power of database sharding that Citus can offer?
I am using PostgreSQL 12.4 with Citus 9.4 on Ubuntu 20.04.
Citus cannot execute your query in a distributed fashion as your query does a LEFT JOIN on a non-distribution column.
If it wouldn't cost too much for you, you could wrap non-distribution column of geography_data in a cte and use it in your query.
Related
I have a source table in a Sybase database (ORDERS) and a Source table in an MSSQL Database (DOCUMENTS). I need to query the Sybase database and for each row found in the ORDERS table get the matching row(s) by order number from the DOCUMENTS table.
I originally wrote the SSIS package using a lookup transformation, simple, except that it could be a one-to-many relationship, where 1 order number will exist in the ORDERS table but more than 1 documents could exist in the DOCUMENTS table. The SSIS lookup will only match on first.
My 2nd attempt will be to stage the rows from the ORDERS table into a staging table in MSSQL and then loop through the rows in this table using a FOR EACH LOOP CONTAINER and get the matching rows from the DOCUMENTS table, inserting the DOCUMENTS rows into another staging table. After all rows from ORDERS have been processed I will write a query to join the two staging tables to give me my result. A concern with this method is that I will be opening and closing the DOCUMENTS database connection many times, which will not be very efficient (although there will probably be less than 200 records).
Or could you let me know of any other way of doing this?
I have a Lookup Transformation on a table with 30 columns but I only am using two columns: ID column for the join and Update column as Output.
On the connection should I enter a query Select ID, Update From T1 or Use Table in the drop down?
Using table in Drop down would this be like doing Select * From T1 or is SSIS clever enough to know I only need 2 columns.
I'm thinking I should go with the Query Select ID, Update From T1.
On the connection should I enter a query Select ID, Update From T1 or Use Table in the drop down?
It is best to specify which columns you want.
Using table in Drop down, would this be like doing Select * From T1
Yes, it is a SELECT *.
or is SSIS clever enough to know I only need 2 columns?
Nope.
Keep in mind that Lookups are good for pulling data from Dimension Tables where the row count and record set is small. If you are dealing with large amounts of unique data, then it will be better to perform a MERGE JOIN, instead. The performance difference can be substantial. For example, when using a Lookup on 20K rows of data, you could experience run times in the tens of minutes. A MERGE JOIN, however, would run within seconds.
Lookups have the drawback of behaving like correlated sub-queries in that they fire off a query to the server for every row passing through it. You can have the Lookup cache the data, which means SSIS will store the results in memory and then check the memory before going to the server for all subsequent rows passing through the Lookup. As a result, this is only effective if there are a large number of matching records for a small cache set. In other words, Lookups are not optimal when there is large amount of Distinct ID's to lookup. To that point, caching data is almost pointless.
This is where you would switch over to using a MERGE JOIN. Note: you will need to perform a SORT on both of the data flows before the MERGE JOIN because the MERGE JOIN component requires the incoming rows to be sorted.
When handled incorrectly, a single poorly placed Lookup can bring an entire package to its knees - lookups can be huge performance bottlenecks. Though, handled correctly, a Lookup can simplify the design of the dataflow and speed development by removing the extra development required to MERGE JOIN data flows.
The bottom line to all of this is that you want the Lookup performing the fewest number of queries against the server.
If you need only two columns from the lookup table then it is better to use a select query then selecting table from drop down list but the columns specified must contains the primary key (ID). Because reading all columns will consume more resources. Even if it may not meaningful effect in small tables.
You can refer to the following answer on database administrators community for more information:
SSIS OLE DB Source Editor Data Access Mode: “SQL command” vs “Table or view”
Note that what #JWeezy mentioned about lookup from large table is right. Lookups is not designed for large table, i will use SQL JOINs instead.
I am working on a heavy record set database in MS SQL 2016. So I want to use row table partition feature to improve speed.
As we know partition feature is working on partition column of a table. Let's say [Date Column] of a table. In our scenario, have many tables that need to partition because of heaver record set in 5 to 7 tables. Each table not have that [Date column]. Also not possible to add that column in each table.
So is there any way I can select partition column of another table or something else.
The best option is to add a common column to all tables that you will then use to partition by.
You must already have a way of relating the different tables to each other so you can use this to tag each table with the correct Partition column.
This column could be as simple as an int with YYYYMM as values for monthly partitions.
You also need to make sure your queries are "Partition Aware".
This means that you should include this column in your WHERE Clause and also your JOIN Clauses for any queries.
Use Query Plans to make sure you are getting Partition Elimination on your queries.
If you can't change the model (but can add partitions???) then you could implement the partitioning with different columns in each table provided you have a single column in each table that you can partition on named ranges - but if you have 1-many relationships then it is unlikely that the child tables keys will be consecutive relative to the parent table. Note that this approach will make your "partition aware" queries more complex to craft.
Probably more information is needed, but this is really odd. Using SQL 2005 I am executing an Inner Join on two tables. If I rename one of the tables (using Alter Table), the resulting Query Plan is significanly longer. There are views on the table, but the Inner Join is using the base table, not any of the views. Does this make sense? Is this to be expected?
As per subject, i am looking for a fast way to count records in a table without table scan with where condition
There are different methods, the most reliable one is
Select count(*) from table_name
But other than that you can also use one of the followings
select sum(1) from table_name
select count(1) from table_name
select rows from sysindexes where object_name(id)='table_name' and indid<2
exec sp_spaceused 'table_name'
DBCC CHECKTABLE('table_name')
The last 2 need sysindexes to be updated, run the following to achieve this, if you don't update them is highly likely it'll give you wrong results, but for an approximation they might actually work.
DBCC UPDATEUSAGE ('database_name','table_name') WITH COUNT_ROWS.
EDIT: sorry i did not read the part about counting by a certain clause. I agree with Cruachan, the solution for your problem are proper indexes.
The following page list 4 methods of getting the number of rows in a table with commentary on accuracy and speed.
http://blogs.msdn.com/b/martijnh/archive/2010/07/15/sql-server-how-to-quickly-retrieve-accurate-row-count-for-table.aspx
This is the one Management Studio uses:
SELECT CAST(p.rows AS float)
FROM sys.tables AS tbl
INNER JOIN sys.indexes AS idx ON idx.object_id = tbl.object_id and idx.index_id < 2
INNER JOIN sys.partitions AS p ON p.object_id=CAST(tbl.object_id AS int)
AND p.index_id=idx.index_id
WHERE ((tbl.name=N'Transactions'
AND SCHEMA_NAME(tbl.schema_id)='dbo'))
Simply, ensure that your table is correctly indexed for the where condition.
If you're concerned over this sort of performance the approach is to create indexes which incorporate the field in question, for example if your table contains a primary key of foo, then fields bar, parrot and shrubbery and you know that you're going to need to pull back records regularly using a condition based on shrubbery that just needs data from this field you should set up a compound index of [shrubbery, foo]. This way the rdbms only has to query the index and not the table. Indexes, being tree structures, are far faster to query against than the table itself.
How much actual activity the rdbms needs depends on the rdbms itself and precisely what information it puts into the index. For example, a select count()* on an unindexed table not using a where condition will on most rdbms's return instantly as the record count is held at the table level and a table scan is not required. Analogous considerations may hold for index access.
Be aware that indexes do carry a maintenance overhead in that if you update a field the rdbms has to update all indexes containing that field too. This may or may not be a critical consideration, but it's not uncommon to see tables where most activity is read and insert/update/delete activity is of lesser importance which are heavily indexed on various combinations of table fields such that most queries will just use the indexes and not touch the actual table data itself.
ADDED: If you are using indexed access on a table that does have significant IUD activity then just make sure you are scheduling regular maintenance. Tree structures, i.e. indexes, are most efficient when balanced and with significant UID activity periodic maintenance is needed to keep them this way.