redshift leader node using up 100% of disk - database

we have a 50 node redshift cluster, and we run vacuum periodically. and currently we are running a pipeline where we are moving some data onto S3 and deleting it from redshift.
after about 2 weeks of processing. our disk usage on 49 nodes ( except leader ) came down from 95% to 80%. but the disk usage on leader went up and its now at 100%.
I tried rebooting the cluster to see if there were transient files that were holding the space. but that didnt help.
any suggestion would be a great help at this point.
thanks!

You might have some "skewed" tables, which means tables are not distributed on the nodes evenly, the following SQL will give you list of tables and based on skew column, you might need to redistribute your tables.
select trim(pgn.nspname) as schema,
trim(a.name) as table, id as tableid,
decode(pgc.reldiststyle,0, 'even',1,det.distkey ,8,'all') as distkey, dist_ratio.ratio::decimal(10,4) as skew,
det.head_sort as "sortkey",
det.n_sortkeys as "#sks", b.mbytes,
decode(b.mbytes,0,0,((b.mbytes/part.total::decimal)*100)::decimal(5,2)) as pct_of_total,
decode(det.max_enc,0,'n','y') as enc, a.rows,
decode( det.n_sortkeys, 0, null, a.unsorted_rows ) as unsorted_rows ,
decode( det.n_sortkeys, 0, null, decode( a.rows,0,0, (a.unsorted_rows::decimal(32)/a.rows)*100) )::decimal(5,2) as pct_unsorted
from (select db_id, id, name, sum(rows) as rows,
sum(rows)-sum(sorted_rows) as unsorted_rows
from stv_tbl_perm a
group by db_id, id, name) as a
join pg_class as pgc on pgc.oid = a.id
join pg_namespace as pgn on pgn.oid = pgc.relnamespace
left outer join (select tbl, count(*) as mbytes
from stv_blocklist group by tbl) b on a.id=b.tbl
inner join (select attrelid,
min(case attisdistkey when 't' then attname else null end) as "distkey",
min(case attsortkeyord when 1 then attname else null end ) as head_sort ,
max(attsortkeyord) as n_sortkeys,
max(attencodingtype) as max_enc
from pg_attribute group by 1) as det
on det.attrelid = a.id
inner join ( select tbl, max(mbytes)::decimal(32)/min(mbytes) as ratio
from (select tbl, trim(name) as name, slice, count(*) as mbytes
from svv_diskusage group by tbl, name, slice )
group by tbl, name ) as dist_ratio on a.id = dist_ratio.tbl
join ( select sum(capacity) as total
from stv_partitions where part_begin=0 ) as part on 1=1
where mbytes is not null
order by mbytes desc

Related

Query runs slower when SELECT clause is used

I have a query in SQL Server with 6 JOINs and 1 LEFT JOIN to tables and views. It returns 16k records in about 1 second if the select clause is "SELECT *"
As soon as I specify even one column to display (SELECT ItemID, for example) the query slows down to about 70 seconds.
Query #1 (2s) - SELECT *:
SELECT *
FROM (SELECT LinkedToSet, LinkedToCopy, ',' + STRING_AGG(LocationID,',') + ',' Locs, count(1) OVER (PARTITION BY LinkedToSet) Copies
FROM Inventory.Locations WHERE LinkedToSet is not null AND (State & 4096)>0 GROUP BY LinkedToSet, LinkedToCopy) l
JOIN Bricklink_Set_Query bsq on l.LinkedToSet=bsq.Number
JOIN Bricklink.Set_Parts_Query bsp on l.LinkedToSet=bsp.SetNum AND bsp.Extra=0
JOIN Bricklink.Item_List i on bsp.ItemType=i.ItemType AND bsp.ItemID=i.Number
JOIN Bricklink.Category_List cat on i.Category_ID=cat.CatID
JOIN Bricklink.Color_List col on bsp.ColorID=col.ColorID
LEFT JOIN (SELECT LocationID, ItemType, ItemNum, ColorID, sum(QtyFound) as InvPcs
FROM Inventory.Item_History
GROUP BY LocationID, ItemType, ItemNum, ColorID) as h ON l.Locs like concat('%,',h.locationID,',%') AND h.ItemType=bsp.ItemType AND h.ItemNum=bsp.ItemID AND h.ColorID=bsp.ColorID
Actual Execution Plan: https://www.brentozar.com/pastetheplan/?id=SJD7Qemf_
Query #2 (81s) - SELECT a single column
SELECT bsp.ItemID
FROM (SELECT LinkedToSet, LinkedToCopy, ',' + STRING_AGG(LocationID,',') + ',' Locs, count(1) OVER (PARTITION BY LinkedToSet) Copies
FROM Inventory.Locations WHERE LinkedToSet is not null AND (State & 4096)>0 GROUP BY LinkedToSet, LinkedToCopy) l
JOIN Bricklink_Set_Query bsq on l.LinkedToSet=bsq.Number
JOIN Bricklink.Set_Parts_Query bsp on l.LinkedToSet=bsp.SetNum AND bsp.Extra=0
JOIN Bricklink.Item_List i on bsp.ItemType=i.ItemType AND bsp.ItemID=i.Number
JOIN Bricklink.Category_List cat on i.Category_ID=cat.CatID
JOIN Bricklink.Color_List col on bsp.ColorID=col.ColorID
LEFT JOIN (SELECT LocationID, ItemType, ItemNum, ColorID, sum(QtyFound) as InvPcs
FROM Inventory.Item_History
GROUP BY LocationID, ItemType, ItemNum, ColorID) as h ON l.Locs like concat('%,',h.locationID,',%') AND h.ItemType=bsp.ItemType AND h.ItemNum=bsp.ItemID AND h.ColorID=bsp.ColorID
Actual execution plan: https://www.brentozar.com/pastetheplan/?id=BJTr4x7Gu
The execution plans look totally different from each other and I'm not sure why. I've also tried wrapping the SELECT * and querying that, but some of these tables/views have the exact same field names, especially on the joins, so SQL Server throws an error:
This column 'foo' was specified multiple times.
How do I achieve the performance of SELECT * but limit which columns I display?
P.S. 2 Notes - 1) My desired select statement is obviously more complex than this and 2) Even using the full select statement, if I add a WHERE clause and restrict the query there, it runs in <1 second. If that plan would be useful I can post it as well.

How can I use outer join with subquery and groupby?

Tool : MySQL Workbench 6.3
Version : MySQL 5.7
SELECT *
FROM cars as a, battery_log as b
WHERE a.user_seq = 226 AND a.seq = b.car_seq
AND b.created = ( SELECT MAX(created) FROM battery_log WHERE car_seq = a.seq )
GROUP BY car_type
ORDER BY a.created DESC;
I want to turn this query into an outer join.
By searching user_seq in the'cars' table
I need to get the latest value of the battery log in the one-to-many relationship of the corresponding car table.
Sometimes the battery log does not have a value that matches car seq, so it is truncated from the joining process of table a and table b. How can I fix this?
SELECT a.*, b.battery
FROM cars as a
LEFT OUTER JOIN battery_log as b ON a.seq = b.car_seq
LEFT OUTER JOIN ( SELECT MAX(created) FROM battery_log WHERE a.seq = b.car_seq) as c
ON b.created = c.MAX(created)
WHERE a.user_seq = 226
GROUP BY car_type
ORDER BY a.created DESC
I tried to fix it this way, but I got the following error:
Error Code: 1054, Unknown column'a.seq' in'where clause'
I solved this problem like this.
SELECT *
FROM cars as a
LEFT OUTER JOIN battery_log as b ON a.seq = b.car_seq
AND b.created = (SELECT MAX(created) FROM battery_log WHERE car_seq = b.car_seq)
WHERE a.user_seq = 226
GROUP BY car_type
ORDER BY a.created DESC;
After LEFT OUTER JOIN ... ON, an additional condition was given with AND, and the query was performed according to the condition.

How to cross join tables from multiple servers?

I'm working on a project to create a table that pull information from my local server and 2 online servers. The 2 online servers are both linked with my local server. I only have ability to read from the online servers and the data is to large for me to create a duplicate.
I built some code that would work on Management Studio, however when I place those code into SSRS, I got message that says one of my table already exists. I tried to put a drop table clause in front of that, but then I got message for the next table down the line already existed. And if I pull drop statement for every table, I got a error message for Timeout expired when refresh the fields.
SELECT s.SiteID, s.[StoreName], cf.CustomerID, , cf.AccountNumber, cf.AccountStatus,
cf.Store_ID, cf.InstitutionID, cf.TransactionTime, cf.Comment
INTO #Report_Table1
FROM dbo.View_GetCustomerInfo cf
LEFT JOIN dbo.Store_Table s ON cf.Store_ID = s.Store_ID
;
SELECT t.*, cl.SaleAmount
INTO #Report_Table2
FROM #Report_Table1 t
LEFT JOIN OnlineServe01.Views.dbo.SaleUpdate su
ON t.CustomerID = cl.CustomerID AND t.Store_ID = cl.Store_ID AND [Status] = 'A'
;
SELECT InstitutionID, Source_ID, BankName
INTO #Report_BankName
FROM OnlineServe01.Views.dbo.InstitutionInfo bn
WHERE InstitutionID IN (
SELECT InstitutionID FROM #Report_Table2)
;
SELECT df.*, bn.BankName
INTO #Report_Table3
FROM #Report_Table2 t
LEFT JOIN #Report_BankName bn ON df.InstitutionID = bn.InstitutionID AND df.Store_ID = bn.Store_ID
;
SELECT StoreName, SiteID, CustomerID, SaleAmount
, BankName, AccountNumber, AccountStatus, TransactionTime, Comment
INTO #Report_Table4
FROM #Report_Table3 t
;
SELECT *
INTO #Report_PlayerName
FROM (
SELECT DISTINCT CustomerID, FirstName, LastName,
Dense_Rank () OVER (Partition by CustomerID ORDER BY FirstName) AS Rnk
FROM OnlineServe02.CustomerManagement.dbo.CustomerName
WHERE PreferredName = 0
AND CustomerID IN (SELECT DISTINCT CustomerID FROM #Report_Table4)
) a
WHERE Rnk = 1
;
SELECT t.*, pn.LastName, pn.FirstName, ca.Deposited, ca.Used, ca.InTransit, ca.Available
FROM #Report_Table4 t
LEFT JOIN OnlineServe02.CustomerManagement.dbo.AccountActivity ca
ON t.CustomerID = ca.CustomerID AND t.SiteID = ca.SiteID
LEFT JOIN #Report_PlayerName pn ON t.CustomerID = pn.CustomerID
;

Identify Inter-Account Transfers in SQL

I have a bunch of bank transactions in a table in SQL.
Example: http://sqlfiddle.com/#!6/6b2c8/1/0
I need to identify the transactions that are made between these 2 linked accounts. The Accounts table (not shown) links these 2 accounts to the one source (user).
For example:
I have an everyday account, and a savings account. From time to time, I may transfer money from my everyday account, to my savings account (or vice-versa).
The transaction descriptions are usually similar (Transfer to xxx/transfer from xxx), usually on the same day, and obviously, the same dollar amount.
EDIT: I now have the following query (dumbed down), which works for some scenarios
Basically, I created 2 temp tables with all withdrawals and deposits that met certain criteria. I then join them together, based on a few requirements (same transaction amount, different account # etc). Then using the ROW_NUMBER function, I have ordered which ones are more likely to be inter-account transactions.
I now have an issue where if, for example:
$100 transferred from Account A to Account B
$100 Transferred from Account B to Account C
My query will match the transfer between Account A and C, then there is only one transaction for account B, and it will not be matched. So essentially, instead of receiving 2 rows back (2 deposits, lined up with 2 withdrawals), I only get 1 row (1 deposit, 1 withdrawal), for a transfer from A to B :(
INSERT INTO #Deposits
SELECT t.*
FROM dbo.Customer c
INNER JOIN dbo.Source src ON src.AppID = app.AppID
INNER JOIN dbo.Account acc ON acc.SourceID = src.SourceID
INNER JOIN dbo.Tran t ON t.AccountID = acc.AccountID
WHERE c.CustomerID = 123
AND t.Template = 'DEPOSIT'
INSERT INTO #Withdrawals
SELECT t.*
FROM dbo.Customer c
INNER JOIN dbo.Source src ON src.AppID = app.AppID
INNER JOIN dbo.Account acc ON acc.SourceID = src.SourceID
INNER JOIN dbo.Tran t ON t.AccountID = acc.AccountID
WHERE c.CustomerID = 123
AND t.Template = 'WITHDRAWAL'
;WITH cte
AS ( SELECT [...] ,
ROW_NUMBER() OVER ( PARTITION BY d.TranID ORDER BY SUM( CASE WHEN d.TranDate = d.TranDate THEN 2 ELSE 1 END), w.TranID ) AS DepRN,
ROW_NUMBER() OVER ( PARTITION BY w.TranID ORDER BY SUM( CASE WHEN d.TranDate = d.TranDate THEN 2 ELSE 1 END ), d.TranID ) AS WdlRN
FROM #Withdrawal w
INNER JOIN d ON w.TranAmount = d.TranAmount -- Same transaction amount
AND w.AccountID <> d.AccountID -- Different accounts, same customer
AND w.TranDate BETWEEN d.TranDate AND DATEADD(DAY, 3, d.TranDate) -- Same day, or within 3 days
GROUP BY [...]
)
SELECT *
FROM cte
WHERE cte.DepRN = cte.WdlRN
Maybe this is a start? I don't think we have enough info to say whether this would be reliable or would cause a lot of "false positives".
select t1.TransactionID, t2.TransactionID
from dbo.Transactions as t1 inner join dbo.Transactions as t2
on t2.AccountID = t2.AccountID
and t2.TransactionDate = t1.TransactionDate
and t2.TransactionAmount = t1.TransactionAmount
and t2.TransactionID - t1.TransactionID between 1 and 20 -- maybe??
and t1.TransactionDesc like 'Transfer from%'
and t2.TransactionDesc like 'Transfer to%'
and t2.TransactionID > t1.TransactionID

Finding difference between 2 tables in MS Access or SQL Server

I have 2 Excel files which I imported into MS Access as two tables. These two tables are identical but imported on different dates.
Now, how can I find out what rows and what fields are updated on the later date? Any help would be highly appreciated.
Finding Inserted records is easy
select * from B where not exists (select 1 from A where A.pk=B.pk)
Finding Deleted records is just as easy
select * from A where not exists (select 1 from B where A.pk=B.pk)
Finding Updated records is a pain. The following rigorous query assumes you have nullable columns and it should work in all situations.
select B.*
from B
inner join A on B.pk=A.pk
where A.col1<>B.col1 or (IsNull(A.col1) and not IsNull(B.col1)) or (not IsNull(A.col1) and IsNull(B.col1))
or A.col2<>B.col2 or (IsNull(A.col2) and not IsNull(B.col2)) or (not IsNull(A.col2) and IsNull(B.col2))
or A.col3<>B.col3 or (IsNull(A.col3) and not IsNull(B.col3)) or (not IsNull(A.col3) and IsNull(B.col3))
etc...
If the columns are defined as NOT NULL then the query is much simper, just remove all the NULL tests.
If the columns are nullable but you can identify a value that will never appear in the data, then use a simple comparison like:
Nz(A.col1,neverAppearingValue)<>Nz(B.col1,neverAppearingValue)
I believe this should be as simple as running a query like this:
SELECT *
FROM Table1
JOIN Table2
ON Table1.ID = Table2.ID AND Table1.Date != Table2.Date
One way to do this is by unpivoting both tables, so you get a new table with , , . Note, though, that you have to take types into account.
For example, the following gets differences in fields:
with oldt as (select id, col, val
from <old table> t
unpivot (val for col in (<column list>)) unpvt
),
newt as (select id, col, val
from <new table> t
unpivot (val for col in (<column list>)) unpvit
)
select *
from oldt full outer join newt on oldt.id = newt.id
where oldt.id is null or newt.id is null
The alternative way with a join is rather cumbersome. This version shows whether columns are added, deleted, and which columns changed if any:
select *
from (select coalesce(oldt.id, newt.id) as id,
(case when oldt.id is null and newt.id is not null then 'ADDED'
when oldt.id is not null and newt.id is null then 'DELETED'
else 'SAME'
end) as stat,
(case when oldt.col1 <> newt.col1 or oldt.col1 is null and newt.col1 is null
then 1 else 0 end) as diff_col1,
(case when oldt.col2 <> newt.col2 or oldt.col2 is null and newt.col2 is null
then 1 else 0 end) as diff_col2,
...
from <old table> oldt full outer join <new table> newt on oldt.id = newt.id
) c
where status in ('ADDED', 'DELETED') or
(diff_col1 + diff_col2 + ... ) > 0
It does have the advantage of working for any data types.
(Select * from OldTable Except Select *from NewTable)
Union All
(Select * from NewTable Except Select *from OldTable)

Resources