SELECT DISTINCT "myapp_profile"."user_id", "myapp_profile"."name",
"myapp_profile"."age", "auth_user"."id", "auth_user"."username",
"auth_user"."first_name", "auth_user"."last_name", "auth_user"."email",
"auth_user"."password", "auth_user"."is_staff", "auth_user"."is_active",
"auth_user"."is_superuser", "auth_user"."last_login", "auth_user"."date_joined"
FROM "myapp_profile"
INNER JOIN "auth_user" ON ("myapp_profile"."user_id" = "auth_user"."id")
LEFT OUTER JOIN "myapp_siterel" ON ("myapp_profile"."user_id" = "myapp_siterel"."profile_id")
LEFT OUTER JOIN "django_site" ON ("myapp_siterel"."site_id" = "django_site"."id")
WHERE ("auth_user"."is_superuser" = false
AND "auth_user"."is_staff" = false
AND ("django_site"."id" IS NULL OR "django_site"."id" IN (15, 16)))
ORDER BY "myapp_profile"."user_id"
DESC LIMIT 100
The above query takes about 100 seconds to run with 2 million users/profiles. I'm no DBA and our DBAs are looking at the situation to see what can be done, but since I'll likely never get to see what changes (assuming it happens at the DB level), I'm curious how you could optimized this query. It obviously needs to happen a ton faster than it is happening, like on the order of 5 seconds or less. If there is no way to optimize the SQL, is there an index or indexes you could add/change to make the query it quicker, or is there anything something else I'm overlooking?
Postgres 9 is the DB, and Django's ORM is where this query came from.
Query Plan
Limit (cost=1374.35..1383.10 rows=100 width=106)
-> Unique (cost=1374.35..1391.24 rows=193 width=106)
-> Sort (cost=1374.35..1374.83 rows=193 width=106)
Sort Key: myapp_profile.user_id, myapp_profile.name, myapp_profile.age, auth_user.username, auth_user.first_name, auth_user.last_name, auth_user.email, auth_user.password, auth_user.is_staff, auth_user.is_active, auth_user.is_superuser, auth_user.last_login, auth_user.date_joined
-> Nested Loop (cost=453.99..1367.02 rows=193 width=106)
-> Hash Left Join (cost=453.99..1302.53 rows=193 width=49)
Hash Cond: (myapp_siterel.site_id = django_site.id)
Filter: ((django_site.id IS NULL) OR (django_site.id = ANY ('{10080,10053}'::integer[])))
-> Hash Left Join (cost=448.50..1053.27 rows=15001 width=53)
Hash Cond: (myapp_profile.user_id = myapp_siterel.profile_id)
-> Seq Scan on myapp_profile (cost=0.00..286.01 rows=15001 width=49)
-> Hash (cost=261.00..261.00 rows=15000 width=8)
-> Seq Scan on myapp_siterel (cost=0.00..261.00 rows=15000 width=8)
-> Hash (cost=3.55..3.55 rows=155 width=4)
-> Seq Scan on django_site (cost=0.00..3.55 rows=155 width=4)
-> Index Scan using auth_user_pkey on auth_user (cost=0.00..0.32 rows=1 width=57)
Index Cond: (auth_user.id = myapp_profile.user_id)
Filter: ((NOT auth_user.is_superuser) AND (NOT auth_user.is_staff))
Thanks
I'm not so familiar with postgres, so I'm not sure how good it's query optimiser is, but it looks like everything you have in the where clause could instead be join conditions, although I'd hope postgres is clever enough to work that out for itself, however if it's not then it's going to fetch all your 2 million users with related records in the other 3 tables and then filter that using your where.
The indexes already mentioned should also work for you if they don't already exist. Again i'm more an MSSQL person but does postgres not have any statistics profile or query plan you can see?
Something along these lines
SELECT DISTINCT
"myapp_profile"."user_id",
"myapp_profile"."name",
"myapp_profile"."age",
"auth_user"."id",
"auth_user"."username",
"auth_user"."first_name",
"auth_user"."last_name",
"auth_user"."email",
"auth_user"."password",
"auth_user"."is_staff",
"auth_user"."is_active",
"auth_user"."is_superuser",
"auth_user"."last_login",
"auth_user"."date_joined"
FROM "myapp_profile"
INNER JOIN "auth_user"
ON ("myapp_profile"."user_id" = "auth_user"."id")
AND "auth_user"."is_superuser" = false
AND "auth_user"."is_staff" = false
LEFT OUTER JOIN "myapp_siterel"
ON ("myapp_profile"."user_id" = "myapp_siterel"."profile_id")
LEFT OUTER JOIN "django_site"
ON ("myapp_siterel"."site_id" = "django_site"."id")
AND ("django_site"."id" IS NULL OR "django_site"."id" IN (15, 16))
ORDER BY "myapp_profile"."user_id" DESC
LIMIT 100
Also, do you need the distinct? That'll also slow it down somewhat.
for basics:
make sure all the user id fields are indexed.
also looks like you would do well with an index on is_supervisor, and is_staff
there's never a straight forward silver-bullet solution for query optimization, however, the obvious steps is to index columns you're searching on, in your case, that's:
"auth_user"."is_superuser"
"auth_user"."is_staff"
"django_site"."id"
"myapp_profile"."user_id"
Related
I have a simple DB table with ONLY 5 columns with no primary key having 7 billion+(7,50,01,771) data. yes, you read it correctly. it has one cluster index.
DB table columns
Cluster index
if I write a simple select query to get data, it is taking 7-8 minutes to return data. now, you get my next question. what are the techniques that I can apply to this DB table? So that I can get data in time.
in the actual scenario, where I am using this table have join with 2 temp tables that have WHERE clause and filtered data. Please find below my query for reference.
SELECT dt.ZipFrom, dt.ZipTo, dt.Total_time, sz.storelocation, sz.AcctShip, sz.Licensee,sz.Entity from #Zips z INNER join DriveTime_ZIPtoZIP dt on zipFrom = z.zip INNER join #storeZips sz on ZipTo = sz.zip order by z.zip desc, total_time asc
Thanks
You can index according to the where conditions in the query. However, this comes at a cost: Storage.
Order by statement is also important. If you have to use order by in your query, you can also index accordingly.
But do not forget, the cost of indexing ...
I have a non-clustered columnstore index on all columns a 40m record non-memory optimized table on SQL Server 2016 Enterprise Edition.
A query forcing the use of the columnstore index will perform significantly faster but the optimizer continues to choose to use the clustered index and other non-clustered indexes. I have lots of available RAM and am using appropriate queries against a dimensional model.
Why won't the optimizer choose the columnstoreindex? And how can I encourage its use (without using a hint)?
Here is a sample query not using columnstore:
SELECT
COUNT(*),
SUM(TradeTurnover),
SUM(TradeVolume)
FROM DWH.FactEquityTrade e
--with (INDEX(FactEquityTradeNonClusteredColumnStoreIndex))
JOIN DWH.DimDate d
ON e.TradeDateId = d.DateId
JOIN DWH.DimInstrument i
ON i.instrumentid = e.instrumentid
WHERE d.DateId >= 20160201
AND i.instrumentid = 2
It takes 7 seconds without hint and a fraction of a second with the hint.
The query plan without the hint is here.
The query plan with the hint is here.
The create statement for the columnstore index is:
CREATE NONCLUSTERED COLUMNSTORE INDEX [FactEquityTradeNonClusteredColumnStoreIndex] ON [DWH].[FactEquityTrade]
(
[EquityTradeID],
[InstrumentID],
[TradingSysTransNo],
[TradeDateID],
[TradeTimeID],
[TradeTimestamp],
[UTCTradeTimeStamp],
[PublishDateID],
[PublishTimeID],
[PublishedDateTime],
[UTCPublishedDateTime],
[DelayedTradeYN],
[EquityTradeJunkID],
[BrokerID],
[TraderID],
[CurrencyID],
[TradePrice],
[BidPrice],
[OfferPrice],
[TradeVolume],
[TradeTurnover],
[TradeModificationTypeID],
[InColumnStore],
[TradeFileID],
[BatchID],
[CancelBatchID]
)
WHERE ([InColumnStore]=(1))
WITH (DROP_EXISTING = OFF, COMPRESSION_DELAY = 0) ON [PRIMARY]
GO
Update. Plan using Count(EquityTradeID) instead of Count(*)
and with hint included
You're asking SQL Server to choose a complicated query plan over a simple one. Note that when using the hint, SQL Server has to concatenate the columnstore index with a rowstore non-clustered index (IX_FactEquiteTradeInColumnStore). When using just the rowstore index, it can do a seek (I assume TradeDateId is the leading column on that index). It does still have to do a key lookup, but it's simpler.
I can see two options to get this behavior without a hint:
First, remove InColumnStore from the columnstore index definition and cover the entire table. That's what you're asking from the columnstore - to cover everything.
If that's not possible, you can use a UNION ALL to explicitly split the data:
WITH workaround
AS (
SELECT TradeDateId
, instrumentid
, TradeTurnover
, TradeVolume
FROM DWH.FactEquityTrade
WHERE InColumnStore = 1
UNION ALL
SELECT TradeDateId
, instrumentid
, TradeTurnover
, TradeVolume
FROM DWH.FactEquityTrade
WHERE InColumnStore = 0 -- Assuming this is a non-nullable BIT
)
SELECT COUNT(*)
, SUM(TradeTurnover)
, SUM(TradeVolume)
FROM workaround e
JOIN DWH.DimDate d
ON e.TradeDateId = d.DateId
JOIN DWH.DimInstrument i
ON i.instrumentid = e.instrumentid
WHERE d.DateId >= 20160201
AND i.instrumentid = 2;
Your index is a filtered index (it has a WHERE predicate).
Optimizer would use such index only when the query's WHERE matches the index's WHERE. This is true for classic indexes and most likely true for columnstore indexes. There can be other limitations when optimizer would not use filtered index.
So, either add WHERE ([InColumnStore]=(1)) to your query, or remove it from the index definition.
You said in the comments: "the InColumnStore filter is for efficiency when loading data. For all tests so far the filter covers 100% of all rows". Does "all rows" here mean "all rows of the whole table" or just "all rows of the result set"? Anyway, most likely optimizer doesn't know that (even though it could have derived that from statistics), which means that the plan which uses such index has to explicitly do extra checks/lookups, which optimizer considers too expensive.
Here are few articles on this topic:
Why isn’t my filtered index being used? by
Rob Farley
Optimizer Limitations with Filtered Indexes by Paul White.
An Unexpected Side-Effect of Adding a Filtered Index by Paul White.
How filtered indexes could be a more powerful feature by Aaron Bertrand, see the section Optimizer Limitations.
Try this one:
Bridge your query
Select *
Into #DimDate
From DWH.DimDate
WHERE DateId >= 20160201
Select COUNT(1), SUM(TradeTurnover), SUM(TradeVolume)
From DWH.FactEquityTrade e
Inner Join DWH.DimInstrument i ON i.instrumentid = e.instrumentid
And i.instrumentid = 2
Left Join #DimDate d ON e.TradeDateId = d.DateId
How fast this query running ?
I need help or any hint. I have Postgres DB 9.4 and have one query processed very slow SOMETIMES.
SELECT COUNT(*) FROM "table_a" INNER JOIN "table_b" ON "table_b"."id" = "table_a"."table_b_id" AND "table_b"."deleted_at" IS NULL WHERE "table_a"."deleted_at" IS NULL AND "table_b"."company_id" = ? AND "table_a"."company_id" = ?
Query plan for this -
Aggregate (cost=308160.70..308160.71 rows=1 width=0)
-> Hash Join (cost=284954.16..308160.65 rows=20 width=0)
Hash Cond: ?
-> Bitmap Heap Scan on table_a (cost=276092.39..299260.96 rows=6035 width=4)
Recheck Cond: ?
Filter: ?
-> Bitmap Index Scan on index_table_a_on_created_at_and_company_id (cost=0.00..276090.89 rows=6751 width=0)
Index Cond: ?
-> Hash (cost=8821.52..8821.52 rows=3220 width=4)
-> Bitmap Heap Scan on table_b (cost=106.04..8821.52 rows=3220 width=4)
Recheck Cond: ?
Filter: ?
-> Bitmap Index Scan on index_ table_b_on_company_id (cost=0.00..105.23 rows=3308 width=0)
Index Cond: ?
But usually, this is query executed enough fast (about 69.7ms). I don't understand why this happened sometimes. I saw in performance logs by this period, that my RDS instance consumes a lot of memory and count this queries reaches about 100 per seconds. so guys, any helps please, where do I move for solve this problem.
I am not sure if this will solve your problem or not :)
When this query is returning very fast result it is returning result from cache and not executing query again and not preparing result at that time.
First of all you have to check if there are too much queries are being executed on these tables, especially inserts/updated/deletes. This type of queries are causing locking and select have to wait until lock is being released.
Query can be slow because there is too much comparison cost of join and where clause between table_a and table_b.
You can reduce your cost by applying indexes to columns "table_b"."id", "table_a"."table_b_id", "table_a"."deleted_at", "table_b"."company_id", AND "table_a"."company_id".
You can create a view to reduce the cost as well. Views are returning cached information.
One last thing is you can reduce cost by using temporary table as well. I have given an example below.
QUERIES:
CREATE TEMPORARY TABLE table_a_temp as
SELECT "table_a"."table_b_id" FROM "table_a"
WHERE "table_a"."deleted_at" IS NULL AND "table_a"."company_id" = ? ;
CREATE TEMPORARY TABLE table_b_temp as
SELECT "table_b"."id" FROM "table_a"
WHERE"table_b"."deleted_at" IS NULL AND "table_b"."company_id" = ?;
SELECT COUNT(*) FROM "table_a_temp" INNER JOIN "table_b_temp"
ON "table_b_temp"."id" = "table_a_temp"."table_b_id" ;
Is there any way where I can avoid to do two INNER JOIN for the same table in this case?
SELECT B.CostCatCd As CostCatCd,
F.CountryDesc AS SenderCountry,
B.SenderCompanycd AS SenderCompanyCd,
D.CountryDesc As ReceivingCountry,
B.BillCompanycd AS ReceivingCompanyCd,
SUM(B.BillAmt) as Amount
FROM Bill B
INNER JOIN Company C
ON B.FY = C.FY
AND B.CycleCd = C.CycleCd
AND B.BillCompanyCd = C.CompanyCd
INNER JOIN Country D
ON B.FY = D.FY
AND B.CycleCd = D.CycleCd
AND C.CountryCd = D.CountryCd
INNER JOIN Company E
ON B.FY = E.FY
AND B.CycleCd = E.CycleCd
AND B.SenderCompanyCd = E.CompanyCd
INNER JOIN Country F
ON B.FY = F.FY
AND B.CycleCd = F.CycleCd
AND E.CountryCd = F.CountryCd
I'm trying to improve the performance in a SP and maybe this is something that may be updated. I've the same concern for both tables (Company & Country).
Thanks in advance!
Without the details, it's not so simple to give suggestions, but you should look into actual query plan and statistics IO output. Those give quite good idea what's going on with your SQL.
If the query is running slow, you should check the following things:
The table with biggest logical reads
Scans in query plan
Key lookups in query plan when it happens for a large number of rows
Spools, Sorts, Spills into temp db
For indexing it looks like a good candidate for indexes would be:
Company: CompanyCd, CycleCd, FY (+ CountryCd as included column)
Country: CountryCd, CycleCd, FY (+ CountryDesc as included column)
Everything of course depends on how often the rows are being updated, since indexes will slow those (slightly), but guessing that companies or countries don't get many updates. I made a guess about selectivity of the columns and that's why the columns in the index are in that order.
Indexing Bill properly is a good idea too, but since where clause is missing it's not possible to give any suggestions.
I am having problem in fetching a number of records from while joining tables. Please see the below query:
SELECT
H.EIN,
H.OUC,
(
SELECT
COUNT(1)
FROM
tbl_Checks C
INNER JOIN INFM_People_OR.dbo.tblHierarchy P
ON P.EIN = C.EIN_Checked
WHERE
(H.EIN IN (P.L1, P.L2)
OR H.EIN = C.EIN_Checked)
AND C.[Read] = 1
) AS [Read]
FROM
INFM_People_OR.dbo.tblHierarchy H
LEFT JOIN tbl_Checks C
ON H.EIN = C.EIN_Checked
WHERE
H.L1 = #EIN
GROUP BY
H.EIN,
H.OUC,
C.Check_Date
Even if there are just 100 records this query takes a much more time(around 1 min).
Please suggest a solution to tune up this query as it is throwing error in front end
Given just the query there are a few things that stick out as being non-optimal:
Any use of OR will be slower:
WHERE
(H.EIN IN (P.L1, P.L2)
OR H.EIN = C.EIN_Checked)
AND C.[Read] = 1
If there's any way to rework this based off of your data set so that both the IN and the OR are replaced with ANDs that would help.
Also, use of a local variable in the WHERE clause will not work well with the optimizer:
WHERE
H.L1 = #EIN
Finally, make sure you have indexes (and hopefully these are integer fields) where you are doing your joins and group bys (H.EIN, H.OUC, C.Check_Date
The size of the result set (100 records) doesn't matter as much as the size of the joined tables and whether or not they have appropriate indexes.
The Estimated number of rows affected is 1196880 is very high resulting in high execution time of query. I have also tried to join the tables only once but that it giving different output.
Please suggest any other solution than creating indices as I have already created non-clustered index for the table tbl_checks but it doesn't make any difference.
Below is the SQl execution plan.