I want to join against a huge partitioned table. The planner probably assumes that the partitioned table is very cheap to scan.
I have the following query:
select *
from (
select users where age < 18 limit 10
) as users
join
clicks on users.id = clicks.userid
where
clicks.ts between '2015-01-01' and now();
The table clicks is the master table with roughly 40 child tables containing together about 40 million records.
This query performs very slow. When I look at the planner postgres first performs a complete scan of the clicks table and then scans the user table.
However when I limit the users subquery to 1 the planner first scans the users and then the clicks.
It seems as if the planner assumes that the clicks table is very lightweight. If I look at the stats in pg_class the master table clicks has 0 tuples. Which is true on the one hand because it is a master table, but on the other hand, for the planner it should contain the sum of all its child tables.
How can I force the planner to use the cheapest option first?
edit: in simplifying the query I indeed missed out an additional constraint on the date.
The partitioning constraints are on: clicks.ts and clicks.userid
I have indexes on users.age, user.id, clicks.userid and clicks.ts
Maybe I have to trust the planner. I am just a little insecure because I once had a case where postgres showed some weird behavior with limits (PostgreSQL query very slow with limit 1).
Related
I have a simple database with 4 tables:
Customer (cusId)
Newspaper (papId)
SubCost (subId)
Subscription (cusId, papId, subId)
Newspaper has a column to track number of subscribers which is updated via a trigger on the Subscription table. It also has a column to track annual revenue which should be based on the number of subscribers and the cost associated with the subscription (subId).
I am looking for a trigger to track annual revenue. There are 3 subscription types (subId) with differing weekly costs and a paper can have more than one type of subscription so it can't just be (cost * 52 * numSubs).
Can you help me with this logic?
Your best bet is not using such a column at all. Instead use a view which computes the result, and index it if necessary
CREATE OR ALTER VIEW vTotalSubs
WITH SCHEMABINDING AS
SELECT
n.papid,
TotalRevenue = SUM(sc.Cost * 52),
TotalSubscriptions = COUNT_BIG(*) -- you MUST have this column here if aggregating with an index
FROM dbo.Newspaper n
JOIN dbo.Subscription s ON s.papid = n.papid
JOIN dbo.SubCost sc ON sc.subid = s.subid
GROUP BY
n.papid;
GO
CREATE UNIQUE CLUSTERED INDEX CX_vTotalSubs ON vTotalSubs (papid);
If you decide to index the view, be aware there are many restrictions to indexed views, in particular:
Only INNER JOIN is allowed, no other join types, no subqueries
Must schema-bind, and specify schema on all tables.
If aggregating, you must have COUNT_BIG(*), and the only other aggregation allowed is SUM
Make sure to add the WITH (NOEXPAND) hint when querying, otherwise there may be performance impacts
The server will automatically maintain the index, you do not need to update it.
I am creating a Query which is taking data from multiple other databases through DBlinks.
One of the table "ord" in a Query is immensely large , say more than 50 million rows.
Now, I want to create query and I want to traverse the data and retrieve the required data based on Partitions defined in t1.
i.e if ord has 50 partitions in it with 1 million records each, I want to run my whole query on first partition get the result ,then move to 2nd, 3rd and so on. . upto last partition.
How can I do that?
Please consider the Sample Query where from local DB I am accessing all the remote DBs using DB links.
This Query list all the orders which are active.
Select ord.order_no,
ord.customer_id,
ord.order_date,
cust.customer_id,
cust.cust_name,
adr.street,
adr.city,
adr.state,
ship.ship_addr_street,
ship.ship_addr_city,
ship.ship_addr_state,
ship.ship_date
from order ord#ordDB
inner join customer#custDB cust on cust.customer_id = ord.customer_id
inner join address#adrDB adr on adr.address_id = cust.address_id
inner join shipment#shipDB ship on ship.shipment_id = ord.shipment_id
where ord.active = 'true';
Now there is a feild "partition_key" defined in this table, and each key is associated with say 1 million rows and I want to restructure the query so that at one time we take one partition from Order and run this whole query on that partition and move to next partition until table is not completed.
Please help me to create sample query.
I have benefited from this website for a long time now. This is my first question on the site. It is regarding performance tuning a reporting query. Here it goes.
1.
SELECT Count(b1.primkey)
from tableA b1 --WITH (NOLOCK)
join tableA b2 --WITH (NOLOCK)
on b1.email = b2.email
and DateDiff(day, b2.BookedDate , b1.BookedDate) > 1
tableA has around 7 million rows. Email is a varchar(100) field. Bookeddate is a datetime field. primkey is a primary key column that is an int.
My purpose of writing this query is to find out the count entries that have same email ids but have come in one day late. This query take about 45 minutes to run. I really want to reduce the time it takes to execute.
Since this is for reporting, i tried in vain to use --WITH (NOLOCK) option to improve the read time. I have a column store index on tableA and I know that it is being used by the SQL optimizer - can see in the execution plan. I am using SQL Server 2012.
Can someone tell me in such a case, what would be better? Using a nonclustered index on email or a nonclustered columnstore index on tableA?
Please help me.
Your query is relatively complex. You are essentially joining two tables that have 7 million records each on a column that is not unique.
How about the following query instead:
select Email
from TableA
group by Email
having MAX(BookedDate) > MIN(BookedDate) + 1
Also make sure you have an index with Email and BookedDate.
Hope this helps.
You have 3 options here:
Create clustered index on email field at least for a larger table.
But I suppose there are other queries running on these tables, and
clustered index is needed on other fields
Move emails to another table, and store email id's in TableA and
TableB; join on int field would be much faster than on varchar
fields
Create indexes on email fields with included columns BookedDate (no
need to include primkey, you can count on another field, or count(*). Code: create index idx_email on TableA include(BoodedDate)
I think that third option is the one you should go with. There's not much work to be done, and there will be great performance gain. The only problem is that index on varchar field will take a lot of space and impact insert/update operations; but you said that this is a reporting db, so I think you can allow that.
I have 2 tables in 2 database. The scheme for the tables is identical. There are no timestamps or last updated information. Table A is a live table, that is, it's updated in "the" program. Update records, insert records and delete records all happen in Table A. Table B is a backup made weekly. Is there a quick way to compare the 2 tables and give me results similar to:
I | 54
D | 55
U | 60
So record 54 in the live table is new, record 55 in the live table was deleted, record 60 in the live table was updated.
This needs to work in SQL Server 2008 and up.
Fields: id, first_name, last_name, phone, email, address_id, birth_date, last_visit, provider_id, comments
I have no control over the scheme. I have read-only access to Table A, read-write to Table B.
Would it be easier to store a hash of each Table A's rows rather than a full copy of the table? Generally speaking I need to know what rows have been updated/inserted and deleted without a build in timestamp. I have the weekly backup table to look at but I could create a hash table if needed.
Using two full joins the first one isvused to check just for id existance and identify inserts and deletes the second would be used for row equality.
In the example I have used checksum for simplicity but I recommend you read up on the cons of using it and consider alternatives like hashbytes or checking each column for equality
Select id, checksum(*) hash
Into #live
From live.dbo.tbl
Select id, checksum(*) hash
Into #archive
From archive.dbo.tbl
Select l1.id,
Case when l1.id is null then 'd'
when a1.id is null then 'I'
when a2.id is null then 'u' end change_type
From #live l1
Full Join #archive a1 On a1.id = l1.id
Full Join #archive a2 On a2.id = l1.id
And a2.hash = l1.hash
I'm going to recommend a tool, but it's not free, although it has a fully functioning 30 day trial period. If you're going to compare data in SQL Server tables, look at Red Gate's SQL Data Compare. It's not cheap, and it will pay for itself many times over. (If you need to compare schemas, their SQL Compare does that.)
Barring that, having a third table, where you write a compare query and select those in one table and not the other (with a field indicating that), those in the other table and not the first, and then comparing field by field to find those different -- well that should work too. It will take longer, but if it's just one one table, the time it takes to write that code should be less than what you'll pay for the Red Gate tools.
If there is a column or set of columns that can uniquely identify each row, then a series of sql statements could be written to identify the inserts, updates and deletes. If there isn't a unique row identifier or the unique identifier (for example, one of the columns that makes it unique) changes, then no.
Use a high level of redundant, denormalized data in my DB designs to improve performance. I'll often store data that would normally need to be joined or calculated. For example, if I have a User table and a Task table, I would store the Username and UserDisplayName redundantly in every Task record. Another example of this is storing aggregates, such as storing the TaskCount in the User table.
User
UserID
Username
UserDisplayName
TaskCount
Task
TaskID
TaskName
UserID
UserName
UserDisplayName
This is great for performance since the app has many more reads than insert, update or delete operations, and since some values like Username change rarely. However, the big draw back is that the integrity has to be enforced via application code or triggers. This can be very cumbersome with updates.
My question is can this be done automatically in SQL Server 2005/2010... maybe via a persisted/permanent View. Would anyone recommend another possibly solution or technology. I've heard document-based DBs such as CouchDB and MongoDB can handle denormalized data more effectively.
You might want to first try an Indexed View before moving to a NoSQL solution:
http://msdn.microsoft.com/en-us/library/ms187864.aspx
and:
http://msdn.microsoft.com/en-us/library/ms191432.aspx
Using an Indexed View would allow you to keep your base data in properly normalized tables and maintain data-integrity while giving you the denormalized "view" of that data. I would not recommend this for highly transactional tables, but you said it was heavier on reads than writes so you might want to see if this works for you.
Based on your two example tables, one option is:
1) Add a column to the User table defined as:
TaskCount INT NOT NULL DEFAULT (0)
2) Add a Trigger on the Task table defined as:
CREATE TRIGGER UpdateUserTaskCount
ON dbo.Task
AFTER INSERT, DELETE
AS
;WITH added AS
(
SELECT ins.UserID, COUNT(*) AS [NumTasks]
FROM INSERTED ins
GROUP BY ins.UserID
)
UPDATE usr
SET usr.TaskCount = (usr.TaskCount + added.NumTasks)
FROM dbo.[User] usr
INNER JOIN added
ON added.UserID = usr.UserID
;WITH removed AS
(
SELECT del.UserID, COUNT(*) AS [NumTasks]
FROM DELETED del
GROUP BY del.UserID
)
UPDATE usr
SET usr.TaskCount = (usr.TaskCount - removed.NumTasks)
FROM dbo.[User] usr
INNER JOIN removed
ON removed.UserID = usr.UserID
GO
3) Then do a View that has:
SELECT u.UserID,
u.Username,
u.UserDisplayName,
u.TaskCount,
t.TaskID,
t.TaskName
FROM User u
INNER JOIN Task t
ON t.UserID = u.UserID
And then follow the recommendations from the links above (WITH SCHEMABINDING, Unique Clustered Index, etc.) to make it "persisted". While it is inefficient to do an aggregation in a subquery in the SELECT as shown above, this specific case is intended to be denormalized in a situation that has higher reads than writes. So doing the Indexed View will keep the entire structure, including the aggregation, physically stored so each read will not recalculate it.
Now, if a LEFT JOIN is needed if some Users do not have any Tasks, then the Indexed View will not work due to the 5000 restrictions on creating them. In that case, you can create a real table (UserTask) that is your denormalized structure and have it populated via either a Trigger on just the User Table (assuming you do the Trigger I show above which updates the User Table based on changes in the Task table) or you can skip the TaskCount field in the User Table and just have Triggers on both tables to populate the UserTask table. In the end, this is basically what an Indexed View does just without you having to write the synchronization Trigger(s).