Conditional Count and group by with realtime data - sql-server

I was able to find similar and close-match questions regarding my query. But I want to know if there are better ways of doing the same.
I have a database table which is being populated by call to a web-service end-point. The end-point is called when certain event occurs and a new record is inserted to this table. The events take place in real-time and there is no set frequency or pattern of occurrence of the events.
The table looks like:
CREATE TABLE MyTbl(id int, type nvarchar(10), timestamp datetime, category nvarchar(50));
I am fetching data from the table as:
SELECT category,
COUNT(CASE WHEN type = 'sent' THEN 1 END) sent,
COUNT(CASE WHEN type = 'received' THEN 1 END) received,
COUNT(CASE WHEN type = 'blocked' THEN 1 END) blocked,
COUNT(CASE WHEN type = 'opened' THEN 1 END) opened
FROM MyTbl
WHERE timestamp >= '2013-01-01 00:00:00' AND timestamp < '2013-02-01 00:00:00'
GROUP BY category
The details about the database schema, sample data and select-query for the report is available here.
Given that the:
data is fed into the table in real-time
table will store huge volume of data approx. 10,00,000+ records
structure of the report-query will not change
data would be filtered on category, date-from and date-to (all are optional parameters)
time is not relevant, only date part is
Will it be a good idea to run a scheduled task that would run periodically and update a new table with values from MyTbl? The new table would look similar to the report query:
Date | Category | Sent | Received | Blocked | Opened
and this table would be queried by applying category and date filters.
Negatives to this approach:
We still have to maintain data on daily basis
We still have to apply GROUP BY and SUM
There may be more database operations required than the original approach
We will not get all the data that comes in real-time.
Positives to this approach:
May speed-up the records fetching from the database which can speed-up the process of display, sorting and paging
Is this a viable approach?
Are there any other ways to speed up the process? Please help!

Yes, this is a viable approach.
You may also want to consider reconfiguring your query as a PIVOT.
select *
from
(select [date],category, [type]
from yourtable
where timestamp between '2013-01-01' and '2013-02-01') d
pivot
(count (type) for [type] in (sent,blocked,received,opened)) p

Related

How do you efficiently pull data from multiple records into 1 record

I currently have data in a table related to transactions. Each record has a purchase ID, a Transaction Number, and up to 5 purchases assigned to the transaction. Associated with each purchase there can be up to 10 transactions. For the first transaction of each purchase I need a field that is a string of each unique purchase concatenated. My solution was slow I estimated it would take 40 days to complete. What would be the most effective way to do this?
What you are looking for can be achieved in 2 steps:
Step1: Extracting the first transaction of each purchase
Depending upon your table configuration this can be done in a couple of different ways.
If your transaction IDs are sequential, you can use something like:
select * from table a
inner join
(select purchaseid,min(transactionid) as transactionid
from table group by purchaseid) b
on a.purchaseid-b.purchaseid and a.transactionid=b.transactionid
If there is a date variable driving the transaction sequence, then:
select a.* from
(select *,row_number() over(partition by purchaseid order by date) as rownum from table)a
where a.rownum=1
Step2: Concatenating the Purchase details
This can be done by using the String_agg function if you are using the latest version of SQL server. If not, the following link highlights a couple of different ways you can do this:
Optimal way to concatenate/aggregate strings
Hope this helps.

Update multiple running totals when past items change

I have a table (in SQL Server 2014) including multiple running totals (by different dates) - not an ideal design but imagine a very large number of rows and users able to pick a specified time period - we don't want to calculate SUMs from the start of time to get the running total to that period every time.
I am looking for an elegant way to update those running totals when multiple rows are updated.
The actual scenario is an account reconciliation - the table stores money transactions for which we have the event date (e.g. when a thing was sold), the transaction date (e.g. the invoice date) and the payment date (when the invoice was paid). For each of these there is a running total, e.g. (much simplified)
CREATE TABLE MyTransaction (
Id INT NOT NULL IDENTITY(1,1) PRIMARY KEY,
EventDate DATETIME NOT NULL,
TransactionDate DATETIME,
PaymentDate DATETIME,
Amount INT, -- assume whole numbers for sake of it
RunningTotalByEventDate INT,
RunningTotalByTransactionDate INT,
RunningTotalByPaymentDate INT,
IsCancelled BIT DEFAULT (0)
)
... with indexes on dates as needed, etc. and assume for sake of example that the date/times are unique (in practice there are uniqueifiers and other stuff).
Inserting a transaction is fine(ish) - best I have come up with is three separate queries, each updating the running total by the relevant date... or one query with logic... so after inserting a new row (with obviously-named variables passed inot a stored proc)...
UPDATE MyTransaction SET RunningTotalByEventDate += #Amount
WHERE EventDate > #EventDate
and so on for the other two running totals, or a single query like...
UPDATE MyTransaction
SET RunningTotalByEventDate += CASE WHEN EventDate > #EventDate THEN #Amount ELSE 0 END,
RunningTotalByTransactionDate += CASE WHEN TransactionDate > #TransactionDate THEN #Amount ELSE 0 END,
RunningTotalByPaymentDate += CASE WHEN PaymentDate > #PaymentDate THEN #Amount ELSE 0 END
WHERE EventDate > #EventDate
OR TransactionDate > #TransactionDate
OR PaymentDate > #PaymentDate
Now I need to cancel transactions, e.g. an invoice is written off - the requirement is to leave the row in, but remove the effect - so the row stays with its Amount, but the cancelled flag is set and the row has no effect on the running totals. Unfortunately an invoice may have multiple transactions (e.g. several part payments), so there could be several transaction rows to update.
My best option so far for updating the multiple running totals is to loop/cursor around the (expected to be few) updated rows and reduce the subsequent running totals much as we increased them when adding a row - so for each time around the loop we have the three update queries (or one with logic) to update the three running totals.
A single UPDATE won't work, since it will only update a target row once (and if two part payments are being cancelled, we need to update it twice to take off each amount). I've played variously with windowed functions but cannot see a way to do this neatly with a single query set-wise.
So given a list of MyTransaction.Id values to cancel (e.g. in a table, table variable or CSV string list), what's the best way to update the various running totals?
Any ideas (and apologies for the rambling question) are very welcome.

T-SQL Select where Subselect or Default

I have a SELECT that retrieves ROWS comparing a DATETIME field to the highest available value of another TABLE.
The Two Tables have the following structure
DeletedRecords
- Id (Guid)
- RecordId (Guid)
- TableName (varchar)
- DeletionDate (datetime)
And Another table which keep track of synchronizations using the following structure
SynchronizationLog
- Id (Guid)
- SynchronizationDate (datetime)
In order to get all the RECORDS that have been deleted since the last synchronization, I run the following SELECT:
SELECT
[Id],[RecordId],[TableName],[DeletionDate]
FROM
[DeletedRecords]
WHERE
[TableName] = '[dbo].[Person]'
AND [DeletionDate] >
(SELECT TOP 1 [SynchronizationDate]
FROM [dbo].[SynchronizationLog]
ORDER BY [SynchronizationDate] DESC)
The problem occurs if I do not have synchronizations available yet, the T-SQL SELECT does not return any row while it should returns all the rows cause there are no synchronization records available.
Is there a T-SQL function like COALESCE that I can use with DateTime?
Your subquery should look like something like this:
SELECT COALESCE(MAX([SynchronizationDate]), '0001-01-01')
FROM [dbo].[SynchronizationLog]
It says: Get the last date, but if there is no record (or all values are NULL), then use the '0001-01-01' date as start date.
NOTE '0001-01-01' is for DATETIME2, if you are using the old DATETIME data type, it should be '1753-01-01'.
Also please note (from https://msdn.microsoft.com/en-us/library/ms187819(v=sql.100).aspx)
Use the time, date, datetime2 and datetimeoffset data types for new work. These types align with the SQL Standard. They are more portable. time, datetime2 and datetimeoffset provide more seconds precision. datetimeoffset provides time zone support for globally deployed applications.
EDIT
An alternative solution is to use NOT EXISTS (you have to test it if its performance is better or not):
SELECT
[Id],[RecordId],[TableName],[DeletionDate]
FROM
[DeletedRecords] DR
WHERE
[TableName] = '[dbo].[Person]'
AND NOT EXISTS (
SELECT 1
FROM [dbo].[SynchronizationLog] SL
WHERE DR.[DeletionDate] <= SL.[SynchronizationDate]
)

Filtering a complex SQL Query

Unit - hmy, scode, hProperty
InsurancePolicy - hmy, hUnit, dtEffective, sStatus
Select MAX(i2.dtEffective) as maxdate, u.hMy, MAX(i2.hmy) as InsuranceId,
i2.sStatus
from unit u
left join InsurancePolicy i2 on i2.hUnit = u.hMy
and i2.sStatus in ('Active', 'Cancelled', 'Expired')
where u.hProperty = 2
Group By u.hmy, i2.sStatus
order by u.hmy
This query will return values for the Insurance Policy with the latest Effective Date (Max(dtEffective)). I added Max(i2.hmy) so if there was more than one Insurance Policy for the latest Effective Date, it will return the one with the highest ID (i2.hmy) in the database.
Suppose there was a Unit that had 3 Insurance Policies attached with the same latest effective date and all have different sStatus'.
The result would look like this:
maxdate UnitID InsuranceID sStatus
1/23/12 2949 1938 'Active'
1/23/12 2949 2343 'Cancelled'
1/23/12 2949 4323 'Expired'
How do I filter the results so that if there are multiple Insurance Policies with different Status' for the same unit and same date, then we choose the Insurance Policy with the 'Active' Status first, if one doesn't exist, choose 'Cancelled', and if that doesn't exist, choose 'Expired'.
This seems to be a matter of proper ranking of InsurancePolicy's rows and then joining Unit to the set of the former's top-ranked rows:
;
WITH ranked AS (
SELECT
*,
rnk = ROW_NUMBER() OVER (
PARTITION BY hUnit
ORDER BY dtEffective DESC, sStatus, hmy DESC
)
FROM InsurancePolicy
)
SELECT
i2.dtEffective AS maxdate,
u.hMy,
i2.hmy AS InsuranceId,
i2.sStatus
FROM Unit u
LEFT JOIN ranked i2 ON i2.hUnit = u.hMy AND i2.rnk = 1
You could make this work with one SQL statement but it will be nearly unreadable to your everyday t-sql developer. I would suggest breaking this query up into a few steps.
First, I would declare a table variable and place all the records that require no manipulation into this table (ie - Units that do not have multiple statuses for the same date = good records).
Then, get a list of your records that need work done on them (multiple statuses on the same date for the same UnitID) and place them in a table variable. I would create a "rank" column within this table variable using a case statement as illustrated here:
Pseudocode: WHEN Active THEN 1 ELSE WHEN Cancelled THEN 2 ELSE WHEN Expired THEN 3 END
Then delete records where 2 and 3 exist with a 1
Then delete records where 2 exists and 3
Finally, merge this updated table variable with your table variable containing your "good" records.
It is easy to get sucked into trying to do too much within one SQL statement. Break up the tasks to make it easier for you to develop and more manageable in the future. If you have to edit this SQL in a few years time you will be thanking yourself, not to mention any other developers that may have to take over your code.

Is there a quicker way of doing this type of query (finding inactive accounts)?

I have a very large table of wagering transactions. Let's say for the sake of the question I want to find the accounts of people who have wagered in the last year but not wagered in the last month, so I do something like this...
--query one
select accountnumber into #wageredrecently from activity
where _date >='2011-08-10' and transaction_type = 'Bet'
group by accountnumber
--query two
select accountnumber,firstname,lastname,email,sum(handle)
from activity a, customers c
where a.accountnumber = c.accountno
and transaction_type = 'Bet'
and _date >='2010-09-10'
and accountnumber not in (select * from #wageredrecently)
group by accountnumber,firstname,lastname,email
The problem is, this takes ages to get the data. Is there a quicker way to acheive the same in sql?
Edit, just to be specific about the time: It takes just over 3 minutes, which is far too long for a query that is destined for a php intranet page.
Edit (11/09/2011): I've found out that the problem is the customers table. It's actually a view. It previously had good performance but now all of a sudden its performance is terrible, a simple query on it takes almost as long as the above query pair. I have therefore chosen an alternative table of customer data (that actually is a table, and not a view) and now the query pair takes about 15 seconds.
You should try to join customers after you have found and aggregated the rows from activity (I assume that handle is a column in activity).
select c.accountno,
c.firstname,
c.lastname,
c.email,
a.sumhandle
from customers as c
inner join (
select accountnumber,
sum(handle) as sumhandle
from activity
where _date >= '2010-09-10' and
transaction_type = 'bet' and
accountnumber not in (
select accountnumber
from activity
where _date >= '2011-08-10' and
transaction_type = 'bet'
)
group by accountnumber
) as a
on c.accountno = a.accountnumber
I also included your first query as a sub-query instead. I'm not sure what that will do for performance. It could be better, it could be worse, you have to test on your data.
I don't know your exact business need, but rarely will someone need access to innactive accounts over several months at a moments notice. Depending on when you pruge data, this may get worse.
You could create an indexed view that contains the last transaction date for each account:
max(_date) as RecentTransaction
If this table gets too large, it could be partioned by year or month of the activity.
Have you considered adding an index on _date to the activity table? It's probably taking so long because it has to do a full table scan on that column when you're comparing the dates. Also, is transaction_type indexed as well? Otherwise, the other index wouldn't do you any good.
Answering my question as the problem wasn't the structure of the query but one of the tables being used. It was a view and its performance was terrible. I change to an actual table with customer data in and reduced the execution time down to about 15 seconds.

Resources