Reduce table records based on minimum time difference - sql-server

I have a log table (MS SQL SERVER) with event entries (events are user actions like "user logged in", "user viewed entity A" etc).
Some events like "user viewed entity A" may occur multiple times within a short time frame. For instance if a user goes back and forward in his browser he may enter entity A's page multiple times within a minute, and multiple "user view" events will be logged.
For my analytics dashboard I would like to count how many times a user viewed entity A, but I would like to "debounce" the result. I want to consider multiple "user view" events close to one another as one "user view" event. Specifically, I want to consider a new "user view" event only if it is more than 30 minutes from the last one.
So having a table like this (last column is my comments for clarity):
timestamp
evt_type
user_id
entity_id
*time diff from previous event
15:30
ENTITY_VIEW
U1
E1
NULL (first view)
15:38
ENTITY_VIEW
U1
E1
8mins
16:05
ENTITY_VIEW
U1
E1
28mins
16:50
ENTITY_VIEW
U1
E1
45mins (this counts as new view)
17:15
ENTITY_VIEW
U1
E1
25mins
17:44
ENTITY_VIEW
U1
E1
29mins
18:30
ENTITY_VIEW
U1
E1
46mins (this counts as another view)
I would like to determine that the user "viewed" the entity 3 times.
What would be a query to determine this? I tried LEAD, LAG, PARTITION BY and other comnbinations but I don't seem to find the correct way as I am not an SQL expert.

Should be a simple LAG() to grab the previous timestamp and check the diff. Will say your column [timestamp] is an odd data type, what about different days? Is there a separate column for date?
Return Records >30 Minutes from Previous Record
WITH cte_DeltaSinceLastView AS (
SELECT *
/*Grab previous record for each user_id/entity_id combo*/
,PrevTimestamp = LAG([timestamp]) OVER (PARTITION BY [user_id],[entity_id] ORDER BY [timestamp])
FROM YourTable
) AS A(ID,[user_id],[entity_id],[timestamp])
)
SELECT *,MinutesSinceLastView = DATEDIFF(minute,PrevTimestamp,[Timestamp])
FROM cte_DeltaSinceLastView
WHERE DATEDIFF(minute,PrevTimestamp,[timestamp]) > 30 /*Over 30 minutes between last view*/
OR PrevTimestamp IS NULL /*First view will not have previous timestamp to compare against*/

Something you could try is a correlated subquery that disregards any rows that are within 30 minutes of previous rows, the remaining rows should be the ones that qualify (ie a gap of 30+ minutes exists). See if this works for you?
select Sum(vc) as ViewedCount
from (
select case when exists (
select * from t t2
where t2.timestamp > t.timestamp
and t2.evt_type = t.evt_type
and t2.user_id = t.user_id
and t2.entity_id = t.entity_id
and DateDiff(minute, t.timestamp,t2.timestamp) <30
) then 0 else 1 end vc
from t
)b;
This assumes that Timestamp is a time data type. This won't work across day boundaries but the same concept should apply with a datetime type.
Demo as Fiddle

Related

trying to break down results of SQL query to show data for each month

I'm very new to SQL and have a problem I can't figure out.
I'm trying to replace an excel spreadsheet and turn it into a PowerBi report. Currently our team runs the following query to get the amount of active users every month and types it into an excel sheet which then graphs the number of users each month showing the increase. Since I don't want to manually input data each month my goal is to break down this query to give the current number of users in each month and add to that every month.
Desired result would look something like this
dateCreated # of Users
----------------------
2008-10 295
2008-11 355
2008-12 470
2009-01 522
I was able to break it down enough to give me the amount created each month, but that doesn't give me the total amount each month. This is the query that I used and a sample of the results I got.
SELECT
FORMAT(USERADDR.DateCreated, 'yyyy-MM') AS 'dateCreated',
COUNT(s.UserId) AS "# of Users"
FROM
ER.dbo.ssUser s,
ER.dbo.ssUserAddress USERADDR,
ER.dbo.ssAddress ADDRESS
WHERE
s.UserId = USERADDR.UserId
AND USERADDR.AddressId = ADDRESS.AddressId
AND Isdefault = 1
AND Type = 'soldto'
GROUP BY
FORMAT(USERADDR.DateCreated, 'yyyy-MM')
result sample:
dateCreated # of Users
2008-10 295
2008-11 41
2008-12 22
2009-01 19
This is almost there, but I need a running total. I've tried a lot of different things including SUM, SUM OVER, COUNT OVER etc. My boss suggested a while loop. I can't get that to work either and everything I've read says that should be the last resort. Here is one example of my failed attempts
SELECT
FORMAT(USERADDR.DateCreated, 'yyyy-MM') as 'dateCreated',
COUNT(s.UserId)
OVER(
PARTITION BY Month(USERADDR.DateCreated)
GROUP BY FORMAT(USERADDR.DateCreated, 'yyyy-MM')
)
AS "# of Users"
FROM
ER.dbo.User s,
ER.dbo.UserAddress USERADDR,
ER.dbo.Address ADDRESS
WHERE
s.UserId = USERADDR.UserId
AND USERADDR.AddressId = ADDRESS.AddressId
AND Isdefault = 1
AND Type = 'soldto'
--original query which gives total number of users right now.
SELECT
count(s.UserId) AS "# of Users"
FROM
ER.dbo.User s,
ER.dbo.UserAddress USERADDR,
ER.dbo.Address ADDRESS
WHERE
s.UserId = USERADDR.UserId
AND USERADDR.AddressId = ADDRESS.AddressId
AND Isdefault = 1
AND Type = 'soldto'
You can do a window sum() on the aggregated count of users per month, like so:
SELECT
FORMAT(USERADDR.DateCreated, 'yyyy-MM') [dateCreated],
SUM(COUNT(s.UserId)) OVER(ORDER BY FORMAT(USERADDR.DateCreated, 'yyyy-MM')) [# of Users]
FROM
ER.dbo.ssUser s
INNER JOIN ER.dbo.ssUserAddress USERADDR
ON s.UserId = USERADDR.UserId,
INNER JOIN ER.dbo.ssAddress ADDRESS
ON USERADDR.AddressId = ADDRESS.AddressId
WHERE Isdefault = 1 AND Type = 'soldto'
group by FORMAT(USERADDR.DateCreated, 'yyyy-MM')
Notes:
always prefer proper, explicit join syntax (with the ON keyword) over implicit, old-school joins, who were deprecated long time ago - I modified your query accordingly
SQLServer uses square brackets for identifiers - you should avoid single quotes, as they are generally used for litteral strings
you have unqualified column names in the WHERE clause: always qualify column names in your query, so it is easy to understand to which table they belong

Crystal Report Time Difference Formula With Conditions

I am trying to create a report for labour effectiveness in a manufacturing business that links to 2 x distinct MS SQL databases.
"Database A" contains information about what employees were doing when they were on shift.
"Database B" contains information about what time an employee clocked in or out for payroll info.
One of the comparisons I want to report on, is the total time an employee was logged into a job vs the total time they were clocked in the building. The data is related on employee number. The main report is linked to database A and is grouped on "Employee Name", within the group footer, there is a sub report linked to Database B and the employee name is passed through as a parameter.
My problem is the way database B records clocking in. I am using a SQL command to collect the data:
SELECT
tws.date_and_time
, CONVERT(date, tws.date_and_time) AS 'Date'
,te.first_name
, te.last_name
, concat(te.first_name,' ', te.last_name) AS 'ConcatName'
, CASE WHEN tws.flag IN (1,3) THEN 1 ELSE 0 END as manual_adjustment
, ROW_NUMBER() OVER (PARTITION BY tw.date_and_time ORDER BY tws.date_and_time ASC) AS swipe_number
, ROW_NUMBER() OVER (PARTITION BY tw.date_and_time ORDER BY tws.date_and_time ASC) % 2 AS in_swipe
FROM temployee te
INNER JOIN twork tw
ON te.employee_id = tw.employee_id AND tw.type = 1000
INNER JOIN twork_swipe tws
on tw.work_id = tws.work_id
The "swipe_number" details what number swipe that record is in the time period (e.g. 1st, 2nd, 3rd etc.)
The "in_swipe" displays 1 if this is the employee clocking in or 0 if it employee clocking out.
I am grouping the sub report on date.
This is relatively straight forward if an employee clocks in and out once on the same day, but I am struggling to work out how to account for if an employee clocks in and out multiple times during a shift (for a break for example) or if an employee clocks in on one day and out on another (night shift for example).
I need to sum the total time an employee is clocked in, so I need to evaluate if the 1st clock of the day is an "in swipe", (swipe_number = 1 AND in_swipe = 1) if it is not, it should not be recorded as the difference between swipe_number 2 and 1, it should be from the start of the day (00:00:0000) to swipe_number = 1 as this indicates the employee has been there since midnight.
Likewise if the last (or only) "swipe_number" of the day is an "in swipe", the time should be recorded as between that time and 23:59:5999.
Outside of this, I need to find the time between the date time fields where swipe numbers = 2 & 1, 4 & 3, 6 & 5 etc. (no fixed number of times a swipe can occur).
Can this be handled dynamically in formulas?

Running concurrent /parallel update statements (T-SQL)

I have a table that is basically records of items, with columns for each day of the month. So basically each row is ITEM , Day1, Day2, Day3, ....I have to run update statements that basically trawl through each row day by day with the current day information requiring some info from the previous day.
Basically, we have required daily quantities. Because the order goes out in boxes (which are a fixed size) and the calculated quantities are in pieces, the system has to calculate the next largest number of boxes. Any "extra quantity" is carried over to the next day to reduce boxes.
For example, for ONE of those records in the table described earlier (the box size is 100)
My current code is basically getting the record, calculate the requirements for that day, increment by one and repeat. I have to do this for each record. It's very inefficient especially since it's being run sequentially for each record.
Is there anyway to parallel-ize this on SQL Server Standard? I'm thinking of something like a buffer where I will submit each row as a job and the system basically manages the resources and runs the query
If the buffer idea is not feasible, is there anyway to 'chunk' these rows and run the chunks in parallel?
Not sure if this helps, but I played around with your data and was able to calculate the figures without row-by-row handling as such. I transposed the figures with unpivot and calculated the values using running total + lag, so this requires SQL Server 2012 or newer:
declare #BOX int = 100
; with C1 as (
SELECT
Day, Quantity
FROM
(SELECT * from Table1 where Type = 'Quantity') T1
UNPIVOT
(Quantity FOR Day IN (Day1, Day2, Day3, Day4)) AS up
),
C2 as (
select Day, Quantity,
sum(ceiling(convert(numeric(5,2), Quantity) / #BOX) * #BOX - Quantity)
over (order by Day asc) % #BOX as Extra
from C1
),
C3 as (
select
Day, Quantity,
Quantity - isnull(Lag(Extra) over (order by Day asc),0) as Required,
Extra
from C2
)
select
Day, Quantity, Required,
ceiling(convert(numeric(5,2), Required) / #BOX) as Boxes, Extra
from C3
Example in SQL Fiddle

MSSQL Comparing rows same table

Hi im looking to compare several rows and check if a certain condition is true/false.
The tables has several columns the ones im interested in are:
Events.Badgeno
Events.Name
Events.Date
Events.Time
Events.Region_id
Events.Data
The region ID can either be 1 or 2.
I want to check weather the same badgeno registers with a different region within a specified date/time difference say 10 mins. (Could be 10 mins before or 10 mins after).
I'm looking to show the records which don't have a record against the 2 regions.
As a further note it should only be within the first and last records of that badge per day.
Normally each record should have a region 1 and 2 record at the start and end. But there maybe multiple region 1's through out the day.
Any suggestions for the best method?
Id Date Time Name Badgeid Region
3385033 27/02/2014 08:16:11 FirstName Surname 5304 2
I think something like this would work
SELECT e.Badgeno,e.Name, e.Date, e.Time,e.Region_id, e.Data
FROM events e
INNER JOIN events e1 ON e1.BadgeNo = e.BadgeNo AND e1.Region_id <> e.RegionId AND DATEDIFF(minutes,e1.date + e1.time,e.date + e.time) > -10 AND DATEDIFF(minutes,e1.date + e1.time,e.date + e.time) < 10
WHERE e1.Region_id IS NULL
you should provide sample data.
This Query is not complete,you can try something with
row_number/rank/dense, partition and check thus number column
generated .
select *,
row_number()over(partition by badgeno,regionno order by badge no)rn from table
where condition of date time

Is there a quicker way of doing this type of query (finding inactive accounts)?

I have a very large table of wagering transactions. Let's say for the sake of the question I want to find the accounts of people who have wagered in the last year but not wagered in the last month, so I do something like this...
--query one
select accountnumber into #wageredrecently from activity
where _date >='2011-08-10' and transaction_type = 'Bet'
group by accountnumber
--query two
select accountnumber,firstname,lastname,email,sum(handle)
from activity a, customers c
where a.accountnumber = c.accountno
and transaction_type = 'Bet'
and _date >='2010-09-10'
and accountnumber not in (select * from #wageredrecently)
group by accountnumber,firstname,lastname,email
The problem is, this takes ages to get the data. Is there a quicker way to acheive the same in sql?
Edit, just to be specific about the time: It takes just over 3 minutes, which is far too long for a query that is destined for a php intranet page.
Edit (11/09/2011): I've found out that the problem is the customers table. It's actually a view. It previously had good performance but now all of a sudden its performance is terrible, a simple query on it takes almost as long as the above query pair. I have therefore chosen an alternative table of customer data (that actually is a table, and not a view) and now the query pair takes about 15 seconds.
You should try to join customers after you have found and aggregated the rows from activity (I assume that handle is a column in activity).
select c.accountno,
c.firstname,
c.lastname,
c.email,
a.sumhandle
from customers as c
inner join (
select accountnumber,
sum(handle) as sumhandle
from activity
where _date >= '2010-09-10' and
transaction_type = 'bet' and
accountnumber not in (
select accountnumber
from activity
where _date >= '2011-08-10' and
transaction_type = 'bet'
)
group by accountnumber
) as a
on c.accountno = a.accountnumber
I also included your first query as a sub-query instead. I'm not sure what that will do for performance. It could be better, it could be worse, you have to test on your data.
I don't know your exact business need, but rarely will someone need access to innactive accounts over several months at a moments notice. Depending on when you pruge data, this may get worse.
You could create an indexed view that contains the last transaction date for each account:
max(_date) as RecentTransaction
If this table gets too large, it could be partioned by year or month of the activity.
Have you considered adding an index on _date to the activity table? It's probably taking so long because it has to do a full table scan on that column when you're comparing the dates. Also, is transaction_type indexed as well? Otherwise, the other index wouldn't do you any good.
Answering my question as the problem wasn't the structure of the query but one of the tables being used. It was a view and its performance was terrible. I change to an actual table with customer data in and reduced the execution time down to about 15 seconds.

Resources