No doubt a similar question has come up before, but I haven't been able to locate it by searching...
I have a raw dataset with time series data including 'from' and 'to' date fields.
The problem is, when data is loaded, new records have been created ('to' date added to old record, new record 'from' load date) even where no values have changed.
I want to convert this to a table which just shows a row for each genuine change - and the from/ to dates reflecting this.
By way of example, the source data looks like this:
ID
Col1
Col2
Col3
From
To
Test1
1
1
1
01/01/2020
31/12/9999
Test2
1
2
3
01/01/2020
30/06/2020
Test2
1
2
3
01/07/2020
30/09/2020
Test2
3
2
1
01/10/2020
31/12/9999
The first two records for Test2 (rows 2 and 3) are essentially the same - there was no change when the second row was loaded on 01/07/2020. I want a single row for the period 01/01/2020 - 30/09/2020 for which there was no change:
ID
Col1
Col2
Col3
From
To
Test1
1
1
1
01/01/2020
31/12/9999
Test2
1
2
3
01/01/2020
30/09/2020
Test2
3
2
1
01/10/2020
31/12/9999
For this simplified example, I can achieve that by grouping by each column (apart from dates) and using the MIN from date/ MAX end date:
SELECT
ID, Col1, Col2, Col3, MIN(From) AS From, MAX(To) as TO
FROM TABLE
GROUP BY ID, Col1, Col2, Col3
However, this won't work if a value changes then subsequently changes back to what it was before eg
ID
Col1
Col2
Col3
From
To
Test1
1
1
1
01/01/2020
31/12/9999
Test2
1
2
3
01/01/2020
30/04/2020
Test2
1
2
3
01/05/2020
30/06/2020
Test2
3
2
1
01/07/2020
30/10/2020
Test2
1
2
3
01/11/2020
31/12/9999
Simply using MIN/ MAX in the code above would return this - so it looks like both sets of values were valid for the period from 01/07/2020 - 30/10/2020:
ID
Col1
Col2
Col3
From
To
Test1
1
1
1
01/01/2020
31/12/9999
Test2
1
2
3
01/01/2020
31/12/9999
Test2
3
2
1
01/07/2020
30/10/2020
Whereas actually the first set of values were valid before and after that period, but not during.
It should return a single row for instead of two for the period from 01/01/2020 - 30/06/2020 when there were no changes for this ID, but then another row for the period when the values were different, and then another row where it reverted to the initial values, but with a new From date.
ID
Col1
Col2
Col3
From
To
Test1
1
1
1
01/01/2020
31/12/9999
Test2
1
2
3
01/01/2020
30/06/2020
Test2
3
2
1
01/07/2020
30/10/2020
Test2
1
2
3
01/11/2020
31/12/9999
I'm struggling to conceptualise how to approach this.
I'm guessing I need to use LAG somehow but not sure how to apply it - eg rank everything in a staging table first, then use LAG to compare a concatenation of the whole row?
I'm sure I could find a fudged way eventually, but I've no doubt this problem has been solved many times before so hoping somebody can point me to a simpler/ neater solution than I'd inevitably come up with...
Advanced Gaps and Islands
I believe this is an advanced "gaps and islands" problem. Use that as a search term and you'll find plenty of literature on the subject. Only difference is normally only one column is being tracked, but you have 3.
No Gaps Assumption
One major assumption of this script is there is no gap in the overlapping dates, or in other words, it assumes the previous rows ToDate = current FromDate - 1 day.
Not sure if you need to account for gaps, would be simple just add criteria to IsChanged to check for that
Multi-Column Gaps and Islands Solution
DROP TABLE IF EXISTS #Grouping
DROP TABLE IF EXISTS #Test
CREATE TABLE #Test (ID INT IDENTITY(1,1),TestName Varchar(10),Col1 INT,Col2 INT,Col3 INT,FromDate Date,ToDate DATE)
INSERT INTO #Test VALUES
('Test1',1,1,1,'2020-01-01','9999-12-31')
,('Test2',1,2,3,'2020-01-01','2020-04-30')
,('Test2',1,2,3,'2020-05-01','2020-06-30')
,('Test2',3,2,1,'2020-07-01','2020-10-30')
,('Test2',1,2,3,'2020-11-01','9999-12-31')
;WITH cte_Prev AS (
SELECT *
,PrevCol1 = LAG(Col1) OVER (PARTITION BY TestName ORDER BY FromDate)
,PrevCol2 = LAG(Col2) OVER (PARTITION BY TestName ORDER BY FromDate)
,PrevCol3 = LAG(Col3) OVER (PARTITION BY TestName ORDER BY FromDate)
FROM #Test
), cte_Compare AS (
SELECT *
,IsChanged = CASE
WHEN Col1 = PrevCol1
AND Col2 = PrevCol2
AND Col3 = PrevCol3
THEN 0 /*No change*/
ELSE 1 /*Iterate so new group created */
END
FROM cte_Prev
)
SELECT *,GroupID = SUM(IsChanged) OVER (PARTITION BY TestName ORDER BY ID)
INTO #Grouping
FROM cte_Compare
/*Raw unformatted data so you can see how it works*/
SELECT *
FROM #Grouping
/*Aggregated results*/
SELECT GroupID,TestName,Col1,Col2,Col3
,FromDate = MIN(FromDate)
,ToDate = MAX(ToDate)
,NumberOfRowsCollapsedIntoOneRow = COUNT(*)
FROM #Grouping
GROUP BY GroupID,TestName,Col1,Col2,Col3
SELECT Col1, Col2, Col3
FROM Table
Results Set: ( Sample table )
Col1 Col2 col3
----------- ---------- -----------
Value Value Value
Value Value Value
Value Value Value
Value Value Value
Value Value Value
Show Col3 with NULL/empty values ( assume that Col3 supports NULL if needed ) EXCEPT for the rows(=values) where col1 condition is true keep the values on Col2 for all rows.
Condition as in ( WHERE Col > 2| WHERE chareindex( 'x' , Col1 ) Etc.. )
Table end results: ( The conditions in here are true for rows 2 and 5 )
Col1 Col2 col3
----------- ---------- -----------
1 Value Value NULL
2 Value Value Value
3 Value Value NULL
4 Value Value NULL
5 Value Value Value
More expressed way for the question :
SELECT EmployeeID, Firstname, Lastname,
From Employees
Results set:
EmployeeID FirstName LastName
----------- ---------- --------------------
1 Nancy Davolio
2 Andrew Fuller
3 Janet Leverling
4 Margaret Peacock
5 Steven Buchanan
6 Michael Suyama
7 Robert King
lets say that in the code above there's a condition for employeeID and employeeID 3 and 6 was true to that condition
I'm looking to achieve:
EmployeeID FirstName LastName
----------- ---------- --------------------
1 Nancy NULL
2 Andrew NULL
3 Janet Leverling
4 Margaret NULL
5 Steven NULL
6 Michael Suyama
7 Robert NULL
what condition/s and how should they be put to achieve this results set
You may change the "base code" completely
You don't know the values on 'Lastname' col ( or col3 )
You must keep all the rows and columns value for the false conditions values
The table is big
Another way of putting the question: ( based on the first paragraph of the question)
For a row which the value in col1 is true to a condition/s, show the value in Col3, if not show NULL/empty and keep the col2 values for all rows.
I'm not sure to have understood exactly what you want to do, but does SELECT CASE WHEN could solve your problem ?
Here's an example :
SELECT
Col1,
Col2,
CASE WHEN (Condition) THEN NULL ELSE Col3 END AS Col3
FROM Table
It would return this output:
Col1 Col2 col3
----------- ---------- -----------
5 Value Value
21 Value NULL
7 Value Value
8 Value Value
40 Value NULL
Using this way, you conditionally select data from the column or NULL
EDIT : concerning the explanation of the case...when, you can find explanations & examples here :
https://www.w3schools.com/sql/sql_case.asp
Try this:
Select col1, col2,
Case When col1 (condition) Then col3 Else null End As col3
From Table
I have tried passing the dtype parameter with read_csv as dtype={n: pandas.Categorical} but this does not work properly (the result is an Object). The manual is unclear.
In version 0.19.0 you can use parameter dtype='category' in read_csv:
data = 'col1,col2,col3\na,b,1\na,b,2\nc,d,3'
df = pd.read_csv(pd.compat.StringIO(data), dtype='category')
print (df)
col1 col2 col3
0 a b 1
1 a b 2
2 c d 3
print (df.dtypes)
col1 category
col2 category
col3 category
dtype: object
If want specify column for category use dtype with dictionary:
df = pd.read_csv(pd.compat.StringIO(data), dtype={'col1':'category'})
print (df)
col1 col2 col3
0 a b 1
1 a b 2
2 c d 3
print (df.dtypes)
col1 category
col2 object
col3 int64
dtype: object
I'm trying to create a query which aggregates some data by one of several columns. (Only ever expecting one, but very interested in hearing multi solutions too!).
I've created the following query but it doesn't work. It complains that all the columns
is invalid in the select list because it is not contained in either an aggregate function or the GROUP BY clause.
I'm very interested in hearing why this is the case, as when the query is run the correct column would always be grouped. Obviously I'm also interested in solutions to my problem too, but will press on with figuring it out myself.
Declare #AggregateColumn int = 1
SELECT
ResultID,
CASE
WHEN #AggregateColumn = 1 THEN Col1
WHEN #AggregateColumn = 2 THEN Col2
WHEN #AggregateColumn = 3 THEN Col3
WHEN #AggregateColumn = 4 THEN Col4
WHEN #AggregateColumn = 5 THEN Col5
ELSE 'ERROR'
END AS ResultName,
SUM(a.ResultValue) AS ResultValue
FROM
#Results
GROUP BY
ResultID,
CASE
WHEN #AggregateColumn = 1 THEN Col1
WHEN #AggregateColumn = 2 THEN Col2
WHEN #AggregateColumn = 3 THEN Col3
WHEN #AggregateColumn = 4 THEN Col4
WHEN #AggregateColumn = 5 THEN Col5
ELSE null
END
Thank you for your time.
Edit: Thanks for the responses... I'd added an extra typo to my question. Basically I've fixed this issue myself. SQL returns very unhelpful error messages if you've done what I've done above. Namely Else 'ERROR' needs to be else null. You'd never realise that was the problem though from the error message SQL gives!
Remove the alias in the group by:
GROUP BY
ResultID,
CASE
WHEN #AggregateColumn = 1 THEN Col1
WHEN #AggregateColumn = 2 THEN Col2
WHEN #AggregateColumn = 3 THEN Col3
WHEN #AggregateColumn = 4 THEN Col4
WHEN #AggregateColumn = 5 THEN Col5
ELSE null
END -- AS ResultName
Basically I've fixed this issue myself. SQL returns very unhelpful error messages if you've done what I've done.
Namely Else 'ERROR' needs to be else null. You'd never realise that was the problem though from the error message SQL gives! It complains about all the other columns.
Declare #AggregateColumn int = 1
SELECT
ResultID,
CASE
WHEN #AggregateColumn = 1 THEN Col1
WHEN #AggregateColumn = 2 THEN Col2
WHEN #AggregateColumn = 3 THEN Col3
WHEN #AggregateColumn = 4 THEN Col4
WHEN #AggregateColumn = 5 THEN Col5
ELSE null
END AS ResultName,
SUM(a.ResultValue) AS ResultValue
FROM
#Results
GROUP BY
ResultID,
CASE
WHEN #AggregateColumn = 1 THEN Col1
WHEN #AggregateColumn = 2 THEN Col2
WHEN #AggregateColumn = 3 THEN Col3
WHEN #AggregateColumn = 4 THEN Col4
WHEN #AggregateColumn = 5 THEN Col5
ELSE null
END
I have table
col1 col2
---- ----
a rrr
a fff
b ccc
b zzz
b xxx
i want a query that return number of occurrences of col1 against each row like.
rows col1 col2
---- ---- ----
2 a rrr
2 a fff
3 b ccc
3 b zzz
3 b xxx
As a is repeated 2 time and b is repeated 3 time.
You can try Count over partition_by_clause which divides the result set produced by the FROM clause into partitions to which the function is applied.
This function treats all rows of the query result set as a single group
Try this...
select count (col1) over (partition by col1) [rows] ,col1 ,col2 from tablename
You can use the OVER clause with aggregate functions like COUNT:
SELECT rows = COUNT(*) OVER(PARTITION BY col1),
col1,
col2
FROM dbo.TableName
Demo