How do I produce the expected table? - sql-server

I'm trying to get averages of min and max for each categories in posnam column. This record set is of course only a sample, so there are many more records.
Given:
state position minrate maxrate
ny admin assistant 12.5000 14.5000
ny office manager 20.5000 25.5000
ca admin assistant 13.5000 15.5000
ca office manager 21.5000 26.5000
al admin assistant 11.5000 13.5000
al office manager 19.5000 24.5000
Expected:
position ny_min ny_max ca_min ca_max al_min al_max avg_min avg_max
admin assistant 12.5000 14.5000 13.5000 15.5000 11.5000 13.5000 12.5000 14.5000
office manager 20.5000 25.5000 21.5000 26.5000 19.5000 24.5000 20.5000 25.5000
Code:
declare #jobs table (
[state] nvarchar(25),
[position] nvarchar(25),
[minrate] decimal(18,4),
[maxrate] decimal(18,4)
)
insert #jobs
values
('ny','admin assistant',12.5, 14.5),
('ny','office manager',20.5, 25.5),
('ca','admin assistant',13.5, 15.5),
('ca','office manager',21.5, 26.5),
('al','admin assistant',11.5, 13.5),
('al','office manager',19.5, 24.5)
select * from #jobs

In order to dynamically create field names, you will need to utilize dynamic SQL. To pair that with the additional aggregates that you need (the total avg min/max), you will need to perform an additional query across all rows and combine them.
In order to utilize dynamic SQL in this manner, we need an object outside of the current session's scope, so for the purpose of your example here I have swapped your provided table variable of #jobs for a global temp table ##tmpjobs. Assuming that you are actually pulling this from a database table in the real world, you can simply swap the global temp table ##tmpjobs for your real table.
I accomplish this in the below example by unpivoting all state min/max values, adding unpivoted values for the total avg min/max, and then performing a single (fairly standard) PIVOT command.
/*get list of columns that we want for our pivot*/
DECLARE #ColumnList nvarchar(max) = CONCAT((
SELECT
STRING_AGG(state_list.min_max_title,N',') WITHIN GROUP (ORDER BY state_list.min_max_title DESC) AS ColumnList
FROM
(SELECT
CONCAT(j.[state],N'_min,',j.[state],N'_max') AS min_max_title
FROM
##tmpjobs AS j
GROUP BY
j.[state]) AS state_list
),N',avg_min,avg_max');
/*build pivot query*/
DECLARE #Sql nvarchar(max) = CONCAT(
N'SELECT
pvt.*
FROM
/*subquery to unpivot data for min/max values for each state and the two totals*/
(/*add in min for each state*/
SELECT
CONCAT(j.[state],N''_min'') AS ColumnName
,j.position AS position
,j.minrate AS Amount
FROM
##tmpjobs AS j
UNION ALL
/*add max for each state*/
SELECT
CONCAT(j.[state],N''_max'') AS ColumnName
,j.position AS position
,j.maxrate AS Amount
FROM
##tmpjobs AS j
UNION ALL
/*add total min/max rows*/
SELECT
CASE /*conditionally return max/min column name*/
WHEN row_mult.RowId = 1 THEN ''avg_min''
WHEN row_mult.RowId = 2 THEN ''avg_max''
END AS ColumnName
,total_avgs.position
,CASE /*conditionally return max/min value*/
WHEN row_mult.RowId = 1 THEN total_avgs.AvgMinRate
WHEN row_mult.RowId = 2 THEN total_avgs.AvgMaxRate
END
FROM
/*subquery to calculate the total for all states for each position*/
(SELECT
j.position
,AVG(j.minrate) AS AvgMinRate
,AVG(j.maxrate) AS AvgMaxRate
FROM
##tmpjobs AS j
GROUP BY
j.position) AS total_avgs
/*generate an extra row per position*/
OUTER APPLY (SELECT 1 AS RowId
UNION ALL
SELECT 2 AS RowId) AS row_mult) AS src
PIVOT
(MAX(Amount) FOR ColumnName IN (',#ColumnList,N')) AS pvt');
/*now run query*/
EXEC sys.sp_executesql #stmt = #Sql;
The only thing to really note here is that I have the state columns currently in reverse alphabetic order (Z to A) to match your expected output. You can change that to A to Z by changing the DESC order to ASC in the WITHIN GROUP statement, or really any other order you please by changing what the #ColumnList variable outputs.

Related

How can I access a specific field in a named subquery when the field name might not be unique?

I am trying to create a routine that can accept an SQL query as a string and the [table].[primaryKey] of the primary record in the returned dataset, then wrap that original query to implement pagination (return records 40-49 when requesting page 4 and 10 records per page).
The dataset returned by the original queries will frequently contain multiple instances of the primary record, one for each occurrence of supporting records. For the example provided, if a customer has three phone numbers on record the results for that customer in the original query would look like:
{5; John Smith; 205 W. Fort St; 17; Home; 123-123-4587}
{5; John Smith; 205 W. Fort St; 18; Work; 123-123-8547}
{5; John Smith; 205 W. Fort St; 19; Mobile; 123-123-1147}
I'm almost there, I think, with the following query:
DECLARE #PageNumber int = 4;
DECLARE #RecordsPerPage int = 10;
WITH OriginalQuery AS (
SELECT [Customer].[Id],
[Customer].[Name],
[Customer].[Address],
[Phone].[Id],
[Phone].[Type],
[Phone].[Number]
FROM [Customer] INNER JOIN [Phone] ON [Customer].[Id] = [Phone].[CustomerId]
)
SELECT [WrappedQuery].[RowNumber], [OriginalQuery].* FROM (
SELECT ROW_NUMBER() OVER (ORDER BY (SELECT NULL)) [RowNumber], *
FROM (
SELECT DISTINCT [OriginalQuery].[{Customer.Id}] [PrimaryKey]
FROM [OriginalQuery]
) [RuwNumberQuery]
) [WrappedQuery]
INNER JOIN [OriginalQuery] ON [WrappedQuery].[PrimaryKey] = [OriginalQuery].[{Customer.Id}]
WHERE [WrappedQuery].[RowNumber] >= #PageNumber
AND [WrappedQuery].[RowNumber] < #PageNumber + #RecordsPerPage
This solution performs a SELECT DISTINCT on the primary key for the Primary (Customer) record and uses the SQL routine Row_Number() then joins the result with the results of the original query such that each unique primary (customer) record is numbered 1 - {end of file}, and I can pull only the RowNumber counts that I want.
But because OriginalQuery may have multiple fields named Id (from different tables), I can't figure out how to properly access [Customer].[Id] in my SELECT DISTINCT clause of [RowNumberQuery] or in the INNER JOIN.
Is there a better way to implement pagination at the SQL level, or a more direct method of accessing the field I need from within the subquery based on the table to which it belongs?
EDIT:
I've caused confusion in the pagination I am looking for. I am using Dapper in C# to compile the resulting dataset into individual complex objects, so the goal in the example would be to retrieve customers 31-40 in the list regardless of how many individual records exist for each customer. If Customer 31 had five phone records, Customer 32 had three phone records, Customer 33 had 1 phone record, and the remaining seven customers had two phone records each, I would expect the resulting dataset to contain 23 records total, but only 10 distinct customers.
SOLUTION
Thank you for all of the assistance, and I apologize for those areas I should have clarified sooner. I am creating a toolset that will allow C# Data Access Libraries to implement a set of standard parameters. If I have an option to implement the pagination in an internal function that can accept the SQL statement, I can defer to the toolset and not have to remember (or count on others to remember) to add the appropriate text each time. I'll set it up to return the finished objects, but if I were going to just modify the original query string it would look like:
public static string AddPagination(string sql, string primaryKey, Parameter requestParameters)
{
return $"WITH OriginalQuery AS ({sql.Replace("SELECT ", $"SELECT DENSE_RANK() OVER (ORDER BY {primaryKey}) AS PrimaryRecordCount, ",StringComparison.OrdinalIgnoreCase)}) " +
$"SELECT TOP ({requestParameters.MaxRecords}) * " +
$"FROM OriginalQuery " +
$"WHERE PrimaryRecordCount >= 1 + (({requestParameters.PageNumber - 1}) * {requestParameters.RecordsPerPage})" +
$" AND PrimaryRecordCount <= {requestParameters.Page} * {requestParameters.Limit}";
}
Just give your columns a different alias in your original query, e.g. [Customer].[Id] AS CustomerId, [Phone].[Id] AS PhoneId..., then you can reference OriginalQuery.CustomerId, or OriginalQuery.PhoneId
e.g.
DECLARE #PageNumber int = 4;
DECLARE #RecordsPerPage int = 10;
WITH OriginalQuery AS (
SELECT [Customer].[Id] AS CustomerId,
[Customer].[Name],
[Customer].[Address],
[Phone].[Id] AS PhoneId,
[Phone].[Type],
[Phone].[Number]
FROM [Customer] INNER JOIN [Phone] ON [Customer].[Id] = [Phone].[CustomerId]
)
SELECT [WrappedQuery].[RowNumber], [OriginalQuery].* FROM (
SELECT ROW_NUMBER() OVER (ORDER BY (SELECT NULL)) [RowNumber], *
FROM (
SELECT DISTINCT [OriginalQuery].[{Customer.Id}] [PrimaryKey]
FROM [OriginalQuery]
) [RuwNumberQuery]
) [WrappedQuery]
INNER JOIN [OriginalQuery] ON [WrappedQuery].[PrimaryKey] = [OriginalQuery].[CustomerId]
WHERE [WrappedQuery].[RowNumber] >= #PageNumber
AND [WrappedQuery].[RowNumber] < #PageNumber + #RecordsPerPage
It's worth noting that your paging logic is wrong too. Currently you are adding page number to the number of pages so you are searching for:
Page 1: Customers 1 - 10
Page 2: Customers 2 - 11
Page 3: Customers 3 - 12
Your logic should be:
WHERE [WrappedQuery].[RowNumber] >= 1 + ((#PageNumber - 1) * #RecordsPerPage)
AND [WrappedQuery].[RowNumber] <= (#PageNumber * #RecordsPerPage)
Page 1: Customers 1 - 10
Page 2: Customers 11 - 20
Page 3: Customers 21 - 30
With that being said, you could just use DENSE_RANK() Rather than ROW_NUMBER which would simplify everything. I think this would give you the same result:
DECLARE #PageNumber int = 4;
DECLARE #RecordsPerPage int = 10;
WITH OriginalQuery AS (
SELECT c.Id AS CustomerId,
c.Name,
c.Address,
p.Id AS PhoneId,
p.Type,
p.Number,
DENSE_RANK() OVER(ORDER BY c.Id) AS RowNumber
FROM Customer AS c INNER JOIN Phone AS p ON c.Id = p.CustomerId
)
SELECT oq.CustomerId, oq.Name, oq.Address, oq.PhoneId, oq.Type, oq.Number
FROM OriginalQuery AS oq
WHERE oq.RowNumber >= 1 +((#PageNumber - 1) * #RecordsPerPage)
AND oq.RowNumber <= (#PageNumber * #RecordsPerPage);
I've added table aliases to try and make the code a bit cleaner, and also removed all the unnecessary square brackets. This is not necessary, but I personally find them quite hard on the eye, and only use them to escape key words.
Another difference is that in adding ORDER BY c.CustomerId you ensure consistent results for your paging. Using ORDER BY (SELECT NULL) implies that you don't care about the order, but you should if you using it for paging.
There are many concerns with what you are trying to do and you might be better off explaining why you are trying to make this process.
SQL query as a string
You are receiving a SQL query as a string, how are you parsing that string into the OriginalQuery CTE? This has both concerns about sql injection and concerns about global temp tables if you are using those.
Secondly, your example isn't doing pagination as it is commonly understood. If someone were to request page 1, 10 records per page, the calling application would expect to receive the first 10 records of the result set but your example will returns all records for the first 10 customers. Meaning the result could be 40+ if they each had 4 phone numbers as in your example data.
You should take a look at OFFSET and FETCH NEXT, as well as why this requirement to parse an arbitrary SQL string. There is probably a better way to do that.
Here is a rough example using OFFSET and FETCH NEXT from a static query, and returning only #RecordsPerPage number of records.
DECLARE #PageNumber int = 1;
DECLARE #RecordsPerPage int = 10;
SELECT [Customer].[Id],
[Customer].[Name],
[Customer].[Address],
[Phone].[Id],
[Phone].[Type],
[Phone].[Number]
FROM [Customer] INNER JOIN [Phone] ON [Customer].[Id] = [Phone].[CustomerId]
ORDER BY [Customer].[Id]
OFFSET (#PageNumber-1)*#RecordsPerPage rows
FETCH NEXT #RecordsPerPage ROWS ONLY
If you wanted to return all records for the the RecordsPerPage number of entries which have a corresponding phone number, then it would be something like...
DECLARE #PageNumber int = 1;
DECLARE #RecordsPerPage int = 10;
SELECT [Customer].[Id],
[Customer].[Name],
[Customer].[Address],
[Phone].[Id],
[Phone].[Type],
[Phone].[Number]
FROM [Customer] INNER JOIN [Phone] ON [Customer].[Id] = [Phone].[CustomerId]
WHERE Customer.ID IN (
SELECT DISTINCT Customer.ID FROM Customer INNER JOIN [Phone] ON [Customer].[Id] = [Phone].[CustomerId]
ORDER BY [Customer].[Id]
OFFSET (#PageNumber-1)*#RecordsPerPage rows
FETCH NEXT #RecordsPerPage ROWS ONLY
)
This does leave a question, what is the point of this query when the calling application can just use their own OFFSET and FETCH NEXT? They already have the SQL to generate the initial dataset, all they need to do is add OFFSET / FETCH NEXT to the end of it and they have their own pagination without trying to wrap it in a procedure of some sort.
To create a comparison, would you create a stored procedure that accepts a SQL string and then filters specific fields by specific values? Or would the people calling that stored procedure just add a Where clause to their own queries instead?
You can use alias name for the cuplicated column.
For example:
WITH OriginalQuery AS (
SELECT [Customer].[Id] as CustomerID,
[Customer].[Name],
[Customer].[Address],
[Phone].[Id] as PhoneID,
[Phone].[Type],
[Phone].[Number]
FROM [Customer] INNER JOIN [Phone] ON [Customer].[Id] = [Phone].[CustomerId]
)
now you can use the 2 ids whit the alias name for the next query.

SQL stored procedure for picking a random sample based on multiple criteria

I am new to SQL. I looked for all over the internet for a solution that matches the problem I have but I couldn't find any. I have a table named 'tblItemReviewItems' in an SQL server 2012.
tblItemReviewItems
Information:
1. ItemReviewId column is the PK.
2. Deleted column will have only "Yes" and "No" value.
3. Audited column will have only "Yes" and "No" value.
I want to create a stored procedure to do the followings:
Pick a random sample of 10% of all ItemReviewId for distinct 'UserId' and distinct 'ReviewDate' in a given date range. 10% sample should include- 5% of the total population from Deleted (No) and 5% of the total population from Deleted (Yes). Audited ="Yes" will be excluded from the sample.
For example – A user has 118 records. Out of the 118 records, 17 records have Deleted column value "No" and 101 records have Deleted column value "Yes". We need to pick a random sample of 12 records. Out of those 12 records, 6 should have Deleted column value "No" and 6 should have Deleted column value "Yes".
Update Audited column value to "Check" for the picked sample.
How can I achieve this?
This is the stored procedure I used to pick a sample of 5% of Deleted column value "No" and 5% of Deleted column value "Yes". Now the situation is different.
ALTER PROC [dbo].[spItemReviewQcPickSample]
(
#StartDate Datetime
,#EndDate Datetime
)
AS
BEGIN
WITH CTE
AS (SELECT ItemReviewId
,100.0
*row_number() OVER(PARTITION BY UserId
,ReviewDate
,Deleted
order by newid()
)
/count(*) OVER(PARTITION BY UserId
,Reviewdate
,Deleted
)
AS pct
FROM tblItemReviewItems
WHERE ReviewDate BETWEEN #StartDate AND #EndDate
AND Deleted in ('Yes','No')
AND Audited='No'
)
SELECT a.*
FROM tblItemReviewItems AS a
INNER JOIN cte AS b
ON b.ItemReviewId=a.ItemReviewId
AND b.pct<=6
;
WITH CTE
AS (SELECT ItemReviewId
,100.00
*row_number() OVER(PARTITION BY UserId
,ReviewDate
,Deleted
ORDER BY newid()
)
/COUNT(*) OVER(PARTITION BY UserId
,Reviewdate
,Deleted
)
AS pct
FROM tblItemReviewItems
WHERE ReviewDate BETWEEN #StartDate AND #EndDate
AND deleted IN ('Yes','No')
AND audited='No'
)
UPDATE a
SET Audited='Check'
FROM tblItemReviewItems AS a
INNER JOIN cte AS b
ON b.ItemReviewId=a.ItemReviewId
AND b.pct<=6
;
END
Any help would be highly appreciated. Thanks in advance.
This may assist you in getting started. My idea is, you create the temp tables you need, and load the specific data into the (deleted, not deleted etc.). You then run something along the lines of:
IF OBJECT_ID('tempdb..#tmpTest') IS NOT NULL DROP TABLE #tmpTest
GO
CREATE TABLE #tmpTest
(
ID INT ,
Random_Order INT
)
INSERT INTO #tmpTest
(
ID
)
SELECT 1 UNION ALL
SELECT 2 UNION ALL
SELECT 3 UNION ALL
SELECT 4 UNION ALL
SELECT 5 UNION ALL
SELECT 6 UNION ALL
SELECT 7 UNION ALL
SELECT 8 UNION ALL
SELECT 9 UNION ALL
SELECT 10 UNION ALL
SELECT 11 UNION ALL
SELECT 12 UNION ALL
SELECT 13 UNION ALL
SELECT 14 UNION ALL
SELECT 15 UNION ALL
SELECT 16;
DECLARE #intMinID INT ,
#intMaxID INT;
SELECT #intMinID = MIN(ID)
FROM #tmpTest;
SELECT #intMaxID = MAX(ID)
FROM #tmpTest;
WHILE #intMinID <= #intMaxID
BEGIN
UPDATE #tmpTest
SET Random_Order = 10 + CONVERT(INT, (30-10+1)*RAND())
WHERE ID = #intMinID;
SELECT #intMinID = #intMinID + 1;
END
SELECT TOP 5 *
FROM #tmpTest
ORDER BY Random_Order;
This assigns a random number to a column, that you then use in conjunction with a TOP 5 clause, to get a random top 5 selection.
Appreciate a loop may not be efficient, but you may be able to update to a random number without it, and the same principle could be implemented. Hope that gives you some ideas.

Pagination in SQL Server 2012 stored procedure with top distinct(x) records

I want to create stored procedure with pagination along with top 100 against a subset of a table. For example in a table(ex:employee table) with more than 3,000,000 records, I want to take the top 100,000 records and do the pagination. I'm able to do the pagination using below script, but I want to take the top 100,000 records and do the pagination.
DECLARE #currentPageNo int,#takeData int
SET #currentPageNo =1
SET #takeData = 10
SELECT DISTINCT emp.empid,emp.name,s.Salary,
FROM Employee emp
LEFT OUTER JOIN salary S ON emp.empid=S.empid
where emp.empid=12
ORDER BY emp.empid desc
OFFSET (#currentPageNo - 1) * #takeData ROWS
FETCH NEXT #takeData ROWS ONLY
I need some suggestions on how this can be achieved. I'm using SQL Server 2012.
In order to perform pagination properly you will need to ensure the order of the result set and assign row numbers, for this you can use ROWNUMBER() OVER()
You will also need to know the total count of the entire result set. This will tell you how many total pages you have. You can do this in a secondary query or add another column for Total Rows using COUNT() OVER()
Here's an example using your variables:
DECLARE #currentPageNo int,#takeData int
SET #currentPageNo =0
SET #takeData = 10
SELECT TOP(#currentPageNo) * FROM
(
SELECT ROW_NUMBER() OVER(ORDER BY [ColumnName]) RowNumber, *
FROM (
SELECT DISTINCT emp.empid,emp.name,s.Salary,COUNT(emp.empid) OVER() TotalRows
FROM Employee emp
LEFT OUTER JOIN salary S ON emp.empid=S.empid
where emp.empid=12
) Q1
) Q2
WHERE ( #takeData<>0 AND RowNumber BETWEEN ((#currentPageNo * #takeData) + 1) AND ((#currentPageNo +1) * #takeData))

TSQL - Need to compare values of two most recent rows in logging table

I have a table INDICATORS that stores details and current scores of performance indicators. I have another table IND_HISTORIES that stores historical values of the indicator scores. Data are stored from INDICATORS to IND_HISTORIES at set periods (ie quarterly), to establish score / rating trends.
IND_HISTORIES has a column structure similar to this-
pk_IndHistId fk_IndId Score DateSaved
Rating levels are also defined, meaning a score value of 1 to 3 is Low, 4 to 6 is Avg, and 7 to 9 is High.
I am trying to build an alert feature, whereby a record will be returned if it's most recent rating level (based on most recent score in IND_HISTORIES) is greater than it's second-most recent rating level (based on second-most recent score in IND_HISTORIES).
I am using code like below to build a temp table that translates score values to rating level thresholds...
-- opt_IND_ScoreValues = 1;2;3;4;5;6;7;8;9
DECLARE #tblScores TABLE (idx int identity, val int not null)
INSERT INTO #tblScores (val) SELECT IntValue FROM dbo.fn_getSettingList('opt_IND_ScoreValues')
-- opt_IND_RatingLevels = Low;Low;Low;Avg;Avg;Avg;High;High;High
DECLARE #tblRatings TABLE (idx int identity, txt nvarchar(128))
INSERT INTO #tblRatings (txt) SELECT TxtValue FROM dbo.fn_getSettingList('opt_IND_RatingLevels')
-- combine two tables above using a common index
DECLARE #tblRatingScores TABLE (val int, txt nvarchar(128))
INSERT INTO #tblRatingScores SELECT s.val, r.txt FROM #tblScores s JOIN #tblRatings r ON s.idx = r.idx
-- reduce table rows above to find score thresholds for each rating level
DECLARE #tblRatingBands TABLE (idx int identity, score int not null, rating nvarchar(128))
INSERT INTO #tblRatingBands
SELECT rs.val, rs.txt FROM #tblRatingScores rs
INNER JOIN (SELECT MIN(val) as val FROM #tblRatingScores GROUP BY txt) AS x ON rs.txt = x.txt AND rs.val = x.val
ORDER BY val
QUESTION: Is there an elegant query I can run against the IND_HISTORIES table that will return records where the most recent rating level for an INDICATOR is above the second-most recent rating level?
UPDATE: To clarify, INDICATORS is not used in the calculation - it's a parent table that holds general information of the performance measure and current 'volatile' scores. Scores are saved to IND_HISTORY periodically - this provides point-in-time 'snapshots' of data, helping to establish score trends.
I'm looking to query the IND_HISTORY table, to find where the most recent 'snapshot' value of an indicator is higher than its second-most recent 'snapshot' value. (It would be ideal to also join the Rating Levels table, as described above, in the determination, so that rows are only returned if the score increase results in a Rating Level increase.)
Any solution should be compatible with SQL Server 2005.
I've implemented the below, which seems to work. But I'd be interested to hear any recommendations to optimize or consolidate.
First, I realize that I do not need the last temp table #tblRatingBands constructed above. Instead, I simply select matching text ratings from #tblRatingScores in my first query set below.
Then in the final query, I check if the score value has increased and if the rating text has changed -- this indicates the trend score has increased and resulted in a change to the rating level.
DECLARE #tblTrendScores TABLE (indId int not null, ih_date datetime, rowNo int, ih_score int, rating nvarchar(128));
WITH LastTwoScores AS (
SELECT fk_IndId,
DateSaved,
ROW_NUMBER() OVER (PARTITION BY fk_IndId ORDER BY DateSaved DESC) AS RowNo,
Score
FROM Ind_History
)
INSERT INTO #tblTrendScores
SELECT *,
(SELECT txt FROM #tblRatingScores WHERE val = Score)
FROM LastTwoScores
WHERE RowNo BETWEEN 1 AND 2
ORDER BY fk_IndId, RowNo
SELECT a.indId,
a.ih_date,
CASE WHEN ((a.ih_score > IsNull(b.ih_score, 0)) AND (a.rating <> IsNull(b.rating, 'none'))) THEN 'Up'
WHEN ((a.ih_score < IsNull(b.ih_score, 0)) AND (a.rating <> IsNull(b.rating, 'none'))) THEN 'Down'
ELSE 'no-change'
END AS TrendRatingChange
FROM #tblTrendScores a
JOIN #tblTrendScores b ON a.indId = b.indId AND b.rowNo = 2
WHERE a.rowNo = 1

T-SQL Grouping Sets of Information

I have a problem which my limited SQL knowledge is keeping me from understanding.
First the problem:
I have a database which I need to run a report on, it contains configurations of a users entitlements. The report needs to show a distinct list of these configurations and a count against each one.
So a line in my DB looks like this:
USER_ID SALE_ITEM_ID SALE_ITEM_NAME PRODUCT_NAME CURRENT_LINK_NUM PRICE_SHEET_ID
37715 547 CultFREE CultPlus 0 561
the above line is one row of a users configuration, for every user ID there can be 1-5 of these lines. So the definition of a configuration is multiple rows of data sharing a common User ID with variable attributes..
I need to get a distinct list of these configurations across the whole table, leaving me just one configuration set for every instance where > 1 has that configuration and a count of instances of that configuration.
Hope this is clear?
Any ideas?!?!
I have tried various group by's and unions, also the grouping sets function to no avail.
Will be very greatful if anyone can give me some pointers!
Ouch that hurt ...
Ok so problem:
a row represents a configurable line
users may be linked to more than 1 row of configuration
configuration rows when grouped together form a configuration set
we want to figure out all of the distinct configuration sets
we want to know what users are using them.
Solution (its a bit messy but the idea is there, copy and paste in to SQL management studio) ...
-- ok so i imported the data to a table named SampleData ...
-- 1. import the data
-- 2. add a new column
-- 3. select all the values of the config in to the new column (Configuration_id)
--UPDATE [dbo].[SampleData]
--SET [Configuration_ID] = SALE_ITEM_ID + SALE_ITEM_NAME + [PRODUCT_NAME] + [CURRENT_LINK_NUM] + [PRICE_SHEET_ID] + [Configuration_ID]
-- 4. i then selected just the distinct values of those and found 6 distinct Configuration_id's
--SELECT DISTINCT [Configuration_ID] FROM [dbo].[SampleData]
-- 5. to make them a bit easier to read and work with i gave them int values instead
-- for me it was easy to do this manually but you might wanna do some trickery here to autonumber them or something
-- basic idea is to run the step 4 statement but select into a new table then add a new primary key column and set identity spec on it
-- that will generate u a bunch of incremental numbers for your config id's so u can then do something like ...
--UPDATE [dbo].[SampleData] sd
--SET Configuration_ID = (SELECT ID FROM TempConfigTable WHERE Config_ID = sd.Configuration_ID)
-- at this point you have all your existing rows with a unique ident for the values combined in each row.
-- so for example in my dataset i have several rows where only the user_id has changed but all look like this ...
--SALE_ITEM_ID SALE_ITEM_NAME PRODUCT_NAME CURRENT_LINK_NUM PRICE_SHEET_ID Configuration_ID
--54101 TravelFREE TravelPlus 0 56101 1
-- now you have a config id you can start to work on building sets up ...
-- each user is now matched with 1 or more config id
-- 6. we use a CTE (common table expression) to link the possibles (keeps the join small) ...
--WITH Temp (ConfigID)
--AS
--(
-- SELECT DISTINCT SD.Configuration_Id --SD2.Configuration_Id, SD3.Configuration_Id, SD4.Configuration_Id, SD5.Configuration_Id,
-- FROM [dbo].[SampleData] SD
--)
-- this extracts all the possible combinations using the CTE
-- on the basis of what you told me, max rows per user is 6, in the result set i have i only have 5 distinct configs
-- meaning i gain nothing by doing a 6th join.
-- cross joins basically give you every combination of unique values from the 2 tables but we joined back on the same table
-- so its every possible combination of Temp + Temp (ConfigID + ConfigID) ... per cross join so with 5 joins its every combination of
-- Temp + Temp + Temp + Temp + Temp .. good job temp only has 1 column with 5 values in it
-- 7. uncomment both this and the CTE above ... need to use them together
--SELECT DISTINCT T.ConfigID C1, T2.ConfigID C2, T3.ConfigID C3, T4.ConfigID C4, T5.ConfigID C5
--INTO [SETS]
--FROM Temp T
--CROSS JOIN Temp T2
--CROSS JOIN Temp T3
--CROSS JOIN Temp T4
--CROSS JOIN Temp T5
-- notice the INTO clause ... this dumps me out a new [SETS] table in my db
-- if i go add a primary key to this and set its ident spec i now have unique set id's
-- for each row in the table.
--SELECT *
--FROM [dbo].[SETS]
-- now here's where it gets interesting ... row 1 defines a set as being config id 1 and nothing else
-- row 2 defines set 2 as being config 1 and config 2 and nothing else ... and so on ...
-- the problem here of course is that 1,2,1,1,1 is technically the same set as 1,1,1,2,1 from our point of view
-- ok lets assign a set to each userid ...
-- 8. first we pull the distinct id's out ...
--SELECT DISTINCT USER_ID usr, null SetID
--INTO UserSets
--FROM SampleData
-- now we need to do bit a of operating on these that's a bit much for a single update or select so ...
-- 9. process findings in a loop
DECLARE #currentUser int
DECLARE #set int
-- while theres a userid not linked to a set
WHILE EXISTS(#currentUser = SELECT TOP 1 usr FROM UserSets WHERE SetId IS NULL)
BEGIN
-- figure out a set to link it to
SET #set = (
SELECT TOP 1 ID
FROM [SETS]
-- shouldn't really do this ... basically need to refactor in to a table variable then compare to that
-- that way the table lookup on ur main data is only 1 per User_id
WHERE C1 IN (SELECT DISTINCT Configuration_id FROM SampleData WHERE USER_ID = #currentUser)
AND C2 IN (SELECT DISTINCT Configuration_id FROM SampleData WHERE USER_ID = #currentUser)
AND C3 IN (SELECT DISTINCT Configuration_id FROM SampleData WHERE USER_ID = #currentUser)
AND C4 IN (SELECT DISTINCT Configuration_id FROM SampleData WHERE USER_ID = #currentUser)
AND C5 IN (SELECT DISTINCT Configuration_id FROM SampleData WHERE USER_ID = #currentUser)
)
-- hopefully that worked
IF(#set IS NOT NULL)
BEGIN
-- tell the usersets table
UPDATE UserSets SET SetId = #set WHERE usr = #currentUser
set #set = null
END
ELSE -- something went wrong ... set to 0 to prevent endless loop but any userid linked to set 0 is a problem u need to look at
UPDATE UserSets SET SetId = 0 WHERE usr = #currentUser
-- and round we go again ... until we are done
END
SELECT
USER_ID,
SALE_ITEM_ID, ETC...,
COUNT(*) WhateverYouWantToNameCount
FROM TableNAme
GROUP BY USER_ID

Resources