Pivot Query using With Clause - sql-server

I'm using MySQL to try to manipulate some data for cancer research and machine learning. It seems an ideal problem for the PIVOT statement but I can't quite get it to work and would welcome any help. If there's a better tool, like maybe R, I'm all ears too.
Let's say I have three tables, patients, samples, and mutations:
patients table has unique rows, each with a unique patient_id.
samples table has unique rows, each with a unique sample_id, but also a patient_id that can be found in the patients table. There may be multiple rows in the samples table with the same patient_id.
mutations table has NON-unique rows. Each row in the mutations table contains just two columns: gene and sample_id.
I need to create a new table, call it summary, with patient_id in the first column, sample_id, followed by a column for every distinct gene in the mutations table.
Each row of the new summary table should contain
the patient_id from the patients table,
the sample_id from the samples table,
a the number 1 in each following gene column for each gene in the mutations table that has a sample_id for the specific patient or a number 0 if not.
New summary table looks sort of like this:
patient_id sample_id gene A gene B gene C gene D etc
12345678 54321 1 0 0 0
23456789 65432 0 1 1 0
34567890 76543 0 0 1 0
34567890 87654 0 1 0 1
etc
The new summary table must have an entry, either a 0 or a 1, for every distinct gene found in the mutations table even if there are no entries in the mutations table that have a sample_id belonging to the patient for a specific row.
Remember, there may be multiple samples belonging to the same patient, so the summary table could contain multiple rows for a given patient - each row for a different sample.
Here's my current non-working SQL:
SELECT cs.patient_id, g.*
FROM samples cs
INNER JOIN (
SELECT *
FROM
(WITH cp AS
(SELECT * FROM
(SELECT gene FROM mutations GROUP BY gene) c
CROSS JOIN (SELECT sample_id FROM samples GROUP BY sample_id) m)
SELECT cp.gene, cp.sample_id, IFNULL(m.id,0) id
FROM cp
LEFT JOIN (SELECT gene, sample_id, 1 id FROM mutations) m on m.gene=cp.gene and m.sample_id=cp.sample_id)
PIVOT ( MAX(id) for gene in ('BAP1','PDGFRA','KRAS','CDKN1B','IDH1','ARID1A','DOT1L','NOTCH4','ABL1',
'PBRM1','MLL3','TET2','SPEN','CCND2','DDR2','RICTOR','SMAD4','GLI1','RASA1',
'MAP2K1','CSF3R','HIST1H3D','DNMT3B','CEBPA','GATA2','ARID1B','BRCA2','EPHA7',
'CTNNB1','EPHA5','EP300','RAF1','NF1','EGFR','NBN','INHA','CARD11','ANKRD11',
'ERBB3','TERT','DNMT1','ATM','RIT1','PDCD1','SMARCA4','FOXP1','DICER1','TGFBR2',
'PTPRS','FANCC','APC','NCOA3','NTRK1','PTPRD','NSD1','GRIN2A','SMARCB1','PTCH1',
'KEAP1','KDR','IRS2','PIK3R3','SUFU','STAG2','MAP3K13','SOX9','SETD2','FAT1',
'ZFHX3','NRAS','MAP3K1','ERBB4','JAK3','NF2','PGR','KDM6A','RPTOR','TP53','CIC',
'MSH2','MAP2K4','AXIN2','PTEN','XPO1','ERCC4','AXL','RNF43','DNMT3A','ERG','NOTCH2',
'RFWD2','IGF1R','GATA1','SMAD3','TMPRSS2','MLL','BRAF','TET1','BCOR','YAP1','HLA-A',
'PLCG2','CBL','IRS1','PIK3CA','POLE','LATS2','MST1','H3F3B','IRF4','AR','B2M','NCOR1',
'FUBP1','NOTCH3','ATR','RPS6KB2','TSC2','PIK3CG','MDM2','ROS1','TCF3','TSC1','FGFR2',
'FBXW7','FOXA1','MEN1','CDKN2Ap16INK4A','EPHA3','PMS1','PAK1','E2F3','PIK3CD','PLK2',
'MPL','RHEB','RBM10','ASXL2','MSH6','RAD21','BRIP1','PTPRT','GNA11','CDKN1A','RAD50',
'BRD4','STK11','ARID2','RUNX1','MTOR','JAK1','TBX3','MALT1','RYBP','MLL2','PIK3CB',
'SMO','AXIN1','MAPK3','VHL','JUN','KDM5A','ARID5B','AMER1','PPM1D','ASXL1','MLH1',
'CASP8','BARD1','DAXX','CDH1','PALB2','AKT3','RECQL4','IGF2','MED12','FLT3','HIST3H3',
'MST1R','EIF4A2','CREBBP','STAT5B','PHOX2B','BRCA1','ERBB2','MITF','RB1','CD79A',
'TMEM127','MAPK1','CDKN2A','CDKN2Ap14ARF','CSF1R','FLT4','CENPA','RPS6KA4','SRC',
'ERCC3','NEGR1','RET','ACVR1','SYK','ICOSLG','FYN','SOX17','ETV6','NTRK3','HIST1H1C',
'IDH2','CHEK1','GNAS','PPP6C','EZH2','MYCL1','SDHA','MDC1','ARAF','RAC1','KDM5C','PARP1',
'NKX2-1','CXCR4','SMAD2','IL7R','TGFBR1','U2AF1','SF3B1','FGFR4','ERRFI1','SMARCD1','FGFR1',
'EPHB1','PDPK1','FLCN','RAD54L','MGA','PPP2R1A'))
) g on g.sample_id = cs.sample_id;
Sample data text files
patients - https://drive.google.com/open?id=1NhRkHvvydmZ5ilHJ4TwKE_AslNFvCOcS
samples - https://drive.google.com/open?id=1Txdaa7JKOVMS3TZ8g9tkQUzZPkNc2m24
mutations - https://drive.google.com/open?id=1-HXEszbpcrkPX7MomJnkcAsVCKuzl-rJ

You seem to be over complicating this query when it should be a lot simpler. Here's an example on how to get the first 3 columns, you would just need to copy paste and replace for the rest.
SELECT s.patient_id,
s.sample_id,
MAX( CASE WHEN m.gene = 'BAP1' THEN 1 ELSE 0 END) AS BAP1,
MAX( CASE WHEN m.gene = 'PDGFRA' THEN 1 ELSE 0 END) AS PDGFRA,
MAX( CASE WHEN m.gene = 'KRAS' THEN 1 ELSE 0 END) AS KRAS
FROM samples s
LEFT JOIN mutations m ON s.sample_id = m.sample_id
GROUP BY s.patient_id,
s.sample_id;
If you want to create this query dynamically, you can do it to prevent writing a large amount of code.
DECLARE #Columns NVARCHAR(MAX),
#SQL NVARCHAR(MAX);
SELECT #Columns = ( SELECT CHAR(10) + CHAR(9) + ',MAX( CASE WHEN m.gene = ' + QUOTENAME( gene, '''') + ' THEN 1 ELSE 0 END) AS ' + QUOTENAME(gene)
FROM mutations
GROUP BY gene
FOR XML PATH(''), TYPE).value('./text()[1]', 'nvarchar(max)')
SET #SQL = N'SELECT s.patient_id ' + NCHAR(10)
+ N' ,s.sample_id '
+ #Columns + NCHAR(10)
+ N'FROM samples s ' + NCHAR(10)
+ N'LEFT JOIN mutations m ON s.sample_id = m.sample_id ' + NCHAR(10)
+ N'GROUP BY s.patient_id, ' + NCHAR(10)
+ N' s.sample_id;' + NCHAR(10)
PRINT sp_executesql --For debugging purposes
EXECUTE sp_executesql #SQL --, #ParametersDefinition, #Param1, #Param2, ..., #ParamN

Related

How to pivot dynamic number of columns row result into a verticle result list

There is a single row of data in a table that will have a varying number of columns named C1, C2, C3.. etc. I can locate that row but i want to un-pivot those dynamic number of columns values into a single column result... I have researched a ton on pivot/unpivot stuff but all examples i've found don't seem to handle the dynamic number of columns in the results.
Native Results:
Col 1 Col 2 Col 3 Col 4 Col 5.... Col X
Id Name DOB City State
Desired Results:
ColumnNameTBD:
Id
Name
DOB
City
State
Thank You!
Tim
After feedback from to the question i was able to piece together the solution. I now have a way to physically instanciate structure from dynamic/inconsistent structure automatically.
DECLARE #sColumnNames As Varchar(5000)
DECLARE #sSQL AS Varchar(5000);
--Flatten List of Column Names
Select #sColumnNames = ' ' + (
Select '(' + COLUMN_NAME + '),' As 'data()'
From SMAR_STG.INFORMATION_SCHEMA.COLUMNS With (NoLock)
where TABLE_NAME = 'V_ETL_CTX_COREM_EXCEL_DTL_COLUMNS'
For XML PATH('')
) + ' '
--Get Rid or trailing Comma
Set #sColumnNames = substring(#Output,1,len(#Output) -1)
-- Assemble SQL Statement
Set #sSQL = 'SELECT Upivot AS X
FROM SMAR_STG.DBO.V_ETL_CTX_COREM_EXCEL_DTL_COLUMNS
CROSS apply (VALUES '
Set #sSql = #sSQL + #sColumnNames
Set #sSQL = #sSQL + ') cs (upivot) '
-- Execute
EXEC (#sSql)
This item was already resolved here
As you will see from the article from the link, you will need to use COALESCE function to achive this

Dynamic Pivot with varying columns

I have a POA Code dynamic pivot that pulls data from a DX temp table and inserts the data into a temp POA table.
The issue I'm having is that there is a possibility of up to 35 different columns that can be returned. Depending on the month there could be 15 columns (POA1...POA15) or there could be all 35 columns (POA1...POA35). I join this dynamic pivot temp table on another patient table. My problem is, I need to show all 35 columns even if some of the columns do not exist in the temp POA table.
--Pivot DX POA Codes
DECLARE #POANAME VARCHAR(40)
SELECT #POAName = '##tmpPOA'
DECLARE #colsPOA NVARCHAR(2000)
SELECT #colsPOA = STUFF((SELECT DISTINCT TOP 100 PERCENT
'],[' + 'POA' + CAST(Dx.RowNum AS NVARCHAR)
FROM #tmpDX DX
ORDER BY '],[' + 'POA' + CAST(Dx.RowNum AS NVARCHAR)
FOR XML PATH ('')
),1,2,'') + ']'
DECLARE #queryPOA NVARCHAR(4000)
SET #queryPOA = 'N
SELECT
EncObjID,
'+
#colsPOA
+' INTO ' + POAName + '
FROM
(SELECT
Dx.EncObjID
,''POA'' + Dx.RowNum AS RowNum
,Dx.POAMne
FROM #tmpDx Dx
) p
PIVOT
(
MIN([POAMne])
FOR RowNum IN
( ' + #colsPOA + ' )
) AS pvt'
EXECUTE(#queryPOA)
I'm receiving an Invalid Column Name in my patient query because some of the columns don't exist in ##tmpPOA. I thought about creating a temp table called #tmpDxPOA and doing an insert (Insert Into #tmpDxPOA select * from ##tmpPOA), but that doesn't work (I receive a Column Name or number of supplied values does not match error).
Any thoughts on how to create all 35 columns even if there isn't any data? I don't care if they're null, I just need to have those place holders in the main patient query and it doesn't help that the number of columns returned varies every month.
With the help of #mxix I was able to come up with the following:
DECLARE #POASQL NVARCHAR(MAX)
SET #POASQL = N'INSERT INTO #tmpPOAFinal (EncObjID,'+#colsPOA+') SELECT * FROM ##tmpPOA'
EXECUTE(#POASQL)
I put this after the EXECUTE(#queryPOA) in my main query.
In order for this to work with Dynamic SQL the rows/colums need to exists more than zero times. Whether it be for one or more patient. I would try to fan out the number of POA possibilities right off the bat and then left outer join to get the actual values back.
IF OBJECT_ID('tempdb..#tmpPOA') IS NOT NULL DROP TABLE #tmpPOA
CREATE TABLE #tmpPOA (POA varchar(10))
IF OBJECT_ID('tempdb..#tmpPatient') IS NOT NULL DROP TABLE #tmpPatient
CREATE TABLE #tmpPatient (Patient varchar(15))
INSERT INTO #tmpPatient VALUES ('ABC123'),('ABC456'),('ABC789')
DECLARE #POAFlag as INT = 0
WHILE #POAFlag <36
BEGIN
INSERT INTO #tmpPOA
VALUES('POA' +CONVERT(varchar,#POAFlag))
SET #POAFlag = #POAFlag + 1
END
SELECT * FROM #tmpPOA
CROSS JOIN #tmpPatient
This should fan out all of the possibilities of the 35DXCodes for you to get their POA flag.

consolidating and Adding data to new lines to a field in a column in mysql table

I have 3 columns in my table X:
Id State Type
1 NJ Form1
1 NY Form 2
1 TX Form 3
I want to consolidate it to one column in table Y:
Id FormTypes
1 NJ:Form1
NY:Form2
TX: Form3
Is this possible to achieve???
Currently I have worked out so much:
DECLARE #NewLine as char(2) = char(13) + char (10)
UPDATE tableY
SET FormTypes =
(
select substring(
(select ':'+ [State] + ':'+ Type+ #NewLine AS 'data()'
from tableX
for xml path(''))
,3, 255)
as "MyList" )
This is giving me garbage like this:
NJ:Form1'&#x0D'; NY:Form2'&#x0D'; TX:Form3'&#x0D';
The reason for getting it in this form is due to the way it is getting read in multiple files.
SELECT
x2.id,
STUFF((SELECT char(10)+x1.State+':'+x1.Type FROM tableX x1 WHERE x1.id=x2.id GROUP BY x1.id for xml path(''),TYPE),1,1,'') as stype
FROM tableX x2
GROUP BY x2.id
this will give you tableY form

Paging, sorting and filtering in a stored procedure (SQL Server)

I was looking at different ways of writing a stored procedure to return a "page" of data. This was for use with the ASP ObjectDataSource, but it could be considered a more general problem.
The requirement is to return a subset of the data based on the usual paging parameters; startPageIndex and maximumRows, but also a sortBy parameter to allow the data to be sorted. Also there are some parameters passed in to filter the data on various conditions.
One common way to do this seems to be something like this:
[Method 1]
;WITH stuff AS (
SELECT
CASE
WHEN #SortBy = 'Name' THEN ROW_NUMBER() OVER (ORDER BY Name)
WHEN #SortBy = 'Name DESC' THEN ROW_NUMBER() OVER (ORDER BY Name DESC)
WHEN #SortBy = ...
ELSE ROW_NUMBER() OVER (ORDER BY whatever)
END AS Row,
.,
.,
.,
FROM Table1
INNER JOIN Table2 ...
LEFT JOIN Table3 ...
WHERE ... (lots of things to check)
)
SELECT *
FROM stuff
WHERE (Row > #startRowIndex)
AND (Row <= #startRowIndex + #maximumRows OR #maximumRows <= 0)
ORDER BY Row
One problem with this is that it doesn't give the total count and generally we need another stored procedure for that. This second stored procedure has to replicate the parameter list and the complex WHERE clause. Not nice.
One solution is to append an extra column to the final select list, (SELECT COUNT(*) FROM stuff) AS TotalRows. This gives us the total but repeats it for every row in the result set, which is not ideal.
[Method 2]
An interesting alternative is given here (https://web.archive.org/web/20211020111700/https://www.4guysfromrolla.com/articles/032206-1.aspx) using dynamic SQL. He reckons that the performance is better because the CASE statement in the first solution drags things down. Fair enough, and this solution makes it easy to get the totalRows and slap it into an output parameter. But I hate coding dynamic SQL. All that 'bit of SQL ' + STR(#parm1) +' bit more SQL' gubbins.
[Method 3]
The only way I can find to get what I want, without repeating code which would have to be synchronized, and keeping things reasonably readable is to go back to the "old way" of using a table variable:
DECLARE #stuff TABLE (Row INT, ...)
INSERT INTO #stuff
SELECT
CASE
WHEN #SortBy = 'Name' THEN ROW_NUMBER() OVER (ORDER BY Name)
WHEN #SortBy = 'Name DESC' THEN ROW_NUMBER() OVER (ORDER BY Name DESC)
WHEN #SortBy = ...
ELSE ROW_NUMBER() OVER (ORDER BY whatever)
END AS Row,
.,
.,
.,
FROM Table1
INNER JOIN Table2 ...
LEFT JOIN Table3 ...
WHERE ... (lots of things to check)
SELECT *
FROM stuff
WHERE (Row > #startRowIndex)
AND (Row <= #startRowIndex + #maximumRows OR #maximumRows <= 0)
ORDER BY Row
(Or a similar method using an IDENTITY column on the table variable).
Here I can just add a SELECT COUNT on the table variable to get the totalRows and put it into an output parameter.
I did some tests and with a fairly simple version of the query (no sortBy and no filter), method 1 seems to come up on top (almost twice as quick as the other 2). Then I decided to test probably I needed the complexity and I needed the SQL to be in stored procedures. With this I get method 1 taking nearly twice as long as the other 2 methods. Which seems strange.
Is there any good reason why I shouldn't spurn CTEs and stick with method 3?
UPDATE - 15 March 2012
I tried adapting Method 1 to dump the page from the CTE into a temporary table so that I could extract the TotalRows and then select just the relevant columns for the resultset. This seemed to add significantly to the time (more than I expected). I should add that I'm running this on a laptop with SQL Server Express 2008 (all that I have available) but still the comparison should be valid.
I looked again at the dynamic SQL method. It turns out I wasn't really doing it properly (just concatenating strings together). I set it up as in the documentation for sp_executesql (with a parameter description string and parameter list) and it's much more readable. Also this method runs fastest in my environment. Why that should be still baffles me, but I guess the answer is hinted at in Hogan's comment.
I would most likely split the #SortBy argument into two, #SortColumn and #SortDirection, and use them like this:
…
ROW_NUMBER() OVER (
ORDER BY CASE #SortColumn
WHEN 'Name' THEN Name
WHEN 'OtherName' THEN OtherName
…
END *
CASE #SortDirection
WHEN 'DESC' THEN -1
ELSE 1
END
) AS Row
…
And this is how the TotalRows column could be defined (in the main select):
…
COUNT(*) OVER () AS TotalRows
…
I would definitely want to do a combination of a temp table and NTILE for this sort of approach.
The temp table will allow you to do your complicated series of conditions just once. Because you're only storing the pieces you care about, it also means that when you start doing selects against it further in the procedure, it should have a smaller overall memory usage than if you ran the condition multiple times.
I like NTILE() for this better than ROW_NUMBER() because it's doing the work you're trying to accomplish for you, rather than having additional where conditions to worry about.
The example below is one based off a similar query I'm using as part of a research query; I have an ID I can use that I know will be unique in the results. Using an ID that was an identity column would also be appropriate here, though.
--DECLARES here would be stored procedure parameters
declare #pagesize int, #sortby varchar(25), #page int = 1;
--Create temp with all relevant columns; ID here could be an identity PK to help with paging query below
create table #temp (id int not null primary key clustered, status varchar(50), lastname varchar(100), startdate datetime);
--Insert into #temp based off of your complex conditions, but with no attempt at paging
insert into #temp
(id, status, lastname, startdate)
select id, status, lastname, startdate
from Table1 ...etc.
where ...complicated conditions
SET #pagesize = 50;
SET #page = 5;--OR CAST(#startRowIndex/#pagesize as int)+1
SET #sortby = 'name';
--Only use the id and count to use NTILE
;with paging(id, pagenum, totalrows) as
(
select id,
NTILE((SELECT COUNT(*) cnt FROM #temp)/#pagesize) OVER(ORDER BY CASE WHEN #sortby = 'NAME' THEN lastname ELSE convert(varchar(10), startdate, 112) END),
cnt
FROM #temp
cross apply (SELECT COUNT(*) cnt FROM #temp) total
)
--Use the id to join back to main select
SELECT *
FROM paging
JOIN #temp ON paging.id = #temp.id
WHERE paging.pagenum = #page
--Don't need the drop in the procedure, included here for rerunnability
drop table #temp;
I generally prefer temp tables over table variables in this scenario, largely so that there are definite statistics on the result set you have. (Search for temp table vs table variable and you'll find plenty of examples as to why)
Dynamic SQL would be most useful for handling the sorting method. Using my example, you could do the main query in dynamic SQL and only pull the sort method you want to pull into the OVER().
The example above also does the total in each row of the return set, which as you mentioned was not ideal. You could, instead, have a #totalrows output variable in your procedure and pull it as well as the result set. That would save you the CROSS APPLY that I'm doing above in the paging CTE.
I would create one procedure to stage, sort, and paginate (using NTILE()) a staging table; and a second procedure to retrieve by page. This way you don't have to run the entire main query for each page.
This example queries AdventureWorks.HumanResources.Employee:
--------------------------------------------------------------------------
create procedure dbo.EmployeesByMartialStatus
#MaritalStatus nchar(1)
, #sort varchar(20)
as
-- Init staging table
if exists(
select 1 from sys.objects o
inner join sys.schemas s on s.schema_id=o.schema_id
and s.name='Staging'
and o.name='EmployeesByMartialStatus'
where type='U'
)
drop table Staging.EmployeesByMartialStatus;
-- Populate staging table with sort value
with s as (
select *
, sr=ROW_NUMBER()over(order by case #sort
when 'NationalIDNumber' then NationalIDNumber
when 'ManagerID' then ManagerID
-- plus any other sort conditions
else EmployeeID end)
from AdventureWorks.HumanResources.Employee
where MaritalStatus=#MaritalStatus
)
select *
into #temp
from s;
-- And now pages
declare #RowCount int; select #rowCount=COUNT(*) from #temp;
declare #PageCount int=ceiling(#rowCount/20); --assuming 20 lines/page
select *
, Page=NTILE(#PageCount)over(order by sr)
into Staging.EmployeesByMartialStatus
from #temp;
go
--------------------------------------------------------------------------
-- procedure to retrieve selected pages
create procedure EmployeesByMartialStatus_GetPage
#page int
as
declare #MaxPage int;
select #MaxPage=MAX(Page) from Staging.EmployeesByMartialStatus;
set #page=case when #page not between 1 and #MaxPage then 1 else #page end;
select EmployeeID,NationalIDNumber,ContactID,LoginID,ManagerID
, Title,BirthDate,MaritalStatus,Gender,HireDate,SalariedFlag,VacationHours,SickLeaveHours
, CurrentFlag,rowguid,ModifiedDate
from Staging.EmployeesByMartialStatus
where Page=#page
GO
--------------------------------------------------------------------------
-- Usage
-- Load staging
exec dbo.EmployeesByMartialStatus 'M','NationalIDNumber';
-- Get pages 1 through n
exec dbo.EmployeesByMartialStatus_GetPage 1;
exec dbo.EmployeesByMartialStatus_GetPage 2;
-- ...etc (this would actually be a foreach loop, but that detail is omitted for brevity)
GO
I use this method of using EXEC():
-- SP parameters:
-- #query: Your query as an input parameter
-- #maximumRows: As number of rows per page
-- #startPageIndex: As number of page to filter
-- #sortBy: As a field name or field names with supporting DESC keyword
DECLARE #query nvarchar(max) = 'SELECT * FROM sys.Objects',
#maximumRows int = 8,
#startPageIndex int = 3,
#sortBy as nvarchar(100) = 'name Desc'
SET #query = ';WITH CTE AS (' + #query + ')' +
'SELECT *, (dt.pagingRowNo - 1) / ' + CAST(#maximumRows as nvarchar(10)) + ' + 1 As pagingPageNo' +
', pagingCountRow / ' + CAST(#maximumRows as nvarchar(10)) + ' As pagingCountPage ' +
', (dt.pagingRowNo - 1) % ' + CAST(#maximumRows as nvarchar(10)) + ' + 1 As pagingRowInPage ' +
'FROM ( SELECT *, ROW_NUMBER() OVER (ORDER BY ' + #sortBy + ') As pagingRowNo, COUNT(*) OVER () AS pagingCountRow ' +
'FROM CTE) dt ' +
'WHERE (dt.pagingRowNo - 1) / ' + CAST(#maximumRows as nvarchar(10)) + ' + 1 = ' + CAST(#startPageIndex as nvarchar(10))
EXEC(#query)
At result-set after query result columns:
Note:
I add some extra columns that you can remove them:
pagingRowNo : The row number
pagingCountRow : The total number of rows
pagingPageNo : The current page number
pagingCountPage : The total number of pages
pagingRowInPage : The row number that started with 1 in this page

How do I easily find IDENTITY columns in danger of overflowing?

My database is getting old, and one of my biggest INT IDENTITY columns has a value around 1.3 billion. This will overflow around 2.1 billion. I plan on increasing it's size, but I don't want to do it too soon because of the number of records in the database. I may replace my database hardware before I increase the column size, which could offset any performance problems this could cause. I also want to keep an eye on all the other columns in my databases that are more than 50% full. It's a lot of tables, and checking each one manually is not practical.
This is how I am getting the value now (I know the value returned may be slightly out of date, but it's good enough for my purposes):
PRINT IDENT_CURRENT('MyDatabase.dbo.MyTable')
Can I use the INFORMATION_SCHEMA to get this information?
You can consult the sys.identity_columns system catalog view:
SELECT
name,
seed_value, increment_value, last_value
FROM sys.identity_columns
This gives you the name, seed, increment and last value for each column. The view also contains the data type, so you can easily figure out which identity columns might be running out of numbers soonish...
I created a stored procedure to solve this problem. It uses the INFORMATION_SCHEMA to find the IDENTITY columns, and then uses IDENT_CURRENT and the column's DATA_TYPE to calculate the percent full. Specify the database as the first parameter, and then optionally the minimum percent and data type.
EXEC master.dbo.CheckIdentityColumns 'MyDatabase' --all
EXEC master.dbo.CheckIdentityColumns 'MyDatabase', 50 --columns 50% full or greater
EXEC master.dbo.CheckIdentityColumns 'MyDatabase', 50, 'int' --only int columns
Example output:
Table Column Type Percent Full Remaining
------------------------- ------------------ ------- ------------ ---------------
MyDatabase.dbo.Table1 Table1ID int 9 1,937,868,393
MyDatabase.dbo.Table2 Table2ID int 5 2,019,944,894
MyDatabase.dbo.Table3 Table3ID int 9 1,943,793,775
I created a reminder to check all my databases once per month, and I log this information in a spreadsheet.
CheckIdentityColumns Procedure
USE master
GO
CREATE PROCEDURE dbo.CheckIdentityColumns
(
#Database AS NVARCHAR(128),
#PercentFull AS TINYINT = 0,
#Type AS VARCHAR(8) = NULL
)
AS
--this procedure assumes you are not using negative numbers in your identity columns
DECLARE #Sql NVARCHAR(3000)
SET #Sql =
'USE ' + #Database + '
SELECT
[Column].TABLE_CATALOG + ''.'' +
[Column].TABLE_SCHEMA + ''.'' +
[Table].TABLE_NAME AS [Table],
[Column].COLUMN_NAME AS [Column],
[Column].DATA_TYPE AS [Type],
CAST((
CASE LOWER([Column].DATA_TYPE)
WHEN ''tinyint''
THEN (IDENT_CURRENT([Table].TABLE_NAME) / 255)
WHEN ''smallint''
THEN (IDENT_CURRENT([Table].TABLE_NAME) / 32767)
WHEN ''int''
THEN (IDENT_CURRENT([Table].TABLE_NAME) / 2147483647)
WHEN ''bigint''
THEN (IDENT_CURRENT([Table].TABLE_NAME) / 9223372036854775807)
WHEN ''decimal''
THEN (IDENT_CURRENT([Table].TABLE_NAME) / (([Column].NUMERIC_PRECISION * 10) - 1))
END * 100) AS INT) AS [Percent Full],
REPLACE(CONVERT(VARCHAR(19), CAST(
CASE LOWER([Column].DATA_TYPE)
WHEN ''tinyint''
THEN (255 - IDENT_CURRENT([Table].TABLE_NAME))
WHEN ''smallint''
THEN (32767 - IDENT_CURRENT([Table].TABLE_NAME))
WHEN ''int''
THEN (2147483647 - IDENT_CURRENT([Table].TABLE_NAME))
WHEN ''bigint''
THEN (9223372036854775807 - IDENT_CURRENT([Table].TABLE_NAME))
WHEN ''decimal''
THEN ((([Column].NUMERIC_PRECISION * 10) - 1) - IDENT_CURRENT([Table].TABLE_NAME))
END
AS MONEY) , 1), ''.00'', '''') AS Remaining
FROM
INFORMATION_SCHEMA.COLUMNS AS [Column]
INNER JOIN
INFORMATION_SCHEMA.TABLES AS [Table]
ON [Table].TABLE_NAME = [Column].TABLE_NAME
WHERE
COLUMNPROPERTY(
OBJECT_ID([Column].TABLE_NAME),
[Column].COLUMN_NAME, ''IsIdentity'') = 1 --true
AND [Table].TABLE_TYPE = ''Base Table''
AND [Table].TABLE_NAME NOT LIKE ''dt%''
AND [Table].TABLE_NAME NOT LIKE ''MS%''
AND [Table].TABLE_NAME NOT LIKE ''syncobj_%''
AND CAST(
(
CASE LOWER([Column].DATA_TYPE)
WHEN ''tinyint''
THEN (IDENT_CURRENT([Table].TABLE_NAME) / 255)
WHEN ''smallint''
THEN (IDENT_CURRENT([Table].TABLE_NAME) / 32767)
WHEN ''int''
THEN (IDENT_CURRENT([Table].TABLE_NAME) / 2147483647)
WHEN ''bigint''
THEN (IDENT_CURRENT([Table].TABLE_NAME) / 9223372036854775807)
WHEN ''decimal''
THEN (IDENT_CURRENT([Table].TABLE_NAME) / (([Column].NUMERIC_PRECISION * 10) - 1))
END * 100
) AS INT) >= ' + CAST(#PercentFull AS VARCHAR(4))
IF (#Type IS NOT NULL)
SET #Sql = #Sql + 'AND LOWER([Column].DATA_TYPE) = ''' + LOWER(#Type) + ''''
SET #Sql = #Sql + '
ORDER BY
[Column].TABLE_CATALOG + ''.'' +
[Column].TABLE_SCHEMA + ''.'' +
[Table].TABLE_NAME,
[Column].COLUMN_NAME'
EXECUTE sp_executesql #Sql
GO
Keith Walton has a very comprehensive query that is very good. Here's a little simpler one that is based on the assumption that the identity columns are all integers:
SELECT sys.tables.name AS [Table Name],
last_value AS [Last Value],
MAX_LENGTH,
CAST(cast(last_value as int) / 2147483647.0 * 100.0 AS DECIMAL(5,2))
AS [Percentage of ID's Used],
2147483647 - cast(last_value as int) AS Remaining
FROM sys.identity_columns
INNER JOIN sys.tables
ON sys.identity_columns.object_id = sys.tables.object_id
ORDER BY last_value DESC
The results will look like this:
Table Name Last Value MAX_LENGTH Percentage of ID's Used Remaining
My_Table 49181800 4 2.29 2098301847
Checking Integer Identity Columns
While crafting a solution for this problem, we found this thread both informative and interesting (we also wrote a detailed post about this and described how our tool works).
In our solution we're querying the information_schema to acquire a list of
all columns. Then we wrote a program that would go through each of them and compute the maximum and minimum (we account for both overflow and underflow).
SELECT
b.COLUMN_NAME,
b.COLUMN_TYPE,
b.DATA_TYPE,
b.signed,
a.TABLE_NAME,
a.TABLE_SCHEMA
FROM (
-- get all tables
SELECT
TABLE_NAME, TABLE_SCHEMA
FROM information_schema.tables
WHERE
TABLE_TYPE IN ('BASE TABLE', 'VIEW') AND
TABLE_SCHEMA NOT IN ('mysql', 'performance_schema')
) a
JOIN (
-- get information about columns types
SELECT
TABLE_NAME,
COLUMN_NAME,
COLUMN_TYPE,
TABLE_SCHEMA,
DATA_TYPE,
(!(LOWER(COLUMN_TYPE) REGEXP '.*unsigned.*')) AS signed
FROM information_schema.columns
) b ON a.TABLE_NAME = b.TABLE_NAME AND a.TABLE_SCHEMA = b.TABLE_SCHEMA
ORDER BY a.TABLE_SCHEMA DESC;

Resources