Select unique values from every column in every table - sql-server

Is there a way to get a count of distinct values from every table and column in SQL Server.
I've tried using a cursor for this, but that seems insufficient.

I've got to agree with Sean and say that this is going to be horrifically slow, but if you really want to do it, then I'm not going to stop you.
Something like this could be used as a starting point if you specifically don't want to use a cursor. This took just under a minute to look at a small database I've got with 10 tables in it. The largest table has just a few million rows in it. No matter what, you're going to be doing some sort of iteration, whether that's a cursor or explicitly reading against the table for each column.
Also, if you want to do something like this, you'll likely need to accommodate for things... like you're not going to be able to use COUNT on xml columns. Like I said, it's a starting point.
DECLARE #cmd VARCHAR(MAX)
SELECT #cmd =
STUFF (
(
SELECT
' union SELECT ''['+ SCHEMA_NAME(st.schema_id) + '].[' + st.name +']'' as [Object], ''[' + sc.name + ']'' as [Column], COUNT(distinct [' + sc.name + ']) as [Count] FROM [' + SCHEMA_NAME(st.schema_id) + '].[' + st.name + ']'
FROM sys.tables st
JOIN sys.columns sc
ON sc.object_id = st.object_id
JOIN sys.dm_db_partition_stats ddps
ON ddps.object_id = sc.object_id
WHERE
ddps.row_count > 0
FOR XML PATH('')
),1,6,''
)
EXECUTE (#cmd)

Related

Implementing geometry_columns view in MS SQL Server

(We're using MSSQL Server 2014 as far as I know)
I have never seen a good solution for maintaining a geometry_columns table in MSSQL Server. https://gis.stackexchange.com/questions/71558 never got figured out, and even if it did, the PostGIS approach of using a view (rather than a table) is a much better solution.
With that said, I can't seem to figure out how to implement the basics of how this might work.
The basic schema of the geometry_columns view - from PostGIS is:
(the DDL is a bit more complicated, but can be provided if need be)
MS SQL Server will allow you to query your information_schema table to show tables with a 'geometry' data type:
select *
FROM information_schema.columns
where data_type = 'geometry'
I'm imagining the geometry_columns view could be defined with something similar to the following, but I can't figure out how to get the information about the geometry columns to populate in the query:
SELECT
TABLE_CATALOG as f_table_catalog
, TABLE_SCHEMA as f_table_schema
, table_name as f_table_name
, COLUMN_NAME as f_geometry_column
/*how to deal with these in view?
, geometry_column.STDimension() as coord_dimension
, geometry_column.STSrid as srid
, geometry_column.STGeometryType() as type
*/
FROM information_schema.columns where data_type = 'geometry'
I'm hung up as to how the three ST operators can dynamically report the dimension, srid, and geometry type in the view when trying to query from the information_schema table. Perhaps this is a SQL problem more than anything, but I can't wrap my head around it for some reason.
Here's what the PostGIS geometry columns table looks like:
Also please let me know if this question a) could be asked differently because it is a general SQL question and/or b) it belongs on another forum (GIS.SE didn't have an answer, as I believe this is more on the database side than spatial/GIS)
Based on a little reading, it seems that PostGIS - as befits a dedicated GIS system - is a little more clever than SQL Server, when it comes to geometry columns. It looks like in PostGIS you can say that a particular geometry column will only ever contain, say, a POINT, or a LINESTRING. This is how the geometry_columns view can then be more specific about the columns it is describing.
I don't believe it is possible to readily constrain a SQL Server geometry in this way (triggers or constraints might allow, but would be messy). PostGIS can have a general geometry column with no further restriction. Let's suppose you're happy for your SQL Server geometry_columns view to return the dimension, SRID, and type based on an arbitrary row of data.
We can get the column metadata out of the catalog views, but I think the only way to do the necessary querying to also get the geometry metadata is with dynamic SQL. This rules out views and functions. I can do you a stored procedure though:
CREATE PROCEDURE GetGeometryColumns
AS
BEGIN
DECLARE #sql nvarchar(max);
SET #sql = ( SELECT
STUFF((
SELECT ' UNION ALL ' + Query
FROM
( SELECT
'SELECT ''' + s.name + ''' SchemaName'
+ ', ''' + t.name + ''' TableName'
+ ', ''' + c.name + ''' ColumnName'
+ ', ( SELECT TOP (1) ' + c.name + '.STDimension() FROM ' + s.name + '.' + t.name + ') Dimension'
+ ', ( SELECT TOP (1) ' + c.name + '.STSrid FROM ' + s.name + '.' + t.name + ') SRID'
+ ', ( SELECT TOP (1) ' + c.name + '.STGeometryType() FROM ' + s.name + '.' + t.name + ') GeometryType'
AS Query
FROM
sys.schemas s
INNER JOIN sys.tables t ON s.schema_id = t.schema_id
INNER JOIN sys.columns c on t.object_id = c.object_id
WHERE
system_type_id = 240
) GeometryColumn
FOR XML PATH(''), TYPE).value('.', 'NVARCHAR(MAX)'), 1, 10, '')
);
EXEC ( #sql );
END
This builds a SQL statement which is a UNION of SELECTs, one for each geometry column defined in the database. Note that I'm using the sys. catalog views, which for SQL Server are better than using INFORMATION_SCHEMA.
Each of the individual SELECTs that this builds will return the name of the column, plus metadata from the value in the first row (artibtarily picked).
The sproc then executes the statements its built, and returns.
To use:
CREATE TABLE T1 (
Id int NOT NULL PRIMARY KEY
, Region geometry
)
;
CREATE TABLE T2 (
Id int NOT NULL PRIMARY KEY
, Source geometry
, Destination geometry
)
;
INSERT T1 VALUES ( 1, geometry::STGeomFromText('POLYGON((1 1, 3 3, 3 1, 1 1))', 4236)) ;
INSERT T2 VALUES ( 10
, geometry::STGeomFromText('POINT(1.3 2.4)', 4236)
, geometry::STGeomFromText('POINT(2.6 2.5)', 4236)) ;
then simply
EXEC GetGeometryColumns;
to get
SchemaName TableName ColumnName Dimension SRID GeometryType
---------- --------- ----------- ----------- ----------- ----------------------
dbo T1 Region 2 4236 Polygon
dbo T2 Source 0 4236 Point
dbo T2 Destination 0 4236 Point
If you want the results in a table, you can for example:
DECLARE #geometryColumn TABLE
(
SchemaName sysname
, TableName sysname
, ColumnName sysname
, Dimension int
, SRID int
, GeometryType nvarchar(100)
);
INSERT #geometryColumn EXEC GetGeometryColumns
SELECT * FROM #geometryColumn
I'd be interested to see if anyone can get the necessary logic into an actual VIEW...

SQL Server - Select from a variable amount of tables depending on date range

I have a query that joins multiple Data Sources together, I need a query that will select from a variable amount of tables depending on the date range I send it.
Joining Query
SELECT I.SerialNumber as DataSource,Deployed,Removed
FROM InstrumentDeployments ID
INNER JOIN Instruments I On I.Id = ID.InstrumentId
INNER JOIN Points P On P.Id = ID.PointId
WHERE P.Id = 1
ORDER BY Deployed
Joining Query Result
So from the above query result, if I wanted to select all of the historical information, it would go through and get the data from the specific tables
(called DataSource in query above) dependant on the relevant date.
Final Query - Something like this but the variable tables from query result above.
SELECT * FROM (VariableTables) WHERE DateRange BETWEEN '2016-09-07' and '2018-07-28'
Thanks
Please note that this is completely untested as the sample data is an image (and I can't copy and paste text from an image). If this doesn't work, please provide your sample data as text.
Anyway, the only way you'll be able to achieve this is with Dynamic SQL. This also, like in the comments, assumes that every table has the exact same definition. if it doesn't you'll likely get a failure (perhaps a conversion error, or that for a UNION query all tables must have the same number of columns). If they don't, you'll need to explicitly define your columns.
Again, this is untested, however:
DECLARE #SQL nvarchar(MAX);
SET #SQL = STUFF((SELECT NCHAR(10) + N'UNION ALL' + NCHAR(10) +
N'SELECT *' + NCHAR(10) +
N'FROM ' + QUOTENAME(I.SerialNumber) + NCHAR(10) +
N'WHERE DateRange BETWEEN #dStart AND #dEnd'
FROM InstrumentDeployments ID
INNER JOIN Instruments I ON I.Id = ID.InstrumentId
INNER JOIN Points P ON P.Id = ID.PointId
WHERE P.Id = 1
ORDER BY Deployed
FOR XML PATH(N'')),1,11,N'') + N';';
PRINT #SQL; --Your best friend.
DECLARE #Start date, #End date;
SET #Start = '20160907';
SET #End = '20180728';
EXEC sp_executesql #SQL, N'#dStart date, #dEnd date', #dStart = #Start, #dEnd = #End;

Return all rows where at least one value in any of the columns is null

I have just completed the process of loading new tables with data. I'm currently trying to validate the data. The way I have designed my database there really shouldn't be any values anywhere that are NULL so i'm trying to find all rows with any NULL value.
Is there a quick and easy way to do this instead of writing a lengthy WHERE clause with OR statements checking each column?
UPDATE: A little more detail... NULL values are valid initially as sometimes the data is missing. It just helps me find out what data I need to hunt down elsewhere. Some of my tables have over 50 columns so writing out the whole WHERE clause is not convenient.
Write a query against Information_Schema.Columns (documentation) that outputs the SQL for your very long where clause.
Here's something to get you started:
select 'OR ([' + TABLE_NAME + '].[' + TABLE_SCHEMA + '].[' + COLUMN_NAME + '] IS NULL)'
from mydatabase.Information_Schema.Columns
order by TABLE_NAME, ORDINAL_POSITION
The short version answer, use SET CONCAT_NULL_YIELDS_NULL ON and bung the whole thing together as a string and check that for NULL (once). That way any null will propagate through to make the whole row comparison null.
Here's the silly sample code to demo the principal, up to you if you want to wrap that in an auto-generating schema script (to only check Nullable columns and do all the appropriate conversions). Efficient it ain't, but almost any way you cut it you will need to do a table scan anyway.
CREATE TABLE dbo.Example
(
PK INT PRIMARY KEY CLUSTERED IDENTITY(1,1),
A nchar(10) NULL,
B int NULL,
C nvarchar(50) NULL
) ON [PRIMARY]
GO
INSERT dbo.Example(A, B, C)
VALUES('Your Name', 1, 'Not blank'),
('My Name', 3, NULL),
('His Name', NULL, 'Not blank'),
(NULL, 5, 'It''s blank');
SET CONCAT_NULL_YIELDS_NULL ON
SELECT E.PK
FROM dbo.Example E
WHERE (E.A + CONVERT(VARCHAR(32), E.B) + E.C) IS NULL
SET CONCAT_NULL_YIELDS_NULL OFF
As mentioned in a comment, if you really expect columns to not be null, then put NOT NULL constraints on them. That said...
Here's a slightly different approach, using INFORMATION_SCHEMA:
DECLARE #sql NVARCHAR(max) = '';
SELECT #sql = #sql + 'UNION ALL SELECT ''' + cnull.TABLE_NAME + ''' as TableName, '''
+ cnull.COLUMN_NAME + ''' as NullColumnName, '''
+ pk.COLUMN_NAME + ''' as PkColumnName,' +
+ 'CAST(' + pk.COLUMN_NAME + ' AS VARCHAR(500)) as PkValue '
+ ' FROM ' + cnull.TABLE_SCHEMA + '.' + cnull.TABLE_NAME
+ ' WHERE ' +cnull.COLUMN_NAME + ' IS NULL '
FROM INFORMATION_SCHEMA.COLUMNS cnull
INNER JOIN (SELECT Col.Column_Name, col.TABLE_NAME, col.TABLE_SCHEMA
from INFORMATION_SCHEMA.TABLE_CONSTRAINTS Tab
INNER JOIN INFORMATION_SCHEMA.CONSTRAINT_COLUMN_USAGE Col
ON Col.Constraint_Name = Tab.Constraint_Name AND Col.Table_Name = Tab.Table_Name
WHERE CONSTRAINT_TYPE = 'PRIMARY KEY') pk
ON pk.TABLE_NAME = cnull.TABLE_NAME AND cnull.TABLE_SCHEMA = pk.TABLE_SCHEMA
WHERE cnull.IS_NULLABLE = 'YES'
set #sql = SUBSTRING(#sql, 11, LEN(#sql)) -- remove the initial 'UNION ALL '
exec(#sql)
Rather a huge where clause, this will tell you the primary key on the table where any field in that table is null. Note that I'm CASTing all primary key values to avoid operand clashes if you have some that are int/varchar/uniqueidentifier etc. If you have a PK that doesn't fit into a VARCHAR(500) you probably have other problems....
This would probably need some tweaking if you have any tables with composite primary keys - as it is, I'm pretty sure it would just output separate rows for each member of the key instead of concatenating them, and wouldn't necessarily group them together the way you'd want.
One other thought would be to just SELECT * from ever table and save the output to a format (Excel, plain text csv) you can easily search for the string NULL.

Forcing SQL Server to pre-cache entire database into memory

We have a client site with a 50Gb SQL 2012 database on a server with 100+ Gb of RAM.
As the application is used, SQL server does a great job of caching the db into memory but the performance increase from the caching occurs the SECOND time a query is run, not the first.
To try to maximize cache hits the first time queries are run, we wrote a proc that iterates through every index of every table within the entire DB, running this:
SELECT * INTO #Cache
FROM ' + #tablename + ' WITH (INDEX (' + #indexname + '))'
In an attempt to force a big, ugly, contrived read for as much data as possible.
We have it scheduled to run every 15 minutes, and it does a great job in general.
Without debating other bottlenecks, hardware specs, query plans, or query optimization, does anybody have any better ideas about how to accomplish this same task?
UPDATE
Thanks for the suggestions. Removed the "INTO #Cache". Tested & it didn't make a difference on filling the buffer.
Added: Instead of Select *, I'm selecting ONLY the keys from the Index. This (obviously) is more to-the-point and is much faster.
Added: Read & Cache Constraint Indexes also.
Here's the current code: (hope it's useful for somebody else)
CREATE VIEW _IndexView
as
-- Easy way to access sysobject and sysindex data
SELECT
so.name as tablename,
si.name as indexname,
CASE si.indid WHEN 1 THEN 1 ELSE 0 END as isClustered,
CASE WHEN (si.status & 2)<>0 then 1 else 0 end as isUnique,
dbo._GetIndexKeys(so.name, si.indid) as Keys,
CONVERT(bit,CASE WHEN EXISTS (SELECT * FROM sysconstraints sc WHERE object_name(sc.constid) = si.name) THEN 1 ELSE 0 END) as IsConstraintIndex
FROM sysobjects so
INNER JOIN sysindexes si ON so.id = si.id
WHERE (so.xtype = 'U')--User Table
AND ((si.status & 64) = 0) --Not statistics index
AND ( (si.indid = 0) AND (so.name <> si.name) --not a default clustered index
OR
(si.indid > 0)
)
AND si.indid <> 255 --is not a system index placeholder
UNION
SELECT
so.name as tablename,
si.name as indexname,
CASE si.indid WHEN 1 THEN 1 ELSE 0 END as isClustered,
CASE WHEN (si.status & 2)<>0 then 1 else 0 end as isUnique,
dbo._GetIndexKeys(so.name, si.indid) as Keys,
CONVERT(bit,0) as IsConstraintIndex
FROM sysobjects so
INNER JOIN sysindexes si ON so.id = si.id
WHERE (so.xtype = 'V')--View
AND ((si.status & 64) = 0) --Not statistics index
GO
CREATE PROCEDURE _CacheTableToSQLMemory
#tablename varchar(100)
AS
BEGIN
DECLARE #indexname varchar(100)
DECLARE #xtype varchar(10)
DECLARE #SQL varchar(MAX)
DECLARE #keys varchar(1000)
DECLARE #cur CURSOR
SET #cur = CURSOR FOR
SELECT v.IndexName, so.xtype, v.keys
FROM _IndexView v
INNER JOIN sysobjects so ON so.name = v.tablename
WHERE tablename = #tablename
PRINT 'Caching Table ' + #Tablename
OPEN #cur
FETCH NEXT FROM #cur INTO #indexname, #xtype, #keys
WHILE (##FETCH_STATUS = 0)
BEGIN
PRINT ' Index ' + #indexname
--BEGIN TRAN
IF #xtype = 'V'
SET #SQL = 'SELECT ' + #keys + ' FROM ' + #tablename + ' WITH (noexpand, INDEX (' + #indexname + '))' --
ELSE
SET #SQL = 'SELECT ' + #keys + ' FROM ' + #tablename + ' WITH (INDEX (' + #indexname + '))' --
EXEC(#SQL)
--ROLLBACK TRAN
FETCH NEXT FROM #cur INTO #indexname, #xtype, #keys
END
CLOSE #cur
DEALLOCATE #cur
END
GO
First of all, there is a setting called "Minumum Server Memory" that looks tempting. Ignore it. From MSDN:
The amount of memory acquired by the Database Engine is entirely dependent on the workload placed on the instance. A SQL Server instance that is not processing many requests may never reach min server memory.
This tells us that setting a larger minimum memory won't force or encourage any pre-caching. You may have other reasons to set this, but pre-filling the buffer pool isn't one of them.
So what can you do to pre-load data? It's easy. Just set up an agent job to do a select * from every table. You can schedule it to "Start automatically when Sql Agent Starts". In other words, what you're already doing is pretty close to the standard way to handle this.
However, I do need to suggest three changes:
Don't try to use a temporary table. Just select from the table. You don't need to do anything with the results to get Sql Server to load your buffer pool: all you need to do is the select. A temporary table could force sql server to copy the data from the buffer pool after loading... you'd end up (briefly) storing things twice.
Don't run this every 15 minutes. Just run it once at startup, and then leave it alone. Once allocated, it takes a lot to get Sql Server to release memory. It's just not needed to re-run this over and over.
Don't try to hint an index. Hints are just that: hints. Sql Server is free to ignore those hints, and it will do so for queries that have no clear use for the index. The best way to make sure the index is pre-loaded is to construct a query that obviously uses that index. One specific suggestion here is to order the results in the same order as the index. This will often help Sql Server use that index, because then it can "walk the index" to produce the results.
This is not an answer, but to supplement Joel Coehoorn's answer, you can look at the table data in the cache using this statement. Use this to determine whether all the pages are staying in the cache as you'd expect:
USE DBMaint
GO
SELECT COUNT(1) AS cached_pages_count, SUM(s.used_page_count)/COUNT(1) AS total_page_count,
name AS BaseTableName, IndexName,
IndexTypeDesc
FROM sys.dm_os_buffer_descriptors AS bd
INNER JOIN
(
SELECT s_obj.name, s_obj.index_id,
s_obj.allocation_unit_id, s_obj.OBJECT_ID,
i.name IndexName, i.type_desc IndexTypeDesc
FROM
(
SELECT OBJECT_NAME(OBJECT_ID) AS name,
index_id ,allocation_unit_id, OBJECT_ID
FROM sys.allocation_units AS au
INNER JOIN sys.partitions AS p
ON au.container_id = p.hobt_id
AND (au.type = 1 OR au.type = 3)
UNION ALL
SELECT OBJECT_NAME(OBJECT_ID) AS name,
index_id, allocation_unit_id, OBJECT_ID
FROM sys.allocation_units AS au
INNER JOIN sys.partitions AS p
ON au.container_id = p.partition_id
AND au.type = 2
) AS s_obj
LEFT JOIN sys.indexes i ON i.index_id = s_obj.index_id
AND i.OBJECT_ID = s_obj.OBJECT_ID ) AS obj
ON bd.allocation_unit_id = obj.allocation_unit_id
INNER JOIN sys.dm_db_partition_stats s ON s.index_id = obj.index_id AND s.object_id = obj.object_ID
WHERE database_id = DB_ID()
GROUP BY name, obj.index_id, IndexName, IndexTypeDesc
ORDER BY obj.name;
GO
Use this to replace function dbo._GetIndexKeys
(SELECT STRING_AGG(COL_NAME(ic.object_id,ic.column_id), ',') FROM sys.index_columns ic WHERE ic.object_id = so.id AND ic.index_id = si.indid) AS Keys,
--dbo._GetIndexKeys(so.name, si.indid) as Keys,

Help me writing this query

CREATE PROCEDURE [dbo].[sp_SelectRecipientsList4Test] --'6DBF9A01-C88F-414D-8DD9-696749258CEF','Emirates.Description','0','5'
--'6DBF9A01-C88F-414D-8DD9-696749258CEF',
--'121f8b91-a441-4fbf-8a4f-563f53fcc103'
(
#p_CreatedBy UNIQUEIDENTIFIER,
#p_SortExpression NVARCHAR(100),
#p_StartIndex INT,
#p_MaxRows INT
)
AS
SET NOCOUNT ON;
IF LEN(#p_SortExpression) = 0
SET #p_SortExpression = 'Users.Name Asc'
DECLARE #sql NVARCHAR(4000)
SET #sql='
DECLARE #p_CreatedBy UNIQUEIDENTIFIER
SELECT
Name,
POBox,
EmirateName,
TelephoneNo,
RecipientID,
CreatedBy,
CreatedDate,
ID
FROM
(
SELECT Users.Name, Users.POBox, Emirates.Description As EmirateName,
UserDetails.TelephoneNo, AddressBook.RecipientID,AddressBook.CreatedBy, AddressBook.CreatedDate,
AddressBook.ID,
ROW_NUMBER() OVER(ORDER BY '+ #p_SortExpression +') AS Indexing
FROM AddressBook INNER JOIN
Users ON AddressBook.RecipientID = Users.ID INNER JOIN
UserDetails ON Users.ID = UserDetails.UserID INNER JOIN
Emirates ON Users.EmiratesID = Emirates.ID
----WHERE (AddressBook.CreatedBy = #p_CreatedBy)
) AS NewDataTable
WHERE Indexing > '+ CONVERT(NVARCHAR(10), #p_StartIndex) +
' AND Indexing<=(' + CONVERT (NVARCHAR(10),#p_StartIndex ) + ' + '
+ CONVERT(NVARCHAR(10),#p_MaxRows)+') '
EXEC sp_executesql #sql
This query is not giving any error but also not giving any result
please help
Have you tried breaking down the statement, to check if intermediate results are as expected? That's what you do to debug a complex statement...
For example, there's a nested SELECT in there. If you commit that SELECT on its own, does it print the expected values?
Edit: There's a saying about teaching a man to fish. 'ck' and 'n8wrl' have given you fish to eat today, now please practice fishing to feed you tomorrow...
Well, a quick glance of this:
WHERE Indexing > '+ CONVERT(NVARCHAR(10), #p_StartIndex) + ' AND Indexing<=(' + CONVERT (NVARCHAR(10),#p_StartIndex ) +...
looks like you're looking for an impossible condition, not unlike this:
WHERE Indexing > 5 AND Indexing <= 5
So that might be why you're getting no rows, but this proc is ripe for SQL injection attacks too. Building SQL on the fly based on possibly-unvalidated parameters is very dangerous.
You are querying:
'WHERE Indexing > '+ CONVERT(NVARCHAR(10), #p_StartIndex) +
' AND Indexing<=(' + CONVERT (NVARCHAR(10),#p_StartIndex ) + ' + '
and then adding max rows as a string, you can do this much more easily like so:
'WHERE Indexing > '+ CONVERT(NVARCHAR(10), #p_StartIndex) +
' AND Indexing <='+ CONVERT(NVARCHAR(10),#p_StartIndex + #p_MaxRows)
EDIT
The problem with your inner WHERE is that you are passing in the parameter, you need to do
'WHERE (AddressBook.CreatedBy = ''' + CAST(#p_CreatedBy AS CHAR(36)) + ''')'
Are you sure all your joins should be inner joins?
Change sp_executesql to PRINT and see what gets generated (the poor man's debugger)
Besides what all the other people told you,
give me one good reason why you are using sp_executesql over exec? You are not using parameterized statements, you also are not protected from sql injections because you just execute the whole string
This will just bloat the procedure cache everytime this is run and some values change, you will get a new plan every time
Please take a look at Changing exec to sp_executesql doesn't provide any benefit if you are not using parameters correctly and Avoid Conversions In Execution Plans By Using sp_executesql Instead of Exec

Resources