I have an odd situation where I will have data coming from various sources (all flat files, none of which are in my control, and no matter how many times I ask for a standard format, I get different column headers and different column orders). We do not have the manpower to manually go through these files to determine which columns are important. Each flat file will have between two and six "identification" columns. However, some of the columns, individually, are not unique, but their combinations can form unique keys. All told, each flat file can have somewhere around one hundred columns.
So, initially, I planned to load the data into a temp table and ask the user to identify which columns contained which data. Once I know that, I can process the file without issue. I would have the two to six columns to identify matches with existing records and the additional data that I am supposed to gather (all identified by the user).
I was then asked to add in the ability for the system to "recommend" which data columns are which. For that, my plan was a count. I would count how many nonempty values each column has and then count how many of those nonempty values match each of the six possible columns. From there, I can take a simple ratio to determine the likelihood that the data contained is of that particular type. There would be some overvaluing of columns that are not unique, but in general, it is working nicely. The problem is that it is very slow.
I created a metadata table that I am calling UploadedTableColumn which contains every column header of the source file and which column it maps to in the database. Here is my stored procedure to update the counts:
CREATE PROCEDURE stored_Procedure
#FileLoadID INT
AS
BEGIN
DECLARE #SqlCommand NVARCHAR(MAX)
DECLARE the_cursor CURSOR FAST_FORWARD FOR
SELECT N'UPDATE UploadedTableColumn SET NumberNonemptyRows = (SELECT COUNT(*) FROM ' + DestinationTableName + N' WHERE ISNULL(' + DestinationColumnName + N','''') <> ''''),' + CHAR(13)
+ N'NumberID1Rows = (SELECT COUNT(*) FROM ' + DestinationTableName + N' WHERE ISNULL(' + DestinationColumnName + N','''') IN (SELECT ID1 FROM ID1Table) AND ISNULL(' + DestinationColumnName + N','''') <> ''''),' + CHAR(13)
+ N'NumberID2Rows = (SELECT COUNT(*) FROM ' + DestinationTableName + N' WHERE ISNULL(' + DestinationColumnName + N','''') IN (SELECT ID2 FROM ID2Table) AND ISNULL(' + DestinationColumnName + N','''') <> ''''),' + CHAR(13)
+ N'NumberIDDateRows = (SELECT COUNT(*) FROM ' + DestinationTableName + N' WHERE IIF(ISDATE(' + DestinationColumnName + N')=1,IIF(CAST(' + DestinationColumnName + N' AS DATE) IN (SELECT IDDate FROM IDDateTable),1,0),0) = 1 AND ISNULL(' + DestinationColumnName + N','''') <> ''''),' + CHAR(13)
+ N'NumberID4Rows = (SELECT COUNT(*) FROM ' + DestinationTableName + N' WHERE ISNULL(' + DestinationColumnName + N', '''') IN (SELECT ID4 FROM ID4Table) AND ISNULL(' + DestinationColumnName + N','''') <> ''''),' + CHAR(13)
+ N'NumberID5Rows = (SELECT COUNT(*) FROM ' + DestinationTableName + N' WHERE ISNULL(' + DestinationColumnName + N', '''') IN (SELECT ID5 FROM ID5Table) AND ISNULL(' + DestinationColumnName + N','''') <> ''''),' + CHAR(13)
+ N'NumberID6Rows = (SELECT COUNT(*) FROM ' + DestinationTableName + N' WHERE ISNULL(' + DestinationColumnName + N', '''') IN (SELECT ID6 FROM ID6Table) AND ISNULL(' + DestinationColumnName + N','''') <> '''')' + CHAR(13)
+ N'WHERE DestinationTableName = ''' + DestinationTableName + N''' AND DestinationColumnName = ''' + DestinationColumnName + N''' AND FileLoadID = ' + CAST(#FileLoadID AS NVARCHAR) + N';' + CHAR(13) As SqlCommand
FROM UploadedTableColumn
WHERE FileLoadID = #FileLoadID
OPEN the_cursor
FETCH NEXT FROM the_cursor
INTO #SqlCommand
WHILE ##FETCH_STATUS = 0
BEGIN
EXECUTE(#SqlCommand)
FETCH NEXT FROM the_cursor
INTO #SqlCommand
END
CLOSE the_cursor
DEALLOCATE the_cursor
END
Is there a faster approach?
One small change that might help.
You say you're holding every column of the source table in UploadedTableColumn - you don't need to do that, your cursor is looping through a lot of unnecessary columns. You can eliminate a lot with a pre-emptive column name match.
So get a combined list of all possible ID columns from your ID1Table, ID2Table, etc, and only pull into UploadedTableColumn the ones that actually match a column in DestinationTableName.
On the basis that there are probably no more than 6 columns in your source data that have a matching ID column name, you're now only checking those rather than all 100+.
Of course, this doesn't help you if you've got people sending data without headers and no agreed format.
Pseudo code to get the desired columns:
SELECT name
FROM sys.columns
WHERE [object_id] = OBJECT_ID('DestinationTableName')
AND Name IN
(
SELECT ID1 AS IDColumn FROM ID1Table
UNION ALL
SELECT ID2 AS IDColumn FROM ID2Table
...
)
Related
I am working on exporting data from one environment to another environment. I want to select the list of tables which has new set of records either inserted or modified.
Database has around 200 tables and only if 10 table records are impacted since yesterday, i want to filter only those tables. Some of these tables does not have createdate table column. It is harder to identify the record difference based on plain select query to the table.
How to find the list of tables which has new set of records impacted in SQL?
And if possible only those newly impacted records from the identified tables.
I tried with this query, however this query is not returning actual tables.
select * from sysobjects where id in (
select object_id
FROM sys.dm_db_index_usage_stats
WHERE last_user_update > getdate() - 1 )
If you've not got a timestamp or something to identify newly changed records such as auditing, utilising triggers or Change Data Capture enabled on those tables, it's quiet impossible to do.
However, reading your scenario is it not possible to ignore what has changed or been modified and just simply export those 200 tables from one environment to the other and override it on the destination location?
If not, then you might be only interested in comparing data rather than identifying newly changed records to identify which tables did not match. You can do that using EXCEPT
See below example of comparing two databases with the same table names and schema creating a dynamic SQL statement column using EXCEPT from both databases on the fly and running them in a while loop; inserting each table name that was effected into a temp table.
DECLARE #Counter AS INT
, #Query AS NVARCHAR(MAX)
IF OBJECT_ID('tempdb..#CompareRecords') IS NOT NULL DROP TABLE #CompareRecords
IF OBJECT_ID('tempdb..#TablesNotMatched') IS NOT NULL DROP TABLE #TablesNotMatched
CREATE TABLE #TablesNotMatched (ObjectName NVARCHAR(200))
SELECT
ROW_NUMBER() OVER( ORDER BY (SELECT 1)) AS RowNr
, t.TABLE_CATALOG
, t.TABLE_SCHEMA
, t.TABLE_NAME
, Query = 'IF' + CHAR(13)
+ '(' + CHAR(13)
+ ' SELECT' + CHAR(13)
+ ' COUNT(*) + 1' + CHAR(13)
+ ' FROM' + CHAR(13)
+ ' (' + CHAR(13)
+ ' SELECT ' + QUOTENAME(t.TABLE_NAME, '''''') + ' AS TableName, * FROM ' + QUOTENAME(t.TABLE_CATALOG) + '.' + QUOTENAME(t.TABLE_SCHEMA) + '.' + QUOTENAME(t.TABLE_NAME) + CHAR(13)
+ ' EXCEPT' + CHAR(13)
+ ' SELECT ' + QUOTENAME(t.TABLE_NAME, '''''') + ' AS TableName, * FROM ' + QUOTENAME(t2.TABLE_CATALOG) + '.' + QUOTENAME(t.TABLE_SCHEMA) + '.' + QUOTENAME(t.TABLE_NAME) + CHAR(13)
+ ' ) AS sq' + CHAR(13)
+ ') > 1' + CHAR(13)
+ 'SELECT ' + QUOTENAME(QUOTENAME(t.TABLE_CATALOG) + '.' + QUOTENAME(t.TABLE_SCHEMA) + '.' + QUOTENAME(t.TABLE_NAME), '''''') + ' AS TableNameRecordsNotMatched'
INTO #CompareRecords
FROM <UAT_DATABASE>.INFORMATION_SCHEMA.TABLES AS t
LEFT JOIN <PROD_DATABASE>.INFORMATION_SCHEMA.TABLES AS t2 ON t.TABLE_SCHEMA = t2.TABLE_SCHEMA
AND t.TABLE_NAME = t2.TABLE_NAME
WHERE t.TABLE_TYPE = 'BASE TABLE'
SET #Counter = (SELECT MAX(RowNr) FROM #CompareRecords)
WHILE #Counter > 0
BEGIN
SET #Query = (SELECT cr.Query FROM #CompareRecords AS cr WHERE cr.RowNr = #Counter)
INSERT INTO #TablesNotMatched
EXECUTE sp_executesql #Query
SET #Counter = #Counter - 1
END
SELECT
*
FROM #TablesNotMatched
Note when using EXCEPT both tables have to have the exact same column count and type.
I hope this slightly helps.
What i am trying to accomplish is comparing two rows to each other pointing out the differences from row to row. Each row has quite a few columns and I was trying to make it easily visible for which ones had changed. Code below is my thoughts, but I know this won't work, but is a start.
SELECT
(SELECT concat('Case WHEN T1.', column_name, ' <> T2.', column_name, ' THEN ''', column_name, ' Changed Values('' + CONVERT(varchar(100), T1.', column_name, ') + '', '' + CONVERT(varchar(100), T2.', column_name, ') + '')'' ELSE '''' END AS ', column_name)
FROM information_schema.columns
WHERE table_name = 'Table')
FROM
(
SELECT * FROM Table
WHERE ID = '13'
) AS T1
JOIN
(
SELECT * FROM Table
WHERE ID = '2006'
) AS T2
ON T1.CreateTimeStamp = T2.CreateTimeStamp
I got the idea because below this works fine, but I would like this to be potentially reusable code for other table without having to type out tens or hundreds of columns each time.
SELECT
Case WHEN T1.R1<> T2.R1 THEN 'Changed Values(' + CONVERT(varchar(100),T1.R1) + ', ' + CONVERT(varchar(100),T2.R1) + ')' ELSE '' END AS R1,
Case WHEN T1.R2<> T2.R2 THEN 'Changed Values(' + CONVERT(varchar(100),T1.R2) + ', ' + CONVERT(varchar(100),T2.R2) + ')' ELSE '' END AS R2
FROM
(
SELECT * FROM Table
WHERE ID = '13'
) AS T1
JOIN
(
SELECT * FROM Table
WHERE ID = '2006'
) AS T2
ON T1.CreateTimeStamp = T2.CreateTimeStamp
For the this example please assume CreateTimeStamp always equals each other between the two rows.
You would need to create the whole query as dynamic SQL. Note that I'm using QUOTENAME() to prevent SQL Injection from weirdly named columns. I'm also trying to keep a format for the code, so I won't get headaches when debugging.
DECLARE #SQL NVARCHAR(MAX);
SELECT #SQL = N' SELECT ' + NCHAR(10)
--Concatenate all columns except ID and CreateTimeStamp
+ STUFF(( SELECT REPLACE( CHAR(9) + ',CASE WHEN T1.<<ColumnName>> <> T2.<<ColumnName>> ' + CHAR(10)
+ CHAR(9) + CHAR(9) + 'THEN ''Changed Values('' + CONVERT(varchar(100),T1.<<ColumnName>>) + '', '' + CONVERT(varchar(100),T2.<<ColumnName>>) + '')'' ' + CHAR(10)
+ CHAR(9) + CHAR(9) + 'ELSE '''' END AS <<ColumnName>>', '<<ColumnName>>', QUOTENAME(COLUMN_NAME)) + NCHAR(10)
FROM INFORMATION_SCHEMA.COLUMNS
WHERE TABLE_NAME = 'Table'
AND COLUMN_NAME NOT IN( 'ID', 'CreateTimeStamp')
FOR XML PATH(''), TYPE).value('./text()[1]', 'nvarchar(max)'), 2, 1, '') + NCHAR(10)
--Add rest of the query
+ 'FROM Table AS T1 ' + NCHAR(10)
+ 'JOIN Table AS T2 ON T1.CreateTimeStamp = T2.CreateTimeStamp ' + NCHAR(10)
+ 'WHERE ID = #ID1 ' + NCHAR(10)
+ 'AND ID = #ID2;'
--PRINT for debugging purposes
PRINT #SQL;
--Execute the dynamic built code
EXECUTE sp_executesql #SQL,
N'#ID1 int, #ID2 int',
#ID1 = 13,
#ID2 = 2006;
The concatenation method is explained on this article.
I have 20 databases, each with same table but different columns.
So to make the uniform we are creating views on top of each table in a database which will contain all the columns, as there will be one application accessing all the database.
In the view, I have to write the query in such a way that if I want to alter it and add any addition column for testing I should be able to do that.
Now in below query I am altering / creating query such that it takes all the columns of that table from the database, and then I append the other columns which are not present in it.
I need to add a column which will just concatenate some of the columns
ALTER VIEW [dbo].[AIV_PARKING]
AS
SELECT
*,
Cast(NULL AS [VARCHAR](20)) [ACTCODE],
Cast(NULL AS [VARCHAR](1)) [ACTIVATEINFO],
Cast(NULL AS [VARCHAR](20)) [VEHLICNOCHECK],
Cast(NULL AS [VARCHAR](40)) [ACTIVITY],
Cast(Isnull(vehlicnocheck, '') + '|' +
Isnull(officername, '') + '|' +
Isnull(locstreet, '') + '|' +
Isnull(locsideofstreet, '') + '|' +
Isnull(loccrossstreet1, '') + '|' +
Isnull(loccrossstreet2, '') + '|'
+ Isnull(locsuburb, '') + '|'
+ Isnull(locstate, '') + '|'
+ Isnull(locpostalcode, '') + '|'
+ Isnull(loclot, '') + '|'
+ Isnull(locpostalcode, '') + '|'
+ Isnull(Cast(officerid AS VARCHAR(20)), '')
+ Isnull(officername, '') + '|'
+ Isnull(Cast (issueno AS VARCHAR(100)), '') AS NVARCHAR(max)) AS SearchText
FROM
[dbo].parking
Here I added a column called SearchText which concatenates other columns, but I get an error
Invalid column name 'VehLicNoCheck'
Is there any way I can add this column to this view?
I also tried to do to something below but I got the same error:
CAST(CASE
WHEN NOT EXISTS
(
Select 1 from INFORMATION_SCHEMA.COLUMNS
Where Column_name ='VehLicNoCheck'
and table_name='Parking'
)
THEN ''
ELSE ISNULL(VehLicNoCheck,'')
END as nvarchar(max)
)
you could create a view that normalizes the uncommon columns to rows, where the values for the common columns are just repeated, e.g:
select id, col, value from parking
unpivot (value for col in (actcode, vehLicNoCheck, etc.)) x
the code to dynamically generate the view would be something like:
declare #sql varchar(max) = 'select id, col, value from parking unpivot (value for col in ('
select #sql += quotename(name) +',' from sys.columns where object_id=object_id('parking') and name not in ('id')
set #sql = substring(#sql, 1, len(#sql) - 1) + '))x'
exec(#sql)
this does not make sense at all.
the [ACTCODE], [ACTIVATEINFO] are all NULL value
so basically SearchText is just a string of '|||||'
you might as well, just do this
SELECT *,
CAST( NULL AS varchar) [ACTCODE],
CAST( NULL AS varchar) [ACTIVATEINFO],
CAST( NULL AS varchar) [VEHLICNOCHECK],
CAST( NULL AS varchar) [ACTIVITY],
'||||||' as SearchText
FROM [dbo].PARKING
Maybe if you can explain what are you trying to achieve here, we can point you to the right direction
EDIT :
You will need to use Dynamic SQL. You will need a list of all column names
-- declare a table variable for all the columns that you required
declare #columns table
(
id int identity,
name varchar(100)
)
-- for example these are the required columns
insert into #columns
values ('ACTCODE'), ('ACTIVATEINFO'), ('VEHLICNOCHECK'), ('ACTIVITY')
-- The Query to create the view
declare #sql nvarchar(max)
select #sql = N'CREATE VIEW [AIV_PARKING] AS' + char(13) + 'SELECT' + char(13)
select #sql = #sql
+ case when t.name is not null
then quotename(c.name) + ','
else 'CAST (NULL AS VARCHAR(10)) AS ' + quotename(c.name) + ','
end
+ char(13)
from #columns c
left join sys.columns t on c.name = t.name
and t.object_id = object_id('PARKING')
order by c.id
select #sql = #sql
+ case when t.name is not null
then 'ISNULL(' + quotename(c.name) + ', '''')'
else ''
end
+ ' + ''|'''
+ ' + '
from #columns c
left join sys.columns t on c.name = t.name
and t.object_id = object_id('PARKING')
order by c.id
select #sql = left(#sql, len(#sql) - 8) + ' AS SearchText' + char(13)
+ 'FROM PARKING'
-- print out to view the complete create view statement
print #sql
-- execute it
exec sp_executesql #sql
There is a sporadic performance problem with a batch process (nested sprocs) in SQL Server 2012. Sometimes, it takes much longer than usual.
The process rebuilds certain tables based on input parameters. RULE table has a statement column that has 70+ different statements (involving 20+ different columns) to be used in the WHERE clause in dynamic sql.
So, the DELETE, at each process run, not only has different parameters, but also different types and number of columns in the WHERE clause.
What would you recommend other than managing stats and index tuning? The dev team is open to code and schema changes.
SELECT rul.SQL_STATEMENT
FROM APP_RULE rul
LEFT JOIN APP_RULE_EXCEPTION exc
ON rul.RULE_ID = exc.RULE_ID
WHERE rul.APP_ID = #AppId
AND (exc.RULE_ID IS NULL OR exc.RULE_ID NOT IN (
SELECT RULE_ID FROM APP_RULE_EXCEPTION ))
SET #SQLStatement =
'DELETE FROM EMPLOYEE ' +
'WHERE APP_ID = ' + CAST(#AppId AS VARCHAR(10)) +
' AND EMPLOYEE_ID NOT IN (' +
'SELECT EMPLOYEE_ID FROM EMPLOYEE_NEW ' +
'WHERE '+ #SQLStatement + ')'
EXEC (#SQLStatement)
SET #SQLStatement =
'DELETE FROM ' + #TableName + ' ' +
'WHERE EMPLOYEE_ID NOT IN (' +
'SELECT EMPLOYEE_ID FROM EMPLOYEE_STG ' +
'WHERE APP_ID = ' + CAST(#AppId AS VARCHAR(10)) +''')'
EXEC (#SQLStatement)
SET #SQLStatement =
'INSERT INTO ' + #DestinationTable + ' '+'( [APP_ID], ' + #AttributeList + ')'+
'SELECT ' + CAST(#AppId AS VARCHAR(10)) + ',''' + #ExposedAttributeList +
'FROM ' + #SourceTable + ' ' +
'WHERE [DATE] = ''' + #TDate + ''''
EXEC (#SQLStatement)
The following are two sample DELETEs generated by dynamic sql.
DELETE FROM EMPLOYEE WHERE APP_ID = 103 AND APP_TYPE = 'IE' AND EMPLOYEE_ID NOT IN (
SELECT EMPLOYEE_ID FROM EMPLOYEE_NEW WHERE APP_TYPE = 'IE' )
DELETE FROM EMPLOYEE WHERE APP_ID = 103 AND APP_TYPE = 'IE' AND COUNTRY='USA' AND EMPLOYEE_ID NOT IN (
SELECT EMPLOYEE_ID FROM EMPLOYEE_NEW WHERE APP_TYPE = 'IE' AND EMPLOYEE_ID IN (
SELECT EMPLOYEE_ID FROM EMPLOYEE_EMAIL WHERE ISNULL(EMAIL_ADDRESS,'')<>'' and EType='OFFICE' ))
Thanks,
Kuzey
First thing you should do is to understand what the problem is. When something runs a long time randomly, it's usually case of getting non-optimal query plan, assuming you have already checked there is no blocking happening at the same time.
You should look at plan cache and see what is the difference between the plans and the CPU & IO measurements of the different cases.
Here's short SQL you can use to look at plan cache:
select top 100
SUBSTRING(t.text, (s.statement_start_offset/2)+1,
((CASE s.statement_end_offset WHEN -1 THEN DATALENGTH(t.text) ELSE s.statement_end_offset END
- s.statement_start_offset)/2) + 1) as statement_text,
t.text,
s.total_logical_reads, s.total_logical_reads / s.execution_count as avg_logical_reads,
s.total_worker_time, s.total_worker_time / s.execution_count as avg_worker_time,
s.execution_count,
max_logical_reads,
creation_time,
last_execution_time
--,cast(p.query_plan as xml) as query_plan
from sys.dm_exec_query_stats s
cross apply sys.dm_exec_sql_text (sql_handle) t
--cross apply sys.dm_exec_text_query_plan (plan_handle, statement_start_offset, statement_end_offset) p
order by s.total_logical_reads desc
That will show you all the measurements collected from plans that are still in cache. When the plan is dropped, also the measurements will be deleted. The measurements are for all the executions for the plan since it was created.
The commented out part is the plan for the statements, from there you can see what are the operators and estimated row counts. Don't look at the percentages in the plans, those are based on estimates and can be totally wrong.
I get the below error
An explicit value for the identity column in table 'c365online_script1.dbo.tProperty' can only be specified when a column list is used and IDENTITY_INSERT is ON.
The problem is with the statement that is dynamically constructed inside the two nested cursors from what I gather it should look something like
INSERT INTO dbo.Table(col1, col2, ...., colN) VALUES(Val1, val2, ...., ValN)
I am however unsure how I would construct the BELOW INSERT statement to resemble the above?.
EXEC('INSERT INTO ' + #Destination_Database_Name + '.dbo.' + #tablename + ' SELECT * FROM ' + #Source_Database_Name + '.dbo.' + #tablename + ' WHERE ' + #Source_Database_Name + '.dbo.' + #tablename + '.CompanyID = ' + #Company_Id)
SET #Counter = 1 -- set the counter to make sure we execute loop only once.
END
You need to specify the list of columns because you don't insert into all of them (you don't insert into identity column). I'm guessing you're inserting from a table with the same structure from a different database - you need to specify all the source columns too in this case.
Your query will be (edit the column names):
EXEC('INSERT INTO ' + #Destination_Database_Name + '.dbo.' + #tablename + '(col1, col2, col3) SELECT col1, col2, col3 FROM ' + #Source_Database_Name + '.dbo.' + #tablename + ' WHERE ' + #Source_Database_Name + '.dbo.' + #tablename + '.CompanyID = ' + #Company_Id)