TSQL "LIKE" or Regular Expressions? - sql-server

I have a bunch (750K) of records in one table that I have to see they're in another table. The second table has millions of records, and the data is something like this:
Source table
9999-A1B-1234X, with the middle part potentially being longer than three digits
Target table
DescriptionPhrase9999-A1B-1234X(9 pages) - yes, the parens and the words are in the field.
Currently I'm running a .net app that loads the source records, then runs through and searches on a like (using a tsql function) to determine if there are any records. If yes, the source table is updated with a positive. If not, the record is left alone.
the app processes about 1000 records an hour. When I did this as a cursor sproc on sql server, I pretty much got the same speed.
Any ideas if regular expressions or any other methodology would make it go faster?

What about doing it all in the DB, rather than pulling records into your .Net app:
UPDATE source_table s SET some_field = true WHERE EXISTS
(
SELECT target_join_field FROM target_table t
WHERE t.target_join_field LIKE '%' + s.source_join_field + '%'
)
This will reduce the total number of queries from 750k update queries down to 1 update.

First I would redesign if at all possible. Better to add a column that contains the correct value and be able to join on it. If you still need the long one. you can use a trigger to extract the data into the column at the time it is inserted.
If you have data you can match on you need neither like '%somestuff%' which can't use indexes or a cursor both of which are performance killers. This should bea set-based task if you have designed properly. If the design is bad and can't be changed to a good design, I see no good way to get good performance using t-SQl and I would attempt the regular expression route. Not knowing how many different prharses and the structure of each, I cannot say if the regular expression route would be easy or even possible. But short of a redesign (which I strongly suggest you do), I don't see another possibility.
BTW if you are working with tables that large, I would resolve to never write another cursor. They are extremely bad for performance especially when you start taking about that size of record. Learn to think in sets not record by record processing.

One thing to be aware of with using a single update (mbeckish's answer) is that the transaction log (enabling a rollback if the query becomes cancelled) will be huge. This will drastically slow down your query. As such it is probably better to proces them in blocks of 1,000 rows or such like.
Also, the condition (b.field like '%' + a.field + '%') will need to check every single record in b (millions) for every record in a (750,000). That equates to more than 750 billion string comparisons. Not great.
The gut feel "index stuff" won't help here either. An index keeps things in order, so the first character(s) dictate the position in the index, not the ones you're interested in.
First Idea
For this reason I would actually consider creating another table, and parsing the long/messy value into something nicer. An example would be just to strip off any text from the last '(' onwards. (This assumes all the values follow that pattern) This would simplify the query condition to (b.field like '%' + a.field)
Still, an index wouldn't help here either though as the important characters are at the end. So, bizarrely, it could well be worth while storing the characters of both tables in reverse order. The index on you temporary table would then come in to use.
It may seem very wastefull to spent that much time, but in this case a small benefit would yield a greate reward. (A few hours work to halve the comparisons from 750billion to 375billion, for example. And if you can get the index in to play you could reduce this a thousand fold thanks to index being tree searches, not just ordered tables...)
Second Idea
Assuming you do copy the target table into a temp table, you may benefit extra from processing them in blocks of 1000 by also deleting the matching records from the target table. (This would only be worthwhile where you delete a meaningful amount from the target table. Such that after all 750,000 records have been checked, the target table is now [for example] half the size that it started at.)
EDIT:
Modified Second Idea
Put the whole target table in to a temp table.
Pre-process the values as much as possible to make the string comparison faster, or even bring indexes in to play.
Loop through each record from the source table one at a time. Use the following logic in your loop...
DELETE target WHERE field LIKE '%' + #source_field + '%'
IF (##row_count = 0)
[no matches]
ELSE
[matches]
The continuous deleting makes the query faster on each loop, and you're only using one query on the data (instead of one to find matches, and a second to delete the matches)

Try this --
update SourceTable
set ContainsBit = 1
from SourceTable t1
join (select TargetField from dbo.TargetTable t2) t2
on charindex(t1.SourceField, t2.TargetField) > 0

First thing is to make sure you have an index for that column on the searched table. Second is to do the LIKE without a % sign on the left side. Check the execution plan to see if you are not doing a table scan on every row.
As le dorfier correctly pointed out, there is little hope if you are using a UDF.

There are lots of ways to skin the cat - I would think that first it would be important to know if this is a one-time operation, or a regular task that needs to be completed regularly.
Not knowing all the details of you problem, if it was me, at this was a one-time (or infrequent operation, which it sounds like it is), I'd probably extract out just the pertinent fields from the two tables including the primary key from the source table and export them down to a local machine as text files. The files sizes will likely be significantly smaller than the full tables in your database.
I'd run it locally on a fast machine using a routine written in something like 'C'/C++ or another "lightweight" language that has raw processing power, and write out a table of primary keys that "match", which I would then load back into the sql server and use it as a basis of an update query (i.e. update source table where id in select id from temp table).
You might spend a few hours writing the routine, but it would run in a fraction of the time you are seeing in sql.
By the sounds of you sql, you may be trying to do 750,000 table scans against a multi-million records table.
Tell us more about the problem.

Holy smoke, what great responses!
system is on disconnected network, so I can't copy paste, but here's the retype
Current UDF:
Create function CountInTrim
(#caseno varchar255)
returns int
as
Begin
declare #reccount int
select #reccount = count(recId) from targettable where title like '%' + #caseNo +'%'
return #reccount
end
Basically, if there's a record count, then there's a match, and the .net app updates the record. The cursor based sproc had the same logic.
Also, this is a one time process, determining which entries in a legacy record/case management system migrated successfully into the new system, so I can't redesign anything. Of course, developers of either system are no longer available, and while I have some sql experience, I am by no means an expert.
I parsed the case numbers from the crazy way the old system had to make the source table, and that's the only thing in common with the new system, the case number format. I COULD attempt to parse out the case number in the new system, then run matches against the two sets, but with a possible set of data like:
DescriptionPhrase1999-A1C-12345(5 pages)
Phrase/Two2000-A1C2F-5432S(27 Pages)
DescPhraseThree2002-B2B-2345R(8 pages)
Parsing that became a bit more complex so I thought I'd keep it simpler.
I'm going to try the single update statement, then fall back to regex in the clr if needed.
I'll update the results. And, since I've already processed more than half the records, that should help.

Try either Dan R's update query from above:
update SourceTable
set ContainsBit = 1
from SourceTable t1
join (select TargetField
from dbo.TargetTable t2) t2
on charindex(t1.SourceField, t2.TargetField) > 0
Alternatively, if the timeliness of this is important and this is sql 2005 or later, then this would be a classic use for a calculated column using SQL CLR code with Regular Expressions - no need for a standalone app.

Related

Improving Merge Performance with Change Data Capture and Hash

Today I'm trying to tune the performance of an audit database. I have a legal reason for tracking changes to rows, and I've implemented a set of tables using the System Versioned tables method in SQL Server 2016.
My overall process lands "RAW" data into an initial table from a source system. From here, I then have a MERGE process that takes data from the RAW table and compares every column in the RAW table to what exists in the audit-able system versioned staging table and decides what has changed. System row versioning then tells me what has changed and what hasn't.
The trouble with this approach is that my tables are very wide. Some of them have 400 columns or more. Even tables that have 450,000 records take SQL server about 17 minutes to perform a MERGE operation. It's really slowing down the performance of our solution and it seems it would help things greatly if we could speed it up. We presently have hundreds of tables we need to do this for.
At the moment both the RAW and STAGE tables are indexed on an ID column.
I've read in several places that we might consider using a CHECKSUM or HASHBYTES function to record a value in the RAW extract. (What would you call this? GUID? UUID? Hash?). We'd then compare the calculated value to what exists in the STAGE table. But here's the rub: There are often quite a few NULL values across many columns. It's been suggested that we cast all the column types to be the same (nvarchar(max))?, and NULL values seem to cause the entire computation of the checksum to fall flat. So I'm also coding lots of ISNULL(,'UNKNOWN') statements into my code too.
So - Are there better methods for improving the performance of the merge here? I thought that I could use a row updated timestamp column as a single value to compare instead of the checksum, but I am not certain that that would pass legal muster/scrutiny. Legal is concerned that rows may be edited outside of an interface and the column wouldn't always be updated. I've seen approaches with developers using a concatenate function (shown below) to combine many column values together. This seems code intensive and expensive to compute / cast columns too.
So my questions are:
Given the situational reality, can I improve MERGE performance in any way here?
Should I use a checksum, or hashbytes, and why?
Which hashbytes method makes the most sense here? (I'm only comparing one RAW row to another STAGE row based on an ID match right)?
Did I miss something with functions that might make this comparison faster or easier in the reading
I have done? It seems odd there aren't better functions besides CONCAT available to do this in SQL Server.
I wrote the below code to show some of the ideas I am considering. Is there something better than what I wrote below?
DROP TABLE IF EXISTS MyTable;
CREATE TABLE MyTable
(C1 VARCHAR(10),
C2 VARCHAR(10),
C3 VARCHAR(10)
);
INSERT INTO MyTable
(C1,C2,C3)
VALUES
(NULL,NULL,NULL),
(NULL,NULL,3),
(NULL,2,3),
(1,2,3);
SELECT
HASHBYTES('SHA2_256',
CONCAT(C1,'-',
C2,'-',
C3)) AS HashbytesValueCastWithNoNullCheck,
HASHBYTES('SHA2_256',
CONCAT(CAST(C1 as varchar(max)),'-',
CAST(C2 as varchar(max)),'-',
CAST(C3 as varchar(max)))) AS HashbytesValueCastWithNoNullCheck,
HASHBYTES('SHA2_256',
CONCAT(ISNULL(CAST(C1 as varchar(max)),'UNKNOWN'),'-',
ISNULL(CAST(C2 as varchar(max)),'UNKNOWN'),'-',
ISNULL(CAST(C3 as varchar(max)),'UNKNOWN'))) AS HashbytesValueWithCastWithNullCheck,
CONCAT(ISNULL(CAST(C1 as varchar(max)),'UNKNOWN'),'-',
ISNULL(CAST(C2 as varchar(max)),'UNKNOWN'),'-',
ISNULL(CAST(C3 as varchar(max)),'UNKNOWN')) AS StringValue,
CONCAT(C1,'-',C2,'-',C3) AS ConcatString,
C1,
C2,
C3
FROM
MyTable;
'''
Given the situational reality, can I improve MERGE performance in any way here?
You should test, but storing a hash for every row, computing the hash for the new rows, and comparing based on the (key,hash) should be cheaper than comparing every column.
Should I use a checksum, or hashbytes, and why?
HASHBYTES has a much lower probability of missing a change. Roughly, with CHECKSUM you'll probably eventually miss a change or two, with HASHBYTES you probably won't ever miss a change. See remarks here: BINARY_CHECKSUM.
Did I miss something with functions that might make this comparison faster or easier in the reading I have done?
No. There's no special way to compare multiple columns.
Is there something better than what I wrote below?
You definitely should replace nulls, else a row (1,null,'A') and (1,'A',null) would get the same hash. And you should replace nulls, and delimit, with something that won't appear as a value in any column. And if you have Unicode text, converting to varchar may erase some changes, so it's safer to use nvarchar. eg:
HASHBYTES('SHA2_256',
CONCAT(ISNULL(CAST(C1 as nvarchar(max)),N'~'),N'|',
ISNULL(CAST(C2 as nvarchar(max)),N'~'),N'|',
ISNULL(CAST(C3 as nvarchar(max)),N'~'))) AS HashbytesValueWithCastWithNullCheck
JSON in SQL Server is very fast. So you might try a pattern like:
select t.Id, z.RowJSON, hashbytes('SHA2_256', RowJSON) RowHash
from SomeTable t
cross apply (select t.* for json path) z(RowJSON)

Speeding up [select * into table] across linked servers

I'm trying to copy a table with size ~50 million rows into another database on a link server. It does not have any indexes (although i wouldn't think that should make a difference). I've used the following query:
select * into [db2].[schema].[table_name]
from
openquery([linked_server_name],
'select * from [db1].[schema].[table_name]')
This took approximately 7 minutes.
This seems suspiciously long for what I intended to be a simple copy and paste. Am I missing something?
I need to run this on a regular basis and would ideally like to keep it as automated as possible (no manual copying tables across servers using SISS would be ideal)
Any ideas would be highly appreciated!
Thanks a bunch
There could be lots of different reasons at play here each having a cumulative effect on the 'slowness'.
Looking at the wait states will be the first port of call.
Indexing ( at least on the select side ) isnt an issue here , there is no predicate used ( at to a lesser extent using all the columns ) therefore how would you expect an index to be helpful ??
I would say that number of rows is not a helpful metric... How big in terms of MB / GB is the source data set ? Use the "Include Client Statistics" in SSMS to get an accurate nummber. Now, if that is 'big' how log does it take to drag a .zip file of the same size over the network ?
Without the indexing of the table, the SELECT query is bound to be slow. Indexes help in improving the performance of the select query. The indexes slow down the INSERT queries as they have to add the data as well as index it simultaneously. On the contrary, the index creation boosts up the SELECT query.
Also, is your data copied transactional in nature or you can do away with dirty read? If dirty read is fine then you can use WITH(NOLOCK) in the SELECT query to avoid and transaction issues.

Add DATE column to store when last read

We want to know what rows in a certain table is used frequently, and which are never used. We could add an extra column for this, but then we'd get an UPDATE for every SELECT, which sounds expensive? (The table contains 80k+ rows, some of which are used very often.)
Is there a better and perhaps faster way to do this? We're using some old version of Microsoft's SQL Server.
This kind of logging/tracking is the classical application server's task. If you want to realize your own architecture (there tracking architecture) do it on your own layer.
And in any case you will need application server there. You are not going to update tracking field it in the same transaction with select, isn't it? what about rollbacks? so you have some manager who first run select than write track information. And what is the point to save tracking information together with entity info sending it back to DB? Save it into application server file.
You could either update the column in the table as you suggested, but if it was me I'd log the event to another table, i.e. id of the record, datetime, userid (maybe ip address etc, browser version etc), just about anything else I could capture and that was even possibly relevant. (For example, 6 months from now your manager decides not only does s/he want to know which records were used the most, s/he wants to know which users are using the most records, or what time of day that usage pattern is etc).
This type of information can be useful for things you've never even thought of down the road, and if it starts to grow large you can always roll-up and prune the table to a smaller one if performance becomes an issue. When possible, I log everything I can. You may never use some of this information, but you'll never wish you didn't have it available down the road and will be impossible to re-create historically.
In terms of making sure the application doesn't slow down, you may want to 'select' the data from within a stored procedure, that also issues the logging command, so that the client is not doing two roundtrips (one for the select, one for the update/insert).
Alternatively, if this is a web application, you could use an async ajax call to issue the logging action which wouldn't slow down the users experience at all.
Adding new column to track SELECT is not a practice, because it may affect database performance, and the database performance is one of major critical issue as per Database Server Administration.
So here you can use one very good feature of database called Auditing, this is very easy and put less stress on Database.
Find more info: Here or From Here
Or Search for Database Auditing For Select Statement
Use another table as a key/value pair with two columns(e.g. id_selected, times) for storing the ids of the records you select in your standard table, and increment the times value by 1 every time the records are selected.
To do this you'd have to do a mass insert/update of the selected ids from your select query in the counting table. E.g. as a quick example:
SELECT id, stuff1, stuff2 FROM myTable WHERE stuff1='somevalue';
INSERT INTO countTable(id_selected, times)
SELECT id, 1 FROM myTable mt WHERE mt.stuff1='somevalue' # or just build a list of ids as values from your last result
ON DUPLICATE KEY
UPDATE times=times+1
The ON DUPLICATE KEY is right from the top of my head in MySQL. For conditionally inserting or updating in MSSQL you would need to use MERGE instead

Problems with LIKE command in SQL Server

I'm trying to execute a query containing LIKE in a SQL Server 2008 database, but for some reason the query takes forever and times out.
The table contains around 47 million rows with aggregated log data, and I'm trying to find a logentry for a specific machine containing a specific application name. My query looks like this:
SELECT MgmtLogID,MgmtLogSeverity, MgmtLogSource, CAST(MgmtLogText as TEXT) as MgmtLogText, MgmtLogTime, MgmtLogHost
FROM [dbo].[MgmtLog]
--Fixed values
WHERE MgmtLogOrigin = 'EventLog' AND MgmtLogSeverity <= 3
--Values depending on what I'm searching for
AND MgmtLogHost = 'MY MACHINENAME'AND MgmtLogTime > 'MY START TIME' AND MgmtLogText LIKE '% KEYWORD TO SEARCH FOR %'
ORDER BY MgmtLogTime DESC
The query executes in around 1-2 seconds without the LIKE and returns around 10 rows. With the LIKE it should return 2 rows out of those ten so it shouldn't be that taxing but it times out. I'm guessing that it has something to do with the properties of MgmLogText, but I'm not sure what. It is an ntext field that has length 16 and uses Finnish_Swedish_CI_AS collation.
In the end I need to execute the query from a php script since I need to find log records for a arbitrary number of machines and/or applications
Depending on the field type of MgmtLogText, the indexes won't be used. Also, as mentioned by other commenters, the LIKE also prevents the index from being used.
Off the top of my head, I wonder if it would work if you use a subquery. The inside query should be the one without the LIKE which only returns 10 results. Then the outside query should be the one that uses the LIKE. That way the LIKE only has to search 10 rows instead of 47 million.
There's probably a more efficient way, but I was thinking something like this:
SELECT MgmtLogID,MgmtLogSeverity, MgmtLogSource, CAST(MgmtLogText as TEXT) as MgmtLogText, MgmtLogTime, MgmtLogHost
FROM [dbo].[MgmtLog]
WHERE MgmtLogID IN (
SELECT MgmtLogID
FROM [dbo.MgmtLog]
WHERE MgmtLogOrigin = 'EventLog' AND MgmtLogSeverity <= 3
AND MgmtLogHost = 'MY MACHINENAME'AND MgmtLogTime > 'MY START TIME'
ORDER BY MgmtLogTime DESC
)
AND MgmtLogText LIKE '%some value%'
Query clauses like where colname like '%something%' are not able to take advantage of indexes and usually result in a full scan of the possible rows to ascertain which ones should be delivered
Although, as ChrisC points out in a comment, it's somewhat surprising that the more efficient clauses aren't firt used to reduce the candidate rowset down to a manageable size before trying to use like - perhaps the statistics for the table are not up to date enough for the query analysis to decide this - best run whatever counts for an explain query under SQL Server.
The reason your non-like query is so fast is because it almost certainly has an index on MgmtLogHost and/or MgmtLogTime which can be used to quickly cull unneeded rows.
One way you can fix this is to use something like insert/update triggers to process the MgmtLogText data only when changed, to extract the application names out and put them in a separate table which can be far better optimised.
Even just using such a trigger to keep a lowercased version of the column (in another column) would be an improvement. Using a case-insensitive collation means that selects run slower since they have to allow for XYZZY and xyzzy being classed as equal. If instead you maintain a lower-cased version in the table and ensure the check is done against lower case, that effort disappears as you only have one case to worry about.
And, by doing all this in the trigger, you ensure that it's only done when necessary (when the data is changed), not every time you want to select. This amortises the cost over many selects.
You can also use something like full text indexing if your DBMS supports it but I've often thought that was like trying to kill mosquitos with a thermo-nuclear warhead.
Yes, there are situations where you may need full text indexing but, in the vast majority of cases, you can gain efficiency by being a little more selective.

What is a maintainable way to store large text fields without sacrificing performance?

I have been dancing around this issue for awhile but it keeps coming up. We have a system and our may of our tables start with a description that is originally stored as an NVARCHAR(150) and I then we get a ticket asking to expand the field size to 250, then 1000 etc, etc...
This cycle is repeated on ever "note" field and/or "description" field we add to most tables. Of course the concern for me is performance and breaking the 8k limit of the page. However, my other concern is making the system less maintainable by breaking these fields out of EVERY table in the system into a lazy loaded reference.
So here I am faced with these same to 2 options that have been staring me in the face. (others are welcome) please lend me your opinions.
Change all may notes and/or descriptions to NVARCHAR(MAX) and make sure we do exclude these fields in all listings. Basically never do a: SELECT * FROM [TableName] unless is it only retrieving one record.
Remove all notes and/or description fields and replace them with a forign key reference to a [Notes] table.
CREATE TABLE [dbo].[Notes] (
[NoteId] [int] NOT NULL,
[NoteText] [NVARCHAR](MAX)NOT NULL )
Obviously I would prefer use option 1 because it will change so much in our system if we go with 2. However if option 2 is really the only good way to proceed, then at least I can say these changes are necessary and I have done the homework.
UPDATE:
I ran several test on a sample database with 100,000 records in it. What I find is that the because of cluster index scans the IO required for option 1 is "roughly" twice that of option 2. If I select a large number of records (1000 or more) option 1 is twice as slow even if I do not include the large text field in the select. As I request less rows the lines blur more. I a web app where page sizes of 50 or so are the norm, so option 1 will work, but I will be converting all instances to option 2 in the (very) near future for scalability.
Option 2 is better for several reasons:
When querying your tables, the large
text fields fill up pages quickly,
forcing the database to scan more
pages to retrieve data. This is
especially taxing when you don't
actually need to return the text
data.
As you mentioned, it gives you
a clean break to change the data
type in one swoop. Microsoft has
deprecated TEXT in SQL Server 2008,
so you should stick with
VARCHAR/VARBINARY.
Separate filegroups. Having
all your text data in a slower,
cheaper storage location might be
something you decide to pursue in
the future. If not, no harm, no
foul.
While Option 1 is easier for now, Option 2 will give you more flexibility in the long-term. My suggestion would be to implement a simple proof-of-concept with the "notes" information separated from the main table and perform some of your queries on both examples. Compare the execution plans, client statistics and logical I/O reads (SET STATISTICS IO ON) for some of your queries against these tables.
A quick note to those suggesting the use of a TEXT/NTEXT from MSDN:
This feature will be removed in a
future version of Microsoft SQL
Server. Avoid using this feature in
new development work, and plan to
modify applications that currently use
this feature. Use varchar(max),
nvarchar(max) and varbinary(max) data
types instead. For more information,
see Using Large-Value Data Types.
I'd go with Option 2.
You can create a view that joins the two tables to make the transition easier on everyone, and then go through a clean-up process that removes the view and uses the single table wherever possible.
You want to use a TEXT field. TEXT fields aren't stored directly in the row; instead, it stores a pointer to the text data. This is transparent to queries, though - if you ask for a TEXT field, it will return the actual text, not the pointer.
Essentially, using a TEXT field is somewhat between your two solutions. It keeps your table rows much smaller than using a varchar, but you'll still want to avoid asking for them in your queries if possible.
The TEXT/NTEXT data type has practically unlimited length while taking up next to nothing in your record.
It comes with a few strings attached, like special behavior with string functions, but for a secondary "notes/description" type of field these may be less of a problem.
Just to expand on Option 2
You could:
Rename existing MyTable to MyTable_V2
Move the Notes column into a joined Notes table (with 1:1 joining ID)
Create a VIEW called MyTable that joins MyTable_V2 and Notes tables
Create an INSTEAD OF trigger on MyTable view which saves the Notes column into the Notes table (IF NULL then delete any existing Notes row, if NOT NULL then Insert if not found, otherwise Update). Perform appropriate action on MyTable_V2 table
Note: We've had trouble doing this where there is a Computed column in MyTable_V2 (I think that was the problem, either way we've hit snags when doing this with "unusual" tables)
All new Insert/Update/Delete code should be written to operate directly on MyTable_V2 and Notes tables
Optionally: Have the INSERT OF trigger on MyTable log the fact that it was called (it can do this minimally, UPDATE a pre-existing log table row with GetDate() only if existing row's date is > 24 hours old - so will only do an update once a day).
When you are no longer getting any log records you can drop the INSTEAD OF trigger on MyTable view and you are now fully MyTable_V2 compliant!
Huge amount of hassle to implement, as you surmised.
Alternatively trawl the code for all references to MyTable and change them to MyTable_V2, put a VIEW in place of MyTable for SELECT only, and not create the INSTEAD OF trigger.
My plan would be to fix all Insert/Update/Delete statements referencing the now deprecated MyTable. For me this would be made somewhat easier because we use unique names for all tables and columns in the database, and we use the same names in all application code, so making sure I had found all instances by a simple FIND would be high.
P.S. Option 2 is also preferable if you have any SELECT * lying around. We have had clients whos application performance has gone downhill fast when they added large Text/Blob columns to existing tables - because of "lazy" SELECT * statements. Hopefully that isn;t the case in your shop though!

Resources