parsing comma seperated values in MS-SQL (no csv or such) - sql-server

I use a closed source commercial application that uses an MS-SQL database. I regularly have to query this database myself for various purposes. This means the table and database design is fixed, and I can't do anything about it at all. I just have to live with it. Now I have two tables with the following layouts (abstracted, not to discredit the software/database designer)
t1: ID (int), att1(varchar), att2(varchar), .... attx(varchar)
t2: ID (int), t1_ids(varchar)
Now the contents of this t1_ids is (shudder) a comma separated list of t1 id's. (for example 12, 456, 43, 675, 54). What I want to do is (you guessed it) join those two tables.
Fortunately for me, these are very small tables, and I don't care about performance in terms of complexity at all (could be O(n^m) as far as I care).
Ideally I would like to make a view that joins these two tables. I don't have any requirements for inserting or updating, just for select statements. What would be the easiest and clearest (in terms of maintainability) way to do this?

To get the first and last too use this:
select *
from t1
join t2 on '%,' + t1.ID + ',%' like ',' + T2.t1_ids + ','
It doesn't matter if T2.t1_ids start or end with . The valid values are enclosed by commas.

EDIT: I realised after posting this answer the PARSENAME function can only return one part of the parsed string, so it's not a useful as I thought it would be in your situation.
Searching SO for alternative solutions I came across an interesting answer to this question: Split String in SQL
If you can add triggers to your database then you could call the split string on INSERT, UPDATE and DELETE to maintain another table with the id's separated as rows. Then you can create your view using that table.
I know you said you don't mind about the speed of the query but at least that way you are not parsing all the strings every time you query the data.

Related

Improving Merge Performance with Change Data Capture and Hash

Today I'm trying to tune the performance of an audit database. I have a legal reason for tracking changes to rows, and I've implemented a set of tables using the System Versioned tables method in SQL Server 2016.
My overall process lands "RAW" data into an initial table from a source system. From here, I then have a MERGE process that takes data from the RAW table and compares every column in the RAW table to what exists in the audit-able system versioned staging table and decides what has changed. System row versioning then tells me what has changed and what hasn't.
The trouble with this approach is that my tables are very wide. Some of them have 400 columns or more. Even tables that have 450,000 records take SQL server about 17 minutes to perform a MERGE operation. It's really slowing down the performance of our solution and it seems it would help things greatly if we could speed it up. We presently have hundreds of tables we need to do this for.
At the moment both the RAW and STAGE tables are indexed on an ID column.
I've read in several places that we might consider using a CHECKSUM or HASHBYTES function to record a value in the RAW extract. (What would you call this? GUID? UUID? Hash?). We'd then compare the calculated value to what exists in the STAGE table. But here's the rub: There are often quite a few NULL values across many columns. It's been suggested that we cast all the column types to be the same (nvarchar(max))?, and NULL values seem to cause the entire computation of the checksum to fall flat. So I'm also coding lots of ISNULL(,'UNKNOWN') statements into my code too.
So - Are there better methods for improving the performance of the merge here? I thought that I could use a row updated timestamp column as a single value to compare instead of the checksum, but I am not certain that that would pass legal muster/scrutiny. Legal is concerned that rows may be edited outside of an interface and the column wouldn't always be updated. I've seen approaches with developers using a concatenate function (shown below) to combine many column values together. This seems code intensive and expensive to compute / cast columns too.
So my questions are:
Given the situational reality, can I improve MERGE performance in any way here?
Should I use a checksum, or hashbytes, and why?
Which hashbytes method makes the most sense here? (I'm only comparing one RAW row to another STAGE row based on an ID match right)?
Did I miss something with functions that might make this comparison faster or easier in the reading
I have done? It seems odd there aren't better functions besides CONCAT available to do this in SQL Server.
I wrote the below code to show some of the ideas I am considering. Is there something better than what I wrote below?
DROP TABLE IF EXISTS MyTable;
CREATE TABLE MyTable
(C1 VARCHAR(10),
C2 VARCHAR(10),
C3 VARCHAR(10)
);
INSERT INTO MyTable
(C1,C2,C3)
VALUES
(NULL,NULL,NULL),
(NULL,NULL,3),
(NULL,2,3),
(1,2,3);
SELECT
HASHBYTES('SHA2_256',
CONCAT(C1,'-',
C2,'-',
C3)) AS HashbytesValueCastWithNoNullCheck,
HASHBYTES('SHA2_256',
CONCAT(CAST(C1 as varchar(max)),'-',
CAST(C2 as varchar(max)),'-',
CAST(C3 as varchar(max)))) AS HashbytesValueCastWithNoNullCheck,
HASHBYTES('SHA2_256',
CONCAT(ISNULL(CAST(C1 as varchar(max)),'UNKNOWN'),'-',
ISNULL(CAST(C2 as varchar(max)),'UNKNOWN'),'-',
ISNULL(CAST(C3 as varchar(max)),'UNKNOWN'))) AS HashbytesValueWithCastWithNullCheck,
CONCAT(ISNULL(CAST(C1 as varchar(max)),'UNKNOWN'),'-',
ISNULL(CAST(C2 as varchar(max)),'UNKNOWN'),'-',
ISNULL(CAST(C3 as varchar(max)),'UNKNOWN')) AS StringValue,
CONCAT(C1,'-',C2,'-',C3) AS ConcatString,
C1,
C2,
C3
FROM
MyTable;
'''
Given the situational reality, can I improve MERGE performance in any way here?
You should test, but storing a hash for every row, computing the hash for the new rows, and comparing based on the (key,hash) should be cheaper than comparing every column.
Should I use a checksum, or hashbytes, and why?
HASHBYTES has a much lower probability of missing a change. Roughly, with CHECKSUM you'll probably eventually miss a change or two, with HASHBYTES you probably won't ever miss a change. See remarks here: BINARY_CHECKSUM.
Did I miss something with functions that might make this comparison faster or easier in the reading I have done?
No. There's no special way to compare multiple columns.
Is there something better than what I wrote below?
You definitely should replace nulls, else a row (1,null,'A') and (1,'A',null) would get the same hash. And you should replace nulls, and delimit, with something that won't appear as a value in any column. And if you have Unicode text, converting to varchar may erase some changes, so it's safer to use nvarchar. eg:
HASHBYTES('SHA2_256',
CONCAT(ISNULL(CAST(C1 as nvarchar(max)),N'~'),N'|',
ISNULL(CAST(C2 as nvarchar(max)),N'~'),N'|',
ISNULL(CAST(C3 as nvarchar(max)),N'~'))) AS HashbytesValueWithCastWithNullCheck
JSON in SQL Server is very fast. So you might try a pattern like:
select t.Id, z.RowJSON, hashbytes('SHA2_256', RowJSON) RowHash
from SomeTable t
cross apply (select t.* for json path) z(RowJSON)

HANA SQL CTE WHERE CONDITION

I'm writing a scripted calculation view on HANA using SQL.
Looking for some performance booster alternatives for the logic that I have implemented in a while loop. Simplified version of code is below.
It is trying to get similar looking vendors in table B for vendors from table A.
Please bear with me for inaccurate syntax.
v = select vendor, vendorname from A;
while --set a counter here
vendorname = capture the record from v for row number represented by counter here
t = select vendor, vendorname from v where (read single vendor from counter row)
union all
select vendor, vendorname from B where contains(vendorname,:vendorname,fuzzy(0.3))
union all
select vendor, vendorname from t
endwhile
This query dies when there are thousands of records in both the tables. So after reading few blogs, I realized that I'm going in wrong direction that is using loop.
To make this little faster, I came across something called CTE.
When I tried to implement the same code using CTE I'm not allowed to do so.
Sample code I'm trying to write is below. Can anybody please help me get this right? The syntax is not accepted by system.
t = with mytab ("Vendor", "VendorName")
AS ( select "Vendor", "VendorName" from "A" WHERE ( "Updated_Date" >= :From_Date AND "Updated_Date" <= :To_Date ) )
select * from "B" WHERE CONTAINS ("VendorName", mytab."VendorName",FUZZY(0.3));
The SQL error for this syntax is:
SQL: invalid identifier: MYTAB
I would like to know:
Whether such operation with CTE is allowed. If yes, what is the correct syntax in HANA SQL?
If No, how do I achieve the desired result without looping through one table?
Thanks,
Anup
CTE's are allowed in SAP HANA - you might want to check the HANA SQL reference if you're looking for syntax.
But as you're in a SQLScript context anyhow, you might as well use table variables instead.
What I'm not sure about is, what you are actually trying to do. Provide a description of your usage scenario, if possible.
Ok, based on your comments, the following approach could work for you.
Note, in my example I use a copy of the USERS system table, so you will have to fit the query to your tables.
do
begin
declare user_names nvarchar(5000);
select string_agg(user_name,' ') into user_names
from cusers
where user_name in ('SYS', 'SYSTEM');
select *
from cusers
where contains (user_name, :user_names, fuzzy(0.3));
end;
What I do here is to get all the potential names for which I want to do a fuzzy lookup into a variable user_names (separated by a space). For this I use the STRING_AGG() aggregation function.
After the first statement is finished, :user_names contains SYSTEM SYS in my example.
Now, CONTAINS allows to search multiple columns for multiple search terms at once (you may want to re-check the reference documentation for details here), so
CONTAINS (<column_name>, 'term1 term2 term3')
looks for all three terms in the column .
With that we feed the string SYS SYSTEM into the second query and the CONTAINS clause.
That works fine for me, avoids a join and runs over the table to be searched only once.
BTW: no idea where you get that statement about table variables in read-only procedures from - it's wrong. Of course you can use table variables, in fact it's recommended to make use of them.

retrieve data from multiple tables and store them in one table confronted the duplicates

I'm working on some critical data retrieving report tasks and find some difficulties to
proceed. Basically, it's belonging to medical area and the whole data is distributed in
several tables and I can't change the architecture of database tables design. In order to
finish my report, I need the following steps:
1- divide the whole report to several parts, for each parts retrieve data by using
several joins. (like for part A can be retrieved by this:
select a1.field1, a2.field2 from a1 left join a2 on a1.fieldA= a2.fieldA ) then I can
got all the data from part A.
2- the same things happened for part B
select b1.field1, b2.field2 from b1 left join b2 on b1.fieldB= b2.fieldB, then I also
get all the data from part B.
3- same case for part C, part D.....and so on.
The reason I divide them is that for each part I need to have more than 8 joins (medical data is always complex) so I can't finish all of them within a single join (with more than 50 joins which is impossible to finish...)
After that, I run my Spring Batch program to insert all the data from part A and data from part b, part c.. into one table as my final report table. The problem is not every part will have same number of rows which means part A may return 10 rows while part b may return 20 rows. Since the time condition for each part is the same (1 day) and can't be changed so just wondering how can I store all these different part of data into one table with minimum overhead. I don't want to have to many duplicates, thanks for the great help.
Lei
Looks to me like what you need are joins over the "data from part A", "data from part B" & "data from part C". Lets call them da, db & dc.
It's perfectly alright that num rows in da/b/c are different. But as you're trying to put them all in a single table at the end, obviously there is some relation between them. Without better description of that relation it's not possible to provide a more concrete answer. So I'll just write my thoughts, which you might already know, but anyway...
Simplest way is to join results from your 3 [inner] queries in a higher level [outer] query.
select j.x, j.y, j.z
from (
' da join db join dc
) j;
If this is not possible (due to way too many joins as you said) then try one of these:
Create 3 separate materialized views (one each for da, db & dc) and perform the join these views. Materialized is optional (i.e. you can use the "normal" views too), but it should improve the performance greatly if available in your DB.
First run queries for da/b/c, fetch the data and put this data in intermediate tables. Run a join on those tables.
PS: If you want to run reports (many/frequent/large size) on some data then that data should be designed appropriately, else you'll run into heap of trouble in future.
If you want something more concrete, please post the relationship between da/b/c.

SQL Query taking forever

I have this webapplication tool which queries over data and shows it in a grid. Now a lot of people use it so it has to be quite performant.
The thing is, I needed to add a couple of extra fields through joins and now it takes forever for the query to run.
If I in sql server run the following query:
select top 100 *
from bam_Prestatie_AllInstances p
join bam_Zending_AllRelationships r on p.ActivityID = r.ReferenceData
join bam_Zending_AllInstances z on r.ActivityID = z.ActivityID
where p.PrestatieZendingOntvangen >= '2010-01-26' and p.PrestatieZendingOntvangen < '2010-01-27'
This takes about 35-55seconds, which is waaay too long. Because this is only a small one.
If I remove one of the two date checks it only takes 1second. If I remove the two joins it also takes only 1 second.
When I use a queryplan on this I can see that 100% of the time is spend on the indexing of the PrestatieZendingOntvangen field. If I set this field to be indexed, nothing changes.
Anybody have an idea what to do?
Because my clients are starting to complain about time-outs etc.
Thanks
Besides the obvious question of an index on the bam_Prestatie_AllInstances.PrestatieZendingOntvangen column, also check if you have indices for the foreign key columns:
p.ActivityID (table: bam_Prestatie_AllInstances)
r.ReferenceData (table: bam_Zending_AllRelationships)
r.ActivityID (table: bam_Zending_AllRelationships)
z.ActivityID (table: bam_Zending_AllInstance)
Indexing the foreign key fields can help speed up JOINs on those fields quite a bit!
Also, as has been mentioned already: try to limit your fields being selected by specifying a specific list of fields - rather than using SELECT * - especially if you join several tables, just the sheer number of columns you select (multiplied by the number of rows you select) can cause massive data transfer - and if you don't need all those columns, that's just wasted bandwidth!
Specify the fields you want to retrieve, rather than *
Specify either Inner Join or Outer Join
Try between?
where p.PrestatieZendingOntvangen
between '2010-01-26 00:00:00' and '2010-01-27 23:00:00'
Have you placed an indexes on the date fields in your Where clause.
If not, I would create a INDEX on that fields to see if it makes any differences to yourv time.
Of course, Indexes will take up more disk space so you will have to consider the impact of that extra index.
EDIT:
The others have also made good points about specifing what columns you require in the Select instead of * (wildcard), and placing more indexes on foreign keys etc.
Someone from DB background can clear my doubt on this.
I think, you should specify date in the style in which the DB will be able to understand it.
for e.g. Assuming, the date is stored in mm/dd/yyyy style inside the table & your query tries to put in a different style of date for comparison (yyyy-mm-dd), the performance will go down.
Am I being too naive, when I assume this?
How many columns does bam_Prestatie_AllInstances and the other tables have? It looks like you are oulling all columns and that can definitely be a performance issue.
Have you tried to select specific columns from specific tables such as:
select top 100 p.column1, p.column2, p.column3
Instead of querying all columns as you are currently doing:
select top 100 *

TSQL "LIKE" or Regular Expressions?

I have a bunch (750K) of records in one table that I have to see they're in another table. The second table has millions of records, and the data is something like this:
Source table
9999-A1B-1234X, with the middle part potentially being longer than three digits
Target table
DescriptionPhrase9999-A1B-1234X(9 pages) - yes, the parens and the words are in the field.
Currently I'm running a .net app that loads the source records, then runs through and searches on a like (using a tsql function) to determine if there are any records. If yes, the source table is updated with a positive. If not, the record is left alone.
the app processes about 1000 records an hour. When I did this as a cursor sproc on sql server, I pretty much got the same speed.
Any ideas if regular expressions or any other methodology would make it go faster?
What about doing it all in the DB, rather than pulling records into your .Net app:
UPDATE source_table s SET some_field = true WHERE EXISTS
(
SELECT target_join_field FROM target_table t
WHERE t.target_join_field LIKE '%' + s.source_join_field + '%'
)
This will reduce the total number of queries from 750k update queries down to 1 update.
First I would redesign if at all possible. Better to add a column that contains the correct value and be able to join on it. If you still need the long one. you can use a trigger to extract the data into the column at the time it is inserted.
If you have data you can match on you need neither like '%somestuff%' which can't use indexes or a cursor both of which are performance killers. This should bea set-based task if you have designed properly. If the design is bad and can't be changed to a good design, I see no good way to get good performance using t-SQl and I would attempt the regular expression route. Not knowing how many different prharses and the structure of each, I cannot say if the regular expression route would be easy or even possible. But short of a redesign (which I strongly suggest you do), I don't see another possibility.
BTW if you are working with tables that large, I would resolve to never write another cursor. They are extremely bad for performance especially when you start taking about that size of record. Learn to think in sets not record by record processing.
One thing to be aware of with using a single update (mbeckish's answer) is that the transaction log (enabling a rollback if the query becomes cancelled) will be huge. This will drastically slow down your query. As such it is probably better to proces them in blocks of 1,000 rows or such like.
Also, the condition (b.field like '%' + a.field + '%') will need to check every single record in b (millions) for every record in a (750,000). That equates to more than 750 billion string comparisons. Not great.
The gut feel "index stuff" won't help here either. An index keeps things in order, so the first character(s) dictate the position in the index, not the ones you're interested in.
First Idea
For this reason I would actually consider creating another table, and parsing the long/messy value into something nicer. An example would be just to strip off any text from the last '(' onwards. (This assumes all the values follow that pattern) This would simplify the query condition to (b.field like '%' + a.field)
Still, an index wouldn't help here either though as the important characters are at the end. So, bizarrely, it could well be worth while storing the characters of both tables in reverse order. The index on you temporary table would then come in to use.
It may seem very wastefull to spent that much time, but in this case a small benefit would yield a greate reward. (A few hours work to halve the comparisons from 750billion to 375billion, for example. And if you can get the index in to play you could reduce this a thousand fold thanks to index being tree searches, not just ordered tables...)
Second Idea
Assuming you do copy the target table into a temp table, you may benefit extra from processing them in blocks of 1000 by also deleting the matching records from the target table. (This would only be worthwhile where you delete a meaningful amount from the target table. Such that after all 750,000 records have been checked, the target table is now [for example] half the size that it started at.)
EDIT:
Modified Second Idea
Put the whole target table in to a temp table.
Pre-process the values as much as possible to make the string comparison faster, or even bring indexes in to play.
Loop through each record from the source table one at a time. Use the following logic in your loop...
DELETE target WHERE field LIKE '%' + #source_field + '%'
IF (##row_count = 0)
[no matches]
ELSE
[matches]
The continuous deleting makes the query faster on each loop, and you're only using one query on the data (instead of one to find matches, and a second to delete the matches)
Try this --
update SourceTable
set ContainsBit = 1
from SourceTable t1
join (select TargetField from dbo.TargetTable t2) t2
on charindex(t1.SourceField, t2.TargetField) > 0
First thing is to make sure you have an index for that column on the searched table. Second is to do the LIKE without a % sign on the left side. Check the execution plan to see if you are not doing a table scan on every row.
As le dorfier correctly pointed out, there is little hope if you are using a UDF.
There are lots of ways to skin the cat - I would think that first it would be important to know if this is a one-time operation, or a regular task that needs to be completed regularly.
Not knowing all the details of you problem, if it was me, at this was a one-time (or infrequent operation, which it sounds like it is), I'd probably extract out just the pertinent fields from the two tables including the primary key from the source table and export them down to a local machine as text files. The files sizes will likely be significantly smaller than the full tables in your database.
I'd run it locally on a fast machine using a routine written in something like 'C'/C++ or another "lightweight" language that has raw processing power, and write out a table of primary keys that "match", which I would then load back into the sql server and use it as a basis of an update query (i.e. update source table where id in select id from temp table).
You might spend a few hours writing the routine, but it would run in a fraction of the time you are seeing in sql.
By the sounds of you sql, you may be trying to do 750,000 table scans against a multi-million records table.
Tell us more about the problem.
Holy smoke, what great responses!
system is on disconnected network, so I can't copy paste, but here's the retype
Current UDF:
Create function CountInTrim
(#caseno varchar255)
returns int
as
Begin
declare #reccount int
select #reccount = count(recId) from targettable where title like '%' + #caseNo +'%'
return #reccount
end
Basically, if there's a record count, then there's a match, and the .net app updates the record. The cursor based sproc had the same logic.
Also, this is a one time process, determining which entries in a legacy record/case management system migrated successfully into the new system, so I can't redesign anything. Of course, developers of either system are no longer available, and while I have some sql experience, I am by no means an expert.
I parsed the case numbers from the crazy way the old system had to make the source table, and that's the only thing in common with the new system, the case number format. I COULD attempt to parse out the case number in the new system, then run matches against the two sets, but with a possible set of data like:
DescriptionPhrase1999-A1C-12345(5 pages)
Phrase/Two2000-A1C2F-5432S(27 Pages)
DescPhraseThree2002-B2B-2345R(8 pages)
Parsing that became a bit more complex so I thought I'd keep it simpler.
I'm going to try the single update statement, then fall back to regex in the clr if needed.
I'll update the results. And, since I've already processed more than half the records, that should help.
Try either Dan R's update query from above:
update SourceTable
set ContainsBit = 1
from SourceTable t1
join (select TargetField
from dbo.TargetTable t2) t2
on charindex(t1.SourceField, t2.TargetField) > 0
Alternatively, if the timeliness of this is important and this is sql 2005 or later, then this would be a classic use for a calculated column using SQL CLR code with Regular Expressions - no need for a standalone app.

Resources