How to efficiently join records with separate string table - sql-server

I have a large table with a lot of duplicate string data. To save space, I have moved the string data to a separate table. My tables now look something like this:
MyRecords
RecordId (int) | FieldA (int) | FieldB (datetime) | FieldC (...) | MyString1Id (int) | MyString2Id (int) | MyString3Id (int) | ...
MyStrings
StringId (int) | StringValue (varchar)
The MyRecords table has about 10 foreign keys to the string table. I have a stored procedure GetMyRecords that retrieves a list of records with the actual string values. This sp now has 10 joins to the string table for each string relation:
SELECT [Field1], [Field2], [Field3], ..., [Strings1].[StringValue], [Strings2].[StringValue], ...
FROM MyRecords INNER JOIN
MyStrings AS Strings1 ON MyRecords.MyString1Id = Strings1.StringId INNER JOIN
MyStrings AS Strings2 ON MyRecords.MyString2Id = Strings2.StringId INNER JOIN
MyStrings AS Strings3 ON MyRecords.MyString3Id = Strings3.StringId INNER JOIN
(more joins)
WHERE [Field1] = #Field1 AND [Field2] = #Field2
GetMyRecords is considerably slower than I would want because of all the joins. How could I improve performance for this sp? Can I somehow turn this into a single join?
The strings table has a clustered primary key on StringId, and all the where fields are in a nonclustered index on the MyRecords table.

You should probably take one further step toward normalization and create a join table. Instead of having the MyStringNId columns in MyRecords, have a third table:
CREATE TABLE RecordsStrings (
RecordId [theDataType] NOT NULL REFERENCES MyRecords (RecordId),
StringId [theDataType] NOT NULL REFERENCES MyStrings (StringId)
)
It is not convenient then to have all the strings in the same row of the returned data from the SELECT (though maybe there's a way to do this with a pivot somehow), so it's probably better to restructure the calling code to deal with results returned from:
SELECT [StringValue]
FROM [MyStrings] s
INNER JOIN [RecordsStrings] rs ON rs.StringId = s.StringId
INNER JOIN [MyRecords] r ON rs.RecordId = r.RecordId
WHERE r.Field1 = #Field1 AND r.Field2 = #Field2
If you need the other fields from MyRecords, you can select those as well, though they would appear in every relevant row. If you have multiple matches on Field1 and Field2, though, that may be helpful.

Can I somehow turn this into a single join?
If it is common for the same combination of strings to occur on multiple rows of MyRecords then it would make sense to store those combinations in a separate table. Then you could do a single join.
So long as you are only storing individual strings, then it is not possible to do this in a single join, since it has to search for each string separately.
You can make the queries easier to read and write by creating a view of the table that includes all of the joins. This will not improve performance, but it will make your queries look a lot better.
How could I improve performance for this sp?
There are things you can do, depending on the form of the data.
If the strings in one field contains (mostly) different information than another field, then you could try putting them into different tables. There is a chance this could improve performance if the maximum length of one field is much smaller than the other or if the number of different values for one field is much smaller than the other.

First step would be to run a performance analysis to see where the problems are.
Just on a lark though, you can pick up a bit of a performance gain by using (nolock) on the joined tables.

Related

SQL Server index on optional columns

In my scenario i have a table with a lot of optional columns (20 columns in total, say form col00 to col19, every column contain an integer not nullable).
When the column contains a 0 it's considered empty any other values have a meaning.
Any subset of that 20 columns could be queried, so i should query for col01 = int1 and col17 = int2.
I need to improve the performance of such queries, but i don't know how to create a representative index.
Surely i can monitor table for a while and see which columns subset are searchest most, but this is not a satisfiable solution to me (the table is periodically regenerated every few months..and the "tags" encoded that way may change)
I think the best you'll be able to do is to index every column by itself, then use the set operator INTERSECT... in a subquery of your where clause.
INTERSECT returns distinct rows that are output by both the left and right input queries operator. So if you select the primary key of the table in the INTERSECT then you should have a good subquery that can be used in a where-clause. This will require you to re-write your queries however.
Example:
SELECT *
FROM tablename
WHERE primary_key = (
SELECT primary_key FROM tablename WHERE col01 = int1
INTERSECT
SELECT primary_key FROM tablename WHERE col17 = int2
)
That should be sargable, if col01 and col17 have their own index.

Fastest Way to Match Column Values of Tables In SQL Server

I have two tables, Table1 and Table2.
Suppose Table1 have the following columns: T1c1, T1c2, T1c3
and Table2 have the following columns: T2c1, T2c2, T2c3
I need to add values of Table2.T2c3 to Table1.T1c3 based on matching pairs between the two tables and their other two columns or just matching one column values if a column has NULL values in one table or both. That is, I need to match Table1.T1c1 values with Table2.T2c1 values and Table1.T1c2 values with Table2.T2c2 values or just match Table1.T1c1 with Table2.T2c1 values and etc if there's NULL.
The problem is, my tables are of very large size; several hundred millions of rows. I need the fastest matching algorithm to fill out values on Table1.T1c3 values.
I think you could abuse Coalesce() to do this in the ON clause of your join:
SELECT table2.c3 + table1.c3
FROM table1
INNER JOIN table2
ON COALESCE(table1.c1, table2.c1) = COALESCE(table2.c1 = table1.c1)
AND COALESCE(table1.c2, table2.c2) = COALESCE(table2.c2 = table1.c2);
That feels like it might be speedier than dropping CASE all of the over the place or hitting this with a bunch of OR conditions but it might be 6 of one half a dozen of the other...
This is assuming that there would never be a null in both c1 and c2 of either table at the same time. Otherwise you may end up with some funky cross-join shenanigans.

SQL query runs into a timeout on a sparse dataset

For sync purposes, I am trying to get a subset of the existing objects in a table.
The table has two fields, [Group] and Member, which are both stringified Guids.
All rows together may be to large to fit into a datatable; I already encountered an OutOfMemory exception. But I have to check that everything I need right now is in the datatable. So I take the Guids I want to check (they come in chunks of 1000), and query only for the related objects.
So, instead of filling my datatable once with all
SELECT * FROM Group_Membership
I am running the following SQL query against my SQL database to get related objects for one thousand Guids at a time:
SELECT *
FROM Group_Membership
WHERE
[Group] IN (#Guid0, #Guid1, #Guid2, #Guid3, #Guid4, #Guid5, ..., #Guid999)
The table in question now contains a total of 142 entries, and the query already times out (CommandTimeout = 30 seconds). On other tables, which are not as sparsely populated, similar queries don't time out.
Could someone shed some light on the logic of SQL Server and whether/how I could hint it into the right direction?
I already tried to add a nonclustered index on the column Group, but it didn't help.
I'm not sure that WHERE IN will be able to maximally use an index on [Group], or if at all. However, if you had a second table containing the GUID values, and furthermore if that column had an index, then a join might perform very fast.
Create a temporary table for the GUIDs and populate it:
CREATE TABLE #Guids (
Guid varchar(255)
)
INSERT INTO #Guids (Guid)
VALUES
(#Guid0, #Guid1, #Guid2, #Guid3, #Guid4, ...)
CREATE INDEX Idx_Guid ON #Guids (Guid);
Now try rephrasing your current query using a join instead of a WHERE IN (...):
SELECT *
FROM Group_Membership t1
INNER JOIN #Guids t2
ON t1.[Group] = t2.Guid;
As a disclaimer, if this doesn't improve the performance, it could be because your table has low cardinality. In such a case, an index might not be very effective.

A big 'like' matching query

I've got 2 tables,
'[Item] with field [name] nvarchar(255)
'[Transaction] with field [short_description] nvarchar(3999)
And I need to do thus :
Select [Transaction].id, [Item].id
From [Transaction] inner join [Item]
on [Transaction].[short_description] like ('%' + [Item].[name] + '%')
The above works if limited to a handful of items, but unfiltered is just going over 20 mins and I cancel.
I have a NC index on [name], but I cannot index [short_description] due to its length.
[Transaction] has 320,000 rows
[Items] has 42,000.
That's 13,860,000,000 combinations.
Is there a better way to perform this query ?
I did poke at full-text, but I'm not really that familiar, the answer was not jumping out at me there.
Any advice appreciated !!
Starting a comparison string with a wildcard (% or _) will NEVER use an index, and will typically be disastrous for performance. Your query will need to scan indexes rather than seek through them, so indexing won't help.
Ideally, you should have a third table that would allow a many-to-many relationship between Transaction and Item based on IDs. The design is the issue here.
After some more sleuthing I have utilized some Fulltext features.
sp_fulltext_keymappings
gives me my transaction table id, along with the FT docID
(I found out that 'doc' = text field)
sys.dm_fts_index_keywords_by_document
gives me FT documentId along with the individual keywords within it
Once I had that, the rest was simple.
Although, I do have to look into the term 'keyword' a bit more... seems that definition can be variable.
This only works because the text I am searching for has no white space.
I believe that you could tweak the FTI configuration to work with other scenarios... but I couldn't promise.
I need to look more into Fulltext.
My current 'beta' code below.
CREATE TABLE #keyMap
(
docid INT PRIMARY KEY ,
[key] varchar(32) NOT NULL
);
DECLARE #db_id int = db_id(N'<database name>');
DECLARE #table_id int = OBJECT_ID(N'Transactions');
INSERT INTO #keyMap
EXEC sp_fulltext_keymappings #table_id;
select km.[key] as transaction_id, i.[id] as item_id
from
sys.dm_fts_index_keywords_by_document ( #db_id, #table_id ) kbd
INNER JOIN
#keyMap km ON km.[docid]=kbd.document_id
inner join [items] i
on kdb.[display_term] = i.name
;
My actual version of the code includes inserting the data into a final table.
Execution time is coming in at 30 seconds, which serves my needs for now.

Recommended approach to merging two tables

I have a database schema like this:
[Patients] [Referrals]
| |
[PatientInsuranceCarriers] [ReferralInsuranceCarriers]
\ /
[InsuranceCarriers]
PatientInsuranceCarriers and ReferralInsuranceCarriers are identical, except for the fact that they reference either Patients or Referrals. I would like to merge those two tables, so that it looks like this:
[Patients] [Referrals]
\ /
[PatientInsuranceCarriers]
|
[InsuranceCarriers]
I have two options here
either create two new columns - ID_PatientOrReferral + IsPatient (will tell me which table to reference)
or create two different columns - ID_Patient and ID_Referral, both nullable.
Generally, I try to avoid nullable columns, because I consider them a bad practice (meaning, if you can live w/o nulls, then you don't really need a nullable column) and they are more difficult to work with in code (e.g., LINQ to SQL).
However I am not sure if the first option would be a good idea. I saw that it is possible to create two FKs on ID_PatientOrReferral (one for Patients and one for Referrals), though I can't set any update/delete behavior there for obvious reasons, I don't know if constraint check on insert works that way, either, so it looks like the FKs are there only to mark that there are relationships. Alternatively, I may not create any foreign keys, but instead add the relationships in DBML manually.
Is any of the approaches better and why?
To expand on my somewhat terse comment:
I would like to merge those two tables
I believe this would be a bad idea. At the moment you have two tables with good clear relation predicates (briefly, what it means for there to exist a record in the table) - and crucially, these relation predicates are different for the two tables:
A record exists in PatientInsuranceCarriers <=> that Patient is associated with that Insurance Carrier
A record exists in ReferralInsuranceCarriers <=> that Referral is associated with that Insurance Carrier
Sure, they are similar, but they are not the same. Consider now what would be the relation predicate of a combined table:
A record exists in ReferralAndPatientInsuranceCarriers <=> {(IsPatient is true and the Patient with ID ID_PatientOrReferral) or alternatively (IsPatient is false and the Referral with ID ID_PatientOrReferral)} is associated with that Insurance Carrier
or if you do it with NULLs
A record exists in ReferralAndPatientInsuranceCarriers <=> {(ID_Patient is not NULL and the Patient with ID ID_Patient) or alternatively (ID_Referral is not NULL and the Referral with ID ID_Referral)} is associated with that Insurance Carrier
Now, I'm not one to automatically suggest that more complicated relation pedicates are necessarily worse; but I'm fairly sure that either of the two above are worse than those they would replace.
To address your concerns:
we now have two LINQ to SQL entities, separate controllers and views for each
In general I would agree with reducing duplication; however, only duplication of the same things! Here, is it not the case that all the above are essentially 'boilerplate', and their construction and maintenance can be delegated to suitable development tools?
and have to merge them when preparing data for reports
If you were to create a VIEW, containing a UNION, for reporting purposes, you would keep the simplicity of the actual data and still have the ability to report on a combined list; eg (making assumptions about column names etc):
CREATE VIEW InterestingInsuranceCarriers
AS
SELECT
IC.Name InsuranceCarrierName
, P.Name CounterpartyName
, 'Patient' CounterpartyType
FROM InsuranceCarriers IC
INNER JOIN PatientInsuranceCarriers PIC ON IC.ID = PIC.InsuranceCarrierID
INNER JOIN Patient P ON PIC.PatientId = P.ID
UNION
SELECT
IC.Name InsuranceCarrierName
, R.Name CounterpartyName
, 'Referral' CounterpartyType
FROM InsuranceCarriers IC
INNER JOIN ReferralInsuranceCarriers RIC ON IC.ID = RIC.InsuranceCarrierID
INNER JOIN Referral R ON PIC.ReferralId = R.ID
Copying my answer from this question
If you really need A_or_B_ID in TableZ, you have two similar options:
1) Add nullable A_ID and B_ID columns to table z, make A_or_B_ID a computed column using ISNULL on these two columns, and add a CHECK constraint such that only one of A_ID or B_ID is not null
2) Add a TableName column to table z, constrained to contain either A or B. now create A_ID and B_ID as computed columns, which are only non-null when their appropriate table is named (using CASE expression). Make them persisted too
In both cases, you now have A_ID and B_ID columns which can have appropriate foreign
keys to the base tables. The difference is in which columns are computed. Also, you
don't need TableName in option 2 above if the domains of the 2 ID columns don't
overlap - so long as your case expression can determine which domain A_or_B_ID
falls into

Resources