Best technique to search for matching substrings with Linq & SQL Server

Best technique to search for matching substrings with Linq & SQL Server - sql-server

I have a requirement to find rows in a table containing 200,000 entries. Some may not consider this 'large', but it is large enough to warrant performance considerations.
The table contains strings consisting of digits only. For instance, the user can enter something like '12340-0560-78', or portions of this, e.g. '0560', and I need to match values
12345678 and
123405678 and
0123456780 etc
These are NDCs, national drug codes, and in spite of the standard, manufacturers format them in a variety of ways by adding or omitting zeros in various positions in their barcodes.
I started out with letting Linq do the work, by removing zeros and non-numeric characters from the search string and using Contains() on the column after removing all zeros from it. This is too slow.
So I added a computed column to the table, containing the search column minus all zeros. This is better, but I am still doing a table scan due to the Contains().
I then created a full text index, but realized that with a full text index I cannot search for substrings, only for words, phrases, and prefixes. Strange, but it does not solve this problem.
Are there other options?

How about just creating a simple clustered index on the computed column. Is the performance ok then?
e.g.
CREATE TABLE [dbo].[foo](
[code] [varchar](20) NULL,
[ccol] AS (replace(replace([code],'-',''),' ','')
) ON [PRIMARY]
and
CREATE CLUSTERED INDEX [IX_foo] ON [dbo].[foo]
(
[ccol] ASC
)
...

I would still give full-text index a try, but you have to prepare text for indexing in a certain way.
The idea is to create a separate ndc_suffixes column and populate it with all the suffixes of ndc. That is, for ndc = '1234567890' the ndc_suffixes will be '1234567890 234567890 34567890 ... 890 90 0'. This can be a computed persisted column. Since NDCs are about 10 digits long, ndc_suffixes will take reasonable amount of storage (and it can be moved off to a separate table anyway).
Then a full-text prefix search over ndc_suffixes could be used to get substring matches. Also an additional check ndc like '%560%' would likely be necessary to filter out bogus matches, but this one would be run over a significantly reduced set of rows.

Related

Converting Varbinary to Char and again to Varbinary for Primary key

I came across a sql code, which creates primary keys with hashbytes function and md5 algorithm. The code looks like this:
SELECT
CONVERT(VARBINARY(32),
CONVERT( CHAR(32),
HASHBYTES('MD5',
(LTRIM(RTRIM(COALESCE(column1,'')))+';'+LTRIM(RTRIM(COALESCE(column2,''))))
),
2)
)
FROM database.schema.table
I find it hard to understand for what is the result from hashbytes function is converted to char and then to varbinary, when we get directly varbinary from hashbytes function. Is there any good reason to do so?

Short Version
This code pads a hash with 0x20 bytes which is rather strange and most likely due to misunderstandings by the initial author. Using hashes as keys is a terrible idea anyway
Long Version
Hashes are completely inappropriate for generating primary keys. In fact, since the same hash can be generated from different original data, this code is guaranteed to produce duplicate values, causing collisions at best.
Worst case, you end up updating or deleting the wrong row, resulting in data loss. In fact, given that MD5 was broken over 20 years ago, one can calculate the values that would result in collisions. This has been used to hack systems in the past and even generate rogue CA certificates as far back as 2008.
And even worse, the concatenation expression :
(LTRIM(RTRIM(COALESCE(column1,'')))+';'+LTRIM(RTRIM(COALESCE(column2,''))))
Will create the same initial string for multiple different column values.
On top of that, given the random nature of hash values, this results in table fragmentation and an index that can't be used for range queries. Primary keys most of the time are clustered keys as well, which means they specify the order rows are stored on disk. Using essentially random values for a PK means new rows can be added at the middle or even the start of a table's data pages.
This also harms caching, as data is loaded from disk in pages. With a meaningful clustered key, it's highly likely that loading a specific row will also load rows that will be needed very soon. Loading eg 50 rows while paging may only need to load a single page. With an essentially random key, you could end up loading 50 pages.
Using a GUID generated with NEWID() would provide a key value without collisions. Using NEWSEQUENTIALID() would generate sequential GUID values eliminating fragmentation and once again allowing range searches.
An even better solution would be to just create a PK from the two columns :
ALTER TABLE ThatTable ADD PRIMARY KEY (Column1,Column2);
Or just add an IDENTITY-generated ID column. A bigint is large enough to handle all scenarios :
Create ThatTable (
ID bigint NOT NULL IDENTITY(1,1) PRIMARY KEY,
...
)
If the intention was to ignore spaces in column values there are better options:
The easiest solution would be to clean up the values when inserting them.
A CHECK constraint can be added to each column to ensure the columns can't have leading or trailing spaces.
An INSTEAD OF trigger can be used to trim them.
Computed, persisted columns can be added that trim the originals, eg Column1_Cleaned as TRIM(Column1) PERSISTED. Persisted columns can be used in indexes and primary keys
As for what it does:
It generates deprecation warnings (MD5 is deprecated)
It pads the MD5 hash with 0x20 bytes. A rather ... unusual way of padding data. I suspect whoever first wrote this wanted to pad the hash to 32 bytes but used some copy-pasta code without understanding the implications.
You can check the results by hashing any value. The following queries
select hashbytes('md5','banana')
----------------------------------
0x72B302BF297A228A75730123EFEF7C41
select cast(hashbytes('md5','banana') as char(32))
--------------------------------
r³¿)z"Šus#ïï|A
A space in ASCII is the byte 0x20. Casting to binary replaces spaces with 0x20, not 0x00
select cast(cast(hashbytes('md5','banana') as char(32)) as varbinary(32))
------------------------------------------------------------------
0x72B302BF297A228A75730123EFEF7C4120202020202020202020202020202020
If one wanted to pad a 16-byte value to 32 bytes, it would make more sense to use 0x00. The result is no better than the original though
select cast(hashbytes('md5','banana') as binary(32))
------------------------------------------------------------------
0x72B302BF297A228A75730123EFEF7C4100000000000000000000000000000000
To get a real 32-byte hash, SHA2_256 can be used :
select hashbytes('sha2_256','banana')
------------------------------------------------------------------
0xB493D48364AFE44D11C0165CF470A4164D1E2609911EF998BE868D46ADE3DE4E

SQL Query on single table-valued parameter slow on large input

I have a table with this simple definition:
CREATE TABLE Related
(
RelatedUser NVARCHAR(100) NOT NULL FOREIGN KEY REFERENCES User(Id),
RelatedStory BIGINT NOT NULL FOREIGN KEY REFERENCES Story(Id),
CreationTime DateTime NOT NULL,
PRIMARY KEY(RelatedUser, RelatedStory)
);
with these indexes:
CREATE INDEX i_relateduserid
ON Related (RelatedUserId) INCLUDE (RelatedStory, CreationTime)
CREATE INDEX i_relatedstory
ON Related(RelatedStory) INCLUDE (RelatedUser, CreationTime)
And I need to query the table for all stories related to a list of UserIds, ordered by Creation Time, and then fetch only X and skip Y.
I have this stored procedure:
CREATE PROCEDURE GetStories
#offset INT,
#limit INT,
#input UserIdInput READONLY
AS
BEGIN
SELECT RelatedStory
FROM Related
WHERE EXISTS (SELECT 1 FROM #input WHERE UID = RelatedUser)
GROUP BY RelatedStory, CreationTime
ORDER BY CreationTime DESC
OFFSET #offset ROWS FETCH NEXT #limit ROWS ONLY;
END;
Using this User-Defined Table Type:
CREATE TYPE UserIdInput AS TABLE
(
UID nvarchar(100) PRIMARY KEY CLUSTERED
)
The table has 13 million rows, and gets me good results when using few userids as input, but very bad (30+ seconds) results when providing hundreds or a couple thousand userids as input. The main problem seems to be that it uses 63% of the effort on sorting.
What index am I missing? this seems to be a pretty straight forward query on a single table.

What types of values do you have for RelatedUser / UID ? Why, exactly, are you using NVARCHAR(100) for it? NVARCHAR is usually a horrible choice for a PK / FK field. Even if the value is a simple, alphanumeric code (e.g. ABTY1245) there are better ways of handling this. One of the main problems with NVARCHAR (and even with VARCHAR for this particular issue) is that, unless you are using a binary collation (e.g. Latin1_General_100_BIN2), every sort and comparison operation will apply the full range of linguistic rules, which can be well worth it when working with strings, but unnecessarily expensive when working with codes, especially when using the typically default case-insensitive collations.
Some "better" (but not ideal) solutions would be:
If you really do need Unicode characters, at least specify a binary collation, such as Latin1_General_100_BIN2.
If you do not need Unicode characters, then switch to using VARCHAR which will take up half the space and sort / compare faster. Also, still use a binary Collation.
Your best bet is to:
Add an INT IDENTITY column to the User table, named UseID
Make UserID the Clustered PK
Add an INT (no IDENTITY) column to the Related table, named UserID
Add an FK from Related back to User on UserID
Remove the RelatedUser column from the Related table.
Add a non-clustered, Unique Index to the User table on the UserCode column (this makes it an "alternate key")
Drop and recreate the UserIdInput User-Defined Table Type to have an INT datatype instead of NVARCHAR(100)
If at all possible, alter the ID column of the User table to have a binary collation (i.e. Latin1_General_100_BIN2)
If possible, rename the current Id column in the User table to be UserCode or something like that.
If users are entering in the "Code" values (meaning: cannot guarantee they will always use all upper-case or all lower-case), then best to add an AFTER INSERT, UPDATE Trigger on the User table to ensure that the values are always all upper-case (or all lower-case). This will also mean that you need to make sure that all incoming queries using the same all upper-case or all lower-case values when searching on the "Code". But that little bit of extra work will pay off.
The entire system will thank you, and show you its appreciation by being more efficient :-).
One other thing to consider: the TVP is a table-variable, and by default those only ever appear to the query optimizer to have a single row. So it makes some sense that adding a few thousand entries into the TVP would slow it down. One trick to help speed up TVP in this scenario is to add OPTION (RECOMPILE) to the query. Recompiling queries with table variables will cause the query optimizer to see the true row count. If that doesn't help any, the other trick is to dump the TVP table variable into a local temporary table (i.e. #TempUserIDs) as those do maintain statistics and optimize better when you have more than a small number of rows in them.
From O.P.'s comment on this answer:
[UID] is an ID used across our system (XXX-Y-ZZZZZZZZZZ...), XXX being letters, Y being a number and Z being numbers
Yes, I figured it was an ID or code of some sort, so that doesn't change my advice. NVARCHAR, especially if using a non-binary, case-insensitive collation, is probably one of the worst choices of datatype for this value. This ID should be in a column named UserCode in the User table with a non-clustered index defined on it. This makes it an "alternate" key and a quick and easy lookup from the app layer, one time, to get the "internal" integer value for that row, the INT IDENTITY column as the actual UserID (is usually best to name ID columns as {table_name}ID for consistency / easier maintenance over time). The UserID INT value is what goes into all related tables to be the FK. An INT column will JOIN much faster than an NVARCHAR. Even using a binary collation, this NVARCHAR column, while being faster than its current implementation, will still be at least 32 bytes (based on the given example of XXX-Y-ZZZZZZZZZZ) whereas the INT will be just 4 bytes. And yes, those extra 28 bytes do make a difference, especially when you have 13 million rows. Remember, this isn't just disk space that these values take up, it is also memory since ALL data that is read for queries goes through the Buffer Pool (i.e. physical memory!).
In this scenario, however, we're not following the foreign keys anywhere, but directly querying on them. If they're indexed, should it matter?
Yes, it still does matter since you are essentially doing the same operation as a JOIN: you are taking each value in the main table and comparing it to the values in the table variable / TVP. This is still a non-binary, case-insensitive (I assume) comparison that is very slow compared to a binary comparison. Each letter needs to be evaluated against not just upper and lower case, but against all other Unicode Code Points that could equate to each letter (and there are more than you think that will match A - Z!). The index will make it faster than not having an index, but nowhere near as fast as comparing one simple value that has no other representation.

So I finally found a solution.
While #srutzky had good suggestions of normalizing the tables by changing the NVARCHAR UserId to an Integer to minimize comparison cost, this was not what solved my problem. I will definitely do this at some point for the added theoretical performance, but I saw very little change in performance after implementing it right off the bat.
#Paparazzi suggested I added an index for (RelatedStory, CreationTime), and that did not do what I needed either. The reason was, that I also needed to also index RelatedUser as that's the way the query goes, and it groups and orders by both CreationTime and RelatedStory, so all three are needed. So:
CREATE INDEX i_idandtime ON Related (RelatedUser, CreationTime DESC, RelatedStory)
solved my problem, bringing my unacceptable query times of 15+ seconds down to mostly 1-second or a couple of seconds querytimes.
I think what gave me the revelation was #srutzky noting:
Remember, "Include" columns are not used for sorting or comparisons,
only for covering.
which made me realize I needed all my groupby and orderby columns in the index.
So while I can't mark either of the above posters post as the Answer, I'd like to sincerely thank them for their time.

The main problem seems to be that it uses 63% of the effort on
sorting.
ORDER BY CreationTime DESC
I would suggest and index on CreationTime
Or try an index on RelatedStory, CreationTime

Indices on composite primary key

I googled this a lot many times but I didn't get the exact explanation for the same.
I am working on a complex database structures (in Oracle 10g) where I hardly have a primary key on one single column except for the static tables.
Now my question is consider a composite primary key ID (LXI, VCODE, IVID, GHID). Since it's a primary key, Oracle will provide a default index.
Will I get ONE (system generated) single index for the primary key itself or for its sub-columns also?
Asking this because I am retrieving data (around millions of records) based on individual columns as well. Now if system generates the indices for the individual columns as well. Why my query runs pretty faster than how it actually runs when I explicitly define indices for each individual column.
Please give a satisfactory answer
Thanks in advance

A primary key is a non-NULL unique key. In your case, the unique index has four columns, LXI, VCODE, IVID GHID in the order of declaration.
If you have a condition on VCODE but not on LXI, then most databases would not use the index. Oracle has a special type of index scan called the "skip scan", which allows for this very situation. It is described in the documentation.
I would expect an index skip scan to be a bit slower than an index range scan on individual columns. However, which is better might also depend on the complexity of the where clause. For instance, three equality conditions on VCODE, IVID and GHID connected by AND might be a great example for the skip scan. And, such an index would cover the WHERE clause -- a great efficiency -- and better than one-column indexes.
As a note: index skip scans were introduced in Oracle 9i, so they are available in Oracle 10.

It will not generate index for individual column. it will generate a composite index
first it will index on LXI
then next column like that it will be a tree structure.
if you search on 1st column of primary key it will use index to use index for second you have to combine it with the first column
ex : select where ...LXI=? will use index PK
select where LXI=? and VCODE=? alse use pk
but select where VCODE=? will not use it (without LXI)

Joining on a non-PK field, does length of varchar datatype determine query speed? SQL Server 2008

I was given a ragtag assortment of data to analyze and am running into a predicament. I've got a ~2 million row table with a non-unique identifier of datatype varchar(50). This identifier is unique to a personID. Until I figure out exactly how I need to normalize this junk I've got another question that might help me right now: If I change the datatype to a varchar(25) for instance, will that help queries run faster when they're joined on a non-PK field? All of the characters in the string are integers, but trying to convert them to an int would cause overflow. Or could I possibly somehow index the column for the time being to get some of the queries to run faster?
EDIT: The personID will be a foreign key to another table with demographic information about a person.

Technically, the length of a varchar specifies it's maximum length.
The actual length is variable (thus the name) so a lower maximum value won't change the evaluation because it will be made on the actual string.
For more information :
Check this MSDN article and this
Stack overflow Post

Varchar(50) to varchar(25) would certainly reduce the size of record in that table, thereby reducing the number of database pages that contain the table, improving the perfomance of queries (may be to a marginal extent), but such an ALTER TABLE statement might take a long time.
Alternatively, if you define index on the join columns, and if your retrieval list is small, you can include those columns also in the index definition (Covering index), that too would bring down the query execution times significantly.

What columns generally make good indexes?

As a follow up to "What are indexes and how can I use them to optimise queries in my database?" where I am attempting to learn about indexes, what columns are good index candidates? Specifically for an MS SQL database?
After some googling, everything I have read suggests that columns that are generally increasing and unique make a good index (things like MySQL's auto_increment), I understand this, but I am using MS SQL and I am using GUIDs for primary keys, so it seems that indexes would not benefit GUID columns...

Indexes can play an important role in query optimization and searching the results speedily from tables. The most important step is to select which columns are to be indexed. There are two major places where we can consider indexing: columns referenced in the WHERE clause and columns used in JOIN clauses. In short, such columns should be indexed against which you are required to search particular records. Suppose, we have a table named buyers where the SELECT query uses indexes like below:
SELECT
buyer_id /* no need to index */
FROM buyers
WHERE first_name='Tariq' /* consider indexing */
AND last_name='Iqbal' /* consider indexing */
Since "buyer_id" is referenced in the SELECT portion, MySQL will not use it to limit the chosen rows. Hence, there is no great need to index it. The below is another example little different from the above one:
SELECT
buyers.buyer_id, /* no need to index */
country.name /* no need to index */
FROM buyers LEFT JOIN country
ON buyers.country_id=country.country_id /* consider indexing */
WHERE
first_name='Tariq' /* consider indexing */
AND
last_name='Iqbal' /* consider indexing */
According to the above queries first_name, last_name columns can be indexed as they are located in the WHERE clause. Also an additional field, country_id from country table, can be considered for indexing because it is in a JOIN clause. So indexing can be considered on every field in the WHERE clause or a JOIN clause.
The following list also offers a few tips that you should always keep in mind when intend to create indexes into your tables:
Only index those columns that are required in WHERE and ORDER BY clauses. Indexing columns in abundance will result in some disadvantages.
Try to take benefit of "index prefix" or "multi-columns index" feature of MySQL. If you create an index such as INDEX(first_name, last_name), don’t create INDEX(first_name). However, "index prefix" or "multi-columns index" is not recommended in all search cases.
Use the NOT NULL attribute for those columns in which you consider the indexing, so that NULL values will never be stored.
Use the --log-long-format option to log queries that aren’t using indexes. In this way, you can examine this log file and adjust your queries accordingly.
The EXPLAIN statement helps you to reveal that how MySQL will execute a query. It shows how and in what order tables are joined. This can be much useful for determining how to write optimized queries, and whether the columns are needed to be indexed.
Update (23 Feb'15):
Any index (good/bad) increases insert and update time.
Depending on your indexes (number of indexes and type), result is searched. If your search time is gonna increase because of index then that's bad index.
Likely in any book, "Index Page" could have chapter start page, topic page number starts, also sub topic page starts. Some clarification in Index page helps but more detailed index might confuse you or scare you. Indexes are also having memory.
Index selection should be wise. Keep in mind not all columns would require index.

Some folks answered a similar question here: How do you know what a good index is?
Basically, it really depends on how you will be querying your data. You want an index that quickly identifies a small subset of your dataset that is relevant to a query. If you never query by datestamp, you don't need an index on it, even if it's mostly unique. If all you do is get events that happened in a certain date range, you definitely want one. In most cases, an index on gender is pointless -- but if all you do is get stats about all males, and separately, about all females, it might be worth your while to create one. Figure out what your query patterns will be, and access to which parameter narrows the search space the most, and that's your best index.
Also consider the kind of index you make -- B-trees are good for most things and allow range queries, but hash indexes get you straight to the point (but don't allow ranges). Other types of indexes have other pros and cons.
Good luck!

It all depends on what queries you expect to ask about the tables. If you ask for all rows with a certain value for column X, you will have to do a full table scan if an index can't be used.
Indexes will be useful if:
The column or columns have a high degree of uniqueness
You frequently need to look for a certain value or range of values for
the column.
They will not be useful if:
You are selecting a large % (>10-20%) of the rows in the table
The additional space usage is an issue
You want to maximize insert performance. Every index on a table reduces insert and update performance because they must be updated each time the data changes.
Primary key columns are typically great for indexing because they are unique and are often used to lookup rows.

Any column that is going to be regularly used to extract data from the table should be indexed.
This includes:
foreign keys -
select * from tblOrder where status_id=:v_outstanding
descriptive fields -
select * from tblCust where Surname like "O'Brian%"
The columns do not need to be unique. In fact you can get really good performance from a binary index when searching for exceptions.
select * from tblOrder where paidYN='N'

In general (I don't use mssql so can't comment specifically), primary keys make good indexes. They are unique and must have a value specified. (Also, primary keys make such good indexes that they normally have an index created automatically.)
An index is effectively a copy of the column which has been sorted to allow binary search (which is much faster than linear search). Database systems may use various tricks to speed up search even more, particularly if the data is more complex than a simple number.
My suggestion would be to not use any indexes initially and profile your queries. If a particular query (such as searching for people by surname, for example) is run very often, try creating an index over the relevate attributes and profile again. If there is a noticeable speed-up on queries and a negligible slow-down on insertions and updates, keep the index.
(Apologies if I'm repeating stuff mentioned in your other question, I hadn't come across it previously.)

It really depends on your queries. For example, if you almost only write to a table then it is best not to have any indexes, they just slow down the writes and never get used. Any column you are using to join with another table is a good candidate for an index.
Also, read about the Missing Indexes feature. It monitors the actual queries being used against your database and can tell you what indexes would have improved the performace.

Your primary key should always be an index. (I'd be surprised if it weren't automatically indexed by MS SQL, in fact.) You should also index columns you SELECT or ORDER by frequently; their purpose is both quick lookup of a single value and faster sorting.
The only real danger in indexing too many columns is slowing down changes to rows in large tables, as the indexes all need updating too. If you're really not sure what to index, just time your slowest queries, look at what columns are being used most often, and index them. Then see how much faster they are.

Numeric data types which are ordered in ascending or descending order are good indexes for multiple reasons. First, numbers are generally faster to evaluate than strings (varchar, char, nvarchar, etc). Second, if your values aren't ordered, rows and/or pages may need to be shuffled about to update your index. That's additional overhead.
If you're using SQL Server 2005 and set on using uniqueidentifiers (guids), and do NOT need them to be of a random nature, check out the sequential uniqueidentifier type.
Lastly, if you're talking about clustered indexes, you're talking about the sort of the physical data. If you have a string as your clustered index, that could get ugly.

A GUID column is not the best candidate for indexing. Indexes are best suited to columns with a data type that can be given some meaningful order, ie sorted (integer, date etc).
It does not matter if the data in a column is generally increasing. If you create an index on the column, the index will create it's own data structure that will simply reference the actual items in your table without concern for stored order (a non-clustered index). Then for example a binary search can be performed over your index data structure to provide fast retrieval.
It is also possible to create a "clustered index" that will physically reorder your data. However you can only have one of these per table, whereas you can have multiple non-clustered indexes.

The ol' rule of thumb was columns that are used a lot in WHERE, ORDER BY, and GROUP BY clauses, or any that seemed to be used in joins frequently. Keep in mind I'm referring to indexes, NOT Primary Key
Not to give a 'vanilla-ish' answer, but it truly depends on how you are accessing the data

It should be even faster if you are using a GUID.
Suppose you have the records
100
200
3000
....
If you have an index(binary search, you can find the physical location of the record you are looking for in O( lg n) time, instead of searching sequentially O(n) time. This is because you dont know what records you have in you table.

Best index depends on the contents of the table and what you are trying to accomplish.
Taken an example A member database with a Primary Key of the Members Social Security Numnber. We choose the S.S. because the application priamry referes to the individual in this way but you also want to create a search function that will utilize the members first and last name. I would then suggest creating a index over those two fields.
You should first find out what data you will be querying and then make the determination of which data you need indexed.

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight