SQL Server -- Efficient String Contains Over Very Large Tables

SQL Server -- Efficient String Contains Over Very Large Tables - sql-server

I have a table that currently contains 10 million records.
One of the columns is SourceText of type nvarchar(4000).
I need a very efficient way to search the SourceText to see if it contains another string.
I have extreme flexibility will the table structures--I can modify the insert procedure and use other, better indexed tables to track things. One thought was to tokenize the SourceText by word and store the words in an indexed table, then use a mapping table to map to the main table. The problem is that the SourceText column can be any language, and there are always rules re:parantheses, etc. For example, in english if I tokenize using ' ' as the delimiter, I will still get things like (Where instead of Where, which is problematic.
Any ideas?

It would be a quite interesting and challenging project (and I think it's possible) to implement a fast full text search without the optional and very powerful full text search component of SQL Server ;-)

Related

Should I use LIKE or CONTAINS on a second column?

I have a table which has 2 columns (nvarchar(max) and varbinary(max). The binary column contains PDF documents and the catalog and index are setup to use this column.
The nvarchar column contains a list of id's (eg. "12","55","69", etc). This column can contain 100's of id's so that text would be quite long.
When building a search query, I always use CONTAINS, eg:
SELECT *
FROM mytable
WHERE CONTAINS(mybinarycolumn, 'keyword')
Depending on the search, I might or might not use the secondary column. So I was going to use IF to execute a second query, like this:
SELECT *
FROM mytable
WHERE CONTAINS(mybinarycolumn, 'keyword') AND
mytextcolumn LIKE '%"55"%'
Would I incur a performance hit if I use LIKE? Is it possible to combine CONTAINS and LIKE into one CONTAINS which might or might not use mytextcolumn in search? (If the text column must be used, it's always and AND with the binary column).

Assuming the normalization option isn't a good one for you...
I'm sure there will be a performance hit. LIKE is never a high performing operation, and you can't really build any indexes to help you out. If you are lucky, the SQL optimizer will do the CONTAINS part of the query first and apply the LIKE only to matching results. (Show execution plan will be your friend here.)
I can't think of a good way to combine the two columns into something that can be searched with a single CONTAINS; anything I've come up with looks like more work than the query as you have it.
You could try putting a full-text index on mytextcolumn and then use CONTAINS on that column as well. I'm not sure if that will help or not, but it may be worth a try.
I assume the values in mytextcolumn are well-delimited. If the column contains unquoted values, e.g. '12,23,45,67,777,890' instead of '"12","23","45","67","777","890"', your LIKE condition won't work the way you expect (because '%55%' would match both '11,22,55' and '11,22,555').
Good luck.

What is the best solution to store a volunteers availability data in access 2016 [duplicate]

Imagine a web form with a set of check boxes (any or all of them can be selected). I chose to save them in a comma separated list of values stored in one column of the database table.
Now, I know that the correct solution would be to create a second table and properly normalize the database. It was quicker to implement the easy solution, and I wanted to have a proof-of-concept of that application quickly and without having to spend too much time on it.
I thought the saved time and simpler code was worth it in my situation, is this a defensible design choice, or should I have normalized it from the start?
Some more context, this is a small internal application that essentially replaces an Excel file that was stored on a shared folder. I'm also asking because I'm thinking about cleaning up the program and make it more maintainable. There are some things in there I'm not entirely happy with, one of them is the topic of this question.

In addition to violating First Normal Form because of the repeating group of values stored in a single column, comma-separated lists have a lot of other more practical problems:
Can’t ensure that each value is the right data type: no way to prevent 1,2,3,banana,5
Can’t use foreign key constraints to link values to a lookup table; no way to enforce referential integrity.
Can’t enforce uniqueness: no way to prevent 1,2,3,3,3,5
Can’t delete a value from the list without fetching the whole list.
Can't store a list longer than what fits in the string column.
Hard to search for all entities with a given value in the list; you have to use an inefficient table-scan. May have to resort to regular expressions, for example in MySQL:
idlist REGEXP '[[:<:]]2[[:>:]]' or in MySQL 8.0: idlist REGEXP '\\b2\\b'
Hard to count elements in the list, or do other aggregate queries.
Hard to join the values to the lookup table they reference.
Hard to fetch the list in sorted order.
Hard to choose a separator that is guaranteed not to appear in the values
To solve these problems, you have to write tons of application code, reinventing functionality that the RDBMS already provides much more efficiently.
Comma-separated lists are wrong enough that I made this the first chapter in my book: SQL Antipatterns, Volume 1: Avoiding the Pitfalls of Database Programming.
There are times when you need to employ denormalization, but as #OMG Ponies mentions, these are exception cases. Any non-relational “optimization” benefits one type of query at the expense of other uses of the data, so be sure you know which of your queries need to be treated so specially that they deserve denormalization.

"One reason was laziness".
This rings alarm bells. The only reason you should do something like this is that you know how to do it "the right way" but you have come to the conclusion that there is a tangible reason not to do it that way.
Having said this: if the data you are choosing to store this way is data that you will never need to query by, then there may be a case for storing it in the way you have chosen.
(Some users would dispute the statement in my previous paragraph, saying that "you can never know what requirements will be added in the future". These users are either misguided or stating a religious conviction. Sometimes it is advantageous to work to the requirements you have before you.)

There are numerous questions on SO asking:
how to get a count of specific values from the comma separated list
how to get records that have only the same 2/3/etc specific value from that comma separated list
Another problem with the comma separated list is ensuring the values are consistent - storing text means the possibility of typos...
These are all symptoms of denormalized data, and highlight why you should always model for normalized data. Denormalization can be a query optimization, to be applied when the need actually presents itself.

In general anything can be defensible if it meets the requirements of your project. This doesn't mean that people will agree with or want to defend your decision...
In general, storing data in this way is suboptimal (e.g. harder to do efficient queries) and may cause maintenance issues if you modify the items in your form. Perhaps you could have found a middle ground and used an integer representing a set of bit flags instead?

Yes, I would say that it really is that bad. It's a defensible choice, but that doesn't make it correct or good.
It breaks first normal form.
A second criticism is that putting raw input results directly into a database, without any validation or binding at all, leaves you open to SQL injection attacks.
What you're calling laziness and lack of SQL knowledge is the stuff that neophytes are made of. I'd recommend taking the time to do it properly and view it as an opportunity to learn.
Or leave it as it is and learn the painful lesson of a SQL injection attack.

I needed a multi-value column, it could be implemented as an xml field
It could be converted to a comma delimited as necessary
querying an XML list in sql server using Xquery.
By being an xml field, some of the concerns can be addressed.
With CSV: Can't ensure that each value is the right data type: no way to prevent 1,2,3,banana,5
With XML: values in a tag can be forced to be the correct type
With CSV: Can't use foreign key constraints to link values to a lookup table; no way to enforce referential integrity.
With XML: still an issue
With CSV: Can't enforce uniqueness: no way to prevent 1,2,3,3,3,5
With XML: still an issue
With CSV: Can't delete a value from the list without fetching the whole list.
With XML: single items can be removed
With CSV: Hard to search for all entities with a given value in the list; you have to use an inefficient table-scan.
With XML: xml field can be indexed
With CSV: Hard to count elements in the list, or do other aggregate queries.**
With XML: not particularly hard
With CSV: Hard to join the values to the lookup table they reference.**
With XML: not particularly hard
With CSV: Hard to fetch the list in sorted order.
With XML: not particularly hard
With CSV: Storing integers as strings takes about twice as much space as storing binary integers.
With XML: storage is even worse than a csv
With CSV: Plus a lot of comma characters.
With XML: tags are used instead of commas
In short, using XML gets around some of the issues with delimited list AND can be converted to a delimited list as needed

Yes, it is that bad. My view is that if you don't like using relational databases then look for an alternative that suits you better, there are lots of interesting "NOSQL" projects out there with some really advanced features.

Well I've been using a key/value pair tab separated list in a NTEXT column in SQL Server for more than 4 years now and it works. You do lose the flexibility of making queries but on the other hand, if you have a library that persists/derpersists the key value pair then it's not a that bad idea.

I would probably take the middle ground: make each field in the CSV into a separate column in the database, but not worry much about normalization (at least for now). At some point, normalization might become interesting, but with all the data shoved into a single column you're gaining virtually no benefit from using a database at all. You need to separate the data into logical fields/columns/whatever you want to call them before you can manipulate it meaningfully at all.

If you have a fixed number of boolean fields, you could use a INT(1) NOT NULL (or BIT NOT NULL if it exists) or CHAR (0) (nullable) for each. You could also use a SET (I forget the exact syntax).

Creating an efficient search capability using SQL Server (and/or coldfusion)

I am trying to visualize how to create a search for an application that we are building. I would like a suggestion on how to approach 'searching' through large sets of data.
For instance, this particular search would be on a 750k record minimum table, of product sku's, sizing, material type, create date, etc;
Is anyone aware of a 'plugin' solution for Coldfusion to do this? I envision a google like single entry search where a customer can type in the part number, or the sizing, etc, and get hits on any or all relevant results.
Currently if I run a 'LIKE' comparison query, it seems to take ages (ok a few seconds, but still), and it is too long. At times making a user sit there and wait up to 10 seconds for queries & page loads.
Or are there any SQL formulas to help accomplish this? I want to use a proven method to search the data, not just a simple SQL like or = comparison operation.
So this is a multi-approach question, should I attack this at the SQL level (as it ultimately looks to be) or is there a plug in/module for ColdFusion that I can grab that will give me speedy, advanced search capability.

You could try indexing your db records with a Verity (or Solr, if CF9) search.
I'm not sure it would be faster, and whether even trying it would be worthwhile would depend a lot on how often you update the records you need to search. If you update them rarely, you could do an Verity Index update whenever you update them. If you update the records constantly, that's going to be a drag on the webserver, and certainly mitigate any possible gains in search speed.
I've never indexed a database via Verity, but I've indexed large collections of PDFs, Word Docs, etc, and I recall the search being pretty fast. I don't know if it will help your current situation, but it might be worth further research.

If your slowdown is specifically the search of textual fields (as I surmise from your mentioning of LIKE), the best solution is building an index table (not to be confiused with DB table indexes that are also part of the answer).
Build an index table mapping the unique ID of your records from main table to a set of words (1 word per row) of the textual field. If it matters, add the field of origin as a 3rd column in the index table, and if you want "relevance" features you may want to consider word count.
Populate the index table with either a trigger (using splitting) or from your app - the latter might be better, simply call a stored proc with both the actual data to insert/update and the list of words already split up.
This will immediately drastically speed up textual search as it will no longer do "LIKE", AND will be able to use indexes on index table (no pun intended) without interfering with indexing on SKU and the like on the main table.
Also, ensure that all the relevant fields are indexed fully - not necessarily in the same compund index (SKU, sizing etc...), and any field that is searched as a range field (sizing or date) is a good candidate for a clustered index (as long as the records are inserted in approximate order of that field's increase or you don't care about insert/update speed as much).
For anything mode detailed, you will need to post your table structure, existing indexes, the queries that are slow and the query plans you have now for those slow queries.
Another item is to enure that as little of the fields are textual as possible, especially ones that are "decodable" - your comment mentioned "is it boxed" in the text fields set. If so, I assume the values are "yes"/"no" or some other very limited data set. If so, simply store a numeric code for valid values and do en/de-coding in your app, and search by the numeric code. Not a tremendous speed improvement but still an improvement.

I've done this using SQL's full text indexes. This will require very application changes and no changes to the database schema except for the addition of the full text index.
First, add the Full Text index to the table. Include in the full text index all of the columns the search should perform against. I'd also recommend having the index auto update; this shouldn't be a problem unless your SQL Server is already being highly taxed.
Second, to do the actual search, you need to convert your query to use a full text search. The first step is to convert the search string into a full text search string. I do this by splitting the search string into words (using the Split method) and then building a search string formatted as:
"Word1*" AND "Word2*" AND "Word3*"
The double-quotes are critical; they tell the full text index where the words begin and end.
Next, to actually execute the full text search, use the ContainsTable command in your query:
SELECT *
from containstable(Bugs, *, '"Word1*" AND "Word2*" AND "Word3*"')
This will return two columns:
Key - The column identified as the primary key of the full text search
Rank - A relative rank of the match (1 - 1000 with a higher ranking meaning a better match).
I've used approaches similar to this many times and I've had good luck with it.

If you want a truly plug-in solution then you should just go with Google itself. It sounds like your doing some kind of e-commerce or commercial site (given the use of the term 'SKU'), So you probably have a catalog of some kind with product pages. If you have consistent markup then you can configure a google appliance or service to do exactly what you want. It will send a bot in to index your pages and find your fields. No SQl, little coding, it will not be dependent on your database, or even coldfusion. It will also be quite fast and familiar to customers.
I was able to do this with a coldfusion site in about 6 hours, done! The only thing to watch out for is that google's index is limited to what the bot can see, so if you have a situation where you want to limit access based on a users role or permissions or group, then it may not be the solution for you (although you can configure a permission service for Google to check with)

Because SQL Server is where your data is that is where your search performance is going to be a possible issue. Make sure you have indexes on the columns you are searching on and if using a like you can't use and index if you do this SELECT * FROM TABLEX WHERE last_name LIKE '%FR%'
But it can use an index if you do it like this SELECT * FROM TABLEX WHERE last_name LIKE 'FR%'. The key here is to allow as many of the first characters to not be wild cards.
Here is a link to a site with some general tips. https://web.archive.org/web/1/http://blogs.techrepublic%2ecom%2ecom/datacenter/?p=173

Sql Server String Comparision

Is there any information as to how SQL Server compares strings and handles searching in them (like statments)? I am trying to find out if there is a way to determine how efficient it is to store information as a large string and use sql server to do a bunch of comparisons on rows to determine which match. I know this is potentially going to be slow (the each string of information would be 2400 characters long), but I need something doucmenting how the string is compared, so I can show the efficency (or inefficency) of it.

each string of information would be 2400 characters long
Exactly 2400? So you've got fixed-width fields in there? Save your time and just split it into separate columns. You'll thank yourself later.
If you must have data, set up a test db and try it both ways. Then at least you'll have data that's specific to your system.

searching in them will be slow because you won't be able to create an index since an index can't be over 900 bytes long/wide
I would do what Joel Coehoorn suggests and split it up into columns
you also might want to split it up in more tables because you can only store 3 rows pr page with 2400 chars per row

There are full text search indexes that you can apply to sql server, which are often used for things like search engines. The full text indexes typically allow for boolean logic operators for the search.

Just additional information to the already mentioned. If you need to filter the large string with like, indices are also not used (except the wildcard % is only at the end of the search string). So it's best to avoid like and make the part you need to filter for available in an own field.

In the MSDN Article about Full-Text searches the following is called out regarding how the LIKE predicate uses character patterns.
Comparing LIKE to Full-Text Search
In contrast to full-text search, the LIKE Transact-SQL predicate works
on character patterns only. Also, you cannot use the LIKE predicate to
query formatted binary data. Furthermore, a LIKE query against a large
amount of unstructured text data is much slower than an equivalent
full-text query against the same data. A LIKE query against millions
of rows of text data can take minutes to return; whereas a full-text
query can take only seconds or less against the same data, depending
on the number of rows that are returned.

What's the best way to store a title in a database to allow sorting without the leading "The", "A"

I run (and am presently completely overhauling) a website that deals with theater (njtheater.com if you're interested).
When I query a list of plays from the database, I'd like "The Merchant of Venice" to sort under the "M"s. Of course, when I display the name of the play, I need the "The" in front.
What the best way of designing the database to handle this?
(I'm using MS-SQL 2000)

You are on the right track with two columns, but I would suggest storing the entire displayable title in one column, rather than concatenating columns. The other column is used purely for sorting. This gives you complete flexibility over sorting and display, rather than being stuck with a simple prefix.
This is a fairly common approach when searching (which is related to sorting). One column (with an index) is case-folded, de-punctuated, etc. In your case, you'd also apply the grammatical convention of removing leading articles to the values in this field. This column is then used as a comparison key for searching or sorting. The other column is not indexed, and preserves the original key for display.

Store the title in two fields: TITLE-PREFIX and TITLE-TEXT (or some such). Then sort on the second, but display the concatenation of the two, with a space between.

My own solution to the problem was to create three columns in the database.
article varchar(4)
sorttitle varchar(255)
title computed (article + sortitle)
"article" will only be either "The ", "A " "An " (note trailing space on each) or empty string (not null)
"sorttitle" will be the title with the leading article removed.
This way, I can sort on SORTTITLE and display TITLE. There's little actual processing going on the computed field (so it's fast), and there's only a little work to be done when inserting.

I agree with doofledorfer, but I would recommend storing spaces entered as part of the prefix instead of assuming it's a single space. It gives your users more flexibility. You may also be able to do some concatenation in your query itself, so you don't have to merge the fields as part of your business logic.

I don't know if this can be done in SQL Server. If you can create function based indexes you could create one that does a regex on the field or that uses your own function. This would take less space than an additional field, would be kept up to date by the database itself, and allows the complete title to be stored together.

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight