fast search-within-text in SQL Server (fulltext not good enough)

fast search-within-text in SQL Server (fulltext not good enough) - sql-server

I am using fulltext indexing in SQL Server 2000 and 2012. It is working great, except that users need to be able to search for ' * term * ' and not just 'term * ' (that is, the search needs to return results that contain the search term, not just begin with the search term).
From what I read and researched, this is not possible in SQL Server.
Since this is a requirement, and using the LIKE operator instead of full text is just too slow, I am thinking about breaking up each value into words, and creating a special table that contains each word twice - once normal, once reversed - and a foreign key to the relevant item.
This is the only way I can see to accomplish decent speeds.
Has anyone done this? Does anyone know of any other solution? Is it maybe possible to control the index itself, adding the reverse values to it, without actually creating a column that contains them?

Related

How to avoid SQL Server error on ORDER BY with duplicate columns

Although this question references PHP, it is not actually PHP-specific, so I have not flagged it as such.
We have a PHP framework which supports multiple DB back-ends.
There is a generic function in our data object class, which allows you to get records from the underlying table, with a specified criteria and sort order.
It looks something like this:
function GetAll($Criteria, $OrderBy = "") {
...
// Add primary key (column 1) to end of order by list,
// so that returned order is predictable.
if ($OrderBy != "") {
$OrderBy .= ", ";
}
$OrderBy .= "1";
...
// Build and run query, returning the result as an array.
}
If you specify an $OrderBy argument of StaffID on a Staff object, the resulting SQL looks something like the following:
SELECT * FROM adminStaff ORDER BY StaffID, 1;
This works fine on a MySQL back-end, and from my searching of the web it should also be fine on most other DB back-ends. However, when using SQL Server, we get the following error message:
A column has been specified more than once in the order by list.
Columns in the order by list must be unique.
This arises because SQL Server disallows the same column appearing multiple times in the ORDER BY clause. In this case StaffID is column 1 and therefore we have multiple instances of the same column.
Is there a way to disable this check in SQL Server? MySQL provides a lot of options to enable/disable strictness checks and incompatible features - does SQL Server provide anything of that nature that would allow the above query to run without errors?
If not, do you have any suggestions for how we could resolve this in our data-object layer? Bear in mind we need to maintain compatibility with existing projects which expect this behaviour, so it is not sufficient to only include the first column when $OrderBy is blank.
The situation is also slightly complicated in the fact that the field list is customisable elsewhere in the data object configuration, so we can't rely on * being used as the field list - it could contain pretty much anything that is valid in a normal SQL field list. However, if that is asking too much, a solution to the simpler case (as outlined above) would be a good start!

In SQL Server you are able to sort either by column name or by ordinal position of the column order in the SELECT list.
In your case the column StaffID became the ordinal position 1. Hence SQL Server cannot sort the same result set based on the same column twice.
If you remove the 1 from your query, the problem will be solved.
Avoid using the ordinal position of the column for sorting.

The basic question - is it possible to suppress this SQL Server restriction on ORDER BY column duplication - was answered by Venu: No it is not.
There are various suggestions (mostly from me) about how you could possibly code around this limitation in a generic manner. For any future readers, those answers are probably the most helpful if you are adapting an existing system. (If you are starting from scratch, just try and avoid this situation altogether.)
However, the actual solution that I came to was to add versioning to our internal API for our DBAL. The API version is now 2 but you can call setApiVersion(1) to instruct the back-end to use the old version of the API instead.
v2 is identical to v1* except it no longer automatically adds column 1 to the ORDER BY unless it is completely blank. Therefore, the SQL Server issue is resolved for new (v2) projects, whilst existing projects can be set to use the v1 API and therefore continue to work correctly (but without SQL server compatibility).
(* Actually, I've taken this opportunity to make some other breaking changes in v2, but that is not relevant to this answer.)

I've come up with a couple of potential solutions at the framework level. All of them have performance implications which would need to be profiled, and in practice that may rule some or all of them out. However, in theory at least, these are ways that a generic solution could be implemented.
Omit the ORDER BY altogether, and do the sorting in code. Would involve parsing the provided ORDER BY string. Would be problematic if ORDER BY contained expressions, but I can't remember ever seeing that in our projects, so can probably be ignored. Probably the slowest solution.
Perform the query without the ORDER BY, limiting the results set to a single row. Use resulting column list to work out whether column 1 is already in the ORDER BY clause, and therefore whether to add it. Then run the full query. Would require parsing the provided ORDER BY string. Query caching may mean this won't add as much overhead as it appears.
Parse the field list to get the first column name and see if this appears in the ORDER BY clause. If field list contains * or table.* would require a schema lookup. May be too difficult if we need to deal with table aliases and wildcards in combination.
Parse ORDER BY string and see if it contains any primary key. If so it is already uniquely ordered and doesn't require the addition of an extra field. Would require a schema look-up.
Use a sub-select to give us a new instance of the column that we can sort on instead. Not sure whether SQL Server would still complain that this is the 'same' column, though.

Could you just append '--' to your OrderBy parameter when working with SQL Server and just explicitly define the Order By fields where necessary?

SQL - Compare 2 text fields

I’m using a software known as FME Desktop. In this software we can issue SQL commands through an item called a transformer. I’m using a transformer called a SQLExecutor that uses a very simple query to make a comparison. Below is an explanation of what I’m trying to do with this SQL Query and the fact that it does not work when trying to compare 2 text fields.
I believe my issue is a limitation of SQL when used in the SQLExecutor. Let's say I have a layer of data called TEST.LEASE and I want to compare it to a layer called EDIT.LEASE based on one unique ID field. Both of these layers are in the same database. We use SQL Server for our stored data. There is a TEXT field in both layers called GIS_ID. This is a unique ID field. So what happens is we get updates on our LEASE layer and they start off being loaded to TEST.LEASE. When we have done our QA/QC of the data and we are satisfied that they are ready to be uploaded to EDIT.LEASE we then run an FME job that serves as our promotion tool. What this promotion tool does is that it checks various fields in TEST.LEASE to make sure they qualify for being uploaded (this part works 100% without issue).
Right before they are promoted to EDIT.LEASE we need to know if this will be a completely new record, in which case we will do an INSERT with FME. If by chance the GIS_ID already exists then we need to do an UPDATE to those records. The tool we have works perfectly for determining if it is an INSERT or UPDATE, except for one seemingly small thing … IT ONLY WORKS IF THE TEXT FIELD CONTAINS A NUMBER THAT DOESN’T HAVE A LETTER IN IT.
FYI: Someone at our company decided to make the GIS_ID field a text field. In my opinion it should have been an integer field because comparisons would have been super easy. But I can't change that now, it has already been decided by people who make way more money than I do that it will be a text field.
As mentioned … The GIS_ID is a text field (in both layers and they are both the same size, there is no difference in the field in both layers). As you may know, SQL doesn't care if it is a TEXT field or an INTEGER field when all that is contained in that field is a number. It can still compare 202 to 202 to see if they are equal to each other. For my example let's say I have a record in both TEST.LEASE and EDIT.LEASE where both of their GIS_ID fields equal 09198760. When I run the query below it runs perfectly.
select OBJECTID
from TEST.LEASE_UPDATE_INSERT_WRITER
where GIS_ID = #Value(GIS_ID)
It runs perfectly, as I’ve mentioned, on the data if both GIS_ID text fields have only numbers in them. But if just one record contains an actual alpha, the SQL query will error out.
So if GIS_ID has 09198760a01 once the query reaches the “a” in GIS_ID a SQL error is returned. I’m not looking for a way for the job to continue and ignore those records, because I need ALL OF THE RECORDS to load. I need to know if anyone would know how to add to or rewrite the query above so that it loads both “number only text fields” and “numbers containing a letter fields.”
I hope that long explanation is clear. Please let me know if it isn’t. Thanks for any help you might be able to provide for me
Sincerely,
Tex

I am assuming that the #value is the function that is causing you problems. I briefly checked their docs. it looks like you need to encapsulate like so '#value(GIS)'
http://fmepedia.safe.com/articles/How_To/Executing-a-Stored-Procedure-on-Microsoft-SQL-Server-with-FME

Jeff is right and as a generic answer for regular sql users and even people using sql in their application code, if you are comparing text like the op mentioned, then you need to use single ' quotes '.
Where avalue = 'myvalue'
Otherwise sql server thinks it is an int, hence why it works when the value he's passing in is only numbers. It's not always easy to tell what the problem is when you're passing in parameters.
Where avalue = #myvalue
So you'll need to pay attention to that. Just wanted to mention this so maybe it helps someone else with a similar issue. I figured this out when we were getting errors from a field that had concatenated an id field i.e. it worked when the value = 2, but not 2,3 etc. Wrapping the parameter in single quotes easily fixed that as we were truly only concerned with value = '2' in our case.
Hope this makes sense.

Does a Full-Text Index work well for columns with embedded code values

Using SQL Server 2012, I've got a table that currently has several hundred-thousand rows, and will grow.
In this table, I've got a nvarchar(30) field that contains Medical Record Number (MRN) values. These values can be just about any alphanumeric value, but are not words.
For Example,
DR-345687
34568523
*45612345;T
My application allows the end user to enter a value, say '456' in the search field. The application would need to return all three of the example records.
Currently, I'm using Entity Framework 5.0, and asking for a field.Contains('456') type of search.
This always takes 3-5 seconds to return since it appears to do a table search.
My question is: Would creating a Full Text Index on this column help performance? I haven't tried it yet because the only copy of the database that I have with lots of data in it is currently in QA trials.
Looking at the documentation for the Full Text Indexes it appears that it is optimized around separate words in the field value, so I am hesitant to take the performance hit to create the index without knowing how it is likely to affect my query performance.

EF won't use the T-SQL keywords needed to access the SQL Server full text index (http://msdn.microsoft.com/en-us/library/ms142571.aspx#queries) so your solution won't fly without more work.
I think you would have to create a SProc to get the data using the FTI and then have EF call this. I have a similar issue and would be interested to know your results.
Andy

SQL server search

I'm going to perform a search in my SQL server DB (ASP.NET, VS2010,C#), user types a phrase and I should search this phrase in several fields, how is it possible? do we have functions such as CONTAINS() in SQL server? can I perform my search using normal queries or I should work in my queries using C# functions?
for instance I have 3 fields in my table which can contain user search phrase, is it OK to write following sql command? (for instance user search phrase is GAME)
select * from myTable where columnA='GAME' or columnB='GAME' or columnC='GAME
I have used AND between different conditions, but can I use OR? how can I search inside my table fields? if one of my fields contains the phrase GAME, how can I find it? columnA='GAME' finds only those fields that are exactly 'GAME', is it right?
I'm a bit confused about my search approach, please help me, thanks guys

OR works fine if you want at least one of the conditions to be true.
If you want to search inside your text strings you can use LIKE
select * from myTable where columnA like '%GAME%' or columnB like '%GAME%' or columnC like '%GAME%'
Note that % is the wildcard.
If you want to find everything that begins with 'GAME' you type LIKE 'GAME%', if you allow 'GAME' to be in the middle you need % in both ends.

You can use LIKE instead of equals and then it can contain wildcard characters, so your example could be:
select * from myTable where columnA LIKE '%GAME%' or columnB LIKE '%GAME%' or columnC LIKE '%GAME%'
Further information may be found in MSDN
This is going to do some pretty heavy lifting in terms of what the database has to do though - I would suggest you consider something like full text search as I think it would more likely be suited to your scenario and provide faster results (of course, if you never have many records to search LIKE would probably do fine). Information on this is also in MSDN

Don't use LIKE, as suggested by other answers. It won't work with indexes, and therefore will be slow to return and expensive to run. Instead, you have two options:
Option 1: Full-Text Indexes
do we have functions such as CONTAINS() in SQL server?
Yes! You can use the CONTAINS() function in sql server. You just have to set up a full-text index for each of the columns you need to search on.
Option 2: Lucene.Net
Lucene.Net is a popular client-side library for searching text data that integrates closely with Sql Server. You can use it to make implementing your search a little easier.

SQL Server; index on TEXT column

I have a database table with several columns; most of them are VARCHAR(x) type columns, and some of these columns have an index on them so that I can search quickly for data inside it.
However, one of the columns is a TEXT column, because it contains a very large amount of data (23 kb of plain ascii text etc). I want to be able to search in that column (... WHERE col1 LIKE '%search string%'... ), but currently it's taking forever to perform the query. I know that the query is slow because of this column search because when I remove that criteria from the WHERE clause the query completes (what I would consider), instantaneously.
I can't add an index on this column because that option is grayed out for that column in the index builder / wizard in SQL Server Management Studio.
What are my options here, to speed up the query search in that column?
Thanks for your time...
Update
Ok, so I looked into the full text search and did all that stuff, and now I would like to run queries. However, when using "contains", it only accepts one word; what if I need an exact phrase? ... WHERE CONTAINS (col1, 'search phrase') ... throws an error.
Sorry, I'm new to SQL Server
Update 2
sorry, just figured it out; use multiple "contains" clauses instead of one clause with multiple words. Actually, this still doesn't get what I want (the exact phrase) it only makes sure that all words in the phrase are present.

Searching TEXT fields is always pretty slow. Give Full Text Search a try and see if that works better for you.

If your queries are like LIKE '%string%' (i. e. you search for a string inside a TEXT field), then you'll need a FULLTEXT index.
If you search for a substring in the beginning of the field (LIKE 'string%') and use SQL Server 2005 or higher, then you can convert your TEXT into a VARCHAR(MAX), create a computed column and index this column.
See this article in my blog for performance details:
Indexing VARCHAR(MAX)

You should be looking at using Full Text Indexing on the column.

You can do complex boolean querying in FTS; like
contains(yourcol,'"My first sting" or "my second string" and "my third string"')
Depending on your query ContainsTable or freetexttable might give better results.
If you are connecting through .Net you might want to look at A google full text search

And since nobody has already said it (maybe because it's obvious) querying LIKE '%string%' bypasses your existing indexes - so it'll run slow.
Hence - why you need to use full text indexing. (which is what Quassnoi said).
Correction - I'm sure I learnt this, and always believed it - but after some investigating it (using wildcard at the start) seems OK? My old regex queries run better with likes!

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight