How to identify records that contain a phone number - sql-server

I have the need to identify records in my database that contain a phone number so that I can send them on to a contact team.
Example:
tblData
id
comment
dtCreate
Given this table structure, the query might be:
SELECT * FROM tblData WHERE comment [HeresWhereINeedHelp]
The comment might (and likely will) contain all sorts of other data. An example comment:
Yea, I had a terrible experience. I'd like for someone to call me at 111.222.3333. Thank you.
The record containing this comment should be pulled in the query because it contains a phone number.
I tried an extended SPROC that enabled regex searching, but the performance was terrible. The system is SQL Server 2012.
Many thanks for any direction.

You should look at the LIKE operator. In your case, you're probably going to go for something along the lines of
WHERE comment LIKE '%[0-9][0-9][0-9]_[0-9][0-9][0-9]_[0-9][0-9][0-9][0-9]%'
Since you're searching anywhere in the string, this will also take a long time to process over a large dataset - another option you might want to try is checking the comment field for a phone number when it's entered and then flagging that row using a column like ContainsPhoneNumber (bit). That way, you can index on that column and do faster lookups.

I would create a separate table for phone numbers stored as 10 digit integer.
Run Regex once to parse out the phone numbers.
Index that column and you will get index seek speed.

Related

Should I use LIKE or CONTAINS on a second column?

I have a table which has 2 columns (nvarchar(max) and varbinary(max). The binary column contains PDF documents and the catalog and index are setup to use this column.
The nvarchar column contains a list of id's (eg. "12","55","69", etc). This column can contain 100's of id's so that text would be quite long.
When building a search query, I always use CONTAINS, eg:
SELECT *
FROM mytable
WHERE CONTAINS(mybinarycolumn, 'keyword')
Depending on the search, I might or might not use the secondary column. So I was going to use IF to execute a second query, like this:
SELECT *
FROM mytable
WHERE CONTAINS(mybinarycolumn, 'keyword') AND
mytextcolumn LIKE '%"55"%'
Would I incur a performance hit if I use LIKE? Is it possible to combine CONTAINS and LIKE into one CONTAINS which might or might not use mytextcolumn in search? (If the text column must be used, it's always and AND with the binary column).
Assuming the normalization option isn't a good one for you...
I'm sure there will be a performance hit. LIKE is never a high performing operation, and you can't really build any indexes to help you out. If you are lucky, the SQL optimizer will do the CONTAINS part of the query first and apply the LIKE only to matching results. (Show execution plan will be your friend here.)
I can't think of a good way to combine the two columns into something that can be searched with a single CONTAINS; anything I've come up with looks like more work than the query as you have it.
You could try putting a full-text index on mytextcolumn and then use CONTAINS on that column as well. I'm not sure if that will help or not, but it may be worth a try.
I assume the values in mytextcolumn are well-delimited. If the column contains unquoted values, e.g. '12,23,45,67,777,890' instead of '"12","23","45","67","777","890"', your LIKE condition won't work the way you expect (because '%55%' would match both '11,22,55' and '11,22,555').
Good luck.

Will indexing improve varchar(max) query performance, and how to create index

Firstly, I should point out I don't have much knowledge on SQL Server indexes.
My situation is that I have an SQL Server 2008 database table that has a varchar(max) column usually filled with a lot of text.
My ASP.NET web application has a search facility which queries this column for keyword searches, and depending on the number of keywords searched for their may be one or many LIKE '%keyword%' statements in the SQL query to do the search.
My web application also allows searching by various other columns in this table as well, not just that one column. There is also a few joins from other tables too.
My question is, is it worthwhile creating an index on this column to improve performance of these search queries? And if so, what type of index, and will just indexing the one column be enough or do I need to include other columns such as the primary key and other searchable columns?
The best analogy I've ever seen for why an index won't help '%wildcard%' searches:
Take two people. Hand each one the same phone book. Say to the person on your left:
Tell me how many people are in this phone book with the last name "Smith."
Now say to the person on your right:
Tell me how many people are in this phone book with the first name "Simon."
An index is like a phone book. Very easy to seek for the thing that is at the beginning. Very difficult to scan for the thing that is in the middle or at the end.
Every time I've repeated this in a session, I see light bulbs go on, so I thought it might be useful to share here.
you cannot create an index on a varchar(max) field. The maximum amount of bytes on a index is 900. If the column is bigger than 900 bytes, you can create the index but any insert with more then 900 bytes will fail.
I suggest you to read about fulltext search. It should suits you in this case
It's not worthwhile creating a regular index if you're doing LIKE '%keyword%' searches. The reason is that indexing works like searching a dictionary, where you start in the middle then split the difference until you find the word. That wildcard query is like asking you to lookup a word that contains the text "to" or something-- the only way to find matches is to scan the whole dictionary.
You might consider a full-text search, however, which is meant for this kind of scenario (see here).
The best way to find out is to create a bunch of test queries that resemble what would happen in real life and try to run them against your DB with and without the index. However, in general, if you are doing many SELECT queries, and little UPDATE/DELETE queries, an index might make your queries faster.
However, if you do a lot of updates, the index might hurt your performance, so you have to know what kind of queries your DB will have to deal with before you make this decision.

Sql Server Full Text: Human names which sound alike

I have a database with lots of customers in it. A user of the system wants to be able to look up a customer's account by name, amongst other things.
What I have done is create a new table called CustomerFullText, which just has a CustomerId and an nvarchar(max) field "CustomerFullText". In "CustomerFullText" I keep concatenated together all the text I have for the customer, e.g. First Name, Last Name, Address, etc, and I have a full-text index on that field, so that the user can just type into a single search box and gets matching results.
I found this gave better results that trying to search data stored in lots of different columns, although I suppose I'd be interested in hearing if this in itself is a terrible idea.
Many people have names which sound the same but which have different spellings: Katherine and Catherine and Catharine and perhaps someone who's record in the database is Katherine but who introduces themselves as Kate. Also, McDonald vs MacDonald, Liz vs Elisabeth, and so on.
Therefore, what I'm doing is, whilst storing the original name correctly, making a series of replacements before I build the full text. So ALL of Katherine and Catheine and so on are replaced with "KATE" in the full text field. I do the same transform on my search parameter before I query the database, so someone who types "Catherine" into the search box will actually run a query for "KATE" against the full text index in the database, which will match Catherine AND Katherine and so on.
My question is: does this duplicate any part of existing SQL Server Full Text functionality? I've had a look, but I don't think that this is the same as a custom stemmer or word breaker or similar.
Rather than trying to phonetically normalize your data yourself, I would use the Double Metaphone algorithm, essentially a much better implementation of the basic SOUNDEX idea.
You can find an example implementation here: http://www.sqlteam.com/forums/topic.asp?TOPIC_ID=13574, and more are listed in the Wikipedia link above.
It will generate two normalized code versions of your word. You can then persist those in two additional columns and compare them against your search text, which you would convert to Double Metaphone on the fly.

Improving performance for string matching

I am working for a startup that is building a iphone app. And i would like to ask a few questions to improve an algorithm we use for string matching.
We have a database that has a huge list of phone numbers along with the name of the user who owns the phone number. Lets say that the database looks like this
name phonenum
hari 1234
abc 3873
....
This database has large number of rows (around 1 million). When the user opens the app, the app gets the list of phone numbers from the person's phone contacts and matches it with the database. We return all the phone numbers that are present in the database. Right now, what we do is very very inefficient. We send the phone numbers from phone contacts in sets of 20. And we match it with the database. This will lead to a complexity of num of phone contacts * O(n).
I thought of some improvements like having the database rows sorted by phone numbers so that we can do binary search. In addition to that, we can have a hash table containing some 10,000 phone numbers in the cache memory and we can search against this cache memory initially. Only if there is a miss, we will access the database and search the database with complexity of O(log n) using binary search.
Also, there is the issue of sending phone numbers for matching. do i send them as such or send them as a hashed value ? will that matter in terms of improving performance?
Is there any other way of doing this thing?
I explained the whole scenario so that you can have a better understanding of my need
thanks
If you already have an SQL Server database, let it take care of that. Create an index on the phone number column (if you don't have it already). Send all the numbers in the contact list in one go (no need to split them by 20) and match them against the database. The SQL server probably uses much better indexing than anything you could come up with, so it's going to be pretty fast.
Alternatively, you can try to insert the numbers into a temporary table and query against that, but I have no idea whether that would be faster.
If you can represent phone numbers as numeric values instead of strings, then you can put an index on your database field that will make lookup operations very fast. Even if you have to represent them as strings, an index on the database field will be make looking up values fast enough to be a non-issue in the grand scheme of things.
Your biggest performance problem is going to be with all the round trips between the application and your database. That is a performance bottleneck with any web-enabled program. If you are unlikely to have a high rate of success (maybe 2% of the user's contacts are in your database), you'll probably be better off sending the whole list of phone numbers at once, since you'll just be getting data back for a few of them.
If the purpose is to update the user's contact data with the data found in your database, you could create a hash out of the appropriate fields and send that along with the phone number. Have the database keep a hash of those fields on its side and do a comparison. If the hash matches, then you don't have to send any data back because the local and remote versions are the same.
A successful caching strategy would require a good understanding of how the data will be used, so I can't provide much guidance based on the information given. For example, if 90% of the phones using your app will have all of the phone numbers matched in a small group of the numbers in the database, then by all means, put that small group into a Hashtable. But if users are likely to have any phone numbers that aren't in that small group, you're going to have to do a database round-trip. The key will be to construct a query that allows the database to return all of the data you need in one trip.
I'd split the phone number up into three parts
example 777.777.7777
Each section can be stored into and int and used as a hash tag.
This would mean that your data store becomes a series of hash tables.
Or you could force the whole number into an int and then use that as your hash key. But for fast results you'd need more buckets.
Cheers

Creating an efficient search capability using SQL Server (and/or coldfusion)

I am trying to visualize how to create a search for an application that we are building. I would like a suggestion on how to approach 'searching' through large sets of data.
For instance, this particular search would be on a 750k record minimum table, of product sku's, sizing, material type, create date, etc;
Is anyone aware of a 'plugin' solution for Coldfusion to do this? I envision a google like single entry search where a customer can type in the part number, or the sizing, etc, and get hits on any or all relevant results.
Currently if I run a 'LIKE' comparison query, it seems to take ages (ok a few seconds, but still), and it is too long. At times making a user sit there and wait up to 10 seconds for queries & page loads.
Or are there any SQL formulas to help accomplish this? I want to use a proven method to search the data, not just a simple SQL like or = comparison operation.
So this is a multi-approach question, should I attack this at the SQL level (as it ultimately looks to be) or is there a plug in/module for ColdFusion that I can grab that will give me speedy, advanced search capability.
You could try indexing your db records with a Verity (or Solr, if CF9) search.
I'm not sure it would be faster, and whether even trying it would be worthwhile would depend a lot on how often you update the records you need to search. If you update them rarely, you could do an Verity Index update whenever you update them. If you update the records constantly, that's going to be a drag on the webserver, and certainly mitigate any possible gains in search speed.
I've never indexed a database via Verity, but I've indexed large collections of PDFs, Word Docs, etc, and I recall the search being pretty fast. I don't know if it will help your current situation, but it might be worth further research.
If your slowdown is specifically the search of textual fields (as I surmise from your mentioning of LIKE), the best solution is building an index table (not to be confiused with DB table indexes that are also part of the answer).
Build an index table mapping the unique ID of your records from main table to a set of words (1 word per row) of the textual field. If it matters, add the field of origin as a 3rd column in the index table, and if you want "relevance" features you may want to consider word count.
Populate the index table with either a trigger (using splitting) or from your app - the latter might be better, simply call a stored proc with both the actual data to insert/update and the list of words already split up.
This will immediately drastically speed up textual search as it will no longer do "LIKE", AND will be able to use indexes on index table (no pun intended) without interfering with indexing on SKU and the like on the main table.
Also, ensure that all the relevant fields are indexed fully - not necessarily in the same compund index (SKU, sizing etc...), and any field that is searched as a range field (sizing or date) is a good candidate for a clustered index (as long as the records are inserted in approximate order of that field's increase or you don't care about insert/update speed as much).
For anything mode detailed, you will need to post your table structure, existing indexes, the queries that are slow and the query plans you have now for those slow queries.
Another item is to enure that as little of the fields are textual as possible, especially ones that are "decodable" - your comment mentioned "is it boxed" in the text fields set. If so, I assume the values are "yes"/"no" or some other very limited data set. If so, simply store a numeric code for valid values and do en/de-coding in your app, and search by the numeric code. Not a tremendous speed improvement but still an improvement.
I've done this using SQL's full text indexes. This will require very application changes and no changes to the database schema except for the addition of the full text index.
First, add the Full Text index to the table. Include in the full text index all of the columns the search should perform against. I'd also recommend having the index auto update; this shouldn't be a problem unless your SQL Server is already being highly taxed.
Second, to do the actual search, you need to convert your query to use a full text search. The first step is to convert the search string into a full text search string. I do this by splitting the search string into words (using the Split method) and then building a search string formatted as:
"Word1*" AND "Word2*" AND "Word3*"
The double-quotes are critical; they tell the full text index where the words begin and end.
Next, to actually execute the full text search, use the ContainsTable command in your query:
SELECT *
from containstable(Bugs, *, '"Word1*" AND "Word2*" AND "Word3*"')
This will return two columns:
Key - The column identified as the primary key of the full text search
Rank - A relative rank of the match (1 - 1000 with a higher ranking meaning a better match).
I've used approaches similar to this many times and I've had good luck with it.
If you want a truly plug-in solution then you should just go with Google itself. It sounds like your doing some kind of e-commerce or commercial site (given the use of the term 'SKU'), So you probably have a catalog of some kind with product pages. If you have consistent markup then you can configure a google appliance or service to do exactly what you want. It will send a bot in to index your pages and find your fields. No SQl, little coding, it will not be dependent on your database, or even coldfusion. It will also be quite fast and familiar to customers.
I was able to do this with a coldfusion site in about 6 hours, done! The only thing to watch out for is that google's index is limited to what the bot can see, so if you have a situation where you want to limit access based on a users role or permissions or group, then it may not be the solution for you (although you can configure a permission service for Google to check with)
Because SQL Server is where your data is that is where your search performance is going to be a possible issue. Make sure you have indexes on the columns you are searching on and if using a like you can't use and index if you do this SELECT * FROM TABLEX WHERE last_name LIKE '%FR%'
But it can use an index if you do it like this SELECT * FROM TABLEX WHERE last_name LIKE 'FR%'. The key here is to allow as many of the first characters to not be wild cards.
Here is a link to a site with some general tips. https://web.archive.org/web/1/http://blogs.techrepublic%2ecom%2ecom/datacenter/?p=173

Resources