Regular Expression in Visual Studio Find & Replace - multiple spaces between search terms - sql-server

I require a regular expression for the Visual Studio Search and Replace functionality, as follows:
Search for the following term: sectorkey in (
There could be multiple spaces between each of the above 3 search terms, or even multiple line breaks/carriage returns.
The search term is looking for SQL statements that have hard-coded SectorKey values inside a SQL in statement. These need to be replaced with a SQL join statement - this will be done manually.

The little arrow to the right of the Find What box is your friend and can help you with the vagaries of the MS regex syntax.
Newline is represented by \n, so you can just do sectorkey( |\n)+in( |\n)+\( (You need to escape the open paren in your search expression, since that's used in grouping.)

I believe :Wh+ is what you want. The Visual Studio regex flavor is very strange; you'll tend to get better results if you consult the official reference. Expertise with "mainstream" regexes tends to be more of a handicap than a help when it comes to VS.

You can use \s+ to search for one or more adjacent whitespace characters (including tab, CR, LF etc), so your regex would presumably end up looking something like sectorkey\s+in\s+\(.
Edit...
As Joe points out in his comment, it seems that Visual Studio doesn't support \s in Find/Replace expressions, in which case you'll probably need to use something like [\n:b] instead. The regex would then become sectorkey[\n:b]+in[\n:b]+\(.

Related

Snowflake and Regular Expressions - issue when implementing known good expression in SF

I'm looking for some assistance in debugging a REGEXP_REPLACE() statement.
I have been using an online regular expressions editor to build expressions, and then the SF regexp_* functions to implement them. I've attempted to remain consistent with the SF regex implementation, but I'm seeing an inconsistency in the returned results that I'm hoping someone can explain :)
My intent is to replace commas within the text (excluding commas with double-quoted text) with a new delimiter (#^#).
Sample text string:
"Foreign Corporate Name Registration","99999","Valuation Research",,"Active Name",02/09/2020,"02/09/2020","NEVADA","UNITED STATES",,,"123 SOME STREET",,"MILWAUKEE","WI","53202","UNITED STATES","123 SOME STREET",,"MILWAUKEE","WI","53202","UNITED STATES",,,,,,,,,,,,
RegEx command and Substitution (working in regex101.com):
([("].*?["])*?(,)
\1#^#
regex101.com Result:
"Foreign Corporate Name Registration"#^#"99999"#^#"Valuation Research"#^##^#"Active Name"#^#02/09/2020#^#"02/09/2020"#^#"NEVADA"#^#"UNITED STATES"#^##^##^#"123 SOME STREET"#^##^#"MILWAUKEE"#^#"WI"#^#"53202"#^#"UNITED STATES"#^#"123 SOME STREET"#^##^#"MILWAUKEE"#^#"WI"#^#"53202"#^#"UNITED STATES"#^##^##^##^##^##^##^##^##^##^##^##^#
When I try and implement this same logic in SF using REGEXP_REPLACE(), I am using the following statement:
SELECT TOP 500
A.C1
,REGEXP_REPLACE((A."C1"),'([("].*?["])*?(,)','\\1#^#') AS BASE
FROM
"<Warehouse>"."<database>"."<table>" AS A
This statement returns the result for BASE:
"Foreign Corporate Name Registration","99999","Valuation Research",,"Active Name",02/09/2020,"02/09/2020","NEVADA","UNITED STATES",,,"123 SOME STREET",,"MILWAUKEE","WI","53202","UNITED STATES","123 SOME STREET",,"MILWAUKEE","WI","53202","UNITED STATES"#^##^##^##^##^##^##^##^##^##^##^##^#
As you can see when comparing the results, the SF result set is only replacing commas at the tail-end of the text.
Can anyone tell me why the results between regex101.com and SF are returning different results with the same statement? Is my expression non-compliant with the SF implementation of RegEx - and if yes, can you tell me why?
Many many thanks for your time and effort reading this far!
Happy Wednesday,
Casey.
The use of .*? to achieve lazy matching for regexing is limited to PCRE, which Snowflake does not support. To see this, in regex101.com, change your 'flavor" to be anything other than PCRE (PHP); you will see that your ([("].*?["])*?(,) regex no longer achieves what you are expecting.
I believe that this will work for your purposes:
REGEXP_REPLACE(A.C1,'("[^"]*")*,','\\1#^#')

Parsing CREATE TABLE sql with regex

I'm doing a simple language sql with regex, and i'm parsing the sentence CREATE TABLE, but does not work. I use this:
reti = regcomp(&regex, "CREATE TABLE [a-zA-Z]\\((.*)\\)", REG_EXTENDED);
It's something simple, i just do it to learn...
What is wrong with the regex?
Apart from only working on one letter SQL tables, you are either omitting whitespaces before the open parentheses or, depending on the SQL and regex engine/string syntax, over-escaping your expression parentheses.
Check:
[A-Za-z]+\\s*
if it doesn't work and the plus sign is not recognized,
[A-Za-z][A-Za-z]*\\s*
and whether it's \\\\( or just \\( (it should be the latter. But better be sure).
This supports names such as Antinoo and cUsToMeRs, but not Invoices_New or Suppliers2014. You might want to add numbers and underscores to your regex. Since table names probably won't start with numbers, you'll want
[A-Z_a-z][A-Z_a-z0-9]*\\s*\\(([^;]*)\\)

When naming columns in a SQL Server table, are there any names I should avoid using?

I remember when I was working with PHP several years back I could blow up my application by naming a MySQL column 'desc' or any other term that was used as an operator.
So, in general are there names I should avoid giving my table columns?
As long as you surround every column name with '[' and ']', it really doesn't matter what you use. Even a space works (try it: [ ]).
Edit: If you can't use '[' and ']' in every case, check the documentation for characters that are not allowable as well as keywords that are intrinsic to the system; those would be out of bounds. Off the top of my head, the characters allowed (for SqlServer) for an identifier are: a-z, A-Z, 0-9, _, $, #.
in general don't start with a number, don't use spaces, don't use reserved words and don't use non alphanumeric characters
however if you really want to you can still do it but you need to surround it with brackets
this will fail
create table 1abc (id int)
this will not fail
create table [1abc] (id int)
but now you need to use [] all the time, I would avoid names as the ones I mentioned above
Check the list of reserved keywords as indicated in other answers.
Also avoid using the "quoting" using quotes or square brackets for the sake of having a space or other special character in the object name. The reason is that when quoted the object name becomes case sensitive in some database engines (not sure about MSSQL though)
Some teams use the prefix for database objects (tables, views, columns) like T_PERSON, V_PERSON, C_NAME etc. I personally do not like this convention, but it does help avoiding keyword issues.
You should avoid any reserved SQL keywords (ex. SELECT) and from a best practices should avoid spaces.
Yes, and no.
Yes, because it's annoying and confusing to have names that match keywords, and that you have to escape in funny ways (when you're not consistently escaping)
and No, because it's possible to have any sequence of characters as an identifier, if you escape it properly :)
Use [square brackets] or "double quotes" to escape multi-word identifiers or keywords, or even names that have backslashes or any other slightly odd character, if you must.
Strictly speaking, there's nothing you can't name your columns. However, it will make your life easier if you avoid names with spaces, SQL reserved words, and reserved words in the language you're programming in.
You can use pretty much anything as long as you surround it with square brackets:
SELECT [value], [select], [insert] FROM SomeTable
I however like to avoid doing this, partly because typing square brackets everywhere is anoying and partyly because I dont generally find that column names like 'value' particularly descriptive! :-)
Just stay away from SQL keywords and anything which contains something other than letters and you shouldn't need to use those pesky square brackets.
You can surround a word in square brackets [] and basically use anything you'd like.
I prefer not to use the brackets, and in order to do so you just have to avoid reserved words.
MS SQL Server 2008 has these reserved words
Beware of using square brackets on updates, I had a problem using the following query:
UPDATE logs SET locked=1 WHERE [id] IN (SELECT [id] FROM ids)
This caused all records to be updated, however, this appears to work fine:
UPDATE logs SET locked=1 WHERE id IN (SELECT [id] FROM ids)
Note that this problem appears specific to updates, as the following returns only the rows expected (not all rows):
SELECT * FROM logs WHERE [id] IN (SELECT [id] FROM ids)
This was using MSDE 2000 SP3 and connecting to the database using MS SQL (2000) Query Analyzer V 8.00.194
Very odd, possibly related to this Knowledgebase bug http://support.microsoft.com/kb/140215
In the end I just removed all the unnecessary square brackets.

Full text catalog/index search for %book%

I'm trying to wrap my head around how to search for something that appears in the middle of a word / expression - something like searching for "LIKE %book% " - but in SQL Server (2005) full text catalog.
How can I do that? It almost appears as if both CONTAINS and FREETEXT really don't support wildcard at the beginning of a search expression - can that really be?
I would have imagined that FREETEXT(*, "book") would find anything with "book" inside, including "rebooked" or something like that.
unfortunately CONTAINS only supports prefix wildcards:
CONTAINS(*, '"book*"')
SQL Server Full Text Search is based on tokenizing text into words. There is no smaller unit as a word, so the smallest things you can look for are words.
You can use prefix searches to look for matches that start with certain characters, which is possible because word lists are kept in alphabetical order and all the Server has to do is scan through the list to find matches.
To do what you want a query with a LIKE '%book%' clause would probably be just as fast (or slow).
If you want to do some serious full text searching then I would (and have) use Lucene.Net. MS SQL Full Text search never seems to work that well for anything other than the basics.
Here's a suggestion that is a workaround for that wildcard limitation. You create a computed column that contains the same content but in reverse as the column(s) you are searching.
If, for example, you are searching on a column named 'ProductTitle', then create a column named ProductsRev. Then update that field's 'Computed Column Specification' value to be:
(reverse([ProductTitle]))
Include the 'ProductsRev' column in your search and you should now be able to return results that support a wildcard at the beginning of the word. Good luck!!
Full text has a table that lists all the words the engine has found. It should have orders-of-magnitude less rows than your full-text-indexed table. You could select from that table " where field like '%book%' " to get all the words that have 'book' in them. Then use that list to write a fulltext query. Its cumbersome, but it would work, and it would be ok in the speed department. HOWEVER, ultimately you are using fulltext wrong when you are doing this. It might actually be better to educate the source of these feature requests about what fulltext is doing. You want them to understand what it WANTS to do, so they can get high value from fulltext. Example, only use wild cards at the end of a word, which means think of the words in an ordered list.
why don't program an assembly in C# to compute all the non repeated sufixes. For example if you have the Text "eat the red meat" you can store in a field "eat at t the he e red ed d meat" (note that is not necesary to add eat at and t again) ind then in this field use full text search. A function for doing that can easily written in Csharp
x) I know it seems od... it's a workarround
x) I know I'm adding overhead in the insert / update .... only justified if this overhead is insignificant besides the improvement in the search function
x) I know there is also an overhead in the size of the stored data.
But I'm pretty conffident that will be quite fast

How do you get leading wildcard full-text searches to work in SQL Server?

Note: I am using SQL's Full-text search capabilities, CONTAINS clauses and all - the * is the wildcard in full-text, % is for LIKE clauses only.
I've read in several places now that "leading wildcard" searches (e.g. using "*overflow" to match "stackoverflow") is not supported in MS SQL. I'm considering using a CLR function to add regex matching, but I'm curious to see what other solutions people might have.
More Info: You can add the asterisk only at the end of the word or phrase. - along with my empirical experience: When matching "myvalue", "my*" works, but "(asterisk)value" returns no match, when doing a query as simple as:
SELECT * FROM TABLENAME WHERE CONTAINS(TextColumn, '"*searchterm"');
Thus, my need for a workaround. I'm only using search in my site on an actual search page - so it needs to work basically the same way that Google works (in the eyes on a Joe Sixpack-type user). Not nearly as complicated, but this sort of match really shouldn't fail.
Workaround only for leading wildcard:
store the text reversed in a different field (or in materialised view)
create a full text index on this column
find the reversed text with an *
SELECT *
FROM TABLENAME
WHERE CONTAINS(TextColumnREV, '"mrethcraes*"');
Of course there are many drawbacks, just for quick workaround...
Not to mention CONTAINSTABLE...
The problem with leading Wildcards: They cannot be indexed, hence you're doing a full table scan.
It is possible to use the wildcard "*" at the end of the word or phrase (prefix search).
For example, this query will find all "datab", "database", "databases" ...
SELECT * FROM SomeTable WHERE CONTAINS(ColumnName, '"datab*"')
But, unforutnately, it is not possible to search with leading wildcard.
For example, this query will not find "database"
SELECT * FROM SomeTable WHERE CONTAINS(ColumnName, '"*abase"')
To perhaps add clarity to this thread, from my testing on 2008 R2, Franjo is correct above. When dealing with full text searching, at least when using the CONTAINS phrase, you cannot use a leading , only a trailing functionally. * is the wildcard, not % in full text.
Some have suggested that * is ignored. That does not seem to be the case, my results seem to show that the trailing * functionality does work. I think leading * are ignored by the engine.
My added problem however is that the same query, with a trailing *, that uses full text with wildcards worked relatively fast on 2005(20 seconds), and slowed to 12 minutes after migrating the db to 2008 R2. It seems at least one other user had similar results and he started a forum post which I added to... FREETEXT works fast still, but something "seems" to have changed with the way 2008 processes trailing * in CONTAINS. They give all sorts of warnings in the Upgrade Advisor that they "improved" FULL TEXT so your code may break, but unfortunately they do not give you any specific warnings about certain deprecated code etc. ...just a disclaimer that they changed it, use at your own risk.
http://social.msdn.microsoft.com/Forums/ar-SA/sqlsearch/thread/7e45b7e4-2061-4c89-af68-febd668f346c
Maybe, this is the closest MS hit related to these issues... http://msdn.microsoft.com/en-us/library/ms143709.aspx
One thing worth keeping in mind is that leading wildcard queries come at a significant performance premium, compared to other wildcard usages.
Note: this was the answer I submitted for the original version #1 of the question before the CONTAINS keyword was introduced in revision #2. It's still factually accurate.
The wildcard character in SQL Server is the % sign and it works just fine, leading, trailing or otherwise.
That said, if you're going to be doing any kind of serious full text searching then I'd consider utilising the Full Text Index capabilities. Using % and _ wild cards will cause your database to take a serious performance hit.
Just FYI, Google does not do any substring searches or truncation, right or left. They have a wildcard character * to find unknown words in a phrase, but not a word.
Google, along with most full-text search engines, sets up an inverted index based on the alphabetical order of words, with links to their source documents. Binary search is wicked fast, even for huge indexes. But it's really really hard to do a left-truncation in this case, because it loses the advantage of the index.
As a parameter in a stored procedure you can use it as:
ALTER procedure [dbo].[uspLkp_DrugProductSelectAllByName]
(
#PROPRIETARY_NAME varchar(10)
)
as
set nocount on
declare #PROPRIETARY_NAME2 varchar(10) = '"' + #PROPRIETARY_NAME + '*"'
select ldp.*, lkp.DRUG_PKG_ID
from Lkp_DrugProduct ldp
left outer join Lkp_DrugPackage lkp on ldp.DRUG_PROD_ID = lkp.DRUG_PROD_ID
where contains(ldp.PROPRIETARY_NAME, #PROPRIETARY_NAME2)
When it comes to full-text searching, for my money nothing beats Lucene. There is a .Net port available that is compatible with indexes created with the Java version.
There's a little work involved in that you have to create/maintain the indexes, but the search speed is fantastic and you can create all sorts of interesting queries. Even indexing speed is pretty good - we just completely rebuild our indexes once a day and don't worry about updating them.
As an example, this search functionality is powered by Lucene.Net.
Perhaps the following link will provide the final answer to this use of wildcards: Performing FTS Wildcard Searches.
Note the passage that states: "However, if you specify “Chain” or “Chain”, you will not get the expected result. The asterisk will be considered as a normal punctuation mark not a wildcard character. "
If you have access to the list of words of the full text search engine, you could do a 'like' search on this list and match the database with the words found, e.g. a table 'words' with following words:
pie
applepie
spies
cherrypie
dog
cat
To match all words containing 'pie' in this database on a fts table 'full_text' with field 'text':
to-match <- SELECT word FROM words WHERE word LIKE '%pie%'
matcher = ""
a = ""
foreach(m, to-match) {
matcher += a
matcher += m
a = " OR "
}
SELECT text FROM full_text WHERE text MATCH matcher
% Matches any number of characters
_ Matches a single character
I've never used Full-Text indexing but you can accomplish rather complex and fast search queries with simply using the build in T-SQL string functions.
From SQL Server Books Online:
To write full-text queries in
Microsoft SQL Server 2005, you must
learn how to use the CONTAINS and
FREETEXT Transact-SQL predicates, and
the CONTAINSTABLE and FREETEXTTABLE
rowset-valued functions.
That means all of the queries written above with the % and _ are not valid full text queries.
Here is a sample of what a query looks like when calling the CONTAINSTABLE function.
SELECT RANK , * FROM TableName ,
CONTAINSTABLE (TableName, *, '
"*WildCard" ') searchTable WHERE
[KEY] = TableName.pk ORDER BY
searchTable.RANK DESC
In order for the CONTAINSTABLE function to know that I'm using a wildcard search, I have to wrap it in double quotes. I can use the wildcard character * at the beginning or ending. There are a lot of other things you can do when you're building the search string for the CONTAINSTABLE function. You can search for a word near another word, search for inflectional words (drive = drives, drove, driving, and driven), and search for synonym of another word (metal can have synonyms such as aluminum and steel).
I just created a table, put a full text index on the table and did a couple of test searches and didn't have a problem, so wildcard searching works as intended.
[Update]
I see that you've updated your question and know that you need to use one of the functions.
You can still search with the wildcard at the beginning, but if the word is not a full word following the wildcard, you have to add another wildcard at the end.
Example: "*ildcar" will look for a single word as long as it ends with "ildcar".
Example: "*ildcar*" will look for a single word with "ildcar" in the middle, which means it will match "wildcard". [Just noticed that Markdown removed the wildcard characters from the beginning and ending of my quoted string here.]
[Update #2]
Dave Ward - Using a wildcard with one of the functions shouldn't be a huge perf hit. If I created a search string with just "*", it will not return all rows, in my test case, it returned 0 records.

Resources