Matching Regular Expressions In SQL Server - sql-server

I am trying to extract id of Android app from its url but getting extra characters.
Using replace function in sql server, below are two sample urls:
https://play.google.com/store/apps/details?id=com.flipkart.android&hl=en com.flipkart.android
https://play.google.com/store/apps/details?hl=en_US&id=com.surveysampling.mobile.quickthoughts&referrer=mat_click_id%3Df1901cef59f79b1542d05a1fdfa67202-20150429-5128 en_US&id=com.surveysampling.mobile.quickthoughts&r
I am doing this right now:
SELECT
SUBSTRING(REPLACE(PREVIEW, '&hl=en',''), CHARINDEX('?', PREVIEW) + 4 , 50)
FROM OFFERS_TABLE;
But for 1st I am getting com.flipkart.android which is correct, but for 2nd I am getting en_US&id=com.surveysampling.mobile.quickthoughts&r.
I want to remove en_US&id from starting of it and &r from its end.
Can someone help me with any post or url from where I can refer?

What you are actually trying to do is extract the string preceded by id= until the & is found which is separator for variables in URL. Taking this condition I came up with following regex.
Regex: (?<=id=)[^&]*
Explanation: It uses the lookbehind assertion that is the string is preceded by id= until the first & is found.
Regex101 Demo

It seems like you've made some assumptions of lengths. The the &r is appearing because that is 50 characters. You are also getting the en_US because you assumed 4 characters at the beginning but your second string has more. Perhaps you can split on & and then look for the variable that begins with id=.
it seems like a function like this would help.
http://www.sqlservercentral.com/blogs/querying-microsoft-sql-server/2013/09/19/how-to-split-a-string-by-delimited-char-in-sql-server/

Related

Replacing Unicode characters using TRANSLATE function

A customer asked to create a custom character mapper function from specific names to ASCII in their SQL database.
Here is a simplified fragment that works (shortened for brevity):
select TRANSLATE(N'àáâãäåāąæậạả',
N'àáâãäåāąæậạả',
N'aaaaaaaaaaaa');
While analyzing the results on customer's dataset, I noticed one more unmapped symbol ă. So I added it to the mapper as follows:
select TRANSLATE(N'àáâãäåāąæậạảă',
N'àáâãäåāąæậạảă',
N'aaaaaaaaaaaaa');
Unexpectedly, it started failing with the message:
The second and third arguments of the TRANSLATE built-in function must contain an equal number of characters.
Obviously, TRANSLATE thinks that ă is special and consists of more than one character. Actually, even Notepad thinks the same (copy ă and try to delete it using Backspace key - something unusual will happen. Delete key works normally, though).
Then I thought - if TRANSLATE considers it a two-char symbol, let's add a two char mapping then:
select TRANSLATE(N'àáâãäåāąæậạảă',
N'àáâãäåāąæậạảă',
N'aaaaaaaaaaaaaa');
No errors this time, yay. But the input string was not processed correctly, ă was not replaced with a.
What is the correct (case-sensitive) way to replace such "double symbols"? Can it be done using TRANSLATE at all? I don't want to add a bunch of REPLACE for every such symbol I find.

How to remove GA Page query parameters using Data Studio calculated fields?

I have Request URIs in my GA Page dimension that look like this:
/this/is/a/webpage.html?parameter=1
/forwarded/from?url=/webpage.html?parameter=1
/this/is/another/webpage.html
I would like to create a calculated field in Data Studio that extracts out the text prior to the first "?" and returns that value.
The ideal output based on the above input would be:
/this/is/a/webpage.html
/forwarded/from
/this/is/another/webpage.html
I tried this:
Calculated Field: Formula:
REGEXP_EXTRACT(Page, '^(.+?)\?')
It returns no records.
This is me playing with the regex https://regex101.com/r/hkqOXA/1
The regex seems valid, Data Studio seems to be failing me here! Please advise on either a workaround or explanation as to why Data Studio doesn't process this as expected!
Thanks!
Try this calculated field:
REGEXP_REPLACE(Page, '\\?.+', '')
The double back slash is the escape character for the question mark, then the calculated field grabs everything after that and replace it all with an empty string ''.
Cheers,
Ben
You can also do it this way.
REGEXP_EXTRACT(Page, '([^?]*)\?.*')

mssql search a varchar field for invalid typo

I have a field with names in it. They can be last name, first name middle name/initial
Basically I want to find all names that aren't normal spellings so I can tell someone to fix their names in the system.
I don't want to select and find this guy
O'Leary-Smith, Timothy L.
But I would want to find this guy
<>[]}{##$%^&*()/?=+_!|";:~`1234567890
I can just keep coming up with special characters to search for but then I'm just making this huge query and having to bracket wildcards... it's like 50+ lines long just to say one thing.
Is there something (not some custom function)
that lets me say
where name not like
A-Z
a-z
,
.
'
-
possibly something that is
where name contains anything but these ascii characters
Hopefully this is a one of fix-up; a negated character class:
where patindex('%[^ A-Za-z,.''-]%', name) > 0
Although more letters than A-Z can appear in names ...
If it's just odd characters you're looking for:
WHERE name like '%[^A-Za-z]%'
The ^ acts as a NOT operator.

SQL Contains exact phrase

I try to implement a search-mechanism with "CONTAINS()" on a SQL Server 2014.
I've read here https://technet.microsoft.com/en-us/library/ms142538%28v=sql.105%29.aspx and in the book "Pro Full-Text Search in SQL Server 2008" that I need to use double quotes to search an exact phrase.
But e.q. if I use this CONTAINS(*, '"test"') I receive results containing words like "numerictest" also. If I try CONTAINS(*, '" test "') it is the same. I've noticed, that there are less results as if I would search with CONTAINS(*, '*test*') for a prefix, sufix search, so there is definitely a delta between the searches.
I didn't expect the "numerictest" in the first statement. Is there an explanation for this behaviour?
I have been wracking my brain about a very similar problem and I recently found the solution.
In my case I was searching full text fields for "#username" but using CONTAINS(body, "#username") returned just "username" as well. I wanted it to strictly match with the # sign.
I could use LIKE "%#username%" but the query took over a minute which was unacceptable so I kept looking.
With the help of some people in a chat room they suggested using both CONTAINS and LIKE. So:
SELECT TOP 25 * FROM table WHERE
CONTAINS(body, "#username") AND body LIKE "%#username%";
this worked perfectly for me because the contains pulls both username and #username records and then the LIKE filters out the ones with the # sign. Queries take 2-3 seconds now.
I know this is an old question but I came across it in my searching so having the answer I thought I would post it. I hope this helps.
Contains(*,'"test"') will only match full words of "test" as you expect.
Contains(*,'" test "') same as above
Contains(*,'"*test*"') will actually do a PREFIX ONLY search, basically strips out any special characters at the start of word and only uses the 2nd *.
You cannot do POSTFIX searches using full text search.
My concern lies with the Contains(*) part, this will search for any full text cataloged items in that entire row. Without seeing the data it is hard to tell but my guess is that another column in that row you think is bad is actually matching on "test" somewhere.

How to split a non delimited string in T-SQL

I need to split a string value that has no delimiter. I work in banking and I am selecting a GL account number and need to separate the account number from the account branch number. The issue is both values are passed as one long string, 10 digits for the account number and 4 for the account branch. For example 01234567891234 needs to be changed to 0123456789.1234.
Every thing I find says to use CHARINDEX or SUBSTRING. From my understand both require a character to search for. If anyone can provide another function and some example code that would be great. Thanks.
You can do something simple like
left(str, 10) + '.' + right(str, 4)
if you know it'll always be a 14 character string
You could also use STUFF function as below:
declare #accNo varchar(14) = '01234567891234'
select stuff(#accNo,11,0,'.')
SQL Fiddle

Resources