Regexp_extract using Google re2 - google-data-studio

I need to extract the following text string using google/re2:
Audiences » Affinity Categories » Food & Dining » Fast Food Cravers
In sum, any string after no.3 »
Would appreciate any help/ guidance, million thanks!
best,
yg

The following Regular Expression extracts the 3rd string:
(?:[^»]+?»){3}\s*(.+?)\s*[\nÂ]
You can test it interactively here.
I haven't actually used re2 before so keep in mind possible syntax differences.

Related

Snowflake and Regular Expressions - issue when implementing known good expression in SF

I'm looking for some assistance in debugging a REGEXP_REPLACE() statement.
I have been using an online regular expressions editor to build expressions, and then the SF regexp_* functions to implement them. I've attempted to remain consistent with the SF regex implementation, but I'm seeing an inconsistency in the returned results that I'm hoping someone can explain :)
My intent is to replace commas within the text (excluding commas with double-quoted text) with a new delimiter (#^#).
Sample text string:
"Foreign Corporate Name Registration","99999","Valuation Research",,"Active Name",02/09/2020,"02/09/2020","NEVADA","UNITED STATES",,,"123 SOME STREET",,"MILWAUKEE","WI","53202","UNITED STATES","123 SOME STREET",,"MILWAUKEE","WI","53202","UNITED STATES",,,,,,,,,,,,
RegEx command and Substitution (working in regex101.com):
([("].*?["])*?(,)
\1#^#
regex101.com Result:
"Foreign Corporate Name Registration"#^#"99999"#^#"Valuation Research"#^##^#"Active Name"#^#02/09/2020#^#"02/09/2020"#^#"NEVADA"#^#"UNITED STATES"#^##^##^#"123 SOME STREET"#^##^#"MILWAUKEE"#^#"WI"#^#"53202"#^#"UNITED STATES"#^#"123 SOME STREET"#^##^#"MILWAUKEE"#^#"WI"#^#"53202"#^#"UNITED STATES"#^##^##^##^##^##^##^##^##^##^##^##^#
When I try and implement this same logic in SF using REGEXP_REPLACE(), I am using the following statement:
SELECT TOP 500
A.C1
,REGEXP_REPLACE((A."C1"),'([("].*?["])*?(,)','\\1#^#') AS BASE
FROM
"<Warehouse>"."<database>"."<table>" AS A
This statement returns the result for BASE:
"Foreign Corporate Name Registration","99999","Valuation Research",,"Active Name",02/09/2020,"02/09/2020","NEVADA","UNITED STATES",,,"123 SOME STREET",,"MILWAUKEE","WI","53202","UNITED STATES","123 SOME STREET",,"MILWAUKEE","WI","53202","UNITED STATES"#^##^##^##^##^##^##^##^##^##^##^##^#
As you can see when comparing the results, the SF result set is only replacing commas at the tail-end of the text.
Can anyone tell me why the results between regex101.com and SF are returning different results with the same statement? Is my expression non-compliant with the SF implementation of RegEx - and if yes, can you tell me why?
Many many thanks for your time and effort reading this far!
Happy Wednesday,
Casey.
The use of .*? to achieve lazy matching for regexing is limited to PCRE, which Snowflake does not support. To see this, in regex101.com, change your 'flavor" to be anything other than PCRE (PHP); you will see that your ([("].*?["])*?(,) regex no longer achieves what you are expecting.
I believe that this will work for your purposes:
REGEXP_REPLACE(A.C1,'("[^"]*")*,','\\1#^#')

SQL Contains exact phrase

I try to implement a search-mechanism with "CONTAINS()" on a SQL Server 2014.
I've read here https://technet.microsoft.com/en-us/library/ms142538%28v=sql.105%29.aspx and in the book "Pro Full-Text Search in SQL Server 2008" that I need to use double quotes to search an exact phrase.
But e.q. if I use this CONTAINS(*, '"test"') I receive results containing words like "numerictest" also. If I try CONTAINS(*, '" test "') it is the same. I've noticed, that there are less results as if I would search with CONTAINS(*, '*test*') for a prefix, sufix search, so there is definitely a delta between the searches.
I didn't expect the "numerictest" in the first statement. Is there an explanation for this behaviour?
I have been wracking my brain about a very similar problem and I recently found the solution.
In my case I was searching full text fields for "#username" but using CONTAINS(body, "#username") returned just "username" as well. I wanted it to strictly match with the # sign.
I could use LIKE "%#username%" but the query took over a minute which was unacceptable so I kept looking.
With the help of some people in a chat room they suggested using both CONTAINS and LIKE. So:
SELECT TOP 25 * FROM table WHERE
CONTAINS(body, "#username") AND body LIKE "%#username%";
this worked perfectly for me because the contains pulls both username and #username records and then the LIKE filters out the ones with the # sign. Queries take 2-3 seconds now.
I know this is an old question but I came across it in my searching so having the answer I thought I would post it. I hope this helps.
Contains(*,'"test"') will only match full words of "test" as you expect.
Contains(*,'" test "') same as above
Contains(*,'"*test*"') will actually do a PREFIX ONLY search, basically strips out any special characters at the start of word and only uses the 2nd *.
You cannot do POSTFIX searches using full text search.
My concern lies with the Contains(*) part, this will search for any full text cataloged items in that entire row. Without seeing the data it is hard to tell but my guess is that another column in that row you think is bad is actually matching on "test" somewhere.

text file of all titles / topic titles in Freebase

I need a text file to contain every title / title of each topic / title of each item in a .txt file each on its own line.
How can I do this or make this if I have already downloaded a freebase rdf dump?
If possible, I also need a separate text file with each topic's / item's description on a single line each description on its own line.
How can I do that?
I would greatly appreciate it if someone could help me make either of these files from a Freebase rdf dump.
Thanks in Advance!
Filter the RDF dump on the predicate/property ns:type.object.name. If you only want a particular language, also filter by that language e.g. #en.
EDIT: I missed the second part about descriptions being desired as well. Here's a three part regex which will get you all the lines with:
English names
English descriptions
a type of /commmon/topic
Combining the three is left as an exercise for the reader.
zegrep $'\tns:(((type\\.object\\.name|common\\.topic\\.description)\t.*#en)|type\\.object\\.type\tns:common\\.topic)\\.$' freebase-rdf-2013-06-30-00-00.gz | gzip > freebase-rdf-2013-06-30-00-00-names-descriptions.gz
It seems to have a performance issue that I'll have to look at. A simple grep of the entire file takes ~11 min on my laptop, but this has been running several times that. I'll have to look at it later though...

How can I extract human-readable text from a code snippet?

I need to write a T-SQL query against a text column where some of the values are html or asp.net coding but include normal human-readable text. For example:
{\colortbl ;\red31\green73\blue125;\red0\green0\blue0;} \viewkind4\uc1\pard\ltrpar\lang1033\f0\fs22 All invoices to be emailed to Jack Jack.Marsman#brampton.ca
I don't need that information I need the real text; in this case I want to get just All invoices to be emailed to Jack Jack.Marsman#brampton.ca
Any suggestions on how to go about extracting the text without getting the coding?
Short answer is that there is no easy standard way to do this. I’d try creating a CLR since this kind of parsing is easier in C# or VB.NET.
You can also try using regex to strip out everything that’s not human readable.
Is all of your data in similar format like you already shown? If that’s the case then it comes down to calling substring several times…

MS SQL - WHERE substring matches are phonetic?

I'd like to make a search feature that searches based on "sounds like" match.
For instance, lets say I have a company list that looks like this (lets say we live in Bizzaro world too):
Acme
Already allusion cite LTD
All ready illusion site INC
Apart assent
Assent sight
(Or something simmilar with names... George or Jeorge ? "Yah-way", or "ye-hova" ?)
When someone searches for something that "sounds like" the soundex("site") == S230, they should see results for "Sight" also.
As most people who've used soudnex before already know, normal substring matches obviously don't do this.
I'm trying to work out in my head how to make a WHERE clause that can match based on this, so instead of a typical WHERE company LIKE input, I'd like to run a soundex. Obviously if I run soundex on the whole company name, I won't be able to do substring searching (for example, a user searching "ALL" will never match a soundex of "All ready"). Soundex split on each word might not be worthwhile either, so I'm not sure running all combinations of a soundex is a good idea... or even if that's going to be computationally feasible in a database with more than 1000 records.
Basically the interaction I want to have is when (in an office or something) Tom says to Sally "That name was something like Rebekkah Schwartzkopff" and it can be searched phonetically for a fuzzy match.
Obviously we're going to run into issues with non-English named companies because of soudnex, but I'm will to compromise on this one.
I'd like to do this without adding anything to the database, or a stored procedure.
If SOUNDEX is a good beginning for what you are doing, you can use DIFFERENCE.
eg:
SELECT *
FROM Person
WHERE DIFFERENCE(Person.FirstName, 'George') >= 3
Note that the DIFFERENCE function returns the difference between the SOUNDEX values of two strings using a value of 0-4; 4 meaning the strings are pretty close to the same and 0 meaning they are completely different (kind of a backwards scale to me, but I suppose it works).
Very interesting question. I did a little poking around and found this:
http://www.codeproject.com/KB/database/dmetaphone4.aspx
I haven't tested it myself but it seems like it would be worth checking out.
It would require you to add something to the database, but I don't see how you can implement the functionality you want with built in SQL Server functionality...

Resources