Does snowflake support positive lookbehind in a regex? - snowflake-cloud-data-platform

I want to use a positive lookbehind as part of my regexp_substr expression.
I have the below:
regexp_substr(My_Data, '(?<=id:).*(?=;)', 1, 1)
which gives me the below error:
Invalid regular expression: '(?<=id:).*(?=;)', no argument for repetition operator: ?
I'm trying to split key value pairs where I have
id:1234;

Look-behind is not supported in Snowflake's regexp.
However, you can use regular regexp groups for what you're trying to achieve:
select regexp_substr('Something,id=12345;Somethng', 'id=([^;]+);',1, 1, 'e');
-----------------------------------------------------------------------+
REGEXP_SUBSTR('SOMETHING,ID=12345;SOMETHNG', 'ID=([^;]+);',1, 1, 'E') |
-----------------------------------------------------------------------+
12345 |
-----------------------------------------------------------------------+
Note the 'e' argument for extraction, see the documentation.

Related

Snowflake REGEX to identify if column contains any digit

I currently have select REGEXP_LIKE(col, '[0-9]+') which seems to return True only if all the characters in the string are numeric.
For example, it returns True for 12345 but False for something like 100 Apple St.
What is the necessary regex pattern to return True in both examples above?
To check if a column contains any digit, you can modify your current pattern to use the .* character to match any number of characters before or after the digit(s):
SELECT REGEXP_LIKE(col, '.*[0-9].*')

How do comparison operators like '<' and '>' work for string values?

I have seen that greater than and less than operators can be applied on string values in SQL Server but I haven't figured out yet what logic is being applied here behind the scenes to perform the comparison.
For example, the string value 'Gabriel' is greater than 'Cassandra':
SELECT 1 WHERE 'Gabriel' > 'Cassandra'
The query above returns 1, whereas an empty result set is returned if the comparison operator is changed to '<'.
The comparison is alphabatical, so 'B' is greater than 'A', and 'C' is greater than 'B' and so on.

Regular expression unexpected pattern matching

I am trying to create a syntax parser using C-Bison and Flex. In Flex I have a regular expression which matches integers based on the following:
Must start with any digit in range 1-9 and followed by any number of digits in range 0-9. (ex. Correct: 1,12,11024 | Incorrect: 012)
Can be signed (ex. +2,-5)
The number 0 must not be followed by any digit (0-9) and must not signed. (ex. Correct: 0 | Incorrect: 012,+0,-0)
Here is the regex I have created to perform the matching:
[^+-]0[^0-9]|[+-]?[1-9][0-9]*
Here is the expression I am testing:
(1 + 1 + 10)
The matches:
1
1
10)
And here is my question, why does it match '10)'?
The reason I used the above expression, instead of the much simpler one,
(0|[+-]?[1-9][0-9]*) is due to inability of the parser to recognise incorrect expressions such as 012.
The problem seems to occur only when before the ')' precedes the digit '0'. However if the '0' is preceded by two or more digits (ex. 100), then the ')' is not matched.
I know for a fact if I remove [^0-9] from the regex it doesn't match the ')'.
It matches 10( because 1 matches [^+-], 0 matches 0 and ( matches [^0-9].
The reason I used the above expression, instead of the much simpler one, (0|[+-]?[1-9][0-9]*) is due to inability of the parser to recognise incorrect expressions such as 012.
How so? Using the above regex, 012 would be recognized as two tokens: 0 and 12. Would this not cause an error in your parser?
Admittedly, this would not produce a very good error message, so a better approach might be to just use [0-9]+ as the regex and then use the action to check for a leading zero. That way 012 would be a single token and the lexer could produce an error or warning about the leading zero (I'm assuming here that you actually want to disallow leading zeros - not use them for octal literals).
Instead of a check in the action, you could also keep your regex and then add another one for integers with a leading zero (like 0[0-9]+ { warn("Leading zero"); return INT; }), but I'd go with the check in the action since it's an easy check and it keeps the regex short and simple.
PS: If you make - and + part of the integer token, something like 2+3 will be seen as the integer 2, followed by the integer +3, rather than the integers 2 and 3 with a + token in between. Therefore it is generally a better idea to not make the sign a part of the integer token and instead allow prefix + and - operators in the parser.

Solr FunctionQuery: gte, lte, eq functions

Solr FunctionQuery has a DIV(x,y) function. I have such a need if y=0, then y should be equal to x.
In other words, I need to represent the following logic with FunctionQuery:
if y == 0, return 1 /* i.e. DIV(x,x) */
else, return DIV(x,y)
Somehow, from the Solr doc, I cannot find any comparison function, e.g. EQ(x, value), etc. for me to use.
Will anyone be able to give me a hint to construct my desired logic using FunctionQuery?
Thanks!
To clean up this question and log what is my final solution, thanks to Srikanth Venugopalan comment:
actually you need to switch the arguments. exposure_count = 0 is interpreted as false. So your condition would be {!boost b=if(exposure_count,div(1,exposure_count),1)}"
As it seems, Lucid works documentation does have a mistake. The FunctionQuery parser does not take comparison operators such as ==, at least this is what I found by looking into the sourcecode. Also, the field separator for IF() function should be ,(comma) and not ;(semi-colon).
The official Solr wiki is correct.
For string terms this works:
if(termfreq(fieldname,"value"),2,1)
which yields 2 if "value" is contained in "fieldname" (termfreq will be >0)
you can use == for equals and if conditional statement as below :-
e.g. if(y == 0; 1; DIV(x,y))
Check for the example # Documentation
if(color=="red"; 100; if(color=="green"; 50; 25)) :
This function checks the document field "color", and if it is "red" returns 100, if it is "green" returns 50, else returns 25.
Edit :-
Solr if wiki mentions , (comma) as seperator
e.g. if(exists(myField),100,0)

Posix regex capture group matching sequence

I have the following text string and regex pattern in a c program:
char text[] = " identification division. ";
char pattern[] = "^(.*)(identification *division)(.*)$";
Using regexec() library function, I got the following results:
String: identification division.
Pattern: ^(.*)(identification *division)(.*)$
Total number of subexpressions: 3
OK, pattern has matched ...
begin: 0, end: 37,match: identification division.
subexpression 1 begin: 0, end: 8, match:
subexpression 2 begin: 8, end: 35, match: identification division
subexpression 3 begin: 35, end: 37, match: .
I was wondering since the regex engine matches in a greedy fashion and the first capture group (.*) matches any number of characters (except new line characters) why doesn't it match characters all the way to the end in the text string (up to '.') as oppose to matching only the first 8 spaces?
Does each capture group have to be matched?
Are there any rules on how the capture group matches the text string?
Thanks.
Regexes are as greedy as possible, without being too greedy. Had the left group been as greedy as you expect, the group that matches "identification division" would have been unable to match, erronously rejecting text, which was clearly in the language.
Just as you said, if the greedy group (.*) had consumed the whole string, the rest of the regex wouldn't have anything to match which wouldn't make your regex match the string. So, yes, each capture group (and other pattern parts) needs to be matched. This is exactly what you specified in your regex.
Try the following string instead and run the code with both a reluctant and a greedy first group and you will see the difference.
char text[] = " identification division identification division. ";

Resources