Special meaning of <> and anonymous arrays inside regex in Perl 6 - arrays

Outside regex, <> behaves more or less like single quotes. My shallow understanding seems to tell me that, inside regex, <> allows evaluation and interpolation of codes:
# Outside regex, <> acts like single quotes:
> my $x = <{"one"}>
{"one"}
> $x.WHAT
(Str)
# Inside regex, <> evaluates and interpolates:
> my $b="one";
one
> say "zonez" ~~ m/ <{$b}> / # Evaluates {$b} then quotes: m/ one /
「one」
> say "zonez" ~~ m/ <$b> / # Interpolates and quotes without {}
「one」
Because an array variable is allowed inside a regex, I suspect that the Perl 6 regex engine expends the array into OR's when there is <> inside regex surrounding the array.
I also suspect that in a user-defined character class, <[ ]>, the array [] inside <> more or less works like an anonymous array in a way similar to #a below, and the contents of the array (chars in the character class) are expended to OR's.
my #a = $b, "two";
[one two]
> so "zonez" ~~ m/ #a /;
True
> say "ztwoz" ~~ m/ <{[$b, "two"]}> / # {} to eval array, then <> quotes
「two」
> say "ztwoz" ~~ m/ <{#a}> /
「two」
> say "ztwoz" ~~ m/ <#a> /
「two」
> say "ztwoz" ~~ m/ one || two / # expands #a into ORs: [||] #a;
# [||] is a reduction operator;
「two」
And char class expansion:
> say "ztwoz" ~~ m/ <[onetw]> / # like [||] [<o n e t w>];
「t」
> say "ztwoz" ~~ m/ o|n|e|t|w /
「t」
> my #m = < o n e t w >
[o n e t w]
> say "ztwoz" ~~ m/ #m /
「t」
I have not looked into the Rakudo source code, and my understanding is limited. I have not been able to construct anonymous arrays inside regex to prove that <> indeed constructs arrays inside regex.
So, is <> inside regex something special? Or should I study the Rakudo source code (which I really try not to do at this time)?

Outside of a regex <> acts like qw<>, that is it quotes and splits on spaces.
say <a b c>.perl;
# ("a", "b", "c")
It can be expanded to
q :w 'a b c'
Q :q :w 'a b c'
Q :single :words 'a b c'
I recommend reading Language: Quoting Constructs as this is a more broad topic than can be discussed here.
This has almost nothing to do with what <> does inside of a regex.
The use of <> in regexes is not useful in base Perl 6 code, and qw is not that useful in regexes. So these characters are doing double duty, mainly because there are very few non-letter and non-number characters in ASCII. The only time it acts like qw is if the character immediately following < is a whitespace character.
Inside of a regex it can be thought of as injecting some code into the regex; sort of like a macro, or a function call.
/<{ split ';', 'a;b;c' }>/;
/ [ "a" | "b" | "c" ] /;
( Note that | tries all alternations at the same time while || tries the leftmost one first, followed by the next one, etc. That is || basically works the way | does in Perl 5 and PCRE. )
/<:Ll - [abc]>/
/ [d|e|f|g|h|i|j|k|l|m|n|o|p|q|r|s|t|u|v|w|x|y|z] / # plus other lowercase letters
/ <#a> /
/ [ "one" | "two" ] /
Note that / #a / also dissolves into the same construct.
/ <?{ 1 > 0 }> /
# null regex always succeeds
/ [ '' ] /
/ <?{ 1 == 0 }> /
# try to match a character after the end of the string
# (can never succeed)
/ [ $ : . ] /
Those last two aren't quite accurate, but may be a useful way to think about it.
It is also used to call regex "methods".
grammar Foo {
token TOP { <alpha> } # calls Grammar.alpha and uses it at that point
}
If you noticed I always surrounded the substitution with [] as it always acts like an independent sub expression.
Technically none of these are implemented in the way I've shown them, it is just a theoretical model that is easier to explain.

Within regex <> are used for what I tend to call "generalized assertions". Whenever you match something with regex, you're making a series of assertions about what the string should look like. If all of the assertions are true, the entire regex matches. For example, / foo / asserts that the string "foo" appears within the string being matched; / f o* / asserts that the string should contain an "f" followed by zero or more "o", etc.
In any case, for generalized assertions, Rakudo Perl 6 uses the character immediately after the < to determine what kind of assertion is being made. If the character after < is alphabetic (e.g. <foo>) it is taken to mean a named subrule; if the character after < is {, it's an assertion that contains code that is to be interpolated into the pattern (e.g., <{ gen_some_regex(); }>); if the character after < is a [, it's a character class; if the character after < is a : then it expects to match an Unicode property (e.g., <:Letter>); if the character after < is a ? or !, you get positive and negative zero-width assertions respectively; etc.
And finally, outside of regex, <> act as "quote words". If the character immediately following the < is a whitespace character, within regex, it will also act as a kind of "quote words":
> "I'm a bartender" ~~ / < foo bar > /
「bar」
This is matched as if it were an alternation, that is < foo bar > will match one of foo or bar as if you'd written foo | bar.

Related

ANTLR4 - What is the correct way to define an array type?

I am creating my own grammar, and so far I had only primitive types. However, now I would like to add a new type by reference, arrays, with a format similar to Java or C#, but I run into the problem that I am not able to make it work with ANTLR.
The code example I'm working with would be similar to this:
VariableDefinition
{
id1: string;
anotherId: bool;
arrayVariable: string[5];
anotherArray: bool[6];
}
MyMethod()
{
temp: string[3];
temp2: string;
temp2 = "Some text";
temp[0] = temp2;
temp2 = temp[0];
}
The Lexer contains:
BOOL: 'bool';
STRING: 'string';
fragment DIGIT: [0-9];
fragment LETTER: [[a-zA-Z\u0080-\u00FF_];
fragment ESCAPE : '\\"' | '\\\\' ; // Escape 2-char sequences: \" and \\
LITERAL_INT: DIGIT+;
LITERAL_STRING: '"' (ESCAPE|.)*? '"' ;
OPEN_BRACKET: '[';
CLOSE_BRACKET: ']';
COLON: ':';
SEMICOLON: ';';
ID: LETTER (LETTER|DIGIT)*;
And my Parser would be an extension of this (there are more rules and other expressions but I don't think that there is a relation with this scenario):
global_
: GLOBAL '{' globalVariables+=variableDefinition* '}'
;
variableDefinition
: name=ID ':' type=type_ ';'
;
type_
: referenceType # TypeReference
| primitiveType # TypePrimitive
;
primitiveType
: BOOL # TypeBool
| CHAR # TypeChar
| DOUBLE # TypeDouble
| INT # TypeInteger
| STRING # TypeString
;
referenceType
: primitiveType '[' LITERAL_INT ']' # TypeArray
;
expression_
: identifier=expression_ '[' position=expression_ ']' # AccessArrayExpression
| left=expression_ operator=( '*' | '/' | '%') right=expression_ # ArithmeticExpression
| left=expression_ operator=( '+' | '-' ) right=expression_ # ArithmeticExpression
| value=ID # LiteralID
I've tried:
Put spaces between the different lexemes in the example programme in case there was a problem with the lexer. (nothing changed).
Creating one rule in type_ called arrayType, and in arrayType reference type_ (fails due to a left-recursion: ANTLR shows the following error The following sets of rules are mutually left-recursive [type_, arrayType]
Put primitive and reference types into a single rule.
type_
: BOOL # TypeBool
| CHAR # TypeChar
| DOUBLE # TypeDouble
| INT # TypeInteger
| STRING # TypeString
| type_ '[' LITERAL_INT ']' # TypeArray
;
Results:
· With whitespace separating the array (temp: string [5] ;).
line 23:25 missing ';' at '[5'
line 23:27 mismatched input ']' expecting {'[', ';'}
· Without whitespace (temp: string[5];).
line 23:18 mismatched input 'string[5' expecting {BOOL, 'char', 'double', INT, 'string'}
line 23:26 mismatched input ']' expecting ':'
EDIT 1: This is how the tree would look like when trying to generate the example I gave:
Parse tree Inspector
fragment LETTER: [[a-zA-Z\u0080-\u00FF_];
You're allowing [ as a letter (and thus as a character in identifiers), so in string[5], string[5 is interpreted as an identifier, which makes the parser think the subsequent ] has no matching [. Similarly in string [5], [5 is interpreted as an identifier, which makes the parser see two consecutive identifiers, which is also not allowed.
To fix this you should remove the [ from LETTER.
As a general tip, when getting parse errors that you don't understand, you should try to look at which tokens are being generated and whether they match what you expect.
It's common for languages that want to be flexible with whitespace to have a rule, something like this:
WS: [ \t\r\n]+ -> skip; // or channel(HIDDEN)
It should address your problem.
This shuttles Whitespace off to the side so you don't have to be concerned with it in your parser rules.
Without that sort of approach, you'd still need to define a whitespace rule (same pattern as above), but, if you don't skip it (or send it to eat HIDDEN channel), you'll have to include it everywhere you want to allow for whitespace by inserting a WS?. Clearly this has the potential to become quite tedious (and adds a lot of "noise" to both your grammar and the resulting parse trees).

How the define parser rule for empty array in Antlr 4.0

I have the following set of rules in my Antlr grammar file:
//myidl.g4
grammar myidl;
integerLiteral:
value = DecimalLiteral # decimalNumberExpr
| value = HexadecimalLiteral # hexadecimalNumberExpr
;
DecimalLiteral: DIGIT+;
HexadecimalLiteral: ('0x' | '0X') HEXADECIMALDIGIT+;
fragment DIGIT: [0-9];
fragment HEXADECIMALDIGIT: [0-9a-fA-F];
array:
'[' ( integerLiteral ( ',' integerLiteral )* )* ']' // how to allow empty arrays like "[]"?
;
The resulting parser works fine for arrays with elements, for example "[0x00]".
But when defining an empty array "[]", i get the error:
no viable alternative at input '[]'
The funny thing is that when defining the empty array with a space, like "[ ]", the parser eats it without error. Can somebody tell me whats wrong with my rules, and how to adjust the rules to allow empty arrays definition without spaces, "[]"?
I use ANTLR Parser Generator Version 4.9.2
EDIT:
It turns out that I was using an old version of the parser due to configuration issue. |=( The above rules work just fine.
You probably have defined a token in your lexer grammar that matches []:
SOME_RULE
: '[]'
;
remove that rule.
The grammar copied above works correctly in my opinion.
must match [] and not [ ] (with space).
As already suggested check the tokens and if you have imported other lexer check that you don't have token like '[]'.
The following grammar:
grammar myidl;
integerLiteral: DecimalLiteral | HexadecimalLiteral;
DecimalLiteral: DIGIT+;
HexadecimalLiteral: ('0x' | '0X') HEXADECIMALDIGIT+;
fragment DIGIT: [0-9];
fragment HEXADECIMALDIGIT: [0-9a-fA-F];
array:
'[' ( integerLiteral ( ',' integerLiteral )* )* ']';
WS : [ \t] -> skip ;
match:
[0x00]
[]
[ ]

Improve code with checking element in array is digit

I want to check that each element in String is digit. Firstly, I split the String to an Array by a regexp [, ]+ expression and then I try to check each element by forall and isDigit.
object Test extends App {
val st = "1, 434, 634, 8"
st.split("[ ,]+") match {
case arr if !arr.forall(_.forall(_.isDigit)) => println("not an array")
case arr if arr.isEmpty => println("bad delimiter")
case _ => println("success")
}
}
How can I improve this code and !arr.forall(_.forall(_.isDigit))?
Use matches that requires the string to fully match the pattern:
st.matches("""\d+(?:\s*,\s*\d+)*""")
See the Scala demo and the regex demo.
Details
In a triple quoted string literal, there is no need to double escape backslashes that are part of regex escapes
Anchors - ^ and $ - are implicit when the pattern is used with .matches
The regex means 1+ digits followed with 0 or more repetitions of a comma enclosed with 0 or more whitespaces and then 1+ digits.
I think it can be simplified while also making it a bit more robust.
val st = "1,434 , 634 , 8" //a little messier but still valid
st.split(",").forall(" *\\d+ *".r.matches) //res0: Boolean = true
I'm assuming strings like "1,,,434 , 634 2 , " should fail.
The regex can be put in a variable so that it is compiled only once.
val digits = " *\\d+ *".r
st.split(",").forall(digits.matches)

Multiple line strings in Apache Zeppelin

I have a very long string that must be broken into multiple lines. How can I do that in zeppelin?
The error is error: missing argument list for method + in class String:
Here is the more complete error message:
<console>:14: error: missing argument list for method + in class String
Unapplied methods are only converted to functions when a function type is expected.
You can make this conversion explicit by writing `$plus _` or `$plus(_)` instead of `$plus`.
val q = "select count(distinct productId),count(distinct date),count(distinct instock_inStockPercent), count(distinct instock_totalOnHand)," +
In Scala (using Apache Zeppelin as well as otherwise), you can write expressions covering multiple lines by wrapping them in parentheses:
val text = ("line 1"
+ "line 2")
Using parentheses
As Theus mentioned. One way is parentheses.
val text = ("line 1" +
"line 2")
Actually all multiline statements which break by semantics can be included by parentheses. like.
(object.function1()
.function2())
Using """
For multiline string. We can use """, like this,
val s = """line 1
line2
line3"""
The leading space before line2 and line3 will be included. If we don't want to to have the leading spaces. We can use like this.
val s = """line 1
|line2
|line3""".stripMargin
Or using different strip character
val s = """line 1
$line2
$line3""".stripMargin('$')

SML Case matching multiple conditions [duplicate]

In SML, is it possible for you to have multiple patterns in one case statement?
For example, I have 4 arithmetic operators express in string, "+", "-", "*", "/" and I want to print "PLUS MINUS" of it is "+" or "-" and "MULT DIV" if it is "*" or "/".
TL;DR: Is there somewhere I can simplify the following to use less cases?
case str of
"+" => print("PLUS MINUS")
| "-" => print("PLUS MINUS")
| "*" => print("MULT DIV")
| "/" => print("MULT DIV")
Given that you've tagged your question with the smlnj tag, then yes, SML/NJ supports this kind of patterns. They call it or-patterns and it looks like this:
case str of
("+" | "-") => print "PLUS MINUS"
| ("*" | "/") => print "MULT DIV"
Notice the parentheses.
The master branch of MLton supports it too, as part of their Successor ML effort, but you'll have to compile MLton yourself.
val str = "+"
val _ =
case str of
"+" | "-" => print "PLUS MINUS"
| "*" | "/" => print "MULT DIV"
Note that MLton does not require parantheses. Now compile it using this command (unlike SML/NJ, you have to enable this feature explicitly in MLton):
mlton -default-ann 'allowOrPats true' or-patterns.sml
In Standard ML, no. In other dialects of ML, such as OCaml, yes. You may in some cases consider splitting pattern matching up into separate cases/functions, or skip pattern matching in favor of a shorter catch-all expression, e.g.
if str = "+" orelse str = "-" then "PLUS MINUS" else
if str = "*" orelse str = "/" then "MULT DIV" else ...
Expanding upon Ionuț's example, you can even use datatypes with other types in them, but their types (and identifier assignments) must match:
datatype mytype = COST as int | QUANTITY as int | PERSON as string | PET as string;
case item of
(COST n|QUANTITY n) => print Int.toString n
|(PERSON name|PET name) => print name
If the types or names don't match, it will get rejected:
case item of
(COST n|PERSON n) => (* fails because COST is int and PERSON is string *)
(COST n|QUANTITY q) => (* fails because the identifiers are different *)
And these patterns work in function definitions as well:
fun myfun (COST n|QUANTITY n) = print Int.toString n
|myfun (PERSON name|PET name) = print name
;

Resources