I want to port the following regex from python:
HASH_REGEX = re.compile("([a-fA-F0-9]{32})")
if HASH_REGEX.match(target):
print "We have match"
to C with apr-utils apr_strmatch function:
pattern = apr_strmatch_precompile(pool, "([a-fA-F0-9]{32})", 0);
if (NULL != apr_strmatch(pattern, target, strlen(target)) {
printf("We have match!\n");
}
The problem is that I don't understand what syntax of regex (or dialect) apr-utils apr_strmatch function is using. Search for documentation and examples ended with no results.
Thanks for your advices in advance...
apr_strmatch doesn't do regular expression matching at all; it does ordinary substring search using the Boyer–Moore–Horspool algorithm (see source).
For RE matching in C, try PCRE.
Related
I'm looking to extract text found exactly within an < and >, while also extracting things found between > and <.
For instance:
<html> would just return <html>
<title>This is a title</title> would return <title>, This is a title, </title>
This is a title would return This is a title
And finally <title>This is a weird use of < bracket</title> should return <title>, This is a weird use of < bracket, </title>. My current version recognises it as <title>, This is a weird use of, < bracket, </title>
I'd appreciate any snippets of code, or directions to head in to get to a solution.
tldr, grab substrings with <...> and >...< seperately without being stumped by a floating ...>... or ...<....
Edit: not using strtok anymore, would appreciate any other help or similiar problems you may know about. Any thing to read also would be greatly beneficial. Note: we aren't trying to parse, simply lex the input string
Can only use standard libraries for c.
Just trying to build a basic validator for a subset of valid HTML.
You can't, not even a basic one. You will have too many false positives and negatives. Here's a simple example.
<tag attribute=">" />
HTML has many features which do not allow simple parsing. It is...
Balanced, like <tag></tag> and also "quotes".
Nested, like <tag><tag></tag></tag>.
Escaped, like "escaped\"quote".
Has other languages embedded in it, like Javascript and CSS.
If this is an exercise in tokenization, you could define a very specific subset, but I'd suggest something simpler like JSON which has a well defined grammar. Those are typically parsed using a lexer and parser, but JSON is small enough to be written by hand.
My own solution has been thus so far,
as suggested by #chqrlie...
void tokenize(char* stringPtr)
{
char *flag;
strcpy(flag, " ");
/*We build this up as we iterate the string.
Strtok was not suitable, build up tokens char by char */
char tempToken[tokenLength];
strcpy(tempToken, ""); // Init current token
// Traverse string catching stuff between <...> and >...< seperately.
for(int i =0; i<strlen(stringPtr);i++)
{
if (stringPtr[i]=='<' )
{
if (strcmp(flag, " ")==0)
{
putToken(tempToken);
strcpy(tempToken,""); // Tag starting, everything before it is a token.
strcpy(flag,"<");
strcat(tempToken, flag);
}
else // Catches <...<
{
presentError(stringPtr);
}
}
else if (stringPtr[i]=='>')
{
if (strcmp(flag,"<")==0)
{
strcat(tempToken, ">");
strcpy(flag," ");
putToken(tempToken);
strcpy(tempToken,"");
}
else // Cant have a > unless we saw < already
{
presentError(stringPtr);
}
}
else // Manage non angle brackets
{
strncat(tempToken, &stringPtr[i],1 );
}
}
putToken(tempToken); // Catches a line ending in a value, not a tag
/* Notes
Floating <'s and >'s will be errored up
- Special case ....<...>..., which is incorrect
will cause floating tokens, can be identified
Unclosed tags i.e. </p will be tokenized verbatim,
thus can identify this mistake
Unopened tags i.e. p> will be errored
*/
}
Assume that presentError() terminates lexing.
Some improvements can be made, I'm open to suggestions however this is a first working draft.
I'm writing a C program that uses a regular expressions to determine if certain words from a text that are being read from a file are valid or invalid. I've a attached the code that does my regular expression check. I used an online regex checker and based off of that it says my regex is correct. I'm not sure why else it would be wrong.
The regex should accept a string in either the format of AB1234 or ABC1234 ABCD1234.
//compile the regular expression
reti1 = regcomp(®ex1, "[A-Z]{2,4}\\d{4}", 0);
// does the actual regex test
status = regexec(®ex1,inputString,(size_t)0,NULL,0);
if (status==0)
printf("Matched (0 => Yes): %d\n\n",status);
else
printf(">>>NO MATCH<< \n\n");
You are using POSIX regular expressions, from regex.h. These don't support the syntax you are using, which is PCRE format, and is much more common these days. You are better off trying to use a library that will give you PCRE support. If you have to use POSIX expressions, I think this will work:
#include <regex.h>
#include "stdio.h"
int main(void) {
int status;
int reti1;
regex_t regex1;
char * inputString = "ABCD1234";
//compile the regular expression
reti1 = regcomp(®ex1, "^[[:upper:]]{2,4}[[:digit:]]{4}$", REG_EXTENDED);
// does the actual regex test
status = regexec(®ex1,inputString,(size_t)0,NULL,0);
if (status==0)
printf("Matched (0 => Yes): %d\n\n",status);
else
printf(">>>NO MATCH<< \n\n");
regfree (®ex1);
return 0;
}
(Note that my C is extremely rusty, so this code is probably horrible.)
I found some good resources on this answer.
Goal:
Find if a string contains a blank line. Whether it be '\n\n',
'\r\n\r\n', '\r\n\n', '\n\r\n'
Issues:
I don't think my current regex for finding '\n\n' is right. This is my first time really using regex outside of simple use of * when removing files in command line.
Is it possible to check for all of these cases (listed above) in one regex? or do I have to do 4 seperate calls to compile_regex?
Code:
int checkForBlankLine(char *reader) {
regex_t r;
compile_regex(&r, "*\n\n");
match_regex(&r, reader);
return 0;
}
void compile_regex(regex_t *r, char *matchText) {
int status;
regcomp(r, matchText, 0);
}
int match_regex(regex_t *r, char *reader) {
regmatch_t match[1];
int nomatch = regexec(r, reader, 1, match, 0);
if (nomatch) {
printf("No matches.\n");
} else {
printf("MATCH!\n");
}
return 0;
}
Notes:
I only need to worry about finding one blank line, that's why my regmatch_t match[1] is only one item long
reader is the char array containing the text I am checking for a blank line.
I have seen other examples and tried to base the code off of those examples, but I still seem to be missing something.
Thank you kindly for the help/advice.
If anything needs to be clarified please let me know.
It seems that you have to compile the regex as extended:
regcomp(&re, "\r?\n\r?\n", REG_EXTENDED);
The first atom, \r? is probably unnecessary, because it doesn't add to the blank-line condition if you don't capture the result.
In the above, blank line really means empty line. If you want blank line to mean a line that has no characters except for white space, you can use:
regcomp(&re, "\r?\n[ \t]*\r?\n", REG_EXTENDED);
(I don't think you can use the space character pattern, \s here instead of [ \t], because that would include carriage return and new-line.)
As others have already hinted at, the "simple use of * in the command line` is not a regular expression. This wildcard-matching is called file globbing and has different semantics.
Check what the * in a regex means. It's not like the wildcard "anything" in the command line. The * means that the previous component can appear any amount of times. The wildcard in regex is the .. So if you want to say match anything you can do .*, which would be anything, any amount of times.
So in your case you can do .*\n\n.* which would match anything that has \n\n.
Finally, you can use or in a regex and ( ) to group stuff. So you can do something like .*(\n\n|\r\n\r\n).* And that would match anything that has a \n\n or a \r\n\r\n.
Hope that helps.
Rather than looking for only \r or \n, look for not \r or \n?
Your regex would simply be
'[^\r\n]'
and a match result of false indicates a blank line to your specification.
I have some questions about antlr3 with tree grammar in C target.
I have almost done my interpretor (functions, variables, boolean and math expressions ok) and i have kept the most difficult statements for the end (like if, switch, etc.)
1) I would like interpreting a simple loop statement:
repeat: ^(REPEAT DIGIT stmt);
I've seen many examples but nothing about the tree walker (only a topic here with the macros MARK() / REWIND(m) + #init / #after but not working (i've antlr errors: "unexpected node at offset 0")). How can i interpret this statement in C?
2) Same question with a simple if statement:
if: ^(IF condition stmt elseifstmt* elsestmt?);
The problem is to skip the statement if the condition is false and test the other elseif/else statements.
3) I have some statements which can stop the script (like "break" or "exit"). How can i interrupt the tree walker and skip the following tokens?
4) When a lexer or parser error is detected, antlr returns an error. But i would like to make my homemade error messages. How can i have the line number where parser crashed?
Ask me if you want more details.
Thanks you very much (and i apologize for my poor english)
About the repeat statement, i think i've found a way to do it. In antlr.org, i've found a complete interpreter for C-- language but made in Java.
I put here the while statement (a bit different but the way is the same):
whileStmt
scope{
Boolean breaked;
}
#after{
CommonTree stmtNode=(CommonTree)$whileStmt.start.getChild(1);
CommonTree exprNode=(CommonTree)$whileStmt.start.getChild(0);
int test;
$whileStmt::breaked=false;
while($whileStmt::breaked==false){
stream.push(stream.getNodeIndex(exprNode));
test=expr().value;
stream.pop();
if (test==0) break;
stream.push(stream.getNodeIndex(stmtNode));
stmt();
stream.pop();
}
}
: ^(WHILE . .)
;
I've tried to transform this code into C language:
repeat
scope {
int breaked;
int tours;
}
#after
{
int test;
pANTLR3_BASE_TREE repeatstmt = (pANTLR3_BASE_TREE)$repeat.start->getChild($repeat.start,1);
pANTLR3_BASE_TREE exprstmt = (pANTLR3_BASE_TREE)$repeat.start->getChild($repeat.start,0);
$repeat::breaked = 0;
test = 1;
while($repeat::breaked == 0)
{
TW_FOLLOWPUSH(exprstmt);
TW_FOLLOWPOP();
test++;
if(test == $repeat::tours)
break;
TW_FOLLOWPUSH(repeatstmt);
CTX->repeat(CTX);
TW_FOLLOWPOP();
}
}
: ^(REPEAT DIGIT stmt)
{
$repeat::tours = $DIGIT.text->toInt32($DIGIT.text);
}
But nothing happened (stmt is parsed juste once).
Do you have an idea about this please?
About the homemade errors messages, i've found the macro GETLINE() in the lexer. It works when the tree walker crashes but antlr continues to display errors messages for lexer or parser errors.
Thanks.
I'm trying to match the following items in the string pcode:
u followed by a 1 or 2 digit number
phaseu
phasep
x (surrounded by non-word chars)
y (surrounded by non-word chars)
z (surrounded by non-word chars)
I've tried to implement a regex match using the POSIX regex functions (shown below), but have two problems:
The compiled pattern seems to have no subpatterns (i.e. compiled.n_sub == 0).
The pattern doesn't find matches in the string " u0", which it really should!
I'm confident that the regex string itself is working—in that it works in python and TextMate—my problem lies with the compilation, etc. in C. Any help with getting that working would be much appreciated.
Thanks in advance for your answers.
if(idata=tb_find(deftb,pdata)){
MESSAGE("Global variable!\n");
char pattern[80] = "((u[0-9]{1,2})|(phaseu)|(phasep)|[\\W]+([xyz])[\\W]+)";
MESSAGE("Pattern = \"%s\"\n",pattern);
regex_t compiled;
if(regcomp(&compiled, pattern, 0) == 0){
MESSAGE("Compiled regular expression \"%s\".\n", pattern);
}
int nsub = compiled.re_nsub;
MESSAGE("nsub = %d.\n",nsub);
regmatch_t matchptr[nsub];
int err;
if(err = regexec (&compiled, pcode, nsub, matchptr, 0)){
if(err == REG_NOMATCH){
MESSAGE("Regular expression did not match.\n");
}else if(err == REG_ESPACE){
MESSAGE("Ran out of memory.\n");
}
}
regfree(&compiled);
}
It seems you intend to use something resembling the "extended" POSIX regex syntax. POSIX defines two different regex syntaxes, a "basic" (read "obsolete") syntax and the "extended" syntax. To use the extended syntax, you need to add the REG_EXTENDED flag for regcomp:
...
if(regcomp(&compiled, pattern, REG_EXTENDED) == 0){
...
Without this flag, regcomp will use the "basic" regex syntax. There are some important differences, such as:
No support for the | operator
The brackets for submatches need to be escaped, \( and \)
It should be also noted that the POSIX extended regex syntax is not 1:1 compatible with Python's regex (don't know about TextMate). In particular, I'm afraid this part of your regexp does not work in POSIX, or at least is not portable:
[\\W]
The POSIX way to specify non-space characters is:
[^[:space:]]
Your whole regexp for POSIX should then look like this in C:
char *pattern = "((u[0-9]{1,2})|(phaseu)|(phasep)|[^[:space:]]+([xyz])[^[:space:]]+)";