GCC regular expressions - c

How do I use regular expressions in GNU G++ / GCC for matching, searching and replacing substrings? E.g. could you provide any tutorial on regex_t and others?
Googling for above an hour gave me no understandable tutorial or manual.

I strongly suggest using the Boost C++ regex library. If you are developing serious C++, Boost is definitely something you must take into account.
The library supports both Perl and POSIX regular expression syntax. I personally prefer Perl regular expressions since I believe they are more intuitive and easier to get right.
http://www.boost.org/doc/libs/1_46_0/libs/regex/doc/html/boost_regex/syntax.html
But if you don't have any knowledge of this fine library, I suggest you start here:
http://www.boost.org/doc/libs/1_46_0/libs/regex/doc/html/index.html

I found the answer here:
#include <regex.h>
#include <stdio.h>
int main()
{
int r;
regex_t reg;
if (r = regcomp(&reg, "\\b[A-Z]\\w*\\b", REG_NOSUB | REG_EXTENDED))
{
char errbuf[1024];
regerror(r, &reg, errbuf, sizeof(errbuf));
printf("error: %s\n", errbuf);
return 1;
}
char* argv[] = { "Moo", "foo", "OlOlo", "ZaooZA~!" };
for (int i = 0; i < sizeof(argv) / sizeof(char*); i++)
{
if (regexec(&reg, argv[i], 0, NULL, 0) == REG_NOMATCH)
continue;
printf("matched: %s\n", argv[i]);
}
return 0;
}
The code above will provide us with
matched: Moo
matched: OlOlo
matched: ZaooZA~!

Manuals should be easy enough to find: POSIX regular expression functions. If you don't understand that, I would really recommend trying to brush up on your C and C++ skills.
Note that actually replacing a substring once you have a match is a completely different problem, one that the regex functions won't help you with.

Related

Regular Expressions are not returning correct solution

I'm writing a C program that uses a regular expressions to determine if certain words from a text that are being read from a file are valid or invalid. I've a attached the code that does my regular expression check. I used an online regex checker and based off of that it says my regex is correct. I'm not sure why else it would be wrong.
The regex should accept a string in either the format of AB1234 or ABC1234 ABCD1234.
//compile the regular expression
reti1 = regcomp(&regex1, "[A-Z]{2,4}\\d{4}", 0);
// does the actual regex test
status = regexec(&regex1,inputString,(size_t)0,NULL,0);
if (status==0)
printf("Matched (0 => Yes): %d\n\n",status);
else
printf(">>>NO MATCH<< \n\n");
You are using POSIX regular expressions, from regex.h. These don't support the syntax you are using, which is PCRE format, and is much more common these days. You are better off trying to use a library that will give you PCRE support. If you have to use POSIX expressions, I think this will work:
#include <regex.h>
#include "stdio.h"
int main(void) {
int status;
int reti1;
regex_t regex1;
char * inputString = "ABCD1234";
//compile the regular expression
reti1 = regcomp(&regex1, "^[[:upper:]]{2,4}[[:digit:]]{4}$", REG_EXTENDED);
// does the actual regex test
status = regexec(&regex1,inputString,(size_t)0,NULL,0);
if (status==0)
printf("Matched (0 => Yes): %d\n\n",status);
else
printf(">>>NO MATCH<< \n\n");
regfree (&regex1);
return 0;
}
(Note that my C is extremely rusty, so this code is probably horrible.)
I found some good resources on this answer.

usage of + in Posix Regex library

This should be pretty simple, but I am having trouble understanding the basic working of '+' in regex.h library in C. Not sure what is going wrong.
Pasting a sample code which doesn't work. I want to find a string which starts with B and ends with A, there can be more than one occurrence of B so I want to use B+
int main(int argc, const char * argv[])
{
regex_t regex;
int reti;
/* Compile regular expression */
reti = regcomp(&regex, "^B+A$", 0);
if( reti)
{
printf("Could not compile regex\n");
exit(1);
}
/* Execute regular expression */
reti = regexec(&regex, "BBBA", 0, NULL, 0);
if (!reti )
{
printf("Match\n");
}
else if( reti == REG_NOMATCH )
{
printf("No match\n");
}
else
{
printf("Regex match failed\n");
exit(1);
}
/* Free compiled regular expression if you want to use the regex_t again */
regfree(&regex);
return 0;
}
This does not find the match, but I am not able to understand why.
Usage of ^BB*A$ works fine, but that is not something I would want.
As I also want to check for something like ^[BCD]+A$ which should match BBBA or CCCCA or DDDDA. Usage of ^[BCD][BCD]*A$ wont work for me as that could match BCCCA which is not the desired match.
Tried using parentheses and brackets in the expression but it doesn't seem to help.
Quick help is much appreciated.
By default regcomp() compiles a pattern as a so-called Basic Regular Expression; in such regular expressions the + operator is not available. The regex syntax you're trying to use is known as Extended Regular Expression syntax. In order to have regcomp() work with that more extended syntax you need to pass it the REG_EXTENDED flag.
By the way, this comment:
As I also want to check for something like ^[BCD]+A$ which should match BBBA or CCCCA or
DDDDA. Usage of ^[BCD][BCD]*A$ wont work for me as that could match BCCCA which is not the
desired match
is based on a misconception of how the quantifiers + and * work. The regular expressions ^[BCD]+A$ and ^[BCD][BCD]*A$ are exactly equivalent.

Optimising C code

Here is my C code..
void Read(int t,char* string1)
{
int j,i,p,row,count=0;
for(i=0;i<t;++i,string1=strchr(string1,')')+2)
{
sscanf(string1,"(%d,%d)",&p,&row);
CallFunction(p,row);
}
}
Here is how i have to call this function:
Read(2,"(3,5),(7,8)")
Is this a good way to deal with such kind of input parameters? Is it time consuming?
Is there any other good way (optimised way) of reading the same input parameters?
You could use the %n format-specifier for sscanf(), which allows you to omit the strchr() function. The speed improvement is probably marginal.
BTW: dont' call a function "Read", not even if you can assume a case-sensitive compiler and linker.
#include <stdio.h>
#define CallFunction(a,b) fprintf(stderr, "p=%d row=%d\n", a, b)
void do_read(int cnt,char *input)
{
int i,err,p,row,res;
for(i=0; i<cnt ; i++,input += res )
{
err = sscanf(input,"(%d,%d)%n",&p,&row, &res);
if (err < 2) {
fprintf(stderr, "%s:%d: input='%s', err=%d\n"
, __FILE__ , __LINE__, input, err );
break;
}
CallFunction(p,row);
if (input[res] == ',') res++;
}
}
int main(void)
{
do_read(2,"(3,5),(7,8)"); /* this should succeed */
do_read(2,"(3,5)#(7,8)"); /* this must fail ... */
return 0;
}
This code is reasonably fast. But how fast it needs to be depends on your constrainsts which are unknown to me.
I hope that your input data is already checked because string1=strchr(string1,')')+2 (and what follows) is not safe.
Reading your code makes me think that, if you really need bare to the metal speed, you should ditch the function calls and do the job manually (parsing the string yourself).
But given the 'API' you have published, the question of the speed may be defeated ABOVE and BELOW this code snippet.
Reaching the optimal code chain then depends on... all the chain: the whole will not run faster than the slowest function in the chain.
Sorry not to be more specific but this is really a more global question than the information you provide lets me address it (I don't have the whole picture).

How to make external Mathematica functions interruptible?

I had an earlier question about integrating Mathematica with functions written in C++.
This is a follow-up question:
If the computation takes too long I'd like to be able to abort it using Evaluation > Abort Evaluation. Which of the technologies suggested in the answers make it possible to have an interruptible C-based extension function? How can "interruptibility" be implemented on the C side?
I need to make my function interruptible in a way which will corrupt neither it, nor the Mathematica kernel (i.e. it should be possible to call the function again from Mathematica after it has been interrupted)
For MathLink - based functions, you will have to do two things (On Windows): use MLAbort to check for aborts, and call MLCallYieldFunction, to yield the processor temporarily. Both are described in the MathLink tutorial by Todd Gayley from way back, available here.
Using the bits from my previous answer, here is an example code to compute the prime numbers (in an inefficient manner, but this is what we need here for an illustration):
code =
"
#include <stdlib.h>
extern void primes(int n);
static void yield(){
MLCallYieldFunction(
MLYieldFunction(stdlink),
stdlink,
(MLYieldParameters)0 );
}
static void abort(){
MLPutFunction(stdlink,\" Abort \",0);
}
void primes(int n){
int i = 0, j=0,prime = 1, *d = (int *)malloc(n*sizeof(int)),ctr = 0;
if(!d) {
abort();
return;
}
for(i=2;!MLAbort && i<=n;i++){
j=2;
prime = 1;
while (!MLAbort && j*j <=i){
if(i % j == 0){
prime = 0;
break;
}
j++;
}
if(prime) d[ctr++] = i;
yield();
}
if(MLAbort){
abort();
goto R1;
}
MLPutFunction(stdlink,\"List\",ctr);
for(i=0; !MLAbort && i < ctr; i++ ){
MLPutInteger(stdlink,d[i]);
yield();
}
if(MLAbort) abort();
R1: free(d);
}
";
and the template:
template =
"
void primes P((int ));
:Begin:
:Function: primes
:Pattern: primes[n_Integer]
:Arguments: { n }
:ArgumentTypes: { Integer }
:ReturnType: Manual
:End:
";
Here is the code to create the program (taken from the previous answer, slightly modified):
Needs["CCompilerDriver`"];
fullCCode = makeMLinkCodeF[code];
projectDir = "C:\\Temp\\MLProject1";
If[! FileExistsQ[projectDir], CreateDirectory[projectDir]]
pname = "primes";
files = MapThread[
Export[FileNameJoin[{projectDir, pname <> #2}], #1,
"String"] &, {{fullCCode, template}, {".c", ".tm"}}];
Now, here we create it:
In[461]:= exe=CreateExecutable[files,pname];
Install[exe]
Out[462]= LinkObject["C:\Users\Archie\AppData\Roaming\Mathematica\SystemFiles\LibraryResources\
Windows-x86-64\primes.exe",161,10]
and use it:
In[464]:= primes[20]
Out[464]= {2,3,5,7,11,13,17,19}
In[465]:= primes[10000000]
Out[465]= $Aborted
In the latter case, I used Alt+"." to abort the computation. Note that this won't work correctly if you do not include a call to yield.
The general ideology is that you have to check for MLAbort and call MLCallYieldFunction for every expensive computation, such as large loops etc. Perhaps, doing that for inner loops like I did above is an overkill though. One thing you could try doing is to factor the boilerplate code away by using the C preprocessor (macros).
Without ever having tried it, it looks like the Expression Packet functionality might work in this way - if your C code goes back and asks mathematica for some more work to do periodically, then hopefully aborting execution on the mathematica side will tell the C code that there is no more work to do.
If you are using LibraryLink to link external C code to the Mathematica kernel, you can use the Library callback function AbortQ to check if an abort is in progress.

Compiling/Matching POSIX Regular Expressions in C

I'm trying to match the following items in the string pcode:
u followed by a 1 or 2 digit number
phaseu
phasep
x (surrounded by non-word chars)
y (surrounded by non-word chars)
z (surrounded by non-word chars)
I've tried to implement a regex match using the POSIX regex functions (shown below), but have two problems:
The compiled pattern seems to have no subpatterns (i.e. compiled.n_sub == 0).
The pattern doesn't find matches in the string " u0", which it really should!
I'm confident that the regex string itself is working—in that it works in python and TextMate—my problem lies with the compilation, etc. in C. Any help with getting that working would be much appreciated.
Thanks in advance for your answers.
if(idata=tb_find(deftb,pdata)){
MESSAGE("Global variable!\n");
char pattern[80] = "((u[0-9]{1,2})|(phaseu)|(phasep)|[\\W]+([xyz])[\\W]+)";
MESSAGE("Pattern = \"%s\"\n",pattern);
regex_t compiled;
if(regcomp(&compiled, pattern, 0) == 0){
MESSAGE("Compiled regular expression \"%s\".\n", pattern);
}
int nsub = compiled.re_nsub;
MESSAGE("nsub = %d.\n",nsub);
regmatch_t matchptr[nsub];
int err;
if(err = regexec (&compiled, pcode, nsub, matchptr, 0)){
if(err == REG_NOMATCH){
MESSAGE("Regular expression did not match.\n");
}else if(err == REG_ESPACE){
MESSAGE("Ran out of memory.\n");
}
}
regfree(&compiled);
}
It seems you intend to use something resembling the "extended" POSIX regex syntax. POSIX defines two different regex syntaxes, a "basic" (read "obsolete") syntax and the "extended" syntax. To use the extended syntax, you need to add the REG_EXTENDED flag for regcomp:
...
if(regcomp(&compiled, pattern, REG_EXTENDED) == 0){
...
Without this flag, regcomp will use the "basic" regex syntax. There are some important differences, such as:
No support for the | operator
The brackets for submatches need to be escaped, \( and \)
It should be also noted that the POSIX extended regex syntax is not 1:1 compatible with Python's regex (don't know about TextMate). In particular, I'm afraid this part of your regexp does not work in POSIX, or at least is not portable:
[\\W]
The POSIX way to specify non-space characters is:
[^[:space:]]
Your whole regexp for POSIX should then look like this in C:
char *pattern = "((u[0-9]{1,2})|(phaseu)|(phasep)|[^[:space:]]+([xyz])[^[:space:]]+)";

Resources