Regex doesn't match in c but works with online interpreter - c

I have a regex that I have tested with https://regexr.com/, where it works correctly. But in c it doesn't find any match.
My code is below; I have removed everything unnecessary.
#include <string.h>
#include <stdio.h>
#include <stdlib.h>
#include <regex.h>
int main ()
{
char * str = "<sql db=../serverTcp/Testing.db query=SELECT * From BuyMarsians;\>";
char * regex = "<sql\s+db=(.+)\s+query=(.+;)\s*\\>";
regex_t regexCompiled;
if (regcomp(&regexCompiled,regex,REG_EXTENDED))
{
printf("Could not compile regular expression.\n");
fflush(stdout);
};
if (!regexec(&regexCompiled,str, 0, NULL, 0)) {
printf("matched");
fflush(stdout);
}
regfree(&regexCompiled);
return 0;
}

You need to escape your backslashes. Change
char * regex = "<sql\s+db=(.+)\s+query=(.+;)\s*\\>";
to
char * regex = "<sql\\s+db=(.+)\\s+query=(.+;)\\s*\\\\>";
Note that this is extremely inefficient. A much more efficient regex uses non-greedy quantification, with ?:
<sql\s+db=(.+?)\s+query=(.+;)\s*\\>
// ^ key change
That becomes:
char * regex = "<sql\\s+db=(.+?)\\s+query=(.+;)\\s*\\\\>";
Also note: Your string to be matched also includes \. You need to escape it there, too:
char * str = "<sql db=../serverTcp/Testing.db query=SELECT * From BuyMarsians;\\>";
Here's a working demo of your corrected code.

Related

Can't use regular expression with .*

I've been trying to use regular expressions (<regex.h>) in a C project I am developing.
According to regex101 the regex it is well written and identifies what I'm trying to identify but it doesn't work when I try to run it in C.
#include <stdio.h>
#include <regex.h>
int main() {
char pattern[] = "#include.*";
char line[] = "#include <stdio.h>";
regex_t string;
int regex_return = -1;
regex_return = regcomp(&string, line, 0);
regex_return += regexec(&string, pattern, 0, NULL, 0);
printf("%d", regex_return);
return 0;
}
This is a sample code I wrote to test the expression when I found out it didn't work.
It prints 1, when I expected 0.
It prints 0 if I change the line to "#include", which is just strange to me, because it's ignoring the .* at the end.
line and pattern are swapped.
regcomp takes the pattern and regexec takes the string to check.

pattern matching / extracting in c using regex.h

I need help extracting a substring from a string using regex.h in C.
In this example, I am trying to extract all occurrences of character 'e' from a string 'telephone'. Unfortunately, I get stuck identifying the offsets of those characters. I am listing code below:
#include <stdio.h>
#include <regex.h>
int main(void) {
const int size=10;
regex_t regex;
regmatch_t matchStruct[size];
char pattern[] = "(e)";
char str[] = "telephone";
int failure = regcomp(&regex, pattern, REG_EXTENDED);
if (failure) {
printf("Cannot compile");
}
int matchFailure = regexec(&regex, pattern, size, matchStruct, 0);
if (!matchFailure) {
printf("\nMatch!!");
} else {
printf("NO Match!!");
}
return 0;
}
So per GNU's manual, I should get all of the occurrences of 'e' when a character is parenthesized. However, I always get only the first occurrence.
Essentially, I want to be able to see something like:
matchStruct[1].rm_so = 1;
matchStruct[1].rm_so = 2;
matchStruct[2].rm_so = 4;
matchStruct[2].rm_so = 5;
matchStruct[3].rm_so = 7;
matchStruct[3].rm_so = 8;
or something along these lines. Any advice?
Please note that you are in fact not comparing your compiled regex against str ("telephone") but rather to your plain-text pattern. Check your second attribute to regexec. That fixed, proceed for instance to "regex in C language using functions regcomp and regexec toggles between first and second match" where the answer to your question is already given.

PCRE is not matching utf8 characters

I'm compiling a PCRE pattern with utf8 flag enabled and am trying to match a utf8 char* string against it, but it is not matching and pcre_exec returns negative. I'm passing the subject length as 65 to pcre_exec which is the number of characters in the string. I believe it expects the number of bytes so I have tried with increasing the argument till 70 but still get the same result. I don't know what else is making the match fail. Please help before I shoot myself.
(If I try without the flag PCRE_UTF8 however, it matches but the offset vector[1] is 30 which is index of the character just before a unicode character in my input string)
#include "stdafx.h"
#include "pcre.h"
#include <pcre.h> /* PCRE lib NONE */
#include <stdio.h> /* I/O lib C89 */
#include <stdlib.h> /* Standard Lib C89 */
#include <string.h> /* Strings C89 */
#include <iostream>
int main(int argc, char *argv[])
{
pcre *reCompiled;
int pcreExecRet;
int subStrVec[30];
const char *pcreErrorStr;
int pcreErrorOffset;
char* aStrRegex = "(\\?\\w+\\?\\s*=)?\\s*(call|exec|execute)\\s+(?<spName>\\w+)("
// params can be an empty pair of parenthesis or have parameters inside them as well.
"\\(\\s*(?<params>[?\\w,]+)\\s*\\)"
// paramList along with its parenthesis is optional below so a SP call can be just "exec sp_name" for a stored proc call without any parameters.
")?";
reCompiled = pcre_compile(aStrRegex, 0, &pcreErrorStr, &pcreErrorOffset, NULL);
if(reCompiled == NULL) {
printf("ERROR: Could not compile '%s': %s\n", aStrRegex, pcreErrorStr);
exit(1);
}
char* line = "?rt?=call SqlTxFunctionTesting(?înFîéld?,?outField?,?inOutField?)";
pcreExecRet = pcre_exec(reCompiled,
NULL,
line,
65, // length of string
0, // Start looking at this point
0, // OPTIONS
subStrVec,
30); // Length of subStrVec
printf("\nret=%d",pcreExecRet);
//int substrLen = pcre_get_substring(line, subStrVec, pcreExecRet, 1, &mantissa);
}
1)
char * q= "î";
printf("%d, %s", q[0], q);
Output:
63, ?
2) You must rebuild PCRE with PCRE_BUILD_PCRE16 (or 32) and PCRE_SUPPORT_UTF. And use pcre16.lib and/or pcre16.dll. Then you can try this code:
pcre16 *reCompiled;
int pcreExecRet;
int subStrVec[30];
const char *pcreErrorStr;
int pcreErrorOffset;
wchar_t* aStrRegex = L"(\\?\\w+\\?\\s*=)?\\s*(call|exec|execute)\\s+(?<spName>\\w+)("
// params can be an empty pair of paranthesis or have parameters inside them as well.
L"\\(\\s*(?<params>[?,\\w\\p{L}]+)\\s*\\)"
// paramList along with its paranthesis is optional below so a SP call can be just "exec sp_name" for a stored proc call without any parameters.
L")?";
reCompiled = pcre16_compile((PCRE_SPTR16)aStrRegex, PCRE_UTF8, &pcreErrorStr, &pcreErrorOffset, NULL);
if(reCompiled == NULL) {
printf("ERROR: Could not compile '%s': %s\n", aStrRegex, pcreErrorStr);
exit(1);
}
const wchar_t* line = L"?rt?=call SqlTxFunctionTesting( ?inField?,?outField?,?inOutField?,?fd? )";
const wchar_t* mantissa=new wchar_t[wcslen(line)];
pcreExecRet = pcre16_exec(reCompiled,
NULL,
(PCRE_SPTR16)line,
wcslen(line), // length of string
0, // Start looking at this point
0, // OPTIONS
subStrVec,
30); // Length of subStrVec
printf("\nret=%d",pcreExecRet);
for (int i=0;i<pcreExecRet;i++){
int substrLen = pcre16_get_substring((PCRE_SPTR16)line, subStrVec, pcreExecRet, i, (PCRE_SPTR16 *)&mantissa);
wprintf(L"\nret string=%s, length=%i\n",mantissa,substrLen);
}
3) \w = [0-9A-Z_a-z]. It doesn't contains unicode symbols.
4) This can really help: http://answers.oreilly.com/topic/215-how-to-use-unicode-code-points-properties-blocks-and-scripts-in-regular-expressions/
5) from PCRE 8.33 source (pcre_exec.c:2251)
/* Find out if the previous and current characters are "word" characters.
It takes a bit more work in UTF-8 mode. Characters > 255 are assumed to
be "non-word" characters. Remember the earliest consulted character for
partial matching. */

How to remove the path to get the filename

How does one remove the path of a filepath, leaving only the filename?
I want to extract only the filename from a fts_path and store this in a char *fileName.
Here's a function to remove the path on POSIX-style (/-separated) pathnames:
char *base_name(const char *pathname)
{
char *lastsep = strrchr(pathname, '/');
return lastsep ? lastsep+1 : pathname;
}
If you need to support legacy systems with odd path separators (like MacOS 9 or Windows), you might need to adapt the above to search for multiple possible separators. For example on Windows, both / and \ are path separators and any mix of them can be used.
You want basename(3).
#include <stdlib.h>
#include <stdio.h>
#include <string.h>
#include <libgen.h>
int main(void)
{
char * path = "/homes/mk08/Desktop/lala.c";
char * tmp = strdup(path);
if(tmp) {
printf("%s\n", basename(tmp));
free(tmp);
}
return EXIT_SUCCESS;
}
This will output:
lala.c
I'm sure there is a less roundabout way of doing this, but you could always search through the filepath (I assume it is stored as a char array?), get the position of the final '\', and then erase everything prior to that.
Edit: See R's comment.

How to extract filename from path

There should be something elegant in Linux API/POSIX to extract base file name from full path
See char *basename(char *path).
Or run the command "man 3 basename" on your target UNIX/POSIX system.
Use basename (which has odd corner case semantics) or do it yourself by calling strrchr(pathname, '/') and treating the whole string as a basename if it does not contain a '/' character.
Here's an example of a one-liner (given char * whoami) which illustrates the basic algorithm:
(whoami = strrchr(argv[0], '/')) ? ++whoami : (whoami = argv[0]);
an additional check is needed if NULL is a possibility. Also note that this just points into the original string -- a "strdup()" may be appropriate.
You could use strstr in case you are interested in the directory names too:
char *path ="ab/cde/fg.out";
char *ssc;
int l = 0;
ssc = strstr(path, "/");
do{
l = strlen(ssc) + 1;
path = &path[strlen(path)-l+2];
ssc = strstr(path, "/");
}while(ssc);
printf("%s\n", path);
The basename() function returns the last component of a path, which could be a folder name and not a file name. There are two versions of the basename() function: the GNU version and the POSIX version.
The GNU version can be found in string.h after you include #define _GNU_SOURCE:
#define _GNU_SOURCE
#include <string.h>
The GNU version uses const and does not modify the argument.
char * basename (const char *path)
This function is overridden by the XPG (POSIX) version if libgen.h is included.
char * basename (char *path)
This function may modify the argument by removing trailing '/' bytes. The result may be different from the GNU version in this case:
basename("foo/bar/")
will return the string "bar" if you use the XPG version and an empty string if you use the GNU version.
References:
basename (3) - Linux Man Pages
Function: char * basename (const char *filename), Finding Tokens in a String.
Of course if this is a Gnu/Linux only question then you could use the library functions.
https://linux.die.net/man/3/basename
And though some may disapprove these POSIX compliant Gnu Library functions do not use const. As library utility functions rarely do. If that is important to you I guess you will have to stick to your own functionality or maybe the following will be more to your taste?
#include <stdio.h>
#include <string.h>
int main(int argc, char *argv[])
{
char *fn;
char *input;
if (argc > 1)
input = argv[1];
else
input = argv[0];
/* handle trailing '/' e.g.
input == "/home/me/myprogram/" */
if (input[(strlen(input) - 1)] == '/')
input[(strlen(input) - 1)] = '\0';
(fn = strrchr(input, '/')) ? ++fn : (fn = input);
printf("%s\n", fn);
return 0;
}
template<typename charType>
charType* getFileNameFromPath( charType* path )
{
if( path == NULL )
return NULL;
charType * pFileName = path;
for( charType * pCur = path; *pCur != '\0'; pCur++)
{
if( *pCur == '/' || *pCur == '\\' )
pFileName = pCur+1;
}
return pFileName;
}
call:
wchar_t * fileName = getFileNameFromPath < wchar_t > ( filePath );
(this is a c++)
You can escape slashes to backslash and use this code:
#include <stdio.h>
#include <string.h>
int main(void)
{
char path[] = "C:\\etc\\passwd.c"; //string with escaped slashes
char temp[256]; //result here
char *ch; //define this
ch = strtok(path, "\\"); //first split
while (ch != NULL) {
strcpy(temp, ch);//copy result
printf("%s\n", ch);
ch = strtok(NULL, "\\");//next split
}
printf("last filename: %s", temp);//result filename
return 0;
}
I used a simpler way to get just the filename or last part in a path.
char * extract_file_name(char *path)
{
int len = strlen(path);
int flag=0;
printf("\nlength of %s : %d",path, len);
for(int i=len-1; i>0; i--)
{
if(path[i]=='\\' || path[i]=='//' || path[i]=='/' )
{
flag=1;
path = path+i+1;
break;
}
}
return path;
}
Input path = "C:/Users/me/Documents/somefile.txt"
Output = "somefile.txt"
#Nikolay Khilyuk offers the best solution except.
1) Go back to using char *, there is absolutely no good reason for using const.
2) This code is not portable and is likely to fail on none POSIX systems where the / is not the file system delimiter depending on the compiler implementation. For some windows compilers you might want to test for '\' instead of '/'. You might even test for the system and set the delimiter based on the results.
The function name is long but descriptive, no problem there. There is no way to ever be sure that a function will return a filename, you can only be sure that it can if the function is coded correctly, which you achieved. Though if someone uses it on a string that is not a path obviously it will fail. I would have probably named it basename, as it would convey to many programmers what its purpose was. That is just my preference though based on my bias your name is fine. As far as the length of the string this function will handle and why anyone thought that would be a point? You will unlikely deal with a path name longer than what this function can handle on an ANSI C compiler. As size_t is defined as a unsigned long int which has a range of 0 to 4,294,967,295.
I proofed your function with the following.
#include <stdio.h>
#include <string.h>
char* getFileNameFromPath(char* path);
int main(int argc, char *argv[])
{
char *fn;
fn = getFileNameFromPath(argv[0]);
printf("%s\n", fn);
return 0;
}
char* getFileNameFromPath(char* path)
{
for(size_t i = strlen(path) - 1; i; i--)
{
if (path[i] == '/')
{
return &path[i+1];
}
}
return path;
}
Worked great, though Daniel Kamil Kozar did find a 1 off error that I corrected above. The error would only show with a malformed absolute path but still the function should be able to handle bogus input. Do not listen to everyone that critiques you. Some people just like to have an opinion, even when it is not worth anything.
I do not like the strstr() solution as it will fail if filename is the same as a directory name in the path and yes that can and does happen especially on a POSIX system where executable files often do not have an extension, at least the first time which will mean you have to do multiple tests and searching the delimiter with strstr() is even more cumbersome as there is no way of knowing how many delimiters there might be. If you are wondering why a person would want the basename of an executable think busybox, egrep, fgrep etc...
strrchar() would be cumbersome to implement as it searches for characters not strings so I do not find it nearly as viable or succinct as this solution. I stand corrected by Rad Lexus this would not be as cumbersome as I thought as strrchar() has the side effect of returning the index of the string beyond the character found.
Take Care
My example (improved):
#include <string.h>
const char* getFileNameFromPath(const char* path, char separator = '/')
{
if(path != nullptr)
{
for(size_t i = strlen(path); i > 0; --i)
{
if (path[i-1] == separator)
{
return &path[i];
}
}
}
return path;
}

Resources