Posix regular expression non-greedy - c

Is there a way to use a non-greedy regular expression in C like one can use in Perl?
I tried several things, but it's actually not working.
I'm currently using this regex that matches an IP address and the corresponding HTTP request, but it's greedy although I'm using the *?:
([0-9]{1,3}(\\.[0-9]{1,3}){3})(.*?)HTTP/1.1
In this example, it always matches the whole string:
#include <regex.h>
#include <stdio.h>
int main() {
int a, i;
regex_t re;
regmatch_t pm;
char *mpages = "TEST 127.0.0.1 GET /test.php HTTP/1.1\" 404 525 \"-\" \"Mozilla/5.0 (Windows NT HTTP/1.1 TEST";
a = regcomp(&re, "([0-9]{1,3}(\\.[0-9]{1,3}){3})(.*?)HTTP/1.1", REG_EXTENDED);
if(a!=0)
printf(" -> Error: Invalid Regex");
a = regexec(&re, &mpages[0], 1, &pm, REG_EXTENDED);
if(a==0) {
for(i = pm.rm_so; i < pm.rm_eo; i++)
printf("%c", mpages[i]);
printf("\n");
}
return 0;
}
$ ./regtest
127.0.0.1 GET /test.php HTTP/1.1" 404 525 "-" "Mozilla/5.0 (Windows NT HTTP/1.1

No, there are no non-greedy quantifiers in POSIX regular expressions. But there is a library that provides perl-like regular expressions for C: http://www.pcre.org/

As I said earlier in a comment, use grep -E to run tests with POSIX regexes, in that way development time will be improved. Either way, It seems your problem it's with the regular expression rather than with the missing feature.
I'm not quite clear of what you want to grab from the request... supposing you just want the IP address, the HTTP verb and the resource, one could end up with the following regex.
regcomp(&re, "\\b(.?[0-9])+\\s+(GET|POST|PUT)\\s+([^ ]+)", REG_EXTENDED);
Be aware that several assumptions have been made. For example, this regex assumes the IP address will be well formed, it also assumes a request with a HTTP verb either GET, POST, PUT. Edit accordantly to your needs.

The brute-force method of getting a regex to match up to the next occurrence of a word is:
"([^H]|H[^T]|HT[^T]|HTT[^P]|HTTP{^/]|HTTP/[^1]|HTTP/1[^.]|HTTP/1\\.[^1])*HTTP/1\\.1"
unless you can get smarter about your match -- which you can: HTTP requests are
Request-Line = Method SP Request-URI SP HTTP-Version CRLF
and none of the nonterminals on the right match embedded spaces. So:
"[0-9]{1,3}(\\.[0-9]{1,3}){3} [^ ]* [^ ]* HTTP/1\\.1"
since you're only allocating space for the whole-expression match, or put the parens back in to get pieces.

a = regcomp(&re, "([0-9]{1,3}(\\.[0-9]{1,3}){3})(.*?)HTTP/1.1", REG_EXTENDED|REG_ENHANCED);
Doesn't have this macro in the old time
#if __MAC_OS_X_VERSION_MIN_REQUIRED >= __MAC_10_8 \
|| __IPHONE_OS_VERSION_MIN_REQUIRED >= __IPHONE_6_0
#define REG_ENHANCED 0400 /* Additional (non-POSIX) features */
#endif

In your code, pm should be an array of regmatch_t, and in your case, should have at least 2 to 4 elements, depending upon which () sub-expressions you want to capture.
You have only one element. The first element, pm[0], always gets whatever text matches your entire RE. That's the one you'll be getting. It is pm[1] that will get the text of the first () sub-expression (the IP address), and pm[3] that will get the text matching your (.*?) term.
But even so, as stated above (by Wumbley, W. Q.) the POSIX regex library may not support non-greedy quantifiers.

Related

What is the correct syntax for a lazy quantifier in c regex? [duplicate]

Is there a way to use a non-greedy regular expression in C like one can use in Perl?
I tried several things, but it's actually not working.
I'm currently using this regex that matches an IP address and the corresponding HTTP request, but it's greedy although I'm using the *?:
([0-9]{1,3}(\\.[0-9]{1,3}){3})(.*?)HTTP/1.1
In this example, it always matches the whole string:
#include <regex.h>
#include <stdio.h>
int main() {
int a, i;
regex_t re;
regmatch_t pm;
char *mpages = "TEST 127.0.0.1 GET /test.php HTTP/1.1\" 404 525 \"-\" \"Mozilla/5.0 (Windows NT HTTP/1.1 TEST";
a = regcomp(&re, "([0-9]{1,3}(\\.[0-9]{1,3}){3})(.*?)HTTP/1.1", REG_EXTENDED);
if(a!=0)
printf(" -> Error: Invalid Regex");
a = regexec(&re, &mpages[0], 1, &pm, REG_EXTENDED);
if(a==0) {
for(i = pm.rm_so; i < pm.rm_eo; i++)
printf("%c", mpages[i]);
printf("\n");
}
return 0;
}
$ ./regtest
127.0.0.1 GET /test.php HTTP/1.1" 404 525 "-" "Mozilla/5.0 (Windows NT HTTP/1.1
No, there are no non-greedy quantifiers in POSIX regular expressions. But there is a library that provides perl-like regular expressions for C: http://www.pcre.org/
As I said earlier in a comment, use grep -E to run tests with POSIX regexes, in that way development time will be improved. Either way, It seems your problem it's with the regular expression rather than with the missing feature.
I'm not quite clear of what you want to grab from the request... supposing you just want the IP address, the HTTP verb and the resource, one could end up with the following regex.
regcomp(&re, "\\b(.?[0-9])+\\s+(GET|POST|PUT)\\s+([^ ]+)", REG_EXTENDED);
Be aware that several assumptions have been made. For example, this regex assumes the IP address will be well formed, it also assumes a request with a HTTP verb either GET, POST, PUT. Edit accordantly to your needs.
The brute-force method of getting a regex to match up to the next occurrence of a word is:
"([^H]|H[^T]|HT[^T]|HTT[^P]|HTTP{^/]|HTTP/[^1]|HTTP/1[^.]|HTTP/1\\.[^1])*HTTP/1\\.1"
unless you can get smarter about your match -- which you can: HTTP requests are
Request-Line = Method SP Request-URI SP HTTP-Version CRLF
and none of the nonterminals on the right match embedded spaces. So:
"[0-9]{1,3}(\\.[0-9]{1,3}){3} [^ ]* [^ ]* HTTP/1\\.1"
since you're only allocating space for the whole-expression match, or put the parens back in to get pieces.
a = regcomp(&re, "([0-9]{1,3}(\\.[0-9]{1,3}){3})(.*?)HTTP/1.1", REG_EXTENDED|REG_ENHANCED);
Doesn't have this macro in the old time
#if __MAC_OS_X_VERSION_MIN_REQUIRED >= __MAC_10_8 \
|| __IPHONE_OS_VERSION_MIN_REQUIRED >= __IPHONE_6_0
#define REG_ENHANCED 0400 /* Additional (non-POSIX) features */
#endif
In your code, pm should be an array of regmatch_t, and in your case, should have at least 2 to 4 elements, depending upon which () sub-expressions you want to capture.
You have only one element. The first element, pm[0], always gets whatever text matches your entire RE. That's the one you'll be getting. It is pm[1] that will get the text of the first () sub-expression (the IP address), and pm[3] that will get the text matching your (.*?) term.
But even so, as stated above (by Wumbley, W. Q.) the POSIX regex library may not support non-greedy quantifiers.

Parsing the payload of the AT commands from the full response string

I want to parse the actual payload from the output of AT commands.
For instance: in the example below, I'd want to read only "2021/11/16,11:12:14-32,0"
AT+QLTS=1 // command
+QLTS: "2021/11/16,11:12:14-32,0" // response
OK
In the following case, I'd need to only read 12345678.
AT+CIMI // command
12345678 // example response
So the point is: not all commands have the same format for the output. We can assume the response is stored in a string array.
I have GetAtCmdRsp() already implemented which stores the response in a char array.
void GetPayload()
{
char rsp[100] = {0};
GetAtCmdRsp("AT+QLTS=1", rsp);
// rsp now contains +QLTS: "2021/11/16,11:12:14-32,0"
// now, I need to parse "2021/11/16,11:12:14-32,0" out of the response
memset(rsp, 0, sizeof(rsp));
GetAtCmdRsp("AT+CIMI", rsp);
// rsp now contains 12345678
// no need to do additional parsing since the output already contains the value I need
}
I was thinking of doing char *start = strstr(rsp, ":") + 1; to get the start of the payload but some responses may only contain the payload as it's the case with AT+CIMI
Perhaps could regex be a good idea to determine the pattern +<COMMAND>: in a string?
In order to parse AT command responses a good starting point is understanding all the possible formats they can have. So, rather than implementing a command specific routine, I would discriminate commands by "type of response":
Commands with no payload in their answers, for example
AT
OK
Commands with no header in their answers, such as
AT+CIMI
12345678
OK
Commands with a single header in their answers
AT+QLTS=1
+QLTS: "2021/11/16,11:12:14-32,0"
OK
Command with multi-line responses.Every line could of "single header" type, like in +CGDCONT:
AT+CDGCONT?
+CGDCONT: 1,"IP","epc.tmobile.com","0.0.0.0",0,0
+CGDCONT: 2,"IP","isp.cingular","0.0.0.0",0,0
+CGDCONT: 3,"IP","","0.0.0.0",0,0
OK
Or we could even have mixed types, like in +CGML:
AT+CMGL="ALL"
+CMGL: 1,"REC READ","+XXXXXXXXXX","","21/11/25,10:20:00+00"
Good morning! How are you?
+CMGL: 2,"REC READ","+XXXXXXXXXX","","21/11/25,10:33:33+00"
I'll come a little late. See you. Bruce Wayne
OK
(please note how it could have also "empty" lines, that is \r\n).
At the moment I cannot think about any other scenario.In this way you'll be able to define an enum like
typedef enum
{
AT_RESPONSE_TYPE_NO_RESPONSE,
AT_RESPONSE_TYPE_NO_HEADER,
AT_RESPONSE_TYPE_SINGLE_HEADER,
AT_RESPONSE_TYPE_MULTILINE,
AT_RESPONSE_TYPE_MAX
}
and pass it to your GetAtCmdRsp( ) function in order to parser the response accordingly. If implement the differentiation in that function, or after it (or in an external function is your choice.
A solution without explicit categorization
Once you have clear all the scenarios that might ever occur, you can think about a general algorithm working for all of them:
Get the full response resp after the command echo and before the closing OK or ERROR. Make sure that the trailing \r\n\r\nOK is removed (or \r\nERROR. Or \r\nNO CARRIER. Or whatever the terminating message of the response might be).Make also sure to remove the command echo
If strlen( resp ) == 0 we belong to the NO_RESPONSE category, and the job is done
If the response contains \r\ns in it, we have a MULTILINE answer. So, tokenize it and place every line into an array element resp_arr[i]. Make sure to remove trailing \r\n
For every line in the response (for every resp_arr[i] element), search for <CMD> : pattern (not only :, that might be contained in the payload as well!). Something like that:
size_t len = strlen( resp_cur_line );
char *payload;
if( strstr( "+YOURCMD: ", resp_cur_line) == NULL )
{
// We are in "NO_HEADER" case
payload = resp_cur_line;
}
else
{
// We are in "HEADER" case
payload = resp_cur_line + strlen( "+YOURCMD: " );
}
Now payload pointer points to the actual payload.
Please note how, in case of MULTILINE answer, after splitting the lines into array elements every loop will handle correctly also the mixed scenarios like the one in +CMGL, as you'll be able to distinguish the lines containing the header from those containing data (and from the empty lines, of course). For a deeper analysis about +CMGL response parsing have a look to this answer.

Macro that resolves to first N characters of argument

I am working on a heavily resource-constrained embedded platform.
I want a macro that will capture function call errors and log them to a fixed-size buffer.
My wish is to be able to do something like
returnType retval;
CAPTURE_ERRORS(retval, function_name, argument1, moreArgsMaybe);
if (retval) { other_error_handling(); }
Where
#define N 12
#define CAPTURE_ERRORS(retval, func, ...) \
do { retval = func(__VA_ARGS__); \
if (retval!=0) write_log_entry(#func[0:N],(int)retval); \
} while (0)
Obviously, the Python slice syntax won't work. Is there any way to get the first N characters of a stringized macro argument?
(I don't want to do the truncation inside write_log_entry, because then the whole long function name will be stored in the executable image, only to be thrown away later.)
I am not aware of any way as a string. (Somebody who is aware, please enlighten me!)
Edit The easiest way I know is to make all your function names no more than N characters long! Think of all that Fortran code with N=6. :)
The second easiest way I know is to pass an additional parameter to CAPTURE_ERRORS:
#define N 12
/* vvvv */
#define CAPTURE_ERRORS(retval, func, tag, ...) \
do { retval = func(__VA_ARGS__); \
if (!retval) write_log_entry(#tag,(int)retval); \
} while (0) /* ^^^^ */
and
CAPTURE_ERRORS(retval, function_name, function_nam, argument1, moreArgsMaybe);
^^^^^^^^^^^^
This is a sufficiently restricted form that you could automatically stuff tag in your existing CAPTURE_ERRORS call with a Python (or even sed!) script that you run before compiling.
Edit
A discussion thread coming to the same conclusion — use an external tool.
In C++, you could likely do this at compile time with a template. :) Not unlike this question, but stopping at length N.

Checking for a blank line in C - Regex

Goal:
Find if a string contains a blank line. Whether it be '\n\n',
'\r\n\r\n', '\r\n\n', '\n\r\n'
Issues:
I don't think my current regex for finding '\n\n' is right. This is my first time really using regex outside of simple use of * when removing files in command line.
Is it possible to check for all of these cases (listed above) in one regex? or do I have to do 4 seperate calls to compile_regex?
Code:
int checkForBlankLine(char *reader) {
regex_t r;
compile_regex(&r, "*\n\n");
match_regex(&r, reader);
return 0;
}
void compile_regex(regex_t *r, char *matchText) {
int status;
regcomp(r, matchText, 0);
}
int match_regex(regex_t *r, char *reader) {
regmatch_t match[1];
int nomatch = regexec(r, reader, 1, match, 0);
if (nomatch) {
printf("No matches.\n");
} else {
printf("MATCH!\n");
}
return 0;
}
Notes:
I only need to worry about finding one blank line, that's why my regmatch_t match[1] is only one item long
reader is the char array containing the text I am checking for a blank line.
I have seen other examples and tried to base the code off of those examples, but I still seem to be missing something.
Thank you kindly for the help/advice.
If anything needs to be clarified please let me know.
It seems that you have to compile the regex as extended:
regcomp(&re, "\r?\n\r?\n", REG_EXTENDED);
The first atom, \r? is probably unnecessary, because it doesn't add to the blank-line condition if you don't capture the result.
In the above, blank line really means empty line. If you want blank line to mean a line that has no characters except for white space, you can use:
regcomp(&re, "\r?\n[ \t]*\r?\n", REG_EXTENDED);
(I don't think you can use the space character pattern, \s here instead of [ \t], because that would include carriage return and new-line.)
As others have already hinted at, the "simple use of * in the command line` is not a regular expression. This wildcard-matching is called file globbing and has different semantics.
Check what the * in a regex means. It's not like the wildcard "anything" in the command line. The * means that the previous component can appear any amount of times. The wildcard in regex is the .. So if you want to say match anything you can do .*, which would be anything, any amount of times.
So in your case you can do .*\n\n.* which would match anything that has \n\n.
Finally, you can use or in a regex and ( ) to group stuff. So you can do something like .*(\n\n|\r\n\r\n).* And that would match anything that has a \n\n or a \r\n\r\n.
Hope that helps.
Rather than looking for only \r or \n, look for not \r or \n?
Your regex would simply be
'[^\r\n]'
and a match result of false indicates a blank line to your specification.

Parse SIP packet in C

I am trying to parse a SIP packet and get some information out of it. To be more specific, the packet looks like this
REGISTER sip:open-ims.test SIP/2.0
Via: SIP/2.0/UDP 192.168.1.64:5060;rport;branch=z9hG4bK1489975971
From: <sip:alice#open-ims.test>;tag=1627897650
To: <sip:alice#open-ims.test>
Call-ID: 1097412971
CSeq: 1 REGISTER
Contact: <sip:alice#192.168.1.64:5060;line=5fc3b39f127158d>;+sip.instance="<urn:uuid:46f525fe-3f60-11e0-bec1-d965d1488cfa>"
Authorization: Digest username="alice#open-ims.test", realm="open-ims.test", nonce=" ", uri="sip:open-ims.test", response=" "
Max-Forwards: 70
User-Agent: UCT IMS Client
Expires: 600000
Supported: path
Supported: gruu
Content-Length: 0
Now, from that packet I need to extract the following :
The value after "From: " ( in this case <sip:alice#open-ims.test> )
The value after "Contact: " ( in this case <sip:alice#192.168.1.64 )
The value after "username" ( in this case alice#open-ims.test )
My code so far is this
char * tch;
char * saved;
tch = strtok (payload,"<>;");
while (tch != NULL)
{
int savenext = 0;
if (!strcmp(tch, "From: "))
{
savenext = 1;
}
tch = strtok (NULL, "<>;");
if (savenext == 1)
{
saved = tch;
}
}
printf ("### SIP Contact: %s ###\n", saved);
}
}
Where payload contains the packet as described above.
However, when I run my program, it will result in a segmentation fault. The weird thing is that if I use in strtok the characters "<>;: " and in strcmp the value "sip" the message will parse successfully and it will keep the saved value. But I need to parse all three of the upper values.
Would a sip library help me more with my problem ?
Thanks in advance
I think something like this could work
char * tch;
char * saved;
tch = strtok (payload,"<>;\n\"");
while (tch != NULL)
{
int savenext = 0;
if (strncmp(tch, "From",4)==0)
{
tch = strtok (NULL, "<>;\n\"");
saved = tch;
printf (" SIP From: %s \n", saved);
}
else if (strncmp(tch, "Contact",7)==0)
{
tch = strtok (NULL, "<>;\n\"");
saved = tch;
printf (" SIP Cont: %s \n", saved);
}
if (strncmp(tch, "Authorization",13)==0)
{
tch = strtok (NULL, "<>;\n\"");
saved = tch;
printf (" SIP User: %s \n", saved);
Echoing the comment provided by Rup, I too would recommend using a library as all the heavy lifting has been done for you and you can spend more time focusing on what you are attempting to accomplish with the parsed information.
The GNU oSIP library may be a good place to start.
From the online documentation:
SIP parser:
==========
The initial feature implemented in
osip is a SIP parser. There is not
much to say about it: it is capable of
parsing and reformating SIP requests
and answers.
The details of the parsing tools
available are listed below:
1 SIP request/answer
2 SIP uri
3 specific headers
4 Via
5 CSeq
6 Call-ID
7 To, From, Route, Record-Route...
8 Authentication related headers
9 Content related headers
10 Accept related headers
11 ...
12 Generic header
13 Attachement parser (should support mime)
14 SDP parser
Use a parser if you possibly can. SIP syntax has a grammar so complex that many ABNF parsers can't handle the RFC 3261 ABNF. If you're still thinking writing it yourself is a good idea, you should get familiar with RFC 4475, the SIP torture tests because you should use them if this is going to interact with other systems, and because it will show you why it's so hard to get right.
Read each line and search for each of your substrings ('From:', 'Contact:', 'username') using strstr().
When you encounter a line that contains one of your keywords, split it with strtok() and extract the piece you need accordingly.
I don't know if you need a full-blown SIP lib for extracting these three things, but if you might need to parse more of the packet in the future, it might not be a bad idea.
For strtok use with "<>;", I'd expect your packet to be split into something like the following (newlines removed)
REGISTER sip:open-ims.test SIP/2.0Via: SIP/2.0/UDP 192.168.1.64:5060
rport
branch=z9hG4bK1489975971From:
sip:alice#open-ims.test
None of these will match
if (!strcmp(tch, "From: "))
You'd either need to modify your parser or search through the string returned by strtok for "From: ".
strtok doesn't have to use the same set of delimiters every time. You can use a colon when you are expecting a field label and leave it off when you are expecting the right hand side.

Resources