I am trying to parse a SIP packet and get some information out of it. To be more specific, the packet looks like this
REGISTER sip:open-ims.test SIP/2.0
Via: SIP/2.0/UDP 192.168.1.64:5060;rport;branch=z9hG4bK1489975971
From: <sip:alice#open-ims.test>;tag=1627897650
To: <sip:alice#open-ims.test>
Call-ID: 1097412971
CSeq: 1 REGISTER
Contact: <sip:alice#192.168.1.64:5060;line=5fc3b39f127158d>;+sip.instance="<urn:uuid:46f525fe-3f60-11e0-bec1-d965d1488cfa>"
Authorization: Digest username="alice#open-ims.test", realm="open-ims.test", nonce=" ", uri="sip:open-ims.test", response=" "
Max-Forwards: 70
User-Agent: UCT IMS Client
Expires: 600000
Supported: path
Supported: gruu
Content-Length: 0
Now, from that packet I need to extract the following :
The value after "From: " ( in this case <sip:alice#open-ims.test> )
The value after "Contact: " ( in this case <sip:alice#192.168.1.64 )
The value after "username" ( in this case alice#open-ims.test )
My code so far is this
char * tch;
char * saved;
tch = strtok (payload,"<>;");
while (tch != NULL)
{
int savenext = 0;
if (!strcmp(tch, "From: "))
{
savenext = 1;
}
tch = strtok (NULL, "<>;");
if (savenext == 1)
{
saved = tch;
}
}
printf ("### SIP Contact: %s ###\n", saved);
}
}
Where payload contains the packet as described above.
However, when I run my program, it will result in a segmentation fault. The weird thing is that if I use in strtok the characters "<>;: " and in strcmp the value "sip" the message will parse successfully and it will keep the saved value. But I need to parse all three of the upper values.
Would a sip library help me more with my problem ?
Thanks in advance
I think something like this could work
char * tch;
char * saved;
tch = strtok (payload,"<>;\n\"");
while (tch != NULL)
{
int savenext = 0;
if (strncmp(tch, "From",4)==0)
{
tch = strtok (NULL, "<>;\n\"");
saved = tch;
printf (" SIP From: %s \n", saved);
}
else if (strncmp(tch, "Contact",7)==0)
{
tch = strtok (NULL, "<>;\n\"");
saved = tch;
printf (" SIP Cont: %s \n", saved);
}
if (strncmp(tch, "Authorization",13)==0)
{
tch = strtok (NULL, "<>;\n\"");
saved = tch;
printf (" SIP User: %s \n", saved);
Echoing the comment provided by Rup, I too would recommend using a library as all the heavy lifting has been done for you and you can spend more time focusing on what you are attempting to accomplish with the parsed information.
The GNU oSIP library may be a good place to start.
From the online documentation:
SIP parser:
==========
The initial feature implemented in
osip is a SIP parser. There is not
much to say about it: it is capable of
parsing and reformating SIP requests
and answers.
The details of the parsing tools
available are listed below:
1 SIP request/answer
2 SIP uri
3 specific headers
4 Via
5 CSeq
6 Call-ID
7 To, From, Route, Record-Route...
8 Authentication related headers
9 Content related headers
10 Accept related headers
11 ...
12 Generic header
13 Attachement parser (should support mime)
14 SDP parser
Use a parser if you possibly can. SIP syntax has a grammar so complex that many ABNF parsers can't handle the RFC 3261 ABNF. If you're still thinking writing it yourself is a good idea, you should get familiar with RFC 4475, the SIP torture tests because you should use them if this is going to interact with other systems, and because it will show you why it's so hard to get right.
Read each line and search for each of your substrings ('From:', 'Contact:', 'username') using strstr().
When you encounter a line that contains one of your keywords, split it with strtok() and extract the piece you need accordingly.
I don't know if you need a full-blown SIP lib for extracting these three things, but if you might need to parse more of the packet in the future, it might not be a bad idea.
For strtok use with "<>;", I'd expect your packet to be split into something like the following (newlines removed)
REGISTER sip:open-ims.test SIP/2.0Via: SIP/2.0/UDP 192.168.1.64:5060
rport
branch=z9hG4bK1489975971From:
sip:alice#open-ims.test
None of these will match
if (!strcmp(tch, "From: "))
You'd either need to modify your parser or search through the string returned by strtok for "From: ".
strtok doesn't have to use the same set of delimiters every time. You can use a colon when you are expecting a field label and leave it off when you are expecting the right hand side.
Related
Is there a way to use a non-greedy regular expression in C like one can use in Perl?
I tried several things, but it's actually not working.
I'm currently using this regex that matches an IP address and the corresponding HTTP request, but it's greedy although I'm using the *?:
([0-9]{1,3}(\\.[0-9]{1,3}){3})(.*?)HTTP/1.1
In this example, it always matches the whole string:
#include <regex.h>
#include <stdio.h>
int main() {
int a, i;
regex_t re;
regmatch_t pm;
char *mpages = "TEST 127.0.0.1 GET /test.php HTTP/1.1\" 404 525 \"-\" \"Mozilla/5.0 (Windows NT HTTP/1.1 TEST";
a = regcomp(&re, "([0-9]{1,3}(\\.[0-9]{1,3}){3})(.*?)HTTP/1.1", REG_EXTENDED);
if(a!=0)
printf(" -> Error: Invalid Regex");
a = regexec(&re, &mpages[0], 1, &pm, REG_EXTENDED);
if(a==0) {
for(i = pm.rm_so; i < pm.rm_eo; i++)
printf("%c", mpages[i]);
printf("\n");
}
return 0;
}
$ ./regtest
127.0.0.1 GET /test.php HTTP/1.1" 404 525 "-" "Mozilla/5.0 (Windows NT HTTP/1.1
No, there are no non-greedy quantifiers in POSIX regular expressions. But there is a library that provides perl-like regular expressions for C: http://www.pcre.org/
As I said earlier in a comment, use grep -E to run tests with POSIX regexes, in that way development time will be improved. Either way, It seems your problem it's with the regular expression rather than with the missing feature.
I'm not quite clear of what you want to grab from the request... supposing you just want the IP address, the HTTP verb and the resource, one could end up with the following regex.
regcomp(&re, "\\b(.?[0-9])+\\s+(GET|POST|PUT)\\s+([^ ]+)", REG_EXTENDED);
Be aware that several assumptions have been made. For example, this regex assumes the IP address will be well formed, it also assumes a request with a HTTP verb either GET, POST, PUT. Edit accordantly to your needs.
The brute-force method of getting a regex to match up to the next occurrence of a word is:
"([^H]|H[^T]|HT[^T]|HTT[^P]|HTTP{^/]|HTTP/[^1]|HTTP/1[^.]|HTTP/1\\.[^1])*HTTP/1\\.1"
unless you can get smarter about your match -- which you can: HTTP requests are
Request-Line = Method SP Request-URI SP HTTP-Version CRLF
and none of the nonterminals on the right match embedded spaces. So:
"[0-9]{1,3}(\\.[0-9]{1,3}){3} [^ ]* [^ ]* HTTP/1\\.1"
since you're only allocating space for the whole-expression match, or put the parens back in to get pieces.
a = regcomp(&re, "([0-9]{1,3}(\\.[0-9]{1,3}){3})(.*?)HTTP/1.1", REG_EXTENDED|REG_ENHANCED);
Doesn't have this macro in the old time
#if __MAC_OS_X_VERSION_MIN_REQUIRED >= __MAC_10_8 \
|| __IPHONE_OS_VERSION_MIN_REQUIRED >= __IPHONE_6_0
#define REG_ENHANCED 0400 /* Additional (non-POSIX) features */
#endif
In your code, pm should be an array of regmatch_t, and in your case, should have at least 2 to 4 elements, depending upon which () sub-expressions you want to capture.
You have only one element. The first element, pm[0], always gets whatever text matches your entire RE. That's the one you'll be getting. It is pm[1] that will get the text of the first () sub-expression (the IP address), and pm[3] that will get the text matching your (.*?) term.
But even so, as stated above (by Wumbley, W. Q.) the POSIX regex library may not support non-greedy quantifiers.
I want to parse the actual payload from the output of AT commands.
For instance: in the example below, I'd want to read only "2021/11/16,11:12:14-32,0"
AT+QLTS=1 // command
+QLTS: "2021/11/16,11:12:14-32,0" // response
OK
In the following case, I'd need to only read 12345678.
AT+CIMI // command
12345678 // example response
So the point is: not all commands have the same format for the output. We can assume the response is stored in a string array.
I have GetAtCmdRsp() already implemented which stores the response in a char array.
void GetPayload()
{
char rsp[100] = {0};
GetAtCmdRsp("AT+QLTS=1", rsp);
// rsp now contains +QLTS: "2021/11/16,11:12:14-32,0"
// now, I need to parse "2021/11/16,11:12:14-32,0" out of the response
memset(rsp, 0, sizeof(rsp));
GetAtCmdRsp("AT+CIMI", rsp);
// rsp now contains 12345678
// no need to do additional parsing since the output already contains the value I need
}
I was thinking of doing char *start = strstr(rsp, ":") + 1; to get the start of the payload but some responses may only contain the payload as it's the case with AT+CIMI
Perhaps could regex be a good idea to determine the pattern +<COMMAND>: in a string?
In order to parse AT command responses a good starting point is understanding all the possible formats they can have. So, rather than implementing a command specific routine, I would discriminate commands by "type of response":
Commands with no payload in their answers, for example
AT
OK
Commands with no header in their answers, such as
AT+CIMI
12345678
OK
Commands with a single header in their answers
AT+QLTS=1
+QLTS: "2021/11/16,11:12:14-32,0"
OK
Command with multi-line responses.Every line could of "single header" type, like in +CGDCONT:
AT+CDGCONT?
+CGDCONT: 1,"IP","epc.tmobile.com","0.0.0.0",0,0
+CGDCONT: 2,"IP","isp.cingular","0.0.0.0",0,0
+CGDCONT: 3,"IP","","0.0.0.0",0,0
OK
Or we could even have mixed types, like in +CGML:
AT+CMGL="ALL"
+CMGL: 1,"REC READ","+XXXXXXXXXX","","21/11/25,10:20:00+00"
Good morning! How are you?
+CMGL: 2,"REC READ","+XXXXXXXXXX","","21/11/25,10:33:33+00"
I'll come a little late. See you. Bruce Wayne
OK
(please note how it could have also "empty" lines, that is \r\n).
At the moment I cannot think about any other scenario.In this way you'll be able to define an enum like
typedef enum
{
AT_RESPONSE_TYPE_NO_RESPONSE,
AT_RESPONSE_TYPE_NO_HEADER,
AT_RESPONSE_TYPE_SINGLE_HEADER,
AT_RESPONSE_TYPE_MULTILINE,
AT_RESPONSE_TYPE_MAX
}
and pass it to your GetAtCmdRsp( ) function in order to parser the response accordingly. If implement the differentiation in that function, or after it (or in an external function is your choice.
A solution without explicit categorization
Once you have clear all the scenarios that might ever occur, you can think about a general algorithm working for all of them:
Get the full response resp after the command echo and before the closing OK or ERROR. Make sure that the trailing \r\n\r\nOK is removed (or \r\nERROR. Or \r\nNO CARRIER. Or whatever the terminating message of the response might be).Make also sure to remove the command echo
If strlen( resp ) == 0 we belong to the NO_RESPONSE category, and the job is done
If the response contains \r\ns in it, we have a MULTILINE answer. So, tokenize it and place every line into an array element resp_arr[i]. Make sure to remove trailing \r\n
For every line in the response (for every resp_arr[i] element), search for <CMD> : pattern (not only :, that might be contained in the payload as well!). Something like that:
size_t len = strlen( resp_cur_line );
char *payload;
if( strstr( "+YOURCMD: ", resp_cur_line) == NULL )
{
// We are in "NO_HEADER" case
payload = resp_cur_line;
}
else
{
// We are in "HEADER" case
payload = resp_cur_line + strlen( "+YOURCMD: " );
}
Now payload pointer points to the actual payload.
Please note how, in case of MULTILINE answer, after splitting the lines into array elements every loop will handle correctly also the mixed scenarios like the one in +CMGL, as you'll be able to distinguish the lines containing the header from those containing data (and from the empty lines, of course). For a deeper analysis about +CMGL response parsing have a look to this answer.
I'm trying to achieve the following without any success:
Removing the opening
message "
and trailing
"
while leaving the content in between, and saving it into my variable, using sscanf regular expressions.
I wrote the following code:
sscanf( buffer, "message \"%[^\"]", message)
Which works good when I have something like message "Hey there", but when I'm trying the following string, I get only the white space between the two quotation marks.
message " """ This is a Test """ "
The result for this should be """ This is a Test """
Is there a way to upgrade my expression so it will include this extreme event of message? I tried to look it up both in google and here, and couldn't find an elegant answer. I'm aware that it's possible using string manipulation with a lot lines of code, but I'm trying something more simple here.
P.S. The trailing " is the end of the expression, and is a must by the program, after that comes nothing.
Thanks in advance for the feedback!
If you're fine with not using regex for the whole thing:
Original version:
sscanf(buffer, "message \"%[^$]", message); // remove 'message "'
message[strlen(message) - 1] = '\0'; // remove trailing '"'
Safe, correct, and generic version:
char* buffer = ...;
const char* prefix = "message \"";
const char* suffix = "\"";
if (strstr(buffer, prefix) != buffer) {
// error, doesn't start with `prefix`
}
buffer += strlen(prefix);
char* suffixStart = strrchr(buffer, suffix[0]);
if (!suffixStart || strcmp(suffixStart, suffix) != 0) {
// error, doesn't end with `suffix`
}
*suffixStart = '\0'; // strip `suffix`
I need to parse manually, without external libraries, a JSON message coming from a server, in C language.
The message coming from server would be like:
{[CR+LF]
"Tmg": "R",[CR+LF]
"STP": 72[CR+LF]
}[CR+LF]
or
{[CR+LF]
"Tmg": "R",[CR+LF]
"STP": 150[CR+LF]
}[CR+LF]
I need the number after STP:. The number is different in each message structure, so I need to get that number from the JSON structure. I can't use external libraries because this code is in an embedded system and exernal code is not allowed.
I tried this following:
int main (){
const char response_message[35] = "{\r\n\"Tmg\":\"R\",\r\n\"STP\":72,\r\n}";
const char needle[8] = "P\":";
char *ret;
ret = strstr(response_message, needle);
printf("The number is: %s\n", ret);
return 0;
}
But obviously, I am getting this result:
The number is: P":72,
}
So I need to only get the number, how can I get this?
Thanks
You can use a hacked solution. Use strstr () to find "STP": then find the following , or } and extract the digits in between.
And that's a hack. Not guaranteed to work. For something that's guaranteed to work, you use a JSON parser.
Is there a way to use a non-greedy regular expression in C like one can use in Perl?
I tried several things, but it's actually not working.
I'm currently using this regex that matches an IP address and the corresponding HTTP request, but it's greedy although I'm using the *?:
([0-9]{1,3}(\\.[0-9]{1,3}){3})(.*?)HTTP/1.1
In this example, it always matches the whole string:
#include <regex.h>
#include <stdio.h>
int main() {
int a, i;
regex_t re;
regmatch_t pm;
char *mpages = "TEST 127.0.0.1 GET /test.php HTTP/1.1\" 404 525 \"-\" \"Mozilla/5.0 (Windows NT HTTP/1.1 TEST";
a = regcomp(&re, "([0-9]{1,3}(\\.[0-9]{1,3}){3})(.*?)HTTP/1.1", REG_EXTENDED);
if(a!=0)
printf(" -> Error: Invalid Regex");
a = regexec(&re, &mpages[0], 1, &pm, REG_EXTENDED);
if(a==0) {
for(i = pm.rm_so; i < pm.rm_eo; i++)
printf("%c", mpages[i]);
printf("\n");
}
return 0;
}
$ ./regtest
127.0.0.1 GET /test.php HTTP/1.1" 404 525 "-" "Mozilla/5.0 (Windows NT HTTP/1.1
No, there are no non-greedy quantifiers in POSIX regular expressions. But there is a library that provides perl-like regular expressions for C: http://www.pcre.org/
As I said earlier in a comment, use grep -E to run tests with POSIX regexes, in that way development time will be improved. Either way, It seems your problem it's with the regular expression rather than with the missing feature.
I'm not quite clear of what you want to grab from the request... supposing you just want the IP address, the HTTP verb and the resource, one could end up with the following regex.
regcomp(&re, "\\b(.?[0-9])+\\s+(GET|POST|PUT)\\s+([^ ]+)", REG_EXTENDED);
Be aware that several assumptions have been made. For example, this regex assumes the IP address will be well formed, it also assumes a request with a HTTP verb either GET, POST, PUT. Edit accordantly to your needs.
The brute-force method of getting a regex to match up to the next occurrence of a word is:
"([^H]|H[^T]|HT[^T]|HTT[^P]|HTTP{^/]|HTTP/[^1]|HTTP/1[^.]|HTTP/1\\.[^1])*HTTP/1\\.1"
unless you can get smarter about your match -- which you can: HTTP requests are
Request-Line = Method SP Request-URI SP HTTP-Version CRLF
and none of the nonterminals on the right match embedded spaces. So:
"[0-9]{1,3}(\\.[0-9]{1,3}){3} [^ ]* [^ ]* HTTP/1\\.1"
since you're only allocating space for the whole-expression match, or put the parens back in to get pieces.
a = regcomp(&re, "([0-9]{1,3}(\\.[0-9]{1,3}){3})(.*?)HTTP/1.1", REG_EXTENDED|REG_ENHANCED);
Doesn't have this macro in the old time
#if __MAC_OS_X_VERSION_MIN_REQUIRED >= __MAC_10_8 \
|| __IPHONE_OS_VERSION_MIN_REQUIRED >= __IPHONE_6_0
#define REG_ENHANCED 0400 /* Additional (non-POSIX) features */
#endif
In your code, pm should be an array of regmatch_t, and in your case, should have at least 2 to 4 elements, depending upon which () sub-expressions you want to capture.
You have only one element. The first element, pm[0], always gets whatever text matches your entire RE. That's the one you'll be getting. It is pm[1] that will get the text of the first () sub-expression (the IP address), and pm[3] that will get the text matching your (.*?) term.
But even so, as stated above (by Wumbley, W. Q.) the POSIX regex library may not support non-greedy quantifiers.