Remove byte-order-mark in R/C - c

This SO post has an example of a server that generates json with a byte order mark. RFC7159 says:
Implementations MUST NOT add a byte order mark to the beginning of a JSON text. In the interests of interoperability, implementations that parse JSON texts MAY ignore the presence of a byte order mark rather than treating it as an error.
Currently yajl and hence jsonlite choke on the BOM. I would like to follow the RFC suggestion and ignore the BOM from the UTF8 string if present. What is an efficient way to do this? A naive implementation:
if(substr(json, 1, 1) == "\uFEFF"){
json <- substring(json, 2)
}
However substr is a bit slow for large strings, and I am not sure this is the correct way to do this. Is there a more efficient way in R or C to remove the BOM if present?

A simple solution:
#include <Rcpp.h>
using namespace Rcpp;
// [[Rcpp::export]]
std::string stripBom(std::string x) {
if (x.size() < 3)
return x;
if (x[0] == '\xEF' && x[1] == '\xBB' && x[2] == '\xBF')
return x.substr(3);
return x;
}
/*** R
x <- "\uFEFFabcdef"
print(x)
print(stripBom(x))
identical(x, stripBom(x))
utf8ToInt(x)
utf8ToInt(stripBom(x))
*/
gives
> x <- "\uFEFFabcdef"
> print(x)
[1] "abcdef"
> print(stripBom(x))
[1] "abcdef"
> identical(x, stripBom(x))
[1] FALSE
> utf8ToInt(x)
[1] 65279 97 98 99 100 101 102
> utf8ToInt(stripBom(x))
[1] 97 98 99 100 101 102
EDIT: What might also be useful is seeing how R does it internally -- there are a number of situations where R strips BOM (e.g. for its scanners and file readers). See:
https://github.com/wch/r-source/blob/bfe73ecd848198cb9b68427cec7e70c40f96bd72/src/main/scan.c#L455-L458
https://github.com/wch/r-source/blob/bfe73ecd848198cb9b68427cec7e70c40f96bd72/src/main/connections.c#L3950-L3957

Based on Kevin's Rcpp example I used the following C function to check for the bom:
SEXP R_parse(SEXP x) {
/* get data from R */
const char* json = translateCharUTF8(asChar(x));
/* ignore BOM as suggested by RFC */
if(json[0] == '\xEF' && json[1] == '\xBB' && json[2] == '\xBF'){
warning("JSON string contains UTF8 byte-order-mark!");
json = json + 3;
}
/* parse json */
char errbuf[1024];
yajl_val node = yajl_tree_parse(json, errbuf, sizeof(errbuf));
}

Related

regex matches wrong strings

I'm trying to use regex.h with no success. I'm trying to match an IP address
#include <stdio.h>
#include <regex.h>
#define No_Regex_Flags 0
void check_RE(char * r, regex_t RE)
{
printf ("%s - %s\n", r, !regexec(&RE, r, 0, NULL, 0) ? "Match" : "No Match");
}
int main ()
{
regex_t regex;
int ret = regcomp(&regex, "[0-9]{1,3}.{3}[0-9]{1,3}", No_Regex_Flags);
if(ret)
printf("err1\n");
char RE_list[][32] =
{
"0.0.0.0",
"123.456.789.123",
"a.b.c.d",
"1.2.34.567",
"1111.1.1.1",
".1.1.1",
"1,1,1,1"
};
for(int i = 0; i < sizeof(RE_list) / sizeof(RE_list[0]); i++)
check_RE(RE_list[i], regex);
return 0;
}
However, the output I get is always a match:
0.0.0.0 - Match
123.456.789.123 - Match
a.b.c.d - Match
1.2.34.567 - Match
1111.1.1.1 - Match
.1.1.1 - Match
1,1,1,1 - Match
Why is that?
Use
int ret = regcomp(&regex, "^([0-9]{1,3}\\.){3}[0-9]{1,3}$", REG_EXTENDED);
Or, a more efficient one:
int ret = regcomp(&regex, "^[0-9]{1,3}(\\.[0-9]{1,3}){3}$", REG_EXTENDED);
See this regex demo that also matches wrong IP addresses like 1.2.34.567 and 123.456.789.123. So, I'd suggest a more precise one (source: regular-expresions.info):
"^(25[0-5]|2[0-4][0-9]|1[0-9][0-9]|[1-9]?[0-9])(\\.(25[0-5]|2[0-4][0-9]|1[0-9][0-9]|[1-9]?[0-9])){3}$"
See this regex demo.
See the C demo. The output is
0.0.0.0 - Match
123.456.789.123 - No Match
a.b.c.d - No Match
1.2.34.567 - No Match
1111.1.1.1 - No Match
.1.1.1 - No Match
1,1,1,1 - No Match
POIs
The dot matches any char, so you must escape it with \\ in the pattern
If you have to repeat a sequence of patterns, you need to group them and quantify the group: [0-9]{1,3}\\.{3} => ([0-9]{1,3}\\.){3}
To match the whole string, you need anchors, ^ and $ around the pattern
To make the $ anchor work, you need to pass REG_EXTENDED flag to regcomp. It is also required if you plan to use {3} without having to escape the { and }. Else, you would have to follow the BRE POSIX specs and write a limiting quantifier like \{3\}
As [0-9]{1,3} matches any 3 digits, the original pattern is not really validating IP addresses, so you need to restrict the octet values to 0..255. Thus, an alternation group (25[0-5]|2[0-4][0-9]|1[0-9][0-9]|[1-9]?[0-9]) should be used to match one octet.
Here is the octet pattern explanation:
25[0-5] - 250 to 255
| - or
2[0-4][0-9] - 200 to 249
| - or
1[0-9][0-9] - 100 to 199
| - or
[1-9]?[0-9] - 0 to 99.

C String Checking

I'm new to C specifically and I'm trying to check some strings.
The following is my code, commented to indicate the issues that I don't understand why they are occuring:
if (strstr(recBuff, "GET / HTTP/1.0\r\n\r\n") != NULL)
//Send HTTP/1.0 200
//This gets recognised fine
else if (strstr(recBuff, "GET / HTTP/1.0\r\r") != NULL)
//Send HTTP/1.0 200
//This gets recognised fine
else if (strstr(recBuff, "GET / HTTP/1.0\r\n") != NULL)
//Do something else
//This never gets picked up, and instead goes to the final else...
else
//HTTP/1.0 404
//Etc
I guess my question is why is strstr picking up \r\n\r\n and acting on it, but just \r\n by itself goes through all the way until the final else? There's an else for \r\n\r\n that works, but the else for a single \r\n doesn't work for a single \r\n.
TL;DR "GET / HTTP/1.0\r\n\r\n" gets picked up, but "GET / HTTP/1.0\r\n" doesn't.
You've not reduced your code to an an SSCCE (Short, Self-Contained, Correct Example) so we can't tell what you're doing wrong. However, it is most likely that the data you think has two carriage returns actually doesn't contain the two adjacent carriage returns. However, only some sort of hex dump or something similar will show that for sure.
Here's an SSCCE which shows that your code can work if given the correct data:
#include <stdio.h>
#include <string.h>
int main(void)
{
char *examples[] =
{
"YYYYGET / HTTP/1.0\r\nExample 1 Single CRLF",
"YYYYGET / HTTP/1.0\r\n\r\nExample 2 Double CRLF",
"YYYYGET / HTTP/1.0\r\r\nExample 3 Double CR",
"YYYYGET / HTTP/1.0\n\nExample 4 Double NL",
};
for (int i = 0; i < 4; i++)
{
char *recBuff = examples[i];
printf("Data:\n%s\n", recBuff);
if (strstr(recBuff, "GET / HTTP/1.0\r\n\r\n") != NULL)
printf("Option 1 - double CRLF\n");
else if (strstr(recBuff, "GET / HTTP/1.0\r\r") != NULL)
printf("Option 2 - double CR\n");
else if (strstr(recBuff, "GET / HTTP/1.0\r\n") != NULL)
printf("Option 3 - single CRLF\n");
else
printf("Option 4 - no match\n");
}
return 0;
}
Sample output
$ ./counter-example
Data:
YYYYGET / HTTP/1.0
Example 1 Single CRLF
Option 3 - single CRLF
Data:
YYYYGET / HTTP/1.0
Example 2 Double CRLF
Option 1 - double CRLF
Data:
YYYYGET / HTTP/1.0
Example 3 Double CR
Option 2 - double CR
Data:
YYYYGET / HTTP/1.0
Example 4 Double NL
Option 4 - no match
$
So, if you are not seeing something similar with your code, you aren't getting the data you thought you were getting.
The YYYY part is not necessary to the reproduction; neither is the Example n information. The trailing part makes sure the fairly difficult to discriminate strings are recognizable; the YYYY is arguably fluff since the HTTP protocol would not start with such garbage.

Formal grammar of XML

Im trying to build small parser for XML files in C. I know, i could find some finished solutions but, i need just some basic stuff for embedded project. I`m trying to create grammar for describing XML without attributes, just tags, but it seems it is not working and i was not able to figure out why.
Here is the grammar:
XML : FIRST_TAG NIZ
NIZ : VAL NIZ | eps
VAL : START VAL END
| STR
| eps
Here is part of C code that implement this grammar :
void check() {
getSymbol();
if( sym == FIRST_LINE )
{
niz();
}
else {
printf("FIRST_LINE EXPECTED");
exit(1);
}
}
void niz() {
getSymbol();
if( sym == ERROR )
return;
if( sym == START ) {
back = 1;
val();
niz();
}
printf(" EPS OR START EXPECTED\n");
}
void val() {
getSymbol();
if( sym == ERROR )
return;
if( sym == START ) {
back = 0;
val();
getSymbol();
if( sym != END ) {
printf("END EXPECTED");
exit(1);
}
return;
}
if( sym == EMPTY_TAG || sym == STR)
return;
printf("START, STR, EMPTY_TAG OR EPS EXPECTED\n");
exit(1);
}
void getSymbol() {
int pom;
if(back == 1) {
back = 0;
return;
}
sym = getNextToken(cmd + offset, &pom);
offset += pom + 1;
}
EDIT: Here is the example of XML file that does not satisfy this grammar:
<?xml version="1.0"?>
<VATCHANGES>
<DATE>15/08/2012</DATE>
<TIME>1452</TIME>
<EFDSERIAL>01KE000001</EFDSERIAL>
<CHANGENUM>1</CHANGENUM>
<VATRATE>A</VATRATE>
<FROMVALUE>16.00</FROMVALUE>
<TOVALUE>18.00</TOVALUE>
<VATRATE>B</VATRATE>
<FROMVALUE>2.00</FROMVALUE>
<TOVALUE>0.00</TOVALUE>
<VATRATE>C</VATRATE>
<FROMVALUE>5.00</FROMVALUE>
<TOVALUE>0.00</TOVALUE>
<DATE>25/05/2010</DATE>
<CHANGENUM>2</CHANGENUM>
<VATRATE>C</VATRATE>
<FROMVALUE>0.00</FROMVALUE>
<TOVALUE>4.00</TOVALUE>
</VATCHANGES>
It gives END EXPECTED at the output.
First, your grammar needs some work. Assuming the preamble is handled correctly, you have a basic error in the definition of NIZ.
NIZ : VAL NIZ | eps
VAL : START VAL END
| STR
| eps
So we enter NIZ and we look for VAL first. The problem is the eps on the end of both VAL's possible productions and NIZ. Therefore, if VAL produces nothing (i.e. eps) and consumes no tokens in the process (which it can't to be proper, since eps is the production), NIZ reduces to:
NIZ: eps NIZ | eps
which isn't good.
Consider into something more along these lines: I just spewed this with no real foresight into having something beyond a purely basic construction.
XML: START_LINE ELEMENT
ELEMENT: OPENTAG BODY CLOSETAG
OPENTAG: lt id(n) gt
CLOSETAG: lt fs id(n) gt
BODY: ELEMENT | VALUE
VALUE: str | eps
This is super basic. Terminals include:
lt: '<'
gt: '>'
fs: '/'
str: any alphanumeric string excluding chars lt or gt.
id(n): any alphanumeric string excluding chars lt, gt, or fs.
I can almost feel the wrath of the XML purists raining down on me right now, but the point I'm trying to get across is that, when an grammar is well-defined, the RDP will literally write itself. Obviously the lexer (i.e. the token engine) needs to handle the terminals accordingly. Note: the id(n) is an id-stack to ensure you properly close the innermost tag, and is an attribute of your parser in accordance with how it manages tag ids. Its not traditional, but it makes things MUCH easier.
This can/should clearly be expanded to include stand-alone element declarations and short-cut element closure. For example, this grammar allows for elements of this form:
<ElementName>...</ElementName>
but not of this form:
<ElementName/>
Nor does it account for short-cut termination such as:
<ElementName>...</>
Accounting for such additions will obviously complicate the grammar considerably, but also make the parser substantially more robust. Like I said, the sample above is basic with a capital B. If you're really going to embark on this these are things you want to consider when designing your grammar, and thus also your RDP by consequence.
Anyway, just consider how a few reworks in your grammar can/will substantially make this easier on you.

C - Read in file according to a format

I am trying to read a file in a specific file format in c.
the file contains some data items. every data item is seprated by a flag.
the file should look look like this:
file-header: "FIL0"
file-id: 0x1020304
flag : 0|1 : uint8_t
length : uint32_t
char[length] : int utf-8
so its: [File-Header] [FileID] [Flag | Length | Data ] [Flag | Length | Data] ...
--> "FIL0" | 0xFFFFFF | 0 or 1 | Data as char[] | 0 or 1 | ... (next data item) ....
My Problem occurs when reading in the file. My idea is to open the file and scan through it using some sscanf-magic.
FILE *fp;
fp = fopen("data.dat". "r");
/* scan file for data components */
while (fgets(buffer, sizeof buffer, fp) != NULL) /* read in file */
{
/* scan for sequence */
if (sscanf(buffer, "%5s", fil0_header) == 1) /* if the "FIL0" header is found */
{
printf("FIL0-header found: %s\n", buffer);
// proceed and scan for [FLAG] [LENGTH] [DATA]
// sscanf()
if (sscanf(buffer, "%u", node) == 1)
{
// doesnt seem to work
}
// read in length of string and extract stringdata
else
{
printf("FIL0-Header not found, found instead: %s\n", buffer);
// do something
}
}
My problem that I have a hard time with my buffer and the varying data types in the file.
The comparision of fil0-header works alright, but:
how to read in the next hexadeciaml number (sscanf using %D)
how scan for the flag which is 1 byte
how to extract the length which is 4 bytes
A problem is, that the check for the flag starts at the beginning of the buffer.
but the pointer should be moved on, after the FIL0-header is found.
I'd be gratefull for any help!
Please help me to find the proper sscanf() -calls:
and want to read it in and retrieve the single parts of my file:
On single [File-Header]
and many {[FileID] [Flag | Length | Data ]} {...} items
well you could just read the file per byte using
line[0] = (char) fgetc(fp);
line[1] = (char) fgetc(fp);
and so on or leave out the cast to retrieve an int-value... should do the trick to do an easy right to left scan of the file (or line - as you say there arent any line breaks)...
You probably could use some standard parsing techniques, for instance have a lexer and a recursive parser. You should define your input syntax more in details. You could perhaps use parser generators (but it might be overkill for your simple example) like ANTLR ...
I suggest you to read some good textbook on parsing (& compiling), it will learn you a lot of useful stuff.

Is it possible to sort arrays using preprocessor?

I have a number of very long arrays. No run-time sort is possible. It is also time consuming to sort them manually. Moreover, new elements can be added in any order later, so I would like to sort them by value using C preprocessor or maybe there is any compilers flag (GCC)?
For example:
sometype S[] = {
{somevals, "BOB", someothervals},
{somevals, "ALICE", someothervals},
{somevals, "TIM", someothervals},
}
must be sorted so:
sometype S[] = {
{somevals, "ALICE", someothervals},
{somevals, "BOB", someothervals},
{somevals, "TIM", someothervals},
}
SOLVED
Ok, here is my solution:
Manually copy&paste each array into a temporary file called tobesorted.c
Sort it by 2nd column: sort -b -i --key=2 tobesorted.c
Copy&paste output back into original file.
Actually, it would be nice to have some possibility to call "sort" directly from the preprocessor (I had a hope that at least GCC somehow support such features, but it seems that it doesn't).
Do this.
Put your giant array in a file.
Sort the file with the built-in sort
Write a small program to create C code from the file. A C program that writes C programs is fine. You can use Python and some of it's cool template packages to make this job simpler.
Compile the remaining program consisting of the sorted file transformed into C code plus the rest of your program.
No, it is not possible. You cannot do string operations (other than concatenation) with the preprocessor. And you can't compare strings with template metaprograming, either.
[edit] What you could do is put your datastructure in a file that is meant to be preprocessed by an external build script (e.g. the unix "sort" utility), and then modify your makefile/project so that at build time, you generate a C file with the (sorted) initialized arrays
I don't think you can do it in the gcc preprocessor, never seen something that could do what you are looking for.
But you could write your own "preprocessor" in your favourite scripting language (python, perl, sed etc...) that would sort those values before gcc kicks in.
I can think of no possibility to use the preprocessor, but you can use a combination of sort and #include to achieve the desired effect:
Put just the values into a seperate file values.h with the sort key being in front (you will need to rearrange your struct sometype for this):
{"BOB", somevals, someothervals},
{"ALICE", somevals, someothervals},
{"TIM", somevals, someothervals},
In your Makefile, use the Unix command sort to sort that file into values_sorted.h:
sort < values.h > values_sorted.h
In your actual code, include the sorted file:
sometype S[] = {
#include "values_sorted.h"
};
The following worked for two and three elements:
// Experiment: static sort:
#define STATIC_SORT2(CMP, a, b) CMP(a,b) <= 0 ?(a):(b), CMP(a,b) <= 0 ? (b):(a),
#define STATIC_SORT3(CMP, a, b, c) \
(CMP(a,b) <= 0 && CMP(a,c) <= 0 ? (a) : \
CMP(b,a) <= 0 && CMP(b,c) <= 0 ? (b) : \
(c)), \
(CMP(a,b) <= 0 && CMP(a,c) <= 0 ? ( CMP(b,c) <= 0 ? (b) : (c) ) : \
CMP(b,a) <= 0 && CMP(b,c) <= 0 ? ( CMP(a,c) <= 0 ? (a) : (c) ) : \
(CMP(a,b) <= 0 ? (a) : (b))), \
(CMP(a,c) <= 0 && CMP(b,c) <= 0 ? (c) : \
CMP(a,b) <= 0 && CMP(c,b) <= 0 ? (b) : \
(a))
// Example:
// #define STATIC_INT_CMP(a,b) ((int)(a) - (int)(b))
// int sorted[] = { STATIC_SORT3(STATIC_INT_CMP, 2, 3, 1 } // gives { 1, 2, 3 }
// #define STATIC_INT_COCMP(a,b) ((int)(b) - (int)(a))
// int cosorted[] = { STATIC_SORT3(STATIC_INT_COCMP, 2, 3, 1 } // gives { 3, 2, 1 }
But I'd say that it is clear that this approach does not generalize to arbitrary sized arrays.
I suppose it is not possible, but I still don't have a formal proof for this conjecture.

Resources