Regex in C to restrict Extended ASCII character set [closed] - c

Closed. This question needs debugging details. It is not currently accepting answers.
Edit the question to include desired behavior, a specific problem or error, and the shortest code necessary to reproduce the problem. This will help others answer the question.
Closed 2 years ago.
Improve this question
I need a regex expression in C able to match everything but first 32 characters from extended ASCII with length greater than 0. I thought the easiest way to do that would be pattern like "^[^\\x00-\\x20]+$", but it's not working as I expected. For some reason it won't match any character from 48 to 92. Any ideas what's wrong with this pattern and how can I fix it?
Link to Extended ASCII character set table

The Posix regex library (i.e. the functions in regex.h, including regcomp and regexec) does not interpret standard C backslash sequences. It really doesn't need to, since C will do those expansions when you compile the character string literal. (This is something you have to think about if you accept regular expressions from user input.) The only use of \ in a regex is to escape a special character (in REG_EXTENDED mode), or to make a character special (in basic regex mode, which should be avoided.)
So if you want to exclude characters from \x01 to \x20, you would write:
"^[^\x01-\x20]+$"
Note that you must supply the REG_EXTENDED flag to regcomp for that to work.
As you might note, that does not exclude NUL (\x00). There's no way to insert a NUL into a regex pattern because NUL is not a valid character inside a C character string; it will terminate the string. For the same reason, it's pointless to try to exclude NUL characters from a C string, because there cannot be any. However, if it made you feel better, you could use:
"^[\x21-\xFF]+$"
Semantically, those two regex patterns are identical (at least, in the default "C" locale and assuming char is 8 bits).
The character class as you wrote it, [^\\x00-\\x20], contains everything but the character x and the range from 0 (48) to \ (92). (That range overlaps with the characters 0, 2 and \, which are named explicitly, some of them twice.)

Never used regex in C. I would do it next way using unsigned char to fit EASCII
void match(const unsigned char *src, unsigned char *dst) {
while (*src) {
if (*src >= 32) {
*dst++ = *src++;
} else {
src++;
}
}
*dst = 0;
}

Related

Different behavior for \" in C

I have a strange problem when using string function in C.
Currently I have a function that sends string to UART port.
When I give to it a string like
char buf[32];
strcpy(buf, "AT+CPMS=\"SM");
strcat(buf, "\"");
uart0_putstr(buf);
//or
uart0_putstr("AT+CPMS=SM"); //not a valid AT command, but without quotes just for test
it works well and sends string to UART. But when I use such call:
char buf[32];
strcpy(buf, "AT+CPMS=\"SM\"");
uart0_putstr(buf);
//or
uart0_putstr("AT+CPMS=\"SM\"");
it doesn't print to UART anything.
Maybe you can explain me what the difference between strings in first and second/third cases?
First the C language part:
String literals: All C string literals include an implicit null byte at the end; the C string literal "123" defines a 4 byte array with the values 49,50,51,0. The null byte is always there even if it is never mentioned and enables strlen, strcat etc. to find the end of the string. The suggestion strcpy(buf, "AT+CPMS=\"SM\"\0"); is nonsensical: The character array produced by "AT+CPMS=\"SM\"\0" now ends in two consecutive zero bytes; strcpy will stop at the first one already. "" is a 1 byte array whose single element has the value 0. There is no need to append another 0 byte.
strcat, strcpy: Both functions always add a null byte at the end of the string. There is no need to add a second one.
Escaping: As you know, a C string literal consists of characters book-ended by double quotes: "abc". This makes it impossible to have simple double quotes as part of the string because that would end the string. We have to "escape" them. The C language uses the backslash to give certain characters a special meaning, or, in this case, suppress the special meaning. The entire combination of backslash and subsequent source code character are transformed into a single character, or byte, in the compiled program. The combination \n is transformed into a single byte with the value 13 (usually interpreted as a newline by output devices), \r is 10 (usually carriage return), and \" is transformed into the byte 34, usually printed as the " glyph. The string Between the arrows is a double quote: ->"<- must be coded as "Between the arrows is a double quote: ->\"<-" in C. The middle double quote doesn't end the string literal because it is "escaped".
Then the UART part: The internet makes me believe that the command you want to send over the UART looks like AT+CPMS="SM", followed by a carriage return. The corresponding C string literal would be "AT+CPMS=\"SM\"\r".
The page I linked also inserts a delay between sending commands. Sending too quickly may cause errors that appear only sometimes.
The things to note are :
The AT command syntax probably demands that SM be surrounded by quotes on both sides.
Additionally, the protocol probably demands that a command end in a carriage return.
This ...
char buf[32];
strcpy(buf, "AT+CPMS=\"SM");
strcat(buf, "\"");
... produces the same contents in buf as this ...
char buf[32];
strcpy(buf, "AT+CPMS=\"SM\"");
... does, up to and including the string terminator at index 12. I fully expect an immediately following call to ...
uart0_putstr(buf);
... to have the same effect in each case unless uart0_putstr() looks at bytes past the terminator or its behavior is sensitive to factors other than its argument.
If it does look past the terminator, however, then that might explain not only a difference between those two, but also a difference with ...
uart0_putstr("AT+CPMS=\"SM\"");
... because in this last case, looking past the string terminator would overrun the bounds of the array, producing undefined behavior.
Thanks all. Finally It was resolved with adding NULL char to the end of string.

removing multi-char constants in C

Here's some code I found in a very old C library that's trying to eat whitespace from a file...
while(
(line_buf[++line_idx] != ' ') &&
(line_buf[ line_idx] != ' ') &&
(line_buf[ line_idx] != ',') &&
(line_buf[ line_idx] != '\0') )
{
This great thread explains what the problem is, but most of the answers are "just ignore it" or "you should never do this". What I don't see, however, is the canonical solution. Can anyone offer a way to code this test using the "proper way"?
UPDATE: to clarify, the question is "what is the proper way to test for the presence of a string of one or more characters at a given index in another string". Forgive me if I am using the wrong terminology.
Original question
There is no canonical or correct way. Multi-character constants have always been implementation defined. Look up the documentation for the compiler used when the code was written and figure out what was meant.
Updated question
You can match multiple characters using strchr().
while (strchr( " ,", line_buf[++line_idx] ))
{
Again, this does not account for that multi-char constant. You should figure out why that was there before simply removing it.
Also, strchr() does not handle Unicode. If you are dealing with a UTF-8 stream, for example, you will need a function capable of handling it.
Finally, if you are concerned about speed, profile. The compiler might get you better results using the three (or four) individual test expressions in the ‘while’ condition.
In other words, the multiple tests might be the best solution!
Beyond that, I smell some uncouth indexing: the way that line_idx is updated depends on the surrounding code to actuate the loop properly. Make sure that you don’t create an off-by-one error when you update stuff.
Good luck!
UPDATE: to clarify, the question is "what is the proper way to test
for the presence of a string of one or more characters at a given
index in another string". Forgive me if I am using the wrong
terminology.
Well, there are a number of ways, but the standard way is using strspn which has the prototype:
size_t strspn(const char *s, const char *accept);
and it cleverly:
calculates the length (in bytes) of the initial segment of s
which consists entirely of bytes in accept.
This allows you to test for the "the presence of a string of one or more characters at a given index in another string" and tells you how many of the characters from that string were sequentially matched.
For example, if you had another string say char s = "somestring"; and wanted to know if it contained the letters r, s, t, say, in char *accept = "rst"; beginning at the 5th character, you could test:
size_t n;
if ((n = strspn (&s[4], accept)) > 0)
printf ("matched %zu chars from '%s' at beginning of '%s'\n",
n, accept, &s[4]);
To compare in order, you can use strncmp (&s[4], accept, strlen (accept));. You can also simply use nestest loops to iterate over s with the characters in accept.
All of the ways are "proper", so long as they do not invoke Undefined Behavior (and are reasonable efficient).

How to strtok() string using the null character as the delimiter? [closed]

Closed. This question needs debugging details. It is not currently accepting answers.
Edit the question to include desired behavior, a specific problem or error, and the shortest code necessary to reproduce the problem. This will help others answer the question.
Closed 7 years ago.
Improve this question
I tried inputting the null character into the list of delimiters but it would not accept it. I tried inputting "\0" but it wouldn't accept it. I even tried putting in the double quotes with escape characters but it still would not accept it.
Is there a way I could do this?
According to strtok(3) this function is used to isolate sequential tokens in a null-terminated string. So the answer is no, not using strtok, since that function cannot compare a \0 separator from the terminator. You will have to write your own function (which is trivial).
Also read the BUGS section in the strtok man page ... better avoid using it.
If you are allowed to end your string with a double \0 as sentinel you can build your own function, something like:
#include <stdio.h>
#include <string.h>
int main(void)
{
char *s = "abc\0def\0ghi\0";
char *p = s;
while (*p) {
puts(p);
p = strchr(p, '\0');
p++;
}
return 0;
}
Output:
abc
def
ghi
For better or worse, C decided that all strings are null terminated, so you can't parse based on that character.
One workaround would be to replace all nulls in your (non-string) binary buffer with something you know doesn't occur in the buffer, then use that as your separator treating the buffer as char*.
In C, strings are just char arrays that must end with a null character. If a string in memory is not terminated that way, the program will endlessly read memory until a null char is found and there is no way to determine where it will be since memory changes constantly. Your best bet is to create an array of char pointers (strings) and populate them using better organization.

What is the significance of '\0' in the character array? [closed]

As it currently stands, this question is not a good fit for our Q&A format. We expect answers to be supported by facts, references, or expertise, but this question will likely solicit debate, arguments, polling, or extended discussion. If you feel that this question can be improved and possibly reopened, visit the help center for guidance.
Closed 10 years ago.
I know that every string in C ends with '\0' character. It is very useful in cases when we need to know when the string ends. However, I am unable to comprehend its use in printing a string and printing a string without it. I have the following code:-
/* Printing out an array of characters */
#include<stdio.h>
#include<conio.h>
int main()
{
char a[7]={'h','e','l','l','o','!','\0'};
int i;
/* Loop where we do not care about the '\0' */
for(i=0;i<7;i++)
{
printf("%c",a[i]);
}
printf("\n");
/* Part which prints the entire character array as string */
printf("%s",a);
printf("\n");
/* Loop where we care about the '\0' */
for(i=0;i<7&&a[i]!='\0';i++)
{
printf("%c",a[i]);
}
}
The output is:-
hello!
hello!
hello!
I am unable to understand the difference. any explanations?
In this case:
for(i=0;i<7;i++)
{
printf("%c",a[i]);
}
You loop for a number of times (7) and then quit. That is the end condition of the loop. It terminates, no matter anything else.
In the other case, you also loop for 7 times and no more and you just added another condition, which really serves no function as you already keeping a count of things. If you did the following:
int index = 0;
while (a[index] != '\0') { printf("%c", a[index]); index++; }
now you would depend on the zero termination character being there, if it wasn't in the string, you while loop would go on forever until the program crashed or something terminated it forcedly. Probably printing garbage on your screen.
\0 is not part of data in character string. It is indicator of end of string. If length of string is not known, look for this indicator. With its help you can replace your cycle of:
for(i=0;i<7&&a[i]!='\0';i++) { ...
with:
for(int i=0; a[i]; ++i) { ...
So, for-loops and printf are displaying the same string. The only difference how you print it.
'\0' does not correspond to a displayable character; that's why the first and last versions appear to be the same.
The second version is the same because under the hood, printf is just iterating until it hits the '\0'.
The purpose of the terminating zero character is to terminate the string, i.e. to indirectly encode the string length information in the string itself. If you somehow already know the length of your string, you can write code that works correctly without relying on that terminating zero character. That's basically all.
Now, in your code sample the first cycle does something that does not make much sense. It prints 7 characters from a string that actually has length 6. I.e. it attempts to print the terminating zero as well.
When you want to print a string from first character until end. Knowing the length of that string is not necessary when the string ends with \0 (Print characters until \0). So you don't need any extra variable to store the length of string.
In fact a string can have many various representations but minimizing the consumed memory (which it was important to C designers) leads designers to define zero-terminated strings.
Each string representation has its trade off between speed, memory and flexibility. For example you can have your string definition same as Pascal string which stores length of the string at first element of array but it causes that string to have limited length, but retrieving the length of string is faster that zero-terminated strings (Counting each character until \0).
I am unable to comprehend its use in printing a string and printing a string without it
Normally you don't print a string character by character like that. You print the whole string. In such cases, your C library will print until it finds a zero.
When printing a string of variable length, there has to be some 'signal' to indicate that you have reached the end. Generally, this is the '\0' character. Most C standard calls, like strcpy, strcat, printf, etc. depend on the string being zero-terminated, thus ending in a '\0' character. This corresponds to your second example.
The first example is printing a string of fixed length, which is simply a far less common occurence.
The third example combines both, it looks for a zero-terminator ('\0' character) ór 7 characters maximum. This corresponds to calls like strncpy, for example.
The purpose of the terminating zero character is to terminate the string, i.e. to indirectly encode the string length information in the string itself. If you somehow already know the length of your string, you can write code that works correctly without relying on that terminating zero character. That's basically all.
Now, in your code sample the first cycle does something that does not make much sense. It prints 7 characters from a string that actually has length 6. I.e. it attempts to print the terminating zero as well. Why it is doing that - I don't know. In other words, the first output generated by your code is formally different from the rest, since it includes the effect of printing a zero character right after the ! sign. On your platform that effect just happened to be "invisible" on the screen, which is why you probably assumed that the first output is the same as the other ones. However, if you redirect the output to a file, you will be able to see that it is actually quite different.
The other output methods in your code simply output the string up to (and not including) the terminating zero character. The last cycle has redundant condition checking, since you know that the cycle will stop at zero character, before i will have a chance to hit 7.
Other than that, I don't know what "difference" you might be asking about. Please, clarify your question, if this doesn't answer it.
In your loop, you actually print the nul character. Generally this has no effect since it is a non-printing, non-control character. However printf("%s",a); will not output the nul at all - it uses it as a sentinel value. So you loop is not equivalent to %s formatted output.
If you try say:
char a[] = "123456" ;
char b[]={'h','e','l','l','o','!' } ; // No terminator
char c[] = "ABCDEF" ;
printf( "%s", a ) ;
printf( "%s", b ) ;
printf( "%s", c ) ;
You might clearly see why the nul terminator is essential. In my case it output:
123456
hello!╠╠╠╠╠╠╠╠╠╠123456
ABCDEF
Your mileage may vary - the result is undefined behaviour, but in this case the output is running through to the adjacent string, but the compiler has inserted some unused space between them with "junk" in it. I packed a string either side of the un-terminated string because there is no way of telling how a particular compiler orders data in memory. Incidentally when I declared the strings static if the strings, the string b was output with no "run-on". Sometimes the surrounding "junk" may happen to already be zero.

Removing strings from C source code [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
We don’t allow questions seeking recommendations for books, tools, software libraries, and more. You can edit the question so it can be answered with facts and citations.
Closed 7 years ago.
Improve this question
Can anyone point me to a program that strips off strings from C source code? Example
#include <stdio.h>
static const char *place = "world";
char * multiline_str = "one \
two \
three\n";
int main(int argc, char *argv[])
{
printf("Hello %s\n", place);
printf("The previous line says \"Hello %s\"\n", place);
return 0;
}
becomes
#include <stdio.h>
static const char *place = ;
char * multiline_str = ;
int main(int argc, char *argv[])
{
printf(, place);
printf(, place);
return 0;
}
What I am looking for is a program very much like stripcmt
only that I want to strip strings and not comments.
The reason that I am looking for an already developed program and not just some handy regular expression is
because when you start considering all corner cases (quotes within strings, multi-line strings etc)
things typically start to be (much) more complex than it first appears. And
there are limits on what REs can achieve, I suspect it is not possible for this task.
If you do think you have an extremely robust regular expression feel free to submit, but please no naive sed 's/"[^"]*"//g' like suggestions.
(No need for special handling of (possibly un-ended) strings within comments, those will be removed first)
Support for multi-line strings with embedded newlines is not important (not legal C), but strings spanning multiple lines ending with \ at the end must be supported.
This is almost the same as the some other questions, but I found no reference to any tools.
All of the tokens in C (and most other programming languages) are "regular". That is, they can be matched by a regular expression.
A regular expression for C strings:
"([^"\\\n]|\\(['"?\\abfnrtv]|[0-7]{1,3}|x[0-9a-fA-F]+))*"
The regex isn't too hard to understand. Basically a string literal is a pair of double quotes surrounding a bunch of:
non-special (non-quote/backslash/newline) characters
escapes, which start with a backslash and then consist of one of:
a simple escape character
1 to 3 octal digits
x and 1 or more hex digits
This is based on sections 6.1.4 and 6.1.3.4 of the C89/C90 spec. If anything else crept in in C99, this won't catch that, but that shouldn't be hard to fix.
Here's a python script to filter a C source file removing string literals:
import re, sys
regex = re.compile(r'''"([^"\\\n]|\\(['"?\\abfnrtv]|[0-7]{1,3}|x[0-9a-fA-F]+))*"''')
for line in sys.stdin:
print regex.sub('', line.rstrip('\n'))
EDIT:
It occurred to me after I posted the above that while it is true that all C tokens are regular, by not tokenizing everything we've got an opportunity for trouble. In particular, if a double quote shows up in what should be another token we can be lead down the garden path. You mentioned that comments have already been stripped, so the only other thing we really need to worry about are character literals (though the approach Im going to use can be easily extended to handle comments as well). Here's a more robust script that handles character literals:
import re, sys
str_re = r'''"([^"\\\n]|\\(['"?\\abfnrtv]|[0-7]{1,3}|x[0-9a-fA-F]+))*"'''
chr_re = r"""'([^'\\\n]|\\(['"?\\abfnrtv]|[0-7]{1,3}|x[0-9a-fA-F]+))'"""
regex = re.compile('|'.join([str_re, chr_re]))
def repl(m):
m = m.group(0)
if m.startswith("'"):
return m
else:
return ''
for line in sys.stdin:
print regex.sub(repl, line.rstrip('\n'))
Essentially we're finding string and character literal token, and then leaving char literals alone but stripping out string literals. The char literal regex is very similar to the string literal one.
You can download the source code to StripCmt (.tar.gz - 5kB). It's trivially small, and shouldn't be too difficult to adapt to striping strings instead (it's released under the GPL).
You might also want to investigate the official lexical language rules for C strings. I found this very quickly, but it might not be definitive. It defines a string as:
stringcon ::= "{ch}", where ch denotes any printable ASCII character (as specified by isprint()) other than " (double quotes) and the newline character.
In ruby:
#!/usr/bin/ruby
f=open(ARGV[0],"r")
s=f.read
puts(s.gsub(/"(\\(.|\n)|[^\\"\n])*"/,""))
f.close
prints to the standard output
In Python using pyparsing:
from pyparsing import dblQuotedString
source = open(filename).read()
dblQuotedString.setParseAction(lambda : "")
print dblQuotedString.transformString(source)
Also prints to stdout.

Resources