Removing strings from C source code [closed] - c

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
We don’t allow questions seeking recommendations for books, tools, software libraries, and more. You can edit the question so it can be answered with facts and citations.
Closed 7 years ago.
Improve this question
Can anyone point me to a program that strips off strings from C source code? Example
#include <stdio.h>
static const char *place = "world";
char * multiline_str = "one \
two \
three\n";
int main(int argc, char *argv[])
{
printf("Hello %s\n", place);
printf("The previous line says \"Hello %s\"\n", place);
return 0;
}
becomes
#include <stdio.h>
static const char *place = ;
char * multiline_str = ;
int main(int argc, char *argv[])
{
printf(, place);
printf(, place);
return 0;
}
What I am looking for is a program very much like stripcmt
only that I want to strip strings and not comments.
The reason that I am looking for an already developed program and not just some handy regular expression is
because when you start considering all corner cases (quotes within strings, multi-line strings etc)
things typically start to be (much) more complex than it first appears. And
there are limits on what REs can achieve, I suspect it is not possible for this task.
If you do think you have an extremely robust regular expression feel free to submit, but please no naive sed 's/"[^"]*"//g' like suggestions.
(No need for special handling of (possibly un-ended) strings within comments, those will be removed first)
Support for multi-line strings with embedded newlines is not important (not legal C), but strings spanning multiple lines ending with \ at the end must be supported.
This is almost the same as the some other questions, but I found no reference to any tools.

All of the tokens in C (and most other programming languages) are "regular". That is, they can be matched by a regular expression.
A regular expression for C strings:
"([^"\\\n]|\\(['"?\\abfnrtv]|[0-7]{1,3}|x[0-9a-fA-F]+))*"
The regex isn't too hard to understand. Basically a string literal is a pair of double quotes surrounding a bunch of:
non-special (non-quote/backslash/newline) characters
escapes, which start with a backslash and then consist of one of:
a simple escape character
1 to 3 octal digits
x and 1 or more hex digits
This is based on sections 6.1.4 and 6.1.3.4 of the C89/C90 spec. If anything else crept in in C99, this won't catch that, but that shouldn't be hard to fix.
Here's a python script to filter a C source file removing string literals:
import re, sys
regex = re.compile(r'''"([^"\\\n]|\\(['"?\\abfnrtv]|[0-7]{1,3}|x[0-9a-fA-F]+))*"''')
for line in sys.stdin:
print regex.sub('', line.rstrip('\n'))
EDIT:
It occurred to me after I posted the above that while it is true that all C tokens are regular, by not tokenizing everything we've got an opportunity for trouble. In particular, if a double quote shows up in what should be another token we can be lead down the garden path. You mentioned that comments have already been stripped, so the only other thing we really need to worry about are character literals (though the approach Im going to use can be easily extended to handle comments as well). Here's a more robust script that handles character literals:
import re, sys
str_re = r'''"([^"\\\n]|\\(['"?\\abfnrtv]|[0-7]{1,3}|x[0-9a-fA-F]+))*"'''
chr_re = r"""'([^'\\\n]|\\(['"?\\abfnrtv]|[0-7]{1,3}|x[0-9a-fA-F]+))'"""
regex = re.compile('|'.join([str_re, chr_re]))
def repl(m):
m = m.group(0)
if m.startswith("'"):
return m
else:
return ''
for line in sys.stdin:
print regex.sub(repl, line.rstrip('\n'))
Essentially we're finding string and character literal token, and then leaving char literals alone but stripping out string literals. The char literal regex is very similar to the string literal one.

You can download the source code to StripCmt (.tar.gz - 5kB). It's trivially small, and shouldn't be too difficult to adapt to striping strings instead (it's released under the GPL).
You might also want to investigate the official lexical language rules for C strings. I found this very quickly, but it might not be definitive. It defines a string as:
stringcon ::= "{ch}", where ch denotes any printable ASCII character (as specified by isprint()) other than " (double quotes) and the newline character.

In ruby:
#!/usr/bin/ruby
f=open(ARGV[0],"r")
s=f.read
puts(s.gsub(/"(\\(.|\n)|[^\\"\n])*"/,""))
f.close
prints to the standard output

In Python using pyparsing:
from pyparsing import dblQuotedString
source = open(filename).read()
dblQuotedString.setParseAction(lambda : "")
print dblQuotedString.transformString(source)
Also prints to stdout.

Related

Regex in C to restrict Extended ASCII character set [closed]

Closed. This question needs debugging details. It is not currently accepting answers.
Edit the question to include desired behavior, a specific problem or error, and the shortest code necessary to reproduce the problem. This will help others answer the question.
Closed 2 years ago.
Improve this question
I need a regex expression in C able to match everything but first 32 characters from extended ASCII with length greater than 0. I thought the easiest way to do that would be pattern like "^[^\\x00-\\x20]+$", but it's not working as I expected. For some reason it won't match any character from 48 to 92. Any ideas what's wrong with this pattern and how can I fix it?
Link to Extended ASCII character set table
The Posix regex library (i.e. the functions in regex.h, including regcomp and regexec) does not interpret standard C backslash sequences. It really doesn't need to, since C will do those expansions when you compile the character string literal. (This is something you have to think about if you accept regular expressions from user input.) The only use of \ in a regex is to escape a special character (in REG_EXTENDED mode), or to make a character special (in basic regex mode, which should be avoided.)
So if you want to exclude characters from \x01 to \x20, you would write:
"^[^\x01-\x20]+$"
Note that you must supply the REG_EXTENDED flag to regcomp for that to work.
As you might note, that does not exclude NUL (\x00). There's no way to insert a NUL into a regex pattern because NUL is not a valid character inside a C character string; it will terminate the string. For the same reason, it's pointless to try to exclude NUL characters from a C string, because there cannot be any. However, if it made you feel better, you could use:
"^[\x21-\xFF]+$"
Semantically, those two regex patterns are identical (at least, in the default "C" locale and assuming char is 8 bits).
The character class as you wrote it, [^\\x00-\\x20], contains everything but the character x and the range from 0 (48) to \ (92). (That range overlaps with the characters 0, 2 and \, which are named explicitly, some of them twice.)
Never used regex in C. I would do it next way using unsigned char to fit EASCII
void match(const unsigned char *src, unsigned char *dst) {
while (*src) {
if (*src >= 32) {
*dst++ = *src++;
} else {
src++;
}
}
*dst = 0;
}

removing multi-char constants in C

Here's some code I found in a very old C library that's trying to eat whitespace from a file...
while(
(line_buf[++line_idx] != ' ') &&
(line_buf[ line_idx] != ' ') &&
(line_buf[ line_idx] != ',') &&
(line_buf[ line_idx] != '\0') )
{
This great thread explains what the problem is, but most of the answers are "just ignore it" or "you should never do this". What I don't see, however, is the canonical solution. Can anyone offer a way to code this test using the "proper way"?
UPDATE: to clarify, the question is "what is the proper way to test for the presence of a string of one or more characters at a given index in another string". Forgive me if I am using the wrong terminology.
Original question
There is no canonical or correct way. Multi-character constants have always been implementation defined. Look up the documentation for the compiler used when the code was written and figure out what was meant.
Updated question
You can match multiple characters using strchr().
while (strchr( " ,", line_buf[++line_idx] ))
{
Again, this does not account for that multi-char constant. You should figure out why that was there before simply removing it.
Also, strchr() does not handle Unicode. If you are dealing with a UTF-8 stream, for example, you will need a function capable of handling it.
Finally, if you are concerned about speed, profile. The compiler might get you better results using the three (or four) individual test expressions in the ‘while’ condition.
In other words, the multiple tests might be the best solution!
Beyond that, I smell some uncouth indexing: the way that line_idx is updated depends on the surrounding code to actuate the loop properly. Make sure that you don’t create an off-by-one error when you update stuff.
Good luck!
UPDATE: to clarify, the question is "what is the proper way to test
for the presence of a string of one or more characters at a given
index in another string". Forgive me if I am using the wrong
terminology.
Well, there are a number of ways, but the standard way is using strspn which has the prototype:
size_t strspn(const char *s, const char *accept);
and it cleverly:
calculates the length (in bytes) of the initial segment of s
which consists entirely of bytes in accept.
This allows you to test for the "the presence of a string of one or more characters at a given index in another string" and tells you how many of the characters from that string were sequentially matched.
For example, if you had another string say char s = "somestring"; and wanted to know if it contained the letters r, s, t, say, in char *accept = "rst"; beginning at the 5th character, you could test:
size_t n;
if ((n = strspn (&s[4], accept)) > 0)
printf ("matched %zu chars from '%s' at beginning of '%s'\n",
n, accept, &s[4]);
To compare in order, you can use strncmp (&s[4], accept, strlen (accept));. You can also simply use nestest loops to iterate over s with the characters in accept.
All of the ways are "proper", so long as they do not invoke Undefined Behavior (and are reasonable efficient).

Why does adding a space between two strings concat the strings in c? [duplicate]

This question already has answers here:
Why allow concatenation of string literals?
(10 answers)
Closed 6 years ago.
Sorry if I am going to ask a very basic question. I tried to search for it but I have been unable to find any answer.
When I run the following code:
#include <stdio.h>
int main() {
char *temp = "sai" "krishna";
printf("%s\n", temp);
return 0;
}
it prints saikrishna
Can you kindly specify why it happens? Should not we use strcat or other concatenation techniques?
Can you please refer to any documentation relating to it and where we can use this technique?
It's a language feature. C allows string literals to get concatenated at compile-time. It can be handy when you have very long string literals stretching over several lines, or when you want to break up string literals containing hex escape sequences. (For example puts("\x42AD") will translate to character 0x42AD, which is likely nonsense and unintended, as opposed to puts("\x42" "AD") which will print BAD.
strcat and strcpy are for string handling in run-time. If you have two string literals, they are compile-time constants, and may as well get concatenated by the compiler in advance, to save execution time.

How to strtok() string using the null character as the delimiter? [closed]

Closed. This question needs debugging details. It is not currently accepting answers.
Edit the question to include desired behavior, a specific problem or error, and the shortest code necessary to reproduce the problem. This will help others answer the question.
Closed 7 years ago.
Improve this question
I tried inputting the null character into the list of delimiters but it would not accept it. I tried inputting "\0" but it wouldn't accept it. I even tried putting in the double quotes with escape characters but it still would not accept it.
Is there a way I could do this?
According to strtok(3) this function is used to isolate sequential tokens in a null-terminated string. So the answer is no, not using strtok, since that function cannot compare a \0 separator from the terminator. You will have to write your own function (which is trivial).
Also read the BUGS section in the strtok man page ... better avoid using it.
If you are allowed to end your string with a double \0 as sentinel you can build your own function, something like:
#include <stdio.h>
#include <string.h>
int main(void)
{
char *s = "abc\0def\0ghi\0";
char *p = s;
while (*p) {
puts(p);
p = strchr(p, '\0');
p++;
}
return 0;
}
Output:
abc
def
ghi
For better or worse, C decided that all strings are null terminated, so you can't parse based on that character.
One workaround would be to replace all nulls in your (non-string) binary buffer with something you know doesn't occur in the buffer, then use that as your separator treating the buffer as char*.
In C, strings are just char arrays that must end with a null character. If a string in memory is not terminated that way, the program will endlessly read memory until a null char is found and there is no way to determine where it will be since memory changes constantly. Your best bet is to create an array of char pointers (strings) and populate them using better organization.

Printing out backslash in C quine program

I'm trying to write a quine program for the follow C source code:
#include<stdio.h>
char name[] = "Jacob Stinson";
int main(){
char *c="#include<stdio.h> char name[] = \"Jacob Stinson\"; int main(){char *c=%c%s%c; prinf(c,34,c,34);}";
printf(c,34,c,34);
}
I need to include the backslash before the " in the string in order to properly print out line 3, however, when I print out *c, I want those backslashes to be present, as to correctly copy the source code. Currently it omits the backslashes from the output.
Wanted to see if anyone knows how to go about doing this.
As the compiler interprets escape sequences in only one direction (deescaping them) I think there's no possibility to include an escape sequence in the code and make it appear as such in the listing. The compiler will always eliminate one of the backslashes on input of the source file, making it appear different on output. The printf uses %s format to allow for the recursive part of the problem and allow you to shelf print, and, as you have guessed correctly, you have to use integer versions of delimiting chars for the " delimiting chars. Why to use %c to be able to delimit the strings in your program if there's an alternative method to include escape sequences? By the same reason, I was not able to include any end of line delimiter, so I wrote the same problem in one line (without using the #include <stdio.h> line) My solution was a one line (without the final end of line.)

Resources