"Pattern matching" and extracting in C - c

I need to parse a lot of filenames (up to 250000 I guess), including the path, and extract some parts out of it.
Here is an example:
Original: /my/complete/path/to/80/01/a9/1d.pdf
Needed: 8001a91d
The "pattern" I am looking for will always begin with "/8". The parts I need to extract form an 8 hex-digits string.
My idea is the following (simplyfied for demonstration):
/* original argument */
char *path = "/my/complete/path/to/80/01/a9/1d.pdf";
/* pointer to substring */
char *begin = NULL;
/* final char array to be build */
char *hex = (char*)malloc(9);
/* find "pattern" */
begin = strstr(path, "/8");
if(begin == NULL)
return 1;
/* jump to first needed character */
begin++;
/* copy the needed characters to target char array */
strncpy(hex, begin, 2);
strncpy(hex+2, begin+3, 2);
strncpy(hex+4, begin+6, 2);
strncpy(hex+6, begin+9, 2);
strncpy(hex+8, "\0", 1);
/* print final char array */
printf("%s\n", hex);
This works. I just have the feeling it is not the most clever way. And that there might be some traps I don't see myself.
So, does someone have suggestions what could be dangerous with this pointer-shifting manner? What would be an improvement in your opinion?
Does C provide a functionality to do it like so s|/(8.)/(..)/(..)/(..)\.|\1\2\3\4| ? If I remember right some scripting languages have a feature like that; if you know what I mean.

C itself doesn't provide this, but you can use POSIX regex. It's a full-featured regular expression library. But for a pattern as simple as yours, this probably is the best way.
BTW, prefer memcpy to strncpy. Very few people know what strncpy is good for. And I'm not one of them.

/* original argument */
char *path = "/my/complete/path/to/80/01/a9/1d.pdf";
char *begin;
char hex[9];
size_t len;
/* find "pattern" */
begin = strstr(path, "/8");
if (!begin) return 1;
// sanity check
len = strlen(begin);
if (len < 12) return 2;
// more sanity
if (begin[3] != '/' || begin[6] != '/' || begin[9] != '/' ) return 3;
memcpy(hex, begin+1, 2);
memcpy(hex+2, begin+4, 2);
memcpy(hex+4, begin+7, 2);
memcpy(hex+6, begin+10, 2);
hex[8] = 0;
// For additional sanity, you could check for valid hex characters here
/* print final char array */
printf("%s\n", hex);

In the simple case of just matching /8./../../.. I'd personally go for the strstr() solution myself (no external dependency required). If the rules become more though, you could try a lexer (flex and friends), they support regular expressions.
In your case something like this:
h2 [0-9A-Fa-f]{2}
mymatch (/{h2}){4}
could work. You'd have to set buffers to the match by side effect though as lexers typically return token identifiers.
Anyway, you'd gain the power of regexps without the dependencies but at the expense of generated (read: unreadable) code.

Related

I am trying to create a code polisher program in C

I am trying to create the function delete_comments(). The read_file() and main functions are given.
Implement function char *delete_comments(char *input) that removes C comments from program stored at input. input variable points to dynamically allocated memory. The function returns pointer to the polished program. You may allocate a new memory block for the output, or modify the content directly in the input buffer.
You’ll need to process two types of comments:
Traditional block comments delimited by /* and */. These comments may span multiple lines. You should remove only characters starting from /* and ending to */ and for example leave any following newlines untouched.
Line comments starting with // until the newline character. In this case, newline character must also be removed.
The function calling delete_comments() only handles return pointer from delete_comments(). It does not allocate memory for any pointers. One way to implement delete_comments() function is to allocate memory for destination string. However, if new memory is allocated then the original memory in input must be released after use.
I'm having trouble understanding why my current approach is wrong or what is the specific problem that I'm getting weird output. I'm approaching the problem by trying to create a new array where to copy the input string with the new rules.
#include "source.h"
#include <stdlib.h>
#include <stdio.h>
#include <string.h>
/* Remove C comments from the program stored in memory block <input>.
* Returns pointer to code after removal of comments.
* Calling code is responsible of freeing only the memory block returned by
* the function.
*/
char *delete_comments(char *input)
{
input = malloc(strlen(input) * sizeof (char));
char *secondarray = malloc(strlen(input) * sizeof (char));
int x, y = 0;
for (x = 0, y = 0; input[x] != '\0'; x++) {
if ((input[x] == '/') && (input[x + 1] == '*')) {
int i = 0;
while ((input[x + i] != '*') && (input[x + i + 1] != '/')) {
y++;
i++;
}
}
else if ((input[x] == '/') && (input[x + 1] == '/')) {
int j = 0;
while (input[x + j] != '\n') {
y++;
j++;
}
}
else {
secondarray[x] = input[y];
y++;
}
}
return secondarray;
}
/* Read given file <filename> to dynamically allocated memory.
* Return pointer to the allocated memory with file content, or
* NULL on errors.
*/
char *read_file(const char *filename)
{
FILE *f = fopen(filename, "r");
if (!f)
return NULL;
char *buf = NULL;
unsigned int count = 0;
const unsigned int ReadBlock = 100;
unsigned int n;
do {
buf = realloc(buf, count + ReadBlock + 1);
n = fread(buf + count, 1, ReadBlock, f);
count += n;
} while (n == ReadBlock);
buf[count] = 0;
return buf;
}
int main(void)
{
char *code = read_file("testfile.c");
if (!code) {
printf("No code read");
return -1;
}
printf("-- Original:\n");
fputs(code, stdout);
code = delete_comments(code);
printf("-- Comments removed:\n");
fputs(code, stdout);
free(code);
}
Your program has fundamental issues.
It fails to tokenize the input. Comment start sequences can occur inside string literals, in which case they do not denote comments: "/* not a comment".
You have some basic bugs:
if ((input[x] == '/') && (input[x + 1] == '*')) {
int i = 0;
while ((input[x + i] != '*') && (input[x + i + 1] != '/')) {
y++;
i++;
}
}
Here, when we enter the loop, with i = 0, input + x is still pointing to the opening /. We did not skip over the opening * and are already looking for a closing *. This means that the sequence /*/ will be recognized as a complete comment, which it isn't.
This loop's also assumes that every /* comment is properly closed. It's not checking for the null character which can terminate the input, so if the comment is not closed, it will march beyond the end of the buffer.
C has line continuations. In ISO C translation stage 2, all backlash-newline sequences are deleted, converting one or more physical lines into logical lines. What that means is that a // comment can span multiple physical lines:
// this is an \
extended comment
You can see, by the way, that StackOverflow's automatic language detector for syntax highlighting is getting this right!
Line continuations are independent of tokenization, which doesn't happen until translation stage 3. Which means:
/\
/\
this is an extended \
comment
That one has defeated StackOverflow's syntax highlighting.
Furthermore, a line continuation can happen in any token, possibly multiple times:
"\
this is a string literal\
"
If you really want to make this work 100% correctly, you need to parse the input. By "parse" I mean a more formal, rigorous detection routine that understands what it is reading, in the context it is reading it.
For example, there are many times where this code could be defeated.
printf("the answer is %d // %d\n", a, b);
would likely trip your // detection and strip the end of the printf.
There are two general approaches to the problem above:
Find every corner case where comment-like characters could be used, and write conditional statements to avoid them before stripping.
Fully parse the language, so you will know if you are within a string or some other context that's wrapping comment like characters, or if you are in the top level context where the characters really mean "this is a comment"
To learn about parsing, I generally recommend "The Dragon Book" but it is a hard read, unless you have studied a bit of Discrete Mathematics. It covers a lot of different parsing techniques, and in doing so it doesn't have many pages left for examples. This means that it's the kind of book where you have to read, think, and then program a mini-example. If you follow that path, there is no input you can't tackle.
If you are pragmatic in your solution, and it is not about learning parsing, but about stripping comments, I recommend that you find a well constructed parser for C, and then learn how to walk the Abstract Syntax Tree in an Emitter, which fails to emit the comments.
There are some projects that do this already; but, I don't know if they have the right structure for easy modification. lint comes to mind, as well as other "pretty-printers" GCC certainly has the parsing code in there, but I've heard that GCC's Abstract Syntax Tree isn't easy to learn.
Your solution has several problems:
The worst issue
As the first instruction in delete_comments() you overwrite input with a new pointer returned by malloc(), which points to memory of random contents.
In consequence the address to the real input is lost.
Oh, and please check the returned value, if you call malloc().
Failing to increment the scanned position in comments correctly
You are scanning the input by the index x, but if you detect a comment, you don't change it.
You are actually advancing y but this is only used for the copying.
Think about lines like these:
int x; /* some /* weird /* comment */
///////////////////////////////
for (;;) { }
Ignoring character and string literals
Your solution should take character and string literals into account.
For example:
int c_plus_plus_comment_start = '//'; /* multi character constant */
const char* c_comment_start = "/*";
Note: There are more. Learn to use a debugger, or at least insert lots of printf()s in "interesting" places.

Replacing a whole word and not substrings in a string in C

I am trying to replace a whole word in C array of characters and skip the substrings. I made research and I ended up with really hard resolutions while I think I have better idea if someone can give me a hand.
Let's say I have the string:
char sentence[100]= "apple tree house";
And I would like to replace tree with the number 12:
"apple 12 house"
I know that the words are delimited by space so my idea is to :
1.Tokenize the string with delimiter white space
2.In the while loop checking with the library function STRCMP if the string is equal to the token and if it is then to be replaced.
The problem for me comes when I try to replace the string as I couldn't make it.
void wordreplace(char string[], char search[], char replace[]) {
// Tokenize
char * token = strtok(string, " ");
while (token != NULL) {
if (strcmp(search, token) == 0) {
REPLACE SEARCH STRING WITH REPLACE STRING
}
token = strtok(NULL, " ");
}
printf("Sentence : %s", string);
}
Any suggestions what I can use ? I guess it might be really simple but I am beginner much appreciated :)
[EDIT]: Spaces are the only delimiters and usually the string to be replaced is not longer than the original.
I would avoid strtok in this case (because it will modify the string as a side effect of tokenizing it), and approach this by looking at the string essentially character-by-character and maintaining a "read" and "write" index. Because the output can never be longer than the input, the write index will never get ahead of the read one, and you can "write-back" and make the change within the same string.
To visualize this, I find it useful to write out the input in boxes and draw arrows to current read and write indexes and track through the process so you can verify that you have a system that will do what you want it to do and that your loops and indexes all work like you expect.
Here is one implementation that matches how my own mind tends to approach this sort of algorithm. It walks the string and looks ahead to try matching from the current character. If it finds a match, it copies the replace onto the current spot, and increments both indexes accordingly.
void wordreplace(char * string, const char * search, const char * replace) {
// This is required to be true since we're going to do the replace
// in-place:
assert(strlen(replace) <= strlen(search));
// Get ourselves set up
int r = 0, w = 0;
int str_len = strlen(string);
int search_len = strlen(search);
int replace_len = strlen(replace);
// Walk through the input character by character.
while (r < str_len) {
// Is this character the start of a matching token? It is
// if we see the search string followed by a space or end of
// string.
if (strncmp(&string[r], search, search_len) == 0 &&
(string[r+search_len] == ' ' || string[r+search_len] == '\0')) {
// We matched the search token. Copy the replace token.
memcpy(&string[w], replace, replace_len);
// Update our indexes.
w += replace_len;
r += search_len;
} else {
// Otherwise just copy this character.
string[w++] = string[r++];
}
}
// Be sure to terminate the final version of the string.
string[w] = '\0';
}
(Note that I tweaked your function signature to use the more idiomatic pointer notation rather than char arrays, and per flu's comment below, I marked the search and replace tokens as "const" which is a way of the function advertising that it will not modify those strings.)
To do what you want to do becomes a little more involved because you need to handle the scenarios where:
replacement is shorter than original -- so you will need to move the remainder of line to follow the replacement text to avoid leaving empty space;
replacement is same length as original -- trivial case, just overwrite original with replacement; and finally
replacement is longer than original -- where you must validate the original string plus the replacement length difference will still fit in the storage for the original string, you must copy the end of line to a temporary buffer before making the replacement, and then add the rest of the line in the temporary buffer to the end.
strtok is some disadvantages here due to it making changes to the original string during the tokenizing process. (you can just make a copy, but if you want an in-place replacement, you need to look further). A combination of strstr and strcspn allow you to operate on the original string in more efficient manner when looking for a specific search string within the original.
strcspn can be used like strtok with the set of delimiters to provide the length of the current token found (to ensure strstr didn't match your search term as a lesser-included-substring of a longer word, like tree in trees) Then it becomes a simple matter of looping with strstr and validating the length of the token with strcspn and then just applying one of the three cases above.
A short example implementation with comments included in-line to help you follow along could be:
#include <stdio.h>
#include <string.h>
#define MAXLIN 100
void wordreplace (char *str, const char *srch,
const char *repl, const char *delim)
{
char *p = str; /* pointer to str */
size_t lenword, /* length of word found */
lenstr = strlen (str), /* length of total string */
lensrch = strlen (srch), /* length of search word */
lenrepl = strlen (repl); /* length of replace word */
while ((p = strstr (p, srch))) { /* srch exist in rest of string? */
lenword = strcspn (p, delim); /* get length of word found */
if (lenword == lensrch) { /* word len match search len */
if (lenrepl == lensrch) /* if replace is same len */
memcpy (p, repl, lenrepl); /* just copy over */
else if (lenrepl > lensrch) { /* if replace is longer */
/* check that additional lenght will fit in str */
if (lenstr + lenrepl - lensrch > MAXLIN - 1) {
fputs ("error: replaced length would exeed size.\n",
stderr);
return;
}
if (!p[lenword]) { /* if no following char */
memcpy (p, repl, lenrepl); /* just copy replace */
p[lenrepl] = 0; /* and nul-terminate */
}
else { /* store rest of line in buffer, replace, add end */
char endbuf[MAXLIN]; /* temp buffer for end */
size_t lenend = strlen (p + lensrch); /* end length */
memcpy (endbuf, p + lensrch, lenend + 1); /* copy end */
memcpy (p, repl, lenrepl); /* make replacement */
memcpy (p + lenrepl, endbuf, lenend); /* add end after */
}
}
else { /* otherwise replace is shorter than search */
size_t lenend = strlen (p + lenword); /* get end length */
memcpy (p, repl, lenrepl); /* copy replace */
/* move end to after replace */
memmove (p + lenrepl, p + lenword, lenend + 1);
}
}
}
}
int main (int argc, char **argv) {
char str[MAXLIN] = "apple tree house in the elm tree";
const char *search = argc > 1 ? argv[1] : "tree",
*replace = argc > 2 ? argv[2] : "12",
*delim = " \t\n";
wordreplace (str, search, replace, delim);
printf ("str: %s\n", str);
}
Example Use/Output
Your replace "tree" with "12" example in "apple tree house in the elm tree":
$ ./bin/wordrepl_strstr_strcspn
str: apple 12 house in the elm 12
A simple same-length replacement of "tree" with "core", e.g.
$ ./bin/wordrepl_strstr_strcspn tree core
str: apple core house in the elm core
The "longer than" replacemnt of "tree" with "bobbing":
$ ./bin/wordrepl_strstr_strcspn tree bobbing
str: apple bobbing house in the elm bobbing
There are many different ways you can approach this problem, so no one way is the right way. The key is to make it understandable and reasonably efficient. Look things over and let me know if you have further questions.

strcat() for formatted strings

I'm building a string piece by piece in my program and am currently using a mix of strcat() when I'm adding a simple string onto the end, but when im adding a formatted string I'm using sprintf() e.g.:
int one = 1;
sprintf(instruction + strlen(instruction), " number %d", one);
is it possible to concatenate formatted string using strcat() or what is the preferred method for this?
Your solution will work. Calling strlen is a bit awkward (particularly if the string gets quite long). sprintf() will return the length you have used [strcat won't], so one thing you can do is something like this:
char str[MAX_SIZE];
char *target = str;
target += sprintf(target, "%s", str_value);
target += sprintf(target, "somestuff %d", number);
if (something)
{
target += sprintf(target, "%s", str_value2);
}
else
{
target += sprintf(target, "%08x", num2);
}
I'm not sure strcat is much more efficient than sprintf() is when used in this way.
Edit: should write smaller examples...
no it's not possible but you could use sprintf() on those simple strings and avoid calling strlen() every time:
len = 0;
len += sprintf(buf+len, "%s", str);
len += sprintf(buf+len, " number %d", one);
To answer the direct question, sure, it's possible to use strcat to append formatted strings. You just have to build the formatted string first, and then you can use strcat to append it:
#include <stdio.h>
#include <string.h>
int main(void) {
char s[100];
char s1[20];
char s2[30];
int n = 42;
double x = 22.0/7.0;
strcpy(s, "n = ");
sprintf(s1, "%d", n);
strcat(s, s1);
strcat(s, ", x = ");
sprintf(s2, "%.6f", x);
strcat(s, s2);
puts(s);
return 0;
}
Output:
n = 42, x = 3.142857
But this is not a particularly good approach.
sprintf works just as well writing to the end of an existing string. See Mats's answer and mux's answer for examples. The individual arrays used to hold individual fields are not necessary, at least not in this case.
And since this code doesn't keep track of the end of the string, the performance is likely to be poor. strcat(s1, s2) first has to scan s1 to find the terminating '\0', and then copy the contents of s2 into it. The other answers avoid this by advancing an index or a pointer to keep track of the end of the string without having to recompute it.
Also, the code makes no effort to avoid buffer overruns. strncat() can do this, but it just truncates the string; it doesn't tell you that it was truncated. snprintf() is a good choice; it returns the number of characters that it would have written if enough space were available. If this exceeds the size you specify, then the string was truncated.
/* other declarations as above */
size_t count;
count = snprintf(s, sizeof s, "n = %d, x = %.6f", n, x);
if (count > sizeof s) {
/* the string was truncated */
}
And to append multiple strings (say, if some are appended conditionally or repeatedly), you can use the methods in the other answers to keep track of the end of the target string.
So yes, it's possible to append formatted strings with strcat(). It's just not likely to be a good idea.
What the preferred method is, depends on what you are willing to use. Instead of doing all those manual (and potentially dangerous) string operations, I would use the GString data structure from GLib or GLib's g_strdup_print function. For your problem, GString provides the g_string_append_printf function.
Write your own wrapper for your need.
A call to this would look like this :-
result = universal_concatenator(4,result,"numbers are %d %f\n",5,16.045);
result = universal_concatenator(2,result,"tail_string");
You could define one function, that would take care of worrying about, if you need to use sprintf() or strcat(). This is what the function would look like :-
/* you should pass the number of arguments
* make sure the second argument is a pointer to the result always
* if non formatted concatenation:
* call function with number_of_args = 2
* else
* call function with number of args according to format
* that is, if five inputs to sprintf(), then 5.
*
* NOTE : Here you make an assumption that result has been allocated enough memory to
* hold your concatenated string. This assumption holds true for strcat() or
* sprintf() of your previous implementation
*/
char* universal_concaternator(int number_of_args,...)
{
va_list args_list;
va_start(args_list,number_of_args);
int counter = number_of_args;
char *result = va_arg(args_list, char*);
char *format;
if(counter == 2) /* it is a non-formatted concatenation */
{
result = strcat(result,va_arg(args_list,char*));
va_end(args_list);
return result;
}
/* else part - here you perform formatted concatenation using sprintf*/
format = va_arg(args_list,char*);
vsprintf(result + strlen(result),format,args_list);
va_end(args_list);
return result;
}
/* dont forget to include the header
* <stdarg.h> #FOR-ANSI
* or <varargs.h> #FOR-UNIX
*/
It should firstly, determine, which of the two it should call(strcat or sprintf), then it should make the call, and make it easy for you to concentrate on the actual logic of whatever you are working on!
Just ctrl+c code above and ctrl+v into your code base.
Note : Matt's answer is a good alternative for long strings. But for short string lengths(<250), this should do.

Best way to do binary arithmetic in C?

I am learning C and writing a simple program that will take 2 string values assumed to each be binary numbers and perform an arithmetic operation according to user selection:
Add the two values,
Subtract input 2 from input 1, or
Multiply the two values.
My implementation assumes each character in the string is a binary bit, e.g. char bin5 = "0101";, but it seems too naive an approach to parse through the string a character at a time. Ideally, I would want to work with the binary values directly.
What is the most efficient way to do this in C? Is there a better way to treat the input as binary values rather than scanf() and get each bit from the string?
I did some research but I didn't find any approach that was obviously better from the perspective of a beginner. Any suggestions would be appreciated!
Advice:
There's not much that's obviously better than marching through the string a character at a time and making sure the user entered only ones and zeros. Keep in mind that even though you could write a really fast assembly routine if you assume everything is 1 or 0, you don't really want to do that. The user could enter anything, and you'd like to be able to tell them if they screwed up or not.
It's true that this seems mind-bogglingly slow compared to the couple cycles it probably takes to add the actual numbers, but does it really matter if you get your answer in a nanosecond or a millisecond? Humans can only detect 30 milliseconds of latency anyway.
Finally, it already takes far longer to get input from the user and write output to the screen than it does to parse the string or add the numbers, so your algorithm is hardly the bottleneck here. Save your fancy optimizations for things that are actually computationally intensive :-).
What you should focus on here is making the task less manpower-intensive. And, it turns out someone already did that for you.
Solution:
Take a look at the strtol() manpage:
long strtol(const char *nptr, char **endptr, int base);
This will let you convert a string (nptr) in any base to a long. It checks errors, too. Sample usage for converting a binary string:
#include <stdlib.h>
char buf[MAX_BUF];
get_some_input(buf);
char *err;
long number = strtol(buf, &err, 2);
if (*err) {
// bad input: try again?
} else {
// number is now a long converted from a valid binary string.
}
Supplying base 2 tells strtol to convert binary literals.
First out I do recommend that you use stuff like strtol as recommended by tgamblin,
it's better to use things that the lib gives to you instead of creating the wheel over and over again.
But since you are learning C I did a little version without strtol,
it's neither fast or safe but I did play a little with the bit manipulation as a example.
int main()
{
unsigned int data = 0;
int i = 0;
char str[] = "1001";
char* pos;
pos = &str[strlen(str)-1];
while(*pos == '0' || *pos == '1')
{
(*pos) -= '0';
data += (*pos) << i;
i++;
pos--;
}
printf("data %d\n", data);
return 0;
}
In order to get the best performance, you need to distinguish between trusted and untrusted input to your functions.
For example, a function like getBinNum() which accepts input from the user should be checked for valid characters and compressed to remove leading zeroes. First, we'll show a general purpose in-place compression function:
// General purpose compression removes leading zeroes.
void compBinNum (char *num) {
char *src, *dst;
// Find first non-'0' and move chars if there are leading '0' chars.
for (src = dst = num; *src == '0'; src++);
if (src != dst) {
while (*src != '\0')
*dst++ = *src++;
*dst = '\0';
}
// Make zero if we removed the last zero.
if (*num == '\0')
strcpy (num, "0");
}
Then provide a checker function that returns either the passed in value, or NULL if it was invalid:
// Check untested number, return NULL if bad.
char *checkBinNum (char *num) {
char *ptr;
// Check for valid number.
for (ptr = num; *ptr == '0'; ptr++)
if ((*ptr != '1') && (*ptr != '0'))
return NULL;
return num;
}
Then the input function itself:
#define MAXBIN 256
// Get number from (untrusted) user, return NULL if bad.
char *getBinNum (char *prompt) {
char *num, *ptr;
// Allocate space for the number.
if ((num = malloc (MAXBIN)) == NULL)
return NULL;
// Get the number from the user.
printf ("%s: ", prompt);
if (fgets (num, MAXBIN, stdin) == NULL) {
free (num);
return NULL;
}
// Remove newline if there.
if (num[strlen (num) - 1] == '\n')
num[strlen (num) - 1] = '\0';
// Check for valid number then compress.
if (checkBinNum (num) == NULL) {
free (num);
return NULL;
}
compBinNum (num);
return num;
}
Other functions to add or multiply should be written to assume the input is already valid since it will have been created by one of the functions in this library. I won't provide the code for them since it's not relevant to the question:
char *addBinNum (char *num1, char *num2) {...}
char *mulBinNum (char *num1, char *num2) {...}
If the user chooses to source their data from somewhere other than getBinNum(), you could allow them to call checkBinNum() to validate it.
If you were really paranoid, you could check every number passed in to your routines and act accordingly (return NULL), but that would require relatively expensive checks that aren't necessary.
Wouldn't it be easier to parse the strings into integers, and then perform your maths on the integers?
I'm assuming this is a school assignment, but i'm upvoting you because you appear to be giving it a good effort.
Assuming that a string is a binary number simply because it consists only of digits from the set {0,1} is dangerous. For example, when your input is "11", the user may have meant eleven in decimal, not three in binary. It is this kind of carelessness that gives rise to horrible bugs. Your input is ambiguously incomplete and you should really request that the user specifies the base too.

Is this a good substr for C?

See also C Tokenizer
Here is a quick substr() for C that I wrote (yes, the variable initializations needs to be moved to start of the function etc, but you get the idea)
I have seen many "smart" implementations of substr() that are simple one liner calls strncpy()!
They are all wrong (strncpy does not guarantee null termination and thus the call might NOT produce a correct substring!)
Here is something maybe better?
Bring out the bugs!
char* substr(const char* text, int nStartingPos, int nRun)
{
char* emptyString = strdup(""); /* C'mon! This cannot fail */
if(text == NULL) return emptyString;
int textLen = strlen(text);
--nStartingPos;
if((nStartingPos < 0) || (nRun <= 0) || (textLen == 0) || (textLen < nStartingPos)) return emptyString;
char* returnString = (char *)calloc((1 + nRun), sizeof(char));
if(returnString == NULL) return emptyString;
strncat(returnString, (nStartingPos + text), nRun);
/* We do not need emptyString anymore from this point onwards */
free(emptyString);
emptyString = NULL;
return returnString;
}
int main()
{
const char *text = "-2--4--6-7-8-9-10-11-";
char *p = substr(text, -1, 2);
printf("[*]'%s' (\")\n", ((p == NULL) ? "<NULL>" : p));
free(p);
p = substr(text, 1, 2);
printf("[*]'%s' (-2)\n", ((p == NULL) ? "<NULL>" : p));
free(p);
p = substr(text, 3, 2);
printf("[*]'%s' (--)\n", ((p == NULL) ? "<NULL>" : p));
free(p);
p = substr(text, 16, 2);
printf("[*]'%s' (10)\n", ((p == NULL) ? "<NULL>" : p));
free(p);
p = substr(text, 16, 20);
printf("[*]'%s' (10-11-)\n", ((p == NULL) ? "<NULL>" : p));
free(p);
p = substr(text, 100, 2);
printf("[*]'%s' (\")\n", ((p == NULL) ? "<NULL>" : p));
free(p);
p = substr(text, 1, 0);
printf("[*]'%s' (\")\n", ((p == NULL) ? "<NULL>" : p));
free(p);
return 0;
}
Output :
[*]'' (")
[*]'-2' (-2)
[*]'--' (--)
[*]'10' (10)
[*]'10-11-' (10-11-)
[*]'' (")
[*]'' (")
Your function seems very complicated for what should be a simple operation. Some problems are (not all of these are bugs):
strdup(), and other memory allocation functions, can fail, you should allow for all possible issues.
only allocate resources (memory in this case) if and when you need it.
you should be able to distinguish between errors and valid stings. At the moment, you don't know whether malloc() failure of substr ("xxx",1,1) or a working substr ("xxx",1,0) produces an empty string.
you don't need to calloc() memory that you're going to overwrite anyway.
all invalid parameters should either cause an error or be coerced to a valid parameter (and your API should document which).
you don't need to set the local emptyString to NULL after freeing it - it will be lost on function return.
you don't need to usr strncat() - you should know the sizes and the memory you have available before doing any copying so you can use the (most likely) faster memcpy().
you're use of base-1 rather than base-0 for string offsets goes against the grain of C.
The following segment is what I'd do (I rather like the Python idiom of negative values to count from the end of the string but I've kept length rather than end position).
char *substr (const char *inpStr, int startPos, int strLen) {
/* Cannot do anything with NULL. */
if (inpStr == NULL) return NULL;
/* All negative positions to go from end, and cannot
start before start of string, force to start. */
if (startPos < 0)
startPos = strlen (inpStr) + startPos;
if (startPos < 0)
startPos = 0;
/* Force negative lengths to zero and cannot
start after end of string, force to end. */
if (strLen < 0)
strLen = 0;
if (startPos >strlen (inpStr))
startPos = strlen (inpStr);
/* Adjust length if source string too short. */
if (strLen > strlen (&inpStr[startPos]))
strLen = strlen (&inpStr[startPos]);
/* Get long enough string from heap, return NULL if no go. */
if ((buff = malloc (strLen + 1)) == NULL)
return NULL;
/* Transfer string section and return it. */
memcpy (buff, &(inpStr[startPos]), strLen);
buff[strLen] = '\0';
return buff;
}
I would say return NULL if the input isn't valid rather than a malloc()ed empty string. That way you can test whether or not the function failed or not with if(p) rather than if(*p == 0).
Also, I think your function leaks memory because emptyString is only free()d in one conditional. You should make sure you free() it unconditionally, i.e. right before the return.
As to your comment on strncpy() not NUL-terminating the string (which is true), if you use calloc() to allocate the string rather than malloc(), this won't be a problem if you allocate one byte more than you copy, since calloc() automatically sets all values (including, in this case, the end) to 0.
I would give you more notes but I hate reading camelCase code. Not that there's anything wrong with it.
EDIT: With regards to your updates:
Be aware that the C standard defines sizeof(char) to be 1 regardless of your system. If you're using a computer that uses 9 bits in a byte (god forbid), sizeof(char) is still going to be 1. Not that there's anything wrong with saying sizeof(char) - it clearly shows your intention and provides symmetry with calls to calloc() or malloc() for other types. But sizeof(int) is actually useful (ints can be different sizes on 16- and 32- and these newfangled 64-bit computers). The more you know.
I'd also like to reiterate that consistency with most other C code is to return NULL on an error rather than "". I know many functions (like strcmp()) will probably do bad things if you pass them NULL - this is to be expected. But the C standard library (and many other C APIs) take the approach of "It's the caller's responsibility to check for NULL, not the function's responsibility to baby him/her if (s)he doesn't." If you want to do it the other way, that's cool, but it's going against one of the stronger trends in C interface design.
Also, I would use strncpy() (or memcpy()) rather than strncat(). Using strncat() (and strcat()) obscures your intent - it makes someone looking at your code think you want to add to the end of the string (which you do, because after calloc(), the end is the beginning), when what you want to do is set the string. strncat() makes it look like you're adding to a string, while strcpy() (or another copy routine) would make it look more like what your intent is. The following three lines all do the same thing in this context - pick whichever one you think looks nicest:
strncat(returnString, text + nStartingPos, nRun);
strncpy(returnString, text + nStartingPos, nRun);
memcpy(returnString, text + nStartingPos, nRun);
Plus, strncpy() and memcpy() will probably be a (wee little) bit faster/more efficient than strncat().
text + nStartingPos is the same as nStartingPos + text - I would put the char * first, as I think that's clearer, but whatever order you want to put them in is up to you. Also, the parenthesis around them are unnecessary (but nice), since + has higher precedence than ,.
EDIT 2: The three lines of code don't do the same thing, but in this context they will all produce the same result. Thanks for catching me on that.
char* emptyString = strdup(""); /* C'mon! This cannot fail? */
You need to check for null. Remember that it still must allocate 1 byte for the null character.
strdup could fail (though it is very unlikely and not worth checking for, IMHO). It does have another problem however - it is not a Standard C function. It would be better to use malloc.
You can also use the memmove function to return a substring from start to length.
Improving/adding another solution from paxdiablo's solution:
#include <stdlib.h>
#include <stdio.h>
#include <string.h>
char *splitstr(char *idata, int start, int slen) {
char ret[150];
if(slen == NULL) {
slen=strlen(idata)-start;
}
memmove (ret,idata+start,slen);
return ret;
}
/*
Usage:
char ostr[]="Hello World!";
char *ores=splitstr(ostr, 0, 5);
Outputs:
Hello
*/
Hope it helps. Tested on Windows 7 Home Premium with TCC C Compilier.

Resources