Is this a good substr for C?

Is this a good substr for C? - c

See also C Tokenizer
Here is a quick substr() for C that I wrote (yes, the variable initializations needs to be moved to start of the function etc, but you get the idea)
I have seen many "smart" implementations of substr() that are simple one liner calls strncpy()!
They are all wrong (strncpy does not guarantee null termination and thus the call might NOT produce a correct substring!)
Here is something maybe better?
Bring out the bugs!
char* substr(const char* text, int nStartingPos, int nRun)
{
char* emptyString = strdup(""); /* C'mon! This cannot fail */
if(text == NULL) return emptyString;
int textLen = strlen(text);
--nStartingPos;
if((nStartingPos < 0) || (nRun <= 0) || (textLen == 0) || (textLen < nStartingPos)) return emptyString;
char* returnString = (char *)calloc((1 + nRun), sizeof(char));
if(returnString == NULL) return emptyString;
strncat(returnString, (nStartingPos + text), nRun);
/* We do not need emptyString anymore from this point onwards */
free(emptyString);
emptyString = NULL;
return returnString;
}
int main()
{
const char *text = "-2--4--6-7-8-9-10-11-";
char *p = substr(text, -1, 2);
printf("[*]'%s' (\")\n", ((p == NULL) ? "<NULL>" : p));
free(p);
p = substr(text, 1, 2);
printf("[*]'%s' (-2)\n", ((p == NULL) ? "<NULL>" : p));
free(p);
p = substr(text, 3, 2);
printf("[*]'%s' (--)\n", ((p == NULL) ? "<NULL>" : p));
free(p);
p = substr(text, 16, 2);
printf("[*]'%s' (10)\n", ((p == NULL) ? "<NULL>" : p));
free(p);
p = substr(text, 16, 20);
printf("[*]'%s' (10-11-)\n", ((p == NULL) ? "<NULL>" : p));
free(p);
p = substr(text, 100, 2);
printf("[*]'%s' (\")\n", ((p == NULL) ? "<NULL>" : p));
free(p);
p = substr(text, 1, 0);
printf("[*]'%s' (\")\n", ((p == NULL) ? "<NULL>" : p));
free(p);
return 0;
}
Output :
[*]'' (")
[*]'-2' (-2)
[*]'--' (--)
[*]'10' (10)
[*]'10-11-' (10-11-)
[*]'' (")
[*]'' (")

Your function seems very complicated for what should be a simple operation. Some problems are (not all of these are bugs):
strdup(), and other memory allocation functions, can fail, you should allow for all possible issues.
only allocate resources (memory in this case) if and when you need it.
you should be able to distinguish between errors and valid stings. At the moment, you don't know whether malloc() failure of substr ("xxx",1,1) or a working substr ("xxx",1,0) produces an empty string.
you don't need to calloc() memory that you're going to overwrite anyway.
all invalid parameters should either cause an error or be coerced to a valid parameter (and your API should document which).
you don't need to set the local emptyString to NULL after freeing it - it will be lost on function return.
you don't need to usr strncat() - you should know the sizes and the memory you have available before doing any copying so you can use the (most likely) faster memcpy().
you're use of base-1 rather than base-0 for string offsets goes against the grain of C.
The following segment is what I'd do (I rather like the Python idiom of negative values to count from the end of the string but I've kept length rather than end position).
char *substr (const char *inpStr, int startPos, int strLen) {
/* Cannot do anything with NULL. */
if (inpStr == NULL) return NULL;
/* All negative positions to go from end, and cannot
start before start of string, force to start. */
if (startPos < 0)
startPos = strlen (inpStr) + startPos;
if (startPos < 0)
startPos = 0;
/* Force negative lengths to zero and cannot
start after end of string, force to end. */
if (strLen < 0)
strLen = 0;
if (startPos >strlen (inpStr))
startPos = strlen (inpStr);
/* Adjust length if source string too short. */
if (strLen > strlen (&inpStr[startPos]))
strLen = strlen (&inpStr[startPos]);
/* Get long enough string from heap, return NULL if no go. */
if ((buff = malloc (strLen + 1)) == NULL)
return NULL;
/* Transfer string section and return it. */
memcpy (buff, &(inpStr[startPos]), strLen);
buff[strLen] = '\0';
return buff;
}

I would say return NULL if the input isn't valid rather than a malloc()ed empty string. That way you can test whether or not the function failed or not with if(p) rather than if(*p == 0).
Also, I think your function leaks memory because emptyString is only free()d in one conditional. You should make sure you free() it unconditionally, i.e. right before the return.
As to your comment on strncpy() not NUL-terminating the string (which is true), if you use calloc() to allocate the string rather than malloc(), this won't be a problem if you allocate one byte more than you copy, since calloc() automatically sets all values (including, in this case, the end) to 0.
I would give you more notes but I hate reading camelCase code. Not that there's anything wrong with it.
EDIT: With regards to your updates:
Be aware that the C standard defines sizeof(char) to be 1 regardless of your system. If you're using a computer that uses 9 bits in a byte (god forbid), sizeof(char) is still going to be 1. Not that there's anything wrong with saying sizeof(char) - it clearly shows your intention and provides symmetry with calls to calloc() or malloc() for other types. But sizeof(int) is actually useful (ints can be different sizes on 16- and 32- and these newfangled 64-bit computers). The more you know.
I'd also like to reiterate that consistency with most other C code is to return NULL on an error rather than "". I know many functions (like strcmp()) will probably do bad things if you pass them NULL - this is to be expected. But the C standard library (and many other C APIs) take the approach of "It's the caller's responsibility to check for NULL, not the function's responsibility to baby him/her if (s)he doesn't." If you want to do it the other way, that's cool, but it's going against one of the stronger trends in C interface design.
Also, I would use strncpy() (or memcpy()) rather than strncat(). Using strncat() (and strcat()) obscures your intent - it makes someone looking at your code think you want to add to the end of the string (which you do, because after calloc(), the end is the beginning), when what you want to do is set the string. strncat() makes it look like you're adding to a string, while strcpy() (or another copy routine) would make it look more like what your intent is. The following three lines all do the same thing in this context - pick whichever one you think looks nicest:
strncat(returnString, text + nStartingPos, nRun);
strncpy(returnString, text + nStartingPos, nRun);
memcpy(returnString, text + nStartingPos, nRun);
Plus, strncpy() and memcpy() will probably be a (wee little) bit faster/more efficient than strncat().
text + nStartingPos is the same as nStartingPos + text - I would put the char * first, as I think that's clearer, but whatever order you want to put them in is up to you. Also, the parenthesis around them are unnecessary (but nice), since + has higher precedence than ,.
EDIT 2: The three lines of code don't do the same thing, but in this context they will all produce the same result. Thanks for catching me on that.

char* emptyString = strdup(""); /* C'mon! This cannot fail? */
You need to check for null. Remember that it still must allocate 1 byte for the null character.

strdup could fail (though it is very unlikely and not worth checking for, IMHO). It does have another problem however - it is not a Standard C function. It would be better to use malloc.

You can also use the memmove function to return a substring from start to length.
Improving/adding another solution from paxdiablo's solution:
#include <stdlib.h>
#include <stdio.h>
#include <string.h>
char *splitstr(char *idata, int start, int slen) {
char ret[150];
if(slen == NULL) {
slen=strlen(idata)-start;
}
memmove (ret,idata+start,slen);
return ret;
}
/*
Usage:
char ostr[]="Hello World!";
char *ores=splitstr(ostr, 0, 5);
Outputs:
Hello
*/
Hope it helps. Tested on Windows 7 Home Premium with TCC C Compilier.

Related

I am trying to create a code polisher program in C

I am trying to create the function delete_comments(). The read_file() and main functions are given.
Implement function char *delete_comments(char *input) that removes C comments from program stored at input. input variable points to dynamically allocated memory. The function returns pointer to the polished program. You may allocate a new memory block for the output, or modify the content directly in the input buffer.
You’ll need to process two types of comments:
Traditional block comments delimited by /* and */. These comments may span multiple lines. You should remove only characters starting from /* and ending to */ and for example leave any following newlines untouched.
Line comments starting with // until the newline character. In this case, newline character must also be removed.
The function calling delete_comments() only handles return pointer from delete_comments(). It does not allocate memory for any pointers. One way to implement delete_comments() function is to allocate memory for destination string. However, if new memory is allocated then the original memory in input must be released after use.
I'm having trouble understanding why my current approach is wrong or what is the specific problem that I'm getting weird output. I'm approaching the problem by trying to create a new array where to copy the input string with the new rules.
#include "source.h"
#include <stdlib.h>
#include <stdio.h>
#include <string.h>
/* Remove C comments from the program stored in memory block <input>.
* Returns pointer to code after removal of comments.
* Calling code is responsible of freeing only the memory block returned by
* the function.
*/
char *delete_comments(char *input)
{
input = malloc(strlen(input) * sizeof (char));
char *secondarray = malloc(strlen(input) * sizeof (char));
int x, y = 0;
for (x = 0, y = 0; input[x] != '\0'; x++) {
if ((input[x] == '/') && (input[x + 1] == '*')) {
int i = 0;
while ((input[x + i] != '*') && (input[x + i + 1] != '/')) {
y++;
i++;
}
}
else if ((input[x] == '/') && (input[x + 1] == '/')) {
int j = 0;
while (input[x + j] != '\n') {
y++;
j++;
}
}
else {
secondarray[x] = input[y];
y++;
}
}
return secondarray;
}
/* Read given file <filename> to dynamically allocated memory.
* Return pointer to the allocated memory with file content, or
* NULL on errors.
*/
char *read_file(const char *filename)
{
FILE *f = fopen(filename, "r");
if (!f)
return NULL;
char *buf = NULL;
unsigned int count = 0;
const unsigned int ReadBlock = 100;
unsigned int n;
do {
buf = realloc(buf, count + ReadBlock + 1);
n = fread(buf + count, 1, ReadBlock, f);
count += n;
} while (n == ReadBlock);
buf[count] = 0;
return buf;
}
int main(void)
{
char *code = read_file("testfile.c");
if (!code) {
printf("No code read");
return -1;
}
printf("-- Original:\n");
fputs(code, stdout);
code = delete_comments(code);
printf("-- Comments removed:\n");
fputs(code, stdout);
free(code);
}

Your program has fundamental issues.
It fails to tokenize the input. Comment start sequences can occur inside string literals, in which case they do not denote comments: "/* not a comment".
You have some basic bugs:
if ((input[x] == '/') && (input[x + 1] == '*')) {
int i = 0;
while ((input[x + i] != '*') && (input[x + i + 1] != '/')) {
y++;
i++;
}
}
Here, when we enter the loop, with i = 0, input + x is still pointing to the opening /. We did not skip over the opening * and are already looking for a closing *. This means that the sequence /*/ will be recognized as a complete comment, which it isn't.
This loop's also assumes that every /* comment is properly closed. It's not checking for the null character which can terminate the input, so if the comment is not closed, it will march beyond the end of the buffer.
C has line continuations. In ISO C translation stage 2, all backlash-newline sequences are deleted, converting one or more physical lines into logical lines. What that means is that a // comment can span multiple physical lines:
// this is an \
extended comment
You can see, by the way, that StackOverflow's automatic language detector for syntax highlighting is getting this right!
Line continuations are independent of tokenization, which doesn't happen until translation stage 3. Which means:
/\
/\
this is an extended \
comment
That one has defeated StackOverflow's syntax highlighting.
Furthermore, a line continuation can happen in any token, possibly multiple times:
"\
this is a string literal\
"

If you really want to make this work 100% correctly, you need to parse the input. By "parse" I mean a more formal, rigorous detection routine that understands what it is reading, in the context it is reading it.
For example, there are many times where this code could be defeated.
printf("the answer is %d // %d\n", a, b);
would likely trip your // detection and strip the end of the printf.
There are two general approaches to the problem above:
Find every corner case where comment-like characters could be used, and write conditional statements to avoid them before stripping.
Fully parse the language, so you will know if you are within a string or some other context that's wrapping comment like characters, or if you are in the top level context where the characters really mean "this is a comment"
To learn about parsing, I generally recommend "The Dragon Book" but it is a hard read, unless you have studied a bit of Discrete Mathematics. It covers a lot of different parsing techniques, and in doing so it doesn't have many pages left for examples. This means that it's the kind of book where you have to read, think, and then program a mini-example. If you follow that path, there is no input you can't tackle.
If you are pragmatic in your solution, and it is not about learning parsing, but about stripping comments, I recommend that you find a well constructed parser for C, and then learn how to walk the Abstract Syntax Tree in an Emitter, which fails to emit the comments.
There are some projects that do this already; but, I don't know if they have the right structure for easy modification. lint comes to mind, as well as other "pretty-printers" GCC certainly has the parsing code in there, but I've heard that GCC's Abstract Syntax Tree isn't easy to learn.

Your solution has several problems:
The worst issue
As the first instruction in delete_comments() you overwrite input with a new pointer returned by malloc(), which points to memory of random contents.
In consequence the address to the real input is lost.
Oh, and please check the returned value, if you call malloc().
Failing to increment the scanned position in comments correctly
You are scanning the input by the index x, but if you detect a comment, you don't change it.
You are actually advancing y but this is only used for the copying.
Think about lines like these:
int x; /* some /* weird /* comment */
///////////////////////////////
for (;;) { }
Ignoring character and string literals
Your solution should take character and string literals into account.
For example:
int c_plus_plus_comment_start = '//'; /* multi character constant */
const char* c_comment_start = "/*";
Note: There are more. Learn to use a debugger, or at least insert lots of printf()s in "interesting" places.

Is using malloc within scanf compliant to c ansi standard

I want to read the input of a user and save it. What i have now does work but i need to know if its legit (following ansi standard - c90) that scanf is first assigning the variable "length" before it allocates memory for the input,
or if its just a quirk of the compiler.
#include <stdio.h>
int main()
{
char* text;
int length = 0;
scanf("%s%n", text = malloc(length+1), &length);
printf("%s", text);
return 0;
}

This will not work as you expect.
At the time you call malloc, length still has the value 0, so you're only allocating one byte. length isn't updated until after scanf returns. So any non-empty string will write past the bounds of the allocated buffer, invoking undefined behavior.
While not exactly the same, what you can do is use getline, assuming you're running on a POSIX system such as Linux. This function reads a line of text (including the newline) and allocates space for that line.
char *text = NULL;
size_t n = 0;
ssite_t rval = getline(&text, &n, stdin);
if (rval == -1) {
perror("getline failed");
} else {
printf("%s", text);
}
free(text);

Apart from the obvious problem with misuse of scanf addressed in another answer, this doesn't follow any C standard either:
#include <stdio.h>
...
text = malloc(length+1)
Since you didn't include stdlib.h where malloc is found, C90 will assume that the function malloc has the form int malloc(int); which is of course nonsense.
And then when you try to assign an int (the result of malloc) to a char*, you have a constraint violation of C90 6.3.16.1, the rules of simple assignment.
Therefore your code is not allowed to compile cleanly, but the compiler must give a diagnostic message.
You can avoid this bug by upgrading to standard ISO C.

Issues well explained by others
I want to read the input of a user and save it
To add and meet OP's goal, similar code could do
int length = 255;
char* text = malloc(length+1);
if (text == NULL) {
Handle_OOM();
}
scanf("%255s%n", text, &length);
// Reallocate if length < 255 and/or
if (length < 255) {
char *t = realloc(text, length + 1);
if (t) text = t;
} else {
tbd(); // fail when length == 255 as all the "word" is not certainly read.
}
The above would be a simple approach if excessive input was deemed hostile.

Dynamically allocate user inputted string

I am trying to write a function that does the following things:
Start an input loop, printing '> ' each iteration.
Take whatever the user enters (unknown length) and read it into a character array, dynamically allocating the size of the array if necessary. The user-entered line will end at a newline character.
Add a null byte, '\0', to the end of the character array.
Loop terminates when the user enters a blank line: '\n'
This is what I've currently written:
void input_loop(){
char *str = NULL;
printf("> ");
while(printf("> ") && scanf("%a[^\n]%*c",&input) == 1){
/*Add null byte to the end of str*/
/*Do stuff to input, including traversing until the null byte is reached*/
free(str);
str = NULL;
}
free(str);
str = NULL;
}
Now, I'm not too sure how to go about adding the null byte to the end of the string. I was thinking something like this:
last_index = strlen(str);
str[last_index] = '\0';
But I'm not too sure if that would work though. I can't test if it would work because I'm encountering this error when I try to compile my code:
warning: ISO C does not support the 'a' scanf flag [-Wformat=]
So what can I do to make my code work?
EDIT: changing scanf("%a[^\n]%*c",&input) == 1 to scanf("%as[^\n]%*c",&input) == 1 gives me the same error.

First of all, scanf format strings do not use regular expressions, so I don't think something close to what you want will work. As for the error you get, according to my trusty manual, the %a conversion flag is for floating point numbers, but it only works on C99 (and your compiler is probably configured for C90)
But then you have a bigger problem. scanf expects that you pass it a previously allocated empty buffer for it to fill in with the read input. It does not malloc the sctring for you so your attempts at initializing str to NULL and the corresponding frees will not work with scanf.
The simplest thing you can do is to give up on n arbritrary length strings. Create a large buffer and forbid inputs that are longer than that.
You can then use the fgets function to populate your buffer. To check if it managed to read the full line, check if your string ends with a "\n".
char str[256+1];
while(true){
printf("> ");
if(!fgets(str, sizeof str, stdin)){
//error or end of file
break;
}
size_t len = strlen(str);
if(len + 1 == sizeof str){
//user typed something too long
exit(1);
}
printf("user typed %s", str);
}
Another alternative is you can use a nonstandard library function. For example, in Linux there is the getline function that reads a full line of input using malloc behind the scenes.

No error checking, don't forget to free the pointer when you're done with it. If you use this code to read enormous lines, you deserve all the pain it will bring you.
#include <stdio.h>
#include <stdlib.h>
char *readInfiniteString() {
int l = 256;
char *buf = malloc(l);
int p = 0;
char ch;
ch = getchar();
while(ch != '\n') {
buf[p++] = ch;
if (p == l) {
l += 256;
buf = realloc(buf, l);
}
ch = getchar();
}
buf[p] = '\0';
return buf;
}
int main(int argc, char *argv[]) {
printf("> ");
char *buf = readInfiniteString();
printf("%s\n", buf);
free(buf);
}

If you are on a POSIX system such as Linux, you should have access to getline. It can be made to behave like fgets, but if you start with a null pointer and a zero length, it will take care of memory allocation for you.
You can use in in a loop like this:
#include <stdlib.h>
#include <stdio.h>
#include <string.h> // for strcmp
int main(void)
{
char *line = NULL;
size_t nline = 0;
for (;;) {
ptrdiff_t n;
printf("> ");
// read line, allocating as necessary
n = getline(&line, &nline, stdin);
if (n < 0) break;
// remove trailing newline
if (n && line[n - 1] == '\n') line[n - 1] = '\0';
// do stuff
printf("'%s'\n", line);
if (strcmp("quit", line) == 0) break;
}
free(line);
printf("\nBye\n");
return 0;
}
The passed pointer and the length value must be consistent, so that getline can reallocate memory as required. (That means that you shouldn't change nline or the pointer line in the loop.) If the line fits, the same buffer is used in each pass through the loop, so that you have to free the line string only once, when you're done reading.

Some have mentioned that scanf is probably unsuitable for this purpose. I wouldn't suggest using fgets, either. Though it is slightly more suitable, there are problems that seem difficult to avoid, at least at first. Few C programmers manage to use fgets right the first time without reading the fgets manual in full. The parts most people manage to neglect entirely are:
what happens when the line is too large, and
what happens when EOF or an error is encountered.
The fgets() function shall read bytes from stream into the array pointed to by s, until n-1 bytes are read, or a is read and transferred to s, or an end-of-file condition is encountered. The string is then terminated with a null byte.
Upon successful completion, fgets() shall return s. If the stream is at end-of-file, the end-of-file indicator for the stream shall be set and fgets() shall return a null pointer. If a read error occurs, the error indicator for the stream shall be set, fgets() shall return a null pointer...
I don't feel I need to stress the importance of checking the return value too much, so I won't mention it again. Suffice to say, if your program doesn't check the return value your program won't know when EOF or an error occurs; your program will probably be caught in an infinite loop.
When no '\n' is present, the remaining bytes of the line are yet to have been read. Thus, fgets will always parse the line at least once, internally. When you introduce extra logic, to check for a '\n', to that, you're parsing the data a second time.
This allows you to realloc the storage and call fgets again if you want to dynamically resize the storage, or discard the remainder of the line (warning the user of the truncation is a good idea), perhaps using something like fscanf(file, "%*[^\n]");.
hugomg mentioned using multiplication in the dynamic resize code to avoid quadratic runtime problems. Along this line, it would be a good idea to avoid parsing the same data over and over each iteration (thus introducing further quadratic runtime problems). This can be achieved by storing the number of bytes you've read (and parsed) somewhere. For example:
char *get_dynamic_line(FILE *f) {
size_t bytes_read = 0;
char *bytes = NULL, *temp;
do {
size_t alloc_size = bytes_read * 2 + 1;
temp = realloc(bytes, alloc_size);
if (temp == NULL) {
free(bytes);
return NULL;
}
bytes = temp;
temp = fgets(bytes + bytes_read, alloc_size - bytes_read, f); /* Parsing data the first time */
bytes_read += strcspn(bytes + bytes_read, "\n"); /* Parsing data the second time */
} while (temp && bytes[bytes_read] != '\n');
bytes[bytes_read] = '\0';
return bytes;
}
Those who do manage to read the manual and come up with something correct (like this) may soon realise the complexity of an fgets solution is at least twice as poor as the same solution using fgetc. We can avoid parsing data the second time by using fgetc, so using fgetc might seem most appropriate. Alas most C programmers also manage to use fgetc incorrectly when neglecting the fgetc manual.
The most important detail is to realise that fgetc returns an int, not a char. It may return typically one of 256 distinct values, between 0 and UCHAR_MAX (inclusive). It may otherwise return EOF, meaning there are typically 257 distinct values that fgetc (or consequently, getchar) may return. Trying to store those values into a char or unsigned char results in loss of information, specifically the error modes. (Of course, this typical value of 257 will change if CHAR_BIT is greater than 8, and consequently UCHAR_MAX is greater than 255)
char *get_dynamic_line(FILE *f) {
size_t bytes_read = 0;
char *bytes = NULL;
do {
if ((bytes_read & (bytes_read + 1)) == 0) {
void *temp = realloc(bytes, bytes_read * 2 + 1);
if (temp == NULL) {
free(bytes);
return NULL;
}
bytes = temp;
}
int c = fgetc(f);
bytes[bytes_read] = c >= 0 && c != '\n'
? c
: '\0';
} while (bytes[bytes_read++]);
return bytes;
}

Converting a decimal number to binary in C

I was trying to convert a decimal from [0, 255] to a 8 bit binary number, where each bit will be separated by a comma and a space. I tried the following (eventually it worked, except for the last bit does not require any separator ):
#include <stdio.h>
#include <stdlib.h>
char* atobin(int input) {
char *str = malloc(0);
int index, count = 0;
if (input > 255| input < 0) {
printf ("Input out of range. Abort.\n");
exit(EXIT_FAILURE);
}
for (index = 7; index >= 0; index--) {
*(str + (count++)) = (input >> index & 1) ? '1' : '0';
*(str + (count++)) = ',';
*(str + (count++)) = ' ';
}
*(str + count) = '\0';
return str;
}
int main(int argc, char *argv[]) {
printf("%s\n", atobin(atoi(argv[1])));
return EXIT_SUCCESS;
}
Now I have a few questions:
I used malloc(0); as far as I am concerned, it will
allocate no memory from the heap. So, how/ why is it working?
Is the declaration *(str + count) = '\0'; necessary?
Is there any way to optimize this code?
UPDATE
To carry on this experiment, I have taken the atobin function in to a .h file. This time it creates some problems.
Now I add my last question:
What should be minimum integer to be used for the parameter of malloc? Some trial-and-error method makes me guess it should me 512. Any idea?

From a malloc doc:
If size is zero, the return value depends on the particular library implementation (it may or may not be a null pointer), but the returned pointer shall not be dereferenced.
That it works for you is just luck. Try some random malloc and free afterwards and you will have a high probability - but no guarantee - that you will get a crash.
This null-terminates the string. Depends on how you want to use the string if you need it. The printf in your example needs it, because that's the only way for it to know when the string ends.
I'm sorry, I don't have time to take a closer look.

malloc(0) is valid to be used. what it returns is implementation defined. but what happens if you access a object through the pointer returned by malloc(0) is undefined behavoiur. you can read this related question.
In C since strings are character arrays terminated by \0 it is better to use *(str + count) = '\0'; statement to set the last character of string. It is working in this code but its better to set the end of string in code always. It may work if bits are set to 0 since it is same as terminating a string.

"Pattern matching" and extracting in C

I need to parse a lot of filenames (up to 250000 I guess), including the path, and extract some parts out of it.
Here is an example:
Original: /my/complete/path/to/80/01/a9/1d.pdf
Needed: 8001a91d
The "pattern" I am looking for will always begin with "/8". The parts I need to extract form an 8 hex-digits string.
My idea is the following (simplyfied for demonstration):
/* original argument */
char *path = "/my/complete/path/to/80/01/a9/1d.pdf";
/* pointer to substring */
char *begin = NULL;
/* final char array to be build */
char *hex = (char*)malloc(9);
/* find "pattern" */
begin = strstr(path, "/8");
if(begin == NULL)
return 1;
/* jump to first needed character */
begin++;
/* copy the needed characters to target char array */
strncpy(hex, begin, 2);
strncpy(hex+2, begin+3, 2);
strncpy(hex+4, begin+6, 2);
strncpy(hex+6, begin+9, 2);
strncpy(hex+8, "\0", 1);
/* print final char array */
printf("%s\n", hex);
This works. I just have the feeling it is not the most clever way. And that there might be some traps I don't see myself.
So, does someone have suggestions what could be dangerous with this pointer-shifting manner? What would be an improvement in your opinion?
Does C provide a functionality to do it like so s|/(8.)/(..)/(..)/(..)\.|\1\2\3\4| ? If I remember right some scripting languages have a feature like that; if you know what I mean.

C itself doesn't provide this, but you can use POSIX regex. It's a full-featured regular expression library. But for a pattern as simple as yours, this probably is the best way.
BTW, prefer memcpy to strncpy. Very few people know what strncpy is good for. And I'm not one of them.

/* original argument */
char *path = "/my/complete/path/to/80/01/a9/1d.pdf";
char *begin;
char hex[9];
size_t len;
/* find "pattern" */
begin = strstr(path, "/8");
if (!begin) return 1;
// sanity check
len = strlen(begin);
if (len < 12) return 2;
// more sanity
if (begin[3] != '/' || begin[6] != '/' || begin[9] != '/' ) return 3;
memcpy(hex, begin+1, 2);
memcpy(hex+2, begin+4, 2);
memcpy(hex+4, begin+7, 2);
memcpy(hex+6, begin+10, 2);
hex[8] = 0;
// For additional sanity, you could check for valid hex characters here
/* print final char array */
printf("%s\n", hex);

In the simple case of just matching /8./../../.. I'd personally go for the strstr() solution myself (no external dependency required). If the rules become more though, you could try a lexer (flex and friends), they support regular expressions.
In your case something like this:
h2 [0-9A-Fa-f]{2}
mymatch (/{h2}){4}
could work. You'd have to set buffers to the match by side effect though as lexers typically return token identifiers.
Anyway, you'd gain the power of regexps without the dependencies but at the expense of generated (read: unreadable) code.

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight