split string into words in C

split string into words in C - c

I have just coded splitting string into words.
if char *cmd = "Hello world baby", then argv[0] = "Hello", argv[1] = "world", argv[2] = "baby".
strdup function cannot be used, and I want to implement this using malloc and strcpy.
my code is below.
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#define buf_size 128
int main() {
char *argv[16];
memset(argv, 0, sizeof(argv));
int words = 0;
char *cmd = "Hello world baby";
unsigned int len = strlen(cmd);
int start = 0;
for(unsigned int i = 0; i <= len; i++){
if(cmd[i] == ' ' | cmd[i] == '\0'){
++words;
char *w = (char *)malloc(sizeof(char)*(i-start) + 1);
strcpy(w, cmd + start);
w[i-start] = '\0';
argv[i] = w;
start = i + 1;
}
}
for(int i = 0; i < words; i++){
printf("%s\n", argv[i]);
free(argv[i]);
}
return 0;
}
I hoped that the printf function produces:
Hello
world
baby
However, when the printf() function is reached, the program triggers a segmentation fault.

Your primary problem, despite the banter in the comments about how to write your own version of strdup(), is that you really need strndup(). Eh?
You have the line:
strcpy(w, cmd + start);
Unfortunately, this copies the whole string from cmd + start into the allocated space, but you only wanted to copy (i - start) + 1 bytes including the null byte, because that's all the space you allocated. So, you have a buffer overflow (but not a stack overflow).
POSIX provides the function strndup()
with the signature:
extern char *strndup(const char *s, size_t size);
This allocates at most size + 1 bytes and copies at most size bytes from s and a null byte into the allocated space. You'd use:
argv[i] = strndup(cmd + start, i - start);
to get the required result. If you don't have (or can't use) strndup(), you can write your own. That's easiest if you have strnlen(), but you can write your own version of that if necessary (you don't have it or can't use it):
char *my_strndup(const char *s, size_t len)
{
size_t nbytes = strnlen(s, len);
char *result = malloc(nbytes + 1);
if (result != NULL)
{
memmove(result, s, nbytes);
result[nbytes] = '\0';
}
return result;
}
This deals with the situation where the actual string is shorter than the maximum length it can be by using the size from strnlen(). It's not clear that you are guaranteed to be able to access the memory at s + nbytes - 1, so simply allocating for the maximum size is not appropriate.
Implementing strnlen():
size_t my_strnlen(const char *s, size_t size)
{
size_t count = 0;
while (count < size && *s++ != '\0')
count++;
return count;
}
"Official" versions of this are probably implemented in assembler and are more efficient, but I think that's a valid implementation in pure C.
Another alternative in your code is to use the knowledge of the length:
char *w = (char *)malloc(sizeof(char)*(i - start + 1));
memmove(w, cmd + start, i - start);
w[i-start] = '\0';
argv[i] = w;
start = i + 1;
I note in passing that multiplying by sizeof(char) is a no-op since sizeof(char) == 1 by definition. You should include the + 1 in the multiplication in general (as I've reparenthesized the expression). If you were dealing with some structure and wanted N + 1 structures, you need to use (N + 1) * sizeof(struct WhatNot) and not N * sizeof(struct WhatNot) + 1. It's a good idea to head off bugs caused by sloppy coding practices while you're learning, even though there's no difference in the result here.
There are those who excoriate the cast on the result of malloc(). I'm not one of them: I learned to program on a pre-standard C system where the cast was crucial because the char * address of an object was different from the address of the same memory location when referenced via a pointer to a type bigger than a char. That is, the short * address and char * address for the same memory location had different bit patterns. Not casting the result of malloc() led to crashes. (I observe that the primary excuse given for rejecting the cast is "it may hide errors if malloc() is not declared". That excuse went by the wayside when C99 mandated that functions must be declared before being used.)
Warning: no compiler was consulted about the validity of any of the code shown in this answer. Nor was the sanity of the overall algorithm tested.
You have:
if(cmd[i] == ' ' | cmd[i] == '\0'){
That | should be ||.

Related

About one of the ways to handle a failure of a malloc()

I've known some ways for handling a failure of a malloc() and I prefer the way to use a result of a malloc() like some example code below. But actually I don't know well how this phrase is actually work. Can anybody give me some advice about using this phrase ?
"if ((result = malloc(sizeof(char) * (strlen(s1) + strlen(s2) + 1))) != 0)"
char strjoin(char const *s1, char const *s2)
{
char *result;
int idx;
if ((result = malloc(sizeof(char) * (strlen(s1) + strlen(s2) + 1))) != 0)
{
idx = 0;
while (*s1)
result[idx++] = *s1++;
while (*s2)
result[idx++] = *s2++;
result[idx] = 0;
return (result);
}
return (0);
}
Found some example codes i.e. using same phrase

char strjoin... is a bug which prevents this code from compiling. It should obviously be char*.
malloc returns a pointer if successful, otherwise a null pointer. result gets assigned either of these and evaluate to not zero or zero accordingly.
Please note that assignment inside an if condition is regarded as poor and dangerous practice, so it should be avoided.
sizeof(char) * ... is nonsense. The very definition of sizeof is that sizeof(char) is always 1. Multiplying something with a constant guaranteed to be 1 is just useless bloat.
strlen(s1) + strlen(s2) + 1 sums up the length of two strings and then + 1 to make room for the null terminator.
Note that lines like result[idx++] = *s1++; have no less than 3 side effects in a single line, which is very bad practice. It is recommended to only have 1 side effect in an expression, to prevent sequencing/order of evaluation bugs.
Copying characters 1 by 1 is naive, memcpy is likely much faster than that.
There is no need for idx since we already know the length of each string from previous strlen calls. The programmer should have saved it down at that point. Instead they are making the program needlessly slow.
Consider a complete rewrite of this questionable code into something more readable and efficient:
#include <stdlib.h>
#include <string.h>
char* strjoin (const char *s1, const char* s2)
{
size_t length1 = strlen(s1);
size_t length2 = strlen(s2);
char* result = malloc(length1 + length2 + 1);
if(result == NULL)
return NULL;
memcpy(result, s1, length1);
memcpy(result+length1, s2, length2);
result[length1+length2] = '\0';
return result;
}

Manipulating a string and rewriting it by the function output

For some functions for string manipulation, I try to rewrite the function output onto the original string. I came up with the general scheme of
#include <stdio.h>
#include <string.h>
#include <stdlib.h>
char *char_repeater(char *str, char ch)
{
int tmp_len = strlen(str) + 1; // initial size of tmp
char *tmp = (char *)malloc(tmp_len); // initial size of tmp
// the process is normally too complicated to calculate the final length here
int j = 0;
for (int i = 0; i < strlen(str); i++)
{
tmp[j] = str[i];
j++;
if (str[i] == ch)
{
tmp[j] = str[i];
j++;
}
if (j > tmp_len)
{
tmp_len *= 2; // growth factor
tmp = realloc(tmp, tmp_len);
}
}
tmp[j] = 0;
char *output = (char *)malloc(strlen(tmp) + 1);
// output matching the final string length
strncpy(output, tmp, strlen(tmp));
output[strlen(tmp)] = 0;
free(tmp); // Is it necessary?
return output;
}
int main()
{
char *str = "This is a test";
str = char_repeater(str, 'i');
puts(str);
free(str);
return 0;
}
Although it works on simple tests, I am not sure if I am on the right track.
Is this approach safe overall?
Of course, we do not re-write the string. We simply write new data (array of the characters) at the same pointer. If output is longer than str, it will rewrite the data previously written at str, but if output is shorter, the old data remains, and we would have a memory leak. How can we free(str) within the function before outputting to its pointer?

A pair of pointers can be used to iterate through the string.
When a matching character is found, increment the length.
Allocate output as needed.
Iterate through the string again and assign the characters.
This could be done in place if str was malloced in main.
#include <stdio.h>
#include <string.h>
#include <stdlib.h>
char *char_repeater(char *str, char ch)
{
int tmp_len = strlen(str) + 1; // initial size of tmp
char *find = str;
while ( *find) // not at terminating zero
{
if ( *find == ch) // match
{
tmp_len++; // add one
}
++find; // advance pointer
}
char *output = NULL;
if ( NULL == ( output = malloc(tmp_len)))
{
fprintf ( stderr, "malloc peoblem\n");
exit ( 1);
}
// output matching the final string length
char *store = output; // to advance through output
find = str; // reset pointer
while ( *find) // not at terminating zero
{
*store = *find; // assign
if ( *find == ch) // match
{
++store; // advance pointer
*store = ch; // assign
}
++store; // advance pointer
++find;
}
*store = 0; // terminate
return output;
}
int main()
{
char *str = "This is a test";
str = char_repeater(str, 'i');
puts(str);
free(str);
return 0;
}

For starters the function should be declared like
char * char_repeater( const char *s, char c );
because the function does not change the passed string.
Your function is unsafe and inefficient at least because there are many dynamic memory allocations. You need to check that each dynamic memory allocation was successful. Also there are called the function strlen also too ofhen.
Also this code snippet
tmp[j] = str[i];
j++;
if (str[i] == ch)
{
tmp[j] = str[i];
j++;
}
if (j > tmp_len)
//...
can invoke undefined behavior. Imagine that the source string contains only one letter 'i'. In this case the variable tmp_len is equal to 2. So temp[0] will be equal to 'i' and temp[1] also will be equal to 'i'. In this case j equal to 2 will not be greater than tmp_len. As a result this statement
tmp[j] = 0;
will write outside the allocated memory.
And it is a bad idea to reassign the pointer str
char *str = "This is a test";
str = char_repeater(str, 'i');
As for your question whether you need to free the dynamically allocated array tmp
free(tmp); // Is it necessary?
then of course you need to free it because you allocated a new array for the result string
char *output = (char *)malloc(strlen(tmp) + 1);
And as for your another question
but if output is shorter, the old data remains, and we would have a
memory leak. How can we free(str) within the function before
outputting to its pointer?
then it does not make a sense. The function creates a new character array dynamically that you need to free and the address of the allocated array is assigned to the pointer str in main that as I already mentioned is not a good idea.
You need at first count the length of the result array that will contain duplicated characters and after that allocate memory only one time.
Here is a demonstration program.
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
char * char_repeater( const char *s, char c )
{
size_t n = 0;
for ( const char *p = s; ( p = strchr( p, c ) ) != NULL; ++p )
{
++n;
}
char *result = malloc( strlen( s ) + 1 + n );
if ( result != NULL )
{
if ( n == 0 )
{
strcpy( result, s );
}
else
{
char *p = result;
do
{
*p++ = *s;
if (*s == c ) *p++ = c;
} while ( *s++ );
}
}
return result;
}
int main( void )
{
const char *s = "This is a test";
puts( s );
char *result = char_repeater( s, 'i' );
if ( result != NULL ) puts( result );
free( result );
}
The program output is
This is a test
Thiis iis a test

My kneejerk reaction is to dislike the design. But I have reasons.
First, realloc() is actually quite efficient. If you are just allocating a few extra bytes every loop, then chances are that the standard library implementation simply increases the internal bytecount value associated with your memory. Caveats are:
Interleaving memory management.Your function here doesn’t have any, but should you start calling other routines then keeping track of all that becomes an issue. Anything that calls other memory management routines can lead to the next problem:
Fragmented memory.If at any time the available block is too small for your new request, then a much more expensive operation to obtain more memory and copy everything over becomes an issue.
Algorithmic issues are:
Mixing memory management in increases the complexity of your code.
Every occurrence of c invokes a function call with potential to be expensive. You cannot control when it is expensive and when it is not.
Worst-case options (char_repeater( "aaaaaaaaaa", 'a' )) trigger worst-case potentialities.
My recommendation is to simply make two passes.
This passes several smell tests:
Algorithmic complexity is broken down into two simpler parts:
counting space required, and
allocating and copying.
Worst-case scenarios for allocation/reallocation are reduced to a single call to malloc().
Issues with very large strings are reduced:
You need at most space for 2 large strings (not 3, possibly repeated)
Page fault / cache boundary issues are similar (or the same) for both methods
Considering there are no real downsides to using a two-pass approach, I think that using a simpler algorithm is reasonable. Here’s code:
#include <stdio.h>
#include <stdlib.h>
char * char_repeater( const char * s, char c )
{
// FIRST PASS
// (1) count occurances of c in s
size_t number_of_c = 0;
const char * p = s;
while (*p) number_of_c += (*p++ == c);
// (2) get strlen s
size_t length_of_s = p - s;
// SECOND PASS
// (3) allocate space for the resulting string
char * dest = malloc( length_of_s + number_of_c + 1 );
// (4) copy s -> dest, duplicating every occurance of c
if (dest)
{
char * d = dest;
while (*s)
if ((*d++ = *s++) == c)
*d++ = c;
*d = '\0';
}
return dest;
}
int main(void)
{
char * s = char_repeater( "Hello world!", 'o' );
puts( s );
free( s );
return 0;
}
As always, know your data
Whether or not a two-pass approach actually is better than a realloc() approach depends on more factors than what is evident in a posting on the internet.
Nevertheless, I would wager that for general purpose strings that this is a better choice.
But, even if it isn’t, I would argue that a simpler algorithm, splitting tasks into trivial sub-tasks, is far easier to read and maintain. You should only start making tricky algorithms only if you have use-case profiling saying you need to spend more attention on it.
Without that, readability and maintainability trumps all other concerns.

How to append a char at a defined position

I'm trying to add a character at a defined position. I've created a new function, allocate a memory for one more char, save characters after the position then added my character at the defined position, and now I don't know how to erase characters after that position to concatenate the saved string. Any solution?
Here is the beginning of my function:
void appendCharact(char *source, char carac, int position) {
source = realloc(source, strlen(source) * sizeof(char) + 1); //Get enough memory
char *temp = source.substr(position); //Save characters after my position
source[position] = carac; //Add the character
}
EDIT :
I'm trying to implement another "barbarous" solution, in debug mode I can see that I've approximately my new string but it look like I can't erase the older pointer...
void appendCharact(char *source, char carac, int position) {
char *temp = (char *)malloc((strlen(source) + 2) * sizeof(char));
int i;
for(i = 0; i < position; i++) {
temp[i] = source[i];
}
temp[position] = carac;
for (i = position; i < strlen(source); i++) {
temp[i + 1] = source[i];
}
temp[strlen(temp) + 1] = '\0';
free(source);
source = temp;
}

I mentioned that I could see five problems with the code as shown (copied here for reference)
void appendCharact(char * source, char carac , int position)
{
source = realloc(source, strlen(source) * sizeof(char) + 1); //Get enough memory
char * temp = source.substr(position); //Save characters after my position
source[position] = carac; //Add the charactere
}
The problems are (in no specific order):
strlen(source) * sizeof(char) + 1 is equal to (strlen(source) * sizeof(char)) + 1. It should have been (strlen(source) + 1) * sizeof(char). However, this works fine since sizeof(char) is defined in the C++ specification to always be equal to 1.
Related to the above: Simple char strings are really called null-terminated byte strings. As such they must be terminated by a "null" character ('\0'). This null character of course needs space in the allocated string, and is not counted by strlen. Therefore to add a character you need allocate strlen(source) + 2 characters.
Never assign back to the pointer you pass to realloc. If realloc fails, it will return a null pointer, making you lose the original memory, and that is a memory leak.
The realloc function return type is void*. In C++ you need to cast it to the correct pointer type for assignment.
You pass source by value, meaning inside the function you have a local copy of the pointer. When you assign to source you only assign to the local copy, the original pointer used in the call will not be modified.
Here are some other problems with the code, or its possible use:
Regarding the null-terminator, once you allocate enough memory for it you also need to add it to the string.
If the function is called with source being a literal string or an array or anything that wasn't returned by a previous call to malloc, calloc or realloc, then you can't pass that pointer to realloc.
You use source.substr(position) which is not possible since source isn't an object and therefore doesn't have member functions.

Your new solution is much closer to a working function but it still has some problems:
you do not check for malloc() failure.
you should avoid computing the length of the source string multiple times.
temp[strlen(temp) + 1] = '\0'; is incorrect as temp is not yet a proper C string and strlen(temp) + 1 would point beyond the allocated block anyway, you should just write temp[i + 1] = '\0';
the newly allocated string should be returned to the caller, either as the return value or via a char ** argument.
Here is a corrected version:
char *insertCharact(char *source, char carac, size_t position) {
size_t i, len;
char *temp;
len = source ? strlen(source) : 0;
temp = (char *)malloc(len + 2);
if (temp != NULL) {
/* sanitize position */
if (position > len)
position = len;
/* copy initial portion */
for (i = 0; i < position; i++) {
temp[i] = source[i];
}
/* insert new character */
temp[i] = carac;
/* copy remainder of the source string if any */
for (; i < len; i++) {
temp[i + 1] = source[i];
}
/* set the null terminator */
temp[i + 1] = '\0';
free(source);
}
return temp;
}

int pos = 1;
char toInsert = '-';
std::string text = "hallo";
std::stringstream buffer;
buffer << text.substr(0,pos);
buffer << toInsert;
buffer << text.substr(pos);
text = buffer.str();

Try using something like:
#include <string>
void appendCharAt(std::string& src, char c , int pos)
{
std::string front(src.begin(), src.begin() + pos - 1 ); // use iterators
std::string back(src.begin() + pos, src.end() );
src = front + c + back; // concat together +-operator is overloaded for strings
}
Not 100% sure weather the positions are right. Maybe front hast to be src.begin() + pos and back src.begin() + pos + 1. Just try it out.

The C version of this will have to take care of the situation where realloc fails, in which case the original string is preserved. You should only overwrite the old pointer with the one returned from realloc upon success.
It might look something like this:
bool append_ch (char** str, char ch, size_t pos)
{
size_t prev_size = strlen(*str) + 1;
char* tmp = realloc(*str, prev_size+1);
if(tmp == NULL)
{
return false;
}
memmove(&tmp[pos+1], &tmp[pos], prev_size-pos);
tmp[pos] = ch;
*str = tmp;
return true;
}
Usage:
const char test[] = "hello word";
char* str = malloc(sizeof test);
memcpy(str, test, sizeof test);
puts(str);
bool ok = append_ch(&str, 'l', 9);
if(!ok)
asm ("HCF"); // error handling here
puts(str);
free(str);

Manipulating dynamic arrays in C

I am trying to solve StringMerge (PP0504B) problem from SPOJ (PL). Basically the problem is to write a function string_merge(char *a, char *b) that returns a pointer to an char array with string created from char arrays with subsequent chars chosen alternately (length of the array is the length of the shorter array provided as an argument).
The program I've created works well with test cases but it fails when I post it to SPOJ's judge. I'm posting my code here, as I believe it the problem is related to memory allocation (I'm still learning this part of C) - could you take a look at my code?
#include <stdio.h>
#include <string.h>
#include <stdlib.h>
#include <stdbool.h>
#define T_SIZE 1001
char* string_merge(char *a, char *b);
char* string_merge(char *a, char *b) {
int alen = strlen(a); int blen = strlen(b);
int len = (alen <= blen) ? alen : blen;
int i,j;
char *new_array = malloc (sizeof (char) * (len));
new_array[len] = '\0';
for(j=0,i=0;i<len;i++) {
new_array[j++] = a[i];
new_array[j++] = b[i];
}
return new_array;
}
int main() {
int n,c; scanf("%d", &n);
char word_a[T_SIZE];
char word_b[T_SIZE];
while(n--) {
scanf("%s %s", word_a, word_b);
char *x = string_merge(word_a, word_b);
printf("%s",x);
printf("\n");
memset(word_a, 0, T_SIZE);
memset(word_b, 0, T_SIZE);
memset(x,0,T_SIZE);
}
return 0;
}
Note: I'm compiling it with -std=c99 flag.

Off-by-one.
char *new_array = malloc (sizeof (char) * (len));
new_array[len] = '\0';
You're writing past the bounds of new_array. You must allocate space for len + 1 bytes:
char *new_array = malloc(len + 1);
Also, sizeof(char) is always 1, so spelling it out is superfluous, so are the parenthesis around len.
Woot, further errors!
So then you keep going and increment j twice within each iteration of the for loop. So essentially you end up writing (approximately) twice as many characters as you allocated space for.
Also, you're leaking memory by not free()ing the return value of string_merge() after use.
Furthermore, I don't see what the memsets are for, also I suggest you use fgets() and strtok_r() for getting the two words instead of scanf() (which doesn't do what you think it does).

char *new_array = malloc (sizeof (char) * (len*2 + 1));
new_array[len*2] = '\0';

What's wrong with this character buffer code?

For reasons that I promise exist, I'm reading input character by character, and if a character meets certain criteria, I'm writing it into a dynamically allocated buffer. This function adds the specified character to the "end" of the specified string. When reading out of the buffer, I read the first 'size' characters.
void append(char c, char *str, int size)
{
if(size + 1 > strlen(str))
str = (char*)realloc(str,sizeof(char)*(size + 1));
str[size] = c;
}
This function, through various iterations of development has produced such errors as "corrupted double-linked list", "double free or corruption". Below is a sample of how append is supposed to be used:
// buffer is a string
// bufSize is the number of non-garbage characters at the beginning of buffer
char *buft = buffer;
int bufLoc=0;
while((buft-buffer)/sizeof(char) < bufSize)
append(*(buft==),destination,bufLoc++);
It generally works for some seemingly arbitrary number of characters, and then aborts with error. If it's not clear what the second code snippet is doing, it's just copying from the buffer into some destination string. I know there's library methods for this, but I need a bit finer control of what exactly gets copied sometimes.
Thanks in advance for any insight. I'm stumped.

This function does not append a character to a buffer.
void append(char c, char *str, int size)
{
if(size + 1 > strlen(str))
str = realloc(str, size + 1);
str[size] = c;
}
First, what is strlen(str)? You can say "it's the length of str", but that's omitting some very important details. How does it compute the length? Easy -- str must be NUL-terminated, and strlen finds the offset of the first NUL byte in it. If your buffer doesn't have a NUL byte at the end, then you can't use strlen to find its length.
Typically, you will want to keep track of the buffer's length. In order to reduce the number of reallocations, keep track of the buffer size and the amount of data in it separately.
struct buf {
char *buf;
size_t buflen;
size_t bufalloc;
};
void buf_init(struct buf *b)
{
buf->buf = NULL;
buf->buflen = 0;
buf->bufalloc = 0;
}
void buf_append(struct buf *b, int c)
{
if (buf->buflen >= buf->bufalloc) {
size_t newalloc = buf->bufalloc ? buf->bufalloc * 2 : 16;
char *newbuf = realloc(buf->buf, newalloc);
if (!newbuf)
abort();
buf->buf = newbuf;
buf->bufalloc = newalloc;
}
buf->buf[buf->buflen++] = c;
}
Another problem
This code:
str = realloc(str, size + 1);
It only changes the value of str in append -- it doesn't change the value of str in the calling function. Function arguments are local to the function, and changing them doesn't affect anything outside of the function.
Minor quibbles
This is a bit strange:
// Weird
x = (char*)realloc(str,sizeof(char)*(size + 1));
The (char *) cast is not only unnecessary, but it can actually mask an error -- if you forget to include <stdlib.h>, the cast will allow the code to compile anyway. Bummer.
And sizeof(char) is 1, by definition. So don't bother.
// Fixed
x = realloc(str, size + 1);

When you do a:
str = (char*)realloc(str,sizeof(char)*(size + 1));
the changes in str will not be reflected in the calling function, in other words the changes are local to the function as the pointer is passed by value. To fix this you can either return the value of str:
char * append(char c, char *str, int size)
{
if(size + 1 > strlen(str))
str = (char*)realloc(str,sizeof(char)*(size + 1));
str[size] = c;
return str;
}
or you can pass the pointer by address:
void append(char c, char **str, int size)
{
if(size + 1 > strlen(str))
*str = (char*)realloc(*str,sizeof(char)*(size + 1));
(*str)[size] = c;
}

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight

split string into words in C - c

Related

About one of the ways to handle a failure of a malloc()

Manipulating a string and rewriting it by the function output

How to append a char at a defined position

Manipulating dynamic arrays in C

What's wrong with this character buffer code?

Categories

Resources