Parsing character array to words held in pointer array (C-programming)

Parsing character array to words held in pointer array (C-programming) - c

I am trying to separate each word from a character array and put them into a pointer array, one word for each slot. Also, I am supposed to use isspace() to detect blanks. But if there is a better way, I am all ears. At the end of the code I want to print out the content of the parameter array.
Let's say the line is: "this is a sentence". What happens is that it prints out "sentence" (the last word in the line, and usually followed by some random character) 4 times (the number of words). Then I get "Segmentation fault (core dumped)".
Where am I going wrong?
int split_line(char line[120])
{
char *param[21]; // Here I want to put one word for each slot
char buffer[120]; // Word buffer
int i; // For characters in line
int j = 0; // For param words
int k = 0; // For buffer chars
for(i = 0; i < 120; i++)
{
if(line[i] == '\0')
break;
else if(!isspace(line[i]))
{
buffer[k] = line[i];
k++;
}
else if(isspace(line[i]))
{
buffer[k+1] = '\0';
param[j] = buffer; // Puts word into pointer array
j++;
k = 0;
}
else if(j == 21)
{
param[j] = NULL;
break;
}
}
i = 0;
while(param[i] != NULL)
{
printf("%s\n", param[i]);
i++;
}
return 0;
}

There are many little problems in this code :
param[j] = buffer; k = 0; : you rewrite at the beginning of buffer erasing previous words
if(!isspace(line[i])) ... else if(isspace(line[i])) ... else ... : isspace(line[i]) is either true of false, and you always use the 2 first choices and never the third.
if (line[i] == '\0') : you forget to terminate current word by a '\0'
if there are multiple white spaces, you currently (try to) add empty words in param
Here is a working version :
int split_line(char line[120])
{
char *param[21]; // Here I want to put one word for each slot
char buffer[120]; // Word buffer
int i; // For characters in line
int j = 0; // For param words
int k = 0; // For buffer chars
int inspace = 0;
param[j] = buffer;
for(i = 0; i < 120; i++) {
if(line[i] == '\0') {
param[j++][k] = '\0';
param[j] = NULL;
break;
}
else if(!isspace(line[i])) {
inspace = 0;
param[j][k++] = line[i];
}
else if (! inspace) {
inspace = 1;
param[j++][k] = '\0';
param[j] = &(param[j-1][k+1]);
k = 0;
if(j == 21) {
param[j] = NULL;
break;
}
}
}
i = 0;
while(param[i] != NULL)
{
printf("%s\n", param[i]);
i++;
}
return 0;
}
I only fixed the errors. I leave for you as an exercise the following improvements :
the split_line routine should not print itself but rather return an array of words - beware you cannot return an automatic array, but it would be another question
you should not have magic constants in you code (120), you should at least have a #define and use symbolic constants, or better accept a line of any size - here again it is not simple because you will have to malloc and free at appropriate places, and again would be a different question
Anyway good luck in learning that good old C :-)

This line does not seems right to me
param[j] = buffer;
because you keep assigning the same value buffer to different param[j] s .
I would suggest you copy all the char s from line[120] to buffer[120], then point param[j] to location of buffer + Next_Word_Postition.

You may want to look at strtok in string.h. It sounds like this is what you are looking for, as it will separate words/tokens based on the delimiter you choose. To separate by spaces, simply use:
dest = strtok(src, " ");
Where src is the source string and dest is the destination for the first token on the source string. Looping through until dest == NULL will give you all of the separated words, and all you have to do is change dest each time based on your pointer array. It is also nice to note that passing NULL for the src argument will continue parsing from where strtok left off, so after an initial strtok outside of your loop, just use src = NULL inside. I hope that helps. Good luck!

Related

Optimizing a do while loop/for loop in C

I was doing an exercise from LeetCode in which consisted in deleting any adjacent elements from a string, until there are only unique characters adjacent to each other. With some help I could make a code that can solve most testcases, but the string length can be up to 10^5, and in a testcase it exceeds the time limit, so I'm in need in some tips on how can I optimize it.
My code:
char res[100000]; //up to 10^5
char * removeDuplicates(char * s){
//int that verifies if any char from the string can be deleted
int ver = 0;
//do while loop that reiterates to eliminate the duplicates
do {
int lenght = strlen(s);
int j = 0;
ver = 0;
//for loop that if there are duplicates adds one to ver and deletes the duplicate
for (int i = 0; i < lenght ; i++){
if (s[i] == s[i + 1]){
i++;
j--;
ver++;
}
else {
res[j] = s[i];
}
j++;
}
//copying the res string into the s to redo the loop if necessary
strcpy(s,res);
//clar the res string
memset(res, '\0', sizeof res);
} while (ver > 0);
return s;
}
The code can't pass a speed test that has a string that has around the limit (10^5) length, I won't put it here because it's a really big text, but if you want to check it, it is the 104 testcase from the LeetCode Daily Problem

If it was me doing something like that, I would basically do it like a simple naive string copy, but keep track of the last character copied and if the next character to copy is the same as the last then skip it.
Perhaps something like this:
char result[1000]; // Assumes no input string will be longer than this
unsigned source_index; // Index into the source string
unsigned dest_index; // Index into the destination (result) string
// Always copy the first character
result[0] = source_string[0];
// Start with 1 for source index, since we already copies the first character
for (source_index = 1, dest_index = 0; source_string[source_index] != '\0'; ++source_index)
{
if (source_string[source_index] != result[dest_index])
{
// Next character is not equal to last character copied
// That means we can copy this character
result[++dest_index] = source_string[source_index];
}
// Else: Current source character was equal to last copied character
}
// Terminate the destination string
result[dest_index + 1] = '\0';

C-Lang: Segmentation Fault when working with string in for-loop

Quite recently, at the university, we began to study strings in the C programming language, and as a homework, I was given the task of writing a program to remove extra words.
While writing a program, I faced an issue with iteration through a string that I could solve in a hacky way. However, I would like to deal with the problem with your help, since I cannot find the error myself.
The problem is that when I use the strlen(buffer) function as a for-loop condition, the code compiles easily and there are no errors at runtime, although when I use the __act_buffer_len variable, which is assigned a value of strlen(buffer) there will be a segmentation fault at runtime.
I tried many more ways to solve this problem, but the only one, which I already described, worked for me.
// deletes words with <= 2 letters
char* _delete_odd(const char* buffer, char delim)
{
int __act_buffer_len = strlen(buffer);
// for debugging purposes
printf("__actbuff: %d\n", __act_buffer_len);
printf("sizeof: %d\n", sizeof(buffer));
printf("strlen: %d\n", strlen(buffer));
char* _newbuff = malloc(__act_buffer_len + 1); // <- new buffer without words with less than 2 unique words
char* _tempbuff; // <- used to store current word
int beg_point = 0;
int curr_wlen = 0;
for (int i = 0; i < strlen(buffer); i++) // no errors at runtime, app runs well
// for (int i = 0; i < __act_buffer_len; i++) // <- segmentation fault when loop is reaching a space character
// for (int i = 0; buffer[i] != '\0'; i++) // <- also segmentation fault at the same spot
// for (size_t i = 0; i < strlen(buffer); i++) // <- even this gives a segmentation fault which is totally confusing for me
{
printf("strlen in loop %d\n", i);
if (buffer[i] == delim)
{
char* __cpy;
memcpy(__cpy, &buffer[beg_point], curr_wlen); // <- will copy a string starting from the beginning of the word til its end
// this may be commented for testing purposes
__uint32_t __letters = __get_letters(__cpy, curr_wlen); // <- will return number of unique letters in word
if (__letters > 2) // <- will remove all the words with less than 2 unique letters
{
strcat(_newbuff, __cpy);
strcat(_newbuff, " ");
}
beg_point = i + 1; // <- will point on the first letter of the word
curr_wlen = buffer[beg_point] == ' ' ? 0 : 1; // <- if the next symbol after space is another space, than word length should be 0
}
else curr_wlen++;
}
return _newbuff;
}
In short, the code above just finds delimiter character in string and counts the number of unique letters of the word before this delimiter.

My fault was in not initializing a __cpy variable.
Also, as #n.1.8e9-where's-my-sharem. stated, I shouldn't name vars with two underscores.
The final code:
// deletes words with <= 2 letters
char* _delete_odd(const char* buffer, char delim)
{
size_t _act_buffer_len = strlen(buffer);
char* _newbuff = malloc(_act_buffer_len); // <- new buffer without words with less than 2 unique words
int beg_point = 0;
int curr_wlen = 0;
for (size_t i = 0; i < _act_buffer_len; i++)
{
if (buffer[i] == delim)
{
char* _cpy = malloc(curr_wlen);
memcpy(_cpy, &buffer[beg_point], curr_wlen); // <- will copy a string starting from the beginning of the word til its end
// this may be commented for testing purposes
__uint32_t _letters = _get_letters(_cpy, curr_wlen); // <- will return number of unique letters in word
if (_letters > 2) // <- will remove all the words with less than 2 unique letters
strcat(_newbuff, _cpy);
beg_point = i + 1; // <- will point on the first letter of the word
curr_wlen = buffer[beg_point] == ' ' ? 0 : 1; // <- if the next symbol after space is another space, than word length should be 0
free(_cpy);
}
else curr_wlen++;
}
return _newbuff;
}
Thanks for helping me

buffer overrun while trying to link two strings together, why do I have this error?

(in C, using visual studio 2022 preview), I have to do a program that link two strings together. Here's what I did:
I wrote two for-loops to count characters of first string and second
string,
I checked (inside the link function if the pointers are null (first and second). If they are null, then "return NULL".
I created "char *result". this is a new string and this is the string to be returned. I allocated enough memory to store nprime, nsecond, and 1 more character (the zero terminator). I used a malloc.
then, I checked if result is null. if it's null then "return NULL".
then, I wrote 2 for-loops to perform the linking between the first string and the second string. And here I got a compiler warning (because I think it's in compile time not in debug time). buffer overrun, the writable size is
"nprime+nsecond+1" but 2 bytes might be written.
my theory is that the program is trying to write outside the result-array, so there could be a loss of data, I tried to edit my code, therefore I write "nprime+nsecond+2" instead but it doesn't work, and it keeps showing me the same buffer overrun error.
#include <stdlib.h>
char* link( const char* first, const char* second) {
size_t nprime = 0;
size_t nsecond = 0;
if (first == NULL) {
return NULL;
}
if (second == NULL) {
return NULL;
}
for (size_t i = 0; first[i] < '\0'; i++) {
nprime++;
}
for (size_t i = 0; second[i] < '\0'; i++) {
nsecond++;
}
char* result = malloc(nprime + nsecond + 1);
if (result == NULL) {
return NULL;
}
for (size_t i = 0; i < nprime; i++) {
result[i] = first[i];
}
for (size_t i = 0; i < nsecond; i++) {
result[nprime + i] = second[i];
}
result[nprime + nsecond] = 0;
return result;
}
this is the main:
int main(void) {
char s1[] = "this is a general string ";
char s2[] = "this is a general test.";
char* s;
s = link(s1, s2);
return 0;
}

The warning is given due to the wrong conditions you defined in the first 2 for loops. The right loops should be as follows:
for (size_t i = 0; first[i] != '\0'; i++) {
nprime++;
}
for (size_t i = 0; second[i] != '\0'; i++) {
nsecond++;
}
With the conditions you defined (i.e. first[i] < '\0') you are just counting how many chars in the given string have an ASCII code lower than the ASCII code of \0 and exit the loop as soon as you find a char not fulfilling such condition.
Since '\0' has ASCII value 0, your nprime and nsecond are never incremented, leading to a malloc with insufficient room for the chars you actually need.

Returning the length of a char array in C

I am new to programming in C and am trying to write a simple function that will normalize a char array. At the end i want to return the length of the new char array. I am coming from java so I apologize if I'm making mistakes that seem simple. I have the following code:
/* The normalize procedure normalizes a character array of size len
according to the following rules:
1) turn all upper case letters into lower case ones
2) turn any white-space character into a space character and,
shrink any n>1 consecutive whitespace characters to exactly 1 whitespace
When the procedure returns, the character array buf contains the newly
normalized string and the return value is the new length of the normalized string.
*/
int
normalize(unsigned char *buf, /* The character array contains the string to be normalized*/
int len /* the size of the original character array */)
{
/* use a for loop to cycle through each character and the built in c functions to analyze it */
int i;
if(isspace(buf[0])){
buf[0] = "";
}
if(isspace(buf[len-1])){
buf[len-1] = "";
}
for(i = 0;i < len;i++){
if(isupper(buf[i])) {
buf[i]=tolower(buf[i]);
}
if(isspace(buf[i])) {
buf[i]=" ";
}
if(isspace(buf[i]) && isspace(buf[i+1])){
buf[i]="";
}
}
return strlen(*buf);
}
How can I return the length of the char array at the end? Also does my procedure properly do what I want it to?
EDIT: I have made some corrections to my program based on the comments. Is it correct now?
/* The normalize procedure normalizes a character array of size len
according to the following rules:
1) turn all upper case letters into lower case ones
2) turn any white-space character into a space character and,
shrink any n>1 consecutive whitespace characters to exactly 1 whitespace
When the procedure returns, the character array buf contains the newly
normalized string and the return value is the new length of the normalized string.
*/
int
normalize(unsigned char *buf, /* The character array contains the string to be normalized*/
int len /* the size of the original character array */)
{
/* use a for loop to cycle through each character and the built in c funstions to analyze it */
int i = 0;
int j = 0;
if(isspace(buf[0])){
//buf[0] = "";
i++;
}
if(isspace(buf[len-1])){
//buf[len-1] = "";
i++;
}
for(i;i < len;i++){
if(isupper(buf[i])) {
buf[j]=tolower(buf[i]);
j++;
}
if(isspace(buf[i])) {
buf[j]=' ';
j++;
}
if(isspace(buf[i]) && isspace(buf[i+1])){
//buf[i]="";
i++;
}
}
return strlen(buf);
}

The canonical way of doing something like this is to use two indices, one for reading, and one for writing. Like this:
int normalizeString(char* buf, int len) {
int readPosition, writePosition;
bool hadWhitespace = false;
for(readPosition = writePosition = 0; readPosition < len; readPosition++) {
if(isspace(buf[readPosition]) {
if(!hadWhitespace) buf[writePosition++] = ' ';
hadWhitespace = true;
} else if(...) {
...
}
}
return writePosition;
}
Warning: This handles the string according to the given length only. While using a buffer + length has the advantage of being able to handle any data, this is not the way C strings work. C-strings are terminated by a null byte at their end, and it is your job to ensure that the null byte is at the right position. The code you gave does not handle the null byte, nor does the buffer + length version I gave above. A correct C implementation of such a normalization function would look like this:
int normalizeString(char* string) { //No length is passed, it is implicit in the null byte.
char* in = string, *out = string;
bool hadWhitespace = false;
for(; *in; in++) { //loop until the zero byte is encountered
if(isspace(*in) {
if(!hadWhitespace) *out++ = ' ';
hadWhitespace = true;
} else if(...) {
...
}
}
*out = 0; //add a new zero byte
return out - string; //use pointer arithmetic to retrieve the new length
}
In this code I replaced the indices by pointers simply because it was convenient to do so. This is simply a matter of style preference, I could have written the same thing with explicit indices. (And my style preference is not for pointer iterations, but for concise code.)

if(isspace(buf[i])) {
buf[i]=" ";
}
This should be buf[i] = ' ', not buf[i] = " ". You can't assign a string to a character.
if(isspace(buf[i]) && isspace(buf[i+1])){
buf[i]="";
}
This has two problems. One is that you're not checking whether i < len - 1, so buf[i + 1] could be off the end of the string. The other is that buf[i] = "" won't do what you want at all. To remove a character from a string, you need to use memmove to move the remaining contents of the string to the left.
return strlen(*buf);
This would be return strlen(buf). *buf is a character, not a string.

The notations like:
buf[i]=" ";
buf[i]="";
do not do what you think/expect. You will probably need to create two indexes to step through the array — one for the current read position and one for the current write position, initially both zero. When you want to delete a character, you don't increment the write position.
Warning: untested code.
int i, j;
for (i = 0, j = 0; i < len; i++)
{
if (isupper(buf[i]))
buf[j++] = tolower(buf[i]);
else if (isspace(buf[i])
{
buf[j++] = ' ';
while (i+1 < len && isspace(buf[i+1]))
i++;
}
else
buf[j++] = buf[i];
}
buf[j] = '\0'; // Null terminate
You replace the arbitrary white space with a plain space using:
buf[i] = ' ';
You return:
return strlen(buf);
or, with the code above:
return j;

Several mistakes in your code:
You cannot assign buf[i] with a string, such as "" or " ", because the type of buf[i] is char and the type of a string is char*.
You are reading from buf and writing into buf using index i. This poses a problem, as you want to eliminate consecutive white-spaces. So you should use one index for reading and another index for writing.
In C/C++, a native string is an array of characters that ends with 0. So in essence, you can simply iterate buf until you read 0 (you don't need to use the len variable at all). In addition, since you are "truncating" the input string, you should set the new last character to 0.
Here is one optional solution for the problem at hand:
int normalize(char* buf)
{
char c;
int i = 0;
int j = 0;
while (buf[i] != 0)
{
c = buf[i++];
if (isspace(c))
{
j++;
while (isspace(c))
c = buf[i++];
}
if (isupper(c))
buf[j] = tolower(c);
j++;
}
buf[j] = 0;
return j;
}

you should write:
return strlen(buf)
instead of:
return strlen(*buf)
The reason:
buf is of type char* - it's an address of a char somewhere in the memory (the one in the beginning of the string). The string is null terminated (or at least should be), and therefore the function strlen knows when to stop counting chars.
*buf will de-reference the pointer, resulting on a char - not what strlen expects.

Not much different then others but assumes this is an array of unsigned char and not a C string.
tolower() does not itself need the isupper() test.
int normalize(unsigned char *buf, int len) {
int i = 0;
int j = 0;
int previous_is_space = 0;
while (i < len) {
if (isspace(buf[i])) {
if (!previous_is_space) {
buf[j++] = ' ';
}
previous_is_space = 1;
} else {
buf[j++] = tolower(buf[i]);
previous_is_space = 0;
}
i++;
}
return j;
}
#OP:
Per the posted code it implies leading and trailing spaces should either be shrunk to 1 char or eliminate all leading and trailing spaces.
The above answer simple shrinks leading and trailing spaces to 1 ' '.
To eliminate trailing and leading spaces:
int i = 0;
int j = 0;
while (len > 0 && isspace(buf[len-1])) len--;
while (i < len && isspace(buf[i])) i++;
int previous_is_space = 0;
while (i < len) { ...

strtok disappearing when returning -1

So I'm writing code to put strings into arrays and it's working perfectly, however I want it to terminate the reading of the strings when I hit a ## in the file. I'm running a loop and parsing the strings line by line. Within my string parser I put a loop to check for the ##. It's at the very end of my parser function and it goes:
for (i = 0; i < strlen(line)); i++)
{
if ((buffer[i] == '#') && (buffer[i+1] == '#'))
{
return -1;
}
}
The problem is that when it hits the line with the ## at the end it doesn't parse the string into my array. It seems like it's just ignoring the code before this loop.
As additional information I'm using strtok to put the tokens in positions in my char* array before this for loop.
EDIT: Here's my parseString function:
int parseString(char* line, char*** inString)
{
char* buffer;
int Token, i;
buffer = (char*) malloc(strlen(line) * sizeof(char));
strcpy(buffer,line);
(*inString) = (char**) malloc(MAX_TOKS * sizeof(char**));
Token = 0;
(*inString)[Token++] = strtok(buffer, DELIMITERS);
while ((((*inString)[token] = strtok(NULL, DELIMITERS)) != NULL) && (Token < MAX_TOKS))
Token++;
for(i=0; i<strlen(line); i++)
{
if ((buffer[i] == '#') && (buffer[i+1] == '#'))
{
return -1;
}
}
return Token;
}

First of all, you are reading out of bounds on an array, because array[-1] is not good. Secondly, use a variable to hold the string length, as the way you do it causes the for loop to re-evaluate strlen(line) for each iteration.
Now, for your problem, it seems like you're putting it before the code that adds it to an array. If you could give us a bit more code, that would help.

Insufficient buffer allocation
// buffer = (char*) malloc(strlen(line) * sizeof(char));
buffer = malloc(strlen(line) + 1); // +1 for the \0
strcpy(buffer,line);
Memory Leak
The allocated 'buffer' may be lost. The *inString array_ have a pointer to the beginning of 'buffer', allowing it to be freed in the calling routine, but that is iffy. Suggest using first element of *inString to save that buffer explicitly.
Algorithm hole
(*inString)[token-1] == NULL should be asserted before for().
O(n*n) via strlen()
Suggestion:
// for(i=0; i<strlen(line); i++)
int length = strlen(line); // `length` should be used in `malloc()` too.
for(i=0; i<length; i++)
OP's early edit approach was almost OK
Just needed to start indexing at 1, rather than 0. No need to test every index i of line, but (length-1). So (i = 1; i<length; i++) or (i = 0; i<length-1; i++).
// for (i = 0; i < strlen(line)); i++) {
int length = strlen(line);
for (i = 1; i<length; i++) { // start at 1
if ((buffer[i-1] == '#') && (buffer[i] == '#')) {
return -1;
}
}
For better assistance, recommend OP provide sample line, line with the ## at the end, MAX_TOKS and DELIMITERS.