Parsing CSV file by splitting with strsep - c

I'm getting some unwanted output when attempting to parse a comma seperated value file with strsep(). It seems be be working for half of the file, with a number with only one value (ie. 0-9), but as soon as multiple values are added like for instance 512,
It will print 512 12 2 512 12 2 and so on. I'm not exactly sure if this is due to the particular style that I'm looping? Not really sure.
int main() {
char line[1024];
FILE *fp;
int data[10][10];
int i = 0;
int j = 0;
fp = fopen("file.csv", "r");
while(fgets(line, 1024, fp)) {
char* tmp = strdup(line);
char* token;
char* idx;
while((token = strsep(&tmp, ","))) {
for (idx=token; *idx; idx++) {
data[i][j] = atoi(idx);
j++;
}
}
i++;
j=0;
free(tmp);
}
for(i = 0; i < 10; i++) {
for(j = 0; j < 10; j++) {
printf("%d ", data[i][j]);
}
printf("\n");
}
fclose(fp);
}

It is because you are creating elements by using every characters in the token returned by strsep() as start via the loop
for (idx=token; *idx; idx++) {
data[i][j] = atoi(idx);
j++;
}
Stop doing that and create just one element from one token to correct:
while((token = strsep(&tmp, ","))) {
data[i][j] = atoi(token);
j++;
}
Also free(tmp); will do nothing because tmp will be set to NULL by strsep(). To free the buffer allocated via strdup(), keep the pointer in another variable and use it for freeing.

Related

different string lengths using strtok

void redact_words(const char *text_filename, const char *redact_words_filename){
FILE *fp = fopen(text_filename,"r");
FILE *f2p = fopen(redact_words_filename,"r");
FILE *f3p = fopen("result.txt", "w"); ;
char buffer1[1000];
char buffer2[1000];
char *word;
char *redact;
char **the_words;
//if ((fgets(buffer1, 1000 ,fp) == NULL) || (fgets(buffer2,1000 ,f2p) == NULL))
fgets(buffer1,1000,fp);
fgets(buffer2,1000,f2p);
rewind(fp);
rewind(f2p);
int word_count = 0;
while (!feof(f2p)){
char c = fgetc(f2p);
if (c == ' '){
word_count += 1;
}
}
word_count += 1;
the_words = malloc(3 * sizeof(char*));
redact = strtok(buffer2, ", ");
for (int i = 0; i < word_count; i++){
the_words[i] = malloc(100);
the_words[i] = redact;
redact = strtok(NULL, ", ");
}
char result[256] = "";
word = strtok(buffer1, " ");
while (word != NULL){
for (int i = 0; i < word_count; i++){
if (strcasecmp(the_words[i],word) == 0){
for (int i = 0; i < strlen(word); i++){
strcat(result,"*");
}
strcat(result, " ");
break;
}
else{
if (i==(word_count-1)){
strcat(result, word);
strcat(result, " ");
}
}
}
word = strtok(NULL," ");
}
fputs(result, f3p);
fclose(fp);
fclose(f2p);
fclose(f3p);
free(the_words);
}
So this is my C code to replace words from the file called text_filename with asterixs if the word exists in a file called redact_words_filename. However, I noticed during the comparison with the 2 strings
if (strcasecmp(the_words[i],word) == 0){
for (int i = 0; i < strlen(word); i++){
strcat(result,"*");
}
that when I have the word quick for example in both text files, the_words[i] contains a string of length 6 while the one in word contains a string of length 5, both containing the value quick, and so it is not registering as the same string. Why is one of the strings longer than another?
(P.s I apologise for the bad code quality)
Edit 1: Ok so I found out it has to do with \n which is put in at the end of every line. Trying to find a way to solve this.
Edit 2: I managed to get rid of \n through a simple for loop
for (int i = 0; i < word_count; i++){
the_words[i] = malloc(100);
the_words[i] = redact;
for (int j = 0; j < strlen(redact); j++){
if (redact[j] == '\n'){
redact[j] = '\0';
}
}
redact = strtok(NULL, ", ");
}
the_words = malloc(3 * sizeof(char*));
redact = strtok(buffer2, ", ");
for (int i = 0; i < word_count; i++){
the_words[i] = malloc(100);
the_words[i] = redact;
redact = strtok(NULL, ", ");
}
Two obvious problems just here
you allocate space for 3 pointers in the_words but then you go and put word_count words into it. So if word_count > 3, you'll overflow and get undefined behavior
for each word, you allocate 100 bytes, and then throw away that allocation, instead storing a pointer into buffer2. The buffer currently contains the word but that will change next time you read into it. You should just use the_words[i] = strdup(redact); to both allocate the right amount of memory, and copy the string into the allocated memory.

Parsing .csv file into 2D array in C

I have a .csv file that reads like:
SKU,Plant,Qty
40000,ca56,1245
40000,ca81,12553.3
40000,ca82,125.3
45000,ca62,0
45000,ca71,3
45000,ca78,54.9
Note: This is my example but in reality this has about 500,000 rows and 3 columns.
I am trying to convert these entries into a 2D array so that I can then manipulate the data. You'll notice that in my example I just set a small 10x10 matrix A to try and get this example to work before moving on to the real thing.
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
const char *getfield(char *line, int num);
int main() {
FILE *stream = fopen("input/input.csv", "r");
char line[1000000];
int A[10][10];
int i, j = 0;
//Zero matrix
for (i = 0; i < 10; i++) {
for (j = 0; j < 10; j++) {
A[i][j] = 0;
}
}
for (i = 0; fgets(line, 1000000, stream); i++) {
while (j < 10) {
char *tmp = strdup(line);
A[i][j] = getfield(tmp, j);
free(tmp);
j++;
}
}
//print matrix
for (i = 0; i < 10; i++) {
for (j = 0; j < 10; j++) {
printf("%s\t", A[i][j]);
}
printf("\n");
}
}
const char *getfield(char *line, int num) {
const char *tok;
for (tok = strtok(line, ",");
tok && *tok;
tok = strtok(NULL, ",\n"))
{
if (!--num)
return tok;
}
return 0;
}
It prints only "null" errors, and it is my belief that I am making a mistake related to pointers on this line: A[i][j] = getfield(tmp, j). I'm just not really sure how to fix that.
This is work that is based almost entirely on this question: Read .CSV file in C . Any help in adapting this would be very much appreciated as it's been a couple years since I last touched C or external files.
It looks like commenters have already helped you find a few errors in your code. However, the problems are pretty entrenched. One of the biggest issues is that you're using strings. Strings are, of course, char arrays; that means that there's already a dimension in use.
It would probably be better to just use a struct like this:
struct csvTable
{
char sku[10];
char plant[10];
char qty[10];
};
That will also allow you to set your columns to the right data types (it looks like SKU could be an int, but I don't know the context).
Here's an example of that implementation. I apologize for the mess, it's adapted on the fly from something I was already working on.
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
// Based on your estimate
// You could make this adaptive or dynamic
#define rowNum 500000
struct csvTable
{
char sku[10];
char plant[10];
char qty[10];
};
// Declare table
struct csvTable table[rowNum];
int main()
{
// Load file
FILE* fp = fopen("demo.csv", "r");
if (fp == NULL)
{
printf("Couldn't open file\n");
return 0;
}
for (int counter = 0; counter < rowNum; counter++)
{
char entry[100];
fgets(entry, 100, fp);
char *sku = strtok(entry, ",");
char *plant = strtok(NULL, ",");
char *qty = strtok(NULL, ",");
if (sku != NULL && plant != NULL && qty != NULL)
{
strcpy(table[counter].sku, sku);
strcpy(table[counter].plant, plant);
strcpy(table[counter].qty, qty);
}
else
{
strcpy(table[counter].sku, "\0");
strcpy(table[counter].plant, "\0");
strcpy(table[counter].qty, "\0");
}
}
// Prove that the process worked
for (int printCounter = 0; printCounter < rowNum; printCounter++)
{
printf("Row %d: column 1 = %s, column 2 = %s, column 3 = %s\n",
printCounter + 1, table[printCounter].sku,
table[printCounter].plant, table[printCounter].qty);
}
// Wait for keypress to exit
getchar();
}
There are multiple problems in your code:
In the second loop, you do not stop reading the file after 10 lines, so you would try and store elements beyond the end of the A array.
You do not reset j to 0 at the start of the while (j < 10) loop. j happens to have the value 10 at the end of the initialization loop, so you effectively do not store anything into the matrix.
The matrix A should be a 2D array of char *, not int, or potentially an array of structures.
Here is a simpler version with an allocated array of structures:
#include <stdio.h>
#include <stdlib.h>
typedef struct item_t {
char SKU[20];
char Plant[20];
char Qty[20];
};
int main(void) {
FILE *stream = fopen("input/input.csv", "r");
char line[200];
int size = 0, len = 0, i, c;
item_t *A = NULL;
if (stream) {
while (fgets(line, sizeof(line), stream)) {
if (len == size) {
size = size ? size * 2 : 1000;
A = realloc(A, sizeof(*A) * size);
if (A == NULL) {
fprintf(stderr, "out of memory for %d items\n", size);
return 1;
}
}
if (sscanf(line, "%19[^,\n],%19[^,\n],%19[^,\n]%c",
A[len].SKU, A[len].Plant, A[len].Qty, &c) != 4
|| c != '\n') {
fprintf(stderr, "invalid format: %s\n, line);
} else {
len++;
}
}
fclose(stream);
//print matrix
for (i = 0; i < len; i++) {
printf("%s,%s,%s\n", A[i].SKU, A[i].Plant, A[i].Qty);
}
free(A);
}
return 0;
}

Import a CSV of strings into a multi-dimensional array in C

I am trying to input a text file, in a similar format to a CSV, into a multi-dimensional array, where every element of the array is an array of words for each line. Any help would be much appreciated!
For example, the file input.txt could contain:
Carrot, Potato, Beetroot, Courgette, Broccoli
Dad's oranges, Apple, Banana, Cherry
Pasta, Pizza, Bread, Butter
The structure of the outputted array I am hoping to get from that would be in the form:
[[Carrot, Potato, Beetroot, Courgette, Broccoli], [Dad's oranges, Apple, Banana, Cherry], [Pasta, Pizza, Bread, Butter]]
So you the line:
printf("%s", inputArray[1][0]);
Would print:
Dad's oranges
I am not sure what the question here is. However, looking at your problem statement, and code I see few issues (note that I did not run the code, it is meant to give you an idea):
Your varCount will start from 1, as you increment before you put the first word.
You are storing an inherently 2-dimensional data into a single dimensional array. That is normally fine, but you need to encode where a line starts, and where it ends. That is missing. If all works, you will get an array of words, with no knowledge of where lines start/end. One way to deal with it is to create a 2D array. Another, is to insert a pointer to known word between the lines. Below is a code snippet that shows inserting a separator
char *knownWord = "anyword";
...
while (fgets(line, maxLineLength, inputFile))
{
token = strtok(&line[0], ",");
while (token) {
inputArray[varCount] = token;
varCount++;
token = strtok(NULL, ",");
}
inputArray[varCount] = knownWord;
varCount ++;
}
For this the print will happen with something like
bool atKnownWord = 0;
printf("[");
for (i = 0; i < maxWords; i++) {
if (inputArray[i] == NULL) {
break;
}
if (inputArray[i] == knownWord) {
atKnownWord = 1;
printf("]");
continue;
}
if (atKnownWord) {
atKnownWord = 0;
printf(", [");
}
printf("%s", inputArray[i]);
}
printf("]");
You have not allocated the memory to store the token.
Also you should increment varCount after storing the token in an array.
The code can be written as:
#include <stdio.h>
#include <string.h>
#include <stdlib.h>
#define maxLineLength 1000 //Maximum length of a line
#define wordsPerLine 200 //Maximum words in a line
#define maxLines 200 //Maximum lines in an input
int main(int argc, char *argv[])
{
char line[maxLineLength] = {0};
char *inputArray[maxLines][wordsPerLine] = {};
char *ptr, *token;
int i, j, lines, maxWords = 0;
FILE *inputFile = fopen("input.txt", "r");
if (inputFile)
{
i = j = 0;
while (fgets(line, maxLineLength, inputFile))
{
token = strtok(&line[0], ",\n");
while(token)
{
if(ptr = malloc(sizeof(char) * (strlen(token)+1))) //whether malloc succeeded
{
if(token[0] == ' ')
strcpy(ptr, token+1);
else
strcpy(ptr, token);
inputArray[i][j++] = ptr;
token = strtok(NULL, ",\n");
}
else
{
printf("malloc failed!\n");
exit(1);
}
}
if(maxWords < j)
maxWords = j;
i++;
j = 0;
}
lines = i;
fclose(inputFile);
for(i = 0; i < lines; i++)
{
for(j = 0; (j < maxWords) && inputArray[i][j]; j++)
printf("%s | ", inputArray[i][j]);
printf("\n");
}
}
return 0;
}

questions regarding tokenisation in c

I am writing a tokenisation program. I want to get input from a file, then store it in an input pointer. I am using the strtok function but when I print my tokens[i] I get NULL.
int tokenise(char *input, int file_output)
{
int i = 0;
char *tokens[100];
for(i=0 ;i<=20;i++)
{
tokens[i]= (char*)malloc(sizeof(char*));
}
char delim[] = " ,.;#/";
printf("\n ------------- buffer data is %s",input);
tokens[i] = strtok(input , delim);
printf("tokens are %s",*tokens[0]);
int j=0;
while(NULL != tokens[i])
{
i++;
tokens[i] = strtok(NULL,delim);
}
for(j = i; j <= 0; j--)
{
write(file_output,tokens[i],strlen(tokens[i]));
}
for(i = 0; i <= 20; i++)
{
printf("%s \n",*tokens[i]);
}
return SUCCESS;
}
For some reason you allocate memory and write pointers to the first 21 elements of tokens[]. At the end of that loop, i is 21. You then parse the input string using strtok(), storing its results in continuing array elements, from tokens[21]. So two of your loops need rewriting:
for(j=21; j<i; j++)
write(file_output,tokens[j],strlen(tokens[j]));
for(j=21; j<i; j++)
printf("%s \n",*tokens[j]);
But it would be better if you removed the first loop that allocates unnecessary memory. strtok() returns pointers to the original string, which it breaks into pieces by inserting '\0' terminators, so you only need to store the pointers in the array tokens[].

Getting every other line empty on output

I have a problem with getting every other line empty on output with this code. The desired output is: http://paste.ubuntu.com/1354365/
While I get: http://paste.ubuntu.com/1356669/
Does anyone have an idea of why I'm getting these empty lines on every other line?
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
FILE *fp;
FILE *fw;
int main(int argc, char *argv[]){
char buffer[100];
char *fileName = malloc(10*sizeof(char));
char **output = calloc(10, sizeof(char*));
char **outputBuffer = calloc(10, sizeof(char*));
fw = fopen("calvin.txt", "w+");
for(int y = 0; y < 6; y++){
for(int i = 0; i < 10; i ++)
{
output[i] = malloc(100);
}
for(int x = 0; x < 12; x++){
sprintf(fileName,"part_%02d-%02d", x, y);
fp = fopen(fileName, "rb");
if(fp == NULL)
{
printf("Kan ikke åpne den filen(finnes ikke/rettigheter)\n");
}
else if(fp != NULL){
memset(buffer, 0, 100);
for(int i = 0; i < 10; i++){
outputBuffer[i] = malloc(100);
}
fread(buffer, 1, 100, fp);
for(int i = 0; i < 100; i++){
if(buffer[i] == '\0')
{
buffer[i] = ' ';
}
else if(buffer[i] == '\n')
{
buffer[i] = ' ';
}
}
for(int i = 0; i < 10; i++) {
strncpy(outputBuffer[i], buffer + i * 10, 10);
strncat(output[i], outputBuffer[i]+1, 11);
}
}
}
for(int i = 0; i < 10; i++){
printf("%s\n", output[i]);
}
}
fclose(fp);
free(fileName);
}
You are not reading correcting from the file. On the first image in the beginning you have:
o ""oo " o o o
on the second
""oo o o o
That does not make a lot of sense because it is the first line. It is not related to empty lines since we are talking about the first line.
It seems that you are reading -2 characters from the left so " prints over o the other " on the ' ' ect..
Try this away, may not be the most efficient solution:
int read(char *file)
{
FILE *fp = NULL;
int size = 0, pos = 0,i;
fp = fopen(file,"r");
if (!fp) return 0;
for(; ((getc(fp))!=EOF); size++); // Count the number of elements in the file
fclose(fp);
char buffer[size];
fp = fopen(file,"r");
if (!fp) return 0;
while((buffer[pos++]=getc(fp))!=EOF); // Saving the chars into the buffer
for(i = 0; i < pos; i++) // print them.
printf("%c",buffer[i]);
fclose(fp);
return 1;
}
This part seems problematic:
strncpy(outputBuffer[i], buffer + i * 10, 10);
strncat(output[i], outputBuffer[i]+1, 11);
1) Why is it necessary to use the extra outputBuffer step?
2) You know that strncpy() isn't guaranteed to null-terminate the string it copies.
3) More significantly, output[i] hasn't been initialized, so strncat() will concatenate the string after whatever junk is already in there. If you use calloc() instead of malloc() when creating each output[i], that might help. It's even possible that your output[i] variables are what hold your extra newline.
4) Even if initialized to an empty string, you could easily overflow output[i], since you're looping 12 times and writing up to 11 characters to it. 11 * 12 + 1 for the null terminator = 133 bytes written to a 100-byte array.
In general, unless this is a class assignment that requires use of malloc(), I don't understand why you aren't just declaring your variables once, at the start of the program and zeroing them out at the start of each loop:
char fileName[10];
char output[10][100];
char outputBuffer[10][100];
And, as stated by others, your allocating a bunch of memory and not trying to free it up. Allocate it once outside of your loop or just skip the allocation step and declare them directly.

Resources