C parsing a comma-separated-values with line breaks - c

I have a CSV data file that have the following data:
H1,H2,H3
a,"b
c
d",e
When I open through Excel as CSV file, it is able to show the sheet with column headings as H1, H2, H3 and column values as: a for H1,
multi line value as
b
c
d
for H2
and c for H3
I need to parse this file using a C program and have the values picked up like this.
But, my following code snippet will not work, as I have multi line values for a column:
char buff[200];
char tokens[10][30];
fgets(buff, 200, stdin);
char *ptok = buff; // for iterating
char *pch;
int i = 0;
while ((pch = strchr(ptok, ',')) != NULL) {
*pch = 0;
strcpy(tokens[i++], ptok);
ptok = pch+1;
}
strcpy(tokens[i++], ptok);
How to modify this code snippet to accommodate multi-line values of columns?
Please don't get bothered by the hard-coded values for the string buffers, this is the test code as POC.
Instead of any 3rd party library, I would like to do it the hard way from first principle.
Please help.

The main complication in parsing "well-formed" CSV in C is precisely the handling of variable-length strings and arrays which you are avoiding by using fixed-length strings and arrays. (The other complication is handling not well-formed CSV.)
Without those complications, the parsing is really quite simple:
(untested)
/* Appends a non-quoted field to s and returns the delimiter */
int readSimpleField(struct String* s) {
for (;;) {
int ch = getc();
if (ch == ',' || ch == '\n' || ch == EOF) return ch;
stringAppend(s, ch);
}
}
/* Appends a quoted field to s and returns the delimiter.
* Assumes the open quote has already been read.
* If the field is not terminated, returns ERROR, which
* should be a value different from any character or EOF.
* The delimiter returned is the character after the closing quote
* (or EOF), which may not be a valid delimiter. Caller should check.
*/
int readQuotedField(struct String* s) {
for (;;) {
int ch;
for (;;) {
ch = getc();
if (ch == EOF) return ERROR;
if (ch == '"') {
ch = getc();
if (ch != '"') break;
}
stringAppend(s, ch);
}
}
}
/* Reads a single field into s and returns the following delimiter,
* which might be invalid.
*/
int readField(struct String* s) {
stringClear(s);
int ch = getc();
if (ch == '"') return readQuotedField(s);
if (ch == '\n' || ch == EOF) return ch;
stringAppend(s, ch);
return readSimpleField(s);
}
/* Reads a single row into row and returns the following delimiter,
* which might be invalid.
*/
int readRow(struct Row* row) {
struct String field = {0};
rowClear(row);
/* Make sure there is at least one field */
int ch = getc();
if (ch != '\n' && ch != EOF) {
ungetc(ch, stdin);
do {
ch = readField(s);
rowAppend(row, s);
} while (ch == ',');
}
return ch;
}
/* Reads an entire CSV file into table.
* Returns true if the parse was successful.
* If an error is encountered, returns false. If the end-of-file
* indicator is set, the error was an unterminated quoted field;
* otherwise, the next character read will be the one which
* triggered the error.
*/
bool readCSV(struct Table* table) {
tableClear(table);
struct Row row = {0};
/* Make sure there is at least one row */
int ch = getc();
if (ch != EOF) {
ungetc(ch, stdin);
do {
ch = readRow(row);
tableAppend(table, row);
} while (ch == '\n');
}
return ch == EOF;
}
The above is "from first principles" -- it does not even use standard C library string functions. But it takes some effort to understand and verify. Personally, I would use (f)lex and maybe even yacc/bison (although it's a bit of overkill) to simplify the code and make the expected syntax more obvious. But handling variable-length structures in C will still need to be the first step.

Related

how to stop my program from skipping characters before saving them

I am making a simple program to read from a file character by character, puts them into tmp and then puts tmp in input[i]. However, the program saves a character in tmp and then saves the next character in input[i]. How do I make it not skip that first character?
I've tried to read into input[i] right away but then I wasn't able to check for EOF flag.
FILE * file = fopen("input.txt", "r");
char tmp;
char input[5];
tmp= getc(file);
input[0]= tmp;
int i=0;
while((tmp != ' ') && (tmp != '\n') && (tmp != EOF)){
tmp= getc(file);
input[i]=tmp;
length++;
i++;
}
printf("%s",input);
It's supposed to print "ADD $02", but instead it prints "DD 02".
You are doing things in the wrong order in your code: The way your code is structures, reading and storing the first char is moved out of the loop. In the loop, that char is then overwritten. In that case start with i = 1.
Perhaps you want to read the first character anyway, but I guess you want to read everything up to the first space, which might be the first character. Then do this:
#include <stdio.h>
int main(void)
{
char input[80];
int i = 0;
int c = getchar();
while (c != ' ' && c != '\n' && c != EOF) {
if (i + 1 < sizeof(input)) { // store char if the is room
input[i++] = c;
}
c = getchar();
}
input[i] = '\0'; // null-terminate input
puts(input);
return 0;
}
Things to note:
The first character is read before the loop. the loop condition and the code that stores the char then use that char. Just before the end of the loop body, the next char is read, which will then be processed in the next iteration.
You don't enforce that the char buffer input cannot be overwritten. This is dangerous, especially since your buffer is tiny.
When you construct strings char by char, you should null-terminate it by placing an explicit '\0' at the end. You have to make sure that there is space for that terminator. Nearly all system functions like puts or printf("%s", ...) expect the string to be null-terminated.
Make the result of getchar an int, so that you can distinguish between all valid character codes and the special value EOF.
The code above is useful if the first and subsequent calls to get the next item are different, for example when tokenizing a string with strtok. Here, you can also choose another approach:
while (1) { // "infinite loop"
int c = getchar(); // read a char first thing in a loop
if (c == ' ' || c == '\n' || c == EOF) break;
// explicit break when done
if (i + 1 < sizeof(input)) {
input[i++] = c;
}
}
This approach has the logic of processing the chars in the loop body only, but you must wrap it in an infinite loop and then use the explicit break.

Using fgetc to pass only part of a text file to a buffer

I have the following text file:
13.69 (s, 1H), 11.09 (s, 1H).
So far I can quite happily use either fgets or fgetc to pass all text to a buffer as follows:
char* data;
data = malloc(sizeof(char) * 100);
int c;
int n = 0;
FILE* inptr = NULL;
inptr = fopen("NMR", "r");
if(NULL == fopen("NMR", "r"))
{
printf("Error: could not open file\n");
return 1;
}
for (c = fgetc(inptr); c != EOF && c != '\n'; c = fgetc(inptr))
{
data[n++] = c;
}
for (int i = 0, n = 100; i < n; i++)
{
printf ("%c", data[i]);
}
printf("\n");
and then print the buffer to the screen afterwards. However, I am only looking to pass part of the textfile to the buffer, namely:
13.69 (s, 1H),
So this means I want fgetc to stop after ','. However, this means the that the text will stop at 13.69 (s, and not 13.69 (s, 1H),
Is there a way around this? I have also experimented with fgets and then using strstr as follows:
char needle[4] = ")";
char* ret;
ret = strstr(data, needle);
printf("The substring is: %s\n", ret);
However, the output from this is:
), 11.09 (s, 1H)
thus giving me the rest of the string which I do not want. It's an interesting one and if anyone has any tips it would be much appreciated!
If you know that the closing parenthesis is the last character you want, you can use that as your stopping point in the fgetc() loop:
char data[100]; //No need to dynamically allocate if we know the size at compile time
int c;
int n = 0;
FILE* inptr = NULL;
inptr = fopen("NMR", "r");
if(inptr == NULL) //We want to check the value of the file we just opened
{ //and plan to use
printf("Error: could not open file\n");
return 1;
}
//We'll keep the original value guards (EOF and '\n') below and add two more
//to make sure we break from the loop
//We use n<98 below to make sure we can always create a null-terminated string,
//If we used 99, the 100th character might be a ')', then we have no room for a
//terminating null-char
for (c = fgetc(inptr); c != ')' && n < 98 && c != EOF && c != '\n'; c = fgetc(inptr))
{
data[n++] = c;
}
if(c != ')') //We hit EOF, \n, or ran out of space in data[]
{
printf("Error: no matching sequence found\n");
return 2;
}
data[n]=')'; //Could also write data[n]=c here, since we know it's a ')'
data[n+1]='\0'; //Add the terminating null character
printf("%s\n",data); //Since it's a properly formatted string, we can use %s
(Note that this example will handle null input characters differently from yours. If you expect null characters to be in the input stream (NMR file) then change the printf("%s",...) line back to the for loop you originally had.
Well with only one example of the format you are trying to parse it's not totally possible to give an answer, however if your input is always like this I would simply have a counter and break after the second comma.
int comma = 0;
for (c = fgetc(inptr); c != EOF && c != '\n' && c != ',' && comma < 1; c = fgetc(inptr))
{
if (data[n] = ',')
comma++;
data[n++] = c;
}
In case the characters inside the parenthesis can be more complex I would simply maintain a boolean state to know if I am actually inside or outside a parenthesis and break when I read a comma outside of it.
Simply read using fgets and store desired string in char * using sscanf-
char *new_data;
new_data=malloc(100); // allocate memory
...
fgets(data,100,inptr); // read from file but check its return
sscanf(data,"%[^)]",new_data); // store string untill ')' in new_data from data
strcat(new_data,")"); // concatenating new_data and ")"
printf("%s",new_data); // print new_data
...
free(new_data); // remember to free memory
Also you should check return of malloc though not done in my example and also close the file opened .

How to get input of string with space and put it in variables separate?

I got stuck with this, i can't figure out when set input. Example -> "LEMON TREE", (there i set that(input) to read it double time function getline), it give me wrong output like "LEMON TRE E" down i explain this more, question is how i can change code for getline(mine modified function) and get output separate, here is my function getline, and here is visual what i want.
I use function from k&r-ansii book(little modified):
int getline(char *line, int len)
{
int i,c;
for ( i =0;i<len-1 && (c = getchar()) != EOF && c!='\n';i++)
*line++ = c;
if ( c == '\n')
{
i++;
*line++ = '\n';
*line = '\0';
}
return i;
}
And this is visual what i want:
char line_1[10];
char line_2[10];
getline(line_1, 10);
getline(line_2, 10);
printf("line_1: %s ", line_1);
printf("line_2: %s", line_2);
INPUT: LEMON TREE // This is input in one line like u see.
OUTPUT: line_1: LEMON TRE line_2: E
What i need to change in my code?
When i write getline in header it say that it got previous declared, now when i use ubuntu, but before when was on windows i declared that and mix also with lib stdio.h and it have worked all right.
UPDATE: I want to change my function, that it can work with more words than two(to make it universal) .
You need to check for whitespaces (eg. space) as well to determine where the word ends
for ( i =0;i<len-1 && (c = getchar()) != EOF && c!='\n' && c!=' ';i++) {
*line++ = c;
}
/*EDIT: (thanks to BLUEPIXY's comment for reminding me) Alyways use string (or other data type) large enough to contain possible input/value assigned to it.
*/
as for universality of the function you can either call getline in a cycle and use its return value to determine whether the last word was already read or use multidimensional array
while (getline(line_1, 10)!= -1) {
printf("line_1: %s ", line_1);
}
Don't forget to edit return value of the function (I'm used to assign values that are are impossible to be returned any other way (usually negative values mean error while counting something):
if ((c == EOF) ||(c == '\n')) {
i = -1;
}
also
if ( c == '\n')
{
i++;
*line++ = '\n';
*line = '\0';
}
should be simply replaced with
*line = '\0'; //every string needs to be finalized
as for 2nd question, according to http://www.cplusplus.com/reference/cstdio/ stdio.h does not contain getline function

fgetc to skip from point to new line

I am trying to get fgetc to read through a file and skip from a certain indicator until a new line. This seems like a simple question, but I can't find any documentation on it.
Here is an example of my question:
read this in ; skip from semicolon on to new line
My best guess at a solution would be to read in the entire file, and for each line use strtok to skip from ; to the end of the line. Obviously this is horrible inefficient. Any ideas?
*I need to use fgetc or something like fgetc that will parse the file character by character
Easiest thing to do is read the entire line in, then truncate if there a ;.
char buffer[1024], * p ;
if ( fgets(buffer, sizeof(buffer), fin) )
{
if (( p= strchr( buffer, ';' ))) { *p = '\0' ; } // chop off ; and anything after
for ( p= buffer ; ( * p ) ; ++ p )
{
char c= * p ;
// do what you want with each character c here.
}
}
When you do the read, buffer will initially contain:
"read this in ; skip from semicolon on to new line\n\0"
After you find the ; in the line and stick a '\0' there, the buffer looks like:
"read this in \0 skip from semicolon on to new line\n\0"
So the for loop starts at r and stops at the first \0.
//Function of compatible fgets to read up to the character specified by a delimiter.
//However file stream keep going until to newline.
//s : buffer, n : buffer size
char *fgets_delim(char *s, int n, FILE *fp, char delimiter){
int i, ch=fgetc(fp);
if(EOF==ch)return NULL;
for(i=0;i<n-1;++i, ch=fgetc(fp)){
s[i] = ch;
if(ch == '\n'){
s[i+1]='\0';
break;
}
if(ch == EOF){
s[i]='\0';
break;
}
if(ch == delimiter){
s[i]='\0';//s[i]='\n';s[i+1]='\0'
while('\n'!=(ch = fgetc(fp)) && EOF !=ch);//skip
break;
}
}
if(i==n-1)
s[i] = '\0';
return s;
}
Given a requirement to use fgetc(), then you are probably supposed to echo everything up to the first semicolon on the line, and suppress everything from the semicolon to the end of the line. I note in passing that getc() is functionally equivalent to fgetc() and since this code is about to read from standard input and write to standard output, it would be reasonable to use getchar() and putchar(). But rules are rules...
#include <stdio.h>
#include <stdbool.h>
int main(void)
{
int c;
bool read_semicolon = false;
while ((c = fgetc(stdin)) != EOF)
{
if (c == '\n')
{
putchar(c);
read_semicolon = false;
}
else if (c == ';')
read_semicolon = true;
else if (read_semicolon == false)
putchar(c);
/* else suppressed because read_semicolon is true */
}
return 0;
}
If you don't have C99 and <stdbool.h>, you can use int, 0 and 1 in place of bool, false and true respectively. You can use else if (!read_semi_colon) if you prefer.

How to parse CSV with quotation marks delimiting fields in C?

Consider, this message:
N,8545,01/02/2011 09:15:01.815,"RASTA OPTSTK 24FEB2011 1,150.00 CE",S,8.80,250,0.00,0
This is just a sample. The idea is, this is one of the rows in a csv file. Now, if I am to break it into commas, then there will be a problem with 1150 figure.
The string inside the double quotes is of variable length, but can be ascertained as one "element"(if I may use the term)
The other elements are the ones separated by ,
How do I parse it? (other than Ragel parsing engine)
Soham
Break the string into fields separated by commas provided that the commas are not embedded in quoted strings.
A quick way to do this is to use a state machine.
boolean inQuote = false;
StringBuffer buffer= new StringBuffer();
// readchar() is to be implemented however you read a char
while ((char = readchar()) != -1) {
switch (char) {
case ',':
if (inQuote == false) {
// store the field in our parsedLine object for later processing.
parsedLine.addField(buffer.toString());
buffer.setLength(0);
}
break;
case '"':
inQuote = !inQuote;
// fall through to next target is deliberate.
default:
buffer.append(char);
}
}
Note that while this provides an example, there is a bit more to CSV files which would have to be accounted for (like embedded quotes within quotes, or whether it is appropriate to strip outer quotes in your example).
A quick and dirty solution if you don't want to add external libraries would be converting the double quotes to \0 (the end of string marker), then parsing the three strings separately using sscanf. Ugly but should work.
Assuming the input is well-formed (otherwise you'll have to add error handling):
for (i=0; str[i]; i++)
if (str[i] == '"') str[i] = 0;
str += sscanf(str, "%c,%d,%d/%d/%d %d:%d:%d.%d,", &var1, &var2, ..., &var9);
var10 = str; // it may be str+1, I don't remember if sscanf consumes also the \0
sscanf(str+strlen(var10), ",%c,%f,%d,%f,%d", &var11, &var12, ..., &var15);
You will obviously have to make a copy of var10 if you want to free str immediately.
This is a function to get the next single CSV field from an input file supplied as a FILE *. It expects the file to be opened in text mode, and supports quoted fields with embedded quotes and newlines. Fields longer than the size of the supplied buffer are truncated.
int get_csv_field(FILE *f, char *buf, size_t size)
{
char *p = buf;
int c;
enum { QS_UNQUOTED, QS_QUOTED, QS_GOTQUOTE } quotestate = QS_UNQUOTED;
if (size < 1)
return EOF;
while ((c = getc(f)) != EOF)
{
if ((c == '\n' || c == ',') && quotestate != QS_QUOTED)
break;
if (c == '"')
{
if (quotestate == QS_UNQUOTED)
{
quotestate = QS_QUOTED;
continue;
}
if (quotestate == QS_QUOTED)
{
quotestate = QS_GOTQUOTE;
continue;
}
if (quotestate == QS_GOTQUOTE)
{
quotestate = QS_QUOTED;
}
}
if (quotestate == QS_GOTQUOTE)
{
quotestate = QS_UNQUOTED;
}
if (size > 1)
{
*p++ = c;
size--;
}
}
*p = '\0';
return c;
}
How about libcsv from our very own Robert Gamble?

Resources