How to parse CSV with quotation marks delimiting fields in C? - c

Consider, this message:
N,8545,01/02/2011 09:15:01.815,"RASTA OPTSTK 24FEB2011 1,150.00 CE",S,8.80,250,0.00,0
This is just a sample. The idea is, this is one of the rows in a csv file. Now, if I am to break it into commas, then there will be a problem with 1150 figure.
The string inside the double quotes is of variable length, but can be ascertained as one "element"(if I may use the term)
The other elements are the ones separated by ,
How do I parse it? (other than Ragel parsing engine)
Soham

Break the string into fields separated by commas provided that the commas are not embedded in quoted strings.
A quick way to do this is to use a state machine.
boolean inQuote = false;
StringBuffer buffer= new StringBuffer();
// readchar() is to be implemented however you read a char
while ((char = readchar()) != -1) {
switch (char) {
case ',':
if (inQuote == false) {
// store the field in our parsedLine object for later processing.
parsedLine.addField(buffer.toString());
buffer.setLength(0);
}
break;
case '"':
inQuote = !inQuote;
// fall through to next target is deliberate.
default:
buffer.append(char);
}
}
Note that while this provides an example, there is a bit more to CSV files which would have to be accounted for (like embedded quotes within quotes, or whether it is appropriate to strip outer quotes in your example).

A quick and dirty solution if you don't want to add external libraries would be converting the double quotes to \0 (the end of string marker), then parsing the three strings separately using sscanf. Ugly but should work.
Assuming the input is well-formed (otherwise you'll have to add error handling):
for (i=0; str[i]; i++)
if (str[i] == '"') str[i] = 0;
str += sscanf(str, "%c,%d,%d/%d/%d %d:%d:%d.%d,", &var1, &var2, ..., &var9);
var10 = str; // it may be str+1, I don't remember if sscanf consumes also the \0
sscanf(str+strlen(var10), ",%c,%f,%d,%f,%d", &var11, &var12, ..., &var15);
You will obviously have to make a copy of var10 if you want to free str immediately.

This is a function to get the next single CSV field from an input file supplied as a FILE *. It expects the file to be opened in text mode, and supports quoted fields with embedded quotes and newlines. Fields longer than the size of the supplied buffer are truncated.
int get_csv_field(FILE *f, char *buf, size_t size)
{
char *p = buf;
int c;
enum { QS_UNQUOTED, QS_QUOTED, QS_GOTQUOTE } quotestate = QS_UNQUOTED;
if (size < 1)
return EOF;
while ((c = getc(f)) != EOF)
{
if ((c == '\n' || c == ',') && quotestate != QS_QUOTED)
break;
if (c == '"')
{
if (quotestate == QS_UNQUOTED)
{
quotestate = QS_QUOTED;
continue;
}
if (quotestate == QS_QUOTED)
{
quotestate = QS_GOTQUOTE;
continue;
}
if (quotestate == QS_GOTQUOTE)
{
quotestate = QS_QUOTED;
}
}
if (quotestate == QS_GOTQUOTE)
{
quotestate = QS_UNQUOTED;
}
if (size > 1)
{
*p++ = c;
size--;
}
}
*p = '\0';
return c;
}

How about libcsv from our very own Robert Gamble?

Related

Does anyone know what is the best way in C to check if a character is in a string?

I have a string for example "ABCDEFG.......", and I want to check if a certain character is in this string or not (the string also contains the newline character). I have the following code, but it doesn't seem to be working. Anyone have any better ideas?
Currently using the strchr to check if it comes out to be NULL, meaning the current char in the loop, is NOT present in the valid_characters variable.
bool check_bad_characters(FILE *inputFile)
{
int c;
char valid_characters[28] = "ABCDEFGHIJKLMNOPQRSTUVWXYZ ";
while ((c = fgetc(inputFile)) != EOF) {
char charC = c + '0';
if (strchr(valid_characters, c) == NULL && strncmp(&charC, "\n", 1) != 0)
{
// This means that there was a character in the input file
// that is not valid.
return false;
}
}
return true;
}
Your code considers \n to be a valid character, so put it in the list of valid characters instead of handling it separately. Your routine can be simply:
bool check_bad_characters(FILE *inputFile)
{
int c;
while ((c = fgetc(inputFile)) != EOF)
if (!strchr("ABCDEFGHIJKLMNOPQRSTUVWXYZ \n", c))
return false;
return true;
}

C Reading a file of digits separated by commas

I am trying to read in a file that contains digits operated by commas and store them in an array without the commas present.
For example: processes.txt contains
0,1,3
1,0,5
2,9,8
3,10,6
And an array called numbers should look like:
0 1 3 1 0 5 2 9 8 3 10 6
The code I had so far is:
FILE *fp1;
char c; //declaration of characters
fp1=fopen(argv[1],"r"); //opening the file
int list[300];
c=fgetc(fp1); //taking character from fp1 pointer or file
int i=0,number,num=0;
while(c!=EOF){ //iterate until end of file
if (isdigit(c)){ //if it is digit
sscanf(&c,"%d",&number); //changing character to number (c)
num=(num*10)+number;
}
else if (c==',' || c=='\n') { //if it is new line or ,then it will store the number in list
list[i]=num;
num=0;
i++;
}
c=fgetc(fp1);
}
But this is having problems if it is a double digit. Does anyone have a better solution? Thank you!
For the data shown with no space before the commas, you could simply use:
while (fscanf(fp1, "%d,", &num) == 1 && i < 300)
list[i++] = num;
This will read the comma after the number if there is one, silently ignoring when there isn't one. If there might be white space before the commas in the data, add a blank before the comma in the format string. The test on i prevents you writing outside the bounds of the list array. The ++ operator comes into its own here.
First, fgetc returns an int, so c needs to be an int.
Other than that, I would use a slightly different approach. I admit that it is slightly overcomplicated. However, this approach may be usable if you have several different types of fields that requires different actions, like a parser. For your specific problem, I recommend Johathan Leffler's answer.
int c=fgetc(f);
while(c!=EOF && i<300) {
if(isdigit(c)) {
fseek(f, -1, SEEK_CUR);
if(fscanf(f, "%d", &list[i++]) != 1) {
// Handle error
}
}
c=fgetc(f);
}
Here I don't care about commas and newlines. I take ANYTHING other than a digit as a separator. What I do is basically this:
read next byte
if byte is digit:
back one byte in the file
read number, irregardless of length
else continue
The added condition i<300 is for security reasons. If you really want to check that nothing else than commas and newlines (I did not get the impression that you found that important) you could easily just add an else if (c == ... to handle the error.
Note that you should always check the return value for functions like sscanf, fscanf, scanf etc. Actually, you should also do that for fseek. In this situation it's not as important since this code is very unlikely to fail for that reason, so I left it out for readability. But in production code you SHOULD check it.
My solution is to read the whole line first and then parse it with strtok_r with comma as a delimiter. If you want portable code you should use strtok instead.
A naive implementation of readline would be something like this:
static char *readline(FILE *file)
{
char *line = malloc(sizeof(char));
int index = 0;
int c = fgetc(file);
if (c == EOF) {
free(line);
return NULL;
}
while (c != EOF && c != '\n') {
line[index++] = c;
char *l = realloc(line, (index + 1) * sizeof(char));
if (l == NULL) {
free(line);
return NULL;
}
line = l;
c = fgetc(file);
}
line[index] = '\0';
return line;
}
Then you just need to parse the whole line with strtok_r, so you would end with something like this:
int main(int argc, char **argv)
{
FILE *file = fopen(argv[1], "re");
int list[300];
if (file == NULL) {
return 1;
}
char *line;
int numc = 0;
while((line = readline(file)) != NULL) {
char *saveptr;
// Get the first token
char *tok = strtok_r(line, ",", &saveptr);
// Now start parsing the whole line
while (tok != NULL) {
// Convert the token to a long if possible
long num = strtol(tok, NULL, 0);
if (errno != 0) {
// Handle no value conversion
// ...
// ...
}
list[numc++] = (int) num;
// Get next token
tok = strtok_r(NULL, ",", &saveptr);
}
free(line);
}
fclose(file);
return 0;
}
And for printing the whole list just use a for loop:
for (int i = 0; i < numc; i++) {
printf("%d ", list[i]);
}
printf("\n");

C parsing a comma-separated-values with line breaks

I have a CSV data file that have the following data:
H1,H2,H3
a,"b
c
d",e
When I open through Excel as CSV file, it is able to show the sheet with column headings as H1, H2, H3 and column values as: a for H1,
multi line value as
b
c
d
for H2
and c for H3
I need to parse this file using a C program and have the values picked up like this.
But, my following code snippet will not work, as I have multi line values for a column:
char buff[200];
char tokens[10][30];
fgets(buff, 200, stdin);
char *ptok = buff; // for iterating
char *pch;
int i = 0;
while ((pch = strchr(ptok, ',')) != NULL) {
*pch = 0;
strcpy(tokens[i++], ptok);
ptok = pch+1;
}
strcpy(tokens[i++], ptok);
How to modify this code snippet to accommodate multi-line values of columns?
Please don't get bothered by the hard-coded values for the string buffers, this is the test code as POC.
Instead of any 3rd party library, I would like to do it the hard way from first principle.
Please help.
The main complication in parsing "well-formed" CSV in C is precisely the handling of variable-length strings and arrays which you are avoiding by using fixed-length strings and arrays. (The other complication is handling not well-formed CSV.)
Without those complications, the parsing is really quite simple:
(untested)
/* Appends a non-quoted field to s and returns the delimiter */
int readSimpleField(struct String* s) {
for (;;) {
int ch = getc();
if (ch == ',' || ch == '\n' || ch == EOF) return ch;
stringAppend(s, ch);
}
}
/* Appends a quoted field to s and returns the delimiter.
* Assumes the open quote has already been read.
* If the field is not terminated, returns ERROR, which
* should be a value different from any character or EOF.
* The delimiter returned is the character after the closing quote
* (or EOF), which may not be a valid delimiter. Caller should check.
*/
int readQuotedField(struct String* s) {
for (;;) {
int ch;
for (;;) {
ch = getc();
if (ch == EOF) return ERROR;
if (ch == '"') {
ch = getc();
if (ch != '"') break;
}
stringAppend(s, ch);
}
}
}
/* Reads a single field into s and returns the following delimiter,
* which might be invalid.
*/
int readField(struct String* s) {
stringClear(s);
int ch = getc();
if (ch == '"') return readQuotedField(s);
if (ch == '\n' || ch == EOF) return ch;
stringAppend(s, ch);
return readSimpleField(s);
}
/* Reads a single row into row and returns the following delimiter,
* which might be invalid.
*/
int readRow(struct Row* row) {
struct String field = {0};
rowClear(row);
/* Make sure there is at least one field */
int ch = getc();
if (ch != '\n' && ch != EOF) {
ungetc(ch, stdin);
do {
ch = readField(s);
rowAppend(row, s);
} while (ch == ',');
}
return ch;
}
/* Reads an entire CSV file into table.
* Returns true if the parse was successful.
* If an error is encountered, returns false. If the end-of-file
* indicator is set, the error was an unterminated quoted field;
* otherwise, the next character read will be the one which
* triggered the error.
*/
bool readCSV(struct Table* table) {
tableClear(table);
struct Row row = {0};
/* Make sure there is at least one row */
int ch = getc();
if (ch != EOF) {
ungetc(ch, stdin);
do {
ch = readRow(row);
tableAppend(table, row);
} while (ch == '\n');
}
return ch == EOF;
}
The above is "from first principles" -- it does not even use standard C library string functions. But it takes some effort to understand and verify. Personally, I would use (f)lex and maybe even yacc/bison (although it's a bit of overkill) to simplify the code and make the expected syntax more obvious. But handling variable-length structures in C will still need to be the first step.

Parse HTTP Request Line In C

This is the problem that will never end. The task is to parse a request line in a web server -- of indeterminate length -- in C. I pulled the following off of the web as an example with which to work.
GET /path/script.cgi?field1=value1&field2=value2 HTTP/1.1
I must extract the absolute path: /path/script.cgi and the query: ?field1=value1&field2=value2. I'm told the following functions hold the key: strchr, strcpy, strncmp, strncpy, and/or strstr.
Here's what has happened so far: I've learned that using functions like strchr and strstr will absolutely allow me to truncate the request line at certain points, but will never allow me to get rid of portions of the request line I do not want, and it doesn't matter how I layer them.
For example, here's some code that get's me close to isolating the query, but I can't eliminate the http version.
bool parse(const char* line)
{
// request line w/o method
const char ch = '/';
char* lineptr = strchr(line, ch);
// request line w/ query and HTTP version
char ch_1 = '?';
char* lineptr_1 = strchr(lineptr, ch_1);
// request line w/o query
char ch_2 = ' ';
char* lineptr_2 = strchr(lineptr_1, ch_2);
printf("%s\n", lineptr_2);
if (lineptr_2 != NULL)
return true;
else
return false;
}
Needless to say, I have a similar issue trying to isolate the absolute path (I can ditch the method, but not the ? or anything thereafter), and I see no occasion on which I can use the functions that require me to know a priori how many chars I'd like to copy from one location (usually an array) to another because, when this is run in real time, I will have no clue what the request line will look like in advance. If someone sees something that I am missing and could point me in the right direction, I would be most grateful!
A more elegant solution.
#include <stdio.h>
#include <string.h>
int parse(const char* line)
{
/* Find out where everything is */
const char *start_of_path = strchr(line, ' ') + 1;
const char *start_of_query = strchr(start_of_path, '?');
const char *end_of_query = strchr(start_of_query, ' ');
/* Get the right amount of memory */
char path[start_of_query - start_of_path];
char query[end_of_query - start_of_query];
/* Copy the strings into our memory */
strncpy(path, start_of_path, start_of_query - start_of_path);
strncpy(query, start_of_query, end_of_query - start_of_query);
/* Null terminators (because strncpy does not provide them) */
path[sizeof(path)] = 0;
query[sizeof(query)] = 0;
/*Print */
printf("%s\n", query, sizeof(query));
printf("%s\n", path, sizeof(path));
}
int main(void)
{
parse("GET /path/script.cgi?field1=value1&field2=value2 HTTP/1.1");
return 0;
}
I wrote some functions in C a while back that manually parse c-strings up to a delimiter, similar to getline in C++.
// Trims all leading whitespace along with consecutive whitespace from provided cstring into destination char*. WARNING: ensure size <= sizeof(destination)
void Trim(char* destination, char* source, int size)
{
bool trim = true;
int index = 0;
int i;
for (i = 0; i < size; ++i)
{
if (source[i] == '\n' || source[i] == '\0')
{
destination[index++] = '\0';
break;
}
else if (source[i] != ' ' && source[i] != '\t')
{
destination[index++] = source[i];
trim = false;
}
else if (trim)
continue;
else
{
if (index > 0 && destination[index - 1] != ' ')
destination[index++] = ' ';
}
}
}
// Parses text up to the provided delimiter (or newline) into the destination char*. WARNING: ensure size <= sizeof(destination)
void ParseUpToSymbol(char* destination, char* source, int size, char delimiter)
{
int index = 0;
int i;
for (i = 0; i < size; ++i)
{
if (source[i] != delimiter && source[i] != '\n' && source[i] != '\0' && source[i] != ' '))
{
destination[index++] = source[i];
}
else
{
destination[i] = '\0';
break;
}
}
Trim(destination, destination, size);
}
Then you could parse your c-string with something along these lines:
char* buffer = (char*)malloc(64);
char* temp = (char*)malloc(256);
strcpy(temp, "GET /path/script.cgi?field1=value1&field2=value2 HTTP/1.1");
Trim(temp, temp, 256);
ParseUpToSymbol(buffer, cstr, 64, '?');
temp = temp + strlen(buffer) + 1;
Trim(temp, temp, 256);
The code above trims any leading and trailing whitespace from the target string, in this case "GET /path/script.cgi?field1=value1&field2=value2 HTTP/1.1", and then stores the parsed value into the variable buffer. Running this the first time should put the word "GET" inside of buffer. When you do the "temp = temp + strlen(buffer) + 1" you are readjusting the temp char-pointer so you can call ParseUpToSymbol again with the remaining part of the string. If you were to call it again, you should get the absolute path leading up to the first question mark. You could repeat this to get each individual query string or change the delimiter to a space and get the entire query string portion of the URL. I think you get the idea. This is just one of many solutions of course.

Using fgetc to pass only part of a text file to a buffer

I have the following text file:
13.69 (s, 1H), 11.09 (s, 1H).
So far I can quite happily use either fgets or fgetc to pass all text to a buffer as follows:
char* data;
data = malloc(sizeof(char) * 100);
int c;
int n = 0;
FILE* inptr = NULL;
inptr = fopen("NMR", "r");
if(NULL == fopen("NMR", "r"))
{
printf("Error: could not open file\n");
return 1;
}
for (c = fgetc(inptr); c != EOF && c != '\n'; c = fgetc(inptr))
{
data[n++] = c;
}
for (int i = 0, n = 100; i < n; i++)
{
printf ("%c", data[i]);
}
printf("\n");
and then print the buffer to the screen afterwards. However, I am only looking to pass part of the textfile to the buffer, namely:
13.69 (s, 1H),
So this means I want fgetc to stop after ','. However, this means the that the text will stop at 13.69 (s, and not 13.69 (s, 1H),
Is there a way around this? I have also experimented with fgets and then using strstr as follows:
char needle[4] = ")";
char* ret;
ret = strstr(data, needle);
printf("The substring is: %s\n", ret);
However, the output from this is:
), 11.09 (s, 1H)
thus giving me the rest of the string which I do not want. It's an interesting one and if anyone has any tips it would be much appreciated!
If you know that the closing parenthesis is the last character you want, you can use that as your stopping point in the fgetc() loop:
char data[100]; //No need to dynamically allocate if we know the size at compile time
int c;
int n = 0;
FILE* inptr = NULL;
inptr = fopen("NMR", "r");
if(inptr == NULL) //We want to check the value of the file we just opened
{ //and plan to use
printf("Error: could not open file\n");
return 1;
}
//We'll keep the original value guards (EOF and '\n') below and add two more
//to make sure we break from the loop
//We use n<98 below to make sure we can always create a null-terminated string,
//If we used 99, the 100th character might be a ')', then we have no room for a
//terminating null-char
for (c = fgetc(inptr); c != ')' && n < 98 && c != EOF && c != '\n'; c = fgetc(inptr))
{
data[n++] = c;
}
if(c != ')') //We hit EOF, \n, or ran out of space in data[]
{
printf("Error: no matching sequence found\n");
return 2;
}
data[n]=')'; //Could also write data[n]=c here, since we know it's a ')'
data[n+1]='\0'; //Add the terminating null character
printf("%s\n",data); //Since it's a properly formatted string, we can use %s
(Note that this example will handle null input characters differently from yours. If you expect null characters to be in the input stream (NMR file) then change the printf("%s",...) line back to the for loop you originally had.
Well with only one example of the format you are trying to parse it's not totally possible to give an answer, however if your input is always like this I would simply have a counter and break after the second comma.
int comma = 0;
for (c = fgetc(inptr); c != EOF && c != '\n' && c != ',' && comma < 1; c = fgetc(inptr))
{
if (data[n] = ',')
comma++;
data[n++] = c;
}
In case the characters inside the parenthesis can be more complex I would simply maintain a boolean state to know if I am actually inside or outside a parenthesis and break when I read a comma outside of it.
Simply read using fgets and store desired string in char * using sscanf-
char *new_data;
new_data=malloc(100); // allocate memory
...
fgets(data,100,inptr); // read from file but check its return
sscanf(data,"%[^)]",new_data); // store string untill ')' in new_data from data
strcat(new_data,")"); // concatenating new_data and ")"
printf("%s",new_data); // print new_data
...
free(new_data); // remember to free memory
Also you should check return of malloc though not done in my example and also close the file opened .

Resources