Parsing a file in C

Parsing a file in C - c

I need parse through a file and do some processing into it. The file is a text file and the data is a variable length data of the form "PP1004181350D001002003..........". So there will be timestamps if there is PP so 1004181350 is 2010-04-18 13:50. The ones where there are D are the data points that are three separate data each three digits long, so D001002003 has three coordonates of 001, 002 and 003.
Now I need to parse this data from a file for which I need to store each timestamp into a array and the corresponding datas into arrays that has as many rows as the number of data and three rows for each co-ordinate. The end array might be like
TimeStamp[1] = "135000", low[1] = "001", medium[1] = "002", high[1] = "003"
TimeStamp[2] = "135015", low[2] = "010", medium[2] = "012", high[2] = "013"
TimeStamp[3] = "135030", low[3] = "051", medium[3] = "052", high[3] = "043"
....
The question is how do I go about doing this in C? How do I go through this string looking for these patterns and storing the values in the corresponding arrays for further processing?
Note: Here the seconds value in timestamp is added on our own as it is known at each data comes after 15 seconds.

edit: updated to follow your specs.
While your file seems to be variable length, your data isn't, you could use fscanf and do something like this:
while(fscanf(file,"PP%*6d%4d", &timestamp, &low, &medium, &high))
{
for(int i = 0; fscanf(file, "D%3d%3d%3d", &low, &medium, &high); i++)
{
timestamp=timestamp*100+i*15;
//Do something with variables (e.g. convert to string, push into vector, ...)
}
}
Note that this reads the data into integers (timestamp, low, medium and high are int's), A string version looks like this (timestamp, low, medium and high are char arrays):
int first[] = {'0', '1', '3', '4'};
int second[] = {'0','5'};
while(fscanf(file,"PP%*6d%4c", &timestamp, &low, &medium, &high))
{
for(int i = 0; fscanf(file, "D%3c%3c%3c", &low, &medium, &high); i++)
{
timestamp[i][4]=first[i%4];
timestamp[i][2]=second[i%2];
}
}
edit: some more explanation about the formatting string, with %*6d I mean: look for 6 digits and discard them (* means: do not put in a variable). %4d or %4c means in this context the same (as 1 digit will be one char), but we do save them in corresponding variables.

As long as your patterns aren't variable length, you could simply use fscanf. If you need something more complex, you might try PCRE, but for this case I think sscanf will suffice.

I wouldn't recommend using fscanf directly on input data because it is very sensitive to the in data, if one byte is wrong and suddenly doesn't the format specifier then you could in worst case a memory overwrite.
It is better to either in using fgetc and parse as it comes in or read into a buffer (fread) and process it from there.

Simply Parsing? Here it is!!
UPDATE: Checkout KillianDS's code above. Thats even better!!
[STEP 1] Search for /n ( or CR+LF)
[STEP 2] Starting from the first character on the line, U know the no. of characters each datafield occupies. Read that many characters from the file.
use atoi() to convert the character data to int
http://www.cplusplus.com/reference/clibrary/cstdlib/atoi/
Repeat for all fields.

Related

importing an ascii grid file in C

Sorry, but I know how to do it in other languages, but C is rather new to me.
I need to import an ascii grid file in a C code. The structure of the file is as follows:
ncols 12
nrows 101
xllcorner 2.0830078125
yllcorner 39.35908297583665
cellsize 0.00439453125
nodata_value -2147483648
401.99 407.38 394.17 362.35 342.36 335.13 319.91 284.99 262.88 259.58 245.62 233.58
397.63 396.36 380.70 358.96 339.35 327.96 314.06 296.73 279.11 264.80 257.20 249.97
389.71 381.29 356.41 338.75 326.04 323.36 317.67 301.30 281.79 269.46 261.94 250.72
.....
I can read the bulk of values but I am struggling to properly import the first 6 lines in two arrays, a character one (namevar) and a double one (valvar).
My only partially working code is:
#define ny 101
#define nx 12
#define dime nx *ny
int main(void)
{
FILE *dem;
double Z[dime], varval[6];
char namevar[12];
int l = 1;
dem = fopen("buttami.txt", "r");
int i;
int j;
while (l < 7)
{
//
fscanf(dem, "%s %f\n", &namevar, &varval[l]);
printf("%s %.8f\n", namevar, varval[l]);
l++;
}
for (i = 1; i < dime; i++)
{
fscanf(dem, "%lf", &Z[i]);
printf("%.5f ", Z[i]);
printf("\n");
}
fclose(dem);
}

Comments address many issue, this focuses on your specific mention...
"I am struggling to properly import the first 6 lines in two arrays, a character one (namevar) and a double one (valvar)"
First, the variable char namevar[12]; is too small to contain the longest name string it will need to contain: "nodata_value" as stored in the file contains 12 characters requiring the variable namevar to be created with size of at least 13 to provide room for the null terminator. ( see definition of C string )
The top part of the input file could be thought of as a header section, and its content as tag/values. An array of struct is useful to store content of varying types into a single array, each containing a set of members to accommodate the various types, in this case one C string, and one double. For example:
#define NUM_HDR_FLDS 6 // to eliminate magic number '6' in code below
typedef struct {
char namevar[20];
double varval;
} header_s;
header_s header[NUM_HDR_FLDS] = {0};//array of NUM_HDR_FLDS elements, each contains two members,
//1 a char array with room for null terminator for field name
//2 a double to contain value
Then your fscanf() loop will look like this:
//note changes to format specifier and the
//string member needs no &
int l=0;//C uses zero base indexing
dem=fopen("buttami.txt", "r");
if(dem)//test for success before using
{
while(l<NUM_HDR_FLDS){//zero base indexing again (0-5)
if(fscanf(dem,"%s %lf", header[l].namevar,&header[l].varval) == 2)
{
printf("%s %.8f\n",header[l].namevar,header[l].varval);
} //else handle error
l++;
}
fclose(dem);
}

By your example data description, I guess it is Arc/Info Ascii Grid foramt by wikipedia https://en.wikipedia.org/wiki/Esri_grid.
For raster data files I/O, please try library Gdal.
Gdal doc about this format https://gdal.org/drivers/raster/aaigrid.html
Here is code samples for open and read a raster file https://gdal.org/tutorials/raster_api_tut.html

Run length encoding on binary files in C

I wrote this function that performs a slightly modified variation of run-length encoding on text files in C.
I'm trying to generalize it to binary files but I have no experience working with them. I understand that, while I can compare bytes of binary data much the same way I can compare chars from a text file, I am not sure how to go about printing the number of occurrences of a byte to the compressed version like I do in the code below.
A note on the type of RLE I'm using: bytes that occur more than once in a row are duplicated to signal the next-to-come number is in fact the number of occurrences vs just a number following the character in the file. For occurrences longer than one digit, they are broken down into runs that are 9 occurrences long.
For example, aaaaaaaaaaabccccc becomes aa9aa2bcc5.
Here's my code:
char* encode(char* str)
{
char* ret = calloc(2 * strlen(str) + 1, 1);
size_t retIdx = 0, inIdx = 0;
while (str[inIdx]) {
size_t count = 1;
size_t contIdx = inIdx;
while (str[inIdx] == str[++contIdx]) {
count++;
}
size_t tmpCount = count;
// break down counts with 2 or more digits into counts ≤ 9
while (tmpCount > 9) {
tmpCount -= 9;
ret[retIdx++] = str[inIdx];
ret[retIdx++] = str[inIdx];
ret[retIdx++] = '9';
}
char tmp[2];
ret[retIdx++] = str[inIdx];
if (tmpCount > 1) {
// repeat character (this tells the decompressor that the next digit
// is in fact the # of consecutive occurrences of this char)
ret[retIdx++] = str[inIdx];
// convert single-digit count to string
snprintf(tmp, 2, "%ld", tmpCount);
ret[retIdx++] = tmp[0];
}
inIdx += count;
}
return ret;
}
What changes are in order to adapt this to a binary stream? The first problem I see is with the snprintf call since it's operating using a text format. Something that rings a bell is also the way I'm handling the multiple-digit occurrence runs. We're not working in base 10 anymore so that has to change, I'm just unsure how having almost never worked with binary data.

A few ideas that can be useful to you:
one simple method to generalize RLE to binary data is to use a bit-based compression. For example the bit sequence 00000000011111100111 can be translated to the sequence 0 9623. Since the binary alphabet is composed by only two symbols, you need to only store the first bit value (this can be as simple as storing it in the very first bit) and then the number of the contiguous equal values. Arbitrarily large integers can be stored in a binary format using Elias gamma coding. Extra padding can be added to fit the entire sequence nicely into an integer number of bytes. So using this method, the above sequence can be encoded like this:
00000000011111100111 -> 0 0001001 00110 010 011
^ ^ ^ ^ ^
first bit 9 6 2 3
If you want to keep it byte based, one idea is to consider all the even bytes frequencies (interpreted as an unsigned char) and all the odd bytes the values. If one byte occur more than 255 times, than you can just repeat it. This can be very inefficient, though, but it is definitively simple to implement, and it might be good enough if you can make some assumptions on the input.
Also, you can consider moving out from RLE and implement Huffman's coding or other sophisticated algorithms (e.g. LZW).
Implementation wise, i think tucuxi already gave you some hints.

You only have to address 2 problems:
you cannot use any str-related functions, because C strings do not deal well with '\0'. So for example, strlen will return the index of the 1st 0x0 byte in a string. The length of the input must be passed in as an additional parameter: char *encode(char *start, size_t length)
your output cannot have an implicit length of strlen(ret), because there may be extra 0-bytes sprinkled about in the output. You again need an extra parameter: size_t encode(char *start, size_t length, char *output) (this version would require the output buffer to be reserved externally, with a size of at least length*2, and return the length of the encoded string)
The rest of the code, assuming it was working before, should continue to work correctly now. If you want to go beyond base-10, and for instance use base-256 for greater compression, you would only need to change the constant in the break-things-up loop (from 9 to 255), and replace the snprintf as follows:
// before
snprintf(tmp, 2, "%ld", tmpCount);
ret[retIdx++] = tmp[0];
// after: much easier
ret[retIdx++] = tmpCount;

C11: how to quickly convert a char array into ints, then modify ints and update char array

There are two parts of the problem that I don't know how to solve:
Input
The user can enter some inputs like 12,14y or 15m and I need to extract the two ints and the character. For now, I simply use:
char buffer[50];
scanf("%s", buffer);
switch (buffer[strlen(buffer)-1]) {
// ... I use this to read the last char
}
This can give me the information of how many ints I have to read:
one in the m,n case -> sscanf(buffer, "%d%c", int1, c)
two in the y,s,b case -> sscanf(buffer, "%d,%d%c", int1, int2, c)
I need these numbers for the core of my program, so I need int values not only the string.
The problem is that online I read about sscanf inefficiency and I need a good way to do this task quickly.
Output
My code has to modify these numbers just in one case (y) and conserve a modified copy of the user input. For example, users input is 1,12y then I have to modify it in 1,10y and store it as a char array so it's not only an input. The modification of int2 it's quite long to explain, I can say that the new value would be less than the original one (in my example from 12 I get 10). The only idea I have about this it's how to create the new char array: I can calculate int1 and int2 length trying to divide them with increasing power of 10 until I get a result between 1 and 9. e.g.:
int1 = 201:
201 no
20.1 no
2.01 yes
=> 3 tries, length = 3
Then I use a malloc. But then, how can I write my "output" in the new char array? e.g.:
input = "1,201y"
-> int1 = 1, int2 = 201
-> lenght(int1) = 1, length(int2) = 2
// if the core program sets int2 = 51, then
char *out = malloc(1+2+1):
// now I have to write "1,51y" in this char array
I've coded the "core" program already, but now I'd want to improve a fast "translation" of user input (because in the core program I need to know if it's a int1m or int1n or int1,int2y or int1,int2s or int1,int2b command) and I don't know how to modify user input to save it in a string (for strings I use char arrays dynamically allocated). Only the y command could modify int2.
I hope that it's clear what I've to done.

The problem is that online I read about sscanf inefficiency
"Online" isn't a very trustworthy source. Inefficiency depends entirely on what you compare the function with.
If you compare with any plain C function then all of the stdio.h functions are very much inefficient. As is malloc for that matter. However, printing to the screen and waiting on the human user are by far the largest bottlenecks in this program, so you might want to re-consider why and what you are optimizing.
That being said, you can easily roll out a manual specialized version of the string to integer conversion, by calling strtol family of functions. Here's a version supporting exactly 1 or 2 integers in the input string (it can easily be rewritten to use a loop instead):
#include <stdlib.h>
int parse_input (const char* input, int* i1, int* i2, char* ch)
{
char* endptr=NULL;
const char* cptr=input;
int result;
result = strtol(cptr, &endptr, 10);
if(cptr==endptr)
{
return 0;
}
*i1 = result;
if(*endptr != ',')
{
*ch = *endptr;
return 1;
}
cptr=endptr+1;
result = strtol(cptr, &endptr, 10);
if(cptr==endptr)
{
return 0;
}
*i2 = result;
*ch = *endptr;
return 2;
}
Some extra error handling might be needed too. This gives around 50 instructions when compiled for x86_64, not counting strtol calls. Where some 20 of those instructions are related to the parameter stacking and calling convention.

How to extract last 3 digit of an array of a string and store it to different int array in C

I want to extract last 3 digit of a string suppose like :
char a[100][100] = ["17BIT0111" , "17BIT0222", ... n];
and I want to take last three digits and store in different array like
int b[100] =[111 , 222 , ... n];
I took reference from this but I wan't it without using pointer or a linked list. As I am gonna use it for comparing stack.
C program to extract different substrings from array of strings

Something like this:
for (int i = 0; i < 100; ++i)
{
unsigned int value = 0;
sscanf(a[i], "17BIT%u", &value);
b[i] = (int) (value % 1000);
}
This doesn't check the return value of sscanf(), instead defaulting the value to 0 in case conversion fails.
This will convert a larger integer, so the % 1000 was added to make sure only the last three digits really matter in the conversion. The unsigned is simply to disallow embedded dashes in the string, which makes sense to me in cases like these.

How would I send this string one character at a time?

Here is the code that I have. I am unable to test it right now, so I'm just wondering if someone can verify that this will take the full str string and send each character individually to TXREG
char str [10];
int n;
n = sprintf(str, "%u%% %u\n", Duty_Cycle, DAC_Output);
transmit_uart(str); // Send the value
And here is the transmit_uart() method.
void transmit_uart(const char *value) {
for(int i = 0; value[i] != '\0'; i++) {
while(TXIF == 0) {}
TXREG = value[i];
}
}
So this should send something like
50% 128
Every time I call transmit_uart() with a string formatted the way I have it up there.
UPDATE: I was able to test it yesterday, and this did in fact work! Thanks for all the help!

Although I haven't personally loaded it onto an MCU and tested it, yes, that looks fine as long as TXIF does what it looks like.
You really should use snprintf or a larger buffer. This is one case where overflowing the integer (as in simply having too large a value, or any negative value) would cascade into buffer overflow.

If you want to send string of 5-9 characters:
1 or 2 or 3 symbols - Duty_Cycle value,
1 symbol - % symbol,
1 symbol - space,
1 or 2 or 3 symbols - DAC_Output value,
1 symbol - \n symbol,
you are doing right.