PCRE binary files? - c

I would like to parse binary files with PCRE. My tactic until now was to use fgets to read a line of a file, then parse that line using pcre_exec.
This will not work for me now because the "lines" end with a null byte rather than a newline. I did not see a way to have fgets stop at a null byte rather than newline.
Edit
The functionality would be similar to running grep -az PATTERN FILE

In this case, no luck, you need to read your binary file byte by byte and check for '\0'.
You can then store this bytes on a buffer and:
Doing some comparaison on the fly with your PATTERN
or
If you want to keep data for later processing, you can store this buffers on a linked list for example (if you don't have a huge files).
Hope this help.
Regards.

Related

Reading a file using pread

The aim of the problem is to use only pread to read a file with the intergers.
I am trying to device a generic solution where I can read intergers of any length, but I think there must be a better solution from my current algorithm.
For the sake of explanation and to guide the algorithm, here is a sample input file. I have explicitly added \r\n to show that they exist in the file.
Input file:
23456\r\n
134\r\n
1\r\n
345678\r\n
Algorithm
1. Read a byte from the file
2. Check if it is number i.e '0' <= byte <= '9'
3.1 if yes, increment the offset and read the next byte
3.2 if not, is it \r
3.2.1 if yes, read the next and it should be \n.
Here the line is finished and we can use strtol to convert string to int.
3.2.2 // Error condition
I'm required to make this algorithm because if found out that pread reads the files as string and just pust the requested number of bytes in the provided buffer.
Question:
Is there an better way of reading intergers from the file using pread() instead of parsing each byte to determine the end-of-string and then converting to interget?
Is there an better way of reading intergers from the file using pread() instead of parsing each byte to determine the end-of-string and then converting to interget?
Yes, read big chunks of data into memory and then do the parsing on the memory. Use a big buffer (i.e. depending on system memory). On a mordern system where giga-bytes of memory is available, you can go for a buffer in the mega byte range. I would probably start out with a 1 or 2mega byte buffer and see how it performs.
This will be much more efficient that byte-by-byte reads.
note: your code needs to handle situations where a chunk from the file stops in the middle of an integer. That adds a little complexity to code but it's not that difficult to handle.
where I can read intergers of any length
Well, if you actually mean integers greater than the largest integer of your system, it's much more complicated. Standard functions like strtol can't be used. Further, you'll need to define your own way of storing these values. Alternatively, you can fetch a public library that can handle such values.

Text files edit C

I have a program which takes data(int,floats and strings) given by the user and writes it in a text file.Now I have to update a part of that written data.
For example:
At line 4 in file I want to change the first 2 words (there's an int and a float). How can I do that?
With the information I found out, fseek() and fputs() can be used but I don't know exactly how to get to a specific line.
(Explained code will be appreciated as I'm a starter in C)
You can't "insert" characters in a file. You will have to create program, which will read whole file, then copy part before insert to a new file, your edition, rest of file.
You really need to read all the file, and ignore what is not needed.
fseek is not really useful: it positions the file at some byte offset (relative to the start or the end of the file) and don't know about line boundaries.
Actually, lines inside a file are an ill defined concept. Often a line is a sequence of bytes (different from the newline character) ended by a newline ('\n'). Some operating systems (Windows, MacOSX) read in a special manner text files (e.g. the real file contains \r\n to end each line, but the C library gives you the illusion that you have read \n).
In practice, you probably want to use line input routines notably getline (or perhaps fgets).
if you use getline you should care about free-ing the line buffer.
If your textual file has a very regular structure, you might fscanf the data (ignoring what you need to skip) without caring about line boundaries.
If you wanted to absolutely use fseek (which is a mistake), you would have to read the file twice: a first time to remember where each line starts (or ends) and a second time to fseek to the line start. Still, that does not work for updates, because you cannot insert bytes in the middle of a file.
And in practice, the most costly operation is the actual disk read. Buffering (partly done by the kernel and <stdio.h> functions, and partly by you when you deal with lines) is negligible.
Of course you cannot change in place some line in a file. If you need to do that, process the file for input, produce some output file (containing the modified input) and rename that when finished.
BTW, you might perhaps be interested in indexed files like GDBM etc... or even in databases like SqlLite, MariaDb, mongodb etc.... and you might be interested in standard textual serialization formats like JSON or YAML (both have many libraries, even for C, to deal with them).
fseek() is used for random-access files where each record of data has the same size. Typically the data is binary, not text.
To solve your particular issue, you will need to read one line at a time to find the line you want to change. A simple solution to make the change is to write these lines to a temporary file, write the changes to the same temporary file, then skip the parts from the original file that you want to change and copy the reset to the temporary file. Finally, close the original file, copy the temporary file to it, and delete the temporary file.
With that said, I suggest that you learn more about random-access files. These are very useful when storing records all of the same size. If you have control over creating the orignal file, these might be better for your current purpose.

How to read 1000 or more columns data from file using c/c++ language?

A data file with 10000 rows and 1000 columns. I want to save a entire line to an array or each column to a variant.
There is a standard function fscanf in C. If use this function, I need write the format 1000 times.
fscanf(pFile, "%f,%f,%f,%f,%f,%f,......", &a[0], &a[1],...,a[999]);
It is almost impossible like this when programming in C.
But, I have no idea to implement it with C language.
Any suggestions or solutions?
And, how to read or extract some of columns data?
Read the file line by line using fgets() into a suitably large buffer. Don't be afraid to use a buffer of 32 KB or something, just to be very sure all the fields fit.
Then parse the line in a loop, perhaps using strtok() or just plain old strtod(). Note that the latter returns a pointer to the first character that was not considered a number; this is where your parsing will continue for the next number. Perhaps you need to add an inner loop to "eat" whitespace (or whatever separators you have).
You could read the file line by line, and then extract the numbers in a loop.

Reading upto newline

Hi
My program reads a CSV file.
So I used fgets to read one line at a time.
But now the interface specification says that it is possible to find NULL characters in few of the columns.
So I need to replace fgets with another function to read from the file
Any suggestions?
If your text stream has a NUL (ascii 0) character, you will need to handle your file as a binary file and use fread to read the file. There are two approaches to this.
Read the entire file into memory. The length of the file can be obtained by fseek(fp, 0, SEEK_END) and then calling ftell.You can then allocate enough memory for the whole file.Once in memory, parsing the file should be relatively easy. This approach is only really suitable for smallish files (probably less than 50M max). For bonus marks look at the mmap function.
Read the file byte by byte and add the characters to a buffer until a newline is found.
Read and parse bit by bit. Create a buffer that is biggest than you largest line and fill it with content from your file. You then parse and extract as many lines as you can. Add the remainder to the beginning of a new buffer an read the next bit. Using a bigger buffer will help minimize copying.
fgets works perfectly well with embedded null bytes. Pre-fill your buffer with \n (using memset) and then use memchr(buf, '\n', sizeof buf). If memchr returns NULL, your buffer was too small and you need to enlarge it to read the rest of the line. Otherwise, you can determine whether the newline you found is the end of the line or the padding you pre-filled the buffer with by inspecting the next byte. If the newline you found is at the end of the buffer or has another newline just after it, it's from padding, and the previous byte is the null terminator inserted by fgets (not a null from the file). Otherwise, the newline you found has a null byte after it (terminator inserted by fgets, and it's the end-of-line newline.
Other approaches will be slow (repeated fgetc) or waste (and risk running out of) resources (loading the whole file into memory).
use fread and then scan the block for the separator
Check the function int T_fread(FILE *input) at http://www.mrx.net/c/source.html

overwriting a specific line on a text file?

how do I go about overwriting a specific line on a text file in c?. I have values in multiple variables that need to be written onto the file.
This only works when the new line has the same size as the old one:
Open the file in the mode a+
fseek() to the start of the file
Before reading the next line, use ftell() to note the start of the line
Read the line
If it's the line you want, fseek() again with the result from ftell() and use fwrite() to overwrite it.
If the length of the line changes, you must copy the file.
Since files (from the point of view of C's standard library) are not line-oriented, but are just a sequence of characters (or bytes in binary mode), you can't expect to edit them at the line-level easily.
As Aaron described, you can of course replace the characters that make up the line if your replacement is the exact same character count.
You can also (perhaps) insert a shorter replacement by padding with whitespace at the end (before the line terminator). That's of course a bit crude.

Resources