Very large fseek forward equivalent for stdin?

Very large fseek forward equivalent for stdin? - c

I have a very large, known number of bytes on stdin, and wish to discard a large (also known) number of them before reading the portion of interest (in other words, I want to fseek forward by a large integer, but fseek isn't defined for pipes). The simplest way to achieve this seems to be a large number of calls to fgetc, and the first alternative is to use a single call to fread with a large scratch pointer allocated to store the result. The first is very slow, and the second uses a potentially unbounded amount of memory for no good reason. Making multiple smaller reads solves the unbounded memory usage issue, but introduces a free parameter (the chunk size) which probably has a different fastest value for every machine and OS combination.
Are there any alternatives that achieve this goal in a neat, efficient fashion? POSIX is assumed.

There is no way to "skip" data on a pipe - you have to read it.
If it's a very large block, you will want to use a medium-sized buffer (as a compromise between overhead and memory usage), something like this:
size_t dataToRead = some_large_number;
while(dataToRead)
{
char buffer[4096];
size_t toread = min(sizeof(buffer), dataToRead);
size_t nread = fread(buffer, 1, toread, stdin);
dataToRead -= nread;
}
The size, 4096 is a rather arbitrary choice - but it's big enough not to cause a HUGE number of reads to the input, and small enough to not use crazy amounts of stack-space. It is unlikely that you will gain/loose much from changing this size.

Related

Using larger values than size_t can represent with fwrite

I know that fwrite takes the following parameters:
fwrite ( const void * ptr, size_t size, size_t count, FILE * stream );
As far as I know, size_t is a typedef and nothing else than:
typedef unsigned long size_t;
Is it possible to use values greater than size_t for count and write?
And if it is not could I connect the written blocks somehow?

No, fwrite accepts only values that fit in the size_t type.
There may be implementation-specific ways to write more but, for standard C, the approach is generally just to do sequential fwrite calls. Each subsequent call will append to what you've already written.
And keep in mind that size_t is a distinct type. It may be defined as an unsigned long is some implementations but that's not guaranteed.

From an application programmers point of view a file is a contiguous series of bytes. Successive writes will position the data sequentially onto a file. (This comment is necessary because some will argue details NOT relevant to your question).
Thus:
fwrite(&user_record1, sizeof(user_record1), 1, fp);
fwrite(&user_record2, sizeof(user_record2), 1, fp);
Results in two user records, one immediately following the other, on the file.
If you have a very large record, then divide it into two smaller records, as:
fwrite(&user_record_parta, sizeof(user_record1), 1, fp);
fwrite(&user_record_partb, sizeof(user_record2), 1, fp);
However, I would question an application design that uses such large records. Perhaps what you are really doing in the application is writing an array of user records and that array grows really large. If this is the case, write each entry of the array, rather than the whole array.

In order to use fwrite, the entire object you're writing must be in the object pointed to by ptr. Unless you have a really messed-up C implementation, it's impossible to have an object larger than the maximum value of size_t, so trying to write more bytes than that would be a programming error, since the pointed-to object is not actually that large anyway.

If you use greater values than ULONG_MAX it will just wrap around to 0.
What you can do is write ULONG_MAX Bytes, seek to the end position and continue writing, then do that over in a loop, until you've written all your data.

Go to a certain point of a binary file in C (using fseek) and then reading from that location (using fread)

I am wondering if this is the best way to go about solving my problem.
I know the values for particular offsets of a binary file where the information I want is held...What I want to do is jump to the offsets and then read a certain amount of bytes, starting from that location.
After using google, I have come to the conclusion that my best bet is to use fseek() to move to the position of the offset, and then to use fread() to read an amount of bytes from that position.
Am I correct in thinking this? And if so, how is best to go about doing so? i.e. how to incorporate the two together.
If I am not correct, what would you suggest I do instead?
Many thanks in advance for your help.
Matt
Edit:
I followed a tutorial on fread() and adjusted it to the following:
`#include <stdio.h>
int main()
{
FILE *f;
char buffer[11];
if (f = fopen("comm_array2.img", "rt"))
{
fread(buffer, 1, 10, f);
buffer[10] = 0;
fclose(f);
printf("first 10 characters of the file:\n%s\n", buffer);
}
return 0;
}`
So I used the file 'comm_array2.img' and read the first 10 characters from the file.
But from what I understand of it, this goes from start-of-file, I want to go from some-place-in-file (offset)
Is this making more sense?
Edit Number 2:
It appears that I was being a bit dim, and all that is needed (it would seem from my attempt) is to put the fseek() before the fread() that I have in the code above, and it seeks to that location and then reads from there.

If you are using file streams instead of file descriptors, then you can write yourself a (simple) function analogous to the POSIX pread() system call.
You can easily emulate it using streams instead of file descriptors1. Perhaps you should write yourself a function such as this (which has a slightly different interface from the one I suggested in a comment):
size_t fpread(void *buffer, size_t size, size_t mitems, size_t offset, FILE *fp)
{
if (fseek(fp, offset, SEEK_SET) != 0)
return 0;
return fread(buffer, size, nitems, fp);
}
This is a reasonable compromise between the conventions of pread() and fread().
What would the syntax of the function call look like? For example, reading from the offset 732 and then again from offset 432 (both being from start of the file) and filestream called f.
Since you didn't say how many bytes to read, I'm going to assume 100 each time. I'm assuming that the target variables (buffers) are buffer1 and buffer2, and that they are both big enough.
if (fpread(buffer1, 100, 1, 732, f) != 1)
...error reading at offset 732...
if (fpread(buffer2, 100, 1, 432, f) != 1)
...error reading at offset 432...
The return count is the number of complete units of 100 bytes each; either 1 (got everything) or 0 (something went awry).
There are other ways of writing that code:
if (fpread(buffer1, sizeof(char), 100, 732, f) != 100)
...error reading at offset 732...
if (fpread(buffer2, sizeof(char), 100, 432, f) != 100)
...error reading at offset 432...
This reads 100 single bytes each time; the test ensures you got all 100 of them, as expected. If you capture the return value in this second example, you can know how much data you did get. It would be very surprising if the first read succeeded and the second failed; some other program (or thread) would have had to truncate the file between the two calls to fpread(), but funnier things have been known to happen.
1 The emulation won't be perfect; the pread() call provides guaranteed atomicity that the combination of fseek() and fread() will not provide. But that will seldom be a problem in practice, unless you have multiple processes or threads concurrently updating the file while you are trying to position and read from it.

It frequently depends on the distance between the parts you care about. If you're only skipping over/ignoring a few bytes between the parts you care about, it's often easier to just read that data and ignore what you read, rather than using fseek to skip past it. A typical way to do this is define a struct holding both the data you care about, and place-holders for the ones you don't care about, read in the struct, and then just use the parts you care about:
struct whatever {
long a;
long ignore;
short b;
} w;
fread(&w, 1, sizeof(w), some_file);
// use 'w.a' and 'w.b' here.
If there's any great distance between the parts you care about, though, chances are that your original idea of using fseek to get to the parts that matter will be simpler.

Your theory sounds correct. Open, seek, read, close.
Create a struct to for the data you want to read and pass a pointer to read() of struct's allocated memory. You'll likely need #pragma pack(1) or similar on the struct to prevent misalignment problems.

C using fread to read an unknown amount of data

I have a text file called test.txt
Inside it will be a single number, it may be any of the following:
1
2391
32131231
3123121412
I.e. it could be any size of number, from 1 digit up to x digits.
The file will only have 1 thing in it - this number.
I want a bit of code using fread() which will read that number of bytes from the file and put it into an appropriately sized variable.
This is to run on an embedded device; I am concerned about memory usage.
How to solve this problem?

You can simply use:
char buffer[4096];
size_t nbytes = fread(buffer, sizeof(char), sizeof(buffer), fp);
if (nbytes == 0)
...EOF or other error...
else
...process nbytes of data...
Or, in other words, provide yourself with a data space big enough for any valid data and then record how much data was actually read into the string. Note that the string will not be null terminated unless either buffer contained all zeroes before the fread() or the file contained a zero byte. You cannot rely on a local variable being zeroed before use.
It is not clear how you want to create the 'appropriately sized variable'. You might end up using dynamic memory allocation (malloc()) to provide the correct amount of space, and then return that allocated pointer from the function. Remember to check for a null return (out of memory) before using it.

If you want to avoid over-reading, fread is not the right function. You probably want fscanf with a conversion specifier along the lines of %100[0123456789]...

One way to achieve this is to use fseek to move your file stream location to the end of the file:
fseek(file, SEEK_END, SEEK_SET);
and then using ftell to get the position of the cursor in the file — this returns the position in bytes so you can then use this value to allocate a suitably large buffer and then read the file into that buffer.
I have seen warnings saying this may not always be 100% accurate but I've used it in several instances without a problem — I think the issues could be dependant on specific implementations of the functions on certain platforms.

Depending on how clever you need to be with the number conversion... If you do not need to be especially clever and fast, you can read it a character at a time with getc(). So,
- start with a variable initialized to 0.
- Read a character, multiply variable by 10 and add new digit.
- Then repeat until done.
Get a bigger sized variable as needed along the way or start with your largest sized variable and then copy it into the smallest size that fits after you finish.

Why does fwrite have both size and count parameters when just bytes to write would suffice? [duplicate]

We had a discussion here at work regarding why fread() and fwrite() take a size per member and count and return the number of members read/written rather than just taking a buffer and size. The only use for it we could come up with is if you want to read/write an array of structures which aren't evenly divisible by the platform alignment and hence have been padded but that can't be so common as to warrant this choice in design.
From fread(3):
The function fread() reads nmemb elements of data, each size bytes long,
from the stream pointed to by stream, storing them at the location given
by ptr.
The function fwrite() writes nmemb elements of data, each size bytes
long, to the stream pointed to by stream, obtaining them from the location
given by ptr.
fread() and fwrite() return the number of items successfully read or written
(i.e., not the number of characters). If an error occurs, or the
end-of-file is reached, the return value is a short item count (or zero).

The difference in fread(buf, 1000, 1, stream) and fread(buf, 1, 1000, stream) is, that in the first case you get only one chunk of 1000 bytes or nothing, if the file is smaller and in the second case you get everything in the file less than and up to 1000 bytes.

It's based on how fread is implemented.
The Single UNIX Specification says
For each object, size calls shall be
made to the fgetc() function and the
results stored, in the order read, in
an array of unsigned char exactly
overlaying the object.
fgetc also has this note:
Since fgetc() operates on bytes,
reading a character consisting of
multiple bytes (or "a multi-byte
character") may require multiple calls
to fgetc().
Of course, this predates fancy variable-byte character encodings like UTF-8.
The SUS notes that this is actually taken from the ISO C documents.

This is pure speculations, however back in the days(Some are still around) many filesystems were not simple byte streams on a hard drive.
Many file systems were record based, thus to satisfy such filesystems in an efficient manner, you'll have to specify the number of items ("records"), allowing fwrite/fread to operate on the storage as records, not just byte streams.

Here, let me fix those functions:
size_t fread_buf( void* ptr, size_t size, FILE* stream)
{
return fread( ptr, 1, size, stream);
}
size_t fwrite_buf( void const* ptr, size_t size, FILE* stream)
{
return fwrite( ptr, 1, size, stream);
}
As for a rationale for the parameters to fread()/fwrite(), I've lost my copy of K&R long ago so I can only guess. I think that a likely answer is that Kernighan and Ritchie may have simply thought that performing binary I/O would be most naturally done on arrays of objects. Also, they may have thought that block I/O would be faster/easier to implement or whatever on some architectures.
Even though the C standard specifies that fread() and fwrite() be implemented in terms of fgetc() and fputc(), remember that the standard came into existence long after C was defined by K&R and that things specified in the standard might not have been in the original designers ideas. It's even possible that things said in K&R's "The C Programming Language" might not be the same as when the language was first being designed.
Finally, here's what P.J. Plauger has to say about fread() in "The Standard C Library":
If the size (second) argument is greater than one, you cannot determine
whether the function also read up to size - 1 additional characters beyond what it reports.
As a rule, you are better off calling the function as fread(buf, 1, size * n, stream); instead of
fread(buf, size, n, stream);
Bascially, he's saying that fread()'s interface is broken. For fwrite() he notes that, "Write errors are generally rare, so this is not a major shortcoming" - a statement I wouldn't agree with.

Likely it goes back to the way that file I/O was implemented. (back in the day) It might have been faster to write / read to files in blocks then to write everything at once.

Having separate arguments for size and count could be advantageous on an implementation that can avoid reading any partial records. If one were to use single-byte reads from something like a pipe, even if one was using fixed-format data, one would have to allow for the possibility of a record getting split over two reads. If could instead requests e.g. a non-blocking read of up to 40 records of 10 bytes each when there are 293 bytes available, and have the system return 290 bytes (29 whole records) while leaving 3 bytes ready for the next read, that would be much more convenient.
I don't know to what extent implementations of fread can handle such semantics, but they could certainly be handy on implementations that could promise to support them.

I think it is because C lacks function overloading. If there was some, size would be redundant. But in C you can't determine a size of an array element, you have to specify one.
Consider this:
int intArray[10];
fwrite(intArray, sizeof(int), 10, fd);
If fwrite accepted number of bytes, you could write the following:
int intArray[10];
fwrite(intArray, sizeof(int)*10, fd);
But it is just inefficient. You will have sizeof(int) times more system calls.
Another point that should be taked into consideration is that you usually don't want a part of an array element be written to a file. You want the whole integer or nothing. fwrite returns a number of elements succesfully written. So if you discover that only 2 low bytes of an element is written what would you do?
On some systems (due to alignment) you can't access one byte of an integer without creating a copy and shifting.

Faster I/O in C

I have a problem which will take 1000000 lines of inputs like below from console.
0 1 23 4 5
1 3 5 2 56
12 2 3 33 5
...
...
I have used scanf, but it is very very slow. Is there anyway to get the input from console in a faster way? I could use read(), but I am not sure about the no of bytes in each line, so I can not as read() to read 'n' bytes.
Thanks,
Very obliged

Use fgets(...) to pull in a line at a time. Note that you should check for the '\n' at the end of the line, and if there is not one, you are either at EOF, or you need to read another buffer's worth, and concatenate the two together. Lather, rinse, repeat. Don't get caught with a buffer overflow.
THEN, you can parse each logical line in memory yourself. I like to use strspn(...) and strcspn(...) for this sort of thing, but your mileage may vary.
Parsing:
Define a delimiters string. Use strspn() to count "non data" chars that match the delimiters, and skip over them. Use strcspn() to count the "data" chars that DO NOT match the delimiters. If this count is 0, you are done (no more data in the line). Otherwise, copy out those N chars to hand to a parsing function such as atoi(...) or sscanf(...). Then, reset your pointer base to the end of this chunk and repeat the skip-delims, copy-data, convert-to-numeric process.

If your example is representative, that you indeed have a fixed format of five decimal numbers per line, I'd probably use a combination of fgets() to read the lines, then a loop calling strtol() to convert from string to integer.
That should be faster than scanf(), while still clearer and more high-level than doing the string to integer conversion on your own.
Something like this:
typedef struct {
int number[5];
} LineOfNumbers;
int getNumbers(FILE *in, LineOfNumbers *line)
{
char buf[128]; /* Should be large enough. */
if(fgets(buf, sizeof buf, in) != NULL)
{
int i;
char *ptr, *eptr;
ptr = buf;
for(i = 0; i < sizeof line->number / sizeof *line->number; i++)
{
line->number[i] = (int) strtol(ptr, &eptr, 10);
if(eptr == ptr)
return 0;
ptr = eptr;
}
return 1;
}
return 0;
}
Note: this is untested (even uncompiled!) browser-written code. But perhaps useful as a concrete example.

You use multiple reads with a fixed size buffer till you hit end of file.

Out of curiosity, what generates that many lines that fast in a console ?

Use binary I/O if you can. Text conversion can slow down the reading by several times. If you're using text I/O because it's easy to debug, consider again binary format, and use the od program (assuming you're on unix) to make it human-readable when needed.
Oh, another thing: there's AT&T's SFIO library, which stands for safer/faster file IO. You might also have some luck with that, but I doubt that you'll get the same kind of speedup as you will with binary format.

Read a line at a time (if buffer not big enough for a line, expand and continue with larger buffer).
Then use dedicated functions (e.g. atoi) rather than general for conversion.
But, most of all, set up a repeatable test harness with profiling to ensure changes really do speed things up.

fread will still return if you try to read more bytes than there are.
I have found on of the fastest ways to read file is like this:
/*seek end of file */
fseek(file,0,SEEK_END);
/*get size of file */
size = ftell(file);
/*seek start of file */
fseek(file,0,SEEK_SET);
/* make a buffer for the file */
buffer = malloc(1048576);
/*fread in 1MB at a time until you reach size bytes etc */
On modern computers put your ram to use and load the whole thing to ram, then you can easily work your way through the memory.
At the very least you should be using fread with block sizes as big as you can, and at least as big as the cache blocks or HDD sector size (4096 bytes minimum, I would use 1048576 as a minimum personally). You will find that with much bigger read requsts rfead is able to sequentially get a big stream in one operation. The suggestion here of some people to use 128 bytes is rediculous.... as you will end up with the drive having to seek all the time as the tiny delay between calls will cause the head to already be past the next sector which almost certainly has sequential data that you want.

You can greatly reduce the time of execution by taking input using fread() or fread_unlocked() (if your program is single-threaded). Locking/Unlocking the input stream just once takes negligible time, so ignore that.
Here is the code:
#include <iostream>
int maxio=1000000;
char buf[maxio], *s = buf + maxio;
inline char getc1(void)
{
if(s >= buf + maxio) { fread_unlocked(buf,sizeof(char),maxio,stdin); s = buf; }
return *(s++);
}
inline int input()
{
char t = getc1();
int n=1,res=0;
while(t!='-' && !isdigit(t)) t=getc1(); if(t=='-')
{
n=-1; t=getc1();
}
while(isdigit(t))
{
res = 10*res + (t&15);
t=getc1();
}
return res*n;
}
This is implemented in C++. In C, you won't need to include iostream, function isdigit() is implicitly available.
You can take input as a stream of chars by calling getc1() and take integer input by calling input().
The whole idea behind using fread() is to take all the input at once. Calling scanf()/printf(), repeatedly takes up valuable time in locking and unlocking streams which is completely redundant in a single-threaded program.
Also make sure that the value of maxio is such that all input can be taken in a few "roundtrips" only (ideally one, in this case). Tweak it as necessary.
Hope this helps!

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight

Very large fseek forward equivalent for stdin? - c

Related

Using larger values than size_t can represent with fwrite

Go to a certain point of a binary file in C (using fseek) and then reading from that location (using fread)

C using fread to read an unknown amount of data

Why does fwrite have both size and count parameters when just bytes to write would suffice? [duplicate]

Faster I/O in C

Categories

Resources