Parsing: load into memory or use stream - c

I'm writing a little parser and I would like to know the advantages and disadvantages of the different ways to load the data to be parsed. The two ways that I thought of are:
Load the file's contents into a string then parse the string (access the character at an array position)
Parse as reading the file stream (fgetc)
The former will allow me to have two functions: one for parse_from_file and parse_from_string, however I believe this mode will take up more memory. The latter will not have that disadvantage of using more memory.
Does anyone have any advice on the matter?

Reading the entire file in or memory mapping it will be faster, but may cause issues if you want your language to be able to #include other files as these would be memory mapped or read into memory as well.
The stdio functions would work well because they usually try to buffer up data for you, but they are also general purpose so they also try to look out for usage patterns which differ from reading a file from start to finish, but that shouldn't be too much overhead.
A good balance is to have a large circular buffer (x * 2 * 4096 is a good size) which you load with file data and then have your tokenizer read from. Whenever a block's worth of data has been passed to your tokenizer (and you know that it is not going to be pushed back) you can refill that block with new data from the file and update some buffer location info.
Another thing to consider is if there is any chance that the tokenizer would ever need to be able to be used to read from a pipe or from a person typing directly in some text. In these cases your reads may return less data than you asked for without it being at the end of the file, and the buffering method I mentioned above gets more complicated. The stdio buffering is good for this as it can easily be switched to/from line or block buffering (or no buffering).
Using gnu fast lex (flex, but not the Adobe Flash thing) or similar can greatly ease the trouble with all of this. You should look into using it to generate the C code for your tokenizer (lexical analysis).
Whatever you do you should try to make it so that your code can easily be changed to use a different form of next character peek and consume functions so that if you change your mind you won't have to start over.

Consider using lex (and perhaps yacc, if the language of your grammar matches its capabilities). Lex will handle all the fiddly details of lexical analysis for you and produce efficient code. You can probably beat its memory footprint by a few bytes, but how much effort do you want to expend into that?

The most efficient on a POSIX system would probably neither of the two (or a variant of the first if you like): just map the file read-only with mmap, and parse it then. Modern systems are quite efficient with that in that they prefetch data when they detect a streaming access etc., multiple instances of your program that parse the same file will get the same physical pages of memory etc. And the interfaces are relatively simple to handle, I think.

Related

What is the purpose of using memory stream in the C standard library?

In the C standard library, what is the purpose of using a memory stream (as created for an array via fmemopen())? How is it compared to manipulating the array directly?
This is very similar to using the std::stringstream in C++, which allows you to write to a string (including '\0' characters) and then use the string the way you'd like.
The idea is that we have many functions at our disposal, such as fprintf(), which can be used to write data to a stream in a formatted way. All those functions can be used with a memory based file without any need for further changes anywhere else than the fopen() to fmemopen().
So if you want to create a string which requires many fprintf(), using that function to generate the string in memory is extremely useful. The snprintf() could also be used if you just need one quick conversion.
Similarly, you can of course use fread() and fwrite() and the like. If you need to create a file which requires a lot of seeking and it's not that big that it can easily fit in memory, then it's going to go a lot faster. Once done, you can save the results to disk.

Fortran: How do I allocate arrays when reading a file of unknown size?

My typical use of Fortran begins with reading in a file of unknown size (usually 5-100MB). My current approach to array allocation involves reading the file twice. First to determine the size of the problem (to allocate arrays) and a second time to read the data into those arrays.
Are there better approaches to size determination/array allocation? I just read about automatic array allocation (example below) in another post that seemed much easier.
array = [array,new_data]
What are all the options and their pros and cons?
I'll bite, though the question is teetering close to off-topicality. Your options are:
Read the file once to get the array size, allocate, read again.
Read piece-by-piece, (re-)allocating as you go. Choose the size of piece to read as you wish (or, perhaps, as you think is likely to be most speedy for your case).
Always, always, work with files which contain metadata to tell an interested program how much data there is; for example a block
header line telling you how many data elements are in the next
block.
Option 3 is the best by far. A little extra thought, and about one whole line of code, at the beginning of a project and so much wasted time and effort saved down the line. You don't have to jump on HDF5 or a similar heavyweight file design method, just adopt enough discipline to last the useful life of the contents of the file. For iteration-by-iteration dumps from your simulation of the universe, a home-brewed approach will do (be honest, you're the only person who's ever going to look at them). For data gathered at an approximate cost of $1M per TB (satellite observations, offshore seismic traces, etc) then HDF5 or something similar.
Option 1 is fine too. It's not like you have to wait for the tapes to rewind between reads any more. (Well, some do, but they're in a niche these days, and a de-archiving system will often move files from tape to disk if they're to be used.)
Option 2 is a faff. It may also be the worst performing but on all but the largest files the worst performance may be within a nano-century of the best. If that's important to you then check it out.
If you want quantification of my opinions run your own experiments on your files on your hardware.
PS I haven't really got a clue how much it costs to get 1TB of satellite or seismic data, it's a factoid invented to support an argument.
I would add to the previous answer:
If your data has a regular structure and it's possible to open it in a txt file, press ctrl+end substract header to the rows total and there it is. Although you may waste time opening it if it's very large.

What is generally the best approach reading a file for a compiler?

I know this is a general question.
I'm going to program a compiler and I was wondering if it's better to take the tokens of the language while reading the file (i.e., first open the file, then extract tokens while reading, and finally close the file) or read the file first, close it and then work with the data in a variable. The pseudo-code for this would be something like:
file = open(filename);
textVariable = read(file);
close(file);
getTokens(textVariable);
The first option would be something like:
file = open(filename);
readWhileGeneratingTokens(file);
close(file);
I guess the first option looks better, since there isn't an additional cost in terms of main memory. However, I think there might be some benefits using the second option, for I minimize the time the file is going to be open.
I can't provide any hard data, but generally the amount of time a compiler spends tokenizing source code is rather small compared to the amount of time spent optimizing/generating target code. Because of this, wanting to minimize the amount of time the source file is open seems premature. Additionally, reading the entire source file into memory before tokenizing would prevent any sort of line-by-line execution (think interpreted language) or reading input from a non-file location (think of a stream like stdin). I think it is safe to say that the overhead in reading the entire source file into memory is not worth the computer's resources and will ultimately be detrimental to your project.
Compilers are carefully designed to be able to proceed on as little as one character at a time from the input. They don't read entire files prior to processing, or rather they have no need to do so: that would just add pointless latency. They don't even need to read entire lines before processing.

I/O methods in C

I am looking for various ways of reading/writing data from stdin/stdout. Currently I know about scanf/printf, getchar/putchar and gets/puts. Are there any other ways of doing this? Also I am interesting in knowing that which one is most efficient in terms of Memory and Space.
Thanks in Advance
fgets()
fputs()
read()
write()
And others, details can be found here: http://www.cplusplus.com/reference/clibrary/cstdio/
As per your time question take a look at this: http://en.wikipedia.org/wiki/I/O_bound
Stdio is designed to be fairly efficient no matter which way you prefer to read data. If you need to do character-by-character reads and writes, they usually expand to macros which just access the buffer except when it's full/empty. For line-by-line text io, use puts/fputs and fgets. (But NEVER use gets because there's no way to control how many bytes it will read!) The printf family (e.g. fprintf) is of course extremely useful for text because it allows you to skip constructing a temporary buffer in memory before writing (and thus lets you avoid thinking about all the memory allocation, overflow, etc. issues). fscanf tends to be much less useful, but mostly because it's difficult to use. If you study the documentation for fscanf well and learn how to use %[, %n, and the numeric specifiers, it can be very powerful!
For large blocks of text (e.g. loading a whole file into memory) or binary data, you can also use the fread and fwrite functions. You should always pass 1 for the size argument and the number of bytes to read/write for the count argument; otherwise it's impossible to tell from the return value how much was successfully read or written.
If you're on a reasonably POSIX-like system (pretty much anything) you can also use the lower-level io functions open, read, write, etc. These are NOT part of the C standard but part of POSIX, and non-POSIX systems usually provide the same functions but possibly with slightly-different behavior (for example, file descriptors may not be numbered sequentially 0,1,2,... like POSIX would require).
If you're looking for immediate-mode type stuff don't forget about Curses (more applicable on the *NIX side but also available on Windows)

Truncate file at front

A problem I was working on recently got me to wishing that I could lop off the front of a file. Kind of like a “truncate at front,” if you will. Truncating a file at the back end is a common operation–something we do without even thinking much about it. But lopping off the front of a file? Sounds ridiculous at first, but only because we’ve been trained to think that it’s impossible. But a lop operation could be useful in some situations.
A simple example (certainly not the only or necessarily the best example) is a FIFO queue. You’re adding new items to the end of the file and pulling items out of the file from the front. The file grows over time and there’s a huge empty space at the front. With current file systems, there are several ways around this problem:
As each item is removed, copy the
remaining items up to replace it, and
truncate the file. Although it works,
this solution is very expensive
time-wise.
Monitor the size of the empty space at
the front, and when it reaches a
particular size or percentage of the
entire file size, move everything up
and truncate the file. This is much
more efficient than the previous
solution, but still costs time when
items are moved in the file.
Implement a circular queue in the
file, adding new items to the hole at
the front of the file as items are
removed. This can be quite efficient,
especially if you don’t mind the
possibility of things getting out of
order in the queue. If you do care
about order, there’s the potential of
having to move items around. But in
general, a circular queue is pretty
easy to implement and manages disk
space well.
But if there was a lop operation, removing an item from the queue would be as easy as updating the beginning-of-file marker. As easy, in fact, as truncating a file. Why, then, is there no such operation?
I understand a bit about file systems implementation, and don't see any particular reason this would be difficult. It looks to me like all it would require is another word (dword, perhaps?) per allocation entry to say where the file starts within the block. With 1 terabyte drives under $100 US, it seems like a pretty small price to pay for such functionality.
What other tasks would be made easier if you could lop off the front of a file as efficiently as you can truncate at the end?
Can you think of any technical reason this function couldn't be added to a modern file system? Other, non-technical reasons?
On file systems that support sparse files "punching" a hole and removing data at an arbitrary file position is very easy. The operating system just has to mark the corresponding blocks as "not allocated". Removing data from the beginning of a file is just a special case of this operation. The main thing that is required is a system call that will implement such an operation: ftruncate2(int fd, off_t offset, size_t count).
On Linux systems this is actually implemented with the fallocate system call by specifying the FALLOC_FL_PUNCH_HOLE flag to zero-out a range and the FALLOC_FL_COLLAPSE_RANGE flag to completely remove the data in that range. Note that there are restrictions on what ranges can be specified and that not all filesystems support these operations.
Truncate files at front seems not too hard to implement at system level.
But there are issues.
The first one is at programming level. When opening file in random access the current paradigm is to use offset from the beginning of the file to point out different places in the file. If we truncate at beginning of file (or perform insertion or removal from the middle of the file) that is not any more a stable property. (While appendind or truncating from the end is not a problem).
In other words truncating the beginning would change the only reference point and that is bad.
At a system level uses exist as you pointed out, but are quite rare. I believe most uses of files are of the write once read many kind, so even truncate is not a critical feature and we could probably do without it (well some things would become more difficult, but nothing would become impossible).
If we want more complex accesses (and there are indeed needs) we open files in random mode and add some internal data structure. Theses informations can also be shared between several files. This leads us to the last issue I see, probably the most important.
In a sense when we using random access files with some internal structure... we are still using files but we are not any more using files paradigm. Typical such cases are the databases where we want to perform insertion or removal of records without caring at all about their physical place. Databases can use files as low level implementation but for optimisation purposes some database editors choose to completely bypass filesystem (think about Oracle partitions).
I see no technical reason why we couldn't do everything that is currently done in an operating system with files using a database as data storage layer. I even heard that NTFS has many common points with databases in it's internals. An operating system can (and probably will in some not so far future) use another paradigm than files one.
Summarily i believe that's not a technical problem at all, just a change of paradigm and that removing the beginning is definitely not part of the current "files paradigm", but not a big and useful enough change to compell changing anything at all.
NTFS can do something like this with it's sparse file support but it's generaly not that useful.
I think there's a bit of a chicken-and-egg problem in there: because filesystems have not supported this kind of behavior efficiently, people haven't written programs to use it, and because people haven't written programs to use it, there's little incentive for filesystems to support it.
You could always write your own filesystem to do this, or maybe modify an existing one (although filesystems used "in the wild" are probably pretty complicated, you might have an easier time starting from scratch). If people find it useful enough it might catch on ;-)
Actually there are record base file systems - IBM have one and I believe DEC VMS also had this facility. I seem to remember both allowed (allow? I guess they are still around) deleting and inserting at random positions in a file.
There is also a unix command called head -- so you could do this via:
head -n1000 file > file_truncated
may can achieve this goal in two steps
long fileLength; //file total length
long reserveLength; //reserve length until the file ending
int fd; //file open for read & write
sendfile(fd, fd, fileLength-reserveLength, reserveLength);
ftruncate(fd, reserveLength);

Resources