Should I keep source files in memory while parsing?

Should I keep source files in memory while parsing? - c

I'm writing the front-end part of an interpreter and I initially disliked the idea of just dumping all the source files into memory and then referencing that text directly. So the tokenizer reads from a char buffers and builds the token stream.
However, I have reached the parsing side of things and it hit me that because I would want to output nice errors and warnings that show the malformed line of source code. I guess I could put column numbers in the tokens, but then by error messages would be like getting directions by telephone: "It's in file X, on line Y, column Z, right next to the curly brace, you know the one. If you hit the semicolon, you've gone to far."
I seem to have put myself into a situation where I want to have my cake and eat it too. I want nice messages, but I don't want to hog memory.
It there something I'm missing? Or is loading the source in memory the way to go?

When there's an error to report to the user, it hardly matters how long in milliseconds it takes to report it.
I'd keep your tokenized stream in memory to keep your interpreter fast. (Actually, you should switch to a threaded interpreter or even a bad one pass compiler to enhance the execution rate).
When you encounter an error, go to the disk, fetch the line(s) of interest, and show them to the user. If he doesn't make any errors, this costs you zero. If he makes a small number of errors, that may be tiny bit inefficient but the user won't know. If he makes large number of errors, the file content of the files containing errors are going to read by the OS into its local cache, which is likely bigger than your programs anyway, and so access will be more efficient than if you kept the source entirely on the disk.

Better idea: mmap your sources in the first place, if you can. Fall back to slurping the whole file if you're reading from a pipe or something.
After parsing, you may want to call madvise(MADV_DONTNEED) (but only if it was originally mmaped) to advise the kernel to drop it from the cache (but still keep it available for errors) ... but this is probably not necessary, and may even not be a good idea, depending on your compiler design (e.g. are identifiers still pointing, or are they interned to a single, separate, allocation).

Related

What is generally the best approach reading a file for a compiler?

I know this is a general question.
I'm going to program a compiler and I was wondering if it's better to take the tokens of the language while reading the file (i.e., first open the file, then extract tokens while reading, and finally close the file) or read the file first, close it and then work with the data in a variable. The pseudo-code for this would be something like:
file = open(filename);
textVariable = read(file);
close(file);
getTokens(textVariable);
The first option would be something like:
file = open(filename);
readWhileGeneratingTokens(file);
close(file);
I guess the first option looks better, since there isn't an additional cost in terms of main memory. However, I think there might be some benefits using the second option, for I minimize the time the file is going to be open.

I can't provide any hard data, but generally the amount of time a compiler spends tokenizing source code is rather small compared to the amount of time spent optimizing/generating target code. Because of this, wanting to minimize the amount of time the source file is open seems premature. Additionally, reading the entire source file into memory before tokenizing would prevent any sort of line-by-line execution (think interpreted language) or reading input from a non-file location (think of a stream like stdin). I think it is safe to say that the overhead in reading the entire source file into memory is not worth the computer's resources and will ultimately be detrimental to your project.

Compilers are carefully designed to be able to proceed on as little as one character at a time from the input. They don't read entire files prior to processing, or rather they have no need to do so: that would just add pointless latency. They don't even need to read entire lines before processing.

random reading on very large files with fgets seems to bring Windows caching at it's limits

I have written a C/C++-program for Windows 7 - 64bit that works on very large files. In the final step it reads lines from an input-file (10GB+) and writes them to an output file. The access to the input-file is random, the writing is sequential.
EDIT: Main reason for this approach is to reduce RAM usage.
What I basically do in the reading part is this: (Sorry, very shortened and maybe buggy)
void seekAndGetLine(char* line, size_t lineSize, off64_t pos, FILE* filePointer){
fseeko64(filePointer, pos, ios_base::beg);
fgets(line, lineSize, filePointer);
}
Normally this code is fine, not to say fast, but under some very special conditions it gets very slow. The behaviour doesn't seem to be deterministic, since the performance drops occure on different machines at other parts of the file or even don't occure at all. It even goes so far, that the program totally stops reading, while there are no disc-operations.
Another sympthom seems to be the used RAM. My process keeps it's RAM steady, but the RAM used by the System grows sometimes very large. After using some RAM-Tools I found out, that the Windows Mapped File grows into several GBs. This behaviour also seems to depend on the hardware, since it occure on different machines at different parts of the process.
As far as I can tell, this problem doesn't exist on SSDs, so it definitely has something to do with the responsetime of the HDD.
My guess is that the Windows Caching gets somehow "wierd". The program is fast as long as the cache does it's work. But when Caching goes wrong, the behaviour goes either into "stop reading" or "grow cache size" and sometimes even both. Since I'm no expert for the windows caching algorithms, I would be happy to hear an explanation. Also, is there any way to get Windows out of C/C++ to manipulate/stop/enforce the caching.
Since I'm hunting this problem for a while now, I've already tried some tricks, that didn't work out:
filePointer = fopen(fileName, "rbR"); //Just fills the cache till the RAM is full
massive buffering of the read/write, to stop getting the two into each others way
Thanks in advance

Truly random access across a huge file is the worst possible case for any cache algorithm. It may be best to turn off as much caching as possible.
There are multiple levels of caching:
the CRT library (since you're using the f- functions)
the OS and filesystem
probably onboard the drive itself
If you replace your I/O calls via the f- functions in the CRT with the comparable ones in the Windows API (e.g., CreateFile, ReadFile, etc.) you can eliminate the CRT caching, which may be doing more harm than good. You can also warn the OS that you're going to be doing random accesses, which affects its caching strategy. See options like FILE_FLAG_RANDOM_ACCESS and possibly FILE_FLAG_NO_BUFFERING.
You'll need to experiment and measure.
You might also have to reconsider how your algorithm works. Are the seeks truly random? Can you re-sequence them, perhaps in batches, so that they're in order? Can you limit access to a relatively small region of the file at a time? Can you break the huge file into smaller files and then work with one piece at a time? Have you checked the level of fragmentation on the drive and on the particular file?

Depending on the larger picture of what your application does, you could possibly take a different approach - maybe something like this:
decide which lines you need from the input file and store the
line numbers in a list
sort the list of line numbers
read through the input file once, in order, and pull out the lines
you need (better yet, seek to next line and grab it, especially when there's big gaps)
if the list of lines you're grabbing is small enough, you can store
them in memory for reordering before output, otherwise, stick them
in a smaller temporary file and use that file as input for your
current algorithm to reorder the lines for final output
It's definitely a more complex approach, but it would be much kinder to your caching subsystem, and as a result, could potentially perform significantly better.

Is there a way to read HD data past EOF?

Is there a way to read a file's data but continue reading the data on the hard drive past the end of file? For normal file I/O I could just use fread(), but, obviously, that will only read to the end of the file. And it might be beneficial if I add that I need this on a Windows computer.
All my Googling for a way to do this is instead coming up with results about unrelated topics concerning EOF, such as people having problems with normal I/O.
My reasoning for this is that I accidentally deleted part of the text in a text file I was working on, and it was an entire day's worth of work. I Googled up a bunch of file recovery stuff, but it all seems to be about recovering deleted files, where my problem is that the file is still there but without some of its information, and I'm hoping some of that data still exists directly after the currently marked end of file and is neither fragmented elsewhere or already claimed or otherwise overwritten. Since I can't find a program that helps with this specifically, I'm hoping I can quickly make something up for it (I understand that, depending on what is involved, this might not be as feasible as just redoing the work, but I'm hoping that's not the case).
As far as I can foresee, though I might not be correct (not sure, which is why I'm asking for help), there are 3 possibilities.
Worst of the three: I have to look up Windows API functions that allow direct access to the entire hard drive (similar to its functions for memory, perhaps? those I have experience with) and scan the entire thing for the data that I still have access to from the file and then just continue looking at what's after it.
Second: I can get a pointer to the file, then I still have to get raw access to HD but at least have a pointer to the file in it?
Best of the three: Just open the file for write access, seek to the end, then write a ways past EOF to claim more space, but first hope that Windows won't clean the data before it hands it over to me so that I get garbage data which was the previous data in that spot which would actually be what I'm looking for? This would be awesome if it were that simple, but I'm afraid to test it out because I'd lose the data if it failed, so hopefully someone else already knows. The PC in question is running Vista Home Premium if that matters to anyone that knows the gory details of Windows.
Do either of those three seem plausible? Whether yea or nay, I'm also open (and eager) for other suggestions, especially those which are better than my silly ideas, and especially if they come with direction toward specific functions to use to get the job done.
Also, if anyone else actually has heard of a recovery program that doesn't just recover deleted files but which would actually work for a situation like this, and which is free and trustworthy, that works too.
Thanks in advance for any assistance.

You should get a utility for scanning the free space of a hard drive and recovering data from it, for example PhotoRec or foremost. Note however that if you've been using the machine much at all (even web browsing, which will create files in your cache), the data has likely already been overwritten. Do not save your recovery tools on the same hard drive, or even use the same PC to download them; get them from another computer and save them to a USB device, then run them from that device.
As for the conceptual content of your question, files are abstract objects. There is no such thing as data "past eof" except (depending on the implementation) perhaps up to the next multiple of the filesystem/disk "blocksize". Also it's possible (very likely) that your editor "saved" the file by truncating it and writing everything newly from the beginning, meaning there's not necessarily any correspondence between the old and new storage.

Your question doesn't make a lot of sense -- by definition there is nothing in the file after the EOF. By your further description, it appears that you want to read whatever happens to be on the disk after the last byte that is used by the file, which might be random garbage (unused space) or might be some other file. But in either case, this isn't 'data after the EOF' its just data on the disk that's not part of the file. Its even possible that it might be some other part of the same file, if the filesystem happens to lay out its data that way -- some filesystems scatter blocks in seemingly random ways across the disk and figuring out what bytes belong to which files requires understanding the filesystem metadata.

Parsing: load into memory or use stream

I'm writing a little parser and I would like to know the advantages and disadvantages of the different ways to load the data to be parsed. The two ways that I thought of are:
Load the file's contents into a string then parse the string (access the character at an array position)
Parse as reading the file stream (fgetc)
The former will allow me to have two functions: one for parse_from_file and parse_from_string, however I believe this mode will take up more memory. The latter will not have that disadvantage of using more memory.
Does anyone have any advice on the matter?

Reading the entire file in or memory mapping it will be faster, but may cause issues if you want your language to be able to #include other files as these would be memory mapped or read into memory as well.
The stdio functions would work well because they usually try to buffer up data for you, but they are also general purpose so they also try to look out for usage patterns which differ from reading a file from start to finish, but that shouldn't be too much overhead.
A good balance is to have a large circular buffer (x * 2 * 4096 is a good size) which you load with file data and then have your tokenizer read from. Whenever a block's worth of data has been passed to your tokenizer (and you know that it is not going to be pushed back) you can refill that block with new data from the file and update some buffer location info.
Another thing to consider is if there is any chance that the tokenizer would ever need to be able to be used to read from a pipe or from a person typing directly in some text. In these cases your reads may return less data than you asked for without it being at the end of the file, and the buffering method I mentioned above gets more complicated. The stdio buffering is good for this as it can easily be switched to/from line or block buffering (or no buffering).
Using gnu fast lex (flex, but not the Adobe Flash thing) or similar can greatly ease the trouble with all of this. You should look into using it to generate the C code for your tokenizer (lexical analysis).
Whatever you do you should try to make it so that your code can easily be changed to use a different form of next character peek and consume functions so that if you change your mind you won't have to start over.

Consider using lex (and perhaps yacc, if the language of your grammar matches its capabilities). Lex will handle all the fiddly details of lexical analysis for you and produce efficient code. You can probably beat its memory footprint by a few bytes, but how much effort do you want to expend into that?

The most efficient on a POSIX system would probably neither of the two (or a variant of the first if you like): just map the file read-only with mmap, and parse it then. Modern systems are quite efficient with that in that they prefetch data when they detect a streaming access etc., multiple instances of your program that parse the same file will get the same physical pages of memory etc. And the interfaces are relatively simple to handle, I think.

Truncate file at front

A problem I was working on recently got me to wishing that I could lop off the front of a file. Kind of like a “truncate at front,” if you will. Truncating a file at the back end is a common operation–something we do without even thinking much about it. But lopping off the front of a file? Sounds ridiculous at first, but only because we’ve been trained to think that it’s impossible. But a lop operation could be useful in some situations.
A simple example (certainly not the only or necessarily the best example) is a FIFO queue. You’re adding new items to the end of the file and pulling items out of the file from the front. The file grows over time and there’s a huge empty space at the front. With current file systems, there are several ways around this problem:
As each item is removed, copy the
remaining items up to replace it, and
truncate the file. Although it works,
this solution is very expensive
time-wise.
Monitor the size of the empty space at
the front, and when it reaches a
particular size or percentage of the
entire file size, move everything up
and truncate the file. This is much
more efficient than the previous
solution, but still costs time when
items are moved in the file.
Implement a circular queue in the
file, adding new items to the hole at
the front of the file as items are
removed. This can be quite efficient,
especially if you don’t mind the
possibility of things getting out of
order in the queue. If you do care
about order, there’s the potential of
having to move items around. But in
general, a circular queue is pretty
easy to implement and manages disk
space well.
But if there was a lop operation, removing an item from the queue would be as easy as updating the beginning-of-file marker. As easy, in fact, as truncating a file. Why, then, is there no such operation?
I understand a bit about file systems implementation, and don't see any particular reason this would be difficult. It looks to me like all it would require is another word (dword, perhaps?) per allocation entry to say where the file starts within the block. With 1 terabyte drives under $100 US, it seems like a pretty small price to pay for such functionality.
What other tasks would be made easier if you could lop off the front of a file as efficiently as you can truncate at the end?
Can you think of any technical reason this function couldn't be added to a modern file system? Other, non-technical reasons?

On file systems that support sparse files "punching" a hole and removing data at an arbitrary file position is very easy. The operating system just has to mark the corresponding blocks as "not allocated". Removing data from the beginning of a file is just a special case of this operation. The main thing that is required is a system call that will implement such an operation: ftruncate2(int fd, off_t offset, size_t count).
On Linux systems this is actually implemented with the fallocate system call by specifying the FALLOC_FL_PUNCH_HOLE flag to zero-out a range and the FALLOC_FL_COLLAPSE_RANGE flag to completely remove the data in that range. Note that there are restrictions on what ranges can be specified and that not all filesystems support these operations.

Truncate files at front seems not too hard to implement at system level.
But there are issues.
The first one is at programming level. When opening file in random access the current paradigm is to use offset from the beginning of the file to point out different places in the file. If we truncate at beginning of file (or perform insertion or removal from the middle of the file) that is not any more a stable property. (While appendind or truncating from the end is not a problem).
In other words truncating the beginning would change the only reference point and that is bad.
At a system level uses exist as you pointed out, but are quite rare. I believe most uses of files are of the write once read many kind, so even truncate is not a critical feature and we could probably do without it (well some things would become more difficult, but nothing would become impossible).
If we want more complex accesses (and there are indeed needs) we open files in random mode and add some internal data structure. Theses informations can also be shared between several files. This leads us to the last issue I see, probably the most important.
In a sense when we using random access files with some internal structure... we are still using files but we are not any more using files paradigm. Typical such cases are the databases where we want to perform insertion or removal of records without caring at all about their physical place. Databases can use files as low level implementation but for optimisation purposes some database editors choose to completely bypass filesystem (think about Oracle partitions).
I see no technical reason why we couldn't do everything that is currently done in an operating system with files using a database as data storage layer. I even heard that NTFS has many common points with databases in it's internals. An operating system can (and probably will in some not so far future) use another paradigm than files one.
Summarily i believe that's not a technical problem at all, just a change of paradigm and that removing the beginning is definitely not part of the current "files paradigm", but not a big and useful enough change to compell changing anything at all.

NTFS can do something like this with it's sparse file support but it's generaly not that useful.

I think there's a bit of a chicken-and-egg problem in there: because filesystems have not supported this kind of behavior efficiently, people haven't written programs to use it, and because people haven't written programs to use it, there's little incentive for filesystems to support it.
You could always write your own filesystem to do this, or maybe modify an existing one (although filesystems used "in the wild" are probably pretty complicated, you might have an easier time starting from scratch). If people find it useful enough it might catch on ;-)

Actually there are record base file systems - IBM have one and I believe DEC VMS also had this facility. I seem to remember both allowed (allow? I guess they are still around) deleting and inserting at random positions in a file.

There is also a unix command called head -- so you could do this via:
head -n1000 file > file_truncated

may can achieve this goal in two steps
long fileLength; //file total length
long reserveLength; //reserve length until the file ending
int fd; //file open for read & write
sendfile(fd, fd, fileLength-reserveLength, reserveLength);
ftruncate(fd, reserveLength);

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight