Can I Create FILE Instance by byte[ ] data? (Don't Write file) - c

Can I Create FILE Instance (FILE*) by byte[ ] data (on Memory)? Don't Write file.
(C, Linux)
I Need for 'MiniSEED' format data parsing by offical MiniSEED Library.
These Library is supported parsing 'MiniSEED' format packet Data that was written in file.
But I Need to parsing 'MiniSEED' data in Byte[] array directly. don't create real file.
(because I must get 'MiniSEED' data by realtime TCP Packet, continuously
and These Library support only way to parse data by written file.)
So I try to solve the problem Created FILE Instance by byte[] data directly.
I think this solution is best way without changing the library as an easy way.

You can create a FILE handle from in-memory data in Linux, because the Linux C libraries do support fmemopen() from POSIX.1-2008.
Calling fmemopen(buffer, size, "r") yields a read-only FILE handle to an in-memory object containing size bytes at buffer.
However, I don't understand why you'd need such a thing.
The official Mini-SEED library does provide function msr_unpack() (and msr_unpack_data()) to parse Mini-SEED data records.
The functions you are probably looking at using, ms_readmsr() and ms_readtraces() (or their thread-safe variants ms_readmsr_r() and ms_readtraces_r(), just read each record from the file, passing each to msr_unpack() (and in case of traces, to mst_addmsrtogroup() or mstl_addmsr()).
In other words, the library does support parsing in-memory data. Your assertion that it only supports parsing files is clearly incorrect.
The man pages describing the library functions do not seem to be available on the net, but if you download libmseed sources, you can read the library function man pages using man -l libmseed/doc/[function].3.

As a compromise, you might use mmap to create a direct mapping between the memory and the file. This will allow you to update the contents directly (by accessing the memory) and the library may access the same data through the file interface. Under Unix systems, depending upon the size of the data, the file may not actually need to be written to disk. It may reside in the kernel's cache structure for faster access (this happens by default: nothing extra you need to do).

No, there's no portable, standard way of creating a FILE * that represents an in-memory stream of bytes.
The typical solution is to instead make the read and write function(s) hookable, so that instead of hard-coding e.g. read() you make the library call an (optionally) application-supplied function.

Related

What is the purpose of using memory stream in the C standard library?

In the C standard library, what is the purpose of using a memory stream (as created for an array via fmemopen())? How is it compared to manipulating the array directly?
This is very similar to using the std::stringstream in C++, which allows you to write to a string (including '\0' characters) and then use the string the way you'd like.
The idea is that we have many functions at our disposal, such as fprintf(), which can be used to write data to a stream in a formatted way. All those functions can be used with a memory based file without any need for further changes anywhere else than the fopen() to fmemopen().
So if you want to create a string which requires many fprintf(), using that function to generate the string in memory is extremely useful. The snprintf() could also be used if you just need one quick conversion.
Similarly, you can of course use fread() and fwrite() and the like. If you need to create a file which requires a lot of seeking and it's not that big that it can easily fit in memory, then it's going to go a lot faster. Once done, you can save the results to disk.

Virtualizing fopen with some malloc-ed memory instead of using a file

I have a piece of code using a FILE* file with a fwrite:
test = fwrite(&object,sizeof(object),1,file);
I want to serialize some internal data structure with an indexing structure (so, I'm using neither Google's Protobuf nor Cap'n Proto, since that is a custom data structure with some specific indexing requirements). Now, inside my project, I want to use Google Test in order to test the serialization, in order to check that what it has been serialized it could be deserialized and easily retrieved, too. In the testing phase, I want to pass to fwrite a FILE* object which is not a file, but a handler to some allocated main memory, so that no file is procuded, and that I can directly check the main memory for the results of the serialization. Is it possible to virtualize the FILE* and to write directly into the main memory? I would like to keep fwrite for writing data structures for performance reasons, without being forced to write two different methods for serialization (Sometimes i'm writing on the fly with no further memory occupation for transcoding). Thanks in advance.
One way is to create a dynamic library with all those fopen/fwrite functions (that would do something for your magic filename and fall back to the original ones otherwise) and load it with LD_PRELOAD. To fall back to the originals, resolve them with "dlsym" and RTLD_NEXT.
Another way is to include a special header at the top of the library/test which would have a statement like "#define fopen my_fopen". Inside the file with the implementation of "my_fopen" you need to put "#undef fopen" before including original "stdio.h". This approach will only work for your source code files that include the header but will not hook the functions for the binary libraries that you link.
fopencookie did the job I was looking for.
http://man7.org/linux/man-pages/man3/fopencookie.3.html

Under what circumstances will fseek/ftell or fstat fail to get the size of a file?

I'm trying to access a file as a char array, via memory mapping it, or copying it into a buffer or whatever, but both of these need the size of the file, easy enough, thought I, just use fseek(file, 0, SEEK_END).
However: according to C++ Reference "Library implementations [of fseek] are allowed to not meaningfully support SEEK_END," Meaning that I can't get the size of a file using that method.
Next I tried fstat, which is less portable, but at least will provide a compile error rather than a runtime problem; but The Open Group notes that fstat does not need to provide a meaningful value for st_size.
So: has anyone actually come across a system where these methods do not work?
The notes about files not having valid sizes reported are there because, in Linux, there are many "files" for which "file size" is not a meaningful concept.
There are two main cases:
The file is not a regular file. In particular, pipes, sockets, and character device files are streams of data where data is consumed on read, and not put on disk, so a size does not make much sense.
The file system that the file resides on does not provide the file size. This is especially common in "virtual" filesystems, where the file contents are generated when read and, again, have no disk backing.
To expand, filesystems do not necessarily keep file contents on disk. Since the filesystem API is a convenient API for expressing hierarchal data, and there are many tools for operating on files, it sometimes makes sense to expose data as a file hierarchy. For example, /proc/ contains information about processes (such as open files and used memory) and /sys/ contains driver-specific information and options (anything from sensor sampling rates to LED colors). With FUSE (Filesystem in UserSpacE), you can program a filesystem to do pretty much anything, from SSHing into a remote computer to exposing Twitter as a filesystem.
For a lot of these filesystems, "file size" may not make much sense. For example, an LED driver might expose three files red, green, and blue. They can be read to get the current color or written to to change the color. Now, is it really worth implementing a file size for them, since they are merely settings in RAM, don't have any disk backing, and can't be removed? Not really.
In summary, files are not necessarily "things on disk". For many of the more advanced usages of files, "file size" either does not make sense or is not worth providing.

Parsing: load into memory or use stream

I'm writing a little parser and I would like to know the advantages and disadvantages of the different ways to load the data to be parsed. The two ways that I thought of are:
Load the file's contents into a string then parse the string (access the character at an array position)
Parse as reading the file stream (fgetc)
The former will allow me to have two functions: one for parse_from_file and parse_from_string, however I believe this mode will take up more memory. The latter will not have that disadvantage of using more memory.
Does anyone have any advice on the matter?
Reading the entire file in or memory mapping it will be faster, but may cause issues if you want your language to be able to #include other files as these would be memory mapped or read into memory as well.
The stdio functions would work well because they usually try to buffer up data for you, but they are also general purpose so they also try to look out for usage patterns which differ from reading a file from start to finish, but that shouldn't be too much overhead.
A good balance is to have a large circular buffer (x * 2 * 4096 is a good size) which you load with file data and then have your tokenizer read from. Whenever a block's worth of data has been passed to your tokenizer (and you know that it is not going to be pushed back) you can refill that block with new data from the file and update some buffer location info.
Another thing to consider is if there is any chance that the tokenizer would ever need to be able to be used to read from a pipe or from a person typing directly in some text. In these cases your reads may return less data than you asked for without it being at the end of the file, and the buffering method I mentioned above gets more complicated. The stdio buffering is good for this as it can easily be switched to/from line or block buffering (or no buffering).
Using gnu fast lex (flex, but not the Adobe Flash thing) or similar can greatly ease the trouble with all of this. You should look into using it to generate the C code for your tokenizer (lexical analysis).
Whatever you do you should try to make it so that your code can easily be changed to use a different form of next character peek and consume functions so that if you change your mind you won't have to start over.
Consider using lex (and perhaps yacc, if the language of your grammar matches its capabilities). Lex will handle all the fiddly details of lexical analysis for you and produce efficient code. You can probably beat its memory footprint by a few bytes, but how much effort do you want to expend into that?
The most efficient on a POSIX system would probably neither of the two (or a variant of the first if you like): just map the file read-only with mmap, and parse it then. Modern systems are quite efficient with that in that they prefetch data when they detect a streaming access etc., multiple instances of your program that parse the same file will get the same physical pages of memory etc. And the interfaces are relatively simple to handle, I think.

Writing more to a file than just plain text

I have always been able to read and write basic text files in C++, but so far no one has discussed much more than that.
My question is this:
If developing a file type by myself for use by an application I also create, how would I go about writing the data to a file and preserve the layout, formatting, etc.? Are there any standards, or does it just depend on the creativity of the programmer?
You basically have to come up with your own file format and write binary data.
You can also serialize your object model and write the output to a file, but that's usually less efficient.
Better to use an existing database, or use xml (or other) for simple needs. If you want to write a file in a format that already exists, find a library that supports it.
You have to know the binary file format for the file you are trying to create. Consider Joel's post on this topic: the 97-2003 File Format is a 349 page spec.
Nearly all the time, to do something like that, you use an API, to avoid the grunt work. Be careful however, because trial and error and figuring out "what works" by trial and error can result in an upgrade of the program breaking your code. Plus you have to take into account other operating systems, minor version differences, patches, etc.
There are a number of standards of course. The likely one to use is some flavor of xml since there are libraries and tools that already exist to help you work with it, but nothing is stopping you from inventing your own.
Well you could store the data in a format you could read, but which maintained the integrity of your data (XML or JSON for instance).
Or (shudder) you could come up with your own propriatory binary format, and use that.
you would go at it exactly the same way as you would a text file. writing your data byte by byte, encoded in such a way that when you read the file you know what you are reading.
for a spreadsheet application you could even use a text format (OOXML, OpenDocument) to store presentation and content information.
Or you could define binary datastructures and write that directly to the file.
the choice between text or binary format depends on the application. for a configuration file you may prefer a text file which can be modified outside your app, for a database you will most likely choose a binary format for performance reasons.
See wotsit.org for information on file formats for various file types. Example: You can figure out exactly how to write out a .BMP file and how it is composed.
Writing to a database can be done by using a wrapper class in your language, mainly passing it SQL commands.
If you create a binary file , you can write any file to it . The only drawback is that you have to know exactly where it starts and where it ends .
Use xml (something open, descriptive, and validatable), and stick with the text. There are standards for this sort of thing as well, including ODF
You can open the file as binary, instead of text (how one does this depends somewhat on the platform), from there you can write the data directly out to disk. The only real caveat to this is endianess, which can become an issue when moving the files from one architecture to another (x86 to PPC for instance).
Writing binary data to disk is really no harder than writing text, and really, your creativity is key for how you store the data.
The general problem is usually referred to as serialization of your application state and in your case with a source/target of a file in whatever format makes sense for you. These days the preferred input/output format is XML, and you may want to look into the existing standards in this field. The problem then becomes how do I map from the state of my system to the particular schema. Boost has a serialization framework that you may want to check out.
/Allan
There are a variety of approaches you can take, but in general you'll want some sort of serialization library. BOOST::Serialization, or Google's Protocal Buffers are a good example of these. The basic idea is that you have memory structures (classes and objects) that represent your data, and you want to write that data to a file in a way that can be used to reconstruct those structures again.
If you're hesitant to use a library, you can do it all manually, but realize that you can end up writing a lot of redundant code, or developing your own library. See fopen, fread, fwrite and fclose for a starting point.
A typical binary file format for custom data is an "indexed file format" consisting of
-------
|index|
-------
|data |
-------
Where the index contains records "pointing" to the data.
The index consists of records containing an offset and a size. The offset tells you where in the file the data is stored and the size tells you the size of the data at that offset (i.e. the number of bytes to read).
typedef struct {
size_t offset
size_t size
} Index
typedef struct {
int ID
char First[20]
char Last[20]
char *RandomInfo
} Data
Suppose you want to store 50 records in the file you would create 50 indices and 50 data structures. The 50 index structures would be written to the file first, followed by the 50 data structures.
To read the file you would read in the 50 index structures, then from the data in the read-in index structures you could tell where to "seek" to read the data records.
Look up (fopen, fread, fwrite, fclose, ftell) for functions to read/write the data.
(Sorry my semicolon key doesn't work)
You usually use a third party library for these things. For example, you would link in a database library for say Oracle that would allow you to talk to the database. Because the underlying file type, ( i.e. Excel spreadsheet vs Openoffice, Oracle vs MySQL, etc. ) differ these libraries abstract away your need to care how the file is constructed.
Hope that helps you find what you're looking for!
1985 called, and said they have some help IFF you are willing to read up. The interchange file format is still in use today and provides some basic metadata around binary files, such as RIFF or WAV audio. (Unfortunately, TIFF is a false friend.) It allegedly even inspired PNG, so it can't be that bad.

Resources