Redirect file descriptor into memory - c

I am working with a file API that only provides a const char* filename interface (accepting - for stdout) when writing files. I would instead like the output to be written into memory, so I can then pass the data elsewhere.
I can use dup2 to redirect stdout to an arbitrary file descriptor and something like fmemopen/open_memstream as my sink. However, the memory stream functions expect a size, which -- in my case -- I don't know in advance and could be arbitrarily large.
The data I need to access, however, does have a fixed length and offset within what's being produced (e.g., out of 1MB, I need the 64KB starting at 384KB, etc.). As such, is there a way to set up a circular buffer with fmemopen/open_memstream that just keeps being rewritten until reaching the offset in question? (I realise this is inefficient, but there's no ability to seek.)
Or is this the wrong approach? I've read a bit about memory mapped files and that seems to be similar to what I'm trying to achieve, but it's not something I know much about...
EDIT Just to be clear, I cannot write anything to disk.

Using dup2 redirect stdout to a pipe and call the API with - to instruct it to use standard output. Then read from the pipe the data generated by the API, filter it and store it in a memory region.
If the pipe capacity is not enough, you will need two threads to make this approach work.
One will be running the API call generating data and putting it into the pipe.
The other thread, will take the data from the pipe, check the offset and store the data in memory when the target offset is reached. But keep reading from the pipe until EOF so that the other thread can complete the API call and finish gracefully.

Related

If mmap is faster than legacy file accessing, where we see the time saving?

I Understand the usage of the mmap. Considering simple read/write operation on the file, involves, opening the file, and allocating the buffer, read [ which requires context switch, ], and then the data available to the user in the buffer, and changes in the buffer will not reflect into the file unless it is written explictly.
Instead , if we use mmap, writting directly to the buffer is nothing but writting into the file.
The Question:
1) File is in the hard disk, mmaped into the process, Each time i write into mmaped memory, is it written directly to the file?. In this case, does not it require any context switch, because, the changes are done directly into the file itself. If mmap is faster than legacy file accessing, where we see the time saving?
Kindly explain. correct me if i m wrong also.
Updates to the file are not immediately visible in the disk, but are visible after an unmap or following an msync call. Hence, there is no system call during the updates, and the kernel is not involved. However, since the file is lazily read page by page, as needed, OS may need to read-in portions of the file as you cross page boundaries. Most obvious advantage of memory mapping is that it eliminates kernel-space to user-space data copies. There is also no need for system calls to seek to a specific position in a file.

Confused about node.js file system

I used write file with nodejs in two steps:
1.First judge if the file is exist or not,use fs.exists function;
2.Then use fs.writeFile to write file directly;
But now I have notice there have more functions used for write file, like fs.open or fs.close, should I use these for open or close file while writing?
Besides, I noticed there have fs.createReadStream and fs.createWriteStream function , what's the differences between them and fs.writeFile and fs.readFile?
Here's how I would explain the differences:
Low-level:
fs.open and fs.close work on file descriptors. These are low-level functions and represent map calls to open(2) BSD system calls. As you'll have a file descriptor, you'd be using these with fs.read or fs.write.
Note, all these are asynchronous and there are synchronous versions as well: fs.openSync, fs.closeSync, fs.readSync, fs.writeSync, where you wouldn't use a callback. The difference between the asynchronous and synchronous versions is that fs.openSync would only return when the operation to open the file has completed, whereas fs.open returns straight away and you'd use the file descriptor in the callback.
These low-level functions give you full control, but will mean a lot more coding.
Mid level:
fs.createReadStream and fs.createWriteStream create stream objects which you can wire up to events. Examples for these events are 'data' (when a chunk of data has been read, but that chunk is only part of the file) or 'close'. Advantages of this are that you can read a file and process it as data comes in, i.e. you don't have to read the whole file, keep it in memory and then process it. This makes sense when dealing with large files as you can get better performance in processing bits in chunks rather than dealing with the whole file (e.g. a whole 1GB file in memory).
High level:
fs.readFile and fs.writeFile operate on the whole file. So you'd call fs.readFile, node would read in the whole file and then present you the whole data in your callback. The advantage of this is that you don't need to deal with differently sized chunks (like when using streams). When writing, node would write the whole file. The disadvantage of this approach is that when reading/writing, you'd have to have the whole file in memory. For example, if you are transforming a log file, you may only need lines of data, using streams you can do this without having to wait for the file to be read in completely before starting to write.
There are also, fs.readFileSync and fs.writeFileSync which would not use a callback, but wait for the read/write to finish before returning. The advantage of using this is that for a small file, you may not want to do anything before the file returns, but for big files it would mean that the CPU would idle away while waiting for the file I/O to finish.
Hope that makes sense and in answer to your question, when using fs.writeFile you don't need fs.open or fs.close.

How to see how much data is queued up in a named pipe?

In a Linux box, I have couple of processes writing to a named pipe and another one reading it. I am suspecting that my reader is not keeping up and there are lot of data queued up in the pipe.
Could anyone please tell me that, is there a way to check/see how much data is queued up in the pipe? Any Linux command or C API?
Thank you for your time.
--KS
I don't think FIONREAD will work as FIONREAD is determined by the value returned by i_size_read which is stored in the inode as i_size. i_size is not used with pipes (which is why stat always returns 0 for a pipe's size).
http://lxr.free-electrons.com/source/include/linux/fs.h#L848
It should be possible to get the size by summing the len property of the buffers (i_node.i_pipe.bufs). It doesn't look like this value is exposed by stat or ioctl though.
https://github.com/mirrors/linux-2.6/blob/master/fs/pipe.c
You could try recv(..., MSG_PEEK) -- this should read from the pipe without removing the data from it (so the next read would return the same data). It won't necessarily tell you about all the data queued, just some of it.

What do I use as replacement of GetFileSize() for pipes?

See title.
On the client side of an named pipe, I want to determine the size of the content to be read from a named pipe in order to allocate memory for a buffer to take the content.
The MSDN help says:
You cannot use the GetFileSize function with a handle of a nonseeking device such as a pipe or a communications device. To determine the file type for hFile, use the GetFileType function.
Hmmm. Okay. But if I cannot use GetFileSize to determine the amount of data being readable from a pipe, what shall I use then? Currently, I do
length = GetFileSize(pipehandle, 0);
while(length == 0){
Sleep(10); // wait a bit
length = GetFileSize(pipehandle, 0); // and try again
}
Sooner or later, length does get greater zero, but the waiting seems a bit bad to me.
Background: I have a pipe server (roughly the Multithreaded Pipe Server from the MSDN example) that waits for the client to connect. Upon connection, the server reads the content of a file and passes that to the client using the pipe connection.
More Background: The overall reason why I want to do that is that I'm working with an external library that implements an XML parser. Originally, the parser opens a file, then CreateFileMapping is applied to that file and finally MapViewOfFile is being called in order to get the file content.
Now, the project rules have changed and we're no longer allowed to create files on the disk, so we need another way to pass the information from App1 (the pipe server) to App2 (the pipe client). To change as less as possible, I decided to use pipes for passing the data because on the first view, opening a pipe is the same as opening any other file and I assume that I have to do only very few changes to get rid of reading files while being able to read from pipes.
Currently, I determine the size of the data in the pipe (I know that it is used only once to pass the input file from App1 to App2), then do a malloc to get a buffer and read the whole content of the pipe into that buffer.
If I'm on a completely off the track, I'd also be open for any suggestions to do things better.
Clearly you want a PIPE_TYPE_BYTE in this case since the amount of data is unpredictable. Treat it just like a regular file in the client, calling ReadFile() repeated with a small buffer of, say, 4096 bytes. If your need is to store it in an array then you could simply write an integer first so that the client knows how big to make the array.
If you created your pipe in a PIPE_TYPE_MESSAGE type, you will be able to use the PeekNamedPipe method to retrieve a complete message from the pipe.
The main difference between PIPE_TYPE_MESSAGE and PIPE_TYPE_BYTE are :
in MESSAGE type, the system is managing the length of the value sent into the pipe, just ask to read one message and you will get all the message (usefull for not too large message to avoid to fill all the memory)
in BYTE type, you have to manage the length of the data you send threw the pipe. Maybe a TLV protocol could be a good way to know the size of your "messages" (maybe the T-Type part sounds like a useless one), you can then read the content in two parts : first, read the first bytes which will give you the size of the message, and then read message by parts if you don't want to overfill the memory.

popen performance in C

I'm designing a program I plan to implement in C and I have a question about the best way (in terms of performance) to call external programs. The user is going to provide my program with a filename, and then my program is going to run another program with that file as input. My program is then going to process the output of the other program.
My typical approach would be to redirect the other program's output to a file and then have my program read that file when it's done. However, I understand I/O operations are quite expensive and I would like to make this program as efficient as possible.
I did a little bit of looking and I found the popen command for running system commands and grabbing the output. How does the performance of this approach compare to the performance of the approach I just described? Does popen simply write the external program's output to a temporary file, or does it keep the program output in memory?
Alternatively, is there another way to do this that will give better performance?
On Unix systems, popen will pass data through an in-memory pipe. Assuming the data isn't swapped out, it won't hit disk. This should give you just about as good performance as you can get without modifying the program being invoked.
popen does pretty much what you are asking for: it does the pipe-fork-exec idiom and gives you a file pointer that you can read and write from.
However, there is a limitation on the size of the pipe buffer (~4K iirc), and if you arent reading quickly enough, the other process could block.
Do you have access to shared memory as a mount point? [on linux systems there is a /dev/shm mountpoint]
1) popen keep the program output in memory. It actually uses pipes to transfer data between the processes.
2) popen looks IMHO as the best option for performance.
It also have an advantage over files of reducing latency. I.e. your program will be able to get the other program output on the fly, while it is produced. If this output is large, then you don't have to wait until the other program is finished to start processing its output.
The problem with having your subcommand redirect to a file is that it's potentially insecure while popen communication can't be intercepted by another process. Plus you need to make sure the filename is unique if you're running several instances of your master program (and thus of your subcommand). The popen solution doesn't suffer from this.
The performance of popen is just fine as long as your don't read/write one byte chunks. Always read/write multiples of 512 (like 4096). But that does apply to file operations as well. popen connects your process and the child process through pipes, so if you don't read then the pipe fills up and the child can't write and vice versa. So all the exchanged data is in memory, but it's only small amounts.
(Assuming Unix or Linux)
Writing to the temp file may be slow if the file is on a slow disk. It also means the entire output will have to fit on the disk.
popen connects to the other program using a pipe, which means that output will be sent to your program incrementally. As it is generated, it is copied to your program chunk-by-chunk.

Resources