Open system call - c

I'm studying for my operating systems midterm and was wondering if I can get some help.
Can someone explain the checks and what the kernel does during the open() system call?
Thanks!

Very roughly, you can think of the following steps:
Translate the file name into an inode, which is the actual file system object describing the contents of the file, by traversing the filesystem data structures.
During this traversal, the kernel will check that you have sufficient access through the directory path to the file, and check access on the file itself. The precise checks depend on what modes were passed to open.
Create what's sometimes called an open file descriptor within the kernel. There is one of these objects for each file the kernel has opened on behalf of any process.
Allocate an unused index in the per-process file descriptor table, and point it at the open file descriptor.
Return this index from the system call as the file descriptor.
This description should be essentially correct for opening plain files and/or directories, but things are different for various sorts of special files, in particular for devices.

I would go back to what the prof told you - there a lot of things that happen during open(), depending on what you're opening (i.e. a device, a file, a directory), and unless you write what the professor's looking for, you'll lose points.
That being said, it mostly involves the checks to see if this open is valid (i.e. does this file exist, does the user have permissions to read/write it, etc), then an entry in the kernel handle table is allocated to keep track of the fd and its current file position (and of course, some other things)

Related

What is the fastest way to detect file size is not zero without knowing the file descriptor?

To explain shortly why I need this,
I am currently doing the detection by stat(2). I don't have control over the file descriptor (may get used up by some other thread as my code is getting injected to replace syscalls) , so i can't use fstat(2) (which is faster). I need to do this check a lot of times, so is there a faster way to do the same thing?
I am checking the same file in different processes which do not have a parent child relation.
You should probably benchmark it for yourself.
I've measured
//Real-time System-time
272.58 ns(R) 170.11 ns(S) //lseek
366.44 ns(R) 366.28 ns(S) //fstat
812.77 ns(R) 711.69 ns(S) //stat("/etc/profile",&sb)
on my Linux laptop. It fluctuates a little between runs but lseek is usually a bunch of ns faster than fstat, but you also need a fd for it and opening is quite expensive at about 1.6µs, so stat is probably the best choice for your case.
As tom-karzes has noted, stat should dependent on the number of directory components in the path. I tried it on a PATH_MAX long "/foo/foo/.../foo" directory and there I'm getting about 80µs.
The most efficient approach, knowing the filesystem you are searching in, is to open the block device associated and search (block by block) the inode table, and check the actual size from the inodes there (open the block device, so you get the inodes from the in-memory images, and not from the disk). This allows you to get all the zero length inodes of a filesystem in a quick and dirty way. The drawback is that you first need to get the info of the filesystem, and then to access the block device directly, which is normally forbidden for a non-root process. After that, you have to search the filesystem to get the names of the files involved, just in case you need to do something on those files.
By the way, your assumption of not being able to use fstat(2) on a shared file descriptor with another thread is wrong, as the stat system call operates on an open file descriptor, and doesn't do anything on the file ---it's nonblocking---, and the system warrants that the inode is locked while accessing the stat structure.
The approach of using lseek(2) is not valid in this case, because it actually moves the file pointer to the end of file, and then back to the saved place, and this requires two system calls to do and undo the move, and there are many race scenarios that can happen if another thread uses another system call (does a write(2), between the two) while you have the file pointer at another place.
Unix (incl. all posix systems linux, bsd, etc.) warrants that a nonblocking system call (as stat(2) is) is atomic in nature, blocking the inode of the file while the process (or thread) is executing the system call. So no other thread can be using the file while your stat(2) system call is getting the data. Even on blocking calls, unix warrants that a different system call made to the same descriptor will be chained to be executed and the process/thread will have to wait until the stat(2) syscall ends.
The problem on fstat(2) is that it has to solve all the path elements until it gets to the final inode of the file (this is where the length of the file is stored) and this is done in a one by one basis. Until it doesn't get to the final inode, no lock is made to the final inode (indeed, it is unknown until we get to it, so we cannot block it until we finish the namei() resolving) and then it solves as the original stat(2).
CONCLUSION
Use stat(2) with the other thread file descriptor whithout fearing about data corruption, it's not possible to happen. Don't hesitate to do this, as nothing is going to happen to the inode of the file while you are gathering the stat info.

What does opening a file actually do?

In all programming languages (that I use at least), you must open a file before you can read or write to it.
But what does this open operation actually do?
Manual pages for typical functions dont actually tell you anything other than it 'opens a file for reading/writing':
http://www.cplusplus.com/reference/cstdio/fopen/
https://docs.python.org/3/library/functions.html#open
Obviously, through usage of the function you can tell it involves creation of some kind of object which facilitates accessing a file.
Another way of putting this would be, if I were to implement an open function, what would it need to do on Linux?
In almost every high-level language, the function that opens a file is a wrapper around the corresponding kernel system call. It may do other fancy stuff as well, but in contemporary operating systems, opening a file must always go through the kernel.
This is why the arguments of the fopen library function, or Python's open closely resemble the arguments of the open(2) system call.
In addition to opening the file, these functions usually set up a buffer that will be consequently used with the read/write operations. The purpose of this buffer is to ensure that whenever you want to read N bytes, the corresponding library call will return N bytes, regardless of whether the calls to the underlying system calls return less.
I am not actually interested in implementing my own function; just in understanding what the hell is going on...'beyond the language' if you like.
In Unix-like operating systems, a successful call to open returns a "file descriptor" which is merely an integer in the context of the user process. This descriptor is consequently passed to any call that interacts with the opened file, and after calling close on it, the descriptor becomes invalid.
It is important to note that the call to open acts like a validation point at which various checks are made. If not all of the conditions are met, the call fails by returning -1 instead of the descriptor, and the kind of error is indicated in errno. The essential checks are:
Whether the file exists;
Whether the calling process is privileged to open this file in the specified mode. This is determined by matching the file permissions, owner ID and group ID to the respective ID's of the calling process.
In the context of the kernel, there has to be some kind of mapping between the process' file descriptors and the physically opened files. The internal data structure that is mapped to the descriptor may contain yet another buffer that deals with block-based devices, or an internal pointer that points to the current read/write position.
I'd suggest you take a look at this guide through a simplified version of the open() system call. It uses the following code snippet, which is representative of what happens behind the scenes when you open a file.
0 int sys_open(const char *filename, int flags, int mode) {
1 char *tmp = getname(filename);
2 int fd = get_unused_fd();
3 struct file *f = filp_open(tmp, flags, mode);
4 fd_install(fd, f);
5 putname(tmp);
6 return fd;
7 }
Briefly, here's what that code does, line by line:
Allocate a block of kernel-controlled memory and copy the filename into it from user-controlled memory.
Pick an unused file descriptor, which you can think of as an integer index into a growable list of currently open files. Each process has its own such list, though it's maintained by the kernel; your code can't access it directly. An entry in the list contains whatever information the underlying filesystem will use to pull bytes off the disk, such as inode number, process permissions, open flags, and so on.
The filp_open function has the implementation
struct file *filp_open(const char *filename, int flags, int mode) {
struct nameidata nd;
open_namei(filename, flags, mode, &nd);
return dentry_open(nd.dentry, nd.mnt, flags);
}
which does two things:
Use the filesystem to look up the inode (or more generally, whatever sort of internal identifier the filesystem uses) corresponding to the filename or path that was passed in.
Create a struct file with the essential information about the inode and return it. This struct becomes the entry in that list of open files that I mentioned earlier.
Store ("install") the returned struct into the process's list of open files.
Free the allocated block of kernel-controlled memory.
Return the file descriptor, which can then be passed to file operation functions like read(), write(), and close(). Each of these will hand off control to the kernel, which can use the file descriptor to look up the corresponding file pointer in the process's list, and use the information in that file pointer to actually perform the reading, writing, or closing.
If you're feeling ambitious, you can compare this simplified example to the implementation of the open() system call in the Linux kernel, a function called do_sys_open(). You shouldn't have any trouble finding the similarities.
Of course, this is only the "top layer" of what happens when you call open() - or more precisely, it's the highest-level piece of kernel code that gets invoked in the process of opening a file. A high-level programming language might add additional layers on top of this. There's a lot that goes on at lower levels. (Thanks to Ruslan and pjc50 for explaining.) Roughly, from top to bottom:
open_namei() and dentry_open() invoke filesystem code, which is also part of the kernel, to access metadata and content for files and directories. The filesystem reads raw bytes from the disk and interprets those byte patterns as a tree of files and directories.
The filesystem uses the block device layer, again part of the kernel, to obtain those raw bytes from the drive. (Fun fact: Linux lets you access raw data from the block device layer using /dev/sda and the like.)
The block device layer invokes a storage device driver, which is also kernel code, to translate from a medium-level instruction like "read sector X" to individual input/output instructions in machine code. There are several types of storage device drivers, including IDE, (S)ATA, SCSI, Firewire, and so on, corresponding to the different communication standards that a drive could use. (Note that the naming is a mess.)
The I/O instructions use the built-in capabilities of the processor chip and the motherboard controller to send and receive electrical signals on the wire going to the physical drive. This is hardware, not software.
On the other end of the wire, the disk's firmware (embedded control code) interprets the electrical signals to spin the platters and move the heads (HDD), or read a flash ROM cell (SSD), or whatever is necessary to access data on that type of storage device.
This may also be somewhat incorrect due to caching. :-P Seriously though, there are many details that I've left out - a person (not me) could write multiple books describing how this whole process works. But that should give you an idea.
Any file system or operating system you want to talk about is fine by me. Nice!
On a ZX Spectrum, initializing a LOAD command will put the system into a tight loop, reading the Audio In line.
Start-of-data is indicated by a constant tone, and after that a sequence of long/short pulses follow, where a short pulse is for a binary 0 and a longer one for a binary 1 (https://en.wikipedia.org/wiki/ZX_Spectrum_software). The tight load loop gathers bits until it fills a byte (8 bits), stores this into memory, increases the memory pointer, then loops back to scan for more bits.
Typically, the first thing a loader would read is a short, fixed format header, indicating at least the number of bytes to expect, and possibly additional information such as file name, file type and loading address. After reading this short header, the program could decide whether to continue loading the main bulk of the data, or exit the loading routine and display an appropriate message for the user.
An End-of-file state could be recognized by receiving as many bytes as expected (either a fixed number of bytes, hardwired in the software, or a variable number such as indicated in a header). An error was thrown if the loading loop did not receive a pulse in the expected frequency range for a certain amount of time.
A little background on this answer
The procedure described loads data from a regular audio tape - hence the need to scan Audio In (it connected with a standard plug to tape recorders). A LOAD command is technically the same as open a file - but it's physically tied to actually loading the file. This is because the tape recorder is not controlled by the computer, and you cannot (successfully) open a file but not load it.
The "tight loop" is mentioned because (1) the CPU, a Z80-A (if memory serves), was really slow: 3.5 MHz, and (2) the Spectrum had no internal clock! That means that it had to accurately keep count of the T-states (instruction times) for every. single. instruction. inside that loop, just to maintain the accurate beep timing.
Fortunately, that low CPU speed had the distinct advantage that you could calculate the number of cycles on a piece of paper, and thus the real world time that they would take.
It depends on the operating system what exactly happens when you open a file. Below I describe what happens in Linux as it gives you an idea what happens when you open a file and you could check the source code if you are interested in more detail. I am not covering permissions as it would make this answer too long.
In Linux every file is recognised by a structure called inode. Each structure has an unique number and every file only gets one inode number. This structure stores meta data for a file, for example file-size, file-permissions, time stamps and pointer to disk blocks, however, not the actual file name itself. Each file (and directory) contains a file name entry and the inode number for lookup. When you open a file, assuming you have the relevant permissions, a file descriptor is created using the unique inode number associated with file name. As many processes/applications can point to the same file, inode has a link field that maintains the total count of links to the file. If a file is present in a directory, its link count is one, if it has a hard link its link count will be two and if a file is opened by a process, the link count will be incremented by 1.
Bookkeeping, mostly. This includes various checks like "Does the file exist?" and "Do I have the permissions to open this file for writing?".
But that's all kernel stuff - unless you're implementing your own toy OS, there isn't much to delve into (if you are, have fun - it's a great learning experience). Of course, you should still learn all the possible error codes you can receive while opening a file, so that you can handle them properly - but those are usually nice little abstractions.
The most important part on the code level is that it gives you a handle to the open file, which you use for all of the other operations you do with a file. Couldn't you use the filename instead of this arbitrary handle? Well, sure - but using a handle gives you some advantages:
The system can keep track of all the files that are currently open, and prevent them from being deleted (for example).
Modern OSs are built around handles - there's tons of useful things you can do with handles, and all the different kinds of handles behave almost identically. For example, when an asynchronous I/O operation completes on a Windows file handle, the handle is signalled - this allows you to block on the handle until it's signalled, or to complete the operation entirely asynchronously. Waiting on a file handle is exactly the same as waiting on a thread handle (signalled e.g. when the thread ends), a process handle (again, signalled when the process ends), or a socket (when some asynchronous operation completes). Just as importantly, handles are owned by their respective processes, so when a process is terminated unexpectedly (or the application is poorly written), the OS knows what handles it can release.
Most operations are positional - you read from the last position in your file. By using a handle to identify a particular "opening" of a file, you can have multiple concurrent handles to the same file, each reading from their own places. In a way, the handle acts as a moveable window into the file (and a way to issue asynchronous I/O requests, which are very handy).
Handles are much smaller than file names. A handle is usually the size of a pointer, typically 4 or 8 bytes. On the other hand, filenames can have hundreds of bytes.
Handles allow the OS to move the file, even though applications have it open - the handle is still valid, and it still points to the same file, even though the file name has changed.
There's also some other tricks you can do (for example, share handles between processes to have a communication channel without using a physical file; on unix systems, files are also used for devices and various other virtual channels, so this isn't strictly necessary), but they aren't really tied to the open operation itself, so I'm not going to delve into that.
At the core of it when opening for reading nothing fancy actually needs to happen. All it needs to do is check the file exists and the application has enough privileges to read it and create a handle on which you can issue read commands to the file.
It's on those commands that actual reading will get dispatched.
The OS will often get a head start on reading by starting a read operation to fill the buffer associated with the handle. Then when you actually do the read it can return the contents of the buffer immediately rather then needing to wait on disk IO.
For opening a new file for write the OS will need to add a entry in the directory for the new (currently empty) file. And again a handle is created on which you can issue the write commands.
Basically, a call to open needs to find the file, and then record whatever it needs to so that later I/O operations can find it again. That's quite vague, but it will be true on all the operating systems I can immediately think of. The specifics vary from platform to platform. Many answers already on here talk about modern-day desktop operating systems. I've done a little programming on CP/M, so I will offer my knowledge about how it works on CP/M (MS-DOS probably works in the same way, but for security reasons, it is not normally done like this today).
On CP/M you have a thing called the FCB (as you mentioned C, you could call it a struct; it really is a 35-byte contiguous area in RAM containing various fields). The FCB has fields to write the file-name and a (4-bit) integer identifying the disk drive. Then, when you call the kernel's Open File, you pass a pointer to this struct by placing it in one of the CPU's registers. Some time later, the operating system returns with the struct slightly changed. Whatever I/O you do to this file, you pass a pointer to this struct to the system call.
What does CP/M do with this FCB? It reserves certain fields for its own use, and uses these to keep track of the file, so you had better not ever touch them from inside your program. The Open File operation searches through the table at the start of the disk for a file with the same name as what's in the FCB (the '?' wildcard character matches any character). If it finds a file, it copies some information into the FCB, including the file's physical location(s) on the disk, so that subsequent I/O calls ultimately call the BIOS which may pass these locations to the disk driver. At this level, specifics vary.
In simple terms, when you open a file you are actually requesting the operating system to load the desired file ( copy the contents of file ) from the secondary storage to ram for processing. And the reason behind this ( Loading a file ) is because you cannot process the file directly from the Hard-disk because of its extremely slow speed compared to Ram.
The open command will generate a system call which in turn copies the contents of the file from the secondary storage ( Hard-disk ) to Primary storage ( Ram ).
And we 'Close' a file because the modified contents of the file has to be reflected to the original file which is in the hard-disk. :)
Hope that helps.

Is it possible to create an unlinked file on a selected filesystem?

Basically, the same result as creating a temporary file in the desired file system, opening it, and then unlinking it.
Even better, though unlikely, if this could be done without creating an inode that is visible to other processes.
The ability to do so is OS-specific, since the relevant POSIX function calls all result in a link being generated. Linux in particular has allowed, since version 3.11, the use of O_TMPFILE in the flags argument of open(2) in order to create an anonymous file in a given directory.
There are several POSIX APIs at your disposal:
mkstemp - generates a unique temporary filename from
template, creates and opens the file, and returns an open file
descriptor for the file.
tmpfile - opens a unique temporary file in binary
read/write (w+b) mode. The file will be automatically deleted when
it is closed or the program terminates.
Both of these functions do create files on the filesystem. Creating an inode is unavoidable, if you want to use a real file.
The first provides you a file descriptor for making low-level system calls, like read and write. The second gives you a FILE* for all of the <stdio.h> APIs.
If you don't need/desire an actual file on disk, you should consider the memory stream APIs provided by POSIX.1-2008.
open_memstream() - opens a stream for writing to a buffer.
The buffer is dynamically allocated (as with malloc(3)), and
automatically grows as required.
libtmpfilefd : create a temporary unnamed file seem to fullfill your requirements
Looking at the source file this function create a temporary file with mkstemp then unlink the file right after

Difference between creating a duplicate file descriptor using dup() and creating a hard link?

I just tried out this program where I use dup to duplicate the file desciptor of an opened file.
I had made a hard link to this same file and I opened the same file to read the contents of the file in the program.
My question is what is the difference?
I understand that dup gives me a run time abstraction to the file and that hard link refers more to the filsystem implementation but I do not understand the need for use of one over the other.
What are the advantages of using one over the other?
Why can't we explicitly refer to the hard link if we want to refer to the same file locations instead of creating a file descriptor and vice versa?
I am using Linux and the standard C library.
Hard links work on i-nodes, dup works on opened file descriptors. These are different animals.
A file is mostly an inode, with directory entries pointing to that inode (so some file can have more than one name thru hard links, other files can have no name at all: a temporary file still opened but unlinked has an i-node refered by an opened file descriptor, but no more any name). I-nodes exist for the duration of the file and are written to disks.
A file descriptor only exist in processes (in kernel memory only, not on disk) so can't be written to disk (you could only write its number, which usually don't make any sense).
A file descriptor knows (inside the kernel) its inode, but also some more state, notably the current offset.
You could have two file descriptors working on the same file (the same inode, perhaps by open-ing two different hardlinked or symlinked paths to it) but having different state (e.g. different file position or offset).
If using dup(2) syscall, the two file descriptors share the same state (just after the dup) in particular share the same file offset or position.
If using link(2) syscall, the two directory entries point to the same inode. They need to be on the same filesystem.
And a symlink(2) syscall creates a new inode (and a new file) which refers to the symbolic name. Read other man pages about path_resolution(7) and symlink(7).
A hard link is just a way to have the same
file in two different directories. It is useful for saving some disk space.
Using fdup lets you have two different file descriptors in your program that point to the same file. It is useful if you want to duplicate some kind of logical object that wraps a file descriptor.
The main difference is that a hard link is persistent and a duplicated file descriptor only lasts as long as the process. Plus the reasons already given.

How do I open a directory at kernel level using the file descriptor for that directory?

I'm working on a project where I must open a directory and read the files/directories inside at kernel level. I'm basically trying to find out how ls is implemented at kernel level.
Right now I've figured out how to get a file descriptor for a directory using sys_open() and the O_DIRECTORY flag, but I don't know how to read the fd that I receive. If anyone has any tips or other suggestions I'd appreciate it. (Keep in mind this has to be done at kernel level).
Edit:For a long story short, For a school project I am implementing file/directory attributes. Where I'm storring the attributes is a hidden folder at the same level of the file with a given attribute. (So a file in Desktop/MyFolder has an attributes folder called Desktop/MyFolder/.filename_attr). Trust me I don't care to mess around in kernel for funsies. But the reason I need to read a dir at kernel level is because it's apart of project specs.
To add to caf's answer mentioning vfs_readdir(), reading and writing to files from within the kernel is is considered unsafe (except for /proc, which acts as an interface to internal data structures in the kernel.)
The reasons are well described in this linuxjournal article, although they also provide a hack to access files. I don't think their method could be easily modified to work for directories. A more correct approach is accessing the kernel's filesystem inode entries, which is what vfs_readdir does.
Inodes are filesystem objects such as regular files, directories, FIFOs and other
beasts. They live either on the disc (for block device filesystems)
or in the memory (for pseudo filesystems).
Notice that vfs_readdir() expects a file * parameter. To obtain a file structure pointer from a user space file descriptor, you should utilize the kernel's file descriptor table.
The kernel.org files documentation says the following on doing so safely:
To look up the file structure given an fd, a reader
must use either fcheck() or fcheck_files() APIs. These
take care of barrier requirements due to lock-free lookup.
An example :
rcu_read_lock();
file = fcheck_files(files, fd);
if (file) {
// Handling of the file structures is special.
// Since the look-up of the fd (fget() / fget_light())
// are lock-free, it is possible that look-up may race with
// the last put() operation on the file structure.
// This is avoided using atomic_long_inc_not_zero() on ->f_count
if (atomic_long_inc_not_zero(&file->f_count))
*fput_needed = 1;
else
/* Didn't get the reference, someone's freed */
file = NULL;
}
rcu_read_unlock();
....
return file;
atomic_long_inc_not_zero() detects if refcounts is already zero or
goes to zero during increment. If it does, we fail fget() / fget_light().
Finally, take a look at filldir_t, the second parameter type.
You probably want vfs_readdir() from fs/readdir.c.
In general though kernel code does not read directories, user code does.

Resources