Why is iget() hidden in xv6 - pwd

I'm playing a bit with xv6, a modern implementation of Unix version 6.
For my first hack, I wanted to implement the simple getcwd syscall, but I'm a bit lost as to which level of abstraction I should use.
Should I use the struct file interface?
Or maybe the struct inode interface?
For what matters, it seems it could even be implemented purely in userland.
I started implementing it with struct inode manipulations. My naive idea was to retrieve the proc->cwd, then readi() its second entry (..), scan it to retrieve my previous inum, and so on recursively until I hit the root.
Doesn't seem very performant, but that will fit for a first hack.
My problem though is that I need fs.c:iget() to retrieve a struct inode from the inums I get in the dirents. I've noticed that iget() is static in fs.c and not declared in defs.h which annoys me a bit, but I can't find the reason why.
So, this is my question. Why is it that iget() was deliberately hidden from the rest of the kernel?

Seems to me they were just pragmatic.
iget is used only by the directory manipulation routines.
The directory manipulation routines are in fs.c.
As for the getcwd implementation.
It would be much better if you follow the chdir syscall code.
The path is there.
You just need to store it, probably in a new field in the proc structure.
Of course, if the path given is relative, you should append it to the current stored path.

in my opinion answer to Your question is:
it was insecure and non generic. (if You could access directly file via inode, without traversing dirent's, how would security be mantained? You need permission for a file as well as execute for parent directory)
it was inconclusive to access file by inode (file with same inode might be in several directories, and inode number is unique just for given FS)
but perhaps I misunderstood You ? Regarding how to get cwd, I'm pretty certain You'd prefer pointer to solution so maybe this will help:
what is being kept in u.u_dent.u_name ? (take a closer look at user.h)
You might want to take a look at these too:
http://lwn.net/Articles/254486/
Why can't files be manipulated by inode?
how to get directory name by inode value in c?

Related

Can I access and change my iNode values of a file?

I know that files in unix systems are represented by their inodes.
Can I as a user have access to these values and change them?
Say, replace the values between two adjacent blocks, and in this way change the file?
Can I overwrite only one block in the middle?
I'm asking this in the context of file manipulation in C (I want to write a program that appends to the beginning, or middle part of a file, and not just to the end).
The user has read access to some of this information using the stat() system call provided he has proper access rights to the directory containing the inode.
The information can only be changed indirectly (timestamps, for example, by accessing the file itself). There is no direct way to mess with this information.
Some file systems might give a bit more access possibilities by exposing some of the information in ioctl() calls. What might or might not be exposed is a decision of the driver/file system developer.

Are there any file systems that do not use file paths?

File paths are inherently dubious when working with data.
Lets say I have a hypothetical situation with a program called find_brca, and some data called my.genome and both are in the /Users/Desktop/ directory.
find_brca takes a single argument, a genome, runs for about 4 hours, and returns the probability of that individual developing breast cancer in their lifetime. Some people, presented with a very high % probability, might then immediately have both of their breasts removed as a precaution.
Obviously, in this scenario, it is absolutely vital that /Users/Desktop/my.genome actually contains the genome we think it does. There are no do-overs. "oops we used an old version of the file from a previous backup" or any other technical issue will not be acceptable to the patient. How do we ensure we are analysing the file we think we are analysing?
To make matters trickier, lets also assert that we cannot modify find_brca itself, because we didn't write it, its closed source, proprietary, whatever.
You might think MD5 or other cryptographic checksums might be able to come to the rescue, and while they do help to a degree, you can only MD5 the file before and/or after find_brca has run, but you can never know exactly what data find_brca used (without doing some serious low-level system probing with DTrace/ptrace, etc).
The root of the problem is that file paths do not have a 1:1 relationship with actual data. Only in a filesystem where files can only be requested by their checksum - and as soon as the data is modified its checksum is modified - can we ensure that when we feed find_brca the genome's file path 4fded1464736e77865df232cbcb4cd19, we are actually reading the correct genome.
Are there any filesystems that work like this? If I wanted to create such a filesystem because none currently exists, how would you recommend I go about doing it?
I have my doubts about the stability, but hashfs looks exactly like what you want: http://hashfs.readthedocs.io/en/latest/
HashFS is a content-addressable file management system. What does that mean? Simply, that HashFS manages a directory where files are saved based on the file’s hash. Typical use cases for this kind of system are ones where: Files are written once and never change (e.g. image storage). It’s desirable to have no duplicate files (e.g. user uploads). File metadata is stored elsewhere (e.g. in a database).
Note: Not to be confused with the hashfs, a student of mine did a couple of years ago: http://dl.acm.org/citation.cfm?id=1849837
I would say that the question is a little vague, however, there are several answers which can be given to parts of your questions.
First of all, not all filesystems lack path/data correspondence. On many (if not most) filesystems, the file is identified only by its path, not by any IDs.
Next, if you want to guarantee that the data is not changed while the application handles them, then the approach depends on the filesystem being used and the way this application works with the file (if it keeps it opened or opens and closes the file as needed).
Finally, if you are concerned by the attacker altering the data on the filesystem in some way while the file data are used, then you probably have a bigger problem, than just the file paths, and that problem should be addressed beforehand.
On a side note, you can implement a virtual file system (FUSE on Linux, our CBFS on Windows), which will feed your application with data taken from elsewhere, be it memory, a database or a cloud. This approach answers your question as well.
Update: if you want to get rid of file paths at all and have the data addressed by hash, then probably a NoSQL database, where the hash is the key, would be your best bet.

How to add (and use) binary data to compiled executable?

There are several questions dealing with some aspects of this problem, but neither seems to answer it wholly. The whole problem can be summarized as follows:
You have an already compiled executable (obviously expecting the use of this technique).
You want to add an arbitrarily sized binary data to it (not necessarily by itself which would be another nasty problem to deal with).
You want the already compiled executable to be able to access this added binary data.
My particular use-case would be an interpreter, where I would like to make the user able to produce a single file executable out of an interpreter binary and the code he supplies (the interpreter binary being the executable which would have to be patched with the user supplied code as binary data).
A similar case are self-extracting archives, where a program (the archiving utility, such as zip) is capable to construct such an executable which contains a pre-built decompressor (the already compiled executable), and user-supplied data (the contents of the archive). Obviously no compiler or linker is involved in this process (Thanks, Mathias for the note and pointing out 7-zip).
Using existing questions a particular path of solution shows along the following examples:
appending data to an exe - This deals with the aspect of adding arbitrary data to arbitrary exes, without covering how to actually access it (basically simple append usually works, also true with Unix's ELF format).
Finding current executable's path without /proc/self/exe - In companion with the above, this would allow getting a file name to use for opening the exe, to access the added data. There are many more of these kind of questions, however neither focuses especially on the problem of getting a path suitable for the purpose of actually getting the binary opened as a file (which goal alone might (?) be easier to accomplish - truly you don't even need the path, just the binary opened for reading).
There also may be other, probably more elegant ways around this problem than padding the binary and opening the file for reading it in. For example could the executable be made so that it becomes rather trivial to patch it later with the arbitrarily sized data so it appears "within" it being in some proper data segment? (I couldn't really find anything on this, for fixed size data it should be trivial though unless the executable has some hash)
Can this be done reasonably well with as little deviation from standard C as possible? Even more or less cross-platform? (At least from maintenance standpoint) Note that it would be preferred if the program performing the adding of the binary data didn't rely on compiler tools to do it (which the user might not have), but solutions necessiting those might also be useful.
Note the already compiled executable criteria (the first point in the above list), which requires a completely different approach than solutions described in questions like C/C++ with GCC: Statically add resource files to executable/library or SDL embed image inside program executable , which ask for embedding data compile-time.
Additional notes:
The problems with the obvious approach outlined above and suggested in some comments, that to just append to the binary and use that, are as follows:
Opening the currently running program's binary doesn't seem something trivial (opening the executable for reading is, but not finding the path to supply to the file open call, at least not in a reasonably cross-platform manner).
The method of acquiring the path may provide an attack surface which probably wouldn't exist otherwise. This means that a potential attacker could trick the program to see different binary data (provided by him) like which the executable actually has, exposing any vulnerability which might reside in the parser of the data.
It depends on how you want other systems to see your binary.
Digital signed in Windows
The exe format allows for verifying the file has not been modified since publishing. This would allow you to :-
Compile your file
Add your data packet
Sign your file and publish it.
The advantage of following this system, is that "everybody" agrees your file has not been modified since signing.
The easiest way to achieve this scheme, is to use a resource. Windows resources can be added post- linking. They are protected by the authenticode digital signature, and your program can extract the resource data from itself.
It used to be possible to increase the signature to include binary data. Unfortunately this has been banned. There were binaries which used data in the signature section. Unfortunately this was used maliciously. Some details here msdn blog
Breaking the signature
If re-signing is not an option, then the result would be treated as insecure. It is worth noting here, that appended data is insecure, and can be modified without people being able to tell, but so is the code in your binary.
Appending data to a binary does break the digital signature, and also means the end-user can't tell if the code has been modified.
This means that any self-protection you add to your code to ensure the data blob is still secure, would not prevent your code from being modified to remove the check.
Running module
Windows GetModuleFileName allows the running path to be found.
Linux offers /proc/self or /proc/pid.
Unix does not seem to have a method which is reliable.
Data reading
The approach of the zip format, is to have a directory written to the end of the file. This means the data can be found at the end of the location, and then looked backwards for the start of the data. The advantage here, is the data blob is signposted from the end of the data, rather than the natural start.

How to Get list of all file under the directory by type in C

I have to simulate the ls function of unix in C.
I have to create a program to "Get all files under any directory by type in C".
I googled and found programs which get lists of files, but they are alphabetically sorted; I want sorted by type of file. Please, can anyone help me?
You must store the files somewhere in memory. Since this looks like a school project, I'd suggest to load the file names into a linked list, and employ one of the algorithms for sorting linked lists. This might well be the purpose of the exercise itself.
For file type let's assume "extension", i.e., a .MP3 is type "Fraunhofer MPEG Layer 3" even if somebody might have renamed a .WMA file and called it .MP3. To detect "true" file type you'd need to employ something called a "magic file", and there is a libmagic out there, but is it worth it? (If it is a doctorate thesis or a commercial program the answer is 'hell yes'. If the program is to be graded by your average professor, then you judge whether to risk being considered 'too clever').
The linked list entry ought to be a struct containing the file name and a pointer to its extension; this last you can find by considering that the extension is "whatever follows the last dot in the file name", so you can leverage the strrchr function. Remember that some files will have no extension.
You judge whether to employ a memory-saving hack such as storing the pointer to the extension, or duplicating the extension with strdup. The former is faster and leaner, but you must remember that the first struct pointer (filename) MUST be freed, and the second ABSOLUTELY MUSTN'T. Having two pointers behave differently might be considered bad coding practice (it is) or a clever hack (it is), depending on whether you value maintenance time or speed/memory.
As for the retrieval of file names itself, there's no all-portable way, so this has led to the development of libraries such as Boost. But for a school project, maybe you could restrict to POSIX systems and use opendir, readdir, closedir and stat.
A good practice and a wise thing to do would be to separate the operations (directory retrieval, list sorting, list display) in different functions, so that you can test them separately and incrementally (e.g.: the folder retrieval does retrieve everything and you can display the files in unsorted order, etc.).

Changing inode behaviour

I am trying to modify the ext3 file system. Basically I want to ensure that the inode for a file is saved in the same (or adjacent) block as the file that it stores metadata for. Hopefully this should help disk access performance
I grabbed the kernel source, compiled it, read a bunch about inodes and looked the inode.c file in the fs subdirectory. However, I am just not sure how I can ensure that any new file being created, and the inode for this file, can be saved in the same or adjacent blocks. Any help or pointers to further readings would be appreciated. Thanks!
Interesting idea.
I'm not deeply familiar with ext3, but I can give you some general pointers.
Currently ext3 stores inodes in predetermined places. Each block group has its own inode table, an array of inodes. So when you have an inode number (i.e., as the result of looking up a filename in a directory), you can find the corresponding inode on disk by using the inode number first to select the correct block group and then to index into that block group's inode table.
If you want to put the inodes next to the corresponding file data, you'll need a new scheme for finding an inode on disk. If you're willing to dedicate a block for each inode, then one possible scheme would be to allocate a new block every time you need an inode and then use the block number as the inode number. This might have the benefit that for small files you could store the data in that same block.
To make something like this happen, creating a new file (i.e., allocating an inode) would have to work very differently than in the current ext3 file system. Instead of using a bitmap to find an unused, pre-allocated and pre-initialized inode, you would have to allocate an empty block and initialize it yourself. So, you'll probably want to look at how the file system allocates blocks when it's writing to a file, then mimic that for allocating an inode.
An alternative scheme would be to store the inode inside the directory. So you save an I/O not because the inode is next to its data, but because when you lookup the filename you also read the inode. This was done back in the 90s as an experiment in BSD's FFS file system, and was written up in an excellent USENIX Paper. Those ideas never made it into FFS, or into any other main stream file system that I'm aware of, so it might be interesting to see how they work in ext3.
Regardless of whether you pursue one of these schemes or come up with something of your own, you'll also have to modify mke2fs to initialize the file system on disk in a way that your new file system variant will understand.
Good luck! It sounds like a fun project.
Kudos for getting into file system design!
First, a bit of engineering advice before you get too deep into hacking: make a copy of the ext3 tree and rename the file system to something else. I've found that when introducing experimental changes into a file system, you really don't want it to be used for your main system. Your system should still boot even if you introduce a bug that randomly loses files (it will eventually happen). You'll also need to branch the ext3 userspace tools to work with your new system.
Second, go get a copy of Understanding the Linux Kernel, 3 ed. by Bovet and Cesati. It presents an organized view of kernel subsystems, and I've found its explanations to be worthwhile. It's written for an older kernel (2.6.x for some x < 15; I forget exactly), but it's still accurate in many places. Read through its descriptions of file systems. I believe it covers ext3.
Third, about your actual project, you aren't proposing a simple modification to ext3. That file system has a pretty straightforward way of mapping an inode number to a disk block. You'll need to find a new way of doing this mapping. I would not anticipate any changes to the rest of ext3. Solving this challenge may be one of the key design points of your architecture. Note that keeping around a big array of inode -> disk block maps doesn't solve your problem: it's probably no better than existing ext3.

Resources