API to set the timestamps on files & directories in btrfs - c

BTRFS files/directories contains the timestamps:
Creation (otime)
Modification (mtime)
Attribute modification (ctime)
Access (atime)
Is there some API where I could set these all these timestamps for a file? I googled a bit but haven't found anything yet.
Programming languages doesn't matter, I would expect there to be some C API, but python is fine too and would be nicer.

From C, the mtime and atime can be set using utime(2) and its relatives. utime(2) itself gives you seconds precision, utimes(2) has microseconds, and utimensat(2) gives you nanoseconds. There are variants like futime if you have a file handle instead of a file name.
Python can provide the same via the os.utime function.
Traditionally it is not possible to arbitrarily modify the otime or ctime, other than by manually editing the raw filesystem. I am not aware that Linux has provided any kernel API to modify them. Of course, you can update the ctime to the current time by changing its status in some way, and you can update the otime to the current time by deleting and recreating the file. In principle you can set them to a different time by changing the system clock first (if you are root), but this is likely to mess up lots of other stuff on the system and is probably a bad idea.

Related

Are there any file systems that do not use file paths?

File paths are inherently dubious when working with data.
Lets say I have a hypothetical situation with a program called find_brca, and some data called my.genome and both are in the /Users/Desktop/ directory.
find_brca takes a single argument, a genome, runs for about 4 hours, and returns the probability of that individual developing breast cancer in their lifetime. Some people, presented with a very high % probability, might then immediately have both of their breasts removed as a precaution.
Obviously, in this scenario, it is absolutely vital that /Users/Desktop/my.genome actually contains the genome we think it does. There are no do-overs. "oops we used an old version of the file from a previous backup" or any other technical issue will not be acceptable to the patient. How do we ensure we are analysing the file we think we are analysing?
To make matters trickier, lets also assert that we cannot modify find_brca itself, because we didn't write it, its closed source, proprietary, whatever.
You might think MD5 or other cryptographic checksums might be able to come to the rescue, and while they do help to a degree, you can only MD5 the file before and/or after find_brca has run, but you can never know exactly what data find_brca used (without doing some serious low-level system probing with DTrace/ptrace, etc).
The root of the problem is that file paths do not have a 1:1 relationship with actual data. Only in a filesystem where files can only be requested by their checksum - and as soon as the data is modified its checksum is modified - can we ensure that when we feed find_brca the genome's file path 4fded1464736e77865df232cbcb4cd19, we are actually reading the correct genome.
Are there any filesystems that work like this? If I wanted to create such a filesystem because none currently exists, how would you recommend I go about doing it?
I have my doubts about the stability, but hashfs looks exactly like what you want: http://hashfs.readthedocs.io/en/latest/
HashFS is a content-addressable file management system. What does that mean? Simply, that HashFS manages a directory where files are saved based on the file’s hash. Typical use cases for this kind of system are ones where: Files are written once and never change (e.g. image storage). It’s desirable to have no duplicate files (e.g. user uploads). File metadata is stored elsewhere (e.g. in a database).
Note: Not to be confused with the hashfs, a student of mine did a couple of years ago: http://dl.acm.org/citation.cfm?id=1849837
I would say that the question is a little vague, however, there are several answers which can be given to parts of your questions.
First of all, not all filesystems lack path/data correspondence. On many (if not most) filesystems, the file is identified only by its path, not by any IDs.
Next, if you want to guarantee that the data is not changed while the application handles them, then the approach depends on the filesystem being used and the way this application works with the file (if it keeps it opened or opens and closes the file as needed).
Finally, if you are concerned by the attacker altering the data on the filesystem in some way while the file data are used, then you probably have a bigger problem, than just the file paths, and that problem should be addressed beforehand.
On a side note, you can implement a virtual file system (FUSE on Linux, our CBFS on Windows), which will feed your application with data taken from elsewhere, be it memory, a database or a cloud. This approach answers your question as well.
Update: if you want to get rid of file paths at all and have the data addressed by hash, then probably a NoSQL database, where the hash is the key, would be your best bet.

How to add (and use) binary data to compiled executable?

There are several questions dealing with some aspects of this problem, but neither seems to answer it wholly. The whole problem can be summarized as follows:
You have an already compiled executable (obviously expecting the use of this technique).
You want to add an arbitrarily sized binary data to it (not necessarily by itself which would be another nasty problem to deal with).
You want the already compiled executable to be able to access this added binary data.
My particular use-case would be an interpreter, where I would like to make the user able to produce a single file executable out of an interpreter binary and the code he supplies (the interpreter binary being the executable which would have to be patched with the user supplied code as binary data).
A similar case are self-extracting archives, where a program (the archiving utility, such as zip) is capable to construct such an executable which contains a pre-built decompressor (the already compiled executable), and user-supplied data (the contents of the archive). Obviously no compiler or linker is involved in this process (Thanks, Mathias for the note and pointing out 7-zip).
Using existing questions a particular path of solution shows along the following examples:
appending data to an exe - This deals with the aspect of adding arbitrary data to arbitrary exes, without covering how to actually access it (basically simple append usually works, also true with Unix's ELF format).
Finding current executable's path without /proc/self/exe - In companion with the above, this would allow getting a file name to use for opening the exe, to access the added data. There are many more of these kind of questions, however neither focuses especially on the problem of getting a path suitable for the purpose of actually getting the binary opened as a file (which goal alone might (?) be easier to accomplish - truly you don't even need the path, just the binary opened for reading).
There also may be other, probably more elegant ways around this problem than padding the binary and opening the file for reading it in. For example could the executable be made so that it becomes rather trivial to patch it later with the arbitrarily sized data so it appears "within" it being in some proper data segment? (I couldn't really find anything on this, for fixed size data it should be trivial though unless the executable has some hash)
Can this be done reasonably well with as little deviation from standard C as possible? Even more or less cross-platform? (At least from maintenance standpoint) Note that it would be preferred if the program performing the adding of the binary data didn't rely on compiler tools to do it (which the user might not have), but solutions necessiting those might also be useful.
Note the already compiled executable criteria (the first point in the above list), which requires a completely different approach than solutions described in questions like C/C++ with GCC: Statically add resource files to executable/library or SDL embed image inside program executable , which ask for embedding data compile-time.
Additional notes:
The problems with the obvious approach outlined above and suggested in some comments, that to just append to the binary and use that, are as follows:
Opening the currently running program's binary doesn't seem something trivial (opening the executable for reading is, but not finding the path to supply to the file open call, at least not in a reasonably cross-platform manner).
The method of acquiring the path may provide an attack surface which probably wouldn't exist otherwise. This means that a potential attacker could trick the program to see different binary data (provided by him) like which the executable actually has, exposing any vulnerability which might reside in the parser of the data.
It depends on how you want other systems to see your binary.
Digital signed in Windows
The exe format allows for verifying the file has not been modified since publishing. This would allow you to :-
Compile your file
Add your data packet
Sign your file and publish it.
The advantage of following this system, is that "everybody" agrees your file has not been modified since signing.
The easiest way to achieve this scheme, is to use a resource. Windows resources can be added post- linking. They are protected by the authenticode digital signature, and your program can extract the resource data from itself.
It used to be possible to increase the signature to include binary data. Unfortunately this has been banned. There were binaries which used data in the signature section. Unfortunately this was used maliciously. Some details here msdn blog
Breaking the signature
If re-signing is not an option, then the result would be treated as insecure. It is worth noting here, that appended data is insecure, and can be modified without people being able to tell, but so is the code in your binary.
Appending data to a binary does break the digital signature, and also means the end-user can't tell if the code has been modified.
This means that any self-protection you add to your code to ensure the data blob is still secure, would not prevent your code from being modified to remove the check.
Running module
Windows GetModuleFileName allows the running path to be found.
Linux offers /proc/self or /proc/pid.
Unix does not seem to have a method which is reliable.
Data reading
The approach of the zip format, is to have a directory written to the end of the file. This means the data can be found at the end of the location, and then looked backwards for the start of the data. The advantage here, is the data blob is signposted from the end of the data, rather than the natural start.

what's the difference between switch_root and run_init?

What's the difference between switch_root and run_init, besides switch_root being made by busybox while run_init is from klibc?
Thanks very much
They both perform exactly the same function, which is to switch to the "real" root and execv(3) the "real" init(8) program from an initramfs. They both assume that the filesystem that should become the root has been mounted on some directory, which they take as an argument.
(An initramfs is a (usually) temporary in-memory filesystem loaded by the bootloader. Its purpose is to do any setup that might be required before mounting the real root and switching to the real init program.)
Recent source code for run-init can be found here. run_init() is the entry point (called from run-init.c, which parses the arguments).
Recent source code for switch_root can be found here. switch_root_main() is the entry point.
The code is short for both implementations (though a bit tricky), which makes it easy to compare them by eye. The only difference seems to be that they perform slightly different sanity checks, and that recent versions of run-init have an extra option to drop selected capabilities(7) before execv()'ing the new init.

Is there any way to open a file using a diff without patching the original?

Example, I have a 40Mb file, and i want to make some minor changes to it, maybe 20Kb of changes.
I can create a diff between the resulting file and the original, simply enough, either by writing it manually with the application that is making the change, or by taking both the original file and the resulting file and generating the diff from that (using Rabin's polynomial fingerprinting algorithm for example)...
The issue is, in order to read the effective outcome of that diff (the new file), I have to patch the diff to the original and create the resulting new file and read that... this creates 2 40mb files with only 20kb of difference between them. It seems logical that one could use the initial file combined with the diff and parse (for reading anyway) the resulting final file without having to create a whole new copy of it.
I have looked through xdiff and it has the functions to create a diff given 2 files, or to apply a diff as a patch to a file, but none to get a simple file handle when provided with the original file and a diff file.
Does such a thing exist? It would be tremendously helpful for storage space savings on larger files, even if only for read-only (write operations could write to a new diff, possibly).
Examples in any language would be fine, although c, python or php would be great if readily available.
Using TortoiseMerge to View Diffs:
You could use TortoiseMerge to view the diff without creating a patch.
Here's an overview of what that looks like. I am also attaching the guide and a download link. If that doesn't suit you, here is a great list of alternative diff tools.
Further Consideration:
Depending on how often you are making changes and your interest in file size savings you may want to consider using a version control system (perhaps you do already). Common options include SVN, Git, and Mercurial.
What your are describing is a source code control with delta storage: you store many versions of a file, and delta are saved, then you can request entire files which are recomposed on the fly, so you can choose to access them directly (for example with the appropriate lib), or save locally before access.
Search for Subversion, git, mercurial and so on, how they implement their delta storage and you'll have working examples. Git has a maintenance task to do that internally, using delta storage when it considers it profitable. Git is in programmed in C.
Clearly it will give a sample of how to access sequentially this kind of files. Once you've got that composing patches is relatively simple, and if the patch commands list can be accessed efficiently you can as well build a random-access solution (as long as the literal part of the patch and the original are accessible).

Order files by creation time to the millisecond in Bash

I need to create a list of files which are located on my hard disk in order of when they arrived on the hard disk. To do so, I have used the following:
ls -lat
which lists all the files in date/time order, however, it only orders them to the nearest second. The problem here is that there are thousands of files and every so often, a few of them come clumped together in the same second. I need the exact correct ordering. I'm guessing the easiest way to do this is to get the creation time to the milli (or perhaps nano) second. To do this, I have tried using the following:
stat $myfile
to look at the modification time, but it always shows hour:minute:second.00000000000.
Is there a way to do this?
Thanks,
Rik
The accuracy depends on the file system you are using, but even with a high accuracy file system such as ext4, the standard implementation of stat uses time_t which has a 1 second resolution.
If you have access to the source of the program spitting out all those files, try setting a timestamp as part of the filename instead and then sort on the filename rather than the modification time.
you'll probably have to write your own stat command, using the stat(2) function
I'm not sure this is possible. My reasoning:
If you look at the stat() function call, you see that it returns a struct containing information about a file. One of its members is this:
time_t st_mtime; /* time of last modification */
And if you look at the time_t structure, well, wikipedia says this:
Unix and POSIX-compliant systems
implement time_t as an integer or
real-floating type (typically a
32- or 64-bit integer) which
represents the number of seconds since
the start of the Unix epoch...
Which means that stat()'s time is in terms of seconds, not milliseconds. I haven't looked at how each inode stores file information, but it might not store info up to the millisecond.
An alternative might be to append the mill/microsecond value to the filename itself when they are being created and order them that way?

Resources