Using modified date of file for comparison. Is it safe? - file

I want to make a procedure that does one way syncrhonization of a folder between a server and a client. I was thinking to use ModifiedDate as a criterio, provided that only the date of the server files will be used. Procedure will not use Modified dates of files on the client at all. It will read dates from the server and compare them with dates read from the server last time the procedure run.
Do you think this is safe?
Is there any possibility that Modified Date will not be changed when a file is edited or it will be changed without touching the contents of the file (eg. from some strange antivirus programs)?

Don't count on the modification date of a file.
Strange programs (antiviruses and such) are not the problem more than the fact that you simply can't count on the client and server clocks to be synchronized.
Why not do a straightforward diff or hash calculation? You can't get a better comparison than that.
Taking performance considerations into effect, you can use the following heuristic:
If date hasn't changed then file is obviously the same
If date has changed, file contents might have changed, and might have not (for example: it has been touched). In this case in order to get a definitive answer you must examine the file somehow.
Bottom line: modification date can always give you a true negative (file not changed), but might sometimes yield a false positive - and this case you must verify.

You didn't mention what OS you're on, but on UNIX platforms the modification time can be set by client code to any value it wants (see the utimes() API or touch command). Therefore you shouldn't rely on modification times to tell you whether a file has changed or not. I imagine Windows is somewhat similar.

Related

Automatic function signature and call changing and refactoring in C code base

I have found that in CLion it is possible to change functions signature, e.g. parameters' number and/or order alongside with the all places where it is called from. However it allows to do it only from a dialog window and in one-by-one manner, as far as I understood. Is there a way to write a script which will do this for all signatures and calls taken from the script or kind of input data? In other words I want to find a way to quickly refactor all code base having slightly changed something in the "task" script. I mean to modify the refactoring from time to time in one place/file and run it against all code base. Otherwise it is a lot of manual work even for one refactoring concept. But what if it is changed everyday. Here is how it is done in Clion in a one-by-one manner.
https://www.jetbrains.com/help/rider/Refactorings__Change_Signature.html

API to set the timestamps on files & directories in btrfs

BTRFS files/directories contains the timestamps:
Creation (otime)
Modification (mtime)
Attribute modification (ctime)
Access (atime)
Is there some API where I could set these all these timestamps for a file? I googled a bit but haven't found anything yet.
Programming languages doesn't matter, I would expect there to be some C API, but python is fine too and would be nicer.
From C, the mtime and atime can be set using utime(2) and its relatives. utime(2) itself gives you seconds precision, utimes(2) has microseconds, and utimensat(2) gives you nanoseconds. There are variants like futime if you have a file handle instead of a file name.
Python can provide the same via the os.utime function.
Traditionally it is not possible to arbitrarily modify the otime or ctime, other than by manually editing the raw filesystem. I am not aware that Linux has provided any kernel API to modify them. Of course, you can update the ctime to the current time by changing its status in some way, and you can update the otime to the current time by deleting and recreating the file. In principle you can set them to a different time by changing the system clock first (if you are root), but this is likely to mess up lots of other stuff on the system and is probably a bad idea.

how to get added content of a file since last modification

I'm working on a project in golang that needs to index recently added file content (using framework called bleve), and I'm looking for a solution to get content of a file since last modification. My current work-around is to record the last indexed position of each file, and during indexing process later on I only retrieve file content starting from the previous recorded position.
So I wonder if there's any library or built-in functionality for this? (doesn't need to be restricted to go, any language could work)
I'll really appreciate it if anyone has a better idea than my work-around as well!
Thanks
It depends on how the files change.
If the files are append-only, then you only need to record the last offset where you stopped indexing, and start from there.
If the changes can happen anywhere, and the changes are mostly replacing old bytes with new bytes (like changing pixels of an image), then perhaps you can consider computing checksum for small chucks, and only index those chunks that has different checksums.
You can check out crypto package in Go standard library for computing hashes.
If the changes are line insertion/deletion to text files (like changes to source code), then maybe a diff algorithm can help you find the differences. Something like https://github.com/octavore/delta.
If you're running in a Unix-like system, you could just use tail. If you specify to follow the file, the process will keep waiting after reaching end of file. You can invoke this in your program with os/exec and pipe the Stdout to your program. Your program can then read from it periodically or with blocking.
The only way I can think of to do this natively in Go is like how you described. There's also a library that tries to emulate tail in Go here: https://github.com/hpcloud/tail

Are there any file systems that do not use file paths?

File paths are inherently dubious when working with data.
Lets say I have a hypothetical situation with a program called find_brca, and some data called my.genome and both are in the /Users/Desktop/ directory.
find_brca takes a single argument, a genome, runs for about 4 hours, and returns the probability of that individual developing breast cancer in their lifetime. Some people, presented with a very high % probability, might then immediately have both of their breasts removed as a precaution.
Obviously, in this scenario, it is absolutely vital that /Users/Desktop/my.genome actually contains the genome we think it does. There are no do-overs. "oops we used an old version of the file from a previous backup" or any other technical issue will not be acceptable to the patient. How do we ensure we are analysing the file we think we are analysing?
To make matters trickier, lets also assert that we cannot modify find_brca itself, because we didn't write it, its closed source, proprietary, whatever.
You might think MD5 or other cryptographic checksums might be able to come to the rescue, and while they do help to a degree, you can only MD5 the file before and/or after find_brca has run, but you can never know exactly what data find_brca used (without doing some serious low-level system probing with DTrace/ptrace, etc).
The root of the problem is that file paths do not have a 1:1 relationship with actual data. Only in a filesystem where files can only be requested by their checksum - and as soon as the data is modified its checksum is modified - can we ensure that when we feed find_brca the genome's file path 4fded1464736e77865df232cbcb4cd19, we are actually reading the correct genome.
Are there any filesystems that work like this? If I wanted to create such a filesystem because none currently exists, how would you recommend I go about doing it?
I have my doubts about the stability, but hashfs looks exactly like what you want: http://hashfs.readthedocs.io/en/latest/
HashFS is a content-addressable file management system. What does that mean? Simply, that HashFS manages a directory where files are saved based on the file’s hash. Typical use cases for this kind of system are ones where: Files are written once and never change (e.g. image storage). It’s desirable to have no duplicate files (e.g. user uploads). File metadata is stored elsewhere (e.g. in a database).
Note: Not to be confused with the hashfs, a student of mine did a couple of years ago: http://dl.acm.org/citation.cfm?id=1849837
I would say that the question is a little vague, however, there are several answers which can be given to parts of your questions.
First of all, not all filesystems lack path/data correspondence. On many (if not most) filesystems, the file is identified only by its path, not by any IDs.
Next, if you want to guarantee that the data is not changed while the application handles them, then the approach depends on the filesystem being used and the way this application works with the file (if it keeps it opened or opens and closes the file as needed).
Finally, if you are concerned by the attacker altering the data on the filesystem in some way while the file data are used, then you probably have a bigger problem, than just the file paths, and that problem should be addressed beforehand.
On a side note, you can implement a virtual file system (FUSE on Linux, our CBFS on Windows), which will feed your application with data taken from elsewhere, be it memory, a database or a cloud. This approach answers your question as well.
Update: if you want to get rid of file paths at all and have the data addressed by hash, then probably a NoSQL database, where the hash is the key, would be your best bet.

How to avoid losing data when overwriting a file is interrupted in C

I've written code that saves progress in my game, but one of my biggest fears is the brief window of time during saving when that data might become corrupted should the computer crash or lose power.
Is there standard methodology using only C's standard I/O header, to ensure the previous save/file will be safe should the program crash while overwriting it, that doesn't leave behind temporary files?
A similar idea incorporating post's comments:
#PC Luddite #Thomas Padron-McCarthy #Martin James
The central issue is that either A) a gap in time exist between having a valid "state" file or B) a small interval exists with 2 valid files. So a worst case failure ends with 0 files using A and 2 files using B. Clearly B is preferable.
Writing the state Assume code could crash just before, during, or just after any step except #1.
Assume initial state (or progress) file exist: State.txt.
If exist, delete earlier temp file(s). (House-keeping)
Write new state in temporary file: State_tmp1.txt. To know this completed, some check code should be part of the "state". This step could be combined with the previous step as an over-write.
Pedantic step - optional. Rename State_tmp1.txt to State_tmp2.txt to verify some aspects of code having rename privileges.
Rename State.txt to State_tmp3.txt. This is the critical step. Up to now, any failure is inconsequential as State.txt exist. Until the next step is complete, no State.txt may exist, yet at least 1 tmp file exist.
Rename State_tmp1.txt to State.txt. (or from State_tmp2.txt if step 4 used.)
Delete tmp files.
Reading the state.
Search for State.txt, if found and valid (Passes validity check), proceed to step 3.
Else, seek for tmp files and use latest valid one. (More work to restore state not yet detailed.
Clean-up - delete tmp files.
Algorithm
Code writes progress(t), t and a verification code (check sum, CRC, etc.) in 1 file.
Next time, code writes progress(t+1), t+1 and with its verification code in another file.
Repeat the above two.
To restore, read both files and certainly at least 1 will have a valid: progress, some t and a verification code. If both are good, use the later one (greater t).

Resources