Cross-platform way to determine if file has been edited? - c

I am writing a cross-platform (big 3 - Linux, MAC, Windows) backup program, so I need to know if a file has been edited since last time. My plan is to save the last save time in a file and check the real situation of a folder against the data in the file to determine which files need to be backed up or updated.
I would like to avoid methods that require a lot of processing power (like diff, or counting bytes).
In this similar post, people suggested to use fstat(), but that solution would be a last resort for me because I was hoping for a cross-platform solution that can be solved with pure C. As far as I know, fstat is a (2), and in my man page it appears as (1), which (to my understanding) means that it is a system function in Linux and isn't a part of the standard C library. I have searched for fstat on windows, but could only find some android version.
Is there some other way to access file metadata? Is there some other solution to this? I am open to any suggestions and am ok if it sometimes false-flags, as long as it backs up data correctly and doesn't waste resources on backing up everything all the time.
Please help!
Thank you!

fstat is still the way to do this, but on Windows it's called _fstat. You can check for the _MSC_VER macro which will be defined if you're building with MSVC, and if so create a macro alias for fstat.
You can do the same for struct stat which MSVC calls struct _stat:
#ifdef _MSC_VER
#define fstat(fd,buf) _fstat(fd,buf)
typedef struct _stat stat_struct;
#else
typedef struct stat stat_struct;
#endif
Then you can use fstat and pass it an argument of type stat_struct for the second argument.
I have a decently sized cross platform open source application that uses this technique.

My plan is to save the last save time in a file and check the real situation of a folder against the data in the file to determine which files need to be backed up or updated.
Ok.
I was hoping for a cross-platform solution that can be solved with pure C.
If by "pure C" you mean relying on only language features and library functions defined by the C language specification, then I'm afraid you're out of luck. Pure C (in that sense) has no concept of persistent file metadata such as modification timestamps. All functions and data structures dealing with such things are extensions or third-party libraries.
You can rely on standard POSIX facilities (such as fstat()) for both Linux and Mac, but Windows does not provide that. At least, Windows does not provide it exactly. The Microsoft C library does provide some POSIX compatibility functions, but it somewhat maddeningly uses modified names for them. In particular, it offers several flavors of _fstat() (note leading underscore). With a little bit of macro glue, it should not be too hard to make your program use POSIX fstat() on Linux and Mac, and use one of the _fstat() flavors on Windows.

Related

What is the purpose of libc_nonshared.a?

Why does libc_nonshared.a exist? What purpose does it serve? I haven't been able to find a good answer for its existence online.
As far as I can tell it provides certain symbols (stat, lstat, fstat, atexit, etc.). If someone uses one of these functions in their code, it will get linked into the final executable from this archive. These functions are part of the POSIX standard and are pretty common so I don't see why they wouldn't just be put in the shared or static libc.so.6 or libc.a, respectively.
It was a legacy mistake in glibc's implementing extensibility for the definition of struct stat before better mechanisms (symbol redirection or versioning) were thought of. The definitions of the stat-family functions in libc_nonshared.a cause the version of the structure to bind at link-time, and the definitions there call the __xstat-family functions in the real shared libc, which take an extra argument indicating the desired structure version. This implementation is non-conforming to the standard since each shared library ends up gettings its own copy of the stat-family functions with their own addresses, breaking the requirement that pointers to the same function evaluate equal.
Here's the problem. Long ago, members of the struct stat structure had different sizes than they had today. In particular:
uid_t was 2 bytes (though I think this one was fixed in the transition from libc5 to glibc)
gid_t was 2 bytes
off_t was 4 bytes
blkcnt_t was 4 bytes
time_t was 4 bytes
also, timespec wasn't used at all and there was no room for nanosecond precision.
So all of these had to change. The only real solution was to make different versions of the stat() system call and library function and you get the version you compiled against. That is, the .a file matches the header files. These things didn't all change at once, but I think we're done changing them now.
You can't really solve this by a macro because the structure name is the same as the function name; and inline wasn't mandated to exist in the beginning so glibc couldn't demand everybody use it.
I remember there used to be this thing O_LARGEFILE for saying you could handle files bigger than 4GB; otherwise things just wouldn't work. We also used to have to define things like _LARGEFILE_SOURCE and _LARGEFILE64_SOURCE but it's all handled automatically now. Back in the day, if you weren't ready for large file support yet, you didn't define these and you didn't get the 64-bit version of the stat structure; and also worked on older kernel versions lacking the new system calls. I haven't checked; it's possible that 32-bit compilation still doesn't define these automatically, but 64-bit always does.
So you probably think; okay, fine, just don't franken-compile stuff? Just build everything that goes into the final executable with the same glibc version and largefile-choice. Ever use plugins such as browser plugins? Those are pretty much guaranteed to be compiled in different places with different compiler and glibc versions and options; and this didn't require you to upgrade your browser and replace all its plugins at the same time.

Can I seek a position beyond 2GB in C using the standard library?

I am making a program that reads disk images in C. I am trying to make something portable, so I do not want to use too many OS-specific libraries. I am aware there are many disk images that are very large files but I am unsure how to support these files.
I have read up on fseek and it seems to use a long int which is not guaranteed to support values over 231-1. fsetpos seems to support a larger value with fpos_t but an absolute position cannot be specified. I have also though about using several relative seeks with fseek but am unsure if this is portable.
How can I support portably support large files in C?
There is no portable way.
On Linux there are fseeko() and ftello(), pair (need some defines, check ftello()).
On Windows, I believe you have to use _fseeki64() and _ftelli64()
#ifdef is your friend
pread() works on any POSIX-compliant platform (OS X, Linux, BSD, etc.). It's missing on Windows but there are lots of standard things that Windows gets wrong; this won't be the only thing in your codebase that needs a Windows special case.
You can't do it with standard C. Even with relative seeks it's not possible on some architectures.
One approach would be to check the platform at compile time. You can just check the value of LONG_MAX and throw a compile error if it's not large enough. But even that doesn't guarantees that the underlying filesystem supports files larger than 2 or 4GB.
A better way is to use the pre-processor macros supplied by your compiler to check the operating system that your code is being compiled for and write operating system specific specific. The operating system should provide a way to check that the filesystem actually supports files larger than 2GB or 4GB.

How to deal with Unicode paths in a cross-platfrom C library?

I'm contributing to a C library. It has a function that takes a char* parameter for a file path name. The authors are mostly UNIX developers, and this works fine on unixes where char* mostly means UTF-8. (At least in GCC, the character set is configurable and UTF-8 is the default.)
However, char* means ANSI on Windows, which implies that it is currently impossible to use Unicode path names with this library on Windows, where wchar_t* should be used and only UTF-16 is supported. (A quick search on StackOverflow reveals that the ANSI Windows API functions can not be used with UTF-8.)
The question is, what is the right way to deal with this? We've come up with various ways to do it, but neither of us are Windows experts, so we can't really decide how to do it properly. Our goal is that the users of the library should be able to write cross-platform code that would work on unixes as well as windows.
Under the hood, the library has #ifdefs in place to differentiate between operating systems so that it can use POSIX functions on UNIXes and Win32 APIs on Windows.
So far, we've come up with the following possibilities:
Offer a separate windows-only function that accepts a wchar_t*.
Require UTF-16 on Windows and #ifdef the library header in such a way that the function would accept wchar_t* on Windows.
Add a flag that would tell the function to cast the given char* to wchar_t* and call the widechar Windows APIs.
Create a variant of the function that takes a file descriptor (or file handle on Windows) instead of a file path.
Always require UTF-8 (even on Windows), and then inside the function, convert UTF-8 to UTF-16 and call the widechar Windows APIs.
The problem with options 1-4 is that they would require the user to consciously take care of portability themselves. Option 5 sounds good, but I'm not sure if this is the right way to go.
I'm also open to other suggestions or ideas that can solve this. :)
Since portability is an important goal for you, I think it is imperative for your function semantics to be precisely defined. Among other things, that means that the arguments' types and meanings don't vary across platforms. So, if you have a function that accepts regular char based paths then it should accept such paths on all systems, and the encoding expected of those paths should be well-defined (which does not necessarily mean "the same"). That rules out options (2) and (3).
Moreover, portability requires the same functions to be usable across all platforms; that rules out (1). Option (4) could be ok if a stream- and/or file descriptor-based approach were the only one provided by your library, but it yields portability only with respect to those functions, not with respect to the path-based ones. (And note that stream (FILE *) APIs are defined by C, whereas file descriptors are a POSIX concept, not native to C. In principle, therefore, streams are more portable than file descriptors.)
(5) could work, but it places stronger constraints than you actually need. It is not essential for the function to define the encoding expected (though it can); it suffices for it to define how that encoding is determined.
Additionally, you could add wchar_t-based functions that work everywhere (as opposed to Windows-only). Those might be more convenient for Windows users. Similar to alternative (4), however, that provides portability only with respect to those functions. Supposing that you don't want to drop the char-based ones, you would need to pair this alternative with some variation on (5).

How to walk a directory in C

I am using glib in my application, and I see there are convenience wrappers in glib for C's remove, unlink and rmdir. But these only work on a single file or directory at a time.
As far as I can see, neither the C standard nor glib include any sort of recursive directory walk functionality. Nor do I see any specific way to delete an entire directory tree at once, as with rm -rf.
For what I'm doing this I'm not worried about any complications like permissions, symlinks back up the tree (infinite recursion), or anything that would rule out a very naive
implementation... so I am not averse to writing my own function for it.
However, I'm curious if this functionality is out there somewhere in the standard libraries gtk or glib (or in some other easily reused C library) already and I just haven't stumbled on it. Googling this topic generates a lot of false leads.
Otherwise my plan is to use this type of algorithm:
dir_walk(char* path, void* callback(char*) {
if(is_dir(path) && has_entries(path)) {
entries = get_entries(path);
for(entry in intries) { dir_walk(entry, callback); }
}
else { callback(path) }
}
dir_walk("/home/user/trash", remove);
Obviously I would build in some error handling and the like to abort the process as soon as a fatal error is encountered.
Have you looked at <dirent.h>? AFAIK this belongs to the POSIX specification, which should be part of the standard library of most, if not all C compilers. See e.g. this <dirent.h> reference (Single UNIX specification Version 2 by the Open Group).
P.S., before someone comments on this: No, this does not offer recursive directory traversal. But then I think this is best implemented by the developer; requirements can differ quite a lot, so one-size-fits-all recursive traversal function would have to be very powerful. (E.g.: Are symlinks followed up? Should recursion depth be limited? etc.)
You can use GFileEnumerator if you want to do it with glib.
Several platforms include ftw and nftw: "(new) file tree walk". Checking the man page on an imac shows that these are legacy, and new users should prefer fts. Portability may be an issue with either of these choices.
Standard C libraries are meant to provide primitive functionality. What you are talking about is composite behavior. You can easily implement it using the low level features present in your API of choice -- take a look at this tutorial.
Note that the "convenience wrappers" you mention for remove(), unlink() and rmdir(), assuming you mean the ones declared in <glib/gstdio.h>, are not really "convenience wrappers". What is the convenience in prefixing totally standard functions with a "g_"? (And note that I say this even if I who introduced them in the first place.)
The only reason these wrappers exist is for file name issues on Windows, where these wrappers actually consist of real code; they take file name arguments in Unicode, encoded in UTF-8. The corresponding "unwrapped" Microsoft C library functions take file names in system codepage.
If you aren't specifically writing code intended to be portable to Windows, there is no reason to use the g_remove() etc wrappers.

what is the easiest way to lookup function names of a c binary in a cross-platform manner?

I want to write a small utility to call arbitrary functions from a C shared library. User should be able to list all the exported functions similar to what objdump or nm does. I checked these utilities' source but they are intimidating. Couldn't find enough information on google, if dl library has this functionality either.
(Clarification edit: I don't want to just call a function which is known beforehand. I will appreciate an example fragment along your answer.)
This might be near to what you're looking for:
http://python.net/crew/theller/ctypes/
Well, I'll speak a little bit about Windows. The C functions exported from DLLs do not contain information about the types, names, or number of arguments -- nor do I believe you can determine what the calling convention is for a given function.
For comparison, take a look at National Instrument's LabVIEW programming environment. You can import functions from DLLs, but you have to manually type in the type and names of the arguments before you use a given function. If this limitation is OK, please edit your question to reflect that.
I don't know what is possible with *nix environments.
EDIT: Regarding your clarification. If you don't know what the function is ahead of time, you're pretty screwed on Windows because in general you won't be able to determine what the number and types of arguments the functions take.
You could try ParaDyn's SymtabAPI. It lets you grab all the symbols in a shared library (or executable) and look at their types, offset, etc. It's all wrapped up in a reasonably nice C++ interface and runs on a lot of platforms. It also provides support for binary rewriting, which you could potentially use to do what you're talking about at runtime.
Webpage is here:
http://www.paradyn.org/html/symtab2.1-features.html
Documentation is here:
http://ftp.cs.wisc.edu/paradyn/releases/release5.2/doc/symtabProgGuide.21.pdf
A standard-ish API is the dlopen/dlsym API; AFAIK it's implemented by GNU libc on Linux and Mac OS X's standard C library (libSystem), and it might be implemented on Windows by MinGW or other compatibility packages.
Only sensible solution (without reinventing the wheel) seems to use libbfd. Downsides are its documentation is scarce and it is a bit bloated for my purposes.
The source code for nm and objdump are available. If you want to start from specification then ELF is what you want to look into.
/Allan
I've written something like this in Perl. On Win32 it runs dumpbin /exports, on POSIX it runs nm -gP. Then, since it's Perl, the results are interpreted using regular expressions: / _(\S+)#\d+/ for Win32 (stdcall functions) and /^(\S+) T/ for POSIX.
Eek! You've touched on one of the very platform-dependent topics of programming. On windows, you have DLLs, on linux, you have ld.so, ld-linux.so, and mac os x's dyld.

Resources