I'm implementing a pthread replacement library which is extremely lightweight. There are several reasons why I want to disable __thread completely.
It's a waste of memory. If I'm creating a thousand threads which has nothing to do with the context that declares a variable with __thread they will still allocate the program will still have allocated 1000*the size of that data bytes and never use it. It's simply not memory compatible with the mass-concurrency model. If we need extremely lightweight fibers with only 8K of stack, a TLS block of just 4K would be an overhead of 50% of the memory used by each thread. In some scenarios the TLS overhead would be enormous.
TLS is a complex standard and I simply don't have the time/resources to support it. It's to expensive. Personally I think the standard is poorly designed. It should have defined standard functions that had to be provided by the linker so thread libraries can take control over where TLS allocation takes place and insert relevant offsets and addresses it requires. Also the standard ELF implementation has been infected with pthread, expecting pthread sized structs to calculate offsets making it really hard to adapt to something else.
It's just a bad pattern. It encourages using globals and creating functions with static functions/side effects. This is not a territory we want to be in if we're creating correct programs that are easy to analyze.
If we really need "thread context" for some magic that tracks thread state behind the scenes (like for allocation or cancellation tracking) why not just expose the magic that TLS uses to understand that context in the first place? Personally I'm just going to use the %fs register directly. This would not be possible in libraries for obvious reasons but why should they be thread aware to begin with? Why not just design them correctly so they get the context related data they need in the first place right in the argument list?
My question is simply: What is the simplest way to disable __thread support and make clang emit errors if you accidentally used it? How can I get errors if I load a dynamic library which happens to require TLS?
I believe the simplest way is adding something like this unconditionally to your CFLAGS (perhaps from clang's equivalent of the gcc specfile if you want it to be system-global):
-D__thread='^-^'
where the righthand side can be anything that's syntactically invalid (a constraint violation) at any point in a C program.
As for preventing loading of libraries with TLS, you'd have to patch the linker and/or dynamic linker to reject them. If you're just talking about dlopen, your program could first read the file and parse the ELF headers for TLS relocations, then reject the library (without passing it to dlopen) if it has any. This might even be possible with an LD_PRELOAD wrapper.
I agree with you that, especially in its current implementation, TLS is something whose use should generally be avoided, but may I ask if you've measured the costs? I think stamping it out completely will be fairly difficult on a system that's designed to use it, and there's much lower-hanging fruit for cutting bloat. Which libc are you using? If it's glibc, I'm pretty sure glibc has lots of TLS it uses internally itself these days... Of course if you're writing your own threads implementation, that's going to entail a lot of interaction with the rest of the standard library, so perhaps you're already patching it out...?
By the way (shameless plug follows), we have an extremely light-weight threads implementation in musl libc which presently does not have TLS. I don't think it would be easy to integrate with another libc (and I'm sure if you're writing your own you will find it difficult to integrate with glibc, especially glibc's dynamic linker, which expects TLS to be supported) but if you can use the whole library as-is, it may meet your needs for specific projects, or have useful code you can borrow (license is MIT).
Related
Suppose that I have a C library that requires initialization and cleanup functions that aren’t thread-safe. Specifically, these functions may invoke other thread-unsafe functions in other libraries. I don’t know (in a default build) which libraries these will be.
Now consider the case of writing Java bindings to this library. Java spawns multiple threads before running any Java code. Worse, in the case of (say) an Eclipse plugin, there could be multiple threads running Java code by the time my code receives control. Some of the other threads could be using the aforementioned unsafe functions.
My current plan is to statically link the C library (in my case, libcurl) and all transitive dependencies – in my case, a TLS library (probably mbedTLS) and (on Windows platforms) the CRT. Fortunately, libcurl cleans up everything it has allocated, so problems related to allocating from one heap and freeing it on another should not arise. Because everything is statically linked, and won’t try to load any other shared libraries, I can then initialize libcurl from a static initializer.
Will this even work? Is there a better way?
Edit: The reason that serializing library calls won’t work, and that I believe that my solution might work, is that the global state is stored not only in libcurl itself, but also in libraries libcurl depends on. Some of these libraries (ex. OpenSSL) might be in use by other code when my code is loaded. So I would need to lock against the entire process.
The reason I believe that isolating the global state would work is that libcurl (and every library it depends on) is thread safe after initialization. I need to make sure that the initialization of libcurl doesn’t create race conditions. Afterwards I am fine.
[Updated and revised]
Your concern seems to be that you will have both direct and indirect bindings to some native library -- say mbedTLS --, that that native library requires one-time initialization that is not thread-safe, and that, beyond your ability to detect or control, different threads of the process may concurrently attempt to initialize that library, or perhaps may (unsafely) attempt to initialize it more than once. That certainly seems to be a worst-case scenario.
On the other hand, you postulate that you can successfully build a monolithic, dynamically-loadable library containing the native library you want along with the transitive closure of all its dependencies (outside the kernel), so as to ensure that this library does not share state with any other library loaded by the process. You assert that after a non-thread-safe initialization, the combined stack will be thread safe, at least as you intend to use it. You want to know about how to initialize the library.
Java promises that each class will be initialized by exactly one thread, and that afterward its initialized state will be visible to all threads. Although that does not explicitly address the question, it certainly implies that if the initialization of your native libraries is performed entirely as part of the initialization of a class -- e.g. via a static initializer, as you propose -- then the correct initialized state will be visible to all Java threads. That adequately addresses the problem as I understand it.
I remain dubious that building the monolithic library is necessary, but if you truly have to deal with the worst-case scenario you seem to anticipate then perhaps it is. Inasmuch as you cannot isolate the library from conflicting demands on the kernel, however, it is conceivable that the strategy will not be sufficient. That would be one of the few conceivable good reasons for a library to rely on the kind of shared state you postulate, and your strategy would thwart that particular purpose. I cannot judge how probable such an eventuality might be, but I doubt it's very likely.
I was looking at the NtDll export table on my Windows 10 computer, and I found that it exports standard C runtime functions, like memcpy, sprintf, strlen, etc.
Does that mean that I can call them dynamically at runtime through LoadLibrary and GetProcAddress? Is this guaranteed to be the case for every Windows version?
If so, it is possible to drop the C runtime library altogether (by just using the CRT functions from NtDll), therefore making my program smaller?
There is absolutely no reason to call these undocumented functions exported by NtDll. Windows exports all of the essential C runtime functions as documented wrappers from the standard system libraries, namely Kernel32. If you absolutely cannot link to the C Runtime Library*, then you should be calling these functions. For memory, you have the basic HeapAlloc and HeapFree (or perhaps VirtualAlloc and VirtualFree), ZeroMemory, FillMemory, MoveMemory, CopyMemory, etc. For string manipulation, the important CRT functions are all there, prefixed with an l: lstrlen, lstrcat, lstrcpy, lstrcmp, etc. The odd man out is wsprintf (and its brother wvsprintf), which not only has a different prefix but also doesn't support floating-point values (Windows itself had no floating-point code in the early days when these functions were first exported and documented.) There are a variety of other helper functions, too, that replicate functionality in the CRT, like IsCharLower, CharLower, CharLowerBuff, etc.
Here is an old knowledge base article that documents some of the Win32 Equivalents for C Run-Time Functions. There are likely other relevant Win32 functions that you would probably need if you were re-implementing the functionality of the CRT, but these are the direct, drop-in replacements.
Some of these are absolutely required by the infrastructure of the operating system, and would be called internally by any CRT implementation. This category includes things like HeapAlloc and HeapFree, which are the responsibility of the operating system. A runtime library only wraps those, providing a nice standard-C interface and some other niceties on top of the nitty-gritty OS-level details. Others, like the string manipulation functions, are just exported wrappers around an internal Windows version of the CRT (except that it's a really old version of the CRT, fixed back at some time in history, save for possibly major security holes that have gotten patched over the years). Still others are almost completely superfluous, or seem so, like ZeroMemory and MoveMemory, but are actually exported so that they can be used from environments where there is no C Runtime Library, like classic Visual Basic (VB 6).
It is also interesting to point out that many of the "simple" C Runtime Library functions are implemented by Microsoft's (and other vendors') compiler as intrinsic functions, with special handling. This means that they can be highly optimized. Basically, the relevant object code is emitted inline, directly in your application's binary, avoiding the need for a potentially expensive function call. Allowing the compiler to generate inlined code for something like strlen, that gets called all the time, will almost undoubtedly lead to better performance than having to pay the cost of a function call to one of the exported Windows APIs. There is no way for the compiler to "inline" lstrlen; it gets called just like any other function. This gets you back to the classic tradeoff between speed and size. Sometimes a smaller binary is faster, but sometimes it's not. Not having to link the CRT will produce a smaller binary, since it uses function calls rather than inline implementations, but probably won't produce faster code in the general case.
* However, you really should be linking to the C Runtime Library bundled with your compiler, for a variety of reasons, not the least of which is security updates that can be distributed to all versions of the operating system via updated versions of the runtime libraries. You have to have a really good reason not to use the CRT, such as if you are trying to build the world's smallest executable. And not having these functions available will only be the first of your hurdles. The CRT handles a lot of stuff for you that you don't normally even have to think about, like getting the process up and running, setting up a standard C or C++ environment, parsing the command line arguments, running static initializers, implementing constructors and destructors (if you're writing C++), supporting structured exception handling (SEH, which is used for C++ exceptions, too) and so on. I have gotten a simple C app to compile without a dependency on the CRT, but it took quite a bit of fiddling, and I certainly wouldn't recommend it for anything remotely serious. Matthew Wilson wrote an article a long time ago about Avoiding the Visual C++ Runtime Library. It is largely out of date, because it focuses on the Visual C++ 6 development environment, but a lot of the big picture stuff is still relevant. Matt Pietrek wrote an article about this in the Microsoft Journal a long while ago, too. The title was "Under the Hood: Reduce EXE and DLL Size with LIBCTINY.LIB". A copy can still be found on MSDN and, in case that becomes inaccessible during one of Microsoft's reorganizations, on the Wayback Machine. (Hat tip to IInspectable and Gertjan Brouwer for digging up the links!)
If your concern is just the need to distribute the C Runtime Library DLL(s) alongside your application, you can consider statically linking to the CRT. This embeds the code into your executable, and eliminates the requirement for the separate DLLs. Again, this bloats your executable, but does make it simpler to deploy without the need for an installer or even a ZIP file. The big caveat of this, naturally, is that you cannot benefit to incremental security updates to the CRT DLLs; you have to recompile and redistribute the application to get those fixes. For toy apps with no other dependencies, I often choose to statically link; otherwise, dynamically linking is still the recommended scenario.
There are some C runtime functions in NtDll. According to Windows Internals these are limited to string manipulation functions. There are other equivalents such as using HeapAlloc instead of malloc, so you may get away with it depending on your requirements.
Although these functions are acknowledged by Microsoft publications and have been used for many years by the kernel programmers, they are not part of the official Windows API and you should not use of them for anything other than toy or demo programs as their presence and function may change.
You may want to read a discussion of the option for doing this for the Rust language here.
Does that mean that I can call them dynamically at runtime through
LoadLibrary and GetProcAddress?
yes. even more - why not use ntdll.lib (or ntdllp.lib) for static binding to ntdll ? and after this you can direct call this functions without any GetProcAddress
Is this guaranteed to be the case for every Windows version?
from nt4 to win10 exist many C runtime functions in ntdll, but it set is different. usual it grow from version to version. but some of then less functional compare msvcrt.dll . say for example printf from ntdll not support floating point format, but in general functional is same
it is possible to drop the C runtime library altogether (by just using
the CRT functions from NtDll), therefore making my program smaller?
yes, this is 100% possible.
I am a beginner C/C++ programmer first of all, but I am curious about it.
My question is more theoretical.
I heard that C does not have explicit multithreading (MT) support, however there are libraries which implement this. I found "process.h" header which has to be included for building MT programs, but the thing I don't understand is how the MT itself works.
I know there are threads in CPU (assume it's single core for simplicity) running and there is only one thread per moment. The CPU is switching between threads really fast so that user sees it as a simultaneous work (correct me if not).
But - what really happens when I write the following
beginthread( Thread, 0, NULL ) //or whatever function/class method we use
keeping in mind that C does not have MT support. I mean, how does code tell the PC to run two functions multithreaded while it is not possible by the language explicit methods? I guess there is some "cheat" inside library related to "process.h", but what is that cheat, I can't just find on the web.
To be more specific - I am not asking about how to use MT, but how is it build?
Sorry if was answered earlier, or question is too complicated :)
UPD:
Imagine we have C language. It has functions, variables, pointers etc. I dont know any "special" function type that can run concurrently with other. Unless there are calls to some other functions from it. But then the caller function stops and waits?
Is it so that when I run MT applications, there is a special "global" function that calls my f1() and f2() repeatedly which looks like they were simultaneously working?
First of all, C11 does actually add multithreading support to the standard, so the premise that C does not support multithreading is no longer entirely correct.
However, I'm assuming your question is more to do with how can multithreading be implemented by a C library when standard C does(/did) not provide the necessary tools. The answer lies in the word “standard” – compilers and platforms can provide additional functionality beyond that required by the standard. Using such extra features makes the program/library less portable (i.e., more is required than is specified in the C standard), but the language and function call semantics can still be C.
Perhaps it is helpful to consider a standard library function such as fopen – somewhere inside that function code must eventually be called which could not be written in standard C, i.e., the implementation of the standard library itself must also rely on platform-specific code to access operating system functionality such as the file system. Every implementation of the standard library must thus implement the non-portable parts in a way specific to that platform (this is kind of the point of having a standard library instead of all code being platform-specific). But likewise a multithreading library can be implemented with non-standard features provided by that platform, but using such a library makes the code portable only to the platforms for which the same (or compatible) multithreading library is available.
As for how multithreading itself works, it is certainly outside the scope of what can be answered here, but as a simplified conceptual model on a single processor core, you can imagine the operating system managing “concurrent” processes by running one process for a short time, interrupting it, saving its state (current instruction, registers, etc), loading the saved state of another process, and repeating this. This gives the illusion of concurrent execution though in actual fact it is switching rapidly between different processes. On multi-core systems the execution on different cores can actually be concurrent, but there are typically more processes than there are cores, so this kind of switching will still happen on individual cores. Things are further complicated by processes waiting for something (I/O, another process, a timer, etc). Perhaps it suffice to say that the scheduler is a piece of software managing all of this inside the operating system and the multithreading library communicates with it.
(Note that there are many different ways to implement multithreading and multitasking, and statements in the above paragraph do not apply to all of them.)
It's platform specific. On Windows it eventually goes down to NtCreateThread which uses the assembly instruction syscall to call the operating system. So you can qualify it as a cheat.
On Linux it's the same, just the function with the syscall is called clone instead.
Purpose
I'm writing a small library for which portability is the biggest concern. It has been designed to assume only a mostly-compliant C90 (ISO/IEC 9899:1990) environment... nothing more. The set of functions provided by the library all operate (read/write) on an internal data structure. I've considered some other design alternatives, but nothing else seems feasible for what the library is trying to achieve.
Question
Are there any portable algorithms, techniques, or incantations which can be used to ensure thread-safety? I am not concerned with making the functions re-entrant. Moreover, I am not concerned with speed or (possibly) wasting resources if the algorithm/technique/incantation is portable. Ideally, I don't want to depend on any libraries (such as GNU Pth) or system-specific operations (like atomic test-and-set).
I have considered modifying Lamport's bakery algorithm, but I do not know how to alter it to work inside of the functions called by the threads instead of working in the threads themselves.
Any help is greatly appreciated.
Without OS/hardware support, at least an atomic CAS, there's nothing you can do that's practical. There are portable libraries that abstract various platforms into a common interface, though.
http://www.gnu.org/software/pth/related.html
Almost all systems (even Windows) can run libpthread these days.
Lamport's bakery algorithm would probably work; unfortunately, there are still practical problems with it. Specifically, many CPUs implement out-of-order memory operations: even if you've compiled your code into a perfectly correct instruction sequence, the CPU, when executing your code, may decide to reorder the instructions on the fly to achieve better performance. The only way to get around this is to use memory barriers, which are highly system- and CPU-specific.
You really only have two choices here: either (1) keep your library thread-unsafe and make your users aware of that in the documentation, or (2) use a platform-specific mutex. Option 2 can be made easier by using another library that implements mutexes for a large variety of platforms and provides you with a unified, abstract interface.
Functions either cannot be thread safe or are innately thread safe, depending on how you want to look at it. And threading/locking is innately platform specific. Really, it is up to users of your library to handle the the threading issues.
There are multiple sections in the manpages. Two of them are:
2 Unix and C system calls
3 C Library routines for C programs
For example there is getmntinfo(3) and getfsstat(2), both look like they do the same thing. When should one use which and what is the difference?
System calls are operating system functions, like on UNIX, the malloc() function is built on top of the sbrk() system call (for resizing process memory space).
Libraries are just application code that's not part of the operating system and will often be available on more than one OS. They're basically the same as function calls within your own program.
The line can be a little blurry but just view system calls as kernel-level functionality.
Libraries of common functions are built on top of the system call interface, but applications are free to use both.
System calls are like authentication keys which have the access to use kernel resources.
Above image is from Advanced Linux programming and helps to understand how the user apps interact with kernel.
System calls are the interface between user-level code and the kernel. C Library routines are library calls like any other, they just happen to be really commonly provided (pretty much universally). A lot of standard library routines are wrappers (thin or otherwise) around system calls, which does tend to blur the line a bit.
As to which one to use, as a general rule, use the one that best suits your needs.
The calls described in section 2 of the manual are all relatively thin wrappers around actual calls to system services that trap to the kernel. The C standard library routines described in section 3 of the manual are client-side library functions that may or may not actually use system calls.
This posting has a description of system calls and trapping to the kernel (in a slightly different context) and explains the underlying mechanism behind system calls with some references.
As a general rule, you should always use the C library version. They often have wrappers that handle esoteric things like restarts on a signal (if you have requested that). This is especially true if you have already linked with the library. All rules have reasons to be broken. Reasons to use the direct calls,
You want to be libc agnostic; Maybe with an installer. Such code could run on Android (bionic), uClibc, and more traditional glibc/eglibc systems, regardless of the library used. Also, dynamic loading with wrappers to make a run-time glibc/bionic layer allowing a dual Android/Linux binary.
You need extreme performance. Although this is probably rare and most likely misguided. Probably rethinking the problem will give better performance benefits and not calling the system is often a performance win, which the libc can occasionally do.
You are writing some initramfs or init code without a library; to create a smaller image or boot faster.
You are testing a new kernel/platform and don't want to complicate life with a full blown file system; very similar to the initramfs.
You wish to do something very quickly on program startup, but eventually want to use the libc routines.
To avoid a known bug in the libc.
The functionality is not available through libc.
Sorry, most of the examples are Linux specific, but the rationals should apply to other Unix variants. The last item is quite common when new features are introduced into a kernel. For example when kqueue or epoll where first introduced, there was no libc to support them. This may also happen if the system has an older library, but a newer kernel and you wish to use this functionality.
If your process hasn't used the libc, then most likely something in the system will have. By coding your own variants, you can negate the cache by providing two paths to the same end goal. Also, Unix
will share the code pages between processes. Generally there is no reason not to use the libc version.
Other answers have already done a stellar job on the difference between libc and system calls.