How does memchr() work under the hood? - c

Background: I'm trying to create a pure D language implementation of functionality that's roughly equivalent to C's memchr but uses arrays and indices instead of pointers. The reason is so that std.string will work with compile time function evaluation. For those of you unfamiliar w/ D, functions can be evaluated at compile time if certain restrictions are met. One restriction is that they can't use pointers. Another is that they can't call C functions or use inline assembly language. Having the string library work at compile time is useful for some compile time code gen hacks.
Question: How does memchr work under the hood to perform as fast as it does? On Win32, anything that I've been able to create in pure D using simple loops is at least 2x slower even w/ obvious optimization techniques such as disabling bounds checking, loop unrolling, etc. What kinds of non-obvious tricks are available for something as simple as finding a character in a string?

I would suggest taking a look at GNU libc's source. As for most functions, it will contain both a generic optimized C version of the function, and optimized assembly language versions for as many supported architectures as possible, taking advantage of machine specific tricks.
The x86-64 SSE2 version combines the results from pcmpeqb on a whole cache-line of data at once (four 16B vectors), to amortize the overhead of the early-exit pmovmskb/test/jcc.
gcc and clang are currently incapable of auto-vectorizing loops with if() break early-exit conditions, so they make naive byte-at-a-time asm from the obvious C implementation.

This implementation of memchr from newlib is one example of someone's optimizing memchr:
it's reading and testing 4 bytes at a time (apart from memchr, other functions in the newlib library are here).
Incidentally, most of the the source code for the MSVC run-time library is available, as an optional part of the MSVC installation (so, you could look at that).

Here is FreeBSD's (BSD-licensed) memchr() from memchr.c. FreeBSD's online source code browser is a good reference for time-tested, BSD-licensed code examples.
void *
memchr(s, c, n)
const void *s;
unsigned char c;
size_t n;
{
if (n != 0) {
const unsigned char *p = s;
do {
if (*p++ == c)
return ((void *)(p - 1));
} while (--n != 0);
}
return (NULL);
}

memchr like memset and memcpy generally reduce to fairly small amount of machine code. You are unlikely to be able to reproduce that kind of speed without inlining similar assembly code. One major issue to consider in an implementation is data alignment.
One generic technique you may be able to use is to insert a sentinel at the end of the string being searched, which guarantees that you will find it. It allows you to move the test for end of string from inside the loop, to after the loop.

GNU libc definitely uses the assembly version of memchr() (on any common linux distro). This is why it is so unbelievable fast. For example, if we count lines in 11Gb file (like "wc -l" does) it takes around 2.5 seconds with assembly version of memchr() from GNU libc. But if we replace memchr() assembly call with for example memchr() C implementation from FreeBSD - the speed will decrease to like 30 seconds. This is equal to replacing memchr() with just a while loop which compares one char after another.

Related

Is it possible to generate ansi C functions with type information for a moving GC implementation?

I am wondering what methods there are to add typing information to generated C methods. I'm transpiling a higher-level programming language to C and I'd like to add a moving garbage collector. However to do that I need the method variables to have typing information, otherwise I could modify a primitive value that looks like a pointer.
An obvious approach would be to encapsulate all (primitive and non-primitive) variables in a struct that has an extra (enum) variable for typing information, however this would cause memory and performance overhead, the transpiled code is namely meant for embedded platforms. If I were to accept the memory overhead the obvious option would be to use a heap handle for all objects and then I'd be able to freely move heap blocks. However I'm wondering if there's a more efficient better approach.
I've come up with a potential solution, namely to predeclare and group variables based whether they're primitives or not (I can do that in the transpiler), and add an offset variable to each method at the end (I need to be able to find it accurately when scanning the stack area), that tells me where the non-primitive variables begin and where they end, so I can only scan those. This means that each method will use an additional 16/32-bit (depending on arch) of memory, however this should still be more memory efficient than the heap handle approach.
Example:
void my_func() {
int i = 5;
int z = 3;
bool b = false;
void* person;
void* person_info = ...;
.... // logic
volatile int offset = 0x034;
}
My aim is for something that works universally across GCC compilers, thus my concerns are:
Can the compiler reorder the variables from how they're declared in
the source code?
Can I force the compiler to put some data in the
method's stack frame (using volatile)?
Can I find the offset accurately when scanning the stack?
I'd like to avoid assembly so this approach can work (by default) across multiple platforms, however I'm open for methods even if they involve assembly (if they're reliable).
Typing information could be somehow encoded in the C function name; this is done by C++ and other implementations and called name mangling.
Actually, you could decide, since all your C code is generated, to adopt a different convention: generate long C identifiers which are practically unique and sort-of random program-wide, such as tiziw_7oa7eIzzcxv03TmmZ and keep their typing information elsewhere (e.g. some database). On Linux, such an approach is friendly to both libbacktrace and dlsym(3) + dladdr(3) (and of course nm(1) or readelf(1) or gdb(1)), so used in both bismon and RefPerSys projects.
Typing information is practically tied to calling conventions and ABIs. For example, the x86-64 ABI for Linux mandates different processor registers for passing floating points or pointers.
Read the Garbage Collection handbook or at least P.Wilson Uniprocessor Garbage Collection Techniques survey. You could decide to use tagged integers instead of boxing them, and you could decide to have a conservative GC (e.g. Boehm's GC) instead of a precise one. In my old GCC MELT project I generated C or C++ code for a generational copying GC. Similar techniques are used both in Bismon and in RefPerSys.
Since you are transpiling to C, consider also alternatives, such as libgccjit or LLVM. Look into libjit and asmjit.
Study also the implementation of other transpilers (compilers to C), including Chicken/Scheme and Bigloo.
Can the GCC compiler reorder the variables from how they're declared in the source code?
Of course yes, depending upon the optimizations you are asking. Some variables won't even exist in the binary (e.g. those staying in registers).
Can I force the compiler to put some data in the method's stack frame (using volatile)?
Better generate a single struct variable containing all your language variables, and leave optimizations to the compiler. You will be surprised (see this draft report).
Can I find the offset accurately when scanning the stack?
This is the most difficult, and depends a lot of compiler optimizations (e.g. if you run gcc with -O1 or -O3 on the generated C code; in some cases a recent GCC -e.g GCC 9 or GCC 10 on x86-64 for Linux- is capable of tail-call optimizations; check by compiling using gcc -O3 -S -fverbose-asm then looking into the produced assembler code). If you accept some small target processor and compiler specific tricks, this is doable. Study the implementation of the Ocaml compiler.
Send me (to basile#starynkevitch.net) an email for discussion. Please mention the URL of your question in it.
If you want to have an efficient generational copying GC with multi-threading, things become extremely tricky. The question is then how many years of development can you afford spending.
If you have exceptions in your language, take also a great care. You could with great caution generate calls to longjmp.
See of course this answer of mine.
With transpiling techniques, the evil is in the details
On Linux (specifically!) see also my manydl.c program. It demonstrates that on a Linux x86-64 laptop you could generate, in practice, hundred of thousands of dlopen(3)-ed plugins. Read then How to write shared libraries
Study also the implementation of SBCL and of GNU Prolog, at least for inspiration.
PS. The dream of a totally architecture-neutral and operating-system independent transpiler is an illusion.

Why does glibc library use assembly

I am looking at this page: https://sys.readthedocs.io/en/latest/doc/01_introduction.html
that goes into explanation about how glibc does system calls. In one of the examples the code is examined and it is shown, that the last instruction glibc does to actually do a system call (meaning the interrupt to the cpu) is written in assembly.... So why is part of glibc in assembly? Is there some sort of advantage by writing that small part in assembly?
Also, the shared libraries during runtime are already compiled to machine code correct?
So why would there be any advantage using two different languages before compilation? Thank you.
The answer is super simple - since C doesn't cover system calls (because it doesn't cover any physical hardware in general, and prefers to express itself in terms of abstract machine), there is no C construct glibc can use to perform system call.
One could argue that compiler could provide a sort of intrinsic to do that, but since in Linux glibc is actually part of the compiler suit of tools (in contains CRT as well) there is really no need for it, glibc can do the job.
Also, last, but not the least, in modern CPUs syscall is usually not an interrupt. Instead, it's a specific instruction (syscall in x86_64).
I want to address this piece of your question:
Also, the shared libraries during runtime are already compiled to machine code correct?
So why would there be any advantage using two different languages before compilation?
SergeyA correctly points out that there isn't any C construct (even with all of GCC's extensions) that will cause the compiler to emit a syscall instruction. That's not the only thing that the C library is supposed to do that simply can't be written purely in C: the implementations of setjmp and longjmp, makecontext and setcontext, the "entry point" code that calls main, the "trampoline" that you return to when you return from a signal handler, and several other low-level bits all require a little bit of hand-written assembly. (Exercise: what do they all have in common?)
But there's another reason to mix assembly language into a program mostly written in C. This is one of the several implementations of memcpy for x86-64 in glibc. It is 3100 lines of hand-written assembly language and preprocessor macros. What it does could be expressed in four lines of C. Why would anyone go to that much trouble? Speed. Compilers are always getting closer, but they haven't yet quite managed to beat the human brain when it comes to squeezing every last possible cycle out of a critical innermost loop. (It is worth mentioning that in early 2018 the glibc devs spent a bunch of time replacing hand-written assembly implementations of math.h functions with C, because the compilers have caught up on those, and C is ever so much more maintainable.)
And yet a third answer, which isn't particularly relevant to glibc but comes up a bunch elsewhere, is that maybe you have two different languages in your program because each of them is better at part of your problem. The statistical language R is mostly implemented in C, but a bunch of its mathematical primitives are (or were, I haven't checked in a while) written in FORTRAN, because FORTRAN is still the language that numerical computation wizards think in. Both C and FORTRAN get compiled to machine code, and in principle you could rewrite all the FORTRAN in C, but nobody wants to.

How can I check if I implement C library functions correctly?

Is there any source/database for basic C library functions (like strcmp, memset, etc)?
I want to implement basic C library functions but I can't verify if I'm doing it right or not.
I found several source code databases but they are far more complicated than they should be (e.g the implementation of strcpy is more than 30 lines, half of it isn't related to copying the strings, I think).
Check out the OpenBSD C library. E.g., here's its basic strcpy:
char *
strcpy(char *to, const char *from)
{
char *save = to;
for (; (*to = *from) != '\0'; ++from, ++to);
return(save);
}
Documentation for the functions is included in the form of manpages.
(It also carries optimized versions of common routines, usually in assembler, so the C versions should really be regarded as reference implementations.)
The "basic" C library functions are also some of the most important for program performance and correctness, and so tend to have some complicated implementations.
I suggest you look at the code for Newlib. It's a basic C library intended for embedded systems (your TV might well run it) and it also used in Cygwin. The license is also mostly compatible with "borrowing" source for your own purposes, but be careful because some bits of it (some files) are GPL.
There's a great book The Standard C Library from P.J. Plauger. It's a bit dated (1992), but still valuable resource if you want to implement libc and do it right. It contains full code to the library. There is also musl libc. The code lives in git repo. The implementation is not straightforward, but if I compare it to other implementations it's really small and simple. And as somebody else already mentioned the C standard is something you want to look at.

Does the C Standard Allow for Self-Modifying Code?

Is self-modifying code possible in a portable manner in C?
The reason I ask is that, in a way, OOP relies on self-modifying code (because the code that executes at run-time is actually generated as data, e.g. in a v-table), and yet, it seems that, if this is taken too far, it would prevent most optimizations in a compiler.
For example:
void add(char *restrict p, char *restrict pAddend, int len)
{
for (int i = 0; i < len; i++)
p[i] += *pAddend;
}
An optimizing compiler could hoist the *pAddend out of the loop, because it wouldn't interfere with p. However, this is no longer a valid optimization in self-modifying code.
In this way, it seems that C doesn't allow for self-modifying code, but at the same time, wouldn't that imply that you can't do some things like OOP in C? Does C really support self-modifying code?
Self-modifying code is not possible in C for many reasons, the most important of which are:
The code generated by the compiler is completely up to the compiler, and might not look anything like what the programmer trying to write code that modifies itself expects. This is a fundamental problem with doing SMC at all, not just a portability problem.
Function and data pointers are completely separate in C; the language provides no way to convert back and forth between them. This issue is not fundamental, since some implementations or higher-level standards (POSIX) guarantee that code and data pointers share a representation.
Aside from that, self-modifying code is just a really really bad idea. 20 years ago it might have had some uses, but nowadays it will result in nothing but bugs, atrocious performance, and portability failures. Note that on some ISAs, whether the instruction cache even sees changes that were made to cached code might be unspecified/unpredictable!
Finally, vtables have nothing to do with self-modifying code. It's purely a matter of modifying function pointers, which are data, not code.
Strictly speaking, self-modifying code cannot be implemented in a portable manner in C or C++ if I understood the standard correctly.
Self modifying code in C/C++ would mean something like this:
uint8_t code_buffer[FUNCTION_SIZE];
void call_function(void)
{
... modify code_buffer here to the machine code we'd like to run.
((void (*)(void))code_buffer)();
}
This is not legal and will crash on most modern architectures. This is impossible to implement on Harvard architectures as executable code is strictly read-only, so it cannot be part of any standard.
Most modern OSes do have a facility to be able to do this hackery, which is used by dynamic recompilers for one. mprotect() in Unix for example.

Which string manipulation functions should I use?

On my Windows/Visual C environment there's a wide number of alternatives for doing the same basic string manipulation tasks.
For example, for doing a string copy I could use:
strcpy, the ANSI C standard library function (CRT)
lstrcpy, the version included in kernel32.dll
StrCpy, from the Shell Lightweight Utility library
StringCchCopy/StringCbCopy, from a "safe string" library
strcpy_s, security enhanced version of CRT
While I understand that all these alternatives have an historical reason, can I just choose a consistent set of functions for new code? And which one? Or should I choose the most appropriate function case by case?
First of all, let's review pros and cons of each function set:
ANSI C standard library function (CRT)
Functions like strcpy are the one and only choice if you are developing portable C code. Even in a Windows-only project, it might be a wise thing to have a separation of portable vs. OS-dependent code.
These functions have often assembly level optimization and are therefore very fast.
There are some drawbacks:
they have many limitations and therefore often you still have to call functions from other libraries or provide your own versions
there are some archaisms like the infamous strncpy
Kernel32 string functions
Functions like lstrcpy are exported by kernel32 and should be used only when trying to avoid any dependency to the CRT. You might want to do that for two reasons:
avoiding the CRT payload for an ultra lightweight executable (unusual these days but not in the 90s!)
avoiding initialization issues (if you launch a thread with CreateThread instead of _beginthread).
Moreover, the kernel32 function could be more optimized that the CRT version: when your executable will run on Windows 12 optimized for a Core i13, kernel32 could use an assembly-optimized version.
Shell Lightweight Utility Functions
Here are valid the same considerations made for the kernel32 functions, with the added value of some more complex functions. However I doubt that they are actively maintained and I would just skip them.
StrSafe Function
The StringCchCopy/StringCbCopy functions are usually my personal choice: they are very well designed, powerful, and surprisingly fast (I also remember a whitepaper that compared performance of these functions to the CRT equivalents).
Security-Enhanced CRT functions
These functions have the undoubted benefit of being very similar to ANSI C equivalents, so porting legacy code is a piece of cake. I especially like the template-based version (of course, available only when compiling as C++). I really hope that they will be eventually standardized. Unfortunately they have a number of drawbacks:
although a proposed standard, they have been basically rejected by the non-Windows community (probably just because they came from Microsoft)
when fail, they don't just return an error code but execute an invalid parameter handler
Conclusions
While my personal favorite for Windows development is the StrSafe library, my advice is to use the ANSI C functions whenever is possible, as portable-code is always a good thing.
In the real life, I developed a personalized portable library, with prototypes similar to the Security-Enhanced CRT functions (included the powerful template based technique), that relies on the StrSafe library on Windows and on the ANSI C functions on other platforms.
My personal preference, for both new and existing projects, are the StringCchCopy/StringCbCopy versions from the safe string library. I find these functions to be overall very consistent and flexible. And they were designed from the groupnd up with safety / security in mind.
I'd answer this question slightly different. Do you want to have portable code or not? If you want to be portable you can not rely on anything else but strcpy, strncpy, or the standard wide character "string" handling functions.
Then if your code just has to run under Windows you can use the "safe string" variants.
If you want to be portable and still want to have some extra safety, than you should check cross-platform libraries like e.g
glib or
libapr
or other "safe string libraries" like e.g:
SafeStrLibrary
I would suggest using functions from the standard library, or functions from cross-platform libraries.
I would stick to one, I would pick whichever one is in the most useful library in case you need to use more of it, and I would stay away from the kernel32.dll one as it's windows only.
But these are just tips, it's a subjective question.
Among those choices, I would simply use strcpy. At least strcpy_s and lstrcpy are cruft that should never be used. It's possibly worthwhile to investigate those independently written library functions, but I'd be hesitant to throw around nonstandard library code as a panacea for string safety.
If you're using strcpy, you need to be sure your string fits in the destination buffer. If you just allocated it with size at least strlen(source)+1, you're fine as long as the source string is not simultaneously subject to modification by another thread. Otherwise you need to test if it fits in the buffer. You can use interfaces like snprintf or strlcpy (nonstandard BSD function, but easy to copy an implementation) which will truncate strings that don't fit in your destination buffer, but then you really need to evaluate whether string truncation could lead to vulnerabilities in itself. I think a much better approach when testing whether the source string fits is to make a new allocation or return an error status rather than performing blind truncation.
If you'll be doing a lot of string concatenation/assembly, you really should write all your code to manage the length and current position as you go. Instead of:
strcpy(out, str1);
strcat(out, str2);
strcat(out, str3);
...
You should be doing something like:
size_t l, n = outsize;
char *s = out;
l = strlen(str1);
if (l>=outsize) goto error;
strcpy(s, str1);
s += l;
n -= l;
l = strlen(str2);
if (l>=outsize) goto error;
strcpy(s, str2);
s += l;
n -= l;
...
Alternatively you could avoid modifying the pointer by keeping a current index i of type size_t and using out+i, or you could avoid the use of size variables by keeping a pointer to the end of the buffer and doing things like if (l>=end-s) goto error;.
Note that, whichever approach you choose, the redundancy can be condensed by writing your own (simple) functions that take pointers to the position/size variable and call the standard library, for instance something like:
if (!my_strcpy(&s, &n, str1)) goto error;
Avoiding strcat also has performance benefits; see Schlemiel the Painter's algorithm.
Finally, you should note that a good 75% of the string copying and assembly people perform in C is utterly useless. My theory is that the people doing it come from backgrounds in script languages where putting together strings is what you do all the time, but in C it's not useful that often. In many cases, you can get by with never copying strings at all, using the original copies instead, and get much better performance and simpler code at the same time. I'm reminded of a recent SO question where OP was using regexec to match a regular expression, then copying out the result just to print it, something like:
char *tmp = malloc(match.end-match.start+1);
memcpy(tmp, src+match.start, match.end-match.start);
tmp[match.end-match.start] = 0;
printf("%s\n", tmp);
free(tmp);
The same thing can be accomplished with:
printf("%.*s\m", match.end-match.start, src+match.start);
No allocations, no cleanup, no error cases (the original code crashed if malloc failed).

Resources