What are the benefits to using BIO_printf() instead of printf()? - c

I have been reviewing example code for using OpenSSL and in every example I locate, the creator has chosen to use BIO_printf() to write things to stdout instead of printf().
I have taken their code, removed the openssl/bio.h header declaration, and changed all calls to BIO_printf() to regular printf() statements. The programs ran with identical results.
The problem I'm grasping with is why these coders use BIO_printf() when it takes a lot more to setup than just using printf(). You have to include another header (which will increase program size), you need to set the file pointer to the stream you want to write to. Then you can print your message to stdout. It seems a lot more complicated than using printf().
When I do a search on BIO_printf() it lists possible man pages for BIO_printf (3), but none of the pages actually contain any information!
I decided to do a benchmark test on both methods. I looped printf("Hey\n"); 1,000,000 times. Then I did it for BIO_printf(fp, "Hey\n");. I only timed the BIO_printf() statement and not the setting up of the file pointer (which would have increased the time). The difference came out to printf() being ~4.7x faster than using BIO_printf().
Why are they using it? What is the benefit? It's my understanding that in programming you either want code to be simple or efficient, and in the case of BIO_printf() it's neither.

In general, a BIO might not be writing to stdout.
You can have a BIO that writes to a file, or null, or a socket, or a network drive, or another BIO, etc.
By using the BIO_printf family, the code can easily be changed to have its output sent to a different location or another BIO which might do some further filtering and then pass the output onto wherever else.

As pointed by others, BIO can be stacked contrary to FILE. snprintf() and vnsprintf() were added in C99. OpenSSL/SSLeay is older than this. Hence, the SSLeay developpers had to write their own implementation. Unfortunately, having a little used implementation leads to the performance issues described by the OP or to CVE-2016-0799.

Related

C logging framework compile time optimization

For a certain time now, I'm looking to build a logging framework in C (not C++!), but for small microcontrollers or devices with a small footprint of some sort. For this, I've had the idea of hashing the strings that are being logged to a certain value and just saving the hashed value with the timestamp instead of the complete ASCII string. The hash can then be correlated with a 'database' file that would be generated from an external process that parses the strings out of the C source files and saves the logged strings along with the hash value.
After doing a little bit of research, this idea is not new, but I do not find an implementation of this idea in C. In other languages, this idea has been worked out, but that is not the goal of my exercise. An example may be this talk where the same concept has been worked out in C++: youtube.com/watch?v=Dt0vx-7e_B0
Some of the requirements that I've set myself for this library are the following:
as portable C code as possible
COMPILE TIME optimization/hashing for the string hash conversion, it should be equivalent to just printf("%d\n", hashed_value) for a single log statement. (Assuming no parameters/arguments for this particular logging statement).
arguments can be passed to the logging statement similar to the printf function.
user can define their own output function (being console, file descriptor, sending the data directly over an UART connection,...)
fast to run!! fast to compile is nice to have, but it should not be terribly slow.
very easy to use, no very complicated API to use the library.
But to achieve this in C, what is a good approach? I've tried several things now, but do not seem to have found a good method of achieving this.
An overview of things I've tried so far, along with the drawbacks are:
Full pre-processor string hashing: did get it working, but the compile time is terribly slow. Also, this code does not feel to be very portable over multiple C compilers.
Semi pre-processor string hashing: The idea was to generate a hash for each string and make an external header file with the defines in of each string with their hash value. The problem here is that I cannot figure out a way of converting the string to the correct define preprocessor value.
Letting go of the default logging macro with a string pointer: Instead of working with the most used method of LOG_DEBUG("Some logging statement"), converting it with an external parser to /*LOG_DEBUG("Some logging statement") */ LOG_RAW(45). This solves the problem of hashing the string since the hash will be replaced by the external parser with the correct hash, but is not the cleanest to read since the original statement will be a comment.
Also expanding this idea to take care of arguments proved to be tricky. How to take care of multiple types of variables as efficiently as possible?
I've tried some other methods but all without success. Especially when I want to add arguments to log the value of a variable, for example, it gets very complicated, and I do not get the required result...

A function returning "ENOTDIR", "EBUSY", etc. as strings?

strerror() function returns a short error description, given error number as argument. For example, if the argument is ENOTDIR, it will return "Not a directory", if the argument is EBUSY, it will return "Device or resource busy", etc.
But, is there a function that returns "ENOTDIR" for argument equals ENOTDIR, "EBUSY" for argument EBUSY, etc.?
I just want to avoid writing a huge switch statement for this purpose.
No- there is no standard or commonly used nonstandard function that provides this functionality.
One approach would be to write a huge switch statement, but this might not be the best approach for you to take. Most values of errno are not specified by any standard, so their values may or may not be consistent across different operating systems or even different versions of the same operating system.
Plus it would be a pain in the rear end.
A more elegant approach, if some runtime overhead is acceptable, would be to write a function that looks up these errors codes when they occur, rather than hard-coding the values into a big table. GNU/Linux systems have a list of all possible errno values at:
/usr/include/asm-generic/errno-base.h
/usr/include/asm-generic/errno.h
These files provide a #define of each errno value along with their value and a short description in an adjacent comment. It'd be pretty trivial to search these files line-by-line and print out the matching error code. Even if not, these files would be the things to start with in your quest to write a huge switch statement.
Beware that the kernel might negate these values when they're passed to userspace.
As David points out above, the problem is there is no standard function that can provide the desired functionality. So thinking that this would be a neat problem to try to write a script for, I wrote a little something to automatically generate the switch function (if it should come to be necessary) and posted the code here. Seems to work alright on OS X, otherwise mileage may vary. A script such as this could be added to the build process to make sure that the values were defined correctly.

Using sysctl(3) to write safe, portable code: good idea?

When writing safe code in straight C, I'm sick and tired of coming up with arbitrary
numbers to represent limitations -- specifically, the maximum amount of
memory to allocate for a single line of text. I know I can always say
stuff like
#define MAX_LINE_LENGTH 1024
and then pass that macro to functions such as snprintf().
I work and code in NetBSD, which has a sysctl(3) variable called
"user.line_max" designed for this very purpose. So I don't need to come up
with an arbitrary number like MAX_LINE_LENGTH, above. I just read the
"user.line_max" sysctl variable, which by the way is settable by the user.
My question is whether this is the Right Thing in terms of safety and
portability. Perhaps different operating systems have a different name for
this sysctl, but I'm more interested in whether I should be using this
technique at all.
And for the record, "portability" excludes Microsoft Windows in this case.
Well the linux SYSCTL (2) man page has this to say in the Notes section:
Glibc does not provide a wrapper for this system call; call it using syscall(2).
Or rather... don't call it: use of this system call has long been discouraged, and it is so unloved that it is likely to disappear in a future kernel version. Remove it from your programs now; use the /proc/sys interface instead.
So that is one consideration.
Not a good idea. Even if it weren't for what Duck told you, relying on a system-wide setting that's runtime-variable is bad design and error-prone. If you're going to go to the trouble of having buffer size limits be variable (which typically requires dynamic allocation and checking for failure) then you should go the last step and make it configurable on a more local scope.
With your example of buffer size limits, opinions differ as to what's the best practice. Some people think you should always use dynamically-growing buffers with no hard limit. Others prefer fixed limits sufficiently large that reasonable data would not exceed them. Or, as you've noted, configurable limits are an option. In choosing what's right for your application, I would consider the user experience implications. Sure users don't like arbitrary limits, but they also don't like it when accidentally (or by somebody else's malice) reading data with no newlines in it causes your application to consume unbounded amounts of memory, start swapping, and/or eventually crash or bog down the whole system.
The nearest portable construct for this is "getconf LINE_MAX" or the equivalent C.
1) Check out the Single Unix Specification, keyword: "limits"
2) s/safety/security/

file and formatting alternative libs for c

I've done some searching and have not found anything that would boost the file and formatting functions in Visual Studio VS2010 C (not C++).
I've been able to address the raw i/o issues to some extent by using large buffers and a SSD drive, so the more pressing issue is a replacement for the family of printf functions.
Has anyone found something worthwhile?
As I understand it, part of the glacial speed issue with the printf functions is that they have to handle myriad types of arguments. Does anyone have experience with writing a datatype-specific version of printf; eg, one that only prints ints, or only prints doubles, etc?
First off, you should profile the code first before assuming it's printf.
But if you're sure it's printf and similar then you can do a few things to fix the issue.
1) print less. IE, don't call expensive operations as much if you can avoid it. Do you need all the output, for example?
2) manually replace the string concatenation with manually built routines that do all the pieces without having to parse the format specifier.
EG: printf("--%s--", "really cool");
Can become:
write(1, "--", 2);
write(1, "really cool", 11);
write(1, "--", 2);
That may be faster. But again, you won't know until you profile it. Don't spend energy on a solution till you can confirm it's the solution you need and be able to measure the success of your proposed solution.
#Wes is right, never assume you know what you need to fix until you have proof.
Where I differ is on the method of finding out.
I and others use random pausing which works for these reasons, and here's a short slide show demo & C++ code so you can see how it works, if you want.
The thing about printf (or any output) function is it spends A) a certain number of CPU cycles creating a buffer to be output, and then it spends B) a certain amount of time waiting while the system and/or auxiliary hardware actually moves the data out.
That's maybe a bit over-simplified, but if you randomly pause and examine the state, that's what you see.
What you've done by using large buffers and an SSD drive is reduce B, and that's good.
That means of the time remaining, A is a larger fraction.
You know that.
Now of the samples you find in A, you might get a hint of what's happening if you see what subordinate routines inside printf are showing up.
Usually printf calls something like vprintf to get rid of the variable argument list, which then cycles over the format string to figure out what to do, including things like parsing precision specifiers.
If it looks like that's what it's doing, then you know about how much time goes into parsing the format.
On the other hand, if you see it inside a routine that is copying a string, or formatting an integer (along with dealing with leading/trailing characters, etc.) then you know to concentrate on that.
On yet another hand, if you see it inside a routine that looks like it's formatting a floating point number (which is actually quite complicated), you know to concentrate on that.
Given all that, you want to know what I do?
First, I ask who is going to read this anyway?
If nobody really needs to read all this text, why not pump it out in binary? Or failing that, in hex?
If you simply write binary, A shrinks to nothing, and when you read it back in with another program, guess what?
No Lost Bits!

Parsing: load into memory or use stream

I'm writing a little parser and I would like to know the advantages and disadvantages of the different ways to load the data to be parsed. The two ways that I thought of are:
Load the file's contents into a string then parse the string (access the character at an array position)
Parse as reading the file stream (fgetc)
The former will allow me to have two functions: one for parse_from_file and parse_from_string, however I believe this mode will take up more memory. The latter will not have that disadvantage of using more memory.
Does anyone have any advice on the matter?
Reading the entire file in or memory mapping it will be faster, but may cause issues if you want your language to be able to #include other files as these would be memory mapped or read into memory as well.
The stdio functions would work well because they usually try to buffer up data for you, but they are also general purpose so they also try to look out for usage patterns which differ from reading a file from start to finish, but that shouldn't be too much overhead.
A good balance is to have a large circular buffer (x * 2 * 4096 is a good size) which you load with file data and then have your tokenizer read from. Whenever a block's worth of data has been passed to your tokenizer (and you know that it is not going to be pushed back) you can refill that block with new data from the file and update some buffer location info.
Another thing to consider is if there is any chance that the tokenizer would ever need to be able to be used to read from a pipe or from a person typing directly in some text. In these cases your reads may return less data than you asked for without it being at the end of the file, and the buffering method I mentioned above gets more complicated. The stdio buffering is good for this as it can easily be switched to/from line or block buffering (or no buffering).
Using gnu fast lex (flex, but not the Adobe Flash thing) or similar can greatly ease the trouble with all of this. You should look into using it to generate the C code for your tokenizer (lexical analysis).
Whatever you do you should try to make it so that your code can easily be changed to use a different form of next character peek and consume functions so that if you change your mind you won't have to start over.
Consider using lex (and perhaps yacc, if the language of your grammar matches its capabilities). Lex will handle all the fiddly details of lexical analysis for you and produce efficient code. You can probably beat its memory footprint by a few bytes, but how much effort do you want to expend into that?
The most efficient on a POSIX system would probably neither of the two (or a variant of the first if you like): just map the file read-only with mmap, and parse it then. Modern systems are quite efficient with that in that they prefetch data when they detect a streaming access etc., multiple instances of your program that parse the same file will get the same physical pages of memory etc. And the interfaces are relatively simple to handle, I think.

Resources