How do I align a number like this in C? - c

I need to align a series of numbers in C with printf() like this example:
-------1
-------5
------50
-----100
----1000
Of course, there are numbers between all those but it's not relevant for the issue at hand... Oh, consider the dashes as spaces, I used dashes so it was easier to understand what I want.
I'm only able to do this:
----1---
----5---
----50--
----100-
----1000
Or this:
---1
---5
--50
-100
1000
But none of this is what I want and I can't achieve what is displayed on the first example using only printf(). Is it possible at all?
EDIT:
Sorry people, I was in a hurry and didn't explain myself well... My last example and all your suggestions (to use something like "%8d") do not work because, although the last number is 1000 it doesn't necessarily go all the way to 1000 or even 100 or 10 for that matter.
No matter the number of digits to be displayed, I only want 4 leading spaces at most for the largest number. Let's say I have to display digits from 1 to 1000 (A) and 1 to 100 (B) and I use, for both, "%4d", this would be the output:
A:
---1
....
1000
Which is the output I want...
B:
---1
....
-100
Which is not the output I want, I actually want this:
--1
...
100
But like I said, I don't know the exact number of numbers I have to print, it can have 1 digit, it can have 2, 3 or more, the function should be prepared for all. And I want four extra additional leading spaces but that's not that relevant.
EDIT 2:
It seems that what I want, the way I need it, it's not possible (check David Thornley and Blank Xavier answers and my comments). Thank you all for your time.

Why is printf("%8d\n", intval); not working for you? It should...
You did not show the format strings for any of your "not working" examples, so I'm not sure what else to tell you.
#include <stdio.h>
int
main(void)
{
int i;
for (i = 1; i <= 10000; i*=10) {
printf("[%8d]\n", i);
}
return (0);
}
$ ./printftest
[ 1]
[ 10]
[ 100]
[ 1000]
[ 10000]
EDIT: response to clarification of question:
#include <math.h>
int maxval = 1000;
int width = round(1+log(maxval)/log(10));
...
printf("%*d\n", width, intval);
The width calculation computes log base 10 + 1, which gives the number of digits. The fancy * allows you to use the variable for a value in the format string.
You still have to know the maximum for any given run, but there's no way around that in any language or pencil & paper.

Looking this up in my handy Harbison & Steele....
Determine the maximum width of fields.
int max_width, value_to_print;
max_width = 8;
value_to_print = 1000;
printf("%*d\n", max_width, value_to_print);
Bear in mind that max_width must be of type int to work with the asterisk, and you'll have to calculate it based on how much space you're going to want to have. In your case, you'll have to calculate the maximum width of the largest number, and add 4.

printf("%8d\n",1);
printf("%8d\n",10);
printf("%8d\n",100);
printf("%8d\n",1000);

[I realize this question is a million years old, but there is a deeper question (or two) at its heart, about OP, the pedagogy of programming, and about assumption-making.]
A few people, including a mod, have suggested this is impossible. And, in some--including the most obvious--contexts, it is. But it's interesting to see that that wasn't immediately obvious to the OP.
The impossibility assumes that the contex is running an executable compiled from C on a line-oriented text console (e.g., console+sh or X-term+csh or Terminal+bash), which is a very reasonable assumption. But the fact that the "right" answer ("%8d") wasn't good enough for OP while also being non-obvious suggests that there's a pretty big can of worms nearby...
Consider Curses (and its many variants). In it, you can navigate the "screen", and "move" the cursor around, and "repaint" portions (windows) of text-based output. In a Curses context, it absolutely would be possible to do; i.e., dynamically resize a "window" to accommodate a larger number. But even Curses is just a screen "painting" abstraction. No one suggested it, and probably rightfully so, because a Curses implementation in C doesn't mean it's "strictly C". Fine.
But what does this really mean? In order for the response: "it's impossible" to be correct, it would mean that we're saying something about the runtime system. In other words, this isn't theoretical, (as in, "How do I sort a statically-allocated array of ints?"), which can be explained as a "closed system" that totally ignores any aspect of the runtime.
But, in this case, we have I/O: specifically, the implementation of printf(). But that's where there's an opportunity to have said something more interesting in response (even though, admittedly, the asker was probably not digging quite this deep).
Suppose we use a different set of assumptions. Suppose OP is reasonably "clever" and understands that it would not be possible to to edit previous lines on a line-oriented stream (how would you correct the horizontal position of a character output by a line-printer??). Suppose also, that OP isn't just a kid working on a homework assignment and not realizing it was a "trick" question, intended to tease out an exploration of the meaning of "stream abstraction". Further, let's suppose OP was wondering: "Wait...If C's runtime environment supports the idea of STDOUT--and if STDOUT is just an abstraction--why isn't it just as reasonable to have a terminal abstraction that 1) can vertically scroll but 2) supports a positionable cursor? Both are moving text on a screen."
Because if that were the question we're trying to answer, then you'd only have to look as far as:
ANSI Escape Codes
to see that:
Almost all manufacturers of video terminals added vendor-specific escape sequences to perform operations such as placing the cursor at arbitrary positions on the screen. One example is the VT52 terminal, which allowed the cursor to be placed at an x,y location on the screen by sending the ESC character, a Y character, and then two characters representing with numerical values equal to the x,y location plus 32 (thus starting at the ASCII space character and avoiding the control characters). The Hazeltine 1500 had a similar feature, invoked using ~, DC1 and then the X and Y positions separated with a comma. While the two terminals had identical functionality in this regard, different control sequences had to be used to invoke them.
The first popular video terminal to support these sequences was the Digital VT100, introduced in 1978. This model was very successful in the market, which sparked a variety of VT100 clones, among the earliest and most popular of which was the much more affordable Zenith Z-19 in 1979. Others included the Qume QVT-108, Televideo TVI-970, Wyse WY-99GT as well as optional "VT100" or "VT103" or "ANSI" modes with varying degrees of compatibility on many other brands. The popularity of these gradually led to more and more software (especially bulletin board systems and other online services) assuming the escape sequences worked, leading to almost all new terminals and emulator programs supporting them.
It has been possible, as early as 1978. C itself was "born" in 1972, and the K&R version was established in 1978. If "ANSI" escape sequences were around at that time, then there is an answer "in C" if we're willing to also stipulate: "Well, assuming your terminal is VT100-capable." Incidentally, the consoles which don't support ANSI escapes? You guessed it: Windows & DOS consoles. But on almost every other platform (Unices, Vaxen, Mac OS, Linux) you can expect to.
TL;DR - There is no reasonable answer that can be given without stating assumptions about the runtime environment. Since most runtimes (unless you're using desktop-computer-market-share-of-the-80's-and-90's to calculate 'most') would have, (since the time of the VT-52!), then I don't think it's entirely justified to say that it's impossible--just that in order for it to be possible, it's an entire different order of magnitude of work, and not as simple as %8d...which it kinda seemed like the OP knew about.
We just have to clarify the assumptions.
And lest one thinks that I/O is exceptional, i.e., the only time we need to think about the runtime, (or even the hardware), just dig into IEEE 754 Floating Point exception handling. For those interested:
Intel Floating Point Case Study
According to Professor William Kahan, University of California at
Berkeley, a classic case occurred in June 1996. A satellite-lifting
rocket named Ariane 5 turned cartwheels shortly after launch and
scattered itself and a payload worth over half a billion dollars over
a marsh in French Guiana. Kahan found the disaster could be blamed
upon a programming language that disregarded the default
exception-handling specifications in IEEE 754. Upon launch, sensors
reported acceleration so strong that it caused a conversion-to-integer
overflow in software intended for recalibration of the rocket’s
inertial guidance while on the launching pad.

So, you want an 8-character wide field with spaces as the padding? Try "%8d". Here's a reference.
EDIT: What you're trying to do is not something that can be handled by printf alone, because it will not know what the longest number you are writing is. You will need to calculate the largest number before doing any printfs, and then figure out how many digits to use as the width of your field. Then you can use snprintf or similar to make a printf format on the spot.
char format[20];
snprintf(format, 19, "%%%dd\\n", max_length);
while (got_output) {
printf(format, number);
got_output = still_got_output();
}

Try converting to a string and then use "%4.4s" as the format specifier. This makes it a fixed width format.

As far as I can tell from the question, the amount of padding you want will vary according to the data you have. Accordingly, the only solution to this is to scan the data before printing, to figure out the widest datum, and so find a width value you can pass to printf using the asterix operator, e.g.
loop over data - get correct padding, put into width
printf( "%*d\n", width, datum );

If you can't know the width in advance, then your only possible answer would depend on staging your output in a temporary buffer of some kind. For small reports, just collecting the data and deferring output until the input is bounded would be simplest.
For large reports, an intermediate file may be required if the collected data exceeds reasonable memory bounds.
Once you have the data, then it is simple to post-process it into a report using the idiom printf("%*d", width, value) for each value.
Alternatively if the output channel permits random access, you could just go ahead and write a draft of the report that assumes a (short) default width, and seek back and edit it any time your width assumption is violated. This also assumes that you can pad the report lines outside that field in some innocuous way, or that you are willing to replace the output so far by a read-modify-write process and abandon the draft file.
But unless you can predict the correct width in advance, it will not be possible to do what you want without some form of two-pass algorithm.

Looking at the edited question, you need to find the number of digits in the largest number to be presented, and then generate the printf() format using sprintf(), or using %*d with the number of digits being passed as an int for the * and then the value. Once you've got the biggest number (and you have to determine that in advance), you can determine the number of digits with an 'integer logarithm' algorithm (how many times can you divide by 10 before you get to zero), or by using snprintf() with the buffer length of zero, the format %d and null for the string; the return value tells you how many characters would have been formatted.
If you don't know and cannot determine the maximum number ahead of its appearance, you are snookered - there is nothing you can do.

#include<stdio.h>
int main()
{
int i,j,n,b;
printf("Enter no of rows ");
scanf("%d",&n);
b=n;
for(i=1;i<=n;++i)
{
for(j=1;j<=i;j++)
{
printf("%*d",b,j);
b=1;
}
b=n;
b=b-i;
printf("\n");
}
return 0;
}

fp = fopen("RKdata.dat","w");
fprintf(fp,"%5s %12s %20s %14s %15s %15s %15s\n","j","x","integrated","bessj2","bessj3","bessj4","bessj5");
for (j=1;j<=NSTEP;j+=1)
fprintf(fp,"%5i\t %12.4f\t %14.6f\t %14.6f\t %14.6f\t %14.6f\t %14.6f\n",
j,xx[j],y[6][j],bessj(2,xx[j]),bessj(3,xx[j]),bessj(4,xx[j]),bessj(5,xx[j]));
fclose(fp);

Related

What do files actually contain, and how are they "read"? What is a "format" and why should I worry about them?

As it becomes ever easier to use computers in general and get into programming in particular, an increasing fraction of beginners seem to lack certain fundamental understanding that was once taken for granted in programming circles. Meanwhile as technology advances, the details of that understanding have grown more complex (I personally was programming before Unicode existed, let alone, say, JSON or XML). So, for the sake of having a solid reference, it seems apropos to ask:
What exactly is in a file, anyway? What do we mean when we say that we "open" and "read" a file - what are we getting out of it? I know the term "data", but just giving a name to something is not a real explanation.
More importantly, how do we make sense of data? If I try simply reading some data from a file and outputting it to the console, why does it often look like garbage? Why do some other files appear to have some text scattered among that garbage, while yet others seem to be mostly or completely text? Why isn't it sufficient to ask the program to read, say, an image file, in order to display the image? Again, I know the term "format", but this doesn't explain the concept. If we say, for example, that we make sense of data according to its format, then that only raises two more questions - how do we determine the format, and how does it actually help?
Related: What exactly causes binary file "gibberish"?.
Data, bits and bytes
Everyone who has had to buy hardware, or arrange a network connection, should have some familiarity with the concept of a "bit" and of a "byte". They're used to measure the capacity of storage devices and transfer rates. In short, they measure data: the amount of data that can be stored on a disk, or the amount of data transferred along a cable (or via a wireless connection) per second.
Data is essentially information - a record of some kind of knowledge. The bit is the fundamental unit of information, representing the smallest possible amount of knowledge: the answer to a yes-or-no question, a choice between two options, a record of a decision between two alternatives. (There would need to be at least two possibilities; with only one, there was no answering, choice or decision necessary, and thus nothing is learned by seeing that single possibility arise.)
A byte is simply a grouping of bits in a standard size. Almost everyone nowadays defines a byte to mean 8 bits, mainly because all contemporary consumer hardware is designed around that concept. In some very specific technical contexts (such as certain C or C++ language standard documents), "byte" may have a broader meaning, and octet is used to be precise about 8-bit groupings. We will stick with "byte" here, because we don't need to worry about ancient hardware or idiosyncratic compiler implementations for now.
Data storage devices - both permanent ones like HDDs and SSDs, and temporary ones like RAM - use a huge amount of individual components (depending on the device) to represent data, each of which can conceptually be in either of two states (we commonly use "on or off", "1 or 0" etc. as metaphors). Because there's a decision to be made between those two states, the component thus represents one bit of data. The data isn't a physical thing - it's not the component itself. It's the state of that component: the answer to the question "which of the two possible ways is this component configured right now?".
How data is made useful
It's clear to see how we can use a bit to represent a number, if there are only two possible numbers we are interested in. Suppose those numbers are 0 and 1; then we can ask, "is the number 1?", and according to the bit that tells us the answer to that question, we know which number is represented.
It turns out that in fact this is all we need in order to represent all kinds of numbers. For example, if we need to represent a number from {0, 1, 2, 3}, we can use two bits: one that tells us whether the represented number is in {0, 1} or {2, 3}, and one that tells us whether it's in {0, 2} or {1, 3}. If we can answer those two questions, we can identify the number. This technique generalizes, using base two arithmetic, to represent any integer: essentially, each bit corresponds to a value from the geometric sequence 1, 2, 4, 8, 16..., and then we just add up (implicitly) the values that were chosen by the bits. By tweaking this convention slightly, we can represent negative integers as well. If we let some bits correspond to binary fractions as well (1/2, 1/4, 1/8...), we can approximate real numbers (including the rationals) as closely as we want, depending on how many bits we use for the fractional part. Alternately, we can just use separate groups of bits to represent the numerator and denominator of a rational number - or, for that matter, the real and imaginary parts of a complex number.
Furthermore, once we can represent numbers, we can represent all kinds of answers to questions. For example, we can agree on a sequence of symbols that are used in text; and then, implicitly, a number represents the symbol at that position in the sequence. So we can use some amount of bits to represent a symbol; and by representing individual symbols repeatedly, we can represent text.
Similarly, we can represent the height of a sound wave at a given instant in time; by repeating this process a few tens of thousands of times per second, we can represent sound audible to humans.
Similarly, having studied how the human eye works, we find that we can analyze colours as combinations of three intensity values (i.e., numbers) representing "components" of the colour. By describing colours at many points a small distance apart (like with the sound wave, but in a two-dimensional grid), we can represent images. By considering images across time (a few tens of times per second), we can represent animations.
And so on, and so on.
Choosing an interpretation
There's a problem, here, though. All of this simply talks about possibilities for what data could represent. How do we know what it does represent?
Plainly, the raw data stored by a computer doesn't inherently represent anything specific. Because it's all in the same regular, sequence-of-bits form, nothing stops us from taking any arbitrary chunk of data and interpreting it by any of the schemes described above.
It just... isn't likely to appear like anything meaningful, that way.
However, the choice of interpretations is a choice... which means it can be encoded and recorded in this raw-data form. We say that such data is metadata: data that tells us about the meaning of other data. This could take many forms: the names of our files and the folder structure (telling us how those files relate to each other, and how the user intends to keep track of them); extensions on file names, special data at the beginning of files or other notes made within the file system (telling us what type of file it is, corresponding to a file format - keep reading); documentation (something that humans can read in order to understand how another file is intended to work); and computer programs (data which tells the computer what steps to take, in order to present the file's contents to the user).
What is a (file) format?
Quite simply, a format is the set of rules that describes a way to interpret some data (typically, the contents of a file). When we say that a file is "in" a particular format, we mean that it a) has a valid interpretation according to that format (not every possible chunk of data will meet the requirements, in general) and b) is intended to be interpreted that way.
Put another way: a format is the meaning represented by some metadata.
A format can be a subset or refinement of some other format. For example, JSON documents are also text documents, using UTF-8 encoding. The JSON format adds additional meaning to the text that was represented, by describing how specific text sequences are used to represent structured data. A programming language can also be thought of as this kind of format: it gives additional meaning to text, by explaining how that text can be translated into instructions a computer can follow. (A computer's "machine code" is also a kind of format, that gets interpreted directly by the hardware rather than by a program.)
(Recall: we established that a computer program can be a kind of metadata, and that a programming language can be a kind of format, and that metadata represents a format. To close the loop: of course, one can have a computer program that implements a programming language - that's what a compiler is.)
A format can also involve multiple steps, explained by separate standards. For example, Unicode is the de facto standard text format, but it only describes how abstract numbers correspond to text symbols. It doesn't directly say how to convert the bits into numbers (and this does need to be specified; "treat each byte as a number from 0..255" a) would still be making a choice of many possible ways to do it; b) isn't really sufficient, because there are a lot more possible text symbols than that). To represent text, we also need an encoding, i.e. the rest of the rules for the data format, specifically to convert bits to numbers. UTF-8 is one such encoding, and has become dominant.
What actually happens when we read the file?
Raw data is transferred from the file on disk, into the program's memory.
That's it.
Some languages offer convenience functionality, for the common case of treating the data like text. This might mean doing some light processing on the data (because operating systems disagree about which text symbols, in what order represent "the end of a line"), and loading the data into the language's built-in "string" data structure, using some kind of encoding. (Yes, even if the encoding is "each byte represents a number from 0 to 255 inclusive, which represents the corresponding Unicode code point", that is an encoding - even if it doesn't represent all text and thus isn't a proper Unicode encoding - and it is being used even if the programmer did nothing to specify it; there is no such thing as "plain text", and ignoring this can have all kinds of strange consequences.)
But fundamentally, the reading is really just a transfer of data. Text conversion is often treated as special because, for a long time, programmers were sloppy about treating text properly as an interpretation of data; for decades there was an interpretation of data as text - one byte per text symbol (incidentally, "character" does not mean the same thing as a Unicode code point) - so well established that everyone started forgetting they were actually using it. Programmers forgot about this even though it only actually specifies what half the possible values of a byte mean and leaves the other half up to a local interpretation, and even though that scheme is still woefully inadequate for many world languages, such that programmers in many other countries came up with their own solutions. The solution - the Unicode standard, mentioned several times above - had its first release in 1991, but there are still a few programmers today blithely ignoring it.
But enough ranting.
How does interpreting a file work?
In order to display an image, render a web page, play sound or anything else from a file, we need to:
Have data that is actually intended to represent the corresponding thing;
Know the format that is used by the data to represent the thing;
Load the data (read the file, or read data from a network connection, or create the data by some other process);
Process the data according to the format.
This happens for even the simplest cases, and it can involve multiple programs. For example, a simple command-line program that inputs text from the user (from the "standard input stream") and outputs text back (to the "standard output stream"), generally, is not actually causing the text to appear on screen, or figuring out what keys were pressed on the keyboard. Instead: the operating system interprets signals from the keyboard, in order to create readable data; after the program writes out its response to the input, another program (the terminal) will translate the text into pixel colour values (getting help from the operating system to choose images from a font); then the operating system will arrange to send the appropriate data to the monitor (according to the terminal window's position etc.).

Using C I would like to format my output such that the output in the terminal stops once it hits the edge of the window

If you type ps aux into your terminal and make the window really small, the output of the command will not wrap and the format is still very clear.
When I use printf and output my 5 or 6 strings, sometimes the length of my output exceeds that of the terminal window and the strings wrap to the next line which totally screws up the format. How can I write my program such that the output continues to the edge of the window but no further?
I've tried searching for an answer to this question but I'm having trouble narrowing it down and thus my search results never have anything to do with it so it seems.
Thanks!
There are functions that can let you know information about the terminal window, and some others that will allow you to manipulate it. Look up the "ncurses" or the "termcap" library.
A simple approach for solving your problem will be to get the terminal window size (specially the width), and then format your output accordingly.
There are two possible answers to fix your problem.
Turn off line wrapping in your terminal emulator(if it supports it).
Look into the Curses library. Applications like top or vim use the Curses library for screen formatting.
You can find, or at least guess, the width of the terminal using methods that other answers describe. That's only part of the problem however -- the tricky bit is formatting the output to fit the console. I don't believe there's any alternative to reading the text word by word, and moving the output to the next line when a word would overflow the width. You'll need to implement a method to detect where the white-space is, allowing for the fact that there could be multiple white spaces in a row. You'll need to decide how to handle line-breaking white-space, like CR/LF, if you have any. You'll need to decide whether you can break a word on punctuation (e.g, a hyphen). My approach is to use a simple finite-state machine, where the states are "At start of line", "in a word", "in whitespace", etc., and the characters (or, rather character classes) encountered are the events that change the state.
A particular complication when working in C is that there is little-to-no built-in support for multi-byte characters. That's fine for text which you are certain will only ever be in English, and use only the ASCII punctuation symbols, but with any kind of internationalization you need to be more careful. I've found that it's easiest to convert the text into some wide format, perhaps UTF-32, and then work with arrays of 32-bit integers to represent the characters. If your text is UTF-8, there are various tricks you can use to avoid having to do this conversion, but they are a bit ugly.
I have some code I could share, but I don't claim it is production quality, or even comprehensible. This simple-seeming problem is actually far more complicated than first impressions suggest. It's easy to do badly, but difficult to do well.

Standard C function names

Is there any rationale for the abbreviated way standard C functions are written? For example, malloc() is short for 'memory allocation'. sprintf() is 'string print formatted'. Neither of these names are very good at telling you what the function actually does. It never occurred to me how terrible some of these abbreviated function names are until recently when I had to teach a new intern many of these functions.
When the language was being developed, was there any reason malloc() was chosen over memAllocate() or something similar? My best guess would be that they more closely resemble UNIX commands, but that doesn't feel like the right answer.
Check out http://publications.gbdirect.co.uk/c_book/chapter2/keywords_and_identifiers.html -
The problem is that there was never any guarantee that more than a
certain number of characters would be checked when names were compared
for equality—in Old C this was eight characters, in Standard C this
has changed to 31.
Basically, in the past (long while back) you could only count on the the first eight characters for uniqueness in a function name. So, you end up with a bunch of short names for the core functions.
As Neal Stephenson wrote about Unix in In the Beginning Was the Command Line,
Note the obsessive use of abbreviations and avoidance of capital letters; this is a system invented by people to whom repetitive stress disorder is what black lung is to miners. Long names get worn down to three-letter nubbins, like stones smoothed by a river.
The first version of Unix and the first C compiler were written using versions of ed. Not vi, not emacs, not anything resembling an IDE, but the line-based ed. There comes a point where reducing the number of keystrokes really does increase the number of SLOC you can write per day, when you're inventing something brand-new and writing it for the first time.
The historical justification is of course that historically the C standard only required implementations to distinguish the initial 6 characters of external identifier names. This allowance was removed in C99. However, users of the C language generally:
Aim to write source code in such a way that it fits in a reasonable number of columns, usually 80 or fewer, which is difficult with long identifier names.
Type identifier names with a keyboard, which is difficult and a waste of time when the identifiers are long.
Tend to prefer high information density and signal-to-noise ratio in source code.

file and formatting alternative libs for c

I've done some searching and have not found anything that would boost the file and formatting functions in Visual Studio VS2010 C (not C++).
I've been able to address the raw i/o issues to some extent by using large buffers and a SSD drive, so the more pressing issue is a replacement for the family of printf functions.
Has anyone found something worthwhile?
As I understand it, part of the glacial speed issue with the printf functions is that they have to handle myriad types of arguments. Does anyone have experience with writing a datatype-specific version of printf; eg, one that only prints ints, or only prints doubles, etc?
First off, you should profile the code first before assuming it's printf.
But if you're sure it's printf and similar then you can do a few things to fix the issue.
1) print less. IE, don't call expensive operations as much if you can avoid it. Do you need all the output, for example?
2) manually replace the string concatenation with manually built routines that do all the pieces without having to parse the format specifier.
EG: printf("--%s--", "really cool");
Can become:
write(1, "--", 2);
write(1, "really cool", 11);
write(1, "--", 2);
That may be faster. But again, you won't know until you profile it. Don't spend energy on a solution till you can confirm it's the solution you need and be able to measure the success of your proposed solution.
#Wes is right, never assume you know what you need to fix until you have proof.
Where I differ is on the method of finding out.
I and others use random pausing which works for these reasons, and here's a short slide show demo & C++ code so you can see how it works, if you want.
The thing about printf (or any output) function is it spends A) a certain number of CPU cycles creating a buffer to be output, and then it spends B) a certain amount of time waiting while the system and/or auxiliary hardware actually moves the data out.
That's maybe a bit over-simplified, but if you randomly pause and examine the state, that's what you see.
What you've done by using large buffers and an SSD drive is reduce B, and that's good.
That means of the time remaining, A is a larger fraction.
You know that.
Now of the samples you find in A, you might get a hint of what's happening if you see what subordinate routines inside printf are showing up.
Usually printf calls something like vprintf to get rid of the variable argument list, which then cycles over the format string to figure out what to do, including things like parsing precision specifiers.
If it looks like that's what it's doing, then you know about how much time goes into parsing the format.
On the other hand, if you see it inside a routine that is copying a string, or formatting an integer (along with dealing with leading/trailing characters, etc.) then you know to concentrate on that.
On yet another hand, if you see it inside a routine that looks like it's formatting a floating point number (which is actually quite complicated), you know to concentrate on that.
Given all that, you want to know what I do?
First, I ask who is going to read this anyway?
If nobody really needs to read all this text, why not pump it out in binary? Or failing that, in hex?
If you simply write binary, A shrinks to nothing, and when you read it back in with another program, guess what?
No Lost Bits!

Parsing a stream of data for control strings

I feel like this is a pretty common problem but I wasn't really sure what to search for.
I have a large file (so I don't want to load it all into memory) that I need to parse control strings out of and then stream that data to another computer. I'm currently reading in the file in 1000 byte chunks.
So for example if I have a string that contains ASCII codes escaped with ('$' some number of digits ';') and the data looked like this... "quick $33;brown $126;fox $a $12a". The string going to the other computer would be "quick brown! ~fox $a $12a".
In my current approach I have the following problems:
What happens when the control strings falls on a buffer boundary?
If the string is '$' followed by anything but digits and a ';' I want to ignore it. So I need to read ahead until the full control string is found.
I'm writing this in straight C so I don't have streams to help me.
Would an alternating double buffer approach work and if so how does one manage the current locations etc.
If I've followed what you are asking about it is called lexical analysis or tokenization or regular expressions. For regular languages you can construct a finite state machine which will recognize your input. In practice you can use a tool that understands regular expressions to recognize and perform different actions for the input.
Depending on different requirements you might go about this differently. For more complicated languages you might want to use a tool like lex to help you generate an input processor, but for this, as I understand it, you can use a much more simple approach, after we fix your buffer problem.
You should use a circular buffer for your input, so that indexing off the end wraps around to the front again. Whenever half of the data that the buffer can hold has been processed you should do another read to refill that. Your buffer size should be at least twice as large as the largest "word" you need to recognize. The indexing into this buffer will use the modulus (remainder) operator % to perform the wrapping (if you choose a buffer size that is a power of 2, such as 4096, then you can use bitwise & instead).
Now you just look at the characters until you read a $, output what you've looked at up until that point, and then knowing that you are in a different state because you saw a $ you look at more characters until you see another character that ends the current state (the ;) and perform some other action on the data that you had read in. How to handle the case where the $ is seen without a well formatted number followed by an ; wasn't entirely clear in your question -- what to do if there are a million numbers before you see ;, for instance.
The regular expressions would be:
[^$]
Any non-dollar sign character. This could be augmented with a closure ([^$]* or [^$]+) to recognize a string of non$ characters at a time, but that could get very long.
$[0-9]{1,3};
This would recognize a dollar sign followed by up 1 to 3 digits followed by a semicolon.
[$]
This would recognize just a dollar sign. It is in the brackets because $ is special in many regular expression representations when it is at the end of a symbol (which it is in this case) and means "match only if at the end of line".
Anyway, in this case it would recognize a dollar sign in the case where it is not recognized by the other, longer, pattern that recognizes dollar signs.
In lex you might have
[^$]{1,1024} { write_string(yytext); }
$[0-9]{1,3}; { write_char(atoi(yytext)); }
[$] { write_char(*yytext); }
and it would generate a .c file that will function as a filter similar to what you are asking for. You will need to read up a little more on how to use lex though.
The "f" family of functions in <stdio.h> can take care of the streaming for you. Specifically, you're looking for fopen(), fgets(), fread(), etc.
Nategoose's answer about using lex (and I'll add yacc, depending on the complexity of your input) is also worth considering. They generate lexers and parsers that work, and after you've used them you'll never write one by hand again.

Resources