Program's Purpose: Rune Cipher
Note - I am linking to my Own GitHub page below
(it is only for purpose-purpose (no joke intended; it is only for the purpose of showing the purpose of it - what I needed help with (and got help, thanks once again to all of you!)
Final Edit:
I have now (thanks to the Extremely Useful answers provided by the Extremely Amazing People) Completed the project I've been working on; and - for future readers I am also providing the full code.
Again, This wouldn't have been possible without all the help I got from the guys below, thanks to them - once again!
Original code on GitHub
Code
(Shortened down a bit)
#include <stdio.h>
#include <locale.h>
#include <wchar.h>
#define UNICODE_BLOCK_START 0x16A0
#define UUICODE_BLOCK_END 0x16F1
int main(){
setlocale(LC_ALL, "");
wchar_t SUBALPHA[]=L"ᛠᚣᚫᛞᛟᛝᛚᛗᛖᛒᛏᛋᛉᛈᛇᛂᛁᚾᚻᚹᚷᚳᚱᚩᚦᚢ";
wchar_t DATA[]=L"hello";
int lenofData=0;
int i=0;
while(DATA[i]!='\0'){
lenofData++; i++;
}
for(int i=0; i<lenofData; i++) {
printf("DATA[%d]=%lc",i,DATA[i]);
DATA[i]=SUBALPHA[i];
printf(" is now Replaced by %lc\n",DATA[i]);
} printf("%ls",DATA);
return 0;
}
Output:
DATA[0]=h is now Replaced by ᛠ
...
DATA[4]=o is now Replaced by ᛟ
ᛠᚣᚫᛞᛟ
Question continues below
(Note that it's solved, see Accepted answer!)
In Python3 it is easy to print runes:
for i in range(5794,5855):
print(chr(i))
outputs
ᚢ
ᚣ
(..)
ᛝ
ᛞ
How to do that in C ?
using variables (char, char arrays[], int, ...)
Is there a way to e.g print ᛘᛙᛚᛛᛜᛝᛞ as individual characters?
When I try it, it just prints out both warnings about multi-character character constant 'ᛟ'.
I have tried having them as an array of char, a "string" (e.g char s1 = "ᛟᛒᛓ";)
And then print out the first (ᛟ) char of s1: printf("%c", s1[0]); Now, this might seem very wrong to others.
One Example of how I thought of going with this:
Print a rune as "a individual character":
To print e.g 'A'
printf("%c", 65); // 'A'
How do I do that, (if possible) but with a Rune ?
I have as well as tried printing it's digit value to char, which results in question marks, and - other, "undefined" results.
As I do not really remember exactly all the things I've tried so far, I will try my best to formulate this post.
If someone spots a a very easy (maybe, to him/her - even plain-obvious) solution(or trick/workaround) -
I would be super happy if you could point it out! Thanks!
This has bugged me for quite some time.
It works in python though - and it works (as far as I know) in c if you just "print" it (not trough any variable) but, e.g: printf("ᛟ"); this works, but as I said I want to do the same thing but, trough variables. (like, char runes[]="ᛋᛟ";) and then: printf("%c", runes[0]); // to get 'ᛋ' as the output
(Or similar, it does not need to be %c, as well as it does not need to be a char array/char variable) I am just trying to understand how to - do the above, (hopefully not too unreadable)
I am on Linux, and using GCC.
External Links
Python3 Cyphers - At GitHub
Runes - At Unix&Linux SE
Junicode - At Sourceforge.io
To hold a character outside of the 8-bit range, you need a wchar_t (which isn't necessarily Unicode). Although wchar_t is a fundamental C type, you need to #include <wchar.h> to use it, and to use the wide character versions of string and I/O functions (such as putwc shown below).
You also need to ensure that you have activated a locale which supports wide characters, which should be the same locale as is being used by your terminal emulator (if you are writing to a terminal). Normally, that will be the default locale, selected with the string "".
Here's a simple equivalent to your Python code:
#include <locale.h>
#include <stdio.h>
#include <wchar.h>
int main(void) {
setlocale(LC_ALL, "");
/* As indicated in a comment, I should have checked the
* return value from `putwc`; if it returns EOF and errno
* is set to EILSEQ, then the current locale can't handle
* runic characters.
*/
for (wchar_t wc = 5794; wc < 5855; ++wc)
putwc(wc, stdout);
putwc(L'\n', stdout);
return 0;
}
(Live on ideone.)
Stored on the stack as a string of (wide) characters
If you want to add your runes (wchar_t) to a string then you can proceed the following way:
using wcsncpy: (overkill for char, thanks chqrlie for noticing)
#define UNICODE_BLOCK_START 0x16A0 // see wikipedia link for the start
#define UUICODE_BLOCK_END 0x16F0 // true ending of Runic wide chars
int main(void) {
setlocale(LC_ALL, "");
wchar_t buffer[UUICODE_BLOCK_END - UNICODE_BLOCK_START + sizeof(wchar_t) * 2];
int i = 0;
for (wchar_t wc = UNICODE_BLOCK_START; wc <= UUICODE_BLOCK_END; wc++)
buffer[i++] = wc;
buffer[i] = L'\0';
printf("%ls\n", buffer);
return 0;
}
About Wide Chars (and Unicode)
To understand a bit better what is a wide char, you have to think of it as a set of bits set that exceed the original range used for character which was 2^8 = 256 or, with left shifting, 1 << 8).
It is enough when you just need to print what is on your keyboard, but when you need to print Asian characters or other unicode characters, it was not enough anymore and that is the reason why the Unicode standard was created. You can find more about the very different and exotic characters that exist, along with their range (named unicode blocks), on wikipedia, in your case runic.
Range U+16A0..U+16FF - Runic (86 characters), Common (3 characters)
NB: Your Runic wide chars end at 0x16F1 which is slightly before 0x16FF (0x16F1 to 0x16FF are not defined)
You can use the following function to print your wide char as bits:
void print_binary(unsigned int number)
{
char buffer[36]; // 32 bits, 3 spaces and one \0
unsigned int mask = 0b1000000000000000000000000000;
int i = 0;
while (i++ < 32) {
buffer[i] = '0' + !!(number & (mask >> i));
if (i && !(i % 8))
buffer[i] = ' ';
}
buffer[32] = '\0';
printf("%s\n", buffer);
}
That you call in your loop with:
print_binary((unsigned int)wc);
It will give you a better understand on how your wide char is represented at the machine level:
ᛞ
0000000 0000001 1101101 1100000
NB: You will need to pay attention to detail: Do not forget the final L'\0' and you need to use %ls to get the output with printf.
Related
First, in this C project we have some conditions as far as writing code: I can´t declare a variable and attribute a value to it on the same line of code and we are only allowed to use while loops. Also, I'm using Ubuntu for reference.
I want to print the decimal ASCII value, character by character, of a string passed to the program. For e.g. if the input is "rose", the program correctly prints 114 111 115 101. But when I try to print the decimal value of a char like a 'Ç', the first char of the extended ASCII table, the program weirdly prints -61 -121. Here is the code:
int main (int argc, char **argv)
{
int i;
i = 0;
if (argc == 2)
{
while (argv[1][i] != '\0')
{
printf ("%i ", argv[1][i]);
i++;
}
}
}
I did some research and found that i should try unsigned char argv instead of char, like this:
int main (int argc, unsigned char **argv)
{
int i;
i = 0;
if (argc == 2)
{
while (argv[1][i] != '\0')
{
printf("%i ", argv[1][i]);
i++;
}
}
}
In this case, I run the program with a 'Ç' and the output is 195 135 (still wrong).
How can I make this program print the right ASCII decimal value of a char from the extended ASSCCI table, in this case a "Ç" should be a 128.
Thank you!!
Your platform is using UTF-8 Encoding.
Unicode Latin Capital Letter C with Cedilla (U+00C7) "Ç" encodes to 0xC3 0x87 in UTF-8.
In turn those bytes in decimal are 195 and 135 which you see in output.
Remember UTF-8 is a multi-byte encoding for characters outside basic ASCII (0 thru 127).
That character is code-point 128 in extended ASCII but UTF-8 diverges from Extend ASCII in that range.
You may find there's tools on your platform to convert that to extended ASCII but I suspect you don't want to do that and should work with the encoding supported by your platform (which I am sure is UTF-8).
It's Unicode Code Point 199 so unless you have a specific application for Extended ASCII you'll probably just make things worse by converting to it. That's not least because it's a much smaller set of characters than Unicode.
Here's some information for Unicode Latin Capital Letter C with Cedilla including the UTF-8 Encoding: https://www.fileformat.info/info/unicode/char/00C7/index.htm
There are various ways of representing non-ASCII characters, such as Ç. Your question suggests you're familiar with 8-bit character sets such as ISO-8859, where in several of its variants Ç does indeed have code 199. (That is, if your computer were set up to use ISO-8859, your program probably would have worked, although it might have printed -57 instead of 199.)
But these days, more and more systems use Unicode, which they typically encode using a particular multibyte encoding, UTF-8.
In C, one way to extract wide characters from a multibyte character string is the function mbtowc. Here is a modification of your program, using this function:
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <wchar.h>
#include <locale.h>
int main (int argc, char **argv)
{
setlocale(LC_CTYPE, "");
if (argc == 2)
{
char *p = argv[1];
int n;
wchar_t wc;
while((n = mbtowc(&wc, p, strlen(p))) > 0)
{
printf ("%lc: %d (%d)\n", wc, wc, n);
p += n;
}
}
}
You give mbtowc a pointer to the multibyte encoding of one or more multibyte characters, and it converts one of them, returning it via its first argument — here, into the variable wc. It returns the number of multibyte characters it used, or 0 if it encountered the end of the string.
When I run this program on the string abÇd, it prints
a: 97 (1)
b: 98 (1)
Ç: 199 (2)
d: 100 (1)
This shows that in Unicode (just like 8859-1), Ç has the code 199, but it takes two bytes to encode it.
Under Linux, at least, the C library supports potentially multiple multibyte encodings, not just UTF-8. It decides which encoding to use based on the current "locale", which is usually part fo the environment, literally governed by an environment variable such as $LANG. That's what the call setlocale(LC_CTYPE, "") is for: it tells the C library to pay attention to the environment to select a locale for the program's functions, like mbtowc, to use.
Unicode is of course huge, encoding thousands and thousands of characters. Here's the output of the modified version of your program on the string "abΣ∫😊":
a: 97 (1)
b: 98 (1)
Σ: 931 (2)
∫: 8747 (3)
😊: 128522 (4)
Emoji like 😊 typically take four bytes to encode in UTF-8.
The following code has several issues - for which I'm kindly asking for your input / corrections / feedback - of various types: first, I can't get it to do what I'd like to, namely pass it a "word" and have it output its elf_hash. The code for the elf_gnu_hash ( EDIT 3: This should have been elf_hash, hence my surprise for the wrong hash below ) function is presented ELF_DT_HASH. What I'm trying to do is incorporate this function i a small standalone program, but I seem to not be able to get it right, i.e. the output printed is by no means that one which is expected (comparing to mentioned article).
Second, I'm sure the code exposes (obvious) to anyone doing C programming - me excluded from this 'list' misunderstanding of data types, conversions, and so on and I'd appreciate some clarifications / hints regarding some common rookies' (me included) misunderstandings.
Third, and most intriguing is that each time i compile and run this program, and enter the same string at the scanf functions, it prints a different result !
There are quite a few warnings at compilation, but honestly I am not sure how to fix them.
Could you guys help me out fix this issue + address some misunderstanding / misuse of C ?
I'd also appreciate some inputs on input sanitization (i.e. avoiding bufferoverflows and so on).
Am compiling it -in case it matters, not sure - like so:
gcc -Wall elf_hash-calculaTor.c && ./a.out
Thanks
As a bonus: is this the algorithm used in Linux OS amd64 elf binary files, like f.e.
$ file hexedit
hexedit: ELF 64-bit LSB shared object, x86-64, version 1 (SYSV), dynamically linked, interpreter /lib64/ld-linux-x86-64.so.2, BuildID[sha1]=d3f6cd413abaa25d5b7a2427f05598f3152e2efa, for GNU/Linux 3.2.0, stripped
, or is this one DT_GNU_HASH instead ?
#include <stdint.h>
/** The ELF symbol GNU hashing function */
static unsigned long elf_gnu_hash(const unsigned char*name) {
unsigned long h = 5381;
unsigned char c;
while ((c = *name++) != '\0') {
h = (h << 5) + h + c;
}
return h & 0xffffffff;
}
int main(int ac,char **av){
unsigned char in[10] = "\0";
printf("Type word for which to calculate the elf-hash:");
scanf("%hhu", &in);
printf("Typed in:");
printf("%hhu\n", in);
printf("processing elf-hash:\n");
//elf_gnu_hash(in);
// following nok: printf("%#lx\n", elf_gnu_hash("freelocal"));
printf("%#lx\n", elf_gnu_hash(in));
printf("%s", "Thanks guys!");
}
EDIT
:
To answer your questions:
The crux of this code is the outputting the correct elf_hash For example, if I enter "freelocal" , it should output "0x0bc334fc", as per the linked article ELF_DT_HASH
elf_hash("freelocal") = 0x0bc334fc
, which in my case does not.
Is it because i enter a string (with scanf "%s", but the function expects a char (pointer?) which can be -if my understanding is correct - the first array of a char array (or is it that this cast from string to char-array is done automatically in C ?). The function then goes on to traverse all the chars in this array:
...
while ((c = *name++) != '\0') {
and does some bit-wise shifting.
The intermediary printf, was serving just to echo out what was typed in and to proove that I got the type conversion right (which I also didn't).
regarding no.3 , the repeated runs with different result:
tox#tox-bookp:/var/tmp$ ./a.out
Type word for which to calculate the elf-hash:cucu
Typed in:126
processing elf-hash:
0x1505
tox#tox-bookp:/var/tmp$ ./a.out
Type word for which to calculate the elf-hash:cucu
Typed in:62
...
tox#tox-bookp:/var/tmp$ ./a.out
Type word for which to calculate the elf-hash:cucu
Typed in:174
...
EDIT 2
The code now looks like so (based on #chux's comments)
int main(int ac,char **av){
unsigned char in[10] = "";
printf("Type word for which to calculate the elf-hash:");
//scanf("%hhu", &in);
//printf("%hhu\n", in);
if (scanf("%9s", in) == 1) {
printf("Typed in:");
printf("<%s>\n", in);
printf("processing elf-hash:\n");
//elf_gnu_hash(in);
// following nok:
printf("%#lx\n", elf_gnu_hash("freelocal"));
printf("%#lx\n", elf_gnu_hash(in));
}
printf("%s\n", "Thanks guys!");
}
but still outputs wrong elf_hash :
Type-in item for which to calculate the elf-hash:freelocal
Typed in:<freelocal>
processing elf-hash:
0xe3364372 **<- is this ok, and if so, why ?**
0xe3364372 **<- incorrect**
Thanks guys!
pass it a string ("word") and have it output its elf_hash.
Use "%s" with a width limit to read user input and save as a string, not "%hhu" (which is useful to read numeric text in and save as a byte).
unsigned char in[10] = "\0"; // Can simplify to ""
// scanf("%hhu", &in);
// printf("%hhu\n", in);
if (scanf("%9s", in) == 1) {
printf("<%s>\n", in);
...
I want to print out a polynomial expression in c but i don't know print x to the power of a number with printf
It's far from trivial unfortunately. You cannot achieve what you want with printf. You need wprintf. Furthermore, it's not trivial to translate between normal and superscript. You would like a function like this:
wchar_t digit_to_superscript(int d) {
wchar_t table[] = { // Unicode values
0x2070,
0x00B9, // Note that 1, 2 and 3 does not follow the pattern
0x00B2, // That's because those three were common in various
0x00B3, // extended ascii tables. The rest did not exist
0x2074, // before unicode
0x2075,
0x2076,
0x2077,
0x2078,
0x2079,
};
return table[d];
}
This function could of course be changed to handle other characters too, as long as they are supported. And you could also write more complete functions operating on complete strings.
But as I said, it's not trivial, and it cannot be done with simple format strings to printf, and not even to wprintf.
Here is a somewhat working example. It's usable, but it's very short because I have omitted all error checking and such. Shortest possible to be able to use a negative float number as exponent.
#include <wchar.h>
#include <locale.h>
wchar_t char_to_superscript(wchar_t c) {
wchar_t digit_table[] = {
0x2070, 0x00B9, 0x00B2, 0x00B3, 0x2074,
0x2075, 0x2076, 0x2077, 0x2078, 0x2079,
};
if(c >= '0' && c <= '9') return digit_table[c - '0'];
switch(c) {
case '.': return 0x22C5;
case '-': return 0x207B;
}
}
void number_to_superscript(wchar_t *dest, wchar_t *src) {
while(*src){
*dest = char_to_superscript(*src);
src++;
dest++;
}
dest++;
*dest = 0;
}
And a main function to demonstrate:
int main(void) {
setlocale(LC_CTYPE, "");
double x = -3.5;
wchar_t wstr[100], a[100];
swprintf(a, 100, L"%f", x);
wprintf(L"Number as a string: %ls\n", a);
number_to_superscript(wstr, a);
wprintf(L"Number as exponent: x%ls\n", wstr);
}
Output:
Number as a string: -3.500000
Number as exponent: x⁻³⋅⁵⁰⁰⁰⁰⁰
In order to make a complete translator, you would need something like this:
size_t superscript_index(wchar_t c) {
// Code
}
wchar_t to_superscript(wchar_t c) {
static wchar_t huge_table[] {
// Long list of values
};
return huge_table[superscript_index(c)];
}
Remember that this cannot be done for all characters. Only those whose counterpart exists as a superscript version.
Unfortunately, it is not possible to output formatted text with printf.
(Of course one could output HTML format, but this then would need to be fed into an interpreter first for correct display)
So you cannot print text in superscript format in the general case.
What you have found is the superscript 1 as a special character. However this is only possible with 1 and 2, if I remember correctly (and only for the right code-page, not in plain ASCII).
The common way to print "superscripts" is to use the x^2, x^3 syntax. This is commonly understood.
An alternative is provided by klutt's answer. If you switch to unicode by using wprintf instead of printf you could use all superscript characters from 0 to 9. Even though, I am not sure how multi-digit exponents look like in a fixed-width terminal it works in principle.
If you want to print superscript 1, you need to use unicode. You can combine unicode superscripts to write a multi-digit number.
#include <stdio.h>
#include <wchar.h>
#include <locale.h>
int main() {
setlocale(LC_CTYPE, "");
wchar_t one = 0x00B9;
wchar_t two = 0x00B2;
wprintf(L"x%lc\n", one);
wprintf(L"x%lc%lc\n", one, two);
}
Output:
$ clang ~/lab/unicode.c
$ ./a.out
x¹
x¹²
Ref: https://www.compart.com/en/unicode/U+00B9
So, here's my problem:
If someone wants to output visually aligned strings using printf, they'll obviously use %<n>s (where <n> is the minimum field width). And this works just fine, unless one of the strings contains unicode (UTF-8) characters.
Take this very basic example:
#include <stdio.h>
int main(void)
{
char* s1 = "\u03b1\u03b2\u03b3";
char* s2 = "abc";
printf("'%6s'\n", s1);
printf("'%6s'\n", s2);
return 0;
}
which will produce the following output:
'αβγ'
' abc'
This isn't all that surprising, because printf of course doesn't know that \u03b1 (which consists of two characters) only produces a single glyph on the output device (assuming UTF-8 is supported).
Now assume that i generate s1 and s2, but have no control over the format string used to output those variables. My current understanding is that nothing i could possibly do to s1 would fix this, because i'd have to somehow fool printf into thinking that s1 is shorter than it actually is. However, since i also control s2, my current solution is to add a non-printing character to s2 for each unicode character in s1, which would look something like this:
#include <stdio.h>
int main(void)
{
char* s1 = "\u03b1\u03b2\u03b3";
char* s2 = "abc\x06\x06\x06";
printf("'%6s'\n", s1);
printf("'%6s'\n", s2);
return 0;
}
This will produce the desired output (even though the actual width no longer corresponds to the specified field width, but i'm willing to accept that):
'αβγ'
'abc'
For context:
The example above is only to illustrate the unicode-problem, my actual code involves printing numbers with SI-prefixes, only one of which (µ) is a unicode character. Therefore i would generate strings containing only up to one normal or unicode character (which is why i can accept the resulting offset in the field-width).
So, my questions are:
Is there a better solution for this?
Is \x06 (ACK) a sensible choice (i.e. a character without undesired side-effects)?
Can you think of any problems with this approach?
Since the non ascii is restricted to µ, I believe there is a solution. I've taken value of µ to be \u00b5. Replace it with the correct value
I've coded a small function myPrint which takes input the string and the width n. You should be able to modify the code below to fit to your needs.
The function searches for all occurrences of µ and increments that much of width to the string
#include <stdio.h>
void myPrint(char* string, int n)
{
char* valueOfNu = "\u00b5";
for(int i=0;string[i]!='\0';i++)
{
if(string[i]==valueOfNu[0] && string[i+1]==valueOfNu[1])
n++;
}
printf("%*s",n,string);
}
int main(void)
{
char* s1 = "ab\u00b5";
char* s2 = "abc";
myPrint(s1,6);
printf("\n");
myPrint(s2,6);
printf("\n");
return 0;
}
First of all let me ask for your forgiveness if this is too trivial, I am not a C developer, usually I program in Fortran.
I am in need to read some columnated text files. The problem I have is that some columns can have blank space (non filled value) or not fully filed field.
Let me use a short example of the problem. Lets say I have a generator program like:
#include <stdio.h>
#include <stdlib.h>
int main(){
printf("xxxx%4d%4.2f\n",99,3.14);
}
When I execute this program I get:
$ ./t1
xxxx 993.14
If I get it into a text file and try to read using (e.g.) sscanf with the code:
#include <stdio.h>
#include <stdlib.h>
int main() {
char *fmt = "%*4c%4d%4f";
char *line = "xxxx 993.14";
int ival;
float fval;
sscanf(line,fmt,&ival,&fval);
printf(">>>>%d|%f\n",ival,fval);
}
The result is:
$ ./t2
>>>>993|0.140000
What is the problem here? The sscanf seems to think that all space is meaningless and should be discarded. So the "%4c" does what it is meant to be, it counts 4 characters without discarding any blank space and discards everything due to "". Next the %4d start skipping all blank spaces and start count the 4 characters of the field upon finding the first valid character for the conversion. So the value, meant to be 99 becomes 993, and the 3.14 becomes 0.14.
In Fortran the reading code would be:
program t3
implicit none
integer :: ival
real :: fval
character(len=30) :: fmt="(4x,i4,f4.0)"
character(len=30) :: line="xxxx 993.14"
read(line,fmt) ival, fval
write(*,"('>>>>',i4,'|',f4.2)") ival,fval
end program t3
and the result would be:
$ ./t3
>>>> 99|3.14
That is, the format specification states the field width and nothing is discarding in conversion, except if instructed to by the "nX" specification.
Some final remarks to help the helpers:
The format to be read is an international standard and there is no
way to change it.
The number of existing files is to big to think of intervention or
format change.
It is not a CSV or similar format.
The code has to be in C for integration in a free software package.
Sorry to be too long, trying to state the problem as completely as possible.
The question is: Is there a way to tell sscanf to not skip the blank spaces? If not, is there a simple way to do it in C or it will be necessary write an specialized parser for each record type?
Thank you in advance.
When reading fixed-length fields with sscanf, it is best to parse the values as character strings (which you could do a number of ways), and then perform independent conversion of each of the fields. This allows you to handle conversion/error detection on a per-field basis. For example, you could use a format string of:
char *fmt = "%*4s%2[^0-9]%s";
which would read/discard the 4 leading characters, then read 2-chars as your integer, followed by the remainder of line (or up until the next whitespace) as a string containing your float value.
To handle the storage and parsing of line as fixed length fields, you could use temporary character arrays to hold each of the strings and then use sscanf to fill them much as you have attempted to do with the integer and float directly. e.g.:
char istr[8] = {0};
char fstr[16] = {0};
...
sscanf (line,fmt,istr,fstr);
(note: you could use minimum storage of istr[3] and fstr[7] in this given case, adjust the storage length as required, but providing space for the nul-terminating character)
You can then use strtol and strtof to provide conversion with error checking on each value. For example:
errno = 0;
if ((ival = (int)strtol (istr, NULL, 10)) == 0 && errno)
fprintf (stderr, "error: integer conversion failed.\n");
/* underflow/overflow checks omitted */
and
errno = 0;
if ((fval = strtof (fstr, NULL)) == 0 && errno)
fprintf (stderr, "error: integer conversion failed.\n");
/* nan and inf checks omitted */
Putting all the pieces together in you example, you could use something like:
#include <stdio.h>
#include <stdlib.h>
#include <errno.h>
int main() {
char *fmt = "%*4s%2[^0-9]%s";
char *line = "xxxx 993.14";
char istr[8] = {0};
char fstr[16] = {0};
int ival;
float fval;
sscanf (line,fmt,istr,fstr);
errno = 0;
if ((ival = (int)strtol (istr, NULL, 10)) == 0 && errno)
fprintf (stderr, "error: integer conversion failed.\n");
/* underflow/overflow checks omitted */
errno = 0;
if ((fval = strtof (fstr, NULL)) == 0 && errno)
fprintf (stderr, "error: integer conversion failed.\n");
/* nan and inf checks omitted */
printf(">>>>%d|%6.2f\n",ival,fval);
return 0;
}
Example/Output
$ >>>>0|993.14
*scanf() is not designed to handle fixed column width with non-intervening white-space.
With sscanf(), to not skip spaces, code must use "%c", "%n", "%[]" as all other specifiers skip leading white-space and those skipped characters do not contribute to a width limit.
To scan the printed line, which in now in buffer, take advantage that the only use of '\n' is at the end of the line.
char str_int[5];
char str_float[5];
int n = 0;
sscanf(buffer, "%*4c%4[^\n]%4[^\n]%n", str_int, str_float, &n);
if (n != 12 || buffer[n] != '\n') Fail();
// Now convert str_int, str_float as needed.
Another way to use sscanf() would be to parse buffer as
int ival;
float fval;
if (strlen(buffer) != 13) Fail();
if (sscanf(&buffer[8], "%f", &fval) != 1) Fail();
buffer[8] = '\0';
if (sscanf(&buffer[4], "%d", &ival) != 1) Fail();
Note: The 4s in the below do not specified the output width as 4 characters. 4 is the minimum width to print.
printf("xxxx%4d%4.2f\n",ival, fval);
Code could use the following to detect problems.
if (13 != printf("xxxx%4d%4.2f\n",ival, fval)) Fail();
Watch out for
printf("xxxx%4d%4.2f\n",123, 9.995000001f); // "xxxx 12310.00\n"
First off, I dunno. There might be some way to wrangle sscanf to recognize the whitespace towards your integer count. But I just don't think scanf was made for this sort of format in mind. The tool's trying to be smart of helpful and it's biting you in the ass.
But if it's columnated data and you know the position of the various fields, there's a really easy work around. Just extract the field you want.
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
int main(int argc, char** argv)
{
char line[] = "xxxx 893.14";
char tmp[100];
int thatDamnNumber;
float myfloatykins;
//Get that field
memcpy(tmp, line+4, 4);
sscanf(tmp, "%d", &thatDamnNumber);
//Kill that field so it doesn't goober-up the float
memset(line+4, ' ', 4);
sscanf(line, "%*4c%f", &myfloatykins);
printf("%d %f\n", thatDamnNumber, myfloatykins);
return 0;
}
If there is a lot of this, you could make some generalized functions: integerExtract(int positionStart, int sizeInCharacters), floatExtract(), etc.
If each element is of fixed width you don't really need scanf(), try this
char copy[5];
const char *line = "xxxx 993.14";
int ival;
float fval;
copy[0] = line[4];
copy[1] = line[5];
copy[2] = line[6];
copy[3] = line[7];
copy[4] = '\0'; // nul terminate for `atoi' to work
ival = atoi(copy);
fval = atof(&line[8]);
fprintf(stdout, "%d -- %f\n", ival, fval);
If you want (probably should) you can use strtol() instead of atoi() and strtof() instead of atof() to check for malformed data.
Both these functions take a parameter to store the unconverted/invalid characters, you can check the passed pointer in order to verify that there was a problem with conversion.
Or if you really want scanf() do the same, capture the integer + whitespaces to a char array and then convert it to int later, like this
char integer[5];
const char *line = "xxxx 993.14";
int ival;
float fval;
if (sscanf(line, "%*4c%4[0-9 ]%f", integer, &fval) != 2)
return -1;
ival = atoi(integer);
fprintf(stdout, "%d -- %f\n", ival, fval);
The format "%*4c%4[0-9 ]%f" will
Skip the first four characters including white spaces.
Scan the next four characters if they consist only of digits or white spaces.
Scan the rest of the input string searching for a matching float value.
I am posting what I think is a final conclusion from the answers I have got so far and from other sources.
What is a very trivial task in Fortran is not a so trivial task in other languages. I guess — not sure — that the same task could be as easy as in Fortran in other languages. I think that Cobol, Pascal, PL/I and others from the time of punched card probably could be trivial.
I think that most languages nowadays are more comfortable with different data structure and inherited its I/O structure from C. I think that Java, Python, Perl(?) and others could serve as examples.
From what I saw in this thread there are two main problems to read / convert fixed column length text data with C.
The first problem is that, as Philip said in his answer: “The tool’s trying to be smart of helpful and it’s biting you in the ass.” Quite right! The point is that it seems that C text I/O thinks that “white space” is something like a NULL character and should be thrown away, completely disregarding any information of the start of field. The only exception to that seems to be the %nc that get exactly n chars, even blanks.
The second problem is that the conversion “tag” (how is that called?) %nf will keep converting while it finds a valid character, even if you say stop at the 4th character.
If we join those two problems with a field completely filled with white space, depending on the conversion tool used, it throws an error or keeps going madly looking for something meaningful.
At the end of the day, it seems that the only way is to extract the field length to another memory area, dynamically allocated or not (we can have an area for each column length), and try to parse this separate area, taking into account the possibility of a full white space area to cache the error.