UTF-8 Character Count - c

I'm programming something that counts the number of UTF-8 characters in a file. I've already written the base code but now, I'm stuck in the part where the characters are supposed to be counted. So far, these are what I have:
What's inside the text file:
黄埔炒蛋
你好
こんにちは
여보세요
What I've coded so far:
#include <stdio.h>
typedef unsigned char BYTE;
int main(int argc, char const *argv[])
{
FILE *file = fopen("file.txt", "r");
if (!file)
{
printf("Could not open file.\n");
return 1;
}
int count = 0;
while(1)
{
BYTE b;
fread(&b, 1, 1, file);
if (feof(file))
{
break;
}
count++;
}
printf("Number of characters: %i\n", count);
fclose(file);
return 0;
}
My question is, how would I code the part where the UTF-8 characters are being counted? I tried to look for inspirations in GitHub and YouTube but I haven't found anything that works well with my code yet.
Edit: Originally, this code prints that the text file has 48 characters. But considering UTF-8, it should only be 18 characters.

See: https://en.wikipedia.org/wiki/UTF-8#Encoding
Each UTF-8 sequence contains one starting byte and zero or more extra bytes.
Extra bytes always start with bits 10 and first byte never starts with that sequence.
You can use that information to count only first byte in each UTF-8 sequence.
if((b&0xC0) != 0x80) {
count++;
}
Keep in mind this will break, if file contains invalid UTF-8 sequences.
Also, "UTF-8 characters" might mean different things. For example "👩🏿" will be counted as two characters by this method.

In C, as in C++, there is no ready-made solution for counting UTF-8 characters. You can convert UTF-8 to UTF-16 using mbstowcs and use the wcslen function, but this is not the best way for performance (especially if you only need to count the number of characters and nothing else).
I think a good answer to your question is here: counting unicode characters in c++.
Еxample from answer on link:
for (p; *p != 0; ++p)
count += ((*p & 0xc0) != 0x80);

You could look into the specs: https://www.rfc-editor.org/rfc/rfc3629.
Chapter 3 has this table in it:
Char. number range | UTF-8 octet sequence
(hexadecimal) | (binary)
--------------------+---------------------------------------------
0000 0000-0000 007F | 0xxxxxxx
0000 0080-0000 07FF | 110xxxxx 10xxxxxx
0000 0800-0000 FFFF | 1110xxxx 10xxxxxx 10xxxxxx
0001 0000-0010 FFFF | 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx
You could inspect the bytes and build the unicode characters.
A different point is, whether you would count a base character and its accent (combining mark cf. https://en.wikipedia.org/wiki/Combining_character) as one or as several characters.

There are multiple options you may take:
you may depend on your system implementation of wide encoding and multibyte encoding
you may read the file as a wide stream and just count the bytes, depend on the system to do UTF-8 multibyte string to wide string conversion on it's own (see main1 below)
you may read the file as bytes and convert the multibyte string into a wide string and count bytes (see main2 below)
You may use an external library that operates on UTF-8 strings and count the unicode characters (see main3 below that uses libunistring)
Or roll your own utf8_strlen-ish solution that will work on specific UTF-8 string property and check the bytes yourself, as showed in other answers.
Here is an example program that has to be compiled with -lunistring under linux with rudimentary error checking with assert:
#include <stdio.h>
#include <wchar.h>
#include <locale.h>
#include <assert.h>
#include <stdlib.h>
void main1()
{
// read the file as wide characters
const char *l = setlocale(LC_ALL, "en_US.UTF-8");
assert(l);
FILE *file = fopen("file.txt", "r");
assert(file);
int count = 0;
while(fgetwc(file) != WEOF) {
count++;
}
fclose(file);
printf("Number of characters: %i\n", count);
}
// just a helper function cause i'm lazy
char *file_to_buf(const char *filename, size_t *strlen) {
FILE *file = fopen(filename, "r");
assert(file);
size_t n = 0;
char *ret = malloc(1);
assert(ret);
for (int c; (c = fgetc(file)) != EOF;) {
ret = realloc(ret, n + 2);
assert(ret);
ret[n++] = c;
}
ret[n] = '\0';
*strlen = n;
fclose(file);
return ret;
}
void main2() {
const char *l = setlocale(LC_ALL, "en_US.UTF-8");
assert(l);
size_t strlen = 0;
char *str = file_to_buf("file.txt", &strlen);
assert(str);
// convert multibye string to wide string
// assuming multibytes are in UTF-8
// this may also be done in a streaming fashion when reading byte by byte from a file
// and calling with `mbtowc` and checking errno for EILSEQ and managing some buffer
mbstate_t ps = {0};
const char *tmp = str;
size_t count = mbsrtowcs(NULL, &tmp, 0, &ps);
assert(count != (size_t)-1);
printf("Number of characters: %zu\n", count);
free(str);
}
#include <unistr.h> // u8_mbsnlen from libunistring
void main3() {
size_t strlen = 0;
char *str = file_to_buf("file.txt", &strlen);
assert(str);
// for simplicity I am assuming uint8_t is equal to unisgned char
size_t count = u8_mbsnlen((const uint8_t *)str, strlen);
printf("Number of characters: %zu\n", count);
free(str);
}
int main() {
main1();
main2();
main3();
}

Related

How can I know which characters inside a string are compositions of a single accentuated character in C?

My native language is not English, is Portuguese-BR and we have these accentuated characters (á, à, ã, õ, and so on).
So, my problem is, if I put one of these characters inside a string, and I try to iterate over each character inside it, I'm going to get that two characters are necessary to display "ã" on the screen.
Here's an image about me iterating over a string "(Não Informado)", which means: Uninformed. The string should have a length of 15 if we count each character one by one. But if we call strlen("(Não Informado)");, the result is 16.
The code I used to print each character in this image is this one:
void print_buffer (const char * buffer) {
int size = strlen(buffer);
printf("BUFFER: %s / %i\n", buffer, size);
for (int i = 0; buffer[i] != '\0'; ++i) {
printf("[%i]: %i\n", i, (unsigned char) buffer[i]);
}
}
So, in graphical applications, a buffer could display "ãbc", and inside the raw string we wouldn't have 3 characters, but actually 4.
So here's my question, is there a way to know which characters inside a string are a composition of those special characters? Is there a rule to design and restrict this occurrence? Is it always a composition of 2 characters? Could a special character be composed of 3 or 4, for example?
Thanks
is there a way to know which characters inside a string are a
composition of those special characters?
Yes, there is, to check if certain byte is part of a multibyte character you just need a bitwise operation (c & 0x80):
#include <stdio.h>
int is_multibyte(int c)
{
return c & 0x80;
}
int main(void)
{
const char *str = "ãbc";
while (*str != 0)
{
printf(
"%c %s part of a multibyte\n",
*str, is_multibyte(*str) ? "is" : "is not"
);
str++;
}
return 0;
}
Output:
� is part of a multibyte
� is part of a multibyte
b is not part of a multibyte
c is not part of a multibyte
The string should have a length of 15 if we count each character one
by one. But if we call strlen("(Não Informado)");, the result is 16.
It seems that you are interested in the number of code points instead of the number of bytes, isn't it?
In this case you want to mask with (c & 0xc0) != 0x80:
#include <stdio.h>
size_t mylength(const char *str)
{
size_t len = 0;
while (*str != 0)
{
if ((*str & 0xc0) != 0x80)
{
len++;
}
str++;
}
return len;
}
int main(void)
{
const char *str = "ãbc";
printf("Length of \"%s\" = %zu\n", str, mylength(str));
return 0;
}
Output:
Length of "ãbc" = 3
Could a special character be composed of 3 or 4, for example?
Yes, of course, the euro sign € is an example (3 bytes), from this nice answer:
Anything up to U+007F takes 1 byte: Basic Latin
Then up to U+07FF it takes 2 bytes: Greek, Arabic, Cyrillic, Hebrew, etc
Then up to U+FFFF it takes 3 bytes: Chinese, Japanese, Korean, Devanagari, etc
Beyond that it takes 4 bytes
Is there a rule to design and restrict this occurrence?
If you mean being able to treat all characters with the same width, C has specialised libraries for wide characters:
#include <stdio.h>
#include <wchar.h>
#include <locale.h>
int main(void)
{
setlocale(LC_CTYPE, "");
const wchar_t *str = L"ãbc";
while (*str != 0)
{
printf("%lc\n", *str);
str++;
}
return 0;
}
Output:
ã
b
c
To get the length you can use wcslen:
#include <stdio.h>
#include <wchar.h>
#include <locale.h>
int main(void)
{
setlocale(LC_CTYPE, "");
const wchar_t *str = L"ãbc";
printf("Length of \"%ls\" = %zu\n", str, wcslen(str));
return 0;
}
Output:
Length of "ãbc" = 3
But if with "restrict" you mean "avoid" those multibyte characters, you can transliterate from UTF8 to plain ASCII. If posix is an option take a look to iconv, you have an example here
El cañón de María vale 1000 €
is converted to
El canon de Maria vale 1000 EUR
and in your case
Não Informado
is converted to
Nao Informado

how to cut out a chinese words & english words mixture string in c language

I have a string that contains both Mandarin and English words in UTF-8:
char *str = "你a好测b试";
If you use strlen(str), it will return 14, because each Mandarin character uses three bytes, while each English character uses only one byte.
Now I want to copy the leftmost 4 characters ("你a好测"), and append "..." at the end, to give "你a好测...".
If the text were in a single-byte encoding, I could just write:
strncpy(buf, str, 4);
strcat(buf, "...");
But 4 characters in UTF-8 isn't necessarily 4 bytes. For this example, it will be 13 bytes: three each for 你, 好 and 测 and one for a. So, for this specific case, I would need
strncpy(buf, str, 13);
strcat(buf, "...");
If I had a wrong value for the length, I could produce a broken UTF-8 stream with an incomplete character. Obviously I want to avoid that.
How can I compute the right number of bytes to copy, corresponding to a given number of characters?
First you need to know your encoding. By the sound of it (3 byte Mandarin) your string is encoded with UTF-8.
What you need to do is convert the UTF-8 back to unicode code points (integers). You can then have an array of integers rather than bytes, so each element of the array will be 1 character, reguardless of the language.
You could also use a library of functions that already handle utf8 such as http://www.cprogramming.com/tutorial/utf8.c
http://www.cprogramming.com/tutorial/utf8.h
In particular this function: int u8_toucs(u_int32_t *dest, int sz, char *src, int srcsz); might be very useful, it will create an array of integers, with each integer being 1 character. You can then modify the array as you see fit, then convert it back again with int u8_toutf8(char *dest, int sz, u_int32_t *src, int srcsz);
I would recommend dealing with this at a higher level of abstraction: either convert to wchar_t or use a UTF-8 library. But if you really want to do it at the byte level, you could count characters by skipping over the continuation bytes (which are of the form 10xxxxxx):
#include <stddef.h>
size_t count_bytes_for_chars(const char *s, int n)
{
const char *p = s;
n += 1; /* we're counting up to the start of the subsequent character */
while (*p && (n -= (*p & 0xc0) != 0x80))
++p;
return p-s;
}
Here's a demonstration of the above function:
#include <string.h>
#include <stdio.h>
int main()
{
const char *str = "你a好测b试";
char buf[50];
int truncate_at = 4;
size_t bytes = count_bytes_for_chars(str, truncate_at);
strncpy(buf, str, bytes);
strcpy(buf+bytes, "...");
printf("'%s' truncated to %d characters is '%s'\n", str, truncate_at, buf);
}
Output:
'你a好测b试' truncated to 4 characters is '你a好测...'
The Basic Multilingual Plane was designed to contain characters for almost all modern languages. In particular, it does contain Chinese.
So you just have to convert your UTF8 string to a UTF16 one to have each character using one single position. That means that you can just use a wchar_t array or even better a wstring to be allowed to use natively all string functions.
Starting with C++11, the <codecvt> header declares a dedicated converter std::codecvt_utf8 to specifically convert UTF8 narrow strings to wide Unicode ones. I must admit it is not very easy to use, but it should be enough here. Code could be like:
char str[] = "你a好测b试";
std::codecvt_utf8<wchar_t> cvt;
std::mbstate_t state = std::mbstate_t();
wchar_t wstr[sizeof(str)] = {0}; // there will be unused space at the end
const char *end;
wchar_t *wend;
auto cr = cvt.in(state, str, str+sizeof(str), end,
wstr, wstr+sizeof(str), wend);
*wend = 0;
Once you have the wstr wide string, you can convert it to a wstring and use all the C++ library tools, or if you prefer C strings you can use the ws... counterparts of the str... functions.
Pure C solution:
All UTF8 multibyte characters will be made from char-s with the most-significant-bit set to 1 with the first bits of their first character indicating how many characters makes a codepoint.
The question is ambiguous in regards to the criterion used in cutting; either:
a fixed number of codepoints followed by three dots, this wil require a variable size output buffer
a fixed size output buffer, which will impose "whatever you can fit inside"
Both the solutions will require a helper function telling how many chars make the next codepoint:
// Note: the function does NOT fully validate a
// UTF8 sequence, only looks at the first char in it
int codePointLen(const char* c) {
if(NULL==c) return -1;
if( (*c & 0xF8)==0xF0 ) return 4; // 4 ones and one 0
if( (*c & 0xF0)==0xE0 ) return 3; // 3 ones and one 0
if( (*c & 0xE0)==0xC0 ) return 2; // 2 ones and one 0
if( (*c & 0x7F)==*c ) return 1; // no ones on msb
return -2; // invalid UTF8 starting character
}
So, solution for the criterion 1 (fixed number of code points, variable output buff size) - does not append ... to the destination, but you can ask "how many chars I need" upfront and if it is longer than you can afford, reserve yourself the extra space.
// returns the number of chars used from the output
// If not enough space or the dest is null, does nothing
// and returns the lenght required for the output buffer
// Returns negative val if the source in not a valid UTF8
int copyFirstCodepoints(
int codepointsCount, const char* src,
char* dest, int destSize
) {
if(NULL==src) {
return -1;
}
// do a cold run to see if size of the output buffer can fit
// as many codepoints as required
const char* walker=src;
for(int cnvCount=0; cnvCount<codepointsCount; cnvCount++) {
int chCount=codePointLen(walker);
if(chCount<0) {
return chCount; // err
}
walker+=chCount;
}
if(walker-src < destSize && NULL!=dest) {
// enough space at destination
strncpy(src, dest, walker-src);
}
// else do nothing
return walker-src;
}
Second criterion (limited buffer size): just use the first one with the number of codepoints returned by this one
// return negative if UTF encoding error
int howManyCodepointICanFitInOutputBufferOfLen(const char* src, int maxBufflen) {
if(NULL==src) {
return -1;
}
int ret=0;
for(const char* walker=src; *walker && ret<maxBufflen; ret++) {
int advance=codePointLen(walker);
if(advance<0) {
return src-walker; // err because negative, but indicating the err pos
}
// look on all the chars between walker and walker+advance
// if any is 0, we have a premature end of the source
while(advance>0) {
if(0==*(++walker)) {
return src-walker; // err because negative, but indicating the err pos
}
advance--;
} // walker is set on the correct position for the next attempt
}
return ret;
}
static char *CutStringLength(char *lpszData, int nMaxLen)
{
if (NULL == lpszData || 0 >= nMaxLen)
{
return "";
}
int len = strlen(lpszData);
if(len <= nMaxLen)
{
return lpszData;
}
char strTemp[1024] = {0};
strcpy(strTemp, lpszData);
char *p = strTemp;
p = p + (nMaxLen-1);
if ((unsigned char)(*p) < 0xA0)
{
*(++p) = '\0'; // if the last byte is Mandarin character
}
else if ((unsigned char)(*(--p)) < 0xA0)
{
*(++p) = '\0'; // if the last but one byte is Mandarin character
}
else if ((unsigned char)(*(--p)) < 0xA0)
{
*(++p) = '\0'; // if the last but two byte is Mandarin character
}
else
{
int i = 0;
p = strTemp;
while(*p != '\0' && i+2 <= nMaxLen)
{
if((unsigned char)(*p++) >= 0xA0 && (unsigned char)(*p) >= 0xA0)
{
p++;
i++;
}
i++;
}
*p = '\0';
}
printf("str = %s\n",strTemp);
return strTemp;
}

I need to format my output without ruining my encryption algorithm

Im doing a railroad cipher (zigzag cipher) however you may call it, I finally seemed to get my code to work properly and I got it to print the correct output, but unfortunately my teacher calls for the output to be printed 80 columns wide (80 characters per line). Unfortunately, the way my encryption is set up I can not find a way to do this since I set my encryption "rail by rail".
For the assignment we must read in the file, and strip it of all spaces and special characters, and to make all uppercase letters lower-case. Then encrypt the message. My issue is the printing portion in the encrypt function.
since its ran from command line here are the files i used
the first file pointer is for the rails, sample would be: 9
second file pointer is the text, sample i used is:
We shall not flag or fail. We shall go on to the end. We shall fight in France, we
shall fight on the seas and oceans, we shall fight with growing confidence and
growing strength in the air, we shall defend our island, whatever the cost may be, we
shall fight on the beaches, we shall fight on the landing grounds, we shall fight in
the fields and in the streets, we shall fight in the hills. we shall never surrender!
my output is correct according to my teachers output, but unfortunately i get 30% off for not having it 80 characters per line... this is due in a few hours but I can't seem to figure it out. Any help is greatly appreciated.
I would show the output for reference but I don't know how to copy and paste from the command line, and it only runs from there.
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <ctype.h>
# define MAX 10000
void condense(char* str)
{
int original=0;
int newplace =0;
while (str[original] != '\0')
{
if(isalpha(str[original] ))
{
str[newplace]= tolower(str[original]);
++newplace;
}
++original;
}
str[newplace] = '\0';
}
char * deblank(char *str)
{
char *out = str, *put = str;
for(; *str != '\0'; ++str)
{
if(*str != ' ')
*put++ = *str;
}
*put = '\0';
return out;
}
void encrypt(int rail,char *plain)
{
char railfence[rail][MAX],buf[2];
int i;
int number=0,increment=1;
buf[1]='\0';
for(i=0;i<rail;i++)
railfence[i][0]='\0';
for(i=0;i<strlen(plain);i++)
{
if(number+increment==rail)
increment=-1;
else if(number+increment==-1)
increment=1;
buf[0]=plain[i];
strcat(railfence[number],buf);
number+=increment;
}
for(i=0;i<rail;i++)
printf("%s",railfence[i]);
}
int main(int argc, char *argv[])
{
int rail,mode;
char text[MAX];
FILE* fp1;
FILE* fp2;
fp1 = fopen(argv[1], "r");
fp2 = fopen(argv[2], "r");
int key;
fscanf(fp1, "%d", &key);
printf("key is %d", key);
char c;
int index = 0;
fgets(text, 10000, fp2);
printf("%s \n", text);
// text[index] = '0';
char nospace[MAX];
deblank(text);
printf("text deblanked: %s \n", text);
//printf("%s", deblank(text));
condense(text);
printf("\nthe text condensed is: %s", text);
printf("\n the text encrypted is \n");
encrypt(key,text);
return 0;
}
Simple. Instead of printing each rail as a whole, print each rail character by character, and count. In the example below I assume your instructor's 80 characters per line is 79 characters of ciphertext plus one newline character. I do not know whether you are expected to print a newline at the end of the ciphertext, but if so just add printf("\n"); at the end of encrypt (though you might want to check that there was at least one character of ciphertext before doing so).
void encrypt(int rail,char *plain)
{
char railfence[rail][MAX],buf[2];
int i, col = 0, j, len; // added col, j, and len
int number=0,increment=1;
buf[1]='\0';
for(i=0;i<rail;i++)
railfence[i][0]='\0';
for(i=0;i<strlen(plain);i++)
{
if(number+increment==rail)
increment=-1;
else if(number+increment==-1)
increment=1;
buf[0]=plain[i];
strcat(railfence[number],buf);
number+=increment;
}
for(i=0;i<rail;i++)
{
len = strlen(railfence[i]); // get the length of the rail
for(j=0;j<len;++j) // for each char in the rail
{
printf("%c", railfence[i][j]);
if (++col==79) {
col = 0;
printf("\n");
}
}
}
}
Other than that, I thoroughly recommend using more whitespace in your formatting, as well as checking things like whether the user passes in two arguments or not, whether your files were opened successfully or not, and also remembering to close any files you open.
As result, your program is hard to read, and currently behaves badly if I do not provide both command line arguments or if I give it non-existent files.

read from a file to an int buffer in C

I have a txt file with thousands of lines. Length of each line varies. The txt file mainly contains hex data in bytes. For example:
01 01 04 03 = 4 bytes.
Second line might contain 8 bytes, 3rd 40 bytes and so on. There are thousands of such lines.
Now I want to read these bytes into int buffer. I am reading into char buffer and in the memory it saves as 0001 0001 0004 0003, which I do not want and it is considered as 8 Bytes. In memory, it saves as 3031 3031 3034 3030 (ASCII) as it is char buffer. I am converting this to 0001 0001 0004 0003.
Below is my piece of code
FILE *file;
char buffer[100] = { '\0' };
char line[100] = { '0' };
if(file!=NULL)
{
while(fgets(line, sizeof(line), file)!=NULL)
{
for(i = 0; (line[i] != '\r') ; i++)
{
buffer[i] = line[i];
}
}
}
I want to read line by line not entire file at once. In the memory I want to see as just 01 01 04 03. I guess using int buffer will help. As soon as it reads the file into buffer line, it is stored as char. Any suggestions please?
I would read in a line, then use strtol to convert the individual numbers in the input. strtol gives you a pointer to the character at which the conversion failed, which you can use as a starting point to find/convert the next number.
You can convert small hex numbers:
#include <ctype.h>
uint8_t digits2hex(char *digits) {
uint8_t r = 0;
while (isxdigit(*digits)) {
r = r * 16 + (*digit - '0');
digit++;
/* check size? */
}
return r;
}
/* ... */
for(i = 0; (line[i] != '\r') ; i+=2)
{
hexnumbers[hexcount++] = digits2hex(line + i);
/* skip white space */
while (isspace(line[i]))
i++
}
You seem to be confusing the textual representation of a byte with the value of the byte (or alternately, expecting your compiler to do more than it does.)
When your program reads in "01", it is reading in two bytes whose values correspond to the ASCII codes for the characters "0" and "1". C doesn't do anything special with them, so you need to convert this sequence into a one-byte value. Note that a C char is one byte and so is the right size to hold this result. This is a coincidence and in any case is not true for Unicode and other wide character encodings.
There are several ways to do this conversion. You can do arithmetic on the bytes yourself like this:
unsigned char charToHex(char c) {
if (isdigit(c)) return c - '0';
return 9 + toupper(c) - 'A';
}
...
first = getc(fh);
second = getc(fh);
buffer[*end] = charToHex(first) << 4 | charToHex(second);
(Note that I'm using getc() to read the characters instead of fgets(). I'll go into that later.)
Note also that 'first' is the most significant half-byte of the input.
You can also (re)create a string from the two bytes and call strtol on it:
char buffer[3];
buffer[0] = first;
buffer[1] = second;
buffer[2] = 0; // null-terminator
buffer[*end] = (char)strtol(buffer, NULL, 16);
Related to this, you'd probably have better luck using getc() to read in the file one character at a time, ignoring anything that isn't a hex digit. That way, you won't get a buffer overflow if an input line is longer than the buffer you pass to fgets(). It also makes it easier to tolerate garbage in the input file.
Here's a complete example of this. It uses isxdigit() to detect hex characters and ignores anything else including single hex digits:
// Given a single hex digit, return its numeric value
unsigned char charToHex(char c) {
if (isdigit(c)) return c - '0';
return 9 + toupper(c) - 'A';
}
// Read in file 'fh' and for each pair of hex digits found, append
// the corresponding value to 'buffer'. '*end' is set to the index
// of the last byte written to 'buffer', which is assumed to have enough
// space.
void readBuffer(FILE *fh, unsigned char buffer[], size_t *end) {
for (;;) {
// Advance to the next hex digit in the stream.
int first;
do {
first = getc(fh);
if (first == EOF) return;
} while (!isxdigit(first));
int second;
second = getc(fh);
// Ignore any single hex digits
if (!isxdigit(second)) continue;
// Compute the hex value and append it to the array.
buffer[*end] = charToHex(first) << 4 | charToHex(second);
(*end)++;
}
}
FILE *fp = ...;
int buffer[1024]; /*enough memery*/
int r_pos = 0;/*read start position*/
char line[128];
char tmp[4];
char *cp;
if(fp) {
while(NULL!=fgets(line, sizeof(line), fp)) {
cp = line;
while(sscanf(cp, "%d %d %d %d", &tmp[0], &tmp[1], &tmp[2], &tmp[3])==4) {
buffer[r_pos++] = *(int *)tmp; /*or ntohl(*(int *)tmp)*/
cp += strlen("01 01 04 03 ");
}
}
}

Parsing text in C

I have a file like this:
...
words 13
more words 21
even more words 4
...
(General format is a string of non-digits, then a space, then any number of digits and a newline)
and I'd like to parse every line, putting the words into one field of the structure, and the number into the other. Right now I am using an ugly hack of reading the line while the chars are not numbers, then reading the rest. I believe there's a clearer way.
Edit: You can use pNum-buf to get the length of the alphabetical part of the string, and use strncpy() to copy that into another buffer. Be sure to add a '\0' to the end of the destination buffer. I would insert this code before the pNum++.
int len = pNum-buf;
strncpy(newBuf, buf, len-1);
newBuf[len] = '\0';
You could read the entire line into a buffer and then use:
char *pNum;
if (pNum = strrchr(buf, ' ')) {
pNum++;
}
to get a pointer to the number field.
fscanf(file, "%s %d", word, &value);
This gets the values directly into a string and an integer, and copes with variations in whitespace and numerical formats, etc.
Edit
Ooops, I forgot that you had spaces between the words.
In that case, I'd do the following. (Note that it truncates the original text in 'line')
// Scan to find the last space in the line
char *p = line;
char *lastSpace = null;
while(*p != '\0')
{
if (*p == ' ')
lastSpace = p;
p++;
}
if (lastSpace == null)
return("parse error");
// Replace the last space in the line with a NUL
*lastSpace = '\0';
// Advance past the NUL to the first character of the number field
lastSpace++;
char *word = text;
int number = atoi(lastSpace);
You can solve this using stdlib functions, but the above is likely to be more efficient as you're only searching for the characters you are interested in.
Given the description, I think I'd use a variant of this (now tested) C99 code:
#include <stdio.h>
#include <string.h>
#include <stdlib.h>
#include <ctype.h>
struct word_number
{
char word[128];
long number;
};
int read_word_number(FILE *fp, struct word_number *wnp)
{
char buffer[140];
if (fgets(buffer, sizeof(buffer), fp) == 0)
return EOF;
size_t len = strlen(buffer);
if (buffer[len-1] != '\n') // Error if line too long to fit
return EOF;
buffer[--len] = '\0';
char *num = &buffer[len-1];
while (num > buffer && !isspace((unsigned char)*num))
num--;
if (num == buffer) // No space in input data
return EOF;
char *end;
wnp->number = strtol(num+1, &end, 0);
if (*end != '\0') // Invalid number as last word on line
return EOF;
*num = '\0';
if (num - buffer >= sizeof(wnp->word)) // Non-number part too long
return EOF;
memcpy(wnp->word, buffer, num - buffer);
return(0);
}
int main(void)
{
struct word_number wn;
while (read_word_number(stdin, &wn) != EOF)
printf("Word <<%s>> Number %ld\n", wn.word, wn.number);
return(0);
}
You could improve the error reporting by returning different values for different problems.
You could make it work with dynamically allocated memory for the word portion of the lines.
You could make it work with longer lines than I allow.
You could scan backwards over digits instead of non-spaces - but this allows the user to write "abc 0x123" and the hex value is handled correctly.
You might prefer to ensure there are no digits in the word part; this code does not care.
You could try using strtok() to tokenize each line, and then check whether each token is a number or a word (a fairly trivial check once you have the token string - just look at the first character of the token).
Assuming that the number is immediately followed by '\n'.
you can read each line to chars buffer, use sscanf("%d") on the entire line to get the number, and then calculate the number of chars that this number takes at the end of the text string.
Depending on how complex your strings become you may want to use the PCRE library. At least that way you can compile a perl'ish regular expression to split your lines. It may be overkill though.
Given the description, here's what I'd do: read each line as a single string using fgets() (making sure the target buffer is large enough), then split the line using strtok(). To determine if each token is a word or a number, I'd use strtol() to attempt the conversion and check the error condition. Example:
#include <stdlib.h>
#include <stdio.h>
#include <string.h>
/**
* Read the next line from the file, splitting the tokens into
* multiple strings and a single integer. Assumes input lines
* never exceed MAX_LINE_LENGTH and each individual string never
* exceeds MAX_STR_SIZE. Otherwise things get a little more
* interesting. Also assumes that the integer is the last
* thing on each line.
*/
int getNextLine(FILE *in, char (*strs)[MAX_STR_SIZE], int *numStrings, int *value)
{
char buffer[MAX_LINE_LENGTH];
int rval = 1;
if (fgets(buffer, buffer, sizeof buffer))
{
char *token = strtok(buffer, " ");
*numStrings = 0;
while (token)
{
char *chk;
*value = (int) strtol(token, &chk, 10);
if (*chk != 0 && *chk != '\n')
{
strcpy(strs[(*numStrings)++], token);
}
token = strtok(NULL, " ");
}
}
else
{
/**
* fgets() hit either EOF or error; either way return 0
*/
rval = 0;
}
return rval;
}
/**
* sample main
*/
int main(void)
{
FILE *input;
char strings[MAX_NUM_STRINGS][MAX_STRING_LENGTH];
int numStrings;
int value;
input = fopen("datafile.txt", "r");
if (input)
{
while (getNextLine(input, &strings, &numStrings, &value))
{
/**
* Do something with strings and value here
*/
}
fclose(input);
}
return 0;
}

Resources