Trapping strange ASCII characters in C - c

I am writing a program that builds a word list for all words in a document, book or song. The book I started with was Charles Dickens "A Tale of Two Cities". This was downloaded from the Project Gutenburg web site. It is formatted as a UTF-8 file.
I am running Ubuntu 22.04.
I have come across a number of seemingly innocuous characters in the text that clearly are not standard ASCII.
In the code below the offending characters are the left and right quotation marks around "Wo-ho" and the single quotation mark in you're.
If I go through the line character by character I can see that these are non ASCII and produce strange ASCII codes of 30, 128, 100 for the left quote, 30, 128, 99 for the right quote and 30, 128, 103 for the single quote.
If someone could help me understand why this is so it would be appreciated.
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <ctype.h>
int main(int argc, char *argv[])
{
char *lineIn = "“Wo-ho!” said the coachman. “So, then! One more pull and you’re at the";
const char delim = ' ';
int strLen = strlen(lineIn);
int i = 0;
printf("Start - %s\n", lineIn);
printf("\n");
for (i = 0; i < strLen + 1; i++)
{
if (isalpha(lineIn[i])) {
printf("Alpha - (%c) ", lineIn[i]);
} else if (iscntrl(lineIn[i])) {
printf("Cntrl - %c\n", lineIn[i]);
} else if (isxdigit(lineIn[i])) {
printf("Hex - %c\n", lineIn[i]);
} else if (isascii(lineIn[i])) {
printf("Asc - %c\n", lineIn[i]);
} else if (lineIn[i] == delim) {
printf("\n");
} else {
printf("Unk - %d\n", lineIn[i]);
}
}
return 0;
}
Output from above :
Start - “Wo-ho!” said the coachman. “So, then! One more pull and you’re at the
Unk - -30
Unk - -128
Unk - -100
Alpha - (W) Alpha - (o) Asc - -
Alpha - (h) Alpha - (o) Asc - !
Unk - -30
Unk - -128
Unk - -99
Asc -
Alpha - (s) Alpha - (a) Alpha - (i) Alpha - (d) Asc -
Alpha - (t) Alpha - (h) Alpha - (e) Asc -
Alpha - (c) Alpha - (o) Alpha - (a) Alpha - (c) Alpha - (h) Alpha - (m) Alpha - (a) Alpha - (n) Asc - .
Asc -
Unk - -30
Unk - -128
Unk - -100
Alpha - (S) Alpha - (o) Asc - ,
Asc -
Alpha - (t) Alpha - (h) Alpha - (e) Alpha - (n) Asc - !
Asc -
Alpha - (O) Alpha - (n) Alpha - (e) Asc -
Alpha - (m) Alpha - (o) Alpha - (r) Alpha - (e) Asc -
Alpha - (p) Alpha - (u) Alpha - (l) Alpha - (l) Asc -
Alpha - (a) Alpha - (n) Alpha - (d) Asc -
Alpha - (y) Alpha - (o) Alpha - (u) Unk - -30
Unk - -128
Unk - -103
Alpha - (r) Alpha - (e) Asc -
Alpha - (a) Alpha - (t) Asc -
Alpha - (t) Alpha - (h) Alpha - (e) Cntrl -

You do no have have ASCII. You have UTF-8.
-30 -128 -100 (E2.80.9C as hex) is the UTF-8 encoding of U+201C LEFT DOUBLE QUOTATION MARK (“)
-30 -128 -99 (E2.80.9D as hex) is the UTF-8 encoding of U+201D RIGHT DOUBLE QUOTATION MARK (”)
“ and ” are not found in the ASCII character set.

As you mentioned, you have a UTF-8 encoded text file. This means you can have characters that are up to 4 bytes long, though the first 128 characters in Unicode UTF-8 are a one-to-one mapping of ASCII (requiring only 1 byte). This means, if the text is English, you will probably be able to see all the letters as if they were ASCII.
You can read Wikipedia's description of how UTF-8 is encoded.
From the table on Encoding, you can see that if the byte you read is in one of the following ranges
UTF-8 character
first byte size
0xxxxxxx 1 byte
110xxxxx 2 bytes
1110xxxx 3 bytes
11110xxx 4 bytes
the character will be 1, 2, 3, or 4 bytes respectively. This will come in handy if you decide to "translate" the character to some ASCII equivalent, or if you want to filter them out (remove) completely.

Related

Linear interpolation works on Windows but not on my PIC18F, trouble with 8-bit arch?

I've implemented linear interpolation on my PIC18F device where I'm interpolating ADC values with intervals stored in EEPROM. I tested the code and it doesn't calculate properly. So I put it into my computer and the calculations works well so I don't really know where the problem might come from, as the two functions I use right now for testing purposes are same on both platforms.
These are the functions same for both platforms:
uint16_t interpolate(uint16_t a, uint16_t b, uint16_t c, uint16_t d, uint16_t raw) {
// b+(x-a)*(d-b)/(c-a)
return b + (raw - a) * (d - b) / (c - a);
}
uint16_t mapRawToCal(uint16_t raw) {
uint16_t cal = 0;
if (raw < 150) {
cal = 0;
}
else if (raw >= 150 && raw <= 457) {
cal = interpolate(150, 0, 457, 800, raw);
}
else if (raw > 457 && raw <= 1235) {
cal = interpolate(457, 800, 1235, 1000, raw);
}
else if (raw > 1235 && raw <= 1988) {
cal = interpolate(1235, 1000, 1988, 1988, raw);
}
else if (raw > 1988 && raw <= 2700) {
cal = interpolate(1988, 1988, 2700, 3500, raw);
}
else if (raw > 2700 && raw <= 3560) {
cal = interpolate(2700, 3500, 3560, 3300, raw);
}
else if (raw > 3560 && raw <= 3987) {
cal = interpolate(3560, 3300, 3987, 4096, raw);
}
else if (raw > 3987) {
cal = 4096;
}
return cal;
}
This is the PIC18F device main function:
int main(void) {
// Setup
hardware_init();
ADC_Init();
// Loop
while(1) {
uint16_t ch1Data;
ch1Data = ADC_Read(1);
uint16_t *buf = usb_get_in_buffer(1);
buf[0] = ch1Data; // Rx Axis
buf[1] = mapRawToCal(ch1Data); // Ry Axis
usb_send_in_buffer(1, 4);
__delay_ms(200); //Delay
#ifndef USB_USE_INTERRUPTS
usb_service();
#endif
}
return 0;
}
And this is the main function on Windows:
int main() {
for (uint16_t raw = 0; raw <= 4096; raw++) {
printf("input=%d - output=%d\n", raw, mapRawToCal(raw));
}
return 0;
}
Here is chunk of the C Windows program result in console (that is accurate):
input=129 - output=0
input=130 - output=0
input=131 - output=0
input=132 - output=0
input=133 - output=0
input=134 - output=0
input=135 - output=0
input=136 - output=0
input=137 - output=0
input=138 - output=0
input=139 - output=0
input=140 - output=0
input=141 - output=0
input=142 - output=0
input=143 - output=0
input=144 - output=0
input=145 - output=0
input=146 - output=0
input=147 - output=0
input=148 - output=0
input=149 - output=0
input=150 - output=0
input=151 - output=2
input=152 - output=5
input=153 - output=7
input=154 - output=10
input=155 - output=13
input=156 - output=15
input=157 - output=18
input=158 - output=20
input=159 - output=23
input=160 - output=26
input=161 - output=28
input=162 - output=31
input=163 - output=33
input=164 - output=36
input=165 - output=39
input=166 - output=41
input=167 - output=44
input=168 - output=46
input=169 - output=49
input=170 - output=52
input=171 - output=54
input=172 - output=57
input=173 - output=59
input=174 - output=62
input=175 - output=65
input=176 - output=67
input=177 - output=70
input=178 - output=72
input=179 - output=75
input=180 - output=78
input=181 - output=80
input=182 - output=83
input=183 - output=85
input=184 - output=88
input=185 - output=91
input=186 - output=93
input=187 - output=96
input=188 - output=99
input=189 - output=101
input=190 - output=104
input=191 - output=106
input=192 - output=109
input=193 - output=112
input=194 - output=114
input=195 - output=117
input=196 - output=119
input=197 - output=122
input=198 - output=125
input=199 - output=127
input=200 - output=130
input=201 - output=132
input=202 - output=135
input=203 - output=138
input=204 - output=140
input=205 - output=143
input=206 - output=145
input=207 - output=148
input=208 - output=151
input=209 - output=153
input=210 - output=156
input=211 - output=158
input=212 - output=161
input=213 - output=164
input=214 - output=166
input=215 - output=169
input=216 - output=171
input=217 - output=174
input=218 - output=177
input=219 - output=179
input=220 - output=182
input=221 - output=185
input=222 - output=187
input=223 - output=190
input=224 - output=192
input=225 - output=195
input=226 - output=198
input=227 - output=200
input=228 - output=203
input=229 - output=205
input=230 - output=208
input=231 - output=211
input=232 - output=213
input=233 - output=216
input=234 - output=218
input=235 - output=221
input=236 - output=224
input=237 - output=226
input=238 - output=229
input=239 - output=231
input=240 - output=234
input=241 - output=237
input=242 - output=239
input=243 - output=242
input=244 - output=244
input=245 - output=247
input=246 - output=250
input=247 - output=252
input=248 - output=255
input=249 - output=257
input=250 - output=260
input=251 - output=263
input=252 - output=265
input=253 - output=268
input=254 - output=271
input=255 - output=273
input=256 - output=276
input=257 - output=278
input=258 - output=281
input=259 - output=284
input=260 - output=286
input=261 - output=289
input=262 - output=291
input=263 - output=294
input=264 - output=297
input=265 - output=299
input=266 - output=302
input=267 - output=304
input=268 - output=307
input=269 - output=310
input=270 - output=312
input=271 - output=315
input=272 - output=317
input=273 - output=320
input=274 - output=323
input=275 - output=325
input=276 - output=328
input=277 - output=330
input=278 - output=333
input=279 - output=336
input=280 - output=338
input=281 - output=341
input=282 - output=343
input=283 - output=346
input=284 - output=349
input=285 - output=351
input=286 - output=354
input=287 - output=357
input=288 - output=359
input=289 - output=362
input=290 - output=364
input=291 - output=367
input=292 - output=370
input=293 - output=372
input=294 - output=375
input=295 - output=377
input=296 - output=380
input=297 - output=383
input=298 - output=385
input=299 - output=388
input=300 - output=390
input=301 - output=393
input=302 - output=396
input=303 - output=398
input=304 - output=401
input=305 - output=403
input=306 - output=406
input=307 - output=409
input=308 - output=411
input=309 - output=414
input=310 - output=416
input=311 - output=419
input=312 - output=422
input=313 - output=424
input=314 - output=427
input=315 - output=429
input=316 - output=432
input=317 - output=435
input=318 - output=437
input=319 - output=440
input=320 - output=442
input=321 - output=445
input=322 - output=448
input=323 - output=450
input=324 - output=453
input=325 - output=456
input=326 - output=458
input=327 - output=461
input=328 - output=463
input=329 - output=466
input=330 - output=469
input=331 - output=471
input=332 - output=474
input=333 - output=476
input=334 - output=479
input=335 - output=482
input=336 - output=484
input=337 - output=487
input=338 - output=489
input=339 - output=492
input=340 - output=495
input=341 - output=497
input=342 - output=500
input=343 - output=502
input=344 - output=505
input=345 - output=508
input=346 - output=510
input=347 - output=513
input=348 - output=515
input=349 - output=518
input=350 - output=521
input=351 - output=523
input=352 - output=526
input=353 - output=528
input=354 - output=531
input=355 - output=534
input=356 - output=536
input=357 - output=539
input=358 - output=542
input=359 - output=544
input=360 - output=547
input=361 - output=549
input=362 - output=552
input=363 - output=555
input=364 - output=557
input=365 - output=560
input=366 - output=562
input=367 - output=565
input=368 - output=568
input=369 - output=570
input=370 - output=573
input=371 - output=575
input=372 - output=578
input=373 - output=581
input=374 - output=583
input=375 - output=586
input=376 - output=588
input=377 - output=591
input=378 - output=594
input=379 - output=596
input=380 - output=599
input=381 - output=601
input=382 - output=604
input=383 - output=607
input=384 - output=609
input=385 - output=612
input=386 - output=614
input=387 - output=617
input=388 - output=620
input=389 - output=622
input=390 - output=625
input=391 - output=628
input=392 - output=630
input=393 - output=633
input=394 - output=635
input=395 - output=638
input=396 - output=641
input=397 - output=643
input=398 - output=646
input=399 - output=648
input=400 - output=651
input=401 - output=654
input=402 - output=656
input=403 - output=659
input=404 - output=661
input=405 - output=664
input=406 - output=667
input=407 - output=669
input=408 - output=672
input=409 - output=674
input=410 - output=677
input=411 - output=680
input=412 - output=682
input=413 - output=685
input=414 - output=687
input=415 - output=690
input=416 - output=693
input=417 - output=695
input=418 - output=698
input=419 - output=700
input=420 - output=703
input=421 - output=706
input=422 - output=708
input=423 - output=711
input=424 - output=714
input=425 - output=716
input=426 - output=719
input=427 - output=721
input=428 - output=724
input=429 - output=727
input=430 - output=729
input=431 - output=732
input=432 - output=734
input=433 - output=737
input=434 - output=740
input=435 - output=742
input=436 - output=745
input=437 - output=747
input=438 - output=750
input=439 - output=753
input=440 - output=755
input=441 - output=758
input=442 - output=760
input=443 - output=763
input=444 - output=766
input=445 - output=768
input=446 - output=771
input=447 - output=773
input=448 - output=776
input=449 - output=779
input=450 - output=781
input=451 - output=784
input=452 - output=786
input=453 - output=789
input=454 - output=792
input=455 - output=794
input=456 - output=797
input=457 - output=800
input=458 - output=800
input=459 - output=800
input=460 - output=800
input=461 - output=801
input=462 - output=801
input=463 - output=801
input=464 - output=801
input=465 - output=802
input=466 - output=802
input=467 - output=802
input=468 - output=802
input=469 - output=803
input=470 - output=803
input=471 - output=803
input=472 - output=803
input=473 - output=804
input=474 - output=804
input=475 - output=804
input=476 - output=804
input=477 - output=805
input=478 - output=805
input=479 - output=805
input=480 - output=805
input=481 - output=806
input=482 - output=806
input=483 - output=806
input=484 - output=806
input=485 - output=807
input=486 - output=807
input=487 - output=807
input=488 - output=807
input=489 - output=808
input=490 - output=808
input=491 - output=808
input=492 - output=808
input=493 - output=809
input=494 - output=809
input=495 - output=809
input=496 - output=810
input=497 - output=810
input=498 - output=810
input=499 - output=810
input=500 - output=811
input=501 - output=811
input=502 - output=811
input=503 - output=811
input=504 - output=812
input=505 - output=812
input=506 - output=812
And here is the video what my PIC outputs, the top chart is the raw data from ADC, the bottom one is with the calculated/interpolated data: https://youtu.be/Dxc60d8wDHw
Any help is highly appreciated! Thank you!
In function interpolate(), this multiplication:
(raw - a) * (d - b)
produces a result of type int or unsigned int, depending on the width of those types. For an 8-bit arch, they are probably 16 bits wide (the minimum), and the result will then have type unsigned int. Again assuming 16-bit width, that result will overflow for some of your input values.
On Windows, you will find that int and unsigned int are both 32 bits wide. The result of the same multiplication will have type int (though it doesn't make a difference in this case) and it will not overflow.
There are a couple of ways you could handle this. Supposing that you want to produce the same numeric results, you can cast one of the operands to unsigned long:
return b + (unsigned long)(raw - a) * (d - b) / (c - a);
That will probably come at a small computational cost.
Alternatively, if you were willing to accept a loss of precision then you could perform the division before the multiplication, which I would write this way for improved clarity:
return b + (d - b) * ((raw - a) / (c - a));
You are a "victim" of arithmetic overflow and integer promotion, respectively, and that integers have different sizes on the PIC and on the PC.
All integer calculation is done with int because of integer promotion, if the operands fit in integers.
Just an example with raw = 2488:
a is 1988
b is 1988
c is 2700
d is 3500
(raw - a) gives 500 and (d - b) gives 1512.
The product of (raw - a) by (d - b) should be 756000.
On the PC you get this result, because its int uses 32 bits and has therefore a range of -2147483648 to +2147483647. After the division and the following calculations the results finally fits into uint16_t again.
But your PIC uses only 16 bits for int and unsigned int, the latter being the base type of the alias type uint16_t. Since uint16_t does not fit into an int, no integer promotion is done. However, the result overflows the range of 0 to 65535, because the product is cut down to 16 bits. The calculated result is 35104, and gives you a wrong interpolation.
The solution is to use an intermediate result with a type of int32_t or uint32_t that provides a bigger range.
As the others have analyzed, you have an arithmetic overflow problem. One solution is to use 32-bit arithmetic. Another is to keep to 16-bit arithmetic but go right down to basics.
What is 3 * 5? It is 5 + 5 + 5.
What is 10 / 3? It is how many times 3 can be subtracted from 10 until the remainder is less than 3.
10 - 3 = 7 => 1 r 7
7 - 3 = 4 => 2 r 4
4 - 3 = 1 => 3 r 1
With that in mind, you can interleave the operations. This is basically how we used to do it in assembler without multiply or divide instructions or when the differences were not nice powers of 2. It is a lot more code than a simple multiply and divide but you do not need to resort to 32-bit arithmetic. This should work on 12-bit ADCs.
uint16_t interp(uint16_t xlo, uint16_t ylo, uint16_t xhi, uint16_t yhi, uint16_t x)
{
// yhi + (x - xlo) * (yhi - ylo) / (xhi - xlo)
uint16_t xdiff = xhi - xlo;
uint16_t ydiff = yhi - ylo;
uint16_t times = x - xlo;
// Do 16 bit interpolation without going into 32-bit arithmetic
uint16_t remain = 0;
uint16_t quotient = 0;
for (int tt = 0; tt < times; ++tt)
{
// The 'multiply' - assume that the addition will not
// take the result over 0xFFFF
remain += ydiff;
// The 'divide'
while (remain > xdiff)
{
remain -= xdiff;
quotient += 1;
}
}
// If we need to round up
// uint16_t yhalf = ydiff >> 1;
// if (remain > yhalf) quotient += 1;
return ylo + quotient;
}

Is there a way to implement Morse code inside an array? [closed]

Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed 1 year ago.
Improve this question
I'm making my microcontroller display letters of the alphabet in Morse code using LEDS (dot == short blink, dash == long blink).
I could make a switch statement and do it like this :
switch (input)
case "a | A"
....
case "b | B "
....
but then I'd have a 27+ long switch statement, so I guess that's not very good?
I was thinking of making an array with all the Morse code inside but how do I implement this concept, that the first entry of the array equals a or A, ... ?
If we assume that your platform uses a character-encoding system in which the 26 Latin alphabet letters have consecutive values (the most common system used, ASCII, does, but it's not guaranteed by the C Standard), then we can define an array of strings for the Morse code for each letter and index into that array using the value of a given letter from which the value of 'A' has been subtracted.
We can also do a similar thing for digits (these are guaranteed by the Standard to have contiguous codes).
Here's a sample code (we convert all letters to uppercase before indexing into the Morse array):
#include <stdio.h>
#include <ctype.h>
int main()
{
const char* Alpha[26] = {
".-", // A
"-...", // B
"-.-.", // C
"-..", // D
".", // E
".._.", // F
"--.", // G
"....", // H
"..", // I
".---", // J
"-.-", // K
".-..", // L
"--", // M
"-.", // N
"---", // O
".--.", // P
"--.-", // Q
".-.", // R
"...", // S
"-", // T
"..-", // U
"...-", // V
".--", // W
"-..-", // X
"-.--", // Y
"--.." // Z
};
const char* Digit[10] = {
"-----",// 0
".----",// 1
"..---",// 2
"...--",// 3
"....-",// 4
".....",// 5
"-....",// 6
"--...",// 7
"---..",// 8
"----.",// 9
};
char input[256];
do {
printf("Enter text ($ to quit): ");
scanf("%255s", input);
printf("Morse code...\n");
for (char *c = input; *c; ++c) {
if (isalpha(*c)) printf("%s", Alpha[toupper(*c) - 'A']);
else if (isdigit(*c)) printf("%s", Digit[*c - '0']);
else if (*c != '$') printf("<error>");
printf("\n");
}
} while (input[0] != '$');
return 0;
}
If we can not rely on contiguous codes for the letters (or choose not to, for a more robust implementation), we can determine the index by calling the strchr function (we must then #include <string.h>) to get our letter's position in a list of all letters:
//...
const char* Letters = "ABCDEFGHIJKLMNOPQRSTUVWXYZ";
//...
if (isalpha(*c)) printf("%s", Alpha[strchr(Letters,toupper(*c)) - Letters]);
Please feel free to ask for any further clarification and/or explanation.
An alternative to #Adrian's, just encode most of the ASCII character set - from space (32) to capital Z (91). You'll get numbers, letters, punctuation, and pro-signs, and some we don't care about, but who cares, because it simplifies things so much. You're on a 8-bit microcontroller, so you don't have to worry about character encoding and OS preferences and all that trash.
Further more, skip the strings. They'll just take up room in your limited RAM, unless you do the F("asd") trick with Arduino to put them in flash. Instead of strings, use ones and zeros: 0b0110000 read from left to right to get dot, dash, sentinel stop bit, followed by padding zeros. Looks nasty but really isn't.
Here's what I have...
/*
* Bit packing: read left to right from MSB, 0=dit, 1=dah,
* with extra dah terminating sentinel, then zero filled on right.
* Example,
* letter D -.. "dahdidit"
* convert 100
* terminate 1001
* pad 10010000
* result 0b10010000
* To use, test the top bit, send dit or dah, shift left, end when result is $80.
*/
/*
* the character codes are in ASCII order,
* from space $20 to underscore $5F or 32-95 decimal.
* lower-case alpha should be coverted to upper-case before lookup
*/
const charcodes[] PROGMEM = {
0, // $20 space
0b10101110, // ! exclamation point /KW
0b01001010, // " double quote /AF
0, // # hash mark, octothorpe
0b00010011, // $ dollar sign /SX
0, // % percent sign or /KA ?
0b01000100, // & ampersand /AS
0b01111010, // ' single quote /WG
0b10110100, // $28 ( left parenthesis /KN
0b10110110, // ) right parenthesis /KK EOW ? EOM
0b00010110, // * asterisk /SK
0b01010100, // + plus sign
0b11001110, // , comma
0b10000110, // - dash, hyphen /DU
0b01010110, // . period
0b10010100, // / slash, divide /DN
0b11111100, // $30 0
0b01111100, // 1
0b00111100, // 2
0b00011100, // 3
0b00001100, // 4
0b00000100, // 5
0b10000100, // 6
0b11000100, // 7
0b11100100, // $38 8
0b11110100, // 9
0b11100010, // : colon /OS
0b10101010, // ; semicolon /KR
0, // < less-than sign
0b10001100, // = equal sign /BT
0, // > greater-than sign
0b00110010, // ? question mark
0b01101010, // $40 # commercial 'at' sign /AC
0b01100000, // A
0b10001000, // B
0b10101000, // C
0b10010000, // D
0b01000000, // E
0b00101000, // F
0b11010000, // G
0b00001000, // $48 H
0b00100000, // I
0b01111000, // J
0b10110000, // K
0b01001000, // L
0b11100000, // M
0b10100000, // N
0b11110000, // O
0b01101000, // $50 P
0b11011000, // Q
0b01010000, // R
0b00010000, // S
0b11000000, // T
0b00110000, // U
0b00011000, // V
0b01110000, // W
0b10011000, // $58 X
0b10111000, // Y
0b11001000, // Z
0, // [ left bracket
0, // \ back-slash
0, // ] right bracket
0, // ^
0b00110110, // $5F _ underscore /IQ
};
/*
* look up the code for an ASCII character
* return 0 for bad characters and characters out of range.
*/
byte getcode(char c)
{
// convert upper-case to lower-case
if ( c >= 'a' && c <= 'z' )
c -= 'a' - 'A';
// check for characters beyond our table
if ( c < ' ' || c > '_' )
return 0;
// else read the byte from flash
return pgm_read_byte(charcodes + c - ' ');
}
I'm sure the smart folks will find problems, and I'll learn from them. Have fun!
EDIT: and they did! #Clifford pointed out that I didn't show how to use the encoded byte. So here it is... you'll have to define your values for length of dit, dah, element gap (between dits and dahs), character gap (between characters) and word gap (between words):
#define TOPBIT 0x80
void key(byte c)
{
// while we're not done, start the tone, delay a bit, stop the tone
while ( c != TOPBIT && c != 0 ) {
tone(BUZZ, FREQ);
digitalWrite(LED, 1);
delay( (c & TOPBIT) ? dahLen : ditLen );
noTone(BUZZ);
digitalWrite(LED, 0);
delay(elemGap);
// shift left one place; test the next bit the next time around
c <<= 1;
}
// now generate a character space
delay(charGap - elemGap);
}
HTH! (Thanks #Clifford)

mbrtowc return -1 for non ASCII characters on embedded device but not on linux computer

Task
At the moment I am porting old DOS code for a device to Linux in pure C. The text is drawn on the surface with the help of bitfonts. I wrote a function which needs the Unicode codepoint to be passed and then draws the corresponding glyph (tested and works with different ASCII and non-ASCII characters). The old source code used DOS encoding but I am trying to use UTF-8 since multilanguage support is desired. I cannot use SDL_ttf or similar functions since the produced glyphs are not "precise" enough. Therefore I have to stick with bitfonts.
Issue
I wrote a small C test program to test the conversion of multibyte characters to their corresponding Unicode codepoint (inspired by http://en.cppreference.com/w/c/string/multibyte/mbrtowc).
#include <stdio.h>
#include <locale.h>
#include <string.h>
#include <wchar.h>
#include <stdint.h>
int main(void)
{
size_t n = 0, x = 0;
setlocale(LC_CTYPE, "en_US.utf8");
mbstate_t state = {0};
char in[] = "!°水"; // or u8"zß水"
size_t in_sz = sizeof(in) / sizeof (*in);
printf("Processing %zu UTF-8 code units: [ ", in_sz);
for(n = 0; n < in_sz; ++n)
{
printf("%#x ", (unsigned char)in[n]);
}
puts("]");
wchar_t out[in_sz];
char* p_in = in, *end = in + in_sz;
wchar_t *p_out = out;
int rc = 0;
while((rc = mbrtowc(p_out, p_in, end - p_in, &state)) > 0)
{
p_in += rc;
p_out += 1;
}
size_t out_sz = p_out - out + 1;
printf("into %zu wchar_t units: [ ", out_sz);
for(x = 0; x < out_sz; ++x)
{
printf("%u ", (unsigned short)out[x]);
}
puts("]");
}
The output is as expected:
Processing 7 UTF-8 code units: [ 0x21 0xc2 0xb0 0xe6 0xb0 0xb4 0 ]
into 4 wchar_t units: [ 33 176 27700 0 ]
When I run this code on my embedded Linux device I get the following as output:
Processing 7 UTF-8 code units: [ 0x21 0xc2 0xb0 0xe6 0xb0 0xb4 0 ]
into 2 wchar_t units: [ 33 55264 ]
After the ! character the mbrtowc output is -1, which, according to the documentation, occurs when an encoding error happened. I tested it with different signs and this error occurs only with non-ASCII characters. Error never occurred on Linux computer
Additional Information
I am using a PFM-540I Rev. B as pc on the embedded device. The Linux distribution is built using Buildroot.
You need to make sure that the en_US.utf8 locale is available on the embedded Linux build. By default, Buildroot limits the locales installed on the system in two ways:
Only specific locales are generated, as specified by the BR2_GENERATE_LOCALE configure option. By default, this list is empty, so you only get the C locale. Set this config option to en_US.UTF-8.
All locale data is removed at the end of the build, except the ones specified in BR2_ENABLE_LOCALE_WHITELIST. en_US is already in the default value, so probably you don't need to change this.
Note that if you change these configuration options, you need to make a completely clean build (with make clean; make) for the change to take effect.

Why is the 17th digit of version 4 GUIDs limited to only 4 possibilities?

I understand that this doesn't take a significant chunk off of the entropy involved, and that even if a whole nother character of the GUID was reserved (for any purpose), we still would have more than enough for every insect to have one, so I'm not worried, just curious.
As this great answer shows, the Version 4 algorithm for generating GUIDs has the following format:
xxxxxxxx-xxxx-4xxx-yxxx-xxxxxxxxxxxx
x is random
4 is constant, this represents the version number.
y is one of: 8, 9, A, or B
The RFC spec for UUIDs says that these bits must be set this way, but I don't see any reason given.
Why is the third bullet (the 17th digit) limited to only those four digits?
Bits, not hex
Focusing on hexadecimal digits is confusing you.
A UUID is not made of hex. A UUID is made of 128 bits.
Humans would resent reading a series of 128 bits presented as a long string of 1 and 0 characters. So for the benefit of reading and writing by humans, we present the 128-bits in hex.
Always keep in mind that when you see the series of 36 hex characters with hyphens, you are not looking at a UUID. You are looking at some text generated to represent the 128-bits of that are actually in the UUID.
Version & Variant
The first special meaning you mention, the “version” of UUID, is recorded using 4 bits. See section 4.1.3 of your linked spec.
The second special meaning you indicate is the “variant”. This value takes 1-3 bits. This See section 4.1.1 of your linked spec.
A hex character represents 4 bits (half an octet).
The Version number, being 4 bits, takes an entire a single hex character to itself.
Version 4 specifically uses the bits 01 00 which in hex is 4 as it is too in decimal (base 10) numbers.
The Variant, being 1-3 bits, does not take an entire hex character.
Outside the Microsoft world of GUIDs, the rest of the industry nowadays uses two bits: 10, for a decimal value of 2, as the variant. This pair of bits lands in the most significant bits of octet # 8. That octet looks like this, where ‘n’ means 0 or 1: 10 nn nn nn. A pair of hex characters represent each half of that octet. So your 17th hex digit, the first half of that 8th octet, 10 nn, can only have four possible values:
10 00 (hex 8)
10 01 (hex 9)
10 10 (hex A)
10 11 (hex B)
Quoting the estimable Mr. Lippert
First off, what bits are we talking about when we say “the bits”? We already know that in a “random” GUID the first hex digit of the third section is always 4....there is additional version information stored in the GUID in the bits in the fourth section as well; you’ll note that a GUID almost always has 8, 9, a or b as the first hex digit of the fourth section. So in total we have six bits reserved for version information, leaving 122 bits that can be chosen at random.
(from https://ericlippert.com/2012/05/07/guid-guide-part-three/)
tl;dr - it's more version information. To get more specific than that I suspect you're going to have to track down the author of the spec.
Based upon what I've tried to learn today, I've attempted to put together a C#/.NET 'LINQPad' snippet/script, to (for some small part) breakdown the GUID/UUID (- in case it helps):
void Main()
{
var guid =
Guid.Parse(
//#"08c8fbdc-ff38-402e-b0fd-353392a407af" // v4 - Microsoft/.NET
#"7c2a81c7-37ce-4bae-ba7d-11123200d59a" // v4
//#"493f6528-d76a-11ec-9d64-0242ac120002" // v1
//#"5f0ad0df-99d4-5b63-a267-f0f32cf4c2a2" // v5
);
Console.WriteLine(
$"UUID = '{guid}' :");
Console.WriteLine();
var guidBytes =
guid.ToByteArray();
// Version # - 8th octet
const int timeHiAndVersionOctetIdx = 7;
var timeHiAndVersionOctet =
guidBytes[timeHiAndVersionOctetIdx];
var versionNum =
(timeHiAndVersionOctet & 0b11110000) >> 4; // 0xF0
// Variant # - 9th octet
const int clkSeqHiResOctetIdx = 8;
var clkSeqHiResOctet =
guidBytes[clkSeqHiResOctetIdx];
var msVariantNum =
(clkSeqHiResOctet & 0b11100000) >> 5; // 0xE0/3 bits
var variantNum =
(clkSeqHiResOctet & 0b11000000) >> 5; // 0xC0/2bits - 0x8/0x9/0xA/0xB
//
Console.WriteLine(
$"\tVariant # = '{variantNum}' ('0x{variantNum:X}') - '0b{((variantNum & 0b00000100) > 0 ? '1' : '0')}{((variantNum & 0b00000010) > 0 ? '1' : '0')}{((variantNum & 0b00000001) > 0 ? '1' : '0')}'");
Console.WriteLine();
if (variantNum < 4)
{
Console.WriteLine(
$"\t\t'0 x x' - \"Reserved, NCS backward compatibility\"");
}
else
{
if (variantNum == 4 ||
variantNum == 5)
{
Console.WriteLine(
$"\t\t'1 0 x' - \"The variant specified in this {{RFC4122}} document\"");
}
else
{
if (variantNum == 6)
{
Console.WriteLine(
$"\t\t'1 1 0' - \"Reserved, Microsoft Corporation backward compatibility\"");
}
else
{
if (variantNum == 7)
{
Console.WriteLine(
$"\t\t'1 1 1' - \"Reserved for future definition\"");
}
}
}
}
Console.WriteLine();
Console.WriteLine(
$"\tVersion # = '{versionNum}' ('0x{versionNum:X}') - '0b{((versionNum & 0b00001000) > 0 ? '1' : '0')}{((versionNum & 0b00000100) > 0 ? '1' : '0')}{((versionNum & 0b00000010) > 0 ? '1' : '0')}{((versionNum & 0b00000001) > 0 ? '1' : '0')}'");
Console.WriteLine();
string[] versionDescriptions =
new[]
{
#"The time-based version specified in this {{RFC4122}} document",
#"DCE Security version, with embedded POSIX UIDs",
#"The name-based version specified in this {{RFC4122}} document that uses MD5 hashing",
#"The randomly or pseudo-randomly generated version specified in this {{RFC4122}} document",
#"The name-based version specified in this {{RFC4122}} document that uses SHA-1 hashing"
};
Console.WriteLine(
$"\t\t'{versionNum}' = \"{versionDescriptions[versionNum - 1]}\"");
Console.WriteLine();
Console.WriteLine(
$"'RFC4122' document - <https://datatracker.ietf.org/doc/html/rfc4122#section-4.1.1>");
Console.WriteLine();
}

How to round float numbers in text in C?

So I have this text
today average human lifespan in Western countries is approaching and exceeding 80 years.
1 Japan 82.6 123 19
2 Hong Kong 82.2 234 411
3 Iceland 81.8 345 26
4 Switzerland 81.7 456 7
5 Australia 81.2 567 500
...
80 Lithuania 73.0 800 2
...
194 Swaziland 39.6 142 212
195 133714 18.0 133 998
196 100110011 10.0 667 87351
I need to round the float numbers in this text to whole numbers. I've searched a lot of forums but I cannot seem to find what I need.
This text is given in a .txt file. My task: "Round real number(that have decimals) to whole numbers. The corrected text should be in a new .txt" That's it.
scanf and sscanf are your best friend to extract anything of a string.
floor is use full to suppress decimal on a float. Then can be used to round it (to use it include math.h)...
The format string describes the expected format.The function return the number of parameters found.
Sample:
Initialization
int id = -1;
char country[160]; /* /!\ Warning country name length shall be fewer than 160 */
/* Scanf don't care of this array parameter length. If it is too short */
/* It will erase following memory... */
/* That is why scanf are often disparaging */
float percent = 0.0;
char a_number_as_string[10];
int other_number = -1;
char* Switzerland = "4 Switzerland 81.7654321 456 7";
effectif code
int ret = sscanf(Switzerland, "%d %s %f %s %d", &id, country,
&percent, a_number_as_string, &other_number);
if(ret == 5) {
printf("~~ id: %d\n\tcountry: %s\n\tpercent: %.2f\n\tand : "
"(%s, %d)\n", id, country, percent, a_number_as_string,
other_number);
/////// ROUND
printf("*** round");
printf("\twith printf %%.1f = %.1f\n", percent);
printf("\twith printf %%.2f = %.2f\n", percent);
printf("\twith printf %%.3f = %.3f\n", percent);
printf("\twith printf %%.4f = %.4f\n", percent);
printf("*** With floor (included in math.h)\n");
printf("\t1 decimal: %f\n", floor(percent*10)/10);
printf("\t2 decimal: %f\n", floor(percent*100)/100);
printf("\t3 decimal: %f\n", floor(percent*1000)/1000);
printf("\t4 decimal: %f\n", floor(percent*10000)/10000);
} else {
printf("--> ret = %d", ret);
}
output
~~ id: 4
country: Switzerland
percent: 81.70
and : (456, 7)
*** round
with printf %.1f = 81.8
with printf %.2f = 81.77
with printf %.3f = 81.765
with printf %.4f = 81.7654
*** With floor (included in math.h)
1 decimal: 81.700000
2 decimal: 81.760000
3 decimal: 81.765000
4 decimal: 81.765400
This function are describe in Unix, OSX, Unix terminal man pages:
man scanf
man sscanf
man floor
Or you can find it on several copies of this manages on the web for example
scanf
sscanf
floor
here is the program I have already tested it for a single line of input string:-
// LINK - http://stackoverflow.com/questions/40317323/how-to-extract-float-numbers-from-text-in-c#40317323
#include <stdio.h>
#include <string.h>
#include <stdlib.h>
#include <math.h>
int main()
{
char s[100];
int i,j;
double sum=0.0,frac=0.0;
gets(s); // delimited by '\n' and not by ' '
for( i=0;s[i]!='\0';i++) // for finding index value of decimal point(.)
{
if(s[i]=='.')
break;
}
for(;s[i]!=' ';i--); // for changing key index value to the index value of the first space character(' ') before the decimal point(.)
i++;
for(i;s[i]!='.';i++)
{
sum=sum*10+(s[i]-48); // For extracting integer part
}
i++;
for(i,j=1;s[i]!=' ';i++,j++)
{
frac=frac+(s[i]-48)/pow(10,j); // For extracting fractional part
}
printf("\n\n%lf",sum+frac); // final answer integer part+decimal part
return 0;
}
Explanation:-
okay so what I did is:-
Since scanf() is automatically delimited by space I used gets() which is delimited by new line;
Then we know that there will be a decimal point(.) for floating number in the string, so we find the index value(say key) of the decimal point in the array.
Once we have found the decimal point we know there will only be numbers between the decimal point and the space character after the country name so now we find the index value of that space(key).
Now from that key value we again first move to the decimal point(.) character index while calculating the integer part.Calculated using ASCII value i.e. ASCII value of character Zero(0) is 48 and character Nine(9) is 57 hence subtracting 48 from every character and extracting the value.
Again from the Decimal point(.) to the next space character in the string is part of the floating number and after decimal the numbers follow the weight 10^(-1),10^(-2),10^(-3)...and so on hence the temporary variable and the power() function.Thus we, successfully calculated the fractional part also.
Finally, I just added the Integer and the fractional parts within the printf().
Try, using printf() with all the variables in each of the for loops for better understanding, kinda like dry-run.
You can parse all the values from the string using regex in c#
using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
using System.Text.RegularExpressions;
using System.Data;
namespace ConsoleApplication1
{
class Program
{
static void Main(string[] args)
{
string[] input = {
"today average human lifespan in Western countries is approaching and exceeding 80 years.",
"1 Japan 82.6 123 19",
"2 Hong Kong 82.2 234 411",
"3 Iceland 81.8 345 26",
"4 Switzerland 81.7 456 7",
"5 Australia 81.2 567 500",
"80 Lithuania 73.0 800 2",
"194 Swaziland 39.6 142 212",
"195 133714 18.0 133 998",
"196 100110011 10.0 667 87351"
};
string pattern = #"(?'index'\d+)\s+(?'country'.*)\s+(?'float'[^ ]+)\s+(?'num1'\d+)\s+(?'num2'\d+)$";
for (int i = 1; i < input.Length; i++)
{
Match match = Regex.Match(input[i], pattern);
Console.WriteLine("index : '{0}', country : '{1}', float : '{2}', num1 : '{3}', num2 : '{4}'",
match.Groups["index"],
match.Groups["country"],
match.Groups["float"],
match.Groups["num1"],
match.Groups["num2"]
);
}
Console.ReadLine();
}
}
}

Resources