Get length of multibyte UTF-8 sequence

Get length of multibyte UTF-8 sequence - c

I am parsing some UTF-8 text but am only interested in characters in the ASCII range, i.e., I can just skip multibyte sequences.
I can easily detect the beginning of a sequence because the sign bit is set, so the char value is < 0. But how can I tell how many bytes are in the sequence so I can skip over it?
I do not need to perform any validation, i.e., I can assume the input is valid UTF-8.

Just strip out all bytes which are no valid ascii, don't try to get cute and interpret bytes >127 at all. This works as long as you don't have any combining sequences with base character in ascii range. For those you would need to interpret the codepoints themselves.

Although Deduplicator's answer is more appropriate to the specific purpose of skipping over multibyte sequences, if there is a need to get the length of each such character, pass the first byte to this function:
int getUTF8SequenceLength (unsigned char firstPoint) {
firstPoint >>= 4;
firstPoint &= 7;
if (firstPoint == 4) return 2;
return firstPoint - 3;
}
This returns the total length of the sequence, including the first byte. I'm using an unsigned char value as the firstPoint parameter here for clarity, but note this function will work exactly the same way if the parameter is a signed char.
To explain:
UTF-8 uses bits 5, 6, and 7 in the first byte of a sequence to indicate the remaining length. If all three are set, the sequence is 3 additional bytes. If only the first of these from the left (the 7th bit) is set, the sequence is 1 additional byte. If the first two from the left are set, the sequence is 2 additional bytes. Hence, we want to examine these three bits (the value here is just an example):
11110111
^^^
The value is shifted down by 4 then AND'd with 7. This leaves only the 1st, 2nd, and 3rd bits from the right as the only possible ones set. The value of these bits are 1, 2, and 4 respectively.
00000111
^^^
If the value is now 4, we know only the first bit from the left (of the three we are considering) is set and can return 2.
After this, the value is either 7, meaning all three bits are set, so the sequence is 4 bytes in total, or 6, meaning the first two from the left are set so the sequence is 3 bytes in total.
This covers the range of valid Unicode characters expressed in UTF-8.

Related

Convert int/string to byte array with length n

How can I convert a value like 5 or "Testing" to an array of type byte with a fixed length of n byte?
Edit:
I want to represent the number 5 in bits. I know that it's 101, but I want it represented as array with a length of for example 6 bytes, so 000000 ....

I'm not sure what you are trying to accomplish here but all I can say is assuming you simply want to represent characters in the binary form of it's ASCII code, you can pad the binary representation with zeros. For example if the set number of characters you want is 10, then encoding the letter a (with ASCII code of 97) in binary will be 1100001, padded to 10 characters will be 0001100001, but that is for a single character to be encoded. The encoding of a string, which is made up of multiple characters will be a set of these 10 digit binary codes which represent the corresponding character in the ASCII table. The encoding of data is important so that the system knows how to interpret the binary data. Then there is also endianness depending on the system architecture - but that's less of an issue these days with more old and modern processors like the ARM processors being bi-endian.
So forget about representing the number 5 and the string "WTF" using
the same number of bytes - it makes the brain hurt. Stop it.
A bit more reading on character encoding will be great.
Start here - https://en.wikipedia.org/wiki/ASCII
Then this - https://en.wikipedia.org/wiki/UTF-8
Then brain hurt - https://en.wikipedia.org/wiki/Endianness

size in bytes of a char number and int number

please tell me if I am wrong, if a number is stored as a character it will contain 1 byte per character of the number(not 4 bytes)?
for example if I make an int variable of the number 8 and a char variable of '8' the int variable will have consumed more memory?
and if I create an int variable as the number 12345 and a character array of "12345" the character array will have consumed more memory?
and in text files if numbers are stored are they considered as integers or characters?
thank you.

Yes, all of your answers are correct.
int will always take up sizeof(int) bytes, 8(int) assuming 32-bit int it will take 4 bytes, whereas 8(char) will take up one byte.
The way to think about your last question IMO is that data is stored as bytes. char and int are way of interpreting bytes, so in text files you write bytes, but if you want to write human-readable "8" into a text file, you must write this in some encoding, such as ASCII where bytes correspond to human-readable characters. So, to write "8" you would need to write the byte 0x38 (ASCII value of 8).
So, in files you have data, not int or chars.

When we consider the memory location for an int or for a char we think as a whole. Integers are commonly stored using a word of memory, which is 4 bytes or 32 bits, so integers from 0 up to 4,294,967,295 (232 - 1) can be stored in an int variable. As we need total 32 bits (32/8 = 4) hence we need 4 bytes for an int variable.
But to store a ascii character we need 7 bits. The ASCII table has 128 characters, with values from 0 through 127. Thus, 7 bits are sufficient to represent a character in ASCII; (However, most computers typically reserve 1 bits more, (i.e. 8 bits), for an ASCII character
And about your question:-
and if I create an int variable as the number 12345 and a character array of "12345" the character array will have consumed more memory?
Yes from the above definition it is true. In the first case(int value) it just need 4 bytes and for the second case it need total 5 bytes. The reason is in the first case 12345 is a single integer value and in the second case "12345" are total 5 ascii characters. Even in the second case, you actually need one more byte to hold the '\0' character as a part of a string (marks end of string).

When int is defined the memory would be allocated based on compiler option ( it can be 4 to 8 bytes). The number assigned to int is stored as is.
e.g int a = 86;
The number 86 would be stored at memory allocated for a.
When char is defined , there are numbers assigned to each character. When these character needs to be printed the same would print but when its stored in memory it would stored as number. These are called ASCII character, there are some more.
The allocation to store is 1Byte because with 1Byte you can represent 2^8 symbols.

if a number is stored as a character it will contain 1 byte per character of the number(not 4 bytes)? for example if I make an int variable of the number 8 and a char variable of '8' the int variable will have consumed more memory?
Yes, since it is guaranteed that (assuming 8-bit bytes):
sizeof(char) == 1
sizeof(int) >= 2
if I create an int variable as the number 12345 and a character array of "12345" the character array will have consumed more memory?
Correct. See the different between:
strlen("12345") == 5
sizeof(12345) >= 2
Of course, for small numbers like 7, it is not true:
strlen("7") == 1
sizeof(7) >= 2
in text files if numbers are stored are they considered as integers or characters?
To read any data (be it in a file or in a clay tablet!) you need to know its encoding.
If it is a text file, then typically the numbers will be encoded using characters, possibly in their decimal representation.
If it is a binary file, then you may find them written as they are stored in memory for a particular computer.
In short, it depends.

Subtlety in conversion of characters to integers

Can someone explain clearly what these lines from K&R actually mean:
"When a char is converted to an int, can it ever produce a negative
integer? The answer varies from machine to machine. The definition of
C guarantees that any character in the machine's standard printing
character set will never be negative, but arbitrary bit patterns
stored in character variables may appear to be negative on some
machines,yet positive on others".

There are two more-or-less relevant parts to the standard — ISO/IEC 9899:2011.
6.2.5 Types
¶3 An object declared as type char is large enough to store any member of the basic
execution character set. If a member of the basic execution character set is stored in a
char object, its value is guaranteed to be nonnegative. If any other character is stored in
a char object, the resulting value is implementation-defined but shall be within the range
of values that can be represented in that type.
¶15 The three types char, signed char, and unsigned char are collectively called
the character types. The implementation shall define char to have the same range,
representation, and behavior as either signed char or unsigned char.45)
45) CHAR_MIN, defined in <limits.h>, will have one of the values 0 or SCHAR_MIN, and this can be
used to distinguish the two options. Irrespective of the choice made, char is a separate type from the
other two and is not compatible with either.
That defines what your quote from K&R states. The other relevant part defines what the basic execution character set is.
5.2.1 Character sets
¶1 Two sets of characters and their associated collating sequences shall be defined: the set in
which source files are written (the source character set), and the set interpreted in the
execution environment (the execution character set). Each set is further divided into a
basic character set, whose contents are given by this subclause, and a set of zero or more
locale-specific members (which are not members of the basic character set) called
extended characters. The combined set is also called the extended character set. The
values of the members of the execution character set are implementation-defined.
¶2 In a character constant or string literal, members of the execution character set shall be
represented by corresponding members of the source character set or by escape
sequences consisting of the backslash \ followed by one or more characters. A byte with
all bits set to 0, called the null character, shall exist in the basic execution character set; it
is used to terminate a character string.
¶3 Both the basic source and basic execution character sets shall have the following
members: the 26 uppercase letters of the Latin alphabet
A B C D E F G H I J K L M
N O P Q R S T U V W X Y Z
the 26 lowercase letters of the Latin alphabet
a b c d e f g h i j k l m
n o p q r s t u v w x y z
the 10 decimal digits
0 1 2 3 4 5 6 7 8 9
the following 29 graphic characters
! " # % & ' ( ) * + , - . / :
; < = > ? [ \ ] ^ _ { | } ~
the space character, and control characters representing horizontal tab, vertical tab, and
form feed. The representation of each member of the source and execution basic
character sets shall fit in a byte. In both the source and execution basic character sets, the
value of each character after 0 in the above list of decimal digits shall be one greater than
the value of the previous. In source files, there shall be some way of indicating the end of
each line of text; this International Standard treats such an end-of-line indicator as if it
were a single new-line character. In the basic execution character set, there shall be
control characters representing alert, backspace, carriage return, and new line. If any
other characters are encountered in a source file (except in an identifier, a character
constant, a string literal, a header name, a comment, or a preprocessing token that is never
converted to a token), the behavior is undefined.
¶4 A letter is an uppercase letter or a lowercase letter as defined above; in this International
Standard the term does not include other characters that are letters in other alphabets.
¶5 The universal character name construct provides a way to name other characters.
One consequence of these rules is that if a machine uses 8-bit character and EBCDIC encoding, then plain char must be an unsigned type since the digits have code 240..249 in EBCDIC.

You need to understand several things first.
If I take an 8-bit value and extend it to a 16-bit value, normally you would imagine just adding a bunch of 0's on the left. For example, if I have the 8-bit value 23, in binary that's 00010111, so as a 16-bit number it's 0000000000010111, which is also 23.
It turns out that negative numbers always have a 1 in the high-order bit. (There might be weird machines for which this is not true, but it's true for any machine you're likely to use.) For example, the 8-bit value -40 is represented in binary as 11011000.
So when you convert a signed 8-bit value to a 16-bit value, if the high-order bit is 1 (that is, if the number is negative), you do not add a bunch of 0-s on the left, you add a bunch of 1's instead. For example, going back to -40, we would convert 11011000 to 1111111111011000, which is the 16-bit representation of -40.
There are also unsigned numbers, that are never negative. For example, the 8-bit unsigned number 216 is represented as 11011000. (You will notice that this is the same bit pattern as the signed number -40 had.) When you convert an unsigned 8-bit number to 16 bits, you add a bunch of 0's no matter what. For example, you would convert 11011000 to 0000000011011000, which is the 16-bit representation of 216.
So, putting this all together, if you're converting an 8-bit number to 16 (or more) bits, you have to look at two things. First, is the number signed or unsigned? If it's unsigned, just add a bunch of 0's on the left. But if it's signed, you have to look at the high-order bit of the 8-0bit number. If it's 0 (if the number is positive), add a bunch of 0's on the left. But if it's 1 (if the number is negative), add a bunch of 1's on the right. (This whole process is known as sign extension.)
The ordinary ASCII characters (like 'A' and '1' and '$') all have values less than 128, which means that their high-order bit is always 0. But "special" characters from the "Latin-1" or UTF-8 character sets have values greater than 128. For this reason they're sometimes also called "high bit" or "eighth bit" characters. For example, the Latin-1 character Ø (O with a slash through it it) has the value 216.
Finally, although type char in C is typically an 8-bit type, the C Standard does not specify whether it is signed or unsigned.
Putting this all together, what Kernighan and Ritchie are saying is that when we convert a char to a 16- or 32-bit integer, we don't quite know how to apply step 5. If I'm on a machine where type char is unsigned, and I take the character Ø and convert it to an int, I'll probably get the value 216. But if I'm on a machine where type char is signed, I'll probably get the number -40.

How can a character be represented by a bit pattern containing three octal digits?

From Chapter 2(Sub section 2.3 named Constants) of K&R book on C programming language:
Certain characters can be represented in character and string
constants by escape sequences like \n (newline); these sequences look
like two characters, but represent only one. In addition, an arbitrary
byte-sized bit pattern can be specified by
′\ooo′
where ooo is one to three octal digits (0...7) or by
′\xhh′
where hh is one or more hexadecimal digits (0...9, a...f, A...F). So
we might write
#define VTAB ′\013′ /* ASCII vertical tab */
#define BELL ′\007′ /* ASCII bell character */
or, in hexadecimal,
#define VTAB ′\xb′ /* ASCII vertical tab */
#define BELL ′\x7′ /* ASCII bell character */
The part that confuses me is the following wordings(emphasis mine): where ooo is one to three octal digits (0...7). If there are three octal digits the the number of bits required will be 9(3 for each digit) which exceeds the byte length required for characters. Surely I am missing something here. What is it that I am missing?

\ooo (3 octal digits) does indeed allow a specification of 9-bit values of 0 to 111111111 (binary) or 511. If this is allowed is dependent on the char size.
Assignments such as below generate a warning on many environments because a char is 8 bits in those environments. Typically the highest octal sequence allowed is \377. But a char needs not be 8 bits. OP's "9... exceeds the byte length required for characters" is incorrect.
char *s = "\777"; //warning "Octal sequence out of range"
char c = '\777'; //warning
int i = '\777'; //warning
The 3 octal digit constant '\141' is the same as 'a' in a typically environment where ASCII is used. But in an alternate character set, 'a' could be different. Thus if one wanted a portable bit pattern assignment of 01100001, one could use '\141' instead of 'a'. One could accomplish the same by assigning '\x61'. In some context, an octal pattern may be preferred.
C11 6.4.4.4.9 If no prefix used, "The value of an octal or hexadecimal escape sequence shall be in the range of representable values for the corresponding type: unsigned char"

The range of code numbers of characters is not defined in K&R, as far as I can remember. In the early days, it was usually the ASCII range 0...127. Nowadays it is often an 8-bit range, 0...255, but it could be wider, too. In any case, the implementation-defined limits on the char data type imply restrictions on the escape notations, too.
For example, if the range is 0...127, then \177 is the largest allowed octal escape.

The first octal digit is only allowed to go to 3 (two bits), not 7 (three bits), if we're talking about eight bit bytes. If we're talking about ASCII (7 bit values), the first digit can only be zero or one.
If K&R says otherwise, their description is either incomplete or incorrect.

What does this line of code do?

for (nbyte=0; nbyte<6; nbyte++) {
mac->byte[nbyte] = (char) (strtoul(string+nbyte*3, 0, 16) & 0xFF);
}
This is a small piece of code found in macchanger, string is a char pointer that points a MAC address, what i don't know is that why must i convert it to a unsigned long int,and why must i *3 then AND it with 0xFF.

Most likely the string is a mac address in the form of
XX:YY:ZZ:AA:BB:CC
Doing nbyte*3 moves the "starting offset" pointer up 3 characters in the string each iteration, skipping over the :. Then strotoul reads 16bits (2 characters) and converts them to an unsigned long, which is then ANDed with 0xFF to strip off all but the lowest byte, which gets cast to a char.

It's parsing from a hexadecimal string, The third parameter of strtoul is the base of the conversion (16 in this case). The input is presumably in this form:
12:34:56:78:9a:bc
The pointer is incremented by 3 each time to start at each pair of hexadecimal digits, which are three apart including the colon.
I don't think the & 0xFF is strictly necessary here. It was presumably there to attempt to correctly handle the case where an input contains a number larger than 0xFF, but the algorithm will still fail for this case for other reasons.

string+nbyte*3
string is a pointer to char (as all C strings are). When you add an integer, x, to a pointer you get the location pointer+x. By adding nbyte*3 you add 3 to pointer, then 6, then 9th,
strtoul converts strings to integers. Specifically here, by passing 16, its specifying base 16 (hex) as the format in the string. Here by passing nbyte*3, your pointer points to the substring beginning at the 3rd, 6th, 9th, etc character of string.
Afte the conversion at each location, the & 0xFF unsets any bits past the 8 LSB, then casts that value to a char.
The result is then stored in a location in the byte array.

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight