Recently I've faced an issue regarding character encoding, while I was digging into character set and character encoding this doubt came to my mind.UTF-8 encoding is most popular because of its backward compatibility with ASCII.Since UTF-8 is variable length encoding format, how it differentiates single byte and double byte characters.For example, "Aݔ" is stored as "410754" (Unicode for A is 41 and Unicode for Arabic character is 0754.How encoding identifies 41 is one character and 0754 is another two-byte character?Why it's not considered as 4107 as one double byte character and 54 as a single byte character?
For example, "Aݔ" is stored as "410754"
That’s not how UTF-8 works.
Characters U+0000 through U+007F (aka ASCII) are stored as single bytes. They are the only characters whose codepoints numerically match their UTF-8 presentation. For example, U+0041 becomes 0x41 which is 01000001 in binary.
All other characters are represented with multiple bytes. U+0080 through U+07FF use two bytes each, U+0800 through U+FFFF use three bytes each, and U+10000 through U+10FFFF use four bytes each.
Computers know where one character ends and the next one starts because UTF-8 was designed so that the single-byte values used for ASCII do not overlap with those used in multi-byte sequences. The bytes 0x00 through 0x7F are only used for ASCII and nothing else; the bytes above 0x7F are only used for multi-byte sequences and nothing else. Furthermore, the bytes that are used at the beginning of the multi-byte sequences also cannot occur in any other position in those sequences.
Because of that the codepoints need to be encoded. Consider the following binary patterns:
2 bytes: 110xxxxx 10xxxxxx
3 bytes: 1110xxxx 10xxxxxx 10xxxxxx
4 bytes: 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx
The amount of ones in the first byte tells you how many of the following bytes still belong to the same character. All bytes that belong to the sequence start with 10 in binary. To encode the character you convert its codepoint to binary and fill in the x’s.
As an example: U+0754 is between U+0080 and U+07FF, so it needs two bytes. 0x0754 in binary is 11101010100, so you replace the x’s with those digits:
11011101 10010100
Short answer:
UTF-8 is designed to be able to unambiguously identify the type of each byte in a text stream:
1-byte codes (all and only the ASCII characters) start with a 0
Leading bytes of 2-byte codes start with two 1s followed by a 0 (i.e. 110)
Leading bytes of 3-byte codes start with three 1s followed by a 0 (i.e. 1110)
Leading bytes of 4-byte codes start with four 1s followed by a 0 (i.e. 11110)
Continuation bytes (of all multi-byte codes) start with a single 1 followed by a 0 (i.e. 10)
Your example Aݔ, which consists of the Unicode code points U+0041 and U+0754, is encoded in UTF-8 as:
01000001 11011101 10010100
So, when decoding, UTF-8 knows that the first byte must be a 1-byte code, the second byte must be the leading byte of a 2-byte code, the third byte must be a continuation byte, and since the second byte is the leading byte of a 2-byte code, the second and third byte together must form this 2-byte code.
See here how UTF-8 encodes Unicode code points.
Just to clarify, ASCII mean standard 7-bit ASCII and not extended 8-bit ASCII as commonly used in Europe.
Thus, part of first byte (0x80 to 0xFF) goes to dual byte representation and part of second byte on two bytes (0x0800 to 0xFFFF) takes the full three-byte representation.
Four byte representation uses only the lowest three bytes and only 1.114.111 of the 16.777.215 available possibilities
You have an xls here
That means that interpreters must 'jump back' a NUL (0) byte when they find those binary patterns.
Hope this helps somebody!
Related
Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 3 years ago.
Improve this question
I would like to know the concept of conversion from UTF8 to UTF16 LE
for e.g
input sequence E3 81 82
output sequence is 42 30
what is the actual arithmetic operation in this conversion.(I do not want to call in-built libraries)
Basically, Unicode is a way to represent many as possible symbols in one continuous code space, the code of each symbol is usually called a "code point".
UTF-8 and UTF-16 are just ways to encode and represent those code point in one or more octets (UTF-8) or 16-bit words (UTF-16), the latest can be represented as pair of octets in either little-endian ("least significant first" or "Intel byte order") or big-endian ("most significant first", or "Motorola byte order") sequence, which gives us two variants: UTF-16LE and UTF-16BE.
First you need to do, is to extract the code point from the UTF-8 sequence.
UTF-8 is encoded as follows:
0x00...0x7F encode symbol "as-is", it corresponds to standard ASCII symbols
but, if most significant bit is set (i.e. 0x80...0xFF), then it means that this is a sequence of several bytes, which all together encode the code point
bytes from range 0xC0...0xFF are on the first position of that sequence, in binary representation they will be:
0b110xxxxx - 1 more byte follows and xxxxx are 5 most significant bits of the code point
0b1110xxxx - 2 more bytes follow and xxxx are 4 most significant bits of the code point
0b11110xxx - 3 more bytes...
0b111110xx - 4 more bytes...
There are no code points defined in Unicode standard, which require more than 5 UTF-8 bytes for now.
following bytes are from range 0x80...0xBF (i.e. 0b10xxxxxx) and encode next six bits (from most to least significant) from the code point value.
So, looking at your example: E3 81 82
0xE3 == 0b11100011 means there will be more 2 bytes in this code point and 0011 - are most significant bits of it
0x81 == 0b10000001 means this is not the first byte in the code point sequence and it encodes next 6 bits: 000001
0x82 == 0b10000010 means this is not the first byte in the code point sequence and it encodes next 6 bits: 000010
i.e. result will be 0011 000001 000010 == 0x3042
UTF-16 works the same way. Most usual code points are just encoded "as-is" but some large values are packed in so-called "surrogate pairs", which are combination of two 16-bit words:
values from range 0xD800...0xDBFF represents the first of them, its 10 lower bits are encoding 10 most significant bits of the resulting code point.
values from range 0xDC00...0xDFFF represents the second, its lower bits are encoding 10 least significant bits of the resulting code point.
Surrogate pairs are required for values more than 0xFFFF (obvious) and for values 0xD800...0xDFFF - but this range is reserved in Unicode standard for surrogate pairs and there must no be such symbols.
So, in our example 0x3042 does not hit that range and therefore requires only one 16-bit word.
Since in your example UTF-16LE (little-endian) variant is given, that means, in the byte sequence first will be a least significant half of that word. I.e.
0x42 0x30
How can I convert a value like 5 or "Testing" to an array of type byte with a fixed length of n byte?
Edit:
I want to represent the number 5 in bits. I know that it's 101, but I want it represented as array with a length of for example 6 bytes, so 000000 ....
I'm not sure what you are trying to accomplish here but all I can say is assuming you simply want to represent characters in the binary form of it's ASCII code, you can pad the binary representation with zeros. For example if the set number of characters you want is 10, then encoding the letter a (with ASCII code of 97) in binary will be 1100001, padded to 10 characters will be 0001100001, but that is for a single character to be encoded. The encoding of a string, which is made up of multiple characters will be a set of these 10 digit binary codes which represent the corresponding character in the ASCII table. The encoding of data is important so that the system knows how to interpret the binary data. Then there is also endianness depending on the system architecture - but that's less of an issue these days with more old and modern processors like the ARM processors being bi-endian.
So forget about representing the number 5 and the string "WTF" using
the same number of bytes - it makes the brain hurt. Stop it.
A bit more reading on character encoding will be great.
Start here - https://en.wikipedia.org/wiki/ASCII
Then this - https://en.wikipedia.org/wiki/UTF-8
Then brain hurt - https://en.wikipedia.org/wiki/Endianness
Why is a char 1 byte long in C? Why is it not 2 bytes or 4 bytes long?
What is the basic logic behind it to keep it as 1 byte? I know in Java a char is 2 bytes long. Same question for it.
char is 1 byte in C because it is specified so in standards.
The most probable logic is. the (binary) representation of a char (in standard character set) can fit into 1 byte. At the time of the primary development of C, the most commonly available standards were ASCII and EBCDIC which needed 7 and 8 bit encoding, respectively. So, 1 byte was sufficient to represent the whole character set.
OTOH, during the time Java came into picture, the concepts of extended charcater sets and unicode were present. So, to be future-proof and support extensibility, char was given 2 bytes, which is capable of handling extended character set values.
Why would a char hold more than 1byte? A char normally represents an ASCII character. Just have a look at an ASCII table, there are only 256 characters in the (extended) ASCII Code. So you need only to represent numbers from 0 to 255, which comes down to 8bit = 1byte.
Have a look at an ASCII Table, e.g. here: http://www.asciitable.com/
Thats for C. When Java was designed they anticipated that in the future it would be enough for any character (also Unicode) to be held in 16bits = 2bytes.
It is because the C languange is 37 years old and there was no need to have more bytes for 1 char, as only 128 ASCII characters were used (http://en.wikipedia.org/wiki/ASCII).
When C was developed (the first book on it was published by its developers in 1972), the two primary character encoding standards were ASCII and EBCDIC, which were 7 and 8 bit encodings for characters, respectively. And memory and disk space were both of greater concerns at the time; C was popularized on machines with a 16-bit address space, and using more than a byte for strings would have been considered wasteful.
By the time Java came along (mid 1990s), some with vision were able to perceive that a language could make use of an international stnadard for character encoding, and so Unicode was chosen for its definition. Memory and disk space were less of a problem by then.
The C language standard defines a virtual machine where all objects occupy an integral number of abstract storage units made up of some fixed number of bits (specified by the CHAR_BIT macro in limits.h). Each storage unit must be uniquely addressable. A storage unit is defined as the amount of storage occupied by a single character from the basic character set1. Thus, by definition, the size of the char type is 1.
Eventually, these abstract storage units have to be mapped onto physical hardware. Most common architectures use individually addressable 8-bit bytes, so char objects usually map to a single 8-bit byte.
Usually.
Historically, native byte sizes have been anywhere from 6 to 9 bits wide. In C, the char type must be at least 8 bits wide in order to represent all the characters in the basic character set, so to support a machine with 6-bit bytes, a compiler may have to map a char object onto two native machine bytes, with CHAR_BIT being 12. sizeof (char) is still 1, so types with size N will map to 2 * N native bytes.
1. The basic character set consists of all 26 English letters in both upper- and lowercase, 10 digits, punctuation and other graphic characters, and control characters such as newlines, tabs, form feeds, etc., all of which fit comfortably into 8 bits.
You don't need more than a byte to represent the whole ascii table (128 characters).
But there are other C types which have more room to contain data, like int type (4 bytes) or long double type (12 bytes).
All of these contain numerical values (even chars! even if they're represented as "letters", they're "numbers", you can compare it, add it...).
These are just different standard sizes, like cm and m for lenght, .
I am parsing some UTF-8 text but am only interested in characters in the ASCII range, i.e., I can just skip multibyte sequences.
I can easily detect the beginning of a sequence because the sign bit is set, so the char value is < 0. But how can I tell how many bytes are in the sequence so I can skip over it?
I do not need to perform any validation, i.e., I can assume the input is valid UTF-8.
Just strip out all bytes which are no valid ascii, don't try to get cute and interpret bytes >127 at all. This works as long as you don't have any combining sequences with base character in ascii range. For those you would need to interpret the codepoints themselves.
Although Deduplicator's answer is more appropriate to the specific purpose of skipping over multibyte sequences, if there is a need to get the length of each such character, pass the first byte to this function:
int getUTF8SequenceLength (unsigned char firstPoint) {
firstPoint >>= 4;
firstPoint &= 7;
if (firstPoint == 4) return 2;
return firstPoint - 3;
}
This returns the total length of the sequence, including the first byte. I'm using an unsigned char value as the firstPoint parameter here for clarity, but note this function will work exactly the same way if the parameter is a signed char.
To explain:
UTF-8 uses bits 5, 6, and 7 in the first byte of a sequence to indicate the remaining length. If all three are set, the sequence is 3 additional bytes. If only the first of these from the left (the 7th bit) is set, the sequence is 1 additional byte. If the first two from the left are set, the sequence is 2 additional bytes. Hence, we want to examine these three bits (the value here is just an example):
11110111
^^^
The value is shifted down by 4 then AND'd with 7. This leaves only the 1st, 2nd, and 3rd bits from the right as the only possible ones set. The value of these bits are 1, 2, and 4 respectively.
00000111
^^^
If the value is now 4, we know only the first bit from the left (of the three we are considering) is set and can return 2.
After this, the value is either 7, meaning all three bits are set, so the sequence is 4 bytes in total, or 6, meaning the first two from the left are set so the sequence is 3 bytes in total.
This covers the range of valid Unicode characters expressed in UTF-8.
gcc (GCC) 4.8.1
c89
Hello,
I was reading a book about pointers. And using this code as a sample:
memset(buffer, 0, sizeof buffer);
Will fill the buffer will binary zero and not the character zero.
I am just wondering what is the difference between the binary and the character zero. I thought it was the same thing.
I know that textual data is human readable characters and binary data is non-printable characters. Correct me if I am wrong.
What would be a good example of binary data?
For added example, if you are dealing with strings (textual data) you should use fprintf. And if you are using binary data you should use fwrite. If you want to write data to a file.
Many thanks for any suggestions,
The quick answer is that the character '0' is represented in binary data by the ASCII number 48. That means, when you want the character '0', the file actually has these bits in it: 00110000. Similarly, the printable character '1' has a decimal value of 49, and is represented by the byte 00110001. ('A' is 65, and is represented as 01000001, while 'a' is 97, and is represented as 01100001.)
If you want the null terminator at the end of the string, '\0', that actually has a 0 decimal value, and so would be a byte of all zeroes: 00000000. This is truly a 0 value. To the compiler, there is no difference between
memset(buffer, 0, sizeof buffer);
and
memset(buffer, '\0', sizeof buffer);
The only difference is a semantic one to us. '\0' tells us that we're dealing with a character, while 0 simply tells us we're dealing with a number.
It would help you tremendously to check out an ascii table.
fprintf outputs data using ASCII and outputs strings. fwrite writes pure binary data. If you fprintf(fp, "0"), it will put the value 48 in fp, while if you fwrite(fd, 0) it will put the actual value of 0 in the file. (Note, my usage of fprintf and fwrite were obviously not proper usage, but shows the point.)
Note: My answer refers to ASCII because it's one of the oldest, best known character sets, but as Eric Postpichil mentions in the comments, the C standard isn't bound to ASCII. (In fact, while it does occasionally give examples using ASCII, the standard seems to go out of its way to never assume that ASCII will be the character set used.). fprintf outputs using the execution character set of your compiled program.
If you are asking about the difference between '0' and 0, these two are completely different:
Binary zero corresponds to a non-printable character \0 (also called the null character), with the code of zero. This character serves as null terminator in C string:
5.2.1.2 A byte with all bits set to 0, called the null character, shall exist in the basic execution character set; it is used to terminate a character string.
ASCII character zero '0' is printable (not surprisingly, producing a character zero when printed) and has a decimal code of 48.
Binary zero: 0
Character zero: '0', which in ASCII is 48.
binary data: the raw data that the cpu gets to play with, bit after bit, the stream of 0s and 1s (usually organized in groups of 8, aka Bytes, or multiples of 8)
character data: bytes interpreted as characters. Conventions like ASCII give the rules how a specific bit sequence should be displayed by a terminal, a printer, ...
for example, the binary data (bit sequence ) 00110000 should be displayed as 0
if I remember correctly, the unsigned integer datatypes would have a direct match between the binary value of the stored bits and the interpreted value (ignore strangeness like Endian ^^).
On a higher level, for example talking about ftp transfer, the destinction is made between:
the data should be interpreted as (multi)byte characters, aka text (this includes non-character signs like a line break)
the data is a big bit/bytestream, that can't be broken down in smaller human readable bits, for example an image or a compiled executable
in system every character have a code and zero ASCII code is 0x30(hex).
to fill this buffer with zero character you must enter this code :
memset(buffer,30,(size of buffer))