concept of converting from UTF8 to UTF16 LE the math operation in c programming [closed] - c

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 3 years ago.
Improve this question
I would like to know the concept of conversion from UTF8 to UTF16 LE
for e.g
input sequence E3 81 82
output sequence is 42 30
what is the actual arithmetic operation in this conversion.(I do not want to call in-built libraries)

Basically, Unicode is a way to represent many as possible symbols in one continuous code space, the code of each symbol is usually called a "code point".
UTF-8 and UTF-16 are just ways to encode and represent those code point in one or more octets (UTF-8) or 16-bit words (UTF-16), the latest can be represented as pair of octets in either little-endian ("least significant first" or "Intel byte order") or big-endian ("most significant first", or "Motorola byte order") sequence, which gives us two variants: UTF-16LE and UTF-16BE.
First you need to do, is to extract the code point from the UTF-8 sequence.
UTF-8 is encoded as follows:
0x00...0x7F encode symbol "as-is", it corresponds to standard ASCII symbols
but, if most significant bit is set (i.e. 0x80...0xFF), then it means that this is a sequence of several bytes, which all together encode the code point
bytes from range 0xC0...0xFF are on the first position of that sequence, in binary representation they will be:
0b110xxxxx - 1 more byte follows and xxxxx are 5 most significant bits of the code point
0b1110xxxx - 2 more bytes follow and xxxx are 4 most significant bits of the code point
0b11110xxx - 3 more bytes...
0b111110xx - 4 more bytes...
There are no code points defined in Unicode standard, which require more than 5 UTF-8 bytes for now.
following bytes are from range 0x80...0xBF (i.e. 0b10xxxxxx) and encode next six bits (from most to least significant) from the code point value.
So, looking at your example: E3 81 82
0xE3 == 0b11100011 means there will be more 2 bytes in this code point and 0011 - are most significant bits of it
0x81 == 0b10000001 means this is not the first byte in the code point sequence and it encodes next 6 bits: 000001
0x82 == 0b10000010 means this is not the first byte in the code point sequence and it encodes next 6 bits: 000010
i.e. result will be 0011 000001 000010 == 0x3042
UTF-16 works the same way. Most usual code points are just encoded "as-is" but some large values are packed in so-called "surrogate pairs", which are combination of two 16-bit words:
values from range 0xD800...0xDBFF represents the first of them, its 10 lower bits are encoding 10 most significant bits of the resulting code point.
values from range 0xDC00...0xDFFF represents the second, its lower bits are encoding 10 least significant bits of the resulting code point.
Surrogate pairs are required for values more than 0xFFFF (obvious) and for values 0xD800...0xDFFF - but this range is reserved in Unicode standard for surrogate pairs and there must no be such symbols.
So, in our example 0x3042 does not hit that range and therefore requires only one 16-bit word.
Since in your example UTF-16LE (little-endian) variant is given, that means, in the byte sequence first will be a least significant half of that word. I.e.
0x42 0x30

Related

Curious about base36 encoding [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 2 years ago.
Improve this question
I want to base36 encode 128bit hexadecimal number,
but 128bit exceeds the range of the largest number supported by c language.
So, i can't get the value of 36 numbers by finding the remainder and finding the quotient.
I'm curious about the internal algorithm in base36 that handles such long strings.
I am wondering how to express the number that cannot be expressed in the number range of c language.
Can you tell me about the algorithm for base36? Or, I would like a book or site for reference.
I am wondering how to express the number that cannot be expressed in the number range of c language.
Consider an array of bytes. Those bytes consist of bits. Now, assume that the byte consists of 8 bits. Next, consider an array of 16 bytes. There are a total of 128 bits in that array. You can use the lowest element to represent the first 8 bits of a 128 bit integer, the next element to represent bits 8...15 and so on.
That is how arbitrarily large integers can be represented in C: Using arrays of smaller integers each of the elements representing a high radix digit. In the scheme that I described, the number is represented using a radix of 256. You don't necessarily need to use an array of bytes. Typically, arbitrary precision math uses elements of the CPU word size for efficiency. In case of 32 bit element for example, that would be a radix of 4'294'967'296.
In the base36 encoding, the radix i.e. the base of the represenation is - you may have guessed this - 36. It is a textual representation, so elements of the array are chars. Instead of using all 256 values, this representation only uses 36 of those values; specifically those values that encode the upper case latin letter characters and the arabic numeral characters.
Can you tell me about the algorithm for base36?
There are essentially two steps:
First convert the input data to radix-36.
Next map those digits to text so that the digit 0 maps to the character '0', the digit 10 maps to 'A' and 35 maps to 'Z'. You can iterpolate the mappings that I did not provide.

Does computet convert every single ascii digit (in binary) to its numerical equivalent (in binary)?

Does computet convert every single ascii digit (in binary) to its numerical equivalent (in binary) ?
Let's say if 9 is given as input then its ascii value will be 00111001 and we know that binary of 9 is 1001 then how computer will convert ascii value of 9 to binary of 9.
It is only when doing arithmetic that a bit pattern represents a numeric value to a digital computer. (It would be possible to create a digital computer that doesn't even do arithmetic.)
It is a human convenience to describe bit patterns as numbers. Hexadecimal is the most common form because it is compact, represents each bit in an easily discernable way and aligns well with storage widths (such as multiples of 8 bits).
How a bit pattern is interpreted depends on the context. That context is driven by programs following conventions and standards, the vast majority of which are beyond the scope of the computer hardware itself.
Some bit patterns are programs. Certain bits may identify an operation, some a register, some an instruction location, some a data location and only some a numeric value.
If you have a bit pattern that you intend represents the character '9' then it does that as long as it flows through a program where that interpretation is built-in or carried along. For convenience, we call the bit pattern for a character, a "character code".
You could write a program that converts the bit pattern for the character '9' to the bit pattern for a particular representation of the numeric value 9. What follows is one way of doing that.
C requires that certain characters are representable, including digits '0' to '9', and that the character codes for those characters, when interpreted as numbers, are consecutive and increasing.
Subtraction of two numbers on a number line is a measure the distance between them. So, in C, subtracting the character code for '0' from the character for any decimal digit character is the distance between the digit and '0', which is the numeric value of the digit.
'9' - '0'
equals the 9 because of the requirements in C for the bit patterns for character codes and the bit patterns for integer.
Note: A binary representation is not very human-friendly in general. It is used when hexadecimal would obscure the details of the discussion.
Note: C does not require ASCII. ASCII is simply one character set and character encoding that satisfies C's requirements. There are many character sets that are supersets of and compatible with ASCII. You are probably using one of them.
Try this sample program, it shows how ASCII input is converted to a binary integer and back again.
#include <stdio.h>
int main()
{
int myInteger;
printf("Enter an integer: ");
scanf("%d",&myInteger);
printf("Number = %d",myInteger);
return 0;
}
It's a bit crude and doesn't handle invalid input in any way.

Convert int/string to byte array with length n

How can I convert a value like 5 or "Testing" to an array of type byte with a fixed length of n byte?
Edit:
I want to represent the number 5 in bits. I know that it's 101, but I want it represented as array with a length of for example 6 bytes, so 000000 ....
I'm not sure what you are trying to accomplish here but all I can say is assuming you simply want to represent characters in the binary form of it's ASCII code, you can pad the binary representation with zeros. For example if the set number of characters you want is 10, then encoding the letter a (with ASCII code of 97) in binary will be 1100001, padded to 10 characters will be 0001100001, but that is for a single character to be encoded. The encoding of a string, which is made up of multiple characters will be a set of these 10 digit binary codes which represent the corresponding character in the ASCII table. The encoding of data is important so that the system knows how to interpret the binary data. Then there is also endianness depending on the system architecture - but that's less of an issue these days with more old and modern processors like the ARM processors being bi-endian.
So forget about representing the number 5 and the string "WTF" using
the same number of bytes - it makes the brain hurt. Stop it.
A bit more reading on character encoding will be great.
Start here - https://en.wikipedia.org/wiki/ASCII
Then this - https://en.wikipedia.org/wiki/UTF-8
Then brain hurt - https://en.wikipedia.org/wiki/Endianness

How can I convert a 64 byte string to 20 byte string? [closed]

Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed 6 years ago.
Improve this question
Below is the problem statement that I have :
I am given a 64 byte string which can contain only digits [0-9].
Need to convert the string to a 20 byte representation and also be able to decode the representation.
The option that comes to my mind is to convert the input string (which is basically a number ) to its base64(or may be higher??) representation which will reduce the size.
But C integer/double datatypes won't allow such big numbers to handle which is a bottleneck.
Also I am really doubtful whether it is even possible. If we can't convert 64 byte to 20 byte then what is the maximum number that can fit into 20 bytes.
I am assuming that each byte is 8 bits.
10^64 is greater than 2^(20*8) so it won't really fit.
A rough guide is that 10 bits (1024 kombinations) can store 3 digits (1000).
You have 20*8 = 160 bits so you can store (slightly more than) 48 digits without loss of information.
If you have only numbers from 0 to 10 per 64 characters you need totally about 213 bits sup( log(2,10^64) ) that are 27 byte sup( 213/8 ).
So no, you cannot compress that number without losing some combination in only 20 byte, 8 bit long.
Supposing you can use 27 byte so you can split your number in block of 3: 123, 456 and write them in binary and concatenating the binary values.
You will use totally 21 blocks * 10 bits + 4 bits (last digit) that are 27 byte. sup( 214/8 ) = 27
PS. with sup() I intend the rounded number to the next integer.
A rough heuristic is the number of bits to represent decimal digits is 3.3 or 3.3-bits ~= 1-digit.
So roughly
10^64 ~= 211-bits ~= 26-bytes
and
20-bytes = 160-bits ~= 48-digits
Thus the question is not possible.
This is (back of the envelope) close to the actual result reported by Timmy and can be done easily "in the head".
Note: by digit I mean base 10 digits.

Why char is of 1 byte in C language

Why is a char 1 byte long in C? Why is it not 2 bytes or 4 bytes long?
What is the basic logic behind it to keep it as 1 byte? I know in Java a char is 2 bytes long. Same question for it.
char is 1 byte in C because it is specified so in standards.
The most probable logic is. the (binary) representation of a char (in standard character set) can fit into 1 byte. At the time of the primary development of C, the most commonly available standards were ASCII and EBCDIC which needed 7 and 8 bit encoding, respectively. So, 1 byte was sufficient to represent the whole character set.
OTOH, during the time Java came into picture, the concepts of extended charcater sets and unicode were present. So, to be future-proof and support extensibility, char was given 2 bytes, which is capable of handling extended character set values.
Why would a char hold more than 1byte? A char normally represents an ASCII character. Just have a look at an ASCII table, there are only 256 characters in the (extended) ASCII Code. So you need only to represent numbers from 0 to 255, which comes down to 8bit = 1byte.
Have a look at an ASCII Table, e.g. here: http://www.asciitable.com/
Thats for C. When Java was designed they anticipated that in the future it would be enough for any character (also Unicode) to be held in 16bits = 2bytes.
It is because the C languange is 37 years old and there was no need to have more bytes for 1 char, as only 128 ASCII characters were used (http://en.wikipedia.org/wiki/ASCII).
When C was developed (the first book on it was published by its developers in 1972), the two primary character encoding standards were ASCII and EBCDIC, which were 7 and 8 bit encodings for characters, respectively. And memory and disk space were both of greater concerns at the time; C was popularized on machines with a 16-bit address space, and using more than a byte for strings would have been considered wasteful.
By the time Java came along (mid 1990s), some with vision were able to perceive that a language could make use of an international stnadard for character encoding, and so Unicode was chosen for its definition. Memory and disk space were less of a problem by then.
The C language standard defines a virtual machine where all objects occupy an integral number of abstract storage units made up of some fixed number of bits (specified by the CHAR_BIT macro in limits.h). Each storage unit must be uniquely addressable. A storage unit is defined as the amount of storage occupied by a single character from the basic character set1. Thus, by definition, the size of the char type is 1.
Eventually, these abstract storage units have to be mapped onto physical hardware. Most common architectures use individually addressable 8-bit bytes, so char objects usually map to a single 8-bit byte.
Usually.
Historically, native byte sizes have been anywhere from 6 to 9 bits wide. In C, the char type must be at least 8 bits wide in order to represent all the characters in the basic character set, so to support a machine with 6-bit bytes, a compiler may have to map a char object onto two native machine bytes, with CHAR_BIT being 12. sizeof (char) is still 1, so types with size N will map to 2 * N native bytes.
1. The basic character set consists of all 26 English letters in both upper- and lowercase, 10 digits, punctuation and other graphic characters, and control characters such as newlines, tabs, form feeds, etc., all of which fit comfortably into 8 bits.
You don't need more than a byte to represent the whole ascii table (128 characters).
But there are other C types which have more room to contain data, like int type (4 bytes) or long double type (12 bytes).
All of these contain numerical values (even chars! even if they're represented as "letters", they're "numbers", you can compare it, add it...).
These are just different standard sizes, like cm and m for lenght, .

Resources