Convert int/string to byte array with length n - arrays

How can I convert a value like 5 or "Testing" to an array of type byte with a fixed length of n byte?
Edit:
I want to represent the number 5 in bits. I know that it's 101, but I want it represented as array with a length of for example 6 bytes, so 000000 ....

I'm not sure what you are trying to accomplish here but all I can say is assuming you simply want to represent characters in the binary form of it's ASCII code, you can pad the binary representation with zeros. For example if the set number of characters you want is 10, then encoding the letter a (with ASCII code of 97) in binary will be 1100001, padded to 10 characters will be 0001100001, but that is for a single character to be encoded. The encoding of a string, which is made up of multiple characters will be a set of these 10 digit binary codes which represent the corresponding character in the ASCII table. The encoding of data is important so that the system knows how to interpret the binary data. Then there is also endianness depending on the system architecture - but that's less of an issue these days with more old and modern processors like the ARM processors being bi-endian.
So forget about representing the number 5 and the string "WTF" using
the same number of bytes - it makes the brain hurt. Stop it.
A bit more reading on character encoding will be great.
Start here - https://en.wikipedia.org/wiki/ASCII
Then this - https://en.wikipedia.org/wiki/UTF-8
Then brain hurt - https://en.wikipedia.org/wiki/Endianness

Related

How does atof.c work? Subtracting an ASCII zero from an ASCII digit makes it an int? Am I missing something?

So as part of my C classes, for our first homework we are supposed to implement our own atof.c function, and then use it for some tasks. So, being the smart stay-at-home student I am I decided to look at the atof.c source code and adapt it to meet my needs. I think i'm on board with most of the operations that this function does, like counting the digits before and after the decimal point, however there is one line of code that I do not understand. I'm assuming this is the line that actually converts the ASCII digit into a digit of type int. Posting it here:
frac1 = 10*frac1 + (c - '0');
in the source code, c is the digit that they are processing, and frac1 is an int that stores some of the digits from the incoming ASCII string. but why does c- '0' work?? And as a followup, is there another way of achieving the same result?
There is no such thing as "text" in C. Just APIs that happen to treat integer values as text information. char is an integer type, and you can do math with it. Character literals are actually ints in C (in C++ they're char, but they're still usable as numeric values even there).
'0' is a nice way for humans to write "the ordinal value of the character for zero"; in ASCII, that's the number 48. Since the digits appear in order from 0 to 9 in all encodings I'm aware of, you can convert from the ordinal value in the encoding (e.g. ASCII) to actual numeric values by subtracting away '0' to get actual int values from 0 to 9.
You could just as easily subtract 48 directly (when compiled, it would be impossible to tell which option you used; 48 and ASCII '0' are indistinguishable), it would just be less obvious what you were doing to other people reading your source code.
The ASCII value of '0' is the 48'th character in code page 437 (IBM default character set). Similarly, '1' is the 49'th etc. Subtracting '0' instead of a magic number such as 48 is much clearer as far as self-documentation goes.

concept of converting from UTF8 to UTF16 LE the math operation in c programming [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 3 years ago.
Improve this question
I would like to know the concept of conversion from UTF8 to UTF16 LE
for e.g
input sequence E3 81 82
output sequence is 42 30
what is the actual arithmetic operation in this conversion.(I do not want to call in-built libraries)
Basically, Unicode is a way to represent many as possible symbols in one continuous code space, the code of each symbol is usually called a "code point".
UTF-8 and UTF-16 are just ways to encode and represent those code point in one or more octets (UTF-8) or 16-bit words (UTF-16), the latest can be represented as pair of octets in either little-endian ("least significant first" or "Intel byte order") or big-endian ("most significant first", or "Motorola byte order") sequence, which gives us two variants: UTF-16LE and UTF-16BE.
First you need to do, is to extract the code point from the UTF-8 sequence.
UTF-8 is encoded as follows:
0x00...0x7F encode symbol "as-is", it corresponds to standard ASCII symbols
but, if most significant bit is set (i.e. 0x80...0xFF), then it means that this is a sequence of several bytes, which all together encode the code point
bytes from range 0xC0...0xFF are on the first position of that sequence, in binary representation they will be:
0b110xxxxx - 1 more byte follows and xxxxx are 5 most significant bits of the code point
0b1110xxxx - 2 more bytes follow and xxxx are 4 most significant bits of the code point
0b11110xxx - 3 more bytes...
0b111110xx - 4 more bytes...
There are no code points defined in Unicode standard, which require more than 5 UTF-8 bytes for now.
following bytes are from range 0x80...0xBF (i.e. 0b10xxxxxx) and encode next six bits (from most to least significant) from the code point value.
So, looking at your example: E3 81 82
0xE3 == 0b11100011 means there will be more 2 bytes in this code point and 0011 - are most significant bits of it
0x81 == 0b10000001 means this is not the first byte in the code point sequence and it encodes next 6 bits: 000001
0x82 == 0b10000010 means this is not the first byte in the code point sequence and it encodes next 6 bits: 000010
i.e. result will be 0011 000001 000010 == 0x3042
UTF-16 works the same way. Most usual code points are just encoded "as-is" but some large values are packed in so-called "surrogate pairs", which are combination of two 16-bit words:
values from range 0xD800...0xDBFF represents the first of them, its 10 lower bits are encoding 10 most significant bits of the resulting code point.
values from range 0xDC00...0xDFFF represents the second, its lower bits are encoding 10 least significant bits of the resulting code point.
Surrogate pairs are required for values more than 0xFFFF (obvious) and for values 0xD800...0xDFFF - but this range is reserved in Unicode standard for surrogate pairs and there must no be such symbols.
So, in our example 0x3042 does not hit that range and therefore requires only one 16-bit word.
Since in your example UTF-16LE (little-endian) variant is given, that means, in the byte sequence first will be a least significant half of that word. I.e.
0x42 0x30

Does computet convert every single ascii digit (in binary) to its numerical equivalent (in binary)?

Does computet convert every single ascii digit (in binary) to its numerical equivalent (in binary) ?
Let's say if 9 is given as input then its ascii value will be 00111001 and we know that binary of 9 is 1001 then how computer will convert ascii value of 9 to binary of 9.
It is only when doing arithmetic that a bit pattern represents a numeric value to a digital computer. (It would be possible to create a digital computer that doesn't even do arithmetic.)
It is a human convenience to describe bit patterns as numbers. Hexadecimal is the most common form because it is compact, represents each bit in an easily discernable way and aligns well with storage widths (such as multiples of 8 bits).
How a bit pattern is interpreted depends on the context. That context is driven by programs following conventions and standards, the vast majority of which are beyond the scope of the computer hardware itself.
Some bit patterns are programs. Certain bits may identify an operation, some a register, some an instruction location, some a data location and only some a numeric value.
If you have a bit pattern that you intend represents the character '9' then it does that as long as it flows through a program where that interpretation is built-in or carried along. For convenience, we call the bit pattern for a character, a "character code".
You could write a program that converts the bit pattern for the character '9' to the bit pattern for a particular representation of the numeric value 9. What follows is one way of doing that.
C requires that certain characters are representable, including digits '0' to '9', and that the character codes for those characters, when interpreted as numbers, are consecutive and increasing.
Subtraction of two numbers on a number line is a measure the distance between them. So, in C, subtracting the character code for '0' from the character for any decimal digit character is the distance between the digit and '0', which is the numeric value of the digit.
'9' - '0'
equals the 9 because of the requirements in C for the bit patterns for character codes and the bit patterns for integer.
Note: A binary representation is not very human-friendly in general. It is used when hexadecimal would obscure the details of the discussion.
Note: C does not require ASCII. ASCII is simply one character set and character encoding that satisfies C's requirements. There are many character sets that are supersets of and compatible with ASCII. You are probably using one of them.
Try this sample program, it shows how ASCII input is converted to a binary integer and back again.
#include <stdio.h>
int main()
{
int myInteger;
printf("Enter an integer: ");
scanf("%d",&myInteger);
printf("Number = %d",myInteger);
return 0;
}
It's a bit crude and doesn't handle invalid input in any way.

Why char is of 1 byte in C language

Why is a char 1 byte long in C? Why is it not 2 bytes or 4 bytes long?
What is the basic logic behind it to keep it as 1 byte? I know in Java a char is 2 bytes long. Same question for it.
char is 1 byte in C because it is specified so in standards.
The most probable logic is. the (binary) representation of a char (in standard character set) can fit into 1 byte. At the time of the primary development of C, the most commonly available standards were ASCII and EBCDIC which needed 7 and 8 bit encoding, respectively. So, 1 byte was sufficient to represent the whole character set.
OTOH, during the time Java came into picture, the concepts of extended charcater sets and unicode were present. So, to be future-proof and support extensibility, char was given 2 bytes, which is capable of handling extended character set values.
Why would a char hold more than 1byte? A char normally represents an ASCII character. Just have a look at an ASCII table, there are only 256 characters in the (extended) ASCII Code. So you need only to represent numbers from 0 to 255, which comes down to 8bit = 1byte.
Have a look at an ASCII Table, e.g. here: http://www.asciitable.com/
Thats for C. When Java was designed they anticipated that in the future it would be enough for any character (also Unicode) to be held in 16bits = 2bytes.
It is because the C languange is 37 years old and there was no need to have more bytes for 1 char, as only 128 ASCII characters were used (http://en.wikipedia.org/wiki/ASCII).
When C was developed (the first book on it was published by its developers in 1972), the two primary character encoding standards were ASCII and EBCDIC, which were 7 and 8 bit encodings for characters, respectively. And memory and disk space were both of greater concerns at the time; C was popularized on machines with a 16-bit address space, and using more than a byte for strings would have been considered wasteful.
By the time Java came along (mid 1990s), some with vision were able to perceive that a language could make use of an international stnadard for character encoding, and so Unicode was chosen for its definition. Memory and disk space were less of a problem by then.
The C language standard defines a virtual machine where all objects occupy an integral number of abstract storage units made up of some fixed number of bits (specified by the CHAR_BIT macro in limits.h). Each storage unit must be uniquely addressable. A storage unit is defined as the amount of storage occupied by a single character from the basic character set1. Thus, by definition, the size of the char type is 1.
Eventually, these abstract storage units have to be mapped onto physical hardware. Most common architectures use individually addressable 8-bit bytes, so char objects usually map to a single 8-bit byte.
Usually.
Historically, native byte sizes have been anywhere from 6 to 9 bits wide. In C, the char type must be at least 8 bits wide in order to represent all the characters in the basic character set, so to support a machine with 6-bit bytes, a compiler may have to map a char object onto two native machine bytes, with CHAR_BIT being 12. sizeof (char) is still 1, so types with size N will map to 2 * N native bytes.
1. The basic character set consists of all 26 English letters in both upper- and lowercase, 10 digits, punctuation and other graphic characters, and control characters such as newlines, tabs, form feeds, etc., all of which fit comfortably into 8 bits.
You don't need more than a byte to represent the whole ascii table (128 characters).
But there are other C types which have more room to contain data, like int type (4 bytes) or long double type (12 bytes).
All of these contain numerical values (even chars! even if they're represented as "letters", they're "numbers", you can compare it, add it...).
These are just different standard sizes, like cm and m for lenght, .

Difference between binary zeros and ASCII character zero

gcc (GCC) 4.8.1
c89
Hello,
I was reading a book about pointers. And using this code as a sample:
memset(buffer, 0, sizeof buffer);
Will fill the buffer will binary zero and not the character zero.
I am just wondering what is the difference between the binary and the character zero. I thought it was the same thing.
I know that textual data is human readable characters and binary data is non-printable characters. Correct me if I am wrong.
What would be a good example of binary data?
For added example, if you are dealing with strings (textual data) you should use fprintf. And if you are using binary data you should use fwrite. If you want to write data to a file.
Many thanks for any suggestions,
The quick answer is that the character '0' is represented in binary data by the ASCII number 48. That means, when you want the character '0', the file actually has these bits in it: 00110000. Similarly, the printable character '1' has a decimal value of 49, and is represented by the byte 00110001. ('A' is 65, and is represented as 01000001, while 'a' is 97, and is represented as 01100001.)
If you want the null terminator at the end of the string, '\0', that actually has a 0 decimal value, and so would be a byte of all zeroes: 00000000. This is truly a 0 value. To the compiler, there is no difference between
memset(buffer, 0, sizeof buffer);
and
memset(buffer, '\0', sizeof buffer);
The only difference is a semantic one to us. '\0' tells us that we're dealing with a character, while 0 simply tells us we're dealing with a number.
It would help you tremendously to check out an ascii table.
fprintf outputs data using ASCII and outputs strings. fwrite writes pure binary data. If you fprintf(fp, "0"), it will put the value 48 in fp, while if you fwrite(fd, 0) it will put the actual value of 0 in the file. (Note, my usage of fprintf and fwrite were obviously not proper usage, but shows the point.)
Note: My answer refers to ASCII because it's one of the oldest, best known character sets, but as Eric Postpichil mentions in the comments, the C standard isn't bound to ASCII. (In fact, while it does occasionally give examples using ASCII, the standard seems to go out of its way to never assume that ASCII will be the character set used.). fprintf outputs using the execution character set of your compiled program.
If you are asking about the difference between '0' and 0, these two are completely different:
Binary zero corresponds to a non-printable character \0 (also called the null character), with the code of zero. This character serves as null terminator in C string:
5.2.1.2 A byte with all bits set to 0, called the null character, shall exist in the basic execution character set; it is used to terminate a character string.
ASCII character zero '0' is printable (not surprisingly, producing a character zero when printed) and has a decimal code of 48.
Binary zero: 0
Character zero: '0', which in ASCII is 48.
binary data: the raw data that the cpu gets to play with, bit after bit, the stream of 0s and 1s (usually organized in groups of 8, aka Bytes, or multiples of 8)
character data: bytes interpreted as characters. Conventions like ASCII give the rules how a specific bit sequence should be displayed by a terminal, a printer, ...
for example, the binary data (bit sequence ) 00110000 should be displayed as 0
if I remember correctly, the unsigned integer datatypes would have a direct match between the binary value of the stored bits and the interpreted value (ignore strangeness like Endian ^^).
On a higher level, for example talking about ftp transfer, the destinction is made between:
the data should be interpreted as (multi)byte characters, aka text (this includes non-character signs like a line break)
the data is a big bit/bytestream, that can't be broken down in smaller human readable bits, for example an image or a compiled executable
in system every character have a code and zero ASCII code is 0x30(hex).
to fill this buffer with zero character you must enter this code :
memset(buffer,30,(size of buffer))

Resources