Decode table construction for base64 - c

I am reading this libb64 source code for encoding and decoding base64 data.
I know the encoding procedure but i can't figure out how the following decoding table is constructed for fast lookup to perform decoding of encoded base64 characters. This is the table they are using:
static const char decoding[] = {62,-1,-1,-1,63,52,53,54,55,56,57,58,59,60,61,-1,-1,-1,-2,-1,-1,-1,0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,-1,-1,-1,-1,-1,-1,26,27,28,29,30,31,32,33,34,35,36,37,38,39,40,41,42,43,44,45,46,47,48,49,50,51};
Can some one explain me how the values in this table are used for decoding purpose.

It's a shifted and limited ASCII translating table. The keys of the table are ASCII values, the values are base64 decoded values. The table is shifted such that the index 0 actually maps to the ASCII character + and any further indices map the ASCII values after +. The first entry in the table, the ASCII character +, is mapped to the base64 value 62. Then three characters are ignored (ASCII ,-.) and the next character is mapped to the base64 value 63. That next character is ASCII /.
The rest will become obvious if you look at that table and the ASCII table.
It's usage is something like this:
int decode_base64(char ch) {
if (ch < `+` or ch > `z`) {
return SOME_INVALID_CH_ERROR;
}
/* shift range into decoding table range */
ch -= `+`;
int base64_val = decoding[ch];
if (base64_val < 0) {
return SOME_INVALID_CH_ERROR;
}
return base64_val;
}

As know, each byte has 8 bits, possible 256 combinations with 2 symbols (base2).
With 2 symbols is need to waste 8 chars to represent a byte, for example '01010011'.
With base 64 is possible to represent 64 combinations with 1 char...
So, we have a base table:
A = 000000
B = 000001
C = 000010
...
If you have the word 'Man', so you have the bytes:
01001101, 01100001, 01101110
and so the stream:
011010110000101101110
Break in group of six bits: 010011 010110 000101 101110
010011 = T
010110 = W
000101 = F
101110 = u
So, 'Man' => base64 coded = 'TWFu'.
As saw, this works perfectly to streams whith length multiple of 6.
If you have a stream that isn't multiple of 6, for example 'Ma' you have the stream:
010011 010110 0001
you need to complete to have groups of 6:
010011 010110 000100
so you have the coded base 64:
010011 = T
010110 = W
000100 = E
So, 'Ma' => 'TWE'
After to decode the stream, in this case you need to calc the last multiple length to be multiple of 8 and so remove the extra bits to obtain the original stream:
T = 010011
W = 010110
E = 000100
1) 010011 010110 000100
2) 01001101 01100001 00
3) 01001101 01100001 = 'Ma'
In really, when we put the trailing 00s, we mark the end of Base64 string with '=' to each trailing '00 added ('Ma' ==> Base64 'TWE=')
See too the link: http://www.base64decode.org/
Images represented on base 64 is a good option to represent with strings in many applications where is hard to work directly with a real binary stream. Real binary stream is better because is a base256, but is difficult inside HTML for example, there are 2 ways, minor traffic, or more easy to work with strings.
See ASCII codes too, the chars of base 64 is from range '+' to 'z' on table ASCII but there are some values between '+' and 'z' that isn't base 64 symbols
'+' = ASCII DEC 43
...
'z' = ASCII DEC 122
from DEC 43 to 122 are 80 values but
43 OK = '+'
44 isn't a base 64 symbols so the decoding index is -1 (invalid symbol to base64)
45 ....
46 ...
...
122 OK = 'z'
do the char needed to decode, decremented of 43 ('+') to be index 0 on vector to quick access by index so, decoding[80] = {62, -1, -1 ........, 49, 50,51};
Roberto Novakosky
Developer Systems

Considering these 2 mapping tables:
static const char decodingTab[] = {62,-1,-1,-1,63,52,53,54,55,56,57,58,59,60,61,-1,-1,-1,-2,-1,-1,-1,0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,-1,-1,-1,-1,-1,-1,26,27,28,29,30,31,32,33,34,35,36,37,38,39,40,41,42,43,44,45,46,47,48,49,50,51};
static unsigned char encodingTab[64]="ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789+/";
decodingTab is the reverse mapping table of encondingTab.
So decodingTab[i] should never be -1.
In fact, only 64 values are expected. However decodingTab size is 128.
So, in decodingTab,unexpected index values are set to -1 (an arbitrary number which is not in [0,63])
char c;
unsigned char i;
...
encoding[decoding[c]]=c;
decoding[encoding[i]=i;
Hope it helps.

Related

Varbinary bytes in Snowflake

If below query is executed in mssql I am getting following output:
Query:
select SubString(0x003800010102000500000000,1, 2) as A
,SubString(0x003800010102000500000000, 6, 1) as B
,CAST(CAST(SubString(0x003800010102000500000000, 9, cast(SubString(0x003800010102000500000000,
6, 1)As TinyInt)) As VarChar) As Float) as D
Reading Format: 0x 00 38 00 01 01 02 00 05 00 00 00 00
Output:
A B D
0x0038 0x02 0
Above substring function is taking two byte for each index value specified excluding the first two bytes "0x" in mssql.
Now I am trying to achieve the same output using snowflake. Can someone pls help as I am difficulty in understanding the byte split into two by creating a function.
CREATE OR REPLACE FUNCTION getFloat1 (p1 varchar) RETURNS Float as $$
Select Case
WHEN concat(substr(p1::varchar,1, 2),substr(p1::varchar,5, 4)) <> '0x3E00'
then 0::float
ELSE 1::float
//Else substr(p1::varchar, 9, substr(p1::varchar, 6, 1)):: float End as test1 $$ ;
Snowflake doesn't have a binary literal, so no notation automatically treats a value as a binary like the 0x notation in SQL Server. You always have to cast a value to the BINARY data type to treat the value as a binary.
Also, there are several differences around the BINARY data type handling between SQL Server and Snowflake:
SUBSTRING in Snowflake can handle only a string
... then the second argument of SUBSTRING must be the number of characters as a hex string, not the number of bytes as a binary
Snowflake supports a hex string as a representation of a binary, but the hex string must not include the 0x prefix
Snowflake has no way to convert a binary to numbers but can convert a hex string by using 'X' format string in TO_NUMBER
Based on the above differences, below is an example query achieving the same result as your SQL Server query:
select
substring('003800010102000500000000', 1, 4)::binary A,
substring('003800010102000500000000', 11, 2)::binary B,
to_number(
substring(
'003800010102000500000000',
17,
to_number(substring('003800010102000500000000', 11, 2), 'XX')*2
),
'XXXX'
)::float D
;
It returns the below result that is the same as your query:
/*
A B D
0038 02 0
*/
Explanation:
Since Snowflake doesn't have a binary literal and SUBSTRING only supports a string (VARCHAR), any binary manipulation has to be done with a VARCHAR hex string.
So, in the query, the first SUBSTRING starts from 1 and extracts 4 characters because 1 byte consists of 2 hex characters, then extracting 2 bytes is equivalent to extracting 4 hex characters.
The second SUBSTRING starts from 11 because starting from the 6th byte means ignoring 5 bytes (= 10 hex characters) and starting from the following hex character which is the first hex character of the 6th byte (10 + 1 = 11).
The third SUBSTRING is the same as the second one, starting from the 9th byte means ignoring 8 bytes (= 16 hex characters) and starting from the following hex character (16 + 1 = 17).
Also, to convert from a hex string to numeric data types, using the X character in the second "format" argument of the TO_NUMBER cast function to parse the string as a collection of hex characters. A single X character corresponds to a single hex character in the string to be parsed. That's why I used 'XX' to parse a single byte (2 hex characters) and used 'XXXX' to parse 2 bytes (4 hex characters).

How to sort Hexadecimal Numbers (like 10 1 A B) in C?

I want to implement a sorting algorithm in C for sorting hexadecimal numbers like:
10 1 A B
to this:
1 A B 10
The problem that I am facing here is I didn;t understand how A & B is less than 10 as A = 10 and B = 11 in hexadecimal numbers. Im sorry if I am mistaken.
Thank you!
As mentioned in the previous comments, 10 is 0x10, so this sorting seems to be no problem: 0x1 < 0xA < 0xB < 0x10
In any base a number with two digits is always greater than a number with one digit.
In hexadecimal notation we have 6 more digits available than in decimal, but they still count as one "digit":
hexadecimal digit | value in decimal representation
A | 10
B | 11
C | 12
D | 13
E | 14
F | 15
When you get a number in hexadecimal notation, it might be that its digits happen to use none of the above extra digits, but just the well-known 0..9 digits. This can be confusing, as we still must treat them as hexadecimal. In particular, a digit in a multi-digit hexadecimal representation must be multiplied with a power of 16 (instead of 10) to be correctly interpreted. So when you get 10 as hexadecimal number, it has a value of one (1) time sixteen plus zero (0), so the hexadecimal number 10 has a (decimal) value of 16.
The hexadecimal numbers you gave should therefore be ordered as 1 < A < B < 10.
As a more elaborate example, the hexadecimal representation 1D6A can be converted to decimal like this:
1D6A
│││└─> 10 x 16⁰ = 10
││└──> 6 x 16¹ = 96
│└───> 13 x 16² = 3328
└────> 1 x 16³ = 4096
──── +
7530
Likewise
10
│└─> 0 x 16⁰ = 0
└──> 1 x 16¹ = 16
── +
16

EF ADN specification in SIM/USIM

I am building an application to read SIM EF files. From 3G TS 31.102 I am trying to parse the EF ADN file.
According to spec for EF ADN,
1 to X Alpha Identifier O X bytes
X+1 Length of BCD number/SSC contents M 1 byte
X+2 TON and NPI M 1 byte
X+3 to X+12 Dialling Number/SSC String M 10 bytes
X+13 Capability/Configuration Identifier M 1 byte
X+14 Extension1 Record Identifier M 1 byte
I am not able to get the coding for -> Length of BCD number/SSC contents.
In the spec the coding is according to GSM 04.08 but I am not able to find.
There is a good utility class for BCD operations to test. Assuming that you are asking how to get length of BCD digits of Abbreviated Dialling Number. ADN numbers can be 3-4 digits , if they are written as BCD they would be 2 bytes long because each BCD digit is 4-bits nibble, after TON/NPI byte you should read N bytes and convert to it to decimal value
byte[] bcds = DecToBCDArray(211);
System.out.println("BCD is "+ Hex.toHexString(bcds));
System.out.println("BCD length is "+ bcds.length);
System.out.println("To decimal "+ BCDtoString(bcds));

How to handle this in huffman coding?

The input for the compression character with frequencies are,
A = 1
B = 2
C = 4
D = 8
E = 16
F = 32
G = 64
H = 128
I = 256
J = 512
K = 1024
L = 2048
M = 4096
N = 8192
The huffman coding algorithm is,
First we have to pick two lowest frequencies characters and implement a tree, with the parent as sum of those two character frequencies.
After than put 0 to left child and 1 to right child.
Then finally select the value for each character as binary form , to select this starts form root and find it is placed in left or right, after that if it is placed in left add 0, if it is right add 1.
It forms a tree it goes above 8 level. We have to mention the binary in 8 bits only. But for this input, the bit crosses the 8.
Here what we have to do?
If you encode all 256 possible values, some will be represented by more than 8 bits, that's right. But your encoded string isn't interpreted as an array ob bytes, but as a series of bits, which may occupy more than one byte, so it is okay to have branches of your Huffman tree that go deeper than eight levels.
Say you have a Huffman tree that contains these encodings (among others):
E 000 # 3 bits
X 0100000001 # 10 bits
NUL 001 #3 bits
Now when you want to encode the string EEXEEEX, you get:
E E X E E E X NUL # original text
000 000 0100000001 000 000 000 0100000001 001 # encoded bits
You now organise this series of bits into blocks of 8, that is bytes:
eeeEEExx xxxxxxxx EEEeeeEE Exxxxxxx xxxNNN # orig
00000001 00000001 00000000 00100000 00100100 # bits
enc[0] enc[1] enc[2] enc[3] enc[4] # bytes
(The sub-blocks of four are just for easy reading. The last two zero bits are padding.) The byte array enc is now your encoded string.
The compression comes from the fact that frequently used characters occupy less than a byte. For example the first two Es fit into a single byte. Infrequent charactes like X here have a longer encoding, which may even span several bytes.
You must, of course extract the current bit from the current byte in order to traverse your Huffman tree. You'll need the bitwise operators for that.

Transform BASE64 string to BASE16(HEX) string?

Hey, I'm trying to write a program to convert from a BASE64 string to a BASE16(HEX) string.
Here's an example:
BASE64: Ba7+Kj3N
HEXADECIMAL: 05 ae fe 2a 3d cd
BINARY: 00000101 10101110 11111110 00101010 00111101 11001101
DECIMAL: 5 174 254 42 61 205
What's the logic to convert from BASE64 to HEXIDECIMAL?
Why is the decimal representation split up?
How come the binary representation is split into 6 section?
Just want the math, the code I can handle just this process is confusing me. Thanks :)
Here's a function listing that will convert between any two bases: https://sites.google.com/site/computersciencesourcecode/conversion-algorithms/base-to-base
Edit (Hopefully to make this completely clear...)
You can find more information on this at the Wikipedia entry for Base 64.
The customary character set used for base 64, which is different than the character set you'll find in the link I provided prior to the edit, is:
ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789+/
The character 'A' is the value 0, 'B' is the value 1, 'C' is the value 2, ...'8' is the value 60, '9' is the value 61, '+' is the value 62, and '/' is the value 63. This character set is very different from what we're used to using for binary, octal, base 10, and hexadecimal, where the first character is '0', which represents the value 0, etc.
Soju noted in the comments to this answer that each base 64 digit requires 6 bits to represent it in binary. Using the base 64 number provided in the original question and converting from base 64 to binary we get:
B a 7 + K j 3 N
000001 011010 111011 111110 001010 100011 110111 001101
Now we can push all the bits together (the spaces are only there to help humans read the number):
000001011010111011111110001010100011110111001101
Next, we can introduce new white-space delimiters every four bits starting with the Least Significant Bit:
0000 0101 1010 1110 1111 1110 0010 1010 0011 1101 1100 1101
It should now be very easy to see how this number is converted to base 16:
0000 0101 1010 1110 1111 1110 0010 1010 0011 1101 1100 1101
0 5 A E F E 2 A 3 D C D
think of base-64 as base(2^6)
so in order to get alignment with a hex nibble you need at least 2 base 64 digits...
with 2 base-64 digits you have a base-(2^12) number, which could be represented by 3 2^4 digits...
(00)(01)(02)(03)(04)(05)---(06)(07)(08)(09)(10)(11) (base-64) maps directly to:
(00)(01)(02)(03)---(04)(05)(06)(07)---(08)(09)(10)(11) (base 16)
so you can either convert to contiguous binary... or use 4 operations... the operations could deal with binary values, or they could use a set of look up tables (which could work on the char encoded digits):
first base-64 digit to first hex digit
first base-64 digit to first half of second hex digit
second base-64 digit to second half of second hex digit
second base-64 digit to third hex digit.
the advantage of this is that you can work on encoded bases without binary conversion.
it is pretty easy to do in a stream of chars... I am not aware of this actually being implemented anywhere though.

Resources