Reading bits in 5-bits chunk in "Memory Aligned" way - c

For example, I am trying to read 15 bits from Starting address lets say "11". By definition of memory aligned (15 % 5 == 0). How to read or write these bits in terms (Memory Aligned Access) but the location of bits should also be in such a way that only 1 byte should be read to read a chunk of 5 bits. My thought would be that memory aligned reading should be in the following way.
MSB x x x x x x LSB : Byte read
0 0 0 15 14 13 12 11 : 0
0 0 0 20 19 18 17 16 : 1
0 0 0 25 24 23 22 21 : 2
Most-significant three bits are 0-padded in all bytes.
Please share your thoughts and also help me writing a C code to read bits in chunk of 5 bits? Thanks

Related

Confusion about a memory alignment example

When reading some posts for memory alignment knowlodge, I have a question about a good answer from What is aligned memory allocation?, #dan04.
Reading the example he gives,
0 1 2 3 4 5 6 7
|a|a|b|b|b|b|c|d| bytes
| | | words
The problem is that on some CPU architectures, the instruction to load a 4-byte integer from memory only works on word boundaries. So your program would have to fetch each half of b with separate instructions.
Why can't (Can it?) read the 4 bytes(a word, assume 32bits) directly that contains b?
For example, if I want b
0 1 2 3 4 5 6 7
|a|a|b|b|b|b|c|d| bytes
| | a word(assume it's 32 bit, get b directly)
read 1 word starts from address 2.
if I want a
0 1 2 3 4 5 6 7
|a|a|b|b|b|b|c|d| bytes
| | a word
read 1 word starts from address 0 and get the first 2 bytes and discard the latter 2 bytes.
if I want c and d
0 1 2 3 4 5 6 7
|a|a|b|b|b|b|c|d| bytes
| | a word
read 1 word starts from address 4 and get the last 2 bytes and discard the first 2 bytes.
Then it seems alignment is not needed which is definitely incorrect..
I must have misunderstood something or lack some other knowledge, please help correct me..
"Why can't (Can it?) read the 4 bytes(a word, assume 32bits) directly that contains b?"
The answer you have quoted already right above. The key is "on word boundaries". That is not the same as "in word size". I.e. those CPUs can read word width only from exactly N*wordwidth, not from N*wordwidth+2.
A wordboundary (only applicable on the mentioned platforms) is a clean multiple of the wordwidth. 0, 4, 8, 12... But not 2, 6, 10...
Picking up your phrasing from comment, yes.
Those CPUs can only read from address 0, 4, 8, 12, 16 and so on.
E.g. one word from addresses 0-3, one word from address 4-7.
(Note the added 12.)

What file size is data if it's 450KB base64 encoded?

Is it possible to compute the size of data if I know its size when it's base64 encoded?
I've a file that is 450KB in size when base64 encoded but what size is it decompressed?
Is there a method to find output size without decompressing the file first?
I've a file that is 450KB in size when base64 encoded but what size is it decompressed?
In fact, you don't "decompress", you decode. The result will be smaller than the encoded data.
As Base 64 encoding needs ~ 8 bits for each 6 bits of the original data (or 4 bytes to store 3), the math is simple:
Encoded Decoded
450KB / 4 * 3 = ~ 337KB
The overhead between Base64 and decoded string is nearly constant, 33.33%. I say "nearly" just because the padding bytes at the end (=) that make the string length multiple of 4. See some examples:
String Encoded Len B64 Pad Space needed
A QQ== 1 2 2 400.00%
AB QUI= 2 3 1 200.00%
ABC QUJD 3 4 0 133.33%
ABCD QUJDRA== 4 6 2 200.00%
ABCDEFGHIJKLMNOPQ QUJDREVGR0hJSktMTU5PUFE= 17 23 1 140.00%
( 300 bytes ) ( 400 bytes ) 300 400 0 133.33%
( 500 bytes ) ( 668 bytes ) 500 666 2 133.60%
( 5000 bytes ) ( 6668 bytes ) 5000 6666 2 133.36%
... tends to 133.33% ...
Calculating the space for unencoded data:
Let's get the value QUJDREVGR0hJSktMTU5PUFE= mentioned above.
There are 24 bytes in the encoded value.
Let's calculate 24 / 4 * 3 => the result is 18.
Let's count the number of =s on the end of encoded value: In this case, 1
(we need to check only the 2 last bytes of encoded data).
Getting 18 (obtained on step 2) - 1 (obtained on step 3 ) we get 17
So, we need 17 bytes to store the data.
base64 adds roughly a third to the original size, so your file should be more or less .75*450kb in size.

VTK Structured Point file

I am trying to parse a VTK file in C by extracting its point data and storing each point in a 3D array. However, the file I am working with has 9 shorts per point and I am having difficulty understanding what each number means.
I believe I understand most of the header information (please correct me if I have misunderstood):
ASCII: Type of file (ASCII or Binary)
DATASET: Type of dataset
DIMENSIONS: dims of voxels (x,y,z)
SPACING: Volume of each voxel (w,h,d)
ORIGIN: Unsure
POINT DATA: Total number of points/voxels (dimx.dimy.dimz)
I have looked at the documentation and I am still not getting an understanding on how to interpret the data. Could someone please help me understand or point me to some helpful resources
# vtk DataFile Version 3.0
vtk output
ASCII
DATASET STRUCTURED_POINTS
DIMENSIONS 256 256 130
SPACING 1 1 1.3
ORIGIN 86.6449 -133.929 116.786
POINT_DATA 8519680
SCALARS scalars short
LOOKUP_TABLE default
0 0 0 0 0 0 0 0 0
0 0 7 2 4 5 3 3 4
4 5 5 1 7 7 1 1 2
1 6 4 3 3 1 0 4 2
2 3 2 4 2 2 0 2 6
...
thanks.
You are correct regarding the meaning of fields in the header.
ORIGIN corresponds to the coordinates of the 0-0-0 corner of the grid.
An example of a DATASET STRUCTURED_POINTS can be found in the documentation.
Starting from this, here is a small file with 6 shorts per point. Each line represents a point.
# vtk DataFile Version 2.0
Volume example
ASCII
DATASET STRUCTURED_POINTS
DIMENSIONS 3 4 2
ASPECT_RATIO 1 1 1
ORIGIN 0 0 0
POINT_DATA 24
SCALARS volume_scalars char 6
LOOKUP_TABLE default
0 1 2 3 4 5
1 1 2 3 4 5
2 1 2 3 4 5
0 2 2 3 4 5
1 2 2 3 4 5
2 2 2 3 4 5
0 3 2 8 9 10
1 3 2 8 9 10
2 3 2 8 9 10
0 4 2 8 9 10
1 4 2 8 9 10
2 4 2 8 9 10
0 1 3 18 19 20
1 1 3 18 19 20
2 1 3 18 19 20
0 2 3 18 19 20
1 2 3 18 19 20
2 2 3 18 19 20
0 3 3 24 25 26
1 3 3 24 25 26
2 3 3 24 25 26
0 4 3 24 25 26
1 4 3 24 25 26
2 4 3 24 25 26
The 3 first fields may be displayed to understand the data layout : x change faster than y, which change faster than z in file.
If you wish to store the data in an array a[2][4][3][6], just read while doing a loop :
for(k=0;k<2;k++){ //z loop
for(j=0;j<4;j++){ //y loop : y change faster than z
for(i=0;i<3;i++){ //x loop : x change faster than y
for(l=0;l<6;l++){
fscanf(file,"%d",&a[k][j][i][l]);
}
}
}
}
To read the header, fscanf() may be used as well :
int sizex,sizey,sizez;
char headerpart[100];
fscanf(file,"%s",headerpart);
if(strcmp(headerpart,"DIMENSIONS")==0){
fscanf(file,"%d%d%d",&sizex,&sizey,&sizez);
}
Note than fscanf() need the pointer to the data (&sizex, not sizex). A string being a pointer to an array of char terminated by \0, "%s",headerpart works fine. It can be replaced by "%s",&headerpart[0]. The function strcmp() compares two strings, and return 0 if strings are identical.
As your grid seems large, smaller files can be obtained using the BINARY kind instead of ASCII, but watch for endianess as specified here.

Matrices - memory

Let's say that I have a matrix A=[];
I want to know if there is any way to represent it in a way where only the filled blocks must occupy memory and remaining must not, e.g.:
A = 1 0 0
0 1 0
0 0 1
Now, every block would take 1 bit of memory to store the matrix,
hence I would like to know is it possible to store matrix as:
A = 1
1
1
and the empty spaces must not occupy any memory at all. Is there any file format to represent a matrix in such a way?
No. You're dealing with bits. It would take MORE memory to store a list of the "filled" bits than it would to simply store the bits. e.g. for a simple 1x8 matrix:
0 1 2 3 4 5 6 7 <---bit-wise addresses
m = [0,1,0,0,0,1,1,1]
could be stored as a SINGLE byte of memory, at a storage ratio of 1 bit per bit.
To store just the locations of the SET bits would take 4 bytes. If all of the bits were set, you'd need 8 bytes to store those locations. So now you've got from a constant 1 byte requirement, to a variable 0 -> 8 bytes.
You could develop an way where you can store Informatiosn about the positions in a List but that would at least consummee more memmory as you would win this way. So at least no.

XOR File Decryption

So I have to decrypt a .txt file that is crypted with XOR code and with a repeated password that is unknown, and the goal is to discover the message.
Here are the things that I already know because of the professor:
First I need to find the length of the unknown password
The message has been altered and it doesn't have spaces (this may add a bit more difficulty because the space character has the highest frequency in a message)
Any ideas on how to solve this?
thx in advanced :)
First you need to find out the length of the password. You do this by assessing the Index of Coincidence or Kappa-test. XOR the ciphertext with itself shifted 1 step and count the number of characters that are the same (value 0). You get the Kappa value by dividing the result with the total number of characters minus 1. Shift one more time and again calculate the Kappa value. Shift the ciphertext as many times as needed until you discover the password length. If the length is 4 you should see something similar to this:
Offset Hits
-------------------------
1 2.68695%
2 2.36399%
3 3.79009%
4 6.74012%
5 3.6953%
6 1.81582%
7 3.82744%
8 6.03504%
9 3.60273%
10 1.98052%
11 3.83241%
12 6.5627%
As you see the Kappa value is significantly higher on multiples of 4 (4, 8 and 12) than the others. This suggests that the length of the password is 4.
Now that you have the password length you should again XOR the cipher text with itself but now you shift by multiples of the length. Why? Since the ciphertext looks like this:
THISISTHEPLAINTEXT <- Plaintext
PASSPASSPASSPASSPA <- Password
------------------
EJKELDOSOSKDOWQLAG <- Ciphertext
When two values which are the same are XOR:ed the result is 0:
EJKELDOSOSKDOWQLAG <- Ciphertext
EJKELDOSOSKDOWQLAG <- Ciphertext shifted 4.
Is in reality:
THISISTHEPLAINTEXT <- Plaintext
PASSPASSPASSPASSPA <- Password
THISISTHEPLAINTEXT <- Plaintext
PASSPASSPASSPASSPA <- Password
Which is:
THISISTHEPLAINTEXT <- Plaintext
THISISTHEPLAINTEXT <- Plaintext
As you see the password "disappears" and the plaintext is XOR:ed with itself.
So what can we do now then? You wrote that the spaces are removed. This makes it a bit harder to get the plaintext or password. But not at all impossible.
The following table shows the ciphertext values for all english characters:
A B C D E F G H I J K L M N O P Q R S T U V W X Y Z
A 0
B 3 0
C 2 1 0
D 5 6 7 0
E 4 7 6 1 0
F 7 4 5 2 3 0
G 6 5 4 3 2 1 0
H 9 10 11 12 13 14 15 0
I 8 11 10 13 12 15 14 1 0
J 11 8 9 14 15 12 13 2 3 0
K 10 9 8 15 14 13 12 3 2 1 0
L 13 14 15 8 9 10 11 4 5 6 7 0
M 12 15 14 9 8 11 10 5 4 7 6 1 0
N 15 12 13 10 11 8 9 6 7 4 5 2 3 0
O 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0
P 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 0
Q 16 19 18 21 20 23 22 25 24 27 26 29 28 31 30 1 0
R 19 16 17 22 23 20 21 26 27 24 25 30 31 28 29 2 3 0
S 18 17 16 23 22 21 20 27 26 25 24 31 30 29 28 3 2 1 0
T 21 22 23 16 17 18 19 28 29 30 31 24 25 26 27 4 5 6 7 0
U 20 23 22 17 16 19 18 29 28 31 30 25 24 27 26 5 4 7 6 1 0
V 23 20 21 18 19 16 17 30 31 28 29 26 27 24 25 6 7 4 5 2 3 0
W 22 21 20 19 18 17 16 31 30 29 28 27 26 25 24 7 6 5 4 3 2 1 0
X 25 26 27 28 29 30 31 16 17 18 19 20 21 22 23 8 9 10 11 12 13 14 15 0
Y 24 27 26 29 28 31 30 17 16 19 18 21 20 23 22 9 8 11 10 13 12 15 14 1 0
Z 27 24 25 30 31 28 29 18 19 16 17 22 23 20 21 10 11 8 9 14 15 12 13 2 3 0
What does this mean then? If an A and a B is XOR:ed then the resulting value is 3. E and P will result in 21. Etc. OK but how will this help you?
Remember that the plaintext is XOR:ed with itself shifted by multiples of the password length. For each value you can check the above table and determine what combinations that position could have. Lets say the value is 25 then the two characters that resulted in the value 25 could be one of the following combinations:(I-P), (H-Q), (K-R), (J-S), (M-T), (L-U), (O-V), (N-W), (A-X) or (C-Z). But which one? Now you do more shifts and look up the corresponding values in the table again for each position. Next time the value might be 7 and since you already have a list of possible character combinations you only check against them. At the next two shifts the values are 3 and 1. Now you can determine that the character is W since that is the only common character in each shift, (N-W), (P-W), (T-W), (V-W). You can do this for most positions.
You will not get all the plaintext but you will get enough characters to discover the password. Take the known characters and XOR them in the correct position in the ciphertext. This will yield the password. The number of known characters you need atleast is the number of characters in the password if they are at the "correct" positions in regards to the password.
Good luck!
you should look at cracking a vigenere chiffre, especially at auto-correlation. The latter will help you finding out the length of the password and the rest is usually just bruteforcing on the normal distribution of letters (where the most common one is the letter e in the english language).
Although spaces are the most common characters and make decryptions like this easy, the other character also have different frequencies. For example, see this Wikipedia article. If you've got enough encrypted text and the password length isn't too large, it might just be enough to find out the most common bytes in the encrypted text. They will most likely be the encrypted versions of e that has the highest frequency in english texts.
This alone won't give you the decrypted text, but it's very likely you can find out the password length and (part of) the password itself with it. For example, let's assume the most frequent encrypted bytes are
w x m z y
with almost the same frequency and there's a significant drop in frequency after the last one. This will tell you two things:
The password length most likely is 5, because statistically, all encrypted e will be equally likely. EDIT: OK, this isn't correct, it will be 5 or above because the password can contain the same character multiple times.
The password will be some permutation of (w x m z y XOR e e e e e) - you can use the byte offsets modulo the password length to get the correct permutation.
EDIT: The same character occuring in the password multiple times makes things a bit harder, but you'll most likely be able to identify those because as I said, encrypted versions of e will cluster around frequency f - now if the character occurs n times, it will have a frequency near n*f.
The most common three letter trigram in English (assuming the language is probably English) is "the". Place "the" at all possible points on your cyphertext to derive a possible 3 characters of the key. Try each possible key fragment at all other possible positions on the cyphertext and see what you get. For example, "qzg" is unlikely to be correct, but "fen" could be. Look at the spacing between possible positions to derive the key length. With a key length and a key fragment you can place a lot more of the key.
As Lars said, look at ways of decrypting Vigenère, which is effectively what you have here.

Resources