I am currently enrolled in a CS107 class which makes the following assumptions:
sizeof(int) == 4
sizeof(short) == 2
sizeof(char) == 1
big endianness
My professor showed the following code:
int arr[5];
((short*)(((char*) (&arr[1])) + 8))[3] = 100;
Here are the 20 bytes representing arr:
|....|....|....|....|....|
My professor states that &arr[1] points here, which I agree with.
|....|....|....|....|....|
x
I now understand that (char*) makes the pointer the width of a char (1 byte) instead of the width of an int (4 bytes).
What I don't understand is the + 8, which my professor says points here:
|....|....|....|....|....|
x
But shouldn't it point here, since it is going forwards 8 times the size of a char (1 byte)?
|....|....|....|....|....|
x
Let's take it step by step. Your expression can be decomposed like this:
((short*)(((char*) (&arr[1])) + 8))[3]
-----------------------------------------------------
char *base = (char *) &arr[1];
char *base_plus_offset = base + 8;
short *cast_into_short = (short *) base_plus_offset;
cast_into_short[3] = 100;
base_plus_offset points at byte location 12 within the array. cast_into_short[3] refers to a short value at location 12 + sizeof(short) * 3, which, in your case is 18.
The expression will set the two bytes 18 bytes after the start of arr to the value 100.
#include <stdio.h>
int main() {
int arr[5];
char* start=(char*)&arr;
char* end=(char*)&((short*)(((char*) (&arr[1])) + 8))[3];
printf("sizeof(int)=%zu\n",sizeof(int));
printf("sizeof(short)=%zu\n",sizeof(short));
printf("offset=%td <- THIS IS THE ANSWER\n",(end-start));
printf("100=%04x (hex)\n",100);
for(size_t i=0;i<5;++i){
printf("arr[%zu]=%d (%08x hex)\n",i,arr[i],arr[i]);
}
}
Possible Output:
sizeof(int)=4
sizeof(short)=2
offset=18 <- THIS IS THE ANSWER
100=0064 (hex)
arr[0]=0 (00000000 hex)
arr[1]=0 (00000000 hex)
arr[2]=0 (00000000 hex)
arr[3]=0 (00000000 hex)
arr[4]=6553600 (00640000 hex)
In all your professors shenanigans he's shifted you 1 integer, 8 chars/bytes and 3 shorts that 4+8+6=18 bytes. Bingo.
Notice this output reveals the machine I ran this on to have 4 byte integers, 2 byte short (common) and be little-endian because the last two bytes of the array were set to 0x64 and 0x00 respectively.
I find your diagrams dreadfully confusing because it isn't very clear if you mean the '|' to be addresses or not.
|....|....|....|....|
012345678901234567890
^ 1 ^ ^ 2
A X C S B
Include the bars ('|') A is the start of Arr and B is 'one past the end' (a legal concept in C).
X is the address referred to by the expression &Arr[1].
C by the expression (((char*) (&arr[1])) + 8).
S by the whole expression.
S and the byte following are assigned to and what that means depends on the endian-ness of your platform.
I leave it as an exercise to determine what the output on a similar but big-endian platform who output. Anyone?
I notice from the comments you're big-endian and I'm little-endian (stop sniggering).
You only need to change one line of the output.
Here's some code that can show you which byte gets modified on your system, along with a breakdown of what is happening:
#include <stdio.h>
int main( int argc, char* argv[] )
{
int arr[5];
int i;
for( i = 0; i < 5; i++ )
arr[i] = 0;
printf( "Before: " );
for( i = 0; i < sizeof(int)*5; i++ )
printf( "%2.2X ", ((char*)arr)[i] );
printf( "\n" );
((short*)(((char*) (&arr[1])) + 8))[3] = 100;
printf( "After: " );
for( i = 0; i < sizeof(int)*5; i++ )
printf( "%2.2X ", ((char*)arr)[i] );
printf( "\n" );
return 0;
}
Start from the inner most:
int pointer to (arr + 4)
&arr[1]
|...|...|...|...|...
Xxxx
char pointer to (arr + 4)
(char*)(&arr[1])
|...|...|...|...|...
X
char pointer to (arr + 4 + 8)
((char*)(&arr[1])) + 8)
|...|...|...|...|...
X
short pointer to (arr + 4 + 8)
(short*)((char*)(&arr[1])) + 8)
|...|...|...|...|...
Xx
short at (arr + 4 + 8 + (3 * 2)) (this is an array index)
((short*)((char*)(&arr[1])) + 8))[3]
|...|...|...|...|...
Xx
Exactly which byte gets modified here depends on the endianess of your system. On my little endian x86 I get the following output:
Before: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
After: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 64 00
Good Luck with your course.
Related
Thanks for the replies, everyone was useful in helping me understand how this works.
A friend sent me this piece of C code asking how it worked (he doesn't know either). I don't usually work with C, but this piqued my interest. I spent some time trying to understand what was going on but in the end I couldn't fully figure it out. Here's the code:
void knock_knock(char *s){
while (*s++ != '\0')
printf("Bazinga\n");
}
int main() {
int data[5] = { -1, -3, 256, -4, 0 };
knock_knock((char *) data);
return 0;
}
Initially I thought it was just a fancy way to print the data in the array (yeah, I know :\), but then I was surprised when I saw it didn't print 'Bazinga' 5 times, but 8. I searched stuff up and figured out it was working with pointers (total amateur when it comes to c), but I still couldn't figure out why 8. I searched a bit more and found out that usually pointers have 8 bytes of length in C, and I verified that by printing sizeof(s) before the loop, and sure enough it was 8. I thought this was it, it was just iterating over the length of the pointer, so it would make sense that it printed Bazinga 8 times. It also was clea to me now why they'd use Bazinga as the string to print - the data in the array was meant to be just a distraction. So I tried adding more data to the array, and sure enough it kept printing 8 times. Then I changed the first number of the array, -1, to check whether the data truly was meaningless or not, and this is where I was confused. It didn't print 8 times anymore, but just once. Surely the data in the array wasn't just a decoy, but for the life of me I couldn't figure out what was going on.
Using the following code
#include<stdio.h>
void knock_knock(char *s)
{
while (*s++ != '\0')
printf("Bazinga\n");
}
int main()
{
int data[5] = { -1, -3, 256, -4, 0 };
printf("%08X - %08X - %08X\n", data[0], data[1], data[2]);
knock_knock((char *) data);
return 0;
}
You can see that HEX values of data array are
FFFFFFFF - FFFFFFFD - 00000100
Function knock_knock print Bazinga until the pointed value is 0x00 due to
while (*s++ != '\0')
But the pointer here is pointing chars, so is pointing a single byte each loop and so, the first 0x00 is reached accessing the "first" byte of third value of array.
You need to look at the bytewise representation of data in the integer array data. Assuming an integer is 4 bytes, The representation below gives the numbers in hex
-1 --> FF FF FF FF
-3 --> FF FF FF FD
256 --> 00 00 01 00
-4 --> FF FF FF FC
0 --> 00 00 00 00
The array data is these numbers stored in a Little- Endian format. I.e. the LSbyte comes first. So,
data ={FF FF FF FF FD FF FF FF 00 01 00 00 FC FF FF FF 00 00 00 00};
The function knock_knock goes through this data bytewise and prints Bazinga for every non-zero. It stops at the first zero found, which will be after 8 bytes.
(Note: Size of Integer can 2 or 8 bytes, but given that your pointer size is 8 bytes, I am guessing that size of integer is 4 bytes).
It is easy to understand what occurs here if to output the array in hex as a character array. Here is shown how to do this
#include <stdio.h>
int main(void)
{
int data[] = { -1, -3, 256, -4, 0 };
const size_t N = sizeof( data ) / sizeof( *data );
char *p = ( char * )data;
for ( size_t i = 0; i < N * sizeof( int ); i++ )
{
printf( "%0X ", p[i] );
if ( ( i + 1) % sizeof( int ) == 0 ) printf( "\n" );
}
return 0;
}
The program output is
FFFFFFFF FFFFFFFF FFFFFFFF FFFFFFFF
FFFFFFFD FFFFFFFF FFFFFFFF FFFFFFFF
0 1 0 0
FFFFFFFC FFFFFFFF FFFFFFFF FFFFFFFF
0 0 0 0
So the string "Bazinga" will be outputted as many times as there are non-zero bytes in the representations of integer numbers in the array. As it is seen the first two negative numbers do not have zero bytes in their representations.
However the number 256 in any case has such a byte at the very beginning of its internal representation. So the string will be outputted exactly eight times provided that sizeof( int ) is equal to 4.
Consider this example in Java:
public final class Meh
{
private static final String HELLO = "Hello world";
private static final Charset UTF32 = Charset.forName("UTF-32");
public static void main(final String... args)
throws IOException
{
final Path tmpfile = Files.createTempFile("test", "txt");
try (
final Writer writer = Files.newBufferedWriter(tmpfile, UTF32);
) {
writer.write(HELLO);
}
final String readBackFromFile;
try (
final Reader reader = Files.newBufferedReader(tmpfile, UTF32);
) {
readBackFromFile = CharStreams.toString(reader);
}
Files.delete(tmpfile);
System.out.println(HELLO.equals(readBackFromFile));
}
}
This program prints true. Now, some notes:
a Charset in Java is a class wrapping a character coding, both ways; you can get a CharsetDecoder to decode a stream of bytes to a stream of characters, or a CharsetEncoder to encode a stream of characters into a stream of bytes;
this is why Java has char vs byte;
for historical reasons however, a char is only a 16bit unsigned number: this is because when Java was born, Unicode did not define code points outside of what is now known as the BMP (Basic Multilingual Plane; that is, any code points defined in range U+0000-U+FFFF, inclusive).
With all this out of the way, the code above performs the following:
given some "text", represented here as a String, it first applies a transformation of this text into a byte sequence before writing it to a file;
then it reads back that file: it is only a sequence of bytes, but then it applies the reverse transformation to find back the "original text" stored in it;
note that CharStreams.toString() is not in the standard JDK; this is a class from Guava.
Now, as to C... My question is as follows:
discussing the matter on the C chat room, I have learned that the C11 standard has, with <uchar.h>, what seems to be appropriate to store a Unicode code point, regardless of the encoding;
however, there doesn't seem to be the equivalent of Java's Charset; another comment on the chat room is that with C you're SOL but that C++ has codecvt...
And yes, I'm aware that UTF-32 is endianness-dependent; with Java, that is BE by default.
But basically: how would I program the above in C? Let's say I want to program the writing side or reading side in C, how would I do it?
In C, you'd typically use a library like libiconv, libunistring, or ICU.
If you only want to process UTF-32, you can directly write and read an array of 32-bit integers containing the Unicode code points, either in little or big endian. Unlike UTF-8 or UTF-16, a UTF-32 string doesn't need any special encoding and decoding. You can use any 32-bit integer type. I'd prefer C99's uint32_t over C11's char32_t. For example:
#include <stdbool.h>
#include <stdint.h>
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
int main() {
// Could also contain non-ASCII code points.
static const uint32_t hello[] = {
'H', 'e', 'l', 'l', 'o', ' ',
'w', 'o', 'r', 'l', 'd'
};
static size_t num_chars = sizeof(hello) / sizeof(uint32_t);
const char *path = "test.txt";
FILE *outstream = fopen(path, "wb");
// Write big endian 32-bit integers
for (size_t i = 0; i < num_chars; i++) {
uint32_t code_point = hello[i];
for (int j = 0; j < 4; j++) {
int c = (code_point >> ((3 - j) * 8)) & 0xFF;
fputc(c, outstream);
}
}
fclose(outstream);
FILE *instream = fopen(path, "rb");
// Get file size.
fseek(instream, 0, SEEK_END);
long file_size = ftell(instream);
rewind(instream);
if (file_size % 4) {
fprintf(stderr, "File contains partial UTF-32");
exit(1);
}
if (file_size > SIZE_MAX) {
fprintf(stderr, "File too large");
exit(1);
}
size_t num_chars_in = file_size / sizeof(uint32_t);
uint32_t *read_back = malloc(file_size);
// Read big endian 32-bit integers
for (size_t i = 0; i < num_chars_in; i++) {
uint32_t code_point = 0;
for (int j = 0; j < 4; j++) {
int c = fgetc(instream);
code_point |= c << ((3 - j) * 8);
}
read_back[i] = code_point;
}
fclose(instream);
bool equal = num_chars == num_chars_in
&& memcmp(hello, read_back, file_size) == 0;
printf("%s\n", equal ? "true" : "false");
free(read_back);
return 0;
}
(Most error checks omitted for brevity.)
Compiling and running this program:
$ gcc -std=c99 -Wall so.c -o so
$ ./so
true
$ hexdump -C test.txt
00000000 00 00 00 48 00 00 00 65 00 00 00 6c 00 00 00 6c |...H...e...l...l|
00000010 00 00 00 6f 00 00 00 20 00 00 00 77 00 00 00 6f |...o... ...w...o|
00000020 00 00 00 72 00 00 00 6c 00 00 00 64 |...r...l...d|
0000002c
This question already has answers here:
printf adds extra `FFFFFF` to hex print from a char array [duplicate]
(3 answers)
Closed 7 years ago.
printf() function prints leading ffffff (technically I understand that most significant bit carries sign so it gets carried all the way to where data starts). But how to get rid of them I have no idea and why is it happening?
int mem_display(Cmd *cp, char *arguments)
{
int i;
char *adr;
if (!sscanf(arguments,"%x",&adr))
{
return 0;
}
printf("%#0.8s ",arguments);
for (i=0; i<16; i++) {
printf("%02.x ",(unsigned int)*(adr+i));
}
...
the output:
% UNIX> md 10668
/*calling function show memory location 0x10668*/
OUT:
10668 ffffffbc 10 20 ffffffe0 ffffffa0 40 ffffffa2 ffffffa0 44 ffffff9c 23 ffffffa0 20
solved:
printf("%0.2x ",(unsigned int)*(adr+i));
output:
UNIX> md 10000
10000 7f 45 4c 46 01 02 01 00 00 00 00 00 00 00 00 00 .ELF............
Cast to unsigned char to have the system consider *(adr+i) as unsigned, so that no sign expansion will be done.
for (i=0; i<16; i++) {
printf("%02.x ",(unsigned char)*(adr+i));
}
Please read the printf() manual, this is the correct way
printf("%02x ", (unsigned char) adr[i]);
Note: don't use this *(a + i) notation, and specially when you are casting it. It's not a bad way to dereference a pointer, it's just not appropriate in this particular situation.
I am trying to convert a char array into integer, then I have to increment that integer (both in little and big endian).
Example:
char ary[6 ] = { 01,02,03,04,05,06};
long int b=0; // 64 bits
this char will be stored in memory
address 0 1 2 3 4 5
value 01 02 03 04 05 06 (big endian)
Edit : value 01 02 03 04 05 06 (little endian)
-
memcpy(&b, ary, 6); // will do copy in bigendian L->R
This how it can be stored in memory:
01 02 03 04 05 06 00 00 // big endian increment will at MSByte
01 02 03 04 05 06 00 00 // little endian increment at LSByte
So if we increment the 64 bit integer, the expected value is 01 02 03 04 05 07. But endianness is a big problem here, since if we directly increment the value of the integer, it will results some wrong numbers. For big endian we need to shift the value in b, then do an increment on it.
For little endian we CAN'T increment directly. ( Edit : reverse and inc )
Can we copy the w r t to endianess? So we don't need to worry about shift operations and all.
Any other solution for incrementing char array values after copying it into integer?
Is there any API in the Linux kernel to it copy w.r.t to endianess ?
Unless you want the byte array to represent a larger integer, which doesn't seem to be the case here, endianess does not matter. Endianess only applies to integer values of 16 bits or larger. If the character array is an array of 8 bit integers, endianess does not apply. So all your assumptions are incorrect, the char array will always be stored as
address 0 1 2 3 4 5
value 01 02 03 04 05 06
No matter endianess.
However, if you memcpy the array into a uint64_t, endianess does apply. For a big endian machine, simply memcpy() and you'll get everything in the expected format. For little endian, you'll have to copy the array in reverse, for example:
#include <stdio.h>
#include <stdint.h>
int main (void)
{
uint8_t array[6] = {1,2,3,4,5,6};
uint64_t x=0;
for(size_t i=0; i<sizeof(uint64_t); i++)
{
const uint8_t bit_shifts = ( sizeof(uint64_t)-1-i ) * 8;
x |= (uint64_t)array[i] << bit_shifts;
}
printf("%.16llX", x);
return 0;
}
You need to read up on documentation. This page lists the following:
__u64 le64_to_cpup(const __le64 *);
__le64 cpu_to_le64p(const __u64 *);
__u64 be64_to_cpup(const __be64 *);
__be64 cpu_to_be64p(const __u64 *);
I believe they are sufficient to do what you want to do. Convert the number to CPU format, increment it, then convert back.
I find myself writing a simple program to extract data from a bmp file. I just got started and I am at one of those WTF moments.
When I run the program and supply this image: http://www.hack4fun.org/h4f/sites/default/files/bindump/lena.bmp
I get the output:
type: 19778
size: 12
res1: 0
res2: 54
offset: 2621440
The actual image size is 786,486 bytes. Why is my code reporting 12 bytes?
The header format specified in,
http://en.wikipedia.org/wiki/BMP_file_format matches my BMP_FILE_HEADER structure. So why is it getting filled with wrong information?
The image file doesn't appear to be corrupt and other images are giving equally wrong outputs. What am I missing?
#include <stdio.h>
#include <stdlib.h>
typedef struct {
unsigned short type;
unsigned int size;
unsigned short res1;
unsigned short res2;
unsigned int offset;
} BMP_FILE_HEADER;
int main (int args, char ** argv) {
char *file_name = argv[1];
FILE *fp = fopen(file_name, "rb");
BMP_FILE_HEADER file_header;
fread(&file_header, sizeof(BMP_FILE_HEADER), 1, fp);
if (file_header.type != 'MB') {
printf("ERROR: not a .bmp");
return 1;
}
printf("type: %i\nsize: %i\nres1: %i\nres2: %i\noffset: %i\n", file_header.type, file_header.size, file_header.res1, file_header.res2, file_header.offset);
fclose(fp);
return 0;
}
Here the header in hex:
0000000 42 4d 36 00 0c 00 00 00 00 00 36 00 00 00 28 00
0000020 00 00 00 02 00 00 00 02 00 00 01 00 18 00 00 00
The length field is the bytes 36 00 0c 00`, which is in intel order; handled as a 32-bit value, it is 0x000c0036 or decimal 786,486 (which matches the saved file size).
Probably your C compiler is aligning each field to a 32-bit boundary. Enable a pack structure option, pragma, or directive.
There are two mistakes I could find in your code.
First mistake: You have to pack the structure to 1, so every type size is exactly the size its meant to be, so the compiler doesn't align it for example in 4 bytes alignment. So in your code, short, instead of being 2 bytes, it was 4 bytes. The trick for this, is using a compiler directive for packing the nearest struct:
#pragma pack(1)
typedef struct {
unsigned short type;
unsigned int size;
unsigned short res1;
unsigned short res2;
unsigned int offset;
} BMP_FILE_HEADER;
Now it should be aligned properly.
The other mistake is in here:
if (file_header.type != 'MB')
You are trying to check a short type, which is 2 bytes, with a char type (using ''), which is 1 byte. Probably the compiler is giving you a warning about that, it's canonical that single quotes contain just 1 character with 1-byte size.
To get this around, you can divide this 2 bytes into 2 1-byte characters, which are known (M and B), and put them together into a word. For example:
if (file_header.type != (('M' << 8) | 'B'))
If you see this expression, this will happen:
'M' (which is 0x4D in ASCII) shifted 8 bits to the left, will result in 0x4D00, now you can just add or or the next character to the right zeroes: 0x4D00 | 0x42 = 0x4D42 (where 0x42 is 'B' in ASCII). Thinking like this, you could just write:
if (file_header.type != 0x4D42)
Then your code should work.