Let's suppose I have a function:
int f1(int x){
// some more or less complicated operations on x
return x;
}
And that I have another function
int f2(int x){
// we simply return x
return x;
}
I would like to be able to do something like the following:
char* _f1 = (char*)f1;
char* _f2 = (char*)f2;
int i;
for (i=0; i<FUN_LENGTH; ++i){
f1[i] = f2[i];
}
I.e. I would like to interpret f1 and f2 as raw byte arrays and "overwrite f1 byte by byte" and thus, replace it by f2.
I know that usually callable code is write-protected, however, in my particular situation, you can simply overwrite the memory location where f1 is located. That is, I can copy the bytes over onto f1, but afterwards, if I call f1, the whole thing crashes.
So, is my approach possible in principle? Or are there some machine/implementation/whatsoever-dependent issues I have to take into consideration?
It would be easier to replace the first few bytes of f1 with a machine jump instruction to the beginning of f2. That way, you won't have to deal with any possible code relocation issues.
Also, the information about how many bytes a function occupies (FUN_LENGTH in your question) is normally not available at runtime. Using a jump would avoid that problem too.
For x86, the relative jump instruction opcode you need is E9 (according to here). This is a 32-bit relative jump, which means you need to calculate the relative offset between f2 and f1. This code might do it:
int offset = (int)f2 - ((int)f1 + 5); // 5 bytes for size of instruction
char *pf1 = (char *)f1;
pf1[0] = 0xe9;
pf1[1] = offset & 0xff;
pf1[2] = (offset >> 8) & 0xff;
pf1[3] = (offset >> 16) & 0xff;
pf1[4] = (offset >> 24) & 0xff;
The offset is taken from the end of the JMP instruction, so that's why there is 5 added to the address of f1 in the offset calculation.
It's a good idea to step through the result with an assembly level debugger to make sure you're poking the correct bytes. Of course, this is all not standards compliant so if it breaks you get to keep both pieces.
Your approach is undefined behavior for the C standard.
And on many operating systems (e.g. Linux), your example will crash: the function code is inside the read only .text segment (and section) of the ELF executable, and that segment is (sort-of) mmap-ed read-only by execve (or by dlopen or by the dynamic linker), so you cannot write inside it.
Instead of trying to overwrite the function (which you've already found is fragile at best), I'd consider using a pointer to a function:
int complex_implementation(int x) {
// do complex stuff with x
return x;
}
int simple_implementation(int x) {
return x;
}
int (*f1)(int) = complex_implementation;
You'd use this something like:
for (int i=0; i<limit; i++) {
a = f1(a);
if (whatever_condition)
f1 = simple_implementation;
}
...and after the assignment, calling f1 would just return the input value.
Calling a function via a pointer does impose some overhead, but (thanks to that being common in OO languages) most compilers and CPUs do a pretty good job of minimizing that overhead.
Most memory architectures will stop you writing over the function code. It will crash.... But some embedded devices, you can do this kind of thing, but it is dangerous unless you know there's enough space, the calling is going to be ok, the stack is going to be ok, etc etc...
Most likely there is a WAY better way to solve the problem.
Related
I am starting to figure out some basic ideas about memory reading and writing (let's assume that the data we read or write have not been cached yet).
For the following code:
int a = 1;
It's definitely a write, since we write the value '1' to the memory place of variable 'a'.
But for the following code:
int a, b;
a = 1;
b = a;
When we execute the statement "b = a;", do we actually perform one read and one write?
To my understandings, I think it's one read and one write, since we have to load the value of 'a' first and then write the value to 'b'.
Not sure if my understandings are correct. Please help me to clarify these basic ideas.
Many thanks for the help.
Let's assume that the data we read or write have not been cached yet)
I don't see how cache is pertinent to this.
When we execute the statement "b = a;", do we actually perform one read and one write?
Correct.
However, C is not like the assembly language. C instructions don't map 1-to-1 to machine instructions. There is the as-if rule. Basically the compiler can generate whatever machine code as long as the observable behavior of the program is preserved.
For instance:
auto foo()
{
int a = 24;
int b = 11;
int c = a + b;
return c;
}
C compiler is free to compile the above to
foo():
mov eax, 35
ret
And compilers do actually do this (with optimizations enabled). As you can see there is no memory read/write. Just just a write to the eax register (where the return of the function must be put). And the value is an imediate (35).
I am using gcc version 4.7.2 on Ubuntu 12.10 x86_64.
First of all these are the sizes of data types on my terminal:
sizeof(char) = 1
sizeof(short) = 2 sizeof(int) = 4
sizeof(long) = 8 sizeof(long long) = 8
sizeof(float) = 4 sizeof(double) = 8
sizeof(long double) = 16
Now please have a look at this code snippet:
int main(void)
{
char c = 'a';
printf("&c = %p\n", &c);
return 0;
}
If I am not wrong we can't predict anything about the address of c. But each time this program gives some random hex address ending in f. So the next available location will be some hex value ending in 0.
I observed this pattern in case of other data types too. For an int value the address was some hex value ending in c. For double it was some random hex value ending in 8 and so on.
So I have 2 questions here.
1) Who is governing this kind of memory allocation ? Is it gcc or C standard ?
2) Whoever it is, Why it's so ? Why the variable is stored in such a way that next available memory location starts at a hex value ending in 0 ? Any specific benefit ?
Now please have a look at this code snippet:
int main(void)
{
double a = 10.2;
int b = 20;
char c = 30;
short d = 40;
printf("&a = %p\n", &a);
printf("&b = %p\n", &b);
printf("&c = %p\n", &c);
printf("&d = %p\n", &d);
return 0;
}
Now here what I observed is completely new for me. I thought the variable would get stored in the same order they are declared. But No! That's not the case. Here is the sample output of one of random run:
&a = 0x7fff8686a698
&b = 0x7fff8686a694
&c = 0x7fff8686a691
&d = 0x7fff8686a692
It seems that variables get sorted in increasing order of their sizes and then they are stored in the same sorted order but with maintaining the observation 1. i.e. the last variable (largest one) gets stored in such a way that the next available memory location is an hex value ending in 0.
Here are my questions:
3) Who is behind this ? Is it gcc or C standard ?
4) Why to waste the time in sorting the variables first and then allocating the memory instead of directly allocating the memory on 'first come first serve' basis ? Any specific benefit of this kind of sorting and then allocating memory ?
Now please have a look at this code snippet:
int main(void)
{
char array1[] = {1, 2};
int array2[] = {1, 2, 3};
printf("&array1[0] = %p\n", &array1[0]);
printf("&array1[1] = %p\n\n", &array1[1]);
printf("&array2[0] = %p\n", &array2[0]);
printf("&array2[1] = %p\n", &array2[1]);
printf("&array2[2] = %p\n", &array2[2]);
return 0;
}
Now this is also shocking for me. What I observed is that the array is always stored at some random hex value ending in '0' if the elements of an array >= 2 and if elements < 2
then it gets memory location following observation 1.
So here are my questions:
5) Who is behind this storing an array at some random hex value ending at 0 thing ? Is it gcc or C standard ?
6) Now why to waste the memory ? I mean array2 could have been stored immediately after array1 (and hence array2 would have memory location ending at 2). But instead of that array2 is stored at next hex value ending at 0 thereby leaving 14 memory locations in between. Any specific benefits ?
The address at which the stack and the heap start is given to the process by the operating system. Everything else is decided by the compiler, using offsets that are known at compile time. Some of these things may follow an existing convention followed in your target architecture and some of these do not.
The C standard does not mandate anything regarding the order of the local variables inside the stack frame (as pointed out in a comment, it doesn't even mandate the use of a stack at all). The standard only bothers to define order when it comes to structs and, even then, it does not define specific offsets, only the fact that these offsets must be in increasing order. Usually, compilers try to align the variables in such a way that access to them takes as few CPU instructions as possible - and the standard permits that, without mandating it.
Part of the reasons are mandated by the application binary interface (ABI) specifications for your system & processor.
See the x86 calling conventions and the SVR4 x86-64 ABI supplement (I'm giving the URL of a recent copy; the latest original is surprisingly hard to find on the Web).
Within a given call frame, the compiler could place variables in arbitrary stack slots. It may try (when optimizing) to reorganize the stack at will, e.g. by decreasing alignment constraints. You should not worry about that.
A compiler try to put local variables on stack location with suitable alignment. See the alignof extension of GCC. Where exactly the compiler put these variables is not important, see my answer here. (If it is important to your code, you really should pack the variables in a single common local struct, since each compiler, version and optimization flags could do different things; so don't depend on that precise behavior of your particular compiler).
I would like to make a self contained C function that prints a string. This would be part of an operating system, so I can't use stdio.h. How would I make a function that prints the string I pass to it without using stdio.h? Would I have to write it in assembly?
Assuming you're doing this on an X86 PC, you'll need to read/write directly to video memory located at address 0xB8000. For color monitors you need to specify an ASCII character byte and an attribute byte, which can indicate color. It is common to use macros when accessing this memory:
#define VIDEO_BASE_ADDR 0xB8000
#define VIDEO_ADDR(x,y) (unsigned short *)(VIDEO_BASE_ADDR + 2 * ((y) * SCREEN_X_SIZE + (x)))
Then, you write your own IO routines around it. Below is a simple function I used to write from a screen buffer. I used this to help implement a crude scrolling ability.
void c_write_window(unsigned int x, unsigned int y, unsigned short c)
{
if ((win_offset + y) >= BUFFER_ROWS) {
int overlap = ((win_offset + y) - BUFFER_ROWS);
*VIDEO_ADDR(x,y) = screen_buffer[overlap][x] = c;
} else {
*VIDEO_ADDR(x,y) = screen_buffer[win_offset + y][x] = c;
}
}
To learn more about this, and other osdev topics, see http://wiki.osdev.org/Printing_To_Screen
You will probably want to look at, or possibly just use, the source to the stdio functions in the FreeBSD C library, which is BSD-licensed.
To actually produce output, you'll need at least some function that can write characters to your output device. To do this, the stdio routines end up calling write, which performs a syscall into the kernel.
I have an array of ints and I'd like to set all values in the array to 'x' everytime a function is called.
I've looked at memset but that would only work for an array of bytes I think.
I could do the obvious for loop, but I'm guessing there is a standard lib function out there that will accomplish this better. Anyone know?
Just loop it, pretty much. Or memset to 0, if you know the value is zero (similar for other values for which you have knowledge of the bit representation). There won't be a standard lib solution, since the standard lib can't know of particular user types.
If you're on an x86 system, you can use some assembly for this. For instance, in gcc:
__asm__(
"rep stosb"
: "=a"('x'), "=c"(count), "=D"(array)
);
Should do the trick.
rep stosb takes the value in AL and assigns it to consecutive memory locations pointed to by ES:EDI. The number of the locations is specified in ECX.
As an aside, in recent processors Intel has made many efforts to improve the performance of MOVSB and STOSB, so this is a good way to go about it.
In addition to memset and looping (which are both O(n) time), it can be actually done in O(1) - but at the cost of triple the amount of memory, and more expensive look ups later on.
This article describes how it can be done.
The idea is to maintain additional stack (logically, implemented as array+ pointer to top) and array, the additional array will indicate when it was first initialized (a number from 0 to n) and the stack will indicate which elements were already modified.
When you access array[i], if stack[additionalArray[i]] == i && i < top the value of the array is array[i]. Otherwise - it is the "initialized" value.
When doing array[i] = x, if it was not initialized yet (as seen before), you should set additionalArray[i] = stack[top] and increase top.
This results in O(1) initialization, but as said it requires additional memory and each access is more expansive.
Below logic will helps you.
...
int a[100] = {0};
int b = 5;
memset_ex(a, 100, &b, sizeof(int));
...
memset_ex(void *buf, int buf_size, void *value, int size_of_type)
{
int i = 0;
for(i = 0; i <= (buf_size - size_of_type); i +=size_of_type)
{
memcpy((buf + i), value, size_of_type);
}
}
I was wondering if theres a realy good (performant) solution how to Convert a whole file to lower Case in C.
I use fgetc convert the char to lower case and write it in another temp-file with fputc. At the end i remove the original and rename the tempfile to the old originals name. But i think there must be a better Solution for it.
This doesn't really answer the question (community wiki), but here's an (over?)-optimized function to convert text to lowercase:
#include <assert.h>
#include <ctype.h>
#include <stdio.h>
int fast_lowercase(FILE *in, FILE *out)
{
char buffer[65536];
size_t readlen, wrotelen;
char *p, *e;
char conversion_table[256];
int i;
for (i = 0; i < 256; i++)
conversion_table[i] = tolower(i);
for (;;) {
readlen = fread(buffer, 1, sizeof(buffer), in);
if (readlen == 0) {
if (ferror(in))
return 1;
assert(feof(in));
return 0;
}
for (p = buffer, e = buffer + readlen; p < e; p++)
*p = conversion_table[(unsigned char) *p];
wrotelen = fwrite(buffer, 1, readlen, out);
if (wrotelen != readlen)
return 1;
}
}
This isn't Unicode-aware, of course.
I benchmarked this on an Intel Core 2 T5500 (1.66GHz), using GCC 4.6.0 and i686 (32-bit) Linux. Some interesting observations:
It's about 75% as fast when buffer is allocated with malloc rather than on the stack.
It's about 65% as fast using a conditional rather than a conversion table.
I'd say you've hit the nail on the head. Temp file means that you don't delete the original until you're sure that you're done processing it which means upon error the original remains. I'd say that's the correct way of doing it.
As suggested by another answer (if file size permits) you can do a memory mapping of the file via the mmap function and have it readily available in memory (no real performance difference if the file is less than the size of a page as it's probably going to get read into memory once you do the first read anyway)
You can usually get a little bit faster on big inputs by using fread and fwrite to read and write big chunks of the input/output. Also you should probably convert a bigger chunk (whole file if possible) into memory and then write it all at once.
edit: I just rememberd one more thing. Sometimes programs can be faster if you select a prime number (at the very least not a power of 2) as the buffer size. I seem to recall this has to do with specifics of the cacheing mechanism.
If you're processing big files (big as in, say, multi-megabytes) and this operation is absolutely speed-critical, then it might make sense to go beyond what you've inquired about. One thing to consider in particular is that a character-by-character operation will perform less well than using SIMD instructions.
I.e. if you'd use SSE2, you could code the toupper_parallel like (pseudocode):
for (cur_parallel_word = begin_of_block;
cur_parallel_word < end_of_block;
cur_parallel_word += parallel_word_width) {
/*
* in SSE2, parallel compares are either about 'greater' or 'equal'
* so '>=' and '<=' have to be constructed. This would use 'PCMPGTB'.
* The 'ALL' macro is supposed to replicate into all parallel bytes.
*/
mask1 = parallel_compare_greater_than(*cur_parallel_word, ALL('A' - 1));
mask2 = parallel_compare_greater_than(ALL('Z'), *cur_parallel_word);
/*
* vector op - and all bytes in two vectors, 'PAND'
*/
mask = mask1 & mask2;
/*
* vector op - add a vector of bytes. Would use 'PADDB'.
*/
new = parallel_add(cur_parallel_word, ALL('a' - 'A'));
/*
* vector op - zero bytes in the original vector that will be replaced
*/
*cur_parallel_word &= !mask; // that'd become 'PANDN'
/*
* vector op - extract characters from new that replace old, then or in.
*/
*cur_parallel_word |= (new & mask); // PAND / POR
}
I.e. you'd use parallel comparisons to check which bytes are uppercase, and then mask both original value and 'uppercased' version (one with the mask, the other with the inverse) before you or them together to form the result.
If you use mmap'ed file access, this could even be performed in-place, saving on the bounce buffer, and saving on many function and/or system calls.
There is a lot to optimize when your starting point is a character-by-character 'fgetc' / 'fputc' loop; even shell utilities are highly likely to perform better than that.
But I agree that if your need is very special-purpose (i.e. something as clear-cut as ASCII input to be converted to uppercase) then a handcrafted loop as above, using vector instruction sets (like SSE intrinsics/assembly, or ARM NEON, or PPC Altivec), is likely to make a significant speedup possible over existing general-purpose utilities.
Well, you can definitely speed this up a lot, if you know what the character encoding is. Since you're using Linux and C, I'm going to go out on a limb here and assume that you're using ASCII.
In ASCII, we know A-Z and a-z are contiguous and always 32 apart. So, what we can do is ignore the safety checks and locale checks of the toLower() function and do something like this:
(pseudo code)
foreach (int) char c in the file:
c -= 32.
Or, if there may be upper and lowercase letters, do a check like
if (c > 64 && c < 91) // the upper case ASCII range
then do the subtract and write it out to the file.
Also, batch writes are faster, so I would suggest first writing to an array, then all at once writing the contents of the array to the file.
This should be considerable faster.