Feasibility to Bypass Address randomization and Stack Smash Protection - buffer overflow attack - c

I just went through logic behind buffer overflow attacks and associated protection mechanisms available in kernel versions above 2.6 in UNIX to avoid buffer overflow attacks (Address Randomization and Stack Smash Protection).
In each time we go ahead disabling Address Randomization (Assigning '0' to kernel address randomization) and Stack Smash Protection (including -fno-stack-protector while compiling) to analyze buffer overflow attacks.
Just curious to get to know, Is there any bypass protection mechanism available without having to do above mentioned two activities just by disabling while it's still enforced. Would be good to hear if so any such mechanism there, can you please help on it.

The best way I know of to avoid buffer overflow is to make use of 100% fully exhaustive unit tests that check any function that deals with a buffer of any size and type. This is not always realistic, of course.
"exhaustive" means that all possible cases are taken in account, no matter whether your application would ever generate all those specific cases at time you first write your code.
Although there are tools out there that can help you in that arena. Some are quite well automated and will generate unit tests automatically. I never tried one of those so I cannot warrant any one of them, but if you are in a time crunch, that could help.
Another way, which somewhat works is to run a static analyzer against your code. Code Coverity is the one I have used in the past, but there are many others too. In most cases, static analysis will only catch problems where you declare static buffers on your stack as in:
char buf[256];
...
char a = buf[256]; // <- bug here, although not too bad
buf[256] = a; // <- bug here, could be bad, you're writing to the stack!
Now... under Unix you have two problems with buffer overflows. It will make your system crash in most cases. However, if the hacker has access to your code, they may be able to call a specific system function (a kernel function, to be clear). In that case, what is potentially problematic is if your process runs with an elevated user (i.e. worst case scenario: root). At that point the hacker may have obtained some permissions to do more stuff without your authorization. To eliminate this risk you have two main solutions:
Use a chroot environment; this can be difficult to setup if you are new to Linux, but that works on virtual all Unices
Use a virtualbox environment (or some other virtual system like qemu); getting such an environment setup is generally pretty easy, although if you want to automatically generate new environments... there is an API and it can be tedious.
There is one last way, but that can be slow. The CPU has an MMU. You can use the MMU to protect/unprotect the memory and ensure that each read and write happens to a buffer that was allocated (in case of the stack, the frame buffer is used to make sure you are within the correct window.) As you can imagine, for each write (and possibly many reads) you get an interrupt and the handler is not small. It's a good tool/idea to debug a software that has many buffer overflows, but in general it is not useable in production.
Unfortunately, none of these options are part of the g++ suite by default.

Related

Custom handling of memory reads and writes in C

I am working on writing my own malloc and using the LD_PRELOAD trick to use it. I need to be able to perform custom functionality for every memory access to the heap, both reads and writes (performance is not a concern, functionality is the goal).
For example, for some code like
int x = A[5];
I would like to be able to trap the read from (A + 5) and instead of reading from that memory location, return my own custom value to store in x.
The ideas I have as of now are:
mprotect away, handling the resulting SIGSEGVs and doing what I need to in the handler. As far as I know, I can access the faulty address in void *si_addr, but I'm not sure how to distinguish between a read and a write - and even if I did manage to do so, I'm not sure how to handle writes since I wouldn't know the value to be written within the handler.
Tweak gcc to handle memory accesses specially. From what I have read, understanding gcc code takes a while, and unless its IR/abstract assembly conveniently isolates memory loads/stores, I'm not sure how practical this is.
Any suggestions are appreciated.
The simplest is via malloc ( you might want to own mmap, munmap, mprotect, sig(action, nal, etc) ... for full coverage ). Yours return addresses which do not correspond to valid mappings, capture SIGBUS + SIGSEGV, interpret the siginfo structure to fixup your process, ... But this is somewhat limited to operating on the heap, and a program can readily escape from it, and if you are trying to catch a misbehaving program, the program might corrupt your lookup tables.
For fuller coverage, you might want to take a look at gvisor, which is billed as a container runtime sandbox. Its technology is closer to a debugger, as it takes full control over the target, capturing its faults, system calls, etc.. and manages its address space. It would likely be minor surgery to adapt it to your needs.
In either situation, when you take a fault, you have to either install the memory and restart the program or emulate the instruction. If you are dealing with a clean architecture like riscv or ARM, emulation isn’t too bad, but for an over-indulgent one like x86, you pretty much need to integrate qemu. If you take the gvisor-like approach, you can install the page and set the single-step flag, then remove the page on the single-step trap, which is a bit less cumbersome. There was a precursor to dtrace, called atrace, that used this approach to analyze cache and tlb access patterns.
Sounds like a fun project; I hope it goes well.

does buffer overflow still exist?

I was watching a university lecture about buffer overflow, and the professor ended up saying
even if we were able to fill the buffer with exploit code and jumped
into that code, we still can not execute it..
the reasons - he mentioned - are:
programmers avoid the use of functions that cause overflow.
randomized stack offsets: at start of program, allocate random amount of space on stack to make it difficult to predict the beginning of inserted code.
use techniques to detect stack corruption.
non-executable code segments: only allow code to execute from "text" sections of memory.
now I wonder, does buffer overflow attack still exist nowadays? or it is out-of-date.
detailed answer will be very appreciated!
Not all of us. There's a bunch of new programmers every day. Does our collective knowledge that strcpy is bad get disseminated to them magically? I don't think so.
Difficult, yes. Impossible, no. Any vulnerability that can be turned into an arbitrary read can defeat such protections trivially.
Indeed we can detect stack corruption, under certain circumstances. Canaries, for instance, may be overwritten, their value is compiler dependent, and they might not protect against all kinds of stack corruption (e.g. GCC's -fstack-protector-strong protects against EIP overwrite, but not other kinds of overrun)
W^X memory is a reality, but how many OS's have adopted it for the stack? That'd be an interesting little research project for your weekend. :) Additionally, if you look into return-oriented-programming (ROP) techniques (return-to-libc is an application of it), you'll see it also can be bypassed.

What are the good implementation practices to minimize RAM consumption

I run a C code on an arm based Linux device that has a very small RAM space (16MB). My code is often killed (SIGKILL) by the kernel with 'out of memory' message. I run the program with Valgrind, and it does not look like there is a memory leak. I run the code with gdb as well but could not identify any mistake on the code. I will try to optimize my code going it through some many times.
In general, what would be the good implementation practices on a code to minimize the memory usage?
one might be to use functions as much as possible(?), but I guess gcc already optimizes the code to decrease the source usage.
to avoid dynamic memory allocations
what else?
Be careful about scope of objects. Make sure you are handling the memory deallocation after an object is no longer needed. I'm not sure I understand your use functions as much as possible(?). Functions require overhead, every call causes a little bit of extra memory to be taken up because it has to store a few pointers and a little bit of information about the method on the call stack. So, while that may help keep your source code clean - it won't lower your memory usage (it'll probably increase it). One way to get the best of both worlds in C is to use inline functions - which suggests to the compiler that it should not create an actual function, but rather just insert that block of code wherever it is used. Keep in mind that efficient code usually has a more machine level look to it (meaning repetition, pointers, and often developer-managed array indices) rather than taking advantage of broad purpose, function abundant objects. But, thank goodness for smart compilers so you don't have to know every optimization. However, in a lower level language like c, since it gives you so much ability to manipulate everything, you need to be careful that you don't make costly mistakes.
If you have this kind of problem on Linux you can disable overcommit memory. It will make sure that all the memory allocated has physical memory. The kernel will be less likely to kill your program. Then be sure to test the result of all mallocs because they will fail at some point when you don't have memory anymore. You can find more information here : http://www.etalabs.net/overcommit.html
You can also disable some programs on your embedded system to free memory. May be you don't use cron or don't need six TTY at startup.

Allocating a new call stack

(I think there's a high chance of this question either being a duplicate or otherwise answered here already, but searching for the answer is hard thanks to interference from "stack allocation" and related terms.)
I have a toy compiler I've been working on for a scripting language. In order to be able to pause the execution of a script while it's in progress and return to the host program, it has its own stack: a simple block of memory with a "stack pointer" variable that gets incremented using the normal C code operations for that sort of thing and so on and so forth. Not interesting so far.
At the moment I compile to C. But I'm interested in investigating compiling to machine code as well - while keeping the secondary stack and the ability to return to the host program at predefined control points.
So... I figure it's not likely to be a problem to use the conventional stack registers within my own code, I assume what happens to registers there is my own business as long as everything is restored when it's done (do correct me if I'm wrong on this point). But... if I want the script code to call out to some other library code, is it safe to leave the program using this "virtual stack", or is it essential that it be given back the original stack for this purpose?
Answers like this one and this one indicate that the stack isn't a conventional block of memory, but that it relies on special, system specific behaviour to do with page faults and whatnot.
So:
is it safe to move the stack pointers into some other area of memory? Stack memory isn't "special"? I figure threading libraries must do something like this, as they create more stacks...
assuming any area of memory is safe to manipulate using the stack registers and instructions, I can think of no reason why it would be a problem to call any functions with a known call depth (i.e. no recursion, no function pointers) as long as that amount is available on the virtual stack. Right?
stack overflow is obviously a problem in normal code anyway, but would there be any extra-disastrous consequences to an overflow in such a system?
This is obviously not actually necessary, since simply returning the pointers to the real stack would be perfectly serviceable, or for that matter not abusing them in the first place and just putting up with fewer registers, and I probably shouldn't try to do it at all (not least due to being obviously out of my depth). But I'm still curious either way. Want to know how these sorts of things work.
EDIT: Sorry of course, should have said. I'm working on x86 (32-bit for my own machine), Windows and Ubuntu. Nothing exotic.
All of these answer are based on "common processor architectures", and since it involves generating assembler code, it has to be "target specific" - if you decide to do this on processor X, which has some weird handling of stack, below is obviously not worth the screensurface it's written on [substitute for paper]. For x86 in general, the below holds unless otherwise stated.
is it safe to move the stack pointers into some other area of memory?
Stack memory isn't "special"? I figure threading libraries
must do something like this, as they create more stacks...
The memory as such is not special. This does however assume that it's not on an x86 architecture where the stack segment is used to limit the stack usage. Whilst that is possible, it's rather rare to see in an implementation. I know that some years ago Nokia had a special operating system using segments in 32-bit mode. As far as I can think of right now, that's the only one I've got any contact with that uses the stack segment for as x86-segmentation mode describes.
Assuming any area of memory is safe to manipulate using the stack
registers and instructions, I can think of no reason why it would be a
problem to call any functions with a known call depth (i.e. no
recursion, no function pointers) as long as that amount is available
on the virtual stack. Right?
Correct. Just as long as you don't expect to be able to get back to some other function without switching back to the original stack. Limited level of recursion would also be acceptable, as long as the stack is deep enough [there are certain types of problems that are definitely hard to solve without recursion - binary tree search for example].
stack overflow is obviously a problem in normal code anyway,
but would there be any extra-disastrous consequences to an overflow in
such a system?
Indeed, it would be a tough bug to crack if you are a little unlucky.
I would suggest that you use a call to VirtualProtect() (Windows) or mprotect() (Linux etc) to mark the "end of the stack" as unreadable and unwriteable so that if your code accidentally walks off the stack, it crashes properly rather than some other more subtle undefined behaviour [because it's not guaranteed that the memory just below (lower address) is unavailable, so you could overwrite some other useful things if it does go off the stack, and that would cause some very hard to debug bugs].
Adding a bit of code that occassionally checks the stack depth (you know where your stack starts and ends, so it shouldn't be hard to check if a particular stack value is "outside the range" [if you give yourself some "extra buffer space" between the top of the stack and the "we're dead" zone that you protected - a "crumble zone" as they would call it if it was a car in a crash]. You can also fill the entire stack with a recognisable pattern, and check how much of that is "untouched".
Typically, on x86, you can use the existing stack without any problems so long as:
you don't overflow it
you don't increment the stack pointer register (with pop or add esp, positive_value / sub esp, negative_value) beyond what your code starts with (if you do, interrupts or asynchronous callbacks (signals) or any other activity using the stack will trash its contents)
you don't cause any CPU exception (if you do, the exception handling code might not be able to unwind the stack to the nearest point where the exception can be handled)
The same applies to using a different block of memory for a temporary stack and pointing esp to its end.
The problem with exception handling and stack unwinding has to do with the fact that your compiled C and C++ code contains some exception-handling-related data structures like the ranges of eip with the links to their respective exception handlers (this tells where the closest exception handler is for every piece of code) and there's also some information related to identification of the calling function (i.e. where the return address is on the stack, etc), so you can bubble up exceptions. If you just plug in raw machine code into this "framework", you won't properly extend these exception-handling data structures to cover it, and if things go wrong, they'll likely go very wrong (the entire process may crash or become damaged, despite you having exception handlers around the generated code).
So, yeah, if you're careful, you can play with stacks.
You can use any region you like for the processor's stack (modulo the memory protections).
Essentially, you simply load the ESP register ("MOV ESP, ...") with a pointer to the new area, however you managed to allocate it.
You have to have enough for your program, and whatever it might call (e.g., a Windows OS API), and whatever funny behaviours the OS has. You might be able to figure out how much space your code needs; a good compiler can easily do that. Figuring how much is needed by Windows is harder; you can always allocate "way too much" which is what Windows programs tend to do.
If you decide to manage this space tightly, you'll probably have to switch stacks to call Windows functions. That won't be enough; you'll likely get burned by various Windows surprises. I describe one of them here Windows: avoid pushing full x86 context on stack. I have mediocre solutions, but not good solutions for this.

Catching stack overflow

What's the best way to catch stack overflow in C?
More specifically:
A C program contains an interpreter for a scripting language.
Scripts are not trusted, and may contain infinite recursion bugs. The interpreter has to be able to catch these and smoothly continue. (Obviously this can partly be handled by using a software stack, but performance is greatly improved if substantial chunks of library code can be written in C; at a minimum, this entails C functions running over recursive data structures created by scripts.)
The preferred form of catching a stack overflow would involve longjmp back to the main loop. (It's perfectly okay to discard all data that was held in stack frames below the main loop.)
The fallback portable solution is to use addresses of local variables to monitor the current stack depth, and for every recursive function to contain a call to a stack checking function that uses this method. Of course, this incurs some runtime overhead in the normal case; it also means if I forget to put the stack check call in one place, the interpreter will have a latent bug.
Is there a better way of doing it? Specifically, I'm not expecting a better portable solution, but if I had a system specific solution for Linux and another one for Windows, that would be okay.
I've seen references to something called structured exception handling on Windows, though the references I've seen have been about translating this into the C++ exception handling mechanism; can it be accessed from C, and if so is it useful for this scenario?
I understand Linux lets you catch a segmentation fault signal; is it possible to reliably turn this into a longjmp back to your main loop?
Java seems to support catching stack overflow exceptions on all platforms; how does it implement this?
Off the top of my head, one way to catch excessive stack growth is to check the relative difference in addresses of stack frames:
#define MAX_ROOM (64*1024*1024UL) // 64 MB
static char * first_stack = NULL;
void foo(...args...)
{
char stack;
// Compare addresses of stack frames
if (first_stack == NULL)
first_stack = &stack;
if (first_stack > &stack && first_stack - &stack > MAX_ROOM ||
&stack > first_stack && &stack - first_stack > MAX_ROOM)
printf("Stack is larger than %lu\n", (unsigned long)MAX_ROOM);
...code that recursively calls foo()...
}
This compares the address of the first stack frame for foo() to the current stack frame address, and if the difference exceeds MAX_ROOM it writes a message.
This assumes that you're on an architecture that uses a linear always-grows-down or always-grows-up stack, of course.
You don't have to do this check in every function, but often enough that excessively large stack growth is caught before you hit the limit you've chosen.
AFAIK, all mechanisms for detecting stack overflow will incur some runtime cost. You could let the CPU detect seg-faults, but that's already too late; you've probably already scribbled all over something important.
You say that you want your interpreter to call precompiled library code as much as possible. That's fine, but to maintain the notion of a sandbox, your interpreter engine should always be responsible for e.g. stack transitions and memory allocation (from the interpreted language's point of view); your library routines should probably be implemented as callbacks. The reason being that you need to be handling this sort of thing at a single point, for reasons that you've already pointed out (latent bugs).
Things like Java deal with this by generating machine code, so it's simply a case of generating code to check this at every stack transition.
(I won't bother those methods depending on particular platforms for "better" solutions. They make troubles, by limiting the language design and usability, with little gain. For answers "just work" on Linux and Windows, see above.)
First of all, in the sense of C, you can't do it in a portable way. In fact, ISO C mandates no "stack" at all. Pedantically, it even seems when allocation of automatic objects failed, the behavior is literally undefined, as per Clause 4p2 - there is simply no guarantee what would happen when the calls nested too deep. You have to rely on some additional assumptions of implementation (of ISA or OS ABI) to do that, so you end up with C + something else, not only C. Runtime machine code generation is also not portable in C level.
(BTW, ISO C++ has a notion of stack unwinding, but only in the context of exception handling. And there is still no guarantee of portable behavior on stack overflow; though it seems to be unspecified, not undefined.)
Besides to limit the call depth, all ways have some extra runtime cost. The cost would be quite easily observable unless there are some hardware-assisted means to amortize it down (like page table walking). Sadly, this is not the case now.
The only portable way I find is to not rely on the native stack of underlying machine architecture. This in general means you have to allocate the activation record frames as part of the free store (on the heap), rather than the native stack provided by ISA. This does not only work for interpreted language implementations, but also for compiled ones, e.g. SML/NJ. Such software stack approach does not always incur worse performance because they allow providing higher level abstraction in the object language so the programs may have more opportunities to be optimized, though it is not likely in a naive interpreter.
You have several options to achieve this. One way is to write a virtual machine. You can allocate memory and build the stack in it.
Another way is to write sophisticated asynchronous style code (e.g. trampolines, or CPS transformation) in your implementation instead, relying on less native call frames as possible. It is generally difficult to get right, but it works. Additional capabilities enabled by such way are easier tail call optimization and easier first-class continuation capture.

Resources