After having successfully implemented the karatsuba algorithm, I decided to compare the time needed with the school algorithm. The program needs to test up to 32768 digits. Unfortunately, it stops at 8192 digits(the digits are stored in an array). When running it with gdb I get the output: Programme terminated with SIGKILL, Killed. So obviously I searched through the web and found out that(since I'm on Linux), the kernel automatically killed the program because it consumed too much of resources.
So my question is: Is there a way to keep it running?
Thanks in advance for any response
The most probable cause is memory exhaustion. You can roughly test this hypothesis by running top on the terminal.
If this is the case, valgrind is your friend. Look very carefully at every place you call malloc in your program and ensure that you call free for each array afterwards.
I see a number of things you should do before forcing Linux to keep your program running (if you could do that anyway).
Watch out for memory leaks (see answer of jons34yp)
Once all memory leaks resolved, check the declaration of your
variables, every non used bit but allocated bit is one to many. If a
byte is enough (unsigned char), don't use a short. If a short is
enough, don't use a long. Same for float's and doubles. Also check
eventual structs and unions for unused data.
Also check your algorithm and the way you implement it. e.g. a
sparse matrix can be represented in other ways than waisting entire
array's.
Keep in mind that C compilers use to align data fields. This means
that after for instance, an array of 13 bytes, compilers tend to
align the next bytes on an 32bit or 64bit boundary, leaving you with
unused bytes in between. The same thing can happen within structs.
So check your compilers alignment settings.
I hope this helps to find a solution.
Kind regards,
PB
Related
strlen is a fairly simple function, and it is obviously O(n) to compute. However, I have seen a few approaches that operate on more than one character at a time. See example 5 here or this approach here. The basic way these work is by reinterpret-casting the char const* buffer to a uint32_t const* buffer and then checking four bytes at a time.
Personally, my gut reaction is that this is a segfault-waiting-to-happen, since I might dereference up to three bytes outside valid memory. However, this solution seems to hang around, and it seems curious to me that something so obviously broken has stood the test of time.
I think this comprises UB for two reasons:
Potential dereference outside valid memory
Potential dereference of unaligned pointer
(Note that there is not an aliasing issue; one might think the uint32_t is aliased as an incompatible type, and code after the strlen (such as code that might change the string) could run out of order to the strlen, but it turns out that char is an explicit exception to strict aliasing).
But, how likely is it to fail in practice? At minimum, I think there needs to be 3 bytes padding after the string literal data section, malloc needs to be 4-byte or larger aligned (actually the case on most systems), and malloc needs to allocate 3 extra bytes. There are other criteria related to aliasing. This is all fine for compiler implementations, which create their own environments, but how frequently are these conditions met on modern hardware for user code?
The technique is valid, and you will not avoid it if you call our C library strlen. If that library is, for instance, a recent version of the GNU C library (at least on certain targets), it does the same thing.
The key to make it work is to ensure that the pointer is aligned properly. If the pointer is aligned, the operation will read beyond the end of the string surely enough, but not into an adjacent page. If the null terminating byte is within one word of the end of a page, then that last word will be accessed without touching the subsequent page.
It certainly isn't well-defined behavior in C, and so it carries the burden of careful validation when ported from one compiler to another. It also triggers false positives from out-of-bounds access detectors like Valgrind.
Valgrind had to be patched to work around Glibc doing this. Without the patches, you get nuisance errors such as this:
==13669== Invalid read of size 8
==13669== at 0x411D6D7: __wcslen_sse2 (wcslen-sse2.S:59)
==13669== by 0x806923F: length_str (lib.c:2410)
==13669== by 0x807E61A: string_out_put_string (stream.c:997)
==13669== by 0x8075853: obj_pprint (lib.c:7103)
==13669== by 0x8084318: vformat (stream.c:2033)
==13669== by 0x8081599: format (stream.c:2100)
==13669== by 0x408F4D2: (below main) (libc-start.c:226)
==13669== Address 0x43bcaf8 is 56 bytes inside a block of size 60 alloc'd
==13669== at 0x402BE68: malloc (in /usr/lib/valgrind/vgpreload_memcheck-x86-linux.so)
==13669== by 0x8063C4F: chk_malloc (lib.c:1763)
==13669== by 0x806CD79: sub_str (lib.c:2653)
==13669== by 0x804A7E2: sysroot_helper (txr.c:233)
==13669== by 0x408F4D2: (below main) (libc-start.c:226)
Glibc is using SSE instructions to do calculate wcslen eight bytes at a time (instead of four, the width of wchar_t). In doing so, it is accessing at offset 56 in a block that is 60 bytes wide. However, note that this access could never straddle a page boundary: the address is divisible by 8.
If you're working in assembly language, you don't have to think twice about the technique.
In fact, the technique is used quite bit in some optimized audio codecs that I work with (targetting ARM), which feature a lot of hand-written assembly language in the Neon instruction set.
I noticed it when running Valgrind on code which integrated these codecs, and contacted the vendor. They explained that it was just a harmless loop optimization technique; I went through the assembly language and convinced myself they were right.
(1) can definitely happen. There's nothing preventing you from taking the strlen of a string near the end of an allocated page, which could result in an access past the end of allocated memory and a nice big crash. As you note, this could be mitigated by padding all your allocations, but then you have to have any libraries do the same. Worse, you have to arrange for the linker and OS to always add this padding (remember the OS passes argv[] in a static memory buffer somewhere). The overhead of doing this isn't worth it.
(2) also definitely happens. Earlier versions of ARM processors generate data aborts on unaligned accesses, which either cause your program to die with a bus error (or halt the CPU if you're running bare-metal), or force a very expensive trap through the kernel to handle the unaligned access. These earlier ARM chips are still in wide use in older cellphones and embedded devices. Later ARM processors synthesize multiple word accesses to deal with unaligned accesses, but this will result in overall slower performance since you basically double the number of memory loads you need to do.
Many current ("modern") PICs and embedded microprocessors lack the logic to handle unaligned accesses, and may behave unpredictably or even nonsensically when given unaligned addresses (I've personally seen chips that will just mask off the bottom bits, which would give incorrect answers, and others that will just give garbage results with unaligned accesses).
So, this is ridiculously dangerous to use in anything that should be remotely portable. Please, please, please do not use this code; use the libc strlen. It will usually be faster (optimized for your platform properly) and will make your code portable. The last thing you want is for your code to subtly and unexpectedly break in some situation (string near the end of an allocation) or on some new processor.
Donald Knuth, a person who wrote 3+ volumes on clever algorithms said: "Premature optimization is the root of all evil".
strlen() is used a lot, so it really should be fast. Riffing on wildplasser's remark, "I would trust the library function", what makes you think that the library function works byte at a time? Or is slow?
The title may give folks the impression that the code you suggest is faster than the standard system library strlen(), but I think what you mean is that it is faster than a naive strlen() which probably doesn't get used, anyway.
I compiled a simple C program and looked on my 64-bit system which uses GNU's glibc function. The code I saw was pretty sophisticated and looks pretty fast in terms of working with register width rather than byte at a time. The code I saw for strlen() is written in assembly language so there probably aren't junk instructions as you might get if this were compiled C code. What I saw was rtld-strlen.S. This code also unrolls loops to reduce the overhead in looping.
Before you think you can do better on strlen, you should look at that code, or the corresponding code for your particular architecture, and register size.
And if you do write your own strlen, benchmark it against the existing implementation.
And obviously, if you use the system strlen, then it is probably correct and you don't have to worry about invalid memory references as a result of an optimization in the code.
I agree it's a bletcherous technique, but I suspect it's likely to work most of the time. It's only a segfault if the string happens to be right up against the end of your data (or stack) segment. The vast majority of strings (wheher statically or dynamically allocated) won't be.
But you're right, to guarantee it working you'd need some guarantee that all strings were padded somehow, and your list of shims looks about right.
If alignment is a problem, you could take care of that in the fast strlen implementation; you wouldn't have to run around trying to align all strings.
(But of course, if your problem is that you're spending too much time scanning strings, the right fix is not to desperately try to make string scanning faster, but to rig things up so that you don't have to scan so many strings in the first place...)
Platform: x86 Linux 3.2.0 (Debian 7.1)
Compiler: GCC 4.7.2 (Debian 4.7.2-5)
I am writing a function that generates a "random" integer by reading allocated portions of memory for "random" values. The idea is based on the fact that uninitialized variables have undefined values. My initial idea was to allocate an array using malloc() and then use its uninitialized elements to generate a random number. But malloc() tends to return NULL blocks of memory so I cannot guarantee that there is anything to read. So I thought about reading a separate processes memory in order to almost guarantee values other than NULL. My current idea is somehow finding the first valid memory address and reading from there down but I do not know how to do this. I tried initializing a pointer to NULL and then incrementing it by one but if I attempt to print the referenced memory location a segmentation fault occurs. So my question is how do I read a separate processes memory. I do not need to do anything with the memory other than read it.
The idea is based on the fact that uninitialized variables have undefined values.
No, you can't. They have garbage value, which means whatever happens to be in that memory, they are not random.
You can't read a separate processes memory, the kernel protects you from doing that because it usually happens because of an error in setting up your pointers. Even if they were possible, you wouldn't be getting anything near a random integer. Why not read from /dev/random instead?
Random numbers have certain special properties. Computer memory in general doesn't satisfy those properties.
If I sampled computer memory, tons of it would be quite similar, and certain numbers would have such a low probability of existing, that they might not even be found within the entire memory of a computer.
That's not to mention that if I read a bit of memory that's outside of the memory allocated to a program, the OS will kill me dead with a SEGFAULT.
It's a bad idea, on many levels. Use a proper random number generator.
Generating random numbers in computers by software is HARD (there are hardware random number generators). Memory in a new program is a terrible source, especially early on, as the OS has zeroed all of memory before it starts the program. Any non-zeros you see are left over from initialization code leaving it's dirt behind.
Assuming you want so "do it yourself" numbers, the micro/nano-second digits of the time are an old-style solution... the theory is show below... play with your own numbers. Modulo with a large prime would be good. Just be sure to discard anything above 1/1,000's of a second.
(long long)(nano * 1E10 ) % 1000
This assume you are starting by a manual command rather than a scheduled job.
If you are running on UNIX look into reading a few bytes from /dev/urandom, or with proper care,/dev/random (read the man page).
Windows has it's own API. In perl,
new Win32::API "advapi$b32","CryptAcquireContextA",'PNNNN','N' ||
die "$^E\n"; # Use MS crypto or die
The serious work good random number generators take to get good numbers is beyond a quick response here; such usually rely on hardware, such as timestamping interrupts.
The idea is based on the fact that uninitialized variables have undefined values.
They are undefined in as far as you cannot predict what they contain. It is mostly OS dependent what they really contain.
Back in old DOS days, you could maybe rely on the fact that if you executed several programs in the current session, there was garbage in the memory. But even then the data wasn't a reliable source of randomness.
Nowadays, things are different.
If you have variables on the stack, and in the corrent program run you were never as deep on the stack as now, your local variables are 0. Otherwise, they contain the data from previous function calls.
If you malloc() and the libc takes the returned memory from the pool of already used memory, it might contain garbage as well. But if it newly gets it from the OS, it is zeroed.
My initial idea was to allocate an array using malloc() and then use its uninitialized elements to generate a random number. But malloc() tends to return NULL blocks of memory so I cannot guarantee that there is anything to read.
(Not NULL, but 0 or NUL.)
See my last point: it depends on the history of the malloc()ed area.
So I thought about reading a separate processes memory in order to almost guarantee values other than NULL.
You cannot, as processes are separated and shielded from each other.
As others said, there are better sources of randomness. /dev/random if you definitely need real entropy, /dev/urandom otherwise.
malloc is not guaranteed to return 0'ed memory. The conventional wisdom is not only that, but that the contents of the memory malloc returns are actually non-deterministic, e.g. openssl used them for extra randomness.
However, as far as I know, malloc is built on top of brk/sbrk, which do "return" 0'ed memory. I can see why the contents of what malloc returns may be non-0, e.g. from previously free'd memory, but why would they be non-deterministic in "normal" single-threaded software?
Is the conventional wisdom really true (assuming the same binary and libraries)
If so, Why?
Edit Several people answered explaining why the memory can be non-0, which I already explained in the question above. What I'm asking is why the program using the contents of what malloc returns may be non-deterministic, i.e. why it could have different behavior every time it's run (assuming the same binary and libraries). Non-deterministic behavior is not implied by non-0's. To put it differently: why it could have different contents every time the binary is run.
Malloc does not guarantee unpredictability... it just doesn't guarantee predictability.
E.g. Consider that
return 0;
Is a valid implementation of malloc.
The initial values of memory returned by malloc are unspecified, which means that the specifications of the C and C++ languages put no restrictions on what values can be handed back. This makes the language easier to implement on a variety of platforms. While it might be true that in Linux malloc is implemented with brk and sbrk and the memory should be zeroed (I'm not even sure that this is necessarily true, by the way), on other platforms, perhaps an embedded platform, there's no reason that this would have to be the case. For example, an embedded device might not want to zero the memory, since doing so costs CPU cycles and thus power and time. Also, in the interest of efficiency, for example, the memory allocator could recycle blocks that had previously been freed without zeroing them out first. This means that even if the memory from the OS is initially zeroed out, the memory from malloc needn't be.
The conventional wisdom that the values are nondeterministic is probably a good one because it forces you to realize that any memory you get back might have garbage data in it that could crash your program. That said, you should not assume that the values are truly random. You should, however, realize that the values handed back are not magically going to be what you want. You are responsible for setting them up correctly. Assuming the values are truly random is a Really Bad Idea, since there is nothing at all to suggest that they would be.
If you want memory that is guaranteed to be zeroed out, use calloc instead.
Hope this helps!
malloc is defined on many systems that can be programmed in C/C++, including many non-UNIX systems, and many systems that lack operating system altogether. Requiring malloc to zero out the memory goes against C's philosophy of saving CPU as much as possible.
The standard provides a zeroing cal calloc that can be used if you need to zero out the memory. But in cases when you are planning to initialize the memory yourself as soon as you get it, the CPU cycles spent making sure the block is zeroed out are a waste; C standard aims to avoid this waste as much as possible, often at the expense of predictability.
Memory returned by mallocis not zeroed (or rather, is not guaranteed to be zeroed) because it does not need to. There is no security risk in reusing uninitialized memory pulled from your own process' address space or page pool. You already know it's there, and you already know the contents. There is also no issue with the contents in a practical sense, because you're going to overwrite it anyway.
Incidentially, the memory returned by malloc is zeroed upon first allocation, because an operating system kernel cannot afford the risk of giving one process data that another process owned previously. Therefore, when the OS faults in a new page, it only ever provides one that has been zeroed. However, this is totally unrelated to malloc.
(Slightly off-topic: The Debian security thing you mentioned had a few more implications than using uninitialized memory for randomness. A packager who was not familiar with the inner workings of the code and did not know the precise implications patched out a couple of places that Valgrind had reported, presumably with good intent but to desastrous effect. Among these was the "random from uninitilized memory", but it was by far not the most severe one.)
I think that the assumption that it is non-deterministic is plain wrong, particularly as you ask for a non-threaded context. (In a threaded context due to scheduling alea you could have some non-determinism).
Just try it out. Create a sequential, deterministic application that
does a whole bunch of allocations
fills the memory with some pattern, eg fill it with the value of a counter
free every second of these allocations
newly allocate the same amount
run through these new allocations and register the value of the first byte in a file (as textual numbers one per line)
run this program twice and register the result in two different files. My idea is that these files will be identical.
Even in "normal" single-threaded programs, memory is freed and reallocated many times. Malloc will return to you memory that you had used before.
Even single-threaded code may do malloc then free then malloc and get back previously used, non-zero memory.
There is no guarantee that brk/sbrk return 0ed-out data; this is an implementation detail. It is generally a good idea for an OS to do that to reduce the possibility that sensitive information from one process finds its way into another process, but nothing in the specification says that it will be the case.
Also, the fact that malloc is implemented on top of brk/sbrk is also implementation-dependent, and can even vary based on the size of the allocation; for example, large allocations on Linux have traditionally used mmap on /dev/zero instead.
Basically, you can neither rely on malloc()ed regions containing garbage nor on it being all-0, and no program should assume one way or the other about it.
The simplest way I can think of putting the answer is like this:
If I am looking for wall space to paint a mural, I don't care whether it is white or covered with old graffiti, since I'm going to prime it and paint over it. I only care whether I have enough square footage to accommodate the picture, and I care that I'm not painting over an area that belongs to someone else.
That is how malloc thinks. Zeroing memory every time a process ends would be wasted computational effort. It would be like re-priming the wall every time you finish painting.
There is an whole ecosystem of programs living inside a computer memmory and you cannot control the order in which mallocs and frees are happening.
Imagine that the first time you run your application and malloc() something, it gives you an address with some garbage. Then your program shuts down, your OS marks that area as free. Another program takes it with another malloc(), writes a lot of stuff and then leaves. You run your program again, it might happen that malloc() gives you the same address, but now there's different garbage there, that the previous program might have written.
I don't actually know the implementation of malloc() in any system and I don't know if it implements any kind of security measure (like randomizing the returned address), but I don't think so.
It is very deterministic.
I need to have two buffers (A and B) and when either of the buffers is full it needs to write its contents to the "merged" buffer - C. Using memcopy seems to be too slow for this operation as noted below in my question. Any insight?'
I haven't tried but I've been told that memcopy will not work. This is an embedded system. 2 buffers. Both of different sizes and when they are full dumb to a common 'C' buffer which is a bigger size than the other two.. Not sure why I got down rated..
Edit: Buffer A and B will be written to prior to C being completely empty.
The memcopy is taking too long and the common buffer 'C' is getting over run.
memcpy is pretty much the fastest way to copy memory. It's frequently a compiler intrinsic and is highly optimized. If it's too slow you're probably going to have to find another way to speed your program up.
I'd expect that copying memory faster is not the lowest hanging fruit in a program.
Some other opportunities could be to copy less memory or copy less often. See if you can profile your program to analyze it's performance and find where the biggest opportunities are.
Edit: With your edit it sounds like the problem is that there's not enough time for you to deal with some data all at once between the time you notice that it needs to be handled and the time that more data comes in. A solution in this case could be, as one of the commenters noted, to have additional buffers that you can flip between. So you may then have time to handle the data in one while another is filled up.
The only way you can merge two buffers without memcpy is by linking them, like a linked list of buffer fragments (or an array of fragments).
Consider that a buffer may not always have to be contiguous. I've done a lot of work with 600dpi images, which means very large buffers. If you can break them up into a sequence of smaller fragments, that helps reducing fragmentation as well as unnecessary copying due to buffer growth.
In some cases buffers must be contiguous, if your API / microcontroller mandates it. For example, Windows bitmap functions require continuity. You could try to use the C realloc function, but it might internally work like the combination of malloc+memcpy+free. Either way, as others have said earlier, memcpy is supposed to be the fastest possible way of copying contiguous buffers.
If the buffer must be contiguous, you could reserve a large address space and commit it on demand. The implementation depends on the platform. For example, on Win32 the VirtualAlloc function can do that. This gives you a very large contiguous buffer, of which only a portion is allocated (committed). Later you can commit further pages as the buffer needs to grow. This trick requires the concept of virtual memory, which may not be available on a microcontroller.
You may think that this is a coincidence that the topic of my question is similar to the name of the forum but I actually got here by googling the term "stack overflow".
I use the OPNET network simulator in which I program using C. I think I am having a problem with big array sizes. It seems that I am hitting some sort of memory allocation limitation. It may have to do with OPNET, Windows, my laptop memory or most likely C language. The problem is caused when I try to use nested arrays with a total number of elements coming to several thousand integers. I think I am exceeding an overall memory allocation limit and I am wondering if there is a way to increase this cap.
Here's the exact problem description:
I basically have a routing table. Let's call it routing_tbl[n], meaning I am supporting 30 nodes (routers). Now, for each node in this table, I keep info. about many (hundreds) available paths, in an array called paths[p]. Again, for each path in this array, I keep the list of nodes that belong to it in an array called hops[h]. So, I am using at least nph integers worth of memory but this table contains other information as well. In the same function, I am also using another nested array that consumes almost 40,000 integers as well.
As soon as I run my simulation, it quits complaining about stack overflow. It works when I reduce the total size of the routing table.
What do you think causes the problem and how can it be solved?
Much appreciated
Ali
It may help if you post some code. Edit the question to include the problem function and the error.
Meanwhile, here's a very generic answer:
The two principal causes of a stack overflow are 1) a recursive function, or 2) the allocation of a large number of local variables.
Recursion
if your function calls itself, like this:
int recurse(int number) {
return (recurse(number));
}
Since local variables and function arguments are stored on the stack, then it will in fill the stack and cause a stack overflow.
Large local variables
If you try to allocate a large array of local variables then you can overflow the stack in one easy go. A function like this may cause the issue:
void hugeStack (void) {
unsigned long long reallyBig[100000000][1000000000];
...
}
There is quite a detailed answer to this similar question.
Somehow you are using a lot of stack. Possible causes include that you're creating the routing table on the stack, you're passing it on the stack, or else you're generating lots of calls (eg by recursively processing the whole thing).
In the first two cases you should create it on the heap and pass around a pointer to it. In the third case you'll need to rewrite your algorithm in an iterative form.
Stack overflows can happen in C when the number of embedded recursive calls is too high. Perhaps you are calling a function from itself too many times?
This error may also be due to allocating too much memory in static declarations. You can switch to dynamic allocations through malloc() to fix this type of problem.
Is there a reason why you cannot use the debugger on this program?
It depends on where you have declared the variable.
A local variable (i.e. one declared on the stack is limited by the maximum frame size) This is a limit of the compiler you are using (and can usually be adjusted with compiler flags).
A dynamically allocated object (i.e. one that is on the heap) is limited by the amount of available memory. This is a property of the OS (and can technically by larger the physical memory if you have a smart OS).
Many operating systems dynamically expand the stack as you use more of it. When you start writing to a memory address that's just beyond the stack, the OS assumes your stack has just grown a bit more and allocates it an extra page (usually 4096Kib on x86 - exactly 1024 ints).
The problem is, on the x86 (and some other architectures) the stack grows downwards but C arrays grow upwards. This means if you access the start of a large array, you'll be accessing memory that's more than a page away from the edge of the stack.
If you initialise your array to 0 starting from the end of the array (that's right, make a for loop to do it), the errors might go away. If they do, this is indeed the problem.
You might be able to find some OS API functions to force stack allocation, or compiler pragmas/flags. I'm not sure about how this can be done portably, except of course for using malloc() and free()!
You are unlikely to run into a stack overflow with unthreaded compiled C unless you do something particularly egregious like have runaway recursion or a cosmic memory leak. However, your simulator probably has a threading package which will impose stack size limits. When you start a new thread it will allocate a chunk of memory for the stack for that thread. Likely, there is a parameter you can set somewhere that establishes the the default stack size, or there may be a way to grow the stack dynamically. For example, pthreads has a function pthread_attr_setstacksize() which you call prior to starting a new thread to set its size. Your simulator may or may not be using pthreads. Consult your simulator reference documentation.