After reading the following paper https://people.freebsd.org/~lstewart/articles/cpumemory.pdf ("What every programmer should know about memory") I wanted to try one of the author's test, that is, measuring the effects of TLB on the final execution time.
I am working on a Samsung Galaxy S3 that embeds a Cortex-A9.
According to the documentation:
we have two micro TLBs for instruction and data cache in L1 (http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.ddi0388e/Chddiifa.html)
The main TLB is located in L2 (http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.ddi0388e/Chddiifa.html)
Data micro TLB has 32 entries (instruction micro TLB has either 32 or 64 entries)
L1' size == 32 Kbytes
L1 cache line == 32 bytes
L2' size == 1MB
I wrote a small program that allocates an array of structs with N entries. Each entry's size is == 32 bytes so it fits in a cache line.
I perform several read access and I measure the execution time.
typedef struct {
int elmt; // sizeof(int) == 4 bytes
char padding[28]; // 4 + 28 = 32B == cache line size
}entry;
volatile entry ** entries = NULL;
//Allocate memory and init to 0
entries = calloc(NB_ENTRIES, sizeof(entry *));
if(entries == NULL) perror("calloc failed"); exit(1);
for(i = 0; i < NB_ENTRIES; i++)
{
entries[i] = mmap(NULL, 4096, PROT_READ | PROT_WRITE, MAP_PRIVATE | MAP_ANONYMOUS, 0, 0);
if(entries[i] == MAP_FAILED) perror("mmap failed"); exit(1);
}
entries[LAST_ELEMENT]->elmt = -1
//Randomly access and init with random values
n = -1;
i = 0;
while(++n < NB_ENTRIES -1)
{
//init with random value
entries[i]->elmt = rand() % NB_ENTRIES;
//loop till we reach the last element
while(entries[entries[i]->elmt]->elmt != -1)
{
entries[i]->elmt++;
if(entries[i]->elmt == NB_ENTRIES)
entries[i]->elmt = 0;
}
i = entries[i]->elmt;
}
gettimeofday(&tStart, NULL);
for(i = 0; i < NB_LOOPS; i++)
{
j = 0;
while(j != -1)
{
j = entries[j]->elmt
}
}
gettimeofday(&tEnd, NULL);
time = (tEnd.tv_sec - tStart.tv_sec);
time *= 1000000;
time += tEnd.tv_usec - tStart.tv_usec;
time *= 100000
time /= (NB_ENTRIES * NBLOOPS);
fprintf(stdout, "%d %3lld.%02lld\n", NB_ENTRIES, time / 100, time % 100);
I have an outer loop that makes NB_ENTRIES vary from 4 to 1024.
As one can see in the figure below, while NB_ENTRIES == 256 entries, executing time is longer.
When NB_ENTRIES == 404 I get an "out of memory" (why? micro TLBs exceeded? main TLBs exceeded? Page Tables exceeded? virtual memory for the process exceeded?)
Can someone explain me please what is really going on from 4 to 256 entries, then from 257 to 404 entries?
EDIT 1
As it has been suggested, I ran membench (src code) and below the results:
EDIT 2
In the following paper (page 3) they ran (I suppose) the same benchmark. But the different steps are clearly visible from their plots, which is not my case.
Right now, according to their results and explanations, I only can identify few things.
plots confirm that L1 cache line size is 32 bytes because as they said
"once the array size exceeds the size of the data cache (32KB), the reads begin to generate misses [...] an inflection point occurs when every read generates a misse".
In my case the very first inflection point appears when stride == 32 Bytes.
- The graph shows that we have a second-level (L2) cache. I think it is depicted by the yellow line (1MB == L2 size)
- Therefore the two last plots above the latter probably reflects the latency while accessing Main Memory (+ TLB?).
However from this benchmark, I am not able to identify:
the cache associativity. Normally D-Cache and I-Cache are 4-way associative (Cortex-A9 TRM).
The TLB effects. As they said,
in most systems, a secondary increase in latency is indicative of the TLB, which caches a limited number of virtual to physical translations.[..] The absence of a rise in latency attributable to TLB indicates that [...]"
large page sizes have probably been used/implemented.
EDIT 3
This link explains the TLB effects from another membench graph. One can actually retrieve the same effects on my graph.
On a 4KB page system, as you grow your strides, while they're still < 4K, you'll enjoy less and less utilization of each page [...] you'll have to access the 2nd level TLB on each access [...]
The cortex-A9 supports 4KB pages mode.
Indeed as one can see in my graph up to strides == 4K, latencies are increasing, then, when it reachs 4K
you suddenly start benefiting again since you're actually skipping whole pages.
tl;dr -> Provide a proper MVCE.
This answer should be a comment but is too big to be posted as comment, so posting as answer instead:
I had to fix a bunch of syntax errors (missing semicolons) and declare undefined variables.
After fixing all those problems, the code did NOTHING (the program quit even prior to executing the first mmap. I'm giving the tip to use curly brackets all the time, here is your first and your second error caused by NOT doing so:
.
// after calloc:
if(entries == NULL) perror("calloc failed"); exit(1);
// after mmap
if(entries[i] == MAP_FAILED) perror("mmap failed"); exit(1);
both lines just terminate your program regardless of the condition.
Here you got an endless loop (reformatted, added curly brackets but no other change):
.
//Randomly access and init with random values
n = -1;
i = 0;
while (++n < NB_ENTRIES -1) {
//init with random value
entries[i]->elmt = rand() % NB_ENTRIES;
//loop till we reach the last element
while (entries[entries[i]->elmt]->elmt != -1) {
entries[i]->elmt++;
if (entries[i]->elmt == NB_ENTRIES) {
entries[i]->elmt = 0;
}
}
i = entries[i]->elmt;
}
First iteration starts by setting entries[0]->elmt to some random value, then inner loop increments until it reaches LAST_ELEMENT. Then i is set to that value (i.e. LAST_ELEMENT) and second loop overwrites end marker -1 to some other random value. Then it's constantly incremented mod NB_ENTRIES in the inner loop until you hit CTRL+C.
Conclusion
If you want help, then post a Minimal, Complete, and Verifiable example and not something else.
Related
I am struggling my head with this problem.
If I check my application with top command in Linux, I get that VIRT is always the same (running for a couple of days) while RES is increasing a bit (between 4 bytes and 32 bytes) after an operation. I perform an operation once each 60 minutes.
An operation consists in reading some frames by SPI, adding them to several linked lists and after a while, extracting them in another thread.
I executed Valgrind with the following options:
valgrind --leak-check=full --show-leak-kinds=all --track-origins=yes -v ./application
Once I close it (My app would run forever if everything goes well) I check leaks and I see nothing but the threads I haven't closed. Only possibly lost bytes in threads. No definitely and no indirectly.
I do not know if this is normal. I have used Valgrind in the past to find some leaks and it always worked well, I am pretty sure it is working well right now too but the RES issue I can't explain.
I have checked the linked list to see if I was leaving some nodes without free but, to me, it seems to be right.
I have 32 linked lists in one array. I have done this to make the push/pop operations easier without having 32 separate lists. I don't know if this could be causing the problem. If this is the case, I will split them:
typedef struct PR_LL_M_node {
uint8_t val[60];
struct PR_LL_M_node *next;
} PR_LL_M_node_t;
pthread_mutex_t PR_LL_M_lock[32];
PR_LL_M_node_t *PR_LL_M_head[32];
uint16_t PR_LL_M_counter[32];
int16_t LL_M_InitLinkedList(uint8_t LL_M_number) {
if (pthread_mutex_init(&PR_LL_M_lock[LL_M_number], NULL) != 0) {
printf("Mutex LL M %d init failed\n", LL_M_number);
return -1;
}
PR_LL_M_ready[LL_M_number] = 0;
PR_LL_M_counter[LL_M_number] = 0;
PR_LL_M_head[LL_M_number] = NULL;
pthread_mutex_unlock(&PR_LL_M_lock[LL_M_number]);
return PR_LL_M_counter[LL_M_number];
}
int16_t LL_M_Push(uint8_t LL_M_number, uint8_t *LL_M_frame, uint16_t LL_M_size) {
pthread_mutex_lock(&PR_LL_M_lock[LL_M_number]);
PR_LL_M_node_t *current = PR_LL_M_head[LL_M_number];
if (current != NULL) {
while (current->next != NULL) {
current = current->next;
}
/* now we can add a new variable */
current->next = malloc(sizeof(PR_LL_M_node_t));
memset(current->next->val, 0x00, 60);
/* Clean buffer before using it */
memcpy(current->next->val, LL_M_frame, LL_M_size);
current->next->next = NULL;
} else {
PR_LL_M_head[LL_M_number] = malloc(sizeof(PR_LL_M_node_t));
memcpy(PR_LL_M_head[LL_M_number]->val, LL_M_frame, LL_M_size);
PR_LL_M_head[LL_M_number]->next = NULL;
}
PR_LL_M_counter[LL_M_number]++;
pthread_mutex_unlock(&PR_LL_M_lock[LL_M_number]);
return PR_LL_M_counter[LL_M_number];
}
int16_t LL_M_Pop(uint8_t LL_M_number, uint8_t *LL_M_frame) {
PR_LL_M_node_t *next_node = NULL;
pthread_mutex_lock(&PR_LL_M_lock[LL_M_number]);
if ((PR_LL_M_head[LL_M_number] == NULL)) {
pthread_mutex_unlock(&PR_LL_M_lock[LL_M_number]);
return -1;
}
if ((PR_LL_M_counter[LL_M_number] == 0)) {
pthread_mutex_unlock(&PR_LL_M_lock[LL_M_number]);
return -1;
}
next_node = PR_LL_M_head[LL_M_number]->next;
memcpy(LL_M_frame, PR_LL_M_head[LL_M_number]->val, 60);
free(PR_LL_M_head[LL_M_number]);
PR_LL_M_counter[LL_M_number]--;
PR_LL_M_head[LL_M_number] = next_node;
pthread_mutex_unlock(&PR_LL_M_lock[LL_M_number]);
return PR_LL_M_counter[LL_M_number];
}
This way I pass the number of the linked list I want to manage and I operate over it. What do you think? Is RES a real problem? I think it could be related to other parts of the application but I have commented out most of it and it always happens if the push/pop operation is used. If I leave push/pop out, RES maintains its number.
When I extract the values I use a do/while until I get -1 as response of the pop operation.
Your observations do not seem to indicate a problem:
VIRT and RES are expressed in KiB (units of 1024 bytes). Depending on how virtual memory works on your system, the numbers should always be multiples of the page size, which is most likely 4KiB.
RES is the amount of resident memory, in other words the amount of RAM actually mapped for your program at a given time.
If the program goes to sleep for 60 minutes at a time, the system (Linux) may determine that some of its pages are good candidates for discarding or swapping should it need to map memory for other processes. RES will diminish accordingly. Note incidentally that top is one such process that needs memory and thus can disturb your process while observing it, a computing variant of Heisenberg's Principle.
When the process wakes up, whatever memory is accessed by the running thread is mapped back into memory, either from the executable file(s) if it was discarded or from the swap file, or from nowhere if the discarded page was all null bytes or an unused part of the stack. This causes RES to increase again.
There might be other problems in the code, especially in code that you did not post, but if VIRT does not change, you do not seem to have a memory leak.
So im doing some computation on 4 million nodes.
the very bask serial version just have a for loop which loops 4 million times and do 4 million times of computation. this takes roughly 1.2 sec.
when I split the for loop to, say, 4 for loops and each does 1/4 of the computation, the total time became 1.9 sec.
I guess there are some overhead in creating for loops and maybe has to do with cpu likes to compute data in chunk.
The real thing bothers me is when I try to put 4 loops to 4 thread on a 8 core machine, each thread would take 0.9 seconds to finish.
I am expecting each of them to only take 1.9/4 second instead.
I dont think there are any race condition or synchronize issue since all I do was having a for loop to create 4 threads, which took 200 microseconds. And then a for loop to joins them.
The computation read from a shared array and write to a different shared array.
I am sure they are not writing to the same byte.
Where could the overhead came from?
main: ncores: number of cores. node_size: size of graph (4 million node)
for(i = 0 ; i < ncores ; i++){
int *t = (int*)malloc(sizeof(int));
*t = i;
int iret = pthread_create( &thread[i], NULL, calculate_rank_p, (void*)(t));
}
for (i = 0; i < ncores; i++)
{
pthread_join(thread[i], NULL);
}
calculate_rank_p: vector is the rank vector for page rank calculation
Void *calculate_rank_pthread(void *argument) {
int index = *(int*)argument;
for(i = index; i < node_size ; i+=ncores)
current_vector[i] = calc_r(i, vector);
return NULL;
}
calc_r: this is just a page rank calculation using compressed row format.
double calc_r(int i, double *vector){
double prank = 0;
int j;
for(j = row_ptr[i]; j < row_ptr[i+1]; j++){
prank += vector[col_ind[j]] * val[j];
}
return prank;
}
everything that is not declared are global variable
The computation read from a shared array and write to a different shared array. I am sure they are not writing to the same byte.
It's impossible to be sure without seeing relevant code and having some more details, but this sounds like it could be due to false sharing, or ...
the performance issue of false sharing (aka cache line ping-ponging), where threads use different objects but those objects happen to be close enough in memory that they fall on the same cache line, and the cache system treats them as a single lump that is effectively protected by a hardware write lock that only one core can hold at a time. This causes real but invisible performance contention; whichever thread currently has exclusive ownership so that it can physically perform an update to the cache line will silently throttle other threads that are trying to use different (but, alas, nearby) data that sits on the same line.
http://www.drdobbs.com/parallel/eliminate-false-sharing/217500206
UPDATE
This looks like it could very well trigger false sharing, depending on the size of a vector (though there is still not enough information in the post to be sure, as we don't see how the various vector are allocated.
for(i = index; i < node_size ; i+=ncores)
Instead of interleaving which core works on which data i += ncores give each of them a range of data to work on.
For me the same surprise when build and run in Debug (other test code though).
In release all as expected ;)
I'm trying to port liballoc on a small kernel that I'm writing for my thesis.
In order to do that, I need a function that scan a range of address to find free (and used) pages.
I wrote that function that scan from and address (it should be pagetable aligned) and print if a page is free or is used:
uint32_t check_pages(uint32_t startAddr,uint32_t length){
pdirectory* dir = vmm_get_directory ();
pd_entry* pagedir = dir->m_entries;
int cfreepage = 0;
int cusedpage = 0;
uint32_t x = 0, y = 0;
for(x = startAddr; x < (startAddr+length) ; x+=4096*1024){ // check 1 pagetable at a time
if(pagedir[x>>22] != 0){ // check if pagetable exist
ptable* table =(ptable*) pagedir[x>>22];
for(y=x;;y+=4096){ // scan every single pages in the pagetable
pt_entry* page = (pt_entry*)table->m_entries [ PAGE_TABLE_INDEX (y) ];
if(((uint32_t)(page)>>22) != 0){ // check if a page is present FIXME this might be the problem
cusedpage++;
kernelPrintf("Found used page number: %d\n",PAGE_TABLE_INDEX (y));
}
else{
cfreepage++;
kernelPrintf("Found free page number: %d\n",PAGE_TABLE_INDEX (y));
}
if(PAGE_TABLE_INDEX (y)==1023) break;
}
}
else{ // if a pagetable doesn't exist add 1024 free pages to the counter
kernelPrintf("found free pagetable! (1024 free pages)\n");
cfreepage+=1024;
}
}
kernelPrintf("Used Pages Found: %d\n",cusedpage);
kernelPrintf("Free Pages Found: %d\n",cfreepage);
return 0;
}
This code works, but have one issue: some pages that are used, will result free..
I think that the problem is this if:
if(((uint32_t)(page)>>22) != 0)
There might be a better way to check if a page is used or not..
Thanks for the help
if (x >> 22) checks if any bit higher than 21th is set. I have no clue why you shift by 22 (looks like an arbitrary number - why the heck do you do it this way?). If you want to check if an entry is present (in a paging structure of any level), check bit 0 of that entry. Note that checking the highest bits would only work if the entry was assigned with high address (wouldn't catch, say, 0x100000).
Also note that if present bit is 0, all the other bytes are ignored, hence the OS can store any values in them, which might also be an information that will come handy one day.
This maybe not what you want but it might help.
I had once a similar task to do (memory allocator for an embedded system), here is what I did:
Define and align the allocable pages
Define elsewhere an array that references all the pages: I update the arry[idx] value when I allocate/release a page and it makes the count easy
The project I'm working on has to test the data memory of a dsPIC30F chip before the program runs. Due to industry requirements, we cannot utilize any pre-defined libraries that C has to offer. That being said, here is my methodology for testing the RAM:
Step 1 - Write the word 0xAAAA to a specific location in memory (defined by a LoopIndex added to the START_OF_RAM address)
Step 2 - increment LoopIndex
Step 3 - Repeat Steps 1-2 until LoopIndex + START_OF_RAM >= END_OF_RAM
Step 4 - Reset LoopIndex = 0
Step 5 - Read memory at LoopIndex+START_OF_RAM
Step 6 - If memory = 0xAAAA, continue, else throw RAM_FAULT_HANDLER
Step 7 - increment LoopIndex
Step 8 - Repeat Step 5 - 7 until LoopIndex + START_OF_RAM >= END_OF_RAM
Now, the weird part is that I can step through the code, no problem. It will slowly loop through each memory address for as long as my little finger can press F8, but as soon as I try to set up a breakpoint at Step 4, it throws a random, generic interrupt handler for no apparent reason. I've thought that it could be due to the fact that the for() I use may exceed END_OF_RAM, but I've changed the bounds of the conditions and it still doesn't like to run.
Any insight would be helpful.
void PerformRAMTest()
{
// Locals
uint32_t LoopIndex = 0;
uint16_t *AddressUnderTest;
uint32_t RAMvar = 0;
uint16_t i = 0;
// Loop through RAM and write the first pattern (0xAA) - from the beginning to the first RESERVED block
for(LoopIndex = 0x0000; LoopIndex < C_RAM_END_ADDRESS; LoopIndex+= 2)
{
AddressUnderTest = (uint32_t*)(C_RAM_START_ADDRESS + LoopIndex);
*AddressUnderTest = 0xAAAA;
}// end for
for(LoopIndex = 0x0000; LoopIndex < C_RAM_END_ADDRESS; LoopIndex += 2)
{
AddressUnderTest = (uint32_t*)(C_RAM_START_ADDRESS + LoopIndex);
if(*AddressUnderTest != 0xAAAA)
{
// If what was read does not equal what was written, log the
// RAM fault in NVM and call the RAMFaultHandler()
RAMFaultHandler();
}// end if
}
// Loop through RAM and write then verify the second pattern (0x55)
// - from the beginning to the first RESERVED block
// for(LoopIndex = C_RAM_START_ADDRESS; LoopIndex < C_RAM_END_ADDRESS; LoopIndex++)
// {
// AddressUnderTest = (uint32_t*)(C_RAM_START_ADDRESS + LoopIndex);
// *AddressUnderTest = 0x5555;
// if(*AddressUnderTest != 0x5555)
// {
// // If what was read does not equal what was written, log the
// // RAM fault in NVM and call the RAMFaultHandler()
// RAMFaultHandler();
// }
// }
}// end PerformRAMTest
You can see that the second pass of the test writes 0x55. This was the original implementation that was given to me, but it never worked (at least as far as debugging/running; the same random interrupt was encountered with this method of writing then immediately reading the same address before moving on)
UPDATE: After a few Clean&Builds, the code will now run through until it hits the stack pointer (WREG15), skip over, then errors out. Here is a new sample of the code in question:
if(AddressUnderTest >= &SPLIMIT && AddressUnderTest <= SPLIMIT)
{
// if true, set the Loop Index to point to the end of the stack
LoopIndex = (uint16_t)SPLIMIT;
}
else if(AddressUnderTest == &SPLIMIT) // checkint to see if AddressUnderTest points directly to the stack [This works while the previous >= &SPLIMIT does not. It will increment into the stack, update, THEN say "oops, I just hit the stack" and error out.]
{
LoopIndex = &SPLIMIT;
}
else
{
*AddressUnderTest = 0xAAAA;
}
I think you actually want (C_RAM_START_ADDRESS + LoopIndex) < C_RAM_END_ADDRESS as your loop condition. Currently, you are looping from C_RAM_START_ADDRESS to C_RAM_START_ADDRESS + C_RAM_END_ADDRESS which I assume is writing past the end of the RAM.
You also should really factor out the repeated code into a separate function that takes the test pattern as a parameter (DRY).
Okay, so there are a number of things that we can look at to get a better understanding of where your problem may be. There are some things that I would like to point out - and hopefully we can figure this out together. The first thing that I noticed that seems a little out of place is this comment:
"...total RAM goes to 0x17FFE..."
I looked up the data sheet for the dsPIC30F6012A . You can see in Figure 3-8 (pg. 33), that the SRAM space is 8K and runs from 0x0800 to 0x2800. Also, there is this little tidbit:
"All effective addresses are 16 bits wide and point to bytes within the data space"
So, you can use 16 bit values for your addresses. I am a little confused by your update as well. SPLIM is a register that you set the value for - and that value limits the size of your stack. I'm not sure what the value for your SPLIMIT is, but W15 is your actual stack pointer register, and the value that is stored there is the address to the top of your stack:
"There is a Stack Pointer Limit register (SPLIM) associated
with the Stack Pointer. SPLIM is uninitialized at
Reset. As is the case for the Stack Pointer, SPLIM<0>
is forced to ‘0’ because all stack operations must be
word aligned. Whenever an Effective Address (EA) is
generated using W15 as a source or destination
pointer, the address thus generated is compared with
the value in SPLIM. If the contents of the Stack Pointer
(W15) and the SPLIM register are equal and a push
operation is performed, a Stack Error Trap will not
occur."
Finally, the stack grows from the lowest available SRAM address value up to SPLIM. So I would propose setting the SPLIM value to something reasonable, let's say 512 bytes (although it would be best to test how much room you need for your stack).
Since this particular stack grows upwards, I would start at 0x0800 plus what we added for the stack limit and then test from there (which would be 0x1000). This way you won't have to worry about your stack region.
Given the above, here is how I would go about doing this.
void PerformRAMTest (void)
{
#define SRAM_START_ADDRESS 0x0800
/* Stack size = 512 bytes. Assign STACK_LIMIT
to SPLIM register during configuration. */
#define STACK_SIZE 0x0200
/* -2, see pg 35 of dsPIC30F6012A datasheet. */
#define STACK_LIMIT ((SRAM_START_ADDRESS + STACK_SIZE) - 2)
#define SRAM_BEGIN_TEST_ADDRESS ((volatile uint16_t *)(STACK_LIMIT + 2))
#define SRAM_END_TEST_ADDRESS 0x2800
#define TEST_VALUE 0xAAAA
/* No need for 32 bit address values on this platform */
volatile uint16_t * AddressUnderTest = SRAM_BEGIN_TEST_ADDRESS
/* Write to memory */
while (AddressUnderTest < SRAM_END_TEST_ADDRESS)
{
*AddressUnderTest = TEST_VALUE;
AddressUnderTest++;
}
AddressUnderTest = SRAM_BEGIN_TEST_ADDRESS;
/* Read from memory */
while (AddressUnderTest < SRAM_END_TEST_ADDRESS)
{
if (*AddressUnderTest != TEST_VALUE)
{
RAMFaultHandler();
break;
}
else
{
AddressUnderTest++;
}
}
}
My code was a bit rushed so I am sure there are probably some errors (feel free to edit), but hopefully this will help get you on the right track!
Any ideas why it works fine for values like 0, 1, 2, 3, 4... and seg faults for values like >15?
#include
#include
#include
void *fib(void *fibToFind);
main(){
pthread_t mainthread;
long fibToFind = 15;
long finalFib;
pthread_create(&mainthread,NULL,fib,(void*) fibToFind);
pthread_join(mainthread,(void*)&finalFib);
printf("The number is: %d\n",finalFib);
}
void *fib(void *fibToFind){
long retval;
long newFibToFind = ((long)fibToFind);
long returnMinusOne;
long returnMinustwo;
pthread_t minusone;
pthread_t minustwo;
if(newFibToFind == 0 || newFibToFind == 1)
return newFibToFind;
else{
long newFibToFind1 = ((long)fibToFind) - 1;
long newFibToFind2 = ((long)fibToFind) - 2;
pthread_create(&minusone,NULL,fib,(void*) newFibToFind1);
pthread_create(&minustwo,NULL,fib,(void*) newFibToFind2);
pthread_join(minusone,(void*)&returnMinusOne);
pthread_join(minustwo,(void*)&returnMinustwo);
return returnMinusOne + returnMinustwo;
}
}
Runs out of memory (out of space for stacks), or valid thread handles?
You're asking for an awful lot of threads, which require lots of stack/context.
Windows (and Linux) have a stupid "big [contiguous] stack" idea.
From the documentation on pthreads_create:
"On Linux/x86-32, the default stack size for a new thread is 2 megabytes."
If you manufacture 10,000 threads, you need 20 Gb of RAM.
I built a version of OP's program, and it bombed with some 3500 (p)threads
on Windows XP64.
See this SO thread for more details on why big stacks are a really bad idea:
Why are stack overflows still a problem?
If you give up on big stacks, and implement a parallel language with heap allocation
for activation records
(our PARLANSE is
one of these) the problem goes away.
Here's the first (sequential) program we wrote in PARLANSE:
(define fibonacci_argument 45)
(define fibonacci
(lambda(function natural natural )function
`Given n, computes nth fibonacci number'
(ifthenelse (<= ? 1)
?
(+ (fibonacci (-- ?))
(fibonacci (- ? 2))
)+
)ifthenelse
)lambda
)define
Here's an execution run on an i7:
C:\DMS\Domains\PARLANSE\Tools\PerformanceTest>run fibonaccisequential
Starting Sequential Fibonacci(45)...Runtime: 33.752067 seconds
Result: 1134903170
Here's the second, which is parallel:
(define coarse_grain_threshold 30) ; technology constant: tune to amortize fork overhead across lots of work
(define parallel_fibonacci
(lambda (function natural natural )function
`Given n, computes nth fibonacci number'
(ifthenelse (<= ? coarse_grain_threshold)
(fibonacci ?)
(let (;; [n natural ] [m natural ] )
(value (|| (= m (parallel_fibonacci (-- ?)) )=
(= n (parallel_fibonacci (- ? 2)) )=
)||
(+ m n)
)value
)let
)ifthenelse
)lambda
)define
Making the parallelism explicit makes the programs a lot easier to write, too.
The parallel version we test by calling (parallel_fibonacci 45). Here
is the execution run on the same i7 (which arguably has 8 processors,
but it is really 4 processors hyperthreaded so it really isn't quite 8
equivalent CPUs):
C:\DMS\Domains\PARLANSE\Tools\PerformanceTest>run fibonacciparallelcoarse
Parallel Coarse-grain Fibonacci(45) with cutoff 30...Runtime: 5.511126 seconds
Result: 1134903170
A speedup near 6+, not bad for not-quite-8 processors. One of the other
answers to this question ran the pthreads version; it took "a few seconds"
(to blow up) computing Fib(18), and this is 5.5 seconds for Fib(45).
This tells you pthreads
is a fundamentally bad way to do lots of fine grain parallelism, because
it has really, really high forking overhead. (PARLANSE is designed to
minimize that forking overhead).
Here's what happens if you set the technology constant to zero (forks on every call
to fib):
C:\DMS\Domains\PARLANSE\Tools\PerformanceTest>run fibonacciparallel
Starting Parallel Fibonacci(45)...Runtime: 15.578779 seconds
Result: 1134903170
You can see that amortizing fork overhead is a good idea, even if you have fast forks.
Fib(45) produces a lot of grains. Heap allocation
of activation records solves the OP's first-order problem (thousands of pthreads each
with 1Mb of stack burns gigabytes of RAM).
But there's a second order problem: 2^45 PARLANSE "grains" will burn all your memory too
just keeping track of the grains even if your grain control block is tiny.
So it helps to have a scheduler that throttles forks once you have "a lot"
(for some definition of "a lot" significantly less that 2^45) grains to prevent the
explosion of parallelism from swamping the machine with "grain" tracking data structures.
It has to unthrottle forks when the number of grains falls below a threshold
too, to make sure there is always lots of logical, parallel work for the physical
CPUs to do.
You are not checking for errors - in particular, from pthread_create(). When pthread_create() fails, the pthread_t variable is left undefined, and the subsequent pthread_join() may crash.
If you do check for errors, you will find that pthread_create() is failing. This is because you are trying to generate almost 2000 threads - with default settings, this would require 16GB of thread stacks to be allocated alone.
You should revise your algorithm so that it does not generate so many threads.
I tried to run your code, and came across several surprises:
printf("The number is: %d\n", finalFib);
This line has a small error: %d means printf expects an int, but is passed a long int. On most platforms this is the same, or will have the same behavior anyways, but pedantically speaking (or if you just want to stop the warning from coming up, which is a very noble ideal too), you should use %ld instead, which will expect a long int.
Your fib function, on the other hand, seems non-functional. Testing it on my machine, it doesn't crash, but it yields 1047, which is not a Fibonacci number. Looking closer, it seems your program is incorrect on several aspects:
void *fib(void *fibToFind)
{
long retval; // retval is never used
long newFibToFind = ((long)fibToFind);
long returnMinusOne; // variable is read but never initialized
long returnMinustwo; // variable is read but never initialized
pthread_t minusone; // variable is never used (?)
pthread_t minustwo; // variable is never used
if(newFibToFind == 0 || newFibToFind == 1)
// you miss a cast here (but you really shouldn't do it this way)
return newFibToFind;
else{
long newFibToFind1 = ((long)fibToFind) - 1; // variable is never used
long newFibToFind2 = ((long)fibToFind) - 2; // variable is never used
// reading undefined variables (and missing a cast)
return returnMinusOne + returnMinustwo;
}
}
Always take care of compiler warnings: when you get one, usually, you really are doing something fishy.
Maybe you should revise the algorithm a little: right now, all your function does is returning the sum of two undefined values, hence the 1047 I got earlier.
Implementing the Fibonacci suite using a recursive algorithm means you need to call the function again. As others noted, it's quite an inefficient way of doing it, but it's easy, so I guess all computer science teachers use it as an example.
The regular recursive algorithm looks like this:
int fibonacci(int iteration)
{
if (iteration == 0 || iteration == 1)
return 1;
return fibonacci(iteration - 1) + fibonacci(iteration - 2);
}
I don't know to which extent you were supposed to use threads—just run the algorithm on a secondary thread, or create new threads for each call? Let's assume the first for now, since it's a lot more straightforward.
Casting integers to pointers and vice-versa is a bad practice because if you try to look at things at a higher level, they should be widely different. Integers do maths, and pointers resolve memory addresses. It happens to work because they're represented the same way, but really, you shouldn't do this. Instead, you might notice that the function called to run your new thread accepts a void* argument: we can use it to convey both where the input is, and where the output will be.
So building upon my previous fibonacci function, you could use this code as the thread main routine:
void* fibonacci_offshored(void* pointer)
{
int* pointer_to_number = pointer;
int input = *pointer_to_number;
*pointer_to_number = fibonacci(input);
return NULL;
}
It expects a pointer to an integer, and takes from it its input, then writes it output there.1 You would then create the thread like that:
int main()
{
int value = 15;
pthread_t thread;
// on input, value should contain the number of iterations;
// after the end of the function, it will contain the result of
// the fibonacci function
int result = pthread_create(&thread, NULL, fibonacci_offshored, &value);
// error checking is important! try to crash gracefully at the very least
if (result != 0)
{
perror("pthread_create");
return 1;
}
if (pthread_join(thread, NULL)
{
perror("pthread_join");
return 1;
}
// now, value contains the output of the fibonacci function
// (note that value is an int, so just %d is fine)
printf("The value is %d\n", value);
return 0;
}
If you need to call the Fibonacci function from new distinct threads (please note: that's not what I'd advise, and others seem to agree with me; it will just blow up for a sufficiently large amount of iterations), you'll first need to merge the fibonacci function with the fibonacci_offshored function. It will considerably bulk it up, because dealing with threads is heavier than dealing with regular functions.
void* threaded_fibonacci(void* pointer)
{
int* pointer_to_number = pointer;
int input = *pointer_to_number;
if (input == 0 || input == 1)
{
*pointer_to_number = 1;
return NULL;
}
// we need one argument per thread
int minus_one_number = input - 1;
int minus_two_number = input - 2;
pthread_t minus_one;
pthread_t minus_two;
// don't forget to check! especially that in a recursive function where the
// recursion set actually grows instead of shrinking, you're bound to fail
// at some point
if (pthread_create(&minus_one, NULL, threaded_fibonacci, &minus_one_number) != 0)
{
perror("pthread_create");
*pointer_to_number = 0;
return NULL;
}
if (pthread_create(&minus_two, NULL, threaded_fibonacci, &minus_two_number) != 0)
{
perror("pthread_create");
*pointer_to_number = 0;
return NULL;
}
if (pthread_join(minus_one, NULL) != 0)
{
perror("pthread_join");
*pointer_to_number = 0;
return NULL;
}
if (pthread_join(minus_two, NULL) != 0)
{
perror("pthread_join");
*pointer_to_number = 0;
return NULL;
}
*pointer_to_number = minus_one_number + minus_two_number;
return NULL;
}
Now that you have this bulky function, adjustments to your main function are going to be quite easy: just change the reference to fibonacci_offshored to threaded_fibonacci.
int main()
{
int value = 15;
pthread_t thread;
int result = pthread_create(&thread, NULL, threaded_fibonacci, &value);
if (result != 0)
{
perror("pthread_create");
return 1;
}
pthread_join(thread, NULL);
printf("The value is %d\n", value);
return 0;
}
You might have been told that threads speed up parallel processes, but there's a limit somewhere where it's more expensive to set up the thread than run its contents. This is a very good example of such a situation: the threaded version of the program runs much, much slower than the non-threaded one.
For educational purposes, this program runs out of threads on my machine when the number of desired iterations is 18, and takes a few seconds to run. By comparison, using an iterative implementation, we never run out of threads, and we have our answer in a matter of milliseconds. It's also considerably simpler. This would be a great example of how using a better algorithm fixes many problems.
Also, out of curiosity, it would be interesting to see if it crashes on your machine, and where/how.
1. Usually, you should try to avoid to change the meaning of a variable between its value on input and its value after the return of the function. For instance, here, on input, the variable is the number of iterations we want; on output, it's the result of the function. Those are two very different meanings, and that's not really a good practice. I didn't feel like using dynamic allocations to return a value through the void* return value.