I have a problem that involves combinatorics that I am trying to solve in the most efficient way possible but all I keep coming up with is ways to brute force it. I was hoping I could get some suggestions on where I could look for more info or maybe how to proceed
Here is the information:
I have a list of disks with properties. These disks are categorized based on these properties. One of those properties (We'll call it Property_A) is used later on in a formula to generate a value (We'll call this Value_A).
From these disks, I have to make stacks of size N, where N is the number of disks in the stack, and each disk in the stack comes from a different category. The order of the disks in the stack matters. Disks in the same category can be swapped out for other disks with different values for Property_A to produce a new stack.
Example:
Stack #1 is made of the following disks
Disk #1 from Category A,
Disk #1 from Category B,
Disk #1 from Category C
I can then swap out the first Disk for Disk #2 from Category A to make a new stack (Stack #2)
Stack #2 is made of the following disks
Disk #2 from Category A,
Disk #1 from Category B,
Disk #1 from Category C
Once this stack is made, a formula is generated for the stack where we plug in Property_A to get Value_A. Formulas are assigned to stack types where a stack type is defined by the disk' categories
Example: Only Stack #1 and Stack #2 would share the same formula because they are made from disks whos categories are in the same order
Stack #1 is made of the following disks
Disk #1 from Category A,
Disk #1 from Category B,
Disk #1 from Category C
Stack #2 is made of the following disks
Disk #2 from Category A,
Disk #1 from Category B,
Disk #1 from Category C
Stack #3 is made of the following disks
Disk #1 from Category B,
Disk #1 from Category A,
Disk #1 from Category C
Stack #4 is made of the following disks
Disk #1 from Category B,
Disk #1 from Category E,
Disk #1 from Category F
Here is the problem:
Given Value_A, N, and some number (We'll call it Number), find all possible stacks of size N whos Value_A is in the range
Value_A-Number<= Value_A<= Value_A+Number
I was thinking that I could use Dynamic programming to help speed this up but, given that Property_A varies greatly from Disk to Disk, I felt that I would get more misses than hits and would just end up with a solution similar in speed to brute force.
I also considered giving Property_A an upper/lower bound but that would only help with finding stacks within the same stack type. Once the stack type changed, you'd have to reset the upper/lower bound and start over.
These still seems like the most promising way to go thus far though. With this approach, I could find all possible stack type combinations, then I could narrow down the disks in the stacks using upper/lower bounds
These things aren't organized in any certain way right now, so feel free to make up assumptions about how data is organized.
Related
I was wondering what were the advantages and disadvantages of linked-list compared to contiguous arrays in C. Therefore I read a wikipedia article about linked-lists.
https://en.wikipedia.org/wiki/Linked_list#Disadvantages
According to this article, the disadvantages are the following:
They use more memory than arrays because of the storage used by their pointers.
Nodes in a linked list must be read in order from the beginning as linked lists are inherently sequential access.
Difficulties arise in linked lists when it comes to reverse traversing. For instance, singly linked lists are cumbersome to navigate backwards and while doubly linked lists are somewhat easier to read, memory is wasted in allocating.
Nodes are stored incontiguously, greatly increasing the time required to access individual elements within the list, especially with a CPU cache.
I understand the first 3 points but I am having a hard time with the last one:
Nodes are stored incontiguously, greatly increasing the time required to access individual elements within the list, especially with a CPU cache.
The article about CPU Cache does not mention anything about non contiguous memory arrays. As far as I know CPU Caches just caches frequently used adresses for a total 10^-6 cache miss.
Therefore, I do not understand why the CPU cache should be less efficient when it comes to non contiguous memory arrays.
CPU caches actually do two things.
The one you mentioned is caching recently used memory.
The other however is predicting which memory is going to be used in near future. The algorithm is usually quite simple - it assumes that the program processes big array of data and whenever it accesses some memory it will prefetch few more bytes behind.
This doesn't work for linked list as the nodes are randomly placed in memory.
Additionally, the CPU loads bigger blocks of memory (64, 128 bytes). Again, for the int64 array with single read it has data for processing 8 or 16 elements. For linked list it reads one block and the rest may be wasted as the next node can be in completely different chunk of memory.
And last but not least, related to previous section - linked list takes more memory for its management, the most simple version will take at least additional sizeof(pointer) bytes for the pointer to the next node. But it's not so much about CPU cache anymore.
The article is only scratching the surface, and gets some things wrong (or at least questionable), but the overall outcome is usually about the same: linked lists are much slower.
One thing to note is that "nodes are stored incontiguously [sic]" is an overly strong claim. It is true that in general nodes returned by, for example, malloc may be spread around in memory, especially if nodes are allocated at different times or from different threads. However, in practice, many nodes are often allocated on the same thread, at the same time, and these will often end up quite contiguous in memory, because good malloc implementations are, well, good! Furthermore, when performance is a concern, you may often use special allocators on a per-object basis, which allocated the fixed-sized notes from one or more contiguous chunks of memory, which will provide great spatial locality.
So you can assume that in at least some scenarios, linked lists will give you reasonable to good spatial locality. It largely depends on if you are adding most of all of your list elements at once (linked lists do OK), or are constantly adding elements over a longer period of time (linked lists will have poor spatial locality).
Now, on the side of lists being slow, one of the main issues glossed over with linked lists is the large constant factors associated with some operations relative to the array variant. Everyone knows that accessing an element given its index is O(n) in a linked list and O(1) in an array, so you don't use the linked list if you are going to do a lot of accesses by index. Similarly, everyone knows that adding an element to the middle of a list takes O(1) time in a linked list, and O(n) time in an array, so the former wins in that scenario.
What they don't address is that even operations that have the same algorithmic complexity can be much slower in practice in one implementation...
Let's take iterating over all the elements in a list (looking for a particular value, perhaps). That's an O(n) operation regardless if you use a linked or array representation. So it's a tie, right?
Not so fast! The actual performance can vary a lot! Here is what typical find() implementations would look like when compiled at -O2 optimization level in x86 gcc, thanks to godbolt which makes this easy.
Array
C Code
int find_array(int val, int *array, unsigned int size) {
for (unsigned int i=0; i < size; i++) {
if (array[i] == val)
return i;
}
return -1;
}
Assembly (loop only)1
.L6:
add rsi, 4
cmp DWORD PTR [rsi-4], edi
je .done
add eax, 1
cmp edx, eax
jne .notfound
Linked List
C Code
struct Node {
struct Node *next;
int item;
};
Node * find_list(int val, Node *listptr) {
while (listptr) {
if (listptr->item == val)
return listptr;
listptr = listptr->next;
}
return 0;
}
Assembly (loop only)
.L20:
cmp DWORD PTR [rax+8], edi
je .done
mov rax, QWORD PTR [rax]
test rax, rax
jne .notfound
Just eyeballing the C code, both methods look competitive. The array method is going to have an increment of i, a couple of comparisons, and one memory access to read the value from the array. The linked list version if going to have a couple of (adjacent) memory accesses to read the Node.val and Node.next members, and a couple of comparisons.
The assembly seems to bear that out: the linked list version has 5 instructions and the array version2 has 6. All of the instructions are simple ones that have a throughput of 1 per cycle or more on modern hardware.
If you test it though - with both lists fully resident in L1, you'll find that the array version executes at about 1.5 cyles per iteration, while the linked list version takes about 4! That's because the linked list version is limited by it's loop-carried dependency on listptr. The one line listptr = listptr->next boils down to on instruction, but that one instruction will never execute more than once every 4 cycles, because each execution depends on the completion of the prior one (you need to finish reading listptr->next before you can calculate listptr->next->next). Even though modern CPUs can execute something like 2 loads cycles every cycle, these loads take ~4 cycles to complete, so you get a serial bottleneck here.
The array version also has loads, but the address doesn't depend on the prior load:
add rsi, 4
cmp DWORD PTR [rsi-4], edi
It depends only on rsi, which is simply calculated by adding 4 each iteration. An add has a latency of one cycle on modern hardware, so this doesn't create a bottleneck (unless you get below 1 cycle/iteration). So the array loop is able to use the full power of the CPU, executing many instructions in parallel. The linked list version is not.
This isn't unique to "find" - any operation linked that needs to iterate over many elements will have this pointer chasing behavior, which is inherently slow on modern hardware.
1I omitted the epilogue and prologue for each assembly function because it really isn't doing anything interesting. Both versions had no epilogue at all really, and the proloque was very similar for both, peeling off the first iteration and jumping into the middle of the loop. The full code is available for inspection in any case.
2It's worth noting that gcc didn't really do as well as it could have here, since it maintains both rsi as the pointer into the array, and eax as the index i. This means two separate cmp instructions, and two increments. Better would have been to maintain only the pointer rsi in the loop, and to compare against (array + 4*size) as the "not found" condition. That would eliminate one increment. Additionally, you could eliminate one cmp by having rsi run from -4*size up to zero, and indexing into array using [rdi + rsi] where rdi is array + 4*size. Shows that even today optimizing compilers aren't getting everything right!
CPU cache usually takes in a page of a certain size for example (the common one) 4096 bytes or 4kB and accesses information needed from there. To fetch a page there is a considerate amount of time consumed let's say 1000 cycles. If say we have an array of 4096 bytes which is contiguous we will fetch a 4096 bytes page from cache memory and probably most of the data will be there. If not maybe we need to fetch another page to get the rest of the data.
Example: We have 2 pages from 0-8191 and the array is in between 2048 and 6244 then we will fetch page#1 from 0-4095 to get the desired elements and then page#2 from 4096-8191 to get all array elements we want. This results in fetching 2 pages from memory to our cache to get our data.
What happens in a list though? In a list the data are non-contiguous which means that the elements are not in contiguous places in memory so they are probably scattered through various pages. This means that a CPU has to fetch a lot of pages from memory to the cache to get the desired data.
Example: Node#1 mem_address = 1000, Node#2 mem_address = 5000, Node#3 mem_address = 18000. If the CPU is able to see in 4k pages sizes then it has to fetch 3 different pages from memory to find the data it wants.
Also, the memory uses prefetch techniques to fetch pages of memory before they are needed so if the linked list is small let's say A -> B -> C, then the first cycle will be slow because the prefetcher can't predict the next block to fetch. But, on the next cycle we say that the prefetcher is warmed up and it can start predicting the path of the linked list and fetch the correct blocks on time.
Summarizing arrays are easily predictable by the hardware and are in one place so they are easy to fetch, while linked lists are unpredictable and are scattered throughout memory, which makes the life of the predictor and CPU harder.
BeeOnRope's answer is good and highlights the cycle count overheads of traversing a linked list vs iterating through an array, but as he explicitly says that's assuming "both lists fully resident in L1". However, it's far more likely that an array will fit better in L1 than a linked list, and the moment you start thrashing your cache the performance difference becomes huge. RAM can be more than 100x slower than L1, with L2 and L3 (if your CPU has any) being between 3x to 14x slower.
On a 64 bit architecture, each pointer takes 8 bytes, and a doubly linked list needs two of them or 16 bytes of overhead. If you only want a single 4 byte uint32 per entry, that means you need 5x as much storage for the dlist as you need for an array. Arrays guarantee locality, and although malloc can do OK at locality if you allocate stuff together in the right order, you often can't. Lets approximate poor locality by saying it takes 2x the space, so a dlist uses 10x as much "locality space" as an array. That's enough to push you from fitting in L1 to overflowing into L3, or even worse from L2 into RAM.
I'm optimizing shared memory in a CUDA kernel, so I need to determine which variables are the best candidates (that is, most often accessed) to be stored in shared memory. I know I can page through the code and count the number of times each variable is accessed, but the kernel is rather complicated, so I'm hoping there's a way to automate this. Can I have GDB count the number of times each variable is accessed in general CPU code or specifically in cuda-gdb? Or are there other profiling/debugging tools that could be useful?
Thanks.
I'm not aware of any performance metrics capable to count the number of times a specific variable is accessed by the involved threads. Perhaps you should have a look at the disassembled microcode (by cuobjdump with --dump-sass option) to be sure on how many times this occurs.
Consider, though, that in modern architectures (e.g., Fermi and Kepler), shared memory can be seen as a "controlled L1 cache", so its use may be not necessary if the variables are not evicted from L1. To have an idea on how frequently the global memory variables are evicted from L1 cache, you may have a look at some performance metrics of nvprof if you do not want to count the access number manually. For example, you may consider global_cache_replay_overhead, gld_efficiency, gst_efficiency etc. You may find the full list of performance metrics at Metrics Reference.
Finally, as suggested by #talonmies, you may wish to consider using registers instead of shared memory for some frequently used variables to have an even faster access.
i am a newbie in c programming . I have an assignment to find the number of data cache levels in the cpu and also the hit time of each levels.I am looking at C Program to determine Levels & Size of Cache but finding it difficult to interpret the results. How is the number of cache levels revealed?
any pointers will be helpful
Assuming you don't have a way to cheat (like some way of getting that information from the operating system or some CPU identification register):
The basic idea is that (by design), your L1 cache is faster than your L2 cache which is faster than your L3 cache... In any normal design, your L1 cache is also smaller than your L2 cache which is smaller than your L3 cache...
So you want to allocate a large-ish block of memory and then access (read and write) it sequentially[1] until you notice that the time taken to perform X accesses has risen sharply. Then keep going until you see the same thing again. You would need to allocate a memory block larger than the largest cache you are hoping to discover.
This requires access to some low-overhead access timestamp counter for the actual measurement (as pointed out in the referrered-to answer).
[1] or depending on whether you want to try to fool any clever prefetching that may skew the results, randomly within a sequentially progressing N-byte block.
I am looking to implement a heap allocation algorithm in C for a memory-constrained microcontroller. I have narrowed my search down to 2 options I'm aware of, however I am very open to suggestions, and I am looking for advice or comments from anyone with experience in this.
My Requirements:
-Speed definitely counts, but is a secondary concern.
-Timing determinism is not important - any part of the code requiring deterministic worst-case timing has its own allocation method.
-The MAIN requirement is fragmentation immunity. The device is running a lua script engine, which will require a range of allocation sizes (heavy on the 32 byte blocks). The main requirement is for this device to run for a long time without churning its heap into an unusable state.
Also Note:
-For reference, we are talking about a Cortex-M and PIC32 parts, with memory ranging from 128K and 16MB or memory (with a focus on the lower end).
-I don't want to use the compiler's heap because 1) I want consistent performance across all compilers and 2) their implementations are generally very simple and are the same or worse for fragmentation.
-double indirect options are out because of the huge Lua code base that I don't want to fundamtnetally change and revalidate.
My Favored Approaches Thus Far:
1) Have a binary buddy allocator, and sacrifice memory usage efficiency (rounding up to a power of 2 size).
-this would (as I understand) require a binary tree for each order/bin to store free nodes sorted by memory address for fast buddy-block lookup for rechaining.
2) Have two binary trees for free blocks, one sorted by size and one sorted by memory address. (all binary tree links are stored in the block itself)
-allocation would be best-fit using a lookup on the table by size, and then remove that block from the other tree by address
-deallocation would lookup adjacent blocks by address for rechaining
-Both algorithms would also require storing an allocation size before the start of the allocated block, and have blocks go out as a power of 2 minus 4 (or 8 depending on alignment). (Unless they store a binary tree elsewhere to track allocations sorted by memory address, which I don't consider a good option)
-Both algorithms require height-balanced binary tree code.
-Algorithm 2 does not have the requirement of wasting memory by rounding up to a power of two.
-In either case, I will probably have a fixed bank of 32-byte blocks allocated by nested bit fields to off-load blocks this size or smaller, which would be immune to external fragmentation.
My Questions:
-Is there any reason why approach 1 would be more immune to fragmentation than approach 2?
-Are there any alternatives that I am missing that might fit the requirements?
If block sizes are not rounded up to powers of two or some equivalent(*), certain sequences of allocation and deallocation will generate an essentially-unbounded amount of fragmentation even if the number of non-permanent small objects that exist at any given time is limited. A binary-buddy allocator will, of course, avoid that particular issue. Otherwise, if one is using a limited number of nicely-related object sizes but not using a "binary buddy" system, one may still have to use some judgment in deciding where to allocate new blocks.
Another approach to consider is having different allocation methods for things that are expected to be permanent, temporary, or semi-persistent. Fragmentation often causes the most trouble when temporary and permanent things get interleaved on the heap. Avoiding such interleaving may minimize fragmentation.
Finally, I know you don't really want to use double-indirect pointers, but allowing object relocation can greatly reduce fragmentation-related issues. Many Microsoft-derived microcomputer BASICs used a garbage-collected string heap; Microsoft's garbage collector was really horrible, but its string-heap approach can be used with a good one.
You can pick up a (never used for real) Buddy system allocator at http://www.mcdowella.demon.co.uk/buddy.html, with my blessing for any purpose you like. But I don't think you have a problem that is easily solved just by plugging in a memory allocator. The long-running high integrity systems I am familiar with have predictable resource usage, described in 30+ page documents for each resource (mostly cpu and I/O bus bandwidth - memory is easy because they tend to allocate the same amount at startup every time and then never again).
In your case none of the usual tricks - static allocation, free lists, allocation on the stack, can be shown to work because - at least as described to us - you have a Lua interpreted hovering in the background ready to do who knows what at run time - what if it just gets into a loop allocating memory until it runs out?
Could you separate the memory use into two sections - traditional code allocating almost all of what it needs on startup, and never again, and expendable code (e.g. Lua) allowed to allocate whatever it needs when it needs it, from whatever is left over after static allocation? Could you then trigger a restart or some sort of cleanup of the expendable code if it manages to use all of its area of memory, or fragments it, without bothering the traditional code?
I am trying to find the official (or a good enough) reason that the free store is commonly referred to as the heap.
Except for the fact that it grows from the end of the data segment, I can't really think of a good reason, especially since it has very little to do with the heap data structure.
Note: Quite a few people mentioned that it's just a whole bunch of things that are kind of unorganized. But to me the term heap physically means a bunch of things that are physically dependent on one another. You pull one out from underneath, everything else collapses on it, etc. In other words, to me heap sounds loosely organized (e.g., latest things are on top). This is not exactly how a heap actually works on most computers, though if you put stuff towards the beginning of the heap and then grew it I guess it could work.
Knuth rejects the term "heap" used as a synonym for the free memory store.
Several authors began about 1975 to call the pool of available memory a "heap." But in the present series of books, we will use that word only in its more traditional sense related to priority queues. (Fundamental Algorithms, 3rd ed., p. 435)
For what it's worth, ALGOL68, which predated C, had an actual keyword heap that was used to allocate space for a variable from the "global heap", as opposed to loc which allocated it on the stack.
But I suspect the use may be simply because there's no real structure to it. By that, I mean that you're not guaranteed to get the best-fit block or the next block in memory, rather you'll take what you're given depending on the whims of the allocation strategy.
Like most names, it was probably thought of by some coder who just needed a name.
I've often heard of it referred to as an arena sometimes (an error message from many moons ago saying that the "memory arena was corrupted"). This brings up images of chunks of memory doing battle in gladiatorial style inside your address space (a la the movie Tron).
Bottom line, it's just a name for an area of memory, you could just as well call it the brk-pool or sbrk-pool (after the calls to modify it) or any of a dozen other names.
I remember when we were putting together comms protocol stacks even before the OSI 7-layer model was a twinkle in someone's eye, we used a layered approach and had to come up with names at each layer for the blocks.
We used blocks, segments, chunks, sections and various other names, all which simply indicated a fixed length thing. It may be that heap had a similar origin:
Carol: "Hey, Bob, what's a good name for a data structure that just doles out random bits of memory from a big area?"
Bob: "How about 'steaming pile of horse dung'?"
Carol: "Thanks, Bob, I'll just opt for 'heap', if that's okay with you. By the way, how are things going with the divorce?"
It's named a heap for the contrasting image it conjures up to that of a stack.
In a stack of items, items sit one on top of the other in the order they were placed there, and you can only remove the top one (without toppling the whole thing over).
In a heap, there is no particular order to the way items are placed. You can reach in and remove items in any order because there is no clear 'top' item.
It does a fairly good job of describing the two ways of allocating and freeing memory in a stack and a heap. Yum!
I don't know if it is the first but ALGOL 68 had keyword 'heap' for allocating memory from the free memory heap.
The first use of 'heap' is likely to be found somewhere between 1958, when John McCarthy invented garbage collection for Lisp and the development of ALGOL 68
It's unordered. A "heap" of storage.
It isn't an ordered heap data structure.
It's a large, tangled pile of disorganized junk, in which nothing can be found unless you know exactly where to look.
I always figured it was kind of a clever way of describing it in comparison to the stack. The heap is like an overgrown, disorganized stack.
Just like java and javascript, heap (free store) and heap (data structure) are named such to insure that our fraternity is inpenetrable to outsiders, thus keeping our jobs secure.
Heap usually has these features:
You can't predict the address of a block malloc() returns.
You have no idea of how the address returned by the next malloc() is related to the address of the previous one.
You can malloc() blocks of variable sizes.
So you have something that can store a bunch of blocks of variable sizes and return them in some generally unpredictable order. How else would you call it if not "heap"?