How much time to initialize an array to 0? - c

I am confused whether
int arr[n]={0} takes constant time i.e. O(1), or O(n)?

You should expect O(N) time, but there are caveats:
It is O(1) if memory occupied by array is smaller than the word size. (it may be O(1) all the way to the cache line size on modern CPUs)
It is O(N) if array fits within a single tier in the memory.
It is complicated if array pushes through the tiers: There are multiple tiers on all modern computers (registers, L0 cache, L1 cache, L3 cache?, NUMA on multi-CPU machines, Virtual memory(mapping to swap), ...). If array can't fit in one - there will be a severe performance penalty.
CPU cache architecture impacts the time needed to zero out memory quite severely. In practice calling it O(N) is somewhat misleading given that going from 100 to 101 may increase time 10x if it falls on a cache boundary (line or whole). It may be even more dramatic if swapping is involved. Beware of the tiered memory model...

Generally initialization to zero of non-static storage is linear in the size of the storage. If you are reusing the array, it will need to be zero'ed each time. There are computer architectures that try to make this free via maintaining bit masks on pages or cache lines and returning zeroes at some point in the cache fill or load unit machinery. These are somewhat rare, but the effect can often be replicated in software if performance is paramount.
Arguably zeroed static storage is free but in practice the pages to back it up will be faulted in and there will be some cost to zero them on most architectures.
(One can end up in situations where the cost of the faults to provide zero filled pages is quite noticeable. E.g. repeated malloc/free of blocks bigger than some threshold can result in the address space backing the allocation being returned to the OS at each deallocation. The OS then has to zero it for security reasons even though malloc isn't guaranteed to return zero filled storage. Worse case the program then writes zeroes into the same block after it is returned from malloc so it ends up being twice the cost.)
For cases where large arrays of mostly zeroes will be accessed in a sparse fashion, the zero fill on demand behavior mentioned above can reduce the cost to linear in the number of pages actually used, not the total size. Arranging to do this usually requires using mmap or similar directly, not just allocating an array in C and zero initializing it.

Related

Scanning Binary Search Tree vs Array

In what ways would finding an element (traverse) in a BST be slower than linearly scanning for it within an array?
The answer supposedly has to do with caching. Can someone explain what exactly this means and why it is holds true?
How exactly do you "cache-this," using an array rather than caching with BST?
Thanks
My guess is that using a BST doesn't give you any advantage, since, even if you're caching data (which means there's some kind of locality, you may access the same element later for example), the insert operation and find operation always cost O(h) where h is the height of the tree. In the worst case, even O(n).
Whereas using an array for caching means that of course, at first it may be linear, but whenever you access the same element of the array afterwards, if there's spatial and temporal locality, you may find yourself accessing directly the same chunks of contiguous memory repeatedly, because you already know its index, which means you have a constant time access.
I assume caching relates to CPU caches, which come with a prefetcher which predicts your next memory accesses. So if you search sequentially in an array your prefetcher recognizes the memory access pattern and loads the memory into the CPU cache before your CPU accesses it. When the CPU actually accesses the next memory element, it is already in the cache and can be accessed quickly.
Without caches & prefetchers your CPU would have to wait for the memory controller to fetch the data from the RAM, which is quite slow in comparison to the CPU cache.
In a BST you don't do sequential access. In the worst case your BST does not reside in contiguous memory, but each node is at some arbitrary location in memory. Your prefetcher cannot predict this. The then CPU has to wait for each element to be fetched from memory.
A though without prefetchers is regarding the cache line. On x86_64 a cache line is 64 bytes long. Each integer is either 4 or 8 byte, so you can scan 16 or 8 array entries per cache line. The first access to the memory location fetches the whole line, so you only pay the memory access once for 8 comparisons.
For the BST the same argument as above applies. The node memory is likely not on the same cache line so have to do a memory access for each comparison.
To summarize: A) The memory access takes significantly more time than the comparison; B) if searching through an array or a BST is faster depends on the amount of items.

Cache locality of an array and amortized complexity

An array is usually faster in access than a linked list.
This is primarily due to cache locality of an array.
I had two doubts :
On what factor does the amount of data that is brought into the cache memory depend ? Is it completely equal to the cache memory of the system ? How can we know what amount of memory is this ?
The first access to an array is usually costlier as the array has to be searched for in the memory and brought in the memory. The subsequent operations are comparitively faster. How do we calculate the amortized complexity of the access operation ?
What are cache misses ? Does it mean (in reference to linked list) that the required item which is needed (current pointer-> next) has not been loaded into the cache memory and hence memory has to be searched again for its address ?
In reality, it is a bit more complex than the simple model you present in the question.
First, you may have multiple caching layers (L1,L2,L3), each of them with different characteristics. Especially, the replacement policy for each cache may use different algorithms as a tradeoff between efficiency and complexity (i.e. cost).
Then, all modern operating-systems implement virtual memory mechanisms. It is not enough to cache the data and the instructions (which is what L1..L3 are used for), it is also required to cache the association between virtual and physical addresses (in the TLB, transaction lookaside buffer).
To understand the impact of locality, you need to consider all these mechanisms.
Question 1
The minimum unit exchanged between the memory and a cache is the cache line. Typically, it is 64 bytes (but it depends on the CPU model). Let's imagine the caches are empty.
If you iterate on an array, you will pay for a cache miss every 64 bytes. A smart CPU (and a smart program) could analyze the memory access pattern and decide to prefetch contiguous blocks of memory in the caches to increase the throughput.
If you iterate on a list, the access pattern will be random, and you will likely pay a cache miss for each item.
Question 2
The whole array is not searched and brought in the cache on first access. Only the first cache line is.
However, there is also another factor to consider: the TLB. The page size is system dependent. Typical value is 4 KB. The first time the array will be accessed, an address translation will occur (and its result will be stored in the TLB). If the array is smaller than 4 KB (page size), no other address translation will have to be done. If it is bigger, than one translation per page will be done.
Compare this to a list. The probability that multiple items fits in the same page (4 KB) is much lower than for the array. The probability that they can fit in the same cache line (64 bytes) is extremely low.
I think it is difficult to calculate a complexity because there are probably other factors to consider. But in this complexity, you have to take in account the cache line size (for cache misses), and the page size (for TLB misses).
Question 3
A cache miss is when a given cache line is not in the cache. It can happen at L1, L2, or L3 levels. The higher level, the more expensive.
A TLB miss occurs when the virtual address is not in the TLB. In that case, a conversion to a physical address is done using the page tables (costly) and the result is stored in the TLB.
So yes, with a linked list, you will likely pay for a higher number of cache and TLB misses than for an array.
Useful links:
Wikipedia article on caches in CPUs: https://en.wikipedia.org/wiki/CPU_cache
Another good article on this topic: http://www.extremetech.com/extreme/188776-how-l1-and-l2-cpu-caches-work-and-why-theyre-an-essential-part-of-modern-chips
An oldish, but excellent article on the same topic: http://arstechnica.com/gadgets/2002/07/caching/
A gallery of various caching effects: http://igoro.com/archive/gallery-of-processor-cache-effects/

Does size of array in C affect time to access or change a piece of data in the array

Title says it all:
I create and array, say ARRAY[10000],
I set bounds where I only need to access data from 1-100 and 900-2000.
Will the time to access and/or change a piece of data take a greater amount of time in this manner than if I had declared an array as ARRAY[2001]
Will access and/or change time be changed if I have an array that only has the data from 1-100 and 900-2000?
I have seen some papers on this but they have not been clear to me and was hoping I could get a more concise answer here.
If the array is accessed infrequently, then the size of the array probably won't make much difference because you'd be getting a cache miss anyway. In that case the time will depend on how quickly the CPU can do any "virtual address to physical address" conversion and fetch the data from RAM.
The more frequently you access something in the array, the more cache effects matter. These cache effects depend heavily on the number, sizes and speeds of different caches.
However, it also depends on your access patterns. For example, if you have a 1 GiB array and frequently access 5 bytes of it, then the size won't make much difference as the 5 bytes you're accessing frequently will be in cache. For another example, if you use a sequential access pattern (e.g. "for each element in array { ... }") then it's likely the CPU might do hardware pre-fetching and you won't pay the full cost of cache misses.
For a typical 80x86 system, with a random access pattern and frequent accesses; there's about 5 different sizes that matter. The first (smallest) is L1 data cache size - if the array fits in the L1 data cache, then it's going to be relatively fast regardless of whether the array is 20 bytes or 20 KiB. The next size is L2 data cache size - as the array gets larger the ratio of "L1 hits to L2 hits" decreases and performance gets worse until (a t maybe "twice as large than L1") the L1 hits become negligible. Then (for some CPUs that do have L3 caches) the same happens with the L3 cache size, where as the array gets larger the ratio of "L2 hits to L3 hits" decreases.
Once you go larger than the largest cache, the ratio of "cache hits to cache misses" decreases.
The next size that can matter is TLB size. Most modern operating systems use paging, and most modern CPUs cache "virtual address to physical address conversions" in something typically called a Translation Look-aside Buffer. If the array is huge then you start getting TLB misses (in addition to cache misses) which makes performance worse (as the CPU can't avoid extra work when converting your virtual address into a physical address).
Finally, the last size that can matter is how much RAM you actually have. If you have a 10 TiB array, 4 GiB of RAM and 20 TiB of swap space; then the OS is going to be swapping data to/from disk and the overhead of disk IO is going to dominate performance.
Of course often you're creating an executable for many different computers (e.g. "64-bit 80x86, ranging from ancient Athlon to modern Haswell"). In that case you can't know most of the details that matter (like cache sizes) and it becomes a compromise between guesses (estimated overhead from cache misses due to "array too large" vs. estimated overhead from other things caused by "array too small").
No, at least normally the time to access any item in an array will be constant, regardless of the array's size.
This could change if (for example) you define an array larger than memory, so the OS (assuming you're using one) ends up doing some extra paging to support the larger array. In most cases, even that's unlikely to have much impact though.
May be yes, but it depends on size.
cache size
size of access range will change latency.
int ARRAY[10000] fit in L1 cache(32KB)
very small size access which fits in L1 cache(32KB), it costs 4 clocks for haswell.
But L2 cache access costs 12 clocks.
see detail here
http://www.7-cpu.com/cpu/Haswell.html
cache conflict
it other core of CPU modify some data in array, local cache-line state will be modified to be Invalid state.
Invalid stated cache-line costs much more latency.
NUMA
some environments with multiple CPU sockets on motherboard, it is called as non-uniform memory access environment.
That may have huge memory capacity, but some address of memory may be resident in CPU1, other address of memory may be resident in CPU2.
int huge_array[SOME_FUGE_SIZE]; // it strides multiple CPU's DIMM
// suppose that entire huge_array is out of cache.
huge_array[ADDRESS_OF_CPU1] = 1; // maybe fast for CPU1
huge_array[ADDRESS_OF_CPU2] = 1; // maybe slow for CPU2
But I wonder huge array strides multiple CPU's memory.
Allocation of huge array may simply fail.
It depends on OS.
In information theory, as was stated by others, array access is constant, and thus does not cost more or less depending on the arrays size. This question seems to be about real live performance though, and there the array size definitely matters. How so, is well explained by the accepted answer by #Brendan.
Things to consider in practice:
* How big are the elements of your array: bool[1000], MyStruct[1000] and MyStruct*[1000] may differ a lot in access performance
* Try writing code for both ways, once using the big array, and once keeping the required data in a smaller array. Then run the code on your target hardware(s), and compare performance. You will often be surprised to see, that optimization attempts make performance worse, and you learn a lot about hardware and its quirks in the process.
I don't believe it should.
When you access an element, you are going to memory location 0 + (element) therefore, regardless of the size it will get to the same memory location in the same time.

Are stack float array ops faster than heap float ops on modern x86 systems?

On C float (or double) arrays small enough to fit in L1 or L2 cache (about 16k), and whos size I know at compile time, is there generally a speed benefit to define them within the function they are used in, so they are stack variables? If so is it a large difference? I know that in the old days heap variables were much slower than stack ones, but nowadays with the far more complicated structure of cpu addressing and cache, I don't know if this is true.
I need to do repeated runs of floating point math over these arrays in 'chunks', over and over again over the same arrays (about 1000 times), and I wonder if I should define them locally. I imagine keeping them in the closest / fastest locations will allow me to iterate over them repeatedly much faster but I dont understand the implications of caching in this scenario. Perhaps the compiler or cpu is clever enough to realize what I am doing and make these data arrays highly local on the hardware during the inner processing loops without my intervention, and perhaps it does a better job than I can at this.
Maybe I risk running out of stack space if I load large arrays in this way? Or is stack space not hugely limited on modern systems? The array size can be defined at compile time and I only need one array, and one CPU as I need to stick to a single thread for this work.
It is the allocation and deallocation speed that may make difference.
Allocating on the stack is just subtracting the required size from the stack pointer, which is normally done for all local variables once upon function entry anyway, so it is essentially free (unless alloca is used). Same applies to deallocating memory on the stack.
Allocating on the heap requires calling malloc or new which ends up executing an order of magnitude more instructions. Same applies to free and delete.
There should be no difference in the speed of access to the arrays once they were allocated. However, the stack could more likely be in the CPU cache already because previous function calls already used the same stack memory region for local variables.
If your architecture employs Non-uniform memory access (NUMA) there can be difference in access speed to different memory regions when your thread gets re-scheduled to run on a different CPU from the one that originally allocated memory.
For in-depth treatment of the subject have a read of What Every Programmer Should Know About Memory.
The answer is: probably not.
On a modern processor such as the i7 the L1/L2/L3 cache sizes are 64K/1MB/8MB, shared across 4x2 cores. Your numbers are a bit off.
The biggest thing to worry about is parallelism. If you can get all 8 cores running 100% that's a good start.
There is no difference between heap and stack memory, it's just memory. Heap allocation is way slower than stack allocation, but hopefully you don't do much of that.
Cache coherency matters. Cache prefetching matters. The order of accessing things in memory matters. Good reading here: https://software.intel.com/en-us/blogs/2009/08/24/what-you-need-to-know-about-prefetching.
But all this is useless until you can benchmark. You can't improve what you can't measure.
Re comment: there is nothing special about stack memory. The thing that usually does matter is keeping all your data close together. If you access local variables a lot then allocating arrays next to them on the stack might work. If you have several blocks of heap memory then subdividing a single allocation might work better than separate allocations. You'll only know if you read the generated code and benchmark.
They are the same speed on average. Assuming the cache lines that the array occupies have not been touched by other code.
One thing to make sure is that the array memory alignment is at least 32 bit or 64bit aligned (for float and double respectively) so an array element will not cross cache line boundaries. Cache lines are 64 byte on x86.
Another important element is to make sure the compiler is using SSE instructions for scalar floating point operations. This should be the default for modern compilers. The legacy floating point (a.k.a 387 with 80 bit register stack) is much slower and harder to optimize.
If this memory is frequently allocated and released, try to reduce calls to malloc/free by allocating it in a pool, globally or on the stack.

Fragmentation-resistant Microcontroller Heap Algorithm

I am looking to implement a heap allocation algorithm in C for a memory-constrained microcontroller. I have narrowed my search down to 2 options I'm aware of, however I am very open to suggestions, and I am looking for advice or comments from anyone with experience in this.
My Requirements:
-Speed definitely counts, but is a secondary concern.
-Timing determinism is not important - any part of the code requiring deterministic worst-case timing has its own allocation method.
-The MAIN requirement is fragmentation immunity. The device is running a lua script engine, which will require a range of allocation sizes (heavy on the 32 byte blocks). The main requirement is for this device to run for a long time without churning its heap into an unusable state.
Also Note:
-For reference, we are talking about a Cortex-M and PIC32 parts, with memory ranging from 128K and 16MB or memory (with a focus on the lower end).
-I don't want to use the compiler's heap because 1) I want consistent performance across all compilers and 2) their implementations are generally very simple and are the same or worse for fragmentation.
-double indirect options are out because of the huge Lua code base that I don't want to fundamtnetally change and revalidate.
My Favored Approaches Thus Far:
1) Have a binary buddy allocator, and sacrifice memory usage efficiency (rounding up to a power of 2 size).
-this would (as I understand) require a binary tree for each order/bin to store free nodes sorted by memory address for fast buddy-block lookup for rechaining.
2) Have two binary trees for free blocks, one sorted by size and one sorted by memory address. (all binary tree links are stored in the block itself)
-allocation would be best-fit using a lookup on the table by size, and then remove that block from the other tree by address
-deallocation would lookup adjacent blocks by address for rechaining
-Both algorithms would also require storing an allocation size before the start of the allocated block, and have blocks go out as a power of 2 minus 4 (or 8 depending on alignment). (Unless they store a binary tree elsewhere to track allocations sorted by memory address, which I don't consider a good option)
-Both algorithms require height-balanced binary tree code.
-Algorithm 2 does not have the requirement of wasting memory by rounding up to a power of two.
-In either case, I will probably have a fixed bank of 32-byte blocks allocated by nested bit fields to off-load blocks this size or smaller, which would be immune to external fragmentation.
My Questions:
-Is there any reason why approach 1 would be more immune to fragmentation than approach 2?
-Are there any alternatives that I am missing that might fit the requirements?
If block sizes are not rounded up to powers of two or some equivalent(*), certain sequences of allocation and deallocation will generate an essentially-unbounded amount of fragmentation even if the number of non-permanent small objects that exist at any given time is limited. A binary-buddy allocator will, of course, avoid that particular issue. Otherwise, if one is using a limited number of nicely-related object sizes but not using a "binary buddy" system, one may still have to use some judgment in deciding where to allocate new blocks.
Another approach to consider is having different allocation methods for things that are expected to be permanent, temporary, or semi-persistent. Fragmentation often causes the most trouble when temporary and permanent things get interleaved on the heap. Avoiding such interleaving may minimize fragmentation.
Finally, I know you don't really want to use double-indirect pointers, but allowing object relocation can greatly reduce fragmentation-related issues. Many Microsoft-derived microcomputer BASICs used a garbage-collected string heap; Microsoft's garbage collector was really horrible, but its string-heap approach can be used with a good one.
You can pick up a (never used for real) Buddy system allocator at http://www.mcdowella.demon.co.uk/buddy.html, with my blessing for any purpose you like. But I don't think you have a problem that is easily solved just by plugging in a memory allocator. The long-running high integrity systems I am familiar with have predictable resource usage, described in 30+ page documents for each resource (mostly cpu and I/O bus bandwidth - memory is easy because they tend to allocate the same amount at startup every time and then never again).
In your case none of the usual tricks - static allocation, free lists, allocation on the stack, can be shown to work because - at least as described to us - you have a Lua interpreted hovering in the background ready to do who knows what at run time - what if it just gets into a loop allocating memory until it runs out?
Could you separate the memory use into two sections - traditional code allocating almost all of what it needs on startup, and never again, and expendable code (e.g. Lua) allowed to allocate whatever it needs when it needs it, from whatever is left over after static allocation? Could you then trigger a restart or some sort of cleanup of the expendable code if it manages to use all of its area of memory, or fragments it, without bothering the traditional code?

Resources