Live Indexing DSE

Live Indexing DSE - solr

I want to enable live indexing on my cluster. As per the steps here, I must increase the heap size to at least 20 GB. I have 3 nodes in my cluster, each having 32 GB of RAM. The current heap size configuration is shown below
Is it recommendable for me to change my heap size to 20GB with the same amount of RAM? If not, what should the recommended heap size be for my RAM so that live indexing works correctly?

Search query performance depends on our ability to utilize the OS page cache effectively to keep search indexes hot. The smaller the size of your indexes, the easier it will be for the OS to maintain them in memory.
If you increase your heap, that's less RAM available for page cache and less likelihood that you'll be fitting those indexes in RAM. Check their size in the solr.data directory and see if you have enough RAM.
Also check out my post on decreasing index size:
http://www.sestevez.com/solr-space-saving-profile/

Related

linked lists vs arrays: which is more contiguous in physical memory?

Arrays are not necessarily contiguous in physical memory, though they are contiguous in virtual address space. But can it be said that the "tidiness" of arrays in physical memory is significantly higher compared to linked lists? So, which is a better option for a cache-friendly program?

There are two reasons why contiguous memory is more cache-friendly than non-contiguous memory:
If the data is stored contiguously, then the data will likely be stored in less cache lines (which are 64 byte blocks on most platforms). In that case, there is a higher chance that all the data will fit in the cache and new cache lines will less often have to be loaded. If the data is not stored contiguously and is scattered in many random memory locations, then it is possible that only a small fraction of every cache line will contain important data and that the rest of the cache line will contain unimportant data. In that case, more cache lines would be required to cache all important data, and if the cache is not large enough to store all these cache lines, then the cache efficiency will decrease.
The hardware cache prefetcher will do a better job at predicting the next cache line to prefetch, because it is easy to predict a sequential access pattern. Depending on whether the elements of the linked list are scattered or not, the access pattern to a linked list may be random and unpredictable, whereas the access pattern to an array is often sequential.
You are right that even if an array is stored contiguously in the virtual address space, this does not necessarily mean that the array is also contiguously in the physical address space.
However, this is irrelevant with regard to my statements made in #1 of my answer), because a cache line cannot overlap the boundary of a memory page. The content of a single memory page is always contiguous, both in the virtual address space and in the physical address space.
But you are right that it can be relevant with regard to my statements made in #2 of my answer. Assuming a memory page size of 4096 bytes (which is standard on the x64 platform) and a cache line size of 64 bytes, then there are 64 cache lines per memory page. This means that every 64th cache line could be at the edge of a "jump" in the physical address space. As a result, every 64th cache line could be mispredicted by the hardware cache prefetcher. Also, the cache prefetcher may not be able to adapt itself immediately to this new situation, so it may fail to prefetch several cache lines before it is able to reliably predict the next cache lines again and preload them in time. However, as a application programmer, you should not have to worry about this. It is the responsibility of the operating system to arrange the mapping of the virtual memory space to the physical memory space in such a way that there are not too many "jumps" which could have a negative performance impact. If you want to read more on this topic, you might want to read this research paper: Analysis of hardware prefetching across virtual page boundaries
Generally, arrays are better than linked lists in terms of cache efficiency, because they are always contiguous (in the virtual address space).

Cache locality of an array and amortized complexity

An array is usually faster in access than a linked list.
This is primarily due to cache locality of an array.
I had two doubts :
On what factor does the amount of data that is brought into the cache memory depend ? Is it completely equal to the cache memory of the system ? How can we know what amount of memory is this ?
The first access to an array is usually costlier as the array has to be searched for in the memory and brought in the memory. The subsequent operations are comparitively faster. How do we calculate the amortized complexity of the access operation ?
What are cache misses ? Does it mean (in reference to linked list) that the required item which is needed (current pointer-> next) has not been loaded into the cache memory and hence memory has to be searched again for its address ?

In reality, it is a bit more complex than the simple model you present in the question.
First, you may have multiple caching layers (L1,L2,L3), each of them with different characteristics. Especially, the replacement policy for each cache may use different algorithms as a tradeoff between efficiency and complexity (i.e. cost).
Then, all modern operating-systems implement virtual memory mechanisms. It is not enough to cache the data and the instructions (which is what L1..L3 are used for), it is also required to cache the association between virtual and physical addresses (in the TLB, transaction lookaside buffer).
To understand the impact of locality, you need to consider all these mechanisms.
Question 1
The minimum unit exchanged between the memory and a cache is the cache line. Typically, it is 64 bytes (but it depends on the CPU model). Let's imagine the caches are empty.
If you iterate on an array, you will pay for a cache miss every 64 bytes. A smart CPU (and a smart program) could analyze the memory access pattern and decide to prefetch contiguous blocks of memory in the caches to increase the throughput.
If you iterate on a list, the access pattern will be random, and you will likely pay a cache miss for each item.
Question 2
The whole array is not searched and brought in the cache on first access. Only the first cache line is.
However, there is also another factor to consider: the TLB. The page size is system dependent. Typical value is 4 KB. The first time the array will be accessed, an address translation will occur (and its result will be stored in the TLB). If the array is smaller than 4 KB (page size), no other address translation will have to be done. If it is bigger, than one translation per page will be done.
Compare this to a list. The probability that multiple items fits in the same page (4 KB) is much lower than for the array. The probability that they can fit in the same cache line (64 bytes) is extremely low.
I think it is difficult to calculate a complexity because there are probably other factors to consider. But in this complexity, you have to take in account the cache line size (for cache misses), and the page size (for TLB misses).
Question 3
A cache miss is when a given cache line is not in the cache. It can happen at L1, L2, or L3 levels. The higher level, the more expensive.
A TLB miss occurs when the virtual address is not in the TLB. In that case, a conversion to a physical address is done using the page tables (costly) and the result is stored in the TLB.
So yes, with a linked list, you will likely pay for a higher number of cache and TLB misses than for an array.
Useful links:
Wikipedia article on caches in CPUs: https://en.wikipedia.org/wiki/CPU_cache
Another good article on this topic: http://www.extremetech.com/extreme/188776-how-l1-and-l2-cpu-caches-work-and-why-theyre-an-essential-part-of-modern-chips
An oldish, but excellent article on the same topic: http://arstechnica.com/gadgets/2002/07/caching/
A gallery of various caching effects: http://igoro.com/archive/gallery-of-processor-cache-effects/

Does size of array in C affect time to access or change a piece of data in the array

Title says it all:
I create and array, say ARRAY[10000],
I set bounds where I only need to access data from 1-100 and 900-2000.
Will the time to access and/or change a piece of data take a greater amount of time in this manner than if I had declared an array as ARRAY[2001]
Will access and/or change time be changed if I have an array that only has the data from 1-100 and 900-2000?
I have seen some papers on this but they have not been clear to me and was hoping I could get a more concise answer here.

If the array is accessed infrequently, then the size of the array probably won't make much difference because you'd be getting a cache miss anyway. In that case the time will depend on how quickly the CPU can do any "virtual address to physical address" conversion and fetch the data from RAM.
The more frequently you access something in the array, the more cache effects matter. These cache effects depend heavily on the number, sizes and speeds of different caches.
However, it also depends on your access patterns. For example, if you have a 1 GiB array and frequently access 5 bytes of it, then the size won't make much difference as the 5 bytes you're accessing frequently will be in cache. For another example, if you use a sequential access pattern (e.g. "for each element in array { ... }") then it's likely the CPU might do hardware pre-fetching and you won't pay the full cost of cache misses.
For a typical 80x86 system, with a random access pattern and frequent accesses; there's about 5 different sizes that matter. The first (smallest) is L1 data cache size - if the array fits in the L1 data cache, then it's going to be relatively fast regardless of whether the array is 20 bytes or 20 KiB. The next size is L2 data cache size - as the array gets larger the ratio of "L1 hits to L2 hits" decreases and performance gets worse until (a t maybe "twice as large than L1") the L1 hits become negligible. Then (for some CPUs that do have L3 caches) the same happens with the L3 cache size, where as the array gets larger the ratio of "L2 hits to L3 hits" decreases.
Once you go larger than the largest cache, the ratio of "cache hits to cache misses" decreases.
The next size that can matter is TLB size. Most modern operating systems use paging, and most modern CPUs cache "virtual address to physical address conversions" in something typically called a Translation Look-aside Buffer. If the array is huge then you start getting TLB misses (in addition to cache misses) which makes performance worse (as the CPU can't avoid extra work when converting your virtual address into a physical address).
Finally, the last size that can matter is how much RAM you actually have. If you have a 10 TiB array, 4 GiB of RAM and 20 TiB of swap space; then the OS is going to be swapping data to/from disk and the overhead of disk IO is going to dominate performance.
Of course often you're creating an executable for many different computers (e.g. "64-bit 80x86, ranging from ancient Athlon to modern Haswell"). In that case you can't know most of the details that matter (like cache sizes) and it becomes a compromise between guesses (estimated overhead from cache misses due to "array too large" vs. estimated overhead from other things caused by "array too small").

No, at least normally the time to access any item in an array will be constant, regardless of the array's size.
This could change if (for example) you define an array larger than memory, so the OS (assuming you're using one) ends up doing some extra paging to support the larger array. In most cases, even that's unlikely to have much impact though.

May be yes, but it depends on size.
cache size
size of access range will change latency.
int ARRAY[10000] fit in L1 cache(32KB)
very small size access which fits in L1 cache(32KB), it costs 4 clocks for haswell.
But L2 cache access costs 12 clocks.
see detail here
http://www.7-cpu.com/cpu/Haswell.html
cache conflict
it other core of CPU modify some data in array, local cache-line state will be modified to be Invalid state.
Invalid stated cache-line costs much more latency.
NUMA
some environments with multiple CPU sockets on motherboard, it is called as non-uniform memory access environment.
That may have huge memory capacity, but some address of memory may be resident in CPU1, other address of memory may be resident in CPU2.
int huge_array[SOME_FUGE_SIZE]; // it strides multiple CPU's DIMM
// suppose that entire huge_array is out of cache.
huge_array[ADDRESS_OF_CPU1] = 1; // maybe fast for CPU1
huge_array[ADDRESS_OF_CPU2] = 1; // maybe slow for CPU2
But I wonder huge array strides multiple CPU's memory.
Allocation of huge array may simply fail.
It depends on OS.

In information theory, as was stated by others, array access is constant, and thus does not cost more or less depending on the arrays size. This question seems to be about real live performance though, and there the array size definitely matters. How so, is well explained by the accepted answer by #Brendan.
Things to consider in practice:
* How big are the elements of your array: bool[1000], MyStruct[1000] and MyStruct*[1000] may differ a lot in access performance
* Try writing code for both ways, once using the big array, and once keeping the required data in a smaller array. Then run the code on your target hardware(s), and compare performance. You will often be surprised to see, that optimization attempts make performance worse, and you learn a lot about hardware and its quirks in the process.

I don't believe it should.
When you access an element, you are going to memory location 0 + (element) therefore, regardless of the size it will get to the same memory location in the same time.

What are the differences between (and reasons to choose) tcmalloc/jemalloc and memory pools?

tcmalloc/jemalloc are improved memory allocators, and memory pool is also introduced for better memory allocation. So what are the differences between them and how to choose them in my application?

It depends upon requirement of your program. If your program has more dynamic memory allocations, then you
need to choose a memory allocator, from available allocators, which would generate most optimal performance
out of your program.
For good memory management you need to meet the following requirements at minimum:
Check if your system has enough memory to process data.
Are you albe to allocate from the available memory ?
Returning the used memory / deallocated memory to the pool (program or operating system)
The ability of a good memory manager can be tested on basis of (at the bare minimum) its efficiency in retriving / allocating and
returning / dellaocating memory. (There are many more conditions like cache locality, managing overhead, VM environments, small or large
environments, threaded environment etc..)
With respect to tcmalloc and jemalloc there are many people who have done comparisions. With reference to one of the
comparisions:
http://ithare.com/testing-memory-allocators-ptmalloc2-tcmalloc-hoard-jemalloc-while-trying-to-simulate-real-world-loads/
tcmalloc scores over all other in terms of CPU cycles per allocation if the number of threads are less.
jemalloc is very close to tcmalloc but better than ptmalloc (std glibc implementation).
In terms of memory overhead jemalloc is the best, seconded by ptmalloc, followed by tcmalloc.
Overall it can be said that jemalloc scores over others. You can also read more about jemalloc here:
https://www.facebook.com/notes/facebook-engineering/scalable-memory-allocation-using-jemalloc/480222803919
I have just quoted from tests done and published by other people and have not tested it myself. I hope
this could be a good starting point for you and use it to test and select the most optimal for
your application.

Summary from this doc
Tcmalloc
tcmalloc is a memory management library open sourced by Google as an alternative to glibc malloc. It has been used in well-known software such as chrome and safari. According to the official test report, ptmalloc takes about 300 nanoseconds to execute malloc and free on a 2.8GHz P4 machine (for small objects). The TCMalloc version takes about 50 nanoseconds for the same operation.
Small object allocation
tcmalloc allocates a thread-local ThreadCache for each thread. Small memory is allocated from ThreadCache. In addition, there is a central heap (CentralCache). When ThreadCache is not enough, it will get space from CentralCache and put it in ThreadCache.
Small objects (<=32K) are allocated from ThreadCache, and large objects are allocated from CentralCache. The space allocated by large objects is aligned with 4k pages, and multiple pages can also be cut into multiple small objects and divided into ThreadCache
CentralCache allocation management
Large objects (>32K) are first aligned with 4k and then allocated from CentralCache.
When there is no free space in the page linked list of the best fit, the page space is always larger. If all 256 linked lists are traversed, the allocation is still not successful. Use sbrk, mmap,/dev/mem to allocate from the system.
The contiguous pages managed by tcmalloc PageHeap are called span. If span is not allocated, span is a linked list element in PageHeap.
Recycle
When an object is free, the page number is calculated according to the address alignment, and then the corresponding span is found through the central array.
If it is a small object, span will tell us its size class, and then insert the object into the ThreadCache of the current thread. If ThreadCache exceeds a budget value (default 2MB) at this time, the garbage collection mechanism will be used to move unused objects from ThreadCache to CentralCache's central free lists.
If it is a large object, span will tell us the page number range where the object is locked. Assuming that this range is [p,q], first search for the span where pages p-1 and q+1 are located. If these adjacent spans are also free, merge them into the span where [p,q] is located, and then recycle this span To PageHeap.
The central free lists of CentralCache are similar to the FreeList of ThreadCache, but it adds a first-level structure.
Jemalloc
jemalloc was launched by facebook, and it was first implemented by freebsd's libc malloc. At present, it is widely used in various components of firefox and facebook server.
memory management
Similar to tcmalloc, each thread also uses thread-local cache without lock when it is less than 32KB.
Jemalloc uses the following size-class classifications on 64bits systems: Small: [8], [16, 32, 48, …, 128], [192, 256, 320, …, 512], [768, 1024, 1280, …, 3840]
Large: [4 KiB, 8 KiB, 12 KiB, …, 4072 KiB]
Huge: [4 MiB, 8 MiB, 12 MiB, …]
Small/large objects need constant time to find metadata, and huge objects are searched in logarithmic time through the global red-black tree.
The virtual memory is logically divided into chunks (the default is 4MB, 1024 4k pages), and the application thread allocates arenas at the first malloc through the round-robin algorithm. Each arena is independent of each other and maintains its own chunks. Chunk cuts pages into small/large objects. The memory of free() is always returned to the arena to which it belongs, regardless of which thread calls free().
Compare
The biggest advantage of jemalloc is its powerful multi-core/multi-thread allocation capability. The more cores the CPU has, the more program threads, and the faster jemalloc allocates
When allocating a lot of small memory, the space for recording meta data of jemalloc will be slightly more than tcmalloc.
When allocating large memory allocations, there will also be less memory fragmentation than tcmalloc.
Jemalloc classifies memory allocation granularity more finely, it leads to less lock contention than ptmalloc.

loading huge database into memory visual c

We have a somewhat unusual c app in that it is a database of about 120 gigabytes, all of which is loaded into memory for maximum performance. The machine it runs on has about a quarter terabyte of memory, so there is no issue with memory availability. The database is read-only.
Currently we are doing all the memory allocation dynamically, which is quite slow, but it is only done once so it is not an issue in terms of time.
We were thinking about whether it would be faster, either in startup or in runtime performance, if we were to use global data structures instead of dynamic allocation. But it appears that Visual Studio limits global data structures to a meager 4gb, even if you set the linker heap commit and reserve size much larger.
Anyone know of a way around this?

One way to do this would be to have your database as a persistent memory mapped file and then use the query part of your database to access that instead of dynamically allocated structures. It could be worth a try, I don't think performance would suffer that much (but of course will be slower).

How many regions of memory are you allocating? (1 x 120GB) or (120 Billion x 1 byte) etc.
I believe the work done when dynamically allocating memory is proportional to the number of allocated regions rather than their size.
Depending on your data and usage (elaborate and we can be more specific), you can allocate a large block of heap memory (e.g. 120 GB) once then manage that yourself.

Startup performance: If you're thinking of switching from dynamic to static global allocation, then I'd assume that you know how much you're allocating at compile time and there is a fixed number of allocations performed at runtime. I'd consider reducing the number of allocations performed, the actual call to new is the real bottleneck, not the actual allocation itself.
Runtime performance: No, it wouldn't improve runtime performance. Data structures of that size are going to end up on the heap, and subsequently in cache as they are read. To improve performance at runtime you should be aiming to improve locality of data so that data required subsequent to some you've just used, will end up on the same cache line, and paced in cache with the data you just used.
Both of these techniques I've used to great effect, efficiently ordering voxel data in 'batches', reducing the locality of data in a tree structure and reducing the number of calls to new, greatly increased the performance of a realtime renderer I worked on in a previous position. We're talking ~40GB voxel structures, possibly streaming of disk. Worked for us :).

Have you conducted an actual benchmark of your "in memory" solution versus having a well indexed read only table set on the solid state drives? Depending upon the overall solution it's entirely possible that your extra effort yields only small improvements to the end user. I happen to be aware of at least one solution approaching a half a petabyte of storage where the access pattern is completely random with an end user response time of less than 10 seconds with all data on disk.