DSE + solr back pressure issue for index work pool - solr

restarting a node B in a three node cluster always ends in the following warn message on the node B:
Back pressure is active for Index work pool with total work queue size 764 and average processing time 9242
the queue size is constantly increasing.
I'm running DSE 4.5 SOLR on a ubuntu 12.04 machine with 20 GB RAM, 4 Cores per server. Monitoring the Systemresources seems to be "normal" , 12 GB RAM usage, very low CPU usage.
opscenter is showing me for this node:
a heap usage of 1,5 GB, a very low load (0.32), a compaction task of a keyspace that is at 100% for several hours now.
On node A a nodetool -pr always show the message:
INFO [HintedHandoff:2] 2015-01-05 17:24:19,570 HintedHandOffManager.java (line 466) Timed out replaying hints to /10.0.106.200; aborting (0 delivered)
where 10.0.106.200 is node B
Any idea what this Warning could cause?

The backpressure warning is just an indication of overloading, as the indexing process isn't able to keep up with the insertion rate according to your configured threshold; generally, it is not a serious issue unless it happens for a prolonged period of time, in which case you might want to scale either horizontally or vertically, or increase the backpressure threshold if you already have powerful machines.
As for the other problems, they look a bit more serious, but it's difficult to diagnose with so little information.

Related

Task queue of size 1 for serial processing

I want a serial queue where only one task may process at a time. This is my declaration in queue.xml:
<queue>
<name>pumpkin</name>
<rate>20/s</rate>
<bucket-size>1</bucket-size>
<max-concurrent-requests>1</max-concurrent-requests>
</queue>
Does the "rate" parameter have any effect in this setup?
I want tasks to queue up and only process one at a time.
Thanks
The queue rate and bucket size do not limit the number of tasks processes simultaneously. See this answer which captured a (now gone) official explanation of these configs better than the (current) official docs (you may want to revisit your configs accordingly): https://stackoverflow.com/a/3740846/4495081
However your max-concurrent-requests should ensure only one task is executed at a time, according to the docs:
You can avoid this possibility by setting max_concurrent_requests to a
lower value. For example, if you set max_concurrent_requests to 10,
our example queue maintains about 20 tasks/second when latency is 0.3
seconds. However, when the latency increases over 0.5 seconds, this
setting throttles the processing rate to ensure that no more than 10
tasks run simultaneously.

Regarding CPU utilization

Considering the below piece of C code, I expected the CPU utilization to go up to 100% as the processor would try to complete the job (endless in this case) given to it. On running the executable for 5 mins, I found the CPU to go up to a max. of 48%. I am running Mac OS X 10.5.8; processor: Intel Core 2 Duo; Compiler: GCC 4.1.
int i = 10;
while(1) {
i = i * 5;
}
Could someone please explain why the CPU usage does not go up to 100%? Does the OS limit the CPU from reaching 100%?
Please note that if I added a "printf()" inside the loop the CPU hits 88%. I understand that in this case, the processor also has to write to the standard output stream hence the sharp rise in usage.
Has this got something to do with the amount of job assigned to the processor per unit time?
Regards,
Ven.
You have a multicore processor and you are in a single thread scenario, so you will use only one core full throttle ... Why do you expect the overall processor use go to 100% in a similar context ?
Run two copies of your program at the same time. These will use both cores of your "Core 2 Duo" CPU and overall CPU usage will go to 100%
Edit
if I added a "printf()" inside the loop the CPU hits 88%.
The printf send some characters to the terminal/screen. Sending information, Display and Update is handeled by code outside your exe, this is likely to be executed on another thread. But displaying a few characters does not need 100% of such a thread. That is why you see 100% for Core 1 and 76% for Core 2 which results in the overal CPU usage of 88% what you see.

Solr data import memory usage

I am running Solr 3.2 and 4 GB memory. Whenever I start the Solr , it does full import of all the cores and after that every 30 minutes delta import happens. Among 5 cores, 2 cores are having data around 1.6M. Full import takes for those 2 cores more than 20 hours and it is taking all the memory. Because of less memory delta import is not happening for other cores. This leads to restart of the Solr whenever data is updated in DB.
As until commit happens, it won't release the memory, I have given autocommit interval to 5 minutes for those 2 cores. Even though memory is not reduced.
Is there any other configuration I can check?
Edit 1 My autocommit settings
<autoCommit>
<maxDocs>25000</maxDocs>
<maxTime>300000</maxTime>
</autoCommit>
Edit 2 jconsole
System information & values from jconsole
Operating System: Windows 7 6.1
Architecture: amd64
Number of processors: 1
Committed virtual memory: 2,618,508 kbytes
Total physical memory: 4,193,848 kbytes
Free physical memory: 669,472 kbytes
Total swap space: 9,317,672 kbytes
Free swap space: 2,074,072 kbytes
threads details from jconsole
Live threads: 201
Peak: 207
Daemon threads: 182
Total threads started: 2,770
Check https://wiki.apache.org/solr/SolrPerformanceProblems, especially "Slow startup" part. I think that's your case.

Why would one CPU core run slower than the others?

I was benchmarking a large scientific application, and found it would sometimes run 10% slower given the same inputs. After much searching, I found the the slowdown only occurred when it was running on core #2 of my quad core CPU (specifically, an Intel Q6600 running at 2.4 GHz). The application is a single-threaded and spends most of its time in CPU-intensive matrix math routines.
Now that I know one core is slower than the others, I can get accurate benchmark results by setting the processor affinity to the same core for all runs. However, I still want to know why one core is slower.
I tried several simple test cases to determine the slow part of the CPU, but the test cases ran with identical times, even on slow core #2. Only the complex application showed the slowdown. Here are the test cases that I tried:
Floating point multiplication and addition:
accumulator = accumulator*1.000001 + 0.0001;
Trigonometric functions:
accumulator = sin(accumulator);
accumulator = cos(accumulator);
Integer addition:
accumulator = accumulator + 1;
Memory copy while trying to make the L2 cache miss:
int stride = 4*1024*1024 + 37; // L2 cache size + small prime number
for(long iter=0; iter<iterations; ++iter) {
for(int offset=0; offset<stride; ++offset) {
for(i=offset; i<array_size; i += stride) {
array1[i] = array2[i];
}
}
}
The Question: Why would one CPU core be slower than the others, and what part of the CPU is causing that slowdown?
EDIT: More testing showed some Heisenbug behavior. When I explicitly set the processor affinity, then my application does not slow down on core #2. However, if it chooses to run on core #2 without an explicitly set processor affinity, then the application runs about 10% slower. That explains why my simple test cases did not show the same slowdown, as they all explicitly set the processor affinity. So, it looks like there is some process that likes to live on core #2, but it gets out of the way if the processor affinity is set.
Bottom Line: If you need to have an accurate benchmark of a single-threaded program on a multicore machine, then make sure to set the processor affinity.
You may have applications that have opted to be attached to the same processor(CPU Affinity).
Operating systems would often like to run on the same processor as they can have all their data cached on the same L1 cache. If you happen to run your process on the same core that your OS is doing a lot of its work on, you could experience the effect of a slowdown in your cpu performance.
It sounds like some process is wanting to stick to the same cpu. I doubt it's a hardware issue.
It doesn't necessarily have to be your OS doing the work, some other background daemon could be doing it.
Most modern cpu's have separate throttling of each cpu core due to overheating or power saving features. You may try to turn off power-saving or improve cooling. Or maybe your cpu is bad. On my i7 I get about 2-3 degrees different core temperatures of the 8 reported cores in "sensors". At full load there is still variation.
Another possibility is that the process is being migrated from one core to another while running. I'd suggest setting the CPU affinity to the 'slow' core and see if it's just as fast that way.
Years ago, before the days of multicore, I bought myself a dual-socket Athlon MP for 'web development'. Suddenly my Plone/Zope/Python web servers slowed to a crawl. A google search turned up that the CPython interpreter has a global interpreter lock, but Python threads are backed by OS threads. OS Threads were evenly distributed among the CPUs, but only one CPU can acquire the lock at a time, thus all the other processes had to wait.
Setting Zope's CPU affinity to any CPU fixed the problem.
I've observed something similar on my Haswel laptop. The system was quiet, no X running, just the terminal. Executing the same code with different numactl --physcpubin option gave exactly the same results on all cores, except one. I changed the frequency of the cores to Turbo, to other values, nothing helped. All cores were running with expected speed, except one which was always running slower than the others. That effect survived the reboot.
I rebooted the computer and turned off HyperThreading in the BIOS. When it came back online it was fine again. I then turned on HyperThreading and it is fine till now.
Bizzare. No idea what that could be.

CPU clock frequency and thus QueryPerformanceCounter wrong?

I am using QueryPerformanceCounter to time some code. I was shocked when the code starting reporting times that were clearly wrong. To convert the results of QPC into "real" time you need to divide by the frequency returned from QueryPerformanceFrequency, so the elapsed time is:
Time = (QPC.end - QPC.start)/QPF
After a reboot, the QPF frequency changed from 2.7 GHz to 4.1 GHz. I do not think that the actual hardware frequency changed as the wall clock time of the running program did not change although the time reported using QPC did change (it dropped by 2.7/4.1).
MyComputer->Properties shows:
Intel(R)
Pentium(R)
4 CPU 2.80 GHz; 4.11 GHz;
1.99 GB of RAM; Physical Address Extension
Other than this, the system seems to be working fine.
I will try a reboot to see if the problem clears, but I am concerned that these critical performance counters could become invalid without warning.
Update:
While I appreciate the answers and especially the links, I do not have one of the affected chipsets nor to I have a CPU clock that varies itself. From what I have read, QPC and QPF are based on a timer in the PCI bus and not affected by changes in the CPU clock. The strange thing in my situation is that the FREQUENCY reported by QPF changed to an incorrect value and this changed frequency was also reported in MyComputer -> Properties which I certainly did not write.
A reboot fixed my problem (QPF now reports the correct frequency) but I assume that if you are planning on using QPC/QPF you should validate it against another timer before trusting it.
Apparently there is a known issue with QPC on some chipsets, so you may want to make sure you do not have those chipset. Additionally some dual core AMDs may also cause a problem. See the second post by sebbbi, where he states:
QueryPerformanceCounter() and
QueryPerformanceFrequency() offer a
bit better resolution, but have
different issues. For example in
Windows XP, all AMD Athlon X2 dual
core CPUs return the PC of either of
the cores "randomly" (the PC sometimes
jumps a bit backwards), unless you
specially install AMD dual core driver
package to fix the issue. We haven't
noticed any other dual+ core CPUs
having similar issues (p4 dual, p4 ht,
core2 dual, core2 quad, phenom quad).
From this answer.
You should always expect the core frequency to change on any CPU that supports technology such as SpeedStep or Cool'n'Quiet. Wall time is not affected, it uses a RTC. You should probably stop using the performance counters, unless you can tolerate a few (5-50) millisecond's worth of occasional phase adjustments, and are willing to perform some math in order to perform the said phase adjustment by continuously or periodically re-normalizing your performance counter values based on the reported performance counter frequency and on RTC low-resolution time (you can do this on-demand, or asynchronously from a high-resolution timer, depending on your application's ultimate needs.)
You can try to use the Stopwatch class from .NET, it could help with your problem since it abstracts from all this low-lever stuff.
Use the IsHighResolution property to see whether the timer is based on a high-resolution performance counter.
Note: On a multiprocessor computer, it
does not matter which processor the
thread runs on. However, because of
bugs in the BIOS or the Hardware
Abstraction Layer (HAL), you can get
different timing results on different
processors. To specify processor
affinity for a thread, use the
ProcessThread..::.ProcessorAffinity
method.
Just a shot in the dark.
On my home PC I used to have "AI NOS" or something like that enabled in the BIOS. I suspect this screwed up the QueryPerformanceCounter/QueryPerformanceFrequency APIs because although the system clock ran at the normal rate, and normal apps ran perfectly, all full screen 3D games ran about 10-15% too fast, causing, for example, adjacent lines of dialog in a game to trip on each other.
I'm afraid you can't say "I shouldn't have this problem" when you're using QueryPerformance* - while the documentation states that the value returned by QueryPerformanceFrequency is constant, practical experimentation shows that it really isn't.
However you also don't want to be calling QPF every time you call QPC either. In practice we found that periodically (in our case once a second) calling QPF to get a fresh value kept the timers synchronised well enough for reliable profiling.
As has been pointed out as well, you need to keep all of your QPC calls on a single processor for consistent results. While this might not matter for profiling purposes (because you can just use ProcessorAffinity to lock the thread onto a single CPU), you don't want to do this for timing which is running as part of a proper multi-threaded application (because then you run the risk of locking a hard working thread to a CPU which is busy).
Especially don't arbitrarily lock to CPU 0, because you can guarantee that some other badly coded application has done that too, and then both applications will fight over CPU time on CPU 0 while CPU 1 (or 2 or 3) sit idle. Randomly choose from the set of available CPUs and you have at least a fighting chance that you're not locked to an overloaded CPU.

Resources