Question about Solr caching mechanism

Question about Solr caching mechanism - solr

I'm working at a Apache Solr project.
( distributed in a cloud environment - Amazon ec2 instances ).
I've noticed Solr does an excellent job in caching the results.
When I execute the same queries again - the respons states Solr QTime 0 or 1 millisecond.
I want to stress test the Solr system. Therefore I have a limited list of queries I could use ( 50 000 unique queries ). The problem now is that all queries are cached!
When I stress test - after 5 mins or so - all my queries are given in Solr & executed.
This makes the system sweat unther the heavy load :) ( witch was the purpose ).
But then, as I execute the same query set again - QTime is almost zero!
--> Solr has an easy time & isn't stressed.
My question:
How can you turn of ALL Solr caches ( Both Solr and Lucence caches)?
Or how can you limit the cache?
I've tried to turn of all Solr intern cache, but the cache still stays.
( QueryResultCache and FieldCache )
Note: The config mentions that Lucence will take management of an internal cache - maybe this cache is the problem?
It's just weird that all of the 50 000 queries can be stored in the cache - out of the box.

You can comment out the filterCache, queryResultCache and documentCache in your configuration. Lucene's FieldCache cannot be disabled.
Although it doesn't really make any sense to do so, even for benchmarking. Would you also disable disk caching in your operating system? CPU caches (all three levels)? The internal cache of each hard disk?
Caches are part of the system, if you disable them you won't accurately simulate what happens in production, thus rendering the benchmark useless.

Turning off caches is an excellent idea, at least those that are application specific. A benchmark in this case is intended I gather to find the response/cost of a query that has not been seen before; as opposed to those that are popular within a cache expire.
You sound like you want metrics that tell you how the search system performs; not the query cache.
Previous answers are really out of left field, suggesting all benchmarks should measure the same thing, "his own definition of " real life performance. That is not how engineering works.
As to the remark about "disk caches". There are no disk caches in Linux; only a page cache; whether that page is persisted on disk, created and destroyed in memory or pre-allocations for large file systems that are smart....they are all pages.
There is benefit to benchmarking with caches... if you bother to measure the cache performance metrics. duh.
BTW, between "-server" and "XXcompileThreshold" you want to make sure your first large set of queries are either random enough or specifically chosen to exercise as many function pathways in the Solr/Lucene as you can; so JIT is both active and somewhat settled.

Related

Reason why Odoo being slow when there is huge data inside the database

We have observed one problem in Postgresql as it doesn't uses multi core of CPU for single query. For example, I have 8 cores in cpu. We are having 40 Million entries in stock.move table. When we apply massive query in single database connection to generate reporting & observe at backend side, we see only one core is 100% used, where as all other 7 are free. Due to that query execution time takes so longer and our odoo system being slow. Whereas problem is inside postgresql core. If by anyhow we can share a query between two or more cores than we can get performance boost in postgresql query execution.
I am sure by solving parallel query execution, we can make Odoo performance even faster. Anyone has any kind of suggestions regarding this ??
----------- * Editing this question to show you answer from Postgresql Core committee *---------
Here I am posting the answer which I got from one of top contributor of Postgresql database. ( I hope this information will be useful)
Hello Hiren,
It is expected behave. PostgreSQL doesn't support parallel CPU for
single query. This topic is under high development, and probably, this
feature will be in planned release 9.6 ~ September 2016. But table
with 40M rows isn't too big, so probably more CPU should not too help
to you (there is some overhead with start and processing multi CPU
query). You have to use some usual tricks like materialized view,
preagregations, ... the main idea of these tricks - don't try to
repeat often same calculation. Check health of PostgreSQL - indexes,
vacuum processing, statistics,.. Check hw - speed of IO. Check
PostgreSQL configuration - shared_buffers, work_mem. Some queries can
be slow due bad estimations - check a explain of slow queries. There
are some tools that can breaks some query to more queries and start
parallel execution, but I didn't use it. https://launchpad.net/stado
http://www.pgpool.net/docs/latest/tutorial-en.html#parallel
Regards Pavel Stehule

Well, I think you have your answer there -- PostgreSQL does not currently support parallel query yet. The general advice towards performance is very apt, and you might also consider partitioning, which might allow you to truncate partitions instead of deleting parts of a table, or increasing memory allocation. It's impossible to give good advice on that without knowing more about the query.
Having had experience with this sort of issue on non-parallel query Oracle systems, I suggest that you also consider what hardware you're using.
The modern trend towards CPUs with very many cores is a great help for web servers or other multi-process systems with many short-lived transactions, but you have a data processing system with few, large transactions. You need the correct hardware to support that. CPUs with fewer, more powerful cores are a better choice, and you have to pay attention to bandwidth to memory and storage.
This is why engineered systems have been popular with big data and data warehousing.

When to definitely use SOLR over Lucene in a Sitecore 7 build?

My client does not have the budget to setup and maintain a SOLR server to use in their production environment. If I understand the Sitecore 7 Content Search API correctly, it is not a big deal to configure things to use Lucene instead. For the most part the configuration will be similar and the code will be the same, and a SOLR server can be swapped in later.
The site build has
faceted search page
listing components on landing and on other pages that will leverage the Content Search API
buckets with custom facets
The site has around 5,000 pages and components not including media library items. Are there any concerns about simply using Lucene?
The main question is, when, during your architecture or design phase do you know that you should definitely choose SOLR over Lucene? What are the major signs that lead you recommend that?

I think if you are dealing with a customer on a limited budget then Lucene will work perfectly well and perform excellently for the scale of things you are doing. All the things you mention are fully supported by the implementation in Lucene.
In a Sitecore scenario I would begin to consider Solr if:
You need to index a large number of items - id say 50 thousand upwards - Lucene is happy with these sorts of number but Solr has improved query caching and is designed for these large numbers of items.
The resilience of the search tier is of maximum business importance (ie the site is purely driven by search) - Solr provides a more robust replication/sharding and failover system with SolrCloud.
Re-purposing of the search tier in other application is important (non Sitecore) - Solr is a search application so can be accessed over HTTP with XML/JSON etc which makes integration with external systems easier.
You need some specific additional feature of Solr that Lucene doesn't have.
.. but as you say if you want swap out Lucene for Solr at a later phase, we have worked hard to make sure that the process as simple as possible. Worth noting a few points here:
While your LINQ queries will stay the same your configuration will be slightly different and will need attention to port across.
The understanding of how Solr works as an application and how the schema works is important to know but there are some great books and a wealth of knowledge out there.
Solr has slightly different (newer) analyzers and scoring mechanisms so your search results may be slightly different (sometimes customers can get alarmed by this :P)
.. but I think these are things you can build up to over time and assess with the customer. Im sure there are more points here and others can chime in if they think of them. Hope this helps :)

Stephen pretty much covered the question - but I just wanted to add another scenario. You need to take into account the server setup in your production environment. If you are going to be using multiple content delivery servers behind a load balancer I would consider Solr from the start, as trying to make sure that the Lucene index on each delivery server is synchronized 100% of the time can be painful.

I would recommend planning an escape plan from Lucene as early as you start thinking about multiple CDs and here is why:
A) Each server has to maintain its own index copy:
Any unexpected restart might cause a few documents not to be added to the index on the one box, making indexes different from server to server.
That would lead to same page showing differently by CDs
Each server must perform index updates - use CPU & disk space; response rate drops after publish operation is over =/
According to security guide, CDs should have Sitecore Shell UI removed, so index cannot be easily rebuilt from Control Panel =\
B) Lucene is not designed for large volumes of content. Each search operation does roughly following:
Create an array with size equal to total number of documents in the index
If document matches search, set flag in the array
While this works like a charm for low sized indexes (~10K elements), huge performance degradation is produced once the volume of content grows.
The allocated array ends in Large Object Heap that is not compacted by default, thereby gets fragmented fast.
Scenario:
Perform search for 100K documents -> huge array created in memory
Perform one more search in another thread -> one more huge array created
Update index -> now 100K + 10 documents
The first operation was completed; LOH has space for 100K array
Seach triggered again -> 100K+10 array is to be created; freed memory 'hole' is not large enough, so more RAM is requested.
w3wp.exe process keeps on consuming more and more RAM
This is the common case for Analytics Aggregation as an index is being populated by multiple threads at once.
You'll see a lot of RAM used after a while on the processing instance.
C) Last Lucene.NET release was done 5 years ago.
Whereas SOLR is actively being developed.
The sooner you'll make the switch to SOLR, the easier it would be.

How much RealTime is Elasticsearch, Solr and DSE realtime search?

From some last couple of weeks, I have been working around Elasticsearch and Solr, and trying to do OLTP processing in real time. However, what comes to me is they claims(especially ES) to be real time. The meaning of real time looks a lot fuzzy to me.
If we go deep into it, both ES and Solr, defines a refresh rate or a soft-commit rate, after which the newly indexed documents would be available for search, effectively providing only Near-Real time capabilities.
It looks like by Real time search, it is either a marketing statement to call it real time, or they make the word fuzzy by talking about Real Time Search rather than batch or analytical processing.
Am I correct, or correct me if I am wrong, and there is a real-time search possible in a typical OLTP system, where every transaction has search visibility to last document ?

Elasticsearch is a Near Real Time search engine for search. Elasticsearch is Real Time for operations like Create, Update, Delete and Get.
By default, refresh is 1 second. In some use cases, it could appear as real time. For example, I was working for a french gov service and we were producing statistics per day. So for our use case, it was somehow real time from our perspective.
For logs for example, 1 second is enough in most use cases.
You can modify this default value but it comes with a cost.
If you really need real time, then you probably want to use a SQL database.
My 2 cents.

Yes, DSE Search is indeed Near real-time and has not yet achieved the mythical goal of absolute zero latency. But... even traditional Real real-time is not real-time once you factor in the time to do the actual database update, plus the fact that a lot of traditional database updates are batch-oriented, or even if the actual update operation is not batched, there is likely to be some human process that delays the start of the database update from the original source of a data change.
Also keep in mind that the latency of a database update needs to include maintaining the required (tunable) consistency for replicating data updates in the cluster.
Rather than push you back towards SQL if you want real-time, I would challenge you to fully justify the true latency requirements of the app. For example, with complex distributed applications you need to be prepared for occasional resource outages, such as network delays, so that it is usually much better to design a modern distributed application to be a lot more flexible and asynchronous than a traditional, synchronous, fragile (think HealthCare.gov) app architecture that improperly depends on a perception of zero-latency distributed operations.
Finally, we are working on enhancements to reduce the actual latency of database updates, coupled with ongoing improvements in hardware performance that further shrink the update latency window.
But ultimately, all computing real-time measures will have some non-zero latency and modern distributed apps must be designed for at least some degree of decoupling between database updates and absolute dependency on those updates.
Worst case scenario, apps that need to synchronize with database updates may need to implement a polling strategy to wait for the update to complete.

ElasticSearch has real time features for CRUD operations. On GET operations, it checks the Transaction log, to look for any uncommitted changes and return the most relevant document.
The Percolator feature enables realtime in search queries as well. It allows you to register queries (percolation), that will be used at indexing time to return matching documents to those predefined queries.
This workflow looks like this:
Register specific query (percolation) in Elasticsearch
Index new content (passing a flag to trigger percolation)
The response to the indexing operation will contain the matched percolations
A very good blog with live example that explains the Percolator concept:
http://blog.qbox.io/elasticsesarch-percolator

Optimizing Solr 4 on EC2 debian instance(s)

My Solr 4 instance is slow and I don't know why.
I am attempting to modify the configurations of JVM, Tomcat6 and Solr 4 in order
to optimize performance, with queries per second as the key metric.
Currently I am running on an EC2 small tier with Debian squeeze, but ready to switch to Ubuntu if needed.
There is nothing special about my use case. The index is small. Queries do include a moderate number of unions (e.g. 10), plus faceting, but I don't think that's unusual.
My understanding is that these areas could need tweaking:
Configuring the JVM Garbage collection schedule and memory allocation ("GC tuning is a precise art form", ref)
Other JVM settings
Solr's Query Result cache, Filter cache, Document cache settings
Solr's Auto-warming settings
There are a number of ways to monitor the performance of Solr:
SolrMeter
Sematext SPM
New Relic
But none of these methods indicate which settings need to be adjusted, and there's no guide that I know of that steps through an exhaustive list of settings that could possibly improve performance. I've reviewed the following pages (one, two, three, four), and gone through some rounds of trial and error so far without improvement.
Questions:
How to tell JVM to use all the 2 GB memory on the small EC2 instance?
How to debug and optimize JVM Garbage Collection?
How do I know when I/O throttling, such as the new EBS IOPS pricing, is the issue?
Using figures like the NewRelic examples below, how to detect what is problematic behavior, and how to approach solutions.
Answers:
I'm looking for link to good documentation for setting up and optimizing Solr 4, from a DevOps or server admin perspective (not index or application design).
I'm looking for the top trouble spots in catalina.sh, solrconfig.xml, solr.xml (other?) that are most likely causes of problems.
Or any tips you think address the questions.

First, you should not focus on switching your linux distribution. A different distribution might bring some changes but considering the information you gave, nothing prove that these changes may be significant.
You are mentionning lots of possibilities for your optimisations, this can be overwhelming. You should consider an tweaking area only once you have proven that the problem lies in that particular part of your stack.
JVM Heap Sizing
You can use the parameter -mx1700m to give a maximum of 1.7GB of RAM to the JVM. Hotspot might not need it, so don't be surprised if your heap capacity does not reach that number.
You should set the minimum heap size to a low value, so that Hotspot can optimise its memory usage. For instance, to set a minimal heap size at 128MB, use -mx128m.
Garbage Collector
From what you say, you have limited hardware (1-core at 1.2GHz max, see this page)
M1 Small Instance
1.7 GiB memory
1 EC2 Compute Unit (1 virtual core with 1 EC2 Compute Unit)
...
One EC2 Compute Unit provides the equivalent CPU capacity of a 1.0-1.2
GHz 2007 Opteron or 2007 Xeon processor
Therefore, using that low-latency GC (CMS) won't do any good. It won't be able to run concurrently with your application since you have only one core. You should switch to the Throughput GC using -XX:+UseParallelGC -XX:+UseParallelOldGC.
Is the GC really a problem ?
To answer that question, you need to turn on GC logging. It is the only way to see whether GC pauses are responsible for your application response time. You should turn these on with -Xloggc:gc.log -XX:+PrintGCDetails.
But I don't think the problem lies here.
Is it a hardware problem ?
To answer this question, you need to monitor resource utilization (disk I/O, network I/O, memory usage, CPU usage). You have a lot of tools to do that, including top, free, vmstat, iostat, mpstat, ifstat, ...
If you find that some of these resources are saturating, then you need a bigger EC2 instance.
Is it a software problem ?
In your stats, the document cache hit rate and the filter cache hit rate are healthy. However, I think the query result cache hit rate is pretty low. This implies a lot of queries operations.
You should monitor the query execution time. Depending on that value you may want to increase the cache size or tune the queries so that they take less time.
More links
JVM options reference : http://jvm-options.tech.xebia.fr/
A feedback that I did on some application performance audit : http://www.pingtimeout.fr/2013/03/petclinic-performance-tuning-about.html
Hope that helps !

Embedded couchDB

CouchDB is great, I like its p2p replication functionality, but it's a bit larger(because we have to install Erlang) and slower when used in desktop application.
As I tested in intel duo core cpu,
12 seconds to load 10000 docs
10 seconds to insert 10000 doc, but need 20 seconds to update view, so total is 30 seconds
Is there any No SQL implementation which has the same p2p replication functionality, but the size is very small like sqlite, and the speed is quite good(1 second to load 10000 docs).

Have you tried using the Hovercraft and/or the Erlang view server? I had a similar problem and found staying within the Erlang VM (thereby and avoiding excursions to SpiderMonkey) gave me the boost I needed. I did 3 things...
Boosting Queries: Porting your mapreduce functions from js to "native" Erlang usually gives tremendous performance boost when querying couch (http://wiki.apache.org/couchdb/EnableErlangViews). Also, managing views is easier coz you can call external libs or your own compiled modules (just add them to your ebin dir) reducing the number of uploads you need to do during development.
Boosting Inserts: Using Hovercraft for inserts gives upto X100 increase in performance (https://github.com/jchris/hovercraft.) This was mentioned in the CouchDB book (http://guide.couchdb.org/draft/performance.html)
Pre-Run Views: The last thing you can do for desktop apps is run your views during application startup (say, when the splash-screen is showing.) The first time views are run is always the slowest, subsequent runs are faster.
These helped me a lot.
Edmond -

Unfortunately the question doesn't offer enough details about your app requirements so it's kind of difficult to offer an advise. Anyways, I'm not aware of any other storage solution offering a similar/advanced P2P replication.
A couple of questions/comments about your your requirements:
what kind of desktop app requires 10000 inserts/second?
when you say size what exactly are you referring to?
You might want to take a look at:
Redis
RavenDB
Also check some of the other NoSQL-solutions listed on http://nosql.mypopescu.com against your app requirements.

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight