Utilising the query cache of a read-write index in Elasticsearch - database

I have a large read-write index that I want to optimise for search speed.
As far as I understand, a constantly indexing index should get little benefits from the query cache given that every time it is refreshed some of the cache is invalidated.
However, I don't need the documents that are being indexed to be immediately available for search, therefore I think that the query cache could be further utilised.
So the question is:
Would increasing the refresh interval to about 6 hours help keep some of that cache in for longer ? Is it a bad idea to use such a high value ? I haven't seen any docs where such values are recommended or advised against. If I am not mistaken, persistence should still be there thanks to the translog.

Related

Couchbase retrieve data from vbucket

I'm new to Couchbase and wondering if there is any manner to implement a parallel read from bucket. Given that, a bucket contains 1024 vbuckets by default. So could it be possible to split a N1QL query select * from b1 into several queries? It means that one of those queries just reads data from vbucket1 to vbucket100. Because the partition key is used to decide which node the value should be persisted. I think it could be possible to read a part of data from bucket according to a range of partition key. Could someone help me out of this?
Thanks
I don't recommend proceeding down this route. If you are just starting out, you should be worrying about how to represent your data in JSON, how to write effective N1QL queries against it, and how to get a useful set of indexes that support those queries and let them run quickly. You should also make sure that your cluster is properly set up, and you have a proper mix of KV, N1QL, and indexing nodes, with none of them as an obvious bottleneck. And of course you should be measuring performance. Exotic strategies like query partitioning should come after that, if you are still unsatisfied with performance.

When to optimize a Solr Index [duplicate]

I have a classifieds website. Users may put ads, edit ads, view ads etc.
Whenever a user puts an ad, I am adding a document to Solr.
I don't know, however, when to commit it. Commit slows things down from what I have read.
How should I do it? Autocommit every 12 hours or so?
Also, how should I do it with optimize?
A little more detail on Commit/Optimize:
Commit: When you are indexing documents to solr none of the changes you are making will appear until you run the commit command. So timing when to run the commit command really depends on the speed at which you want the changes to appear on your site through the search engine. However it is a heavy operation and so should be done in batches not after every update.
Optimize: This is similar to a defrag command on a hard drive. It will reorganize the index into segments (increasing search speed) and remove any deleted (replaced) documents. Solr is a read only data store so every time you index a document it will mark the old document as deleted and then create a brand new document to replace the deleted one. Optimize will remove these deleted documents. You can see the search document vs. deleted document count by going to the Solr Statistics page and looking at the numDocs vs. maxDocs numbers. The difference between the two numbers is the amount of deleted (non-search able) documents in the index.
Also Optimize builds a whole NEW index from the old one and then switches to the new index when complete. Therefore the command requires double the space to perform the action. So you will need to make sure that the size of your index does not exceed %50 of your available hard drive space. (This is a rule of thumb, it usually needs less then %50 because of deleted documents)
Index Server / Search Server:
Paul Brown was right in that the best design for solr is to have a server dedicated and tuned to indexing, and then replicate the changes to the searching servers. You can tune the index server to have multiple index end points.
eg: http://solrindex01/index1; http://solrindex01/index2
And since the index server is not searching for content you can have it set up with different memory footprints and index warming commands etc.
Hope this is useful info for everyone.
Actually, committing often and optimizing makes things really slow. It's too heavy.
After a day of searching and reading stuff, I found out this:
1- Optimize causes the index to double in size while beeing optimized, and makes things really slow.
2- Committing after each add is NOT a good idea, it's better to commit a couple of times a day, and then make an optimize only once a day at most.
3- Commit should be set to "autoCommit" in the solrconfig.xml file, and there it should be tuned according to your needs.
The way that this sort of thing is usually done is to perform commit/optimize operations on a Solr node located out of the request path for your users. This requires additional hardware, but it ensures that the performance penalty of the indexing operations doesn't impact your users. Replication is used to periodically shuttle optimized index files from the master node to the nodes that perform search queries for users.
Try it first. It would be really bad if you avoided a simple and elegant solution just because you read that it might cause a performance problem. In other words, avoid premature optimization.

Solr: How can I improve the performance of a filter query (for a specific value, not a range query) on a numeric field?

I have an index with something like 60-100 Million documents. We almost always query these documents (in addition to other filter queries and field queries, etc) on a foreign key id, to scope the query to a specific parent object.
So, for example: /solr/q=*:*&fq=parent_id_s:42
Yes, that _s means this is currently a solr.StrField field type.
My question is: should I change this to a TrieIntField? Would that speed up performance? And if so, what would be the ideal precisionStep and positionIncrementGap values, given that I know that I will always be querying for a single specific value, and that the cardinality of that parent_id is in the 10,000-100,000 (maximum) order of magnitude?
Edit for aditional detail (from comment on an answer below):
The way our system is used, it turns out that we end up using that same fq for many queries in a row. And when the cache is populated, the system runs blazing fast. When the cache gets dumped because of a commit, this query (even a test case with ONLY this fq) can take up to 20 seconds. So I'm trying to figure out how to speed up that initial query that populates the cache.
Second Edit:
I apologize, after further testing it turns out that the above poor performance only happens when there are also facet fields being returned (e.g. stuff like &facet=true&facet.field=resolved_facet_facet). With a dozen or so of these fields, that's when the query takes up to 20-30 seconds sometimes, but only with a fresh searcher. It's instant when the cache is populated. So maybe my problem is the facet fields, not the parent_id field.
TrieIntField with a precisionStep is optimized for range queries. As you're only searching for a specific value your field type is optimal.
Have you looked at autowarming queries? These run whenever a new IndexSearcher is being created (on startup, on an index commit for example), so that it becomes available with some cache already in place. Depending on your requirements, you can also set useColdSearcher flag to true, so that the new Searcher is only available when the cache has been warmed. For more details have a look here: https://cwiki.apache.org/confluence/display/solr/Query+Settings+in+SolrConfig#QuerySettingsinSolrConfig-Query-RelatedListeners
It sounds like you probably aren't getting much benefit from the caching of result sets from the filter. One of the more important features of filters is that they cache their result sets. This makes the first run of a certain filter take longer while a cache is built, but subsequent uses of the same filter are much faster.
With the cardinality you've described, you are probably just wasting cycles, and polluting the filter cache, by building caches without them ever being of use. You can turn off caching of a filter query like:
/solr/q=*:*&fq={!cache=false}parent_id_s:42
I also think filter query doesn't help in this case.
q=parent_id_s:42 is to query the index by the term "parent_id_s:42" and get a set of document ids. Since the postings (document ids) are indexed by the term, and assuming you have enough memory to hold this (either in JVM or OS cache), then this lookup should be pretty fast.
Assuming filter cache is already warmed up and you have 100% hit ratio, which one of the following is faster?
q=parent_id_s:42
fq=parent_id_s:42
I think they are very close. But I could be wrong. Anyone knows? Any know ran performance test for this?

solr indexing strategy

We have millions of documents in mongo we are looking to index on solr. Obviously when we do this the first time we need to index all the documents.
But after that, we should only need to index the documents as they change. What is the best way to do this? Should we call addDocument and then in cron call commit()? What does addDocument vs commit vs optimize do (I am using Apache_Solr_Service)
If you're using Solr 3.x you can forget the optimize, which merges all segments into one big segment. The commit makes changes visible to new IndexReaders; it's expensive, I wouldn't call it for each document you add. Instead of calling it through a cron, I'd use the autocommit in solrconfig.xml. You can tune the value depending on how much time you can wait to get new documents while searching.
The document won't actually be added to the index until you do commit() - it could be rolled back. optimize() will (ostensibly; I've not had particularly good luck with it) reduce the size of the index (documents that have been deleted still take up room unless the index is optimized).
If you set autocommit for your database, then you can be sure that any documents added to the database via update, have been committed, when the autocommit interval has passed. I have used a 5-minute interval and it works fine even when a few thousand updates happen within the 5 minutes. After a full reindex is complete, I wait 5 minutes and then tell people that it is done. In fact, when people ask how quickly updates get into the db, I will tell them that we poll for changes every minute, but that there are variables (such as a sudden big batch) and it is best to not expect things to be updated for 5 or 6 minutes. So far, nobody has really claimed a business need to have it update faster than that.
This is with a 350,000 record db totalling roughly 10G in RAM.

Ehcache, Hibernate, updating cache of very large table when a new entry is added?

I'm new to Ehcache and am searching on how to do this but now quite sure if this is a normal use case. I am working on an application that isn't a traditional web app, its something that is only used by a few people at a time and is for retrieving data from a very large dataset so rather than making a call to the DB each time I want to use caching to cache this large table. However, there is a chance that a new entry could be added to this table and I need this reflected in the cache but I don't want to reload the entire cache each time as its quite large. Any advice on how to approach this / further resources is appreciated.
You should learn about Hibernate query cache. In simple words: it works on top of second level cache (L2) and stores results of queries. But it only stores ids of the records that should be returned by the query rather than a whole list. This means that you need to have L2 working and fine tuned.
In your scenario suppose you have 1M records in table T and a query that returns 1K by average. The first time you run this query it will miss the query cache and:
run the SQL
fetch 1K records
put all of them in L2
put 1K ids in query cache
The next time you execute the query it will hit the query cache and lookup all the result from L2. The interesting part comes when you modify table T. Hibernate will figure out that the results in query cache might be stale and it will invalidate the whole cache but not the L2. It will basically repeat points 1-4 but refreshing only query cache (most of entities from table T are already in L2).
In some scenarios it works great, in others it introduces N+1 problems in unpredictable moments. This is just a tip of an iceberg, you should be really careful as this mechanism is very fragile and requires great understanding.

Resources