Informations about TieredMergePolicy - solr

I would like to well understand Solr merge behaviour. I did some researches on the different merge policies. And it seems that the TieredMergePolicy is better than old merge policies (LogByteSizeMergePolicy, etc ...). That's why I use this one and that's the default policy on last solr versions.
First, I give you some interesting links that I've read to have a better idea of merge process :
http://java.dzone.com/news/merge-policy-internals-solr
http://blog.mikemccandless.com/2011/02/visualizing-lucenes-segment-merges.html
According to the official documentation of Lucene, I would like to ask several questions on it :
http://lucene.apache.org/core/3_2_0/api/all/org/apache/lucene/index/TieredMergePolicy.html
Questions
1- In the official documentation, there is one method called setExpungeDeletesPctAllowed(double v). And in the Solr 4.3.0, I have checked in the TieredMergePolicy class and I didn't find this method. There is another method that look like this one, called : setForceMergeDeletesPctAllowed(double v). Is there any differences between both methods ?
2- Are both methods above called only when you do a ExpungeDelete and an optimization or Are they called when a normal merge.
3- I've read that merges beetween segments are done according a pro-rata of deleted documents percentage on a segment. By default, this percentage is set to 10%. Does it possible to set this value to 0% to be sure that there is no more deleted documents in the index after merging ?
I need to reduce the size of my index without call optimize() method if it's possible. That's why any informations about merge process would be interesting for me.
Thanks

you appear to be mixing up your documentation. If you are using Lucene 4.3.0, use the documentation for it (see the correct documentation for TieredMergePolicy in 4.3.0), rather than for version 3.2.0.
Anyway, on these particular questions: See #Lucene-3577
1 - Seems to be mainly a necessary name change, for all intents and purposes.
2 - Firstly, IndexWriter.expungeDeletes no longer exists in 4.3.0. You can use IndexWriter.forceMergeDeletes(), if you must, though it is strongly recommended against, as it is very, very costly. I believe this will only impact a ForceMergeDeletes() call. If you want to favor reclaiming deletions, set that in the MergePolicy, using: TieredMergePolicy.setReclaimDeletesWeight
3 - The percent allowed is right there in the method call you've indicated in your first question. Forcing all the deletions to be merged out when calling ForceMergeDeletes() will serve to make an already very expensive operation that much more expensive as well, though.
Just to venture a guess, if you need to save disk space taken by your index, you'll likely have much more success looking more closely at how much data your are storing in the index. Not enough information to say for sure, of course, but seems a likely solution to consider.

Related

Accuracy Document Embedding in Apache Solr

I made use of Bert document embeddings to perform information retrieval on the CACM dataset. I achieved a very low accuracy score of around 6%. However when I used the traditional BM-25 method, the result was a lot closer to 40% which is close to the average accuracy found in literature for this dataset. This is all being performed within Apache Solr.
I also attempted to perform information retrieval using Doc2Vec and acheived similarly poor results as with BERT. Is it not advisable to use document embeddings for IR tasks such as this one ?
Many people find document embeddings work really well for their purposes!
If they're not working for you, possible reasons include:
insufficiency of training data
problems in your unshown process
different end-goals – what's your idea of 'accuracy'? – than others
It's impossible to say what's affecting your process, & raw perception of its usefulness, without far more details on what you're aiming to achieve, and then doing.
Most notably, if there's other published work using the same dataset, and a similar definition of 'accuracy' on which the other published work claims a far better result using the same methods as give worse results for you, then it's more likely that there are errors in your implementation.
You'd have to name the results you're trying to match (ideally with links to the exact writeups), & show the details of what your code does, for others to have any chance of guessing what's happening for you.

Solr near real time search: impact of reindexing frequently the same documents

We want to use SolR in a Near Real Time scenario. Say for example we want to filter / rank our results by number of views.
SolR SoftCommit was made for this use case but:
In practice, the same few documents are updated very frequently (just for the nb_view field) while most of the documents are untouched.
As far as I know each update, even partial are implemented as a full delete and full addition of the document in lucene.
It seems to me having many times the same docs in the Tlog is inefficient and might also be problematic during the merge process (is the doc marked n times as deleted and added?)
Any advice / good practice?
Two things you could use for supporting this scenario:
In place updates: only that field is udpated, not the whole doc. Check out the conditions you need to be able to use them.
ExternalFileFieldType you keep the values in an external file
if the scenario is critical, I would test both in reald world conditions if possible, and asses.

What is faster way to count relation?

I'm using Waterline which is amazing ORM of Node.js. I think there are two ways to count relation(association).
First way is to apply record count when a relation record added or removed. e.g) A comment appended to a post, post's comment count field will be increased.
Second way is using 'count' query. I can count the relations when I need.
What I am worry is second way is easier but it seems to be slower than first way. It can request too much. But first way needs more dirty codes.
I really don't know what is best way to count relation.
The answers to this question have to be a little opinonated, but I will give you my point of view.
I would go with the "count query" solution because it is the most reliable way to get this information. As you said, the other solution needs more dirty code and could be more easily bugged. I always try to have a single way to retrieve an information.
If the request is too much slow and/or too much frequent and slows down your application, then you should consider caching the result. Depending on the infrastructure you are using, you could cache the result of the query in a variable or in a fast cache backend like memcached or Redis. You will have to invalidate the cache when needed and it is up to you to decide the lifetime of the cache. You should define a global cache stategy of your application so you could use it for other parts of your application.

How to handle frequently changing multivalue string fields in SOLR?

I have a SOLR (or rather Heliosearch 0.07) core on a single EC2 instance. It contains about 20M documents and takes about 50GB on disc. The core is quite fixed/frozen and performs quite well, if everything is warmed up.
The problem is a multimulti value string field: That field contains assigned categories, which change quite frequently for large parts of the 20M documents. After a commit, the warm up takes way too long to be usable in production.
The field is used only for facetting and filtering. My idea was, to store the categories outside SOLR and to inject them somehow using custom code. I checked quite some approaches in various JIRA issues and blogs, but I could not find some working solution. Item 2 of this issue suggests that there is a solution, but I don't get what he's talking about.
I would appreciate any solution which allows me to update my category field without having to re-warmup my caches again afterwards.
I'm not sure that JIRA will help you: it seems an advanced topic and most impprtant it is still unresolved so not yet available.
Partial document updates are not useful here because a) it requires everything is stored in your schema b) behind the scenes it does reindex again the whole index
From what you say it seems tou have a one monolithic index: have you considered to split the index vertically using sharding or SolrCloud? In that way each "portion" would be smaller and the autowarm shouldn't be a big problem.

What does the ndb.EVENTUAL_CONSISTENCY option mean?

The documentation of the ndb.Query class, states that it accepts a read_policy option that can be set to EVENTUAL_CONSISTENCY to allow faster queries that might not be strongly consistent. Which implies that not using this option would return strongly consistent results.
However, global queries are always eventually consistent. So what does this flag actually do?
You can choose to have an ancestor-query, which would normally be strongly-consistent, use the eventually-consistent policy instead for the stated speed improvement.
The old 'db' module docs explain this.
(If you've only ever used NDB, then the DB docs are definitely worth reading - there is a lot more detail on how things work, and how best to make use of datastore.)

Resources