What are the advantages of using synonyms at index time vs expanding at query time? In what case would you use both?
There's a very good write-up at http://nolanlawson.com/2012/10/31/better-synonym-handling-in-solr/ … I remembered that one, because I had the same question recently, and I found that via Google *wink wink*.
Basically, there's a huge difference between both, and you may need to use both in the end, depending on what you're trying to achieve.
Related
We want to use SolR in a Near Real Time scenario. Say for example we want to filter / rank our results by number of views.
SolR SoftCommit was made for this use case but:
In practice, the same few documents are updated very frequently (just for the nb_view field) while most of the documents are untouched.
As far as I know each update, even partial are implemented as a full delete and full addition of the document in lucene.
It seems to me having many times the same docs in the Tlog is inefficient and might also be problematic during the merge process (is the doc marked n times as deleted and added?)
Any advice / good practice?
Two things you could use for supporting this scenario:
In place updates: only that field is udpated, not the whole doc. Check out the conditions you need to be able to use them.
ExternalFileFieldType you keep the values in an external file
if the scenario is critical, I would test both in reald world conditions if possible, and asses.
I have a solr instance with 200M+ documents. I would like to find an efficient way to iterate over all those documents.
I tried using the start parameter to formulate a list of queries:
http://ip:port/solr/docs/select?q=*:*&start=0&rows=1000000&fl=content&wt=python
http://ip:port/solr/docs/select?q=*:*&start=1000000&rows=1000000&fl=content&wt=python
...
But it is very slow when start gets too high.
I also tried using the cursorMark parameter with an initial query like this one:
http://ip:port/solr/docs/select?q=*:*&cursorMark=*&sort=id+asc&start=0&rows=1000000&fl=content&wt=python
which I believe try to sort all the documents first and crash the server. Sadly I don't think it is possible to bypass the sort. What would be the proper way to do it?
this is a very well known antipattern. You just need to use cursorMark feature to go deep into a result set.
if cursorMark is not doable then try the export handler
Okay, so I couldn't make it work with the cursor, even though it's probably me not knowing well enough how to use the tool. If you are having the same problem as me here are 3 tracks:
Track one: use cursor sorting using _docid_ as suggested by #femtoRgon. I couldn't make it work but I didn't have a lot of time to allocate to it.
Track two: use export handled as suggested by #Persimmonium
Track three (lazy track): what I did in the end is I keep using incremental start values, but I switch from wt=python to wt=csv, which is much faster and allows me to query by batches of 10M documents. This limits the amount of queries and the cost of using start instead of cursorMark is kind of amortized
Good luck, post your solutions if you find anything better.
I have a SOLR (or rather Heliosearch 0.07) core on a single EC2 instance. It contains about 20M documents and takes about 50GB on disc. The core is quite fixed/frozen and performs quite well, if everything is warmed up.
The problem is a multimulti value string field: That field contains assigned categories, which change quite frequently for large parts of the 20M documents. After a commit, the warm up takes way too long to be usable in production.
The field is used only for facetting and filtering. My idea was, to store the categories outside SOLR and to inject them somehow using custom code. I checked quite some approaches in various JIRA issues and blogs, but I could not find some working solution. Item 2 of this issue suggests that there is a solution, but I don't get what he's talking about.
I would appreciate any solution which allows me to update my category field without having to re-warmup my caches again afterwards.
I'm not sure that JIRA will help you: it seems an advanced topic and most impprtant it is still unresolved so not yet available.
Partial document updates are not useful here because a) it requires everything is stored in your schema b) behind the scenes it does reindex again the whole index
From what you say it seems tou have a one monolithic index: have you considered to split the index vertically using sharding or SolrCloud? In that way each "portion" would be smaller and the autowarm shouldn't be a big problem.
I tried using Sitecore.Search namespace and it seems to do basic stuff. I am now evaluating AdvancedDatabaseCrawler module by Alex Shyba. What are some of the advantages of using this module instead of writing my own crawler and search functions?
Thanks
Advantages:
You don't have to write anything.
It handles a lot of the code you need to write to even query Sitecore, e.g. basic search, basic search with field-level sorting, field-level searches, relation searches (GUID matches for lookup fields), multi-field searches, numeric range and date range searches, etc.
It handles combined searches, with logical operators
You can access the code.
This video shows samples of the code and front-end running various search types.
Disadvantages:
None that I can think of, because if you find an issue or a way to extend it, you have full access to the code and can amend it per your needs. I've done this before by creating the GetHashCode() and Equals() methods for the SkinnyItem class.
First of all, the "old" way of acecssing the Lucene index was very simple, but unfortunately it's deprecated from Sitecore 6.5.
The "new" way of accessing the Lucene index is very complex as the possibilities are endless. Alex Shyba's implementation is the missing part that makes it sensible to use the "new" way.
Take a look at this blog post: http://briancaos.wordpress.com/2011/10/12/using-the-sitecore-open-source-advanceddatabasecrawler-lucene-indexer/
It's a 3 part description on how to configure the AdvancedDatabaseCrawler, how to make a simple search and how to make a multi field search. Without Alex's AdvancedDatabaseCrawler, these tasks would take almost 100 lines of code. With the AdvancedDatabaseCrawler, it takes only 7 lines of code.
So if you are in need of an index solution, this is the solution to use.
Can somebody explain how StackOverflow search works? I would like to add same features to a project I'm working on.
In SO, it's possible to filter the questions by multiple tags (e.g. c#, java) and get results sorted/paged by date or number of votes?
I realize that RDBMS with full-text engine can be used to filter and sort the questions but I'm not sure if that's the best solution?
Is it possible to somehow get top N ordered results from a full-text index?
Maybe Lucene.NET or Redis or something similar is used?
As of April 2011, Stackoverflow uses Lucene.NET.
Source: (Jeff Atwood) https://blog.stackoverflow.com/2011/01/stack-overflow-search-now-81-less-crappy/
Their old method was Homebrew + Full Text SQL
How to search by tags in Lucene
Top N with Lucene
Paging with Lucene