I have a solr instance with 200M+ documents. I would like to find an efficient way to iterate over all those documents.
I tried using the start parameter to formulate a list of queries:
http://ip:port/solr/docs/select?q=*:*&start=0&rows=1000000&fl=content&wt=python
http://ip:port/solr/docs/select?q=*:*&start=1000000&rows=1000000&fl=content&wt=python
...
But it is very slow when start gets too high.
I also tried using the cursorMark parameter with an initial query like this one:
http://ip:port/solr/docs/select?q=*:*&cursorMark=*&sort=id+asc&start=0&rows=1000000&fl=content&wt=python
which I believe try to sort all the documents first and crash the server. Sadly I don't think it is possible to bypass the sort. What would be the proper way to do it?
this is a very well known antipattern. You just need to use cursorMark feature to go deep into a result set.
if cursorMark is not doable then try the export handler
Okay, so I couldn't make it work with the cursor, even though it's probably me not knowing well enough how to use the tool. If you are having the same problem as me here are 3 tracks:
Track one: use cursor sorting using _docid_ as suggested by #femtoRgon. I couldn't make it work but I didn't have a lot of time to allocate to it.
Track two: use export handled as suggested by #Persimmonium
Track three (lazy track): what I did in the end is I keep using incremental start values, but I switch from wt=python to wt=csv, which is much faster and allows me to query by batches of 10M documents. This limits the amount of queries and the cost of using start instead of cursorMark is kind of amortized
Good luck, post your solutions if you find anything better.
Related
I have been browsing over the internet for quite few hours now and didn't came to a satisfactory answer for why one is better over another. If this is situation dependent than what are the situations to use one over the other.It would be great if you could provide me a solution on this with example if there can be one. I understand that since the aggregation operators came later so probably they are the better one, but i have still seen people using the find()+sort() method.
You shouldn't think of this as an issue of "which method is better?", but "what kind of query do I need to perform?"
The MongoDB aggregation pipeline exists to handle a different set of problems than a simple .find() query. Specifically, aggregation is meant to allow processing of data on the database end in order to reduce the workload on the application server. For example, you can use aggregation to generate a numerical analysis on all of the documents in a collection.
If all you want to do is retrieve some documents in sorted order, use find() and sort(). If you want to perform a lot of processing on the data before retrieving the results, then use aggregation with a $sort stage.
We want to use SolR in a Near Real Time scenario. Say for example we want to filter / rank our results by number of views.
SolR SoftCommit was made for this use case but:
In practice, the same few documents are updated very frequently (just for the nb_view field) while most of the documents are untouched.
As far as I know each update, even partial are implemented as a full delete and full addition of the document in lucene.
It seems to me having many times the same docs in the Tlog is inefficient and might also be problematic during the merge process (is the doc marked n times as deleted and added?)
Any advice / good practice?
Two things you could use for supporting this scenario:
In place updates: only that field is udpated, not the whole doc. Check out the conditions you need to be able to use them.
ExternalFileFieldType you keep the values in an external file
if the scenario is critical, I would test both in reald world conditions if possible, and asses.
I'm using Waterline which is amazing ORM of Node.js. I think there are two ways to count relation(association).
First way is to apply record count when a relation record added or removed. e.g) A comment appended to a post, post's comment count field will be increased.
Second way is using 'count' query. I can count the relations when I need.
What I am worry is second way is easier but it seems to be slower than first way. It can request too much. But first way needs more dirty codes.
I really don't know what is best way to count relation.
The answers to this question have to be a little opinonated, but I will give you my point of view.
I would go with the "count query" solution because it is the most reliable way to get this information. As you said, the other solution needs more dirty code and could be more easily bugged. I always try to have a single way to retrieve an information.
If the request is too much slow and/or too much frequent and slows down your application, then you should consider caching the result. Depending on the infrastructure you are using, you could cache the result of the query in a variable or in a fast cache backend like memcached or Redis. You will have to invalidate the cache when needed and it is up to you to decide the lifetime of the cache. You should define a global cache stategy of your application so you could use it for other parts of your application.
Sometimes I don't need just the top X results from a SOLR query, but all results (running into millions). This is easily achievable by searching once with 0 rows as a request parameter, and then re-execute the search with the numFound from the result as number of rows(*)
Of course we can sort the results by e.g. "id asc" to remove relevancy ranking, however, I would like to be able to disable the entire scoring calculation for these queries, as they probably are quite computational intensive and we just don't need them in these cases.
My question:
Is there a way to make SOLR work in boolean mode and effectively run faster on these often slow queries, when all we need is just all results?
(*) I actually usually simply do a paged query where a script walks through the pages (multi threaded), to prevent timeouts on large result sets, yet keep it fast as possible, but this is not important for the question.
This looks like a related question, but apparently the user asked the wrong question and was only after retrieving all results: Solr remove ranking or modify ranking feature; This question is not answered there.
Use filters instead of queries; there is no score calculation for filters.
There is a couple of things to be aware of
Solr deep paging allows you to export large number of results much quicker
Using an export format such as CSV could be faster than using an XML format just due to the formatting and it being more compact
And, as already mentioned, if you are exporting all, put your queries into FilterQuery with caching off
For very complex queries, if you can split it into several steps, you can actually assign different weights to the filters and have them execute in sequence. This allows to use cheap first filter that gets rid of most of the results and only then apply more expensive, more precise, filters
I have a Solr solution working which requires two queries, but I'm looking for a way to do it in a single query. My idea is that if I can figure out a way to do this, I wont have to incur the overhead of twice the load on the Solr cluster.
The details: I'm running a simple query like "q=camera" with a query filter of say "fq=type:digital". The second query is identical to the first, but the filter is the inverse, like "fq=-type:digital" I'm imagining that if there's a way to run a single query while applying the first filter to get the first set of topDocs, then generate a second set with the second filter the results could be merged and returned ( it doesn't matter if sorting resorts and mixes the two sets).
I experimented with partitioning the data by marking a specific field during indexing, into two different groups and then using Solr "grouping" queries, but the response time for these wasn't acceptable in my setup.
I'm looking for suggestions the most Solr congruent approach to experiment with: tuning to improve the two-query solution performance, or investigating a kind of custom Solr post-filter ( I read Yonik's 2/2012 blog post ).
I have to implement this in Solr 3.5, although if there's a slam dunk solution in 4.0 I'll eventually be able to move to that.
I can think of two alternate approaches :-
Instead of filter the results, use a variable higher boost so that all the results for type:digital come on top and rest of the documents would follow. No need for separate queries. The boost can be changes as per the type value.
Other approach is not to display the results for type other then digital. However, you can display the facets for the other types with the counts for the same for users to know if the other types exist for the search term. You can check on tagging and excluding filters
Result grouping might give you what you want. Just group by that parameter and specify sufficient top number of documents in each group.
But I would test whether its performance is any better than two queries. Just because it mentions performance in limitations section.