Mongoid - .limit(n) ignored - mongoid

The mongoid 7.0 .limit(n) seems to be partially ignored.
When querying data and limiting it's size, the .limit(n) works as expected just after making an array out of it or when iterating over the items. It's not ideal but ok.
But the bigger problem appears when using the limit in combination with update_all.
Model.where(foo: 'bar').limit(100).update_all(foo: 'baz')
In this case the limit is entirely ignored and the update updates all the documents without taking the limit into account.
Sure, you can always go one by one, but that's probably something you don't want to do on bigger data (thousands of documents at the time in our case).
The only workaround I could think of is to select the document ids. Make an array out of them, create a query that would search only for those id documents and update them together.
ids = Model.where(foo: 'bar').only('_id').limit(1000).to_a
Model.where(:_id.in => ids).update_all(foo: 'baz')
Is there any way to make the limit work without making an array of the data first?

The update command does not provide for arbitrary limiting of the number of updates performed. You can update one document or all matching documents.

Related

How can I limit my Solr search to an arbitrary set of 100,000 documents?

I've got an 11,000,000-document index. Most documents have a unique ID called "flrid", plus a different ID called "solrid" that is Solr's PK. For some searches, we need to be able to limit the searches to a subset of documents defined by a list of FLRID values. The list of FLRID values can change between every search and it will be rare enough to call it "never" that any two searches will have the same set of FLRIDs to limit on.
What we're doing right now is, roughly:
q=title:dogs AND
(flrid:(123 125 139 .... 34823) OR
flrid:(34837 ... 59091) OR
... OR
flrid:(101294813 ... 103049934))
Each of those FQs parentheticals can be 1,000 FLRIDs strung together. We have to subgroup to get past Solr's limitations on the number of terms that can be ORed together.
The problem with this approach (besides that it's clunky) is that it seems to perform O(N^2) or so. With 1,000 FLRIDs, the search comes back in 50ms or so. If we have 10,000 FLRIDs, it comes back in 400-500ms. With 100,000 FLRIDs, that jumps up to about 75000ms. We want it be on the order of 1000-2000ms at most in all cases up to 100,000 FLRIDs.
How can we do this better?
Things we've tried or considered:
Tried: Using dismax with minimum-match mm:0 to simulate an OR query. No improvement.
Tried: Putting the FLRIDs into the fq instead of the q. No improvement.
Considered: dumping all the FLRIDs for a given search into another core and doing a join between it and the main core, but if we do five or ten searches per second, it seems like Solr would die from all the commits. The set of FLRIDs is unique between searches so there is no reuse possible.
Considered: Translating FLRIDs to SolrID and then limiting on SolrID instead, so that Solr doesn't have to hit the documents in order to translate FLRID->SolrID to do the matching.
What we're hoping for:
An efficient way to pass a long set of IDs, or for Solr to be able to pull them from the app's Oracle database.
Have Solr do big ORs as a set operation not as (what we assume is) a naive one-at-a-time matching.
A way to create a match vector that gets passed to the query, because strings of fqs in the query seems to be a suboptimal way to do it.
I've searched SO and the web and found people asking about this type of situation a few times, but no answers that I see beyond what we're doing now.
solr search within subset defined by list of keys
Searching within a subset of data - Solr
http://lucene.472066.n3.nabble.com/Filtered-search-for-subset-of-ids-td502245.html
http://lucene.472066.n3.nabble.com/Search-within-a-subset-of-documents-td1680475.html

SOLR index time boost depending on the field value

Is it possible to boost a document on the indexing stage depending on the field value?
I'm indexing a text field pulled from the database. I would like to boost results that are shorter over the longer ones. So the value of boost should depend on the length of the text field.
This is needed to alter the standard SOLR behavior that in my case tends to return documents with multiple matches first.
Considering I have a field that stores the length of the document, the equivalent in the query of what I need at indexing would be:
q={!boost b=sqrt(length)}text:abcd
Example:
I have two items in the DB:
ABCDEBCE
ABCD
I always want to get ABCD first for the 'BC' query even though the other item contains the search query twice.
The other solution to the problem would be ability to 'switch off' the feature that scores multiple matches higher at query time. Don't know if that is possible either...
Doing this at index time is important as the hardware I run the SOLR on is not too powerful and trying to boost on query time returns with OutOfMemory Exception. (Even If I could work around that increasing memory for java I prefer to be on the safe side and implement the index the most efficient way possible.)
Yes and no - but how you do it depends on how you're indexing your documents.
As far as I know there's no way of resolving this only on the solr server side at the moment.
If you're using the regular XML based interface to submit documents, let the code that generates the submitted XML add boost=".." values to the field or to the document depending on the length of the text field.
You can check upon DIH Special Commands which has a $docBoost command
$docBoost : Boost the current doc. The value can be a number or the
toString of a number
However, there seems no $fieldBoost Command.
For you case though, if you are using DefaultSimilarity, shorter fields are boosted higher then longer fields in the Score calculation.
You can surely implement your own Simiarity class with a changed TF (Term Frequency) and LengthNorm Calculation as your needs.

Can SOLR/Lucene report calculated score of extra named documents, even if they're not in top N results?

I'd like to submit a query to SOLR/Lucene, plus a list of document IDs. From the query, I'd like the usual top-N scored results, but I'd also like to get the scores for the named documents... no matter how low they are.
Can anyone think of an easy/supported way to do this in a single index scan, where the scores for the 'added' (non-ranking/pinned-for-inclusion) docs are comparable/same-scaled as those for the top-N results? (Patching SOLR with specialized classes would be OK; I figure that's what I may have to do if there's no existing support.)
Or failing that, could it be simulated with a followup query, ideally in a way that the named-document scores could be scaled to be roughly comparable to the top-N for the reference query?
Alternatively -- and perhaps as good or better for my intended use -- could I make a single request against a SOLR/Lucene index which includes M (with M=2 or more) distinct queries, and return the results that are in the top-N for any of the M queries, and for every result include its score against all M of the distinct queries?
(Even in my above formulation, the list of documents that I want scored along with a new query will typically have been the results from a prior query.)
Solutions or even just fragments of possible approaches appreciated!
I am not sure if I understand properly what you want to achieve but wouldn't a simple
q: (somequery) OR id: (1 OR 2 OR 4)
be enough?
If you would want both parts to be boosted by the same scale (I am not sure if this isn't the default behaviour of Solr) you would want to use dismax or edismax and your query would change to something like:
q: (somequery)^10 OR id: (1 OR 2 OR 4)^10
You would then have both the elements defined by the IDs and the query results scored the same way.
To self-answer, reporting what I've found since posting...
One clumsy option is the explainOther parameter, which takes another query. (This query could be a OR list of interesting document IDs.) The response will then include a full scoring explanation for documents which match this other query. explainOther only has effect when combined with the also-required debugQuery parameter.
All that debug/explain information is overkill for the need, but may be useful, or the code paths that implement it might provide a guide to making a hypothetical new more narrowly-focused 'scoreOther' option.
Another option would be to make use of pseudo-field calculated using the query() function to report how any set of results score on some other query/queries. So if for example the original document set was the top-N from query_A, and then those are the exact documents that you also want to score against query_B, you would execute query_A again with a reporting-field …&fl=bscore:query({!dismax v="query_B"})&…. Then the document's scores against query_B would be included in the output (as bscore).
Finally, the result-grouping functionality can be used both collect the top-N for one query and scores for lesser documents intersecting with other queries in one go. For example, if querying for query_B and adding …&group=true&group.query=query_B&group.query=query_A&…, you'll get back groups that satisfy query_B (ranked by query_B), and that satisfy both query_B and query_A (but again ranked by query_B). This could be mixed with the functional field above to get the scores by another query (like query_A) as well.
However, all groups will share the same sort order (from either the master query or something specified by a group.sort parameter), so it's not currently possible (SOLR-4.0.0-beta) to get several top-N results according to different scorings, just the top-Ns according to one scoring, limited by certain groups. (There's a comment in the source code suggesting alternate sorts per group may be envisioned as a future capability.)

How does appengine's data store query and index multi-value properties?

Lets say I have a Photo class containing a multi-valued property for tags and a date field.
I would like to allow the user to perform a query based on tags (using only a AND operator for more then 1 tag).
For example lets say a user searches for a rainy day.
Select * from Photo where tag='clouds' AND tag='rainy'
How does the zig-zag merge work? I know that two scans are performed, and based on if the keys from both searches point to the same Photo then it's returned. Does this happen in parallel however? Ex: While Search 1 finds a photo that contains tag 'clouds' , Search 2 is finding the first photo that contains tag "rainy". When both searches are done, it becomes synchronous. Search 1 then continues it's scan until it hits the same key as S2. Then while the keys for each search are the same, the photo is returned, and the "cursor" is moved along 1 step for each search?
Secondly, does defining multiple indexes speed up these sort of queries? Ex, if I wanted to allow up to 4 tags then I would need to define the indexes such as:
Index(Photo)
Index(Photo, tag)
Index(Photo, tag,tag)
Index(Photo, tag,tag,tag)
Index(Photo, tag,tag,tag,tag)
Then, performing the same query above will be quicker?
Also, using our original query, lets say we have Millions of photos tagged as cloudy, but only two are tagged as rainy. Does this mean zig-zag will perform relatively slow? Since one of the searches will try to find a matching exist? Even worse, if we have one million photos tagged "rainy" and one million are tagged "cloudly" yet no single photo have both tags in them. Will defining the above index's fix this issue?
Lastly, lets say a photo has 100 tags. Does that mean all the index's above have to include EVERY combination of the 100 tags?
I know there are got-yas (such as a entity can only be indexed 5000 times, and a single multi-valued property can only be indexed a 1000 times).
How does the zig-zag merge work?
You can check out the Google I/O video from 2009 on Building Scalable, Complex Apps on App Engine. Brett Slatkin explains how zig-zag merge works starting at 27 minutes. As he says, "I can't really explain it without showing how it works."

Solr returns Out of memory when I query for arbitrary rows

I'm finding that a query with only one row crashes if I request an arbitrary large number of rows.
The error thrown by the server is 500 - with an Out of memory exception message.
This crashes :
http://localhost:8983/solr/myIndex1/select?rows=100000&q=*%3A*&fq=group%3term1_JAYUNIT100&fq=grid%3A75&wt=json&indent=on
This does not crash :
http://localhost:8983/solr/myIndex1/select?rows=1&q=*%3A*&fq=group%3term1_JAYUNIT100&fq=grid%3A75&wt=json&indent=on
This is odd to me - I dont see why Solr would use up extra memory for a query which only returns one row. Is there some sort of pre-allocation of resources which happens at the server side before a query is run, which is based on the value of the "rows" parameter?
SOLR caches the result of queries. In this case the result set is very large even though you filter it and only return one row.
First of all, SOLR needs RAM. It is an in-RAM index after all. Everything that makes SOLR fast, takes up RAM so please do not starve a SOLR server.
Secondly, your actual query is useless. There is no point in saying "select all records from the database, build a bitmap index and then filter that set to select only the ones with certain field values. If your query sounds like this in natural language:
Records where XField is like so, AND YField is like that, AND ZField meets this condition
Then the right way to do it in SOLR is:
q=XField:so&fq=Yfield:that%20AND%20ZField:this
In fact, if you are sure that there are x records with XField:so and 3x records with YField:that and .07x records with ZField:this, then start by rearranging your AND expression and put ZField in the q= part.
The q= part defines the resultset. AFter getting all the records in the resultset, SOLR then applies bitmap index techniques to quickly filter (narrow down) the results using set operations. So when you can, make the q= part return fewer records for fq= to operate on.

Resources