Inconsistent values for getNumberFound() in Search API - google-app-engine

I have a full-text search index with 42 documents like in the screenshot below:
When I query the index for "" it returns all the 42 documents correctly (good), but when I use the limit and offset options in the query, the value returned for the total number of matches found (results.getNumberFound()) varies from time to time. It gives me different values for different offsets!! In short, making the same query just with different offset values gives a different value for results.getNumberFound() function!
NOTE: This happens only in production server after I deploy the app. In local server everything
works perfectly (i.e for the same query, the number of total hits found is the same regardless of the offset option value).
Query query = Query.newBuilder()
.setOptions(QueryOptions.newBuilder()
.setLimit(limit)
.setOffset(offset).build())
.build(searchPhrase);
Results<ScoredDocument> results = INDEX.search(query);
LOG.warning( "Phrase:'" + searchPhrase +
"' limit:" + limit +
" offset:" + offset +
" num:" + results.getNumberFound());
Here's a screenshot of the log output:
So is there something wrong I'm doing or it's a bug in the Search API because the weird thing is that the issue only happens in the production server not the local one.

The python docs say
number_found
Returns an approximate number of documents matching the query. QueryOptions defining post-processing of the search results. If the QueryOptions.number_found_accuracy parameter were set to 100, then number_found <= 100 is accurate.
Similiar api components in exist in Java. From your code it appears you haven't set an accuracy. See java QueryOptions https://developers.google.com/appengine/docs/java/javadoc/com/google/appengine/api/search/QueryOptions
Having said that I have seen many questions/discussions about lack of accuracy on the number of found results.

Surprisingly, this is working as intended (as Tim says).
https://developers.google.com/appengine/docs/java/javadoc/com/google/appengine/api/search/QueryOptions.Builder#setNumberFoundAccuracy(int)
In its default state, the datastore scans the minimal set of data to fulfill the request. The database provides a very rough estimate of match results by multiplying ID range with estimate of matching keys (#keys found that matched / #ids scanned during the query).
For small data sets, set the accuracy value higher (500 or 1000) and call it a day. You can also improve the estimate by making sure key IDs are uniformly distributed and by fetching a higher limit each call (though if you don't need the data, just use the accuracy parameter).
This might not be applicable here but this is a general workaround for larger data sets:
Use num_accuracy == 1000. When queries return an estimate of <1000, you can trust that. When a query returns an estimate of >1000, perform your own estimate using a second query:
Include an extra numeric field with your data, which is a value of a discrete probabilistic event (e.g. #0s in a hash of some randomish data). When you get a large estimate from the first query, repeat your query with the additional constraint (e.g. AND ZERO_COUNT == y), where y is chosen based on the first query's estimate to match <1000 entities, producing an exact count for the second query which you can accurately extrapolate. Since you don't need the results of this data, you can set limit to 1 & num_accuracy == 1000.

Related

Count of no of record returned, without considering Limit cakePHP 3

I want the no of records available in the database for the current query but without considering the LIMIT.
$this->Orders->find('all')
->where(['order_quantity']=>5)
->LIMIT(5);
Let's consider, I have 50 no of records for this above query. So just want the no of records available for the current query. I can't use 'count()' because of the limit it will always return total no of records available is less than or equal to 5. Is there any solution in cakePHP.
This page in the CakePHP 3 book, explains EXACTLY the answer to your question including how and why it works:
Returning the Total Count of Records
Using a single query object, it is possible to obtain the total number
of rows found for a set of conditions:
$total = $articles->find()->where(['is_active' => true])->count();
The count() method will ignore the limit, offset and page clauses,
thus the following will return the same result:
$total = $articles->find()->where(['is_active' => true])->limit(10)->count();
This is useful when you need to know the total result set size in
advance, without having to construct another Query object. Likewise,
all result formatting and map-reduce routines are ignored when using
the count() method.
Notice the bit about "... will ignore the limit, offset, and page clauses"
So try something like this:
$data = $articles->find()->where(['is_active' => true])->limit(10);
$count = $data->count();
I don't think you are familiar with the use of limit in find. So, I suggest you to study the docs.
The limit in your query means that the query will only display first 5 data even if the query actually has 50 data.
So, in order to get the actual data, you just need to remove the limit and make some changes in your code as follows:
$this->Orders->find('count')->where(['order_quantity' => 5]);

Solr score boost - based on number of likes

I have added fs_votingapi_result in solr document this represents number of likes.
I found below function to improve the score based on fs_votingapi_result.
But I am unable to get the logic behind this - what are the extra parameters $vote_steepness, $total, $total, $vote_boost?
bf=recip(rord(fs_votingapi_result),$vote_steepness,$total,$total)^$vote_boost
I am new to solr and I am not able to find any document/article to get more idea about this.
This is in the Function Query documentation.
recip
A reciprocal function with recip(x,m,a,b) implementing a/(m*x+b). m,a,b are constants, x is any numeric field or arbitrarily complex function.
rord
The reversed ordinal of the indexed value. (In your case, the function: rord(fs_votingapi_result) would yield 1 for the record w the most votes, 2 for the second most votes, etc...)
So
recip(rord(fs_votingapi_result),$vote_steepness,$total,$total)
= $total / ($vote_steepness * rev-ordinal-of-vote-result + $total)
Then the result is boosted by $vote_boost to create the boost function (from bf param).
= ($total / ($vote_steepness * rev-ordinal-of-vote-result + $total)) * $vote_boost
Which is added to the document score from the rest of the query. (Then before scores are returned, they are normalized across all matching docs)
The $<var> values are either defined in solrconfig.xml or more commonly passed as separate http query parameters.
Hope that gives you a starting point.

increase performance of a linq query using contains

I have a winforms app where I have a Telerik dropdownchecklist that lets the user select a group of state names.
Using EF and the database is stored in Azure SQL.
The code then hits a database of about 17,000 records and filters the results to only include states that are checked.
Works fine. I am wanting to update a count on the screen whenever they change the list box.
This is the code, in the itemCheckChanged event:
var states = stateDropDownList.CheckedItems.Select(i => i.Value.ToString()).ToList();
var filteredStops = (from stop in aDb.Stop_address_details where states.Contains(stop.Stop_state) select stop).ToArray();
ExportInfo_tb.Text = "Current Stop Count: " + filteredStops.Count();
It works, but it is slow.
I tried to load everything into a memory variable then querying that vs the database but can't seem to figure out how to do that.
Any suggestions?
Improvement:
I picked up a noticeable improvement by limiting the amount of data coming down by:
var filteredStops = (from stop in aDb.Stop_address_details where states.Contains(stop.Stop_state) select stop.Stop_state).ToList();
And better yet --
int count = (from stop in aDb.Stop_address_details where
states.Contains(stop.Stop_state)
select stop).Count();
ExportInfo_tb.Text = "Current Stop Count: " + count.ToString();
The performance of you query, actually, has nothing to do with Contiains, in this case. Contains is pretty performant. The problem, as you picked up on in your third solution, is that you are pulling far more data over the network than required.
In your first solution you are pulling back all of the rows from the server with the matching stop state and performing the count locally. This is the worst possible approach. You are pulling back data just to count it and you are pulling back far more data than you need.
In your second solution you limited the data coming back to a single field which is why the performance improved. This could have resulted in a significant improvement if your table is really wide. The problem with this is that you are still pulling back all the data just to count it locally.
In your third solution EF will translate the .Count() method into a query that performs the count for you. So the count will happen on the server and the only data returned is a single value; the result of count. Since network latency CAN often be (but is not always) the longest step when performing a query, returning less data can often result in significant gains in query speed.
The query translation of your final solution should look something like this:
SELECT COUNT(*) AS [value]
FROM [Stop_address_details] AS [t0]
WHERE [t0].[Stop_state] IN (#p0)

Neo4J / Cypher Query very slow with order by property

I have a graph database with 5M of nodes and 10M of relationships.
I'm on a Macbook Pro with 4GB RAM. I have already try to adjust java heap size and neo4j memory without success.
My problem is that i have a simply cypher query like that :
MATCH (pet:Pet {id:52163})-[r:FOLLOWS]->(friend)
MATCH (friend)-[r:POSTED]->(n)
RETURN friend.id, TYPE(r),LABELS(n),n.id
LIMIT 30;
This query takes 100ms , which is impressive. But when i add an "ORDER BY" this query takes a long time => 8s :/
MATCH (pet:Pet {id:52163})-[r:FOLLOWS]->(friend)
MATCH (friend)-[r:POSTED]->(n)
RETURN friend.id, TYPE(r),LABELS(n),n.id
ORDER BY r.date DESC
LIMIT 30;
Does Someone has an idea ?
You might want to consider relationship indexes to speed up your query. The date property could be indexed this way. You're using the ORDER BY keyword which will almost always make your query slower as it needs to iterate the entire result set to perform the ordering.
Also consider using a single MATCH statement if that suits your needs:
MATCH (pet:Pet {id:52163})-[r:FOLLOWS]->(friend)-[r:POSTED]->(n)

How to get all results from solr query?

I executed some query like "Address:Jack*". It show numFound = 5214 and display 100 documents in results page(I changed default display results from 10 to 100).
How can I get all documents.
I remember myself doing &rows=2147483647
2,147,483,647 is integer's maximum value. I recall using a number bigger than that once and having a NumberFormatException because it couldn't be parsed into an int. I don't know if they use Long nowadays, but 2 billion rows is normally more than enough.
Small note:
Be careful if you are planning to do this in production. If you do a query like * : * and your index is big, you could transferring a couple of gigabytes in that query.
If you know you won't have many docs, go ahead and use integer's max value.
On the other hand, if you are doing a one-time script and just need to dump all results (for example document ID's) then this approach is valid, if you don't mind waiting 3-5 minutes for a query to return.
Don't use &rows=2147483647
Don't use Integer.MAX_VALUE(2147483647) as value of rows in production. This will heavily slow down your query even if you have a small resultset, because solr preallocates a queue in this size. see https://issues.apache.org/jira/browse/SOLR-7580
I strongly suggest to use Exporting Result Sets
It’s possible to export fully sorted result sets using a special rank query parser and response writer specifically designed to work together to handle scenarios that involve sorting and exporting millions of records.
Or I suggest to use Deep Paging.
Simple Pagination is a easy thing when you have few documents to read and all you have to do is play with start and rows parameters. But this is not a feasible way when you have many documents, I mean hundreds of thousands or even millions.
This is the kind of thing that could bring your Solr server to their knees.
For typical applications displaying search results to a human user,
this tends to not be much of an issue since most users don’t care
about drilling down past the first handful of pages of search results
— but for automated systems that want to crunch data about all of the
documents matching a query, it can be seriously prohibitive.
This means that if you have a website and are paging search results, a real user do not go so further but consider on the other hand what can happen if a spider or a scraper try to read all the website pages.
Now we are talking of Deep Paging.
I’ll suggest to read this amazing post:
https://lucidworks.com/post/coming-soon-to-solr-efficient-cursor-based-iteration-of-large-result-sets/
And take a look at this document page:
https://solr.apache.org/guide/pagination-of-results.html
And here is an example that try to explain how to paginate using the cursors.
SolrQuery solrQuery = new SolrQuery();
solrQuery.setRows(500);
solrQuery.setQuery("*:*");
solrQuery.addSort("id", ORDER.asc); // Pay attention to this line
String cursorMark = CursorMarkParams.CURSOR_MARK_START;
boolean done = false;
while (!done) {
solrQuery.set(CursorMarkParams.CURSOR_MARK_PARAM, cursorMark);
QueryResponse rsp = solrClient.query(solrQuery);
String nextCursorMark = rsp.getNextCursorMark();
for (SolrDocument d : rsp.getResults()) {
...
}
if (cursorMark.equals(nextCursorMark)) {
done = true;
}
cursorMark = nextCursorMark;
}
Returning all the results is never a good option as It would be very slow in performance.
Can you mention your use case ?
Also, Solr rows parameter helps you to tune the number of the results to be returned.
However, I don't think there is a way to tune rows to return all results. It doesn't take a -1 as value.
So you would need to set a high value for all the results to be returned.
What you should do is to first create a SolrQuery shown below and set the number of documents you want to fetch in a batch.
int lastResult=0; //this is for processing the future batch
String query = "id:[ lastResult TO *]"; // just considering id for the sake of simplicity
SolrQuery solrQuery = new SolrQuery(query).setRows(500); //setRows will set the required batch, you can change this to whatever size you want.
SolrDocumentList results = solrClient.query(solrQuery).getResults(); //execute this statement
Here I am considering an example of search by id, you can replace it with any of your parameter to search upon.
The "lastResult" is the variable you can change after execution of the first 500 records(500 is the batch size) and set it to the last id got from the results.
This will help you execute the next batch starting with last result from previous batch.
Hope this helps. Shoot up a comment below if you need any clarification.
For selecting all documents in dismax/edismax via Solarium php client, the normal query syntax : does not work. To select all documents set the default query value in solarium query to empty string. This is required as the default query in Solarium is :. Also set the alternative query to :. Dismax/eDismax normal query syntax does not support :, but the alternative query syntax does.
For more details following book can be referred
http://www.packtpub.com/apache-solr-php-integration/book
As the other answers pointed out, you can configure the rows to be max integer to yield back all the results for a query.
I would recommend though to use Solr feature of pagination, and build a function that will return for you all the results using the cursorMark API. The gist of it is you set the cursorMark parameter to '*', you set the page size(rows parameter), and on each result you'll get a cursorMark for the next page, so you execute the same query only with the cursorMark given from the last result. This way you'll have more flexibility on how much of the results you want back, in a much more performant way.
The way I dealt with the problem is by running the query twice:
// Start with your (usually small) default page size
solrQuery.setRows(50);
QueryResponse response = solrResponse(query);
if (response.getResults().getNumFound() > 50) {
solrQuery.setRows(response.getResults().getNumFound());
response = solrResponse(query);
}
It makes a call twice to Solr, but gets you all matching records....with the small performance penalty.
query.setRows(Integer.MAX_VALUE);
works for me!!

Resources