Slower search response in solr - solr

I have a collection with 3 shards containing 5M records, with 10 fields, index size on disk is less than 1 GB, the document has 1 long valued fields which need to be sorted in every query.
All the queries are filter queries with one range query filter, where sorting on the basis of long value has to be applied.
I am expected to get the response under 50 milliseconds(including elapsed time). however, the actual Qtime range from 50-100 ms while Elapsed time varies from 200-350 ms.
Note: I have used docValues for all the fields, configured newSearcher/firstSearcher. Still, I do not see any improvement in response.
What could be the possible tuning options?

Try to index those values.That may help.
I am not quite sure but you can give a try.

Related

Hasura Timeout, Queries with Inconsistent speed Issues

I have a table with 2 millions of data in it, lets say I search for user email Id(using _ilike), it takes more than 15 to 20 seconds (or sometime I get Timeout error) to respond with out indexing. With indexing I get within a second(still there are times where it takes 15 to 20s, let say 2 out of 10 times I have this delay).
Now the question following are the question I have,
There is timeout most of the time when we search for mail id which is not present in the table, why is that? Whether this is expected behavior?
How much space/configuration of the DB does Hasura is expected to have for approximate 2 millions data?
Whether btree indexing is better for _ilike searchs or gin indexing is the better solution?
Any more suggestions to improve the performance of the query other than indexing?
Even the basic query to get the count of the rows is pretty slow, is there a way to improve?
`userTable_aggregate {
aggregate {
count
}
}`
Note: Every one hour data is getting added to the userTable(lets 100 approximate).
Thank you for taking time to answer my questions

Solr doesn't return all values

I set up a Solr server for indexing my trending data (4 million records).
When I tested it, it was very fast (0.3 seconds). Then I found that the result looks different to the result I query in MySQL.
In MySQL it gets about 4000 records that belong to a set of user IDs. Then I sort and group them using JavaScript.
Solr return about 4000 records with the same set of IDs. After I sort and group them in JavaScript I found some records are missing (about 10). I then added 'rows': 5000000 to the Solr query and I got the same result as MySQL returns but the time increased from 0.3 seconds to 1.7 seconds.
So I wonder is adding 'rows': 5000000 the only way to make the Solr to return the data I need? If yes, am I doing it right? If not, what else should I use?

Lucene query performance

I have a lucene index which has close to 480M documents. The size of the index is 36G. And I ran around 10000 queries against the index. Each query is a boolean AND query with 3 term queries inside. That is the query has 3 operands which MUST occur. Executing such 3 word queries gives the following latency percentiles.
50th = 16 ms
75th = 52 ms
90th = 121 ms
95th = 262 ms
99th = 76010 ms
99.9th = 76037 ms
Is the latency expected to degrade when the number of docs is as high as 480M? All the segments in the index are merged into one segment. Even when the segments are not merged, the latencies are not very different. Each document has 5-6 stored fields. But as mentioned above, the above latencies are for boolean queries that don't access any stored fields, but just do a posting list lookup on 3 tokens.
Any ideas on what could be wrong here?

SOLR faceting slower than manual count?

I'm trying to get SOLR range query working. I have a database with over 12 milion documents, and i am filtering by few parameters for example:
product_category:"category1" AND product_group:"group1" AND product_manu:"manufacturer1"
The query itself returns about 700 documents and executes in two-three seconds on average.
But when i want to add date range facet to that query (i want to see how many products were added each day for past x years) it executes in 50 seconds or more. So it seems that it would be faster to just retrieve all matching documents and perform manual counting in java.
So i guess i must be doing something wrong with faceting?
here is an example faceted query:
start=0&rows=0&facet.query=productDate%3A[0999-12-26T23%3A36%3A00.000Z+TO+2012-05-22T15%3A58%3A05.232Z]&q=source%3A%22source1%22+AND+productCategory%3A%22category1%22+AND+type%3A%22type1%22&facet=true&facet.limit=-1&facet.sort=count&facet.range=productDate&facet.range.start=NOW%2FDAY-5000DAYS&facet.range.end=NOW%2FDAY%2B1DAY&facet.range.gap=%2B1DAY
My only explanation is that SOLR is counting fields on some larger document pool than my 700 documents resulting from "q=" parameter. Or maybe i should filter documents in another way?
I have tried changing filterCache size and it works, but it seems to be a waste of memory for queries like these. After all aggregating over 700 documents should be very fast shouldnt it?

Paginated searching... does performance degrade heavily after N records?

I just tried the following query on YouTube:
http://www.youtube.com/results?search_query=test&search=tag&page=100
and received the error message:
Sorry, YouTube does not serve more than 1000 results for any query.
(You asked for results starting from 2000.)
I also tried Google search for "test", and although it said there were about 3.44 billion results, I was only able to get to page 82 (or about 820 results).
This leads me to wonder, does performance start to degrade with paginated searches after N records (specifically wondering about with ROW_NUMBER() in SQL Server or similar feature in other DB systems), or are YouTube/Google doing this for other reasons? Granted, it's pretty unlikely that most people would need to go past the first 1000 results for a query, but I would imagine the limitation is specifically put in place for some technical reason.
Then again Stack Overflow lets you page through 47k results: https://stackoverflow.com/questions/tagged/c?page=955&sort=newest&pagesize=50
Yes. High offsets are slow and inefficient.
The only way to find the records at an offset, is to compute all records that came before and then discard them.
(I dont know ROW_NUMBER(), but would be LIMIT in standard SQL. So
SELECT * FROM table LIMIT 1999,20
)
.. in the above example, the first 2000 records have to be fetched first, and then discarded. Generally it can't skip ahead, or use indexes to jump right to the correct location in the data, because normally there would be a 'WHERE' clause filtering the results.
It is possible to cache the results, which is probably what SO does. So it doesn't actually have to compute the large offsets each and every time. (Most of SO's searches are a 'small' set of known tags, so its quite feasible to cache. A arbitrary search query is will have much versions to catch, making it impractical)
(Alternatively it might be using some other implementation that does allow arbitrary offsets)
Other places taking about similar things
http://sphinxsearch.com/docs/current.html#conf-max-matches
Back of the envolope test:
mysql> select gridimage_id from gridimage_search where moderation_status = "geograph" order by imagetaken limit 100999,3;
...
3 rows in set (11.32 sec)
mysql> select gridimage_id from gridimage_search where moderation_status = "geograph" order by imagetaken limit 3;
...
3 rows in set (4.59 sec)
(Arbitrary query choosen so as not to use indexes very well, if indexes can be used the difference is less pronounced and harder to see. But in a production system running lots of queries, 1 or 2ms difference is huge)
Update: (to show a indexed query)
mysql> select gridimage_id from gridimage_search order by imagetaken limit 10;
...
10 rows in set (0.00 sec)
mysql> select gridimage_id from gridimage_search order by imagetaken limit 100000,10;
...
10 rows in set (1.70 sec)
It's a TOP clause designed to limit the amount of physical reads that the database has to perform, which limits the amount of time that the query takes. Imagine you have 82 billion links to stories about "Japan" in your database. What if someone queries "Japan"? Are all 82 billion results really going to be clicked? No. The user needs the top 1000 most relevant results. When the search is generic, like "test", there is no way to determine relevance. In this case, YouTube/Google has to limit the volume returned so other users aren't affected by generic searches. What's faster, returning 1,000 results or 82,000,000,000 results?

Resources