I have a lucene index which has close to 480M documents. The size of the index is 36G. And I ran around 10000 queries against the index. Each query is a boolean AND query with 3 term queries inside. That is the query has 3 operands which MUST occur. Executing such 3 word queries gives the following latency percentiles.
50th = 16 ms
75th = 52 ms
90th = 121 ms
95th = 262 ms
99th = 76010 ms
99.9th = 76037 ms
Is the latency expected to degrade when the number of docs is as high as 480M? All the segments in the index are merged into one segment. Even when the segments are not merged, the latencies are not very different. Each document has 5-6 stored fields. But as mentioned above, the above latencies are for boolean queries that don't access any stored fields, but just do a posting list lookup on 3 tokens.
Any ideas on what could be wrong here?
Related
According to official documentation: In sys.database_query_store_options we have options which can adjust Query Store workflow and performance.
From documentation:
"flush_interval_seconds - The period for regular flushing of Query Store data to disk in seconds. Default value is 900 (15 min)"
"interval_length_minutes - The statistics aggregation interval in minutes. Arbitrary values are not allowed. Use one of the following: 1, 5, 10, 15, 30, 60, and 1440 minutes. The default value is 60 minutes."
And now i have a problem:
If Query Store flush data to disk every 15min, why do i see query in QS tables in seconds after execution?
As i understand QS tables are 'permanent' and they are stored in data base (on disk), so how does this parameter (flush_interval_seconds) work?
The same thing about interval_length_minute - when i saved QS output after 1 minute from last query execution and after 61 minutes i realised that they are more less the same, so what about this aggregation?
flush_interval_seconds - The period for regular flushing of Query Store data to disk in seconds. That means flushing from memory to disk so that the information wouldn't be lost after server restart. Before the flushing you just read info from memory.
interval_length_minute - this is aggregation interval for query runtime statistics. The lower it is the finer granularity of the runtime statistics becomes.
None of the options sets a period after which the info will be available.
Im trying to improve performance of my Solr 6.0 Index.
Originally we were indexing 45m rows that was using a select statement joining 7 table and taking 7+ hours to index. This caused us to get a snapshot too old error while the jdbc connection is open for the entire duration of the indexing. Causing our full index to fail.
We were able to archive about 10m rows and build an external table from the original 7 join select. This simplified the query solr was using so a select * from 1 table.
Now are indexing 35m rows using a Select * from ONE_BIG_External-TABLE now and it's taking ~4-5 hrs # 2.3k docs/s +-250. Since we are using an external table we shouldn't be getting the snap shot too old because of the UNDO stack.
We have 77 columns we are indexing.
So we found a solution for our initial issue but now I'm looking to increase our indexing speed when doing clean fulls.
Referencing SolrPerformanceFactors I have tried:
Batch Sizes:
2000 - no change
6000 - no change
4000 - no change
Example:
<dataSource jndiName="xxxxxx batchSize="2000" type="JdbcDataSource"/>
Autocommit:
Every 1 hour - no change
MergeFactor:
20 vs 10 default - shed off 20 mins
Indexed Fields:
Cut out 11 indexed fields - nothing
EDIT: Adding some information per questions below. I did auto-commits to every hour which didn't help any. Also soft commit every second. I copied a much smaller solr core we have here that had these parameters and they said they have been running well.
<autoCommit>
<maxTime>3600000</maxTime>
<openSearcher>false</openSearcher>
</autoCommit>
<autoSoftCommit>
<maxTime>1000</maxTime>
</autoSoftCommit>
Is there any gotchas that I'm missing other than throwing hardware at this?
Let me know if you need more info, I'll try answer questions as best as I'm allowed.
I have a collection with 3 shards containing 5M records, with 10 fields, index size on disk is less than 1 GB, the document has 1 long valued fields which need to be sorted in every query.
All the queries are filter queries with one range query filter, where sorting on the basis of long value has to be applied.
I am expected to get the response under 50 milliseconds(including elapsed time). however, the actual Qtime range from 50-100 ms while Elapsed time varies from 200-350 ms.
Note: I have used docValues for all the fields, configured newSearcher/firstSearcher. Still, I do not see any improvement in response.
What could be the possible tuning options?
Try to index those values.That may help.
I am not quite sure but you can give a try.
I'm trying to get SOLR range query working. I have a database with over 12 milion documents, and i am filtering by few parameters for example:
product_category:"category1" AND product_group:"group1" AND product_manu:"manufacturer1"
The query itself returns about 700 documents and executes in two-three seconds on average.
But when i want to add date range facet to that query (i want to see how many products were added each day for past x years) it executes in 50 seconds or more. So it seems that it would be faster to just retrieve all matching documents and perform manual counting in java.
So i guess i must be doing something wrong with faceting?
here is an example faceted query:
start=0&rows=0&facet.query=productDate%3A[0999-12-26T23%3A36%3A00.000Z+TO+2012-05-22T15%3A58%3A05.232Z]&q=source%3A%22source1%22+AND+productCategory%3A%22category1%22+AND+type%3A%22type1%22&facet=true&facet.limit=-1&facet.sort=count&facet.range=productDate&facet.range.start=NOW%2FDAY-5000DAYS&facet.range.end=NOW%2FDAY%2B1DAY&facet.range.gap=%2B1DAY
My only explanation is that SOLR is counting fields on some larger document pool than my 700 documents resulting from "q=" parameter. Or maybe i should filter documents in another way?
I have tried changing filterCache size and it works, but it seems to be a waste of memory for queries like these. After all aggregating over 700 documents should be very fast shouldnt it?
I just tried the following query on YouTube:
http://www.youtube.com/results?search_query=test&search=tag&page=100
and received the error message:
Sorry, YouTube does not serve more than 1000 results for any query.
(You asked for results starting from 2000.)
I also tried Google search for "test", and although it said there were about 3.44 billion results, I was only able to get to page 82 (or about 820 results).
This leads me to wonder, does performance start to degrade with paginated searches after N records (specifically wondering about with ROW_NUMBER() in SQL Server or similar feature in other DB systems), or are YouTube/Google doing this for other reasons? Granted, it's pretty unlikely that most people would need to go past the first 1000 results for a query, but I would imagine the limitation is specifically put in place for some technical reason.
Then again Stack Overflow lets you page through 47k results: https://stackoverflow.com/questions/tagged/c?page=955&sort=newest&pagesize=50
Yes. High offsets are slow and inefficient.
The only way to find the records at an offset, is to compute all records that came before and then discard them.
(I dont know ROW_NUMBER(), but would be LIMIT in standard SQL. So
SELECT * FROM table LIMIT 1999,20
)
.. in the above example, the first 2000 records have to be fetched first, and then discarded. Generally it can't skip ahead, or use indexes to jump right to the correct location in the data, because normally there would be a 'WHERE' clause filtering the results.
It is possible to cache the results, which is probably what SO does. So it doesn't actually have to compute the large offsets each and every time. (Most of SO's searches are a 'small' set of known tags, so its quite feasible to cache. A arbitrary search query is will have much versions to catch, making it impractical)
(Alternatively it might be using some other implementation that does allow arbitrary offsets)
Other places taking about similar things
http://sphinxsearch.com/docs/current.html#conf-max-matches
Back of the envolope test:
mysql> select gridimage_id from gridimage_search where moderation_status = "geograph" order by imagetaken limit 100999,3;
...
3 rows in set (11.32 sec)
mysql> select gridimage_id from gridimage_search where moderation_status = "geograph" order by imagetaken limit 3;
...
3 rows in set (4.59 sec)
(Arbitrary query choosen so as not to use indexes very well, if indexes can be used the difference is less pronounced and harder to see. But in a production system running lots of queries, 1 or 2ms difference is huge)
Update: (to show a indexed query)
mysql> select gridimage_id from gridimage_search order by imagetaken limit 10;
...
10 rows in set (0.00 sec)
mysql> select gridimage_id from gridimage_search order by imagetaken limit 100000,10;
...
10 rows in set (1.70 sec)
It's a TOP clause designed to limit the amount of physical reads that the database has to perform, which limits the amount of time that the query takes. Imagine you have 82 billion links to stories about "Japan" in your database. What if someone queries "Japan"? Are all 82 billion results really going to be clicked? No. The user needs the top 1000 most relevant results. When the search is generic, like "test", there is no way to determine relevance. In this case, YouTube/Google has to limit the volume returned so other users aren't affected by generic searches. What's faster, returning 1,000 results or 82,000,000,000 results?