SOLR Indexing Performance: Processing 2.3k Docs/s

SOLR Indexing Performance: Processing 2.3k Docs/s - solr

Im trying to improve performance of my Solr 6.0 Index.
Originally we were indexing 45m rows that was using a select statement joining 7 table and taking 7+ hours to index. This caused us to get a snapshot too old error while the jdbc connection is open for the entire duration of the indexing. Causing our full index to fail.
We were able to archive about 10m rows and build an external table from the original 7 join select. This simplified the query solr was using so a select * from 1 table.
Now are indexing 35m rows using a Select * from ONE_BIG_External-TABLE now and it's taking ~4-5 hrs # 2.3k docs/s +-250. Since we are using an external table we shouldn't be getting the snap shot too old because of the UNDO stack.
We have 77 columns we are indexing.
So we found a solution for our initial issue but now I'm looking to increase our indexing speed when doing clean fulls.
Referencing SolrPerformanceFactors I have tried:
Batch Sizes:
2000 - no change
6000 - no change
4000 - no change
Example:
<dataSource jndiName="xxxxxx batchSize="2000" type="JdbcDataSource"/>
Autocommit:
Every 1 hour - no change
MergeFactor:
20 vs 10 default - shed off 20 mins
Indexed Fields:
Cut out 11 indexed fields - nothing
EDIT: Adding some information per questions below. I did auto-commits to every hour which didn't help any. Also soft commit every second. I copied a much smaller solr core we have here that had these parameters and they said they have been running well.
<autoCommit>
<maxTime>3600000</maxTime>
<openSearcher>false</openSearcher>
</autoCommit>
<autoSoftCommit>
<maxTime>1000</maxTime>
</autoSoftCommit>
Is there any gotchas that I'm missing other than throwing hardware at this?
Let me know if you need more info, I'll try answer questions as best as I'm allowed.

Related

Hasura Timeout, Queries with Inconsistent speed Issues

I have a table with 2 millions of data in it, lets say I search for user email Id(using _ilike), it takes more than 15 to 20 seconds (or sometime I get Timeout error) to respond with out indexing. With indexing I get within a second(still there are times where it takes 15 to 20s, let say 2 out of 10 times I have this delay).
Now the question following are the question I have,
There is timeout most of the time when we search for mail id which is not present in the table, why is that? Whether this is expected behavior?
How much space/configuration of the DB does Hasura is expected to have for approximate 2 millions data?
Whether btree indexing is better for _ilike searchs or gin indexing is the better solution?
Any more suggestions to improve the performance of the query other than indexing?
Even the basic query to get the count of the rows is pretty slow, is there a way to improve?
`userTable_aggregate {
aggregate {
count
}
}`
Note: Every one hour data is getting added to the userTable(lets 100 approximate).
Thank you for taking time to answer my questions

Table Scan very high "actual rows" when filter placed on different table

I have a query, that I did not write, that takes 2.5 minutes to run. I am trying to optimize it without being able to modify the underlying tables, i.e. no new indexes can be added.
During my optimization troubleshooting I commented out a filter and all of a sudden my query ran in .5 seconds. I have screwed with the formatting and placing of that filter and if it is there the query takes 2.5 minutes, without it .5 seconds. The biggest problem is that the filter is not on the table that is being table-scanned (With over 300k records), it is on a table with 300 records.
The "Actual Execution Plan" of both the 0:0:0.5 vs 0:2:30 are identical down to the exact percentage costs of all steps:
Execution Plan
The only difference is that on the table-scanned table the "Actual Number of Rows" on the 2.5 min query shows 3.7 million rows. The table only has 300k rows. Where the .5 sec query shows Actual Number of Rows as 2,063. The filter is actually being placed on the FS_EDIPartner table that only has 300 rows.
With the filter I get the correct 51 records, but it takes 2.5 minutes to return. Without the filter I get duplication, so I get 2,796 rows, and only take half a second to return.
I cannot figure out why adding the filter to a table with 300 rows and a correct index is causing the Table scan of a different table to have such a significant difference in actual number of rows. I am even doing the "Table scan" table as a sub-query to filter its records down from 300k to 17k prior to doing the join. Here is the actual query in its current state, sorry the tables don't make a lot of sense, I could not reproduce this behavior in test data.
SELECT dbo.FS_ARInvoiceHeader.CustomerID
, dbo.FS_EDIPartner.PartnerID
, dbo.FS_ARInvoiceHeader.InvoiceNumber
, dbo.FS_ARInvoiceHeader.InvoiceDate
, dbo.FS_ARInvoiceHeader.InvoiceType
, dbo.FS_ARInvoiceHeader.CONumber
, dbo.FS_EDIPartner.InternalTransactionSetCode
, docs.DocumentName
, dbo.FS_ARInvoiceHeader.InvoiceStatus
FROM dbo.FS_ARInvoiceHeader
INNER JOIN dbo.FS_EDIPartner ON dbo.FS_ARInvoiceHeader.CustomerID = dbo.FS_EDIPartner.CustomerID
LEFT JOIN (Select DocumentName
FROM GentranDatabase.dbo.ZNW_Documents
WHERE DATEADD(SECOND,TimeCreated,'1970-1-1') > '2016-06-01'
AND TransactionSetID = '810') docs on dbo.FS_ARInvoiceHeader.InvoiceNumber = docs.DocumentName COLLATE Latin1_General_BIN
WHERE docs.DocumentName IS NULL
AND dbo.FS_ARInvoiceHeader.InvoiceType = 'I'
AND dbo.FS_ARInvoiceHeader.InvoiceStatus <> 'Y'
--AND (dbo.FS_EDIPartner.InternalTransactionSetCode = '810')
AND (NOT (dbo.FS_ARInvoiceHeader.CONumber LIKE 'CB%'))
AND (NOT (dbo.FS_ARInvoiceHeader.CONumber LIKE 'DM%'))
AND InvoiceDate > '2016-06-01'
The Commented out line in the Where statement is the culprit, uncommenting it causes the 2.5 minute run.

It could be that the table statistics may have gotten out of whack. These include the number of records tables have which is used to choose the best query plan. Try running this and running the query again:
EXEC sp_updatestats

Using #jeremy's comment as a guideline to point out the Actual Number of Rows was not my problem, but instead the number of executions, I figured out that the Hash Match was .5 seconds, the Nested loop was 2.5 minutes. Trying to force the Hash Match using Left HASH Join was inconsistent depending on what the other filters were set to, changing dates took it from .5 seconds, to 30 secs sometimes. So forcing the Hash (Which is highly discouraged anyway) wasn't a good solution. Finally I resorted to moving the poor performing view to a Stored Procedure and splitting out both of the tables that were related to the poor performance into Table Variables, then joining those table variables. This resulted in the most consistently good performance of getting the results. On average the SP returns in less than 1 second, which is far better than the 2.5 minutes it started at.
#Jeremy gets the credit, but since his wasn't an answer, I thought I would document what was actually done in case someone else stumbles across this later.

What is the Solr/Lucene process to purge deleted documents in index?

What is the process to purge index when you've got some deleted documents (after a delete by query) in index ?
I'm asking this question because I'm working on a project based on solr and I've noticed a strange behavior and I would like to have some informations about it.
My system got those features :
My documents are indexed continuously (1000docs per second)
A purge is done every couple of second with this query :
<delete><query>timestamp_utc:[ * TO NOW-10MINUTES ]</query></delete>
So I got 600000 documents everytime visible in my index :
10 Minutes * 60 = 600 seconds
and speed = 1000docs/s so 600 * 1000 = 600000
But the size of my index increase with the time. And I know that when you do a delete by query the documents are affected by a "delete" label or something like that in the index.
I've seen and tried the attribute "expungeDeletes=true", but I didn't notice a considerable change on my index size.
Any informations about the index purge process would be appreciated.
Thanks.
Edit
I know that an optimize can to do this job but it's a long operation and I want to avoid that.

You can create a new collection/core every 10 minutes, switch to it (plus the previous) and delete the oldest collection/core (later than 10 minutes).

SOLR faceting slower than manual count?

I'm trying to get SOLR range query working. I have a database with over 12 milion documents, and i am filtering by few parameters for example:
product_category:"category1" AND product_group:"group1" AND product_manu:"manufacturer1"
The query itself returns about 700 documents and executes in two-three seconds on average.
But when i want to add date range facet to that query (i want to see how many products were added each day for past x years) it executes in 50 seconds or more. So it seems that it would be faster to just retrieve all matching documents and perform manual counting in java.
So i guess i must be doing something wrong with faceting?
here is an example faceted query:
start=0&rows=0&facet.query=productDate%3A[0999-12-26T23%3A36%3A00.000Z+TO+2012-05-22T15%3A58%3A05.232Z]&q=source%3A%22source1%22+AND+productCategory%3A%22category1%22+AND+type%3A%22type1%22&facet=true&facet.limit=-1&facet.sort=count&facet.range=productDate&facet.range.start=NOW%2FDAY-5000DAYS&facet.range.end=NOW%2FDAY%2B1DAY&facet.range.gap=%2B1DAY
My only explanation is that SOLR is counting fields on some larger document pool than my 700 documents resulting from "q=" parameter. Or maybe i should filter documents in another way?
I have tried changing filterCache size and it works, but it seems to be a waste of memory for queries like these. After all aggregating over 700 documents should be very fast shouldnt it?

Paginated searching... does performance degrade heavily after N records?

I just tried the following query on YouTube:
http://www.youtube.com/results?search_query=test&search=tag&page=100
and received the error message:
Sorry, YouTube does not serve more than 1000 results for any query.
(You asked for results starting from 2000.)
I also tried Google search for "test", and although it said there were about 3.44 billion results, I was only able to get to page 82 (or about 820 results).
This leads me to wonder, does performance start to degrade with paginated searches after N records (specifically wondering about with ROW_NUMBER() in SQL Server or similar feature in other DB systems), or are YouTube/Google doing this for other reasons? Granted, it's pretty unlikely that most people would need to go past the first 1000 results for a query, but I would imagine the limitation is specifically put in place for some technical reason.
Then again Stack Overflow lets you page through 47k results: https://stackoverflow.com/questions/tagged/c?page=955&sort=newest&pagesize=50

Yes. High offsets are slow and inefficient.
The only way to find the records at an offset, is to compute all records that came before and then discard them.
(I dont know ROW_NUMBER(), but would be LIMIT in standard SQL. So
SELECT * FROM table LIMIT 1999,20
)
.. in the above example, the first 2000 records have to be fetched first, and then discarded. Generally it can't skip ahead, or use indexes to jump right to the correct location in the data, because normally there would be a 'WHERE' clause filtering the results.
It is possible to cache the results, which is probably what SO does. So it doesn't actually have to compute the large offsets each and every time. (Most of SO's searches are a 'small' set of known tags, so its quite feasible to cache. A arbitrary search query is will have much versions to catch, making it impractical)
(Alternatively it might be using some other implementation that does allow arbitrary offsets)
Other places taking about similar things
http://sphinxsearch.com/docs/current.html#conf-max-matches
Back of the envolope test:
mysql> select gridimage_id from gridimage_search where moderation_status = "geograph" order by imagetaken limit 100999,3;
...
3 rows in set (11.32 sec)
mysql> select gridimage_id from gridimage_search where moderation_status = "geograph" order by imagetaken limit 3;
...
3 rows in set (4.59 sec)
(Arbitrary query choosen so as not to use indexes very well, if indexes can be used the difference is less pronounced and harder to see. But in a production system running lots of queries, 1 or 2ms difference is huge)
Update: (to show a indexed query)
mysql> select gridimage_id from gridimage_search order by imagetaken limit 10;
...
10 rows in set (0.00 sec)
mysql> select gridimage_id from gridimage_search order by imagetaken limit 100000,10;
...
10 rows in set (1.70 sec)

It's a TOP clause designed to limit the amount of physical reads that the database has to perform, which limits the amount of time that the query takes. Imagine you have 82 billion links to stories about "Japan" in your database. What if someone queries "Japan"? Are all 82 billion results really going to be clicked? No. The user needs the top 1000 most relevant results. When the search is generic, like "test", there is no way to determine relevance. In this case, YouTube/Google has to limit the volume returned so other users aren't affected by generic searches. What's faster, returning 1,000 results or 82,000,000,000 results?

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight