What is the Solr/Lucene process to purge deleted documents in index?

What is the Solr/Lucene process to purge deleted documents in index? - solr

What is the process to purge index when you've got some deleted documents (after a delete by query) in index ?
I'm asking this question because I'm working on a project based on solr and I've noticed a strange behavior and I would like to have some informations about it.
My system got those features :
My documents are indexed continuously (1000docs per second)
A purge is done every couple of second with this query :
<delete><query>timestamp_utc:[ * TO NOW-10MINUTES ]</query></delete>
So I got 600000 documents everytime visible in my index :
10 Minutes * 60 = 600 seconds
and speed = 1000docs/s so 600 * 1000 = 600000
But the size of my index increase with the time. And I know that when you do a delete by query the documents are affected by a "delete" label or something like that in the index.
I've seen and tried the attribute "expungeDeletes=true", but I didn't notice a considerable change on my index size.
Any informations about the index purge process would be appreciated.
Thanks.
Edit
I know that an optimize can to do this job but it's a long operation and I want to avoid that.

You can create a new collection/core every 10 minutes, switch to it (plus the previous) and delete the oldest collection/core (later than 10 minutes).

Related

Hasura Timeout, Queries with Inconsistent speed Issues

I have a table with 2 millions of data in it, lets say I search for user email Id(using _ilike), it takes more than 15 to 20 seconds (or sometime I get Timeout error) to respond with out indexing. With indexing I get within a second(still there are times where it takes 15 to 20s, let say 2 out of 10 times I have this delay).
Now the question following are the question I have,
There is timeout most of the time when we search for mail id which is not present in the table, why is that? Whether this is expected behavior?
How much space/configuration of the DB does Hasura is expected to have for approximate 2 millions data?
Whether btree indexing is better for _ilike searchs or gin indexing is the better solution?
Any more suggestions to improve the performance of the query other than indexing?
Even the basic query to get the count of the rows is pretty slow, is there a way to improve?
`userTable_aggregate {
aggregate {
count
}
}`
Note: Every one hour data is getting added to the userTable(lets 100 approximate).
Thank you for taking time to answer my questions

SOLR Indexing Performance: Processing 2.3k Docs/s

Im trying to improve performance of my Solr 6.0 Index.
Originally we were indexing 45m rows that was using a select statement joining 7 table and taking 7+ hours to index. This caused us to get a snapshot too old error while the jdbc connection is open for the entire duration of the indexing. Causing our full index to fail.
We were able to archive about 10m rows and build an external table from the original 7 join select. This simplified the query solr was using so a select * from 1 table.
Now are indexing 35m rows using a Select * from ONE_BIG_External-TABLE now and it's taking ~4-5 hrs # 2.3k docs/s +-250. Since we are using an external table we shouldn't be getting the snap shot too old because of the UNDO stack.
We have 77 columns we are indexing.
So we found a solution for our initial issue but now I'm looking to increase our indexing speed when doing clean fulls.
Referencing SolrPerformanceFactors I have tried:
Batch Sizes:
2000 - no change
6000 - no change
4000 - no change
Example:
<dataSource jndiName="xxxxxx batchSize="2000" type="JdbcDataSource"/>
Autocommit:
Every 1 hour - no change
MergeFactor:
20 vs 10 default - shed off 20 mins
Indexed Fields:
Cut out 11 indexed fields - nothing
EDIT: Adding some information per questions below. I did auto-commits to every hour which didn't help any. Also soft commit every second. I copied a much smaller solr core we have here that had these parameters and they said they have been running well.
<autoCommit>
<maxTime>3600000</maxTime>
<openSearcher>false</openSearcher>
</autoCommit>
<autoSoftCommit>
<maxTime>1000</maxTime>
</autoSoftCommit>
Is there any gotchas that I'm missing other than throwing hardware at this?
Let me know if you need more info, I'll try answer questions as best as I'm allowed.

Slower search response in solr

I have a collection with 3 shards containing 5M records, with 10 fields, index size on disk is less than 1 GB, the document has 1 long valued fields which need to be sorted in every query.
All the queries are filter queries with one range query filter, where sorting on the basis of long value has to be applied.
I am expected to get the response under 50 milliseconds(including elapsed time). however, the actual Qtime range from 50-100 ms while Elapsed time varies from 200-350 ms.
Note: I have used docValues for all the fields, configured newSearcher/firstSearcher. Still, I do not see any improvement in response.
What could be the possible tuning options?

Try to index those values.That may help.
I am not quite sure but you can give a try.

SOLR faceting slower than manual count?

I'm trying to get SOLR range query working. I have a database with over 12 milion documents, and i am filtering by few parameters for example:
product_category:"category1" AND product_group:"group1" AND product_manu:"manufacturer1"
The query itself returns about 700 documents and executes in two-three seconds on average.
But when i want to add date range facet to that query (i want to see how many products were added each day for past x years) it executes in 50 seconds or more. So it seems that it would be faster to just retrieve all matching documents and perform manual counting in java.
So i guess i must be doing something wrong with faceting?
here is an example faceted query:
start=0&rows=0&facet.query=productDate%3A[0999-12-26T23%3A36%3A00.000Z+TO+2012-05-22T15%3A58%3A05.232Z]&q=source%3A%22source1%22+AND+productCategory%3A%22category1%22+AND+type%3A%22type1%22&facet=true&facet.limit=-1&facet.sort=count&facet.range=productDate&facet.range.start=NOW%2FDAY-5000DAYS&facet.range.end=NOW%2FDAY%2B1DAY&facet.range.gap=%2B1DAY
My only explanation is that SOLR is counting fields on some larger document pool than my 700 documents resulting from "q=" parameter. Or maybe i should filter documents in another way?
I have tried changing filterCache size and it works, but it seems to be a waste of memory for queries like these. After all aggregating over 700 documents should be very fast shouldnt it?

What is a viable local database for Windows Phone 7 right now?

I was wondering what is a viable database solution for local storage on Windows Phone 7 right now. Using search I stumbled upon these 2 threads but they are over a few months old. I was wondering if there are some new development in databases for WP7. And I didn't found any reviews about the databases mentioned in the links below.
windows phone 7 database
Local Sql database support for Windows phone 7
My requirements are:
It should be free for commercial use
Saving/updating a record should only save the actual record and not the entire database (unlike WinPhone7 DB)
Able to fast query on a table with ~1000 records using LINQ.
Should also work in simulator
EDIT:
Just tried Sterling using a simple test app: It looks good, but I have 2 issues.
Creating 1000 records takes 30 seconds using db.Save(myPerson). Person is a simple class with 5 properties.
Then I discovered there is a db.SaveAsync<Person>(IList) method. This is fine because it doesn't block the current thread anymore.
BUT my question is: Is it save to call db.Flush() immediately and do a query on the currently saving IList? (because it takes up to 30 seconds to save the records in synchronous mode). Or do I have to wait until the BackgroundWorker has finished saving?
Query these 1000 records with LINQ and a where clause the first time takes up to 14 sec to load into memory.
Is there a way to speed this up?
Here are some benchmark results: (Unit tests was executed on a HTC Trophy)
-----------------------------
purging: 7,59 sec
creating 1000 records: 0,006 sec
saving 1000 records: 32,374 sec
flushing 1000 records: 0,07 sec
-----------------------------
//async
creating 1000 records: 0,04 sec
saving 1000 records: 0,004 sec
flushing 1000 records: 0 sec
-----------------------------
//get all keys
persons list count = 1000 (0,007)
-----------------------------
//get all persons with a where clause
persons list with query count = 26 (14,241)
-----------------------------
//update 1 property of 1 record + save
persons list with query count = 26 (0,003s)
db saved (0,072s)

You might want to take a look at Sterling - it should address most of your concerns and is very flexible.
http://sterling.codeplex.com/
(Full disclosure: my project)

try Siaqodb is commercial project and as difference from Sterling, not serialize objects and keep all in memory for query.Siaqodb can be queried by LINQ provider which efficiently can pull from database even only fields values without create any objects in memory, or load/construct only objects that was requested.

Perst is free for non-commercial use.

You might also want to try Ninja Database Pro. It looks like it has more features than Sterling.
http://www.kellermansoftware.com/p-43-ninja-database-pro.aspx

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight