Solr doesn't return all values - solr

I set up a Solr server for indexing my trending data (4 million records).
When I tested it, it was very fast (0.3 seconds). Then I found that the result looks different to the result I query in MySQL.
In MySQL it gets about 4000 records that belong to a set of user IDs. Then I sort and group them using JavaScript.
Solr return about 4000 records with the same set of IDs. After I sort and group them in JavaScript I found some records are missing (about 10). I then added 'rows': 5000000 to the Solr query and I got the same result as MySQL returns but the time increased from 0.3 seconds to 1.7 seconds.
So I wonder is adding 'rows': 5000000 the only way to make the Solr to return the data I need? If yes, am I doing it right? If not, what else should I use?

Related

apache solr max and min in single query

We are migrating from mysql to apache solr since solr is fast in searching. Thank you. We had a scenario to
find 1) difference (max-min)
2) with group by date(timeStamp)
Given below is our mysql table :
And our mysql query is,
SELECT Date(eventTimeStamp), MAX(field) - MIN(field) AS Energy FROM PowerTable GROUP BY DATE(eventTimeStamp);
will results,
So we have to calculate difference per day, where date column is in datetime format.
To reflect/migrate above mysql query in apache solr, we are using result grouping as
group=true&group.query=eventTimeStamp:[2019-12-11T00:00:00Z TO 2019-12-11T23:59:59Z]&group.query=eventTimeStamp:[2019-12-12T00:00:00Z TO 2019-12-12T23:59:59Z]
Using Apache solr statistics option, we are able to calculate max and min for whole result, But we need max and min value per day basis.
When we try to get max and min value per day basis, we are able to fetch either min or max using following query.
&group.sort=event1 desc or &group.sort=event1 asc
Definitely you have to spend some effort/time to understand this question.
So how to find both min and max in single query (per group; not for whole result).

SOLR Indexing Performance: Processing 2.3k Docs/s

Im trying to improve performance of my Solr 6.0 Index.
Originally we were indexing 45m rows that was using a select statement joining 7 table and taking 7+ hours to index. This caused us to get a snapshot too old error while the jdbc connection is open for the entire duration of the indexing. Causing our full index to fail.
We were able to archive about 10m rows and build an external table from the original 7 join select. This simplified the query solr was using so a select * from 1 table.
Now are indexing 35m rows using a Select * from ONE_BIG_External-TABLE now and it's taking ~4-5 hrs # 2.3k docs/s +-250. Since we are using an external table we shouldn't be getting the snap shot too old because of the UNDO stack.
We have 77 columns we are indexing.
So we found a solution for our initial issue but now I'm looking to increase our indexing speed when doing clean fulls.
Referencing SolrPerformanceFactors I have tried:
Batch Sizes:
2000 - no change
6000 - no change
4000 - no change
Example:
<dataSource jndiName="xxxxxx batchSize="2000" type="JdbcDataSource"/>
Autocommit:
Every 1 hour - no change
MergeFactor:
20 vs 10 default - shed off 20 mins
Indexed Fields:
Cut out 11 indexed fields - nothing
EDIT: Adding some information per questions below. I did auto-commits to every hour which didn't help any. Also soft commit every second. I copied a much smaller solr core we have here that had these parameters and they said they have been running well.
<autoCommit>
<maxTime>3600000</maxTime>
<openSearcher>false</openSearcher>
</autoCommit>
<autoSoftCommit>
<maxTime>1000</maxTime>
</autoSoftCommit>
Is there any gotchas that I'm missing other than throwing hardware at this?
Let me know if you need more info, I'll try answer questions as best as I'm allowed.

Slower search response in solr

I have a collection with 3 shards containing 5M records, with 10 fields, index size on disk is less than 1 GB, the document has 1 long valued fields which need to be sorted in every query.
All the queries are filter queries with one range query filter, where sorting on the basis of long value has to be applied.
I am expected to get the response under 50 milliseconds(including elapsed time). however, the actual Qtime range from 50-100 ms while Elapsed time varies from 200-350 ms.
Note: I have used docValues for all the fields, configured newSearcher/firstSearcher. Still, I do not see any improvement in response.
What could be the possible tuning options?
Try to index those values.That may help.
I am not quite sure but you can give a try.

What is the Solr/Lucene process to purge deleted documents in index?

What is the process to purge index when you've got some deleted documents (after a delete by query) in index ?
I'm asking this question because I'm working on a project based on solr and I've noticed a strange behavior and I would like to have some informations about it.
My system got those features :
My documents are indexed continuously (1000docs per second)
A purge is done every couple of second with this query :
<delete><query>timestamp_utc:[ * TO NOW-10MINUTES ]</query></delete>
So I got 600000 documents everytime visible in my index :
10 Minutes * 60 = 600 seconds
and speed = 1000docs/s so 600 * 1000 = 600000
But the size of my index increase with the time. And I know that when you do a delete by query the documents are affected by a "delete" label or something like that in the index.
I've seen and tried the attribute "expungeDeletes=true", but I didn't notice a considerable change on my index size.
Any informations about the index purge process would be appreciated.
Thanks.
Edit
I know that an optimize can to do this job but it's a long operation and I want to avoid that.
You can create a new collection/core every 10 minutes, switch to it (plus the previous) and delete the oldest collection/core (later than 10 minutes).

SOLR faceting slower than manual count?

I'm trying to get SOLR range query working. I have a database with over 12 milion documents, and i am filtering by few parameters for example:
product_category:"category1" AND product_group:"group1" AND product_manu:"manufacturer1"
The query itself returns about 700 documents and executes in two-three seconds on average.
But when i want to add date range facet to that query (i want to see how many products were added each day for past x years) it executes in 50 seconds or more. So it seems that it would be faster to just retrieve all matching documents and perform manual counting in java.
So i guess i must be doing something wrong with faceting?
here is an example faceted query:
start=0&rows=0&facet.query=productDate%3A[0999-12-26T23%3A36%3A00.000Z+TO+2012-05-22T15%3A58%3A05.232Z]&q=source%3A%22source1%22+AND+productCategory%3A%22category1%22+AND+type%3A%22type1%22&facet=true&facet.limit=-1&facet.sort=count&facet.range=productDate&facet.range.start=NOW%2FDAY-5000DAYS&facet.range.end=NOW%2FDAY%2B1DAY&facet.range.gap=%2B1DAY
My only explanation is that SOLR is counting fields on some larger document pool than my 700 documents resulting from "q=" parameter. Or maybe i should filter documents in another way?
I have tried changing filterCache size and it works, but it seems to be a waste of memory for queries like these. After all aggregating over 700 documents should be very fast shouldnt it?

Resources