Delete documents in Solr by query

Delete documents in Solr by query - solr

I work on a rather complex Solr core with about 100 fields and about 6 million documents.
When I submit a query like
-ean:* and type:(-set and -series) and -title:* and -aggregatorId:*
I get about 20.000 results. But when I use the exact same query for deletion, none of these records gets deleted.
As usually delete queries work on my system and I use Solr 4.8, it is not the issue from
Solr index update by query.
I use a simple master slave configuration, no SolrCloud.
Is there something special with this query? Too much negations or whatever? And what can I do to delete the records resulting from this query?
And yes, I know that Solr 4 is not state of the art; but upgrading to Solr 6 is currently not an option.

Related

How to get top items viewed and downloaded for all time after sharding SOLR statistics in DSpace?

Previously, before I shard my SOLR statistics, I am able to display the top 10 most viewed and most downloaded items of all time using this code: Add top 10 most downloaded items to /statistics-home
Our statistics only dates back to the year 2011, so after sharding, we have 10 statistics core from 2011 up to the present year which is 2020.
My question now is how to get the most viewed/downloaded items for all the combined years since I can no longer use the default SOLR statistics URL because using it will only get the current year. When I tail my solr.log upon viewing the /statistics-home URL, it is querying every statistics core.
How did the URL /statistics-home manage to get the top item views even after sharding the SOLR statistics? Any tips on querying multiple SOLR statistics core from XMLUI?

After tailing my solr.log, I found out that I just need to add the parameter shards in my original SOLR query.
So previously when doing a query for the all-time downloaded items, I used this query:
/solr/statistics/select?q=type:0+-isBot:true+statistics_type:view&indent=true&facet=true&facet.field=owningItem&fq=bundleName:ORIGINAL&facet.sort=count&facet.limit=10&rows=0&wt=json
Now, I just inserted the parameter &shards=localhost:8080/solr/statistics-2011,localhost:8080/solr/statistics-2012,localhost:8080/solr/statistics-n… with all my sharded statistics core. It turned out that I don't need to join queries.
Regarding on the querying Solr statistics from XMLUI, I just created a variable name for shards where I input all the sharded statistics core that I have. I don't know if there are other ways besides manually adding each sharded statistics core per year but right now this will suffice for my current needs.

Maximum number of docs for Solr with Cassandra

I currently have a Cassandra database with around 50,000 rows and ~5 columns. However, when I check my numDocs/maxDocs on Core Admin using the Solr Admin UI, it only finds 10K numDocs & maxDocs. Is there a maximum that Solr is able to index? If so, where can I edit that number if possible? If not, am I doing something wrong when setting up Solr? I have every column indexed in my schema set-up.

The max amount of docs Solr can index is way, way larger.
You most probably have 5 solr nodes and each one is holding 10k different docs. Just run some query, sort by something and verify it.
DSE does not use SolrCloud, they do their own clustering stuff, so the info you see in Solr dashboard is not totally equivalent to what you see in vanilla solr.

Atomic Updates in Solr - multiple shards

We are currently working on atomic update feature in solr using solrJ. Will solr update the record correctly if it is distributed across shards?
If the record is in shard2, will it be updated or it will create new record in shard1?

If you're handling the sharding yourself, you'll have to update the exact shard in question (as you're the one responsible for distributing documents).
If you're using Solr in SolrCloud mode, Solr will route the document to the correct shard for you, based on the document routing strategy.

Handling large number of ids in Solr

I need to perform an online search in Solr i.e user need to find list of user which are online with particular criteria.
How I am handling this: we store the ids of user in a table and I send all online user id in Solr request like
&fq=-id:(id1 id2 id3 ............id5000)
The problem with this approach is that when ids become large, Solr is taking too much time to resolved and we need to transfer large request over the network.
One solution can be use of join in Solr but online data change regularly and I can't index data every time (say 5-10 min, it should be at-least an hour).
Other solution I think of firing this query internally from Solr based on certain parameter in URL. I don't have much idea about Solr internals so don't know how to proceed.

With Solr4's soft commits, committing has become cheap enough that it might be feasible to actually store the "online" flag directly in the user record, and just have &fq=online:true on your query. That reduces the overhead involved in sending 5000 id's over the wire and parsing them, and lets Solr optimize the query a bit. Whenever someone logs in or out, set their status and set the commitWithin on the update. It's worth a shot, anyway.

We worked around this issue by implementing Sharding of the data.
Basically, without going heavily into code detail:
Write your own indexing code
use consistent hashing to decide which ID goes to which Solr server
index each user data to the relevant shard (it can be a several machines)
make sure you have redundancy
Query Solr shards
Do sharded queries in Solr using the shards parameter
Start an EmbeddedSolr and use it to do a sharded query
Solr will query all the shards and merge the results, it also provides timeouts if you need to limit the query time for each shard
Even with all of what I said above, I do not believe Solr is a good fit for this. Solr is not really well suited for searches on indexes that are constantly changing and also if you mainly search by IDs than a search engine is not needed.
For our project we basically implement all the index building, load balancing and query engine ourselves and use Solr mostly as storage. But we have started using Solr when sharding was flaky and not performant, I am not sure what the state of it is today.
Last note, if I was building this system today from scratch without all the work we did over the past 4 years I would advise using a cache to store all the users that are currently online (say memcached or redis) and at request time I would simply iterate over all of them and filter out according to the criteria. The filtering by criteria can be cached independently and updated incrementally, also iterating over 5000 records is not necessarily very time consuming if the matching logic is very simple.

Any robust solution will include bringing your data close to SOLR (batch) and using it internally. NOT running a very large request during search which is low latency thing.
You should develop your own filter; The filter will cache the online users data once in a while (say, every minute). If the data changes VERY frequently, consider implementing PostFilter.
You can find a good example of filter implementation here:
http://searchhub.org/2012/02/22/custom-security-filtering-in-solr/

one solution can be use of join in solr but online data change
regularly and i cant index data everytime(say 5-10 min, it should be
at-least an hr)
I think you could very well use Solr joins, but after a little bit of improvisation.
The Solution, I propose is as follows:
You can have 2 Indexes (Solr Cores)
1. Primary Index (The one you have now)
2. Secondary Index with only two fields , "ID" and "IS_ONLINE"
You could now update the Secondary Index frequently (in the order of seconds) and keep it in sync with the table you have, for storing online users.
NOTE: This secondary Index even if updated frequently, would not degrade any performance provided we do the necessary tweaks like usage of appropriate queries during delta-import, etc.
You could now perform a Solr join on the ID field on these two Indexes to achieve what you want. Here is the link on how to perform Solr Joins between Indexes/ Solr Cores.

Solr 3.6 Distribution dynamically adding shards

I am looking at using distribution and sharding with Solr 3.6 vs Solr 4+ (SolrCloud)
I can see that 3.6 can have multiple shards set up, ideally each shard would rest on a different box. On a large scale once the boxes start to run low on memory I would like to add new shards to the index. From what I have seen this cannot be done/ isn't documented.
Does this require a full re index of the data?
Can a 3 shard indexed be re-indexed into a 4 shard instance?
Can queries still be invoked on the index during a re-index?
What are the space overheads required to re index?
The schema.xml (field names and types) would not be changed, just a new shard location added.

Self Answer- From what I have seen it would be best to stop filling the 3 shards and fill only the new 4th shard with data. Then update the shards parameter to include the new shard to search.