Maximum number of docs for Solr with Cassandra - solr

I currently have a Cassandra database with around 50,000 rows and ~5 columns. However, when I check my numDocs/maxDocs on Core Admin using the Solr Admin UI, it only finds 10K numDocs & maxDocs. Is there a maximum that Solr is able to index? If so, where can I edit that number if possible? If not, am I doing something wrong when setting up Solr? I have every column indexed in my schema set-up.

The max amount of docs Solr can index is way, way larger.
You most probably have 5 solr nodes and each one is holding 10k different docs. Just run some query, sort by something and verify it.
DSE does not use SolrCloud, they do their own clustering stuff, so the info you see in Solr dashboard is not totally equivalent to what you see in vanilla solr.

Related

Delete documents in Solr by query

I work on a rather complex Solr core with about 100 fields and about 6 million documents.
When I submit a query like
-ean:* and type:(-set and -series) and -title:* and -aggregatorId:*
I get about 20.000 results. But when I use the exact same query for deletion, none of these records gets deleted.
As usually delete queries work on my system and I use Solr 4.8, it is not the issue from
Solr index update by query.
I use a simple master slave configuration, no SolrCloud.
Is there something special with this query? Too much negations or whatever? And what can I do to delete the records resulting from this query?
And yes, I know that Solr 4 is not state of the art; but upgrading to Solr 6 is currently not an option.

DSE - com.datastax.bdp.search.solr.Cql3SolrSecondaryIndex

My Cassandra table has secondary indexes of type 'com.datastax.bdp.search.solr.Cql3SolrSecondaryIndex'. They were created automatically when I initialized the Solr core so they are probably used by Solr in some way.
What exactly is their purpose? What is the effect if I cancel the index build in my Cassandra node? (but not in my Solr nodes)
That index type doesn't cause any overhead on non-Solr nodes, so you don't have to (actually, you must not) drop them.
What exactly is their purpose?
Solr extends the search possibilities of Cassandra rows. Therefore, it allows Cassandra nodes to have the 'com.datastax.bdp.search.solr.Cql3SolrSecondaryIndex' rows automatically and asynchronously indexed on Solr whenever you insert/update a row.
In addition Solr provides a HTTP interface (http://:8983/solr) that allows you to reload/rebuilt the index, among other admin Solr tasks.

Handling large number of ids in Solr

I need to perform an online search in Solr i.e user need to find list of user which are online with particular criteria.
How I am handling this: we store the ids of user in a table and I send all online user id in Solr request like
&fq=-id:(id1 id2 id3 ............id5000)
The problem with this approach is that when ids become large, Solr is taking too much time to resolved and we need to transfer large request over the network.
One solution can be use of join in Solr but online data change regularly and I can't index data every time (say 5-10 min, it should be at-least an hour).
Other solution I think of firing this query internally from Solr based on certain parameter in URL. I don't have much idea about Solr internals so don't know how to proceed.
With Solr4's soft commits, committing has become cheap enough that it might be feasible to actually store the "online" flag directly in the user record, and just have &fq=online:true on your query. That reduces the overhead involved in sending 5000 id's over the wire and parsing them, and lets Solr optimize the query a bit. Whenever someone logs in or out, set their status and set the commitWithin on the update. It's worth a shot, anyway.
We worked around this issue by implementing Sharding of the data.
Basically, without going heavily into code detail:
Write your own indexing code
use consistent hashing to decide which ID goes to which Solr server
index each user data to the relevant shard (it can be a several machines)
make sure you have redundancy
Query Solr shards
Do sharded queries in Solr using the shards parameter
Start an EmbeddedSolr and use it to do a sharded query
Solr will query all the shards and merge the results, it also provides timeouts if you need to limit the query time for each shard
Even with all of what I said above, I do not believe Solr is a good fit for this. Solr is not really well suited for searches on indexes that are constantly changing and also if you mainly search by IDs than a search engine is not needed.
For our project we basically implement all the index building, load balancing and query engine ourselves and use Solr mostly as storage. But we have started using Solr when sharding was flaky and not performant, I am not sure what the state of it is today.
Last note, if I was building this system today from scratch without all the work we did over the past 4 years I would advise using a cache to store all the users that are currently online (say memcached or redis) and at request time I would simply iterate over all of them and filter out according to the criteria. The filtering by criteria can be cached independently and updated incrementally, also iterating over 5000 records is not necessarily very time consuming if the matching logic is very simple.
Any robust solution will include bringing your data close to SOLR (batch) and using it internally. NOT running a very large request during search which is low latency thing.
You should develop your own filter; The filter will cache the online users data once in a while (say, every minute). If the data changes VERY frequently, consider implementing PostFilter.
You can find a good example of filter implementation here:
http://searchhub.org/2012/02/22/custom-security-filtering-in-solr/
one solution can be use of join in solr but online data change
regularly and i cant index data everytime(say 5-10 min, it should be
at-least an hr)
I think you could very well use Solr joins, but after a little bit of improvisation.
The Solution, I propose is as follows:
You can have 2 Indexes (Solr Cores)
1. Primary Index (The one you have now)
2. Secondary Index with only two fields , "ID" and "IS_ONLINE"
You could now update the Secondary Index frequently (in the order of seconds) and keep it in sync with the table you have, for storing online users.
NOTE: This secondary Index even if updated frequently, would not degrade any performance provided we do the necessary tweaks like usage of appropriate queries during delta-import, etc.
You could now perform a Solr join on the ID field on these two Indexes to achieve what you want. Here is the link on how to perform Solr Joins between Indexes/ Solr Cores.

Solr 3.6 Distribution dynamically adding shards

I am looking at using distribution and sharding with Solr 3.6 vs Solr 4+ (SolrCloud)
I can see that 3.6 can have multiple shards set up, ideally each shard would rest on a different box. On a large scale once the boxes start to run low on memory I would like to add new shards to the index. From what I have seen this cannot be done/ isn't documented.
Does this require a full re index of the data?
Can a 3 shard indexed be re-indexed into a 4 shard instance?
Can queries still be invoked on the index during a re-index?
What are the space overheads required to re index?
The schema.xml (field names and types) would not be changed, just a new shard location added.
Self Answer- From what I have seen it would be best to stop filling the 3 shards and fill only the new 4th shard with data. Then update the shards parameter to include the new shard to search.

Running solr index on hadoop

I have a huge amount of data needs to be indexed and it took more than 10 hours to get the job done. Is there a way I can do this on hadoop? Anyone has done this before? Thanks a lot!
You haven't explained where does 10hr take? Does it take to extract the data? or does it take just to index the data.
If you are taking long time on the extraction, then you may use hadoop. Solr has a feature called bulk insert. So in your map function you could accumulate 1000s of record and commit for index in one shot to solr for large number of recods. That will optimize your performance alot.
Also what size is your data?
You could collect large number of records in reduce function of map/reduce job. You have to generate proper keys in your map so that large number of records go to single reduce function. In your custom reduce class, initialize solr object in setup/configure method, depending on your hadoop version and then close it in cleanup method.You will have to create a document collection object(in solrNet or solrj) and commit all of them in one single shot.
If you are using hadoop there is other option called katta. You can look over it as well.
You can write a map reduce job over your hadoop cluster which simply takes each record and sends it to solr over http for indexing. Afaik solr currently doesn't have indexing over cluster of machines, so it would be of worth to look into elastic search if you want to distribute your index also over multiple nodes.
There is a SOLR hadoop output format which creates a new index in each reducer- so you disteibute your keys according to the indices which you want and then copy the hdfs files into your SOLR instance after the fact.
http://www.datasalt.com/2011/10/front-end-view-generation-with-hadoop/

Resources