DSE - com.datastax.bdp.search.solr.Cql3SolrSecondaryIndex - solr

My Cassandra table has secondary indexes of type 'com.datastax.bdp.search.solr.Cql3SolrSecondaryIndex'. They were created automatically when I initialized the Solr core so they are probably used by Solr in some way.
What exactly is their purpose? What is the effect if I cancel the index build in my Cassandra node? (but not in my Solr nodes)

That index type doesn't cause any overhead on non-Solr nodes, so you don't have to (actually, you must not) drop them.

What exactly is their purpose?
Solr extends the search possibilities of Cassandra rows. Therefore, it allows Cassandra nodes to have the 'com.datastax.bdp.search.solr.Cql3SolrSecondaryIndex' rows automatically and asynchronously indexed on Solr whenever you insert/update a row.
In addition Solr provides a HTTP interface (http://:8983/solr) that allows you to reload/rebuilt the index, among other admin Solr tasks.

Related

Maximum number of docs for Solr with Cassandra

I currently have a Cassandra database with around 50,000 rows and ~5 columns. However, when I check my numDocs/maxDocs on Core Admin using the Solr Admin UI, it only finds 10K numDocs & maxDocs. Is there a maximum that Solr is able to index? If so, where can I edit that number if possible? If not, am I doing something wrong when setting up Solr? I have every column indexed in my schema set-up.
The max amount of docs Solr can index is way, way larger.
You most probably have 5 solr nodes and each one is holding 10k different docs. Just run some query, sort by something and verify it.
DSE does not use SolrCloud, they do their own clustering stuff, so the info you see in Solr dashboard is not totally equivalent to what you see in vanilla solr.

SolrCloud Indexing/Querying without a Smart-Client

I'm having a bit of trouble understanding exactly how indexing and querying would work if I don't have a smart-client available. I'm using SolrNet with C#, which currently doesn't integrate with ZooKeeper.
As a basic example, let's say I have a single collection, split into two shards, replicated across two separate nodes/servers, and I have a standard HTTP load-balancer in front of the servers (a scenario mentioned here). If I use the standard compositeId router, I believe that indexing would work without issue and be replicated to both nodes by ZooKeeper behind the scenes. I wouldn't need to worry about which node received the "update" command -- ZooKeeper would handle document routing and replication automatically.
However, in this same scenario, would ZooKeeper handle query routing behind the scenes correctly? Given that I'm using built-in sharding and not custom sharding, would a query request to the load-balancer get routed to the correct shard, or would I have to include all known shards in the "shards" parameter (see here) to make sure I don't miss anything? Obviously this would be onerous to maintain as the number of shards grows.
Is seems like custom sharding would provide the greatest efficiency across indexing and querying, although then you run the risk wildly unequal shard sizes. Any thoughts on these matters would be appreciated.
Lets take the example of a two shard collection, with each shard on a separate node/server.
10.x.x.100:8983/solr/ --> shard 1 / node 1
10.x.x.101:8983/solr/ --> shard 2 / node 2
Using default routing you indexed 100 documents which got split into these two servers and now they have 50 documents each.
If you query any of the two servers for documents, solr will search in both the shards by default. You do not need to specify anything in shards parameter.
So
10.x.x.100:8983/solr/collection/select?q=solr rocks
will run this same query on 10.x.x.101:8983/solr/ also and the results returned will be a combination of results from both shards, sorted and ranked by score.
The &shards parameter comes into picture when you know which "group" of data is in which shard. For example using the above example, you have custom routing enabled and you use the field "city" to route the documents. For sake of example, lets assume there can be only two values for "city" field. Your documents will be routed to one of the shards based on this field.
On your application side, if you want to specifically query for documents belonging to a city, you can specify the &shard parameter, and all the results for the query will be only from that shard.

Solr Collection vs Cores

I struggle with understanding the difference between collections and cores. If I understand it correctly, cores are multiple indexes. Collection consists of cores, so essentially they share the same logic in separation, i.e. separate cores and collections have separate end-points.
I have the following scenario. I create a backend for cloud service for several online shops. Each shop has a set of products, to which customers can add reviews. I want to index static data (product information) separately from dynamic information(reviews) so I can improve performance.
How can I best separate in Solr???
From the SolrCloud Documentation
Collection: A single search index.
Shard: A logical section of a single collection (also called
Slice). Sometimes people will talk about "Shard" in a physical sense
(a manifestation of a logical shard)
Replica: A physical manifestation of a logical Shard, implemented
as a single Lucene index on a SolrCore
Leader: One Replica of every Shard will be designated as a Leader to
coordinate indexing for that Shard
SolrCore: Encapsulates a single physical index. One or more make up
logical shards (or slices) which make up a collection.
Node: A single instance of Solr. A single Solr instance can have
multiple SolrCores that can be part of any number of collections.
Cluster: All of the nodes you are using to host SolrCores.
So basically a Collection (Logical group) has multiple cores (physical indexes).
Also, check the discussion
Core
In Solr, a core is composed of a set of configuration files, Lucene index files, and Solr’s
transaction log.
a Solr core is a
uniquely named, managed, and configured index running in a Solr server; a Solr server
can host one or more cores. A core is typically used to separate documents that have
different schemas
collection
Solr also uses the term collection, which only has meaning in the context
of a Solr cluster in which a single index is distributed across multiple servers.
SolrCloud introduces the concept of a collection, which extends the concept of a uniquely
named, managed, and configured index to one that is split into shards and distributed
across multiple servers.
As per my understanding:
In distributed search,
Collection is a logical index spread across multiple servers.
Core is that part of server which runs one collection.
In non-distributed search,
Single server running the Solr can have multiple collections and each of those collection is also a core. So collection and core are same if search is not distributed.
Summary
Collection per server is called a core.
Collection is same as an index.
One Solr server can have many cores.
Collection is a logical index (Example usage for multiple collections: Say two teams in same group are not big enough to justify a full Solr server of their own. But they also do not want to mix their data in a single index. They can then create separate collections/indexes which will keep their data separate).
Its better to use a separate Solr Cloud rather than create collections if the data for a collection is big enough (not sure, comments please?)
Single instance
On a single instance, Solr has something called a SolrCore that is essentially a single index. If you want multiple indexes, you create multiple SolrCores.
Solr Cloud
With SolrCloud, a single index can span multiple Solr instances. This means that a single index can be made up of multiple SolrCore's on different machines. We call all of these SolrCores that make up one logical index a collection.
A collection is a essentially a single index that spans many SolrCore's, both for index scaling as well as redundancy. If you wanted to move your 2 SolrCore Solr setup to SolrCloud, you would have 2 collections, each made up of multiple individual SolrCores.
From Solr Wiki:
Collections are made up of one or more shards. Shards have one or
more replicas. Each replica is a core. A single collection represents
a single logical index.
This explains the use of cores and collections.
Single instance
When dealing with a single solr instance you query to cores.
The admin UI of a single Solr instance has no collection selector:
Solr Cloud
When dealing with Solr Cloud you query to collections.
The collections are organized in different cores (replicas, shards) on different solr instances.
The admin UI of a Solr Cloud instance has a collection and core selector. But cores are technically instances, here:
From the Solr docs:
Usage: solr create [-c name] [-d confdir] [-n configName] [-shards #]
[-replicationFactor #] [-p port] [-V]
Create a core or collection depending on whether Solr is running in
standalone (core) or SolrCloud mode (collection). In other words,
this action detects which mode Solr is running in, and then takes
the appropriate action (either create_core or create_collection).

Solr 3.6 Distribution dynamically adding shards

I am looking at using distribution and sharding with Solr 3.6 vs Solr 4+ (SolrCloud)
I can see that 3.6 can have multiple shards set up, ideally each shard would rest on a different box. On a large scale once the boxes start to run low on memory I would like to add new shards to the index. From what I have seen this cannot be done/ isn't documented.
Does this require a full re index of the data?
Can a 3 shard indexed be re-indexed into a 4 shard instance?
Can queries still be invoked on the index during a re-index?
What are the space overheads required to re index?
The schema.xml (field names and types) would not be changed, just a new shard location added.
Self Answer- From what I have seen it would be best to stop filling the 3 shards and fill only the new 4th shard with data. Then update the shards parameter to include the new shard to search.

Running solr index on hadoop

I have a huge amount of data needs to be indexed and it took more than 10 hours to get the job done. Is there a way I can do this on hadoop? Anyone has done this before? Thanks a lot!
You haven't explained where does 10hr take? Does it take to extract the data? or does it take just to index the data.
If you are taking long time on the extraction, then you may use hadoop. Solr has a feature called bulk insert. So in your map function you could accumulate 1000s of record and commit for index in one shot to solr for large number of recods. That will optimize your performance alot.
Also what size is your data?
You could collect large number of records in reduce function of map/reduce job. You have to generate proper keys in your map so that large number of records go to single reduce function. In your custom reduce class, initialize solr object in setup/configure method, depending on your hadoop version and then close it in cleanup method.You will have to create a document collection object(in solrNet or solrj) and commit all of them in one single shot.
If you are using hadoop there is other option called katta. You can look over it as well.
You can write a map reduce job over your hadoop cluster which simply takes each record and sends it to solr over http for indexing. Afaik solr currently doesn't have indexing over cluster of machines, so it would be of worth to look into elastic search if you want to distribute your index also over multiple nodes.
There is a SOLR hadoop output format which creates a new index in each reducer- so you disteibute your keys according to the indices which you want and then copy the hdfs files into your SOLR instance after the fact.
http://www.datasalt.com/2011/10/front-end-view-generation-with-hadoop/

Resources