In what condition i can use solr core - database

I am using solr version 3.0.1, and I am about to change to solr 4.6.0.
Usually I just use solr without defining core (I think solr 3.0.1 doesn't have core yet).
And now I want to upgrade my solr to version 4.6.0, there is something new on it.
So i have 3 questions:
What exactly solr core is?
When i should use solr core?
Is it right that each solr core is like a table in a (relational) database? That is, can I save different type of data in different core?
Thanks in advance.

A core is basically an index with a given schema and will hold a set of documents.
You should use different cores for different collections of documents, it doesn't mean you should store different kind of documents in different indexes.
Some examples:
you could have same documents in different languages stored on different cores and select the core based on configured language;
you could have different type of documents stored in different cores to organize them physically separated;
but at the same time you could have different documents stored on the same index and differentiate them by a field value;
it really depends on your use-case.

You have to think up-front about what type of queries you are going to execute against you Solr index. You then lay down your schema of a core or several cores accordingly.
If you for example execute some JOIN queries on your relational DB, those won't be very efficient (if at all possible) with lots of documents in the SOLR index, because it is NoSQL world (here read as: non-relational). In such a case you might need to duplicate your data from several DB tables into one core's schema.
As Francisco has already mentioned physically core is represented as an independent entity with its own schema, config and index data.
One caution with multi-core setup: all the cores configured under the same container instance will hence share the same JVM. This means you should be careful with the amount of data you store on those cores. Lucene, which is an indexing engine inside Solr, has really neat and fast (de)compression algorithms (in versions 4.x) so disk can leave for longer, but JVM heap is something to care about.
The goodies of cores coupled with the Solr admin UI are things like:
core reload after schema / solrconfig changes
core hot swap (if you have a live core serving queries you can hot swap it with a new core with same data and some modifications)
core index optimization
core renaming

Related

Solr - Efficient way to search across multiple cores?

I am building a user-facing search engine for movies, music and art where users perform free-text queries (like Google) and get the desired results. Right now, I have movies, music and art data indexed separately on different cores and they do not share a similar schema. For ease of maintenance, I would prefer having them in separate cores as it is now.
Till date, I have been performing my queries individually on each core, but I want to expand this capability to perform a single query that runs across multiple cores/indexes. Say I run a query by the name of the artist and the search engine returns me all the relevant movies, music and art work they have done. Things get tricky here.
Based on my research, I see that there are two options in this case.
Create a fourth core, add shards attribute that points to my other three cores. Redirect all my queries to this core to return required results.
Create a hybrid index merging all three schema's and perform queries on this index.
With the first option, the downside I see is that the keys need to be unique across my schema for this to work. I am going to have the key artistName across all my cores, this isn't going to help me.
I really prefer keeping my schema separately, so I do not want to delve into the second option as such. Is there a middle ground here? What would be considered best practice in this case?
Linking other SO questions that I referred here:
Best way to search multiple solr core
Solr Search Across Multiple Cores
Search multiple SOLR core's and return one result set
I am of the opinion that you should not be doing search across multiple core.
Solr or Nosql databases are not meant for it. These database are preferred when we want to achieve faster response which is not possible with the RDBMS as it involves the joins.
The joins in the RDBMS slower's the performance of your query as the data grows in size.
To achieve the faster response we try to convert the data into flat document and stores it in the NoSQL database like MongoDB, Solr etc..
You should covert your data into such a way that, it should be part of single document.
If above option is not possible then create individual cores and retrieve the specific data from specific core with multiple calls.
You can also check for creating parent child relation document in solr.
Use solr cloud option with solr streaming expression.
Every option has its pros and cons. It all depends on your requirement and what you can compromise.

Using SolrCloud with RDBMS or using Solr as main data storage

I was wondering which scenario (or the combination) would be better for my application. From the aspect of performance, scalability and high availability.
Here is my application:
Suppose I am going to have more than 10m documents and it grows every day. (probably in 1 years it reaches to more than 100m docs. I want to use Solr as tool for indexing these documents but the problem is I have some data fields that could change frequently. (not too much but it could change)
Scenarios:
1- Using SolrCloud as database for all data. (even the one that could be changed)
2- Using SolrCloud as database for static data and using RDBMS (such as oracle) for storing dynamic fields.
3- Using The integration of SolrCloud and Hadoop (HDFS+MapReduce) for all data.
Best regards.
I'm not sure how SolrCloud works with DIH (you might face situation when indexing will happen only on one instance).
On the other hand I would store data in RDBMS, because from time to time you will need to reindex Solr to add some new functionality to the index.
At the end of the day I would use DB + Solr (all the fields) with either Hadoop (have not used it yet) or some other piece of software to post data into the SolrCloud.

Prefer multiple Solr applications or single application multicore setup?

What are the pros and cons of having multiple Solr applications for completely different searches comparing to having a single Solr application but have different searches setup as separate cores?
What is the Solr's preferred method? Is having a single Solr application with multicore setup (for various search indexes) is always a right way?
There is no preferred method. It depends on what you are trying to solve. So by nature, can handle multiple cores on the single Solr instance or can have cores across Solr application servers , can handle the collection (in solrcloud).
Having said that, usually you go for
1) Single core on a Solr instance if your data is fairly small - few million documents.
2) You go for multiple solr instances with a single core on each if you want to shard your data incase of billions of documents and want to get better indexing and query performance.
3) You go for multiple cores on single or multiple solr instances if you have multitenancy separating, example a core for each customer or a for catalog another core for skus.
It depends on your use case, the volume of data and query response times etc.

building in support for future Solr sharding

Building an application. Right now we have one Solr server. But we would like to design the app so that it can support multiple Solr shard in future if we outgrow the indexing needs.
What are keys things to keep in mind when developing an application that can support multiple shards in future?
we stored the solr URL /solr/ in a DB. Which is used to execute queries against solr. There is one URL for Updates and one URL for Searches in the DB
If we add shards to the solr environment at a future date, will the process for using the shards be as simple as updating the URLs in the DB? Or are there other things that need to be updated. We are using SolrJ
e.g. change the SolrSearchBaseURL in DB to:
https://solr2/solr/select?shards=solr1/solr,solr2/solr&indent=true&q={search_query}
And updating the SolrUpdateBaseURL in DB to
https://solr2/solr/
?
Basically, what you are describing has already been implemented in SolrCloud. There the ZooKeeper maintains the state of your search cluster (which shards in what collections, shard replicas, leader and slave nodes and more). It can handle the load on indexing and querying sides by using hashing.
You could, in principle, get by (at least in the beginning of your cluster growth) with the system you have developed. But think about replicating, adding load balancers, external cache servers (like e.g. varnish): in the long run you would end up implementing smth like SolrCloud yourself.
Having said that, there are some caveats to using hash based indexing and hence searching. If you want to implement logical partitioning of you data (say, by date) at this point there is no way to this but making a custom code. There is some work projected around this though.

what is SOLR multicore exactly

In SOLR, what is multicore?
Is it a way to create multiple tables (inside a single solr app) with their own set of schema files, or is it about creating different databases (inside a single solr app)?
If we want to create multiple tables (with their respective schema.xml files) for solr web app then what is the best way to do this, or how can we achieve this in SOLR?
Solr Multicore is basically a set up for allowing Solr to host multiple cores.
These Cores which would host a complete different set of unrelated entities.
You can have a separate Core for each table as well.
For e.g. If you have collections for Documents, People, Stocks which are completely unrelated entities you would want to host then in different collections
Multicore setup would allow you to
Host unrelated entities separately so that they don't impact each other
Having a different configuration for each core with different behavior
Performing activities on each core differently (Update data, Load, Reload, Replication)
keep the size of the core in check and configure caching accordingly

Resources