I am new to Solr, and am trying to figure out the best way to index and search our catalogs.
We have to index multiple manufactures and each manufacturer has a different catalog per country. Each catalog for each manufacture per country is about 8GB of data.
I was thinking it might be easier to have an index per manufacture per country and have some way to tell Solr in the URL which index to search from.
Is that the best way of doing this? If so, how would I do it? Where should I start looking? If not, what would be the best way?
I am using Solr 3.5
In general there are two ways of solving this:
Split each catalog into its own core, running a large multi core setup. This will keep each index physically separated from each other, and will allow you to use different properties (language, etc) and configuration for each core. This might be practical, but will require quite a bit of overhead if you plan on searching through all the core at the same time. It'll be easy to split the different cores into running on different servers later - simply spin the cores up on a different server.
Run everything in a single core - if all the attributes and properties of the different catalogs are the same, add two fields - one containing the manufacturer and one containing the country. Filter on these values when you need to limit the hits to a particular country or manufacturer. It'll allow you to easily search the complete index, and scalability can be implemented by replication or something like SolrCloud (coming in 4.0). If you need multilanguage support you'll have to have a field for each language with the settings you need for that language (such as stemming).
There are a few tidbits of information about this on the Solr wiki, but my suggestion is to simply try one of the methods and see if that solves your issue. Moving to the other solution shouldn't be too much work. The simplest implementation is to keep everything in the same index.
Related
I am building a user-facing search engine for movies, music and art where users perform free-text queries (like Google) and get the desired results. Right now, I have movies, music and art data indexed separately on different cores and they do not share a similar schema. For ease of maintenance, I would prefer having them in separate cores as it is now.
Till date, I have been performing my queries individually on each core, but I want to expand this capability to perform a single query that runs across multiple cores/indexes. Say I run a query by the name of the artist and the search engine returns me all the relevant movies, music and art work they have done. Things get tricky here.
Based on my research, I see that there are two options in this case.
Create a fourth core, add shards attribute that points to my other three cores. Redirect all my queries to this core to return required results.
Create a hybrid index merging all three schema's and perform queries on this index.
With the first option, the downside I see is that the keys need to be unique across my schema for this to work. I am going to have the key artistName across all my cores, this isn't going to help me.
I really prefer keeping my schema separately, so I do not want to delve into the second option as such. Is there a middle ground here? What would be considered best practice in this case?
Linking other SO questions that I referred here:
Best way to search multiple solr core
Solr Search Across Multiple Cores
Search multiple SOLR core's and return one result set
I am of the opinion that you should not be doing search across multiple core.
Solr or Nosql databases are not meant for it. These database are preferred when we want to achieve faster response which is not possible with the RDBMS as it involves the joins.
The joins in the RDBMS slower's the performance of your query as the data grows in size.
To achieve the faster response we try to convert the data into flat document and stores it in the NoSQL database like MongoDB, Solr etc..
You should covert your data into such a way that, it should be part of single document.
If above option is not possible then create individual cores and retrieve the specific data from specific core with multiple calls.
You can also check for creating parent child relation document in solr.
Use solr cloud option with solr streaming expression.
Every option has its pros and cons. It all depends on your requirement and what you can compromise.
I am new to Solr and using in my project where i have large number of products with a number of properties. So the indexing takes a whole lot of time. But if i don't index all the properties then the results will have to be populated via a separate db hit. But that kind of loses the significance of the Solr, doesn't it? Since we are hitting db anyways, doesn't that make the query slower? Kindly guide whats the right approach. Indexing all properties or getting the remaining properties from db?
An hybrid choice is not necessarily the evil. Basically that choice depends on what kind of search features and services you want to offer to your users. For instance
if you want to facet over a "category" field you need to put that field on Solr
if you want to have some data in real time (e.g. price) I would go with the database
In general you should experiment and try because all your thoughts make sense but, my suggestion, don't optimize things in advance. Write down your (search and view) requirements and on top of that try to get a good compromise between the two extremes (only solr / only database)
I have a number of documents quite evenly distributed among a number of languages (6 at the moment, perhaps 12 in the near future). There would be no need to guess the language of a document, as that information is available.
Furthermore, the use-cases for search are such that one search will always be in one language and search only for documents in that language.
Now, I want to apply proper language handling such as stemming to both index and queries. What would be the suggested way to go? From my yet limited Solr knowledge, I can imagine:
Just use one core per language. Keeps the indexes small, the queries match the language by core URL and the configuration is simple. However, it duplicates lots of the configuration.
Use one core and apply something like Solr: DIH for multilingual index & multiValued field?. The search for a specific language would than be via a field such as title_de:sehen
I sure one core per language is the best solution.
You can share all configuration except schema.xml between cores (using single conf folder) and specify schema.xml location per core (check http://wiki.apache.org/solr/CoreAdmin)
I went with a single core instead. The duplication of configuration was daunting. Now it’s all in a single core. A bit of Java magic, and it works perfectly.
I have an MVC application which I need to be able to search. The application is modular so it needs to be easy for modules to register data to index with the search module.
At present, there's just a quick interim solution in place which is fine for flexibility, but speed was always going to be a problem. Modules register models (and relationships and columns) which they'd like to be searchable. Upon search, the search functionality queries data using those relationships and applies Levenshtein, removes stop words, does character replacements etc. Clearly this will slow down as the volume of data increases so it's not viable to keep as it is effectively select * from x,y,z and then mine through the data.
The benefit of the above is such that there is a direct relation to the model which found the data. For example, if Model_Product finds something, I know that in my code i can use Model_Product::url() to associate the result off to the relevant location or Model_Product::find(other data) to show say the image or description if the keyword had been found in the title for example.
Another benefit of the above is it's already database specific, and therefore can just be thrown up onto a virtualhost and it works.
I have read about the various options, and they all seem very similar so it's unlikely that people are going to be able to suggest the 'right' one without inciting discussion or debate, but for the record; from the following options, Solr seems to be the one I'm leaning toward. I'm not set in stone so if anyone has any advice they'd like to share or other options I could look at, that'd be great.
Sphinx
Lucene
Solr - appears to just run Lucene as a service?
Xapian
ElasticSearch
Looking through various tutorials and guides they all seem relatively easy to set up and configure. In the case above I can have modules register the path of config files/search index models and have the searcher run them all through search program x. This will build my indexes, and provide the means by which to query data. Fine.
What I don't understand is how any of these indexes related to my other code. If I index data, search and in turn find a result with say Solr, how do I know how to get all of the other information related to the bit it found?
Also is someone able to confirm whether or not I will need to have an instance of any of the above per virtualhost? This is something which I can't seem to find much information on. I would assume that I can just connect to a single instance and tell it what data is relevant? Much like connecting to a single DBMS server, with credentials x to database y.
Granted I haven't done as extensive reading on this as I would have typically because I'm a bit stuck in terms of direction at the moment and I'd rather not read everything about everything in favour of seeking some advice from those who know before I take a particular route.
Edit: This question seems to have swayed me more towards Solr. There's also a similar thread here with a fair amount of insight into Sphinx.
DISCLAIMER: I can only speak about Lucene/Solr and, I believe, ElasticSearch as I know it is based on Lucene. Others might or might not work in the same way.
If I index data, search and in turn find a result with say Solr, how
do I know how to get all of the other information related to the bit
it found?
You can store any extra data you want, e.g. a database key pointing to a particular row in the database. Lucene/Solr can also help you to find relative information, e.g. if you run a DVD rent shop and user has misspelled a movie name, Lucene will figure this out for you and (unlike with DB) still list the closest alternatives. You can also provide hints by boosting certain fields during indexing or querying. There are special extensions for geospatial search, etc. And obviously you can provide your own if you need to.
Also is someone able to confirm whether or not I will need to have an
instance of any of the above per virtualhost?
Lucene is a low level library and will have to be present in every JVM you run. Solr (built on top of Lucene) is an HTTP server. You can call it from as many clients as you want. More scaling options explained here.
I'm lost in: Hadoop, Hbase, Lucene, Carrot2, Cloudera, Tika, ZooKeeper, Solr, Katta, Cascading, POI...
When you read about the one you can be often sure that each of the others tools is going to be mentioned.
I don't expect you to explain every tool to me - sure not. If you could help me to narrow this set for my particular scenario it would be great. So far I'm not sure which of the above will fit and it looks like (as always) there are more then one way of doing what's to be done.
The scenario is: 500GB - ~20 TB of documents stored in Hadoop. Text documents in multiple formats: email, doc, pdf, odt. Metadata about those documents stored in SQL db (sender, recipients, date, department etc.) Main source of documents will be ExchangeServer (emails and attachments), but not only. Now to the search: User needs to be able to do complex full-text searches over those documents. Basicaly he'll be presented with some search-config panel (java desktop application, not webapp) - he'll set date range, document types, senders/recipients, keywords etc. - fire the search and get the resulting list of the documents (and for each document info why its included in search results i.e. which keywords are found in document).
Which tools I should take into consideration and which not? The point is to develop such solution with only minimal required "glue"-code. I'm proficient in SQLdbs but quite uncomfortable with Apache-and-related technologies.
Basic workflow looks like this: ExchangeServer/other source -> conversion from doc/pdf/... -> deduplication -> Hadopp + SQL (metadata) -> build/update an index <- search through the docs (and do it fast) -> present search results
Thank you!
Going with solr is a good option. I have used it for similar scenario you described above. You can use solr for real huge data as its a distributed index server.
But to get the meta data about all of these documents formats you should be using some other tool. Basically your workflow will be this.
1) Use hadoop cluster to store data.
2) Extract data in hadoop cluster using map/redcue
3) Do document identification( identify document type)
4) Extract meta data from these document.
5) Index metadata in solr server, store other ingestion information in database
6) Solr server is distributed index server, so for each ingestion you could create a new shard or index.
7) When search is required search on all the indexs.
8) Solr supports all the complex searches , so you don't have to make your own search engine.
9) It also does paging for you as well.
We've done exactly this for some of our clients by using Solr as a "secondary indexer" to HBase. Updates to HBase are sent to Solr, and you can query against it. Typically folks start with HBase, and then graft search on. Sounds like you know from the get go that search is what you want, so you can probably embed the secondary indexing in from your pipeline that feeds HBase.
You may find though that just using Solr does everything you need.
Another project to look at is Lily, http://www.lilyproject.org/lily/index.html, which has already done the work of integrating Solr with a distributed database.
Also, I do not see why you would not want to use a browser for this application. You are describing exactly what faceted search is. While you certainly could set up a desktop app that communicates with the server (parses JSON) and displays the results in a thick client GUI, all of this work is already done for you in the browser. And, Solr comes with a free faceted search system out of the box: just follow along the tutorial.
Going with Solr (http://lucene.apache.org/solr) is a good solution, but be ready to have to deal with some non-obvious things. First is planning your indexes properly. Multiple terabytes of data will almost definitely need multiple shards on Solr for any level of reasonable performance and you'll be in charge of managing those yourself. It does provide distributed search (doing the queries off multiple shards), but that is only half the battle.
ElasticSearch (http://www.elasticsearch.org/) is another popular alternative, but i don't have much experience with it regarding scale. It uses the same Lucene engine so i'd expect the search feature-set to be similar.
Another type of solution is something like SenseiDB - open sourced from LinkedIn - which gives the full-text search functionality (also Lucene-based) as well as proven scale for large amounts of data:
http://senseidb.com
They've definitely done a lot of work on search over there and my casual use of it is pretty promising.
Assuming all your data is already in Hadoop, you could write some custom MR jobs that pull the data in a consistent schema-friendly format into SenseiDB. SenseiDB already provides a Hadoop MR indexer which you can look at.
The only caveat is it is a little more complex to setup, but will save you with the scaling issues many times over - especially around indexing performance and faceting functionality. It also provides clustering support if HA is important to you - which is still in Alpha for Solr (Solr 4.x is alpha atm).
Hope that helps and good luck!
Update:
I asked a friend who is more versed in ElasticSearch than me and it does have the advantage of clustering and rebalancing based on the # of machines and shards you have. This is a definite win over Solr - especially if you're dealing with TBs of data. The only downside is the current state of documentation on ElasticSearch leaves a lot to be desired.
As a side note, you can't say the documents are stored in Hadoop, they are stored in a distributed file system (most probably HDFS since you mentioned Hadoop).
Regarding searching/indexing: Lucene is the tool to use for your scenario. You can use it for both indexing and searching. It's a java library. There is also an associated project (called Solr) which allows you to access the indexing/searching system through WebServices. So you should also take a look at Solr as it allows the handling of different types of documents (Lucene puts the responsability of interpreting the document (PDF, Word, etc) on your shoulders but you, probably, can already do that)