Solr or Lucene like single application - solr

Hello i have already working application for searching in database. In database I have like 50M indexed documents. There is any idea to run all together i mean i don't want solr on http? what should i do? it's better to use Lucene or EmbeddedSolrServer? Or maybe you have other solution?
I have already something like on 1st diagram and i want make this in single process
And if i will go in lucene can i use my indexes from solr?
solr-5.2.1
Tomcat v8.0

It is not recommended to have one tomcat and deploy the application and solr.
If solr crashes then they are chances of getting downtime for the application. So its always better to run solr independently. Embedding solr is also not recommended.
The simplest, safest, way to use Solr is via Solr's standard HTTP interfaces. Embedding Solr is less flexible, harder to support, not as well tested, and should be reserved for special circumstances.
for reference http://wiki.apache.org/solr/EmbeddedSolr

It depends. If you want to use parts of the Solr feature set (Solr adds quite a few features on top of Lucene), you'll reimplement features that you otherwise would get for free.
You can use EmbeddedSolr to have Solr internal to your application, and then use the EmbeddedSolrServer client in SolrJ to talk to it - the rest of your application would still use Solr as it were a remote instance.
The problem with EmbeddedSolr is that you'll run into scalability issues as the index size grows, since you'll have a harder time scaling onto multiple servers and to separate concerns.

Related

Advice for implementing a user-search from with React UI and Spring Boot server

I have a Spring Boot/React application. I have a list of users in my database I will have populated already from LDAP.
As part of a form, I need to allow users to specify a list of users. Since they could be searching from (and technically specifying as well), up to 400,000 users (most will be in the 10k or less range), I'm assuming I'd need to do this both client and server-side.
Does anyone have any recommendations on the approach or technologies?
I'm not using a small amount of data, but I don't want to over-engineer it either (tips are mostly for server-side, but any are welcome).
If you are using hibernate as the ORM in your application, you may also checkout Hibernate Search. This seems to serve your purpose as I feel that searching through a list of users can be done using a normal text based index. Hibernate search leverages Lucene, which is suitable for text based indexing and searching.
While another answer is good and works perfectly fine when you have a small set of data but be aware of the few design issue with it.
Lucene is not distributed and you can't easily scale it to multiple horizontal machines without duplicating the whole index, which is perfectly fine when you have a small set of data and in-fact it's pretty fast as there will be no network call(in case of elasticsearch, it will be).
If you want to build a stateless application that is easy to HS(horizontally scalablele) then going with Lucene will not be helpful as it stateful and you need to create Lucene index before your newly spawned app-server finished local indexing in Lucene.
Elasticsearch(ES) is rest-based and is written in JAVA and has very good java-client which you can easily use for simple to complex use-cases.
Last but not the least, please go through the STOF answer of none other than shay banon, creator of Elasticsearch, who explains why he created ES in first place :) and which will give more trade-off and insights to choose a best solution for your use-case.

Searching over documents stored in Hadoop - which tool to use?

I'm lost in: Hadoop, Hbase, Lucene, Carrot2, Cloudera, Tika, ZooKeeper, Solr, Katta, Cascading, POI...
When you read about the one you can be often sure that each of the others tools is going to be mentioned.
I don't expect you to explain every tool to me - sure not. If you could help me to narrow this set for my particular scenario it would be great. So far I'm not sure which of the above will fit and it looks like (as always) there are more then one way of doing what's to be done.
The scenario is: 500GB - ~20 TB of documents stored in Hadoop. Text documents in multiple formats: email, doc, pdf, odt. Metadata about those documents stored in SQL db (sender, recipients, date, department etc.) Main source of documents will be ExchangeServer (emails and attachments), but not only. Now to the search: User needs to be able to do complex full-text searches over those documents. Basicaly he'll be presented with some search-config panel (java desktop application, not webapp) - he'll set date range, document types, senders/recipients, keywords etc. - fire the search and get the resulting list of the documents (and for each document info why its included in search results i.e. which keywords are found in document).
Which tools I should take into consideration and which not? The point is to develop such solution with only minimal required "glue"-code. I'm proficient in SQLdbs but quite uncomfortable with Apache-and-related technologies.
Basic workflow looks like this: ExchangeServer/other source -> conversion from doc/pdf/... -> deduplication -> Hadopp + SQL (metadata) -> build/update an index <- search through the docs (and do it fast) -> present search results
Thank you!
Going with solr is a good option. I have used it for similar scenario you described above. You can use solr for real huge data as its a distributed index server.
But to get the meta data about all of these documents formats you should be using some other tool. Basically your workflow will be this.
1) Use hadoop cluster to store data.
2) Extract data in hadoop cluster using map/redcue
3) Do document identification( identify document type)
4) Extract meta data from these document.
5) Index metadata in solr server, store other ingestion information in database
6) Solr server is distributed index server, so for each ingestion you could create a new shard or index.
7) When search is required search on all the indexs.
8) Solr supports all the complex searches , so you don't have to make your own search engine.
9) It also does paging for you as well.
We've done exactly this for some of our clients by using Solr as a "secondary indexer" to HBase. Updates to HBase are sent to Solr, and you can query against it. Typically folks start with HBase, and then graft search on. Sounds like you know from the get go that search is what you want, so you can probably embed the secondary indexing in from your pipeline that feeds HBase.
You may find though that just using Solr does everything you need.
Another project to look at is Lily, http://www.lilyproject.org/lily/index.html, which has already done the work of integrating Solr with a distributed database.
Also, I do not see why you would not want to use a browser for this application. You are describing exactly what faceted search is. While you certainly could set up a desktop app that communicates with the server (parses JSON) and displays the results in a thick client GUI, all of this work is already done for you in the browser. And, Solr comes with a free faceted search system out of the box: just follow along the tutorial.
Going with Solr (http://lucene.apache.org/solr) is a good solution, but be ready to have to deal with some non-obvious things. First is planning your indexes properly. Multiple terabytes of data will almost definitely need multiple shards on Solr for any level of reasonable performance and you'll be in charge of managing those yourself. It does provide distributed search (doing the queries off multiple shards), but that is only half the battle.
ElasticSearch (http://www.elasticsearch.org/) is another popular alternative, but i don't have much experience with it regarding scale. It uses the same Lucene engine so i'd expect the search feature-set to be similar.
Another type of solution is something like SenseiDB - open sourced from LinkedIn - which gives the full-text search functionality (also Lucene-based) as well as proven scale for large amounts of data:
http://senseidb.com
They've definitely done a lot of work on search over there and my casual use of it is pretty promising.
Assuming all your data is already in Hadoop, you could write some custom MR jobs that pull the data in a consistent schema-friendly format into SenseiDB. SenseiDB already provides a Hadoop MR indexer which you can look at.
The only caveat is it is a little more complex to setup, but will save you with the scaling issues many times over - especially around indexing performance and faceting functionality. It also provides clustering support if HA is important to you - which is still in Alpha for Solr (Solr 4.x is alpha atm).
Hope that helps and good luck!
Update:
I asked a friend who is more versed in ElasticSearch than me and it does have the advantage of clustering and rebalancing based on the # of machines and shards you have. This is a definite win over Solr - especially if you're dealing with TBs of data. The only downside is the current state of documentation on ElasticSearch leaves a lot to be desired.
As a side note, you can't say the documents are stored in Hadoop, they are stored in a distributed file system (most probably HDFS since you mentioned Hadoop).
Regarding searching/indexing: Lucene is the tool to use for your scenario. You can use it for both indexing and searching. It's a java library. There is also an associated project (called Solr) which allows you to access the indexing/searching system through WebServices. So you should also take a look at Solr as it allows the handling of different types of documents (Lucene puts the responsability of interpreting the document (PDF, Word, etc) on your shoulders but you, probably, can already do that)

Solr as main search engine, Redis as autocomplete engine

I have an app with about 1+ million records.
I plan to use Solr to handle all searches.
I also have a feature for autocomplete.
I understand that Redis is very fast for autocomplete, but Solr also has its own autocomplete feature.
Question: Should I use Solr as main search engine (for non-autocomplete tasks) and a separate Redis for autocomplete, or I am better off using just Solr to complete both tasks?
Notes:
Load-balancing is a concern too.
Using Rails by the way.
Thanks.
I think that you're just going to unnecessarily complicate things with Redis (I'm normally big fan of Redis).
Solr has its own autocomplete, as you mentioned already.
I wouldn't say 1 million docs is a big index for production environment. On the contrary, I'd say it's a rather small one.
So I wouldn't expect any problems with Solr's autocomplete.
Besides the one you suggested, here's a different approach for implementing it, written as a step-by-step tutorial.
You're right, Redis is great for large scale stuff, but since your whole index is going to grow, at some time you'll have to scale Solr anyway (not only for autocomplete).

Can I use Solr just for search an existing Lucene index?

I use Lucene locally to index documents. I know how to use Lucene pretty well. I never used Solr but I want to run a web search using a Lucene index so I'm now looking into it.
Can I install Solr on EC2 let's say, and then instead of indexing documents using Solr, doing it locally using Lucene directly and then just coping the Lucene index from my machine to EC2 which Solr will be using for search?
I'm assuming it's possible as long as I keep the index on disk but would like to be sure.
Thanks!
It's certainly possible, you would only make sure to maintain the exactly the same index structure (defined by Solr schema). However, it would also mean that your configuration would be stored in two completely separate places -- e.g. each time you would change an analyzer in Lucene, you would need to synchronize this change in Solr XML configuration. I'm not sure what benefit would Solr bring in such use case.

Hadoop to create an Index and Add() it to distributed SOLR... is this possible? Should I use Nutch? ..Cloudera?

Can I use a MapReduce framework to create an index and somehow add it to a distributed Solr?
I have a burst of information (logfiles and documents) that will be transported over the internet and stored in my datacenter (or Amazon). It needs to be parsed, indexed, and finally searchable by our replicated Solr installation.
Here is my proposed architecture:
Use a MapReduce framework (Cloudera, Hadoop, Nutch, even DryadLinq) to prepare those documents for indexing
Index those documents into a Lucene.NET / Lucene (java) compatible file format
Deploy that file to all my Solr instances
Activate that replicated index
If that above is possible, I need to choose a MapReduce framework. Since Cloudera is vendor supported and has a ton of patches not included in the Hadoop install, I think it may be worth looking at.
Once I choose the MatpReduce framework, I need to tokenize the documents (PDF, DOCx, DOC, OLE, etc...), index them, copy the index to my Solr instances, and somehow "activate" them so they are searchable in the running instance. I believe this methodolgy is better that submitting documents via the REST interface to Solr.
The reason I bring .NET into the picture is because we are mostly a .NET shop. The only Unix / Java we will have is Solr and have a front end that leverages the REST interface via Solrnet.
Based on your experience, how does
this architecture look? Do you see
any issues/problems? What advice can
you give?
What should I not do to lose faceting search? After reading the Nutch documentation, I believe it said that it does not do faceting, but I may not have enough background in this software to understand what it's saying.
Generally, you what you've described is almost exactly how Nutch works. Nutch is an crawling, indexing, index merging and query answering toolkit that's based on Hadoop core.
You shouldn't mix Cloudera, Hadoop, Nutch and Lucene. You'll most likely end up using all of them:
Nutch is the name of indexing / answering (like Solr) machinery.
Nutch itself runs using a Hadoop cluster (which heavily uses it's own distributed file system, HDFS)
Nutch uses Lucene format of indexes
Nutch includes a query answering frontend, which you can use, or you can attach a Solr frontend and use Lucene indexes from there.
Finally, Cloudera Hadoop Distribution (or CDH) is just a Hadoop distribution with several dozens of patches applied to it, to make it more stable and backport some useful features from development branches. Yeah, you'd most likely want to use it, unless you have a reason not to (for example, if you want a bleeding edge Hadoop 0.22 trunk).
Generally, if you're just looking into a ready-made crawling / search engine solution, then Nutch is a way to go. Nutch already includes a lot of plugins to parse and index various crazy types of documents, include MS Word documents, PDFs, etc, etc.
I personally don't see much point in using .NET technologies here, but if you feel comfortable with it, you can do front-ends in .NET. However, working with Unix technologies might feel fairly awkward for Windows-centric team, so if I'd managed such a project, I'd considered alternatives, especially if your task of crawling & indexing is limited (i.e. you don't want to crawl the whole internet for some purpose).
Have you looked at Lucandra https://github.com/tjake/Lucandra a Cassandra based back end for Lucense/Solr which you can use Hadoop to populate the Cassandra store with the index of your data.

Resources