I am new in development and I need your advice.
Our student team is going to develop an application for online restaurant booking, where also will be search tool (restaurant and dishes search).
We want to use modern search tool like Lucene, but we are not sure if it is what we really need.
Due to knowledge information, this is more for text search with different kinds of indexes and so on, while our app will make search in database. BUT, if we want to add new features in future, I guess we need good search engine background today.
So, let me know if Lucene is able to do "select" operations or something like it, or this technology is just for text searches?
Sedond question, what can you advise in realisation of this feature? Where to start with?
Thank you in advance.
It all depends. You usually don't start with Lucene and Solr, you use it to attain a goal or implement a specific behavior you need. Usually Solr is your secondary storage, built from your primary database - i.e. you're inserting data into Solr to solve a specific need, for example proper full text search with relevancy scoring.
If you're just starting up, go with the technology you know - i.e. usually a regular RDBMS. You can then attach Solr if you need those features that they're really good at, and wait with introducing new technology until it's necessary. The need first, then the technology. Maybe Lucene/Solr isn't the right technology for what you end up needing when you get to that point.
One of the main tenants of modern development is "YAGNI" - You Ain't Gonna Need It. You implement features when you need them, not for some imagined behavior that may or may not show up down the road.
Related
I am looking into Neo4j as a stripped-down document store. A key aspect of document storage is search, and I know Neo4j includes full text search via legacy indices provided by Lucene.
I would be very interested in hearing the limitations of Neo4j search capabilities in a distributed environment. Does it provide a distributed index? In what ways is it inferior to Solr or ElasticSearch? How far can I take it before I must install Solr?
-- EDIT --
We are trying to integrate two distinct search efforts. The first is standard text content search. For instance, using the Enron emails, we want to search for every email that matches "bananas" or "going to the store" and get those document bodies in response. This is where people often turn to Solr.
The second case is more complicated, we have attached a great deal of meta-data to each document. We may have decided that "these" emails were the result of late-night drunk-dialing. Now I want to search for all emails that may have been the result of late-night drunk-dialing. For this kind of meta-data, we believe a graph database is in order.
In a perfect world, I can use one platform to perform both queries. I appreciate that Neo4j (nor OrientDB, Arango, etc) are designed as full text search databases, but I'm trying to understand the limitations thereof.
In terms of volume, we are dealing at a very large scale with batch-style nightly updates. The data is content heavy, with some documents running into hundreds of pages of text, but mostly on the order of a page or two.
I once worked on a health social network where we needed some sort of search and connection search functionalities we first went on neo4j we were very impressed by the cypher query language we could get and express any request however when you throw there billion of nodes you start to pay the price and we started considering another graph db, this time we've made a lot of research, tests and OrientDB was clearly the winner, OrientDB is highly scalable but the thing is that you have to code by yourself, your "search algorithm" if you want to do some advanced things (what is the common point between this two nodes) otherwise you have the SQL like query language (i don't know/remember if he has a name) but you can do some interesting stuff with it
So in conclusion i would definitely go on OrientDB
Neo4j can provide a "distributed index" in the sense that the high availability cluster can make your index available on more than one machine, but I'm pretty sure that's not what you're after. Related to this issue is a different answer I wrote about graph partitioning, and what it takes to distribute a really large number of nodes/relationships across multiple machines. (It's not terribly simple)
Solr and Lucene do two different things (although Solr is built on top of Lucene). I think solr and neo4j are not comparable because they're trying to do completely different things. This site isn't about software recommendations so I can't tell you what you should use other than to say you should read up on solr and neo4j, and figure out which set of functionality you want. As far as I know, this is an exclusive decision as I'm not aware of people integrating solr with neo4j.
Your question is very difficult to answer, I'd recommend expanding on what you are trying to do and what you have tried, you'll probably get better responses.
I have an MVC application which I need to be able to search. The application is modular so it needs to be easy for modules to register data to index with the search module.
At present, there's just a quick interim solution in place which is fine for flexibility, but speed was always going to be a problem. Modules register models (and relationships and columns) which they'd like to be searchable. Upon search, the search functionality queries data using those relationships and applies Levenshtein, removes stop words, does character replacements etc. Clearly this will slow down as the volume of data increases so it's not viable to keep as it is effectively select * from x,y,z and then mine through the data.
The benefit of the above is such that there is a direct relation to the model which found the data. For example, if Model_Product finds something, I know that in my code i can use Model_Product::url() to associate the result off to the relevant location or Model_Product::find(other data) to show say the image or description if the keyword had been found in the title for example.
Another benefit of the above is it's already database specific, and therefore can just be thrown up onto a virtualhost and it works.
I have read about the various options, and they all seem very similar so it's unlikely that people are going to be able to suggest the 'right' one without inciting discussion or debate, but for the record; from the following options, Solr seems to be the one I'm leaning toward. I'm not set in stone so if anyone has any advice they'd like to share or other options I could look at, that'd be great.
Sphinx
Lucene
Solr - appears to just run Lucene as a service?
Xapian
ElasticSearch
Looking through various tutorials and guides they all seem relatively easy to set up and configure. In the case above I can have modules register the path of config files/search index models and have the searcher run them all through search program x. This will build my indexes, and provide the means by which to query data. Fine.
What I don't understand is how any of these indexes related to my other code. If I index data, search and in turn find a result with say Solr, how do I know how to get all of the other information related to the bit it found?
Also is someone able to confirm whether or not I will need to have an instance of any of the above per virtualhost? This is something which I can't seem to find much information on. I would assume that I can just connect to a single instance and tell it what data is relevant? Much like connecting to a single DBMS server, with credentials x to database y.
Granted I haven't done as extensive reading on this as I would have typically because I'm a bit stuck in terms of direction at the moment and I'd rather not read everything about everything in favour of seeking some advice from those who know before I take a particular route.
Edit: This question seems to have swayed me more towards Solr. There's also a similar thread here with a fair amount of insight into Sphinx.
DISCLAIMER: I can only speak about Lucene/Solr and, I believe, ElasticSearch as I know it is based on Lucene. Others might or might not work in the same way.
If I index data, search and in turn find a result with say Solr, how
do I know how to get all of the other information related to the bit
it found?
You can store any extra data you want, e.g. a database key pointing to a particular row in the database. Lucene/Solr can also help you to find relative information, e.g. if you run a DVD rent shop and user has misspelled a movie name, Lucene will figure this out for you and (unlike with DB) still list the closest alternatives. You can also provide hints by boosting certain fields during indexing or querying. There are special extensions for geospatial search, etc. And obviously you can provide your own if you need to.
Also is someone able to confirm whether or not I will need to have an
instance of any of the above per virtualhost?
Lucene is a low level library and will have to be present in every JVM you run. Solr (built on top of Lucene) is an HTTP server. You can call it from as many clients as you want. More scaling options explained here.
I want to design a Semantic Search engine for my final year Master's degree. I have been doing a fair amount of reading both casually on the web and academic papers so I am not a total noob in this field.
My aim is to build a semantic search engine, which parses out the HTML content into its equivatlent RDF triples,stores the triples in a triplestore, through which the engine will try to respond to the query fired using SPARQL. I want to do something out of the box unlike the other students . So, I decided to build a semantic search engine.
Right now, I had a running search engine using Solr which performs keyword search, what I want to do is the semantic search. I know some open source tools regarding Web 3.0 but not sure whether they will be compatible with Solr or not.
So, can you please provide me some help for building the same.
Thanks.
Regards
Although it sounds hard, but you will not be able to capture everything.
You need a lot of data. Of course, there already is a lot of data arranged in formats like owl and rdf which you may use (e.g. WordNet, Yago, GeoNames etc), but although they are of huge size, they only focus on very small portions of a possible discourse universe.
Developing a good semantic search takes a lot of resources and brain power. Projects, like for example KompParse at the German Research Center for Artificial Intelligence, which only focus on a small part of human conversation (gossip or buying furniture) have been running for several years with several employees by now and are still just "ok".
Understanding semantics has already been implemented in different search engines, take google for example, or wolfram alpha. So this topic might not even be as much "out of the box" as you think.
So I will go with user723630 and strongly advise you, to focus on a smaller topic. You will still achieve a lot, but you will not get frustrated.
Currently I am using thinking sphinx for search. Now I'm considering using sunspot or tire because they automatically index new content.
Are there any performance differences between the two? Is there anything else I should be concerned with?
Obviously the first difference is that you want to decide which search engine you think is best for your purposes: SOLR or Elasticsearch. We're using SOLR via Sunspot right now, but we're thinking seriously about moving to Elasticsearch because it feels like a better match for the sorts of web app functionality we want. It was incredibly easy to set up Tire, install the attachments plugin, and get search operating against data both in the database and in PDF attachments, with highlighting (now working thanks to another answer here on SO). Also, from a development/debugging point of view being able to use curl to test queries and see results is just great.
From the point of view of coding in a Rails app, you're right that both Sunspot and Tire are very similar. They both use the idea of a searchable/mapping block that defines what fields to index and how, and then performing a search is quite similar. As far as performance goes, I might give a bit of advantage to Tire, partly because the way it paginates and indexes in bulk is pretty slick (via the rake tire:import task). The ability in tire to control the indexing contents via to_json is very flexible as well.
Ultimately I think probably Sunspot and Tire are close enough that the choice between SOLR vs Elasticsearch is where you'll really end up making your decision.
I am looking into mechanisms for better search capabilities against our database. It is currently a huge bottleneck (causing long-lasting queries that are hurting our database performance).
My boss wanted me to look into Solr, but on closer inspection, it seems we actually want some kind of DB integration mechanism with Lucene itself.
From the Lucene FAQ, they recommend Hibernate Search, Compass, and DBSight.
As a background of our current technology stack, we are using straight JSPs on Tomcat, no Hibernate, no other frameworks on top of it... just straight Java, JSP, and JDBC against a DB2 database.
Given that, it seems Hibernate Search might be a bit more difficult to integrate into our system, though it might be nice to have the option of using Hibernate after such an integration.
Does anyone have any experiences they can share with using one of these tools (or other similar Lucene based solutions) that might help in picking the right tool?
It needs to be a FOSS solution, and ideally will manage updating Lucene with changes from the database automagicly (though efficiently), without extra effort to notify the tool when changes have been made (otherwise, it seems rolling my own Lucene solution would be just as good). Also, we have multiple application servers with just 1 database (+failover), so it would be nice if it is easy to use the solution from all application servers seamlessly.
I am continuing to inspect the options now, but it would be really helpful to utilize other people's experiences.
When you say "search against a DB", what do you mean?
Relational databases and information retrieval systems use very different approaches for good reason. What kind of data are you searching? What kind of queries do you perform?
If I were going to implement an inverted index on top of a database, as Compass does, I would not use their approach, which is to implement Lucene's Directory abstraction with BLOBs. Rather, I'd implement Lucene's IndexReader abstraction.
Relational databases are quite capable of maintaining indexes. The value that Lucene brings in this context is its analysis capabilities, which are most useful for unstructured text records. A good approach would leverage the strengths of each tool.
As updates are made to the index, Lucene creates more segments (additional files or BLOBs), which degrade performance until a costly "optimize" procedure is used. Most databases will amortize this cost over each index update, giving you more stable performance.
I have had good experiences with Compass. It has really good integration with hibernate and can mirror data changes made through hibernate and jdbc directly to the Lucene indexes though its GPS devices http://www.compass-project.org/docs/1.2.2/reference/html/gps-jdbc.html.
Maintaining the Lucene indexes on all your application servers may be an issue. If you have multiple App servers updating the db, then you may hit some issues with keeping the index in sync with all the changes. Compass may have an alternate mechanism for handling this now.
The Alfresco Project (CMS) also uses Lucene and have a mechanism for replicating Lucene index changes between servers that may be useful in handling these issues.
We started using Compass before Hibernate Search was really off the ground so I cannot offer any comparison with it.
LuSql http://code.google.com/p/lusql/ allows you to load the contents of a JDBC-accessible database into Lucene, making it searchable. It is highly optimized and multi-threaded. I am the author of LuSql and will be coming out with a new version (re-architected with a new plugable architecture) in the next month.
For a pure performance boost with searching Lucene will certainly help out a lot. Only index what you care about/need and you should be good. You could use Hibernate or some other piece if you like but I don't think it is required.
Well, it seems DBSight doesn't meet the FOSS requirement, so unless it is an absolutely stellar solution, it is not an option for me right now...