I have an MVC application which I need to be able to search. The application is modular so it needs to be easy for modules to register data to index with the search module.
At present, there's just a quick interim solution in place which is fine for flexibility, but speed was always going to be a problem. Modules register models (and relationships and columns) which they'd like to be searchable. Upon search, the search functionality queries data using those relationships and applies Levenshtein, removes stop words, does character replacements etc. Clearly this will slow down as the volume of data increases so it's not viable to keep as it is effectively select * from x,y,z and then mine through the data.
The benefit of the above is such that there is a direct relation to the model which found the data. For example, if Model_Product finds something, I know that in my code i can use Model_Product::url() to associate the result off to the relevant location or Model_Product::find(other data) to show say the image or description if the keyword had been found in the title for example.
Another benefit of the above is it's already database specific, and therefore can just be thrown up onto a virtualhost and it works.
I have read about the various options, and they all seem very similar so it's unlikely that people are going to be able to suggest the 'right' one without inciting discussion or debate, but for the record; from the following options, Solr seems to be the one I'm leaning toward. I'm not set in stone so if anyone has any advice they'd like to share or other options I could look at, that'd be great.
Sphinx
Lucene
Solr - appears to just run Lucene as a service?
Xapian
ElasticSearch
Looking through various tutorials and guides they all seem relatively easy to set up and configure. In the case above I can have modules register the path of config files/search index models and have the searcher run them all through search program x. This will build my indexes, and provide the means by which to query data. Fine.
What I don't understand is how any of these indexes related to my other code. If I index data, search and in turn find a result with say Solr, how do I know how to get all of the other information related to the bit it found?
Also is someone able to confirm whether or not I will need to have an instance of any of the above per virtualhost? This is something which I can't seem to find much information on. I would assume that I can just connect to a single instance and tell it what data is relevant? Much like connecting to a single DBMS server, with credentials x to database y.
Granted I haven't done as extensive reading on this as I would have typically because I'm a bit stuck in terms of direction at the moment and I'd rather not read everything about everything in favour of seeking some advice from those who know before I take a particular route.
Edit: This question seems to have swayed me more towards Solr. There's also a similar thread here with a fair amount of insight into Sphinx.
DISCLAIMER: I can only speak about Lucene/Solr and, I believe, ElasticSearch as I know it is based on Lucene. Others might or might not work in the same way.
If I index data, search and in turn find a result with say Solr, how
do I know how to get all of the other information related to the bit
it found?
You can store any extra data you want, e.g. a database key pointing to a particular row in the database. Lucene/Solr can also help you to find relative information, e.g. if you run a DVD rent shop and user has misspelled a movie name, Lucene will figure this out for you and (unlike with DB) still list the closest alternatives. You can also provide hints by boosting certain fields during indexing or querying. There are special extensions for geospatial search, etc. And obviously you can provide your own if you need to.
Also is someone able to confirm whether or not I will need to have an
instance of any of the above per virtualhost?
Lucene is a low level library and will have to be present in every JVM you run. Solr (built on top of Lucene) is an HTTP server. You can call it from as many clients as you want. More scaling options explained here.
Related
For statistic purposes I have to save and analyse all search queries made to a server running Solr (version 8.3.1). Maybe it's just because I haven't worked with Solr until today, but I couldn't find a simpler way to access these queries except crawling the logs.
I've only found one article to help me, in which the following is stated:
I think that solr by him self doesn't store the queries (correct me if I'm wrong, about this) but you can accomplish what you want by processing the solr log (its the only way I think).
(source: https://lucene.472066.n3.nabble.com/is-it-possible-to-save-the-search-query-td4018925.html)
Is there any more convenient way to do this?
I actually found a good way to achieve this in another SO-Question. Well, at least kind of.
Note: It is only useful if you have enough resources on the same server or another server to properly handle a second Solr-Core.
Link to original answer
That SO-Question is about Elastic-Search but the methodology of it can also be applied to this case with a second Solr-Core that indexes the queries made. (One can also add additional fields like when it was last searched, total search count, ...)
The functionalities of a search auto-complete are also achievable with this solution.
In short:
The basic idea is to use a second Solr instance to provide the means necessary for quickly saving the queries (instead of a DB for instance).
Remark: I'm not going to accept this as the best answer because it is a rather special solution to the question I originally made. But I nonetheless felt it could be useful for any programmer trying to achieve this while also thinking about search auto-completion.
I am new in development and I need your advice.
Our student team is going to develop an application for online restaurant booking, where also will be search tool (restaurant and dishes search).
We want to use modern search tool like Lucene, but we are not sure if it is what we really need.
Due to knowledge information, this is more for text search with different kinds of indexes and so on, while our app will make search in database. BUT, if we want to add new features in future, I guess we need good search engine background today.
So, let me know if Lucene is able to do "select" operations or something like it, or this technology is just for text searches?
Sedond question, what can you advise in realisation of this feature? Where to start with?
Thank you in advance.
It all depends. You usually don't start with Lucene and Solr, you use it to attain a goal or implement a specific behavior you need. Usually Solr is your secondary storage, built from your primary database - i.e. you're inserting data into Solr to solve a specific need, for example proper full text search with relevancy scoring.
If you're just starting up, go with the technology you know - i.e. usually a regular RDBMS. You can then attach Solr if you need those features that they're really good at, and wait with introducing new technology until it's necessary. The need first, then the technology. Maybe Lucene/Solr isn't the right technology for what you end up needing when you get to that point.
One of the main tenants of modern development is "YAGNI" - You Ain't Gonna Need It. You implement features when you need them, not for some imagined behavior that may or may not show up down the road.
I'm using Alfresco 4.1.6 and SOLR 1.4.
For search, I use fts_alfresco_language and the searchService.query method.
And in my query I search by PATH, TYPE and some custom properties like direction, telephone, mail, or similar.
I have now over 2 millions of documents, and we can see how the performance of the searchs are worst than at the beginning.
I read that in version 1.4 of solr, using PATH on the query is a bad idea. And is better avoid it and only use TYPE and the property key and value.
But I have 2 questions...
Why the PATH increase the response time? It's not a help? I have over 1000 main folders at the root of the repository. If I specify the folder that solr may search, why this not filter the results and give me a worst time response than if I don't specify this? Or there are another way to say to solr the main folder to reduce results and then do the rest of the query?
When I find by custom properties, I use 3 or 4 properties, all indexed, to search. These merged lookups has a higher overhead than one? Maybe is better to search only by one property, and not by the 3? Or maybe use ORs and not ANDs to quickly results? How works SOLR?
Thanks!
First let me start with this, I'm not sure what you want of this question cause it's vague. You're not asking how to make your query better, your asking why a bad-practice(bad-performance) is working bad for you.
Do some research on how to structure your ECM system, first thing what makes your ECM any good is a proper Content Model. There are books out there which will help you.
If you're structuring your content with folders (Path) and these are important for you, than you need to add these as metadata to your content. If you haven't done that, then you should start with that.
A good Content Model will be able to find content wherever it's placed within your ECM system.
Sure it's easy to migrate a filesystem to an ECM system and just leave it there, but you've done only half the work.
The path queries are slow in general cause it uses a loop pattern and it's expensive. It has been greatly improved in the new SOLR, but it still isn't as fast as normal metadata querying.
I am looking into Neo4j as a stripped-down document store. A key aspect of document storage is search, and I know Neo4j includes full text search via legacy indices provided by Lucene.
I would be very interested in hearing the limitations of Neo4j search capabilities in a distributed environment. Does it provide a distributed index? In what ways is it inferior to Solr or ElasticSearch? How far can I take it before I must install Solr?
-- EDIT --
We are trying to integrate two distinct search efforts. The first is standard text content search. For instance, using the Enron emails, we want to search for every email that matches "bananas" or "going to the store" and get those document bodies in response. This is where people often turn to Solr.
The second case is more complicated, we have attached a great deal of meta-data to each document. We may have decided that "these" emails were the result of late-night drunk-dialing. Now I want to search for all emails that may have been the result of late-night drunk-dialing. For this kind of meta-data, we believe a graph database is in order.
In a perfect world, I can use one platform to perform both queries. I appreciate that Neo4j (nor OrientDB, Arango, etc) are designed as full text search databases, but I'm trying to understand the limitations thereof.
In terms of volume, we are dealing at a very large scale with batch-style nightly updates. The data is content heavy, with some documents running into hundreds of pages of text, but mostly on the order of a page or two.
I once worked on a health social network where we needed some sort of search and connection search functionalities we first went on neo4j we were very impressed by the cypher query language we could get and express any request however when you throw there billion of nodes you start to pay the price and we started considering another graph db, this time we've made a lot of research, tests and OrientDB was clearly the winner, OrientDB is highly scalable but the thing is that you have to code by yourself, your "search algorithm" if you want to do some advanced things (what is the common point between this two nodes) otherwise you have the SQL like query language (i don't know/remember if he has a name) but you can do some interesting stuff with it
So in conclusion i would definitely go on OrientDB
Neo4j can provide a "distributed index" in the sense that the high availability cluster can make your index available on more than one machine, but I'm pretty sure that's not what you're after. Related to this issue is a different answer I wrote about graph partitioning, and what it takes to distribute a really large number of nodes/relationships across multiple machines. (It's not terribly simple)
Solr and Lucene do two different things (although Solr is built on top of Lucene). I think solr and neo4j are not comparable because they're trying to do completely different things. This site isn't about software recommendations so I can't tell you what you should use other than to say you should read up on solr and neo4j, and figure out which set of functionality you want. As far as I know, this is an exclusive decision as I'm not aware of people integrating solr with neo4j.
Your question is very difficult to answer, I'd recommend expanding on what you are trying to do and what you have tried, you'll probably get better responses.
I am starting on a ASP.NET MVC 3 General Management System (Project Management being the first component). Now I have been reading up a bit on RavenDB and it sounds pretty interesting. One of the biggest things that I like about it is the fact I would not need any type on ORM to handle the data from the DB. This will make my code a lot cleaner and quicker. However coming from a background working exclusively with MySQL for the past 6+ years, I tend to think very relationally with my data. There are a few things that seems like NoSQL would not be good for. I want to throw these things out there and maybe these issues can be handle in a NoSQL solution and I am just think too relationally (then again, maybe this project should be done with MySQL). These are the issues I am thinking of:
Unique Idenifiers: I am going to want to be able to have unique identifiers for a lot of things. For stuff like projects, the name should be unique and could use that however when it come to tasks under a project, the title may not be unique and this is where I would use a quto-increment field but I can do that in RavenDB (from what I can tell)
Linking: Using for fields like status and type I would just use a linking with a foreign key. Now for one-to-many relationships, I can just use the text instead of trying to link a foreign key (which you don't have in NoSQL) but with many-to-many linking, that because a problem. For example, I intend to have a tagging system (like on here) where most items can have 1 to many tags attached to it and then I can perform searches on those tag for the items. Is there a way to do this in NoSQL?
Is a RDBMS really the best tool for the job here or am I just not properly think the "NoSQL" way and I can accomplish this with NoSQL (RavenDB)?
I know this is an old post. Perhaps the docs weren't as good when originally written. But for reference in case other stumble here:
Raven comes with a HiLo document id generation strategy by default. Storing a new document without specifying an id yourself will get an auto incrementing id such as "projects/1", "projects/2", etc. Read more here.
The best guidance on the different ways to handle document relationships is here in the documentation. For the situation you described, you don't really need a separate document at all. You can simply embed a string array of tag names into each item. Documents are not flat, they can be structured. And yes, you can still query on them.
Hopefully you've discovered this on your own since the original post.
Ayende wrote a post "Modeling reference data in RavenDB" which answers some of your questions re Linking. You will have copies of the data between the reference document and your other documents and that redundancy is "ok" for document databases. You can still build indexes or query based on the on either Id or text that you store.
I would favor SQL for a transaction system such as Accounts Receivable application where you need to perform ad hoc queries. With document database you really need to think through how you will be fetching your data and build indexes up front to answers those questions. With RavenDB there is also a dynamic indexing function that learns from and caches the queries that are fired at the database.
For project management where the majority of items would be tasks I would think a RavenDB would fit your needs.