how to implement full text search in database - database

I understand that full text indexing and search for a database can be enabled by a lot of pre-packaged products. However, just out of academical curiosity, I wonder how are those full text indexes actually implemented. I have tried to google for results with little answer. Please any feedback would be much appreciated.

Full text searches are supported by quite a few database engines these days as a core feature.
As for implementation I think your best bet is to check out postgres full text searches, as you can
find a lot of material on how it is implemented
actually change and play with the parsers (for example optimize for certain domain)
There are further details and concept explained on wikipedia:
full text indexes, and you can also check out
open source and free full text search engines as normally you will find supporting documentation explaining inner workings of those too (I have heard good things about Lucene/Solr from this list)

Probably by creating dictionaries of "words" and maybe a bit of lexical analysis. (Note that fulltext searches whole words and not parts of words, so indexing may be constrained to that.)

Related

Database that allows full text search in O(1)

I have a database of documents where searching quickly for keywords and patterns would be very useful to have.
I know of "Burrows–Wheeler transform"/FM-index. I wonder if there are any programs or database programs based on BWT or similar methods in order to search a corpus in O(1) and hopefully more advantages.
Any ideas?
There is a great book by Witten/Moffat/Bell (1994) Managing Gigabytes; this describes in detail everything you need to know about indexing and retrieval. I think their sourcecode is also available, or has been made available in an information retrieval library.
However, it doesn't include the Burrows-Wheeler transform, as that was only invented in the same year.

How useful is Lucene/Solr in database search?

I am new in development and I need your advice.
Our student team is going to develop an application for online restaurant booking, where also will be search tool (restaurant and dishes search).
We want to use modern search tool like Lucene, but we are not sure if it is what we really need.
Due to knowledge information, this is more for text search with different kinds of indexes and so on, while our app will make search in database. BUT, if we want to add new features in future, I guess we need good search engine background today.
So, let me know if Lucene is able to do "select" operations or something like it, or this technology is just for text searches?
Sedond question, what can you advise in realisation of this feature? Where to start with?
Thank you in advance.
It all depends. You usually don't start with Lucene and Solr, you use it to attain a goal or implement a specific behavior you need. Usually Solr is your secondary storage, built from your primary database - i.e. you're inserting data into Solr to solve a specific need, for example proper full text search with relevancy scoring.
If you're just starting up, go with the technology you know - i.e. usually a regular RDBMS. You can then attach Solr if you need those features that they're really good at, and wait with introducing new technology until it's necessary. The need first, then the technology. Maybe Lucene/Solr isn't the right technology for what you end up needing when you get to that point.
One of the main tenants of modern development is "YAGNI" - You Ain't Gonna Need It. You implement features when you need them, not for some imagined behavior that may or may not show up down the road.

How can I relate search indexes to models in MVC?

I have an MVC application which I need to be able to search. The application is modular so it needs to be easy for modules to register data to index with the search module.
At present, there's just a quick interim solution in place which is fine for flexibility, but speed was always going to be a problem. Modules register models (and relationships and columns) which they'd like to be searchable. Upon search, the search functionality queries data using those relationships and applies Levenshtein, removes stop words, does character replacements etc. Clearly this will slow down as the volume of data increases so it's not viable to keep as it is effectively select * from x,y,z and then mine through the data.
The benefit of the above is such that there is a direct relation to the model which found the data. For example, if Model_Product finds something, I know that in my code i can use Model_Product::url() to associate the result off to the relevant location or Model_Product::find(other data) to show say the image or description if the keyword had been found in the title for example.
Another benefit of the above is it's already database specific, and therefore can just be thrown up onto a virtualhost and it works.
I have read about the various options, and they all seem very similar so it's unlikely that people are going to be able to suggest the 'right' one without inciting discussion or debate, but for the record; from the following options, Solr seems to be the one I'm leaning toward. I'm not set in stone so if anyone has any advice they'd like to share or other options I could look at, that'd be great.
Sphinx
Lucene
Solr - appears to just run Lucene as a service?
Xapian
ElasticSearch
Looking through various tutorials and guides they all seem relatively easy to set up and configure. In the case above I can have modules register the path of config files/search index models and have the searcher run them all through search program x. This will build my indexes, and provide the means by which to query data. Fine.
What I don't understand is how any of these indexes related to my other code. If I index data, search and in turn find a result with say Solr, how do I know how to get all of the other information related to the bit it found?
Also is someone able to confirm whether or not I will need to have an instance of any of the above per virtualhost? This is something which I can't seem to find much information on. I would assume that I can just connect to a single instance and tell it what data is relevant? Much like connecting to a single DBMS server, with credentials x to database y.
Granted I haven't done as extensive reading on this as I would have typically because I'm a bit stuck in terms of direction at the moment and I'd rather not read everything about everything in favour of seeking some advice from those who know before I take a particular route.
Edit: This question seems to have swayed me more towards Solr. There's also a similar thread here with a fair amount of insight into Sphinx.
DISCLAIMER: I can only speak about Lucene/Solr and, I believe, ElasticSearch as I know it is based on Lucene. Others might or might not work in the same way.
If I index data, search and in turn find a result with say Solr, how
do I know how to get all of the other information related to the bit
it found?
You can store any extra data you want, e.g. a database key pointing to a particular row in the database. Lucene/Solr can also help you to find relative information, e.g. if you run a DVD rent shop and user has misspelled a movie name, Lucene will figure this out for you and (unlike with DB) still list the closest alternatives. You can also provide hints by boosting certain fields during indexing or querying. There are special extensions for geospatial search, etc. And obviously you can provide your own if you need to.
Also is someone able to confirm whether or not I will need to have an
instance of any of the above per virtualhost?
Lucene is a low level library and will have to be present in every JVM you run. Solr (built on top of Lucene) is an HTTP server. You can call it from as many clients as you want. More scaling options explained here.

Semantic Search Engine

I want to design a Semantic Search engine for my final year Master's degree. I have been doing a fair amount of reading both casually on the web and academic papers so I am not a total noob in this field.
My aim is to build a semantic search engine, which parses out the HTML content into its equivatlent RDF triples,stores the triples in a triplestore, through which the engine will try to respond to the query fired using SPARQL. I want to do something out of the box unlike the other students . So, I decided to build a semantic search engine.
Right now, I had a running search engine using Solr which performs keyword search, what I want to do is the semantic search. I know some open source tools regarding Web 3.0 but not sure whether they will be compatible with Solr or not.
So, can you please provide me some help for building the same.
Thanks.
Regards
Although it sounds hard, but you will not be able to capture everything.
You need a lot of data. Of course, there already is a lot of data arranged in formats like owl and rdf which you may use (e.g. WordNet, Yago, GeoNames etc), but although they are of huge size, they only focus on very small portions of a possible discourse universe.
Developing a good semantic search takes a lot of resources and brain power. Projects, like for example KompParse at the German Research Center for Artificial Intelligence, which only focus on a small part of human conversation (gossip or buying furniture) have been running for several years with several employees by now and are still just "ok".
Understanding semantics has already been implemented in different search engines, take google for example, or wolfram alpha. So this topic might not even be as much "out of the box" as you think.
So I will go with user723630 and strongly advise you, to focus on a smaller topic. You will still achieve a lot, but you will not get frustrated.

Is it advisable to use Lucene for this?

I have a huge XML file, about 2GB in size, containing Resumes. There are thousands of resumes in this file, tagged properly. Right now I am using XPATH to query it. So is it advisable to use Lucene for the same instead of XPATH?
Depends upon what your requirements are. If you need full-text searching and all other great features of a full-blown search engine, Lucene is the way to go. I would recommend Solr which builds on top of lucene and provides a much better API and abstraction.
Like everything else technology related, it depends.
What Lucene gives you that you're not getting with XPath is the power of a full-text engine that supports among other things ranking and the ability to phrase queries, wildcard queries etc.
Based on your use-case I would say that at full-text search engine makes sense. That's not to say that vanilla Lucene is the best way to go (there are for example other alternatives that build on Lucene).
2GB seems to be pretty less for which I would contruct my own inverted index (a minimal one) :) However no problem in using Lucene/Solr though. Go ahead. It will help you once your records starts doubling. However at this scale (2GB) or even much larger many real life stuff is working on databases full text searches using SQL like keyword.

Resources