Lucene and SQL Server - best practice - sql-server

I am pretty new to Lucene, so would like to get some help from you guys :)
BACKGROUND: Currently I have documents stored in SQL Server and want to use Lucene for full-text/tag searches on those documents in SQL Server.
Q1) In this case, in order to do the keyword search on the documents, should I insert all of those documents to the Lucene index? Does this mean there will be data duplication (one in SQL Server and the other one in the Lucene index?) It could be a matter since we have a massive amount of documents (about 100GB). Is it inevitable?
Q2) Also, each documents has a set of tags (up to 3). Lucene is also good choice for the tag search? If so, how to do it?
Thanks,

Yes, providing full-text search through Lucene and data storage through a traditional database is a well-supported architecture. Take a look here, for a brief introduction. A typical implementation would be to index anything you wish to be able to support searching on, and store only a unique identifier in the Lucene index, and pull any records founds by a search from the database, based on the ID. If you want to reduce DB load, you can store some information in Lucene to display a list of search results, and only query the database in order to fetch the full document.
As for saving on space, there will be some measure of duplication. This is true even if you only Lucene, though. Lucene stores the inverted index used for searching entirely separately from stored data. For saving on space, I'd recommend being very deliberate about what data you choose to index, and what you need to store and be able to retrieve later. What you store is particularly important for saving space in Lucene, since indexed-only values tend to be very space-efficient, in most cases.
Lucene can certainly implement a tag search. The simple way to implement it would be to add each tag to a field of your choosing (I'll call is "tags", which seems to make sense), while building the document, such as:
document.add(new Field("tags", "widget", Field.Store.NO, Field.Index.ANALYZED));
document.add(new Field("tags", "forkids", Field.Store.NO, Field.Index.ANALYZED));
and I could simply add a required term to any query to search only within a particular tag. For instance, if I was to search for "some stuff", but only with the tag "forkids", I could write a query like:
some stuff +tags:forkids

Documents can also be stored in Lucene, you can retrieve and reference them using the document ID.
I would suggest using Solr http://lucene.apache.org/solr/ on top of Lucene, is more user friendly and has multiValued fields (for the tags) available by default.
http://wiki.apache.org/solr/SchemaXml

Related

How to help my Solr engine to understand related terms?

I have a big list of related terms (not synonyms) that I would like my solr engine to take into account when searching. For example:
Database --> PostgreSQL, Oracle, Derby, MySQL, MSSQL, RabbitMQ, MongoDB
For this kind of list, I would like Solr to take into account that if a user is searching for "postgresql configuration" he might also bring results related to "RabbitMQ" or "Oracle", but not as absolute synonyms. Just to boost results that have these keywords/terms.
What is the best approach to implement such connection? Thanks!
You've already discovered that these are synonyms - and that you want to use that metainformation as a boost (which is a good idea).
The key is then to define a field that does what you want - in addition to your regular field. Most of these cases are implemented by having a second field that does the "less accurate" version of the field, and apply a lower boost to matches in that field compared to the accurate version.
You define both fields - one with synonyms (for example content_synonyms) and one without (content), and then add a copyField instruction from the content field (this means that Solr will take anything submitted to the content field and "copy" it as the source text for the content_synonyms field as well.
Using edismax you can then use qf to query both fields and give a higher weight to the exact content field: qf=content^10 content_synonyms will score hits in content 10x higher than hits in content_synonyms, in effect using the synonym field for boosting content.
The exact weights will have to be adjusted to fit your use case, document profile and query profile.

Difference between full text and free text search in solr (other search db)

New to search databases and working with one. What is the difference between full text and free text search/index?
They are kind of same. More precisely they are just synonyms.
They are techniques used by search engines to find results in a database.
Solr uses Lucene project for it's search engine. It is used when you have a large documents to be searched and, you can't use LIKE queries with normal RDMS considering the performance.
Mianly it's follows two stages indexing and searching. The indexing stage will scan the text of all the documents and build a list of search terms. In the search stage, when performing a specific query, only the index is referenced, rather than the text of the original documents.
Suppose you typed John and Ryan, query will return will all the items in document which either contains "John" or "Ryan". Order and case sensitiveness doesn't matter.
In nutshell, unless you are using/terming them in specific use case, they are just calling different name for same person.
Call him Cristiano or CR7, they are same :)

Solr queries stored within Solr field

I have a set of keywords defined by client requirements stored in a SOLR field. I also have a never ending stream of sentences entering the system.
By using the sentence as the query against the keywords I am able to find those sentences that match the keywords. This is working well and I am pleased. What I have essentially done is reverse the way in which SOLR is normally used by storing the query in Solr and passing the text in as the query.
Now I would like to be able to extend the idea of having just a keyword in a field to having a more fully formed SOLR query in a field. Doing so would allow proximity searching etc. But, of course, this is where life becomes awkward. Placing SOLR query operators into a field will not work as they need to be escaped.
Does anyone know if it might be possible to use the SOLR "query" function or perhaps write a java class that would enable such functionality? Or is the idea blowing just a bit too much against the SOLR winds?
Thanks in advance.
ES has percolate for this - for Solr you'll usually index the document as a single document in a memory based core / index and then run the queries against that (which is what ES at least used to do internally, IIRC).
I would check out the percolate api with ElasticSearch. It would sure be easier using this api than having to write your own in Solr.

Sorting by recent access in Lucene / Solr

In my Solr queries, I want to sort most recently accessed documents to the top ("accessed" meaning opened by user action). No other search criteria has weight for me: of the documents with text matching the query, I want them in order of recent use. I can only think of two ways to do this:
1) Include a 'last accessed' date field in each doc to have Solr sort upon. Trie Date fields can be sorted very quickly, I'm told. The problem of course is keeping the field up to date, which would require storing each document's text so I can delete and re-add any document with an updated 'last accessed' field. Mutable fields would obviate this, but Lucene/Solr still doesn't offer mutable fields.
2) Alternatively, store the mutable 'last accessed' dates and keep them updated in another db. This would require Solr to return the full list of matching documents, which could be upwards of hundreds of thousands of documents. This huge list of document ids would then be matched up against dates in the db and then sorted. It would work OK for uncommon search terms, but not for broad, common search terms.
So the trade off is between 1) index size plus a processing cost every time a document is accessed and 2) big query overhead, especially for unfocused search terms
Do I have any alternatives?
http://lucidworks.lucidimagination.com/display/solr/Solr+Field+Types#SolrFieldTypes-WorkingwithExternalFiles
http://blog.mikemccandless.com/2012/01/tochildblockjoinquery-in-lucene.html
You should be able to do this with the atomic update functionality.
http://wiki.apache.org/solr/Atomic_Updates
This functionality is available as of Solr 4.0. It allows you to update a single field in a document without having to reindex the entire document. I only know about this functionality from the documentation. I have not used it myself, so I can't say how well it works or if there are any pitfalls.
Definitely use option 1, using SOLR queries and updating the lastAccessed field as needed.
Since SOLR 4.0 partial document updates are suported in several falvours: https://cwiki.apache.org/confluence/display/solr/Updating+Parts+of+Documents
For your application it seems that a simple atomic update would be sufficient.
With respect to performance, this should work very well for large collections and fast document updates.

How to implement an Enterprise Search

We are searching disparate data sources in our company. We have information in multiple databases that need to be searched from our Intranet. Initial experiments with Full Text Search (FTS) proved disappointing. We've implemented a custom search engine that works very well for our purposes. However, we want to make sure we are doing "the right thing" and aren't missing any great tools that would make our job easier.
What we need:
Column search
ability to search by column
we flag which columns in a table are searchable
Keep some relation between db column and data
we provide advanced filtering on the results
facilitates (amazon style) filtering
filter provided by grouping of results and allowing user to filter them via a checkbox
this is a great feature, users like it very much
Partial Word Match
we have a lot of unique identifiers (product id, etc).
the unique id's can have sub parts with meaning (location, etc)
or only a portion may be available (when the user is searching)
or (by a decidedly poor design decision) there may be white space in the id
this is a major feature that we've implemented now via CHARINDEX (MSSQL) and INSTR (ORACLE)
using the char index functions turned out to be equivalent performance(+/-) on MSSQL compared to full text
didn't test on Oracle
however searches against both types of db are very fast
We take advantage of Indexed (MSSQL) and Materialized (Oracle) views to increase speed
this is a huge win, Oracle Materialized views are better than MSSQL Indexed views
both provide speedups in read-only join situations (like a search combing company and product)
A search that matches user expectations of the paradigm CTRL-f -> enter text -> find matches
Full Text Search is not the best in this area (slow and inconsistent matching)
partial matching (see "Partial Word Match")
Nice to have:
Search database in real time
skip the indexing skip, this is not a hard requirement
Spelling suggestion
Xapian has this http://xapian.org/docs/spelling.html
Similar to google's "Did you mean:"
What we don't need:
We don't need to index documents
at this point searching our data sources are the most important thing
even when we do search documents, we will be looking for partial word matching, etc
Ranking
Our own simple ranking algorithm has proven much better than an FTS equivalent.
Users understand it, we understand it, it's almost always relevant.
Stemming
Just don't need to get [run|ran|running]
Advanced search operators
phrase matching, or/and, etc
according to Jakob Nielsen http://www.useit.com/alertbox/20010513.html
most users are using simple search phrases
very few use advanced searches (when it's available)
also in Information Architecture 3rd edition Page 185
"few users take advantage of them [advanced search functions]"
http://oreilly.com/catalog/9780596000356
our Amazon like filtering allows better filtering anyway (via user testing)
Full Text Search
We've found that results don't always "make sense" to the user
Searching with FTS is hard to tune (which set of operators match the users expectations)
Advanced search operators are a no go
we don't need them because
users don't understand them
Performance has been very close (+/1) to the char index functions
but the results are sometimes just "weird"
The question:
Is there a solution that allows us to keep the key value pair "filtering feature", offers the column specific matching, partial word matching and the rest of the features, without the pain of full text search?
I'm open to any suggestion. I've wondered if a document/hash table nosql data store (MongoDB, et al) might be of use? ( http://www.mongodb.org/display/DOCS/Full+Text+Search+in+Mongo ). Any experience with these is appreciated.
Again, just making sure we aren't missing something with our in-house customized version. If there is something "off the shelf" I would be interested in it. Or if you've built something from some components, what components (search engines, data stores, etc) did you use and why?
You can also make your point for FTS. Just make sure it meets the requirements above before you say "just use Full Text Search because that's the only tool we have."
I ended up coding my own.
The results are fantastic. Users like it, it works well with our existing technologies.
It really wasn't that hard. Just took some time.
Features:
Faceted search (amazon, walmart, etc)
Partial word search (the real stuff not full text)
Search databases (oracle, sql server, etc) and non database sources
Integrates well with our existing environment
Maintains relations, so I can have a n to n search and display
--> this means I can display child records of a master record in search results
--> also I can search any child field and return the master record
It's really amazing what you can do with dictionaries and a lot of memory.
I recommend looking into Solr, I believe it will meet you needs:
http://lucene.apache.org/solr/
For an off-she-shelf solution: Have you checked out the Google Search Appliance?
Quote from the Google Mini/GSA site:
... If direct database indexing is a requirement for you, we encourage you to consider the Google Search Appliance, which has direct database connectivity.
And of course it indexes everything else in the Googly manner you'd expect it to.
Apache Solr is a good way to start your project with and it is open source . You can also try Elastic Search and there are a lot of off shelf products which offer good customization abilities and search features such as Coveo, SharePoint Fast, Google ...

Resources