How do I facet on text phrases in SOLR? - solr

I know that SOLR can do free text search but what is the best practice for faceting on common terms inside SOLR text fields?
For example, we have a large blob of text (a description of a property) which contains useful text to facet on like 'private garage', 'private garden', 'private parking', 'underground parking', 'hardwood floors', 'two floors', ... dozens more like these.
I would like to create a view which lets users see the number of properties with each of these terms and allow the users to drill down to the relevant properties.
One obvious solution is to pre-process the data, parse the text, and create the facets for each of these key phrases with a boolean yes/no value.
I'd ideally like to automate this, so I imagine the SOLR free text search engine might allow this? e.g. Can I use the free text search engine to remove stop words and collect counts of common phrases which we can then present to the user?
If pre-processing is the only way, is there a common/best practice approach to this or any open source libraries which perform this function?
What is the best practice for counting and grouping common phrases from a text field in SOLR?

Problem is that faceting on text fields (non-string fields) with some custom analysis chain is rather expensive. You may try using shingles, i.e. break your input into an array of overlapping bi-grams. If you are going to use solr4 make sure to have docValues=true on the text field definition. This may speed up or at least save you RAM.
The bi-gramming can be achieved using ShingleFilterFactory: http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.ShingleFilterFactory
Beware that it is still quite computing intensive.
This may work if your data set isn't too large (subject to a separate definition) or if you can shard the data appropriately.

Related

How to associate subsection meta-data in SOLR searchable text

I'd like to make the text of a book searchable in SOLR, and I'd like to include the page number(s) where the matching text can be found in the original book.
I'm wondering what mechanisms SOLR might have to associate a page number with the searchable words of text? (To be clear, I'm talking about the page number of the original source text, not anything to do with SOLR result pagination.)
So in essense I basically need to have structured text, whereby each searchable word (ideally each letter actually, because my real use-case is more of a giant substring that may start anywhere within the word) has some associated meta-data. I could put this information in an external datastore if necessary, but wondered if SOLR has a way to do it natively.
If not, is there another tool better suited to this purpose than SOLR?

Difference between full text and free text search in solr (other search db)

New to search databases and working with one. What is the difference between full text and free text search/index?
They are kind of same. More precisely they are just synonyms.
They are techniques used by search engines to find results in a database.
Solr uses Lucene project for it's search engine. It is used when you have a large documents to be searched and, you can't use LIKE queries with normal RDMS considering the performance.
Mianly it's follows two stages indexing and searching. The indexing stage will scan the text of all the documents and build a list of search terms. In the search stage, when performing a specific query, only the index is referenced, rather than the text of the original documents.
Suppose you typed John and Ryan, query will return will all the items in document which either contains "John" or "Ryan". Order and case sensitiveness doesn't matter.
In nutshell, unless you are using/terming them in specific use case, they are just calling different name for same person.
Call him Cristiano or CR7, they are same :)

Is there a better way to represent provenenace on a field level in SOLR

I have documents in SOLR which consist of fields where the values come from different source systems. The reason why I am doing this is because this document is what I want returned from the SOLR search, including functionality like hit highlighting. As far as I know, if I use join with multiple SOLR documents, there is no way to get what matched in the related documents. My document has fields like:
id => unique entity id
type => entity type
name => entity name
field_1_s => dynamic field from system A
field_2_s => dynamic field from system B
...
Now, my problem comes when data is updated in one of the source systems. I need to update or remove only the fields that correspond to that source system and keep the other fields untouched. My thought is to encode the dynamic field name with the first part of the field name being a 8 character hash representing the source system.. this way they can have common field names outside of the unique source hash. And in this way, I can easily clear out all fields that start with the source prefix, if needed.
Does this sound like something I should be doing, or is there some other way that others have attempted?
In our experience the easiest and least error prone way of implementing something like this is to have a straight forward way to build the resulting document, and then reindex the complete document with data from both subsystems retrieved at time of reindexing. Tracking field names and field removal tend to get into a lot of business rules that live outside of where you'd normally work with them.
By focusing on making the task of indexing a specific document easy and performant, you'll make the system more flexible regarding other issues in the future as well (retrieving all documents with a certain value from Solr, then triggering a reindex for those documents from a utility script, etc.).
That way you'll also have the same indexing flow for your application and primary indexing code, so that you don't have to maintain several sets of indexing code to do different stuff.
If the systems you're querying isn't able to perform when retrieving the number of documents you need, you can add a local cache (in SQL, memcached or something similar) to speed up the process, but that code can be specific to the indexing process. Usually the subsystems will be performant enough (at least if doing batch retrieval depending on the documents that are being updated).

Find similar results with Lucene / SOLR index

We have an application for tagging user selections over a large corpus of MS Word documents. We tag these selections with one or more keyword tags, and usually a title tag. We want to add a feature where the selected text is instantly analyzed, and the tagger is presented with a list of most-likely keyword and title tags (based on the existing tagged text selections)
We are using a SOLR index. I have been told that we can simply issue the selected text as the query itself to return similar selections. However, the selected text could be anywhere between 200 and 6000 words long. A 6000 word query may be a problem in terms of memory usage!
I thought we could do some very aggressive stopword removal to significantly reduce the number of words in the queries, leaving only the very meaningful words. We have been working with this corpus for the last 10 years and we are very familiar with the subject matter and the vocabulary used, so this would be easy for us to do. But the problem is that we also use the same index for allowing the normal users to search the index, and if we remove too many common words, then their normal queries may not work properly (especially phrase queries).
We would also like to boost the results that contain the text of the query within a smaller range, rather than just spread arbitrarily throughout the document.
Another issue is that we allow nested selections. The outer selection may be more general in nature and be around 5000 words long, and the inner selections will be shorter and topically more specific. However, since both selections contain the same text, SOLR ranks them both highly, when the outer selection may not be so relevant
I have spent the last few days going through the SOLR query parser documentation, and it looks like this should be doable, but I'm still not sure exactly what I need to do to make this work. Any suggestions would be much appreciated.
Solr have multi-core facility. So if you can have one core for your internal work and you can reveal the other core for public domain, it may solve your issue.
You can refer this section
http://wiki.apache.org/solr/Solr.xml%20(supported%20through%204.x)
or you can refer Solr cores and solr.xml section in solr reference manual.

How do I index different sources in Solr?

How do I index text files, web sites and database in the same Solr schema? All 3 sources are a requirement and I'm trying to figure out how to do it. I did some examples and they're working fine as they're separate from each other, now I need them all to be 1 schema since the user will be searching in all of those 3 data sources.
How should I proceed?
You should sketch up a few notes for each of your content sources:
What meta-data is available
How is the information accessed
How do I want to present the information
Once that is done, determine which meta-data you want to make searchable. Some of it might be very specific to just one of the content sources (such as author on web pages, or any given field in a DB row), while others will be present in all sources (such as unique ID, title, text content). Use copy-fields to consolidate fields as needed.
Meta-data will vary greatly from project to project, but yes -- things like update date, filename, and any structured data you can parse out of the text files will surely help you improve relevance. Beyond that, it varies a lot from case to case. Maybe the file paths hint at a (possibly informal) taxonomy you can use as metadata. Maybe filenames contain metadata themselves (such as year, keyword, product names, etc).
Be prepared to use different fields for different sources when displaying results. A source field goes a long way in terms of creating result tiles -- and it might turn out to be your most used facet.
An alternative (and probably preferred) approach to using copy-fields extensively, is using the DisMax/EDisMax request handlers, to facilitate searching in several fields.
Consider using a mix of copy-fields and (e)dismax. For instance, copy all fields into a catch-all text-field, that need not be stored, and include it in searches, but with a low boost-value, and include highly weighted fields (such as title, or headings, or keywords, or filename) in the search. There's a lot of parameters to tweak in dismax, but it's definately worth the effort.

Resources