How to associate subsection meta-data in SOLR searchable text - solr

I'd like to make the text of a book searchable in SOLR, and I'd like to include the page number(s) where the matching text can be found in the original book.
I'm wondering what mechanisms SOLR might have to associate a page number with the searchable words of text? (To be clear, I'm talking about the page number of the original source text, not anything to do with SOLR result pagination.)
So in essense I basically need to have structured text, whereby each searchable word (ideally each letter actually, because my real use-case is more of a giant substring that may start anywhere within the word) has some associated meta-data. I could put this information in an external datastore if necessary, but wondered if SOLR has a way to do it natively.
If not, is there another tool better suited to this purpose than SOLR?

Related

On elasticsearch (and maybe with logstash) how to create the simplest index to allow word searching in text files?

I have some text files about territory, land development. I need a global view of the topics and main words that are inside.
I would like elasticsearch to create an index for simple research by keywords.
I don't know in advance (and I don't want to bother with) keywords that could define the text file parsed with more accuracy: date, author, title might exist, but they aren't interesting me yet.
I only need elasticsearch and/or logstash to gather the plain text of the each file, and index it as it is, for the beginning.
Given such sample file, how can I create the simplest index, either by a curl to elasticsearch or by the mean of logstash?
(the file has a markdown type, but I want to parse it as plain text, without considering it's markdown)

Difference between full text and free text search in solr (other search db)

New to search databases and working with one. What is the difference between full text and free text search/index?
They are kind of same. More precisely they are just synonyms.
They are techniques used by search engines to find results in a database.
Solr uses Lucene project for it's search engine. It is used when you have a large documents to be searched and, you can't use LIKE queries with normal RDMS considering the performance.
Mianly it's follows two stages indexing and searching. The indexing stage will scan the text of all the documents and build a list of search terms. In the search stage, when performing a specific query, only the index is referenced, rather than the text of the original documents.
Suppose you typed John and Ryan, query will return will all the items in document which either contains "John" or "Ryan". Order and case sensitiveness doesn't matter.
In nutshell, unless you are using/terming them in specific use case, they are just calling different name for same person.
Call him Cristiano or CR7, they are same :)

Solr multilingual search

I'm currently working on a project where we have indexed text content in SOLR. Every content is writen in one specific language (we have 4 differents
european languages) but we would like to add a feature that if the primary search (search text entered by the user) doesn't return much result then we try too look for document in other languages. Thus we would somehow need to translate the query.
Our base is that we can have a mapping list of translated words commonly used in the field of the project.
One solution that came to me was to use synonym search feature. But this might not provide the best results.
Does people have pointers on existing modules that could help us achieving this multilingual search feature? Or conception ideas we cold try to investigate?
Thanks
It seems like multi-lingual search is not a unique problem.
Please take a look
http://lucene.472066.n3.nabble.com/Multilingual-Search-td484201.html
and
Solr index and search multilingual data
those two links suggest to have dedicated fields for each language, but you can also have a field that states language, and you can add filter query (&fq=) for the language you have detected (from user query). This is more scalable solution, I think.
One option would be for you to translate your terms at index time, this could probably be done at Solr level or even before Solr at the application level, and then store the translated texts in different fields so you would have fields like:
text_en: "Hello",
text_fi: "Hei"
Then you can just query text_en:Hello and it would match.
And if you want to score primary language matches higher, you could have a primary_language field and then boost documents where it matches the search language higher.

How do I facet on text phrases in SOLR?

I know that SOLR can do free text search but what is the best practice for faceting on common terms inside SOLR text fields?
For example, we have a large blob of text (a description of a property) which contains useful text to facet on like 'private garage', 'private garden', 'private parking', 'underground parking', 'hardwood floors', 'two floors', ... dozens more like these.
I would like to create a view which lets users see the number of properties with each of these terms and allow the users to drill down to the relevant properties.
One obvious solution is to pre-process the data, parse the text, and create the facets for each of these key phrases with a boolean yes/no value.
I'd ideally like to automate this, so I imagine the SOLR free text search engine might allow this? e.g. Can I use the free text search engine to remove stop words and collect counts of common phrases which we can then present to the user?
If pre-processing is the only way, is there a common/best practice approach to this or any open source libraries which perform this function?
What is the best practice for counting and grouping common phrases from a text field in SOLR?
Problem is that faceting on text fields (non-string fields) with some custom analysis chain is rather expensive. You may try using shingles, i.e. break your input into an array of overlapping bi-grams. If you are going to use solr4 make sure to have docValues=true on the text field definition. This may speed up or at least save you RAM.
The bi-gramming can be achieved using ShingleFilterFactory: http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.ShingleFilterFactory
Beware that it is still quite computing intensive.
This may work if your data set isn't too large (subject to a separate definition) or if you can shard the data appropriately.

How do I index different sources in Solr?

How do I index text files, web sites and database in the same Solr schema? All 3 sources are a requirement and I'm trying to figure out how to do it. I did some examples and they're working fine as they're separate from each other, now I need them all to be 1 schema since the user will be searching in all of those 3 data sources.
How should I proceed?
You should sketch up a few notes for each of your content sources:
What meta-data is available
How is the information accessed
How do I want to present the information
Once that is done, determine which meta-data you want to make searchable. Some of it might be very specific to just one of the content sources (such as author on web pages, or any given field in a DB row), while others will be present in all sources (such as unique ID, title, text content). Use copy-fields to consolidate fields as needed.
Meta-data will vary greatly from project to project, but yes -- things like update date, filename, and any structured data you can parse out of the text files will surely help you improve relevance. Beyond that, it varies a lot from case to case. Maybe the file paths hint at a (possibly informal) taxonomy you can use as metadata. Maybe filenames contain metadata themselves (such as year, keyword, product names, etc).
Be prepared to use different fields for different sources when displaying results. A source field goes a long way in terms of creating result tiles -- and it might turn out to be your most used facet.
An alternative (and probably preferred) approach to using copy-fields extensively, is using the DisMax/EDisMax request handlers, to facilitate searching in several fields.
Consider using a mix of copy-fields and (e)dismax. For instance, copy all fields into a catch-all text-field, that need not be stored, and include it in searches, but with a low boost-value, and include highly weighted fields (such as title, or headings, or keywords, or filename) in the search. There's a lot of parameters to tweak in dismax, but it's definately worth the effort.

Resources