solr schema for article->paragraph structure - solr

I want to index some articles and show the paragraph number in the search result. So I guess the solr schema should looks like this:
article_id, paragraph_number, paragraph_content
Therefore, I need to parse article first, extract paragraphs and index it one by one.
I'm worried about the performance since one article can contain 100 paragraphs.
Any suggestion?

It is better to do the heavy lifting at index time rather than search time. So parsing the paragraphs out of the document when you index is probably the right way to go.
How many articles do you have? It really shouldn't be a problem to strip paragraphs (we do much more complex pre-processing that that).

If you only need to match individual paragraphs against the fulltext query (as opposed to filters etc.), you could also do this using highlighting -- split up the paragraphs, prefix each one with its paragraph number, and then index the paragraphs as multiple values in a single field in a single document. At search time, you'd do a highlight on the field with a full match (e.g. fragment size of -1) and no decoration of the highlight; so what you'd get back is the paragraph that matched the fulltext query, prefixed by its paragraph number (which you'd probably want to then pull back out).
Not sure if this fits your use case exactly but might be an interesting approach to try -- I do something similar to identify photos whose caption matches the fulltext query to display next to article search results.

Related

search results by distinct field - cloudsearch / solr / lucene

This is on Amazon Cloudsearch, but it probably holds true for any generic Lucene/Solr installation.
I am indexing a bunch of articles and comments on those articles to be searched. When I search for "Trump sucks", I want the ability to get back a list of comments that match, or a list of articles which have comments that match.
I know I can index them in 2 separate domains, but I wonder if there is an easier way to do a "distinct" on a field... in other words...
I have a list of indexed documents for each comment which also contains the article_id as a field .. so:
id=1 {'article_id':10}
id=2 {'article_id':10}
right now if both of these comments match, I will get back 2 results. (and yes I can do a distinct on the client side, but it would mess up paging and such). I want to be able to just get back [10]
There is no way to do distinct in CloudSearch so you will need to come up with another solution.
The best I can offer is to concatenate all comments into a single text field on article records and add a type field to differentiate comments and articles (if you don't already have one). You can then query on type=Article while searching over the concatenated comments and article body and will only ever receive one result per article.
Even with thousands of comments concatenated to a single field on each article I am sure CloudSearch will perform well (maybe even better than with tens of thousands of extra records to concider) however your update process to concatenate all the comments might get heavy. If you do get thousands of comments than adding a flag tracking if it has been concatenated so you don't have to re-build them every time will be helpful.

Similarity/approximate queries in Solr

What is the simplest way to query Solr for the documents that contain text similiar to a (longish) passage. This is similar to what ElasticSearch match queries do or what probabilistic search engines like Indri do by default. This is something between an and and an or query. None of the terms is required, but you get documents that contain many of the terms. You can also just pass a passage of raw text to the engine and it returns documents with high term overlap with the passage without having to try to parse or tokenize the text in the client. The best I option can see in the Solr query reference is to tokenize the query text myself and then insert an OR between each pair of terms and return the top N results. Is there more concise way of doing it with Solr?
The answer above is correct. You can choose to find documents similar to another document in the index, similar to a given external URL or similar to some given text. You can choose what field(s) to target and various other parameters. Here's the official Solr Reference Guide documentation page for MLT: https://cwiki.apache.org/confluence/display/solr/MoreLikeThis

Find similar results with Lucene / SOLR index

We have an application for tagging user selections over a large corpus of MS Word documents. We tag these selections with one or more keyword tags, and usually a title tag. We want to add a feature where the selected text is instantly analyzed, and the tagger is presented with a list of most-likely keyword and title tags (based on the existing tagged text selections)
We are using a SOLR index. I have been told that we can simply issue the selected text as the query itself to return similar selections. However, the selected text could be anywhere between 200 and 6000 words long. A 6000 word query may be a problem in terms of memory usage!
I thought we could do some very aggressive stopword removal to significantly reduce the number of words in the queries, leaving only the very meaningful words. We have been working with this corpus for the last 10 years and we are very familiar with the subject matter and the vocabulary used, so this would be easy for us to do. But the problem is that we also use the same index for allowing the normal users to search the index, and if we remove too many common words, then their normal queries may not work properly (especially phrase queries).
We would also like to boost the results that contain the text of the query within a smaller range, rather than just spread arbitrarily throughout the document.
Another issue is that we allow nested selections. The outer selection may be more general in nature and be around 5000 words long, and the inner selections will be shorter and topically more specific. However, since both selections contain the same text, SOLR ranks them both highly, when the outer selection may not be so relevant
I have spent the last few days going through the SOLR query parser documentation, and it looks like this should be doable, but I'm still not sure exactly what I need to do to make this work. Any suggestions would be much appreciated.
Solr have multi-core facility. So if you can have one core for your internal work and you can reveal the other core for public domain, it may solve your issue.
You can refer this section
http://wiki.apache.org/solr/Solr.xml%20(supported%20through%204.x)
or you can refer Solr cores and solr.xml section in solr reference manual.

Showing human readable most frequent indexed terms using a stemmed field with Solr faceted search

We are planning on using Solr to show the users the "n" most frequent terms from a field and we want to apply stemming so that similar terms get grouped.
Now, we need to show the terms to the users but the stemmed terms are not always human readable. Is there any way to get an example of the original terms that got stemmed so that those could be shown to the user?
The only solution we can think of is quering two different fields, one with stemming and one without and then do the matching ourselves. But we think that is going to be expensive (two queries) and may be error prone (the matching may produce errors).
Is there any other way to implement this on Solr? Thanks in advance.
Stemming is applied at both query time and index time so I don't think there is an easy way to accomplish what you're trying to do. However, it may be possible, depending on the number of results in your database, to do this by employing a combination of faceting and highlighting. The highlighted term will be the entire matching term rather than the stemmed term (so, for example, the stemmed term might be "associ" but the highlighted terms will be "associated", "association", "associations", etc.). Perhaps what you could do is the following:
?q=keyword&facet=true&facet.field=myfield&&facet.limit=20hl=true&hl.fl=myfield&hl.fragsize=0&rows=10
Getting 10 rows and examining the highlighted results (by default, these are highlighted using <em> </em> tags but you can change this by using hl.simple.pre and hl.simple.post -- for example, using &hl.simple.pre=[&hl.simple.post=] would wrap the matching terms in square brackets) should at least give a sample of the "original" matching terms. hl.fragsize=0 returns the entire field along with highlighting.
Hope this helps. You can read more about highlighting parameters here:
http://wiki.apache.org/solr/HighlightingParameters

Keyword to SQL search

Use Case
When a user goes to my website, they will be confronted with a search box much like SO. They can search for results using plan text. ".net questions", "closed questions", ".net and java", etc.. The search will function a bit different that SO, in that it will try to as much as possible of the schema of the database rather than a straight fulltext search. So ".net questions" will only search for .net questions as opposed to .net answers (probably not applicable to SO case, just an example here), "closed questions" will return questions that are closed, ".net and java" questions will return questions that relate to .net and java and nothing else.
Problem
I'm not too familiar with the words but I basically want to do a keyword to SQL driven search. I know the schema of the database and I also can datamine the database. I want to know any current approaches there that existing out already before I try to implement this. I guess this question is for what is a good design for the stated problem.
Proposed
My proposed solution so far looks something like this
Clean the input. Just remove any special characters
Parse the input into chunks of data. Break an input of "c# java" into c# and java Also handle the special cases like "'c# java' questions" into 'c# java' and "questions".
Build a tree out of the input
Bind the data into metadata. So convert stuff like closed questions and relate it to the isclosed column of a table.
Convert the tree into a sql query.
Thoughts/suggestions/links?
I run a digital music store with a "single search" that weights keywords based on their occurrences and the schema in which Products appear, eg. with different columns like "Artist", "Title" or "Publisher".
Products are also related to albums and playlists, but for simpler explanation, I will only elaborate on the indexing and querying of Products' Keywords.
Database Schema
Keywords table - a weighted table for every word that could possibly be searched for (hence, it is referenced somewhere) with the following data for each record:
Keyword ID (not the word),
The Word itself,
A Soundex Alpha value for the Word
Weight
ProductKeywords table - a weighted table for every keyword referenced by any of a product's fields (or columns) with the following data for each record:
Product ID,
Keyword ID,
Weight
Keyword Weighting
The weighting value is an indication of how often the words occurs. Matching keywords with a lower weight are "more unique" and are more likely to be what is being searched for. In this way, words occurring often are automatically "down-weighted", eg. "the", "a" or "I". However, it is best to strip out atomic occurrences of those common words before indexing.
I used integers for weighting, but using a decimal value will offer more versatility, possibly with slightly slower sorting.
Indexing
Whenever any product field is updated, eg. Artist or Title (which does not happen that often), a database trigger re-indexes the product's keywords like so inside a transaction:
All product keywords are disassociated and deleted if no longer referenced.
Each indexed field (eg. Artist) value is stored/retrieved as a keyword in its entirety and related to the product in the ProductKeywords table for a direct match.
The keyword weight is then incremented by a value that depends on the importance of the field. You can add, subtract weight based on the importance of the field. If Artist is more important than Title, Subtract 1 or 2 from its ProductKeyword weight adjustment.
Each indexed field value is stripped of any non-alphanumeric characters and split into separate word groups, eg. "Billy Joel" becomes "Billy" and "Joel".
Each separate word group for each field value is soundexed and stored/retrieved as a keyword and associated with the product in the same way as in step 2. If a keyword has already been associated with a product, its weight is simply adjusted.
Querying
Take the input query search string in its entirety and look for a direct matching keyword. Retrieve all ProductKeywords for the keyword in an in-memory table along with Keyword weight (different from ProductKeyword weight).
Strip out all non-alphanumeric characters and split query into keywords. Retrieve all existing keywords (only a few will match). Join ProductKeywords to matching keywords to in-memory table along with Keyword weight, which is different from the ProductKeyword weight.
Repeat Step 2 but use soundex values instead, adjusting weights to be less relevant.
Join retrieved ProductKeywords to their related Products and retrieve each product's sales, which is a measure of popularity.
Sort results by Keyword weight, ProductKeyword weight and Sales. The final summing/sorting and/or weighting depends on your implementation.
Limit results and return product search results to client.
What you are looking for is Natural Language Processing. Strangely enough this used to be included free as English Query in SQL Server 2000 and prior. But it's gone now
Some other sources are :
http://devtools.korzh.com/eq/dotnet/
http://www.easyask.com/products/business-intelligence/index.htm
The concept is a meta data dictionary mapping words to table, columns, relationships etc and an English sentence parser combined together to convert a English sentence ( or just some keywords) into a real query
Some people even user English Query with speech recognition for some really cool demos, never saw it used in anger though!
If you're using SQL Server, you can simply use its Full-Text Search feature, which is specifically designed to solve your problem.
You could use a hybrid approach, take the full text search results and further filter them based on the meta data from your #4. For something more intelligent you could create a simple supervised learning solution by tracking what links the user clicks on after the search and store that choice with the key search words in a decision tree. Searches would then be mined from this decision tree

Resources