How to search embedded mongoid documents with sunspot SOLR? - mongoid

Does anyone know how to index and search embedded documents with sunpot_mongoid?
The question has been asked in the sunspot_mongoid issues, but has no solution, so far.

Just tried it. It's a hack, but it works for searching embedded documents, and returning the parent document holding it. Is that what you want? If so, do this then. Define method that returns the embedded fields you want as an array, and then index that array.
Assuming you have class Company, with embedded departments
searchable do
# Your regular index
# ...
text :company_departments
end
def company_departments
departments.map(&:name).join(" ")
end
reindex and try to search.

You can also include a block that returns the text you want index right in the searchable block. For example:
searchable do
text :innerdoc do
innerdocs.map { |i| i.title + ' ' + i.description }
end
end
That takes the title and description from an embedded array of "innerdocs" and adds it to the index for the main document.
The sunspot docs have the best info on the syntax for the "searchable" block:
http://outoftime.github.com/sunspot/docs/

Related

Searching wiki URLs using Solr

I am trying to index and search a wiki on our intranet using Solr. I have it more-or-less working using edismax but I'm having trouble getting main topic pages to show up first in the search results. For example, suppose I have some URLs in the database:
http://whizbang.com/wiki/Foo/Bar
http://whizbang.com/wiki/Foo/Bar/One
http://whizbang.com/wiki/Foo/Bar/Two
http://whizbang.com/wiki/Foo/Bar/Two/Two_point_one
I would like to be able to search for "foo bar" and have the first link returned as the top result because it is the main page for that particular topic in the wiki. I've tried boosting the title and URL field in the search but the fieldNorm value for the document keeps affecting the scores such that sub-pages score higher. In one particular case, the main topic page shows up on the 2nd results page.
Is there a way to make the First URL score significantly higher than the sub categories so that it shows up in the top-5 search results?
One possible approach to try:
Create a copyField with your url
Extract path only (so, no host, no wiki)
Split on / and maybe space
Lowercase
Boost on phrase or bigram or something similar.
If you have a lot of levels, maybe you want a multivalued field, with different depth (starting from the end) getting separate entries. That way a perfect match will get better value. Here, you should start experimenting with your real searches.

Finding the most common terms in my Solr collection

I need to identify potential stopwords in my Solr collection. Is it possible to find those terms which have the highest document frequency in my collection (or at least in a given shard)?
Yes, use HighFreqTerms, like:
TermStats[] stats = HighFreqTerms.gethighFreqTerms(reader, 10, "myContentField", new HighFreqTerms.DocFreqComparator());
for (TermStats stat : stats) {
System.out.println(stat.termtext.utf8ToString() + ", docfreq:" + stat.docFreq);
//Or whatever else you want to do with them...
}
Luke also prominently displays the most common terms.
As you already set up Solr, use TermsComponent to get the term frequencies for any given field:
http://wiki.apache.org/solr/TermsComponent
If you have a default search field, (which is the destination of your copied field), it should give you the frequencies across all fields.

solr sort,i want Specify a particular document at the first

solr sort,i want Specify a particular document at the first
for example:
Results :5,2,3,1
I want 2 at the first ,Other sorted in accordance with the rules
2,1,3,5
how to do this ?
I know of two ways you can try to tackle this using Solr.
The first is to use the QueryElevationComponent. This lets you define the top results at index time. As suggested in the documentation, this is good for placing sponsored results or popular documents at the top of the search results. The potential downside is that you have to be able to identify those documents at index time and not at query time.
The other approach is to boost the desired documents at query time using the bq parameter. To boost document 435, you would do something like this:
...&bq=id:435^10
Unfortunately, neither of these approaches give you absolute control over the order of the results.
The solution provided by Riking would certainly do the job if you don't mind processing the results after performing the search. Another approach you could consider is to add a field to your Solr schema that defines a display order or priority. You can then sort on that field to get the desired sort order.
If you are using Solr 3.1 or later, you can sort by a function query. The map function is useful for this.
sort=map(field_name,5,5,0) asc
In the above, field_name is the name of the field you want to sort by, 5 is the value you want to push to the front and 0 must be replaced with some number that you know is less than all other numbers.
Call the builtin sort() function, then shift the desired element to the front.
Pseudocode, in case you do not have a builtin method to shift it to the front:
tmp = desired;
int dIndex = array.indexOf(desired);
for(i=dIndex-1; i >= 0; i--)
{
array[i+1] = array[i]
}
In case you use standart query (not dismax) add "OR id:2^1000" to you query. Like this:
q=(text:lalala AND author:Bob) OR id:2^1000
that will place document with ID=2 at the top of results.

Oracle Text - Index a BLOB Field (which contains PDF data)

Do any of you have any experience with using Oracle Text to search for content inside PDF files?
I have a table, with a field called FILEDATA(blob).
I would like to do the following query:
SELECT id FROM ttc.contract_attachment WHERE CONTAINS(filedata, 'EXAMPLE') > 0;
However, i'm not too sure about the type of index to add to it.
I found the following code:
begin
ctx_ddl.create_preference('doc_lexer', 'BASIC_LEXER');
ctx_ddl.set_attribute('doc_lexer', 'printjoins', '_-');
end;
/
create index idxContentMgmtBinary on CMDEMO.CONTENT_INVENTORY(TEXT) indextype is ctxsys.context
parameters ('lexer doc_lexer sync (on commit)');
Ref: http://www.devx.com/dbzone/Article/21563/1954
I have no idea what BASIC_LEXER is. I'm at a bit of a loss. I shall endeavour to continue searching for an answer. Any help would be great.
Thanks.
I've used Oracle Text to index not only PDF's but other data like XML structures. Oracle has the concept of lexers which take content and parses, tokenizes and indexes the tokens. The basic lexer handles English words, there are other lexers for Chinese, Japanese, Korean, etc. The printjoin attribute allows you to index characters that are normally excluded such as hyphes, quotes, etc.
The index you have defined above will work. Keep in mind that Oracle Text indexing is an asynchronous process, meaning the commit occurs and then sometime in the future the document is indexed. However you will need to synchronize the index as part of a scheduled job or the like. With the option "sync (on commit)" on your index, it will index the document as part of the transaction. This is noteworthy only if you are indexing sizable PDF documents.
I would recommend utilizing progressive relaxation for any search you may want to run, as it can being with a restrictive search and expand out to a more generic search, thereby providing the user with results that are decreasing in relevancy. For instance:
<query>
<textquery lang="ENGLISH" grammar="CONTEXT"> cat dog
<progression>
<seq><rewrite>transform((TOKENS, "{", "}", " "))</rewrite></seq>
<seq><rewrite>transform((TOKENS, "{", "}", "AND"))</rewrite></seq>
<seq><rewrite>transform((TOKENS, "{", "}", "ACCUM"))</rewrite></seq>
</progression>
</textquery>
<score datatype="INTEGER" algorithm="COUNT"/>
</query>
The above query tokenizes the search keywords "cat dog" attempts to find them as a phrase, then any documents contains cat AND dog (not necessarily beside each other), then any document containing cat OR dog, documents containing both words are scored higher than if a document just has a single one. Futhermore the structure automatically dedups the results as it returns them.
All of that being said, you could simply define your index as:
create index idxContentMgmtBinary on CMDEMO.CONTENT_INVENTORY(TEXT)
indextype is ctxsys.context
parameters ('sync (on commit)');
and it would probably work very well for your needs. You would only need to change the behavior of the lexer if you have a need for doing so. I hope this helps.

Solr query results using *

I want to provide for partial matching, so I am tacking on * to the end of search queries. What I've noticed is that a search query of gatorade will return 12 results whereas gatorade* returns 7. So * seems to be 1 or many as opposed to 0 or many ... how can I achieve this? Am I going about partial matching in Solr all wrong? Thanks.
First, I think Solr wildcards are better summarized by "0 or many" than "1 or many". I doubt that's the source of your problem. (For example, see the javadocs for WildcardQuery.)
Second, are you using stemming, because my first guess is that you're dealing with a stemming issue. Solr wildcards can behave kind of oddly with stemming. This is because wildcard expansion is based by searching through the list of terms stored in the inverted index; these terms are going to be in stemmed form (perhaps something like "gatorad"), rather than the words from the original source text (perhaps "gatorade" or "gatorades").
For example, suppose you have a stemmer that maps both "gatorade" and "gatorades" to the stem "gatorad". This means your inverted index will not contain either "gatorade" or "gatorades", only "gatorad". If you then issue the query gatorade*, Solr will walk the term index looking for all the stems beginning with "gatorade". But there are no such stems, so you won't get any matches. Similarly, if you searched gatorades*, Solr will look for all stems beginning with "gatorades". But there are no such stems, so you won't get any matches.
Third, for optimal help, I'd suggest posting some more information, in particular:
Some particular query URLs you are submitting to Solr
An excerpt from your schema.xml file. In particular, include A) the field elements for the fields you are having trouble with, and B) the field type definitions corresponding to those fields
so what I was looking for is to make the search term for 'gatorade' -> 'gatorade OR gatorade*' which will give me all the matches i'm looking for.
If you want a query to return all documents that match either a stemmed form of gatorade or words that begin with gatorade, you'll need to construct the query yourself: +(gatorade gatorade*). You could alternatively extend the SolrParser to do this, but that's more work.
Another alternative is to use NGrams and TokenFilterFactories, specifically the EdgeNGramFilterFactory. .
This will create indexes for ngrams or parts of words. Documents, with a min ngram size of 5 and max ngram size of 8, would index: Docum Docume Document Documents
There is a bit of a tradeoff for index size and time. One of the Solr books quotes as a rough guide: Indexing takes 10 times longer Uses 5 times more disk space Creates 6 times more distinct terms.
However, the EdgeNGram will do better than that.
You do need to make sure that you don't submit wildcard character in your queries. As you aren't doing a wildcard search, you are matching a search term on ngrams(parts of words).
My guess is the missing matches are "Gatorade" (with a capital 'G'), and you have a lowercase filter on your field. The idea is that you have filters in your schema.xml that preprocess the input data, but wildcard queries do not use them;
see this about how Solr deals with wildcard queries:
http://solr.pl/en/2010/12/20/wildcard-queries-and-how-solr-handles-them/
("Solr and wildcard handling").
From what I've read the wildcards only matched words with additional characters after the search term. "Gatorade*" would match Gatorades but not Gatorade itself. It appears there's been an update to Solr in version 3.6 that takes this into account by using the 'multiterm' field type instead of the 'text' field.
A better description is here:
http://bensch.be/the-solr-wildcard-problem-and-multiterm-solution

Resources