My team is attempting to implement Solr and Sunspot-Rails as a search provider for our application. Our story requires that certain string fields be searched in addition to the text fields. I have seen some people aggregate these fields into a consolidated indexing field (with a type of Text) with ActiveRecord callbacks. Is this my only hope or is there a wildcard argument that I am missing?
you can define your fields that will be indexed like this
class Post < ActiveRecord::Base
searchable do
text :title, :body
text :comments do
comments.map { |comment| comment.body }
end
string :sort_title do
title.downcase.gsub(/^(an?|the)/, '')
end
end
end
then you search them you can specify the fields you want to search like this
Post.search do
fulltext 'pizza' do
fields(:body, :title)
end
end
if you didn't specify fields it'll apply the search to all the text fields indexed
Post.search do
fulltext 'best pizza'
end
Related
Consider a user inputs this search string to a news search engine:
"Oops, Donald Trump Jr. Did It Again (Wikileaks Edition) :: Politics - Paste"
Imagine we have a database of News Titles, and a database of "Important People".
The goal here is: If a Search string contains an Important person, then return results containing this "substring" with higher ranking then those resutls that do NOT contain it.
Using the Yahoo Vespa Engine, How can I match a database full of people names against long news title strings ?
*I hope that made sense, sorry everyone, my english not so good :( Thank you !
During document processing/indexing of news titles you could extract named entities from the input text using the "important people" database. This process could be implemented in a custom document processor. See http://docs.vespa.ai/documentation/document-processing-overview.html).
A document definition for the news search could look something like this with a custom ranking function. The document processor reads the input title and populates the entities array.
search news {
document news {
field title type string {
indexing: summary | index
}
field entities type array<string> {
indexing: summary | index
match: word
}
}
rank-profile entity-ranking {
first-phase {
expression: nativeRank(title) + matches(entities)
}
}
At query time you'll need to do the same named entity extraction from the query input and built a Vespa query tree which can search the title (e.g using OR or WeakAnd) and also search the entities field for the possible named entities using the Vespa Rank operator. E.g given your query example the actual query could look something like:
select * from sources * where rank(title contains "oops" or title
contains "donald" or title contains "trump", entities contains "Donald Trump Jr.");
You can build the query tree in a custom searcher http://docs.vespa.ai/documentation/searcher-development.html using a shared named entity extraction component.
Some resources
Shared components & writing custom searchers/documentprocesors (To implement the named entity extraction) http://docs.vespa.ai/documentation/jdisc/container-components.html
Ranking http://docs.vespa.ai/documentation/ranking.html
Query language http://docs.vespa.ai/documentation/query-language.html
I'd like to perform a partial text / phrase search against a Datastore record field using Ruby.
I've figured out how to do it with a conditional constraint using >= <"\ufffd" condition, but that only works from the beginning of the field.
This works; querying for "Ener" returns "Energizer AA Batteries" but querying for "AA" does not return the same.
In the docs for the Python Google Client's Search API, it documents the ability to manually create indexes which allow for both atomic and partial word searches.
https://cloud.google.com/appengine/docs/standard/python/search/ says:
Tokenizing string fields When an HTML or text field is indexed, its
contents are tokenized. The string is split into tokens wherever
whitespace or special characters (punctuation marks, hash sign, etc.)
appear. The index will include an entry for each token. This enables
you to search for keywords and phrases comprising only part of a
field's value. For instance, a search for "dark" will match a document
with a text field containing the string "it was a dark and stormy
night", and a search for "time" will match a document with a text
field containing the string "this is a real-time system".
In the docs for Ruby and PHP, I cannot find such an API reference to enable me to do the same thing. Is it possible to perform this type of query in Ruby / PHP with Cloud Datastore?
If so, can you point me to the docs, and if not, is there a workaround, for example, create indexes with the Python Search API, and then configure the PHP/Ruby client to execute it's queries against those indexes?
Suppose I have to store student academic details like...
College name -- text field searchable
Student Class -- text field searchable
Subjects -- multivalue field , text field searchable
How do I store/handle "Student class" because student can search like this "students of class 4th" , "Students of class 4" , "student of class fourth"
How Can I handle these (4th, 4, fourth) variations? What are elegant ways to do so.
Thanks
Amit Aggarwal
One way to solve this problem is to use a field type that supports query time synonyms. Check out the "text_general" type in the example solr schema.
In practice you would add rows like this to the synonyms.txt file in your cores conf dir:
# numbers
1,1st,first
2,2nd,second
3,3rd,third
4,4th,fourth
Now, lets suppose you had a document such as:
{ "college":"Princeton", "class":"1", "subjects":["CS 101", "introduction to full text search"]}
You could then retrieve that document if you do a query such as:
class:first
In this example the search query is directed towards one field, which may or may not be what you want. If you need to target the search query with number synonym matching into multiple fields( ie, search query with no field specifier, just the search term), you could copy all those fields content into a single synonym searchable field (using copyField) such as content_synonyms and then run the query against this field by default.
I'd like to boost by recency but not order by datetime/date field absolutely (since there are other factors at play).
How do I boost a date/datetime field in Sunspot Rails?
Sunspot does not have built-in capability to boost a date/datetime field, however, it is possible to do so using solr itself, as evidenced by this article:
https://wiki.apache.org/solr/FunctionQuery#Date_Boosting
Fortunately, sunspot provides a way of manually adding params to the solr query. What you'll want to do here is make sure that the date field you are using to boost is contained in the searchable block on the model.rb:
searchable do
time :datetime_field, stored: true, trie: true
end
Then, in the search block that is probably in your models_controller.rb, add the date boost function:
#search = Model.search do
# perform search
adjust_solr_params do |sunspot_params|
sunspot_params[:boost] = 'recip(ms(NOW,datetime_field_dts),3.16e-11,1,1)'
end
end
Note that sunspot adds '_dts' to the end of your time field, so you will need to include this at the end of the variable name in the query string.
Does anyone know how to index and search embedded documents with sunpot_mongoid?
The question has been asked in the sunspot_mongoid issues, but has no solution, so far.
Just tried it. It's a hack, but it works for searching embedded documents, and returning the parent document holding it. Is that what you want? If so, do this then. Define method that returns the embedded fields you want as an array, and then index that array.
Assuming you have class Company, with embedded departments
searchable do
# Your regular index
# ...
text :company_departments
end
def company_departments
departments.map(&:name).join(" ")
end
reindex and try to search.
You can also include a block that returns the text you want index right in the searchable block. For example:
searchable do
text :innerdoc do
innerdocs.map { |i| i.title + ' ' + i.description }
end
end
That takes the title and description from an embedded array of "innerdocs" and adds it to the index for the main document.
The sunspot docs have the best info on the syntax for the "searchable" block:
http://outoftime.github.com/sunspot/docs/