App Engine Search API (Document Search) - Multiple Languages - google-app-engine

I have Documents that I'd like to make searchable in 3 different languages. Since I can have multiple fields with the same name/type, the following Document structure works (this is a simplified example).
document = search.Document(
fields=[
search.TextField(
name="name",
language="en",
value="dog"),
search.TextField(
name="name",
language="es",
value="perro"),
search.TextField(
name="name",
language="fr",
value="chien")
]
)
index = search.Index("my_index")
index.put(document)
Specifying the language helps Google tokenize the value of the TextField.
The following queries all work, each returning one result:
print index.search("name: dog")
print index.search("name: perro")
print index.search("name: chien")
Here is my question: Can I restrict a search to only target fields with a specific language?
The purpose is to avoid getting false positive results. Since each language uses the Arabic alphabet, it's possible that someone performing a full text search in Spanish may see English results that are not relevant.
Thank you.

You can use facets to add fields to a document that don't actually appear in the document (metadata). These would indicate what languages appear in the document.
Document insertion:
index = search.Index("my_index")
document = search.Document(
fields=[
search.TextField(
name="name",
language="en",
value="dog"),
search.TextField(
name="name",
language="es",
value="perro"),
search.TextField(
name="name",
language="fr",
value="chien")
],
facets=[
search.AtomFacet(name='lang', value='en'),
search.AtomFacet(name='lang', value='es'),
search.AtomFacet(name='lang', value='fr'),
],
)
index.put(document)
document = search.Document(
fields=[
search.TextField(
name="name",
language="es",
value="gato"),
search.TextField(
name="name",
language="fr",
value="chat")
],
facets=[
# no english in this document so leave out lang='en'
search.AtomFacet(name='lang', value='es'),
search.AtomFacet(name='lang', value='fr'),
],
)
index.put(document)
Query:
index = search.Index("my_index")
query = search.Query(
'', # query all documents, cats and dogs.
# filter docs by language facet
facet_refinements=[
search.FacetRefinement('lang', value='en'),
])
results = index.search(query)
for doc in results:
result = {}
for f in doc.fields:
# filter fields by language
if f.language == 'en':
result[f.name] = f.value
print result
Should print {u'name': u'dog'}.
Note that although we can fetch only documents that have english in them, we still have to filter out the fields in other languages in those documents. This why we iterate through the fields only adding those in english to result.
If you want to know more about the more general use case for faceted search, this answer gives a pretty good idea.

You could use a separate index for each language.
Define a utility function for resolving the correct index for a given language:
def get_index(lang):
return search.Index("my_index_{}".format(lang))
Insert documents:
document = search.Document(
fields=[
search.TextField(
name="name",
language="en",
value="dog"),
])
get_index('en').put(document)
document = search.Document(
fields=[
search.TextField(
name="name",
language="fr",
value="chien")
])
get_index('fr').put(document)
Query by language:
query = search.Query(
'name: chien')
results = get_index('fr').search(query)
for doc in results:
print doc

Related

can azure search do facets with one to may relationships

If I have documents (lets say books) I want to search that have a facet (lets say genre) where the document can have many values for that facet, so for example a book could be both "young adult", "fiction", "sci-fi"
Can azure search faceting handle this situation and if so can it do it from simple strings with a delimeter?
Define the genre field in your index as a string collection (Collection(Edm.String) and make it facetable. When indexing documents, pass the values for that field as a JSON array:
{
... other properties
"genre" : [ "young adult", "fiction", "sci-fi" ]
}

How can I search a multi-value field for values exclusively

I have a Solr index with a multi-valued field, let's call it mvfield. It can contain arbitrary values, even though currently it is a finite set of values.
I want to find documents which contain only certain values in this field. Example:
doc1: mvfield = [a,b,c,d,e,f]
doc2: mvfield = [e,f]
doc3: mvfield = [f]
doc4: mvfield = [a,b,c,e,f]
doc5: mvfield = [e]
I want to create a query which returns documents which contain only e or f in mvfield, so in this example it should give doc2, doc3 and doc5.
I found a crude workaround using ranges:
-mvfield:[* to e} AND -mvfield:{e TO f} AND -mvfield:{f TO *]
but it seems very fragile. Is there any better way to do this?
Why not just use a filter query.
&fq=myfield:e and so on.

Solr - how to return most frequent terms of a query

While performing a query name:*b* AND country:China (name contains 'b'), I want solr to return number of people (from China) with each different term
Documents(name are whitespace delimiter toknized):
[
{name: 'sponge bob'},
{name: 'billy chen'},
{name: 'abie white'}
]
Result expectd
[
{term: 'bob', matches: 100},
{term: 'billy', matches: 90},
{term: 'abie', matches: 80}
]
Attemp: facet search
I try query like q=name:*b*+%3AAND+%3Acountry:China&facet=on&facet.field=name
Result includes unrelated terms as
[sponge,1, bob, 1, ...]
How could I get ride of unrelated terms like sponge
I'm not sure if I understand your use case correctly, but the TermsComponent may fit your needs.
It "provides access to the indexed terms in a field and the number of documents that match each term" (from the docs).
After configuring the component in your solrconfig.xml the query should look like this:
terms=true&terms.fl=name&terms.regex=.*b.*
Finally I modify the facet search implementation based on this patch https://issues.apache.org/jira/browse/SOLR-1387 and build myself a brand new solr war

Parameter bq modify facet counts using grouping

I am using solr trunk to search some documents and group them by their category, but I have to group them first by another field. More specifically I am using this schema:
component_id: string
category: string
name: text
And I have two documents:
component_id = register1, category = category1, name='foo bar'
component_id = register1, category = category2, name='foo bar zoo'
My query is (only relevant parameters):
{edismax qf=name}(foo bar)&group.field=component_id&group.truncate=true&facet.field=category&bq=category:category1^2
And the facet results are:
'category':
'category1', 1
'category2',1
BUT, when I change the bq parameter, for example : bq=category:category1^20
The facet results have changed:
'category':
'category1', 1
'category2', 0
Is that posible ? Is a bug ? If I set group.truncate=false everything is fine for this example, but it fails for the rest of the querys.
Thanks & regards
I answer myself.
group.truncate is the correct option when your data is uniform or when your groups contains similar objects, but it has problems when mixing data from diferent categories.
if group.truncate=true |A| ∪ |B| <> |A| + |B| - A ∩ B
Everything is OK with bq parameter.

SOLR sort by IN Query

I was wondering if it is possible to sort by the order that you request documents from SOLR. I am running a In based query and would just like SOLR to return them based on the order that I ask.
In (4,2,3,1) should return me documents ordered 4,2,3,1.
Thanks.
You need Sorting in solr, to order them by field.
I assume that "In based query" means something like: fetch docs whose fieldx has values in (val1,val2). You can a field as multi-valued field and facet on that field. A facet query is a 'is in' search, out of the box (so to say) and it can do more sophisticated searches too.
Edited on OP's query:
Updating a document with a multi-valued field in JSON here. See the line
"my_multivalued_field": [ "aaa", "bbb" ] /* use an array for a multi-valued field */
As for doing a facet query, check this.
You need to do one or more fq statements:
&fq=field1:[400 to 500]
&fq=field2:johnson,thompson
Also do read up on the fact (in link above) that you need to facet on stored rather than indexed fields.
You can easily apply sorting with QueryOptions and field sort (ExtraParams property - I am sorting by savedate field, descending):
var results = _solr.Query(textQuery,
new QueryOptions
{
Highlight = new HighlightingParameters
{
Fields = new[] { "*" },
},
ExtraParams = new Dictionary<string, string>
{
{"fq", dateQuery},
{"sort", "savedate desc"}
}
});

Resources