Apache Solr Schema Configuration - solr

So I am pretty new to Apache Solr and have a situation I do now know how to handle. I am from an OO programming background so first let me explain the object relationships:
Take an object called Movie that has two text fields, title and description. A movie can be associated with tags by a user. These tags are particular to the user, and are not visible to other users.
So an example Movie could have something like this:
"Movie Title", "Description of the Movie"
User1Tags: "tag1", "tag2"
User2Tags: "action", "somethingElse"
I need to design a schema/solr query so that when user1 is searching for movies, if they type "action", the movie above will not show up. This is because user2 has associated "action" with "Movie Title", not user1.
Things I have considered:
1) Filter queries - these do not seem to work as once the index per movie is built, I do not see how to avoid having all the user tags be tied to the movie's index.
2) A separate core for movie to tag associations and just doing two queries per search. I know I can do it this way, but making another core seems excessive to me.
Are there other options I am missing? Or is there a way to implement 1? Or is the simplest option just option 2 and that's how people who know what they are doing with Solr would do it?

How many users?
If not many, then you can have dynamic fields tag_user1, tag_user2 and modify the eDismax field list to match or not match against it, e.g. by using field name alias.
The other option is to prefix the values with the userid. So tags field would have: user1_tag1, user1_tag2, user2_action, user2_somethingElse. Then, you need a custom filter in the query chain that will prefix your search tokens with the user of the request and so only prefixed values would match.

Related

Is there any way to sort on a nested value in Azure Cognitive Search?

Is there any way to sort on a nested value in Azure Cognitive Search?
My use case is that I have a database of songs that are associated with dances that one can dance to that song. Users can vote on the danceability of a dance to a song, so there is a is a numeric vote tally for each song/dance combination. A core part of the functionality for the search is to be able to do an arbitrary search and sort the results by the popularity of a particular dance.
I am currently modeling this by creating a new top level field with a decorated name (e.g. DNC_Salsa or DNC_Waltz) for each dance. This works. But aside from being clumsy, I can't associate other information with a dance. In addition, I have to dynamically add the dance fields, so I have to use the generic SearchDocument type in the C# library rather than using a POCO type.
I'd much prefer to model this with the dance fields as an array of subdocuments where the subdocuments contain a dance name, a vote count and the other information I'd like to associate with a dance.
A simplified example record would look something like this:
{
"title": "Baby, It's Cold Outside",
"artist": "Seth MacFarlane",
"tempo": 119.1,
"dances": [
{ "name", "cha cah", "votes", 1 },
{ "name", "foxtrot", "votes", 4 }
]
}
I gave this a try and received:
{"error":{"code":"OperationNotAllowed","message":"The request is invalid.","details":[{"code":"CannotEnableFieldForSorting","message":"The field 'Votes' cannot be enabled for sorting because it is directly or indirectly contained in a collection, which makes it a multi-valued field. Sorting is not allowed on multi-valued fields. Parameters: definition"}]}}
It looks like elastic search will do what I want:
Sort search results | Elasticsearch Guide [7.17] | Elastic
If I'm reading the Elasticsearch documetion correctly, you can basically say I'd like to sort on the dances subdocument by first filtering for name == "cha cha" and then sorting on the vote field.
Is there anything like this in Azure Cognitive Search? Or even something more restrictive? I don't need to do arbitrary sorting on anything in the subdocument. I would be happy to only ever sort on the vote count (although I'd have to be able to do that for any dance name).
It's not clear to me what your records or data model looks like. However, from the error message you provided, it's clear that you try to sort on a multivalue property. That is logically impossible.
Imagine a property Color that can contain colors like 'Red' or 'Blue'. If you sort by Color, you would get your red values before the blues. If you instead had 'Colors' that can contain multiple values like both 'Red' and 'Blue', how would you sort it? You can't.
So, if you actually want to sort by a property, that property has to contain a single value.
When that's said, I have a feeling you are really asking about ranking/boosting. Not sorting. Have a look at the examples with boosting and scoring profiles for different genres of music. I believe the use case in these examples could help you solve your use case.
https://learn.microsoft.com/en-us/azure/search/index-add-scoring-profiles#extended-example

ArangoDB getting all unique tags from results

I've a document in ArangoDB.
{ title: "title 12345", tags : ["tagx", "tagy", "tagz"}
I've an full text index on title and another hash index on tags[*].
I've a query where I want to use full text search on title and apply some filtering via tags and get skip x limit 10 in results. I'm able to achieve this. This will help me in pagination. In one API call, I can return the user 10 items.
However, I also want to get all the unique tags which are present in the result(without skip and limit constraint) without hitting all the documents present in the result. This will help me to show the tags which user can further select to narrow down the search.
We can assume that there will be a small number of unique tags(around 30-40) in the database. Is there an efficient way to achieve this in ArangoDB? Maybe, we can create some new indexes or change the schema to achieve this.
Assuming that you are going to create a collection with documents named test with the following content
Just to be reproducible, the data that I am inserting is:
[{"_key":"342","_id":"test/342","_rev":"504","tags":["tagx","tagy","tagz"],"title":"title 12345"},{"_key":"564","_id":"test/564","_rev":"591","tags":["tagx","tagy","tagt"],"title":"title another"},{"_key":"510","_id":"test/510","_rev":"538","tags":["tagh","tagk","tagz"],"title":"title 56789"}]
If you run the following AQL query
let tags = (for i in test
return i.tags)[**]
return unique(tags)
This will return
[
[
"tagk",
"tagh",
"tagt",
"tagz",
"tagy",
"tagx"
]
]
As you can see there are different tags repeated across multiple documents and the results are showing an array with just the unique ones

Can I match important Keywords in a string?

Consider a user inputs this search string to a news search engine:
"Oops, Donald Trump Jr. Did It Again (Wikileaks Edition) :: Politics - Paste"
Imagine we have a database of News Titles, and a database of "Important People".
The goal here is: If a Search string contains an Important person, then return results containing this "substring" with higher ranking then those resutls that do NOT contain it.
Using the Yahoo Vespa Engine, How can I match a database full of people names against long news title strings ?
*I hope that made sense, sorry everyone, my english not so good :( Thank you !
During document processing/indexing of news titles you could extract named entities from the input text using the "important people" database. This process could be implemented in a custom document processor. See http://docs.vespa.ai/documentation/document-processing-overview.html).
A document definition for the news search could look something like this with a custom ranking function. The document processor reads the input title and populates the entities array.
search news {
document news {
field title type string {
indexing: summary | index
}
field entities type array<string> {
indexing: summary | index
match: word
}
}
rank-profile entity-ranking {
first-phase {
expression: nativeRank(title) + matches(entities)
}
}
At query time you'll need to do the same named entity extraction from the query input and built a Vespa query tree which can search the title (e.g using OR or WeakAnd) and also search the entities field for the possible named entities using the Vespa Rank operator. E.g given your query example the actual query could look something like:
select * from sources * where rank(title contains "oops" or title
contains "donald" or title contains "trump", entities contains "Donald Trump Jr.");
You can build the query tree in a custom searcher http://docs.vespa.ai/documentation/searcher-development.html using a shared named entity extraction component.
Some resources
Shared components & writing custom searchers/documentprocesors (To implement the named entity extraction) http://docs.vespa.ai/documentation/jdisc/container-components.html
Ranking http://docs.vespa.ai/documentation/ranking.html
Query language http://docs.vespa.ai/documentation/query-language.html

Solr - How do I get the number of documents for each field containing the search term within that field in Solr?

Imagine an index like the following:
id partno name description
1 1000.001 Apple iPod iPod by Apple
2 1000.123 Apple iPhone The iPhone
When the user searches for "Apple" both documents would be returned. Now I'd like to give the user the possibility to narrow down the results by limiting the search to one or more fields that have documents containing the term "Apple" within those fields.
So, ideally, the user would see something like this in the filter section of the ui after his first query:
Filter by field
name (2)
description (1)
When the user applies the filter for field "description", only documents which contain the term "Apple" within the field "description" would be returned. So the result set of that second request would be the iPod document only. For that I'd use a query like ?q=Apple&qf=description (I'm using the Extended DisMax Query Parser)
How can I accomplish that with Solr?
I already experimented with faceting, grouping and highlighting components, but did not really come to a decent solution to this.
[Update]
Just to make that clear again: The main problem here is to get the information needed for displaying the "Filter by field" section. This includes the names of the fields and the hits per field. Sending a second request with one of those filters applied already works.
Solr just plain Doesn't Do This. If you absolutely need it, I'd try it the multiple requests solution and benchmark it -- solr tends to be a lot faster than what people put in front of it, so an couple few requests might not be that big of a deal.
you could achieve this with two different search requests/queries:
name:apple -> 2 hits
description:apple -> 1 hit
EDIT:
You also could implement your own SearchComponent that executes multiple queries in the background and put it in the SearchHandler processing chain so you only will need a single query in the frontend.
if you want the term to be searched over the same fields every time, you have 2 options not breaking the "single query" requirement:
1) copyField: you group at index time all the fields that should match togheter. With just one copyfield your problem doesn't exist, if you need more than one, you're at the same spot.
2) you could filter the query each time dynamically adding the "fq" parameter at the end
http://<your_url_and_stuff>/?q=Apple&fq=name:Apple ...
this works if you'll be searching always on the same two fields (or you can setup them before querying) otherwise you'll always need at least a second query
Since i said "you have 2 options" but you actually have 3 (and i rushed my answer), here's the third:
3) the dismax plugin described by them like this:
The DisMaxQParserPlugin is designed to process simple user entered phrases
(without heavy syntax) and search for the individual words across several fields
using different weighting (boosts) based on the significance of each field.
so, if you can use it, you may want to give it a look and start from the qf parameters (that is what the option number 2 wanted to be about, but i changed it in favor of fq... don't ask me why...)
SolrFaceting should solve your problem.
Have a look at the Examples.
This can be achieved with Solr faceting, but it's not neat. For example, I can issue this query:
/select?q=*:*&rows=0&facet=true&facet.query=title:donkey&facet.query=text:donkey&wt=json
to find the number of documents containing donkey in the title and text fields. I may get this response:
{
"responseHeader":{"status":0,"QTime":1,"params":{"facet":"true","facet.query":["title:donkey","text:donkey"],"q":"*:*","wt":"json","rows":"0"}},
"response":{"numFound":3365840,"start":0,"docs":[]},
"facet_counts":{
"facet_queries":{
"title:donkey":127,
"text:donkey":4108
},
"facet_fields":{},
"facet_dates":{},
"facet_ranges":{}
}
}
Since you also want the documents back for the field-disjunctive query, something like the following works:
/select?q=donkey&defType=edismax&qf=text+titlle&rows=10&facet=true&facet.query=title:donkey&facet.query=text:donkey&wt=json

SOLR: Is it it possible to index multiple timestamp:value pairs per document?

Is it possible in solr to index key-value pairs for a single document, like:
Document ID: 100
2011-05-01,20
2011-08-23,200
2011-08-30,1000
Document ID: 200
2011-04-23,10
2011-04-24,100
and then querying for documents with a specific value aggregation in a specific time range, i.e. "give me documents with sum(value) > 0 between 2011-08-01 and 2011-09-01" would return the document with id 100 in the example data above.
Here is a post from the Solr User Mailing List where a couple of approaches for dealing with fields as key/value pairs are discussed.
1) encode the "id" and the "label" in the field value; facet on it;
require clients to know how to decode. This works really well for simple
things where the the id=>label mappings don't ever change, and are
easy to encode (ie "01234:Chris Hostetter"). This is a horrible approach
when id=>label mappings do change with any frequency.
2) have a seperate type of "metadata" document, one per "thing" that you
are faceting on containing fields for id and the label (and probably a
doc_type field so you can tell it apart from your main docs) then once
you've done your main query and gotten the results back facetied on id,
you can query for those ids to get the corrisponding labels. this works
realy well if the labels ever change (just reindex the corrisponding
metadata document) and has the added bonus that you can store additional
metadata in each of those docs, and in many use cases for presenting an
initial "browse" interface, you can sometimes get away with a cheap
search for all metadata docs (or all metadata docs meeting a certain
criteria) instead of an expensive facet query across all of your main
documents.

Resources