ComplexType: Querying based upon analysis of data within - azure-cognitive-search

I'm wanting to see if what I want to do is possible, and if so, how?
The idea is that I store a bunch of products. Then for those products, I have sellers. The sellers have offers for the products. I'd like to offer the buyers some intelligence on the history of seller offers for those products.
So let's say I have a ComplexType field within my Azure Cognitive Search Index called "SellerOffers". That field would have data that looks like this:
{
"offers":[
{
"offerid":5,
"offerPrice":"15.00",
"offerDate":"9/23/2021"
},
{
"offerid":4,
"offerPrice":"12.50",
"offerDate":"8/10/2021"
},
{
"offerid":3,
"offerPrice":"13.50",
"offerDate":"7/15/2021"
},
{
"offerid":2,
"offerPrice":"10.00",
"offerDate":"6/01/2021"
},
{
"offerid":1,
"offerPrice":"12.50",
"offerDate":"5/23/2021"
}
]
}
In the data above, you can see that on 6/1/21, the price was only $10.00, and the price is currently $15.00, maybe the seller would still accept $10 (or $12).
Is it possible to ask that question of Azure Cognitive Search?
IE, show me products that had a previous offer where the price was 20% lower than the current offer.
Thank you for any help, I couldn't find any documentation about how such a complex query might be written.

The closest to what you are looking for is the Semantic Answers feature from Semantic Search, which basically adds cognitive capabilities to Search queries, and allows answers based on queries that are formulated as a question https://learn.microsoft.com/en-us/azure/search/semantic-answers#prerequisites
I don't know if it is smart enough to understand your question but I think is worth giving it a try, you need to sign up for the public preview first https://learn.microsoft.com/en-us/azure/search/semantic-search-overview#availability-and-pricing

Related

Is there any way to sort on a nested value in Azure Cognitive Search?

Is there any way to sort on a nested value in Azure Cognitive Search?
My use case is that I have a database of songs that are associated with dances that one can dance to that song. Users can vote on the danceability of a dance to a song, so there is a is a numeric vote tally for each song/dance combination. A core part of the functionality for the search is to be able to do an arbitrary search and sort the results by the popularity of a particular dance.
I am currently modeling this by creating a new top level field with a decorated name (e.g. DNC_Salsa or DNC_Waltz) for each dance. This works. But aside from being clumsy, I can't associate other information with a dance. In addition, I have to dynamically add the dance fields, so I have to use the generic SearchDocument type in the C# library rather than using a POCO type.
I'd much prefer to model this with the dance fields as an array of subdocuments where the subdocuments contain a dance name, a vote count and the other information I'd like to associate with a dance.
A simplified example record would look something like this:
{
"title": "Baby, It's Cold Outside",
"artist": "Seth MacFarlane",
"tempo": 119.1,
"dances": [
{ "name", "cha cah", "votes", 1 },
{ "name", "foxtrot", "votes", 4 }
]
}
I gave this a try and received:
{"error":{"code":"OperationNotAllowed","message":"The request is invalid.","details":[{"code":"CannotEnableFieldForSorting","message":"The field 'Votes' cannot be enabled for sorting because it is directly or indirectly contained in a collection, which makes it a multi-valued field. Sorting is not allowed on multi-valued fields. Parameters: definition"}]}}
It looks like elastic search will do what I want:
Sort search results | Elasticsearch Guide [7.17] | Elastic
If I'm reading the Elasticsearch documetion correctly, you can basically say I'd like to sort on the dances subdocument by first filtering for name == "cha cha" and then sorting on the vote field.
Is there anything like this in Azure Cognitive Search? Or even something more restrictive? I don't need to do arbitrary sorting on anything in the subdocument. I would be happy to only ever sort on the vote count (although I'd have to be able to do that for any dance name).
It's not clear to me what your records or data model looks like. However, from the error message you provided, it's clear that you try to sort on a multivalue property. That is logically impossible.
Imagine a property Color that can contain colors like 'Red' or 'Blue'. If you sort by Color, you would get your red values before the blues. If you instead had 'Colors' that can contain multiple values like both 'Red' and 'Blue', how would you sort it? You can't.
So, if you actually want to sort by a property, that property has to contain a single value.
When that's said, I have a feeling you are really asking about ranking/boosting. Not sorting. Have a look at the examples with boosting and scoring profiles for different genres of music. I believe the use case in these examples could help you solve your use case.
https://learn.microsoft.com/en-us/azure/search/index-add-scoring-profiles#extended-example

How do I use Solr "relatedness()" function to measure relatedness of two sets of documents?

I'd like to use the new Semantic Knowledge Graph capability in Solr to answer this question:
Given a set of documents from several different publishers, compute a "relatedness" metric between a given publisher and every other publisher, based on the text content of their respective documents.
I've watched several of Trey Grainger's talks regarding the Semantic Knowledge Graph functionality in Solr (this is a great recent one: https://www.youtube.com/watch?v=lLjICpFwbjQ) I have a reasonably good understanding of Solr faceted search functionality, and I have a working Solr engine with my dataset indexed and searchable. So far I've been unable to construct a facet query to do what I want.
Here is an example curl command which I thought might get me what I want
curl -sS -X POST http://localhost:8983/solr/plans/query -d '
{
params: {
fore:"publisher_url:life.church"
back:"*:*",
},
query:"*:*",
limit: 0,
facet:{
pub_type: {
type: terms,
field: "publisher_url",
limit: 5,
sort: { "r1": "desc" },
facet: {
r1: "relatedness($fore,$back)"
}
}
}
}
}'
Below are the result facets. Notice that after the first bucket (which matches the foreground query), the others all have exactly the same relatedness. Which leads me to believe that the "relatedness" is only based on the publisher_url field rather than the entire text content of the documents.
{
"facets":{
"count":2152,
"pub_type":{
"buckets":[{
"val":"life.church",
"count":141,
"r1":{
"relatedness":0.38905,
"foreground_popularity":0.06552,
"background_popularity":0.06552}},
{
"val":"10ofthose.com/us/products/1039/colossians",
"count":1,
"r1":{
"relatedness":-0.00285,
"foreground_popularity":0.0,
"background_popularity":4.6E-4}},
{
"val":"14DAYMARRIAGECHALLENGE.COM",
"count":1,
"r1":{
"relatedness":-0.00285,
"foreground_popularity":0.0,
"background_popularity":4.6E-4}},
{
"val":"23blast.com",
"count":1,
"r1":{
"relatedness":-0.00285,
"foreground_popularity":0.0,
"background_popularity":4.6E-4}},
{
"val":"2911worship.com",
"count":1,
"r1":{
"relatedness":-0.00285,
"foreground_popularity":0.0,
"background_popularity":4.6E-4}}]}}}
I'm not very familiar with the relatedness function, but as far as I understand, the relatedness score is generated from the similarity between your foreground and background set of documents for that facet bucket.
Since your foreground set only contain that single value (and none of the other), the first bucket is the only one that will generate a different similarity score when you're faceting for the same field as you use for selecting documents.
I'm not sure if your use case is a good match for what you're trying to use, as relatedness would indicate that single terms in a field is related between the two sets you're using, and not a similarity score across a different field for the two comparison operators.
You probably want something more structured than a text field to generate relatedness() scores, as that's usually more useful for finding single values that generate statistical insight into the structure of your query set.
The More Like This functionality might actually be a better match for getting the most similar other sites instead.
Again, this is based on my understanding of the functionality at the moment, so someone else can hopefully add more details and correct me as necessary.

Foursquare API venues: only 60% matches our database

We have our own existing venue database we want to match with venues from the Foursquare Places API. The purpose of this is to retrieve and display certain info and content from Foursquare.
Currently we're having trouble matching the Foursquare venues with our own venues, only 60% matches. We pass the following parameters:
ll: latitude, longitude of the venue
query: name of the venue
categoryId: "4bf58dd8d48988d1fa931735" - ID of venue category
intent: browse
locale: en
radius: 100
We also tried the intent=match parameter but that gave us even less matches. Is there anything we can change that would improve our matching percentage or is this the best it can get?
Remember that foursquare's database is more or less user created. You have to account for user error, category mismatch or even the location data being off. Also even when using the FourSquare app or Swarm, all venues in the immediate area don't always show up in the results.
Example, searching for a Starbucks sometimes includes ones that may far, far away. The order of the the results are another story, it may not be distance, their search uses other factors relating to user preferences and popularity.
Sorry if this is not much of an answer but from using the apps and the api, I believe you'll have a hard time getting close to 100% without a lot of data manipulation and creative calls to their search.
To get a higher match rate use the venues/search api and only use parameters for ll and query.
Since Foursquare has such a deep category taxonomy, I would avoid using the categoryId parameter unless you are 100% certain it matches with what Foursquare has. Also, I wouldn't use intent=browse for any matching. By leaving out the intent param it will default to intent=checkin, which will be better for fuzzy matching.
If you need precision at the expense of match rate, you can set intent=match. This parameter is very sensitive but can accept things like phone number and address as well. This is great for when you don't have a lat/lng but since you do I wouldn't bother with this.

Solr Lucene - Not sure how to index data so documents scored properly

Here's my goal. A user has a list of skill+proficiency tuples.
We want to find users based on some skill/experience criteria:
java, novice
php, expert
mysql, advanced
Where the * skills are highly desired and all others are good to have. Users which meet or exceed (based on experience) would be ranked highest. But it should also degrade nicely. If no users have both java and php experience, but they have one of the highly desired skills they should be ranked at the top. Users with only one of the optional skills may appear at the bottom.
An idea I had is to index a user's skills in fields like this:
skill_novice: java
skill_novice: php
skill_advanced: php
skill_expert: php
skill_novice: mysql
skill_advanced: mysql
...so that at minimal I can do a logical query to find people who meeting the highly desired skills:
(skill_novice:java AND skill_expert:php)
but this doesn't degrade nicely (if no matches found) nor does it find the optional skills. Perhaps instead I can do something like this:
skill_novice:java AND
(skill_novice:php^0.1 OR skill_advanced:php^0.2 OR skill_expert:php^0.3)
Is there a better way to accomplish this?
I think you could boost the field with the different values at index time:
// mysql expert
Field mysqlf = new Field("skill", "mysql",
Field.Store.YES,
Field.Index.ANALYZED);
mysqlf.setBoost(10.0F);
// mysql begginer
mysqlf = new Field("skill", "mysql",
Field.Store.YES,
Field.Index.ANALYZED);
mysqlf.setBoost(1.0F);
You need to enable norms for this to work.

Solr: How can I implement timed discount availablity in solr

I'm attempting to use solr for a book store site.
Each book will have a price but on occasions this will be discounted. The discounted price exists for a defined time period but there may be many discount periods. Each discount will have a brief synopsis, start and end time.
A subset of the desired output would be as follows:
.......
"response":{"numFound":1,"start":0,"docs":[
{
"name":"The Book",
"price":"$9.99",
"discounts":[
{
"price":"$3.00",
"synopsis":"thanksgiving special",
"starts":"11-24-2011",
"ends":"11-25-2011",
},
{
"price":"$4.00",
"synopsis":"Canadian thanksgiving special",
"starts":"10-10-2011",
"ends":"10-11-2011",
},
]
},
.........
A requirement is to be able to search for just discounted publications. I think I could use date faceting for this ( return publications that are within a discount window ). When a discount search is performed no publications that are not currently discounted will be returned.
My question are:
Does solr support this type of sub documents
In the above example the discounts are the sub documents. I know solr is not a relational DB but I would like to store (and index ) the above representation in a single document if possible.
what is the best method to approach the above
I can see in many examples the authors tend to denormalize to solve similar problems. This suggest that for each discount I am required to duplicate the book data or form a document association. Which method would you advise?
It would be nice if solr could return a response structured as above.
Much Thanks
You could probably achieve something close enough by using Solr's dynamic fields to get:
.......
"response":{"numFound":1,"start":0,"docs":[
{
"name":"The Book",
"price":"$9.99",
"discount_1_price":"$3.00",
"discount_1_synopsis":"thanksgiving special",
"discount_1_starts":"11-24-2011",
"discount_1_ends":"11-25-2011",
"discount_2_price":"$4.00",
"discount_2_synopsis":"Canadian thanksgiving special",
"discount_2_starts":"10-10-2011",
"discount_2_ends":"10-11-2011",
},
........

Resources