Azure Search Working with Complex Collections - azure-cognitive-search

Our data structure is similar to HotelId 1 example in the link https://learn.microsoft.com/en-us/azure/search/search-howto-complex-data-types
Our requirement is as follows:
Input: City = New York, StateProvince = NY, BaseRate = $100
Select fields: HotelId, HotelName, Description, Tags, Address, Rooms
Filter: Only rooms where BaseRate is less than or equal to Input rate and Address City and State matches input values. In this example, it should only select the first room from Rooms, not all Rooms.
Desired output:
{
"HotelId": "1",
"HotelName": "Secret Point Motel",
"Description": "Ideally located on the main commercial artery of the city in the heart of New York.",
"Tags": ["Free wifi", "on-site parking", "indoor pool", "continental breakfast"]
"Address": {
"StreetAddress": "677 5th Ave",
"City": "New York",
"StateProvince": "NY"
},
"Rooms": [
{
"Description": "Budget Room, 1 Queen Bed (Cityside)",
"RoomNumber": 1105,
"BaseRate": 96.99,
}
]
}
Any help or direction on how to write a query for this or any pointers would be welcome.

The records in the hotels sample index consist of hotels, not rooms. Think of it as an index with Documents and Paragraphs. You may search for a Document (hotel) which has something within a Paragraph (room). The result you get would always be a list of Documents. From what I know there is no way to remove certain complex types from a record in a response.
The query to do what you ask (except filtering out rooms) is this by the way:
search=Address/City:"New York" AND Address/StateProvince:"NY"&$select=HotelId,HotelName,Description,Tags,Address,Rooms&$count=true&searchMode=all&queryType=full&$filter=Rooms/any(room: room/BaseRate lt 100.0)
Possible workarounds:
Design your index with rooms as records
Filter out rooms above the selected base rate in your frontend application.

Related

Solr - How to index and boost results using several popularity fields?

Here is a sample document -
{
"id": "AIRPORT-LAS",
"RANK": 80.0,
"TYPE": "AIRPORT",
"COUNTRY_NAME": "United States",
"COUNTRY_CODE": "US",
"ISO_COUNTRY_CODE": "US",
"LATITUDE": "36.08047103880001",
"LONGITUDE": "-115.14331054699983",
"LATLON": [
"36.08047103880001,-115.14331054699983"
],
"CITY_CODE": "LAS",
"CITY_NAME": "Las Vegas",
"PROVINCE_CODE": "NV",
"PROVINCE_NAME": "NEVADA",
"AIRPORT_NAME": "McCarran Intl Airport",
"AIRPORT_CODE": "LAS"
}
Now based on where (geographic location) the customer is searching, I'll be having several RANK(s) using State and Country combinations for each of the above documents.
For example -
For AIRPORT-LAS, I'll have the following -
USA - CA - 100
USA - NJ - 80
USA - NY - 75
.... rest of combinations
I am trying to understand the following -
What is the best way to index this new set of ranks to the existing documents? As a separate collection? Or as a nested data set?
How can I boost my results using the new set of ranks at search time? [so basically, if the user is searching from USA - CA, I should be using RANK=100, to boost my search results. I would know the State and Country at search time.]
Thank You!
If you want to integrate numeric document values directly into the score, use a boost function on query time. You may also use multiple document values here, but watch out to select an adequate boost factor.
bf=mul(RANK, 2)

Loading an Array tag from a .JSON file into snowflake

I have a json file that I am loading into snowflake. One of the keys in the file has a value that is an array. The question is how do I load this tag into a separate column of type ARRAY in snowflake? It's already an array in json. Do I still need to use an array_construct(tag_name_here) function to load it up? What happens if in subsequent records, the 'industry' tag is missing altogether? Please advise.
Below is a sample of the json...
[
[
{
"title": "Avino Silver & Gold Mines Ltd. Fourth Quarter and Year End Results to be Released on....",
"pubDate": "Tue, 25 Feb 2020 00:49:00 +0000",
"description": " Avino Silver & Gold Mines Ltd. plans to announce its Fourth Quarter and Year End 2019 financial results after the market closes. In addition, the Company...",
"industry": [
"Mining & Metals ",
"Mining ",
"MNG",
"MIN"
],
"subject": [
"Conference Call Announcements ",
"Earnings "
]
}
]
]
Take a look at the examples here:
https://docs.snowflake.net/manuals/user-guide/querying-semistructured.html
generally you're looking into extracting data rather than constructing it (just use value:industry to fill in your array column). And if the proper tag is missing in some record it will just get filled with NULL.

Is it possible to change the scoring profile based on the number of tags?

I have a document with a collection of strings representing the number of times that document appears in a region (tags). For example:
[{
"id": "A"
// other properties
"regions": ["3", "3", "3", "2"] // Appears 3 times in region "3" and once in region "2"
},
{
"id": "B"
// other properties
"regions": ["3", "3", "1"] // Appears twice in region "3" and once in region "1"
}]
I tried using a custom scoring profile of type Tag, but I don't see how to give documents with more regions a better score. In other words, I want document A that appears 3 times in region 3 to show before document B that only appears twice in region 3.
FYI, the reason we chose to represent regions this way is because there are way too many regions and not all documents appear in all regions. More details here
Is this doable? This way or another way?
The tag scoring profile checks for an existence of a tag. If the tag appears multiple times, it has no effect on the score.
I've read your other post here. One solution you could consider (which is not exactly what you want) is to bucket the regions based on count. For example, you'd have a collection of regions where the document shows up less than 10 times, between 10 and 50, between 50 and 100 (pick the ranges in a way that make sense for the distribution of region occurrences in your scenario). You're documents would look like this:
{
"id": "A"
"regions10": ["3", "2"] // Appears in region 3 and 2 less than 10 times
"regions50": ["1"] // Appears in region 1 between 10 and 50 times
}
Then, you could use a Weights scoring profile to boost documents that matched in the higher count regions:
"scoringProfiles": [
{
"name": "boostRegions",
"text": {
"weights": {
"regions10": 1,
"regions50": 2,
"regions100": 3
}
}
}
This is not a good solution if you need strict ordering based on the region count, you can't precompute the region counts, or the entire range of value is large (say 0 to 2^31) while the individual buckets need to be small (you'd end up with too many fields).
The problem you have is a data modeling problem. You're trying to retrieve documents based on the property of the document, which is whether it contains a region in a set of regions, but score/boost the document based on the properties of the region, not the document. You'd have to have a document in the index for each document-region pair an a property with the number of times given document appeared in that region.

solr sorting intended only for two document

Please see my sample solr document below.
{
"title": "Apple"
},
{
"title": "Banana",
"popularity": 2
},
{
"title": "Mango",
"popularity": 3
},
{
"title": "Lemon",
"popularity": 1
}
By default the query is "title":* so all those solr document will return as result, sorted by title ascending order. It will look like this
Apple
Banana
Lemon
Mango
Now, what I want is to add another sorting which a bit tricky at least for me to implement :(. I want to sort it by title ascending and by popularity descending order which only involves the popularity that has a value of 3 and 2. The result should be like this
Mango
Banana
Apple
Lemon
The question is what would be the query?
Thanks
You can sort it as follows:
sort=map(popularity,2,3, popularity,0) desc, title asc

Using map reduce in CouchDB to output fewer rows

Lets say you have two document types, customers and orders. A customer document contains basic information like name, address etc. and orders contain all the order information each time a customer orders something. When storing the documents, the type = order or the type = customer.
If I do a map function over a set of 10 customers and 30 orders it will output 40 rows. Some rows will be customers, some will be orders.
The question is, how do I write the reduce, so that the order information is "stuffed" inside of the rows that has the customer information? So it will return 10 rows (10 customers), but all the relevant orders for each customer.
Basically I don't want separate records on the output, I want to combine them (orders into one customer row) and I think reduce is the way?
This is called view collation and it is a very useful CouchDB technique.
Fortunately, you don't even need a reduce step. Just use map to get the customers and their orders "clumped" together.
Setup
The key is that you need a unique id for each customer, and it has to be known both from customer docs and from order docs.
Example customer:
{ "_id": "customer me#example.com"
, "type": "customer"
, "name": "Jason"
}
Example order:
{ "_id": "abcdef123456"
, "type": "order"
, "for_customer": "customer me#example.com"
}
I have conveniently used the customer ID as the document _id but the important thing is that both docs know the customer's identity.
Payoff
The goal is a map query, where if you specify ?key="customer me#example.com" then you will get back (1) first, the customer info, and (2) any and all orders placed.
This map function would do that:
function(doc) {
var CUSTOMER_VAL = 1;
var ORDER_VAL = 2;
var key;
if(doc.type === "customer") {
key = [doc._id, CUSTOMER_VAL];
emit(key, doc);
}
if(doc.type === "order") {
key = [doc.for_customer, ORDER_VAL];
emit(key, doc);
}
}
All rows will sort primarily on the customer the document is about, and the "tiebreaker" sort is either the integer 1 or 2. That makes customer docs always sort above their corresponding order docs.
["customer me#example.com", 1], ...customer doc...
["customer me#example.com", 2], ...customer's order...
["customer me#example.com", 2], ...customer's other order.
... etc...
["customer another#customer.com", 1], ... different customer...
["customer another#customer.com", 2], ... different customer's order
P.S. If you follow all that: instead of 1 and 2 a better value might be null for the customer, then the order timestamp for the order. They will sort identically as before except now you have a chronological list of orders.

Resources