Here is a sample document -
{
"id": "AIRPORT-LAS",
"RANK": 80.0,
"TYPE": "AIRPORT",
"COUNTRY_NAME": "United States",
"COUNTRY_CODE": "US",
"ISO_COUNTRY_CODE": "US",
"LATITUDE": "36.08047103880001",
"LONGITUDE": "-115.14331054699983",
"LATLON": [
"36.08047103880001,-115.14331054699983"
],
"CITY_CODE": "LAS",
"CITY_NAME": "Las Vegas",
"PROVINCE_CODE": "NV",
"PROVINCE_NAME": "NEVADA",
"AIRPORT_NAME": "McCarran Intl Airport",
"AIRPORT_CODE": "LAS"
}
Now based on where (geographic location) the customer is searching, I'll be having several RANK(s) using State and Country combinations for each of the above documents.
For example -
For AIRPORT-LAS, I'll have the following -
USA - CA - 100
USA - NJ - 80
USA - NY - 75
.... rest of combinations
I am trying to understand the following -
What is the best way to index this new set of ranks to the existing documents? As a separate collection? Or as a nested data set?
How can I boost my results using the new set of ranks at search time? [so basically, if the user is searching from USA - CA, I should be using RANK=100, to boost my search results. I would know the State and Country at search time.]
Thank You!
If you want to integrate numeric document values directly into the score, use a boost function on query time. You may also use multiple document values here, but watch out to select an adequate boost factor.
bf=mul(RANK, 2)
Related
Our data structure is similar to HotelId 1 example in the link https://learn.microsoft.com/en-us/azure/search/search-howto-complex-data-types
Our requirement is as follows:
Input: City = New York, StateProvince = NY, BaseRate = $100
Select fields: HotelId, HotelName, Description, Tags, Address, Rooms
Filter: Only rooms where BaseRate is less than or equal to Input rate and Address City and State matches input values. In this example, it should only select the first room from Rooms, not all Rooms.
Desired output:
{
"HotelId": "1",
"HotelName": "Secret Point Motel",
"Description": "Ideally located on the main commercial artery of the city in the heart of New York.",
"Tags": ["Free wifi", "on-site parking", "indoor pool", "continental breakfast"]
"Address": {
"StreetAddress": "677 5th Ave",
"City": "New York",
"StateProvince": "NY"
},
"Rooms": [
{
"Description": "Budget Room, 1 Queen Bed (Cityside)",
"RoomNumber": 1105,
"BaseRate": 96.99,
}
]
}
Any help or direction on how to write a query for this or any pointers would be welcome.
The records in the hotels sample index consist of hotels, not rooms. Think of it as an index with Documents and Paragraphs. You may search for a Document (hotel) which has something within a Paragraph (room). The result you get would always be a list of Documents. From what I know there is no way to remove certain complex types from a record in a response.
The query to do what you ask (except filtering out rooms) is this by the way:
search=Address/City:"New York" AND Address/StateProvince:"NY"&$select=HotelId,HotelName,Description,Tags,Address,Rooms&$count=true&searchMode=all&queryType=full&$filter=Rooms/any(room: room/BaseRate lt 100.0)
Possible workarounds:
Design your index with rooms as records
Filter out rooms above the selected base rate in your frontend application.
So, I'm designing the model for the documents that I'll insert in my database and I have a question about the design. Is it better to just insert more documents in my collections or fewer nested documents?
Example:
sale:{
store_id : "2",
vendor_id: "2,
points : 100
}
sale:{
store_id : "2",
vendor_id: "2,
points : 100
}
sale:{
store_id : "2",
vendor_id: "2,
points : 100
}
sale:{
store_id : "4",
vendor_id: "3,
points : 100
}
sale:{
store_id : "4",
vendor_id: "1,
points : 100
}
So,in this not nested example if I have N sales, I'll have N sales, inside my collections. But if I try to nest, my example will be:
stores:{ [
store_id : "2"
vendor : [
vendor_id : "2"
sales : [
points : 100
],
[
points : 100
],
[
points : 100
]
]
],
[
store_id: 4
vendor : [
vendor_id : 3
sales : [
point : 100
]
],
[
vendor_id : 1
sale : [
point : 100
]
]
] };
In this example, I nest all my sales.
So, my question is: to create reports and analyze data, which one is faster? If I want to see which store sold more for example, will it be faster to analyze nested documents or one line documents?
Thank you in advance.
The answer is pretty simple. If you know there are gonna be an a lot of sales and its not limited number, you have to go for a separate collection for sales. Mongodb is designed to perform amazingly fast even if there are a million documents in a collection but interestingly you are gonna face a lot of issues by nesting.
Also there is a 16mb document size limit in mongodb, so eventually after a while your one store document will reach that limit and it will make things pretty ugly.
It’s quite straight that you should go for a separate collection.
You can also read this blog and it will clear things out for you
https://www.mongodb.com/blog/post/6-rules-of-thumb-for-mongodb-schema-design-part-1
I have a document with a collection of strings representing the number of times that document appears in a region (tags). For example:
[{
"id": "A"
// other properties
"regions": ["3", "3", "3", "2"] // Appears 3 times in region "3" and once in region "2"
},
{
"id": "B"
// other properties
"regions": ["3", "3", "1"] // Appears twice in region "3" and once in region "1"
}]
I tried using a custom scoring profile of type Tag, but I don't see how to give documents with more regions a better score. In other words, I want document A that appears 3 times in region 3 to show before document B that only appears twice in region 3.
FYI, the reason we chose to represent regions this way is because there are way too many regions and not all documents appear in all regions. More details here
Is this doable? This way or another way?
The tag scoring profile checks for an existence of a tag. If the tag appears multiple times, it has no effect on the score.
I've read your other post here. One solution you could consider (which is not exactly what you want) is to bucket the regions based on count. For example, you'd have a collection of regions where the document shows up less than 10 times, between 10 and 50, between 50 and 100 (pick the ranges in a way that make sense for the distribution of region occurrences in your scenario). You're documents would look like this:
{
"id": "A"
"regions10": ["3", "2"] // Appears in region 3 and 2 less than 10 times
"regions50": ["1"] // Appears in region 1 between 10 and 50 times
}
Then, you could use a Weights scoring profile to boost documents that matched in the higher count regions:
"scoringProfiles": [
{
"name": "boostRegions",
"text": {
"weights": {
"regions10": 1,
"regions50": 2,
"regions100": 3
}
}
}
This is not a good solution if you need strict ordering based on the region count, you can't precompute the region counts, or the entire range of value is large (say 0 to 2^31) while the individual buckets need to be small (you'd end up with too many fields).
The problem you have is a data modeling problem. You're trying to retrieve documents based on the property of the document, which is whether it contains a region in a set of regions, but score/boost the document based on the properties of the region, not the document. You'd have to have a document in the index for each document-region pair an a property with the number of times given document appeared in that region.
Please see my sample solr document below.
{
"title": "Apple"
},
{
"title": "Banana",
"popularity": 2
},
{
"title": "Mango",
"popularity": 3
},
{
"title": "Lemon",
"popularity": 1
}
By default the query is "title":* so all those solr document will return as result, sorted by title ascending order. It will look like this
Apple
Banana
Lemon
Mango
Now, what I want is to add another sorting which a bit tricky at least for me to implement :(. I want to sort it by title ascending and by popularity descending order which only involves the popularity that has a value of 3 and 2. The result should be like this
Mango
Banana
Apple
Lemon
The question is what would be the query?
Thanks
You can sort it as follows:
sort=map(popularity,2,3, popularity,0) desc, title asc
I am trying to fit the following data in Solr to support flexible queries and would like to get your input on the same. I have data about users say:
contentID (assume uuid),
platform (eg. website, mobile etc),
softwareVersion (eg. sw1.1, sw2.5, ..etc),
regionId (eg. us144, uk123, etc..)
....
and few more other such fields. This data is partially pre aggregated (read Hadoop jobs): so let’s assume for "contentID = uuid123 and platform = mobile and softwareVersion = sw1.2 and regionId = ANY" I have data in format:
timestamp pre-aggregated data [ uniques, total]
Jan 15 [ 12, 4]
Jan 14 [ 4, 3]
Jan 13 [ 8, 7]
... ...
And then I also have less granular data say "contentID = uuid123 and platform = mobile and softwareVersion = ANY and regionId = ANY (These values will be more than above table since granularity is reduced)
timestamp : pre-aggregated data [uniques, total]
Jan 15 [ 100, 40]
Jan 14 [ 45, 30]
... ...
I'll get queries like "contentID = uuid123 and platform = mobile" , give sum of 'uniques' for Jan15 - Jan13 or for "contentID=uuid123 and platform=mobile and softwareVersion=sw1.2", give sum of 'total' for Jan15 - Jan01.
I was thinking of simple schema where documents will be like (first example above):
{
"contentID": "uuid12349789",
"platform" : "mobile",
"softwareVersion": "sw1.2",
"regionId": "ANY",
"ts" : "2017-01-15T01:01:21Z",
"unique": 12,
"total": 4
}
second example from above:
{
"contentID": "uuid12349789",
"platform" : "mobile",
"softwareVersion": "ANY",
"regionId": "ANY",
"ts" : "2017-01-15T01:01:21Z",
"unique": 100,
"total": 40
}
Possible optimization:
{
"contentID": "uuid12349789",
"platform.mobile.softwareVersion.sw1.2.region.us12" : {
"unique": 12,
"total": 4
},
"platform.mobile.softwareVersion.sw1.2.region.ANY" : {
"unique": 100,
"total": 40
},
"ts" : "2017-01-15T01:01:21Z"
}
Challenges: Number of such rows is very large and it'll grow exponentially with every new field - For instance if I go with above suggested schema, I'll end up storing a new document for each combination of contentID,platform,softwareVersion,regionId. Now if we throw in another field to this document, number of combinations increase exponentially.I have more than a billion such combination rows already.
I am hoping to find advice by experts if
Multiple such fields can be fit in same document for different 'ts' such that range queries are possible on it.
time range (ts) can be fit in same document as a list(?) (to reduce number of rows). I know multivalued fields don't support complex data types, but if anything else can be done with the data/schema to reduce query time and number of rows.
The number of these rows are very large, for sure more than 1billion (if we go with the schema I was suggesting). What schema would you suggest for this that'll fit query requirements?
FYI: All queries will be exact match on fields (no partial or tokenized), so no analysis on fields is necessary. And almost all queries are range queries.
You are trying to store query time results of all the possible combination of attributes values. Thats just too much duplicate data. Rather you store each observation and the attributes as a single data point just once. so if you had 'n' observations and if you add an additional attribute, it would grow additively, not exponentially. And if you needed data for a certain combination of attributes, you filter/aggregate them at query time.
{
"contentID": "uuid12349789",
"ts" : "2017-01-15T01:01:21Z",
"observation": 10001,
"attr-platform" : "mobile",
"attr-softwareVersion": "sw1.2",
"attr-regionId": "US",
}