I wanted to calculate how many records couchDB has for a database. So hit the below API and it returns json which has records count as well.
GET http://localhost:20984/mydb/
jSON
{
"db_name": "mydb",
"update_seq": "25577-g1AAAAFLeJ4mDJtXMoSQMrqYcpacSjLYwGSDA1ACqhyPljpJ7xKF0CU7gcrXYlX6QGI0vtgpWJ4lT6AKIW4tTYLACqwZ0c",
"sizes": {
"file": 20881199,
"external": 11977342,
"active": 16542736
},
"purge_seq": 0,
"other": {
"data_size": 11977342
},
"doc_del_count": 0,
"doc_count": 25569,
"disk_size": 20881199,
"disk_format_version": 6,
"data_size": 16542736,
"compact_running": false,
"cluster": {
"q": 8,
"n": 1,
"w": 1,
"r": 1
},
"instance_start_time": "0"
}
Here doc_count is 25569 so i assume total records are 25569.
But when i set document_per_page to 100 & begin to view the records it shows 100 records for first page & 200 records for 2 pages & so on. And if i keep doing this way it shows me more than 35000 records.
Now my question is if total records are 25569 then how couchDB is showing me records more than 35000 with pagination ?
You're right about doc_count, it reports the number of documents in the specified database.
Presuming you're using Fauxton and in there, you switch "Documents per page" to 100, I was able to reproduce the described behavior. When I repeatedly press the next arrow in a short interval, the displayed data and the numbers on the paginator went out of sync. Therefore this seems to be a bug of Fauxton.
Related
Search For eg search query - Hospital Country set- IN. total results found are <=100.if you send the request again the number of total results change. This has started happening recently and was working fine a couple weeks back.
The below result is definitely incorrect as the totalResults should be in thousands
eq query
https://atlas.microsoft.com/search/poi/json?subscription-key={Your key here}&api-version=1.0&limit=100&ofs=0&countrySet=IN&query=hospital&maxfuzzylevel=3&minfuzzylevel=1
summary
"summary": {
"query": "hospital",
"queryType": "NON_NEAR",
"queryTime": 374,
"numResults": 95,
"offset": 0,
"totalResults": 95,
"fuzzyLevel": 1
},
Expected totalResults to be returned in 1000s across India
Our data structure is similar to HotelId 1 example in the link https://learn.microsoft.com/en-us/azure/search/search-howto-complex-data-types
Our requirement is as follows:
Input: City = New York, StateProvince = NY, BaseRate = $100
Select fields: HotelId, HotelName, Description, Tags, Address, Rooms
Filter: Only rooms where BaseRate is less than or equal to Input rate and Address City and State matches input values. In this example, it should only select the first room from Rooms, not all Rooms.
Desired output:
{
"HotelId": "1",
"HotelName": "Secret Point Motel",
"Description": "Ideally located on the main commercial artery of the city in the heart of New York.",
"Tags": ["Free wifi", "on-site parking", "indoor pool", "continental breakfast"]
"Address": {
"StreetAddress": "677 5th Ave",
"City": "New York",
"StateProvince": "NY"
},
"Rooms": [
{
"Description": "Budget Room, 1 Queen Bed (Cityside)",
"RoomNumber": 1105,
"BaseRate": 96.99,
}
]
}
Any help or direction on how to write a query for this or any pointers would be welcome.
The records in the hotels sample index consist of hotels, not rooms. Think of it as an index with Documents and Paragraphs. You may search for a Document (hotel) which has something within a Paragraph (room). The result you get would always be a list of Documents. From what I know there is no way to remove certain complex types from a record in a response.
The query to do what you ask (except filtering out rooms) is this by the way:
search=Address/City:"New York" AND Address/StateProvince:"NY"&$select=HotelId,HotelName,Description,Tags,Address,Rooms&$count=true&searchMode=all&queryType=full&$filter=Rooms/any(room: room/BaseRate lt 100.0)
Possible workarounds:
Design your index with rooms as records
Filter out rooms above the selected base rate in your frontend application.
Here is a sample document -
{
"id": "AIRPORT-LAS",
"RANK": 80.0,
"TYPE": "AIRPORT",
"COUNTRY_NAME": "United States",
"COUNTRY_CODE": "US",
"ISO_COUNTRY_CODE": "US",
"LATITUDE": "36.08047103880001",
"LONGITUDE": "-115.14331054699983",
"LATLON": [
"36.08047103880001,-115.14331054699983"
],
"CITY_CODE": "LAS",
"CITY_NAME": "Las Vegas",
"PROVINCE_CODE": "NV",
"PROVINCE_NAME": "NEVADA",
"AIRPORT_NAME": "McCarran Intl Airport",
"AIRPORT_CODE": "LAS"
}
Now based on where (geographic location) the customer is searching, I'll be having several RANK(s) using State and Country combinations for each of the above documents.
For example -
For AIRPORT-LAS, I'll have the following -
USA - CA - 100
USA - NJ - 80
USA - NY - 75
.... rest of combinations
I am trying to understand the following -
What is the best way to index this new set of ranks to the existing documents? As a separate collection? Or as a nested data set?
How can I boost my results using the new set of ranks at search time? [so basically, if the user is searching from USA - CA, I should be using RANK=100, to boost my search results. I would know the State and Country at search time.]
Thank You!
If you want to integrate numeric document values directly into the score, use a boost function on query time. You may also use multiple document values here, but watch out to select an adequate boost factor.
bf=mul(RANK, 2)
I have a document with a collection of strings representing the number of times that document appears in a region (tags). For example:
[{
"id": "A"
// other properties
"regions": ["3", "3", "3", "2"] // Appears 3 times in region "3" and once in region "2"
},
{
"id": "B"
// other properties
"regions": ["3", "3", "1"] // Appears twice in region "3" and once in region "1"
}]
I tried using a custom scoring profile of type Tag, but I don't see how to give documents with more regions a better score. In other words, I want document A that appears 3 times in region 3 to show before document B that only appears twice in region 3.
FYI, the reason we chose to represent regions this way is because there are way too many regions and not all documents appear in all regions. More details here
Is this doable? This way or another way?
The tag scoring profile checks for an existence of a tag. If the tag appears multiple times, it has no effect on the score.
I've read your other post here. One solution you could consider (which is not exactly what you want) is to bucket the regions based on count. For example, you'd have a collection of regions where the document shows up less than 10 times, between 10 and 50, between 50 and 100 (pick the ranges in a way that make sense for the distribution of region occurrences in your scenario). You're documents would look like this:
{
"id": "A"
"regions10": ["3", "2"] // Appears in region 3 and 2 less than 10 times
"regions50": ["1"] // Appears in region 1 between 10 and 50 times
}
Then, you could use a Weights scoring profile to boost documents that matched in the higher count regions:
"scoringProfiles": [
{
"name": "boostRegions",
"text": {
"weights": {
"regions10": 1,
"regions50": 2,
"regions100": 3
}
}
}
This is not a good solution if you need strict ordering based on the region count, you can't precompute the region counts, or the entire range of value is large (say 0 to 2^31) while the individual buckets need to be small (you'd end up with too many fields).
The problem you have is a data modeling problem. You're trying to retrieve documents based on the property of the document, which is whether it contains a region in a set of regions, but score/boost the document based on the properties of the region, not the document. You'd have to have a document in the index for each document-region pair an a property with the number of times given document appeared in that region.
I am trying to fit the following data in Solr to support flexible queries and would like to get your input on the same. I have data about users say:
contentID (assume uuid),
platform (eg. website, mobile etc),
softwareVersion (eg. sw1.1, sw2.5, ..etc),
regionId (eg. us144, uk123, etc..)
....
and few more other such fields. This data is partially pre aggregated (read Hadoop jobs): so let’s assume for "contentID = uuid123 and platform = mobile and softwareVersion = sw1.2 and regionId = ANY" I have data in format:
timestamp pre-aggregated data [ uniques, total]
Jan 15 [ 12, 4]
Jan 14 [ 4, 3]
Jan 13 [ 8, 7]
... ...
And then I also have less granular data say "contentID = uuid123 and platform = mobile and softwareVersion = ANY and regionId = ANY (These values will be more than above table since granularity is reduced)
timestamp : pre-aggregated data [uniques, total]
Jan 15 [ 100, 40]
Jan 14 [ 45, 30]
... ...
I'll get queries like "contentID = uuid123 and platform = mobile" , give sum of 'uniques' for Jan15 - Jan13 or for "contentID=uuid123 and platform=mobile and softwareVersion=sw1.2", give sum of 'total' for Jan15 - Jan01.
I was thinking of simple schema where documents will be like (first example above):
{
"contentID": "uuid12349789",
"platform" : "mobile",
"softwareVersion": "sw1.2",
"regionId": "ANY",
"ts" : "2017-01-15T01:01:21Z",
"unique": 12,
"total": 4
}
second example from above:
{
"contentID": "uuid12349789",
"platform" : "mobile",
"softwareVersion": "ANY",
"regionId": "ANY",
"ts" : "2017-01-15T01:01:21Z",
"unique": 100,
"total": 40
}
Possible optimization:
{
"contentID": "uuid12349789",
"platform.mobile.softwareVersion.sw1.2.region.us12" : {
"unique": 12,
"total": 4
},
"platform.mobile.softwareVersion.sw1.2.region.ANY" : {
"unique": 100,
"total": 40
},
"ts" : "2017-01-15T01:01:21Z"
}
Challenges: Number of such rows is very large and it'll grow exponentially with every new field - For instance if I go with above suggested schema, I'll end up storing a new document for each combination of contentID,platform,softwareVersion,regionId. Now if we throw in another field to this document, number of combinations increase exponentially.I have more than a billion such combination rows already.
I am hoping to find advice by experts if
Multiple such fields can be fit in same document for different 'ts' such that range queries are possible on it.
time range (ts) can be fit in same document as a list(?) (to reduce number of rows). I know multivalued fields don't support complex data types, but if anything else can be done with the data/schema to reduce query time and number of rows.
The number of these rows are very large, for sure more than 1billion (if we go with the schema I was suggesting). What schema would you suggest for this that'll fit query requirements?
FYI: All queries will be exact match on fields (no partial or tokenized), so no analysis on fields is necessary. And almost all queries are range queries.
You are trying to store query time results of all the possible combination of attributes values. Thats just too much duplicate data. Rather you store each observation and the attributes as a single data point just once. so if you had 'n' observations and if you add an additional attribute, it would grow additively, not exponentially. And if you needed data for a certain combination of attributes, you filter/aggregate them at query time.
{
"contentID": "uuid12349789",
"ts" : "2017-01-15T01:01:21Z",
"observation": 10001,
"attr-platform" : "mobile",
"attr-softwareVersion": "sw1.2",
"attr-regionId": "US",
}