couchdb get by value, limit, and order by time - database

I'm new to CouchDB and I'm trying to get the last 50 most recent entries of a user in an app.
I created a view that pulls out the documents for the entries, and I can use the key parameter to get only the docs of a particular user and the limit query to get only 50 entries.
However, I'd like to order the docs by a "timestamp" field (which stores the new Date().getTime() of when the entry was made) in order to ensure that I only get the most recent entries. Is this possible in CouchDB, and if so how?

You can probably achieve this (at least in the case that you don't have any future dates in your data) by emitting a more complex key like an array of the form [username,datetime].
Then make a view that pulls the documents with a startkey like, for example, ['johndoe',1331388874195] with descending order, and limit to 50. The date should obviously be the current one. CouchDB's collation will make sure the results are first ordered by user and then by date.

Related

Dynamo Database GSI

I am currently having a problem with solving the problem with dynamo db querying.
My dynamo db keeps track of changes in data. Thus partition key identifies which data i am changing
a single row looks something like this
partition key: servicename#resource#resource_id#region
sortkey: current_time
changelogs: map of changelog (basically an array of changelog)
changer: who changed it
; It does great job when requesting one specific resource changes; however if i want to, say, query "I want to see last 30 minutes of changes in this servicename#resource without specifying resource id. Right now I only have scanning method at hand.. And i can't use scan due to large amount of data. I am open to all recommendation.
Your subject mentions the answer...even if your post doesn't.
Create a Global secondary index (GSI) with either the date or part of the date as the hash key and the time or remaining date part + time as the sort key. If you have a large amount of data being created, you might want to include Hour in the hash key.
HK SK
YYYY-MM-DD-HH :MM:SS.00000
YYYY-MM-DD HH:MM:SS.00000
YYYY-MM DD-HH:MM:SS.0000

How to prevent user from inserting a Date that is in a range of dates?

I'm using CouchDB for syncing offline and online data. The problem is that since couchDB hasn't a layer between client and server, validation is kinda poor. The only field that can be unique is the _id.
For example if you want to make a sure the field "phone" is unique, you need to make sure that the phone number is somewhere in the _id (555-666-777_username) and the field "phone" is 555-666-777.
My problem is: I have a calendar that makes events that cannot overlap each other. An event has a start time and an end time. How can I make sure that the user won't put a date between the start time and end time.
One idea is instead of making one document with a startTime and endTime. It's to make various documents that have dates within the user's desired range.
Example: User selects a range between 2018-09-02 and 2018-09-10, so I'll create 8 documents with a _id composed of {date}{username}.
If you think couchDB doesn't suit to this kind of stuff, I'm open for suggestions.

Range Key Querying on composed keys

Currently I have a collection which contains the following fields:
userId
otherUserId
date
status
For my Dynamo collection I used userId as the hashKey and for the rangeKey I wanted to use date:otherUserId. By doing it like this I could retrieve all userId entries sorted on a date which is good.
However, for my usecase I shouldn't have any duplicates, meaning I shouldn't have the same userId-otherUserId value in my collection. This means I should do a query first to check if that 'couple' exist, remove it if needed and then do the insert, right?
EDIT:
Thanks for your help already :-)
The goal of my usecase would be to store when userA visits the profile of userB.
Now, The kind of queries I would like to do are the following:
Retrieve all the UserB's that visited the profile of UserA, in an unique (= No double UserB's) and sorted by time way.
Retrieve a particular pair visit of UserA and UserB
I think you have a lot of choices, but here is one that might work based on the assumption that your application is time-aware i.e. you want to query for interactions in the last N minutes, hours, days etc.
hash_key = userA
range_key = [iso1860_timestamp][1]+userB+uuid
First, the uuid trick is there to avoid overwriting a record of an interaction between userA and userB happening exactly at the same time (can occur depending on the granularity/precision of your clock). So insert-wise we are safe : no duplicates, no overwrites.
Query-wise, here is how things get done:
Retrieve all the UserB's that visited the profile of UserA, in an unique (= No double UserB's) and sorted by time way.
query(hash_key=userA, range_key_condition=BEGIN(common_prefix))
where common_prefix = 2013-01-01 for all interactions in Jan 2013
This will retrieve all records in a time range, sorted (assuming they were inserted in the proper order). Then in the application code you filter them to retain only those concerning userB. Unfortunately, DynamoDB API doesn't support a list of range key conditions (otherwise you could just save some time by passing an additional CONTAINS userB condition).
Retrieve a particular pair visit of UserA and UserB
query(hash_key=userA, range_key_condition=BEGINS(common_prefix))
where common_prefix could be much more precise if we can assume you know the timestamp of the interaction.
Of course, this design should be evaluated wrt to the properties of the data stream you will handle. If you can (most often) specify a meaningful time range for your queries, it will be fast and bounded by the number of interactions you have recorded in the time range for userA.
If your application is not so time-oriented - and we can assume a user have most often only a few interactions - you might switch to the following schema:
hash_key = userA
range_key = userB+[iso1860_timestamp][1]+uuid
This way you can query by user:
query(hash_key=userA, range_key_condition=BEGIN(userB))
This alternative will be fast and bounded by the nber of userA - userB interactions over all time ranges, which could be meaningful depending on your application.
So basically you should check example data and estimate which orientation is meaningful for your application. Both orientations (time or user) might also be sped up by manually creating and maintaining indexes in other tables - at the cost of a more complex application code.
(historical version: trick to avoid overwriting records with time-based keys)
A common trick in your case is to postfix the range key with a generated unique id (uuid). This way you can still do query calls with BETWEEN condition to retrieve records that were inserted in a given time period, and you don't need to worry about key collision at insertion time.

Solr changes document's score when its random field value altered

I need to navigate forth and back in Solr results set ordered by score viewing documents one by one. To visualise that, first a list of document titles is presented to user, then he or she can click one of the title to see more details and then needs to have an opportunity to move to the next document in the original list without getting back and clicking another title.
During viewing documents get changed: their dynamic field is modified (or created is not exists yet) to mark that document has already been viewed (used in other search).
The problem I face is that when the document is altered and re-indexed to keep those changes, sometimes (and not always, which is very disturbing) its place in the results set for the same query changes (in other words, it's score changes as that doesn't happen when browsing results sorted by one of the documents' fields). So, "Previous" / "Next" navigation doesn't work properly.
I'm not using any custom weighting or boosters on fields for score calculation. Also, that dynamic field changed during browsing doesn't participate in the query used to get the record set browsed.
So, the questions are: can the modification of the document's field not included in the query change its relevance score? And if it can, then how can I control that?
UPDATE
I did some tests and can add the following:
Document changes its place in the result set even if no field is amended - just requesting the document and re-indexing it without any changes to its fields makes it take another place next time the same query over the same index is executed.
That happens even if the result set is sorted explicitly ("first_name DESC"), so score (which depends on the update date) is not involved. The document stays the same, its field result set is sorted by is the same, yet its position changes.
Still have no idea how to avoid that.
In Solr, if your field is "indexed", it will have an effect on the relevancy ranking ("stored" fields show up in search results but are not necessarily searchable). If the fields in question aren't marked as indexed then you are good to go. Note that "indexed" and "stored" are not necessarily the same, hence you confusion about results lists changing even though not all fields are shown (a field can be "indexed" and not "stored" as well).
In this case I think you want your "viewed" field to be "stored" but not "indexed". If you really want to control the query, you can use copyField to copy the relevant results into a single searchable field. You can also boost terms or documents so that certain fields are "less important" to the search query.
If you want to see how the relevancy rankings are calculated, you can add "debugQuery=on" to the end of your Solr Query (see the Relevancy FAQ for more info).
However, all that being said, I would recommend you cache your search result query (at least for the first page for your results), since you will always have results changing (documents added, removed by other users, etc). Your best bet is to design a UI that anticipates this, or at least batches a user's query.
I've found the solution which doesn't eliminate the problem completely but makes it much less likely to happen.
So the problem happens when the documents are sorted by some field and there is a number of them with the same value in this field (e.g. result set is sorted by first name, and there are 100 entries for "John").
This is when the indexed time gets involved - apparently Solr uses it to sort the documents when their main sorting fields are identical. To make this case much less probable, you need to add more sorting fields, e.g. "first_name desc" should become "first_name desc, last_name desc, register_date asc".
Also, adding document's unique id as the last sorting field should remove the problem completely (the set of sorting fields will never be identical for any two documents in the index).

Best way to create a Lucene Index with fields that can be updated frequently, and filtering the results by this field

I use Lucene to index my documents and search. Actually I have 800k documents indexed in Lucene. Those documents have some fields:
Id: is a Numeric field to index the documents
Name: is a textual field to be stored and analyzed
Description: like name
Availability: is a numeric field to filter results. This field can be updated frequently, every day.
My question is: What's the better way to create a filter for availability?
1 - add this information to index and make a lucene filter.
With this approach I have to update document (remove and add, because lucene 3.0.2 not have update support) every time the "availability" changes. What the cost of reindex?
2 - don't add this information to index, and filter the results with a DB select.
This approach will do a lot of selects, because I need select every id from database to check availability.
3 - Create a separated index with id and availability.
I don't know if it is a good solution, but I can create a index with static information and other with information can be frequently updated. I think it is better then update all document, just because some fields were updated.
I would stay away from 2, if you can deal only with the search in lucene, instead of search in lucene+db, do it. I deal in my project with this case (Lucene search + DB search), but I do it cause there is no way out of it.
The cost of an update is internally:
delete the doc
insert new doc (with new field).
I would just try approach number 1 (as is the simplest), if the performance is good enough, then just stick with it, if not then you might look ways to optimize it or try 3.
Answer provided from lucene-groupmail:
How often is "frequently"? How many updates do you expect to do in
a day? And how quickly must those updates be reflected in the search
results?
800K documents isn't all that many. I'd go with the simple approach first
and monitor the results, #then# go to a more complex solution if you
see a problem arising. Just update (delete/add) the documents when
the value changes.
Well, the cost to reindex is just about what the cost to index it orignally
is. The old version of the document is marked deleted and the new one
is added. It's essentially the same cost as to index a new document.
This leaves some gaps in your index, that is the deleted docs are still in
there, but the next optimize will compact them.
From which you may infer that optimizing is the expensive part. I'd do that,
say
once daily (or even weekly).
HTH
Erick

Resources