Range Key Querying on composed keys - database

Currently I have a collection which contains the following fields:
userId
otherUserId
date
status
For my Dynamo collection I used userId as the hashKey and for the rangeKey I wanted to use date:otherUserId. By doing it like this I could retrieve all userId entries sorted on a date which is good.
However, for my usecase I shouldn't have any duplicates, meaning I shouldn't have the same userId-otherUserId value in my collection. This means I should do a query first to check if that 'couple' exist, remove it if needed and then do the insert, right?
EDIT:
Thanks for your help already :-)
The goal of my usecase would be to store when userA visits the profile of userB.
Now, The kind of queries I would like to do are the following:
Retrieve all the UserB's that visited the profile of UserA, in an unique (= No double UserB's) and sorted by time way.
Retrieve a particular pair visit of UserA and UserB

I think you have a lot of choices, but here is one that might work based on the assumption that your application is time-aware i.e. you want to query for interactions in the last N minutes, hours, days etc.
hash_key = userA
range_key = [iso1860_timestamp][1]+userB+uuid
First, the uuid trick is there to avoid overwriting a record of an interaction between userA and userB happening exactly at the same time (can occur depending on the granularity/precision of your clock). So insert-wise we are safe : no duplicates, no overwrites.
Query-wise, here is how things get done:
Retrieve all the UserB's that visited the profile of UserA, in an unique (= No double UserB's) and sorted by time way.
query(hash_key=userA, range_key_condition=BEGIN(common_prefix))
where common_prefix = 2013-01-01 for all interactions in Jan 2013
This will retrieve all records in a time range, sorted (assuming they were inserted in the proper order). Then in the application code you filter them to retain only those concerning userB. Unfortunately, DynamoDB API doesn't support a list of range key conditions (otherwise you could just save some time by passing an additional CONTAINS userB condition).
Retrieve a particular pair visit of UserA and UserB
query(hash_key=userA, range_key_condition=BEGINS(common_prefix))
where common_prefix could be much more precise if we can assume you know the timestamp of the interaction.
Of course, this design should be evaluated wrt to the properties of the data stream you will handle. If you can (most often) specify a meaningful time range for your queries, it will be fast and bounded by the number of interactions you have recorded in the time range for userA.
If your application is not so time-oriented - and we can assume a user have most often only a few interactions - you might switch to the following schema:
hash_key = userA
range_key = userB+[iso1860_timestamp][1]+uuid
This way you can query by user:
query(hash_key=userA, range_key_condition=BEGIN(userB))
This alternative will be fast and bounded by the nber of userA - userB interactions over all time ranges, which could be meaningful depending on your application.
So basically you should check example data and estimate which orientation is meaningful for your application. Both orientations (time or user) might also be sped up by manually creating and maintaining indexes in other tables - at the cost of a more complex application code.
(historical version: trick to avoid overwriting records with time-based keys)
A common trick in your case is to postfix the range key with a generated unique id (uuid). This way you can still do query calls with BETWEEN condition to retrieve records that were inserted in a given time period, and you don't need to worry about key collision at insertion time.

Related

Firestore: handling concurrency with booking app

I've currently got a bookings and a bookable collection. Each document in bookings holds a date range (check-out and check-in) and an array of references to bookable documents.
I'm a bit stumped at how to guarantee two overlapping bookings for the same bookables aren't written at the same time. From what I understand I can't technically lock a collection via something like a transaction, so I'm wondering what my options are (perhaps restructuring how I'm storing data, etc).
Any pointers or advice would be much appreciated.
EDIT:
Say User A wants to make a booking for the same two items as User B does and for the same time range. They both load the booking UI at around the same time and confirm their selection.
Prior to creating a new document inside the bookings collection for each of their requests, the app would perform a get query to check for any overlaps and if none exist insert the new booking documents. That fraction of time between the app's check for overlaps across the booking collection and the creation of new documents is what seems to open up a window for inconsistencies (e.g. potentially allowing two documents with overlapping time ranges and items to be created).
Could a transaction help prevent a new document being written to a collection based the existance of other documents in that collection that fit a specific criteria?
To prevent users from accidentally overwriting each other's data, you'll want to use a transaction.
To prevent users from intentionally overwriting each other's data, you'll want to use security rules. Key to this is to use the information that you want to be unique as the ID of the documents.
So say you identify time slots by the date and start time, you could have a document ID "20210420T0900". If a user is trying to write to that document when it already exists, you can reject that write in the security rules of your database.
I am facing the exact same problem, and here is my best option at the moment....
I need a collection of all booking (bookingscollection), regardless of date, time, resource booked etc. This collection is usabale in many parts of my UI, as I list upcoming bookingss etc.
I need to avoid writes being made to this collection, where there is an overlap.
I am considering adding an additional collection, where each document descibes the bookings for a specific resource on a specific day (lockcollection). It could be a doc, with the resource id, the date it covers and an array of start,stop times of booking already made.
Then when considering adding a new booking to my bookingscollection, I would make a transaction to the relevant document in the lockcollection, see if there is an overlap in which case I would fail, and if not, I would add the new intervall to the lock document within the transaction.
Once this succeed, I know that I can just plainly add the booking to the booking collection, as the lock is already there....
Similar logic would be applied to the procedure of deleting or changing bookings.
This idea is new to me, but I wanted to sahre, so I can hear your inputs....?

DB design for sensor data (lots and LOTS of data)

I am writing an application for viewing and management of sensor data. I can have unlimited number of sensors, and each sensors makes one reading every minutes and records the values as (time, value, sensor_id, location_id, [a bunch of other doubles]).
As an example, I might have 1000 sensors and collect data every minute for each one of them, which ends up generating 525,600,000 rows after a year. Multiple users (say up to 20) can plot the data of any time period, zoom in and out in any range, and add annotations to the data of a sensor at a time. Users can also modify certain data points and I need to keep track of the raw data and modified one.
I'm not sure how the database for application like this should look like! Should it be just one table SensorData, with indices for time and sensor_id and location_id? Should I partition this single table based on sensor_id? should I save the data in files for each sensor each day (say .csv files) and load them into a temp table upon request? How should I manage annotations?
I have not decided on a DBMS yet (maybe MySQL or PostgreSQL). But my intention is to get an insight about data management in applications like this in general.
I am assuming the the users cannot change the fields you show (time, value, sensor_id, location_id) but the other fields implied.
In that case, I would suggest Version Normal Form. The fields you name are static, that is, once entered, they never change. However, the other fields are changeable by many users.
You fail to state if users see all user's changes or only their own. I will assume all changes are seen by all users. You should be able to make the appropriate changes if that assumption is wrong.
First, let's explain Version Normal Form. As you will see, it is just a special case of Second Normal Form.
Take a tuple of the fields you have named, rearranged to group the key values together:
R1( sensor_id(k), time(k), location_id, value )
As you can see, the location_id (assuming the sensors are movable) and value are dependent on the sensor that generated the value and the time the measurement was made. This tuple is in 2nf.
Now you want to add updatable fields:
R2( sensor_id(k), time(k), location_id, value, user_id, date_updated, ... )
But the updateable fields (contained in the ellipses) are dependent not only on the original key fields but also on user_id and date_updated. The tuple is no longer in 2nf.
So we add the new fields not to the original tuple, but create a normalized tuple:
R1( sensor_id(k), time(k), location_id, value )
Rv( sensor_id(k), time(k), user_id(k), date_updated(k), ... )
This makes it possible to have a series of any number of versions for each original reading.
To query the latest update for a particular reading:
select R1.sensor_id, R1.time, R1.location_id, R1.value, R2.user_id, R2.date_updated, R2.[...]
from R1
left join Rv as R2
on R2.sensor_id = R1.sensor_id
and R2.time = R1.time
and R2.date_updated =(
select max( date_update )
from Rv
where sensor_id = R2.sensor_id
and time = R2.time )
where R1.sensor_id = :ThisSensor
and R1.time = :ThisTime;
To query the latest update for a particular reading made by a particular user, just add the user_id value to the filtering criteria of the main query and subquery. It should be easy to see how to get all the updates for a particular reading or just those made by a specific user.
This design is very flexible in how you can access the data and, because the key fields are also indexed, it is very fast even on Very Large Tables.
Looking for an answer I came across this thread. While it is not entirely the same as my case, it answers many of my questions; such as is using a relational database a reasonable way of doing this (to which the answer is "Yes"), and what to do about partitioning, maintenance, archiving, etc.
https://dba.stackexchange.com/questions/13882/database-redesign-opportunity-what-table-design-to-use-for-this-sensor-data-col

Finding unique products (never seen before by a user) in a datastore sorted by a dynamically changing value (i.e. product rating)

been trying to solve this problem for a week and couldn't come up with any solutions in all my research so I thought I'd ask you all.
I have a "Product" table and a "productSent" table, here's a quick scheme to help explain:
class Product(ndb.Model):
name = ndb.StringProperty();
rating = ndb.IntegerProperty
class productSent(ndb.Model): <--- the key name here is md5(Product Key+UUID)
pId = ndb.KeyProperty(kind=Product)
uuId = ndb.KeyProperty(kind=userData)
action = ndb.StringProperty()
date = ndb.DateTimeProperty(auto_now_add=True)
My goal is to show users the highest rated product that they've never seen before--fast. So to keep track of the products users have seen, I use the productSent table. I created this table instead of using Cursors because every time the rating order changes, there's a possibility that the cursor skips the new higher ranking product. An example: assume the user has seen products 1-24 in the db. Next, 5 users liked product #25, making it the #10 product in the database--I'm worried that the product will never be shown again to the user (and possibly mess things up on a higher scale).
The problem with the way I'm doing it right now is that, once the user has blown past the first 1,000 products, it really starts slowing down the query performance. Because I'm literally pulling 1,000+ results, checking if they've been sent by querying against the productSent table (doing a keyName lookup to speed things up) and going through the loop until 15 new ones have been detected.
One solution I thought of was to add a repeated property (listProperty) to the Product table of all the users who have seen a product. Or if I don't want to have inequality filters I could put a repeated property of all the users who haven't seen a product. That way when I query I can dynamically take those out. But I'm afraid of what happens when I have 1,000+ users:
a) I'll go through the roof on the limit of repeated properties in one entity.
b) The index size will increase size costs
Has anyone dealt with this problem before (I'm sure someone has!) Any tips on the best way to structure it?
update
Okay, so had another idea. In order to minimize the changes that take place when a rating (number of likes) changes, I could have a secondary column that only has 3 possible values: positive, neutral, negative. And sort by that? Ofcourse for items that have a rating of 0 and get a 'like' (making them a positive) would still have a chance of being out of order or skipped by the cursor--but it'd be less likely. What do y'all think?
Sounds like the inverse, productNotSent would work well here. Every time you add a new product, you would add a new productNotSent entity for each user. When the user wants to see the highest rated product they have not seen, you will only have to query over the productNotSent entities that match that user. If you put the rating directly on the productNotSent you could speed the query up even more, since you will only have to query against one Model.
Another idea would be to limit the number of productNotSent entities per user. So each user only has ~100 of these entities at a time. This would mean your query would be constant for each user, regardless of the number of products or users you have. The creation of new productNotSent entities would become more complex, though. You'd have to have a cron job or something that "tops up" a user's collection of productNotSent entities when they use some up. You also may want to double-check that products rated higher than those already within the user's set of productNotSent entities get pushed in there. These are a little more difficult and well require some design trade-offs.
Hope this helps!
I do not know your expected volumes and exact issues (only did a quick perusal of your question), but you may consider using Json TextProperty storage as part of your plan. Create dictionaries/lists and store them in records by json.dump()ing them to a TextProperty. When the client calls, simply send the TextProperties to the client, and figure everything out on the client side once you JSON.parse() them. We have done some very large array/object processing in JS this way, and it is very fast (particularly indexed arrays). When the user clicks on something, send a transaction back to update their record. Set up some pull or push queue processes to handle your overall product listing updates, major customer rec updates, etc.
One downside is higher bandwidth going out of you app, but I think this cost will be minimal given potential processing savings on GAE. If you structure this right, you may be able to use get_by_id() to replace all or most of your planned indices and queries. We have found json.loads() and json.dumps() to be very fast inside the app, but we only use simple dictionary/list structures.This approach will be, though, a big, big quantum measure lower than your planned use of queries. The other potential issue is that very large objects may run into soft memory limits. Be sure that your Json objects are fairly simple+lightweight to avoid this (e.g. do no include product description, sub-objects, etc. in the Json item, just the basics such as product number). HTH, -stevep

Paged results when selecting data from 2 databases

Hi
I have one web service connected to one db that has a table called clients which has some data.
I have another web service connected to another db that has a table called clientdetails which has some other data.
I have to return a paged list of clients and every client object contains the information from both tables.
But I have a problem.
The search criteria has to be applied on both tables.
So basically in the clients table I can have the properties:
cprop1, cprop2
in the clientdetails table I can have cdprop1,cdprop2
and my search criteria can be cporp1=something, cdprop2 = somethingelse
I call the first web service and send it the criteria cporp1=something
And it returns some info and then I call the method in the second web service but if I have to return say 10 items on a page and the criteria of the second web service are applied on the 10 items selected by the first web service(cdprop2 = somethingelse) then I may be left with 8 items or none at all.
So what do I do in this case?
How can I make sure I always get the right number of items(that is as much as the user says he wants on a page)?
Until you have both responses you don't know how many records you are going to have to display.
You don't what kind of database access you are using, you imply that you ask for "N records matching criterion X", where you have N set to 10. In some DB access mechanisms you can ask for all matching records and then advance a "cursor" through the set, hence you don't need to set any upper bound - we assume that the DB takes care of managing resources efficiently for such a query.
If you can't do that, then you need to be able to revisit the first database asking for the next 10 records, repeat until finally you have a page full or no more records can be found. This requires that you have some way to specify a query for "next 10".
You need the ability to get to all records matching the criteria in some efficient way, either by some cursor mechanism offered by your DB or by your own "paged" queries, without that capability I don't see a way to guarantee to give an accurate result.
I found that in instances like this it's better not to use identity primary keys but primary keys with generated values in the second database(generated in the first database).
As for searching you should search for the first 1000 items that fit your criteria from the first database, intersect them with the first 1000 that match the given criteria from the second database and return the needed amount of items from this intersection.
Your queries should never return an unlimited amount of items any way so 1000 should do. The number could be bigger or smaller of course.

Can I do transactions and locks in CouchDB?

I need to do transactions (begin, commit or rollback), locks (select for update).
How can I do it in a document model db?
Edit:
The case is this:
I want to run an auctions site.
And I think how to direct purchase as well.
In a direct purchase I have to decrement the quantity field in the item record, but only if the quantity is greater than zero. That is why I need locks and transactions.
I don't know how to address that without locks and/or transactions.
Can I solve this with CouchDB?
No. CouchDB uses an "optimistic concurrency" model. In the simplest terms, this just means that you send a document version along with your update, and CouchDB rejects the change if the current document version doesn't match what you've sent.
It's deceptively simple, really. You can reframe many normal transaction based scenarios for CouchDB. You do need to sort of throw out your RDBMS domain knowledge when learning CouchDB, though. It's helpful to approach problems from a higher level, rather than attempting to mold Couch to a SQL based world.
Keeping track of inventory
The problem you outlined is primarily an inventory issue. If you have a document describing an item, and it includes a field for "quantity available", you can handle concurrency issues like this:
Retrieve the document, take note of the _rev property that CouchDB sends along
Decrement the quantity field, if it's greater than zero
Send the updated document back, using the _rev property
If the _rev matches the currently stored number, be done!
If there's a conflict (when _rev doesn't match), retrieve the newest document version
In this instance, there are two possible failure scenarios to think about. If the most recent document version has a quantity of 0, you handle it just like you would in a RDBMS and alert the user that they can't actually buy what they wanted to purchase. If the most recent document version has a quantity greater than 0, you simply repeat the operation with the updated data, and start back at the beginning. This forces you to do a bit more work than an RDBMS would, and could get a little annoying if there are frequent, conflicting updates.
Now, the answer I just gave presupposes that you're going to do things in CouchDB in much the same way that you would in an RDBMS. I might approach this problem a bit differently:
I'd start with a "master product" document that includes all the descriptor data (name, picture, description, price, etc). Then I'd add an "inventory ticket" document for each specific instance, with fields for product_key and claimed_by. If you're selling a model of hammer, and have 20 of them to sell, you might have documents with keys like hammer-1, hammer-2, etc, to represent each available hammer.
Then, I'd create a view that gives me a list of available hammers, with a reduce function that lets me see a "total". These are completely off the cuff, but should give you an idea of what a working view would look like.
Map
function(doc)
{
if (doc.type == 'inventory_ticket' && doc.claimed_by == null ) {
emit(doc.product_key, { 'inventory_ticket' :doc.id, '_rev' : doc._rev });
}
}
This gives me a list of available "tickets", by product key. I could grab a group of these when someone wants to buy a hammer, then iterate through sending updates (using the id and _rev) until I successfully claim one (previously claimed tickets will result in an update error).
Reduce
function (keys, values, combine) {
return values.length;
}
This reduce function simply returns the total number of unclaimed inventory_ticket items, so you can tell how many "hammers" are available for purchase.
Caveats
This solution represents roughly 3.5 minutes of total thinking for the particular problem you've presented. There may be better ways of doing this! That said, it does substantially reduce conflicting updates, and cuts down on the need to respond to a conflict with a new update. Under this model, you won't have multiple users attempting to change data in primary product entry. At the very worst, you'll have multiple users attempting to claim a single ticket, and if you've grabbed several of those from your view, you simply move on to the next ticket and try again.
Reference: https://wiki.apache.org/couchdb/Frequently_asked_questions#How_do_I_use_transactions_with_CouchDB.3F
Expanding on MrKurt's answer. For lots of scenarios you don't need to have stock tickets redeemed in order. Instead of selecting the first ticket, you can select randomly from the remaining tickets. Given a large number tickets and a large number of concurrent requests, you will get much reduced contention on those tickets, versus everyone trying to get the first ticket.
A design pattern for restfull transactions is to create a "tension" in the system. For the popular example use case of a bank account transaction you must ensure to update the total for both involved accounts:
Create a transaction document "transfer USD 10 from account 11223 to account 88733". This creates the tension in the system.
To resolve any tension scan for all transaction documents and
If the source account is not updated yet update the source account (-10 USD)
If the source account was updated but the transaction document does not show this then update the transaction document (e.g. set flag "sourcedone" in the document)
If the target account is not updated yet update the target account (+10 USD)
If the target account was updated but the transaction document does not show this then update the transaction document
If both accounts have been updated you can delete the transaction document or keep it for auditing.
The scanning for tension should be done in a backend process for all "tension documents" to keep the times of tension in the system short. In the above example there will be a short time anticipated inconsistence when the first account has been updated but the second is not updated yet. This must be taken into account the same way you'll deal with eventual consistency if your Couchdb is distributed.
Another possible implementation avoids the need for transactions completely: just store the tension documents and evaluate the state of your system by evaluating every involved tension document. In the example above this would mean that the total for a account is only determined as the sum values in the transaction documents where this account is involved. In Couchdb you can model this very nicely as a map/reduce view.
No, CouchDB is not generally suitable for transactional applications because it doesn't support atomic operations in a clustered/replicated environment.
CouchDB sacrificed transactional capability in favor of scalability. In order to have atomic operations you need a central coordination system, which limits your scalability.
If you can guarantee you only have one CouchDB instance or that everyone modifying a particular document connects to the same CouchDB instance then you could use the conflict detection system to create a sort of atomicity using methods described above but if you later scale up to a cluster or use a hosted service like Cloudant it will break down and you'll have to redo that part of the system.
So, my suggestion would be to use something other than CouchDB for your account balances, it will be much easier that way.
As a response to the OP's problem, Couch is probably not the best choice here. Using views is a great way to keep track of inventory, but clamping to 0 is more or less impossible. The problem being the race condition when you read the result of a view, decide you're ok to use a "hammer-1" item, and then write a doc to use it. The problem is that there's no atomic way to only write the doc to use the hammer if the result of the view is that there are > 0 hammer-1's. If 100 users all query the view at the same time and see 1 hammer-1, they can all write a doc to use a hammer 1, resulting in -99 hammer-1's. In practice, the race condition will be fairly small - really small if your DB is running localhost. But once you scale, and have an off site DB server or cluster, the problem will get much more noticeable. Regardless, it's unacceptable to have a race condition of that sort in a critical - money related system.
An update to MrKurt's response (it may just be dated, or he may have been unaware of some CouchDB features)
A view is a good way to handle things like balances / inventories in CouchDB.
You don't need to emit the docid and rev in a view. You get both of those for free when you retrieve view results. Emitting them - especially in a verbose format like a dictionary - will just grow your view unnecessarily large.
A simple view for tracking inventory balances should look more like this (also off the top of my head)
function( doc )
{
if( doc.InventoryChange != undefined ) {
for( product_key in doc.InventoryChange ) {
emit( product_key, 1 );
}
}
}
And the reduce function is even more simple
_sum
This uses a built in reduce function that just sums the values of all rows with matching keys.
In this view, any doc can have a member "InventoryChange" that maps product_key's to a change in the total inventory of them. ie.
{
"_id": "abc123",
"InventoryChange": {
"hammer_1234": 10,
"saw_4321": 25
}
}
Would add 10 hammer_1234's and 25 saw_4321's.
{
"_id": "def456",
"InventoryChange": {
"hammer_1234": -5
}
}
Would burn 5 hammers from the inventory.
With this model, you're never updating any data, only appending. This means there's no opportunity for update conflicts. All the transactional issues of updating data go away :)
Another nice thing about this model is that ANY document in the DB can both add and subtract items from the inventory. These documents can have all kinds of other data in them. You might have a "Shipment" document with a bunch of data about the date and time received, warehouse, receiving employee etc. and as long as that doc defines an InventoryChange, it'll update the inventory. As could a "Sale" doc, and a "DamagedItem" doc etc. Looking at each document, they read very clearly. And the view handles all the hard work.
Actually, you can in a way. Have a look at the HTTP Document API and scroll down to the heading "Modify Multiple Documents With a Single Request".
Basically you can create/update/delete a bunch of documents in a single post request to URI /{dbname}/_bulk_docs and they will either all succeed or all fail. The document does caution that this behaviour may change in the future, though.
EDIT: As predicted, from version 0.9 the bulk docs no longer works this way.
Just use SQlite kind of lightweight solution for transactions, and when the transaction is completed successfully replicate it, and mark it replicated in SQLite
SQLite table
txn_id , txn_attribute1, txn_attribute2,......,txn_status
dhwdhwu$sg1 x y added/replicated
You can also delete the transactions which are replicated successfully.

Resources