I am currently having a problem with solving the problem with dynamo db querying.
My dynamo db keeps track of changes in data. Thus partition key identifies which data i am changing
a single row looks something like this
partition key: servicename#resource#resource_id#region
sortkey: current_time
changelogs: map of changelog (basically an array of changelog)
changer: who changed it
; It does great job when requesting one specific resource changes; however if i want to, say, query "I want to see last 30 minutes of changes in this servicename#resource without specifying resource id. Right now I only have scanning method at hand.. And i can't use scan due to large amount of data. I am open to all recommendation.
Your subject mentions the answer...even if your post doesn't.
Create a Global secondary index (GSI) with either the date or part of the date as the hash key and the time or remaining date part + time as the sort key. If you have a large amount of data being created, you might want to include Hour in the hash key.
HK SK
YYYY-MM-DD-HH :MM:SS.00000
YYYY-MM-DD HH:MM:SS.00000
YYYY-MM DD-HH:MM:SS.0000
Related
I'm using CouchDB for syncing offline and online data. The problem is that since couchDB hasn't a layer between client and server, validation is kinda poor. The only field that can be unique is the _id.
For example if you want to make a sure the field "phone" is unique, you need to make sure that the phone number is somewhere in the _id (555-666-777_username) and the field "phone" is 555-666-777.
My problem is: I have a calendar that makes events that cannot overlap each other. An event has a start time and an end time. How can I make sure that the user won't put a date between the start time and end time.
One idea is instead of making one document with a startTime and endTime. It's to make various documents that have dates within the user's desired range.
Example: User selects a range between 2018-09-02 and 2018-09-10, so I'll create 8 documents with a _id composed of {date}{username}.
If you think couchDB doesn't suit to this kind of stuff, I'm open for suggestions.
I am writing an application for viewing and management of sensor data. I can have unlimited number of sensors, and each sensors makes one reading every minutes and records the values as (time, value, sensor_id, location_id, [a bunch of other doubles]).
As an example, I might have 1000 sensors and collect data every minute for each one of them, which ends up generating 525,600,000 rows after a year. Multiple users (say up to 20) can plot the data of any time period, zoom in and out in any range, and add annotations to the data of a sensor at a time. Users can also modify certain data points and I need to keep track of the raw data and modified one.
I'm not sure how the database for application like this should look like! Should it be just one table SensorData, with indices for time and sensor_id and location_id? Should I partition this single table based on sensor_id? should I save the data in files for each sensor each day (say .csv files) and load them into a temp table upon request? How should I manage annotations?
I have not decided on a DBMS yet (maybe MySQL or PostgreSQL). But my intention is to get an insight about data management in applications like this in general.
I am assuming the the users cannot change the fields you show (time, value, sensor_id, location_id) but the other fields implied.
In that case, I would suggest Version Normal Form. The fields you name are static, that is, once entered, they never change. However, the other fields are changeable by many users.
You fail to state if users see all user's changes or only their own. I will assume all changes are seen by all users. You should be able to make the appropriate changes if that assumption is wrong.
First, let's explain Version Normal Form. As you will see, it is just a special case of Second Normal Form.
Take a tuple of the fields you have named, rearranged to group the key values together:
R1( sensor_id(k), time(k), location_id, value )
As you can see, the location_id (assuming the sensors are movable) and value are dependent on the sensor that generated the value and the time the measurement was made. This tuple is in 2nf.
Now you want to add updatable fields:
R2( sensor_id(k), time(k), location_id, value, user_id, date_updated, ... )
But the updateable fields (contained in the ellipses) are dependent not only on the original key fields but also on user_id and date_updated. The tuple is no longer in 2nf.
So we add the new fields not to the original tuple, but create a normalized tuple:
R1( sensor_id(k), time(k), location_id, value )
Rv( sensor_id(k), time(k), user_id(k), date_updated(k), ... )
This makes it possible to have a series of any number of versions for each original reading.
To query the latest update for a particular reading:
select R1.sensor_id, R1.time, R1.location_id, R1.value, R2.user_id, R2.date_updated, R2.[...]
from R1
left join Rv as R2
on R2.sensor_id = R1.sensor_id
and R2.time = R1.time
and R2.date_updated =(
select max( date_update )
from Rv
where sensor_id = R2.sensor_id
and time = R2.time )
where R1.sensor_id = :ThisSensor
and R1.time = :ThisTime;
To query the latest update for a particular reading made by a particular user, just add the user_id value to the filtering criteria of the main query and subquery. It should be easy to see how to get all the updates for a particular reading or just those made by a specific user.
This design is very flexible in how you can access the data and, because the key fields are also indexed, it is very fast even on Very Large Tables.
Looking for an answer I came across this thread. While it is not entirely the same as my case, it answers many of my questions; such as is using a relational database a reasonable way of doing this (to which the answer is "Yes"), and what to do about partitioning, maintenance, archiving, etc.
https://dba.stackexchange.com/questions/13882/database-redesign-opportunity-what-table-design-to-use-for-this-sensor-data-col
Structure of database for the problem:
tv_shows
tv_show_id,
tv_show_title,
tv_show_description,
network,
status
episodes
episode_id,
tv_show_id,
episode_title,
episode_description,
air_date
Hi,
I am unsure of what’s best here. To output the number of episodes of tv_show1, should I make this an attribute of the tv_shows table and include it? (i.e. “”no_of_episodes”) Or simply count the number of episodes where there’s an occurrence of the id for tv_show1 in the episodes table with a query?
That’s not all...Something else is riddling me with this task. If I wanted the premiere date of tv_show1, would I also make this an attribute in the tv_shows table? Or retrieve it by simply querying the earliest instance of an episode’s air_date?
Both solutions work but I have no idea what's correct
The basic rule of thumb in database design: only store each piece of information in one place.
While your system might run slightly faster if you store each count, you will also have to adjust those counts each time you add or remove an episode. Consequently, it's better just to count the records each time you query.
Currently I have a collection which contains the following fields:
userId
otherUserId
date
status
For my Dynamo collection I used userId as the hashKey and for the rangeKey I wanted to use date:otherUserId. By doing it like this I could retrieve all userId entries sorted on a date which is good.
However, for my usecase I shouldn't have any duplicates, meaning I shouldn't have the same userId-otherUserId value in my collection. This means I should do a query first to check if that 'couple' exist, remove it if needed and then do the insert, right?
EDIT:
Thanks for your help already :-)
The goal of my usecase would be to store when userA visits the profile of userB.
Now, The kind of queries I would like to do are the following:
Retrieve all the UserB's that visited the profile of UserA, in an unique (= No double UserB's) and sorted by time way.
Retrieve a particular pair visit of UserA and UserB
I think you have a lot of choices, but here is one that might work based on the assumption that your application is time-aware i.e. you want to query for interactions in the last N minutes, hours, days etc.
hash_key = userA
range_key = [iso1860_timestamp][1]+userB+uuid
First, the uuid trick is there to avoid overwriting a record of an interaction between userA and userB happening exactly at the same time (can occur depending on the granularity/precision of your clock). So insert-wise we are safe : no duplicates, no overwrites.
Query-wise, here is how things get done:
Retrieve all the UserB's that visited the profile of UserA, in an unique (= No double UserB's) and sorted by time way.
query(hash_key=userA, range_key_condition=BEGIN(common_prefix))
where common_prefix = 2013-01-01 for all interactions in Jan 2013
This will retrieve all records in a time range, sorted (assuming they were inserted in the proper order). Then in the application code you filter them to retain only those concerning userB. Unfortunately, DynamoDB API doesn't support a list of range key conditions (otherwise you could just save some time by passing an additional CONTAINS userB condition).
Retrieve a particular pair visit of UserA and UserB
query(hash_key=userA, range_key_condition=BEGINS(common_prefix))
where common_prefix could be much more precise if we can assume you know the timestamp of the interaction.
Of course, this design should be evaluated wrt to the properties of the data stream you will handle. If you can (most often) specify a meaningful time range for your queries, it will be fast and bounded by the number of interactions you have recorded in the time range for userA.
If your application is not so time-oriented - and we can assume a user have most often only a few interactions - you might switch to the following schema:
hash_key = userA
range_key = userB+[iso1860_timestamp][1]+uuid
This way you can query by user:
query(hash_key=userA, range_key_condition=BEGIN(userB))
This alternative will be fast and bounded by the nber of userA - userB interactions over all time ranges, which could be meaningful depending on your application.
So basically you should check example data and estimate which orientation is meaningful for your application. Both orientations (time or user) might also be sped up by manually creating and maintaining indexes in other tables - at the cost of a more complex application code.
(historical version: trick to avoid overwriting records with time-based keys)
A common trick in your case is to postfix the range key with a generated unique id (uuid). This way you can still do query calls with BETWEEN condition to retrieve records that were inserted in a given time period, and you don't need to worry about key collision at insertion time.
I'm new to CouchDB and I'm trying to get the last 50 most recent entries of a user in an app.
I created a view that pulls out the documents for the entries, and I can use the key parameter to get only the docs of a particular user and the limit query to get only 50 entries.
However, I'd like to order the docs by a "timestamp" field (which stores the new Date().getTime() of when the entry was made) in order to ensure that I only get the most recent entries. Is this possible in CouchDB, and if so how?
You can probably achieve this (at least in the case that you don't have any future dates in your data) by emitting a more complex key like an array of the form [username,datetime].
Then make a view that pulls the documents with a startkey like, for example, ['johndoe',1331388874195] with descending order, and limit to 50. The date should obviously be the current one. CouchDB's collation will make sure the results are first ordered by user and then by date.