Recommendation for a noSQL data structure for organizational application - database

I'm redesigning my data structure for an organizational application The problem is trying to come up with the optimal structure and boils down to indexing and keeping the structure flexible. It is based on a JSON structure and starts with the question of a map of objects or an array of objects. [{}] vs {{}}. Should each top level object be indexed by a key, or should the key be inside of object, and an index is generated separately).
The app contains user tasks, appointments, events, and notes. I used Localstorage on the client, and mongoDB on the server. For the client, I'm changing to IndexedDB and will take this opportunity to also redesign my local JSON data structure.
When using the google calendar api, I noticed many of the results are just a random list of calendar events. The list is an array of objects which have relevant event info. Granted, these are the result of a REST request, not the actually data storage structure itself, however it got me to thinking... previously my data was all key:value pairs, sometimes nested, but always starting with a key. {{}}
For example, using a startTime key, represented by an epoch number (or could be isoDateTime string):
{{}}
"events": {
(EPOCH NUMBER): {
creationDate: (EPOCH NUMBER),
UID: (STRING),
summary: (STRING),
endDateTime: (EPOCH NUMBER)
}
...
}
vs
[{}]
"events": [{
startDateTime: (EPOCH NUMBER)
creationDate: (EPOCH NUMBER),
UID: (STRING),
summary: (STRING),
endDateTime: (EPOCH NUMBER)
}
...
]
In the first, I can easily get date ranges of Events, test if an event of a certain day exists, get all keys, etc. I can save to localstorage or mongodb directly using my unique key. I also have a key generator which increments the isoDateTime key (in the case they might overlap, javascript epoch uses milisecond so there are 1000 diff per second so I'm not concerned about overlapping keys). Problem: if I change an event start time, I'd need to change my key or generate a new object with the right key. Overall seems efficient, but a brittle approach.
In the second, on application initialization, I could run an indexing function which orders by the startDateTime and each points to the associated object. To save to storage, it would be a little more interesting since I don't have an obvious key/value pair. I could save the array under the key "events" but i'm not sure how updates would work, unless I also kept an index on all the array positions. This could be more flexible as I can easily change my startTime field, and I could have multiple indexes, which could also easily be changed.
So two questions: First, between the two options, {{}} and [{}] which is the more recommended approach for saving nested data which needs to be indexed. Second, I'm saving all dateTime data as UTC (changing on client when rendering to local Timezone), should I use the isoDateTime string or maybe just the Epoch number?
Any recommendations or feedback greatly appreciated, I've been scribbling different scenarios and algorithms for days now. I really wanna get this right.
Thanks,
Paul

My first instinct is to basically create an object store of events. Give each event an auto-incremented id. For each event, ensure that you store a few basic properties like start-date, end-date, etc. Then, for the particular queries you wish to run and hope for them to complete quickly, create indices on the properties involved.
The events will be sorted according to id when iterating over the store, but will be sorted by date or whatever when iterating over the index.
If you want to export to json, you would export an object containing an array of event objects.
For nosql, it isn't important that each event has the same properties. Only the object type itself, and a minimal set of properties like the key path, are important. The rest of the properties are completely variable, and should be understand as just a 'bag' of misc. props.
If this doesn't help then I guess I misunderstood the question.

Related

How to get a collection of all latest attributes values from DynamoDB?

I have a one table where I store all of the sensors data.
Id is a Partition key, TimeEpoch is a sort key.
Example table looks like this:
Id
TimeEpoch
AirQuality
Temperature
WaterTemperature
LightLevel
b8a76d85-f1b1-4bec-abcf-c2bed2859285
1608208992
95
3a6930c2-752a-4103-b6c7-d15e9e66a522
1608208993
23.4
cb44087d-77da-47ec-8264-faccc2a50b17
1608287992
5.6
latest
1608287992
95
5.6
23.4
1000
I need to get all the latest attributes values from the table.
For now I used additional Item with Id = latest where I'm storing all the latest values, but I know that this is a hacky way that requires sensor to put data in with new GUID as the Id and to the Id = latest at the same time.
The attributes are all known and it's possible that one sensor under one Id can store AirQuality and Temperature at the same time.
NoSQL databases like DynamoDB are a tricky thing, because they don't offer the same query "patterns" like traditional relational databases.
Therefore, you often need non-traditional solutions to valid challenges like the one you present.
My proposal for one such solution would be to use a DynamoDB feature called DynamoDB Streams.
In short, DynamoDB Streams will be triggered every time an item in your table is created, modified or removed. Streams will then send the new (and old) version of that item to a "receiver" you specify. Typically, that would be a Lambda function.
The solution I would propose is to use streams to send new items to a Lambda. This Lambda could then read the attributes of the item that are not empty and write them to whatever datastore you like. Could be another DynamoDB table, could be S3 or whatever else you like. Obviously, the Lambda would need to make sure to overwrite previous values etc, but the detailed business logic is then up to you.
The upside of this approach is, that you could have some form of up-to-date version of all of those values that you can always read without any complicated logic to find the latest value of each attribute. So reading would be simplified.
The downside is, that writing becomes a bit more complex. Not at least because you introduce more parts to your solution (DynamoDB Streams, Lambda, etc.). This also will increase your cost a bit, depending on how often your data changes. Since you seem to store sensor data that might be quite often. So keep in mind to check the cost. This solution will also introduce more delay. So if delay is an issue, it might not be for you.
At last I want to mention that it is recommend to only have at most two "receivers" of a tables stream. That means that for production I would recommend to only have a single receiver Lambda and then let that Lambda create an AWS EventBridge event (e.g. "item created", "item modified", "item removed"). This will allow you to have a lot more Lambdas etc. "listening" to such events and process them, mitigating the streams limitation. This is an event-driven solution then. As before, this will add delay.

DDD, Databases and Lists of Data

Im at the beginning of my first "real" software project, and I'd like to start off right. The concept of DDD seems like a very clean approach which separates the various software parts, however im having trouble implementing this in reality.
My Software is measurement tracker and essentially stores list of measurement data, consisting of a timestamp and the data value.
My Domain Models
class MeasurementDM{
string Name{get;set;}
List<MeasurementPointDM> MeasurementPoints{get;set;}
}
class MeasurementPointDM{
DateTime Time{get;set;}
double Value{get;set;}
}
My Persistence Models:
class MeasurementPM{
string Id{get;set;} //Primary key
string Name{get;set;} //Data from DomainModel to store
}
class MeasurementPointPM{
string Id{get;set;} //Primary Key
string MeasurementId{get;set;} //Key of Parent measurement
}
I now have the following issues:
1) Because I want to keep my Domain Models pure, I don't want or need the Database Keys inside those classes. This is no problem when building my Domain models from the Database, but I don't understand how to store them, as the Domain Model no longer knows the Database Id. Should I be including this in the Domain model anyway? Should I create a Dictionary mapping Domain objects to Database ids when i retreive them from the Database?
2)The measurement points essentially have the same Id problem as the measurements themselves. Additionally I'm not sure what the right way is to store the MeasurementPoints themselves. Above, each MeasurementPointPM knows to which MeasurementPM it belongs. When I query, I simply select MeasurementPoints based on their Measurement key. Is this a valid way to store such data? It seems like this will explode as more and more measurements are added. Would I be better off serializing my list of MeasurementPoints to a string, and storing the whole list as an nvarchar? This would make adding and removing datapoints more difficult, as Id always need to deserialize, reserialize the whole list
I'm having difficulty finding a good example of DDD that handles these problems, and hopefully someone out there can help me out.
My Software is measurement tracker and essentially stores list of measurement data, consisting of a timestamp and the data value.
You may want to have a careful think about whether you are describing a service or a database. If your primary use case is storing information that comes from somewhere else, then introducing a domain model into the mix may not make your life any better.
Domain models test to be interesting when new information interacts with old information. So if all you have are data structures, it's going to be hard to discover a good model (because the critical element -- how the model entities change over time -- is missing).
That said....
I don't understand how to store them, as the Domain Model no longer knows the Database Id.
This isn't your fault. The literature sucks.
The most common answer is that _people are allowing their models to be polluted with O/RM concerns. For instance, if you look at the Cargo entity from the Citerus sample application, you'll find these lines hidden at the bottom:
Cargo() {
// Needed by Hibernate
}
// Auto-generated surrogate key
private Long id;
This is an indirect consequence of the fact that the "repository" pattern provides the illusion of an in-memory collection of objects that maintain their own state, when the reality under the covers is that you are copying values between memory and durable storage.
Which is to say, if you want a clean domain model, then you are going to need a separate in memory representation for your stored data, and functions to translate back and forth between the two.
Put another way, what you are running into is a violation of the Single Responsibility Principle -- if you are using the same types to model your domain that you use to manage your persistence, the result is going to be a mix of the two concerns.
So essentially you would say that some minimal pollution of the domain model, for example an Id, is standard practice.
Less strong; I would say that it is a common practice. Fundamentally, a lot of people, particularly in the early stages of a project, don't value having a boundary between their domain model and their persistence plumbing.
Could it make sense to have every Domain Model inherit from a base class or implement an interface that forces the creation of Unique Id?
It could. There are a lot of examples on the web where domain entities extend some generic Entity or Aggregate pattern.
The really interesting questions are
What are the immediate costs and benefits of doing that?
What are the deferred costs and benefits of doing that?
In particular, does that make things easier or harder to change?

What's the best way to store event data in Redshift?

I'm new to Redshift and am looking at the best way to store event data. The data consists of an identifier, time and JSON metadata about the current state.
I'm considering three approaches:
Create a table for each event type with a column for each piece of data.
Create a single table for events and store metadata as a JSON field.
Create a single table with a column for every possible piece of data I might want to store.
The advantage of #1 is I can filter on all data fields and the solution is more flexible. The disadvantage is every time I want to add a new event I have to create a new table.
The advantage of #2 is I can put all types of events into a single table. The disadvantage is to filter on any of the data in the metadata I would need to use a JSON function on every row.
The advantage of #3 is I can easily access all the fields without running a function and don't have to create a new table for each type. The disadvantage is whoever is using the data needs to remember which columns to ignore.
Is one of these ways better than the others or am I missing something entirely?
This is a classic dilemma. After thinking for a while, in my company we ended up keeping the common properties of the events in separate columns and the unique properties in the JSON field. Examples of the common properties:
event type, timestamp (every event has it)
URL (this will be missing for backend events and mobile app events but is present for all frontend events and is worth to have in a separate column)
client properties: device, browser, OS (will be missing in backend but present in mobile app events and frontend events)
Examples of unique properties (no such properties in other events):
test name and variant in AB test event
product name or ID in purchase event
Borderline between common and unique property is your own judgement based on how many events share this property and how often will this property be used in the analytics queries to filter or group data. If some property is just "nice-to-have" and it is not involved in regular analysis use cases (yeah, we all love to store anything that is trackable just in case) the burden of maintaining a separate column is an overkill.
Also, if you have some unique property that you use extensively in the queries there is a hacky way to optimize. You can place this property at the beginning of your JSON column (yes, in Python JSON is not ordered but in Redshift it is a string, so the order of keys can be fixed if you want) and use LIKE with a wildcard only at the end of the field:
select *
from event_table
where event_type='Start experiment'
and event_json like '{"test_name":"my_awesome_test"%' -- instead of below
-- and json_extract_path_text(event_json,'test_name')='my_awesome_test'
LIKE used this way works much faster than JSON lookup (2-3x times faster) because it doesn't need to scan every row, decode JSON, find the key and check the value but it just checks if the string starts with a substring (much cheaper operation).

Data storage: "grouping" entities by property value? (like a dictionary/map?)

Using AppEngine datastore, but this might be agnostic, no idea.
Assume a database entity called Comment. Each Comment belongs to a User. Every Comment has a date property, pretty standard so far.
I want something that will let me: specify a User and get back a dictionary-ish (coming from a Python background, pardon. Hash table, map, however it should be called in this context) data structure where:
keys: every date appearing in the User's comment
values: Comments that were made on date.
I guess I could just iterate over a range of dates an build a map like this myself, but I seriously doubt I need to "invent" my own solution here.
Is there a way/tool/technique to do this?
Datastore supports both references and list properties. This let's you build one-to-many relationships in two ways:
Parent (User) has a list property containing keys of Child entities (Comment).
Child has a key property pointing to Parent.
Since you need to limit Comments by date, you'd best go with option two. Then you could query Comments which have date=somedate (or date range) and where user=someuserkey.
There is no native grouping functionality in Datastore, so to also "group" by date, you can add a sort on date to the query. Than when you iterate over the result, when the date changes you can use/store it as a grouping key.
Update
Designing no-sql databases should be access-oriented (versus datamodel oriented in sql): for often-used operations you should be getting data out as cheaply (= as few operations) as possible.
So, as a rule of thumb you should, in one operation, only get data that is needed at that moment (= shown on that page to user). I'm not sure about your app's design, but I doubt you need all user's full comments (with text and everything) at one time.
I'd start by saying you shouldn't apologize for having a Python background. App Engine started supporting only Python. Using the db module, you could have a User entity as the parent of several DailyCommentBatch entities each a parent of a couple Comment entities. IIRC, this will keep all related entities stored together (or close).
If you are using the NDB (I love it) you may have employ a StructuredProperty either at the User or DailyCommentBatch levels.

Another database table or a json object

I have two tables: stores and users. Every user is assigned to a store. I thought "What if I could just save all the users assigned to a store as a json object and save that json object in a field of a store." So in other words, user's data will be stored in a field instead of it's own table. There will be around 10 people to a store. I would like to know which method will require the least amount of processing for the server.
Most databases are relational, meaning there's no reason to be putting multiple different fields in one column. Besides being more work for you having to put them together and take them apart, you'd be basically ignoring the strength of the database.
If you were ever to try to access the data from another app, you'd have to make yourself go through additional steps. It also limits sorting and greatly adds to your querying difficulties (i.e. can't say where field = value because one field contains many values)
In your specific example, if the users at a store change, rather than being able to do a very efficient delete from the users table (or modify which store they're assigned to) you'd have to fetch the data and edit it, which would double your overhead.
Joins exist for a reason, and they are efficient. So, don't fear them!

Resources