What's the best way to store event data in Redshift? - database

I'm new to Redshift and am looking at the best way to store event data. The data consists of an identifier, time and JSON metadata about the current state.
I'm considering three approaches:
Create a table for each event type with a column for each piece of data.
Create a single table for events and store metadata as a JSON field.
Create a single table with a column for every possible piece of data I might want to store.
The advantage of #1 is I can filter on all data fields and the solution is more flexible. The disadvantage is every time I want to add a new event I have to create a new table.
The advantage of #2 is I can put all types of events into a single table. The disadvantage is to filter on any of the data in the metadata I would need to use a JSON function on every row.
The advantage of #3 is I can easily access all the fields without running a function and don't have to create a new table for each type. The disadvantage is whoever is using the data needs to remember which columns to ignore.
Is one of these ways better than the others or am I missing something entirely?

This is a classic dilemma. After thinking for a while, in my company we ended up keeping the common properties of the events in separate columns and the unique properties in the JSON field. Examples of the common properties:
event type, timestamp (every event has it)
URL (this will be missing for backend events and mobile app events but is present for all frontend events and is worth to have in a separate column)
client properties: device, browser, OS (will be missing in backend but present in mobile app events and frontend events)
Examples of unique properties (no such properties in other events):
test name and variant in AB test event
product name or ID in purchase event
Borderline between common and unique property is your own judgement based on how many events share this property and how often will this property be used in the analytics queries to filter or group data. If some property is just "nice-to-have" and it is not involved in regular analysis use cases (yeah, we all love to store anything that is trackable just in case) the burden of maintaining a separate column is an overkill.
Also, if you have some unique property that you use extensively in the queries there is a hacky way to optimize. You can place this property at the beginning of your JSON column (yes, in Python JSON is not ordered but in Redshift it is a string, so the order of keys can be fixed if you want) and use LIKE with a wildcard only at the end of the field:
select *
from event_table
where event_type='Start experiment'
and event_json like '{"test_name":"my_awesome_test"%' -- instead of below
-- and json_extract_path_text(event_json,'test_name')='my_awesome_test'
LIKE used this way works much faster than JSON lookup (2-3x times faster) because it doesn't need to scan every row, decode JSON, find the key and check the value but it just checks if the string starts with a substring (much cheaper operation).

Related

Are documents removed from a couchbase view if the data changes?

My understanding is that Couchbase views are built incrementally, but I can't seem to find an answer to whether a document can exist in a view multiple times. For example, say I want to create a view based on an updatedAt timestamp, that is changed every time I update this document type.
If the view is built incrementally, that seems to imply that if document id "1234" is updated several times and that updatedAt timestamp changed each time, I'd end up with several entries in the view for the same document, when what I want is just one entry, for the latest value.
It does seem like Couchbase is limiting it to a single copy of any given document id within the view, but I can't find firm confirmation of that anywhere. I want to make sure I'm not designing something for a production system around a behavior that might not work the way it seems to on a small scale.
Yes. When a view index is refreshed, any documents modified since the last refresh have their associated rows removed from the view, and the map function is invoked again to emit the new row(s).
A single document can generate multiple view rows, but only if the view's map function calls emit multiple times.

How to get a collection of all latest attributes values from DynamoDB?

I have a one table where I store all of the sensors data.
Id is a Partition key, TimeEpoch is a sort key.
Example table looks like this:
Id
TimeEpoch
AirQuality
Temperature
WaterTemperature
LightLevel
b8a76d85-f1b1-4bec-abcf-c2bed2859285
1608208992
95
3a6930c2-752a-4103-b6c7-d15e9e66a522
1608208993
23.4
cb44087d-77da-47ec-8264-faccc2a50b17
1608287992
5.6
latest
1608287992
95
5.6
23.4
1000
I need to get all the latest attributes values from the table.
For now I used additional Item with Id = latest where I'm storing all the latest values, but I know that this is a hacky way that requires sensor to put data in with new GUID as the Id and to the Id = latest at the same time.
The attributes are all known and it's possible that one sensor under one Id can store AirQuality and Temperature at the same time.
NoSQL databases like DynamoDB are a tricky thing, because they don't offer the same query "patterns" like traditional relational databases.
Therefore, you often need non-traditional solutions to valid challenges like the one you present.
My proposal for one such solution would be to use a DynamoDB feature called DynamoDB Streams.
In short, DynamoDB Streams will be triggered every time an item in your table is created, modified or removed. Streams will then send the new (and old) version of that item to a "receiver" you specify. Typically, that would be a Lambda function.
The solution I would propose is to use streams to send new items to a Lambda. This Lambda could then read the attributes of the item that are not empty and write them to whatever datastore you like. Could be another DynamoDB table, could be S3 or whatever else you like. Obviously, the Lambda would need to make sure to overwrite previous values etc, but the detailed business logic is then up to you.
The upside of this approach is, that you could have some form of up-to-date version of all of those values that you can always read without any complicated logic to find the latest value of each attribute. So reading would be simplified.
The downside is, that writing becomes a bit more complex. Not at least because you introduce more parts to your solution (DynamoDB Streams, Lambda, etc.). This also will increase your cost a bit, depending on how often your data changes. Since you seem to store sensor data that might be quite often. So keep in mind to check the cost. This solution will also introduce more delay. So if delay is an issue, it might not be for you.
At last I want to mention that it is recommend to only have at most two "receivers" of a tables stream. That means that for production I would recommend to only have a single receiver Lambda and then let that Lambda create an AWS EventBridge event (e.g. "item created", "item modified", "item removed"). This will allow you to have a lot more Lambdas etc. "listening" to such events and process them, mitigating the streams limitation. This is an event-driven solution then. As before, this will add delay.

Database design - should I use 30 columns or 1 column with all data in form of JSON/XML?

I am doing a project which need to store 30 distinct fields for a business logic which later will be used to generate report for each
The 30 distinct fields are not written at one time, the business logic has so many transactions, it's gonna be like:
Transaction 1, update field 1-4
Transaction 2, update field 3,5,9
Transaction 3, update field 8,12, 20-30
...
...
N.B each transaction(all belong to one business logic) would be updating arbitrary number of fields & not in any particular order.
I am wondering what's my database design would be best:
Have 30 columns in postgres database representing those 30 distinct
field.
Have 30 filed store in form of xml or json and store it in just one
column of postgres.
1 or 2 which one is better ?
If I choose 1>:
I know for programming perspective is easier Because in this way I don't need to read the overall xml/json and update only a few fields then write back to database, I can only update a few columns I need for each transaction.
If I choose 2>:
I can potentially generic reuse the table for something else since what's inside the blob column is only xml. But is it wrong to use the a table generic to store something totally irrelevant in business logic just because it has a blob column storing xml? This does have the potential to save the effort of creating a few new table. But is this kind of generic idea of reuse a table is wrong in a RDBMS ?
Also by choosing 2> it seem I would be able to handle potential change like change certain field /add more field ? At least it seems I don't need to change database table. But I still need to change c++ & c# code to handle the change internally , not sure if this is any advantage.
I am not experiences enough in database design, so I cannot make the decision which one to choose. Any input is appreciated.
N.B there is a good chance I probabaly don't need to do index or search on those 30 columsn for now, a primary key will be created on a extra column is I choose 2>. But I am not sure if later I will be required to do search based on any of those columns/field.
Basically all my fields are predefined from requirement documents, they generally like simple field:
field1: value(max len 10)
field2: value(max len 20)
...
field20: value((max len 2)
No nest fields. Is it worth to create 20 columns for each of those fields(some are string like date/time, some are string, some are integer etc).
2>
Is putting different business logic in a shared table a bad design idea? If it only being put in a shared table because they share the same structure? E.g. They all have Date time column , a primary key & a xml column with different business logic inside ? This way we safe some effort of creating new tables... Is this saving effort worth doing ?
Always store your XML/JSON fields as separate fields in a relational database. Doing so you will keep your database normalized, allowing the database to do its thing with queries/indices etc. And you will save other developers the headache of deciphering your XML/JSON field.
It will be more work up front to extract the fields from the XML/JSON and perhaps to maintain it if fields need to be added, but once you create a class or classes to do so that hurdle will be eliminated and it will more than make up for the cryptic blob field.
In general it's wise to split the JSON or XML document out and store it as individual columns. This gives you the ability to set up constraints on the columns for validation and checking, to index columns, to use appropriate data types for each field, and generally use the power of the database.
Mapping it to/from objects isn't generally too hard, as there are numerous tools for this. For example, Java offers JAXB and JPA.
The main time when splitting it out isn't such a great idea is when you don't know in advance what the fields of the JSON or XML document will be or how many of them there will be. In this case you really only have two choices - to use an EAV-like data model, or store the document directly as a database field.
In this case (and this case only) I would consider storing the document in the database directly. PostgreSQL's SQL/XML support means you can still create expression indexes on xpath expressions, and you can use triggers for some validation.
This isn't a good option, it's just that EAV is usually an even worse option.
If the document is "flat" - ie a single level of keys and values, with no nesting - the consider storing it as hstore instead, as the hstore data type is a lot more powerful.
(1) is more standard, for good reasons. Enables the database to do heavy lifting on things like search and indexing for one thing.

Data storage: "grouping" entities by property value? (like a dictionary/map?)

Using AppEngine datastore, but this might be agnostic, no idea.
Assume a database entity called Comment. Each Comment belongs to a User. Every Comment has a date property, pretty standard so far.
I want something that will let me: specify a User and get back a dictionary-ish (coming from a Python background, pardon. Hash table, map, however it should be called in this context) data structure where:
keys: every date appearing in the User's comment
values: Comments that were made on date.
I guess I could just iterate over a range of dates an build a map like this myself, but I seriously doubt I need to "invent" my own solution here.
Is there a way/tool/technique to do this?
Datastore supports both references and list properties. This let's you build one-to-many relationships in two ways:
Parent (User) has a list property containing keys of Child entities (Comment).
Child has a key property pointing to Parent.
Since you need to limit Comments by date, you'd best go with option two. Then you could query Comments which have date=somedate (or date range) and where user=someuserkey.
There is no native grouping functionality in Datastore, so to also "group" by date, you can add a sort on date to the query. Than when you iterate over the result, when the date changes you can use/store it as a grouping key.
Update
Designing no-sql databases should be access-oriented (versus datamodel oriented in sql): for often-used operations you should be getting data out as cheaply (= as few operations) as possible.
So, as a rule of thumb you should, in one operation, only get data that is needed at that moment (= shown on that page to user). I'm not sure about your app's design, but I doubt you need all user's full comments (with text and everything) at one time.
I'd start by saying you shouldn't apologize for having a Python background. App Engine started supporting only Python. Using the db module, you could have a User entity as the parent of several DailyCommentBatch entities each a parent of a couple Comment entities. IIRC, this will keep all related entities stored together (or close).
If you are using the NDB (I love it) you may have employ a StructuredProperty either at the User or DailyCommentBatch levels.

Dynamic Queries - Expando/Dynamic object type

I need to query tables neither known or existing at compile time, publish the table via odata and then make it available to a silverlight client for CRUD.
Would be wonderful to use a PCO of type dynamic or ExpandoObject to acheive this but that doesn't seem to work (as suspected).
I'm wondering if there are Interfaces that would allow me to perform the type mapping and serializing at the row level so I would dynamically take the data row and round trip it's values on the server side. Perhaps an interface for the PCO to "help", or dynmically created property getter/setter. I'm also toying with dynamically creating the context class at run time but that's kind of ugly.
Then - on the client side, something to do the same thing with the odata feed, I have a solution here but it aint pretty enough to share with the world.
EF doesn't offer any "dynamic" approach as well as any simple way to let you create a new table and add it to mapping. Another question is how well can WCF Data Services work with changing data - I believe it is not supported as well.
If you want dynamically changed structure (added tables, columns etc.) use some metadata model instead of creating new table for each entity. Metadata model usually have something like table with common properties and related table with key value pair of attribute name and value. It can be futher extended to more complex scenarios but it is the only way how to achieve this. Instead of mapping in EF take your entity types as data.

Resources