I have a system producing about 5TB of time-tagged numeric data every year. The fields tend to be different for each row, and to avoid having heaps of NULLs I'm thinking of using Postgres as a document store with JSONB.
However, GIN indexes on JSONB fields don't seem to be made for numerical and datetime data. There are no inequality or range operators for numbers and dates.
Here they suggest making special constructs with LATERAL to treat JSON values as normal numeric columns, and here someone proposes using a "sortable" string format for dates and filter string ranges.
These solutions sound a bit hacky and I wonder about their performance. Perhaps this is not a good application for JSONB?
An alternative I can think of using a relational DB is to use the 6th normal form, making one table for each (optional) field, of which however there would be hundreds. It sounds like a big JOIN mess, and new tables would have to be created on the fly any time a new field pops up. But maybe it's still better than a super-slow JSONB implementation.
Any guidance would be much appreciated.
More about the data
The data are mostly sensor readings, physical quantities and boolean flags. Which subset of these is present in each row is unpredictable. The index is an integer, and the only field that always exists is the corresponding date.
There would probably be one write for each value and almost no updates. Reads can be frequent and sliced based on any of the fields (some are more likely to be in a WHERE statement than others).
Related
I wanted to ask, if there may be a different and better approach than mine.
I have a model entity that can have an arbitrary amount of hyperparameters. Depending on the specific model I want to insert as row into the model table, I may have specific hyperparameters. I do not want to continuously add new columns to my model table for new hyperparameters that I encounter when trying out new models (+ I don't like having a lot of columns that are null for many rows). I also want to easily filter models on specific hyperparameter values, e.g. "select * from models where model.hyperparameter_x.value < 0.5". So, an n-to-n relationship to a hyperparameter table comes to mind. The issue is, that the datatype for hyperparameters can be different, so I cannot define a general value column on the relationship table, with a datatype, that's easily comparable across different models.
So my idea is, to define a json type "value" column in the relationship table to support different value datatypes (float, array, string, ...). What I don't like about that idea and what was legitimately critizised by colleagues is that this can result in chaos within the value column pretty fast, e.g. people inserting data with very different json structures for the same hyperparameters. To mitigate this issue, I would introduce a "json_regex_template" column in the hyperparameter table, so that on API level I can easily validate wheter the json for a value for hyperparameter x is correctly defined by the user. An additional "json_example" column in the hyperaparameter table would further help the user on the other side of the API make correct requests.
This solution would still not guarentee non-chaos on database request level (even though no User should directly insert data without using the API, so I don't think thats a very big deal). And the solution still feels a bit hacky. I would believe, that I'm not the first person with this problem and maybe there is a best practice to solve it?
Is my aversion against continuously adding columns reasonable? It's about probl. 3-5 new columns per month, may saturate at some point to a lower number, but thats speculative.
I'm aware of this post (Storing JSON in database vs. having a new column for each key), but it's pretty old, so my hope is that there may be new stuff I could use. The model-hyperparameter thing is of course just a small part of my full database model. Changing to a non-relational database is not an option.
Opinions are much appreciated
Imagine a web form with a set of check boxes (any or all of them can be selected). I chose to save them in a comma separated list of values stored in one column of the database table.
Now, I know that the correct solution would be to create a second table and properly normalize the database. It was quicker to implement the easy solution, and I wanted to have a proof-of-concept of that application quickly and without having to spend too much time on it.
I thought the saved time and simpler code was worth it in my situation, is this a defensible design choice, or should I have normalized it from the start?
Some more context, this is a small internal application that essentially replaces an Excel file that was stored on a shared folder. I'm also asking because I'm thinking about cleaning up the program and make it more maintainable. There are some things in there I'm not entirely happy with, one of them is the topic of this question.
In addition to violating First Normal Form because of the repeating group of values stored in a single column, comma-separated lists have a lot of other more practical problems:
Can’t ensure that each value is the right data type: no way to prevent 1,2,3,banana,5
Can’t use foreign key constraints to link values to a lookup table; no way to enforce referential integrity.
Can’t enforce uniqueness: no way to prevent 1,2,3,3,3,5
Can’t delete a value from the list without fetching the whole list.
Can't store a list longer than what fits in the string column.
Hard to search for all entities with a given value in the list; you have to use an inefficient table-scan. May have to resort to regular expressions, for example in MySQL:
idlist REGEXP '[[:<:]]2[[:>:]]' or in MySQL 8.0: idlist REGEXP '\\b2\\b'
Hard to count elements in the list, or do other aggregate queries.
Hard to join the values to the lookup table they reference.
Hard to fetch the list in sorted order.
Hard to choose a separator that is guaranteed not to appear in the values
To solve these problems, you have to write tons of application code, reinventing functionality that the RDBMS already provides much more efficiently.
Comma-separated lists are wrong enough that I made this the first chapter in my book: SQL Antipatterns, Volume 1: Avoiding the Pitfalls of Database Programming.
There are times when you need to employ denormalization, but as #OMG Ponies mentions, these are exception cases. Any non-relational “optimization” benefits one type of query at the expense of other uses of the data, so be sure you know which of your queries need to be treated so specially that they deserve denormalization.
"One reason was laziness".
This rings alarm bells. The only reason you should do something like this is that you know how to do it "the right way" but you have come to the conclusion that there is a tangible reason not to do it that way.
Having said this: if the data you are choosing to store this way is data that you will never need to query by, then there may be a case for storing it in the way you have chosen.
(Some users would dispute the statement in my previous paragraph, saying that "you can never know what requirements will be added in the future". These users are either misguided or stating a religious conviction. Sometimes it is advantageous to work to the requirements you have before you.)
There are numerous questions on SO asking:
how to get a count of specific values from the comma separated list
how to get records that have only the same 2/3/etc specific value from that comma separated list
Another problem with the comma separated list is ensuring the values are consistent - storing text means the possibility of typos...
These are all symptoms of denormalized data, and highlight why you should always model for normalized data. Denormalization can be a query optimization, to be applied when the need actually presents itself.
In general anything can be defensible if it meets the requirements of your project. This doesn't mean that people will agree with or want to defend your decision...
In general, storing data in this way is suboptimal (e.g. harder to do efficient queries) and may cause maintenance issues if you modify the items in your form. Perhaps you could have found a middle ground and used an integer representing a set of bit flags instead?
Yes, I would say that it really is that bad. It's a defensible choice, but that doesn't make it correct or good.
It breaks first normal form.
A second criticism is that putting raw input results directly into a database, without any validation or binding at all, leaves you open to SQL injection attacks.
What you're calling laziness and lack of SQL knowledge is the stuff that neophytes are made of. I'd recommend taking the time to do it properly and view it as an opportunity to learn.
Or leave it as it is and learn the painful lesson of a SQL injection attack.
I needed a multi-value column, it could be implemented as an xml field
It could be converted to a comma delimited as necessary
querying an XML list in sql server using Xquery.
By being an xml field, some of the concerns can be addressed.
With CSV: Can't ensure that each value is the right data type: no way to prevent 1,2,3,banana,5
With XML: values in a tag can be forced to be the correct type
With CSV: Can't use foreign key constraints to link values to a lookup table; no way to enforce referential integrity.
With XML: still an issue
With CSV: Can't enforce uniqueness: no way to prevent 1,2,3,3,3,5
With XML: still an issue
With CSV: Can't delete a value from the list without fetching the whole list.
With XML: single items can be removed
With CSV: Hard to search for all entities with a given value in the list; you have to use an inefficient table-scan.
With XML: xml field can be indexed
With CSV: Hard to count elements in the list, or do other aggregate queries.**
With XML: not particularly hard
With CSV: Hard to join the values to the lookup table they reference.**
With XML: not particularly hard
With CSV: Hard to fetch the list in sorted order.
With XML: not particularly hard
With CSV: Storing integers as strings takes about twice as much space as storing binary integers.
With XML: storage is even worse than a csv
With CSV: Plus a lot of comma characters.
With XML: tags are used instead of commas
In short, using XML gets around some of the issues with delimited list AND can be converted to a delimited list as needed
Yes, it is that bad. My view is that if you don't like using relational databases then look for an alternative that suits you better, there are lots of interesting "NOSQL" projects out there with some really advanced features.
Well I've been using a key/value pair tab separated list in a NTEXT column in SQL Server for more than 4 years now and it works. You do lose the flexibility of making queries but on the other hand, if you have a library that persists/derpersists the key value pair then it's not a that bad idea.
I would probably take the middle ground: make each field in the CSV into a separate column in the database, but not worry much about normalization (at least for now). At some point, normalization might become interesting, but with all the data shoved into a single column you're gaining virtually no benefit from using a database at all. You need to separate the data into logical fields/columns/whatever you want to call them before you can manipulate it meaningfully at all.
If you have a fixed number of boolean fields, you could use a INT(1) NOT NULL (or BIT NOT NULL if it exists) or CHAR (0) (nullable) for each. You could also use a SET (I forget the exact syntax).
We have have a table layout with property names in one table, and values in a second table, and items in a third. (Yes, we're re-implementing tables in SQL.)
We join all three to get a value of a property for a specific item.
Unfortunately the values can have multiple data types double, varchar, bit, etc. Currently the consensus is to stringly type all the values and store the type name in the column next to the value.
tblValues
DataTypeName nvarchar
Is there a better, cleaner way to do this?
Clarifications:
Our requirements state that we must add new "attributes" at run time without modifying the db schema
I would prefer not to use EAV, but that is the direction we are headed right now.
This system currently exists in SQL server using a traditional db design, but I can't see a way to fulfill our requirement of not modifying the db schema without moving to EAV.
There are really only two patterns for implementing an 'EAV model' (assuming that's what you want to do):
Implement it as you've described, where you explicitly store the property value type along with the value, and use that to convert the string values stored into the appropriate 'native' types in the application(s) that access the DB.
Add a separate column for each possible datatype you might store as a property value. You could also include a column that indicates the property value type, but it wouldn't be strictly necessary.
Solution 1 is a simpler design, but it incurs the overhead of converting the string values stored in the table into the appropriate data type as needed.
Solution 2 has the benefit of storing values as the appropriate native type, but it will necessarily require more, though not necessarily much more, space. This may be moot if there aren't a lot of rows in this table. You may want to add a check constraint that only allows one non-NULL value in the different value columns, or if you're including a type column (so as to avoid checking for non-NULL values in the different value columns), prevent mismatches between the value stored in the type column and which value column contains the non-NULL value.
As HLGEM states in her answer, this is less preferred than a standard relational design, but I'm more sympathetic to the use of EAV model table designs for data such as application option settings.
Well don't do that! You lose all the values of having datatypes if you do. You can't properly constrain them (and will, I guarantee it, get bad data eventually) and you have to cast them back to the proper type to use in mathematical or date calculations. All in all a performance loser.
Your whole design will not scale well. Read up on why you don't want to use EAV tables in a relational database. It is not only generally slower but unusually difficult to query especially for reporting.
Perhaps a noSQL database would better suit your needs or a proper relational design and NOT an EAV design. Is it really too hard to figure out what fields each table would really need or are your developers just lazy? Are you sacrificing performance for flexibility - a flexibility that most users will hate? Especially when it means bad performance? Have you ever used a database designed that way to try to do anything?
I need to find a relatively robust method of storing variable types data in a single column of a database table. The data may represent a single value or multiple values and may any of a long list of characters (too long to enumerate easily). I'm wondering what approaches might work in this process. I'd toyed with the ideas of adding some form of separator, but I'm worried that any simple separator or combination might occur naturally in the data. I'd also like to avoid XML or snippets since in fact the data could be XML. Arguably I could encode the XML, but that still seems fragile.
I realize this is naturally a bit of an opinion question, but I lack the mojo to make it community.
Edit for Clarification:
Background for the problem: the column will hold data that is then used to make a evaluation based on another column. Functionally it's the test criteria for a decision engine. Other columns hold the evaluation's nature and the source of the value to test. The data doesn't need to be searchable.
Does the data need to be searchable? If not, slap it in a varbinary(MAX) and have a field to assist in deserialization.
Incidentally, though; using the right XML API, there should be no trouble storing XML inside an XML node.
But my guess is there has to be a better way to do this... it seems... ugh!
JSON format, though I agree with djacobson, your question is like asking for the best way to saw a 2x4 in half with a teaspoon.
EDIT: The order in which data are stored in the JSON string is irrelevant; each datum is stored as a key-value pair.
There's not a "good" way to do this. There is a reason that data types exist in SQL.
The only conceivable way I can think of to make it close is to make your column a lookup column, which refers to a GUID or ID in another table, which itself has additional columns indicating which table and row have your data.
I've written some very basic tools for grouping, pivoting, unioning and subtotaling datasets sourced from non DB sources (eg: CSV, OLTP systems). The "group by" methods sit at the core of most of these.
However i'm sure lot of work has been done in making efficient algorithms for grouping data... and i'm sure i'm not using them. And my Google-fu has completely failed to turn anything up.
Are there any good online sources or books describing the better methods to create grouped data?
Or should i just start looking at the MySQL source or something similar?
One very handy way to "group by" some field (or set of fields and expressions, but I'll use "field" for simplicity!-) is when you can arrange to walk over the results before grouping (RBG) in a sorted way -- you actually don't care about the sorting (save in the common case in which an ORDER BY is also there and just happens to be on the same field as the GROUP BY!-), but rather about the "side effect" property of ordering -- that all rows in RBG with the same value for the grouping field come right after each other, so you can accumulate until the grouping field changes, then emit/yield the results accumulated so far, and proceed to reinitialize the accumulators with the new row (the one with a different value of the grouping field) -- make sure to "just initialize the accumulators" at the very start, AND "just emit/yield accumulated results" at the very end, of course.
If this doesn't work, maybe you can hash the grouping field and use a hash table for the results being accumulated for that group -- at each row in RBG, hash the grouping field, check if it was already present as a key in the hash table, if not put it there with accumulators suitably initialized from the RBG row, else update the accumulators per the RBG row. You just emit everything at the end. The problem of course is you're taking up more memory until the end!-)
These are the two fundamental approaches. Would you like pseudocode for each, BTW?
You should check out OLAP databases. OLAP allows you to create a database of aggregates meant to be analyzed in a "slice and dice" fashion.
Aggregate measures such as counts, averages, mins, maxs, sums and stdev's can be quickly analyzed by any number of dimensions using an OLAP database.
See this introduction to OLAP on MSDN.
Give an example CSV file and type of result wanted and I might be able to rustle up a a solution in Python for you.
Python has the CSV module and list/generator comprehensions that can help with this sort of thing.
Paddy.