Data Validation with ORM

Data Validation with ORM - database

I have a Grails application that is dependent upon an external dataset that gets sucked in periodically. There is no way to validate the data when it comes into the database, because the program that sucks it in isn't written by us.
Once in a while, we get a bad piece of data in the database. For example, a number is "5,5" instead of "5.5". The datatype on the column (since they defined the table) is VARCHAR even though the field should always contain a number. Our application has this column mapped to a FLOAT in the ORM layer, since that is what we expect it to be.
I want to make sure that the application doesn't crash when we get new data, but I am not sure how. Should I map the column to a VARCHAR and then do a conversion to a transient FLOAT column or something like that?

Short-term, yes, mapping the VARCHAR to String in the domain object instead of a numeric type would solve the problem.
In my opinion, a better solution would be to do the periodic import into a different table, and have a process that moves only valid rows to the domain table. You'd probably also want to put the 'bad' rows somewhere, and notify someone to manually fix them.

Related

Prevent continuously adding columns by postgres json data type

I wanted to ask, if there may be a different and better approach than mine.
I have a model entity that can have an arbitrary amount of hyperparameters. Depending on the specific model I want to insert as row into the model table, I may have specific hyperparameters. I do not want to continuously add new columns to my model table for new hyperparameters that I encounter when trying out new models (+ I don't like having a lot of columns that are null for many rows). I also want to easily filter models on specific hyperparameter values, e.g. "select * from models where model.hyperparameter_x.value < 0.5". So, an n-to-n relationship to a hyperparameter table comes to mind. The issue is, that the datatype for hyperparameters can be different, so I cannot define a general value column on the relationship table, with a datatype, that's easily comparable across different models.
So my idea is, to define a json type "value" column in the relationship table to support different value datatypes (float, array, string, ...). What I don't like about that idea and what was legitimately critizised by colleagues is that this can result in chaos within the value column pretty fast, e.g. people inserting data with very different json structures for the same hyperparameters. To mitigate this issue, I would introduce a "json_regex_template" column in the hyperparameter table, so that on API level I can easily validate wheter the json for a value for hyperparameter x is correctly defined by the user. An additional "json_example" column in the hyperaparameter table would further help the user on the other side of the API make correct requests.
This solution would still not guarentee non-chaos on database request level (even though no User should directly insert data without using the API, so I don't think thats a very big deal). And the solution still feels a bit hacky. I would believe, that I'm not the first person with this problem and maybe there is a best practice to solve it?
Is my aversion against continuously adding columns reasonable? It's about probl. 3-5 new columns per month, may saturate at some point to a lower number, but thats speculative.
I'm aware of this post (Storing JSON in database vs. having a new column for each key), but it's pretty old, so my hope is that there may be new stuff I could use. The model-hyperparameter thing is of course just a small part of my full database model. Changing to a non-relational database is not an option.
Opinions are much appreciated

New to database design (Postgres) seeking advice

I am new to designing databases. I come from a front end background. I am looking to design a database that stores performance metrics of wind turbines. I was given an excel file. The list of metrics is nearly 200 and you can see the first few in this image.
I can't think of the best way to represent this data in a database. My first thought was to import this table as is into a database table and add a turbine-id column to it. my second through was to create a table for each metric and add a turbine-id column to each of those tables. What do you guys think. What is the best way for me to store the data that would set me up with a performant database. Thank you for your help and input

One way to do it would be something like this:
TURBINE
ID_TURBINE INTEGER PK
LATITUDE DECIMAL
LONGITUDE DECIMAL
METRIC
ID_METRIC INTEGER PK
METRIC_NAME VARCHAR UNIQUE
VALUE_TYPE VARCHAR
Allowed values = ('BOOLEAN', 'PERCENTAGE', 'INTEGER', 'DOUBLE', 'STRING')
TURBINE_METRIC
ID_TURBINE_METRIC INTEGER PK
ID_TURBINE INTEGER
FOREIGN KEY TO TURBINE
METRIC_NAME VARCHAR
FOREIGN KEY TO METRIC
BOOLEAN_VALUE BOOLEAN
PERCENTAGE_VALUE DOUBLE
INTEGER_VALUE INTEGER
DOUBLE_VALUE DOUBLE
STRING_VALUE VARCHAR
Flesh this out however you need it to be. I have no idea how long your VARCHAR fields should be, etc, but this allows you flexibility in terms of which metrics you store for each turbine. I suppose you could make the LATITUDE and LONGITUDE metrics as well - I just added them to the TURBINE table to show that there may be fixed info which is best stored as part of the TURBINE table.

You want one table to represent turbines (things true of the turbine, like its location) and one or more of turbine metrics that arrive over time. If different groups of metrics arrive at different intervals, put them in different tables.
One goal I would have would be to minimize the number of nullable columns. Ideally, every column is defined NOT NULL, and invalid inputs are set aside for review. What is and is not nullable is controlled by promises made by the system supplying the metrics.
That's how it's done: every table has one or more keys that uniquely identify a row, and all non-key columns are information about the entity defined by the row.
It might seem tempting and "more flexible" to use one table of name-value pairs, so you never have to worry about new properties if the feed changes. That would be a mistake, though (a classic one, which is why I mention it). It's actually not more flexible, because changes upstream will require changes downstream, no matter what. Plus, if the upstream changes in ways that aren't detected by the DBMS, they can subtly corrupt your data and results.
By defining as tight a set of rules about the data as possible in SQL, you guard against missing, malformed, and erroneous inputs. Any verification done by the DBMS is verification that the application can skip, and that no application will be caught by.
For example, you're given min/max values for wind speed and so on. These promises can form constraints in the database. If you get negative wind speed, something is wrong. It might be a sensor problem or (more likely) a data alignment error because a new column was introduced or the input was incorrectly parsed. Rather than put the wind direction mistakenly in the wind speed column, the DBMS rejects the input, and someone can look into what went wrong.
Don't forget to have fun. You have an opportunity to create a new database in a growing industry, and learn about database technology and theory at the same time. Doesn't happen every day!

is delimiting data in a database field ok

Is delimiting data in a database field something that would be ok to do?
Something like
create table column_names (
id int identity (1,1) PRIMARY KEY,
column_name varchar(5000)
);
and then storing data in it as follows
INSERT INTO column_names (column_name) VALUES ('stocknum|name|price');

No. this is bad:
in order to create new queries you have to track down how things are stored.
queries that join on price or name or stocknum are going to be nasty
the database can't assign data types to the data or validate it
you can't create constraints on any of this data now
Basically you're subverting the RDBMS' scheme for handling things and making up your own, so you're limiting how much the RDBMS tools can help you and you've made the system harder to understand for new people.
The only possible advantage of this kind of system that I can think of is that it can serve as a workaround to avoid dealing with a totally impossible DBA who vetoes all schema changes regardless of merit. Which can happen, unfortunately.
Of course there's an exception to everything. I'm currently on a project with audit-logging requirements that are pretty stringent. the logging is done to a database, we're using delimited fields for storing the fields because the application is never going to interact with this data, it gets written once and left alone.

Almost certainly not.
It violates principles of normalization. The data stored in a particular row of a particular column should be atomic-- you shouldn't be able to parse the data into smaller component parts.
It makes it substantially more difficult to get acceptable performance. Every piece of code that queries this table will need to know how to parse the data which is generally going to mean that more data needs to be read off disk and potentially sent over the network to the client. Every query that has to parse this data is going to have to be more complex which tends to cause grief for the query optimizer. Concatenated data cannot generally be indexed effectively for searches-- you'd have to do something like a full-text index with custom delimiters rather than a nice standard index on a character string. And if you ever have to update one of the delimited values (i.e. because a product name changes), those updates are going to have to scan every row in the table, parse the data, decide whether to actually update the row, and then update a ton of rows.
It makes the application much more brittle. What happens when someone decides to include a | character in the name attribute, for example? Even if you specify an optional enclosure in the spec (i.e. | is allowed if the entire token is enclosed in double quotes), what fraction of the bits of code that actually parse this column are going to implement and test that correctly?

Adequately Good Way to Store Variable Amounts of Data in a Single Column

I need to find a relatively robust method of storing variable types data in a single column of a database table. The data may represent a single value or multiple values and may any of a long list of characters (too long to enumerate easily). I'm wondering what approaches might work in this process. I'd toyed with the ideas of adding some form of separator, but I'm worried that any simple separator or combination might occur naturally in the data. I'd also like to avoid XML or snippets since in fact the data could be XML. Arguably I could encode the XML, but that still seems fragile.
I realize this is naturally a bit of an opinion question, but I lack the mojo to make it community.
Edit for Clarification:
Background for the problem: the column will hold data that is then used to make a evaluation based on another column. Functionally it's the test criteria for a decision engine. Other columns hold the evaluation's nature and the source of the value to test. The data doesn't need to be searchable.

Does the data need to be searchable? If not, slap it in a varbinary(MAX) and have a field to assist in deserialization.
Incidentally, though; using the right XML API, there should be no trouble storing XML inside an XML node.
But my guess is there has to be a better way to do this... it seems... ugh!

JSON format, though I agree with djacobson, your question is like asking for the best way to saw a 2x4 in half with a teaspoon.
EDIT: The order in which data are stored in the JSON string is irrelevant; each datum is stored as a key-value pair.

There's not a "good" way to do this. There is a reason that data types exist in SQL.
The only conceivable way I can think of to make it close is to make your column a lookup column, which refers to a GUID or ID in another table, which itself has additional columns indicating which table and row have your data.

How to Handle Unknown Data Type in one Table

I have a situation where I need to store a general piece of data (could be an int, float, or string) in my database, but I don't know ahead of time which it will be. I need a table (or less preferably tables) to store this unknown typed data.
What I think I am going to do is have a column for each data type, only use one for each record and leave the others NULL. This requires some logic above the database, but this is not too much of a problem because I will be representing these records in models anyway.
Basically, is there a best practice way to do something like this? I have not come up with anything that is less of a hack than this, but it seems like this is a somewhat common problem. Thanks in advance.
EDIT: Also, is this considered 3NF?

You could easily do that if you used SQLite as a database backend :
Any column in a version 3 database, except an INTEGER PRIMARY KEY column, may be used to store any type of value.
For other RDBMS systems, I would go with Philip's solution.
Note that in my line of software (business applications), I cannot think of any situation where this kind of requirement would be needed (a value with an unknown datatype). Unless the domain model was flawed, of course... I can imagine that other lines of software may incur different practices, but I suggest that you consider rethinking your overall design.

If your application can reliably convert datatypes, you might consider a single column solution based on a variable-length binary column, with a second column to track original data type. (I did a very small routine based on this once before, and it worked well enough.) Testing would show if conversion is more efficiently handled on the application or database side.

If I were to do this I would choose either your method, or I would cast everything to string and use only one column. Of course there would be another column with the type (which would probably be useful for the first method too).
For faster code I would probably go with your method.

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight