How to handle numerous forms with different fields and data types in the same table? - database

I need to develop an application where I need to handle more than 30 forms. Those forms have different numbers of fields with different data types. After storage, I need to do advanced search over the forms. A full-text-search may be needed for fields with a specific name shared among forms. Expected data size is ~50k forms with ~500k form fields. PostgreSQL is going to be used.
The solutions that I came up with:
1. Encoding form fields into a JSON String
Problems: Performing full-text-search for data with a specific field name can be cumbersome. Also when I need to read or update the data in the form, I need to perform decode and encode.
2. Creating a table having data fields as much as the number of the inputs in a form
Problems: As the form mapping classes are going to be ready, I can actually map form fields to database fields for each of those forms. For search, I may need to write different rules for each of those forms as the mapping for fields will change radically.
3. Keeping fields in a different table with a foreign-key to the form table
Problems: Maintaining the form data is still an issue but I don't know about speed. I expect it to run at least as fast as the previous way. Generating the search query from Java/Hibernate will be a bit harder than the previous way.
As I don't have any experience in handling such a case, I need your help and suggestions.

Related

Prevent continuously adding columns by postgres json data type

I wanted to ask, if there may be a different and better approach than mine.
I have a model entity that can have an arbitrary amount of hyperparameters. Depending on the specific model I want to insert as row into the model table, I may have specific hyperparameters. I do not want to continuously add new columns to my model table for new hyperparameters that I encounter when trying out new models (+ I don't like having a lot of columns that are null for many rows). I also want to easily filter models on specific hyperparameter values, e.g. "select * from models where model.hyperparameter_x.value < 0.5". So, an n-to-n relationship to a hyperparameter table comes to mind. The issue is, that the datatype for hyperparameters can be different, so I cannot define a general value column on the relationship table, with a datatype, that's easily comparable across different models.
So my idea is, to define a json type "value" column in the relationship table to support different value datatypes (float, array, string, ...). What I don't like about that idea and what was legitimately critizised by colleagues is that this can result in chaos within the value column pretty fast, e.g. people inserting data with very different json structures for the same hyperparameters. To mitigate this issue, I would introduce a "json_regex_template" column in the hyperparameter table, so that on API level I can easily validate wheter the json for a value for hyperparameter x is correctly defined by the user. An additional "json_example" column in the hyperaparameter table would further help the user on the other side of the API make correct requests.
This solution would still not guarentee non-chaos on database request level (even though no User should directly insert data without using the API, so I don't think thats a very big deal). And the solution still feels a bit hacky. I would believe, that I'm not the first person with this problem and maybe there is a best practice to solve it?
Is my aversion against continuously adding columns reasonable? It's about probl. 3-5 new columns per month, may saturate at some point to a lower number, but thats speculative.
I'm aware of this post (Storing JSON in database vs. having a new column for each key), but it's pretty old, so my hope is that there may be new stuff I could use. The model-hyperparameter thing is of course just a small part of my full database model. Changing to a non-relational database is not an option.
Opinions are much appreciated

What is the best way to manage multiple instances of the same dataset class across multiple instances of the same form?

Apologies for the long winded question - I am experienced with the basics but this is first time working with datasets and databases (previously applications involved records, arrays and text files). My question is one of best practice and what is the best way to implement this, but to answer you will need to know what I am trying to achieve...
I am building an application that allows a telephone operator to take messages, write them into a form and save them to a DB. The main form contains a grid showing the messages already in the DB, and I have a 'message form' which is a VCL form containing edit, combo and checkboxes for all the respective fields the operator must log, some are mandatory, some are optional. There are lots of little helpers and automation's that run depending on the user input, but my question is related to the underlying data capture and storage.
There are two actions the user can perform:
Create new/blank message by clicking a button
Edit message by clicking on respective line in grid
Both of these actions cause an instance of the form to be created and initialized, and in the case of EDIT, the fields are then populated with the respective data from the database. In the case of NEW a blank dataset/record is loaded - read on...
The message form captures four distinct groups of information, so for each one I have defined a specific record structure in a separate unit (lets call them Group1, Group2, Group3 and Group4) each containing different numbers/types of elements. I have then defined a fifth record structure with four elements - each element being one of the previous four defined record structures - this is called TMessageDetails.
Unlike perhaps other applications, I am allowing the user to have up to 6 instances of the message form open at any one time - and each one can be in either NEW or EDIT mode. The only restriction is that two forms in EDIT mode cannot be editing the same message - the main form prevents this.
To manage these forms, I have another record (TFormDetails) with elements such as FormName (each form is given a unique name when created), an instance of TMessageDetails, FormTag and some other bits. I then have an array of TFormDetails with length 6. Each time a form is opened a spare 'slot' in this array is found, a new form created, the TMessageDetails record initialized (or data loaded into it from the DB) and a pointer to this record is given to the form. The form is then opened and it loads all the data from the TMessageDetails record into the respective controls. The pointer is there so that when the controls on the form make changes to the record elements, the original record is edited and I don't end up with a 'local' copy behind the form and out of sync with the original.
For the DB interaction I have four FDQuery components (one per group) on the message form, each pointing to a corresponding table (one per group) in an SQLite DB.
When loading a message I have a procedure that uses FDQuery1 to get a row of data from Table1, and then it copies the data to the Group1 record (field by field) in the respective TMessageDetails record (stored in the TFormDetails array) for that form. The same then happens with FDQuery2, 3, 4 etc...
Saving is basically the same but obviously in reverse.
The reason for the four FDQuery's is so that I can keep each dataset open after loading, which then gives me an open dataset to update and post to the DB when saving. The reason for copying to the records is mainly so that I can reference the respective fields elsewhere in the code with shorter names, and also when the VCL control tries to change a field in the dataset the changes don't 'stick' (the data I try and save back to the DB is the same as what I loaded), whereas in a record they do. The reason for breaking the records down into groups is there are places where the data in one of the groups may need to be copied to somewhere else, but not the whole message. It was also more natural to me to use records than datasets.
So my question is...
Is my use of; the record structures, a global TFormsDetails array, pointers, four FDQuery's per form (so up 6 forms open means up to 24 datasets open), and copying between records and datasets on save/load; a good way to implement what I am trying to achieve?
OR
Is there a way I can replace the records with datasets (making copying from FDQuery easier/shorter surely?) but still store them in an a global 'message form' array so I can keep track of them. Should I also try and reduce the instances of FDQuery and number of potential open datasets by having say one FDQuery component, and re-using it to load the tables into other global datasets etc?
My current implementation works just fine and there is no noticeable lag/hang when saving/loading, I just can't find much info on what is considered best practice for my needs (namely having multiple instances of the same form open - other example refer to ShowModal and only having one dataset to worry about) so I'm not sure if I'm leaving myself open to problems like memory leaks (I understand the 'dangers' of using pointers), performance issues or just general bad practice.
Currently using RAD 10.3 and the latest version of SQLite.
I don't know if what I'll say is the "best" practice but it is how I would do it.
The form would have his own TFDQuery using a global FDConnection. When several instance of the form are created, you have a FDQuery for each instance.
You add an 'Execute' method in the form. That method will create the FDQuery, get all data from the dataset with one or more queries, populate the form and show the form. Execute method receive argument (or a form's property) such as primary key to be able to get the data. If the argument is empty, then it is for a new record.
If the form has to update the grid on the fly, then an event will be used. The main form (containing the grid) will install an event handler and update the grid accordingly to the data given by the form.
When the form is done, it will use the primary key to store data back to the database.

Is there a better way to represent provenenace on a field level in SOLR

I have documents in SOLR which consist of fields where the values come from different source systems. The reason why I am doing this is because this document is what I want returned from the SOLR search, including functionality like hit highlighting. As far as I know, if I use join with multiple SOLR documents, there is no way to get what matched in the related documents. My document has fields like:
id => unique entity id
type => entity type
name => entity name
field_1_s => dynamic field from system A
field_2_s => dynamic field from system B
...
Now, my problem comes when data is updated in one of the source systems. I need to update or remove only the fields that correspond to that source system and keep the other fields untouched. My thought is to encode the dynamic field name with the first part of the field name being a 8 character hash representing the source system.. this way they can have common field names outside of the unique source hash. And in this way, I can easily clear out all fields that start with the source prefix, if needed.
Does this sound like something I should be doing, or is there some other way that others have attempted?
In our experience the easiest and least error prone way of implementing something like this is to have a straight forward way to build the resulting document, and then reindex the complete document with data from both subsystems retrieved at time of reindexing. Tracking field names and field removal tend to get into a lot of business rules that live outside of where you'd normally work with them.
By focusing on making the task of indexing a specific document easy and performant, you'll make the system more flexible regarding other issues in the future as well (retrieving all documents with a certain value from Solr, then triggering a reindex for those documents from a utility script, etc.).
That way you'll also have the same indexing flow for your application and primary indexing code, so that you don't have to maintain several sets of indexing code to do different stuff.
If the systems you're querying isn't able to perform when retrieving the number of documents you need, you can add a local cache (in SQL, memcached or something similar) to speed up the process, but that code can be specific to the indexing process. Usually the subsystems will be performant enough (at least if doing batch retrieval depending on the documents that are being updated).

Database design - should I use 30 columns or 1 column with all data in form of JSON/XML?

I am doing a project which need to store 30 distinct fields for a business logic which later will be used to generate report for each
The 30 distinct fields are not written at one time, the business logic has so many transactions, it's gonna be like:
Transaction 1, update field 1-4
Transaction 2, update field 3,5,9
Transaction 3, update field 8,12, 20-30
...
...
N.B each transaction(all belong to one business logic) would be updating arbitrary number of fields & not in any particular order.
I am wondering what's my database design would be best:
Have 30 columns in postgres database representing those 30 distinct
field.
Have 30 filed store in form of xml or json and store it in just one
column of postgres.
1 or 2 which one is better ?
If I choose 1>:
I know for programming perspective is easier Because in this way I don't need to read the overall xml/json and update only a few fields then write back to database, I can only update a few columns I need for each transaction.
If I choose 2>:
I can potentially generic reuse the table for something else since what's inside the blob column is only xml. But is it wrong to use the a table generic to store something totally irrelevant in business logic just because it has a blob column storing xml? This does have the potential to save the effort of creating a few new table. But is this kind of generic idea of reuse a table is wrong in a RDBMS ?
Also by choosing 2> it seem I would be able to handle potential change like change certain field /add more field ? At least it seems I don't need to change database table. But I still need to change c++ & c# code to handle the change internally , not sure if this is any advantage.
I am not experiences enough in database design, so I cannot make the decision which one to choose. Any input is appreciated.
N.B there is a good chance I probabaly don't need to do index or search on those 30 columsn for now, a primary key will be created on a extra column is I choose 2>. But I am not sure if later I will be required to do search based on any of those columns/field.
Basically all my fields are predefined from requirement documents, they generally like simple field:
field1: value(max len 10)
field2: value(max len 20)
...
field20: value((max len 2)
No nest fields. Is it worth to create 20 columns for each of those fields(some are string like date/time, some are string, some are integer etc).
2>
Is putting different business logic in a shared table a bad design idea? If it only being put in a shared table because they share the same structure? E.g. They all have Date time column , a primary key & a xml column with different business logic inside ? This way we safe some effort of creating new tables... Is this saving effort worth doing ?
Always store your XML/JSON fields as separate fields in a relational database. Doing so you will keep your database normalized, allowing the database to do its thing with queries/indices etc. And you will save other developers the headache of deciphering your XML/JSON field.
It will be more work up front to extract the fields from the XML/JSON and perhaps to maintain it if fields need to be added, but once you create a class or classes to do so that hurdle will be eliminated and it will more than make up for the cryptic blob field.
In general it's wise to split the JSON or XML document out and store it as individual columns. This gives you the ability to set up constraints on the columns for validation and checking, to index columns, to use appropriate data types for each field, and generally use the power of the database.
Mapping it to/from objects isn't generally too hard, as there are numerous tools for this. For example, Java offers JAXB and JPA.
The main time when splitting it out isn't such a great idea is when you don't know in advance what the fields of the JSON or XML document will be or how many of them there will be. In this case you really only have two choices - to use an EAV-like data model, or store the document directly as a database field.
In this case (and this case only) I would consider storing the document in the database directly. PostgreSQL's SQL/XML support means you can still create expression indexes on xpath expressions, and you can use triggers for some validation.
This isn't a good option, it's just that EAV is usually an even worse option.
If the document is "flat" - ie a single level of keys and values, with no nesting - the consider storing it as hstore instead, as the hstore data type is a lot more powerful.
(1) is more standard, for good reasons. Enables the database to do heavy lifting on things like search and indexing for one thing.

Need advice on multilingual data storage

This is more of a question for experienced people who've worked a lot with multilingual websites and e-shops. This is NOT a database structure question or anything like that. This is a question on how to store a multilingual website: NOT how to store translations. A multilingual website can not only be translated into multiple languages, but also can have language-specific content. For instance an english version of the website can have a completely different structure than the same website in russian or any other language. I've thought up of 2 storage schemas for such cases:
// NUMBER ONE
table contents // to store some HYPOTHETICAL content
id // content id
table contents_loc // to translate the content
content, // ID of content to translate
lang, // language to translate to
value, // translated content
online // availability flag, VERY IMPORTANT
ADVANTAGES:
- Content can be stored in multiple languages. This schema is pretty common, except maybe for the "online" flag in the "_loc" tables. About that below.
- Every content can not only be translated into multiple languages, but also you could mark online=false for a single language and disable the content from appearing in that language. Alternatively, that record could be removed from "_loc" table to achieve the same functionality as online=false, but this time it would be permanent and couldn't be easily undone. For instance we could create some sort of a menu, but we don't want one or more items to appear in english - so we use online=false on those "translations".
DISADVANTAGES:
- Quickly gets pretty ugly with more complex table relations.
- More difficult queries.
// NUMBER 2
table contents // to store some HYPOTHETICAL content
id, // content id
online // content availability (not the same as in first example)
lang, // language of the content
value, // translated content
ADVANTAGES:
1. Less painful to implement
2. Shorter queries
DISADVANTAGES:
2. Every multilingual record would now have 3 different IDs. It would be bad for eg. products in an e-shop, since the first version would allow us to store different languages under the same ID and this one would require 3 separate records to represent the same product.
First storage option would seem like a great solution, since you could easily use it instead of the second one as well, but you couldn't easily do it the other way around.
The only problem is ... the first structure seems a bit like an overkill (except in cases like product storage)
So my question to you is:
Is it logical to implement the first storage option? In your experience, would anyone ever need such a solution?
The question we ask ourselves is always:
Is the content the same for multiple languages and do they need a relation?
Translatable models
If the answer is yes you need a translatable model. So a model with multiple versions of the same record. So you need a language flag for each record.
PROS: It gives you a structure in which you can see for example which content has not yet been translated.
Separate records per language
But many times we see a different solution as the better one: Just seperate both languages totally. We mostly see this in CMS solutions. The story is not only translated but also different. For example in country 1 they have a different menu structure, other news items, other products and other pages.
PROS: Total flexibility and no unexpected records from other languages.
Example
We see it like writing a magazine: You can write one, then translate to another language. Yes that's possible but in real world we see more and more that the content is structurally different. People don't like to be surprised so you need lots of steps to make sure content is not visible in wrong languages, pages don't get created in duplicate etc.
Sharing logic
So what we do is most time: Share the views, make the buttons, inputs etc. translatable but keep the content seperated. So that every admin can just work in his area. If we need to confirm that some records are available in all languages we can always trick that by creating a link (nicely relational) between them but it is not the standard we use most of the time.
Really translatable records like products
Because we are flexible in creating models etc. we can just use decide how to work with them based on the requirements. I would not try to look for a general solution which works for all because there is none. You need a solution based on your data.
Assuming that you need a translatable model, as it is described by Luc, I would suggest coming up with some sort of special-character-delimited key-value pair format for the value column of the content table. Example:
#en=English Term#de=German Term
You may use UDFs (User Defined Functions in T-SQL) to set/get the appropriate term based on the specified language.
For selecting :
select id, dbo.GetContentInLang(value, #lang)
from content
For updating:
update content
set value = dbo.SetContentInLang(value, #lang, new_content)
where id = #id
The UDFs:
a. do have a performance hit but this also the case for join that you will have to do between the content and content_loc tables
and
b. are somehow difficult to implement but are reusable practically throughout your database.
You can also do the above on the application/UI layer.

Resources