Recently I started working in this company and they showed to me this "framework" they use within the company for the project. And the main goal is to do everything from the database because that way it can be change later on without touching the code, only the database which is "easier". So it would look like the following:
|id |component_id |default |read_only|visible|form_id|
|---|-------------|---------|---------|-------|-------|
|1 |2 |now() + 4|false |true |3 |
|2 |1 |null |false |true |4 |
|3 |5 |null |true |true |1 |
The component_id goes to a component table where they define each fields like date pickers, inputs, selects... And so on, and the form_id goes to another table where different forms in the app are, so like register, login, add_order... So on.
This is an over-simplification since they have more columns and more tables just to display the data in the UI with this, and have actions that trigger different thinks.
So my question is, is this a good practice? I mean, the code looks very complicated just to allow this and that database is a mess with a lot of different tables that only stores logic. Or is this use use something that people use very often, since I haven't encountered before.
We're using Dart/Flutter in the front end and I love the strong typed languages, but this removes the strong type for a guessing what value in the db is what we get and have a huge file with switch statements checking what component is to render it and apply all the other values.
I think is easier just write the code when needed since it simpler and better to look at instead of trying to figuring all this madness... Am I right?
This is a perfect example of over-engineering. There are numerous issues with this design. One of the main ones is the amount of risk that this introduces. Not only does it make development a nightmare, but it also allows developers to bypass any sort of risk-controls such as code scans. It also introduces a possible attack vector as it relies on an external mutable source for runtime behavior.
Data from your database should be just that, data. The business layer should be a stable set of logical instructions that manipulates that data. The less cross-over the better.
This kind of design also introduces problems with what amounts to versioning your dataset as you would your codebase. Then you have to make sure they sync up together.
Unfortunately, you probably have the original architect of this nightmare still around where you work, or your development team has gotten so used to such lax risk controls that a transition to a proper design will be like pulling teeth. If you are aiming to eventually push them in the right direction, your best bet would be to present the issues as a matter of risk versus reward and have a solution ready to propose.
Related
I was asked to create a table to store paid-hours data from multiple attendance systems from multiple geographies from multiple sub-companies. This table would be used for high level reporting so basically it is skipping the steps of creating tables for each system (which might exist) and moving directly to what the final product would be.
The request was to have a dimension for each type of hours or pay like this:
date | employee_id | type | hours | amount
2016-04-22 abc123 regular 80 3500
2016-04-22 abc123 overtime 6 200
2016-04-22 abc123 adjustment 1 13
2016-04-22 abc123 paid time off 24 100
2016-04-22 abc123 commission 600
2016-04-22 abc123 gross total 4413
There are multiple rows per employee but the though process is that this will allow us to capture new dimensions if they are added.
The data is coming from several sources and I was told not to worry about the ETL, but just design the ultimate table and make it work for any system. We would provide this format to other people for them to fill in.
I have only seen the raw data from one system and it like this:
date | employee_id | gross_total_amount | regular_hours | regular_amount | OT_hours | OT_amount | classification | amount | hours
It is pretty messy. Multiple rows for employees and values like gross_total repeat each row. There is a classification column which has items like PTO (paid time off), adjustments, empty values, commission, etc. Because of repeating values, it is impossible to just simply sum the data up to make it equal the gross_total_amount.
Anyways, I kind of would prefer to do a column based approach where each row describes the employees paid hours for a cut off. One problem is that I won't know all of the possible types of hours which are possible so I can't necessarily make a table like:
date | employee_id | gross_total_amount | commission_amount | regular_hours | regular_amount | overtime_hours | overtime_amount | paid_time_off_hours | paid_time_off_amount | holiday_hours | holiday_amount
I am more used to data formatted that way though. The concern is that you might not capture all of the necessary columns or if something new is added. (For example, I know there is maternity leave, paternity leave, bereavement leave, in other geographies there are labor laws about working at night, etc)
Any advice? Is the table which was suggested to me from my superior a viable solution?
TAM makes lots of good points, and I have only two additional suggestions.
First, I would generate some fake data in the table as described above, and see if it can generate the required reports. Show your manager each of the reports based on the fake data, to check that they're OK. (It appears that the reports are the ultimate objective, so work back from there.)
Second, I would suggest that you get sample data from as many of the input systems as you can. This is to double check that what you're being asked to do is possible for all systems. It's not so you can design the ETL, or gather new requirements, just testing it all out on paper (do the ETL in your head). Use this to update the fake data, and generate fresh fake reports, and check the reports again.
Let me recapitulate what I understand to be the basic task.
You get data from different sources, having different structures. Your task is to consolidate them in a single database to be able to answer questions about all these data. I understand the hint about "not to worry about the ETL, but just design the ultimate table" in that way that your consolidated database doesn't need to contain all detail information that might be present in the original data, but just enough information to fulfill the specific requirements to the consolidated database.
This sounds sensible as long as your superior is certain enough about these requirements. In that case, you will reduce the information coming from each source to the consolidated structure.
In any way, you'll have to capture the domain semantics of the data coming in from each source. Lacking access to your domain semantics, I can't clarify the mess of repeating values etc. for you. E.g., if there are detail records and gross total records, as in your example, it would be wrong to add the hours of all records, as this would always yield twice the hours actually worked. So someone will have to worry about ETL, namely interpreting each set of records, probably consisting of all entries for an employee and one working day, find out what they mean, and transform them to the consolidated structure.
I understand another part of the question to be about the usage of metadata. You can have different columns for notions like holiday leave and maternity leave, or you have a metadata table containing these notions as a key-value pair, and refer to the key from your main table. The metadata way is sometimes praised as being more flexible, as you can introduce a new type (like paternity leave) without redesigning your database. However, you will need to redesign the software filling and probably also querying your tables to make use of the new type. So you'll have to develop and deploy a new software release anyway, and adding a few columns to a table will just be part of that development effort.
There is one major difference between a broad table containing all notions as attributes and the metadata approach. If you want to make sure that, for a time period, either all or none of the values are present, that's easy with the broad table: Just make all attributes `not nullĀ“, and you're done. Ensuring this for the metadata solution would mean some rather complicated constraint that may or may not be available depending on the database system you use.
If that's not a main requirement, I would go a pragmatic way and use different columns if I expect only a handful of those types, and a separate key-value table otherwise.
All these considerations relied on your superior's assertion (as I understand it) that your consolidated table will only need to fulfill the requirements known today, so you are free to throw original detail information away if it's not needed due to these requirements. I'm wary of that kind of assertion. Let's assume some of your information sources deliver additional information. Then it's quite probable that someday someone asks for a report also containing this information, where present. This won't be possible if your data structure only contains what's needed today.
There are two ways to handle this, i.e. to provide for future needs. You can, after knowing the data coming from each additional source, extend your consolidated database to cover all data structures coming from there. This requires some effort, as different sources might express the same concept using different data, and you would have to consolidate those to make the data comparable. Also, there is some probability that not all of your effort will be worth the trouble, as not all of the detail information you get will actually be needed for your consolidated database. Another more elegant way would therefore be to keep the original data that you import for each source, and only in case of a concrete new requirement, extend your database and reimport the data from the sources to cover the additional details. Prices of storage being low as they are, this might yield an optimal cost-benefit ratio.
I've looked around trying to figure out the best way to handle my database, and I have some questions.
Is it better to have a large table or separate tables? Is there any real difference as far as server load?
I paid a guy to put together a database and some php so I could break into it, and he made 2 separate tables for what, I think, should be one. Seems to be redundant, and I no likey repeating myself.
Basically, x_content is: ID | section | page | heading | content
and x_menu is ID | PARENT | LINK | DISPLAY | HASCHILD
(personally, it bugs me about the caps. I think I'll go Standard Case since everything else on the site is all-lowercase or in script camelCase)
Anyway, ID, (heading/DISPLAY), and (page/LINK) are more/less (can be) the same. Seems to me I'd be doing myself a favor by combining these, and adding the rest of what I want.
What I'd Like: ID | Category | Name | URL | Description | Keywords | Content | Theme
So- should I delete the x_menu and combine them?
*If I link all of the pages in my site right now, it would be something like 40+
If menu is build dynamically in your application, I would suggest to have a separate table for menu for easy maintenance.
However, in your combined table, I don't see any "menu" column. I don't know whether you'd like to replace it with URL. But difference I find is every page has url but it may or may not be included in menu.
If "menu" is not build dynamically, I think more information must be provided to consider whether menu needs to be separate entity.
I have a dataflow task with information that looks something like this:
Province | City | Population
-------------------------------
Ontario | Toronto | 7000000
Ontario | London | 300000
Quebec | Quebec | 300000
Quebec | Montreal| 6000000
How do I use the Aggregate transformation to get the city with the largest population in each province:
Province | City | Population
-------------------------------
Ontario | Toronto | 7000000
Quebec | Montreal| 6000000
If I set "Province" as the Group-By column and "Population" to the "Max" aggregate, what do I do with the City column?
Completely agree with #PaulStock that aggregates are best left to source systems. An aggregate in SSIS is a fully blocking component much like a sort and I've already made my argument on that point.
But there are times when doing those operations in the source system just aren't going to work. The best I've been able to come up with is to basically double process the data. Yes, ick but I was never able to find a way to pass a column through unaffected. For Min/Max scenarios, I'd want that as an option but obviously something like a Sum would make it hard for the component to know what the "source" row it'd tie to.
2005
A 2005 implementation would look like this. Your performance is not going to be good, in fact a few orders of magnitude from good as you'll have all these blocking transforms in there in addition to having to reprocess your source data.
Merge join
2008
In 2008, you have the option of using the Cache Connection Manager which would help eliminate the blocking transformations, at least where it matters, but you're still going to have to pay the cost of double processing your source data.
Drag two data flows onto the canvas. The first will populate the cache connection manager and should be where the aggregate takes place.
Now that the cache has the aggregated data in there, drop a lookup task in your main data flow and perform a lookup against the cache.
General lookup tab
Select the cache connection manager
Map the appropriate columns
Great success
Script task
The third approach I can think of, 2005 or 2008, is to write it your own self. As a general rule, I try to avoid the script tasks but this is a case where it probably makes sense. You will need to make it an asynchronous script transformation but simply handle your aggregations in there. More code to maintain but you can save yourself the trouble of reprocessing your source data.
Finally, as a general caveat, I'd investigate what the impact of ties will do to your solution. For this data set, I would expect something like Guelph to suddenly swell and tie Toronto but if it did, what should the package do? Right now, both will result in 2 rows for Ontario but is that the intended behaviour? Script, of course, allows you to define what happens in the case of ties. You could probably stand the 2008 solution on its head by caching the "normal" data and using that as your lookup condition and using the aggregates to pull back just one of the ties. 2005 can probably do the same just by putting the aggregate as the left source for the merge join
Edits
Jason Horner had a good idea in his comment. A different approach would be to use a multicast transformation and perform the aggregation in one stream and bring it back together. I couldn't figure out how to make it work with a union all but we could use sorts and merge join much like in the above. This is probably a better approach as it saves us the trouble of reprocessing the source data.
Instead of using the Aggregate transformation, could you use a SQL query instead?
SELECT
p.province,
p.city,
p.[population]
FROM
temp_pop P
JOIN ( SELECT
province,
[population] = MAX([POPULATION])
FROM
temp_pop
GROUP BY
province
) AS M ON p.province = M.province AND
p.[population] = M.[population]
I mean referring to specific database rows by their ID, from code, or specifying a class name in the database. Example:
You have a database table called SocialNetwork. It's a lookup table. The application doesn't write or or delete from it. It's mostly there for database integrity; let's say the whole shebang looks like this:
SocialNetwork table:
Id | Description
-----------------------------
1 | Facebook
2 | Twitter
SocialNetworkUserName table:
Id | SocialNetworkId | Name
---------------------------------------------------
1 | 2 | #seanssean
2 | 1 | SeanM
In your code, there's some special logic that needs to be carried out for Facebook users. What I usually do is make either an enum or some class constants in the code to easily refer to it, like:
if (socailNetwork.Id == SocialNetwork.FACEBOOK ) // SocialNetwork.FACEBOOK = 1
// special facebook-specific functionality here
That's a hard-coded database ID. It's not a huge crime since it's just referencing a lookup table, but there's no longer a clean division between data and logic, and it bothers me.
The other option I can think of would be to specify the name of a class or delegate in the database, but that's even worse IMO because now you've not only broken the division between data and logic, but you've tied yourself to one language now.
Am I making much ado about nothing?
I don't see the problem.
At some point your code needs to do things. Facebook is a real social network, with its own real API, and you want it to do Facebook-specific things in your code. Unless your tasks are trivial, to put all of the Facebook-specific stuff in the database would mean a headache in your code. (What's the equivalent of "Like" in Twitter, for example?)
If the Facebook entry isn't in your database, then the Facebook-specific code won't be executed. You can do that much.
Yep, but with the caveat that "it depends." It's unlikely to change, but.
Storing the name of a class or delegate is probably bad, but storing a token used by a class or delegate factory isn't, because it's language-neutral--but you'll always have the problem of having to maintain the connection somewhere. Unless you have a table of language-specific things tied to that table, at which point I believe you'd be shot.
Rather than keep the constant comparison in mainline code, IMO this kind of situation is nice for a factory/etc. pattern, enum lookup, etc. to implement network-specific class lookup/behavior. The mainline code shouldn't have to care how it's implemented, which it does right now--that part is a genuine concern.
With the caveat that ultimately it may never matter. If it were me, I'd at least de-couple the mainline code, because stuff like that makes me twitchy.
How would I verify a database is structured how my C++ program expects? Our source control was very weak in the past so we have many production installs out there using databases which are missing columns and tables which are now required in the current version of the C++ program. I'd like to have my app check to make sure the database is structured the way it expects at startup time. Ideally this would work for SQL, Oracle, Access, MySql DBs.
The difficulty seems to be in the cross-DBMS. ODBC drivers provide most of the functionality you need across all databases.
In this situation I have used ODBC SQLTables and SQLDescribeColumn to extract a definition of all tables, columns and indexes on the database and then compared that to the output of the process run against a known good database.
This is easy enough if you just want to validate the structure, the code to repair such a database by adding columns and indexes followed logically from that but got a little harder.
Assuming you can query the database using SQL, then you can use the DESCRIBE sql statement to request a description of the table you are looking at, e.g.
DESCRIBE table1;
The use your code to check through the description to analyse whether it is correct..
This will give you are list of Fields, their Type, and other information e.g.
Field |Type |NULL |Key |Default| Extra
Col 1 |int(11) |NO |PRI |NULL | auto_increment
Co1 2 |time |No | |NULL |
You can then go through this table.
"I'd like to have my app check to make sure the database is structured the way it expects at startup time."
And what the ... do you think this is going to solve ?
If the database is not structured the way your program expects, then I'd say your program is almost absolutely certain to fail (and quite early on at that, probably).
Supposing you could do the check (and I'm quite certain you could go a hell of a long way to achieve that, even far beyond what has been suggested here), what else do you think there is left for you to do but to abort your program saying "cannot run, database not as expected" ?
The result is almost guaranteed to be the same in both situations : your program won't run. If you are experiencing problems with "databases not structured as expected", then you need to look at (and fix the faults in) the overall process. Software does not live "in its own world", and neither do databases.