I am building a data warehouse for the company's (which I am working for) core ERP application, for a particular client.
In the source database most of the dimension information in the data warehouse are stored in an unpivoted manner basically since the application is a product which is to be customized on the client's request.
For the current client I am working with, I can unpivot and extract the data. But my concern is, if we are going to reuse the data warehouse (with other customers too) then I think depending on the way they classify the fields the data warehouse model will not be able to adjust and further customization would require.
Do let me know whether there is any competent mechanism to overcome this design issue.
Following is an example of the way the products are classified in the source database (this applies to most of the other master data classifications too),
Product Code MasterClassification MasterClassificationValue
------------ -------------------- -------------------------
AAA Brand AA
AAA Category A
Same set of data pivoted:
Product Code Brand Category
------------ ----- --------
AAA AA A
Thanks in advance.
This is a classic and well documented data problem. What you describe as 'unpivoted' is known as EAV. I suggest you google 'EAV' prehaps together with 'reporting'. You are not alone!
It makes sense that the dimensional data in the source system is stored is unpivoted -- it's a database, so it should be normalized. How you handle it in the data warehouse is another question.
In a previous job, we debated whether and how we should carry pivoted / denormalized / "wide and shallow" data. In our implementation, every table brought with it a view (containing the ETL logic) and a procedure (to load the table). That's a lot of infrastructure, so we thought twice before adding another table. Also, the requirement for pivoted data often came from the analytics team for use in Tableau, a tool that easily consumes unpivoted / "narrow and deep" data and pivots it -- so we often debated whether pivoted data was actually required.
Eventually we decided that we would occasionally carry pivoted data but only via a reporting view. (We had naming conventions to distinguish reporting views from ETL views.) I think this is an approach you should consider, for reasons you mentioned yourself: new categories could be added, rendering your pivoted design outdated. Also, if you have multiple clients using this data, each client could be interested in a different set of categories. You could cast a customized pivoted reporting view on top of this table for each client. That sounds like a lot of work, but I think it's less work than redoing a pivoted table every time you become aware that a new category has been added. Good luck!
Related
I have a fairly large database in SQL Server. To illustrate my use case, suppose I have a mobile game and I want to report on user activity.
To start with I have a table that looks like:
userId
date
# Sessions
Total Session Duration
1
2021-01-01
3
55
1
2021-01-02
9
22
2
2021-01-01
6
43
I am trying to "add" information of each session into this data. The options I'm considering are:
Add the session data as a new column containing a JSON array with the data for each session
Create a table with all session data indexed by userId & date - and query this table as needed.
Is this possible in SQL Server? (my experience is coming from GCP's BigQuery)
Your question boils down to whether it is better to use nested data or to figure out a system of tables where every column of every table has a simple domain (text string, number, date, etc.).
It turns out that this question was being pondered by Ed Codd fifty years ago when he was proposing the first database system based on the relational model. He decided that it was worthwhile restricting relational databases to Normal Form, later renamed First Normal Form. He proved to his own satisfaction that this restriction wouldn't reduce the expressive power of the relational model. And it would make it easier to build the first relational dabase manager.
Since then, just about every relational or SQL database has conformed to First Normal Form, although there are ways to get around the restriction by storing one of various forms of data structures in one column of a table. JSON is an example.
You'll gain the flexibility you get with JSON, but you will lose the ability to specify the data you want to retrieve using the various clauses of the SELECT statement, clauses like INNER JOIN or WHERE, among others. This loss could be deal killer.
If it were me, I would go with the added table approach, and analyze the session data down to one of more tables with simple columns. But you may find that JSON decoders are just as powerful, and that doing full table scans are worth the time taken.
I'm an Excel user trying to solve this one problem, and the only efficient way I can think of is do it by a database. I use arrays in programming VBA/Python and I've queried from databases before, but never really designed a database. So I'm here to look for suggestion on how to structure this db in Access.
Anyway, I currently maintain a sheet of ~50 economics indicators for ~100 countries. It's a very straightforward sheet, with
Column headers: GDP , Unemployment , Interest Rate, ... ... ... Population
And Rows:
Argentina
Australia
...
...
Yemen
Zambia
etc.
I want to take snapshots of this sheet so I can see trends and run some analysis in the future. I thought of just keep duplicating the worksheet in Excel but it just feels inefficient.
I've never designed a database before. My question would be what's the most efficient way to store these data for chronological snapshots? In the future I will probably do these things:
Queue up a snapshot for day mm-dd-yy in the past.
Take two different data point of a metric, of however many countries, and track the change/rate of change etc.
Once I can queue them well enough I'll probably do some kind of statistical analysis, which just requires getting the right data set.
I feel like I need to create an individual table for each country and add a row to every country table every time I take a snapshot. I'll try to play with VBA to automate this.
I can't think of any other way to do this with less tables? What would you suggest? Is it a recommended practice to use more than a dozen tables for this task?
There are a couple of ways of doing this,
Option 1
Id suggest you probably only need a single table, something akin to,
Country, date_of_snapshot, columns 1-50 (GDP etc..)
Effective you would add a new row for each day and each country,
Option 2
You could also use a table atructured as below though this would require more complex queries which may be too much for access,
Country, datofsnapshot, factor, value
with each factor GDP etc... getting a row for each date and country
we make software for managing participants in grants given to non-profits. (for example, if your family needs food stamps, then that office must somehow track your family and report to the state)
Up until now we have been focused on one particularly complex grant. We are now wanting to expand to other grants. Our first goal was a fairly simplistic grant. The code for it was just piled onto the old application. Now we have decided the best course of action is to separate the two programs(because not all of our clients have both grants). This sounds easy in theory.
We can manage the code complexity brought about by this pretty easily with patches and SVNs merging functionality. The thing that is significantly harder is that their database is the same. The two grants share a few tables and a few procedures. But this is a rather large legacy database (more than 40 tables, 100s of stored procedures).
What exactly is the best way to keep these two databases separate, but still sharing their common elements? We are not concerned about conflicts between the two applications writing to the same DB(we have locks for that), but rather we're concerned about schema conflicts in development and updating our client's servers and managing the complexity.
We have a few options we've thought of:
Using Schemas (shared, grant1, grant2)
Using prefixed names
The book SQL Antipatterns had a solution to this sort of thing. Sharing common data is fine, just move the extended data into a separate table, much like extending a class in OOP.
Persons Table
--------------------------
PersonID LastName FirstName
FoodStamps Table (From Application 1)
--------------------------
PersonID FoodStampAllotment
HousingGrant Table (From Application 2)
--------------------------
PersonID GrantAmount
You could add prefixes to the extended tables if you want.
This query will get me just people in the FoodStamps program:
SELECT * FROM Persons
JOIN FoodStamps
ON FoodStamps.PersonID = Persons.PersonID
I am designing a database that needs to store transaction time and valid time, and I am struggling with how to effectively store the data and whether or not to fully time-normalize attributes. For instance I have a table Client that has the following attributes: ID, Name, ClientType (e.g. corporation), RelationshipType (e.g. client, prospect), RelationshipStatus (e.g. Active, Inactive, Closed). ClientType, RelationshipType, and RelationshipStatus are time varying fields. Performance is a concern as this information will link to large datasets from legacy systems. At the same time the database structure needs to be easily maintainable and modifiable.
I am planning on splitting out audit trail and point-in-time history into separate tables, but I’m struggling with how to best do this.
Some ideas I have:
1)Three tables: Client, ClientHist, and ClientAudit. Client will contain the current state. ClientHist will contain any previously valid states, and ClientAudit will be for auditing purposes. For ease of discussion, let’s forget about ClientAudit and assume the user never makes a data entry mistake. Doing it this way, I have two ways I can update the data. First, I could always require the user to provide an effective date and save a record out to ClientHist, which would result in a record being written to ClientHist each time a field is changed. Alternatively, I could only require the user to provide an effective date when one of the time varying attributes (i.e. ClientType, RelationshipType, RelationshipStatus) changes. This would result in a record being written to ClientHist only when a time varying attribute is changed.
2) I could split out the time varying attributes into one or more tables. If I go this route, do I put all three in one table or create two tables (one for RelationshipType and RelationshipStatus and one for ClientType). Creating multiple tables for time varying attributes does significantly increase the complexity of the database design. Each table will have associated audit tables as well.
Any thoughts?
A lot depends (or so I think) on how frequently the time-sensitive data will be changed. If changes are infrequent, then I'd go with (1), but if changes happen a lot and not necessarily to all the time-sensitive values at once, then (2) might be more efficient--but I'd want to think that over very carefully first, since it would be hard to manage and maintain.
I like the idea of requiring users to enter effective daes, because this could serve to reduce just how much detail you are saving--for example, however many changes they make today, it only produces that one History row that comes into effect tomorrow (though the audit table might get pretty big). But can you actually get users to enter what is somewhat abstract data?
you might want to try a single Client table with 4 date columns to handle the 2 temporal dimensions.
Something like (client_id, ..., valid_dt_start, valid_dt_end, audit_dt_start, audit_dt_end).
This design is very simple to work with and I would try and see how ot scales before going with somethin more complicated.
We've an SQL Server DB design time scenario .. we've to store data about different Organizations in our database (i.e. like Customer, Vendor, Distributor, ...). All the diff organizations share the same type of information (almost) .. like Address details, etc... And they will be referred in other tables (i.e. linked via OrgId and we have to lookup OrgName at many diff places)
I see two options:
We create a table for each organization like OrgCustomer, OrgDistributor, OrgVendor, etc... all the tables will have similar structure and some tables will have extra special fields like the customer has a field HomeAddress (which the other Org tables don't have) .. and vice-versa.
We create a common OrgMaster table and store ALL the diff Orgs at a single place. The table will have a OrgType field to distinguish among the diff types of Orgs. And the special fields will be appended to the OrgMaster table (only relevant Org records will have values in such fields, in other cases it'll be NULL)
Some Pros & Cons of #1:
PROS:
It helps distribute the load while accessing diff type of Org data so I believe this improves performance.
Provides a full scope for accustomizing any particular Org table without effecting the other existing Org types.
Not sure if diff indexes on diff/distributed tables work better then a single big table.
CONS:
Replication of design. If I have to increase the size of the ZipCode field - I've to do it in ALL the tables.
Replication in manipulation implementation (i.e. we've used stored procedures for CRUD operations so the replication goes n-fold .. 3-4 Inert SP, 2-3 SELECT SPs, etc...)
Everything grows n-fold right from DB constraints\indexing to SP to the Business objects in the application code.
Change(common) in one place has to be made at all the other places as well.
Some Pros & Cons of #2:
PROS:
N-fold becomes 1-fold :-)
Maintenance gets easy because we can try and implement single entry points for all the operations (i.e. a single SP to handle CRUD operations, etc..)
We've to worry about maintaining a single table. Indexing and other optimizations are limited to a single table.
CONS:
Does it create a bottleneck? Can it be managed by implementing Views and other optimized data access strategy?
The other side of centralized implementation is that a single change has to be tested and verified at ALL the places. It isn't abstract.
The design might seem a little less 'organized\structured' esp. due to those few Orgs for which we need to add 'special' fields (which are irrelevant to the other tables)
I also got in mind an Option#3 - keep the Org tables separate but create a common OrgAddress table to store the common fields. But this gets me in the middle of #1 & #2 and it is creating even more confusion!
To be honest, I'm an experienced programmer but not an equally experienced DBA because that's not my main-stream job so please help me derive the correct tradeoff between parameters like the design-complexity and performance.
Thanks in advance. Feel free to ask for any technical queries & suggestions are welcome.
Hemant
I would say that your 2nd option is close, just few points:
Customer, Distributor, Vendor are TYPES of organizations, so I would suggest:
Table [Organization] which has all columns common to all organizations and a primary key for the row.
Separate tables [Vendor], [Customer], [Distributor] with specific columns for each one and FK to the [Organization] row PK.
The sounds like a "supertype/subtype relationship".
I have worked on various applications that have implemented all of your options. To be honest, you probably need to take account of the way that your users work with the data, how many records you are expecting, commonality (same organisation having multiple functions), and what level of updating of the records you are expecting.
Option 1 worked well in an app where there was very little commonality. I have used what is effectively your option 3 in an app where there was more commonality, and didn't like it very much (there is more work involved in getting the data from different layers all of the time). A rewrite of this app is implementing your option 2 because of this.
HTH