Is it ok to have 2 dimensions that are the same but one is less deep? - data-modeling

I have a fact table, with account number and some numbers associated..
I have my DimAccount which has a very long hierarchy of level1,sub-level2… up to sub-level20.
When reporting in PowerBI this makes it very hard to navigate…
My requirement is to have a sort of different/new DimAccount which is less deep (it will be similar to DimAccount but with a different grouping)
So, I want to create a different mapping. Where should this be done?
In the backend?
Having some sort of DimAccount2, where it has less hierarchies or
Creating new table? Perhaps creating a mapping table, where I just map sublevels to a less deep hierarchy?
Or should this be corrected in the cube/powerbi ? creating measures in DAX where one does the mapping manually there?
I am not sure where/how to do it... My Goal is to have a DimHighLevelAccount, but it is not that I just can remove sub-levels, the mapping will be also different, perhaps I group some categories from level5,6 and 7 together...

Power BI always has its own data model (called a "dataset" in Power BI docs), derived in this case from the data model in your data warehouse. And the Power BI data model has some modeling capabilities that your DW does not have.
So the Power BI data model should load/expose only the tables and columns from your data warehouse that are useful for the use case (you may have a handful of different Power BI datasets for the same DW tables). And then add additional modeling, like adding Measures, hiding columns, and declaring Hierarcies.
So in this case, have a single Account dimension table, but when you bring it in to Power BI, leave out hierarchy levels that you don't want, and add the remaining ones to a Hierarchy and hide the individual levels from the report view, so the report developer sees a single hierarchal property.

Related

Ideal data type / structure / model for storing device data with different parameters / attributes in snowflake

We are in the process of designing a dimensional data model in snow flake to store data from different devices (from solar and wind plants) for reporting / analytical purposes. The data is currently residing in influxDB as time series data, one of the challenges in designing the target DB model is - different devices emit data for different parameters (even though the devices have a super set of parameters, it can vary and chances are there that new parameters can be added to the super set).
One of the key ask is to not have any development efforts (coding) when new parameters / devices are added, hence the model and design needs to have the flexibility to store the data accordingly, with certain configurations, Following are the options,
Create wide fact tables with all the superset parameters and store nulls for devices which do not send the data.
Pros: Lesser data volume compared to #2.
Cons: a) Will have some effort when new parameters are added.
b) Depending on the reporting tool (which will be mostly custom built and not a BI tool) the selection of data from different parameters might not be straight forward like using a where clause based on the needed parameters.
Create narrow fact tables, the parameter will become a dimension table along with other dimensions and will have a reference to it by ID column, and the value will be present in one column.
Pros: a) No efforts / schema changes when new parameters are added.
b) Ease of selecting and filtering data based on the selected parameters.
Cons: a) Data Volume - There are 1000's of devices and multiple parameters under them, so approximately per day it will go to 90M records (~1GB - the base data itself is huge and the unpivot would increase the record count dramatically).
b) Performance considerations due to increased data volume especially while querying data.
Use the support provided by snowflake for semi structured data. OBJECT datatype seems to be a good fit, the parameter name and value can be stored as a key value pair.
Pros: a) No efforts / schema changes when new parameters are added.
b) Data volume not increased.
c) Ease of selecting and filtering data based on functions provided by SQL - Is this true, based on the documentation, the querying looks straight forward especially for OBJECT datatype. However need confirmation.
Cons: a) Performance considerations due to the usage of semi structured data types - From the documentation , it mentions that the VARIANT data type stores the data in columnar format wherever possible (data remain in json where it is not able to convert) , but there is no mention about the OBJECT data type and how it is handled with this data type. So want to understand whether this will have a considerable performance impact or not.
So, considering the above, what would be the ideal way to store this kind of data where the structure changes dynamically based on different devices.
Option 3 is my favorite for laziness, cost, and performance reasons:
Snowflake uses the same storage ideas for OBJECTs and VARIANTs, it will be optimized for columnar - as long as your underlying object/variant is well suited for it. This means good performance and compression.
Object/variant will need the less maintenance when adding new parameters.
But option 1 has some advantages too:
It's a good idea for governability to know all your columns and their purposes.
3rd party tools understand navigating columns much better than figuring out objects.
Then you could have a great mix of 3+1:
Store everything as object/variant.
Create a view that parses the object and names columns for 3rd party tools.
When new fields are added, you will just need to update the definition of the view.

Using Doctrine 2 for large data models

I have a legacy in-house human resources web app that I'd like to rebuild using more modern technologies. Doctrine 2 is looking good. But I've not been able to find articles or documentation on how best to organise the Entities for a large-ish database (120 tables). Can you help?
My main problem is the Person table (of course! it's an HR system!). It currently has 70 columns. I want to refactor that to extract several subsets into one-to-one sub tables, which will leave me with about 30 columns. There are about 50 other supporting one-to-many tables called person_address, person_medical, person_status, person_travel, person_education, person_profession etc. More will be added later.
If I put all the doctrine associations (http://docs.doctrine-project.org/projects/doctrine-orm/en/latest/reference/working-with-associations.html) in the Person entity class along with the set/get/add/remove methods for each, along with the original 30 columns and their methods, and some supporting utility functions then the Person entity is going to be 1000+ lines long and a nightmare to test.
FWIW i plan to create a PersonRepository to handle the common bulk queries, a PersonProfessionRepository for the bulk queries / reports on that sub table etc, and Person*Service s which will contain some of the more complex business logic where needed. So organising the rest of the app logic is fine: this is a question about how to correctly organise lots of sub-table Entities with Doctrine that all have relationships / associations back to one primary table. How do I avoid bloating out the Person entity class?
Identifying types of objects
It sounds like you have a nicely normalized database and I suggest you keep it that way. Removing columns from the people table to create separate tables for one-to-one relations isn't going to help in performance nor maintainability.
The fact that you recognize several groups of properties in the Person entity might indicate you have found cases for a Value Object. Even some of the one-to-many tables (like person_address) sound more like Value Objects than Entities.
Starting with Doctrine 2.5 (which is not yet stable at the time of this writing) it will support embedding single Value Objects. Unfortunately we will have to wait for a future version for support of collections of Value objects.
Putting that aside, you can mimic embedding Value Objects, Ross Tuck has blogged about this.
Lasagna Code
Your plan of implementing an entity, repository, service (and maybe controller?) for Person, PersonProfession, etc sounds like a road to Lasagna Code.
Without extensive knowledge about your domain, I'd say you want to have an aggregate Person, of which the Person entity is the aggregate root. That aggregate needs a single repository. (But maybe I'm off here and being simplistic, as I said, I don't know your domain.)
Creating a service for Person (and other entities / value objects) indicates data-minded thinking. For services it's better to think of behavior. Think of what kind of tasks you want to perform, and group coherent sets of tasks into services. I suspect that for a HR system you'll end up with many services that evolve around your Person aggregate.
Is Doctrine 2 suitable?
I would say: yes. Doctrine itself has no problems with large amounts of tables and large amounts of columns. But performance highly depends on how you use it.
OLTP vs OLAP
For OLTP systems an ORM can be very helpful. OLTP involves many short transactions, writing a single (or short list) of aggregates to the database.
For OLAP systems an ORM is not suited. OLAP involves many complex analytical queries, usually resulting in large object-graphs. For these kind of operations, native SQL is much more convenient.
Even in case of OLAP systems Doctrine 2 can be of help:
You can use DQL queries (in stead of native SQL) to use the power of your mapping metadata. Then use scalar or array hydration to fetch the data.
Doctrine also support arbitrary joins, which means you can join entities that are not associated to each other according by mapping metadata.
And you can make use of the NativeQuery object with which you can map the results to whatever you want.
I think a HR system is a perfect example of where you have both OLTP and OLAP. OLTP when it comes to adding a new Person to the system for example. OLAP when it comes to various reports and analytics.
So there's nothing wrong with using an ORM for transactional operations, while using plain SQL for analytical operations.
Choose wisely
I think the key is to carefully choose when to use what, on a case by case basis.
Hydrating entities is great for transactional operations. Make use of lazy loading associations which can prevent fetching data you're not going to use. But also choose to eager load certain associations (using DQL) where it makes sense.
Use scalar or array hydration when working with large data sets. Data sets usually grow where you're doing analytical operations, where you don't really need full blown entities anyway.
#Quicker makes a valid point by saying you can create specialized View objects. You can fetch only the data you need in specific cases and manually mold that data into objects. This is accompanied by his point to don't bloat the user interface with options a user with a certain role doesn't need.
A technique you might want to look into is Command Query Responsibility Segregation (CQRS).
I understood that you have a fully normalized table persons and now you are asking for how to denormalize that best.
As long as you do not hit any technical constaints (such as max 64 K Byte) I find 70 columns definitly not overloaded for a persons table in a HR system. Do yourself a favour to not segment that information for following reasons:
selects potentially become more complex
each extract table needs (an) extra index/indeces, which increases your overall memory utilization -> this sounds to be a minor issue as disk is cheap. However keep in mind that via caching the RAM to disk space utilization ratio determines your performance to a huge extend
changes become more complex as extra relations demand for extra care
as any edit/update/read view can be restricted to deal with slices of your physical data from the tables only no "cosmetics" pressure arises from end user (or even admin) perspective
In summary your the table subsetting causes lots of issues and effort but does add low if not no value.
Btw. databases are optimized for data storage. Millions of rows and some dozens of columns are no brainers at that end.

Is it really a standard to always include a Fact table into creating an OLAP Cube using SSAS?

Can I create a Cube by just combining Dimension Tables? Let's say I don't have a Fact table from the data sources and I want to create an OLAP cube out of the Dimensions that I have in the Database Tables sources.
I am studying SSAS only by myself and I saw from these examples (LINK) that building an OLAP cube requires a Fact table.
http://www.codeproject.com/Articles/658912/Create-First-OLAP-Cube-in-SQL-Server-Analysis-Serv
http://www.slideshare.net/PeterGfader/07-olap-5287936
The links above are very informative and helpful. However, I dont see any Fact table that I can link on the Dimension tables from the Data sources that I have.
As JaneD said in a comment I have also never heard of a cube without a fact table. The fact table(s) should include the numbers you want to play with/analyse/report on, your measures. The dimension will be used to slice and dice these measures in order to see f.ex number of sold items (a fact) in Europe (dimension) in 2012 (dimension).
If you only have regions and time, as in my example, you won't be able to do much with it. And if you're trying to learn SSAS you really should try to find/create a fact table so you can explore more of SSAS.

Mutually exclusive facts. Should I create a new dimention in this case?

There is a star schema that contains 3 dimensions (Distributor, Brand, SaleDate) and a fact table with two fact columns: SalesAmountB measured in boxes as the integer type and SalesAmountH measured in hectolitres as the numeric type. The end user wants to select which fact to show in a report. The report is going to be presented via SharePoint 2010 PPS.
So help me please determine which variant is suitable for me the most:
1) Add a new dimension like "Units" with two values Boxes, Hectolitres and use the in-built filter for this dim. (The fact data types are incompatible though)
2) Make two separate tables for the two facts and build two cubes. Then select either as the datasource.
3) Leave the model as it is and use the PPS API in SharePoint to select the fact to show.
So any ideas?
I think the best way to implement this is by using separate field for SalesAmountB and SalesAmountH in fact table. Then creating 2 separate measure in BIDS and controlling the visibility through MDX. By doing this, you can avoid complexity of duplicating whole data or even creating separate cubes.

SQL-Server DB design time scenario (distributed or centralized)

We've an SQL Server DB design time scenario .. we've to store data about different Organizations in our database (i.e. like Customer, Vendor, Distributor, ...). All the diff organizations share the same type of information (almost) .. like Address details, etc... And they will be referred in other tables (i.e. linked via OrgId and we have to lookup OrgName at many diff places)
I see two options:
We create a table for each organization like OrgCustomer, OrgDistributor, OrgVendor, etc... all the tables will have similar structure and some tables will have extra special fields like the customer has a field HomeAddress (which the other Org tables don't have) .. and vice-versa.
We create a common OrgMaster table and store ALL the diff Orgs at a single place. The table will have a OrgType field to distinguish among the diff types of Orgs. And the special fields will be appended to the OrgMaster table (only relevant Org records will have values in such fields, in other cases it'll be NULL)
Some Pros & Cons of #1:
PROS:
It helps distribute the load while accessing diff type of Org data so I believe this improves performance.
Provides a full scope for accustomizing any particular Org table without effecting the other existing Org types.
Not sure if diff indexes on diff/distributed tables work better then a single big table.
CONS:
Replication of design. If I have to increase the size of the ZipCode field - I've to do it in ALL the tables.
Replication in manipulation implementation (i.e. we've used stored procedures for CRUD operations so the replication goes n-fold .. 3-4 Inert SP, 2-3 SELECT SPs, etc...)
Everything grows n-fold right from DB constraints\indexing to SP to the Business objects in the application code.
Change(common) in one place has to be made at all the other places as well.
Some Pros & Cons of #2:
PROS:
N-fold becomes 1-fold :-)
Maintenance gets easy because we can try and implement single entry points for all the operations (i.e. a single SP to handle CRUD operations, etc..)
We've to worry about maintaining a single table. Indexing and other optimizations are limited to a single table.
CONS:
Does it create a bottleneck? Can it be managed by implementing Views and other optimized data access strategy?
The other side of centralized implementation is that a single change has to be tested and verified at ALL the places. It isn't abstract.
The design might seem a little less 'organized\structured' esp. due to those few Orgs for which we need to add 'special' fields (which are irrelevant to the other tables)
I also got in mind an Option#3 - keep the Org tables separate but create a common OrgAddress table to store the common fields. But this gets me in the middle of #1 & #2 and it is creating even more confusion!
To be honest, I'm an experienced programmer but not an equally experienced DBA because that's not my main-stream job so please help me derive the correct tradeoff between parameters like the design-complexity and performance.
Thanks in advance. Feel free to ask for any technical queries & suggestions are welcome.
Hemant
I would say that your 2nd option is close, just few points:
Customer, Distributor, Vendor are TYPES of organizations, so I would suggest:
Table [Organization] which has all columns common to all organizations and a primary key for the row.
Separate tables [Vendor], [Customer], [Distributor] with specific columns for each one and FK to the [Organization] row PK.
The sounds like a "supertype/subtype relationship".
I have worked on various applications that have implemented all of your options. To be honest, you probably need to take account of the way that your users work with the data, how many records you are expecting, commonality (same organisation having multiple functions), and what level of updating of the records you are expecting.
Option 1 worked well in an app where there was very little commonality. I have used what is effectively your option 3 in an app where there was more commonality, and didn't like it very much (there is more work involved in getting the data from different layers all of the time). A rewrite of this app is implementing your option 2 because of this.
HTH

Resources