Data sets for Data warehouse research

Data sets for Data warehouse research - database

I am working on a data warehouse project for my final year degree. My intention is to create a data warehouse within the Business Intelligence spectrum, that incorporates customers' information from a company's point-of-sales data for a given year or location for example. However, my problem is finding the right raw data for this purpose as I had been in contact with tens of companies who refused to share their own data!
Is there any source for a large free data set I can use to fit the above scenario i.e. Business Intelligence purpose?
Any help would be greatly appreciated!

Related

How to handle Manual Input within your Data Warehouse

I newly joined an organisation and we recently introduced a Data Warehouse solution (Snowflake) that incorporates a large amount of external systems (CRM etc). There are use cases to bring manual data input on weekly (e.i. Sales targets ). This one area that I am having trouble with.
In an ideal world, all systems would perfectly integrate and form the core data within the DW.
But the reality is that there is likely to need to keep the manual data input to create a complete picture (at least until we can find a way around it long term).
So far I have thought of Excel/Google Sheet as manual entry into a backend service which populates DB Tables in the staging server.
Does anyone here have experience in this scenario? How do users of a data platform typically handle this scenario? And practice for handling manual data entry into a Data Warehouse solution?
Any help you can provide here would be greatly appreciated.

When a data model becomes a data nightmare

I am revising a data model for an educational company. So not your straightforward retail or finance model.
Problem is the company want to use the datawarehouse to produce listing reports, the kind of reports that should normally be produced by the Education Management System (EMS).
Reports like class lists, detailed learner information reports, guardian contact, work, and financial information, academic reports.
My argument thus far has been that a datawarehouse houses an analytical data model used to data analytics. Not a reporting model for an education management system with a for more complex relational database.
The current model has (forgive the crude diagram) snow flaked completely out of control in such a way that reporting and analytical tools are struggling to interpret it. The warehouse is starting to resemble the RDMS model more and more. There are so many relationships between Dimensions in order to keep data together.
Some of the tables contains so much unnecessary attributes that have no analytical value and purely exist for a listing report.
I need some opinions/criticism (possibly as good reference) regarding this approach so I can better explain the problem to a business who is oblivious to the concept of data modelling. I need to make them understand that butchering the DW to handle detailed reporting is going to end badly for them.

How to Organize a Messy Database

I know there is no easy answer to this question, but how do I cleanup a database with no relationships, foreign keys, and not a whole lot of structure?
I'm an amateur to SQL, and I've inherited a database that is complete mess. We have no sort of referential integrity, and there's not a whole lot of logic to how tables are working.
My database is all data that comes from a warehouse that builds servers.
To give you an idea of the type of data I'm working with:
EDI from customers
Raw output from server projects
Sales information
Site information
Parts lists
I have been prioritizing Raw output and EDI information, and generating reports with that information using SSRS. I have learned a lot about SQL Server and the BI Microsoft tools (SSIS and SSRS) in my short time doing this. However, I'm still an amateur and I want to build a solid database that flows well and can stand on its own.
It seems like a data warehouse model is the type of structure I should adapt.
My question how do I take my mess of a database and make something more organized before I drown in data?

Since your end goal appears to be business reporting, and you're dealing with data from multiple sources made up from "isolated" tables, I would advise you to start by aggregating all that into a data model.
Personally, I would design a dimensional model to structure and store all that data, with the goal of being easy to understand (for reporting or adhoc querying). The model should be focused on business entities and their transactions. In a dimensional model, the business entities will (almost always) be the dimensions and the transactions (the metrics) will be the facts. For example, without knowing your model I'm guessing that the immediate entities would include Customer, Site, Part and transactions would include ServerSale, SiteVisit, PartPurchase, PartRepair, PartOrder, etc...
More information about dimensional modelling here and here, but I suggest going straight to the source: https://www.kimballgroup.com/data-warehouse-business-intelligence-resources/books/data-warehouse-dw-toolkit/
When your model is designed (and implemented in a database like SQL Server) you'll then be loading data into the model, by extracting it from its different source systems/databases and transforming it from the current structure into the structure defined by the model, namely by using an ETL tool like MS Integration Services. For example, your Customer data may be scattered across the "sales", "customer" and "site", so you want to aggregate all that data and load it into a single Customer dimension table. It's when doing this ETL that you should check your data for the problems you already mentioned, loading correct rows into you data model and discarding incorrect rows into a file/log where they can later be checked and corrected. (multiple ways to address this).
A straightforward tutorial to get started on doing ETL using SSIS can be found at https://technet.microsoft.com/en-us/library/jj720568(v=sql.110).aspx
So, to sum up, you should build a data mart:
design a dimensional model that represents the business facts and
context on the data you have. This will strongly facilitate both data understanding and reporting, because a dimensional model is closely matches business users terminology and mental models.
use an ETL tool to extract the data from its current source, process it (e.g. check for data quality problems, join data from different sources) and load it into the dimensional model and check it for problems. This will get you close to having an automated data integration job/pipeline with quality checks you deem fit for the data.

How can a community organization organize data and streamline simple analyses?

I work for a research organization in India and have recently taken up with a program extending immunizations among poor rural communities. They're a fairly large organization but don't really have any IT infrastructure. Data reports on vaccine coverage, logistical questions, meeting attendance etc. come from hundreds of villages, go from pen-and-paper through several iterations of data entry and compilation, finally arriving each month at the central office as HUNDREDS of messy Excel sheets. The organization generally needs nothing more than simple totals and proportions from a large series of indicators, but doctors and high-level professionals are left spending days summing the sheets by hand, introducing lots of error and generally wasting a ton of time. I threw in some formulas and at least automated the process within single sheets, but the compilation and cross-referencing is still an issue.
There's not much to be done at the point of data collection...obviously it would be great to implement some system at the point of entry, but that would involve training hundreds of officials and local health workers; not practical at the moment.
My question: what can be done with the stack of excel sheets every month so we can analyze individually and also holistically? Is there any type of management app. or simple database we can build to upload and compile the data for easy analysis in R or even (gasp) excel? What kind of tools could I implement and then pass on to some relative technophobes? Can we house it all online?
I'm by no means a programmer but I'm an epidemiologist/stats analyst proficient in R and Google products and the general tools of a not-so-tech-averse millenial. I'd be into using this as an opportunity for learning some mySQL or similar, but need some guidance. Any ideas are appreciated...there has to be a better way!

step by step approach would be, first store the raw data from excel sheets and papers in structured database. Once data is maintained in DB you will have many tools to manipulate that data later.
Use any database like MySQL to store excel sheet ; Excel sheets or CSV files can be exported to database directly.
Later with simple database operations you can you can manipulate the data; you can use reports/web application/etc. to display and manage data.
and keep the good Work!

Business Intelligence - analyzing events rather than aggregates? What's the right approach

I currently analyze our customer data and trends by a number of SQL queries; and the testing of a hypothesis can be time-expensive.
For instance, we have a table of our customer info and a table of our customer service calls, indexed by customer. I'd like to find out if a particular cohort of customers had more CS issues than another; and if there is any correlation between customer service calls and increased cancel rates.
I was looking into MS's BI studio, as we're running MSSQL 2008 already; but most of what I've read focuses on carefully constructed MDX cubes that aggregate numerical data; so in the above model, I'd have to build a cube of facts (number of CS calls and types) and then use the customer data as dimensions. Fair enough, but in the time it'd take me to do that, I could just write the query manually in TSQL.
My DB is small enough that the speed gains from a separate datamart aren't necessary -- what I'm looking for is a flexible way of looking at my data, by creating a Customer 'Object' and tying all sorts of data, actions and numerical values to them. And I'd rather have the data extracted from my existing tables rather than having to ETL to a separate table.
Ideally at some point, I'd be able to use Data Mining tools for predictive analysis, but right now I'm going after low hanging fruit -- do customers from this ad campaign cancel more than the other one; etc.
Am I barking up the wrong tree with SQL Analysis Services/MDX cubes? Or does what I'm talking about not exist easily to begin with? Any advice, directions to products, or insight greatly appreciated.

It depends on who you want to do the analysis. If you are the one who is going to do the analysis, you know SQL, and you understand the structure of your data, then there's no real benefit to doing extra work to simply change the structure of the data. You want to use BI tools when you want to make that data available to others who don't know SQL, and don't necessarily know the relationships between different tables of data that are out there. You're in essence adding an abstraction layer to hide all this complexity from them, but still allow them to do the analysis. Of course the side effect of the abstraction is that you end up adding some limitations, but the trade-off is that the information is available to more people.

Don't waste your time with SSAS/cubes. Your dataset is small and the scope of your problem is narrow...so there's no need for you to build a cube. Instead, you should give the Excel Data Mining addin a test-run. It's pretty powerful and works well with small datasets. It is the low-hanging fruit I believe you are looking for. Plus, users feel comfortable using Excel.
SSAS is not necessary for creating data mining structures/models is only necessary if you want to automate the process.
Building a cube first only helps when you have a very large dataset. Because of its speed, it will allow the data mining algorithms to run faster. Even if you use SSAS to build a data minining strucutre/model(s), you still don't need a cube...you can build the structure/model(s) off of relational tables.
If you database tables are designed correctly