Understanding Data for newly hired

Understanding Data for newly hired - database

I have just started a new job, where I have to maintain the data warehouse, but the problem is that the company data warehouse gets data from the eautomate, and some other software, and when I look at the tables in the data warehouse, I don't understand anything. What would be the best idea for understanding all the data in the warehouse for analytic purpose? It looks like the eautomate company has build all the tables for the company and no one is here to train me how data got in there. I just felt bad for not being very productive, since its my second week at the job.

Honestly, there has to be a contact for eAutomate that set up the DB. Without having experience with eAutomate, you're handicapped.
I'd recommend getting a idea what which DBs/Tables exist, then I'd get permission to add records/modify some test data in eAutomate. Start looking in tables that seem to correspond with area of eAutomate you modified data in.
Your best bet is to get some domain knowledge before you dive into the DB.
Just my .02. Hope it helps!

Related

SQL Server copy data across databases

I'm using SQL Server 2019. I have a "MasterDB" database that is exposed to GUI application.
I'm also going to have as many as 40 user database like "User1DB", "User2DB", etc. And all these user databases will have "exact same" schema and tables.
I have a requirement to copy tables data (overwriting target) from one user database (let's say "User1DB") to the other (say "User2DB"). I need to do this from within "MasterDB" database since the GUI client app going to have access to only this database. How to handle this dynamic situation? I'd prefer static SQL (in form of Stored Procedures) rather than dynamic SQL.
Any suggestion will greatly be appreciated. Thanks.

Check out this question here for transferring data from one database to another.
Aside from that, I agree with #DaleK here. There is no real reason to have a database per user if we are making the assumption that a user is someone who is logging into your frontend app.
I can somewhat understand replicating your schema per customer if you are running some multi-billion record enterprise application where you physically have so much data per customer that it makes sense to split it up, but based on your question that doesn't seem to be the case.
So, if our assumptions are correct, you just need to have a user table, where your fields might be...
UserTable
UserId
FName
LName
EmailAddress
...
Edit:
I see in the comments you are referring to "source control data" ... I suggest you study up on databases and how they're meant to be designed, implemented, and how data should be transacted. There are a ton of great articles and books out there on this with a simple Google search.
If you are looking to replicate your data for backup purposes, look into some data warehouse design principles, maybe creating a separate datastore in a different geographic region for that. The latter is a very complex subject to which I can't go over in this answer, but it sounds like that goes far beyond your current needs. My suggestion is to backtrack and hash out the needs for your application, while understanding some of the fundamentals of databases (or different methods of storing data). Implement something and then see where it can be expanded upon / refactored.
Beyond that, I can't be more detailed than the original question you posted. Hope this helps.

split table rows of data into multiple tables according to column obeying constraints

I have a source flat file with about 20 columns of data an roughly 11K records. Each record (row) contains info such as
patientID,PatietnSSN.PatientDOB,PatientSex,PatientName,Patientaddress,PatientPhone,PatientWorkPhone,PatientProvider,PatientReferrer,PatientPrimaryInsurance,PatientInsurancePolicyID.
My goal is to move this data to a sql database.
I have created a database with the below datamodel
I know want to do a bulk insert to move all the records however I am unsure how to do that as you can see there are and have to be constraints in order to ensure referential integrity. What should my approach be? am I going about this all wrong? thus far I have used SSIS to import the data into a single staging table and now I must figure out how to write the 11k plus records to the individual tables in which they belong... so record 1 of the staging table will create 1 record across almost all of the tables minus perhaps the ones where there are 1 to many relationships like "provider" and "Referrer" as one provider will be linked to many patients but one patient can only have one provider.
I hope I have explained this well enough. Please help!

As the question is generic, I'll approach the answer in a generic way as well - in an attempt to at least get you asking the right questions.
Your goal is to get flat-file data into a relational database. This is a very common operation and is at least a subset of the ETL process. So you might want to start your search by reading more on ETL.
Your fundamental problem, as I see it, is two-fold. First, you have a large amount of data to insert. Second, you are inserting into a relational database.
Starting with the second problem first; Not all of your data can be inserted each time. For example, you have a provider table that holds a 1:many relationship with a patient. That means that you will have to ask the question of each patient row in your flat table as to whether the provider exists or needs creating. Also, you have seeded Ids, meaning that in some instance you have to maintain your order of creation so that you can reference the id of a created entry in the next created entry. What this means to you is that your effort will be more complex than a simple set of SQL inserts. You need to logic associated with the effort. There are several ways to approach this.
Pure SQL/TSQL; It can be accomplished but would be a lot of work and hard to debug/troubleshoot
Write a program: This gives you a lot of flexibility, but means you will have to know how to program and use programming tools for working with a database (such as an ORM)
Use an automated ETL tool
Use SQL Server's flat-file import abilities
Use an IDE with import capabilities - such as Toad, Datagrip, DBeaver, etc.
Each of these approaches will take some research and learning on your part -- this forum cannot teach you how to use them. And the decision as to which one you want to use will somewhat depend on how automated the process should be.
Concerning your first issue -- large data inserts. SQL has the facility for Bulk inserts docs, but you will have to condition your data first.
Personally (as per my comments), I am a .Net developer. But given this task, I would still script it up in Python. The learning curve is very kind in Python and it has lots of great tools for working with files and database. .Net and EF carry with it a lot of overhead with respect to what you need to know to get started that python doesn't -- but that is just me.
Hope this helps get you started.

Steve you are a boss, thank you. Ed thanks to you as well!
I have taken everyone's guidance into consideration and concluded that I will not be able to get away with a simple solution for this.
There are bigger implications so it makes sense to accomplish this ground work task in such a way that allows me to exploit my efforts for future projects. I will be proceeding with a simple .net web app using EF to take care of the data model and write a simple import procedure to pull the data in.
I have a notion of how I will accomplish this but with the help of this board I'm sure success is to follow! Thanks all-Joey
For the record tools I plan on using (I agree with the complexity and learning curve opinions but have an affinity for MS products).
Azure SQL Database (data store)
Visual Studio 2017 CE (ide)
C# (Lang)
.net MVC (project type)
EF 6 (orm)
Grace (cause I'm only human :-)

data model for performance monitoring for Tableau Server

I have a question in regards to some performance monitoring.
I am going to connect to the Postgres database and what I want to do is extract the relevant information from tableau server database to my very own database.
I am following a document at the moment to perform the relevant steps needed to retrieve the performance information from Postgres, but what I really require to do is set up a data model for our very own database.
I’m not a strong DBA so may require help in designing the data model but the requirement is:
We want a model in place so that we can see how long workbooks take to load and if any of them take let’s say longer than 5 seconds, we are alerted of this so we can go in and investigate.
My current idea for the data model in very basic terms is having the following tables:
Users – Projects – Workbooks – Views – Performance
Virtually we have our users who access various projects that contain their very own workbooks. The views table is simply for workbook view so that we can see how many times a workbook has been viewed and when. Finally performance table is required for the load time.
This is a very basic description we require but my question is simply is there anyone who has knowledge of tableau and data models to help design a very basic model and schema for this? Will need it salable so that it can perform for as many tableau servers as it can.
Thank you very much,

I've found an article from the blog of a monitoring tool that I like and maybe it could help you with your monitoring. I'm not an expert in PostgreSQL but is worth to take a look:
http://blog.pandorafms.org/how-to-monitor-postgress/
Hope this can help!

How to design a database that can handle unknown reports?

I am working on a project which stores large amounts of data on multiple industries.
I have been tasked with designing the database schema.
I need to make the database schema flexible so it can handle complex reporting on the data.
For example,
what products are trending in industry x
what other companies have a similar product to my company
how is my company website different to x company website
There could be all sorts of reports. Right now everything is vague. But I know for sure the reports need to be fast.
Am I right in thinking my best path is to try to make as many association tables as I can? The idea being (for example) if the product table is linked to the industry table, it'll be relatively easy to get all products for a certain industry without having to go through joins on other tables to try to make a connection to the data.
This seems insane though. The schema will be so large and complex.
Please tell me if what I'm doing is correct or if there is some other known solution for this problem. Perhaps the solution is to hire a data scientist or DBA whose job is to do this sort of thing, rather than getting the programmer to do it.
Thank you.

I think getting these kinds of answers from a relational/operational database will be very difficult and the queries will be really slow.
The best approach I think will be to create multidimensional data structures (in other words a data warehouse) where you will have flattened data which will be easier to query than a relational database. It will also have historical data for trend analysis.
If there is a need for complex statistical or predictive analysis, then the data scientists can use the data warehouse as their source.

Adding to Amit's answer above, the problem is that what you need from your transactional database is a heavily normalized association of facts for operational purposes. For an analytic side you want what are effectively tagged facts.
In other words what you want is a series of star schemas where you can add whatever associations you want.

Best way to build a DataMart from multiple external systems?

I'm in the planning stages of building a SQL Server DataMart for mail/email/SMS contact info and history. Each piece of data is located in a different external system. Because of this, email addresses do not have account numbers and SMS phone numbers do not have email addresses, etc. In other words, there isn't a shared primary key. Some data overlaps, but there isn't much I can do except keep the most complete version when duplicates arise.
Is there a best practice for building a DataMart with this data? Would it be an acceptable practice to create a key table with a column for each external key? Then, a unique primary ID can be assigned to tie this to other DataMart tables.
Looking for ideas/suggestions on approaches I may not have yet thought of.
Thanks.

The email address or phone number itself sounds like a suitable business key. Typically a "staging" database is used to load the data from multiple sources and then assign surrogate keys and do other transformations.
Are you familiar with data warehouse methods and design patterns? If you don't have previous knowledge or experience then consider hiring some help. BI / data warehouse projects have a very high failure rate and mistakes can be expensive.

Found more information here:
http://en.wikipedia.org/wiki/Extract,_transform,_load#Dealing_with_keys

Well, with no other information to tie the disparate pieces together, your datamart is going to be pretty rudimentary. You'll be able to get the types of data (sms, email, mail), metrics for each type over time ("this week/month/quarter/year we averaged 42.5 sms texts per day, and 8000 emails per month! w00t!"). With just phone numbers and email addresses, your "other datamarts" will likely have to be phone company names, or internet domains. I guess you could link from that into some sort of geographical information (internet provider locations?), or maybe financial information for the companies. Kind of a blur if you don't already know which direction you want to head.
To be honest, this sounds like someone high-up is having a knee-jerk reaction to the "datamart" buzzword coupled with hearing something about how important communication metrics are, so they sent orders on down the chain to "get us some datamarts to run stats on all our e-mails!"
You need to figure out what it is that you or your employer is expecting to get out of this project, and then figure out if the data you're currently collecting gives you a trail to follow to that information. Right now it sounds like you're doing it backwards ("I have this data, what's it good for?"). It's entirely possible that you don't currently have the data you need, which means you'll need to buy it (who knows if you could) or start collecting it, in which case you won't have nice looking graphs and trend-lines for upper-management to look at for some time... falling right in line with the warning dportas gave you in his second paragraph ;)

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight