Metadata and End user data difference - database

So I'm having difficulties understanding the differences of Metadata and end user data. What is the difference between them when it comes to databases?

Related

How to Organize a Messy Database

I know there is no easy answer to this question, but how do I cleanup a database with no relationships, foreign keys, and not a whole lot of structure?
I'm an amateur to SQL, and I've inherited a database that is complete mess. We have no sort of referential integrity, and there's not a whole lot of logic to how tables are working.
My database is all data that comes from a warehouse that builds servers.
To give you an idea of the type of data I'm working with:
EDI from customers
Raw output from server projects
Sales information
Site information
Parts lists
I have been prioritizing Raw output and EDI information, and generating reports with that information using SSRS. I have learned a lot about SQL Server and the BI Microsoft tools (SSIS and SSRS) in my short time doing this. However, I'm still an amateur and I want to build a solid database that flows well and can stand on its own.
It seems like a data warehouse model is the type of structure I should adapt.
My question how do I take my mess of a database and make something more organized before I drown in data?
Since your end goal appears to be business reporting, and you're dealing with data from multiple sources made up from "isolated" tables, I would advise you to start by aggregating all that into a data model.
Personally, I would design a dimensional model to structure and store all that data, with the goal of being easy to understand (for reporting or adhoc querying). The model should be focused on business entities and their transactions. In a dimensional model, the business entities will (almost always) be the dimensions and the transactions (the metrics) will be the facts. For example, without knowing your model I'm guessing that the immediate entities would include Customer, Site, Part and transactions would include ServerSale, SiteVisit, PartPurchase, PartRepair, PartOrder, etc...
More information about dimensional modelling here and here, but I suggest going straight to the source: https://www.kimballgroup.com/data-warehouse-business-intelligence-resources/books/data-warehouse-dw-toolkit/
When your model is designed (and implemented in a database like SQL Server) you'll then be loading data into the model, by extracting it from its different source systems/databases and transforming it from the current structure into the structure defined by the model, namely by using an ETL tool like MS Integration Services. For example, your Customer data may be scattered across the "sales", "customer" and "site", so you want to aggregate all that data and load it into a single Customer dimension table. It's when doing this ETL that you should check your data for the problems you already mentioned, loading correct rows into you data model and discarding incorrect rows into a file/log where they can later be checked and corrected. (multiple ways to address this).
A straightforward tutorial to get started on doing ETL using SSIS can be found at https://technet.microsoft.com/en-us/library/jj720568(v=sql.110).aspx
So, to sum up, you should build a data mart:
design a dimensional model that represents the business facts and
context on the data you have. This will strongly facilitate both data understanding and reporting, because a dimensional model is closely matches business users terminology and mental models.
use an ETL tool to extract the data from its current source, process it (e.g. check for data quality problems, join data from different sources) and load it into the dimensional model and check it for problems. This will get you close to having an automated data integration job/pipeline with quality checks you deem fit for the data.

How is data stored in a database?

In my course we've learnt about different database models - how data is stored theoretically with trees, etc.
my question is more practical: I want to know how the data is stored into the storage. Are there algorithms that distribute the data to the 'hard drive'?
So in big databases is the data spread randomly to the storage or will the drives be filled like step-by-step until they're full, before the next one gets data to save?
Storing data in a database is not a major problem. Instead, the problems lies in viewing or manipulating the stored data. For example how frequently one wants to access the data, when do you want to access data, which data you want to access etc. questions to be answered. If you are sure about your data manipulation detail, then you can design tables accordingly and choose appropriate techniques to manipulate the data.
You can refer any books on database management systems.
Advanced DBMS Tutorials

What is the difference between dataset and database?

What is the difference between a dataset and a database ? If they are different then how ?
Why is huge data difficult to be manageusing databases today?!
Please answer independent of any programming language.
In American English, database usually means "an organized collection of data". A database is usually under the control of a database management system, which is software that, among other things, manages multi-user access to the database. (Usually, but not necessarily. Some simple databases are just text files processed with interpreted languages like awk and Python.)
In the SQL world, which is what I'm most familiar with, a database includes things like tables, views, stored procedures, triggers, permissions, and data.
Again, in American English, dataset usually refers to data selected and arranged in rows and columns for processing by statistical software. The data might have come from a database, but it might not.
Database
The definition of the two terms is not always clear. In general a database is a set of data organized and accessible using a database management system (DBMS). Databases usually, but not always, are composed of several tables linked together often accessed, modified and updated by various users often simultaneously.
Cambridge dictionary:
A structured set of data held in a computer, especially one that is
accessible in various ways.
Merriam-webster
a usually large collection of data organized especially for rapid
search and retrieval (as by a computer)
Data set (or dataset)
A data set sometimes refer to the contents of a single database table, but this is quite a restrictive definition. In general, as the name suggests, is a set (or collection) of data hence there are datasets of images like Caltech-256 Object Category Dataset or videos e.g. A large-scale benchmark dataset for event recognition in surveillance video. A data set purpose is usually designed for the analysis rather to a continual update form different users, hence represent the end of a collection of data or a snapshot of a specific time.
Oxford dictionary:
A collection of related sets of information that is composed of
separate elements but can be manipulated as a unit by a computer.
‘all hospitals must provide a standard data set of each patient's
details’
Cambridge dictionary
a collection of separate sets of information that is treated as a
single unit by a computer
A dataset is the data... usually in a table or can be XML or other types of data however it's only data... it doesn't really do anything.
And as you know a database is a container for the dataset usually with built in infrastructure around it to interact with it.
Huge data isn't hard to manage for what I do. I guess you're asking a study related question?
Dataset is just a set of data (maybe related to someone and may not be for others ) whereas Database is a software/hardware component that organizes and stores data or dataset. Both are different things practically.
Huge data needs more infrastructure and components (hardware & software) or computing power & storage for efficient storage or retrieval of data's . More huge data means more components hence difficult. Modern days database provides good infrastructure to handle huge data's processing (both read/write) , check datalake management by Microsoft which manages relational data or dataset extensively.

Data Processing / Mining Question

I'm starting to work on a financial information website (somewhat like google finance or bloomberg).
My website needs to display live currency, commodity, and stock values. I know how to do this frontend-wize, but I have a backend data storing question (I already have the data feed APIs):
How would you guys go about this - would you set up your own database and save all the data in the db with some kind of a backend worker, and then plug in your frontend to your db, or would you plug your frontend directly to the API and not mine the data?
Mining the data could be good for later reference (statistics and other things that the API wont allow), but can such a big quantity of ever growing information be stored on a database? Is this feasible? What other things should I be considering?
Thank you - any comment would be much appreciated!
First, I'd cleanly separate the front end from the code that reads the source APIs. Having done that, I could have the code that reads the source APIs feed the front end directly, feed a database, or both.
I'm a database guy. I'd lean toward feeding data from the APIs into a database, and connecting the front end to the database. But it really depends on the application's requirements.
Feeding a database makes it simple and cheap to change your mind. If you (or whoever) decides later to keep no historical data, just delete old data after storing new data. If you (or whoever) decides later to keep all historical data, just don't delete old data.
Feeding a database also gives you fine-grained control over who gets to see the data, relatively independent of their network operating system permissions. Depending on the application, this may or may not be a good thing.

Retrieving data from database. Retrieve only when needed or get everything?

I have a simple application to store Contacts. This application uses a simple relational database to store Contact information, like Name, Address and other data fields.
While designing it, I question came to my mind:
When designing programs that uses databases, should I retrieve all database records and store them in objects in my program, so I have a very fast performance or I should always gather data only when required?
Of course, retrieving all data can only be done if it`s not too many, but do you use this approach when you make sure that the database will be small (< 300 records for example)?
I have designed once a similar application that fetches data only when needed, but that was slow (using a Access database).
Thanks for all help.
This depends a lot on the type of data, the state your application works in, transactions, multiple users, etc.
Generally you don't want to pull everything and operate on the data within your application because almost all of the above conditions will cause data to become non-synchronized. Imagine a user updating a contact while someone else is viewing that information from a cached version inside their application.
In your application, you should design the database queries such that they retrieve what is going to be displayed on the current screen. If the user is viewing a list of contacts, then the query would retrieve the entire contact table, or a portion of it if you are doing a paginated view. When they click on a contact, for example, for more information, then a new query would request the full details of that contact.
For strings and small pieces of data like what a contact list involves, you shouldn't have any speed issues working with a relational database like SQL, MySql or Oracle.
I think it will be best to retrieve data when needed , retrieving all the records and storing it in object can be an overhead. And when you say you have a small database , retrieving the records when needed should not be an issue at all.

Resources