What is the difference between dataset and database? - database

What is the difference between a dataset and a database ? If they are different then how ?
Why is huge data difficult to be manageusing databases today?!
Please answer independent of any programming language.

In American English, database usually means "an organized collection of data". A database is usually under the control of a database management system, which is software that, among other things, manages multi-user access to the database. (Usually, but not necessarily. Some simple databases are just text files processed with interpreted languages like awk and Python.)
In the SQL world, which is what I'm most familiar with, a database includes things like tables, views, stored procedures, triggers, permissions, and data.
Again, in American English, dataset usually refers to data selected and arranged in rows and columns for processing by statistical software. The data might have come from a database, but it might not.

Database
The definition of the two terms is not always clear. In general a database is a set of data organized and accessible using a database management system (DBMS). Databases usually, but not always, are composed of several tables linked together often accessed, modified and updated by various users often simultaneously.
Cambridge dictionary:
A structured set of data held in a computer, especially one that is
accessible in various ways.
Merriam-webster
a usually large collection of data organized especially for rapid
search and retrieval (as by a computer)
Data set (or dataset)
A data set sometimes refer to the contents of a single database table, but this is quite a restrictive definition. In general, as the name suggests, is a set (or collection) of data hence there are datasets of images like Caltech-256 Object Category Dataset or videos e.g. A large-scale benchmark dataset for event recognition in surveillance video. A data set purpose is usually designed for the analysis rather to a continual update form different users, hence represent the end of a collection of data or a snapshot of a specific time.
Oxford dictionary:
A collection of related sets of information that is composed of
separate elements but can be manipulated as a unit by a computer.
‘all hospitals must provide a standard data set of each patient's
details’
Cambridge dictionary
a collection of separate sets of information that is treated as a
single unit by a computer

A dataset is the data... usually in a table or can be XML or other types of data however it's only data... it doesn't really do anything.
And as you know a database is a container for the dataset usually with built in infrastructure around it to interact with it.
Huge data isn't hard to manage for what I do. I guess you're asking a study related question?

Dataset is just a set of data (maybe related to someone and may not be for others ) whereas Database is a software/hardware component that organizes and stores data or dataset. Both are different things practically.
Huge data needs more infrastructure and components (hardware & software) or computing power & storage for efficient storage or retrieval of data's . More huge data means more components hence difficult. Modern days database provides good infrastructure to handle huge data's processing (both read/write) , check datalake management by Microsoft which manages relational data or dataset extensively.

Related

How to Organize a Messy Database

I know there is no easy answer to this question, but how do I cleanup a database with no relationships, foreign keys, and not a whole lot of structure?
I'm an amateur to SQL, and I've inherited a database that is complete mess. We have no sort of referential integrity, and there's not a whole lot of logic to how tables are working.
My database is all data that comes from a warehouse that builds servers.
To give you an idea of the type of data I'm working with:
EDI from customers
Raw output from server projects
Sales information
Site information
Parts lists
I have been prioritizing Raw output and EDI information, and generating reports with that information using SSRS. I have learned a lot about SQL Server and the BI Microsoft tools (SSIS and SSRS) in my short time doing this. However, I'm still an amateur and I want to build a solid database that flows well and can stand on its own.
It seems like a data warehouse model is the type of structure I should adapt.
My question how do I take my mess of a database and make something more organized before I drown in data?
Since your end goal appears to be business reporting, and you're dealing with data from multiple sources made up from "isolated" tables, I would advise you to start by aggregating all that into a data model.
Personally, I would design a dimensional model to structure and store all that data, with the goal of being easy to understand (for reporting or adhoc querying). The model should be focused on business entities and their transactions. In a dimensional model, the business entities will (almost always) be the dimensions and the transactions (the metrics) will be the facts. For example, without knowing your model I'm guessing that the immediate entities would include Customer, Site, Part and transactions would include ServerSale, SiteVisit, PartPurchase, PartRepair, PartOrder, etc...
More information about dimensional modelling here and here, but I suggest going straight to the source: https://www.kimballgroup.com/data-warehouse-business-intelligence-resources/books/data-warehouse-dw-toolkit/
When your model is designed (and implemented in a database like SQL Server) you'll then be loading data into the model, by extracting it from its different source systems/databases and transforming it from the current structure into the structure defined by the model, namely by using an ETL tool like MS Integration Services. For example, your Customer data may be scattered across the "sales", "customer" and "site", so you want to aggregate all that data and load it into a single Customer dimension table. It's when doing this ETL that you should check your data for the problems you already mentioned, loading correct rows into you data model and discarding incorrect rows into a file/log where they can later be checked and corrected. (multiple ways to address this).
A straightforward tutorial to get started on doing ETL using SSIS can be found at https://technet.microsoft.com/en-us/library/jj720568(v=sql.110).aspx
So, to sum up, you should build a data mart:
design a dimensional model that represents the business facts and
context on the data you have. This will strongly facilitate both data understanding and reporting, because a dimensional model is closely matches business users terminology and mental models.
use an ETL tool to extract the data from its current source, process it (e.g. check for data quality problems, join data from different sources) and load it into the dimensional model and check it for problems. This will get you close to having an automated data integration job/pipeline with quality checks you deem fit for the data.

Best database and database design for this particular need?

I'm looking to store around 50-100 million documents in a database and be able to do queries at a very fast speed. A document would look something like this:
{
name: 'example',
value: '300,201,512'
}
The value column is always unique, name is not.
Now I want to be able to only check if there exists a document with a specific value using a query. What database would be the best choice and which design would be best to approach the fastest speed for a query like that?
NoSQL databases try to offer certain functionality that more traditional relational database management systems do not. Whether it is for holding simple key-value pairs for shorter lengths of time for caching purposes, or keeping unstructured collections (e.g. collections) of data that could not be easily dealt with using relational databases and the structured query language (SQL) – they are here to help.
In order to better understand the roles and underlying technology of each database management system, let's quickly go over these four operational models.
Key / Value Based
We will begin our NoSQL modeling journey with key / value based database management simply because they can be considered the most basic and backbone implementation of NoSQL.
These type of databases work by matching keys with values, similar to a dictionary. There is no structure nor relation. After connecting to the database server (e.g. Redis), an application can state a key (e.g. the_answer_to_life) and provide a matching value (e.g. 42) which can later be retrieved the same way by supplying the key.
Key / value DBMSs are usually used for quickly storing basic information, and sometimes not-so-basic ones after performing, for example, a CPU and memory intensive computation. They are extremely performant, efficient and usually easily scalable.
Note: When it comes to computers, a dictionary usually refers to a special sort of data object. They constitutes of arrays of collections with individual keys matching values.
Column Based
Column based NoSQL database management systems work by advancing the simple nature of key / value based ones.
Despite their complicated-to-understand image on the internet, these databases work very simply by creating collections of one or more key / value pairs that match a record.
Unlike the traditional defines schemas of relational databases, column-based NoSQL solutions do not require a pre-structured table to work with the data. Each record comes with one or more columns containing the information and each column of each record can be different.
Basically, column-based NoSQL databases are two dimensional arrays whereby each key (i.e. row / record) has one or more key / value pairs attached to it and these management systems allow very large and un-structured data to be kept and used (e.g. a record with tons of information).
These databases are commonly used when simple key / value pairs are not enough, and storing very large numbers of records with very large numbers of information is a must. DBMS implementing column-based, schema-less models can scale extremely well.
Document Based
Document based NoSQL database management systems can be considered the latest craze that managed to take a lot of people by storm. These DBMS work in a similar fashion to column-based ones; however, they allow much deeper nesting and complex structures to be achieved (e.g. a document, within a document, within a document).
Documents overcome the constraints of one or two level of key / value nesting of columnar databases. Basically, any complex and arbitrary structure can form a document, which can be stored using these management systems.
Despite their powerful nature, and the ability to query records by individual keys, document based management systems have their own issues and downfalls compared to others. For example, retrieving a value of a record means getting the whole lot of it and same goes for updates, all of which affect the performance.
Graph Based
Finally, the very interesting flavour of NoSQL database management systems is the graph based ones.
The graph based DBMS models represent the data in a completely different way than the previous three models. They use tree-like structures (i.e. graphs) with nodes and edges connecting each other through relations.
Similarly to mathematics, certain operations are much simpler to perform using these type of models thanks to their nature of linking and grouping related pieces of information (e.g. connected people).
These databases are commonly used by applications whereby clear boundaries for connections are necessary to establish. For example, when you register to a social network of any sort, your friends' connection to you and their friends' friends' relation to you are much easier to work with using graph-based database management systems.
Fasted document based db
1) MongoDB
2) DynamoDB
Here is difference for your reference
I will give preference to DynamoDB
Currently, we are working on aws datalake, really fast performance
store data in s3 and get back via athena.
If you want to import data on to some database then try using MS SQL Server 2008 R2, because it is very user friendly and allow you to do your work more accurately and precisely. If you want to do that without any cost, then MySQL will be a better option to do so(better MySQL editor is SQLYog). I hope it would be beneficial for you.
Short Answer:
I think, 100 million documents in your mentioned structure and conditions is not BIG ENOUGH to use NoSQL. You can handle them with PostgreSQL and MySQL and etc.
Note that: for a long time Wikipedia used MySQL (not now). see Reference

What is the difference between a database and a normal file based system

What is the difference between a database and a normal file based system (like how we store data in our drives)?
Is it that data will be stored in form of tables in database? Can't we store files (mp3, mp4 etc.)?
From https://en.wikipedia.org/wiki/File_system:
In computing, a file system is used to control how
data is stored and retrieved. Without a file system, information
placed in a storage area would be one large body of data with no way
to tell where one piece of information stops and the next begins. By
separating the data into pieces and giving each piece a name, the
information is easily isolated and identified. Taking its name from
the way paper-based information systems are named, each group of data
is called a "file". The structure and logic rules used to manage the
groups of information and their names is called a "file system".
From https://en.m.wikipedia.org/wiki/Database:
A database is an organized collection of data. It is the collection
of schemas, tables, queries, reports, views and other objects. The
data are typically organized to model aspects of reality in a way that
supports processes requiring information, such as modelling the
availability of rooms in hotels in a way that supports finding a hotel
with vacancies.
So a file system is the basic organization, retrieval / storage mechanism of data.
A database sits on top of some sort of file system but also provides various insights into the data (like relating and querying data) and other operations optimized for data writing and retrieval.
These days database is widely used to mean database management system (DBMS) which provides the various tools and mechanisms that sort, allow querying, index, relate, etc.
In all cases (I'm sure I will be corrected if wrong) a file system is required to have a database but a database is not necessary to have a file system.
A text file could be considered a database.
A database is an organized collection of data. It is the collection of
schemas, tables, queries, reports, views and other objects. The data
are typically organized to model aspects of reality in a way that
supports processes requiring information, such as modelling the
availability of rooms in hotels in a way that supports finding a hotel
with vacancies.
- Wikipedia: Database
Facts regarding databases can be found in the link above. A file system is way different than a database. File systems store files, a database cannot store files.
I think you'll be best off reading the actual definition of a database, and you'll see it's far from the same as a file system.

Database alternatives?

I was wondering the trade-offs for using databases and what the other options were? Also, what problems are not well suited for databases?
I'm concerned with Relational Databases.
The concept of database is very broad. I will make some simplifications in what I present here.
For some tasks, the most common database is the relational database. It's a database based on the relational model. The relational model assumes that you describe your data in rows, belonging to tables where each table has a given and fixed number of columns. You submit data on a "per row" basis, meaning that you have to provide a row in a single shot containing the data relative to all columns of your table. Every submitted row normally gets an identifier which is unique at the table level, sometimes at the database level. You can create relationships between entities in the relational database, for example by saying that a given cell in your table must refer to another table's row, so to preserve the so called "referential integrity".
This model works fine, but it's not the only one out there. In some cases, data are better organized as a tree. The filesystem is a hierarchical database. starts at a root, and everything goes under this root, in a tree like structure. Another model is the key/value pair. Sleepycat BDB is basically a store of key/value entities.
LDAP is another database which has two advantages: stores rather generic data, it's distributed by design, and it's optimized for reading.
Graph databases and triplestores allow you to store a graph and perform isomorphism search. This is typically needed if you have a very generic dataset that can encompass a broad level of description of your entities, so broad that is basically unknown. This is in clear opposition to the relational model, where you create your tables with a very precise set of columns, and you know what each column is going to contain.
Some relational column-based databases exist as well. Instead of submitting data by row, you submit them by whole column.
So, to answer your question: a database is a method to store data. Technically, even a text file is a database, although not a particularly nice one. The choice of the model behind your database is mostly relative to what is the typical needs of your application.
Setting the answer as CW as I am probably saying something strictly not correct. Feel free to edit.
This is a rather broad question, but databases are well suited for managing relational data. Alternatives would almost always imply to design your own data storage and retrieval engine, which for most standard/small applications is not worth the effort.
A typical scenario that is not well suited for a database is the storage of large amounts of data which are organized as a relatively small amount of logical files, in this case a simple filesystem-like system can be enough.
Don't forget to take a look at NOSQL databases. It's pretty new technology and well suited for stuff that doesn't fit/scale in a relational database.
Use a database if you have data to store and query.
Technically, most things are suited for databases. Computers are made to process data and databases are made to store them.
The only thing to consider is cost. Cost of deployment, cost of maintenance, time investment, but it will usually be worth it.
If you only need to store very simple data, flat files would be an alternative (text files).
Note: you used the generic term 'database', but there are many many different types and implementations of these.
For search applications, full-text search engines (some of which are integrated to traditional DBMSes, but some of which are not), can be a good alternative, allowing both more features (various linguistic awareness, ability to have semi-structured data, ranking...) as well as better performance.
Also, I've seen applications where configuration data is stored in the database, and while this makes sense in some cases, using plain text files (or YAML, XML and such) and loading the underlying objects during initialization, may be preferable, due to the self-contained nature of such alternative, and to the ease of modifying and replicating such files.
A flat log file, can be a good alternative to logging to DBMS, depending on usage of course.
This said, in the last 10 years or so, the DBMS Systems, in general, have added many features, to help them handle different forms of data and different search capabilities (ex: FullText search a fore mentioned, XML, Smart storage/handling of BLOBs, powerful user-defined functions, etc.) which render them more versatile, and hence a fairly ubiquitous service. Their strength remain mainly with relational data however.

What is an example of a non-relational database? Where/how are they used?

I have been working with relational databases for sometime, but it only recently occurred to me that there must be other types of databases that are non-relational.
What are some examples of non-relational databases, and where/how are they used in the real world? Why would you choose to use a non-relational database over relational databases?
Edit: Two other similar questions have been mentioned in the answers:
Database system that is not relational.
Good reasons NOT to use a relational database?
An admittedly obscure but interesting alternative to the types of databases mentioned here is the associative database, such as Sentences, from LazySoft Technology. There is a free personal version you can download and try on your own. The Enterprise Edition is also free, but requires a request to the company.
Essentially, an associative database allows you to store information in much the same way as our brains do: as things and associations between those things. The name "Sentences" comes from the way this information can be represented in a subject-verb-object syntax:
Tom is brother to Laura
San Francisco is located in California
Mike has a credit limit of $10,000
A sentence may be the subject or object of another sentence:
(Bus 570 arrives at 8:15am) on Sundays
Mary says (the pie was baked by William)
So, everything can be boiled down to entities and associations.
There is, of course, much more to Sentences than what can be expressed here. I recommend that you take some time to read more about it in a white paper from LazySoft.
"The Associative Model of Data" is a book available in PDF format by Simon Williams, one of the creators of Sentences.
Flat file
CSV or other delimited data
spreadsheets
/etc/passwd
mbox mail files
Hierarchical
Windows Registry
Subversion using the file system, FSFS, instead of Berkley DB
A non-relational document oriented database we have been looking at is Apache CouchDB.
Apache CouchDB is a distributed, fault-tolerant and schema-free document-oriented database accessible via a RESTful HTTP/JSON API. Among other features, it provides robust, incremental replication with bi-directional conflict detection and resolution, and is queryable and indexable using a table-oriented view engine with JavaScript acting as the default view definition language.
Our interest was in providing a distributed access user preferences store that would be immune to shape changes to which we could serialize preference objects from Java and access those just as easily with Javascript from a XULRunner based client application.
Any database that claims to be a "Berkley style Database" or "Key/Value" Database is not relational.
These databases are usually based off complex hashing algorithms and provide a very fast lookup O(1) based off a key, but leave any form of relational goodness to end user.
For example, in a relational database, you would normalize your structure and join many tables together to create a single result set.
In a key/value database, you would denormalize as much as possible and then use a unique key to lookup data.
If you need to pull data from two sources, you would have to join the resulting set together by hand.
All databases were originally non-relational, it was only with the arrival of DB2 and Oracle in the mid 1980's that they became common. Before that most databases where either flat files or hierarchical.
Flat files are inherently boring, but hierarchical database are much less so, particularly as DB2 was actually implemented on top of an hierarchical implementation (namely VSAM) in the first instance. VSAM is I believe still around on mainframe systems and is of some considerable importance.
DB/1 (so obscure now I can't even find a wikipedia link) was IBM's predecessor prime-time database to DB2 (hence the name). This was hierarchical - basically you had a file which consisted of any number or 'root' records, generally directly accessible by a key. Each root record could then have any number of child records off it, each of which could in turn have it's own children. The net effect is a index file or root records with each root being the top of a potential tree-like structure. Accessing the child records could be tricky - there were limitations of direct access so usually you ended up traversing the tree looking for the record you needed. A 'database' could have any number of these files in it, usually related by keys.
This had major disadvantages - not least that actually doing anything demanded a full program written - basically the equivalent of a days work for what we can now do in SQL in a few minutes. However it really did score on execution speed, in those days a mainframe had about the processing power of your iPhone (albeit optimized for data I/O) and poor DB2 queries could kill a multi-million dollar installation dead. This was never an issue with DB/1 and in a world where programmers were less expensive than CPU time it made sense.
Google App Engine Datastore :
The App Engine datastore is not a relational database. While the datastore interface has many of the same features of traditional databases, the datastore's unique characteristics imply a different way of designing and managing data to take advantage of the ability to scale automatically.
The PI historical database from OSIsoft is non-relational. It's only made to archive timestamped data. It's used a lot by industry, especially as the back-end database for all those 'dashboards'.
There's no need to be relational in it, since there are no joins.
Other two types of databases that haven't come up yet:
Content Repositories are databases designed for content (i.e. files, documents, images, etc). They typically have additioan constructs such as a hierarchical way to browse content, search, transformation between different formats, versioning, and many other things. Examples - Alfresco, Documentum, JackRabbit, Day, OpenText, many other ECM vendors.
Directories, i.e. Active Directory, or LDAP Directories. These are databases designed for low-write / high read scenarios and highly distributed across high geographical distances / high latency connections. While mostly used for authentication / authorization, they don't have to be if your use case matches the requirements.
Dimensional Databases are great examples of non-relational databases. They are very commonly used for 'Business Dashboards'/'Business Intelligence' for KPIs and other types of aggregate or statistical data. They are usually populated from relational databases and can offer better performance in certain situations.
http://en.wikipedia.org/wiki/Dimensional_database
XML databases e.g. xindice
Object databases e.g. db4o
Be aware that the concept of relational databases is highly contentious. Purists such as C. J. Date would argue that many databases in common use (such as Oracle and SQL Server) do not comply sufficiently with the relational model to be termed 'relational'.
Non-Relational databases just do not meet Codd’s requirements.
Intersystems Caché seams a total re-write/re-design of the old Pick Operating system’s database. From the little I’ve read of Caché it appears to be a nicely done redesign.
It permits .net programs to access the database just like it would SQL. Caché’s run’s the Pick OS programs without requiring any changes. By importing your Pick files into Caché you can still run your old green screen applications with it, but also write new programs using .net so you can migrate to Windows Applications without abandoning the years of data design you’ve already invested in.
Here is some background on the Pick DB model . A Pick database uses totally variable length records, and fields. All table are keyed by a single unique key and are accessible without reading an index. Pick designed the system to use a Hashing algorithm that reads the item from disk on generally on the 1st physical read (assuming system maintenance was performed correctly). Fields in Pick are Non-Typed. All data is stored as string and Casting is up to the programmer. Nulls are stored as a empty string, thus a null does not take up disk space as it does in SQL. There is no need for Foreign Keys. In the ‘Relational world’ the DBA has to create and Order Header table, and an Order Line Item table. In the “Pick Model’ there is a single table. An example would be, ‘Order Date’ is a field that would store a # of days since ‘Dec 13, 1967’ (the data Pick OS was turned on for the first time). Pick programmers did not have Y2k problems. A 2nd column would be Customer Number. The big difference is when you get to the Product Number Column, it would be ‘Multi-Valued’ (the Codd Non-Conformance). In other words, the database can handle 1-32000 product#s in that column. Other columns like Quantity Ordered would be in a controlling/dependant relationship with the Product Number and would also be multi-valued. When you get to the Quantity Shipped, Pick would go to a third dimension and have Sub-Multi-Valued field. You would have a Shipment Number Column, and it would be Multi-Valued by Line Item and Sub-Multi-Valued containing The Shipment Quantity for that line for that shipment number. There are No Inner Joins needed. All data for that Order is stored in One Table and in a single record. No orphan rows ever!
Secondly the data definition is a bit different. Our dictionaries can contain definitions for data that is not in this table or is being manipulated. A couple of examples are, Customer Name. It would be defined as ‘Use the Customer number column and return the Name field from the Customer Table. Another Example is Line Item Extension would be defined as a calculation of Quantity*Price/PricePer.
I believe I read somewhere Caché claims to have over 100,000 installations.
I would think a flat-file database in Excel is non-relational and used by quite a few people.
It is really just a database table that can not be joined with other tables.
Object-Oriented Databases are one interesting type of non-relational database.
The trading sector sometimes uses OO Databases since each deal/contract can look sort of like others in that category but have unique attributes as well. VERY difficult to represent it relationally.
eXist-db is an xml database that has been around for a long time. It is particularly useful for xquery over tons of xml documents.
Any file or group of files that contains data but does not express relationships within that data is a non-relational database.
RRDtool is designed to store and aggregate log data. You configure a sampling interval and feed data into it, then it returns time-based results. It's optimized for fixed-size storage, and starts aggregating past results after a time. For example, suppose you have a round-robin database with a 5-minute time interval. Even if you send it temperature data once per second, it still only stores the results in 5-minute increments. After a week, it averages those results into hourly values. After a month, hourly results are averaged into daily numbers, and so on.
RRDtool is commonly used as the backend for tools like Cricket and MRTG to track network and environmental data for months and years at a stretch.
For a graph based dbms you have neo4j
For a hierarichal dbms you have any standard filesystem or with "schema" support any LDAP implementation.
There are many answers but they all end up being in one of two major categories:
Navigational. Includes Tree/Hierarchy databases and Graph databases.
Databases that break first normal form (multiple values). Includes Pick databases and Lotus Notes and its offspring like CouchDB.
EDIT: And of course key/value stores like BDB aren't relational, but that just goes without saying doesn't it? I mean, they're just key/value stores.
dBase. Although it was marketed as such, it doesn't meet the requirements.
As an OO database, Intersystems Caché comes to mind. Some medical and library systems are built on this.
In my company, www.smartsgroup.com, we have a proprietary database engine we call a "transaction log database". It is built on flat files, each file containing a sequence of "events" or "messages", in binary format, plus various indexes on this data and algorithms for reproducing the state of a stock exchange's orderbook. It is highly optimised for sequential updates and sequential access.
In scientific applications, it is also common to use proprietary database engines rather than RDBMS's. I also worked for a company that has the world's largest database of EEG brain recordings: www.brainresource.com . There we use a flat file database, and it worked well for us.
SmartsGroup also uses a temporal database, which is like a non-relational database table, except that we store a history of all changes to all fields so we can reproduce the state of a particular row on a particular date.
The Wiki page for Dimensional Databases linked to above seems to have disappeared.
Some OLAP Systems have are backed by Multidimensional databases (MOLAP) these are used often in financial analysis. They afford interactive clients that allow one to navigate through the data at different levels of aggregation.
At my university there is a group that researches deductive databases.

Resources