In my course we've learnt about different database models - how data is stored theoretically with trees, etc.
my question is more practical: I want to know how the data is stored into the storage. Are there algorithms that distribute the data to the 'hard drive'?
So in big databases is the data spread randomly to the storage or will the drives be filled like step-by-step until they're full, before the next one gets data to save?
Storing data in a database is not a major problem. Instead, the problems lies in viewing or manipulating the stored data. For example how frequently one wants to access the data, when do you want to access data, which data you want to access etc. questions to be answered. If you are sure about your data manipulation detail, then you can design tables accordingly and choose appropriate techniques to manipulate the data.
You can refer any books on database management systems.
Advanced DBMS Tutorials
Related
Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 4 years ago.
Improve this question
I am reading book on big data for dummies.
Welcome to Big Data For Dummies. Big data is becoming one of the most
important technology trends that has the potential for dramatically
changing the way organizations use information to enhance the customer
experience and transform their business models.
Big data enables organizations to store, manage, and manipulate vast
amounts of data at the right speed and at the right time to gain the
right insights. The key to understanding big data is that data has to
be managed so that it can meet the business requirement a given
solution is designed to support. Most companies are at an early stage
with their big data journey.
I can understand store means we have to store in DBMS
My questions on above text .
What does author mean by manage vast amounts of data in above context? Example will be helpful.
What does author mean by "organizations transform their business models" with big data? Again example will be helpful.
What does author mean by "manipulate vast amounts of data in above context?
Following are the answers to your questions:
1.What does author mean by manage vast amounts of data in above context? Example will be helpful.
Ans. When we talk about Bigdata, its the data at scale that we mention. Vast amounts of data in the above context indicates a hint at the volume of data that we can process with bigdata platforms. It could be somewhere in the range of Terabytes to petabytes or even more. This volume of data is unmanageable for the age old relational systems.
Example : Twitter, Facebook, Google etc. handling Petabytes of data on a daily basis.
2.What does author mean by "organizations transform their business models" with big data? Again example will be helpful.
Ans. With the use of bigdata technologies,organizations can have huge insights into their business models and accordingly they can make future strategies that can help them to conquer more business share in the market.
Example : Online Retail giant Amazon thrives on user data that helps them know about user's online shopping pattern and hence they create more products and services that are likely to shoot up the business and take them way ahead of their competitors.
3.What does author mean by "manipulate vast amounts of data in above context? Example will be helpful.
Ans. We can manage humongous amounts of data with big data but managing is not enough. So we use sophisticated tools that help us manipulate data in such a way that it turns into business insights and ultimately into money.
Example : Clickstream data. This data consists of user clicks on websites, how much time he/she spent on a particular site, on a particular item etc. All these things when manipulated properly results in greater business insights about the users and hence a huge profit.
Vast amount of Data means a large size file not MB or GB it may be in Tera Byte. For example some social networking site everyday generate approx 6 TB of data.
Organization using traditional RDBMS to handle data. But they are implementing Hadoop, Spark to manage easily big data. So day by day they are changing their business tactics with the help of new technology. Easily they are getting customer view with analysis of insight.
Your assumption/understanding
"I can understand store means we have to store in DBMS"
was the way long back. I am answering that aspect in my detailed answer here. Detailed so you get the Big Data concept clear upfront. (I will provide answers to your listed questions in another subsequent answer post.)
It's not just DBMS/RDBMS any more. It's data storage including file system to data stores.
In Big Data Context, it refers to
a) big data (data itself)
and
b) a storage system - distributed file system (highly available, scalable,
fault-tolerant being the salient features. High throughput and low latency
is targeted.) handling large volumes (multiples) (not necessarily
homogenous or one type of data) than the traditional DBMS in terms of I/O
and (durable/consistent) storage.
and
(extension)
c) Big Data eco system that includes systems, frameworks, projects that handle and deal with or
interacts with (and/or based on) the above two. Example. Apache Spark.
It can store just any file including raw file as it's. DBMS equivalent Data Storage system for Big Data allows giving structure to data or storing structured data.
As you store data on any normal user device – computer, hard disk or external hard disks, you can think of Big Data store as a cluster (defined/configurable networked collection of nodes) of commodity hardware and storage components (that has a configurable network IP at least, so you usually need to mount/attach a storage device or disk to a computer system or server to have an IP) to provide a single aggregated distributed (data/file) view store / storage system.
So data: structured (traditional DBMS equivalent), relational structured (RDMS equivalent), un structured (e.g., text files and more) and semi-structured files/data (csv, json, xml etc.).
With respect to Big Data, it can be flat files, text files, log files, image files, video files or binary files.
There's again row-oriented and/or column-oriented data as well (when structured / semi-structured data are stored/treated as Database / Data Warehouse data. Example: Hive is a data warehouse of/on Hadoop that allows storing structed relational data and csv files etc. in as-is file format or any specific one like parquet, avro, ORC etc.).
In terms of volume/size, though individual files can be (KBs not recommended) MBs, GBs or some times TBs aggregating to be TBs and PBs (or more; there's no official limit as such) storage at any point of time across the store/system.
It can be batch data or discrete stream data or stream real time data and feeds.
(Wide Data goes beyond Big Data in terms of nature, size and volume etc.)
Book for Beginners:
11. In terms of Book for Beginners, though “Big Data for Dummies” is not a bad option (I have not personally read it though, but know their series/style when I had touched upon during my software engineering degree studies way back.)
12. I suggest you go for "Hadoop: The Definitive Guide" book. You should go for the last edition release which happens to be the 4th Edition (year 2015). It's based on Hadoop 2.x. Though it has not been enhanced with latest 2.x updates, you will find it really good book to read and reading it.
Beyond:
Though Hadoop 3 in alpha phase, you need not worry about that just now.
Follow the Apache Hadoop site and documentation though. (ref: http://hadoop.apache.org/)
Know and learn the Hadoop Ecosystem as well.
(Big Data and Hadoop almost going synonymous now a days though Hadoop is based on the Big Data concept. Hadoop is an Open Source Apache project. Used in Production.)
The file system I mentioned is HDFS (Hadoop Distributed File System) (and/or similar ones).
Otherwise it's other Cloud storage systems including AWS S3, Google Cloud Storage and Azure Blob Storage (Object Storage).
Big data can also be stored on NoSQL DB/s which functions as non-relational flexible schema data store DBMS but not optimised for strictly relational data though. If you store relational data, relation constraints are by default removed/broken. And they are not inherently SQL-oriented though interfaces are provided. NoSQL DBs like HBase (on top of HDFS and based Big Table), Cassandra, MongoDB etc. depending on the data type (or direct files) storage and CAP theorem's attributes handled.
I have a very basic question, if we talk about millions of records to be manipulated, then why we need to store millions of records in memory? whatever record we need we can fetch from database and do manipulation in memory using some data structures and update back to database.
I'll give one example.
At work we work with language learning, so enormous data sets of words and phrases. (hundreds of thousands of words, though it could easily reach the millions as time goes on)
Good use of Data Structures are crucial to a successful application. Like #Juan Lopes said, keeping everything in databases is slow, and impractical. What happens if I need to manipulate multiple values or run an algorithm on a data set? I need to retrieve that data from my database first in order to to do this.
An argument can be made that algorithms can be added to the database then to solve this problem. More often then not however, you will not have ownership of the database, or you will consume data that you do not have permission to modify the server code for.
Also depending on which data structures you use you can save large amounts of time! Take the map/dictionary. By doing a O(n) pass on the data to create a map, I now am able to access any of the data in O(1) time if I know the key I'm looking for, Running a database query will rarely produce such fast results, also in modern applications often times the database is on a server far away from your program, and the time to retrieve the data is compounded with that of an HTTP request, which could very well take 10x the time to run the query itself.
In the end, there is a good reason that data structures are a fundamental part of any good programmers toolbox and why they teach it so vigorously in universities.
What is the difference between a dataset and a database ? If they are different then how ?
Why is huge data difficult to be manageusing databases today?!
Please answer independent of any programming language.
In American English, database usually means "an organized collection of data". A database is usually under the control of a database management system, which is software that, among other things, manages multi-user access to the database. (Usually, but not necessarily. Some simple databases are just text files processed with interpreted languages like awk and Python.)
In the SQL world, which is what I'm most familiar with, a database includes things like tables, views, stored procedures, triggers, permissions, and data.
Again, in American English, dataset usually refers to data selected and arranged in rows and columns for processing by statistical software. The data might have come from a database, but it might not.
Database
The definition of the two terms is not always clear. In general a database is a set of data organized and accessible using a database management system (DBMS). Databases usually, but not always, are composed of several tables linked together often accessed, modified and updated by various users often simultaneously.
Cambridge dictionary:
A structured set of data held in a computer, especially one that is
accessible in various ways.
Merriam-webster
a usually large collection of data organized especially for rapid
search and retrieval (as by a computer)
Data set (or dataset)
A data set sometimes refer to the contents of a single database table, but this is quite a restrictive definition. In general, as the name suggests, is a set (or collection) of data hence there are datasets of images like Caltech-256 Object Category Dataset or videos e.g. A large-scale benchmark dataset for event recognition in surveillance video. A data set purpose is usually designed for the analysis rather to a continual update form different users, hence represent the end of a collection of data or a snapshot of a specific time.
Oxford dictionary:
A collection of related sets of information that is composed of
separate elements but can be manipulated as a unit by a computer.
‘all hospitals must provide a standard data set of each patient's
details’
Cambridge dictionary
a collection of separate sets of information that is treated as a
single unit by a computer
A dataset is the data... usually in a table or can be XML or other types of data however it's only data... it doesn't really do anything.
And as you know a database is a container for the dataset usually with built in infrastructure around it to interact with it.
Huge data isn't hard to manage for what I do. I guess you're asking a study related question?
Dataset is just a set of data (maybe related to someone and may not be for others ) whereas Database is a software/hardware component that organizes and stores data or dataset. Both are different things practically.
Huge data needs more infrastructure and components (hardware & software) or computing power & storage for efficient storage or retrieval of data's . More huge data means more components hence difficult. Modern days database provides good infrastructure to handle huge data's processing (both read/write) , check datalake management by Microsoft which manages relational data or dataset extensively.
The project I have been given is to store and retrieve unstructured data from a third-party. This could be HR information – User, Pictures, CV, Voice mail etc or factory related stuff – Work items, parts lists, time sheets etc. Basically almost any type of data.
Some of these items may be linked so a User many have a picture for example. I don’t need to examine the content of the data as my storage solution will receive the data as XML and send it out as XML. It’s down to the recipient to convert the XML back into a picture or sound file etc. The recipient may request all Users so I need to be able to find User records and their related “child” items such as pictures etc, or the recipient may just want pictures etc.
My database is MS SQL and I have to stick with that. My question is, are there any patterns or existing solutions for handling unstructured data in this way.
I’ve done a bit of Googling and have found some sites that talk about this kind of problem but they are more interested in drilling into the data to allow searches on their content. I don’t need to know the content just what type it is (picture, User, Job Sheet etc).
To those who have given their comments:
The problem I face is the linking of objects together. A User object may be added to the data store then at a later date the users picture may be added. When the User is requested I will need to return the both the User object and it associated Picture. The user may update their picture so you can see I need to keep relationships between objects. That is what I was trying to get across in the second paragraph. The problem I have is that my solution must be very generic as I should be able to store anything and link these objects by the end users requirements. EG: User, Pictures and emails or Work items, Parts list etc. I see that Microsoft has developed ZEntity which looks like it may be useful but I don’t need to drill into the data contents so it’s probably over kill for what I need.
I have been using Microsoft Zentity since version 1, and whilst it is excellent a storing huge amounts of structured data and allowing (relatively) simple access to the data, if your data structure is likely to change then recreating the 'data model' (and the regression testing) would probably remove the benefits of using such a system.
Another point worth noting is that Zentity requires filestream storage so you would need to have the correct version of SQL Server installed (2008 I think) and filestream storage enabled.
Since you deal with XML, it's not an unstructured data. Microsoft SQL Server 2005 or later has XML column type that you can use.
Now, if you don't need to access XML nodes and you think you will never need to, go with the plain varbinary(max). For your information, storing XML content in an XML-type column let you not only to retrieve XML nodes directly through database queries, but also validate XML data against schemas, which may be useful to ensure that the content you store is valid.
Don't forget to use FILESTREAMs (SQL Server 2008 or later), if your XML data grows in size (2MB+). This is probably your case, since voice-mail or pictures can easily be larger than 2 MB, especially when they are Base64-encoded inside an XML file.
Since your data is quite freeform and changable, your best bet is to put it on a plain old file system not a relational database. By all means store some meta-information in SQL where it makes sense to search through structed data relationships but if your main data content is not structured with data relationships then you're doing yourself a disservice using an SQL database.
The filesystem is blindingly fast to lookup files and stream them, especially if this is an intranet application. All you need to do is share a folder and apply sensible file permissions and a large chunk of unnecessary development disappears. If you need to deliver this over the web, consider using WebDAV with IIS.
A reasonably clever file and directory naming convension with a small piece of software you write to help people get to the right path will hands down, always beat any SQL database for both access speed and sequential data streaming. Filesystem paths and file names will always beat any clever SQL index for data location speed. And plain old files are the ultimate unstructured, flexible data store.
Use SQL for what it's good for. Use files for what they are good for. Best tools for the job and all that...
You don't really need any pattern for this implementation. Store all your data in a BLOB entry. Read from it when required and then send it out again.
Yo would probably need to investigate other infrastructure aspects like periodically cleaning up the db to remove expired entries.
Maybe i'm not understanding the problem clearly.
So am I right if I say that all you need to store is a blob of xml with whatever binary information contained within? Why can't you have a users table and then a linked(foreign key) table with userobjects in, linked by userId?
I have an interesting database problem. I have a DB that is 150GB in size. My memory buffer is 8GB.
Most of my data is rarely being retrieved, or mainly being retrieved by backend processes. I would very much prefer to keep them around because some features require them.
Some of it (namely some tables, and some identifiable parts of certain tables) are used very often in a user facing manner
How can I make sure that the latter is always being kept in memory? (there is more than enough space for these)
More info:
We are on Ruby on rails. The database is MYSQL, our tables are stored using INNODB. We are sharding the data across 2 partitions. Because we are sharding it, we store most of our data using JSON blobs, while indexing only the primary keys
Update 2
The tricky thing is that the data is actually being used for both backend processes as well as user facing features. But they are accessed far less often for the latter
Update 3
Some people are commenting than 8Gb is toy these days. I agree, but just increasing the size of the db is pure LAZINESS if there is a smarter, efficient solution
This is why we have Data Warehouses. Separate the two things into either (a) separate databases or (b) separate schema within one database.
Data that is current, for immediate access, being updated.
Data that is historical fact, for analysis, not being updated.
150Gb is not very big and a single database can handle your little bit of live data and your big bit of history.
Use a "periodic" ETL process to get things out of active database, denormalize into a star schema and load into the historical data warehouse.
If the number of columns used in the customer facing tables are small you can make indexes with all the columns being used in the queries. This doesn't mean that all the data stays in memory but it can make the queries much faster. Its trading space for response time.
This calls for memcached! I'd recommend using cache-money, a great ActiveRecord write-through caching library. The ngmoco branch has support for enabling caching per-model, so you could only cache those things you knew you wanted to keep in memory.
You could also do the caching by hand using $cache.set/get/expire calls in controller actions or model hooks.
With MySQL, proper use of the Query Cache will keep frequently queried data in memory. You can provide a hint to MySQL not to cache certain queries (e.g. from the backend processes) with the SQL_NO_CACHE keyword.
If the backend processes are accessing historical data, or accessing data for reporting purposes, certainly follow S. Lott's suggestion to create a separate data warehouse and query that instead. If a data warehouse is too much to accomplish in the short term, you can replicate your transactional database to a different server and perform queries there (a Data Warehouse gives you MUCH more flexibility and capability, so go down that path if possible)
UPDATE:
See documentation of SELECT and scroll down to SQL_NO_CACHE.
Read about the Query Cache
Ensure query_cache_type set appropriate for your needs.
UPDATE 2:
I confirmed with MySQL support that there is no mechanism to selectively cache certain tables etc. in the innodb buffer pool.
So, what is the problem?
First, 150gb is not very large today. It was 10 years ago.
Second any non-total-crap database system will utilize your memory as cache. If the cache is big enough (compared to the amount of data that is in use) it will be efficient. If not, the only thing you CAN do is get more memory (because, sorry, 8gb of memory is VERY low for a modern server - it was low 2 years ago).
You should not have to do anything for the memory to be efficiently used. At least not on a commercial level database - maybe mysql sucks, but I would not assume this.