The major operations we do are insertion,deletion and searching in any kind of data structure,which can also be done using database queries then what is the use of data structure?
which makes data structure unique from database?
Data structure shows how the objects in the problem is modeled and organized.
For example,
Your shopping items are organized linearly into an array;
Your company's org chart is modeled in a tree;
Facebook connections are organized as a huge graph.
The problem which data structure solves is how to model the objects in real world logically so that we can solve the problem in a computational manner.
Database is about how the information is persisted. The data in data structure may be persisted into database if necessary and may not be.
At first it seems like a very stupid question , But really is not . The answers can turn your world upside down . (May be). So lets divide and conquer the question
Data Structure
Loosely speaking,
“ A data structure is a specialized format for organizing and storing data. Any data structure is designed to organize data to suit a specific purpose so that it can be used according to needs, stored normally on RAM” -wikipedia
Database
“A database is an organized collection of data. It is the collection of database objects stored normally on hard disk.” -wikipedia
The Misconceptions
Well to be honest, It is a misconception that term data structure is applied to those structures that live in Ram, Not even close . Yes Normally Data structure resides in Ram(AKA Main Memory) but they can also live in Hard Disk(AKA Secondary Storage) . Stop being so Racist !!!!
The Similarities And the differences
Well database is nothing more than a collection of database objects that are stored in hard disk . A database management System uses a database for its own ends. To make my self perfectly clear, database objects here means schemas , tables, views , indexes , users etc
There are so many database objects . Each object can be implemented using one or more data structures .
For Example a btree/btree+ index is implemented using Btree data structure. A hash based index obviously will use hash table to resolve key to an address.
It is safe to say that Database is a collection of different data structures . These types can vary significantly based on technology and Operating System .
A database is a collection of tables (and possibly Stored Procedures, Functions, Views, etc.)
Let's keep it simple for now, though. Each table has a table structure that defines what can be placed in it. With No-SQL databases, this is different, as they are more loosey-goosey. Again, though, let's keep it simple for now.
A Database might be named anything, such as "Platypus" It can contain many tables, such as "DuckbillsInTheWild" and "DuckbillsInCaptivity" etc.
One of these tables may have the structure:
Name Data Type
-------------------------
ID int
Name VarChar
Weight Float
PoisonToeLength Float
Data structure is the memory and Database is Storage in computer. Data structure is voltile memory and database is non-volatile memory.
Data structure is a logical representation of how your data is organized. while database is a middleware to help you store the data into file system.
Suppose you are working in a social network company. Intuitively, the social network is a graph (data structure). You can apply graph algorithm here to solve practical problems. Meanwhile, the data in this graph must be persisted somewhere, so that you can read/write them. Instead of writing codes to save them into files, graph databases can help you there.
One is logical, the other is middleware (closely related to physical).
IMO, Database is a term to refer to a collection of records which could be anything such as inventory data, user's data etc. Now, how do we store it on a computer is another technical issue. I can use linked list, array, tree, graph etc. to store my records depending on the usecase. Thus, data structure is a technical term given to these structures. In short, database is more of business term or folklore term and data structure is a technical term to refer to the collection of records.
Database is the conceptual view of storing data like excel sheets ,CSV files in data sets which stores data permanent in nature while data structure is the logical representation of storing of data into memory like array,graphs etc stores data into temporary nature wise.
Data Structure is the basic requirement for implementation of Data bases or data store.
the main difference between database and data structure is that database is a collection of data that is stored and managed in permanent memory while data structure is a way of storing and arranging data efficiently in temporary memory.
Overall, data is raw and unprocessed facts.
Related
I'm building a mobile application that records information about items and then outputs an automatically generated report.
Each Item may be of various types, each type requires different information to be recorded. The user needs to be able to specify what is to be stored for each type.
Is there a "best" way to store this type of information in a relational database?
My current plan is to have a Type table that maps Types to Attributes that need to be recorded for that Type. Does this sound sensible? I imagine that it may get messy when I come to produce reports from this data.
I guess I need a way of generalising the information that needs to be recorded?
I think I just need some pointers in the right direction.
Thanks!
Only a suggestion, might not be an answer... use JSON and go for no-sql database. Today it is more convenient to operate and play around with data in not strictly relational database format.
That way you can define a model(s), or create you own data structure as mentioned and store it easily as a collection of documents of that model. Also no-sql allows structure changes without obligating you to define entire "column" for all "rows" present there ;)
Check this out about MongoDB and NoSQL explanation.
This is also a beatiful post that i love about data modeling in
NoSQL.
Usually I use this data structure to store information in my DWH Fact Tables
Every month we have requests to add more metrics and dimensions to this fact table so as the number of entries increases this becomes more complex.
I was wondering to migrate this structures to another kind of data structure, something like this:
So I do not have to change my tables structure and delegate the metric calculations to the business layer.
Do anyone know what is the canonical name of these structures? So that I can start looking for documentation.
Plus, if someone know the disadvantages of using this...
I was exposed to the world of tables and data structures in R before the RDBMS systems and other database systems. It is quite elegant in R/Python to create tables and lists from stuctured data (.csv or other formats) and then do data manipulations programmatically.
Last year, I attended a course in Database management and learnt all about structured and unstructured databases. I also noticed that it is the norm to feed data from multiple sources of data into databases rather than directly use them in R (for convenience and discipline?).
For research purposes, R seems to suffice, for joining, appending or even complicated data manipulations.
The questions that keeps arising is:
When to use R directly by using commands such as read.csv, when to use R by creating database and querying from tables using the R-SQL interface?
For instance, if I have a multi-source data, like (a) Person level information (age, gender, smoking habits), (b) Outcome variables (such as surveys taken by them in real time), (c) Covariate information (environment characteristics), (d) Treatment input (occurrence of an event that modifies the outcome - survey response) (d) Time and space information of participants taking survey
How to approach the data collection and processing in this case. There may be standard industry procedures, but I put this question forward here, to understand list of feasible and optimal approaches that individuals and small group of researchers can adopt.
What you're describing when you say "that it is the norm to feed data from multiple sources of data into databases" sounds more specifically like a data warehouse. Databases are used for many reasons, and in plenty of situations they will hold data from one source - for instance, a database used as the data store of a transactional system will often only hold the data needed to run that system, and the data produced by that system.
The process you're describing is commonly called Extract, Transform, Load (ETL), and you might find looking up information about ETL and data warehousing helpful if you decide to go in the direction of combining your data prior to working with it in R.
I can't tell you which you should choose, or the optimal way of accomplishing it, because it will vary in different situations and might even come down to opinion. What I can tell you are some of the reasons why people create data warehouses, and you can decide for yourself whether it might be useful in your situation:
A data warehouse can provide a central location to hold combined data. This means that people do not need to combine the data themselves each time they need to use that specific combination of data. Unlike something like a simple one-off report or extract of combined data, it should provide some flexibility, letting people obtain the combined set of data they need for a specific task. Very often, in enterprise situations, multiple things are then be run on top of the same combined set of data - multidimensional data analysis tools (cubes), reports, data mining, etc.
Some of the benefits of this might include:
Individuals saving time when they otherwise would have needed to combine the data themselves.
If the data which needs to be combined is complex, or some people do not have proficiency at handling that part of the process, then there is less risk of data being combined incorrectly; you can be sure that different pieces of work have used the same source data.
If the data suffers from data quality issues, you resolve this once in the data warehouse, rather than working around it or resolving it repeatedly in code.
If new data is constantly being received, collection and integration of this into the data warehouse can be carried out automatically.
Like I say, I can't decide for you whether this is a useful direction or not - as with any decision of this kind you'll need to weigh up the costs of implementing such a solution against the benefits, and both will be specific to your individual case. But hopefully this answers your core question of why someone might choose to do this work in a database instead of in their code, and gives you a starting point to work from.
Suppose I have a list of R objects which are themselves lists. Each list has a defined structure: data, model which fits data and some attributes for identifying data. One example would be time series of certain economic indicators in particular countries. So my list object has the following elements:
data - the historical time series for economic indicator
country - the name of the country, USA for example
name - the indicator name, GDP for example
model - ARIMA orders found out by auto.arima in suitable format, this again may be a list.
This is just an example. As I said suppose I have a number of such objects combined into a list. I would like to save it into some suitable format. The obvious solution is simply to use save, but this does not scale very well for large number of objects. For example if I only wanted to inspect a subset of objects, I would need to load all of the objects into memory.
If my data is a data.frame I could save it to database. If I wanted to work with particular subset of data I would use SELECT and rely on database to deliver the required subset. SQLite served me well in this regard. Is it possible to replicate this for my described list object with some fancy database like MongoDB? Or should I simply think about how to convert my list to several related tables?
My motivation for this is to be able to easily generate various reports on the fitted models. I can write a bunch of functions which produce some report on a given object and then just use lapply on my list of objects. Ideally I would like to parallelise this process, but this is a another problem.
I think I explained the basics of this somewhere once before---the gist of it is that
R has complete serialization and deserialization support built in, so you can in fact take any existing R object and turn it into either a binary or textual serialization. My digest package use that to turn the serialization into hash using different functions
R has all the db connectivity you need.
Now, what a suitable format and db schema is ... will depend on your specifics. But there is (as usual) nothing in R stopping you :)
This question has been inactive for a long time. Since I had a similar concern recently, I want to add the pieces of information that I've found out. I recognise these three demands in the question:
to have the data stored in a suitable structure
scalability in terms of size and access time
the possibility to efficiently read only subsets of the data
Beside the option to use a relational database, one can also use the HDF5 file format which is designed to store a large amount of possible large objects. The choice depends on the type of data and the intended way to access it.
Relational databases should be favoured if:
the atomic data items are small-sized
the different data items possess the same structure
there is no anticipation in which subsets the data will be read out
convenient transfer of the data from one computer to another is not an issue or the computers where the data is needed have access to the database.
The HDF5 format should be preferred if:
the atomic data items are themselves large objects (e.g. matrices)
the data items are heterogenous, it is not possible to combine them into a table like representation
most of the time the data is read out in groups which are known in advance
moving the data from one computer to another should not require much effort
Furthermore, one can distinguish between relational and hierarchial relationships, where the latter is contained in the former. Within a HDF5 file, the information chunks can be arranged in a hierarchial way, e.g.:
/Germany/GDP/model/...
/Germany/GNP/data
/Austria/GNP/model/...
/Austria/GDP/data
The rhdf5 package for handling HDF5 files is available on Bioconductor. General information on the HDF5 format is available here.
Not sure if it is the same, but I had some good experience with time series objects with:
str()
Maybe you can look into that.
I'm currently specing out a project that stored threaded comment trees.
For those of you unfamiliar with what I'm talking about I'll explain, basically every comment has a parent comment, rather than just belonging to a thread. Currently, I'm working on a relational SQL Server model of storing this data, simply because it's what I'm used to. It looks like so:
Id int --PK
ThreadId int --FK
UserId int --FK
ParentCommentId int --FK (relates back to Id)
Comment nvarchar(max)
Time datetime
What I do is select all of the comments by ThreadId, then in code, recursively build out my object tree. I'm also doing a join to get things like the User's name.
It just seems to me that maybe a document storage like MongoDB which is NoSql would be a better choice for this sort of model. But I don't know anything about it.
What would be the pitfalls if I do choose MongoDB?
If I'm storing it as a Document in MongoDB, would I have to include the User's name on each comment to prevent myself from having to pull up each user record by key, since it's not "relational"?
Do you have to aggressively cache "related" data on the objects you need them on when you're using MongoDB?
EDIT: I did find this arcticle about storing trees of information in MongoDB. Given that one of my requirements is the ability to list to a logged in user a list of his recent comments, I'm now strongly leaning towards just using SQL Server, because I don't think I'll be able to do anything clever with MongoDB that will result in real performance benefits. But I could be wrong. I'm really hoping an expert (or two) on the matter will chime in with more information.
The main advantage of storing hierarchical data in Mongo (and other document databases) is the ability to store multiple copies of the data in ways that make queries more efficient for different use cases. In your case, it would be extremely fast to retrieve the whole thread if it were stored as a hierarchical nested document, but you'd probably also want to store each comment un-nested or possibly in an array under the user's record to satisfy your 2nd requirement. Because of the arbitrary nesting, I don't think that Mongo would be able to effectively index your hierarchy by user ID.
As with all NoSQL stores, you get more benefit by being able to scale out to lots of data nodes, allowing for many simultaneous readers and writers.
Hope that helps