I need to implement a simple graph database engine, what are the things should I consider? First, I am confused between which data structure to use, I mean graph representation (like adjacency matrix or adjacency list) or the actual graph itself? I need this to be scalable.
Later how do I store the graph in the hard disk as files? After I store the graph data in the form of files, I would also need a way to selectively load only certain files into the graph, since I can not load everything at once into the RAM. Sorry for being vague, but I need someone to point me in the right direction. Also please suggest the language I can use, can I use python for this project? Thank you.
Depending on your needs you will implement different interface to the database ie. an adjacency matrix or the graph itself.
Instead of using a file based database, the important step forward you can take is use a key/value store like bsddb, leveldb or wiredtiger (prefered). This will deal with caching often accessed files, provide ACID semantic, and indices if you use wiredtiger.
The storage layer made upon the key/value store, can have several layout. It depends on the final interface you need.
To get started with developing custom databases using key/value stores I recommend you read questions answered about mostly leveldb and bsddb on SO.
Like the following:
store list in key value database
How to give multiple values to a single key using a dictionary?
Use integer keys in Berkeley DB with python (using bsddb3)
Expressing multiple columns in berkeley db in python?
Related
I recently completed a course Algorithms part I and hence learned a lot of data structures that are used to store data in appropriate ways.
I realize that today for maintaining data we use data bases rather than writing data structures. We cant change these data structures used by the data base(Or can we). So when we go for permanent storage do we always go for database use(wont that hit performance), or is their a way in which both of them are used in a combination?
Databases are themselves comprehensibly designed to perform efficiently in all possible conditions. These databases are used only to store compact and vast data. If one has simpler,lesser and minimal needs to store and save data like that of a small program,these smaller data-structures would work easily on a flat-file system.
Also,I feel you are trying to compare flat-file-storage which utilises data structures and which are directly visible to the user WHEREAS in the case of database you can't say about their implemented data-structures to save the file and perform the data-manipulation operations as they are not visible to the user!
But,databases themselves perform hashing,indexing and other data-structures to implement their data storage techniques. They are efficiently coded as per internal data-structures that we generally don't find externally data-structure from the front-view. BUT,THEY DO IMPLEMENT THEIR DATABASES ON THE SAME TECHNIQUE!
Data structures that you learnt ( I suppose), like List, Maps, Trees etc. are the core concepts of the modern Relational databases.
For example:
B-tree is used in many databases. The B+-tree is used in many well known databases as well as common filesystems like NTFS.
SQLite uses B+ tree
SQL Server uses heap or B-tree
Majority of databases use a combination of these data structures, sometimes implementing a customized version. These are optimized for high performance.
For permanent storage, you can use a file system with manually implementing any data structure of your choice, but that would be like re-inventing the wheel
I have been playing around with using graphs to analyze big data. Its been working great and really fun but I'm wondering what to do as the data gets bigger and bigger?
Let me know if there's any other solution but I thought of trying Hbase because it scales horizontally and I can get hadoop to run analytics on the graph(most of my code is already written in java), but I'm unsure how to structure a graph on a nosql database? I know each node can be an entry in the database but I'm not sure how to model edges and add properties to them(like name of nodes, attributes, pagerank, weights on edges,etc..).
Seeing how hbase/hadoop is modeled after big tables and map reduce I suspect there is a way to do this but not sure how. Any suggestions?
Also, does this make sense what I'm trying to do? or is it there better solutions for big data graphs?
You can store an adjacency list in HBase/Accumulo in a column oriented fashion. I'm more familiar with Accumulo (HBase terminology might be slightly different) so you might use a schema similar to:
SrcNode(RowKey) EdgeType(CF):DestNode(CFQ) Edge/Node Properties(Value)
Where CF=ColumnFamily and CFQ=ColumnFamilyQualifier
You might also store node/vertex properties as separate rows using something like:
Node(RowKey) PropertyType(CF):PropertyValue(CFQ) PropertyValue(Value)
The PropertyValue could be either in the CFQ or the Value
From a graph processing perspective as mentioned by #Arnon Rotem-Gal-Oz you could look at Apache Giraph which is an implementation of Google Pregel. Pregel is the method Google use for large graph processing.
Using HBase/Accumulo as input to giraph has been submitted recently (7 Mar 2012) as a new feature request to Giraph: HBase/Accumulo Input and Output formats (GIRAPH-153)
You can store the graph in HBase as adjacency list so for example, each raw would have columns for general properties (name, pagerank etc.) and a list of keys of adjacent nodes (if it a directed graph than just the nodes you can get to from this node or an additional column with the direction of each)
Take a look at apache Giraph (you can also read a little more about it here) while this isn't about HBase it is about handling graphs in Hadoop.
Also you may want to look at Hadoop 0.23 (and up) as the YARN engine (aka map/reduce2) is more open to non-map/reduce algorithms
I would not use HBase in the way "Binary Nerd" recommended it as HBase does not perform very well when handling multiple column families.
Best performance is achieved with a single column family (a second one should only be used if you very often only access the content of one column family and the data stored in the other column family is very large)
There are graph databases build on top of HBase you could try and/or study.
Apache S2Graph
provides REST API for storing, querying the graph data represented by edge and vertices. There you can find a presentation, where the construction of row/column keys is explained. Analysis of operations' performance that influenced or is influenced by the design are also given.
Titan
can use other storage backends besides HBase, and has integration with analytics frameworks. It is also designed with big data sets in mind.
How graph databases store data to a persistent storage?
PKV
I would expect that every implementation of a graph database uses a different approach.
To take one example, look at Neo4j's NeoStore class, and the other kinds of store it refers to. It seems that Neo4j uses multiple files, each containing fixed-length records; one for nodes, one for keys of properties of nodes, one for values of properties of nodes, etc. Records in each contain indexes to refer to records in the others. It seems overcomplicated to me, but it evidently seemed like a good idea to the guys who wrote it!
To know more about how OrientDB stores graphs look at: http://code.google.com/p/orient/wiki/Concepts#Storage
What is the difference between graph-based databases (http://neo4j.org/) and object-oriented databases (http://www.db4o.com/)?
I'd answer this differently: object and graph databases operate on two different levels of abstraction.
An object database's main data elements are objects, the way we know them from an object-oriented programming language.
A graph database's main data elements are nodes and edges.
An object database does not have the notion of a (bidirectional) edge between two things with automatic referential integrity etc. A graph database does not have the notion of a pointer that can be NULL. (Of course one can imagine hybrids.)
In terms of schema, an object database's schema is whatever the set of classes is in the application. A graph database's schema (whether implicit, by convention of what String labels mean, or explicit, by declaration as models as we do it in InfoGrid for example) is independent of the application. This makes it much simpler, for example, to write multiple applications against the same data using a graph database instead of an object database, because the schema is application-independent. On the other hand, using a graph database you can't simply take an arbitrary object and persist it.
Different tools for different jobs I would think.
Yes, the API seems like the major difference, but is not really a superficial one. Conceptually a set of objects will form a graph and you could think of an API that treats this graph in a uniform way. Conversely, you could in theory mine a generic graph structure for patterns and map them to objects exposed via some API. But the design of the API of an actual product will generally have consequence on how data is actually stored, how it can be queried, so it would be far from trivial to, say, create a wrapper and make it look like something else. Also, an object-oriented database must offer some integrity guarantees and a typing structure that a graph database won't normally do. In fact, serious OO database are far from "free form" :)
Take a look at [HyperGraphDB][1] - it is both a full object-oriented database (like db4o) and a very advanced graph database both in terms of representational and querying capabilities. It is capable of storing generalized hypergraphs (where edges can point to more than one node and also to other edges as well), it has a fully extensible type system embedded as a graph etc.
Unlike other graph databases, in HyperGraphDB every object becomes a node or an edge in the graph, with none-to-minimal API intrusion and you have the choice of representing your objects as a graph or treating them in a way that is orthogonal to the graph structure (as "payload" values of your nodes or edges). You can do sophisticated traversals, customized indexing and querying.
An explanation why HyperGraphDB is in fact an ODMS, see the blog post Is HyperGraphDB an OO Database? at Kobrix's website.
As Will descibes from another angle, a graphdb will keep your data separated from your application classes and objects. A graphdb also has more built-in functionality to deal with graphs, obviously - like shortest path or deep traversals.
Another important difference is that in a graphdb like neo4j you can traverse the graph based on relationship (edge) types and directions without loading the full nodes (including node properties/attributes). There's also the choice of using neo4j as backend of an object db, still being able to use all the graphy stuff, see: jo4neo This project has a different approach that could also count as an object db on top of neo4j: neo4j.rb. A new option is to use Spring Data Graph, which gives graphdb support through annotations.
The same question was asked in the comments to this blogpost.
From a quick browse of both their websites:
The major difference is the way the APIs are structured, rather than the kind of free-form database you can build with them.
db4o uses an object mapping - you create a Java/C# class, and it uses reflection to persist it in the database.
neo4j has an explicit manipulation API.
Neo4j seemed, in my humble opinion, much nicer to interact with.
You might also consider a key-value store - you could make exactly the same free-form database with one of those.
The difference at low-level is not so huge. Both manage relationships as direct links without costly joins. Furthermore both have a way to traverse relationships with the Query language, but the graph database has operators to go recursively at Nth level.
But the biggest difference is in the domain: in a Graph databases all is based on the 2 types: vertexes and edges, even if usually you can define your own types as a sort of subtypes of Vertex or Edge.
In the ODBMS you have no Vertex and Edge concepts, unless you write your own.
With graph databases, you have a slight semblance of a chance that it is based on mathematical graph theory. With Object-oriented databases, you have the certainty that it is based on nothing at all (and most certainly no mathematical theory at all).
I'm writing a CAD (Computer-Aided Design) application. I'll need to ship a library of 3d objects with this product. These are simple objects made up of nothing more than 3d coordinates and there are going to be no more than about 300 of them.
I'm considering using a relational database for this purpose. But given my simple needs, I don't want any thing complicated. Till now, I'm leaning towards SQLite. It's small, runs within the client process and is claimed to be fast. Besides I'm a poor guy and it's free.
But before I commit myself to SQLite, I just wish to ask your opinion whether it is a good choice given my requirements. Also is there any equivalent alternative that I should try as well before making a decision?
Edit:
I failed to mention earlier that the above-said CAD objects that I'll ship are not going to be immutable. I expect the user to edit them (change dimensions, colors etc.) and save back to the library. I also expect users to add their own newly-created objects. Kindly consider this in your answers.
(Thanks for the answers so far.)
The real thing to consider is what your program does with the data. Relational databases are designed to handle complex relationships between sets of data. However, they're not designed to perform complex calculations.
Also, the amount of data and relative simplicity of it suggests to me that you could simply use a flat file to store the coordinates and read them into memory when needed. This way you can design your data structures to more closely reflect how you're going to be using this data, rather than how you're going to store it.
Many languages provide a mechanism to write data structures to a file and read them back in again called serialization. Python's pickle is one such library, and I'm sure you can find one for whatever language you use. Basically, just design your classes or data structures as dictated by how they're used by your program and use one of these serialization libraries to populate the instances of that class or data structure.
edit: The requirement that the structures be mutable doesn't really affect much with regard to my answer - I still think that serialization and deserialization is the best solution to this problem. The fact that users need to be able to modify and save the structures necessitates a bit of planning to ensure that the files are updated completely and correctly, but ultimately I think you'll end up spending less time and effort with this approach than trying to marshall SQLite or another embedded database into doing this job for you.
The only case in which a database would be better is if you have a system where multiple users are interacting with and updating a central data repository, and for a case like that you'd be looking at a database server like MySQL, PostgreSQL, or SQL Server for both speed and concurrency.
You also commented that you're going to be using C# as your language. .NET has support for serialization built in so you should be good to go.
I suggest you to consider using H2, it's really lightweight and fast.
When you say you'll have a library of 300 3D objects, I'll assume you mean objects for your code, not models that users will create.
I've read that object databases are well suited to help with CAD problems, because they're perfect for chasing down long reference chains that are characteristic of complex models. Perhaps something like db4o would be useful in your context.
How many objects are you shipping? Can you define each of these Objects and their coordinates in an xml file? So basically use a distinct xml file for each object? You can place these xml files in a directory. This can be a simple structure.
I would not use a SQL database. You can easy describe every 3D object with an XML file. Pack this files in a directory and pack (zip) all. If you need easy access to the meta data of the objects, you can generate an index file (only with name or description) so not all objects must be parsed and loaded to memory (nice if you have something like a library manager)
There are quick and easy SAX parsers available and you can easy write a XML writer (or found some free code you can use for this).
Many similar applications using XML today. Its easy to parse/write, human readable and needs not much space if zipped.
I have used Sqlite, its easy to use and easy to integrate with own objects. But I would prefer a SQL database like Sqlite more for applications where you need some good searching tools for a huge amount of data records.
For the specific requirement i.e. to provide a library of objects shipped with the application a database system is probably not the right answer.
First thing that springs to mind is that you probably want the file to be updatable i.e. you need to be able to drop and updated file into the application without changing the rest of the application.
Second thing is that the data you're shipping is immutable - for this purpose therefore you don't need the capabilities of a relational db, just to be able to access a particular model with adequate efficiency.
For simplicity (sort of) an XML file would do nicely as you've got good structure. Using that as a basis you can then choose to compress it, encrypt it, embed it as a resource in an assembly (if one were playing in .NET) etc, etc.
Obviously if SQLite stores its data in a single file per database and if you have other reasons to need the capabilities of a db in you storage system then yes, but I'd want to think about the utility of the db to the app as a whole first.
SQL Server CE is free, has a small footprint (no service running), and is SQL Server compatible