How can you migrate your data from one graph database (Neo4j, Tiger Graph etc.) to another?
Background:
I have to decide between the standards of the W3C (RDF (S), OWL) and
Databases for property graphs (Neo4j, TigerGraph etc.).
I know that all "triple stores" that support the W3C standard also make it possible to simply "pull out" the data
and import it into another triple store.
For relational databases there is also the standard SQL (and dialects),
so that with a little effort you can get the data from one relational database to another.
But I can't think of such a solution for graph databases.
As someone already mentioned for the property graphs there is no defined standard as of now. There are efforts going to build such standards called GQL https://www.gqlstandards.org/
However, for importing data from RDF to property graphs. Tigergraph and neo4j provides option to load your rdf data to the respective platforms. This might not provide complete switch over capabilities from RDF to Property graph but can help with solutions for certain scenarios.
For interchanging data between property graphs you might have to re-create schema when you switch platforms. For data loading most of property graph dbs provide option to load using csv's.
I am new to Semantic Web and I have a very basic question about
the JENA RDF Dataset. I read it from the documentation that a dataset is
a collection of graphs (or Models in the Java API). If I view the graph
(the model) as a RDF Alternative to Relational DB's 'Table', can I view
the dataset as a 'Database' ?
If so, then with TDB I should be able to create multiple
datasets. However in the documentation it says 'Every dataset is
obtained via TDBFactory.createDataset(Location) within a JVM is the same
dataset.' (http://jena.apache.org/documentation/tdb/datasets.html). I
also recall that the documentation said somewhere that the TDB does not
support multiple JVM now. Does that mean with TDB I can create ONLY ONE
dataset?
This is Andy's answer to my question in the jena users mailing list. Thanks, Andy.
Hi, Everyone
I am new to Semantic Web and I have a very basic question about
the JENA RDF Dataset. I read it from the documentation that a dataset is
a collection of graphs (or Models in the Java API). If I view the graph
(the model) as a RDF Alternative to Relational DB's 'Table', can I view
the dataset as a 'Database' ?
yes - sort of.
If so, then with TDB I should be able to create multiple
datasets. However in the documentation it says 'Every dataset is
obtained via TDBFactory.createDataset(Location) within a JVM is the same
dataset.' (http://jena.apache.org/documentation/tdb/datasets.html).
... for the same "location" argument ...
TDBFactory.createDataset("DB1") ;
TDBFactory.createDataset("DB2") ;
are different datasets
I
also recall that the documentation said somewhere that the TDB does not
support multiple JVM now. Does that mean with TDB I can create ONLY ONE
dataset?
TDB is the core database engine, and when used directly, you are using
in a kind of embedded mode.
You can use Jena Fuseki for sharing a dataset between applications (just
like you might share an SQL database between apps, except it's HTTP not
JDBC).
Andy
Is there a way I can generate a database schema from an erlang application like I can do with hibernate.
I assume you mean Mnesia, and if that is the case, you don't really understand the nature of the Mnesia database. It is by its very design and implementation "schemaless". You might could write some really messy ugly code that walked a Mnesia database and tried to document the various records that are in it, but that would be pretty much a futile exercise. If you are storing Records in Mnesia, you already have the "schema" in the .hrl files that the Records are defined in.
There's nothing like nhibernate for sql databases in erlang.
Check out SumoDB
Overview
sumo_db gives you a standard way to define your db schema, regardless of the db implementation (mongo, mysql, redis, elasticsearch, etc.).
Your entities encapsulate behavior in code (i.e. functions in a module) and state in a sumo:doc() implementation.
sumo is the main module. It translates to and from sumo internal records into your own state.
Each store is managed by a worker pool of processes, each one using a module that implements sumo_store and calls the actual db driver (e.g: sumo_store_mnesia).
Some native domain events are supported, that are dispatched through a gen_event:notify/2 automatically when an entity is created, updated, deleted. Also when a schema is created and when all entities of a given type are deleted. Events are described in this article.
Full conditional logic support when using find_by/2 and delete_by/2 function. You can find more information about the syntax of this conditional logic operators here.
Support for sorting (asc or desc) based on multiple fields unsing find_by/5 and find_all/4 functions. For example this [{age, desc}, {name, asc}]] will sort descendently by age and ascendently by name.
Support for docs/models validations through sumo_changeset (check out the Changeset section).
If you are looking for Java hibernate type of object to SQL mapping framework in Erlang, you may have to write your own mapping module. One option is to map Erlang records to SQL. Any framework has to make sure the type mapping. Here is the link to Erlang's ODBC mapping http://erlang.org/doc/apps/odbc/databases.html#type
Erlang's ETS and Mnesia which is an extension of ETS are very flexible and efficient to manage records. If these two databases cannot be your choice, you might have to implement ways for record mapping
I have a number of objects, each one have an arbitrary number of shared, and distinct property-value pairs (more specifically: files, and their related properties -such as width, and height values for images, album/artist/length for music files, etc). I'd like to be able to search for objects having specific property/values (such as: by album), group by property, etc.
What kind of database would you suggest for this scenario? Due to modularity (ability to add more properties on-the-fly), as well as the fact of common properties are <20% of all properties, the standard SQL with normalized tables wouldn't really cut it. I have already tried to approach the problem using a "skinny data model"; however I have faced with serious scalability issues.
Are there any specialized databases tuned for this scenario (BSD-licensed solutions preferred)? Or any alternative way to tweak standard RDBMs for this?
Searching for an object having some properties makes me think about a RDF datastore. Have a look a a RDF API (see JENA , sesame, virtuoso ).
Or BerkeleyDB ?
What you're talking about is called EAV model or triple store. Later can be queried with SPARQL
Pierre is right; a triplestore is what you want, and RDF is the standard for that. SPARQL is the standard language for querying it (a lot like SQL for RDBMS's).
Have a look at the databases offered by various Cloud services:
Google AppEngine Datastore
(based on Bigtable)
Amazon SimpleDB
Microsoft SDS
Apache CouchDB
If Cloud databases aren't an option, BerkelyDB could be a good choice.
What are the other types of database systems out there. I've recently came across couchDB that handles data in a non relational way. It got me thinking about what other models are other people is using.
So, I want to know what other types of data model is out there. (I'm not looking for any specifics, just want to look at how other people are handling data storage, my interest are purely academic)
The ones I already know are:
RDBMS (mysql,postgres etc..)
Document based approach (couchDB, lotus notes)
Key/value pair (BerkeleyDB)
db4o
Quote from the "about" page:
db4o is the open source object database that enables Java and .NET developers to store and retrieve any application object with only one line of code, eliminating the need to predefine or maintain a separate, rigid data model.
Older non-relational databases:
Network Database
Hierarchical Database
Both mostly went out of style when relational became feasible.
Column-oriented databases are also a bit of a different animal. Many of them do support standard relational database SQL though. These are generally used for data warehouse type applications.
Semantic Web is also a non-relational data storage paradigm. There are no relations, all metadata is stored in the same way as data, and every entity has potentially its own unique set of attributes. Open-source projects that implement RDF, a Semantic Web standard, include Jena and Sesame.
Isn't Amazon's SimpleDB non-relational?
db4o, as mentioned by Eric, is an Object-Oriented database management system (OODBMS).
There's object-based databases(Gemstore, for example). Google's Big-Table and Amason's Simple Storage I am not sure how you would categorize, but both are map-reduce based.
A non-relational document oriented database we have been looking at is Apache CouchDB.
Apache CouchDB is a distributed, fault-tolerant and schema-free document-oriented database accessible via a RESTful HTTP/JSON API. Among other features, it provides robust, incremental replication with bi-directional conflict detection and resolution, and is queryable and indexable using a table-oriented view engine with JavaScript acting as the default view definition language.
Our interest was in providing a distributed access user preferences store that would be immune to shape changes to which we could serialize preference objects from Java and access those just as easily with Javascript from a XULRunner based client application.
I'd like to detail more on Bill Karwin's answer about semantic web and triplestores, since it's what I am working on at the moment, and I have something to say on it.
The idea behind a triplestore is to store a graph-based database, whose datamodel roots in RDF. With RDF, you describe nodes and associations among nodes (in other words, edges). Data is organized in triples :
start node ----relation----> end node
(in RDF speech: subject --predicate--> object). With this very simple data model, any data network can be represented by adding more and more triples, provided you give a meaning to nodes and relations.
RDF is very general, and it's a graph-based data model well suited for search criteria looking for all triples with a particular combination of subject, predicate, or object, in any combination. Eventually, through a query language called SPARQL, you can also perform more complex queries, an operation that boils down to a graph isomorphism search onto the graph, both in terms of topology and in terms of node-edge meaning (we'll see this in a moment). SPARQL allows you only SELECT (and similar) queries. No DELETE, no INSERT, no UPDATE. The information you query (e.g. specific nodes you are interested in) are mapped into a table, which is what you get as a result of your query.
Now, topology in itself does not mean a lot. For this, a Schema language has been invented. Actually, more than one, and calling them schema languages is, in some cases, very limitative. The most famous and used today are RDF-Schema, OWL (Lite and Full), and they predate from the obsolete DAML+OIL. The point of these languages is, boiling down stuff, to give a meaning to nodes (by granting them a type, also described as a triple) and to relationships (edges). Also, you can define the "range" and "domain" of these relationships, or said differently what type is the start node and what type is the end node: you can say for example, that the property "numberOfWheels" can be applied only to connect a node of type Vehicle to a non-zero integer value.
ns:MyFiat --rdf:type--> ns:Vehicle
ns:MyFiat --ns:numberOfWheels-> 4
Now, you can use these ontologies in two directions: validation and inference. Validation is not that fancy today, but I've seen instances of use. Inference is what is cool today, because it allows reasoning. Inference basically takes a RDF graph containing a set of triples, takes an ontology, mixes them into a triplestore database which contains an "inference engine" and like magic the inference engine invents triples according to your ontological description. Example: suppose you just store this information in the database
ns:MyFiat --ns:numberOfWheels--> 4
and nothing else. No type is specified about this node, but the inference engine will add automatically a triple saying that
ns:MyFiat --rdf:type--> ns:Vehicle
because you said in your ontology that only objects of type Vehicle can be described by a property numberOfWheels.
Conversely, you can use the inference engine to validate your data against the ontology so to refuse not compliant data (sort of like XML-Schema for XML). In this case, you will need both triples to have your data successfully accepted by the triplestore.
Additional characteristics of triplestores are Formulas and Context-aware storage. Formulas are statements (as usual, triples subject predicate object) that describe something hypothetical. I never used Formulas, so I won't go into more details of something I don't know. Context awareness are basically subgraphs: the problem with storing triples is that you don't have anything to say where these triples come from. Suppose you have two dealers that describe the same price of a component. One says that the price is 5.99 and the other 4.99. If you just store both triples into a database, now you don't know anything about who stated each information. There are two ways to solve this problem.
One is reification. Reification means that you store additional triples to describe another triple. It's wasteful, and makes life hell because you have to reify every and each triple you store. The alternative is context-awareness. Having a context-aware storage It's like being able to box a bunch of triples into a container with a label on it (the context identifier). You now can use this identifier as subject for additional statements, hence describing a bunch of triples in a single action.
4. Navigational. Includes Tree/Hierarchy and Graph/Network.
File systems, the semantic web, XML, Object databases, CODASYL, and many others all fit into this category.
Those 4 are pretty much it.
There is also what is referred to as an "inverted index" or "inverted list" database. Software AG's Adabas product would be an example. As with hierachical, these databases continue to be used in large corporate or university environments because of legacy considerations or due to a performance advantage in certain situations (typically high-end transactional applications).
There are BASE systems (Basically Available, Soft State, Eventually consistent) and they work well with simple data models holding vast volumes of data. Google's BigTable, Dojo's Persevere, Amazon's Dynamo, Facebook's Cassandra are some examples.
See LINK
The illuminate Correlation Database is a new revolutionary non-relational database. The Correlation Database Management Dystem (CDBMS) is data model independent and designed to efficiently handle unplanned, ad hoc queries in an analytical system environment. Unlike relational database management systems or column-oriented databases, a correlation database uses a value-based storage (VBS) architecture in which each unique data value is stored only once and an auto-generated indexing system maintains the context for all values (data is 100% indexed). Queries are performed using natural language instead of SQL (NoSQL).
Learn more at: www.datainnovationsgroup.com