How can I get realtime streaming updates to a Gremlin query? - graph-databases

I fell in love with realtime streaming updates to a query when I was using Firebase, RethinkDB and similar. Now that I am working with graph databases via Gremlin, I'm wondering how to get this behavior.
As a trivial example, if I specified a gremlin query like:
g.V().values('name')
I'd like to receive an update when a new vertex is added with a name property, or a name is changed on an existing vertex.
I am beginning to use Janusgraph, so the ideal solution would work there -- but this is such a killer feature that I could be swayed to other Gremlin-friendly graph databases.
Thanks!

You could use an EventStrategy with any Tinkerpop compatible graph database. Once you create the event strategy, you add it to your traversal g = graph.traversal().withStrategies(strategy). You'll need to implement the MutationListener interface to do whatever you'd like to on those events.

OrientDB has LiveQuery though I don't know that it integrates with Gremlin - https://orientdb.com/docs/2.1/Live-Query.html - that's the closest thing I know of to this kind of feature in any TinkerPop-enabled graph database

Related

Are there any nosql database could do search(like lucene) on map/reduce

I'm using cloudant which I could use mapreduce to project view of data and also it could search document with lucene
But these 2 feature is separate and cannot be used together
Suppose I make a game with userdata like this
{
name: ""
items:[]
}
Each user has item. Then I want to let user find all swords with quality +10. With cloudant I might project type and quality as key and use query key=["sword",10]
But it cannot make query more complex than that like lucene could do. To do lucene I need to normalize all items to be document and reference it with owner
I really wish I could do a lucene search on a key of data projection. I mean, instead of normalization, I could store nested document as I want and use map/reduce to project data inside document so I could search for items directly
PS. If that database has partial update by scripting and inherently has transaction update feature that would be the best
I'd suggest trying out elasticsearch.
Seems like your use case should be covered by the search api
If you need to do more complex analytics elasticsearch supports aggregations.
I am not at all sure that I got the question correctly, but you may want to take a look at riak. It offers a solr-based search, which is quite well documented. I have used it in the past for distributed search over a distributed key-value index and it was quite fast.
If you use this, you will also need to take a look at the syntax of solr queries, so I add it here to save you some time. However, keep in mind that not all of those solr query functionalities were available in riak (at least that was when I used it).
There are several solutions that would do the job. I can give my 2 cents proposing the well established MongoDB. With MongoDB you can create a text-Index on a given field and then do a full text Search as explained here. The feature is in MongoDb since version 2.4 and the syntax is well documented on MongoDB docs.

Is there any graph database good for both updating graph and data mining?

I am new to graph database and try to find the right one for us but I haven't. We need something good for both updating graph and data mining.
For graph database like Neo4j we could perform queries and updates really fast. And it will perform very fast when dealing with highly connected data. But it seems not very useful to perform computations on the whole graph. That is what we need for data mining(to run pagerank for example). And GraphLab, Giraph, GraphX, Faunus etc are of this kind. But many of them are not good at like even removing and updating the graph. For example deleting vertices and edges cannot be done explicitly in GraphLab.
Is there anything good for both updating graph and pageranking?
Titan is built with both OLTP and OLAP processing in mind. It is therefore good at both high-speed read/writes at large scales:
http://thinkaurelius.com/2013/05/13/educating-the-planet-with-pearson/
http://thinkaurelius.com/2013/11/24/boutique-graph-data-with-titan/
You mentioned Faunus as something you looked at for graph analytics. Faunus is highly tuned to work with Titan. In fact, as of the most recent release of Titan at 0.5.0, Faunus has been repackaged and distributed directly with Titan as titan-hadoop for even greater integration and support.
Looking forward to Titan 1.0 due in coming months, Titan will look to support TinkerPop3, which does for OLAP what original versions did for OLTP, in that it generalizes graph analytics frameworks (it already integrates Giraph as the reference implementation).
Since you are in the exploring stage and familiar with Neo4j, I think looking at TinkerPop3 documentation would be a good start as it uses Neo4j for its reference implementation. Numerous vendors are preparing to support this latest version of TinkerPop and thus developing your application against TinkerPop, lets you get started without having to be tied to a particular graph database in this early stage. You can save that decision for later once you have more time to evaluate the different implementations available.
If you need to get to work with something right away, then start with Titan 0.5 and consider your migration path to 1.0 when it becomes available.

How do I use Blueprints to migrate away from Titan?

Currently I use methods specific to the Titan like
TitanType name = graph.getType("name");
and
graph.makeKey("name").dataType(String.class).indexed(Vertex.class)
How can I replace this code with methods from Blueprints so it would work for non-Titan graph databases?
This post claims that it's not possible to translate createKeyIndex into makeKey? If not, what is the solution here?
If you are writing code that will work for any Blueprints-enabled graph, then you are a bit stuck in this regard. The variety of options when it comes to indexing available to Titan, Neo4j, OrientDB, etc. are too vast to generalize behind Blueprints. Blueprints only has the notion of key indices as a generalized approach, but that approach is generally not good enough for Titan users and they must drop down to the Titan API.
Your best option for this situation is to work with createKeyIndex and when not possible drop down for what you need done, drop down to the API of the underlying graph instance. That's a common practice and going forward to TinkerPop3 will be the only way to create an index and types.

Is it possible to store graphs hbase? if so how do you model the database to support a graph structure?

I have been playing around with using graphs to analyze big data. Its been working great and really fun but I'm wondering what to do as the data gets bigger and bigger?
Let me know if there's any other solution but I thought of trying Hbase because it scales horizontally and I can get hadoop to run analytics on the graph(most of my code is already written in java), but I'm unsure how to structure a graph on a nosql database? I know each node can be an entry in the database but I'm not sure how to model edges and add properties to them(like name of nodes, attributes, pagerank, weights on edges,etc..).
Seeing how hbase/hadoop is modeled after big tables and map reduce I suspect there is a way to do this but not sure how. Any suggestions?
Also, does this make sense what I'm trying to do? or is it there better solutions for big data graphs?
You can store an adjacency list in HBase/Accumulo in a column oriented fashion. I'm more familiar with Accumulo (HBase terminology might be slightly different) so you might use a schema similar to:
SrcNode(RowKey) EdgeType(CF):DestNode(CFQ) Edge/Node Properties(Value)
Where CF=ColumnFamily and CFQ=ColumnFamilyQualifier
You might also store node/vertex properties as separate rows using something like:
Node(RowKey) PropertyType(CF):PropertyValue(CFQ) PropertyValue(Value)
The PropertyValue could be either in the CFQ or the Value
From a graph processing perspective as mentioned by #Arnon Rotem-Gal-Oz you could look at Apache Giraph which is an implementation of Google Pregel. Pregel is the method Google use for large graph processing.
Using HBase/Accumulo as input to giraph has been submitted recently (7 Mar 2012) as a new feature request to Giraph: HBase/Accumulo Input and Output formats (GIRAPH-153)
You can store the graph in HBase as adjacency list so for example, each raw would have columns for general properties (name, pagerank etc.) and a list of keys of adjacent nodes (if it a directed graph than just the nodes you can get to from this node or an additional column with the direction of each)
Take a look at apache Giraph (you can also read a little more about it here) while this isn't about HBase it is about handling graphs in Hadoop.
Also you may want to look at Hadoop 0.23 (and up) as the YARN engine (aka map/reduce2) is more open to non-map/reduce algorithms
I would not use HBase in the way "Binary Nerd" recommended it as HBase does not perform very well when handling multiple column families.
Best performance is achieved with a single column family (a second one should only be used if you very often only access the content of one column family and the data stored in the other column family is very large)
There are graph databases build on top of HBase you could try and/or study.
Apache S2Graph
provides REST API for storing, querying the graph data represented by edge and vertices. There you can find a presentation, where the construction of row/column keys is explained. Analysis of operations' performance that influenced or is influenced by the design are also given.
Titan
can use other storage backends besides HBase, and has integration with analytics frameworks. It is also designed with big data sets in mind.

Graph Database to Count Direct Relations

I'm trying to graph the linking structure of a web site so I can model how pages on a given domain link to each other. Note I'm not graphing links to sites not on the root domain.
Obviously this graph could be considerable in size. One of the main queries I want to perform is to count how many pages directly link into a given url. I want to run this against the whole graph (shudder) such that I end up with a list of urls and the count of incoming links to that url.
I know one popular way of doing this would be via some kind of map reduce - and I may still end up going that way - however I have a requirement to be able to view this report in (near) realtime which isn't generally map reduce friendly.
I've had a quick look at neo4j and OrientDb. While both of these could model the relationship I want it's not clear if I could query them to generate the report I want. At this point I'm not committed to any particularly technology.
Any help would be greatly appreciated.
Thanks,
Paul
both OrientDB and Neo4J supports Blueprints as common API to make graph operations like traversal, counting, etc.
If I've understood well your use case your graph seems pretty simple: you have a "URL" Vertex that links each other with one type of Edge "Links".
To execute operation against graphs take a look at Gremlin.
You might have a look at structr. It is a open source CMS running on top of Neo4j and exactly has those types of inter-page links.
For getting the number of links pointing to the page you just have to iterate the incoming LINKS_TO links for the current page-node.
What is the use-case for your query ? A popular page list? So it would just contain the top-n pages? You might then try to just start at random places of the graph traverse incoming LINKS_TO relationships to your current node(s) in parallel and put them into a sorting structure, so you always start/continue with the first 20 or so top page-nodes that already have the highest number of incoming links (until they're finished).
Marko Rodriguez has some similar "page-rank" examples in the Gremlin documentation. He's also got several blog posts where he talks about this.
Well with Neo4J you won't be able to split the graph across servers to distribute the load. you could replicate the database to distribute the computation, but then updating will be slow (as you have to replicate the updates). I would attack the problem by updating a count of inbound links to each node as new relationships are added as a property of the node. Neo4J has excellent write performance. Of course you don't need to persist this information because direct relationships are cheap to retrieve (you don't get a collection of all related nodes just an iterator).
You should also take a look at a highly scalable graph database product, such as InfiniteGraph. If you email their technical support I think they will be able to point you at some sample code that does a large part of what you've described here.

Resources