I'm trying to graph the linking structure of a web site so I can model how pages on a given domain link to each other. Note I'm not graphing links to sites not on the root domain.
Obviously this graph could be considerable in size. One of the main queries I want to perform is to count how many pages directly link into a given url. I want to run this against the whole graph (shudder) such that I end up with a list of urls and the count of incoming links to that url.
I know one popular way of doing this would be via some kind of map reduce - and I may still end up going that way - however I have a requirement to be able to view this report in (near) realtime which isn't generally map reduce friendly.
I've had a quick look at neo4j and OrientDb. While both of these could model the relationship I want it's not clear if I could query them to generate the report I want. At this point I'm not committed to any particularly technology.
Any help would be greatly appreciated.
Thanks,
Paul
both OrientDB and Neo4J supports Blueprints as common API to make graph operations like traversal, counting, etc.
If I've understood well your use case your graph seems pretty simple: you have a "URL" Vertex that links each other with one type of Edge "Links".
To execute operation against graphs take a look at Gremlin.
You might have a look at structr. It is a open source CMS running on top of Neo4j and exactly has those types of inter-page links.
For getting the number of links pointing to the page you just have to iterate the incoming LINKS_TO links for the current page-node.
What is the use-case for your query ? A popular page list? So it would just contain the top-n pages? You might then try to just start at random places of the graph traverse incoming LINKS_TO relationships to your current node(s) in parallel and put them into a sorting structure, so you always start/continue with the first 20 or so top page-nodes that already have the highest number of incoming links (until they're finished).
Marko Rodriguez has some similar "page-rank" examples in the Gremlin documentation. He's also got several blog posts where he talks about this.
Well with Neo4J you won't be able to split the graph across servers to distribute the load. you could replicate the database to distribute the computation, but then updating will be slow (as you have to replicate the updates). I would attack the problem by updating a count of inbound links to each node as new relationships are added as a property of the node. Neo4J has excellent write performance. Of course you don't need to persist this information because direct relationships are cheap to retrieve (you don't get a collection of all related nodes just an iterator).
You should also take a look at a highly scalable graph database product, such as InfiniteGraph. If you email their technical support I think they will be able to point you at some sample code that does a large part of what you've described here.
Related
Problem in a nutshell:
There's a huge amount of input data in JSON format. Like right now it's about 1 Tb, but it's going to grow. I was told that we're going to have a cluster.
I need to process this data, make a graph out of it and store it in a database. So every time I get a new JSON, I have to traverse the whole graph in a database to complete it.
Later I'm going to have a thin client in a browser, where I'm going to visualize some parts of the graph, search in it, traverse it, do some filtering, etc. So this system is not high load, just a lot of processing and data.
I have no experience in distributed systems, NoSQL databases and other "big data"-like stuff. During my little research I found out that there are too many of them and right now I'm just lost.
What I've got on my whiteboard at the moment:
Apache Spark's GraphX (GraphFrames) for distributed computing on top of some storage (HDFS, Cassanda, HBase, ...) and processor (Yarn, Mesos, Kubernetes, ...).
Some graph database. I think it's good to use a graph query language like Cipher in neo4j or Gremlin in JanusGraph/TitanDB. Neo4j is good, but it has clustering only in EE and I need something open source. So now I'm thinking about the latter ones, which have Gremlin + Cassandra + Elasticsearch by default.
Maybe I don't need any of these, just store graph as adjacency matrix in some RDBMS like Postgres and that's it.
Don't know if I need Spark in 2 or 3. Do I need it at all?
My chief told me to check out Elasticsearch. But I guess I can use it only as an additional full-text search engine.
Thanks for any reply!
Let us start with a couple of follow-up questions :
1Tb is not a huge amount of data if that is also (close to) the total amount of data. Is it ? How much new data are you expecting and at what rate will it arrive.
Why would you have to traverse the whole graph if each JSON is merely referring to a small part of the graph ? It's either new data or an update of existing data (which you should be able to pinpoint), isn't it ?
Yes, that's how you use a graph database ...
The rest sort of depends on your answer on 1). If we're talking about IOT numbers of arriving events (tens of thousands per second ... sustained) you might need a big data solution. If not, your main problem is getting the initial load done and do some easy sailing from there ;-).
Hope this helps.
Regards,
Tom
Although I am not using Neo4j, and instead using TitanDB (IBM Graph), due to the fact that I am new to graph databases, I have modelled a basic news feed using the schema suggested in the Neo4j documentation, for now.
http://neo4j.com/docs/snapshot/cypher-cookbook-newsfeed.html
Having fully read all the documentation, I am aware of several key differences between the way these databases operate.
In the model described in the link, each of a users posts are stored as vertexes connected by edges to each other, forming a long list of status updates emanating out from each user vertex.
While this makes sense given Neo4j's capabalities I am aware that TitanDB has vertex-centric indexing abilities, described in detail here:
http://s3.thinkaurelius.com/docs/titan/1.0.0/indexes.html
Right now I am trying to ensure that querying for a given users feed is optimal, for a large graph with lots of users, and with lots of permanently kept posts or status updates. Therefore, I would like to avoid having to traverse all the posts, of all of a users friends, then finally order and limit them, just in order to get the first 15 items of a users feed.
As such, I am unsure if the model described in the Neo4j documentation is really the best one to use with TitanDB, so my question is as follows:
Is the model described in the Neo4j documentation optimal for fast news feed retrieval in TitanDB?
If so, what indexes would I need to create in order to retrieve a users feed optimally?
If not, Would I be better to connect each post vertex directly to the user who posted it, and use a vertex-centric index on the time property of each posted edge?
I'm really after some general advice on modelling, indexing and retrieving a basic newsfeed in Titan DB. Thanks in advance.
The basic schema doesn't seem like a bad approach, though it's difficult to make a good judgement based on this one use case.
The simplest approach to solving your indexing problem is probably to denormalize a bit - store the user id as a property on the post vertex and create and index on the [user, timestamp] pair.
Vertex centric indexes might help you, but not in the proposed model - you'd need to model post as an edge, node a vertex, which may make other traversals rather awkward. Furthermore, IBM Graph does not support vertex centric indexes as of its current release.
This week I read an interesting article which explain how the authors implemented an activity. Basically, they're using two approaches to handle activities, which I'm adapting to my scenario, so supposing we hava an user foo who has a certain number (x) of followers:
if x<500, then the activity will be copyied to every follower feed
this means slow writes, fast reads
if x>500, only a link will be made between foo and his followoers
in theory, fast writes, but will slow reads
So when some user access your activity feed, the server will fetch and merge all data, so this means fast lookups in their own copyied activities and then query accross the links. If a timeline has a limit of 20, then I fetch 10 of each and then merge.
I'm trying to do it with Riak and the feature of Linking, so this is my question: is linking faster than copy? My idea of architecture is good enough? Are there other solutions and/or technologies which I should see?
PS.: I'm not implementing a activity feed for production, it's just for learning how to implement one which performs well and use Riak a bit.
Two thoughts.
1) No, Linking (in the sense of Riak Link Walking) is very likely not the right way to implement this. For one, each link is stored as a separate HTTP header, and there is a recommended limit in the HTTP spec on how many header fields you should send. (Although, to be fair, in tests you can use upwards of a 1000 links in the header with Riak, seems to work fine. But not recommended). More importantly, querying those links via the Link Walking api actually uses MapReduce on the backend, and is fairly slow for the kind of usage you're intending it for.
This is not to say that you can't store JSON objects that are lists of links, sure, that's a valid approach. I'm just recommending against using Riak links for this.
2) As for how to properly implement it, that's a harder question, and depends on your traffic and use case. But your general approach is valid -- copy the feed for some X value of updates (whether X is 500 or much smaller should be determined in testing), and link when the number of updates is greater than X.
How should you link? You have 3 choices, all with tradeoffs. 1) Use Secondary Indices (2i), 2) Use Search, or 3) Use links "manually", meaning, store JSON documents with URLs that you dereference manually (versus using link walking queries).
I highly recommend watching this video: http://vimeo.com/album/2258285/page:2/sort:preset/format:thumbnail (Building a Social Application on Riak), by the Clipboard engineers, to see how they solved this problem. (They used Search for linking, basically).
I have been playing around with using graphs to analyze big data. Its been working great and really fun but I'm wondering what to do as the data gets bigger and bigger?
Let me know if there's any other solution but I thought of trying Hbase because it scales horizontally and I can get hadoop to run analytics on the graph(most of my code is already written in java), but I'm unsure how to structure a graph on a nosql database? I know each node can be an entry in the database but I'm not sure how to model edges and add properties to them(like name of nodes, attributes, pagerank, weights on edges,etc..).
Seeing how hbase/hadoop is modeled after big tables and map reduce I suspect there is a way to do this but not sure how. Any suggestions?
Also, does this make sense what I'm trying to do? or is it there better solutions for big data graphs?
You can store an adjacency list in HBase/Accumulo in a column oriented fashion. I'm more familiar with Accumulo (HBase terminology might be slightly different) so you might use a schema similar to:
SrcNode(RowKey) EdgeType(CF):DestNode(CFQ) Edge/Node Properties(Value)
Where CF=ColumnFamily and CFQ=ColumnFamilyQualifier
You might also store node/vertex properties as separate rows using something like:
Node(RowKey) PropertyType(CF):PropertyValue(CFQ) PropertyValue(Value)
The PropertyValue could be either in the CFQ or the Value
From a graph processing perspective as mentioned by #Arnon Rotem-Gal-Oz you could look at Apache Giraph which is an implementation of Google Pregel. Pregel is the method Google use for large graph processing.
Using HBase/Accumulo as input to giraph has been submitted recently (7 Mar 2012) as a new feature request to Giraph: HBase/Accumulo Input and Output formats (GIRAPH-153)
You can store the graph in HBase as adjacency list so for example, each raw would have columns for general properties (name, pagerank etc.) and a list of keys of adjacent nodes (if it a directed graph than just the nodes you can get to from this node or an additional column with the direction of each)
Take a look at apache Giraph (you can also read a little more about it here) while this isn't about HBase it is about handling graphs in Hadoop.
Also you may want to look at Hadoop 0.23 (and up) as the YARN engine (aka map/reduce2) is more open to non-map/reduce algorithms
I would not use HBase in the way "Binary Nerd" recommended it as HBase does not perform very well when handling multiple column families.
Best performance is achieved with a single column family (a second one should only be used if you very often only access the content of one column family and the data stored in the other column family is very large)
There are graph databases build on top of HBase you could try and/or study.
Apache S2Graph
provides REST API for storing, querying the graph data represented by edge and vertices. There you can find a presentation, where the construction of row/column keys is explained. Analysis of operations' performance that influenced or is influenced by the design are also given.
Titan
can use other storage backends besides HBase, and has integration with analytics frameworks. It is also designed with big data sets in mind.
My aim is to write an intelligent ChatBot. He should save known informations likely to the human brain.
That is why I am looking for a filetype wich stores data as a net of connected keywords. What filetype or database system could reach this?
Further Informations:
The information input will be wikipedia, google search, and facts teached by a human during a conversation.
I could give specific informations about my requirements and wishes but I don't know if there exists even any approach to this. Maybe there are more useful specifications as my thoughts.
Just one example: the connections should have weights. Requesting an information net should increase the weights of the used connections.
What I expect is that the ChatBot could get real associations (or ideas) using the data net.
As an extension to my above comments:
A graph is definitely the way you want to go in terms of data representation...it maps perfectly to your problem description.
What you seem to be asking is how you can [persistently] store this information on disk (rather than memory). That completely depends on what constraints you need. There is a "Graph Database" which is more geared to storing graphs than say relational or hierarchical databases, and would be perform far better than say pushing your adjacency matrix or list to a flat file. Here's the wikipedia entry:
http://en.wikipedia.org/wiki/Graph_database
Now, there is the issue of what happens when you have so many nodes and edges that you can't load them all into memory at once, and unfortunately if you have nodes that are connected to every other node, that can be a problem (because you won't be able to load the complete/valid graph. I can't answer that right now, but I'm sure there are paradigms to address this problem. I will update my answer after some digging.
Edit-You'll probably have to consult someone who knows more about graph databases. It's possible that there are ways to load chunks of the graph from the database without loading the whole thing. If that's what your issue is, you may want to reform a question about working with large graphs stored on graph databases and post it again, tagged with graphs,databases,algorithms, stuff like that, and just post it again in a more specific manner.