I am looking for real world applications where topological sorting is performed on large graph sizes.
Some fields where I image you could find such instances would be bioinformatics, dependency resolution, databases, hardware design, data warehousing... but I hope some of you may have encountered or heard of any specific algorithms/projects/applications/datasets that require topsort.
Even if the data/project may not be publicly accessible any hints (and estimates on the order of magnitude of potential graph sizes) might be helpful.
Here are some examples I've seen so far for Topological Sorting:
While scheduling task graphs in a distributed system, it is usually
needed to sort the tasks topologically and then assign them to
resources. I am aware of task graphs containing more than 100,000
tasks to be sorted in a topological order. See this in this context.
Once upon a time I was working on a Document Management System. Each
document on this system has some kind of precedence constraint to a
set of other documents, e.g. its content type or field referencing.
Then, the system should be able to generate an order of the documents
with the preserved topological order. As I can remember, there were
around 5,000,000 documents available two years ago !!!
In the field of social networking, there is famous query to know the
largest friendship distance in the network. This problem needs to
traverse the graph by a BFS approach, equal to the cost of a
topological sorting. Consider the members of Facebook and find your
answer.
If you need more real examples, do not hesitate to ask me. I have worked in lots of projects working on on large graphs.
P.S. for large DAG datasets, you may take a look at Stanford Large Network Dataset Collection and Graphics# Illinois page.
I'm not sure if this fits what you're looking for but did you know Bio4j project?
Not all the contents stored in the graph based DB would be adequate for topological sorting (there exists directed cycles in an important part of the graph), however there are sub-graphs like Gene Ontology and Taxonomy where this ordering may have sense.
TopoR is a commercial topological PCB router that works first by routing the PCB as topological problem and then translating the topology into physical space. They support up to 32 electrical layers, so it should be capable of many thousands of connections (say 10^4).
I suspect integrated circuits may use similar methods.
The company where I work manages a (proprietary) database of software vulnerabilities and patches. Patches are typically issued by a software vendor (like Microsoft, Adobe, etc.) at regular intervals, and "new and improved" patches "supercede" older ones, in the sense that if you apply the newer patch to a host then the old patch is no longer needed.
This gives rise to a DAG where each software patch is a node with arcs pointing to a node for each "superceding" patch. There are currently close to 10K nodes in the graph, and new patches are added every week.
Topological sorting is useful in this context to verify that the graph contains no cycles - if they do arise then it means that there was either an error in the addition of a new DB record, or corruption was introduced by botched data replication between DB instances.
Related
I was wondering how databases like Dgraph and TigerGraph managed to shard the graph in-order to support horizontal scaling without breaking the connections between nodes besides supports a lot of interesting algorithms.
And they claim to be a native graph solution so an approach like facebook or twitter for example is not the case here.
The only solution that come to my mind is by spreading the graph among so many small databases, which leads to so many nodes duplication to maintain the relationships.
Any ideas ?
Thanks in advance
So technically there are two principles to follow regarding graph sharding. The first one is Edge-Cut which cuts an edge into two parts (incoming and outgoing) and stores them in different servers respectively. Each vertex associated with the edge is distributed to a specific server in the cluster. Nebula Graph, a distributed graph database, follows this method. The second one is Vertex-Cut, which cuts a vertex into N parts (depending on how many edges the vertex has) and stores them in different servers. Each edge associated with the vertex is then distributed to a specific server in the cluster. GraphX did it this way.
However, graph sharding is an NP problem anyway, which is way much harder than sharding in SQL. So some vendors might do it differently than Cut-Edge only or Cut-Vertex only. For example, your thought, i.e. spreading subgraph, is somewhat like Neo4j Fabric. Some vendors place the whole graph structure, not including the properties, into a single host memory so that fetching subgraphs is very fast. While some vendors adopt adjacency list to separate nodes and edges in the graph, without considering too much for the locality.
This is a critical question, and a weakness of large graph databases. Most of them use read-only replicas or do have issues with too many hops across a network.
But you asked about Dgraph which is actually completely different and does not break a graph up into disconnected or even overlapping sub-graphs on different servers. Instead, it stores entire predicates on individual servers, and this allows a given query to execute in a small number of network hops.
See the "developers in sf who use vim" example here: https://dgraph.io/blog/post/db-sharding/ .
I have five computers networked together. Among them one is master computer and another four are slave computers.
Each slave computer has its own set of data (a very big integer matrix). I want to run four different clustering programs in four different slaves. Then, take the results back into the master computer for further processing (such as visualization).
I initially thought to use Hadoop. But, I cannot find any nice way to convert the above problem (specifically the output results) into the Map Reduce framework.
Is there any nice open-source distributed computing framework by using which I can perform the above task easily?
Thanks in advance.
You should used YARN for manage multiple clusters or resources
YARN is the prerequisite for Enterprise Hadoop, providing resource management and a central platform to deliver consistent operations, security, and data governance tools across Hadoop clusters.
Reference
It seems that you have already stored the data on each of the nodes, so you have already solved the "distributed storage" element of the problem.
Since each node's dataset is different, this isn't a parallel processing problem either.
It seems to me that you don't need Hadoop or any other big data framework. However, you can embrace the philosophy of Hadoop by taking the code to the data. You run the clustering algorithm on each node, and then handle the results in whatever way you need. A caveat would be if you also have a problem in loading the data and running the clustering algorithm on each node, but that is a different problem.
I am trying to implement a storage system to support tagging on data. A very simple application of this system is like questions on Stackoverflow, which are tagged with multiple tags. And a query may consist of multiple tags. This also looks like search on Google with multiple key words.
The data set maintained by this system will be very large, like several or tens of terabytes with billions of entries.
So what data structures and algorithms should I use in this system for maintaining and query data? And the data may be stored across a cluster of machines.
Are there any guide or papers to describe such problem and solutions?
You might want to read the two books below:
Collective Intelligence in Action
Satnam Alag (ISBN: 1933988312)
http://www.manning.com/alag/
"Capter 3. Extracting intelligence from tags" covers:
Three forms of tagging and the use of tags
A working example of how intelligence is extracted from tags
Database architecture for tagging
Developing tag clouds
Programming Collective Intelligence
Toby Segaran (ISBN: 978-0-596-52932-1)
http://shop.oreilly.com/product/9780596529321.do
"Chapter 4. Searching and Ranking" covers:
Basic concepts of algorithms for search engine index
Design of a click-tracking neural network
Hope it helps.
Your problem is very difficult, but there is a plenty of related papers and books. Amazon Dynamo paper, yahoo PNUTS and this hadoop paper is a good examples.
So, at first, you must decide how your data will be distributed across cluster. Data must be evenly distributed across network, without hot spots. Consistent hashing will be a good solution for this problem. Also, data must be redundant, any entry need to be stored in several places to tolerate faults of individual nodes.
Next, you must decide how writes will occur in your system. Every write must be replicated across nodes that contains updated data entry. You might want to read about CAP theorem, and eventual consistency concept(wikipedia have a good article about both). Also, there is a consistency - latency tradeoff. You can use different mechanisms for writes replication: some kind of gossip protocol or state machine replication.
I don't know what kind of tagging do you mean, is this tags manually assigned to entries or learned from data. Anyway, this is a field of information retrieval(IR). You might use some kind of inverted index to effectively search entries by tags or keywords. Also, you must use some query result ranking algorithm.
In the context of design of a social network using Graphs data structure, where you can perform a BFS to find a connection from one person to another, I have some questions pertaining to it.
If there are million users, the topology would indeed be much more complicated and interconnected than the graphs we normally design and I am trying to comprehend how you could solve these problems.
In the real world, servers fail. How does this affect you?
How could you take advantage of caching?
Do you search until the end of the graph (infinite)? How do you decide when to give up?
In real life, some people have more friends of friends than others, and are therefore more likely
to make a path between you and someone else. How could you use this data to pick where you
start traverse?
Your question seems interesting and curious :)
1) Well... of course, data is stored in disks, not in ram.
Disks have systems that avoid failure, in particular, RAID-5 for example.
Redundancy is the key: if one system fail there is another system ready to take his place.
There is also redundancy and workload sharing together... there are two computers that work in parallel and share their jobs but if one stops only one works and take the full workload.
In places like google or facebook redundancy is not 2, is 1200000000 :)
And consider also that data is not in a single server farm, in google there are several datacenters connected together, so if one building explodes, another one will take his place for example.
2) Not an easy question at all, but usually these systems have big cache for disk arrays too, so reading and writing data on disk is faster than on our laptops :)
Data can be processed in parallel by several concurrent systems and this is the key of the speed of services like facebook.
3) The end of the graph is not infinite.
So it is possible with actual technology indeed.
The computational complexity of exploring all connections and all nodes on a graph is O(n + m) where n is the number of vertices and m the number of edges.
This means, it is linear to the number of registered user and to the number of connection between users. And RAM these days is very cheap.
Being a linear growth is easy to add resources when needed.
Add more computers the more you get rich :)
Consider also that no-one will perform a real search for every node, everything in facebook is quite "local", you can view the direct friend of one person, not the friend of friend of friend .... it would be not useful.
Getting the number of vertices directly connected to a vertex, if the data structure is well done, is very easy and fast. In SQL it would be a simple select and if tables are well indexed it will be very fast and also not very dependant on the total number of users (see the concept of hash tables).
I'm creating a simulator for a large scale P2P-system. In order to make the simulations as good as possible I would like to use data from the real world. I'd like to use this data to simulate each node's behavior (primarily it's availability). Is there any availability-data that has been recorded from large P2P-systems (such as BitTorrent) available?
I'm not too sure about other P2P protocols, but here's a stab at answering the question for BitTorrent:
You should be able to glean some stats from a BitTorrent tracker log, in the case where the tracker was centralised (as opposed to decentralised tracker, or where a decentralised hash table is used).
To wrap your head around the logs, have a look at one of the many log analyzers, like BitTorrent Tracker Log Analyzer.
As for actual data, you can find them all over the web. There's a giant RedHat9 tracker log here ☆, for instance. I'd search Google for "bittorrent tracker log".
☆ The article Dissecting BitTorrent: Five Months in a Torrent's Lifetime on that page also looks interesting.
Another way of appropaching this is to simulate availability mathematically. Availability will follow some powerlaw distribution, e.g. the vast majority of nodes are available very rarely and for short periods of time, and a very few nodes are available nearly always over long periods.
Real world networks will of course have many other types of patterns in the data so this is not a perfect simulation, but I figure it's pretty good.
I've found two web-sites that have what I was looking for. http://p2pta.ewi.tudelft.nl/pmwiki/?n=Main.Home and http://www.cs.uiuc.edu/homes/pbg/availability/