Availability data - database

I'm creating a simulator for a large scale P2P-system. In order to make the simulations as good as possible I would like to use data from the real world. I'd like to use this data to simulate each node's behavior (primarily it's availability). Is there any availability-data that has been recorded from large P2P-systems (such as BitTorrent) available?

I'm not too sure about other P2P protocols, but here's a stab at answering the question for BitTorrent:
You should be able to glean some stats from a BitTorrent tracker log, in the case where the tracker was centralised (as opposed to decentralised tracker, or where a decentralised hash table is used).
To wrap your head around the logs, have a look at one of the many log analyzers, like BitTorrent Tracker Log Analyzer.
As for actual data, you can find them all over the web. There's a giant RedHat9 tracker log here ☆, for instance. I'd search Google for "bittorrent tracker log".
☆ The article Dissecting BitTorrent: Five Months in a Torrent's Lifetime on that page also looks interesting.

Another way of appropaching this is to simulate availability mathematically. Availability will follow some powerlaw distribution, e.g. the vast majority of nodes are available very rarely and for short periods of time, and a very few nodes are available nearly always over long periods.
Real world networks will of course have many other types of patterns in the data so this is not a perfect simulation, but I figure it's pretty good.

I've found two web-sites that have what I was looking for. http://p2pta.ewi.tudelft.nl/pmwiki/?n=Main.Home and http://www.cs.uiuc.edu/homes/pbg/availability/

Related

Consensus algorithm for Node.js

I'm trying to implement a collaborative canvas in which many people can draw free-handly or with specific shape tools.
Server has been developed in Node.js and client with Angular1-js (and I am pretty new to them both).
I must use a consensus algorithm for it to show always the same stuff to all the users.
I'm seriously in troubles with it since I cannot find a proper tutorial its use. I have been looking and studying Paxos implementation but it seems like Raft is very used in practical.
Any suggestions? I would really appreciate it.
Writing a distributed system is not an easy task[1], so I'd recommend using some existing strongly consistent one instead of implementing one from scratch. The usual suspects are zookeeper, consul, etcd, atomix/copycat. Some of them offer nodejs clients:
https://github.com/alexguan/node-zookeeper-client
https://www.npmjs.com/package/consul
https://github.com/stianeikeland/node-etcd
I've personally never used any of them with nodejs though, so I won't comment on maturity of clients.
If you insist on implementing consensus on your own, then raft should be easier to understand — the paper is surprisingly accessible https://raft.github.io/raft.pdf. They also have some nodejs implementations, but again, I haven't used them, so it is hard to recommend any particular one. Gaggle readme contains an example and skiff has an integration test which documents its usage.
Taking a step back, I'm not sure if the distributed consensus is what you need here. Seems like you have multiple clients and a single server. You can probably use a centralized data store. The problem domain is not really that distributed as well - shapes can be overlaid one on top of the other when they are received by server according to FIFO (imagine multiple people writing on the same whiteboard, the last one wins). The challenge is with concurrent modifications of existing shapes, by maybe you can fallback to last/first change wins or something like that.
Another interesting avenue to explore here would be Conflict-free Replicated Data Types — CRDT. Folks at github used them to implement collaborative "pair" programming in atom. See the atom teletype blog post, also their implementation maybe useful, as collaborative editing seems to be exactly the problem you try to solve.
Hope this helps.
[1] Take a look at jepsen series https://jepsen.io/analyses where Kyle Kingsbury tests various failure conditions of distribute data stores.
Try reading Understanding Paxos. It's geared towards software developers rather than an academic audience. For this particular application you may also be interested in the Multi-Paxos Example Application referenced by the article. It's intended both to help illustrate the concepts behind the consensus algorithm and it sounds like it's almost exactly what you need for this application. Raft and most Multi-Paxos designs tend to get bogged down with an overabundance of accumulated history that generates a new set of problems to deal with beyond simple consistency. An initial prototype could easily handle sending the full-state of the drawing on each update and ignore the history issue entirely, which is what the example application does. Later optimizations could be made to reduce network overhead.

Algorithms and data structures for 'tag' or key word based query on large data set?

I am trying to implement a storage system to support tagging on data. A very simple application of this system is like questions on Stackoverflow, which are tagged with multiple tags. And a query may consist of multiple tags. This also looks like search on Google with multiple key words.
The data set maintained by this system will be very large, like several or tens of terabytes with billions of entries.
So what data structures and algorithms should I use in this system for maintaining and query data? And the data may be stored across a cluster of machines.
Are there any guide or papers to describe such problem and solutions?
You might want to read the two books below:
Collective Intelligence in Action
Satnam Alag (ISBN: 1933988312)
http://www.manning.com/alag/
"Capter 3. Extracting intelligence from tags" covers:
Three forms of tagging and the use of tags
A working example of how intelligence is extracted from tags
Database architecture for tagging
Developing tag clouds
Programming Collective Intelligence
Toby Segaran (ISBN: 978-0-596-52932-1)
http://shop.oreilly.com/product/9780596529321.do
 "Chapter 4. Searching and Ranking" covers:
Basic concepts of algorithms for search engine index
Design of a click-tracking neural network
Hope it helps.
Your problem is very difficult, but there is a plenty of related papers and books. Amazon Dynamo paper, yahoo PNUTS and this hadoop paper is a good examples.
So, at first, you must decide how your data will be distributed across cluster. Data must be evenly distributed across network, without hot spots. Consistent hashing will be a good solution for this problem. Also, data must be redundant, any entry need to be stored in several places to tolerate faults of individual nodes.
Next, you must decide how writes will occur in your system. Every write must be replicated across nodes that contains updated data entry. You might want to read about CAP theorem, and eventual consistency concept(wikipedia have a good article about both). Also, there is a consistency - latency tradeoff. You can use different mechanisms for writes replication: some kind of gossip protocol or state machine replication.
I don't know what kind of tagging do you mean, is this tags manually assigned to entries or learned from data. Anyway, this is a field of information retrieval(IR). You might use some kind of inverted index to effectively search entries by tags or keywords. Also, you must use some query result ranking algorithm.

Are there any algorithms for creating playlists that don't require a massive database (or that work with a publicly accessible one)?

Is there any algorithm with which I can automatically create a playlist of songs that well with each other -- similarly to services like iTunes Genius -- that a single developer can actually implement? It should either a) not require any sort of remote database of listening habits etc. or b) require such a database, but work with one that is freely available.
i did this, and i used the last.fm database as described by tomasz. i didn't use "related artist" directly, but instead constructed my own relationship graph by comparing tags associated with different artists (this is not the approach suggested by lcfseth btw - i have quite a large range of music and i wanted to explore "natural" connections that might not be common partners in "normal" playlists; also i wasn't sure how uniform the related artists were).
i also used a local database to cache data from last.fm, because calls to the api are rate limited, and i experimented with using other parts of the api to improve / normalize the information i was reading from mp3 tags.
generating a useful graph of related artists was actually quite hard; largely because some nodes in the graph naturally tend to be more important than others. if you don't "even out" the graph then your playlist will keep returning to the "important" artists.
the final result did work well, in that the selection of music had a good balance between "central theme" and variation. but the implementation is not at all polished, the calculation of the graph can take a long time (many hours), the program takes up a fair amount of memory when running, and it still seems to play elvis costello a little more than expected ;o)
if you are interested, the code is at http://code.google.com/p/uykfe/
the best part of all, from my point of view as a user, is that it can update logitech media server (squeezeserver) playlists in "realtime", adding a new track whenever the list is empty. that works really well in continuing from whatever music you select "by hand". it can also generate one-off playlists, of course, and, finally, by tweaking parameters you can get a kind of "random walk" through your music collection - it will play related tunes but slowly drift from one style to another (in fact, this is really the "default" mode - to get it to stay on a single theme i needed extra logic that biased it towards whatever music it had played earlier).
ps also, the dump of the final graph to gephi was really cool - i had it printed out and it's now pinned to the wall...
pps i also experimented with the musicbrainz database, which in theory sounds like a fantastic resource. but in practice it is over-complex and poorly documented.
I don't know iTunes Genius, but I think last.fm database and API might be useful for you. Every time you see any track it shows you a list of similar tracks, based on other users preferencs. The same information can be obtained using track.getSimilar API method.
The idea behind most of these databases, is to see what other users listens to after they listen to a given song. The accuracy of these statistics depends on the number of users therefor it is probably hard to use this locally. The algorithm itself is not that hard to implement.
The alternative would be to sort song based on genre, singer... which are informations that are usually embedded in the songs but not always. Winamp have this feature, but it won't work for old songs, unless you manually set the informations or use an On-line song database.

Multi-Agent system application idea

I need to implement a multi-agent system for an assignment. I have been brainstorming for ideas as to what I should implement, but I did not come up with anything great yet. I do not want it to be a traffic simulation application, but I need something just as useful.
I once saw an application of multiagent systems for studying/simulating fire evacuation plans in large buildings. Imagine a large building with thousands of people; in case of fire, you want these people to follow some rules for evacuating the building. To evaluate the effectiveness of your evacuation plan rules, you may want to simulate various scenarios using a multiagent system. I think it's a useful and interesting application. If you search the Web, you will find papers and works in this area, from which you might get further inspiration.
A few come to mind:
Exploration and mapping: send a team of agents out into an environment to explore, then assimilate all of their observations into consistent maps (not an easy task!)
Elevator scheduling: how to service call requests during peak capacities considering the number and location of pending requests, car locations, and their capacities (not too far removed from traffic-light scheduling, though)
Air traffic control: consider landing priorities (i.e. fuel. number of passengers, emergency conditions,etc.), airplane position and speed, and landing conditions (ie. number of runways, etc). Then develop a set of rules so that each "agent" (i.e. airplane) assumes its place in a landing sequence. Note that this is a harder version of the flocking problem mentioned in another reply.
Not sure what you mean by "useful" but... you can always have a look at swarmbased AI (school of fish, flock of birds etc.) Each agent (boid) is very very simple in this case. Make the individual agents follow each other, stay away from a predator etc.
Its not quite multi-agent but have you considered a variation on ant colony optimisation ?
http://en.wikipedia.org/wiki/Ant_colony_optimization_algorithms

Looking for an example of when screen scraping might be worthwhile

Screen scraping seems like a useful tool - you can go onto someone else's site and steal their data - how wonderful!
But I'm having a hard time with how useful this could be.
Most application data is pretty specific to that application even on the web. For example, let's say I scrape all of the questions and answers off of StackOverflow or all of the results off of Google (assuming this were possible) - I'm left with data that is not very useful unless I either have a competing question and answer site (in which case the stolen data will be immediately obvious) or a competing search engine (in which case, unless I have an algorithm of my own, my data is going to be stale pretty quickly).
So my question is, under what circumstances could the data from one app be useful to some external app? I'm looking for a practical example to illustrate the point.
It's useful when a site publicly provides data that is (still) not available as an XML service. I had a client who used scraping to pull flight tracking data into one of his company's intranet applications.
The technique is also used for research. I had a client who wanted to compare the contents of several online dictionaries by part of speech, and all of these sites had to be scraped.
It is not a technique for "stealing" data. All ordinary usage restrictions apply. Many sites implement CAPTCHA mechanisms to prevent scraping, and it is inappropriate to work around these.
A good example is StackOverflow - no need to scrape data as they've released it under a CC license. Already the community is crunching statistics and creating interesting graphs.
There's a whole bunch of popular mashup examples on ProgrammableWeb. You can even meet up with fellow mashupers (O_o) at events like BarCamps and Hack Days (take a sleeping bag). Have a look at the wealth of information available from Yahoo APIs (particularly Pipes) and see what developers are doing with it.
Don't steal and republish, build something even better with the data - new ways of understanding, searching or exploring it. Always cite your data sources and thank those who helped you. Use it to learn a new language or understand data or help promote the semantic web. Remember it's for fun not profit!
Hope that helps :)
If the site has data that would benefit from being accessible through an API (and it would be free and legal to do so), but they just haven't implemented one yet, screen scraping is a way of essentially creating that functionality for yourself.
Practical example -- screen scraping would allow you to create some sort of mashup that combines information from the entire SO family of sites, since there's currently no API.
Well, to collect data from a mainframe. That's one reason why some people use screen scraping. Mainframes are still in use in the financial world and often it's running software that has been written in the previous century. The people who wrote it might already be retired and since this software is very critical for these organizations, they really hate it when some new code needs to be added. So, screenscraping offers an easy interface to communicate with the mainframe to collect information from the mainframe and then send it onwards to any process that needs this information.
Rewrite the mainframe application, you say? Well, software on mainframes can be very old. I've seen software on mainframes that was over 30 years old, written in COBOL. Often, those applications work just fine and companies don't want to risk rewriting parts because it might break some code that had been working for over 30 years! Don't fix things if they're not broken, please. Of course, additional code could be written but it takes a long time for mainframe code to be used in a production environment. And experienced mainframe developers are hard to find.
I myself had to use screen scraping too in a software project. This was a scheduling application which had to capture the output to the console of every child process it started. It's the simplest form of screen scraping, actually, and many people don't even realize that if you redirect the output of one application to the input of another, that it's still a kind of screen scraping. :)
Basically, screen scraping allows you to connect one (web) application with another one. It's often a quick solution, used when other solutions would cost too much time. Everyone hates it, but the amount of time it saves still makes it very efficient.
Let's say you wanted to get scores from a popular sports site that did not offer the information available with an XML feed or API.
For one project we found a (cheap) commercial vendor that offered translation services for a specific file format. The vendor didn't offer an API (it was, after all, a cheap vendor) and instead had a web form to upload and download from.
With hundreds of files a day the only way to do this was to use WWW::Mechanize in Perl, screen scrape the way through the login and upload boxes, submit the file, and save the returned file. It's ugly and definitely fragile (if the vendor changes the site in the least it could break the app) but it works. It's been working now for over a year.
One example from my experience.
I needed a list of major cities throughout the world with their latitude and longitude for an iPhone app I was building. The app would use that data along with the geolocation feature on the iPhone to show which major city each user of the app was closest to (so as not to show exact location), and plot them on a 3D globe of the earth.
I couldn't find an appropriate list in XML/Excel/CSV type format anywhere easily, but I did find this wikipedia page with (roughly) the info I needed. So I wrote up a quick script to scrape that page and load the data into a database.
Any time you need a computer to read the data on a website. Screen scraping is useful in exactly the same instances that any website API is useful. Some websites, however, don't have the resources to create an API themselves; screen scraping is the developer's way around that.
For instance, in the earlier days of Stack Overflow, someone built a tool to track changes to your reputation over time, before Stack Overflow itself provided that feature. The only way to do that, since Stack Overflow has no API, was to screen scrape.
The obvious case is when a webservice doesn't offer reverse search. You can implement that reverse search over the same data set, but it requires scraping the entire dataset.
This may be fair use if the reverse search also requires significant pre-processing, e.g. because you need to support partial matching. The data source may not have the technical skills or computing resources to provide the reverse search option.
I use screen scraping daily, I run some eCommerce sites and have screen-scraping scripts running daily to gather product lists automatically from my suppliers wholesale sites. This allows me to have upto date information on all the products available to me from several suppliers and allows me to flag non-economical margins due to price changes.

Resources