Association Rule Mining on a FOAF dataset of social networks - artificial-intelligence

I am working on a project called "association rule discovery from social network data: Introducing Data Mining to the Semantic Web". Can anyone suggest a good source for an algorithm (and its code. I heard that it can be implemented using Perl and also R packages) to find association rules from a social network database?
The snapshot of the database can be got in the following link: https://docs.google.com/uc?id=0B0mXGRdRowo1MDZlY2Q0NDYtYjlhMi00MmNjLWFiMWEtOGQ0MjA3NjUyZTE5&export=download&hl=en_US
The dataset is available on the following link: http://ebiquity.umbc.edu/get/a/resource/82.zip
I have searched a lot regarding this project but unfortunately can't find something useful as yet. The following link I found somewhat related:
Criminal data : http://www.computer.org/portal/web/csdl/doi/10.1109/CSE.2009.435
Your help will be highly appreciated.
Thank You,

Well, the most widely used implementations of the original Association Rules algorithm (originally developed at IBM Almaden Research Center) are Apriori, and Eclat, in particular, the C implementations by Christian Borgelt.
(Brief summary for anyone not familiar with Association Rules (aka "Frequent Items Sets", or "Market Basket Analysis"). The prototype application for Association Rules is analyzing consumer transactions, e.g., supermarket data: Among shoppers who buy polish sausage what percentage of those also also purchase black bread?)
I would recommend the statistical platform, R. It is free and open source, and its package repository contains (at least) four libraries directed solely to Association Rules, all with excellent documentation--three of the four Packages include a Manual and a separate Vignette (informal prose document with code examples). Both the Manuals and Vignettes contain numerous examples in R code.
I have used three of the four Packages below and i can recommend those three highly. Among them are bindings for Eclat and Apriori. These libraries are distributed as R 'Packages', which are available on CRAN, R's primary Package repository. Basic installation and setup of R is trivial--there are binaries for Mac, Linux, and Windows, available from the link above. Likewise, Package installation/integration is as simple as you would expect from an integrated platform (though not every one of the four Packages listed below have binaries for every OS though).
So on CRAN, you will find these Packages all directed solely Association Rules:
arules
arulesNBMiner
arulesSequences
arulesViz
This set of four R Packages is comprised of R bindings for four different Association Rules implementations, as well as a visualization library.
The first package, arules, includes R bindings for Eclat and Apriori. The second, arulesNBMiner, is the bindings for Michael Hahsler's Association Rules algorithm NB-frequent itemsets by . The third, arules Sequences, is the bindings for Mohammed Zaki's cSPADE .
The last of these is particularly useful because it is a visualization library for plotting the output from any of the previous three packages. For your social network study, i suspect you will find the graph visualization--i.e., explicit visualization of the nodes (users in the data set) and edges (connections between them).

This is a bit broader than http://en.wikipedia.org/wiki/Association_rule_learning but hopefully useful.
Some earlier FOAF work that might be interesting (SVD/PCA etc):
http://stderr.org/~elw/foaf/
http://www.scribd.com/doc/353326/The-Social-Semantics-of-LiveJournal-FOAF-Structure-and-Change-from-2004-to-2005
http://datamining.sztaki.hu/files/snakdd.pdf
Also Ch.4 of http://www.amazon.com/Understanding-Complex-Datasets-Decompositions-Knowledge/dp/1584888326 is devoted to the application of matrix decomposition techniques against graph data structures; strongly recommended.
Finally, Apache Mahout is the natural choice for large scale data mining, machine learning etc., https://cwiki.apache.org/MAHOUT/dimensional-reduction.html

If you want some Java code, you can check my website for the SPMF software. It provides source code for more than 45 algorithms for frequent itemset mining, association mining, sequential pattern mining, etc.
Moreover, it does not only provide the most popular algorithms. It also offers many variations such as mining rare itemsets, high utility itemsets, uncertain itemsets, non redundant association rules, closed association rules, indirect association rules, top-k association rules, and much more...

Related

Database of scientific paper abstracts

I am trying to find a database with scientific papers which will allow me to:
1. Get metadata of papers by doi (including abstracts);
2. Do this stuff regularly (e.g. daily updated);
3. Ability to download whole existing database.
I know about Crossref API, however, only 3% of all publications presented have abstract (and none of biggest publishers like Springer or Elsevier provide them). On the other side I see some projects like Dimensions or Researcher which already implemented mentioned functionality. So the question is: does somebody know such services (possibly not free) and had experience working with them?
Have you looked at Semantic Scholar (https://www.semanticscholar.org/)? They have an API that supports the first of your requirements (http://api.semanticscholar.org/) and also provide the "Open Research Corpus" (http://labs.semanticscholar.org/corpus/) which should satisfy your third requirement. It is a smaller database than what is provided by Scopus or Web of Science, but both of those require subscriptions to fully use their APIs and don't (as far as I know) have a real way for you to purchase a full download of the database.

yocto and relational database why not?

I've been working with yocto for a while. The job involves adding a layer on top of meta-yogurt, and thus adding new or modifying existing recipes in this layer. It has been working out so far.
But through struggling with recipes/rules syntax and relationships between them, my gut told me that it would be much easier for the internal implementation of the build system, as well as how the build system interacts with end-developer like me if a relational database had been involved, which defines things, their attributes and their relationships.
But I am no yocto expert (and cannot spend time on it, considering my project schedule) to evaluate a re-design of yocto with a relational DB. Just wonder if this also makes sense to anybody?
Speaking as one of the developers who's worked extensively on bitbake whilst I can see why you might think this, it doesn't really make sense for many reasons, not least performance. We once did try using sqlite as a data store backend within bitbake and it was orders of magnitude too slow to be useful compared to our python data store.
The flexibility in the metadata also comes from things like the ability to inject python code and with a database engine, this wouldn't be easily possible.
So in summary, I doubt bitbake or the Yocto Project will ever use relational databases in this way.
I tried to post a comment to Richard Purdie answer from Oct 7 '17 but the comment box didn't allow line break and long text...
I was not aware that there was already a python data store besides yocto objects and rules (mixture of variables, string literals, reg.exp., file types eg. layer, bb, bbappend, conf, inc etc and the "half-explicit", "half-documented" syntax and relationship among them :().
Of course in python environment, it makes most sense to use the combination of its data store and language binding and not another relational db. But I imagine with a relational db, every things such as data store and rules can be implemented in one place, querying/modifying is also an integrated part, binding to different languages / scripting should be possible. Was it actually the same, just the "python way" that I am not familiar with?
When it goes to performance, we already needed a powerful machine (Xeon E5 16G DRAM 1T HDD) to clean build one distro (without fetching sources from remote servers) in an hour and believe other people also use their best available PC for such a serious job as building a distro. So without comparing performance of different db engines and programming languages, which can't be different in great magnitude, I would just forget the performance factor.
Injecting code into a database engine might be difficult. But I guess when using a database engine, just a different programming pattern does the job: the logic of the job is implemented outside the db engine and this process contacts the db engine when needed.
I've been actually not working with yocto for awhile because our yocto build setup actually has been working great and producing new distros for us now and then with the previously prepared layer. Just some thoughts ...

Any examples of production applications that use signature trees?

I've been reading a lot lately about signature trees, or S-Trees. For example, this paper. The literature speaks very highly of them, and evidence is provided for considerable performance gains over, for example, inverted files or B-Trees, for some applications.
Now, why is it that I don't see S-Trees used very much? Do you know of any prominent instances of such a data structure in a popular application? Are there DBMS implementations that offer signature-tree indexes?
Now, why is it that I don't see S-Trees used very much?
Including a new indexing or join method into a database is a very complex task.
MySQL, for instance, doesn't yet impement MERGE JOIN and HASH JOIN that were invented by, like, ancient Romans or Archimedes or around that time.
And the paper you referenced is dated 2006 and this method isn't even mentioned in Wikipedia.
This means that it's either yet unknown to developers or isn't worth using it in an RDBMS (or both).
I've heard of something similar described as being a "C-tree" - it was part of an object database and I imagined that its indexing methods were similar to what the paper in the link described. A company called InterSystems makes a database system called Caché that they describe as "post-relational" and is very hierarchical ... I don't know enough about the details of these different systems to be sure that they're all different names for the same functionality, but, they have some overlapping fundamental concepts.

How to design and verify distributed systems?

I've been working on a project, which is a combination of an application server and an object database, and is currently running on a single machine only. Some time ago I read a paper which describes a distributed relational database, and got some ideas on how to apply the ideas in that paper to my project, so that I could make a high-availability version of it running on a cluster using a shared-nothing architecture.
My problem is, that I don't have experience on designing distributed systems and their protocols - I did not take the advanced CS courses about distributed systems at university. So I'm worried about being able to design a protocol, which does not cause deadlock, starvation, split brain and other problems.
Question: Where can I find good material about designing distributed systems? What methods there are for verifying that a distributed protocol works right? Recommendations of books, academic articles and others are welcome.
I learned a lot by looking at what is published about really huge web-based plattforms, and especially how their systems evolved over time to meet their growth.
Here a some examples I found enlightening:
eBay Architecture: Nice history of their architecture and the issues they had. Obviously they can't use a lot of caching for the auctions and bids, so their story is different in that point from many others. As of 2006, they deployed 100,000 new lines of code every two weeks - and are able to roll back an ongoing deployment if issues arise.
Paper on Google File System: Nice analysis of what they needed, how they implemented it and how it performs in production use. After reading this, I found it less scary to build parts of the infrastructure myself to meet exactly my needs, if necessary, and that such a solution can and probably should be quite simple and straight-forward. There is also a lot of interesting stuff on the net (including YouTube videos) on BigTable and MapReduce, other important parts of Google's architecture.
Inside MySpace: One of the few really huge sites build on the Microsoft stack. You can learn a lot of what not to do with your data layer.
A great start for finding much more resources on this topic is the Real Life Architectures section on the "High Scalability" web site. For example they a good summary on Amazons architecture.
Learning distributed computing isn't easy. Its really a very vast field covering areas on communication, security, reliability, concurrency etc., each of which would take years to master. Understanding will eventually come through a lot of reading and practical experience. You seem to have a challenging project to start with, so heres your chance :)
The two most popular books on distributed computing are, I believe:
1) Distributed Systems: Concepts and Design - George Coulouris et al.
2) Distributed Systems: Principles and Paradigms - A. S. Tanenbaum and M. Van Steen
Both these books give a very good introduction to current approaches (including communication protocols) that are being used to build successful distributed systems. I've personally used the latter mostly and I've found it to be an excellent text. If you think the reviews on Amazon aren't very good, its because most readers compare this book to other books written by A.S. Tanenbaum (who IMO is one of the best authors in the field of Computer Science) which are quite frankly better written.
PS: I really question your need to design and verify a new protocol. If you are working with application servers and databases, what you need is probably already available.
I liked the book Distributed Systems: Principles and Paradigms by Andrew S. Tanenbaum and Maarten van Steen.
At a more abstract and formal level, Communicating and Mobile Systems: The Pi-Calculus by Robin Milner gives a calculus for verifying systems. There are variants of pi-calculus for verifying protocols, such as SPI-calculus (the wikipedia page for which has disappeared since I last looked), and implementations, some of which are also verification tools.
Where can I find good material about designing distributed systems?
I have never been able to finish the famous book from Nancy Lynch. However, I find that the book from Sukumar Ghosh Distributed Systems: An Algorithmic Approach is much easier to read, and it points to the original papers if needed.
It is nevertheless true that I didn't read the books from Gerard Tel and Nicola Santoro. Perhaps they are still easier to read...
What methods there are for verifying that a distributed protocol works right?
In order to survey the possibilities (and also in order to understand the question), I think that it is useful to get an overview of the possible tools from the book Software Specification Methods.
My final decision was to learn TLA+. Why? Even if the language and tools seem better, I really decided to try TLA+ because the guy behind it is Leslie Lamport. That is, not just a prominent figure on distributed systems, but also the author of Latex!
You can get the TLA+ book and several examples for free.
There are many classic papers written by Leslie Lamport :
(http://research.microsoft.com/en-us/um/people/lamport/pubs/pubs.html) and Edsger Dijkstra
(http://www.cs.utexas.edu/users/EWD/)
for the database side.
A main stream is NoSQL movement,many project are appearing in the market including CouchDb( couchdb.apache.org) , MongoDB ,Cassandra. These all have the promise of scalability and managability (replication, fault tolerance, high-availability).
One good book is Birman's Reliable Distributed Systems, although it has its detractors.
If you want to formally verify your protocol you could look at some of the techniques in Lynch's Distributed Algorithms.
It is likely that whatever protocol you are trying to implement has been designed and analysed before. I'll just plug my own blog, which covers e.g. consensus algorithms.

Meta-model evolution in the Eclipse Modeling Framework

I am making an attempt to evaluate EMF for use within a project. One of the things i am looking at is some kind of versioning support at the metamodel (M2 or the .ecore model) level.
In terms of metamodel evolution, i have read certain discussions and have come across this paper. However, i wanted to know if there is anything concrete in this direction that is happening within EMF.
In general, what is the level of support for features involving versioning - such as merge and compare, evolution, migration, co-existence of multiple versions simultaneously, etc. I realize that the actual versioning itself will be provided by the source control system that one would use to store these meta-models, however semantic versioning capabilities (such as the ones i have mentioned above) should be provided by EMF itself, right?
I am aware of certain initiatives such as EMF Compare and Temporality which are meant for the EMF models. I am not sure if these work at the meta-model level.
I am working on metamodel evolution in my PhD thesis. To show the applicability of my ideas, I have developed tool support for metamodel evolution in EMF which is called COPE. On the website, you can access a number of publications about COPE as well as download the tool itself. In addition, I am currently proposing a project to contribute COPE to EMF.
In general, every tool which works with Ecore models will work with Ecore meta-models as well, since the meta-model of Ecore is Ecore. (Take some time to let this sink in, I know I had to...)
I've successfully used EMF Compare with my Ecore meta-model, don't know about the other tools you mentioned.

Resources