Datasets for Apache Mahout - dataset

I am looking for datasets that can be used for implementing recommendation system usecase of Apache Mahout. I know of only MovieLens Data Sets from GroupLens Research group.
Anyone knows any other datasets that can be used for recommendation system implementation? I am particularly interested in item-based data sets though other datasets are most welcome.

this is Sebastian from Mahout.
There is a dataset from a czech dating website available that might be of interest to you: http://www.occamslab.com/petricek/data/
Btw the term item-based refers to a special collaborative filtering approach not to the dataset itself, which is usually in the common form of user-item-rating tripels that most collaborative filtering approaches work with.
We would love to hear from your experimentation results and experiences (if you wanna share them) on our user mailinglist at user#mahout.apache.org

While searching for data sets, I found few sites that list publicly available data sets which can used for data mining. Some of these can be used for Mahout too.
Bixo Labs
UCI Datasets
KDnuggets

You can look at iPinYou RTB Bidding Data Set
Quora : http://qr.ae/OrqgM
http://contest.ipinyou.com/data-release.html

Related

Pros/Cons of incorporating multiple database types into same project

I'm beginning to pursue my first online project that I am planning will need to scale as such I have opted for a NoSQL DB. Some reading into this and modeling of what my queries would look like and there are two databases I am considering. Cassandra seems like the right choice for item lookups by keyword but MongoDB sounds like the right choice for initially entering the data in as it can retain the account structure in document form.
This split decision has left me wondering: Are there any major companies that use multiple database types for storage of different items as in using both Cassandra and Mongo together?
I would think scaling up would be more difficult but are the added benefits (if there are any) worth the trouble? I'm not the expert on this. I'm hoping you are. Thanks in advance for sharing your experience.
Cassandra can handle both use cases so you can use the same database for your purposes.
Stargate (https://stargate.io/) is an open-source API platform which provides a data gateway to Cassandra with REST API, GraphQL API, Document API and even native CQL access.
The Document API lets you save and search schemaless JSON documents to/from Cassandra directly from your app.
You can try it out for free on Astra with no credit card required. In just a few clicks, you'll be able to launch a Cassandra cluster with Stargate pre-configured so you can use the Document API straight out-of-the box and build a proof-of-concept app immediately without having to worry about downloading/installing/configuring a Cassandra cluster.
There are even sample apps you can access straight from the Astra dashboard so you can see Stargate in action. For more info, see Using the Document API on Astra. Cheers!
Using multiple database technologies in the same project is somewhat common nowadays and it is called "Polyglot persistence".
Many people use this method to take advantage of multiple systems - and as you mentioned Cassandra is right for somethings and something else (maybe MongoDB) is best for something else, so using a combination can give the advantage of both worlds.
Scaling, Replication, Support can be more costly when you use multiple technologies because you need expertise in both to support.
So if you really have use cases where Cassandra wont be a good choice and you have some primary use cases where Cassandra is the best choice then yes, going with two databases can be the best option provided you are ready to take the trouble of supporting two systems.

Neo4j Drawbacks

So I am putting together a report on Neo4j and it's potential advantages over your average relational database. In my research, I've encountered a couple issues that might not make Neo4j the best choice at the present for a public server application. Namely:
Neo4j uses Apache Lucene, which treats all data as text. This makes purely integer data queries much slower than they need to be.
Neo4j has no user management built in. All security must be done at the application level.
My question is whether my research is outdated and these issues have solutions. I know that community and support for Neo4j and other graph databases is growing fast. Anyone with knowledge on the current state of Neo4j could really help me out.
Thanks in advance
Reply to your first bullet:
By using ValueContext you can tell Lucene to index it numerically and also to query by numeric value or numeric range. See for example https://github.com/neo4j/neo4j/blob/master/community/lucene-index/src/test/java/org/neo4j/index/impl/lucene/TestLuceneIndex.java#L622
As for the second bullet, it appears that version 2.2 (currently at M3) introduces true user management. See http://neo4j.com/blog/neo4j-2-2-milestone-1-release/ for a basic overview of the new features. Note that this is not a production ready release.

What type of NoSQL database is best suited to store hierarchical data?

What type of NoSQL database is best suited to store hierarchical data?
Say for example I want to store posts of a forum with a tree structure:
original post
+ re: original post
+ re: original post
+ re2: original post
+ re3: original post
+ re2: original post
MongoDB and CouchDB offer solutions, but not built in functionality. See this SO question on representing hierarchy in a relational database as most other NoSQL solutions I've seen are similar in this regard; where you have to write your own algorithms for recalculating that information as nodes are added, deleted and moved. Generally speaking you're making a decision between fast read times (e.g. nested set) or fast write times (adjacency list). See aforementioned SO question for more options along these lines - the flat table approach appears most aligned with your question.
One standard that does abstract away these considerations is the Java Content Repository (JCR), both Apache JackRabbit and JBoss eXo are implementations. Note, behind the scenes both are still doing some sort of algorithmic calculations to maintain hierarchy as described above. In addition, the JCR also handles permissions, file storage, and several other aspects - so it may be overkill for your project.
What you possibly need is a document-oriented database like MongoDB or CouchDB.
See examples of different techniques which allow you to store hierarchical data in MongoDB:
http://www.mongodb.org/display/DOCS/Trees+in+MongoDB
The most common one is IBM's IMS.There is also Cache Database
See this question posted on dba section of stackexchange.
Faced with the same issue, I decided to create my own (very simple) solution using Lua + Redis https://github.com/qbolec/Redis-Tree/
Exist-db implemented hierarchical data model for xml persistence
Graph databases would probably also solve this problem. If neo4j is not enough for you in terms of scaling, consider Titan, which is based on various storage back-ends including HBase and should scale very well. It is not as mature as neo4j, but it is a very promising project.
LDAP, obviously. OpenLDAP would make short work of it.
In mathematics, and, more specifically, in graph theory, a tree is an undirected graph in which any two vertices are connected by exactly one path. So any graph db will do the job for sure. BTW an ordinary graph like a tree can be simply mapped to any relational or non-relational DB. To store hierarchical data into a relational db take a look at this awesome presentation by Bill Karwin. There are also ORMs with facilities to store trees. For example TypeORM supports the Adjacency list and Closure table patterns for storing hierarchical structures.
TypeORM is used in TypeScript\Javascript development. Check popular ORMs to find a one supporting trees based on your environment.
The king of Non-relational DBs [IMHO] is Mongodb. Check out it's documentation. to find out how it stores trees. Trees are the most common kind of graphs and they are used everywhere. Any well-established DB solution should have a way to deal with trees.
Here's a non-answer for you. SQLServer 2008!!!! It's great for recursive queries. Or you can go the old fashioned route and store hierarchy data in a separate table to avoid recursion.
I think relational databases lend themselves very well to tree data. Both in query performance and ease of use. With one caveat.... you will be inserting into an indexed table, and probably several other indexed tables every time someone makes a post. Insert performance could be an issue on a facebook caliber forum.
Check out MarkLogic. You can download a demo copy from the website. It is a database for unstructured data and falls under the NoSQL classification of databases. I know unstructured data is a pretty loaded term but just think of it as data that does not fit well in the rows and columns of a RDBMS (like hierarchical data).
Just spent the weekend at a training course using MUMUPS db as a back-end for a full stack javascript browser application development framework. Great stuff! I'd recommend GT.M distro of MUMPS under GPL. Or try http://sourceforge.net/projects/mumps/?source=recommended for vanilla MUMPS. Check out http://robtweed.wordpress.com/ for ewd.js js framework and more info on MUMPS.
A NoSql storage service with native support for hierarchical data is Amazon Web Service's Simple Storage Service (AWS S3). The path based keys are hierarchical by nature, and the blob values may be typed using attributes (mime type, e.g. application/json, text/csv, etc.). Advantages of S3 include the ability to scale to both extremely large overall capacity, versioning, as well as nearly infinite concurrent writes. Disadvantages include no support for conditional writes (optimistic concurrency), or consistent reads (only for read-after write) and no support for references/relationships. It is also purely usage based so wide variations in demand do not require complex scaling infrastructure or over-provisioned capacity.
Clicknouse db has explicit support for hierarchical data

Any scalable OLAP database (web app scale)?

I have an application that requires analytics for different level of aggregation, and that's the OLAP workload. I want to update my database pretty frequently as well.
e.g., here is what my update looks like (schema looks like: time, dest, source ip, browser -> visits)
(15:00-1-2-2010, www.stackoverflow.com, 128.19.1.1, safari) --> 105
(15:00-1-2-2010, www.stackoverflow.com, 128.19.2.1, firefox) --> 110
...
(15:00-1-5-2010, www.cnn.com, 128.19.5.1, firefox) --> 110
And then I want to ask what is the total visit to www.stackoverflow.com from a firefox browser last month.
I understand Vertica system can do this in a relatively cheap way (performance and scalability wise, but not cost-wise probably). I have two questions here.
1) Is there an open-source product that I can build upon to solve this problem? In particular, how well does a Mondrian system work? (scalability, and performance)
2) Is there an HBase or Hypertable base solution (obviously, a naked HBase/Hypertable can't do this) for this? -- but if there is a project based on HBase/Hypertable, scalability probably won't be an issue IMO)?
Thanks!
You can download a free edition (the single node edition) of the greenplum database. I haven't tried it myself but I think/guess it is a powerful beast. Read here: http://www.dbms2.com/2009/10/19/greenplum-free-single-node-edition/
Another option is MongoDB, it is fast and free and you can write MapReduce functions with JavaScript to do analytics.
My reputation here is to low to add a hyperlink to mongodb, so you have to google . I can add only one hyper link per post.
The zohmg project aims to solve this problem using Hadoop and HBase.
Facebook also built Hive on-top of Hadoop. Pretty simple to get going - reasonable query API too.
http://mirror.facebook.net/facebook/hive/
Is your data model more complex than that? If it isn't you might be beter of just writing custom code for it. Then you can really tune it to your data. Real products have to offer a lot of flexibility, need a lot of complexiy to achieve that, and suffer in speed as a result.
Your question is not clear in one aspect: when you talk about scalable, what do you mean by that? Are you collecting data from lots of sites but only have a limited amount of query users, or do you also have a lot of users? That situation leads to a significantly different model.

Is any group or foundation developing an algorithm for better storing massive amounts of data?

I've looked at several approaches to enterprise architecture for databases that store massive amounts of data, and it usually comes down to more hardware, database sharding, and storing JSON objects. Has any group been doing research, or does anyone have a more dynamic approach that processes the available data and tells you how to better store it, and then instructs you how to retrieve it given the new method of storage? I know it sounds a bit fanciful, but I figured I would ask anyway.
You might find this interesting:
http://en.wikipedia.org/wiki/BigTable
Very interesting question. It seems to me like the Semantic Web folks may have to deal with this issue before too long. It also seems to me that they've got some technologies that might provide at least part of the solution. Have a look at the OWL specs, for instance.

Resources