Ensemble classifiers (Random Forest, Bagging, Boosting, etc.) in SSAS - sql-server

I am using SSAS (SQL Server 2008 R2) to develop a classification model for a data set where 80% of values are missing. Ensemble classifiers based on trees are supposedly the best solution (Random Forest for example).
Is there any nice way of adding an ensemble classifier into SSAS? For example an AdaBoost or any other Bagging or Boosting classifier?
I know SSAS provides plug-in functionality, but I have not come across anyone doing any ensemble solutions... Not to mention anything that you can just download and start using.
If not, is there any efficient method to connect various classifiers in SSAS? I hope I am missing something obvious that is there.

I am not too familiar with the topic you are asking about, but technologically in SSAS you can register assemblies and use them in MDX. Therefore, it could be possible for you to code this in .NET and use the logic in SSAS. Please have a look at the following MSDN page for more information in case this sounds like something worth exploring.
http://technet.microsoft.com/en-us/library/ms175398.aspx
Additionally, have a look at the Data Mining SSAS provides out-of-the box as some classification objectives could be achievable by using the included algorithms:
http://technet.microsoft.com/en-us/library/ms175595.aspx

Related

What is difference between Titan and Neo4j graph database?

I had worked on relational database; but now want to learn about graph database. I came to know that these two are graph database. What is difference between these two databases. What should we prefer among them?
One approach is to simply try to choose one database over the other. For example, you might quickly search around to find that Titan has been forked to JanusGraph where it is more actively maintained. In your research you may find that there are other open source graph databases as well like OrientDb, ChronoGraph, or Sqlg as well as commercial alternatives like Microsoft's CosmosDb, DSE Graph or IBM Graph. How do you decide now?
There is a graph framework that ties together all of these graphs including Neo4j/Titan (and more than those listed here): Apache TinkerPop. TinkerPop provides an abstraction over different graph databases and graph processors allowing the same code to be used with different configurable backends. This pattern is quite similar to the one you find in SQL with JDBC which helps make your code vendor agnostic.
You can try all of the different supported graph databases before you make a choice and you can do this type of prototyping/benchmarking fairly quickly with the Gremlin Console. You will be able to make self-informed choice as to what is the best way to go for your project.
It occurs to me as I come to the end of this post that I haven't directly answered your question. If you are just getting started and are just interested in learning about graph databases, then I likely wouldn't recommend starting with Titan/JanusGraph as it requires a bit of configuration to get started (schemas, backend selection, etc). Start with TinkerGraph or Neo4j using the Gremlin Console to try out some simple graph traversals and go from there.
Titan was originally backed by Aurelius, which was bought by DataStax in 2015. This move was designed to give DataStax a jump-start into the Graph DB world, as they now offer their own "DSE Graph" enterprise product. Titan was since been forked (as previously mentioned) into JanusGraph.
The nice thing about Titan/Janus (IMO) is that it is "pluggable" with other existing back-end and search technologies. So it will "play nice" with things like Cassandra, HBase, Hadoop, Solr, and ElasticSearch.
The drawback is that the community support is tough. The Titan project has been effectively killed, and Janus scores a whopping 0.23 on DBEngines. That makes it the 16th most-popular Graph DB (231st overall), which is pretty low.
Neo4j is backed by Neo Technology, and is regarded as the front-runner in the Graph DB community (score of 38.52 right now, 1st graph DB and 21st overall). It is open source, but controlled by Neo Technologies so they can dictate a difference in feature set between open source and enterprise.
The nice thing about Neo4j is that they have a lot of tutorials and learning aids built right-in to the Neo4j Browser, which is a nice, user-friendly web interface. Their documentation is top-notch, easy to read and search through, and they have a pretty good following here on Stack Overflow.
Neo4j Browser screenshot:
The drawback of Neo4j, is that some features (like clustering) are only available in the enterprise version. But if you work for a big company who doesn't mind shelling-out $ for an enterprise license, that may not be a big deal.
Consistency: Titan/Janus is a part of the "eventual consistency" crowd, while Neo4j aims to be strong-consistent (especially in a causal clustering scenario). Although consistency can be tuned with configuration in both, with Titan/Janus that can be dependent on your choice of pluggable backend (ex: typically strong-consistent with HBase, while eventually consistent with Cassandra).
Recommendation:
If you're just starting to learn graph databases and modeling, you can't go wrong with Neo4j. Simply download/install the community edition, run it, and execute :play movies as your first command (tutorial that walks you through loading, modeling, and querying movie relationships).
If you have some experience with graph, and you don't mind troubleshooting/googling to figure out things (like how to set the max frame size for Thrift), then you could probably do some really cool things with Titan.
Try each out, and see which one works for you.
There are far more than two graph databases - there are dozens. That being said, there are two with real market share: Neo4j and Titan/JanusGraph. But there are dozens of other graph datases, each with interesting strengths for different specific application spaces. That being said, I wouldn't dig into all of the niche players to start with - learning the basic idea of graph databases can be done with one of the two lead players.
Neo4j is the most mature, with the most nicely packaged install and documentation, tons of reference code, and support from a wide range of partners.
Titan/JanusGraph is the next most popular, as it's free/open source and has very strong support (e.g. IBM, Google, Hortonworks, AWS, ...). There's a recent complexity in that the leaders of the Titan project were acquired, freezing the Titan project. But the community forked the project into JanusGraph. So while JanusGraph is a new project, it's literally the same Titan code, with even broader industry support than Titan had.
Related to the two is the language used to work with the graphs. Neo4j uses its proprietary language, Cypher, while nearly everyone else uses Gremlin, and the TinkerPop open source tool set (which is a part of the Apache set of open source projects). Nearly all graph databases, including Neo4j, support Gremlin and TinkerPop. So, for example, you can use either Cypher or Gremlin to query Neo4j, though Neo (and some other proprietary graph database vendors) support Gremlin as a second-class citizen, so to speak. For example, you can connect to Neo using Gremlin from the (external) Gremlin console, but you can't use Gremlin in the (very nice) Neo4j console.
Note that there are many graph databases that support Gremlin other than Titan/JanusGraph. One new entrant that's very interesting is Microsoft's Azure Cosmos DB, which is a managed graph database that's "cheap and easy" if you use Azure already. And there are several vendors that provide managed JanusGraph.
For personal learningk I'd say that Neo4j is the easiest to set up and learn - you download and run it, and open a web browser onto their web-based console, which only takes a few minutes. That being said, if you're comfortable on a command line JanusGraph only took a half hour to install and get running for me, so it's not too hard.
For learning the concepts Neo4j is great. Neo4j's query language, Cypher, and JanusGraph's query language, Gremlin, are semantically identical, just spelled differently, so you'll learn the concepts either way.
For building a real system, either could work (and there are many successful following both approaches).
For which you choose, you'll want to think about whether you want to be strategically tied to a single vendor (Neo4j) or in a broader standards-based community. There's comfort level in picking the market leader with the most mature product - Neo4j. And there's a comfort level in picking open standards with strong industry support - JanusGraph. So IMO there's no "wrong" answer - people using either one are happy and successful. But since you have to pick, you'll need to think about which you're more comfortable with long-term.
Neo4j uses native graph technology.
Native graph technology ensures that data is stored efficiently by writing nodes and relationships close to each other.
It optimizes the graph DB.
With native graph technology, processing becomes faster because it uses index-free
adjancey. That means each node directly references its adjacent nodes.
Titan (Now JanusGraph) uses non-native graph technology.
In non-native we use different storage backends like Cassandra, HBase
With non-native processing becomes slowers compared to native because database uses
many types of indexs to link nodes together.

Which community edition graph database supports high-available cluster and has good online query performance?

I am currently building a knowledge graph for an e-commerce company, and it mainly consists of the product category hierarchies, properties, and relations among them. Besides the common relational queries, we care about the following points very much:
Master-slave cluster support. This graph database will be used for online search query processing, so high availability is crucial to us. The data volume won't be as big as millions of nodes, so we don't need a distributed cluster that can span data across multiple machines. Still, rather we may need multiple machines that can be read simultaneously, and the service won't go down even if one of the machines is offline.
Fast online query performance. Reasoning about relations can be done offline, so the performance is not that important. But we need to do a lot of online queries like "find the nodes whose property P equals to value V", so we need good performance for online query processing. This database will be read-intensive and won't be changed very much after it's initialization.
Community and documentation. Since our team is new to the field of a graph database, so we expect user-friendly documentation for deployment and development and an active community for solving problems.
Based on the requirements above, I investigated some candidates:
Neo4j. We first tried Neo4j since it's the most popular one in the field. Actually, I liked it, especially the Cypher query language. But we are about to abandon it because the community edition does not support any cluster, and currently, we don't have the budget to pay for the enterprise edition.
OrientDB. OrientDB is like the second most popular one on the market, and it seems to support cluster in its community edition. I use the word "seems" because it is not clearly stated on its website. Can anyone clear this out? Besides, I found a negative article about OrientDB which makes me hesitate: http://orientdbleaks.blogspot.jp/2015/06/the-orientdb-issues-that-made-us-give-up.html
Titan. Titan is also great, but since its original company has been acquired and its original developers are developing a different product, its future development and maintenance are in doubt.
ArangoDB. This one seems to be very fast, according to the performance report(https://www.arangodb.com/2015/10/benchmark-postgresql-mongodb-arangodb/), but I don't know if its online query processing ability is good enough, and its support for the cluster is also unknown to me.
As for documentation and community, I really have no idea since these are the kind of things that you only get to know after you start doing it.
To sum up, based on my requirements, I think OrientDB and ArangoDB maybe my candidates, but I don't know which one to choose because of the points I stated above. Or perhaps is there any other right candidate that I'm missing?
Thanks.
Max working for ArangoDB here. ArangoDB does not only do online queries for graphs, but due to its multi-model nature you can mix graph queries with document queries (using secondary indexes), key lookups and joins. It has a sophisticated query engine with an optimizer that is fully aware of the ArangoDB cluster structure and can optimize and distribute query executions across all instances.
In a cluster, sharding, synchronous replication and self-healing are all fully automatic with configurable parameters. Deployment of an ArangoDB cluster is particularly simple (literally two clicks) on Apache Mesos or DC/OS, but is also relatively straightforward with other orchestration frameworks. ArangoDB on DC/OS additionally allows you to scale up and down via the graphical user interface or REST API calls, and failed tasks are automatically replaced.
As to the performance, all our benchmarks show a very good performance, the just released Version 3.1 even has vertex centric indexes, which is particularly important for graph queries.
We do our best to provide extensive documentation, which you find at https://www.arangodb.com/documentation/ . We have a user manual, a manual for our query language AQL as well as one for the HTTP/REST API. Furthermore, we have tutorials, frequently asked questions, a "Cookbook" for standard tasks, and we try to answer questions on StackOverflow and github issues in a timely manner.
All of this is included in the Community Edition, which is available with the Apache 2.0 open source license.
If you have more questions, do not hesitate to reach out to our team or to me personally.
OrientDB Community Edition is a free open source software, built upon by a community of developers and is constantly improving. Features such as horizontal scaling, fault tolerance, clustering, sharding and replicating aren’t disabled in OrientDB community.
For more information about cluster, take a look at the official OrientDB guide: http://orientdb.com/docs/last/Tutorial-Clusters.html
Hope it helps.
Regards
Neo4j enterprise edition can be used under the AGPL license. I am surprised a lot of people arn't aware this. If you are using Neo4j Enterprise as a server and communicating with it via REST or bolt protocol (Apache Licensed), then you don't have to worry about releasing the code of the system connecting to it under AGPL.
If you are using it embedded, then you to adhere to AGPL, but then why would you need Neo4j enterprise in that situation?
Remember to clone and compile Neo4j Enterprise from github if you plan on using it under AGPL, don't download trial.
Neo Technology gives great support and that is what you are essentially paying for for the enterprise subscription.

Olap/multi dimensional reporting, WPF client, agnostic data?

Can anyone give me general guidelines on how to approach multi dimensional reporting where I'd like to support at the very least cube generated from Oracle and SQL Server databases. I can imagine GemFire or Coherence being in the mix too.
I'm a little unsure where to start. If I work entirely in the Microsoft ecosystem I'm fine with SQL Server Analysis services, reporting services, MDX. Introduce the other data sources and I'm lost.
Thanks
The following vendors can all do what you need:
SAP Business Objects
IBM Cognos
Microstrategy
Actuate
Oracle and Microsoft will both work great with only ONE of your datasources.
Try looking under keywords "Business Intelligence" for Gartner group papers and other useful whitepapers from sources like InformationWeek. There are MANY vendors in this space, I encourage you to do a very deep slice prototype, because they all look great in demo, but might not work for you.
Also, the CUBE you mention (OLAP) is truly a performance booster. But you can do "multi-dimensional reporting" without Cubes. Maybe slower, but less intimidating and definitely less expensive.
Regarding prices you've a bunch of free OLAP servers available, depending on your needs all of them will be fine. Just look for the ones following XMLA/MDX standard.
Amongst them vou've the classical Mondrian (ROLAP) and one new coming icCube (MOLAP).

Data Mining - Predictive Analysis

We are looking at acquiring Data Mining software to primarily run predictive analysis processes.
How does SQL Server Data Mining solution compares to other solutions like SPSS from IBM?
Since SQL Server DM is included in SQL Server Enterprise license - what would be the justification to spend extra couple 100K to buy separate software just to do DM?
I would look into open source options as well, including R, RapidMiner, Weka
I would recommend checking out the Rexer survey, as it shows popularity and satisfaction measures for a variety of data mining products:
http://www.kdnuggets.com/2010/03/f-annual-rexer-analytics-data-miner-survey-results.html
Depending on what you are looking to accomplish, and obviously your budget, there are certainly some great things being done in R. Check out Rattle for R and Revolution Computing.
I am a big fan of SPSS, and unfortunately have not used their Modeler package, but it seems like it may be worth considering. I have used SAS Enterprise Miner, and while it is powerful, I am not a big fan.
I haven't dabbled with Weka that much, but I found RapidMiner to have a steep learning curve, but does have alot of capability.
If you want to keep everything in the Microsoft stack check out www.predixionsoftware.com which is planning the release of a disruptive Excel add-in as an update to the current MS DM add-ins.
You might want want to give KNIME a try before paying for something else. Works well with databases and is excellent for exploratory analysis.
I would suggest to check open-source data mining software. There are some very good open-source software that are free.
I Would start by building some data mining models in SSAS using both Multidimensional and Tabular, and then get an account for Google Analytics. I built a social networking website that was set up where members had to join and used Google Analytics to start building reporting dashboards and have probably built near a thousand. Good starting point, R is good, Omni used to be the top dawg but Adobe bought them, clicktracks, quilk view, Sisense, Tableau, Actuate, however I would wait and see how the product Microsoft releases is. Chances are it will set itself apart like they have in the BI market and shot up to 2nd in market share and 1st in growth in the database market.

What is the best way to use SQL Server Analysis Services data in a line of business application?

We'd like to see if we can get some improved performance for analysis and reporting by moving some of our key data into Analysis Services cubes. However, I haven't been able to find much in the way of good client front ends.
Our users have Office 2003. The move to 2007 is probably at least a year out and the Analysis Services add-in for Excel 2003 isn't great. I also considered just creating a winforms app, but I haven't had much luck finding robust controls for SSAS data. Meanwhile, Reporting Services seems to make you force your multi-dimensional data into a two dimensional dataset before it can be used in a report.
I hope that I'm missing something obvious and that there are some great client tools somewhere that will allow us to bring the multi-dimensional data from SSAS to a client application in a meaningful way. Performance Point seems like overkill, but maybe it's the best option.
Does anyone use SSAS data in line of business apps or is it primarily used for adhoc analysis? If you are using SSAS data for line of business apps, what technology(ies) are you using for the client front end?
I am on a project now that is using SSAS 2008. We were able to get all of our pilot users upgraded to Office 2007 and during the pilot of this project we used Excel 2007 as the front end tool. It turned out to be a great move (for us..YMMV) and I have been very impressed with how well the data features of Excel 2007 work. That being said it doesn't serve all our business needs and we're still going to use a reporting tool (MicroStrategy) as part of the client tool offerings to this project. This article (http://www.microsoft.com/downloads/details.aspx?FamilyId=2D779CD5-EEB2-43E9-BDFA-641ED89EDB6C&displaylang=en) was very helpful too.
Though you didn't ask directly I'll still say that the FE tools won't do much if the back-end design isn't right. I recommend googling Ralph Kimball and buying The Data Warehouse Toolkit book. There is even one tailored to SSAS 2005. Also search for the Microsoft Project Real whitepaper.
I've heard good things about this control.
http://www.datadynamics.com/Products/DDA/Default.aspx
I've used Dundas' control for OLAP. Very good and easy to use.
Also recommended is the DevExpress kit, the ASPxPivotGrid works directly against cubes, with some flexibility over what & how groups/dimensions/etc get shown via properties. Good prices and products. I don't work there :)
Take a look at some of the Demos:
http://demos.devexpress.com/ASPXPIVOTGRIDDEMOS/OLAP/Browser.aspx
Also nice integration with their charting stuff.

Resources