Zeppelin over Superset - apache-zeppelin

I have been using zeppelin for a couple of years, now superset is gaining more attention for better Visualization features etc. so I am trying to understand exact differences and also help if someone is looking to select a BI tool.
I have listed a few unique features based on initial reading on superset, it would be really appreciated if anyone can contribute more to the list .
Most big data cluster integration support (Spark, flink etc)
Inline code execution using paragraphs
Multi language supports
As I am not a proper user of superset,I would like to know more unique features of Zeppelin and these features are not possible or hard to do in Superset.
Also I got below details from apache wiki, but I don't see these can be unique factor except leveraging notebooks style
Apache Zeppelin is an indirect competitor, but it solves a different use case.
Apache Zeppelin is a web-based notebook that enables interactive data analytics. It enables the creation of beautiful data-driven, interactive and collaborative documents with SQL, Scala and more. Although a user can create data visualizations using this project, it leverages a notebook style user interfaces and it is geared towards the Spark community where Scala and SQL co-exist

Fundamentally, Zeppelin and Superset take significantly different viewpoints on the data workflow.
Zeppelin is centered around the [computational notebook interface][1], which enables you to write code fragments, run them and internalize the output, and iterate & expand. Zeppelin notebooks then focus on working with 20+ programming [languages and interpreters][2]. Zeppelin can also query popular databases using the JDBC connector.
Superset is centered around the BI use case and ships with a SQL IDE and a no-code chart builder. The important difference here is that Superset can only query data from SQL speaking databases. Superset, unlike Zeppelin, doesn't enable you to run arbitrary code from a variety of programming languages.
The use cases, workflows, and design choices are very different between both of these tools. Superset wants to enable end-users & analysts and SQL ninjas to create dashboards (that others in an organization may consume). Zeppelin wants to level up data scientists & programmers to analyze data, and is less focused on building dashboards for the rest of the organization to consume.
[1]: https://en.wikipedia.org/wiki/Notebook_interface#:~:text=A%20notebook%20interface%20(also%20called,and%20text%20into%20separate%20sections.
[2]: https://zeppelin.apache.org/supported_interpreters.html

Related

Pattern discovery in aws sagemaker

How can i run "Pattern discovery" on my dataset using aws sagemaker?
And of there is a simmiler term to
"Pattern discovery" because i cant find a lot of info about it
The concept lies downstream of Data Wrangling.
As can be seen in the official guide 'Prepare ML Data with Amazon SageMaker Data Wrangler:
Amazon SageMaker Data Wrangler (Data Wrangler) is a feature of Amazon
SageMaker Studio that provides an end-to-end solution to import,
prepare, transform, featurize, and analyze data. You can integrate a
Data Wrangler data preparation flow into your machine learning (ML)
workflows to simplify and streamline data pre-processing and feature
engineering using little to no coding. You can also add your own
Python scripts and transformations to customize workflows.
Here you will find a few tools for pattern recognition/discovery (and general data analysis) that will be used in SageMaker Studio: "Analyze and Visualize".
The data scientist's advice I can give is not to use no-code tools if you don't fully understand the nature of the data you are dealing with. A good knowledge of the data is a prerequisite for any kind of targeted analysis. Try writing some custom code instead, to have maximum control over each operation.

What is the difference between Databricks and Spark?

I am trying to a clear picture of how they are interconnected and if the use of one always require the use of the other. If you could give a non-technical definition or explanation of each of them, I would appreciate it.
Please do not paste a technical definition of the two. I am not a software engineer or data analyst or data engineer.
These two paragraphs summarize the difference quite good (from this source)
Spark is a general-purpose cluster computing system that can be used for numerous purposes. Spark provides an interface similar to MapReduce, but allows for more complex operations like queries and iterative algorithms. Databricks is a tool that is built on top of Spark. It allows users to develop, run and share Spark-based applications.
Spark is a powerful tool that can be used to analyze and manipulate data. It is an open-source cluster computing framework that is used to process data in a much faster and efficient way. Databricks is a company that uses Apache Spark as a platform to help corporations and businesses accelerate their work. Databricks can be used to create a cluster, to run jobs and to create notebooks. It can be used to share datasets and it can be integrated with other tools and technologies. Databricks is a useful tool that can be used to get things done quickly and efficiently.
In simple words, Databricks has a 'tool' that is built on top of Apache Spark, but it wraps and manipulates it in an intuitive way which is easier for people to use.
This, in principle, is the same as difference between Hadoop and AWS EMR.

Pros and cons using Lucidworks Fusion instead of regular Solr

i wanna know what are the pros and cons using Fusion instead of regular Solr ? can you guys give some example (like some problem that can be solved easily using Fusion)?
First of all, I should disclose that I am the Product Manager for Lucidworks Fusion.
You seem to already be aware that Fusion works with Solr (or one or more Solr clusters or instances), using Solr for data storage and querying. The purpose of Fusion is to make it easier to use Solr, integrate Solr, and to build complex solutions that make use of Solr. Some of the things that Fusion provides that many people find helpful for this include:
Connectors and a connector framework. Bare Solr gives you a good API and the ability to push certain types of files at the command line. Fusion comes with several pre-built data source connectors that fetch data from various types of systems, process them as appropriate (including parsing, transformation, and field mapping), and sends the results to Solr. These connectors include common document stores (cloud and on-premise), relational databases, NoSQL data stores, HDFS, enterprise applications, and a very powerful and configurable web crawler.
Security integration. Solr does not have any authentication or authorizations (though as of version 5.2 this week, it does have a pluggable API and an basic implementation of Kerberos for authentication). Fusion wraps the Solr APIs with a secured version. Fusion has clean integrations into LDAP, Active Directory, and Kerberos for authentication. It also has a fine-grained authorizations model for mananging and configuring Fusion and Solr. And, the Fusion authorizations model can automatically link group memberships from LDAP/AD with access control lists from the Fusion Connectors data sources so that you get document-level access control mirrored from your source systems when you run search queries.
Pipelines processing model. Fusion provides a pipeline model with modular stages (in both API and GUI form) to make it easier to define and edit transformations of data and documents. It is analogous to unix shell pipes. For example, while indexing you can include stages to define mappings of fields, compute new fields, aggregate documents, pull in data from other sources, etc. before writing to Solr. When querying, you could do the same, along with transforming the query, running and returning the results of other analytics, and applying security filtering.
Admin GUI. Fusion has a web UI for viewing and configuring the above (as well as the base Solr config). We think this is convenient for people who want to use Solr, but don't use it regularly enough to remember how to use the APIs, config files, and command line tools.
Sophisticated search-based features: Using the pipelines model described above, Fusion includes (and make easy to use) some richer search-based components, including: Natural language processing and entity extraction modules; Real-time signals-driven relevancy adjustment. We intend to provide more of these in the future.
Analytics processing: Fusion includes and integrates Apache Spark for running deep analytics against data stored in Solr (or on its way in to Solr). While Solr implicitly includes certain data analytics capabilities, that is not its main purpose. We use Apache Spark to drive Fusion's signals extraction and relevancy tuning, and expect to expose APIs so users can easily run other processing there.
Other: many useful miscellaneous features like: dashboarding UI; basic search UI with manual relevancy tuning; easier monitoring; job management and scheduling; real-time alerting with email integration, and more.
A lot of the above can of course be built or written against Solr, without Fusion, but we think that providing these kinds of enterprise integrations will be valuable to many people.
Pros:
Connectors : Lucidworks provides you a wide range of connectors, with those you can connect to datasources and pull the data from there.
Reusability : In Lucidworks you can create pipelines for data ingestion and data retrieval. You can create pipelines with common logic so that these can be used in other pipelines.
Security : You can apply restrictions over data i.e Security Trimming data. Lucidworks provides in built query-pipeline stages for Security Trimming or you can write custom pipeline for your use case.
Troubleshooting : Lucidworks comes with discrete services i.e api, connectors, solr. You can troubleshoot any issue according the services, each service has its logs. Also you can configure JVM properties for each service
Support : Lucidworks support is available 24/7 for help. You can create support case according the severity and they schedule call for you.
Cons:
Not much, but it keeps you away from your normal development, you don't get much chance to open your IDE and start coding.

What are the approaches to the Big-Data problems? [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 7 years ago.
Improve this question
Let us consider the following problem. We have a system containing a huge amount of data (Big-Data). So, in fact we have a data base. As the first requirement we want to be able to write to and to read from the data base quickly. We also want to have a web-interface to the data-bases (so that different clients can write to and read from the data base remotely).
But the system that we want to have should be more than a data base. First, we want to be able to run different data-analysis algorithm on the data to find regularities, correlations, abnormalities and so on (as before we do care a lot about the performance). Second, we want to bind a machine learning machinery to the data-base. Which means that we want to run machine learning algorithms on the data to be able to learn "relations" present on the data and based on that predict the values of entries that are not yet in the data base.
Finally, we want to have a nice clicks based interface that visualize the data. So that the users can see the data in form of nice graphics, graphs and other interactive visualisation objects.
What are the standard and widely recognised approaches to the above described problem. What programming languages have to be used to deal with the described problems?
I will approach your question like this: I assume you are firmly interested in big data database use already and have a real need for one, so instead of repeating textbooks upon textbooks of information about them, I will highlight some that meet your 5 requirements - mainly Cassandra and Hadoop.
1) The first requirement we want to be able to write to and to read from the database quickly.
You'll want to explore NoSQL databases which are often used for storing “unstructured” Big Data. Some open-source databases include Hadoop and Cassandra. Regarding the Cassandra,
Facebook needed something fast and cheap to handle the billions of status updates, so it started this project and eventually moved it to Apache where it's found plenty of support in many communities (ref).
References:
Big Data and NoSQL: Five Key Insights
NoSQL standouts: New databases for new applications
Big data woes: Which database should I use?
Cassandra and Spark: A match made in big data heaven
List of NoSQL databases (currently 150)
2) We also want to have a web interface to the database
See the list of 150 NoSQL databases to see all the various interfaces available, including web interfaces.
Cassandra has a cluster admin, a web-based environment, a web-admin based on AngularJS, and even GUI clients.
References:
150 NoSQL databases
Cassandra Web
Cassandra Cluster Admin
3) We want to be able to run different data-analysis algorithm on the data
Cassandra, Hive, and Hadoop are well-suited for data analytics. For example, eBay uses Cassandra for managing time-series data.
References:
Cassandra, Hive, and Hadoop: How We Picked Our Analytics Stack
Cassandra at eBay - Cassandra Summit
An Introduction to Real-Time Analytics with Cassandra and Hadoop
4) We want to run machine learning algorithms on the data to be able to learn "relations"
Again, Cassandra and Hadoop are well-suited. Regarding Apache Spark + Cassandra,
Spark was developed in 2009 at UC Berkeley AMPLab, open sourced in
2010, and became a top-level Apache project in February, 2014. It has
since become one of the largest open source communities in big data, with over 200 contributors in 50+ organizations (ref).
Regarding Hadoop,
With the rapid adoption of Apache Hadoop, enterprises use machine learning as a key technology to extract tangible business value from their massive data assets.
References:
Getting Started with Apache Spark and Cassandra
What is Apache Mahout?
Data Science with Apache Hadoop: Predicting Airline Delays
5) Finally, we want to have a nice clicks-based interface that visualize the data.
Visualization tools (paid) that work with the above databases include Pentaho, JasperReports, and Datameer Analytics Solutions. Alternatively, there are several open-source interactive visualization tools such as D3 and Dygraphs (for big data sets).
References:
Data Science Central - Resources
Big Data Visualization
Start looking at:
what kind of data you want to store in the Database?
what kind of relationship between data you got?
how this data will be accessed? (for instance you need to access a certain set of data quite often)
are they documents? text? something else?
Once you got an answer for all those questions, you can start looking at which NoSQL Database you could use that would give you the best results for your needs.
You can choose between 4 different types: Key-Value, Document, Column family stores, and graph databases.
Which one will be the best fit can be determined answering the question above.
There are ready to use stack that may really help to start with your project:
Elasticsearch that would be your Database (it has a REST API that you can use to write them to the DB and to make queries and analysis)
Kibana is a visualization tool, it will allows you to explore and visualize your data, it it quite powerful and will be more than enough for most of your needs
Logstash can centralize the data processing and help you process and save them in elasticsearch, it already support quite few sources of logs and events, and you can also write your own plugin as well.
Some people refers to them as the ELK stack.
I don't believe you should worry about the programming language you have to use at this point, try to select the tools first, sometimes the choices are limited by the tools you want to use and you can still use a mixture of languages and make the effort only if/when it make sense.
A common way to solve such a requirements is to use Amazon Redshift and the ecosystem around it.
Redshift is a peta-scale data warehouse (it can also start with giga-scale), that exposes Ansi SQL interface. As you can put as much data as you like into the DWH and you can run any type of SQL you wish against this data, this is a good infrastructure to build almost any agile and big data analytics system.
Redshift has many analytics functions, mainly using Window functions. You can calculate averages and medians, but also percentiles, dense rank etc.
You can connect almost every SQL client you want using JDBS/ODBC drivers. It can be from R, R studio, psql, but also from MS-Excel.
AWS added lately a new service for Machine Learning. Amazon ML is integrating nicely with Redshift. You can build predictive models based on data from Redshift, by simply giving an SQL query that is pulling the data needed to train the model, and Amazon ML will build a model that you can use both for batch prediction as well as for real-time predictions. You can check this blog post from AWS big data blog that shows such a scenario: http://blogs.aws.amazon.com/bigdata/post/TxGVITXN9DT5V6/Building-a-Binary-Classification-Model-with-Amazon-Machine-Learning-and-Amazon-R
Regarding visualization, there are plenty of great visualization tools that you can connect to Redshift. The most common ones are Tableau, QliView, Looker or YellowFin, especially if you don't have any existing DWH, where you might want to keep on using tools like JasperSoft or Oracle BI. Here is a link to a list of such partners that are providing free trial for their visualization on top of Redshift: http://aws.amazon.com/redshift/partners/
BTW, Redshift also provides a free trial for 2 months that you can quickly test and see if it fits your needs: http://aws.amazon.com/redshift/free-trial/
Big Data is a tough problem primarily because it isn't one single problem. First if your original database is a normal OLTP database that is handling business transactions throughout the day, you will not want to also do your big data analysis on this system since the data-analysis you will want to do will interfere with the normal business traffic.
Problem #1 is what type of database do you want to use for data-analysis? You have many choices ranging from RDBMS, Hadoop, MongoDB, and Spark. If you go with RDBMS then you will want to change the schema to be more compliant with data-analysis. You will want to create a data warehouse with a star schema. Doing this will make many tools available to you because this method of data analysis has been around for a very long time. All of the other "big data" and data analysis databases do not have the same level of tooling available, but they are quickly catching up. Each one of these will require research on which one you will want to use based on your problem set. If you have big batches of data RDBMS and Hadoop will be good. If you have streaming types of data then you will want to look at MongoDB and Spark. If you are a Java shop then RDBMS, Hadoop or Spark. If you are JavaScript MongoDB. If you are good with Scala then Spark.
Problem #2 is getting your data from your transactional database into your big data storage. You will need to find a programming language that has libraries to talk to both databases and you will have to decide when and where you will be moving this data. You can use Python, Java or Ruby to do this work.
Problem #3 is your UI. If you decide to go with RDBMS then you can use many of the available tools available or you can build your own. The other data storage solutions will have tool support but it isn't as mature is that available for the RDBMS. You are most likely going to build your own here anyway because your analysts will want to have the tools built to their specifications. Java works with all of these storage mechanisms but you can probably get Python to work too. You may want to provide a service layer built in Java that provides a RESTful interface and then put a web layer in front of that service layer. If you do this, then your web layer can be built in any language you prefer.
These three languages are most commonly used for machine learning and data mining on the Server side: R, Python, SQL. If you are aiming for heavy mathematical functions and graph generation, Haskell is very popular.

Dynamic Data Extract Tools

I've been searching around for a few weeks now for a tool that either is fully built or a direction of something I could build for dynamically extracting data via a web interface. Basically, what I'm looking for is a way to give users a list of all available data objects from our database and then let them pick ones from the list they'd like to view and set parameters then export the results to an excel file.
Right now we're doing it purely with SQL statements but we have hundreds of objects so as you might imagine, those statements are really complex and prone to errors. It would be great if there was a tool available to do this or if someone had an idea of an easy way to organize this. Any help would be greatly appreciated.
We've looked at BI tools like QlikView and Tableau but that is probably overkill for what we're trying to do. The open-source BI tools we've looked at seemed really primitive in their functionality. The other thing we looked at was MSAS (our DB is SQL Server) but I'd prefer something that was more database-agnostic and lived on a web server instead of on the database.
I think what you are describing is a typical BI reporting tool. I don't know what open-source BI tools you have been looking at but there are open source solutions which aren't "primitive" at all. The two main open-source reporting libraries are JasperReports and BIRT. You can design report templates within a graphical interface (NetBeans plugin called iReport for Jasper, Eclipse plugin for BIRT). A simple report template is basically an xml file which contains a parameterized SQL query and describes how to display the query results.
End-user typically connect to a web application (Java EE app which uses the reporting library) which executes the report templates : it asks the user to input parameters in a graphical way such as drop-down lists and checkboxes, and then retrieves the SQL query results from the database, and displays them according to the template (tables, charts, etc.). These results can then be exported in many formats including xls.
JasperReports developers provide a free open-source webapp designed to run Jasper reports, called JasperReports Server. Other open-source projects let you execute reports designed either with BIRT or Jasper, for instance ReportsServer which I haven't tested.
At my company we use SpagoBI, which is a fully-fledged free and open source Business Intelligence suite. This means that it has all the features of a commercial BI suite. The most useful is probably the ad-hoc query editor, which lets users with little technical knowledge design simple queries by dragging and dropping fields, and it performs the underlying table joins for them. It then lets users design simple reports such as pie charts or line charts from the data they just extracted. This sort of feature is part of the commercial editions of JasperReports Server and Actuate One (the BIRT equivalent of JasperReports Server which doesn't have a free version).
SpagoBI is a great, powerful tool and I recommend it, but it is also quite difficult to configure and to master. Maybe if your needs are only to execute pre-defined reports you had better go with one of the other solutions.
PowerPivot, Data Explorer, or Microsoft Query?
Sorry, didnt see that you wanted a web interface...
You can try get some export data functions from SQL Web Data Administrator - http://sqlwebadmin.codeplex.com/
Or you can install the web tool but restrict access for its web-page other then export data pages.
Cognos BI (specifically, the web-based Query Studio) fits this tab perfectly and is a great tool to deploy to non-technical web users.
It does require a pretty robust setup and is not cheap but it is an enterprise-class product. I've only worked with the full-scale deployment but they also have an Express product for small/midsize companies.
If you could clarify number of users, database size, expected query volume, and budget, we could refine the toolset further...

Resources