Apache Zeppelin 0.7.0-SNAPSHOT not working with external Spark - apache-zeppelin

I am trying to use Zeppelin (0.7-0 snapshot compiled with mvn clean package -Pcassandra-spark-1.6 -Dscala-2.11 -DskipTests)
with an external, standalone Spark of version 1.6.1
I have tried to set this up by entering export MASTER=spark://mysparkurl:7077 in /zeppelin/conf/zeppelin-env.sh
and under the %spark interpeter settings, through the Zeppelin GUI I have also tried to set the master-parameter to spark://mysparkurl:7077.
So far, attempts to connect to Spark have been unsuccessful. Here is a piece of code I have used for testing Zeppelin with external spark and the error I get with it:
%spark
val data = Array(1,2,3,4,5)
val distData = sc.parallelize(data)
val distData2 = distData.map(i => (i,1))
distData2.first
data: Array[Int] = Array(1, 2, 3, 4, 5)
Java.lang.IllegalStateException: Cannot call methods on a stopped SparkContext.
Zeppelin is running in a docker container, and Spark is running on host.
Am I missing something here? Is there something else that needs to be configured in order for Zeppelin to work with an external, standalone Spark?

As Cedric H. mentions, at that time you have to compile Apache Zeppelin with -Dscala-2.10.
Few bugs have been fixed since Sept and Scala 2.11 support should be working now well, if not - please file an issue in official project JIRA.

Related

Unable to run PubSubSource on Flink cluster

I've wrote a minimal Flink application trying to read data from PubSub.
val env = StreamExecutionEnvironment.getExecutionEnvironment
env.enableCheckpointing(10000L)
env.addSource(
PubSubSource.newBuilder()
.withDeserializationSchema(new SimpleStringSchema)
.withProjectName("PROJECT")
.withSubscriptionName("SUBSCRIPTION")
.build())
.print()
env.execute("job")
This program can be run directly (sbt run) successfully, but if I submit it to a Flink cluster, I got the following error message.
java.lang.IllegalArgumentException: cannot find a NameResolver for pubsub.googleapis.com:443
I've tried to run clusters in different machines/environments, but none of them works.
OS: macOS Catalina / Ubuntu 18.04
Flink version: 1.13.1 / 1.12.2
Scala version: 2.12.13 / 2.11.12
JVM: Oracle 8&11, OpenJDK 8&11
Here is the gist for code, build.sbt and full error message.
Thank you.
Found the solution.
Like this post said, I need to keep those files in META-INF/services.
After I remove the following line, everything works fine.
case PathList("META-INF", xs # _*) => MergeStrategy.discard

how to run zeppelin 0.8.0 interpreter on different host?

I have two hosts, one with zeppelin and another where I want to run jdbc interpreter, problem is that in 0.7 I could run it like this:
<br> /opt/zeppelin/bin/interpreter.sh -d /opt/zeppelin/interpreter/jdbc -p xxxx<br>
but from 0.8 they add new parameters CALLBACK_HOST and INTP_PORT and I script throws errors:
Exception in thread "main"
java.lang.ArrayIndexOutOfBoundsException: 1
at org.apache.zeppelin.interpreter.remote.RemoteInterpreterServer.main(RemoteInterpreterServer.java:266).
Please advise how to start this interpreter with zeppelin 0.8.0 and what exactly this parameters mean because there is almost nothing on github.
Although you can find callback_host in the configuration such functionality is not available now.
The JDBC interpreter could run only on the same host as Zeppelin server.

Unable to run a python flink application on cluster

I am trying to run a Python Flink Application on the standalone Flink cluster. The application works fine on a single node cluster but it throws the following error on a multi-node cluster. java.lang.Exception: The user defined 'open()' method caused an exception: An error occurred while copying the file. Please help me resolve this problem. Thank you
The application I am trying to execute has the following code.
from flink.plan.Environment import get_environment
from flink.plan.Constants import INT, STRING, WriteMode
env = get_environment()
data = env.from_elements("Hello")
data.map(lambda x: list(x)).output()
env.execute()
You have to configure "python.dc.tmp.dir" in "flink-conf.yaml" to point to a distributed filesystem (like HDFS). This directory is used to distributed the python scripts.

Error with module using Cloud Storage with Python and his tutorial

I'm trying to test Google Cloud Storage to store images (I need it in an app that I'm developing) and I'm following the Bookshelf App tutorial that they have in his webpage.
I'm using python and the problem is that when I execute the requirementes.txt all packages have been installed fine, but when I try execute the code, I see this error:
...sandbox.py", line 948, in load_module
raise ImportError('No module named %s' % fullname)
ImportError: No module named cryptography.hazmat.bindings._openssl
I have been trying hundred of posibles solutions, reinstalling only the cryptography package, trying to use different versions of the same module, and installing other packages that contains it but anything resolved the problem.
The requirements contains this:
Flask==0.10.1
gcloud==0.9.0
gunicorn==19.4.5
oauth2client==1.5.2
Flask-SQLAlchemy==2.1
PyMySQL==0.7.1
Flask-PyMongo==0.4.0
PyMongo==3.2.1
six==1.10.0
I'm sure that it is a simple error but I don't find the way to solve it.
Any help will be welcome. Thanks.
EDIT:
When I try do this with a python program this work fine:
import os
from gcloud import storage
os.environ['GOOGLE_APPLICATION_CREDENTIALS'] = 'key.json'
client = storage.Client(project='xxxxxxx')
bucket = client.get_bucket('yyyyyyy')
f = open('profile.jpg', 'rb')
blob = bucket.blob(f.name)
blob.upload_from_string(f.read(), 'image/jpeg')
url = blob.public_url
print url
Why I don't can use gcloud library without erros in a GAE app?
It seems you're following the bookshelf tutorial, but according to this line in your stacktrace:
...sandbox.py", line 948, in load_module
It hints that you're using dev_appserver.py to run the code. This isn't necessary for Managed VMs/Flexible unless you're using the compat runtime.
If this is the case, the tutorial provides correct instructions:
$ virtualenv env
$ source env/bin/activate
$ pip install -r requirements.txt
$ python main.py
(If this is not the case, please feel free to comment on this with more details about how you're running your application).

No interpreters available in Zeppelin

I have just installed the following on my Mac (Yosemite 10.10.3):
oracle java 1.8 update 45
scala 2.11.6
spark 1.4 (precompiled release: http://d3kbcqa49mib13.cloudfront.net/spark-1.4.0-bin-hadoop2.6.tgz)
zeppelin from source (https://github.com/apache/incubator-zeppelin)
no additional config, just copied created zeppelin-env.sh and zeppelin-site.xml from templates. no edits.
I Followed the installation guidelines: https://zeppelin.incubator.apache.org/docs/install/install.html
I have build zeppelin without problems:
mvn clean install -DskipTests
Started it
./bin/zeppelin-daemon.sh start
Opened http://localhost:8080 and opened the Tutorial Notebook.
Here is what happens when I refresh the snippets:
Here is the exception for the md interpreter in the webapp logs:
ERROR [2015-06-19 11:44:37,410] ({WebSocketWorker-8} NotebookServer.java[runParagraph]:534) - Exception from run
org.apache.zeppelin.interpreter.InterpreterException: **Interpreter md not found**
at org.apache.zeppelin.notebook.Note.run(Note.java:269)
at org.apache.zeppelin.socket.NotebookServer.runParagraph(NotebookServer.java:531)
at org.apache.zeppelin.socket.NotebookServer.onMessage(NotebookServer.java:119)
at org.java_websocket.server.WebSocketServer.onWebsocketMessage(WebSocketServer.java:469)
at org.java_websocket.WebSocketImpl.decodeFrames(WebSocketImpl.java:368)
at org.java_websocket.WebSocketImpl.decode(WebSocketImpl.java:157)
at org.java_websocket.server.WebSocketServer$WebSocketWorker.run(WebSocketServer.java:657)
Restarting the interpreter doesn't seem to cause errors.
Ok I have just found the answer, at the top of the tutorial there is a snippet about interpreter binding, click the save button and all start to work normally.

Resources