Making Spark and Slick work at the same time - sql-server

I am writing a program in Scala and I came through this problem. I am trying to solve this for days now, and really getting on my nerve. But who knows maybe it is not even possible. I am doing a GUI with scalafx which firstly has a login window where I want to use mssql to do the login. I decided to use Slick. It works fine if I assemble them together with sbt assembly for an executable jar and execute it with
java -jar target/scala-2.12/myjarname.jar
If I don't do the sbt assembly then I can only run the program with sbt run to make it work.
But later on, on a scene I want to start a sparkSession where data comes through Kafka. That part alone if I comment out the Slick-sqlserver usage, the spark-submit with the spark-kafka-connector package added works fine. I am using sbt with scala.
But as soon as I want to combine them together it gives me back an error:
java.lang.ClassNotFoundException: com.microsoft.sqlserver.jdbc.SQLServerDriver
Just for making it sure I checked that it can be found inside the assembled jar. Am I missing something with the spark-submit? Or they are not able to work at the same time/together in one program? My spark sumbit looks like this:
spark-submit --class Login --master local[*] --packages "org.apache.spark:spark-sql-kafka-0-10_2.12:3.1.3" target/scala-2.12/myjarname.jar
I don't know if this is useful or not but my application.conf
sqlserver = {
profile = "slick.jdbc.SQLServerProfile$"
db {
driver = "com.microsoft.sqlserver.jdbc.SQLServerDriver"
host = localhost
port = <myport>
databaseName = <my database's name>
url = "jdbc:sqlserver://"${sqlserver.db.host}":"${sqlserver.db.port}";databaseName="${sqlserver.db.databaseName}";encrypt=true;trustServerCertificate=true;"
user = <username>
password = <password>
}
}
And some of my build.sbt:
scalaVersion := "2.12.10"
libraryDependencies += "org.scalafx" %% "scalafx" % "16.0.0-R24"
libraryDependencies ++= Seq(
"com.typesafe.slick" %% "slick" % "3.2.1",
"org.slf4j" % "slf4j-nop" % "2.0.1",
"com.typesafe.slick" %% "slick-hikaricp" % "3.2.1",
"com.microsoft.sqlserver" % "mssql-jdbc" % "10.2.1.jre8"
)
libraryDependencies ++= Seq(
"org.apache.spark" %% "spark-core" % "3.2.0" % "provided",
"org.apache.spark" %% "spark-sql" % "3.2.0" % "provided",
"com.datastax.spark" %% "spark-cassandra-connector" % "3.1.0"
)
Update:
Found what cause this. Adding ShadeRules to the SBT Assembly for Hikari in build.sbt solved my problem.
I got my solution from this.

Related

Flink (1.11.2) - Can't find implementation for S3 despite correct plugins being set. Using JDK11 and Scala 2.12.11

I'm running a Flink standalone cluster with a single node using Docker in Linux. I've been running a previous version for a while in production with Flink 1.10.0 and JDK8, I was able to get S3 running properly there. Now I'm trying to update to a newer version, running Docker on my dev machine using a local S3 implementation. No matter what I try, this error keeps popping up:
org.apache.flink.core.fs.UnsupportedFileSystemSchemeException: Could not find a file system implementation for scheme 's3'.
It would seem that the S3 scheme isn't being mapped to the appropriate classes. I'm positive that the right plugins are being picked up by Flink. I have the following dependencies:
val testDependencies = Seq(
"org.scalatest" %% "scalatest" % "3.2.0" % "test"
)
val miscDependencies = Seq(
"com.github.tototoshi" %% "scala-csv" % "1.3.6",
"org.lz4" % "lz4-java" % "1.5.1",
"org.json4s" %% "json4s-jackson" % "3.6.1",
"org.apache.hadoop" % "hadoop-common" % "3.2.1",
"redis.clients" % "jedis" % "2.9.0",
"com.googlecode.plist" % "dd-plist" % "1.21",
"com.couchbase.client" % "java-client" % "2.7.14",
"org.apache.parquet" % "parquet-avro" % "1.11.1",
)
val flinkDependencies = Seq(
"org.apache.flink" %% "flink-scala" % flinkVersion % "provided",
"org.apache.flink" %% "flink-streaming-scala" % flinkVersion % "provided",
"org.apache.flink" % "flink-s3-fs-hadoop" % flinkVersion % "provided",
"org.apache.flink" % "flink-metrics-dropwizard" % flinkVersion,
"org.apache.flink" % "flink-formats" % flinkVersion pomOnly(),
"org.apache.flink" % "flink-compress" % flinkVersion,
"org.apache.flink" %% "flink-statebackend-rocksdb" % flinkVersion,
"org.apache.flink" %% "flink-clients" % flinkVersion,
"org.apache.flink" %% "flink-parquet" % flinkVersion
)
I confirm that I'm following the documentation to the letter.
After struggling with this for a while I was able to solve the problem. I'm leaving my solution here in case anyone has the same issue.
Plugin classes, such as the S3 file system factory, are detected once the jobmanager and taskmanager starts, however, they're not loaded. In my setup, the classes must be loaded dynamically once the job starts. You can find more information about how Flink loads its classes here.
As explained here, the cue to load a class is given by the existence of a file in META-INF/services inside the job's jar. For the S3 plugins to work, you need to have the file:
META-INF/services/org.apache.flink.core.fs.FileSystemFactory
which contains one line for each class that Flink should load dynamically as dependencies to your job. For example:
org.apache.flink.fs.s3hadoop.S3FileSystemFactory
org.apache.flink.fs.s3hadoop.S3AFileSystemFactory
I'm using sbt assembly to create a far JAR with my job. In my project dependencies I was including flink-s3-fs-hadoop as a provided dependency, which prevented the correct services files from being included. Once I removed that qualifier, the correct services were created and everything worked.

SBT - Invoke GCC call failing in SBT, but not when I manually execute it

I'm working on a library that is going to talk to the I2C bus on my Raspberry PI from Scala. For this I need a little bit of JNI code that is going to interrupt with the OS on the device.
I tried to make a build file for this, which right now, looks like this:
name := "core"
organization := "nl.fizzylogic.reactivepi"
scalaVersion := "2.11.6"
val nativeClasses = List(
"nl.fizzylogic.reactivepi.i2c.I2CDevice"
)
val nativeDeviceSources = List(
"src/jni/nl_fizzylogic_reativepi_i2c_I2CDevice.c"
)
val nativeGenerateHeaders = taskKey[Int]("Generates JNI headers for the library")
val nativeCompile = taskKey[Int]("Compiles the native library")
nativeGenerateHeaders := {
("javah -classpath target/scala-2.11/classes -d src/jni " + nativeClasses.mkString(" ")) !
}
nativeCompile := {
("gcc -Wl -add-stdcall-alias -I$JAVA_HOME/include -I$JAVA_HOME/include/darwin -I$JAVA_HOME/include/linux -shared -o target/reactivepi.so " + nativeDeviceSources.mkString(" ")) !
}
So far the javah call is successful. But when I invoke the nativeCompile task, GCC tells me it can't find the jni.h file. However when I copy and paste the command from the build file to my terminal and execute it, it succeeds.
It looks like it is not picking up the include paths when I am executing gcc from my custom build task. But I have no idea what I'm doing wrong here.

error "can't expand macros compiled by previous versions of Scala". Using sbt and Scala Test,

I got above error when i run 'test' under sbty.
environment: ScalaTest, sbt version 0.13.8
in the build.sbt file tried "scalaVersion := "2.10.4" and below dependency definition (both options):
//libraryDependencies += "org.scalatest" % "scalatest_2.11" % "2.2.4" % "test"
libraryDependencies += "org.scalatest" % "scalatest_2.10" % "2.0" % "test"
I refreshed my sbt projects after the above changes.
the error still exists. can anybody give some lights?
after playing around and helped by colleague, it turns out the version for scalaTest (ArtifactID)is incorrect, and now the working version is automatically get the right Scala Version (i.e. use GroupID %% artifactID %revision INSTEAD OF GroupID % artificatID %revision).
libraryDependencies += "org.scalatest" %% "scalatest" % "2.2.4" % Test //note 2.2.2 works too
for more details see: http://www.scala-sbt.org/0.13/tutorial/Library-Dependencies.html

How to list packages and dependencies that will be installed in Macports?

Is there a way to just list all new packages and their dependencies that port will install for a given command?
For instance, consider installing the SciPy stack with the suggested:
sudo port install py27-numpy py27-scipy py27-matplotlib py27-ipython +notebook py27-pandas py27-sympy py27-nose
That installs a ton of packages and dependencies not listed in the above command.
Also, some of them I already have.
I'm already aware of the -y switch but that gives a verbose output of everything, including packages I had already installed.
I'm interested to have port tell me which new packages (be it dependencies or not) will be installed.
Is there a known way or do people just parse the -y output of the command, comparing each reported package against the existing installed packages?
Cheers
p.s. I'm fairly new to Macports and MacOSX (in Linux, apt-get always tells you which new packages will be installed)
You can use a port expression to print what will be installed:
port echo rdepof:$yourport and not installed
or for multiple ports
port echo \( rdepof:$yourport rdepof:$yourport2 ... \) and not installed
Due to the number of Portfiles involved in this and how the set operations are implemented, this will be rather slow. That being said, we're also working on improving this and providing feedback prior to installation like apt-get in a future MacPorts version.
neverpanic already gave a answer, but it seems unable to handle variants (like +notebook) and command-line options (like configure.compiler=macports-clang-3.7). I had a separate solution. The following Python script can display the new dependencies recursively:
#!/usr/bin/env python
#coding: utf-8
import re
import sys
import subprocess
# Gets command output as a list of lines
def popen_readlines(cmd):
p = subprocess.Popen(cmd, stdout=subprocess.PIPE)
p.wait()
if p.returncode != 0:
raise subprocess.CalledProcessError(p.returncode, cmd)
else:
return map(lambda line: line.rstrip('\n'), p.stdout.readlines())
# Gets the port name from a line like " gcc6 #6.1.0_0 (active)"
def get_port_name(port_line):
return re.sub(r'^ (\S+).*', r'\1', port_line)
# Gets installed ports as a set
def get_installed():
installed_ports_lines = popen_readlines(['port', 'installed'])[1:]
installed_ports = set(map(get_port_name, installed_ports_lines))
return installed_ports
# Gets port names from items that may contain version specifications,
# variants, or options
def get_ports(ports_and_specs):
requested_ports = set()
for item in ports_and_specs:
if not (re.search(r'^[-+#]', item) or re.search(r'=', item)):
requested_ports.add(item)
return requested_ports
# Gets dependencies for the given port list (which may contain options
# etc.), as a list of tuples (combining with level), excluding items in
# ignored_ports
def get_deps(ports, ignored_ports, level):
if ports == []:
return []
deps_raw = popen_readlines(['port', 'deps'] + ports)
uninstalled_ports = []
for line in deps_raw:
if re.search(r'Dependencies:', line):
deps = re.sub(r'.*Dependencies:\s*', '', line).split(', ')
uninstalled_ports += [x for x in deps if x not in ignored_ports]
ignored_ports |= set(deps)
port_level_pairs = []
for port in uninstalled_ports:
port_level_pairs += [(port, level)]
port_level_pairs += get_deps([port], ignored_ports, level + 1)
return port_level_pairs
def main():
if sys.argv[1:]:
ports_and_specs = sys.argv[1:]
ignored_ports = get_installed() | get_ports(ports_and_specs)
uninstalled_ports = get_deps(ports_and_specs, ignored_ports, 0)
for (port, level) in uninstalled_ports:
print ' ' * (level * 2) + port
if __name__ == '__main__':
main()
It can be invoked like port_rdeps.py libomp configure.compiler=macports-clang-3.7. As a bonus, it can show the uninstalled dependencies as a tree.

fabric for offline package installation

The project I'm working in uses fabric for many build steps and requires a offline build as fallback.
I'm currently stuck at installing python packages provided in tarballs.
The thing is I have trouble getting into the newly extracted directory and running setup.py install in there.
#task
def deploy_artifacts():
"""Installs dependencies from local path, useful for offline builds"""
#TODO: Handle downloading files and do something like this bellow
tmpdir = tempfile.mkdtemp()
artifacts_path = ''
if not 'http' in env.artifacts_path:
artifacts_path = env.artifacts_path
with lcd(artifacts_path):
for f in os.listdir(artifacts_path):
if 'gz' in f:
put(f, tmpdir)
tar = os.path.join(tmpdir, f)
target_dir = os.path.join(tempfile.gettempdir(), normalize(f))
if not files.exists(target_dir):
run('mkdir %s' % target_dir)
else:
run('rm -rf %s' %target_dir)
run('mkdir %s' % target_dir)
run('tar xf %s -C %s' % (tar, target_dir))
run('rm %s' % tar)
with cd(target_dir):
sudo('python setup.py install')
I come from reading the tar man page for the bazillion time and I got nowhere near to getting what I want.
Did some of you face a situation like this? is there some other (read: better) approach to this scenario?
There's nothing wrong (in principle) with what you're trying do. Maybe just take smaller steps getting there. Rather than using temporary directories, it might make debugging easier if everything was put in a systematic location that has known permissions that nothing else writes to by convention. At least that would let you use some combination of fabric and manual intervention to check what is going wrong.
In the longer term, there are a few alternatives that I see. For simplicity you want the online and offline versions to work the same way, and that means fetching packages using easy_install / pip for both cases.
One way to do this is to build a mirror of PyPi. The right way to do this if you've got plenty of storage space (30Gb) is to use software that implements PEP381 (Mirroring Infrastructure for PyPI), there is already a client that does this (pep381client). A number of other projects are available that do similar things (basketweaver, djangopypi2, chishop).
An alternative is to consider a lighter weight proxying scheme. I've been looking a pip2pi and pipli. I'm unsure if they will work directly with easy_install, but it would be worth a try.
It's also worth noting that if you were using pip, you could have installed directly from the tarballs.

Resources