How to sink message to InfluxDB using PyFlink? - apache-flink

I am trying to run PyFlink walkthough, but instead of sinking data to Elasticsearch, i want to use InfluxDB.
Note: the code in walkthough (link above) is working as expected.
In order for this to work, we need to put InfluxDB connector inside docker container.
The other Flink connectors are placed inside container with these commands in Dockerfile:
# Download connector libraries
RUN wget -P /opt/flink/lib/ https://repo.maven.apache.org/maven2/org/apache/flink/flink-json/${FLINK_VERSION}/flink-json-${FLINK_VERSION}.jar; \
wget -P /opt/flink/lib/ https://repo.maven.apache.org/maven2/org/apache/flink/flink-sql-connector-kafka_2.12/${FLINK_VERSION}/flink-sql-connector-kafka_2.12-${FLINK_VERSION}.jar; \
wget -P /opt/flink/lib/ https://repo.maven.apache.org/maven2/org/apache/flink/flink-sql-connector-elasticsearch7_2.12/${FLINK_VERSION}/flink-sql-connector-elasticsearch7_2.12-${FLINK_VERSION}.jar;
I need help in order to:
Put an InfluxDB connector into container
Modify the CREATE TABLE statement below, in order to work for InfluxDB
CREATE TABLE es_sink (
id VARCHAR,
value DOUBLE
) with (
'connector' = 'elasticsearch-7',
'hosts' = 'http://elasticsearch:9200',
'index' = 'platform_measurements_1',
'format' = 'json'
)

From the documentation:
Table and SQL APIs currently (14/06/2022) does not support InfluxDB - a sql/table connector does not exist.
Here are the known connectors that you can use:
From Maven Apache Flink
From Apache Bahir
You can:
Use Flink streaming connector for InfluxDB from Apache Bahir (only DataStream API)
or
Implement your own sink

Related

How to get data from VerticaDB with Pyspark

I am trying to get data from VerticaDb with pyspark but I have error is called Class Not Found Exception.
Error: Py4JJavaError: An error occurred while calling o165.load.
: java.lang.ClassNotFoundException: Failed to find data source: com.vertica.spark.datasource.VerticaSource.
My code is here :
from pyspark import SparkContext, SparkConf
from pyspark.sql import SQLContext, SparkSession
from pyspark import sql
# Create the spark session
spark = SparkSession \
.builder \
.appName("Vertica Connector Pyspark Example") \
.getOrCreate()
spark_context = spark.sparkContext
sql_context = sql.SQLContext(spark_context)
# The name of our connector for Spark to look up
format = "com.vertica.spark.datasource.VerticaSource"
# Set connector options based on our Docker setup
table = "*****"
db = "*****"
user = "********"
password = "********"
host = "******"
part = "1";
staging_fs_url="****"
#spark.read.format("com.vertica.spark.datasource.VerticaSource").options(opt).load()
readDf = spark.read.load(
# Spark format
format=format,
# Connector specific options
host=host,
user=user,
password=password,
db=db,
table=table)
# Print the DataFrame contents
readDf.show()
Thanks
This is from official documentaion on how to enable Vertica as data source in Spark-
The Vertica Connector for Apache Spark is packaged as a JAR file. You install this file on your Spark cluster to enable Spark and Vertica to exchange data. In addition to the Connector JAR file, you also need the Vertica JDBC client library. The Connector uses this library to connect to the Vertica database.
Both of these libraries are installed with the Vertica server and are available on all nodes in the Vertica cluster in the following locations:
The Spark Connector files are located in /opt/vertica/packages/SparkConnector/lib.
The JDBC client library is /opt/vertica/java/vertica-jdbc.jar.
Make sure Vertica JDBC jar is copied at the Spark library path.
Getting the Spark Connector
Deploying the Vertica Connector for Apache Spark

run apache beam on apache flink

I want to run a Python code using Apache beam on Apache Flink. The command that the apache beam site for launching Python code on Apache Flink is as follows:
docker run --net=host apachebeam/flink1.9_job_server:latest --flink-master=localhost:8081
The following is a discussion of different methods of executing code using Apache Fail on Apache Flink. But I haven't seen an example of launching it.
https://flink.apache.org/ecosystem/2020/02/22/apache-beam-how-beam-runs-on-top-of-flink.html
I want this code to run without Docker. How is this code commanded?
You can spin up the flink job server directly using the beam source source. Note you'll need to install java.
1) Clone the beam source code:
git clone https://github.com/apache/beam.git
2) Start the job server
cd beam
./gradlew -p runners/flink/1.8/job-server runShadow -PflinkMasterUrl=localhost:8081
Some helpful tips:
This is not flink itself! You'll need to spin up flink separately.
The flink job service actually spins up a few services:
Expansion Service (port 8097) : This service allows you to use ExternalTransforms within your pipeline that exist within the java sdk. For example the transforms found within the python sdk apache_beam.io.external.* hit this expansion service.
Artifact Service (port 8098) : This is where the pipeline uploads your python artifacts (e.g. pickle files, etc) to be used by the flink taskmanager when it executes your python code. From what I recall you must share the artifact staging area (default to /tmp/beam-artifact-staging) between the flink taskworker and this artifact service.
Job Service (port 8099) : This is what you submit your pipeline to. It translates your pipeline into something for flink and submits it.

Connecting to apache atlas + hbase + solr setup with gremlin cli

I am new to atlas and janusgraph, I have a local setup of atlas with hbase and solr as the backends with dummy data.
I would like to use gremlin cli + gremlin server and connect to the existing data in hbase. ie: view and traverse the dummy atlas metadata objects.
This is what I have done so far:
Run atlas server + hbase + solr - inserted dummy entities
Run gremlin server with the right configuration
I have set the graph: { ConfigurationManagementGraph: ..} to janusgraph-hbase-solr.properties
Run gremlin cli, connect with :remote connect tinkerpop.server conf/remote.yaml session which connects the gremlin server just fine.
I do graph = JanusGraphFactory.open(..../janusgraph-hbase-solr.properties) and create g = graph.traversal()
I am able to create my own vertex and edges and list them, but not able to list anything related to atlas ie: entities etc.
What am I missing?
I want to connect to existing atlas setup and traverse the graph with gremlin cli.
Thanks
To be able to access Atlas artifacts from gremlin cli you will have to add Atlas dependency jars to Janusgraph's lib directory.
You can get the jars from Atlas maven repo or from your local build.
$ cp atlas-* janusgraph-0.3.1-hadoop2/lib/
list of JARs
atlas-common-1.1.0.jar
atlas-graphdb-api-1.1.0.jar
atlas-graphdb-common-1.1.0.jar
atlas-graphdb-janus-1.1.0.jar
atlas-intg-1.1.0.jar
atlas-repository-1.1.0.jar
A sample query could be:
gremlin> :> g.V().has('__typeName','hive_table').count()
==>10
As ThiagoAlvez mentioned, Atlas docker image can be used since Tinknerpop Gremlin support is now build-in into it and can be easily used to play with Janusgraph, and Atlas artifacts using gremlin CLI:
Pull the image:
docker pull sburn/apache-atlas
Start Apache Atlas in a container exposing Web-UI port 21000:
docker run -d \
-p 21000:21000 \
--name atlas \
sburn/apache-atlas \
/opt/apache-atlas-2.1.0/bin/atlas_start.py
Install gremlin-server and gremlin-console into the container by running included automation script:
docker exec -ti atlas /opt/gremlin/install-gremlin.sh
Start gremlin-server in the same container:
docker exec -d atlas /opt/gremlin/start-gremlin-server.sh
Finally, run gremlin-console interactively:
docker exec -ti atlas /opt/gremlin/run-gremlin-console.sh
I had this same issue when trying to connect to the Apache Atlas JanusGraph database (org.janusgraph.diskstorage.solr.Solr6Index).
I got it solved after moving the atlas jars to the JanusGraph lib folder as anand said and then configuring janusgraph-hbase-solr.properties.
These are the configurations that set on janusgraph-hbase-solr.properties:
gremlin.graph=org.janusgraph.core.JanusGraphFactory
storage.backend=hbase
storage.hostname=localhost
cache.db-cache = true
cache.db-cache-clean-wait = 20
cache.db-cache-time = 180000
cache.db-cache-size = 0.5
index.search.backend=solr
index.search.solr.mode=http
index.search.solr.http-urls=http://localhost:9838/solr
index.search.solr.zookeeper-url=localhost:2181
index.search.solr.configset=_default
atlas.graph.storage.hbase.table=apache_atlas_janus
storage.hbase.table=apache_atlas_janus
I'm running Atlas using this docker image: https://github.com/sburn/docker-apache-atlas

How to "Run a single Flink job on YARN " by rest API?

From the Flink official document we know that we can "Run a single Flink job on YARN " by the command below ,my question is can we "Run a single Flink job on YARN " by Rest API, and got the application API ?
./bin/flink run -m yarn-cluster -yn 2 ./examples/batch/WordCount.jar
See the (somewhat deceptively named) Monitoring REST API. You can use the /jars/upload request to send your (fat/uber) jar to the cluster. This returns back an id, that you can use with the /jars/:jarid/run request to start your job.
If you also need to start up the cluster, then you're currently (AFAIK) going to need to write some Java code to start a cluster on YARN. There are two source files in Flink that do this same thing:
ProgramDeployer.java Used by the Flink Table API.
CliFrontEnd.java Used by the command line tool.

How to configure Apache Ignite in Redash

I need to configure Apache ignite in Redash for BI dashboard but couldn't figure out how to do the same since there is no direct support for ignite in Redash.
It is possible using Python query runner, which is available for stand-alone installs only. It allows you to run arbitrary Python code, in which you can query Apache Ignite via JDBC.
First, add redash.query_runner.python query runner to settings.py:
and install Python JDBC bridge module together with dependencies:
sudo apt-get install python-setuptools
sudo apt-get install python-jpype
sudo easy_install JayDeBeApi
Then after VM restart you should add Python data source (you might need to tweak module path):
And then you can actually run the query (you will need to provide Apache Ignite core JAR and JDBC connection string as well):
import jaydebeapi
conn = jaydebeapi.connect('org.apache.ignite.IgniteJdbcThinDriver', 'jdbc:ignite:thin://localhost', {}, '/home/ubuntu/.m2/repository/org/apache/ignite/ignite-core/2.3.2/ignite-core-2.3.2.jar')
curs = conn.cursor()
curs.execute("select c.Id, c.CreDtTm from TABLE.Table c")
data = curs.fetchall()
result = {"columns": [], "rows": []}
add_result_column(result, "Id", "Identifier", "string")
add_result_column(result, "CreDtTm", "Create Date-Time", "integer")
for d in data:
add_result_row(result, {"Id": d[0], "CreDtTm": d[1]})
Unfortunately there's no direct support for JDBC in Redash (that I'm aware of) so all that boilerplate is needed.

Resources