How does flink recognize hiveConfDir when running in yarn cluster - apache-flink

I have following code to test flink and hive integration. I submit the application via flink run -m yarn-cluster ..... The hiveConfDir is a local directory that resides on the machine that I submit the application, I would ask how flink can able to read this local directory when the main class is running in the cluster(yarn-cluster)? Thanks!
package org.example.app
import org.apache.flink.streaming.api.scala._
import org.apache.flink.table.api.bridge.scala._
import org.apache.flink.table.catalog.hive.HiveCatalog
import org.apache.flink.types.Row
object FlinkBatchHiveTableIntegrationTest {
def main(args: Array[String]): Unit = {
val env = StreamExecutionEnvironment.getExecutionEnvironment
env.setParallelism(1)
val tenv = StreamTableEnvironment.create(env)
val name = "myHiveCatalog"
val defaultDatabase = "default"
//how does flink could read this local directory
val hiveConfDir = "/apache-hive-2.3.7-bin/conf"
val hive = new HiveCatalog(name, defaultDatabase, hiveConfDir)
tenv.registerCatalog(name, hive)
tenv.useCatalog(name)
val sql =
"""
select * from testdb.t1
""".stripMargin(' ')
val table = tenv.sqlQuery(sql)
table.printSchema()
table.toAppendStream[Row].print()
env.execute("FlinkHiveIntegrationTest")
}
}

Looks I find the answer. The application is submitted with flink run -m yarn-cluster.By this way, the main method of the application is running at the client side where the hive is installed,so the hive conf dir could be read.

Related

Filesystem changes using watchdog python and sent the file to SQL Server

import pandas
import os
import sqlalchemy
import sys
import time from watchdog.observers
import Observer from watchdog.events
import FileSystemEventHandler
class EventHandler(FileSystemEventHandler):
def on_any_event(self, event):
print("EVENT")
print(event.event_type)
print(event.src_path)
print()
if __name__ == "__main__":
path = 'input path here'
event_handler= EventHandler()
observer = Observer()
observer.schedule(event_handler, path, recursive=True)
print("Monitoring started")
observer.start()
try:
while(True):
time.sleep(1)
except KeyboardInterrupt:
observer.stop()
observer.join()
engine = create_engine('Database information')
cursor = engine.raw_connection().cursor()
for file in os.listdir('.'):
file_basename, extension = file.split('.')
if extension == 'xlsx':
df = pd.read_excel(os.path.abspath(file))
df.to_sql(file_basename, con = engine, if_exists = 'replace')
so the first part where observer.join() ends i am okay with. but the next part where it starts with engine = create_engine("") that where i am having trouble with that part is suppose to send the file to the sql server but the code is not doing that i have researched the internet but having found anything. any help is appreciated.

Ktor with Gradle run configuration "Could not resolve substitution to a value" from environment variables

I have set up a server in Ktor with a Postgres database in Docker, but figured it would be useful to be able to develop the server locally without rebuilding the docker container each time.
In application.conf I have
// ...
db {
jdbcUrl = ${DATABASE_URL}
dbDriver = "org.postgresql.Driver"
dbDriver = ${?DATABASE_DRIVER}
dbUser = ${DATABASE_USER}
dbPassword = ${DATABASE_PASSWORD}
}
and in my DatabaseFactory I have
object DatabaseFactory {
private val appConfig = HoconApplicationConfig(ConfigFactory.load())
private val dbUrl = appConfig.property("db.jdbcUrl").getString()
private val dbDriver = appConfig.property("db.dbDriver").getString()
private val dbUser = appConfig.property("db.dbUser").getString()
private val dbPassword = appConfig.property("db.dbPassword").getString()
fun init() {
Database.connect(hikari())
transaction {
val flyway = Flyway.configure().dataSource(dbUrl, dbUser, dbPassword).load()
flyway.migrate()
}
}
private fun hikari(): HikariDataSource {
val config = HikariConfig()
config.driverClassName = dbDriver
config.jdbcUrl = dbUrl
config.username = dbUser
config.password = dbPassword
config.maximumPoolSize = 3
config.isAutoCommit = false
config.transactionIsolation = "TRANSACTION_REPEATABLE_READ"
config.validate()
return HikariDataSource(config)
}
suspend fun <T> dbQuery(block: () -> T): T =
withContext(Dispatchers.IO) {
transaction { block() }
}
}
I have edited the Gradle run configuration with the following environment config:
DATABASE_URL=jdbc:h2:mem:default;DATABASE_DRIVER=org.h2.Driver;DATABASE_USER=test;DATABASE_PASSWORD=password
When I run the task I get this error: Could not resolve substitution to a value: ${DATABASE_URL}, but if I set a breakpoint on the first line (private val appConfig) and evaluate System.getenv("DATABASE_URL") it is resolved to the correct value.
My questions are:
Why does this not work?
What is the best (or: a good) setup for developing the server without packing it in a container? Preferably without running the database in another container.
You need to specify:
appConfig.property("ktor.db.jdbcUrl").getString()
I found that setting environment variables for the task in gradle.config.kts works:
tasks {
"run"(JavaExec::class) {
environment("DATABASE_URL", "jdbc:postgresql://localhost:5432/test")
environment("DATABASE_USER", "test")
environment("DATABASE_PASSWORD", "password")
}
}
(source: Setting environment variables in build.gradle.kts)
As for why my initial approach only works in debug mode I have no idea.
As for question #2 I have a suspicion that H2 and Postgres could have some syntactic differences that will cause trouble. Running the database container in the background works fine for now.
As answered by #leonardo-freitas we need to specify ktor. first before accessing application.conf properties.
environment.config.property("ktor.db.jdbcUrl").getString()
This is missing on the official doc as well https://ktor.io/docs/jwt.html#validate-payload
I experienced the same problem. I had to manually restart the IntelliJ (NOT via File). Simply closed the IDE, and then turned it on again. Also, check that your environment variable is permanent.

Where to find the output for the standalone

I have the following flink work count, when I run it in my IDE, it prints the word count correctly as follows
(hi,2)
(are,1)
(you,1)
(how,1)
But I when I run it in the cluster, I didn't find the output.
1. Start cluster using start-cluster.sh
2. Open the webui at http://localhost:8081
3. In the Submit new Job page, Submit the jar, and then input the entry class and then click the Submit button to submit the job
4. The job is done successfully, but I didn't find the output in the TaskManager or JobManager Logs on the UI.
I would ask where I can find the output
The word count application is:
import org.apache.flink.api.scala.ExecutionEnvironment
import org.apache.flink.api.scala._
/**
* Wordcount example
*/
object WordCount {
def main(args: Array[String]) {
val env = ExecutionEnvironment.getExecutionEnvironment
val data = List("hi", "how are you", "hi")
val dataSet = env.fromCollection(data)
val words = dataSet.flatMap(value => value.split("\\s+"))
val mappedWords = words.map(value => (value, 1))
val grouped = mappedWords.groupBy(0)
val sum = grouped.sum(1)
sum.collect().foreach(println)
}
}
In the log directory of each taskmanager machine you should find both *.log and *.out files. Whatever your job has printed will go to the .out files. This is what is displayed in the "stdout" tab for each taskmanager in the web UI -- though if this file is very large, the browser may struggle to fetch and display it.
Update: Apparently the Flink's batch environment handles printing differently from the streaming one. When I use the CLI to submit this batch job, the output appears in the terminal, and not in the .out files as it would for a streaming job.
I suggest you change your example to do something like this at the end to collect the results in a file:
...
sum.writeAsText("/tmp/test")
env.execute()

Flink: How to set System Properties in TaskManager?

I have some code reading message from kafka like below:
def main(args: Array[String]): Unit = {
System.setProperty("java.security.auth.login.config", "someValue")
val env = StreamExecutionEnvironment.getExecutionEnvironment
val consumerProperties = new Properties()
consumerProperties.setProperty("security.protocol", "SASL_PLAINTEXT")
consumerProperties.setProperty("sasl.mechanism", "PLAIN")
val kafkaConsumer = new FlinkKafkaConsumer011[ObjectNode](consumerProperties.getProperty("topic"), new JsonNodeDeserializationSchema, consumerProperties)
val stream = env.addSource(kafkaConsumer)
}
When the source try to read message from Apache Kafka, org.apache.kafka.common.security.JaasContext.defaultContext function will load "java.security.auth.login.config" property.
But the property is only set in JobManager, and when my job get running, the property can not load correctly in TaskManager, so the source will fail.
I tried to set extra JVM_OPTS like "-Dxxx=yyy", but the flink cluster is deployed in standalone mode, environment variable can not be changed very often.
Is there any way to set property in TaskManager?
The file bin/config.sh of Flink standalone cluster holds a property named DEFAULT_ENV_JAVA_OPTS.
Also if you export $JVM_ARGS="your parameters" the file bin/config.sh will load it using these lines:
# Arguments for the JVM. Used for job and task manager JVMs.
# DO NOT USE FOR MEMORY SETTINGS! Use conf/flink-conf.yaml with keys
# KEY_JOBM_MEM_SIZE and KEY_TASKM_MEM_SIZE for that!
if [ -z "${JVM_ARGS}" ]; then
JVM_ARGS=""
fi

Playframework/RequireJs javascript files not being optimized

I'm new to play framework and I'm trying to get it to work with RequireJs. When I run my app in dev mode everything runs fine, but when I set application.mode=prod and start the server with play start I'm running into problems.
The browser receives an HTTP404 when attempting to load /assets/javascripts-min/home/main.js.
Here's my Build.scala file
import sbt._
import Keys._
import play.Project._
import com.google.javascript.jscomp._
import java.io.File
object MyBuild extends Build {
val appDependencies = Seq (
jdbc,
anorm,
cache
)
val appVersion = "0.0.1"
val appName = "TodoList"
// set clojure compiler options so it won't choke on modern js frameworks
val root = new java.io.File(".")
val defaultOptions = new CompilerOptions()
defaultOptions.closurePass = true
defaultOptions.setProcessCommonJSModules(true)
defaultOptions.setCommonJSModulePathPrefix(root.getCanonicalPath + "/app/assets/javascripts/")
defaultOptions.setLanguageIn(CompilerOptions.LanguageMode.ECMASCRIPT5)
CompilationLevel.WHITESPACE_ONLY.setOptionsForCompilationLevel(defaultOptions)
val main = play.Project(appName, appVersion, appDependencies).settings(
(Seq(requireJs += "home/main.js", requireJsShim := "home/main.js") ++ closureCompilerSettings(defaultOptions)): _*
)
}
It turns out that I had a conflicting build.sbt (in the root directory) and Build.scala (in the project directory) files. Once I removed the sbt file, the requireJs optimization began working as expected

Resources