How to load GraphDSL flow dynamically - akka-stream

I have requirements to load the graph DSL flow dynamically from DB or text file.
So instead of hard-coding the below flow ,loading the DSL from a store
in
~> f1 ~> bcast ~> f2 ~> merge ~> f3 ~> out
bcast ~> f4 ~> merge
Any ideas how to load GraphDSL flow pro-grammatically ?

Related

flink when to use timewindowAll

I have a pipeline that consumes data with the following shape :
case class Foo(source: String, destination: String){def key=source+destination} I want to remove all source+destination duplicates that arrive in the same hour and then I want to count all calls that arrives for a destination in the same hour. I created a pipeline with a src ~> timewindow1(1 hour, keyBy:key) ~> timewindow2(1 hour, keyBy:destination) ~> ... should I use timewindowAll in timewindow2 ?
You should only use timeWindowAll in cases where you don't want to have key-partitioned windowing. Since you are keying by destination, you should use timeWindow, not timeWindowAll.

How to indicate the database in SparkSQL over Hive in Spark 1.3

I have a simple Scala code that retrieves data from the Hive database and creates an RDD out of the result set. It works fine with HiveContext. The code is similar to this:
val hc = new HiveContext(sc)
val mySql = "select PRODUCT_CODE, DATA_UNIT from account"
hc.sql("use myDatabase")
val rdd = hc.sql(mySql).rdd
The version of Spark that I'm using is 1.3. The problem is that the default setting for hive.execution.engine is 'mr' that makes Hive to use MapReduce which is slow. Unfortunately I can't force it to use "spark".
I tried to use SQLContext by replacing hc = new SQLContext(sc) to see if performance will improve. With this change the line
hc.sql("use myDatabase")
is throwing the following exception:
Exception in thread "main" java.lang.RuntimeException: [1.1] failure: ``insert'' expected but identifier use found
use myDatabase
^
The Spark 1.3 documentation says that SparkSQL can work with Hive tables. My question is how to indicate that I want to use a certain database instead of the default one.
use database
is supported in later Spark versions
https://docs.databricks.com/spark/latest/spark-sql/language-manual/use-database.html
You need to put the statement in two separate spark.sql calls like this:
spark.sql("use mydb")
spark.sql("select * from mytab_in_mydb").show
Go back to creating the HiveContext. The hive context gives you the ability to create a dataframe using Hive's metastore. Spark only uses the metastore from hive, and doesn't use hive as a processing engine to retrieve the data. So when you create the df using your sql query, its really just asking hive's metastore "Where is the data, and whats the format of the data"
Spark takes that information, and will run process against the underlying data on the HDFS. So Spark is executing the query, not hive.
When you create the sqlContext, its removing the link between Spark and the Hive metastore, so the error is saying it doesn't understand what you want to do.
I have not been able to implement the use databale command, but here is a workaround to use the desired database:
spark-shell --queue QUEUENAME;
val sqlContext = new org.apache.spark.sql.hive.HiveContext(sc)
val res2 = sqlContext.sql("select count(1) from DB_NAME.TABLE_NAME")
res2.collect()

Join under a Group By condition in Apache Spark

I have two DataFrames A and B which look like the following
A: [cust_no,feature1,feature2......]
B: [ID_number,cust_no]
I need to build a DataFrame C
C:[ID_number,feature1,feature2......]
(where A.cust_no = not any of B.cust_no )**for each ID_number**
How do I do this using Apache Spark DataFrame in Scala?
PS: I don't want to extract the ID_numbers and then loop over list of ID_numbers as Apache-Spark doesn't support parallelism among different iterations of for loop

Single Solrj call for adding and deleting docs

I am using org.apache.solr.client.solrj.impl.HttpSolrServer.HttpSolrServer for calling solr.
For sequential delete and add operations , I am hitting solr like
solr.addBeans(<solrDocs>);
solr.deleteByQuery(<Query>)
solr.commit();
Is there anyway, I can achieve same in one solr call, like solr.execute(addbean, deleteByQuery1)?
I know that multiple commands may be contained in one message as per solr wiki. I what to know that how to achieve same in solrj and any other java library.
What I want to achieve by this ?
Atomic opertion.
Lets have a case:There are two process(or thread) P1 and P2. Each perform Add(corresponding A1 and A2) and Delete(D1 and D2) operation. Let the sequence be like this :
D1 (Deletion of docs by process P1)
D2 (Deletion of docs by process P2)
A2 (Addition of docs by process P2)
P2.commit -> (This will make D1 commited in Solr too)
A1 (Addition of docs by process P1) : Now even if it failed, D1 is not going to rollback (beacuse of P2.commit)
What I want is to rollback P1.D1

How do I loop this piece of code?

me and my partners have this piece of code where we extract tweets in R and put it in a database, what we like to know is how to loop this piece of code, so that it loops periodically. Preferably every 30 minutes.
Here's our code:
#Load twitter package for R
library(twitteR)
#load MySQL package for R
library(RMySQL)
#Load authentication files for twitter
load(file="twitter_authentication.Rdata")
registerTwitterOAuth(cred)
#Search twitter for tweets e.g. #efteling
efteling <- searchTwitter("#efteling", n=100)
#Store the tweets into a dataframe
dataFrameEfteling <- do.call("rbind", lapply(efteling, as.data.frame))
#Setup up the connection to the database()
doConnect <- dbConnect(MySQL(), user="root", password="", dbname="portfolio", host="127.0.0.1")
dbWriteTable(doConnect, "tweetsEfteling", dataFrameEfteling)
eftelingResult <- dbSendQuery(doConnect, "select text from tweetsEfteling")
showResultEfteling <- fetch(eftelingResult, n=20)
Do you have access to crontab? If so, you can set it to run the script however frequently you like.
Here is a little information on crontab.
If your server is running linux, you can just type in
crontab -e
to pull up your personal crontab file. After that, you schedule your command.
For every 30 mins, you would use this command.
*/30 * * * * /path/to/script
Save and exit.
Have you considered using Twitter's streaming API vs REST? This would likely accomplish the same thing if you leave the connection open for an extended period of time. Plus it would cut down on API pulls. Try the streamR package.
If you still want to set it on a timer—http://statistics.ats.ucla.edu/stat/r/faq/timing_code.htm looks useful.

Resources