elastic4s in Spark Application

elastic4s in Spark Application - elastic4s

I have been trying to use the elastic4s in my spark application but every time it tries to send data to my elasticsearch node I keep getting:
java.lang.NoSuchMethodError: com.google.common.util.concurrent.MoreExecutors.directExecutor()Ljava/util/concurrent/Executor;
at org.elasticsearch.threadpool.ThreadPool.<clinit>(ThreadPool.java:190)
at org.elasticsearch.client.transport.TransportClient$Builder.build(TransportClient.java:131)
at com.sksamuel.elastic4s.ElasticClient$.transport(ElasticClient.scala:111)
at com.sksamuel.elastic4s.ElasticClient$.remote(ElasticClient.scala:92)
Not sure where I can even start to debug this error. Code is fairly simple:
val elasticAddress = getEnvirometalParameter("streaming_pipeline", "elastic_address")(0)._1
val uri = ElasticsearchClientUri("elasticsearch://" + elasticAddress)
val client = ElasticClient.remote(uri)
def elasticInsert(subject:String, predicate:String, obj:String, label:String) = {
client.execute {
update id (label + subject + predicate + obj) in "test" / "quad" docAsUpsert (
"subject" -> subject,
"predicate" -> predicate,
"object" -> obj,
"label" -> label
)
}
}

The issue is that Elasticsearch and Spark clash on their versions of Netty (and other dependencies). The versions are incompatible and you get these types of exceptions at runtime.
Since version 5.3 of Elastic4s, your best bet is to use the HttpClient which has no dependencies on things like Netty or Guava.

Related

Flink: TypeExtractor complains about protobuf class even though a ProtobufSerializer is registered for it

Using Flink 1.12.
My protobuf is
syntax = "proto3";
package flink.protobuf;
message TimestampedMessage {
int64 timeMs = 1;
string message = 2;
}
and tried to use it like so
final var env = StreamExecutionEnvironment.createLocalEnvironment();
env.getConfig().registerTypeWithKryoSerializer(TimestampedMessage.class, ProtobufSerializer.class);
env.fromCollection(new EventsIter(), TimestampedMessage.class)
...
But the logs show this
flink.protobuf.Test$TimestampedMessage does not contain a setter for field timeMs_
2021-08-12 06:38:19,940 INFO org.apache.flink.api.java.typeutils.TypeExtractor Class class
flink.protobuf.Test$TimestampedMessage cannot be used as a POJO type because not all fields are
valid POJO fields, and must be processed as GenericType. Please read the Flink
documentation on "Data Types & Serialization" for details of the effect on performance.

Seems like it is using the ProtobufSerializer despite the warning.

How to get and modify jobgraph in Flink?

I want to modify jobGraph, like getVertices(), ColocationGroup() and I tried following:
val s = StreamExecutionEnvironment.getExecutionEnvironment
val p = new Param()
s.addSource(new MySource(p))
.map(new MyMap(p))
.addSink(new MySink(p))
val g = s.getStreamGraph.getJobGraph()
val v = g.getVertices()
s.executeAsync()
But in debug mode the vertices is not my MySource operator, and I am confused that getStreamGraph could accept a JobID, how could a job have an id before it start..., so I thought maybe I should get it after execute and I changed code to following,
val s = StreamExecutionEnvironment.getExecutionEnvironment
val p = new Param()
s.addSource(new MySource(p))
.map(new MyMap(p))
.addSink(new MySink(p))
s.executeAsync()
val g = s.getStreamGraph.getJobGraph()
val v = g.getVertices()
But the code get stuck in s.getStreamGraph.getJobGraph()
How to get the jobgraph actually?...

It's not possible to modify the JobGraph. The various APIs construct the JobGraph, which is then makes its way to the JobManager, which turns it into an execution graph, which is then scheduled and run in task slots provided by the task managers. There's no API that will allow you to modify the job's topology once it has been established. (It's not inconceivable that this could someday be supported, but it's not possible now.)
If you just want to see a representation of it, System.out.println(env.getExecutionPlan()) is interesting.
If you are looking for more dynamism that you can get from a static DAG, the Stateful Functions API is much more flexible.

Dynamic file in request body

I am playing with Karate to test one resource that accepts a date that cannot be in the past.
Scenario: Schedule one
Given path '/schedules'
And request read('today_at_19h30.json')
When method post
Then status 201
I have created some JSON files (mostly duplicates but with subtle changes) for each scenario. But I cannot seriously imagine changing the date in all of them each time I want to run my tests.
Using a date in the far future is not a good idea because there is some manual verification and that will force us to click (on next) too many times.
Is there a way to include a variable or an expression in a file?
Thanks

There are multiple ways to "clobber" JSON data in Karate. One way is to just use a JS expression. For example:
* def foo = { a: 1 }
* foo.a = 2
* match foo == { a: 2 }
For your specific use case, I suspect embedded expressions will be the more elegant way to do it. The great thing about embedded expressions is that they work in combination with the read() API.
For example where the contents of the file test.json is { "today": "#(today)" }
Background:
* def getToday =
"""
function() {
var SimpleDateFormat = Java.type('java.text.SimpleDateFormat');
var sdf = new SimpleDateFormat('yyyy/MM/dd');
var date = new java.util.Date();
return sdf.format(date);
}
"""
Scenario:
* def today = getToday()
* def foo = read('test.json')
* print foo
Which results in:
Running com.intuit.karate.junit4.dev.TestRunner
20:19:20.957 [main] INFO com.intuit.karate - [print] {
"today": "2020/01/22"
}
By the way if the getToday function has been defined, you can even do this: { "today": "#(getToday())" }. Which may give you more ideas.

Evaluation Error while using the Hiera hash in puppet

I have the following values in my hiera yaml file:
test::config_php::php_modules :
-'soap'
-'mcrypt'
-'pdo'
-'mbstring'
-'php-process'
-'pecl-memcache'
-'devel'
-'php-gd'
-'pear'
-'mysql'
-'xml'
and following is my test class:
class test::config_php (
$php_version,
$php_modules = hiera_hash('php_modules', {}),
$module_name,
){
class { 'php':
version => $php_version,
}
$php_modules.each |String $php_module| {
php::module { $php_module: }
}
}
While running my puppet manifests I get the following error:
Error: Evaluation Error: Error while evaluating a Function Call, create_resources(): second argument must be a hash at /tmp/vagrant-puppet/modules-f38a037289f9864906c44863800dbacf/ssh/manifests/init.pp:46:3 on node testdays-1a.vagrant.loc.vag
I am quite confused on what exactly am I doing wrong. My puppet version is 3.6.2 and I also have parser = future
I would really appreciate any help here.

Looks like your YAML was slightly off.
You don't really need quotes in YAML.
Your indentation was two instead of one.
Your first colon on the first time was spaced. This will throw a syntax error.
it should look more like this:
test::config_php::php_modules:
- soap
- mcrypt
- pdo
- mbstring
- php-process
- pecl-memcache
- devel
- php-gd
- pear
- mysql
- xml
In the future try and look up YAML parsers like this: link

The problem was with my puppet version, somehow version 3.6 acts weird while creating resources, for instance it was failing on the following line,:
create_resources('::ssh::client::config::user', $fin_users_client_options)
The code snippet above is part of ssh module from puppet labs, which I assume is throughly tested and shouldn't be a reason for the an exception.
A further analysis led to the fact that the exception was thrown when the parameter parser = future was set in the config file
I cannot iterate using each without setting future as the parser, therefore I decided to change my source as follow:
I created a new class:
define test::install_modules {
php::module { $name: }
}
and then I changed the config config_php to:
class test::config_php (
$php_version,
$php_modules = [],
){
class { 'php':
version => $php_version,
}
install_modules { $php_modules: }
}
Everything seems to be much better now.

Manipulating DocList in SOLR

I have written a custom request handler in solr to meet my business requirements. The handler involves getting data from two different from SolrIndexSearchers. I want the returned doclists from the two SolrIndexSearchers merged into one.
I tried iterating through one and adding doc by doc to another, but all I could get was an "Unsupported Operation" exception. Is there anyway to merge two doclists?
[Edit 1] : Code snippet inside the overridden handleRequestBody method
SolrCore core = new SolrCore("Desired Directory 1", schema);
reader = IndexReader.open("Desired Directory 1");
searcher = new SolrIndexSearcher(core, schema, getName(), reader, false);
Sort lsort = null;
FilteredQuery filter = null;
DocList results1 = searcher.getDocList(query, filter, lsort, 0, 10);
reader.close();
searcher.close();
core.close();
SolrCore core = new SolrCore("Desired Directory 2", schema);
reader = IndexReader.open("Desired Directory 2");
searcher = new SolrIndexSearcher(core, schema, getName(), reader, false);
Sort lsort = null;
FilteredQuery filter = null;
DocList results2 = searcher.getDocList(query, filter, lsort, 0, 10);
reader.close();
searcher.close();
core.close();
rsp.add("response",results1);
rsp.add("response",results2);
Now that I have two DocLists results1 and results2, how do I merge them?
[Edit 2] : The problem is not an exception/stack trace. When I add two responses, I get the results in two response sets when it is a single machine search. When it is a distributed search, I only get the distribution between response 1 of machine 1 and response 1 of machine 2. IN my understanding, only when I merge the responses to a single set, I will be able to get proper distribution. Hope I am understandable?

DocList is not supposed to be changed after you retrieve it from a search result.
I had the similar problem and, if I remember correctly, I used:
org.apache.solr.util.SolrPluginUtils.docListToSolrDocumentList
to convert DocList to SolrDocumentList, which extends ArrayList<SolrDocument>, which consequently supports add(SolrDocument).
So the worst case scenario (from the performance standpoint) is to convert the first DocList to SolrDocumentList and then loop through all other DocLists calling add on each doc.
You'll have to test how efficient this approach is. I'm not Solr expert, but this is where I'd start testing.

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight

elastic4s in Spark Application - elastic4s

Related

Flink: TypeExtractor complains about protobuf class even though a ProtobufSerializer is registered for it

How to get and modify jobgraph in Flink?

Dynamic file in request body

Evaluation Error while using the Hiera hash in puppet

Manipulating DocList in SOLR

Categories

Resources