Apache Flink: Unable to convert the Table object to DataSet object - apache-flink

I am using the Table API on Flink 1.4.0. I have some Table objects to be convert to a DataSet of type Row. The project was built using Maven and imported on IntelliJ. I have the following code and the IDE cannot resolve the method tableenv.toDataSet() method. Please help me out. Thank you.
ExecutionEnvironment env = ExecutionEnvironment.getExecutionEnvironment();
BatchTableEnvironment tableEnvironment = TableEnvironment.getTableEnvironment(env);
...
tableEnvironment.registerTableSource("table1",csvSource);
Table table1 = tableEnvironment.scan("table1");
DataSet<Row> result = tableEnvironment.toDataSet(table1, Row.class);
The last statement causes an error
"Cannot resolve toDataSet() method"

You might not import the right BatchTableEnvironment.
Please check that you import org.apache.flink.table.api.java.BatchTableEnvironment instead of org.apache.flink.table.api.BatchTableEnvironment. The former is the common base class for the Java and Scala variants.

If you want to read a DataSet from a csv file, do it like following:
DataSet<YourType> csvInput = env.readCsvFile("hdfs:///the/CSV/file") ...
More on this: https://ci.apache.org/projects/flink/flink-docs-release-1.4/dev/batch/#data-sources

Related

Add a new column and row to existing CSV data in Apache Camel

I am trying to add the new column and row/content to the existing CSV data but unable to achieve this in Apache Camel
I have used camel-csv component in the code and below is the snippet for the same.
<unmarshal>
<csv delimiter="|" useMaps="true" lazyLoad="true" />
</unmarshal>
When unmarshalling, getting "org.apache.camel.dataformat.csv.CsvUnmarshaller$CsvIterator" as class name but unable to get the exchange or cast to any type to this class as this is abstract class.
Let me know if we can use bean component and solution to add the column and content to the existing CSV data.
I can suggest an alternative solution. You can use BeanIO Data Format.
E.g :
BeanIODataFormat dataFormat = new BeanIODataFormat("classpath:beanio/mappings.xml", "ContactsCSV");
from("direct:convert-to-csv")
.marshal(dataFormat)
.to("file:xxxx")
.end();
You can find check data format details in docs. There is an examples file in there as well.

Flink: TypeExtractor complains about protobuf class even though a ProtobufSerializer is registered for it

Using Flink 1.12.
My protobuf is
syntax = "proto3";
package flink.protobuf;
message TimestampedMessage {
int64 timeMs = 1;
string message = 2;
}
and tried to use it like so
final var env = StreamExecutionEnvironment.createLocalEnvironment();
env.getConfig().registerTypeWithKryoSerializer(TimestampedMessage.class, ProtobufSerializer.class);
env.fromCollection(new EventsIter(), TimestampedMessage.class)
...
But the logs show this
flink.protobuf.Test$TimestampedMessage does not contain a setter for field timeMs_
2021-08-12 06:38:19,940 INFO org.apache.flink.api.java.typeutils.TypeExtractor Class class
flink.protobuf.Test$TimestampedMessage cannot be used as a POJO type because not all fields are
valid POJO fields, and must be processed as GenericType. Please read the Flink
documentation on "Data Types & Serialization" for details of the effect on performance.
Seems like it is using the ProtobufSerializer despite the warning.

Parsing data from Kafka in Apache Flink

I am just getting started on Apache Flink (Scala API), my issue is following:
I am trying to stream data from Kafka into Apache Flink based on one example from the Flink site:
val stream =
env.addSource(new FlinkKafkaConsumer09("testing", new SimpleStringSchema() , properties))
Everything works correctly, the stream.print() statement displays the following on the screen:
2018-05-16 10:22:44 AM|1|11|-71.16|40.27
I would like to use a case class in order to load the data, I've tried using
flatMap(p=>p.split("|"))
but it's only splitting the data one character at a time.
Basically the expected results is to be able to populate 5 fields of the case class as follows
field(0)=2018-05-16 10:22:44 AM
field(1)=1
field(2)=11
field(3)=-71.16
field(4)=40.27
but it's now doing:
field(0) = 2
field(1) = 0
field(3) = 1
field(4) = 8
etc...
Any advice would be greatly appreciated.
Thank you in advance
Frank
The problem is the usage of String.split. If you call it with a String, then the method expects it to be a regular expression. Thus, p.split("\\|") would be the correct regular expression for your input data. Alternatively, you can also call the split variant where you specify the separating character p.split('|'). Both solutions should give you the desired result.

Merge results of ExecuteSQL processor with Json content in nifi 6.0

I am dealing with json objects containing geo coordinate points. I would like to run these points against a postgis server I have locally to assess point in polygon matching.
I'm hoping to do this with preexisting processors - I am successfully extracting the lat/lon coordinates into attributes with an "EvaluateJsonPath" processor, and successfully issuing queries to my local postgis datastore with "ExecuteSQL". This leaves me with avro responses, which I can then convert to JSON with the "ConvertAvroToJSON" processor.
I'm having conceptual trouble with how to merge the results of the query back together with the original JSON object. As it is, I've got two flow files with the same fragment ID, which I could theoretically merge together with "mergecontent", but that gets me:
{"my":"original json", "coordinates":[47.38, 179.22]}{"polygon_match":"a123"}
Are there any suggested strategies for merging the results of the SQL query into the original json structure, so my result would be something like this instead:
{"my":"original json", "coordinates":[47.38, 179.22], "polygon_match":"a123"}
I am running nifi 6.0, postgres 9.5.2, and postgis 2.2.1.
I saw some reference to using replaceText processor in https://community.hortonworks.com/questions/22090/issue-merging-content-in-nifi.html - but this seems to be merging content from an attribute into the body of the content. I'm missing the point of merging the content of the original and either the content of the SQL response, or attributes extracted from the SQL response without the content.
Edit:
Groovy script following appears to do what is needed. I am not a groovy coder, so any improvements are welcome.
import org.apache.commons.io.IOUtils
import java.nio.charset.*
import groovy.json.JsonSlurper
def flowFile = session.get();
if (flowFile == null) {
return;
}
def slurper = new JsonSlurper()
flowFile = session.write(flowFile,
{ inputStream, outputStream ->
def text = IOUtils.toString(inputStream, StandardCharsets.UTF_8)
def obj = slurper.parseText(text)
def originaljsontext = flowFile.getAttribute('original.json')
def originaljson = slurper.parseText(originaljsontext)
originaljson.put("point_polygon_info", obj)
outputStream.write(groovy.json.JsonOutput.toJson(originaljson).getBytes(StandardCharsets.UTF_8))
} as StreamCallback)
session.transfer(flowFile, ExecuteScript.REL_SUCCESS)
If your original JSON is relatively small, a possible approach might be the following...
Use ExtractText before getting to ExecuteSQL to copy the original JSON into an attribute.
After ExecuteSQL, and after ConvertAvroToJSON, use an ExecuteScript processor to create a new JSON document that combines the original from the attribute with the results in the content.
I'm not exactly sure what needs to be done in the script, but I know others have had success using Groovy and JsonSlurper through the ExecuteScript processor.
http://groovy-lang.org/json.html
http://docs.groovy-lang.org/latest/html/gapi/groovy/json/JsonSlurper.html

How to perform this GqlQuery?

I have the datastore as follows,
class Data(db.Model):
project = db.StringProperty()
project_languages = db.ListProperty(str,default=[])
When user inputs a language (input_language), I want to output all the projects which contains the language user mentioned in it's language list (project_languages).
I tried to do it in the below way but got an error saying,
BadQueryError: Parse Error: Invalid WHERE Condition
db.GqlQuery("SELECT * FROM Data WHERE input_language IN project_languages")
What should be my query, if I want to get the data in the above mentioned way?
Not sure if you are using python for the job.. If so I highly recommend you use the ndb library for datastore queries. The solution is easy as Data.query(A.IN(B))

Resources