How to read json file format in Apache flink using java.
I am not able to find any proper code to read json file in flink using java and do some transformation on top of it.
Any suggestions or code is highly appreciated.
For using Kafka with the DataStream API, see https://stackoverflow.com/a/62072265/2000823. The idea is to implement an appropriate DeserializationSchema, or KafkaDeserializationSchema. There's an example (and pointers to more) in the answer I've linked to above.
Or if you want to use the Table API or SQL, it's easier. You can configure this with a bit of DDL. For example:
CREATE TABLE minute_stats (
`minute` TIMESTAMP(3),
`currency` STRING,
`revenueSum` DOUBLE,
`orderCnt` BIGINT,
WATERMARK FOR `minute` AS `minute` - INTERVAL '10' SECOND
) WITH (
'connector.type' = 'kafka',
'connector.version' = 'universal',
'connector.topic' = 'minute_stats',
'connector.properties.zookeeper.connect' = 'not-needed',
'connector.properties.bootstrap.servers' = 'kafka:9092',
'connector.startup-mode' = 'earliest-offset',
'format.type' = 'json'
);
For trying things out locally while reading from a file, you'll need to do things differently. Something like this
DataStreamSource<String> rawInput = env.readFile(
new TextInputFormat(new Path(fileLocation)), fileLocation);
DataStream<Event> = rawInput.flatMap(new MyJSONTransformer());
where MyJSONTransformer might use a jackson ObjectMapper to convert JSON into some convenient Event type (a POJO).
Related
I'm trying to write Arrays into HBase using Hadoop's ArrayWritable class. I'm serializing a List into ArrayWritable, and then use WritableUtils.toByteArray to get bytes. While reading, I reconstruct List from ArrayWritable. Have added the code for this operation below. I'm able to correctly read/write List into the HBase DB.
However, when we try to create an external table on top of the written data, we're not able to see the stored arrays in the expected format. The list of strings appear as concatenated strings without any separator, and certainly not as Arrays or Strings.
PROBLEM: We need the HBase arrays to be visible in Hive in a feasible format that we can use. We'll also be exporting this data for querying to Redshift, and hence need something workable. How can I modify my read/write approach so that it works for both Java application and Hive/Redshift.
Method used for serialization and de-serialization:
public static ArrayWritable getWritableFromArray(List<String> stringList) {
Writable[] writables = new Writable[stringList.size()];
for (int i = 0; i < writables.length; i++) {
writables[i] = new Text(stringList.get(i));
}
return new ArrayWritable(Text.class, writables);
}
public static List<String> getListFromWritable(ArrayWritable arrayWritable) {
Writable[] writables = arrayWritable.get();
List<String> list = new ArrayList<>(writables.length);
for (Writable writable : writables) {
list.add(((Text) writable).toString());
}
return list;
}
Method for creating Hive table:
create external table testdata(
uid bigint
,city array<string>
)
ROW FORMAT SERDE 'org.apache.hadoop.hive.hbase.HBaseSerDe'
STORED BY 'org.apache.hadoop.hive.hbase.HBaseStorageHandler'
WITH SERDEPROPERTIES ("hbase.columns.mapping" = ":key, d:city")
TBLPROPERTIES("hbase.table.name" = "testdata");
Querying from Hive table returns this:
select * from testdata;
OK
23821975838576221 ["\u0000\u0000\u0000\u0001\u0006raipur\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000"]
808554262192775221 ["\u0000\u0000\u0000","\u0006indore\u0006raipur\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000"]
2361689275801875221 ["\u0000\u0000\u0000","\u0006indore\u0006raipur\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000"]
4897772799782875221 ["\u0000\u0000\u0000","\nchandigarh\u0006indore\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000"]
When this data is exported to Redshift, the cities appear concatenated, as
indoreraipur
chandigarhindore
How can I fix this? Is trying to write a List directly into HBase a bad idea? Should I try to manually serialize and deserialize it and write it as a String instead of an Array type?
below is the pseudocode of my stream processing.
Datastream env = StreamExecutionEnvironment.getExecutionEnvironment()
env.setStreamTimeCharacteristic(TimeCharacteristic.EventTime)
Datastream stream = env.addSource() .map(mapping to java object)
.filter(filter for specific type of events)
.assignTimestampsAndWatermarks(new BoundedOutOfOrdernessTimestampExtractor(Time.seconds(2)){})
.timeWindowAll(Time.seconds(10));
//collect all records.
Datastream windowedStream = stream.apply(new AllWindowFunction(...))
Datastream processedStream = windowedStream.keyBy(...).reduce(...)
String outputPath = ""
final StreamingFileSink sink = StreamingFileSink.forRowFormat(...).build();
processedStream.addSink(sink)
The above code flow is creating multiple files and each file has records of different windows I guess. For example, records in each files have timestamps which ranges between 30-40 seconds, whereas window time is only 10 seconds.
My expected output pattern is writing each window data to separate file.
Any references or input on this would be of great help.
Take a look at the BucketAssigner interface. It should be flexible enough to meet your needs. You just need to make sure that your stream events contain enough information to determine the path you want them written to.
I am using the Dataset API with Flink and I am trying to partition parquet files by a key in my POJO e.g. date. The end goal is to write my files down using the following file structure.
/output/
20180901/
file.parquet
20180902/
file.parquet
Flink provides a convenience class to wrap AvroParquetOutputFormat as shown below but I don't see anyway to provide a partitioning key.
HadoopOutputFormat<Void, Pojo> outputFormat =
new HadoopOutputFormat(new AvroParquetOutputFormat(), Job.getInstance());
I'm trying to figure out the best way to proceed. Do I need to write my own version of AvroParquetOutputFormat that extends hadoops MultipleOutputs type or can I leverage the Flink APIs to do this for me.
The equivalent in Spark would be.
df.write.partitionBy('date').parquet('base path')
You can use the BucketingSink<T> sink to write data in partitions you defined by supplying an instance of the Bucketer interface. See the DateTimeBucketer for an example.
https://github.com/apache/flink/blob/master/flink-connectors/flink-connector-filesystem/src/main/java/org/apache/flink/streaming/connectors/fs/DateTimeBucketer.java
I was trying to use table api inside flatMap by passing the flink env object to the flatMap object. But I was getting serialization exception which tells that I have added some field which cannot be serializable.
Could you please give some light on this?
Regards,
Sajeev
You cannot pass the ExecutionEnvironment into a Function. It would be like passing Flink into Flink.
The Table API is an abstraction on top of the DataSet/DataStream APIs. If you want to use both the Table API and the lower API, you can use TableEnvironment#toDataSet/fromDataSet to change between the APIs even between DataSet operators.
DataSet<Integer> ds = env.fromElements(1, 2, 3);
BatchTableEnvironment tEnv = TableEnvironment.getTableEnvironment(env);
Table t = tEnv.fromDataSet(ds, "intCol"); // continue in Table API
Table t2 = t.select("intCol.cast(STRING)"); // do something with table
DataSet<String> ds2 = tEnv.toDataSet(t2); // continue in DataSet API
What are possible ways of storing large data file( CSV filesaround 1 GB) using SQL database and streaming that data from Database using WCF to the client(without fetching the complete data in memory )?
I think there are a few issues to take into account here:
The size of the data you actually want to return
The structure of that data (or lack thereof)
The place to store that data behind your NLB
Returning that data to the consumer.
From your question, it sounds like you want to store 1 GB of structured (CSV) data and stream it to the client. If you really are generating and then serving a 1GB file (and don't have much metadata around it), I'd go for using a FTP/SFTP server (or perhaps a Network file share, which can certainly be secured in a variety of ways).
If you need to store metadata about the file that goes beyond its file name/create time/location, then SQL might be a good option, assuming you could do one of the following:
store the CSV data in tabular format in the database
Use FILESTREAM and store the file itself
Here is a decent primer on FILESTREAM from SimpleTalk. You could then use the SqlFileStream to help stream the data from the file itself (and SQL Server will help maintain transactional consistency for you, which you may or may not want), an example of which is present in the documentation. The relevant section is here:
private static void ReadFilestream(SqlConnectionStringBuilder connStringBuilder)
{
using (SqlConnection connection = new SqlConnection(connStringBuilder.ToString()))
{
connection.Open();
SqlCommand command = new SqlCommand("SELECT TOP(1) Photo.PathName(), GET_FILESTREAM_TRANSACTION_CONTEXT() FROM employees", connection);
SqlTransaction tran = connection.BeginTransaction(IsolationLevel.ReadCommitted);
command.Transaction = tran;
using (SqlDataReader reader = command.ExecuteReader())
{
while (reader.Read())
{
// Get the pointer for the file
string path = reader.GetString(0);
byte[] transactionContext = reader.GetSqlBytes(1).Buffer;
// Create the SqlFileStream
using (Stream fileStream = new SqlFileStream(path, transactionContext, FileAccess.Read, FileOptions.SequentialScan, allocationSize: 0))
{
// Read the contents as bytes and write them to the console
for (long index = 0; index < fileStream.Length; index++)
{
Console.WriteLine(fileStream.ReadByte());
}
}
}
}
tran.Commit();
}
}
Alternatively, if you do choose to store it in tabular format you can use typical SqlDataReader methods, or perhaps some combination of bcp and .NET helpers.
You should be able to combine that last link with Microsoft's remarks on streaming large data over WCF to get the desired result.