Apache Flink: Why do reduce or groupReduce transformations not operate in parallel?

Apache Flink: Why do reduce or groupReduce transformations not operate in parallel? - apache-flink

For example:
DataSet<Tuple1<Long>> input = env.fromElements(1,2,3,4,5,6,7,8,9);
DataSet<Tuple1<Long>> sum = input.reduce(new ReduceFunction()<Tuple1<Long>,Tuple1<Long>>{
public Tuple1<Long> reduce(Tuple1<Long> value1,Tuple1<Long> value2){
return new Tuple1<>(value1.f0 + value2.f0);
}
}
If the above reduce transform is not a parallel operation, do I need to use additional two transformation 'partitionByHash' and 'mapPartition' as below:
DataSet<Tuple1<Long>> input = env.fromElements(1,2,3,4,5,6,7,8,9);
DataSet<Tuple1<Long>> sum = input.partitionByHash(0).mapPartition(new MapPartitionFunction()<Tuple1<Long>,Tuple1<Long>>{
public void map(Iterable<Tuple1<Long>> values,Collector<Tuple1<Long>> out){
long sum = getSum(values);
out.collect(new Tuple1(sum));
}
}).reduce(new ReduceFunction()<Tuple1<Long>,Tuple1<Long>>{
public Tuple1<Long> reduce(Tuple1<Long> value1,Tuple1<Long> value2){
return new Tuple1<>(value1.f0 + value2.f0);
}
}
and why the result of reduce transform is still an instance of DataSet but not an instance of Tuple1<Long>

Both, reduce and reduceGroup are group-wise operations and are applied on groups of records. If you do not specify a grouping key using groupBy, all records of the data set belong to the same group. Therefore, there is only a single group and the final result of reduce and reduceGroup cannot be computed in parallel.
If the reduce transformation is combinable (which is true for any ReduceFunction and all combinable GroupReduceFunctions), Flink can apply combiners in parallel.

Two answers, to your two questions:
(1) Why is reduce() not parallel
Fabian gave a good explanation. The operations are parallel if applied by key. Otherwise only pre-aggregation is parallel.
In your second example, you make it parallel by introducing a key. Instead of your complex workaround with "mapPartition()", you can also simply write (Java 8 Style)
DataSet<Tuple1<Long>> input = ...;
input.groupBy(0).reduce( (a, b) -> new Tuple1<>(a.f0 + b.f0);
Note however that your input data is so small that there will be only one parallel task anyways. You can see parallel pre-aggregation if you use larger input, such as:
ExecutionEnvironment env = ExecutionEnvironment.getExecutionEnvironment();
env.setParallelism(10);
DataSet<Long> input = env.generateSequence(1, 100000000);
DataSet<Long> sum = input.reduce ( (a, b) -> a + b );
(2) Why is the result of a reduce() operation still a DataSet ?
A DataSet is still a lazy representation of X in the cluster. You can continue to use that data in the parallel program without triggering some computation and fetching the result data back from the distributed workers to the driver program. That allows you to write larger programs that run entirely on the distributed workers and are lazily executed. No data ever fetched to the client and re-distributed to the parallel workers.
Especially in iterative programs, this is very powerful, as the entire loops work without ever involving the client and needing to re-deploy operators.
You can always get the "X" by calling "dataSet.collext().get(0);" - which makes it explicit that something should be executed and fetched.

Related

Apache Fink & Iceberg: Not able to process hundred of RowData types

I have a Flink application that reads arbitrary AVRO data, maps it to RowData and uses several FlinkSink instances to write data into ICEBERG tables. By arbitrary data I mean that I have 100 types of AVRO messages, all of them with a common property "tableName" but containing different columns. I would like to write each of these types of messages into a separated Iceberg table.
For doing this I'm using side outputs: when I have my data mapped to RowData I use a ProcessFunction to write each message into a specific OutputTag.
Later on, with the datastream already processed, I loop into the different output tags, get records using getSideOutput and create an specific IcebergSink for each of them. Something like:
final List<OutputTag<RowData>> tags = ... // list of all possible output tags
final DataStream<RowData> rowdata = stream
.map(new ToRowDataMap()) // Map Custom Avro Pojo into RowData
.uid("map-row-data")
.name("Map to RowData")
.process(new ProcessRecordFunction(tags)) // process elements one by one sending them to a specific OutputTag
.uid("id-process-record")
.name("Process Input records");;
CatalogLoader catalogLoader = ...
String upsertField = ...
outputTags
.stream()
.forEach(tag -> {
SingleOutputStreamOperator<RowData> outputStream = stream
.getSideOutput(tag);
TableIdentifier identifier = TableIdentifier.of("myDBName", tag.getId());
FlinkSink.Builder builder = FlinkSink
.forRowData(outputStream)
.table(catalog.loadTable(identifier))
.tableLoader(TableLoader.fromCatalog(catalogLoader, identifier))
.set("upsert-enabled", "true")
.uidPrefix("commiter-sink-" + tableName)
.equalityFieldColumns(Collections.singletonList(upsertField));
builder.append();
});
It works very well when I'm dealing with a few tables. But when the number of tables scales up, Flink cannot adquire enough task resources since each Sink requires two different operators (because of the internals of https://iceberg.apache.org/javadoc/0.10.0/org/apache/iceberg/flink/sink/FlinkSink.html).
Is there any other more efficient way of doing this? or maybe any way of optimizing it?
Thanks in advance ! :)

Given your question, I assume that about half of your operators are IcebergStreamWriter which are fully utilised and another half is IcebergFilesCommitter which are rarely used.
You can optimise the resource usage of the servers by:
Increasing the number of slots on the TaskManagers (taskmanager.numberOfTaskSlots) [1] - so the CPU not utilised by the idle IcebergFilesCommitter Operators are then used by the other operators on the TaskManager
Increasing the resources provided to the TaskManagers (taskmanager.memory.process.size) [2] - this helps by distributing the JVM Memory overhead between the running Operators on this TaskManager (do not forget to increase the slots in parallel this change to start using the extra resources :) )
The possible downside in adding more slots for the TaskManagers could cause Operators competing for CPU, and the memory is still reserved for the "idle" tasks. [3]
Maybe this Flink architecture could useful too [4]
I hope this helps,
Peter

Reuse of a Stream is a copy of stream or not

For example, there is a keyed stream:
val keyedStream: KeyedStream[event, Key] = env
.addSource(...)
.keyBy(...)
// several transformations on the same stream
keyedStream.map(....)
keyedStream.window(....)
keyedStream.split(....)
keyedStream...(....)
I think this is the reuse of same stream in Flink, what I found is that when I reused it, the content of stream is not affected by the other transformation, so I think it is a copy of a same stream.
But I don't know if it is right or not.
If yes, this will use a lot of resources(which resources?) to keep the copies ?

A DataStream (or KeyedStream) on which multiple operators are applied replicates all outgoing messages. For instance, if you have a program such as:
val keyedStream: KeyedStream[event, Key] = env
.addSource(...)
.keyBy(...)
val stream1: DataStream = keyedStream.map(new MapFunc1)
val stream2: DataStream = keyedStream.map(new MapFunc2)
The program is executed as
/-hash-> Map(MapFunc1) -> ...
Source >-<
\-hash-> Map(MapFunc2) -> ...
The source replicates each record and sends it to both downstream operators (MapFunc1 and MapFunc2). The type of the operators (in our example Map) does not matter.
The cost of this is sending each record twice over the network. If all receiving operators have the same parallelism it could be optimized by sending each record once and duplicating it at the receiving task manager, but this is currently not done.
You manually optimize the program, by adding a single receiving operator (e.g., an identity Map operator) and another keyBy from which you fork to the multiple receivers. This will not result in a network shuffle, because all records are already local. All operator must have the same parallelism though.

Flink DataSet program runs several jobs

I have the following code in Apache Flink.
When I execute it, some parts of my code are run twice.
Could some one let me know why is that happening?
DataSet input1 = ...
DataSet input2 = ...
List mappedInput1 = input1
.map(...)
.collect();
DataSet data = input1
.union(input1.filter(...))
.mapPartition(...);
data = data.union(data2).distinct();
data.flatMap(new MapFunc1(data.collect()));
data
.flatMap(new MapFunc2(input2.collect()))
.groupBy(0)
.sum(1)
.print();

Every collect() and print() statement eagerly triggers an execution and fetches the result to the client code. Each such call tracks back the whole program to the data sources.
Your code contains three collect() and one print() statement. Hence, four individual programs are submitted and executed. Instead of using collect(), you should have a look at broadcast variables. Broadcast variables distribute a DataSet to each parallel instance of an operator. The computation and distribution happens in the same program and is not routed over the client program. Instead the data is directly exchanged between the workers running the operators.

Inconsistency in App Engine datastore vs what I know it should be from parsing the same data source locally

This may be a trivial question, but I was just hoping to get some practical experience from people who may know more about this than I do.
I wanted to generate a database in GAE from a very large series of XML files -- as a form of validation, I am calculating statistics on the GAE datastore, and I know there should be ~16,000 entities, but when I perform a count, I'm getting more on the order of 12,000.
The way I'm doing counting is basically I perform a filter, fetch a page of 1000 entities, and then spin up task queues for each entity (using its key). Each task queue then adds "1" to a counter that I'm storing.
I think I may have juiced the datastore writes too much; I set the rate of my task queues to 50/s.. I did get some writing errors, but not nearly enough to justify the 4,000 difference. Could it be possible that I was rushing the counting calls too much that it lead to inconsistency? Would slowing the rate that I process task queues to something like 5/s solve the problem? Thanks.

You can count your entities very easily (no tasks and almost for free):
int total = 0;
Query q = new Query("entity_kind").setKeysOnly();
// set your filter on this query
QueryResultList<Entity> results;
Cursor cursor = null;
FetchOptions queryOptions = FetchOptions.Builder.withLimit(1000).chunkSize(1000);
do {
if (cursor != null) {
queryOptions.startCursor(cursor);
}
results = datastore.prepare(q).asQueryResultList(queryOptions);
total += results.size();
cursor = results.getCursor();
} while (results.size() == 1000);
System.out.println("Total entities: " + total);
UPDATE:
If looping like I suggested takes too long, you can spin a task for every 100/500/1000 entities - it's definitely more efficient than creating a task for each entity. Even very complex calculations should take milliseconds in Java if done right.
For example, each task can retrieve a batch of entities, spin a new task (and pass a query cursor to this new task), and then proceed with your calculations.

key-value store for time series data?

I've been using SQL Server to store historical time series data for a couple hundred thousand objects, observed about 100 times per day. I'm finding that queries (give me all values for object XYZ between time t1 and time t2) are too slow (for my needs, slow is more then a second). I'm indexing by timestamp and object ID.
I've entertained the thought of using somethings a key-value store like MongoDB instead, but I'm not sure if this is an "appropriate" use of this sort of thing, and I couldn't find any mentions of using such a database for time series data. ideally, I'd be able to do the following queries:
retrieve all the data for object XYZ between time t1 and time t2
do the above, but return one date point per day (first, last, closed to time t...)
retrieve all data for all objects for a particular timestamp
the data should be ordered, and ideally it should be fast to write new data as well as update existing data.
it seems like my desire to query by object ID as well as by timestamp might necessitate having two copies of the database indexed in different ways to get optimal performance...anyone have any experience building a system like this, with a key-value store, or HDF5, or something else? or is this totally doable in SQL Server and I'm just not doing it right?

It sounds like MongoDB would be a very good fit. Updates and inserts are super fast, so you might want to create a document for every event, such as:
{
object: XYZ,
ts : new Date()
}
Then you can index the ts field and queries will also be fast. (By the way, you can create multiple indexes on a single database.)
How to do your three queries:
retrieve all the data for object XYZ
between time t1 and time t2
db.data.find({object : XYZ, ts : {$gt : t1, $lt : t2}})
do the above, but return one date
point per day (first, last, closed to
time t...)
// first
db.data.find({object : XYZ, ts : {$gt : new Date(/* start of day */)}}).sort({ts : 1}).limit(1)
// last
db.data.find({object : XYZ, ts : {$lt : new Date(/* end of day */)}}).sort({ts : -1}).limit(1)
For closest to some time, you'd probably need a custom JavaScript function, but it's doable.
retrieve all data for all objects for
a particular timestamp
db.data.find({ts : timestamp})
Feel free to ask on the user list if you have any questions, someone else might be able to think of an easier way of getting closest-to-a-time events.

This is why databases specific to time series data exist - relational databases simply aren't fast enough for large time series.
I've used Fame quite a lot at investment banks. It's very fast but I imagine very expensive. However if your application requires the speed it might be worth looking it.

There is an open source timeseries database under active development (.NET only for now) that I wrote. It can store massive amounts (terrabytes) of uniform data in a "binary flat file" fashion. All usage is stream-oriented (forward or reverse). We actively use it for the stock ticks storage and analysis at our company.
I am not sure this will be exactly what you need, but it will allow you to get the first two points - get values from t1 to t2 for any series (one series per file) or just take one data point.
https://code.google.com/p/timeseriesdb/
// Create a new file for MyStruct data.
// Use BinCompressedFile<,> for compressed storage of deltas
using (var file = new BinSeriesFile<UtcDateTime, MyStruct>("data.bts"))
{
file.UniqueIndexes = true; // enforces index uniqueness
file.InitializeNewFile(); // create file and write header
file.AppendData(data); // append data (stream of ArraySegment<>)
}
// Read needed data.
using (var file = (IEnumerableFeed<UtcDateTime, MyStrut>) BinaryFile.Open("data.bts", false))
{
// Enumerate one item at a time maxitum 10 items starting at 2011-1-1
// (can also get one segment at a time with StreamSegments)
foreach (var val in file.Stream(new UtcDateTime(2011,1,1), maxItemCount = 10)
Console.WriteLine(val);
}

I recently tried something similar in F#. I started with the 1 minute bar format for the symbol in question in a Space delimited file which has roughly 80,000 1 minute bar readings. The code to load and parse from disk was under 1ms. The code to calculate a 100 minute SMA for every period in the file was 530ms. I can pull any slice I want from the SMA sequence once calculated in under 1ms. I am just learning F# so there are probably ways to optimize. Note this was after multiple test runs so it was already in the windows Cache but even when loaded from disk it never adds more than 15ms to the load.
date,time,open,high,low,close,volume
01/03/2011,08:00:00,94.38,94.38,93.66,93.66,3800
To reduce the recalculation time I save the entire calculated indicator sequence to disk in a single file with \n delimiter and it generally takes less than 0.5ms to load and parse when in the windows file cache. Simple iteration across the full time series data to return the set of records inside a date range in a sub 3ms operation with a full year of 1 minute bars. I also keep the daily bars in a separate file which loads even faster because of the lower data volumes.
I use the .net4 System.Runtime.Caching layer to cache the serialized representation of the pre-calculated series and with a couple gig's of RAM dedicated to cache I get nearly a 100% cache hit rate so my access to any pre-computed indicator set for any symbol generally runs under 1ms.
Pulling any slice of data I want from the indicator is typically less than 1ms so advanced queries simply do not make sense. Using this strategy I could easily load 10 years of 1 minute bar in less than 20ms.
// Parse a \n delimited file into RAM then
// then split each line on space to into a
// array of tokens. Return the entire array
// as string[][]
let readSpaceDelimFile fname =
System.IO.File.ReadAllLines(fname)
|> Array.map (fun line -> line.Split [|' '|])
// Based on a two dimensional array
// pull out a single column for bar
// close and convert every value
// for every row to a float
// and return the array of floats.
let GetArrClose(tarr : string[][]) =
[| for aLine in tarr do
//printfn "aLine=%A" aLine
let closep = float(aLine.[5])
yield closep
|]

I use HDF5 as my time series repository. It has a number of effective and fast compression styles which can be mixed and matched. It can be used with a number of different programming languages.
I use boost::date_time for the timestamp field.
In the financial realm, I then create specific data structures for each of bars, ticks, trades, quotes, ...
I created a number of custom iterators and used standard template library features to be able to efficiently search for specific values or ranges of time-based records.