Flink batch join performance

Flink batch join performance - apache-flink

I've been testing a simple join with both TableApi and DataStream api in batch mode. However i've been getting pretty bad results, so it must be i'm doing something wrong. Datasets used for joining are ~900gb and 3gb. Environment used for testing is EMR with 10 * m5.xlarge worker nodes.
TableApi approach used is creating a tables over data s3 paths and performing insert into statement to a created table over destination s3 path. With tweaking task manager memory, numberOfTaskSlots, parallelism but couldn't make it perform in somewhat acceptable time ( 1.5h at least ).
When using DataStreamApi in batch mode i always encounter a problem where yarn kills task due to it using over 90% of disk space. So i'm confused if that's due to the code, or just flink needs much more disk space than spark does.
Reading in datastreams:
val sourceStream: DataStream[SourceObj] = env.fromSource(source, WatermarkStrategy.noWatermarks(), "someSourceName")
.map(x => SourceObj.convertFromString(x))
Joining:
val joinedStream = sourceStream.join(sourceStream2)
.where(col1 => sourceStream.col1)
.equalTo(col2 => sourceStream2.col2)
.window(GlobalWindows.create())
.trigger(CountTrigger.of(1))
.apply{
(s, c) => JoinedObj(c.col1, s.col2, s.col3)
}
Am I missing something or i just need to scale up the cluster?

In general you're better off implementing relational workloads with Flink's Table/SQL API, so that its optimizer has a chance to help out.
But if I'm reading this correctly,
this particular join is going to be quite expensive to execute because nothing is ever expired from state. Both tables will be fully materialized within Flink, because for this query, every row of input remains relevant and could affect the result.
If you can convert this into some sort of join with a temporal constraint that can be used by the optimizer to free up rows that are no longer useful, then it will be much better behaved.

When you are using DataStream API in Batch mode it is extensively using managed memory in all shuflle/join/reduce operations. Also, as stated in the last paragraph here Flink will spill to disk all the data that it cannot fit in memory during join.
So, I assume this may be the reason for a disk space shortage issue. I had faced the same problem with my job.

Related

How to best process large query results written to intermediate table in App Engine

We are running large query jobs where we hit the 128M response size and BigQuery raises the "Response too large to return. Consider setting allowLargeResults to true in your job configuration" error.
We are opting for the allowLargeResults approach to keep the already complex SQL unchanged (instead of chunking things at this level). The question is what is the best way to process the results written to the intermediate table:
Export the table to GCS, then queue tasks that process chunks of the response file using offsets into the GCS file. This introduces latency from GCS, GCS file maintenance (e.g. cleaning up files), and another point of failure (http errors/timeouts etc).
Query chunks from the intermediate tables also using queued tasks. The question here is what is the best way to chunk the rows (is there an efficient way to do this, e.g. is there an internal row number we can refer to?). We probably end up scanning the entire table for each chunk so this seems more costly than the export to GCS option.
Any experience in this area and or recommendations?
Note that we are running in the Google App Engine (Python)
Thanks!

I understand that https://cloud.google.com/bigquery/docs/reference/v2/tabledata/list willl let you read chunks of a table without performing a query (incurring data processing charges).
This lets you read the results of a query in parallel as well as all queries are written to a temporary table id which you can pass to this function and supply different ranges (with startIndex,maxResults).

Note that BigQuery is able to export data in chunks - and you can request as many chunks as workers you have.
From https://cloud.google.com/bigquery/exporting-data-from-bigquery#exportingmultiple:
If you ask to export to:
['gs://my-bucket/file-name.json']
you will get an export in one GCS file, as long as it's less than 1GB.
If you ask to export to:
['gs://my-bucket/file-name-*.json']
you will get several files with each having a chunk of the total export. Useful when exporting more than 1GB.
If you ask to export to:
['gs://my-bucket/file-name-1-*.json',
'gs://my-bucket/file-name-2-*.json',
'gs://my-bucket/file-name-3-*.json']
you will get exports optimized for 3 workers. Each of these patterns will receive a series of exported chunks, so each worker can focus on its own chunks.

It is not clear what is exactly the reason for "chunk processing". If you have some complex SQL logic that needs to be run against your data and result happend to be bigger than current 128MB limit - just still do it with allowLargeResults and than consume result the way you need to consume it.
Of course you most likely have reason for chunking but it is not understood and thus making answering problematic
Another suggestion is to not to ask many questions in one question - this make answering very problematic, so you have great chance to not to get it answered
Finally, my answer for the only question that is relatively clear (for me at least)
The question here is what is the best way to chunk the rows (is there
an efficient way to do this, e.g. is there an internal row number we
can refer to?). We probably end up scanning the entire table for each
chunk so this seems more costly than the export to GCS option
It depends on how and when your table was created!
If your table was loaded as a one big load - i dont see way of avoiding table scan again and again.
If table was loaded in increments and recently - you have chance to enjoy so called Table Decoration (specifically you should look to Range Decorators)
In the very early era of BigQuery there was an expectation of having Partitioned Decorators - this would address a lot of users needs - but it is still not available and i dont know what are the plans on them

Integration of non-parallelizable task with high memory demands in Flink pipeline

I am using Flink in a Yarn Cluster to process data using various sources and sinks. At some point in the topology, there is an operation that cannot be parallelized and furthermore needs access to a lot of memory. In fact, the API I am using for this step needs its input in array-form. Right now, I have implemented it something like
ExecutionEnvironment env = ExecutionEnvironment.getExecutionEnvironment();
DataSet<Pojo> input = ...
List<Pojo> inputList = input.collect();
Pojo[] inputArray = inputList.toArray();
Pojo[] resultArray = costlyOperation(inputArray);
List<Pojo> resultList = Arrays.asList(resultArray);
DataSet<Pojo> result = env.fromCollection(resultList);
result.otherStuff()
This solution seems rather unnatural. Is there a straight-forward way to incorporate this task into my Flink pipeline?
I have read in another thread that the collect() function should not be used for large datasets. I believe the fact that collecting the dataset into a list and then an array does not happen parallely is not my biggest problem right now, but would you still prefer to write what I called input above into a file and build an array from that?
I have also seen the options to configure managed memory in flink. In principle, it might be possible to tune this in a way so that enough heap is left for the expensive operation. On the other hand, I am afraid that the performance of all the other operators in the topology might suffer. What is your opinion on this?

You could replace the "collect->array->costlyOperation->array->fromCollection" step by a key-less reduce operation with a surrogate key that has a unique value for all tuples such that you get only a single partition. This would be Flink like.
In your costly operation itself, that is implemented as a GroupReduceFunction, you will get an iterator over the data. If you do not need to access all data "at once", you also safe heap space as you do not need to keep all data in-memory within reduce (but this depends of course what your costly operation computes).
As an alternative, you could also call reduce() without a previous groupBy(). However, you do not get an iterator or an output collector and can only compute partial aggregates. (see "Reduce" in https://ci.apache.org/projects/flink/flink-docs-release-0.10/apis/programming_guide.html#transformations)
Using Flink style operations has the advantage, that the data is kept in the cluster. If you do collect() the result is transfered to the client, the costly operation is executed in the client, and the result is transfered back to the cluster. Furthermore, if the input is large, Flink will automatically spill the intermediate result to disc for you.

Possible bottlenecks when inserting and updating BYTEA rows?

The project requires storing binary data into PostgreSQL (project requirement) database. For that purpose we made a table with following columns:
id : integer, primary key, generated by client
data : bytea, for storing client binary data
The client is a C++ program, running on Linux.
The rows must be inserted (initialized with a chunk of binary data), and after that updated (concatenating additional binary data to data field).
Simple tests have shown that this yields better performance.
Depending on your inputs, we will make client use concurrent threads to insert / update data (with different DB connections), or a single thread with only one DB connection.
We haven't much experience with PostgreSQL, so could you help us with some pointers concerning possible bottlenecks, and whether using multiple threads to insert data is better than using a single thread.
Thank you :)
Edit 1:
More detailed information:
there will be only one client accessing the database, using only one Linux process
database and client are on the same high performance server, but this must not matter, client must be fast no matter the machine, without additional client configuration
we will get new stream of data every 10 seconds, stream will provide new 16000 bytes per 0.5 seconds (CBR, but we can use buffering and only do inserts every 4 seconds max)
stream will last anywhere between 10 seconds and 5 minutes

It makes extremely little sense that you should get better performance inserting a row then appending to it if you are using bytea.
PostgreSQL's MVCC design means that an UPDATE is logically equivalent to a DELETE and an INSERT. When you insert the row then update it, what's happening is that the original tuple you inserted is marked as deleted and new tuple is written that contains the concatentation of the old and added data.
I question your testing methodology - can you explain in more detail how you determined that insert-then-append was faster? It makes no sense.
Beyond that, I think this question is too broad as written to really say much of use. You've given no details or numbers; no estimates of binary data size, rowcount estimates, client count estimates, etc.
bytea insert performance is no different to any other insert performance tuning in PostgreSQL. All the same advice applies: Batch work into transactions, use multiple concurrent sessions (but not too many; rule of thumb is number_of_cpus + number_of_hard_drives) to insert data, avoid having transactions use each others' data so you don't need UPDATE locks, use async commit and/or a commit_delay if you don't have a disk subsystem with a safe write-back cache like a battery-backed RAID controller, etc.
Given the updated stats you provided in the main comments thread, the amount of data you want to consume sounds entirely practical with appropriate hardware and application design. Your peak load might be achievable even on a plain hard drive if you had to commit every block that came in, since it'd require about 60 transactions per second. You could use a commit_delay to achieve group commit and significantly lower fsync() overhead, or even use synchronous_commit = off if you can afford to lose a time window of transactions in case of a crash.
With a write-back caching storage device like a battery-backed cache RAID controller or an SSD with reliable power-loss-safe cache, this load should be easy to cope with.
I haven't benchmarked different scenarios for this, so I can only speak in general terms. If designing this myself, I'd be concerned about checkpoint stalls with PostgreSQL, and would want to make sure I could buffer a bit of data. It sounds like you can so you should be OK.
Here's the first approach I'd test, benchmark and load-test, as it's in my view probably the most practical:
One connection per data stream, synchronous_commit = off + a commit_delay.
INSERT each 16kb record as it comes in into a staging table (if possible UNLOGGED or TEMPORARY if you can afford to lose incomplete records) and let Pg synchronize and group up commits. When each stream ends, read the byte arrays, concatenate them, and write the record to the final table.
For absolutely best speed with this approach, implement a bytea_agg aggregate function for bytea as an extension module (and submit it to PostgreSQL for inclusion in future versions). In reality it's likely you can get away with doing the bytea concatenation in your application by reading the data out, or with the rather inefficient and nonlinearly scaling:
CREATE AGGREGATE bytea_agg(bytea) (SFUNC=byteacat,STYPE=bytea);
INSERT INTO final_table SELECT stream_id, bytea_agg(data_block) FROM temp_stream_table;
You would want to be sure to tune your checkpointing behaviour, and if you were using an ordinary or UNLOGGED table rather than a TEMPORARY table to accumulate those 16kb records, you'd need to make sure it was being quite aggressively VACUUMed.
See also:
Whats the fastest way to do a bulk insert into Postgres?
How to speed up insertion performance in PostgreSQL

Need recommendations on pushing the envelope with SqlBulkCopy on SQL Server

I am designing an application, one aspect of which is that it is supposed to be able to receive massive amounts of data into SQL database. I designed the database stricture as a single table with bigint identity, something like this one:
CREATE TABLE MainTable
(
_id bigint IDENTITY(1,1) NOT NULL PRIMARY KEY CLUSTERED,
field1, field2, ...
)
I will omit how am I intending to perform queries, since it is irrelevant to the question I have.
I have written a prototype, which inserts data into this table using SqlBulkCopy. It seemed to work very well in the lab. I was able to insert tens of millions records at a rate of ~3K records/sec (full record itself is rather large, ~4K). Since the only index on this table is autoincrementing bigint, I have not seen a slowdown even after significant amount of rows was pushed.
Considering that the lab SQL server was a virtual machine with relatively weak configuration (4Gb RAM, shared with other VMs disk sybsystem), I was expecting to get significantly better throughput on the physical machine, but it didn't happen, or lets say the performance increase was negligible. I could, maybe get 25% faster inserts on physical machine. Even after I configured 3-drive RAID0, which performed 3 times faster than a single drive (measured by a benchmarking software), I got no improvement. Basically: faster drive subsystem, dedicated physical CPU and double RAM almost didn't translate into any performance gain.
I then repeated the test using biggest instance on Azure (8 cores, 16Gb), and I got the same result. So, adding more cores did not change insert speed.
At this time I have played around with following software parameters without any significant performance gain:
Modifying SqlBulkInsert.BatchSize parameter
Inserting from multiple threads simultaneously, and adjusting # of threads
Using table lock option on SqlBulkInsert
Eliminating network latency by inserting from a local process using shared memory driver
I am trying to increase performance at least 2-3 times, and my original idea was that throwing more hardware would get tings done, but so far it doesn't.
So, can someone recommend me:
What resource could be suspected a bottleneck here? How to confirm?
Is there a methodology I could try to get reliably scalable bulk insert improvement considering there is a single SQL server system?
UPDATE I am certain that load app is not a problem. It creates record in a temporary queue in a separate thread, so when there is an insert it goes like this (simplified):
===>start logging time
int batchCount = (queue.Count - 1) / targetBatchSize + 1;
Enumerable.Range(0, batchCount).AsParallel().
WithDegreeOfParallelism(MAX_DEGREE_OF_PARALLELISM).ForAll(i =>
{
var batch = queue.Skip(i * targetBatchSize).Take(targetBatchSize);
var data = MYRECORDTYPE.MakeDataTable(batch);
var bcp = GetBulkCopy();
bcp.WriteToServer(data);
});
====> end loging time
timings are logged, and the part that creates a queue never takes any significant chunk
UPDATE2 I have implemented collecting how long each operation in that cycle takes and the layout is as follows:
queue.Skip().Take() - negligible
MakeDataTable(batch) - 10%
GetBulkCopy() - negligible
WriteToServer(data) - 90%
UPDATE3 I am designing for standard version of SQL, so I cannot rely on partitioning, since it's only available in Enterprise version. But I tried a variant of partitioning scheme:
created 16 filegroups (G0 to G15),
made 16 tables for insertion only (T0 to T15) each bound to its individual group. Tables are with no indexes at all, not even clustered int identity.
threads that insert data will cycle through all 16 tables each. This makes it almost a guarantee that each bulk insert operation uses its own table
That did yield ~20% improvement in bulk insert. CPU cores, LAN interface, Drive I/O were not maximized, and used at around 25% of max capacity.
UPDATE4 I think it is now as good as it gets. I was able to push inserts to a reasonable speeds using following techniques:
Each bulk insert goes into its own table, then results are merged into main one
Tables are recreated fresh for every bulk insert, table locks are used
Used IDataReader implementation from here instead of DataTable.
Bulk inserts done from multiple clients
Each client is accessing SQL using individual gigabit VLAN
Side processes accessing the main table use NOLOCK option
I examined sys.dm_os_wait_stats, and sys.dm_os_latch_stats to eliminate contentions
I have a hard time to decide at this point who gets a credit for answered question. Those of you who don't get an "answered", I apologize, it was a really tough decision, and I thank you all.
UPDATE5: Following item could use some optimization:
Used IDataReader implementation from here instead of DataTable.
Unless you run your program on machine with massive CPU core count, it could use some re-factoring. Since it is using reflection to generate get/set methods, that becomes a major load on CPUs. If performance is a key, it adds a lot of performance when you code IDataReader manually, so that it is compiled, instead of using reflection

For recommendations on tuning SQL Server for bulk loads, see the Data Loading and Performance Guide paper from MS, and also Guidelines for Optimising Bulk Import from books online. Although they focus on bulk loading from SQL Server, most of the advice applies to bulk loading using the client API. This papers apply to SQL 2008 - you don't say which SQL Server version you're targetting
Both have quite a lot of information which it's worth going through in detail. However, some highlights:
Minimally log the bulk operation. Use bulk-logged or simple recovery.
You may need to enable traceflag 610 (but see the caveats on doing
this)
Tune the batch size
Consider partitioning the target table
Consider dropping indexes during bulk load
Nicely summarised in this flow chart from Data Loading and Performance Guide:
As others have said, you need to get some peformance counters to establish the source of the bottleneck, since your experiments suggest that IO might not be the limitation.
Data Loading and Performance Guide includes a list of SQL wait types and performance counters to monitor (there are no anchors in the document to link to but this is about 75% through the document, in the section "Optimizing Bulk Load")
UPDATE
It took me a while to find the link, but this SQLBits talk by Thomas Kejser is also well worth watching - the slides are available if you don't have time to watch the whole thing. It repeats some of the material linked here but also covers a couple of other suggestions for how to deal with high incidences of particular performance counters.

It seems you have done a lot however I am not sure if you have had chance to study Alberto Ferrari SqlBulkCopy Performance Analysis report, which describes several factors to consider the performance related with SqlBulkCopy. I would say lots of things discussed in that paper is still worth trying to that would good to try first.

I am not sure why you are not getting 100% utilization on CPU, IO or memory. But if you simply want to improve your bulk load speeds, here is something to consider:
Partition you data file into different files. Or if they are coming from different sources, then simply create different data files.
Then run multiple bulk inserts simultaneously.
Depending on your situation the above may not be feasible; but if you can then I am sure it should improve your load speeds.

SQL Server & update (or insert) parallelism

I got a large conversion job- 299Gb of JPEG images, already in the database, into thumbnail equivalents for reporting and bandwidth purposes.
I've written a thread safe SQLCLR function to do the business of re-sampling the images, lovely job.
Problem is, when I execute it in an UPDATE statement (from the PhotoData field to the ThumbData field), this executes linearly to prevent race conditions, using only one processor to resample the images.
So, how would I best utilise the 12 cores and phat raid setup this database machine has? Is it to use a subquery in the FROM clause of the update statement? Is this all that is required to enable parallelism on this kind of operation?
Anyway the operation is split into batches, around 4000 images per batch (in a windowed query of about 391k images), this machine has plenty of resources to burn.

Please check the configuration setting for Maximum Degree of Parallelism (MAXDOP) on your SQL Server. You can also set the value of MAXDOP.
This link might be useful to you http://www.mssqltips.com/tip.asp?tip=1047
cheers

Could you not split the query into batches, and execute each batch separately on a separate connection? SQL server only uses parallelism in a query when it feels like it, and although you can stop it, or even encourage it (a little) by changing the cost threshold for parallelism option to O, but I think its pretty hit and miss.
One thing thats worth noting is that it will only decide whether or not to use parallelism at the time that the query is compiled. Also, if the query is compiled at a time when the CPU load is higher, SQL server is less likely to consider parallelism.

I too recommend the "round-robin" methodology advocated by kragen2uk and onupdatecascade (I'm voting them up). I know I've read something irritating about CLR routines and SQL paralellism, but I forget what it was just now... but I think they don't play well together.
The bit I've done in the past on similar tasks it to set up a table listing each batch of work to be done. For each connection you fire up, it goes to this table, gest the next batch, marks it as being processed, processes it, updates it as Done, and repeats. This allows you to gauge performance, manage scaling, allow stops and restarts without having to start over, and gives you something to show how complete the task is (let alone show that it's actually doing anything).

Find some criteria to break the set into distinct sub-sets of rows (1-100, 101-200, whatever) and then call your update statement from multiple connections at the same time, where each connection handles one subset of rows in the table. All the connections should run in parallel.

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight