Cache locality in Flink

Cache locality in Flink - apache-flink

I have a stream of data, containing a key, that I need to mix and match with data associated with that key. Each key belongs to a partition, and each partition can be loaded from a database.
Data is quite big and only a few hundred out of hundreds of thousands partitions can fit in a single task manager.
My current approach is to use partitionCustom based on the key.partition and cache the partition data inside a RichMapFunction to mix and match without reloading the data of the partitions multiple times.
When the number of message rate on a same partition gets too high, I hit a hot-spot/performance bottleneck.
What tools do I have in Flink to improve the throughput in this case?
Are there ways to customize the scheduling and to optimize the job placements based on setup time on the machines, and maximum processing time history?

It sounds like (a) your DB-based data is also partitioned, and (b) you have skew in your keys, where one partition gets a lot more keys than other partitions.
Assuming the above is correct, and you've done code profiling on your "mix and match" code to make that reasonably efficient, then you're left with manual optimizations. For example, if you know that keys in partition X are much more common, you can put all of those keys in one partition, and then distribute the remaining keys amongst the other partitions.
Another approach is to add a "batcher" operator, which puts up to N keys for the same partition into a group (typically this also needs a timeout to flush, so data doesn't get stuck). If you can batch enough keys, then it might not be so bad to load the DB data on demand for the partition associated with each batch of keys.

Related

What decides the number of partitions in a DynamoDB table?

I'm a beginner to DynamoDB, and my online constructor doesn't answer his Q/A lol, and i've been confused about this.
I know that the partition key decides the partition in which the item will be placed.
I also know that the number of partitions is calculated based on throughput or storage using the famous formulas
So let's say a table has user_id as its partition Key, with 200 user_ids. Does that automatically mean that we have 200 partitions? If so, why didn't we calculate the no. of partitions based on the famous formulas?
Thanks

Let's establish 2 things.
A DynamoDB partition can support 3000 read operations and 1000 write operations. It keeps a divider between read and write ops so they do not interfere with each other. If you had a table that was configured to support 18000 reads and 6000 writes, you'd have at least 12 partition, but probably a few more for some head room.
A provisioned capacity table has 1 partition by default, but an on-demand partition has 4 partitions by default.
So, to answer your question directly. Just because you have 200 items, does not mean you have 200 partitions. It is very possible for those 200 items to be in just one partition if your table was in provisioned capacity mode. If the configuration of the table changes or it takes on more traffic, those items might move around to new partitions.
There are a few distinct times where DynamoDB will add partitions.
When partitions grow in storage size larger than 10GB. DynamoDB might see that you are taking on data and try to do this proactively, but 10GB is the cut off.
When your table needs to support more operations per second that it is currently doing. This can happen manually because you configured your table to support 20,000 reads/sec where before I only supported 2000. DynamoDB would have to add partitions and move data to be able to handle that 20,000 reads/sec. Or is can happen automatically to add partitions because you configured floor and ceiling values in DynamoDB auto-scaling and DynamoDB senses your ops/sec is climbing and will therefore adjust the number of partitions in response to capacity exceptions.
Your table is in on-demand capacity mode and DynamoDB attempts to automatically keep 2x your previous high water mark of capacity. For example, say your table just reached 10,000 RCU for the first time. DynamoDB would see that is past your previous high water mark and start adding more partitions as it tries to keep 2x the capacity at the ready in case you peak up again like you just did.
DynamoDB is actively monitoring your table and if it sees one or more items are particularly being hit hard (hot keys), are in the same partition and this might create a hot partition. If that is happening, DynamoDB might split the table to help isolate those items and prevent or fix a hot partition situation.
There are one or two other more rare edge cases, but you'd likely be talking to AWS Support if you encountered this.
Note: Once DynamoDB creates partitions, the number of partitions never shrinks and this is ok. Throughput dilution is no longer a thing in DynamoDB.

The partition key value is hashed to determine the actual partition to place the data item into.
Thus the number of distinct partition key values has zero affect on the number of physical partitions.
The only things that affect the physical number of partitions are RCUs/WCUs (throughput) and the amount of data stored.
Nbr Partions Pt = RCU/3000 + WCU/1000
Nbr Partions Ps = GB/10
Unless one of the above is more than 1.0, there will likely only be a single partition. But I'm sure the split happens as you approach the limits, when exactly is something only AWS knows.

Large partition count performance impact

I have tables have millions of partitions.
Should I reduce partition count for performance?
As my experience of spark application or hive query system, too many partition was bad for performance.

If you do not have auto clustering on the table, it will not be auto defragmented. So if you write to the table frequently with small row counts, it will be in very bad shape.
Partition count impacts compile time badly, as every partition has metadata that is load to plan/optimize the query. I would suggest doing a rebuild test (select into a new transient table) and run some comparable queries to see the different in compile time.
We have a number of table that sorting (thus auto clustering) does not make sense for as the use pattern is always full-table scan, thus we just rebuild those tables on schedules to keep the partition count down, and for us, that rebuild cost is worth the performance gain.
As with everything Snowflake you should run a test, and see how it is for you. And monitor hot spots as they can and do change.

In Snowflake, there are micro-partitions, and they are managed automatically. Therefore you do not need to worry about the number of micro-partition.
https://docs.snowflake.com/en/user-guide/tables-clustering-micropartitions.html#what-are-micro-partitions
It says:
Micro-partitioning is automatically performed on all Snowflake tables.
Tables are transparently partitioned using the ordering of the data as
it is inserted/loaded.
From this page, I understand that micro-partitions are managed by Snowflake, and you do not need to focus on reducing the partition count (this is the original question).
This should also help to understand the difference between clustering and micro-partitions:
https://docs.snowflake.com/en/user-guide/table-considerations.html#when-to-set-a-clustering-key
If you read the above link, you can see that it is not a must to define clustering on even large tables to get a good query performance!
As the original question about reducing the partition count, I also have to say that clustering does not always reduce the number of partitions, but it is another story.

Maintenance commands on Redshift

I find myself dealing with a Redshift cluster with 2 different types of tables: the ones that get fully replaced every day and the ones that receive a merge every day.
From what I understand so far, there are maintenance commands that should be given since all these tables have millions of rows. The 3 commands I've found so far are:
vacuum table_name;
vacuum reindex table_name;
analyze table_name;
Which of those commands should be applied on which circumstance? I'm planning on doing it daily after they load in the middle of the night. The reason to do it daily is because after running some of these manually, there is a huge performance improvement.
After reading the documentation, I feel it's not very clear what the standard procedure should be.
All the tables have interleaved sortkeys regardless of the load type.

A quick summary of the commands, from the VACUUM documentation:
VACUUM: Sorts the specified table (or all tables in the current database) and reclaims disk space occupied by rows that were marked for deletion by previous UPDATE and DELETE operations. A full vacuum doesn't perform a reindex for interleaved tables.
VACUUM REINDEX: Analyzes the distribution of the values in interleaved sort key columns, then performs a full VACUUM operation.
ANALYZE: Updates table statistics for use by the query planner.
It is good practice to perform an ANALYZE when significant quantities of data have been loaded into a table. In fact, Amazon Redshift will automatically skip the analysis if less than 10% of data has changed, so there is little harm in running ANALYZE.
You mention that some tables get fully replaced every day. This should be done either by dropping and recreating the table, or by using TRUNCATE. Emptying a table with DELETE * is less efficient and should not be used to empty a table.
VACUUM can take significant time. In situations where data is being appended in time-order and the table's SORTKEY is based on the time, it is not necessary to vacuum the table. This is because the table is effectively sorted already. This, however, does not apply to interleaved sorts.
Interleaved sorts are more tricky. From the sort key documentation:
An interleaved sort key gives equal weight to each column in the sort key, so query predicates can use any subset of the columns that make up the sort key, in any order.
Basically, interleaved sorts use a fancy algorithm to sort the data so that queries based on any of the columns (individually or in combination) will minimize the number of data blocks that are required to be read from disk. Disk access always takes the most time in a database, so minimizing disk access is the best way to speed-up the database. Amazon Redshift uses Zone Maps to identify which blocks to read from disk and the best way to minimize such access is to sort data and then skip over as many blocks as possible when performing queries.
Interleaved sorts are less performant than normal sorts, but give the benefit that multiple fields are fairly well sorted. Only use interleaved sorts if you often query on many different fields. The overhead in maintaining an interleaved sort (via VACUUM REINDEX) is quite high and should only be done if the reindex effort is worth the result.
So, in summary:
ANALYZE after significant data changes
VACUUM regularly if you delete data from the table
VACUUM REINDEX if you use Interleaved Sorts and significant amounts of data have changed

Query optimization for hive table

We have a table which is of size 100TB and we have multiple customers using the same table (i.e every customer uses different where conditions). Now the problem statement is every time a customer tries to query the table it gets scanned from top to bottom.
This creates lot of slowness for all the queries. We cannot even partition/bucket the table basing on any business keys. Can someone provide solution or point to similar problem statements and their resolution.
you can provide your suggestions as well as alternative technologies so that we can pick the best suitable one. Thanks.

My 2 cents: experiment with an ORC table with GZip compression (default) and clever partitioning / ordering...
every SELECT that uses a partition key in its WHERE clause will
do "partition pruning" and thus avoid to scan everything [OK, OK, you said you had no good candidate in your specific case, but in general it can be done so I had to mention it first]
then within each ORC file in scope, the min/max counters will be
checked for "stripe pruning", limiting the I/O further
With clever partitioning & clever ordering of the data at INSERT time, using the most-frequent filters, the pruning can be quite efficient.
Then you can look into optimizations such as using a non-default ORC stripe size, a non-default "bytes-per-reducer" threshold, etc.
Reference:
http://fr.slideshare.net/oom65/orc-andvectorizationhadoopsummit
https://cwiki.apache.org/confluence/display/Hive/LanguageManual+ORC
https://streever.atlassian.net/wiki/display/HADOOP/Optimizing+ORC+Files+for+Query+Performance
http://thinkbig.teradata.com/hadoop-performance-tuning-orc-snappy-heres-youre-missing/
One last thing: with 15 nodes for running queries and a replication factor of 3, each HDFS block is available "locally" on 3 the nodes (20%) and "remotely" in the rest (80%). A higher replication factor may reduce I/O and network bottlenecks -- at the cost of disk space, of course.

Database scalability - performance vs. database size

I'm creating an app that will have to put at max 32 GB of data into my database. I am using B-tree indexing because the reads will have range queries (like from 0 < time < 1hr).
At the beginning (database size = 0GB), I will get 60 and 70 writes per millisecond. After say 5GB, the three databases I've tested (H2, berkeley DB, Sybase SQL Anywhere) have REALLY slowed down to like under 5 writes per millisecond.
Questions:
Is this typical?
Would I still see this scalability issue if I REMOVED indexing?
What are the causes of this problem?
Notes:
Each record consists of a few ints

Yes; indexing improves fetch times at the cost of insert times. Your numbers sound reasonable - without knowing more.
You can benchmark it. You'll need to have a reasonable amount of data stored. Consider whether or not to index based upon the queries - heavy fetch and light insert? index everywhere a where clause might use it. Light fetch, heavy inserts? Probably avoid indexes. Mixed workload; benchmark it!
When benchmarking, you want as real or realistic data as possible, both in volume and on data domain (distribution of data, not just all "henry smith" but all manner of names, for example).

It is typical for indexes to sacrifice insert speed for access speed. You can find that out from a database table (and I've seen these in the wild) that indexes every single column. There's nothing inherently wrong with that if the number of updates is small compared to the number of queries.
However, given that:
1/ You seem to be concerned that your writes slow down to 5/ms (that's still 5000/second),
2/ You're only writing a few integers per record; and
3/ You're queries are only based on time queries,
you may want to consider bypassing a regular database and rolling your own sort-of-database (my thoughts are that you're collecting real-time data such as device readings).
If you're only ever writing sequentially-timed data, you can just use a flat file and periodically write the 'index' information separately (say at the start of every minute).
This will greatly speed up your writes but still allow a relatively efficient read process - worst case is you'll have to find the start of the relevant period and do a scan from there.
This of course depends on my assumption of your storage being correct:
1/ You're writing records sequentially based on time.
2/ You only need to query on time ranges.

Yes, indexes will generally slow inserts down, while significantly speeding up selects (queries).
Do keep in mind that not all inserts into a B-tree are equal. It's a tree; if all you do is insert into it, it has to keep growing. The data structure allows for some padding, but if you keep inserting into it numbers that are growing sequentially, it has to keep adding new pages and/or shuffle things around to stay balanced. Make sure that your tests are inserting numbers that are well distributed (assuming that's how they will come in real life), and see if you can do anything to tell the B-tree how many items to expect from the beginning.

Totally agree with #Richard-t - it is quite common in offline/batch scenarios to remove indexes completely before bulk updates to a corpus, only to reapply them when update is complete.
The type of indices applied also influence insertion performance - for example with SQL Server clustered index update I/O is used for data distribution as well as index update, where as nonclustered indexes are updated in seperate (and therefore more expensive) I/O operations.
As with any engineering project - best advice is to measure with real datasets (skews page distribution, tearing etc.)

I think somewhere in the BDB docs they mention that page size greatly affects this behavior in btree's. Assuming you arent doing much in the way of concurrency and you have fixed record sizes, you should try increasing your page size