I understand that MapReduce is great for solving parallel problems on a huge data set. However, are there any examples of problems that while in some sense parallellizable, are not a good fit for MapReduce?
Few observations:
We shouldn’t be confusing Hadoop and early Google implementation of MapReduce that Hadoop copied (i.e. limited to key/value mapping only) with general split & aggregate concept that MapReduce is based on
MapReduce idea (split & aggregate, divide & concur are just few other names for it) is about parallelization of processing through splitting into smaller sub-tasks that can be processed independently parallel - and as such can be applied to a wide verity of problems (data intensive, compute intensive or otherwise)
MapReduce, in general, has nothing to do with big data sets, or data at all. It is successfully used for small data sets or in computational MapReduce where it is employed for pure processing parallelization
To answer your question the MapReduce doesn’t work generally in cases where the original task cannot be split into set of sub-tasks that can be processed independently in parallel. In real life - very few use cases fall into this category as most non-obvious problems can be approximated for MapReduce type of processing.
Yes and no. It really depends on how they are structured and written. There are certainly problems in which map reduce will parallelize poorly in a given data step/ map-reduce function. Simultaneous equation solvers for symmetric matrices are one example. They do not parallelize well, for the obvious reason of simultaneity, if written in one single function (in many cases they may load onto a single-node). A common work around to this is to isolate the pre-matrix calculations in a separate processor, as they are trivially parallelizable. By breaking this up, the map-reduce optimizer is able to pick-up more nodes, processing power, than it would otherwise.
Related
My question is about knowing a good choice for parallelism for operators in a flink job in a fixed cluster setting. Suppose, we have a flink job DAG containing map and reduce type operators with pipelined edges between them (no blocking edge). An example DAG is as follows:
Scan -> Keyword Search -> Aggregation
Assume a fixed size cluster of M machines with C cores each and the DAG is the only workflow to be run on the cluster. Flink allows the user to set the parallelism for individual operators. I usually set M*C parallelism for each operator. But is this the best choice from performance perspective (e.g. execution time)? Can we leverage the properties of the operators to make a better choice? For example, if we know that aggregation is more expensive, should we assign M*C parallelism to only the aggregation operator and reduce the parallelism for other operators? This hopefully will reduce the chances of backpressure too.
I am not looking for a proper formula that will give me the "best" parallelism. I am just looking for some kind of an intuition/guideline/ideas that can be used to make a decision. Surprisingly, I could not find much literature to read on this topic.
Note: I am aware of the dynamic scaling reactive mode in recent Flink. But my question is about a fixed cluster with only one workflow running, which means that the dynamic scaling is not relevant. I looked at this question, but did not get an answer.
I think about this a little differently. From my perspective, there are two key questions to consider:
(1) Do I want to keep the slots uniform? Or in other words, will each slot have an instance of every task, or do I want to adjust the parallelism of specific tasks?
(2) How many cores per slot?
My answer to (1) defaults to "keep things uniform". I haven't seen very many situations where tuning the parallelism of individual operators (or tasks) has proven to be worthwhile.
Changing the parallelism is usually counterproductive if it means breaking an operator chain. Doing it where's a shuffle anyway can make sense in unusual circumstances, but in general I don't see the point. Since some of the slots will have instances of every operator, and the slots are all uniform, why is it going to be helpful to have some slots with fewer tasks assigned to them? (Here I'm assuming you aren't interested in going to the trouble of setting up slot sharing groups, which of course one could do.) Going down this path can make things more complex from an operational perspective, and for little gain. Better, in my opinion, to optimize elsewhere (e.g., serialization).
As for cores per slot, many jobs benefit from having 2 cores per slot, and for some complex jobs with lots of tasks you'll want to go even higher. So I think in terms of an overall parallelism of M*C for simple ETL jobs, and M*C/2 (or lower) for jobs doing something more intense.
To illustrate the extremes:
A simple ETL job might be something like
source -> map -> sink
where all of the connections are forwarding connections. Since there is only one task, and because Flink only uses one thread per task, in this case we are only using one thread per slot. So allocating anything more than one core per slot is a complete waste. And the task is probably i/o bound anyway.
At the other extreme, I've seen jobs that involve ~30 joins, the evaluation of one or more ML models, plus windowed aggregations, etc. You certainly want more than one CPU core handling each parallel slice of a job like that (and more than two, for that matter).
Typically most of the CPU effort goes into serialization and deserialization, especially with RocksDB. I would try to figure out, for every event, how many RocksDB state accesses, keyBy's, and rebalances are involved -- and provide enough cores that all of that ser/de can happen concurrently (if you care about maximizing throughput). For the simplest of jobs, one core can keep up. By the time to you get to something like a windowed join you may already be pushing the limits of what one core can keep up with -- depending on how fast your sources and sinks can go, and how careful you are not to waste resources.
Example: imagine you are choosing between a parallelism of 50 with 2 cores per slot, or a parallelism of 100 with 1 core per slot. In both cases the same resources are available -- which will perform better?
I would expect fewer slots with more cores per slot to perform somewhat better, in general, provided there are enough tasks/threads per slot to keep both cores busy (if the whole pipeline fits into one task this might not be true, though deserializers can also run in their own thread). With fewer slots you'll have more keys and key groups per slot, which will help to avoid data skew, and with fewer tasks, checkpointing (if enabled) will be a bit better behaved. Inter-process communication is also a little more likely to be able to take an optimized (in-memory) path.
I'm trying to understand what are the important features I need to take into consideration before submitting a Flink job.
My question is what is the number of parallelism, is there an upper bound(physically)? and how can the parallelism impact the performance of my job?
For example, I have a CEP Flink job that detects a pattern from unkeyed Stream, the number of parallelism will always be 1 unless I partition the datastream with KeyBy operator.
Plz Correct me if I'm wrong :
If I partition the data stream, then I will have a number of parallelism equals to the number of different keys. but the problem is that the pattern matching is being done independently for each key so I can't define a pattern that requires information from 2 partitions that have different keys.
It's not bad to use Flink with parallelism = 1. But it defeats the main purpose of using Flink (being able to scale).
In general, you should not have a higher parallelism than your cores (physical or virtual depends on the use case) as you want to saturate your cores as much as possible. Anything over that will negatively impact your performance as it requires more communication overhead and context switching. By scaling out, you can add cores from distributed compute nodes in a network, which is the main benefit of using big data technologies vs. writing application by hand.
As you said you can only use the parallelism if you partition your data. If you have an algorithm that needs all data, you need to process it on one core eventually. However, usually you can do lots of preprocessing (filtering, transformation) and partial aggregations in parallel before combining the data at a final core. For example, think of simply counting all events. You can count the data of each partition and then simply sum up the partial counts in a final step, which scales almost perfectly.
If your algorithm does not allow splitting it up, then your use case may not allow distributed processing. In that case, Flink is not a good fit. However, it's worth exploring if alternative algorithms (sometimes approximate) would suffice your use case as well. That's the art of data engineering to split monolithic algorithms into parallelizable sub-algorithms.
Such as in hadoop , there is a shuffle phase between map and reduce . And I want to know if there is such a stage in flink, and how it works .Because I have read a lot of websites, they did not mention much about that.Such as a wordcount demo , it has a flatmap,key and sum.Are there always a shuffle phase between two operators ?And can I get the Intermediate data between these operators?
Shuffle is not always performed and it depends on only specific operators. In case of your example, the keyby step in the wordCount example introduces a hash partitioner which performs shuffling of the data based on the key.
In other cases for example - if you want to just process and filter your data without some form of aggregation and then write somewhere, then each of your partitions would hold its own data and there wouldn't be any kind of shuffling involved.
So to answer your questions -
No, shuffling is not always involved between 2 operators and it depends.
If you are asking about some intermediate files which you can access like in Hadoop, then the answer is No, Flink is an in-memory processing engine and (in most cases) processes data which is read in memory.
I've been studying indexes and there are some questions that pother me and which I think important.
If you can help or refer to sources, please feel free to do it.
Q1: B-tree indexes can favor a fast access to specific rows on a table. Considering an OLTP system, with many accesses, both Read and Write, simultaneously, do you think it can be a disadvantage having many B-tree indexes on this system? Why?
Q2: Why are B-Tree indexes not fully occupied (typically only 75% occupied, if I'm not mistaken)?
Q1: I've no administration experience with large indexing systems in practice, but the typical multiprocessing environment drawbacks apply to having multiple B-tree indexes on a system -- cost of context switching, cache invalidation and flushing, poor IO scheduling, and the list goes up. On the other hand, IO is something that inherently ought to be non-blocking for maximal use of resources, and it's hard to do that without some sort of concurrency, even if done in a cooperative manner. (For example, some people recommend event-based systems.) Also, you're going to need multiple index structures for many practical applications, especially if you're looking at OLTP. The biggest thing here is good IO scheduling, access patterns, and data caching depending on said access patterns.
Q2: Because splitting and re-balancing nodes is expensive. The naive methodology for speed is "only split with they're full." Given this, there's two extremes -- a node was just split and is half full, or a node is full so it will be next time. The 'average' between the cases (50% and 100%) is 75%. Yes, it's somewhat bad logic from a mathematics perspective, but it exposes the underlying reason as to why the 75% figure appears.
I am developing a simulation in which there can be millions of entities that can interact with each other. At the moment, all the entities are stored in a list. Would it be better to store the objects in a database like redis instead of a list?
Note: I assumed this was being implemented in Java (force of habit). My answer is not terribly useful if it is not Java.
Making lots of assumptions about your requirements, I'd consider Redis if:
You are running into unacceptable GC pauses as a result of your millions of objects OR
The entities you create can be reused across multiple simulation runs
Java apps with giant heaps and lots of long-lived objects can run into very long GC pauses, depending on work-load. i.e. the old gen fills up with all these millions of objects and they're never eligible for collection. Regardless, periodically a full collect will happen (unless you're a GC tuning master) and have to scan these millions of objects in the old gen. This can take many seconds each time it happens, and you're frozen during that time. If this is happening and you don't like it, you could off-load all these long-lived objects to Redis, and pay the serialize/deserialize cost of accessing them rather than the GC pauses.
On the other point about reusing entities: if you're loading up a big Redis db and then dropping all its data when the simulation ends, it feels a bit wasteful. If you can re-use entities across simulation runs you might save yourself a bunch of time by persisting them in Redis.
The best choice depends on a number of factors, including how you access data, whether it will fit in memory, and what the distribution of accesses looks like. As a broad generalization, keeping data in memory is always faster than on disk, and keeping it in-process is faster than keeping it elsewhere.
If your data fits in memory, is accessed in a manner that means you can use basic data structures like lists/arrays and hashtables efficiently, and all items are accessed roughly equally often, keeping your data in memory is probably the best option.
If your data fits in memory, but you need to access it in complex ways, you may be best choosing a datastore like redis that supports in-memory databases.
If your data doesn't fit in memory, or you have a very uneven access pattern such that evicting the least used data to disk might allow other things to be loaded, speeding up your task in general, a regular disk-based datastore may be a better choice.
A list is not necessarily the best data structure unless "interaction" is limited to the respective next or previous element. Random access (by index) is very slow on a list.
Lists rocket at inserting at front and end, and at finding the next (or previous) element, or inserting one in between. They totally blow for accessing element 164553 and then element 10657, being O(N) on random access. Thus "interact with each other" suggests that list is a bad choice.
It very much depends on the access and allocation patterns, but a vector or deque will likely be much better suited than a list for your simulation.
Redis is based on a hash table, which has a (much!) better characteristic for random access, but it will most likely still be slower, because it has considerable overhead for you serializing the data, it going through a socket, redis unserializing and analyzing it, sending a reply, and you parsing that.