Is sort part of shuffle in mapreduce - shuffle

the process by which the system sort the map output on map side is known as the sort. is this part of shuffle? In other words, when does shuffle start? After the map output has been wrote to disk, or after the map output has been wrote to the buffer in memory

The whole Map-reduce processed is explained at detailed level here: http://ercoppa.github.io/HadoopInternals/AnatomyMapReduceJob.html
To answer your question, the steps in single map task comprises of:
INIT phase: we setup the Map Task
EXECUTION phase: for each (key, value) tuple inside the map split we run the map() function
SPILLING phase: the map output is stored in an in-memory buffer; when this
buffer is almost full then we start (in parallel) the spilling phase in order to remove data from it
SHUFFLE phase: at the end of the
spilling phase, we merge all the map outputs and package them for the
reduce phase
The Execution and Spilling phase occurs in-parallel. So, data is written in a circular buffer memory -> Sorted in memory -> When buffer is 80% full -> Written to local disk.
At the end of the EXECUTION phase, the SPILLING thread is triggered for the last time. In more detail, we:
sort and spill the remaining unspilled tuples
start the SHUFFLE phase
Notice that for each time the buffer was almost full, we get one spill file (SpillReciord + output file). Each Spill file contains several partitions (segments).

Related

How does Flash Translation Layer store mapping data, unusable block and super block?

Does ftl have private storage space that is not flash?
If not, how does ftl store those meta data while avoiding wear leveling.
Actually I don’t know if there is a super block in ftl, but if you want to locate the mapping data and unusable block whose physical address changes frequently, a certain physical address may be needed. The content on this physical address must change frequently, how To avoid wear leveling of this physical address?
There are many possible solutions to this problem and it's very intertwined with the data representation that the drive uses to store its data, so I'm sure it differs a lot based on the drive / manufacturer. I'll just outline a general approach that could work.
Let's say you design an FTL that maintains several fixed-size, append-only "logs", and for simplicity we always have one "active" log that all writes are appended to. If the user is issuing random writes, the order of LBAs in the active log will be random too. When the active log fills all the space allocated to it, it gets "frozen" and we switch the active log to some empty log elsewhere in the flash. As the data in the frozen log becomes stale, we will eventually need to garbage collect it by copying any still-referenced blocks to a different log before erasing the original so that it can be reused for new writes.
Now, for each write to a log, nothing in our interface so far requires that the blocks be exactly 4KiB (or whatever), so you could append a small header to the data that tells you what its LBA is, and perhaps some other metadata -- write sequence number so you can tell if it's the most recent copy of a block, and maybe a checksum for read integrity checking. When a write finishes, you update an in-RAM copy of the map with the new location for the LBAs that were updated (RAM inside the SSD, not RAM for the main CPU of the computer obviously).
If the FTL crashes or loses power, you can reconstruct the map by reading all headers from all the logs. The downside is that scanning the logs will scale O(number of logs * number of blocks per log), so you optimize that somehow:
you could write the headers to a separate part of the disk by themselves so that you can scan them without also reading the user data (same big-O runtime but a lot faster in practice)
you could periodically flush the in-RAM copy of the map to flash somewhere, along with the latest IO sequence number, so that you only have to read the parts of the logs that were written since the latest map flush
How do you find the portion of the log to start scanning from? do a binary search on the IO sequence numbers in the log headers. So the boot runtime is now O(number of logs * (log_2(number of blocks per log) + number of blocks that need to be scanned))
How do you know when to stop scanning? either you recognize that all data in the block you read is 1's because that part of the log hasn't been written to yet, or you recognize that the checksum and data don't match.
Minor optimization: during a clean shutdown, always write the map to flash, so that this binary search + scanning only needs to happen if there's a crash or unclean shutdown.
So far, this lowers how often you need to write the map by a lot, but it's still probably too often to overwrite it to a fixed location for a drive with a very long lifetime. To resolve that, we have to cycle where we write the map:
The simplest solution would be to designate a small set of X special logs to store all map data and write to them like a circular buffer, where X is chosen to make the map updates last the expected lifetime of the device. To find the most recent map in the log on boot, you'd do binary search within those logs to find the last one that was written. So boot = O(X * log_2(number of maps per log) + runtime to scan the other logs if unclean shutdown).
Probably a more optimal solution (but one that might be more complicated), would include the map writes directly into the logs where the updates are happening. Then you need some way to find where the maps are at boot time -- the most obvious way to do that would be to write the map into the beginning of each active log, or you could allow arbitrary map writes by adding backpointers into the block headers that point back to the latest map in their log.
Another aspect of this is that full map flushes could be expensive, which would add tail latency if it ever interferes with the performance of user IOs -- would it be better to allow incremental updates? That's when you start looking at using something like a log-structured merge (LSM) tree to store your map, so that each incremental write is pretty small and you can amortize the full map write cost.
Obviously there are a bunch of tiny details that this explanation leaves out, but hopefully that's enough to get you started. :-)

Is there an efficient way of storing the most recent part of a continuous stream in an array?

I have a never ending stream of data coming in to a program I’m writing. I would like to have a fixed size buffer array which only stores the T most recent observations of that stream. However, to me its not obvious how to implement that in an efficient way.
What I have done so far is to first allocate the buffer of length T and place incoming observations in consecutive order from the top as they arrive: data_0->index 0, data_1->index 1…data_T->index T.
Which works fine until the buffer is full. But when observation data_T+1 arrives, index 0 needs to be removed from the buffer and all T-1 rows needs to be moved up one step in the array/matrix in order to place the newest data point at index T.
That seems to be a very inefficient approach when the buffer is large and hundreds of thousands of elements need to be pushed one row up all the time.
How is this normally solved?
This algorithm called FIFO queue java fifo queue
Look at this API it has several code examples.

CUDA threads appending variable amounts of data to common array

My application takes millions of input records, each 8 bytes, and hashes each one into two or more output bins. That is, each input key K creates a small number of pairs (B1,K), (B2,K), ... The number of output bins per key is not known until the key is processed. It's usually 2 but could occasionally be 10 or more.
All those output pairs need to be eventually stored in one array since all the keys in each bin will later be processed together. How to do this efficiently?
Using an atomic increment to repeatedly reserve a pair from a global array sounds horribly slow. Another obvious method would be to init a hash table as an array of pointers to some sort of storage per bin. That looks slower.
I'm thinking of pre-reserving 2 pairs per input record in a block shared array, then grabbing more space as needed (i.e., a reimplementation of the STL vector reserve operation), then having the last thread in each block copying the block shared array to global memory.
However I'm not looking forward to implementing that. Help? Thanks.
Using an atomic increment to repeatedly reserve a pair from a global
array sounds horribly slow.
You could increment bins of a global array instead of one entry at a time. In other words, you could have a large array, each thread could start with 10 possible output entries. If the thread over flows it requests for the next available bin from the global array. If you're worried about slow speed with the 1 atomic number, you could use 10 atomic numbers to 10 portions of the array and distribute the accesses. If one gets full, find another one.
I'm also considering processing the data twice: the 1st time just to
determine the number of output records for each input record. Then
allocate just enough space and finally process all the data again.
This is another valid method. The bottleneck is calculating the offset of each thread into the global array once you have the total number of results for each thread. I haven't figured a reasonable parallel way to do that.
The last option I can think of, would be to allocate a large array, distribute it based on blocks, used a shared atomic int (would help with slow global atomics). If you run out of space, mark that the block didn't finish, and mark where it left off. On your next iteration complete the work that hasn't been finished.
Downside of course of the distributed portions of global memory is like talonmies said... you need a gather or compaction to make the results dense.
Good luck!

Alternative to reduce large number of binary files reading access time from hard disk

In my first prototype of application, I have to read around 400,000 files (each 4KB file, around total 1.5 GB data) from hard disk sequentially, and do some operation over the data read from each files, and store the results over RAM. Through this mechanism, I were first accessing I/O for a file and then utilizing CPU for operation, and keep going for another file, but it was very slow process.
To work around, now we first read all the files, and stored all the files data in the RAM, and now doing operation (utilizing CPU). It gave significant improvement.
But in my second phase of development, I have to read 20 GB of data, which now I cannot store in RAM. And, single reading operation with CPU utilization is very time consuming operation.
Can someone please suggest some method to work around this problem?
I am developing this application on Windows in C, with Visual Studio compiler.
There's a technique called Asynchronous I/O (AIO) that lets you keep doing some processing with the CPU while a file is read in the background. You can use this to read the next few files at the same time as you're processing a file.
The various AIO calls are OS-specific. On Windows, Microsoft call it "Overlapped I/O". See this Wikipedia page or this MSDN page for more info.
To work around, now we first read all the files, and stored all the files data in the RAM, and now doing operation (utilizing CPU).
(Assuming files can be processed independently...)
You are half-way there. Instead of waiting until all files have been loaded to RAM, start processing as soon as any file is loaded. That would be a form of pipelining.
You'll need three components:
A thread1 that reads files ("producer").
A thread2 that processes the files ("consumer").
A message queue3 between them.
The producer reads the files the way you are already doing it, but instead of processing them, just enqueues them to the message queue. The consumer thread waits until it can dequeue the file from the queue, processes it, and then immediately frees the memory that has been occupied by the file and resumes waiting to the queue.
In case you can process files by sequentially traversing them start-to-finish, you could even devise a more fine-grained "streaming", where files wold be both read and processed in chunks, which could lower the peak memory consumption even more (e.g. if you have some extra-large files that would no longer need to be kept whole in the memory).
1 Or a set of threads to parallelize the I/O, if you anticipate reading from multiple physical disks.
2 Or a set of threads to saturate the CPU cores, if processing the file is not cheaper than reading it.
3 You don't need a fancy persistent distributed message queue for that. Just a
straight in-memory queue, a-la BlockingCollection in .NET (I'm sure you'll find something similar for pure C).
Create threads (in loop) which will read files into RAM.
Work with the data in RAM in separate thread[s] and free RAM after processing.
Keep limits and a poll of records about files (read and processed) in the shared object protected by mutex.
Use semaphore for resources (files in RAM) production/utilisation synchronisation.

Execute Large C Program By Generating Intermediate Stages

I have an algorithm that takes 7 days to Run To Completion (and few more algorithms too)
Problem: In order to successfully Run the program, I need continuous power supply. And if out of luck, there is a power loss in the middle, I need to restart it again.
So I would like to ask a way using which I can make my program execute in phases (say each phase generates Results A,B,C,...) and now in case of a power loss I can some how use this intermediate results and continue/Resume the Run from that point.
Problem 2: How will i prevent a file from re opening every time a loop iterates ( fopen was placed in a loop that runs nearly a million times , this was needed as the file is being changed with each iteration)
You can separate it in some source files, and use make.
When each result phase is complete, branch off to a new universe. If the power fails in the new universe, destroy it and travel back in time to the point at which you branched. Repeat until all phases are finished, and then merge your results into the original universe via a transcendental wormhole.
Well, couple of options, I guess:
You split your algorithm along sensible lines with this a defined output from a phase that can be the input to the next phase. Then, configure your algorithm as a workflow (ideally soft-configured through some declaration file.
You add logic to your algorithm by which it knows what it has successfully completed (commited). Then, on failure, you can restart the algorithm and it bins all uncommitted data and restarts from the last commit point.
Note that both these options may draw out your 7hr run time further!
So, to improve the overall runtime, could you also separate your algorithm so that it has "worker" components that can work on "jobs" in parallel. This usually means drawing out some "dumb" but intensive logic (such as a computation) that can be parameterised. Then, you have the option of running your algorithm on a grid/ space/ cloud/ whatever. At least you have options to reduce the run time. Doesn't even need to be a space... just use queues (IBM MQ Series has a C interface) and just have listeners on other boxes listening to your jobs queue and processing your results before persisting the results. You can still phase the algorithm as discussed above too.
Problem 2: Opening the file on each iteration of the loop because it's changed
I may not be best qualified to answer this but doing fopen on each iteration (and fclose) presumably seems wasteful and slow. To answer, or have anyone more qualified answer, I think we'd need to know more about your data.
For instance:
Is it text or binary?
Are you processing records or a stream of text? That is, is it a file of records or a stream of data? (you aren't cracking genes are you? :-)
I ask as, judging by your comment "because it's changed each iteration", would you be better using a random-accessed file. By this, I'm guessing you're re-opening to fseek to a point that you may have passed (in your stream of data) and making a change. However, if you open a file as binary, you can fseek through anywhere in the file using fsetpos and fseek. That is, you can "seek" backwards.
Additionally, if your data is record-based or somehow organised, you could also create an index for it. with this, you could use to fsetpos to set the pointer at the index you're interested in and traverse. Thus, saving time in finding the area of data to change. You could even persist your index in an accompanying index file.
Note that you can write plain text to a binary file. Perhaps worth investigating?
Sounds like classical batch processing problem for me.
You will need to define checkpoints in your application and store the intermediate data until a checkpoint is reached.
Checkpoints could be the row number in a database, or the position inside a file.
Your processing might take longer than now, but it will be more reliable.
In general you should think about the bottleneck in your algo.
For problem 2, you must use two files, it might be that your application will be days faster, if you call fopen 1 million times less...

Resources