I am learning about big data in my class and right now we are learning about HIVE. We learned about the mappers and reducers today, but honestly it went way over my head. Could someone explain to me what the mapper and the reducer does in each step? or atleast point me to some good readings? Thanks in advance
Lets try to understand map-reduce flow from above diagram which i downloaded from internet.
We are going to discuss word count problem in hadoop , which is also known as hello world in Hadoop .
Word Count is a program where we find the occurrence of each word from the file.
Lets try to understand
Step 1) :
Input file : we need some data on which we will be running the word count program , to run this program on cluster , first step should be to put this file on hadoop , which can be done via various ways easiest way is to use hadoop shell commands :
like put or copyFromLocal can be used :
Step 2)
Map Reduce talk in terms of key value pair , which means mapper will get input in the form of key and value pair, they will do the required processing then they will produce intermediate result in the form of key value pair ,which would be input for reducer to further work on that and finally reducer will also write their output in key value pair.
But we know mapper execute just after main driver program ,so who is providing input to mapper in the form of key value pair,input format does this thing for you .
InputFormat is the class which does two major things:
1) Input split ( Your number of instances of mapper driven by this input split or number of mapper is driven by input split , by default your one split size is equivalent to one block if you go by default configuration, but you may change the split size as per your need .
so if you are playing with lets say 512 mb data and your block size is 64 mb so about 8 input split size will be used , 8 mapper instances would run or 8 mappers would be used )
2) Breaking the data in key value pair ( record reader is the class which does this thing at the back end)
Now what would be key and value for a mapper , that would be driven by the file input format you use , for instance for TextInputFormat which is the mostly used inputformat. it sends longWritable(equivalent to long) as a key and Text(string) as a value to mapper
Your mapper class would work on 1 split , in class you have a map function which would work on a single line at a time so as we can see from the above diagram single line would go the map function
for example it send : "Apple orange Mango" to map function
3) Mapper
In mapper we get line as an input so now we need to write our logic .
we break the line into words based on delimited so now we have single single word in one line
As we know map works on key value pair .
we can take that work as a key and value as 1
why we have taken word as key not the other way round , because next phase is
Shuffling and Sorting phase : In this phase framework will make the group based on similar keys , or all the different keys will come together during shuffling phase and they will be sorted on the basis of keys.
Now Lets again revise :
Initially we had one file which was sent to different different mapper based on input splits , then in mapper class in map function we got one line as an input ,so built our logic with respect to one line , all the lines will work in a similar way for one instance and finally all instance would be able to work parallel way like this.
Now lets say you have 10 mappers running , now in map reduce your number of reducer is always less than mapper .
so if we 10 mappers were used so most likely 2-3 reducers would be used .
Shuffling and sorting phase we have seen all the similar keys will club together .
First of all on which basis it would be decided that which mapper data will go to which reducer.
In out case 10 mappers data has to divide in 2 reducers ,so on which basis it would decide .
There is a component called Partitioner which will decide which mapper output will go to which reducer based on hash partitioning and using modulo operator on that .
so if we are using hashing so this is 100% sure that all the same keys would go to same reducer.
We don't have to bother about anything as framework has been designed to do so efficiently , but yes as it has been written in java so we do have all the flexibility to play with different components as per need like customizing key,custom partitioner ,custom comparator and so on .
4)
Reducer : Now reducer will get keys and list of its value in its input something like this
Apple,<1,1,1,1)>
Now in reducer we write logic what exactly we want to do , for our case we want to do word count so simply we have to count the values .
That was also the reason we took 1 as value initially in Map phase because simply we had to count .
Output : final output would be written to hdfs by reducer Again in key value pair.
Related
I wrote my model in the GUI and want to run it with repetitions in the headless mode in a cluster.
The model has a go command that is repeated until we reach a specified step (at each go procedure, the year variable is incremented and when we reach 2070, the model stop running). At the end of the go procedure, the world is exported (and analysed in R).
If I run multiple repetitions on parallel cores, how can I export the worlds so they have different names ?
So far, I export the world with the following lines (when running only one time) :
let file-name (word scenario "_NETLOGO_" year ".csv")
export-world (file-name)
But if the model is run at the same time on several cores, there will be overlap and I would not know which file is coming from which repetition (assuming that the name would change with an extra (1)). I thought about creating folders to save the worlds,is that possible ? Is so, how is it possible to pimp the folder name according to the number of repetitions ?
i have planner buckets (Tasks, in progress, Backlog) and I want to create a new Bucket depending on elements in a SharePoint list.
But I can't save the bucket names to an array and then add the missing value e.g. like "on hold" and the go trough the array again. It always set my array to blank again.
Maybe you can help me. Here is my PowerAutomate Flow so far:
You could use a couple of Select actions to create two arrays which you can use for the comparison. With a Filter Array you can find the ones which are missing. The output of that Filter Array can be used to loop through and create the missing buckets.
Below is an example of that approach.
1. In both Select actions the Map field is switched to text mode to only get Arrays with values (without keys)
2. In the Filter Array this is used in the From
body('Select_-_SharePoint_Items')
3. And this is the expression used in the advanced mode
#not(contains(body('Select_-_Existing_Buckets'), item()))
I have records in a flat file that have what you might consider to be master records with detail records following the master they relate to, until the master record.
Here is an example:
Order123, Customer1, 1/1/2018
Item1, $1
Item2, $1
Order124, Customer2, 1/1/2018
Item1, $1
Item4, $2
The file has no line numbers or any kind of sequencing built in, nor does it use foreign keys to relate master to detail.
If I were to use SSIS to import the raw TXT data into a flexible table with columns designed to take various datatypes (i.e. nvarchar(255), or similar), I could iterate over the values after the import, and relate the values in Line #2 and Line #3 with Order123; and consequently, Lines #5 and #6 with Order124.
The table holding the raw data will use a simple RecordID identity column with an integer incrementing by one.
It doesn't really matter, but if you're curious, the actual data I'm referring to is the Retrosheet data event files. It's a collection of all Major League Baseball data. A real file could be downloaded from a link on this page:
https://www.retrosheet.org/game.htm
I seem to recall that you could not import TXT data into a table and expect that the order of the rows would match the order of the TXT lines. When I do small tests of this, however, the records do appear in the same order as the source file. I suspect that my small test results were too good to be true and not a fail safe prediction on how it will turn out.
In summary:
How do I use SSIS to import data, inserting SQL records in the same order as the original flat file?
The answer is yes, flat files are processed in order, as long as you aren't applying any kind of sorting.
I was able to process the Retrosheet files by creating a table in my DB that had an identity column, and a varchar column long enough to hold each line of the file (I chose 100). I then set up my flat file connection with Ragged Right formatting, defining the row delimiter as {CR}{LF}.
I just typed this out so there might be a few errors in syntax but it should get you close.
You will need to set up 2 different outputs.
Order of load will not matter as you are adding a foreign key to the detail table.
public string orderNo; /// on the OUTSIDE
public main()
string[] lines = System.IO.File.ReadAllLines([filename]);
foreach(var line in lines)
{
string[] cols = line.Split(',');
if(cols.Length == 3)
{
orderNo = cols[0];
Output0Buffer.AddRow();
Output0Buffer.OrderNo = cols[0].ToString();
Output0Buffer.Customer = cols[1].ToString();
Output0Buffer.OrderDate = DateTime.Parse(cols[2].ToString().Trim());
}
else
{
Output1Buffer.AddRow();
Output1Buffer.OrderNo = orderNo;
Output1Buffer.Item = cols[0].ToString();
Output1Buffer.Amt = cols[1].ToString(); //This needs to be parsed later.
}
}
FOLLOW UP:
I just reviewed the site you are trying to download from. And the file is more complicated than you led on by your question.
Split still seems safe to use but you will have to trim some quotewrapped strings (names) but it looks like there are no quote wrapped commas (at least in the examples). If that is the case you need to use REGEX to split.
I would change the logic to use switch and case and base it on cols[0] being one of the 8 types.
Save ID on the outside and write to each of the 7 other possible datasets this creates for linkage to parent. You will have to use the same strategy for other records that need to be tied to a different parent (I think comment is an example).
GOOD LUCK with all this. Plays do not look easy to interpret!
I am working on my first SSIS package. I have a view with data that looks something like:
Loc Data
1 asd
1 qwe
2 zxc
3 jkl
And I need all of the rows to go to different files based on the Loc value. So all of the data rows where Loc = 1 should end up in the file named Loc1.txt, and the same for each other Loc.
It seems like this can be accomplished with a conditional split to flat file, but that would require a destination for each Location. I have a lot of Locations, and they all will be handled the same way other than being split in to different files.
Is there a built in way to do this without creating a bunch of destination components? Or can I at least use the script component to act as a way?
You should be able to set an expression using a variable. Define your path up to the directory and then set the variable equal to that column.
You'll need an Execute SQL task to return a Single Row result set, and loop that in a container for every row in your original result set.
I don't have access at the moment to post screenshots, but this link should help outline the steps.
So when your package runs the expression will look like:
'C:\Documents\MyPath\location' + #User::LocationColumn + '.txt'
It should end up feeding your directory with files according to location.
Set the User::LocationColumn equal to the Location Column in your result set. Write your result set to group by Location, so all your records write to a single file per Location.
I spent some time try to complete this task using the method #Phoenix suggest, but stumbled upon this video along the way.
I ended up going with the method shown in the video. I was hoping I wouldn't have to separate it in to multiple select statements for each location and an extra one to grab the distinct locations, but I thought the SSIS implementation in the video was much cleaner than the alternative.
Change the connection manager's connection string, in which you have to use variable which should be changed.
By varying the variable, destination file also changes
and connection string is :
'C:\Documents\ABC\Files\' + #User::data + '.txt'
vote this if it helps you
I'm tring to create an SSIS package to import some dataset files, however given that I seem to be hitting a brick
wall everytime I achieve a small part of the task I need to take a step back and perform a sanity check on what I'm
trying to achieve, and if you good people can advise whether SSIS is the way to go about this then I would
appreciate it.
These are my questions from this morning :-
debugging SSIS packages - debug.writeline
Changing an SSIS dts variables
What I'm trying to do is have a For..Each container enumerate over the files in a share on the SQL Server. For each
file it finds a script task runs to check various attributes of the filename, such as looking for a three letter
code, a date in CCYYMM, the name of the data contained therein, and optionally some comments. For example:-
ABC_201007_SalesData_[optional comment goes here].csv
I'm looking to parse the name using a regular expression and put the values of 'ABC', '201007', and
'SalesData' in variables.
I then want to move the file to an error folder if it doesn't meet certain criteria :-
Three character code
Six character date
Dataset name (e.g. SalesData, in this example)
CSV extension
I then want to lookup the Character code, the date (or part thereof), and the Dataset name against a lookup table
to mark off a 'checklist' of received files from each client.
Then, based on the entry in the checklist, I want to kick off another SSIS package.
So, for example I may have a table called 'Checklist' with these columns :-
Client code Dataset SSIS_Package
ABC SalesData NorthSalesData.dtsx
DEF SalesData SouthSalesData.dtsx
If anyone has a better way of achieving this I am interested in hearing about it.
Thanks in advance
That's an interesting scenario, and should be relatively easy to handle.
First, your choice of the Foreach Loop is a good one. You'll be using the Foreach File Enumerator. You can restrict the files you iterate over to be just CSVs so that you don't have to "filter" for those later.
The Foreach File Enumerator puts the filename (full path or just file name) into a variable - let's call that "FileName". There's (at least) two ways you can parse that - expressions or a Script Task. Depends which one you're more comfortable with. Either way, you'll need to create three variables to hold the "parts" of the filename - I'll call them "FileCode", "FileDate", and "FileDataset".
To do this with expressions, you need to set the EvaluateAsExpression property on FileCode, FileDate, and FileDataset to true. Then in the expressions, you need to use FINDSTRING and SUBSTRING to carve up FileName as you see fit. Expressions don't have Regex capability.
To do this in a Script Task, pass the FileName variable in as a ReadOnly variable, and the other three as ReadWrite. You can use the Regex capabilities of .Net, or just manually use IndexOf and Substring to get what you need.
Unfortunately, you have just missed the SQLLunch livemeeting on the ForEach loop: http://www.bidn.com/blogs/BradSchacht/ssis/812/sql-lunch-tomorrow
They are recording the session, however.