Using SUM() in PIG Latin - loops

I am just starting to write some scripts in PIG, and I am trying to SUM an int Column, My Script looks like this :
DATA = LOAD 'SomeFile' as (fingerPrint, size, str1, str2);
groupedChunks = GROUP DATA BY fingerPrint;
uniqueChunks = FILTER groupedChunks BY COUNT(DATA)==1;
sizes = FOREACH uniqueChunks GENERATE MAX($.size) as size;
Now I have a table, having just one column, which is the size column, if I would
call DESCRIBE, it generates this Output: sizes:{size: int}
Now I need help in this step, how I do I get the SUM of all sizes of this column ?

Can you try this?
result = FOREACH (GROUP sizes ALL) GENERATE SUM(sizes);
DUMP result;
UPDATE: full code
input.txt
a 1 b c
d 2 e f
PigScript:
DATA = LOAD 'input.txt' as (fingerPrint, size, str1, str2);
groupedChunks = GROUP DATA BY fingerPrint;
uniqueChunks = FILTER groupedChunks BY COUNT(DATA)==1;
sizes = FOREACH uniqueChunks GENERATE MAX(DATA.size) as size;
result = FOREACH (GROUP sizes ALL) GENERATE SUM(sizes);
DUMP result;
Output:
(3.0)

V= GROUP DATA ALL;
result= FOREACH V GENERATE SUM(DATA.size)

Related

Csv file to a Lua table and access the lines as new table or function()

Currently my code have simple tables containing the data needed for each object like this:
infantry = {class = "army", type = "human", power = 2}
cavalry = {class = "panzer", type = "motorized", power = 12}
battleship = {class = "navy", type = "motorized", power = 256}
I use the tables names as identifiers in various functions to have their values processed one by one as a function that is simply called to have access to the values.
Now I want to have this data stored in a spreadsheet (csv file) instead that looks something like this:
Name class type power
Infantry army human 2
Cavalry panzer motorized 12
Battleship navy motorized 256
The spreadsheet will not have more than 50 lines and I want to be able to increase columns in the future.
Tried a couple approaches from similar situation I found here but due to lacking skills I failed to access any values from the nested table. I think this is because I don't fully understand how the tables structure are after reading each line from the csv file to the table and therefore fail to print any values at all.
If there is a way to get the name,class,type,power from the table and use that line just as my old simple tables, I would appreciate having a educational example presented. Another approach could be to declare new tables from the csv that behaves exactly like my old simple tables, line by line from the csv file. I don't know if this is doable.
Using Lua 5.1
You can read the csv file in as a string . i will use a multi line string here to represent the csv.
gmatch with pattern [^\n]+ will return each row of the csv.
gmatch with pattern [^,]+ will return the value of each column from our given row.
if more rows or columns are added or if the columns are moved around we will still reliably convert then information as long as the first row has the header information.
The only column that can not move is the first one the Name column if that is moved it will change the key used to store the row in to the table.
Using gmatch and 2 patterns, [^,]+ and [^\n]+, you can separate the string into each row and column of the csv. Comments in the following code:
local csv = [[
Name,class,type,power
Infantry,army,human,2
Cavalry,panzer,motorized,12
Battleship,navy,motorized,256
]]
local items = {} -- Store our values here
local headers = {} --
local first = true
for line in csv:gmatch("[^\n]+") do
if first then -- this is to handle the first line and capture our headers.
local count = 1
for header in line:gmatch("[^,]+") do
headers[count] = header
count = count + 1
end
first = false -- set first to false to switch off the header block
else
local name
local i = 2 -- We start at 2 because we wont be increment for the header
for field in line:gmatch("[^,]+") do
name = name or field -- check if we know the name of our row
if items[name] then -- if the name is already in the items table then this is a field
items[name][headers[i]] = field -- assign our value at the header in the table with the given name.
i = i + 1
else -- if the name is not in the table we create a new index for it
items[name] = {}
end
end
end
end
Here is how you can load a csv using the I/O library:
-- Example of how to load the csv.
path = "some\\path\\to\\file.csv"
local f = assert(io.open(path))
local csv = f:read("*all")
f:close()
Alternative you can use io.lines(path) which would take the place of csv:gmatch("[^\n]+") in the for loop sections as well.
Here is an example of using the resulting table:
-- print table out
print("items = {")
for name, item in pairs(items) do
print(" " .. name .. " = { ")
for field, value in pairs(item) do
print(" " .. field .. " = ".. value .. ",")
end
print(" },")
end
print("}")
The output:
items = {
Infantry = {
type = human,
class = army,
power = 2,
},
Battleship = {
type = motorized,
class = navy,
power = 256,
},
Cavalry = {
type = motorized,
class = panzer,
power = 12,
},
}

Combine 2 rows of CSV file in SSIS

I have one CSV file where the information is spread on two lines
Line 1 contains Name and age
Line 2 contains detail like address, city, salary, occupation
I want to combine 2 rows to insert it in a database.
CSV file :
Raju, 42
12345 west andheri,Mumbai, 100000, service
In SQL Server I can do by using cursor. But I have to do in SSIS.
For a similar case, i will read each line as one column and use a script component to fix the structure. You can follow my answer on the following question. It contains a step-by-step guide:
SSIS reading LF as terminator when its set as CRLF
I like using a script component in order to be able to store data from a different row in this case.
Read the file as a single column CSV into Column1.
Add script component and add a new Output called CorrectedOutput and define all columns from both rows. Also, mark Column1 as read.
Create 2 variables outside of row processing to 'hold' first row
string name = string.Empty;
string Age = string.Empty;
Use a split to determine line 1 or line 2
string[] str = Row.Column1.Split(',');
Use an if to determine row 1 or 2
if(str.Length == 2)
{
name = str[0];
age=str[1];}
else
{
CorrectedOutputBuffer.AddRow();
CorrectedOutputBuffer.Name = name; //This uses the stored value from prior row
CorrectedOutputBuffer.Age = age; //This uses the stored value from prior row
CorrectedOutputBuffer.Address = str[0];
CorrectedOutputBuffer.City = str[1];
CorrectedOutputBuffer.Salary = str[2];
CorrectedOutputBuffer.Occupation = str[3];
}
The overall effect is this...
On Row 1, you just hold the data in variables
On Row 2, you write out the data to 1 new row.

Add datetime format to cell array in Matlab

I have a cell array that I call "Table", as in the code below (but my array has more lines). Column 1 contains dates in string format. I want to add an additional column that contains the dates in datetime format. I did the following, which works, but it is VERY slow. What are the alternatives?
% Table that I have:
Table{1,1} = 'Stringdate';
Table{2,1} = '01.01.1999';
Table{3,1} = '02.01.1999';
Table{4,1} = '03.01.1999';
Table{5,1} = '04.01.1999';
% What I want to add:
Table{1, size(Table,2)+1} = 'Datetime';
for index = 2:length(Table)
Table{index, size(Table,2)} = datetime(Table{index, 1});
end
You can apply datetime to all of them in one-go and use just num2cell and indexing to achieve the same result as that of your loop.
Table(2:end,2) = num2cell(datetime(Table(2:end,1)));
%You might need to specify the InputFormat as well i.e.
%Table(2:end,2) = num2cell(datetime(Table(2:end,1),'InputFormat','dd.MM.yyyy'));

Splitting column with XML data

I have a SQL column named "details" and it contains the following data:
<changes><RoundID><new>8394</new></RoundID><RoundLeg><new>JAYS CLOSE AL6 Odds(1 - 5)</new></RoundLeg><SortType><new>1</new></SortType><SortOrder><new>230</new></SortOrder><StartDate><new>01/01/2009</new></StartDate><EndDate><new>01/01/2021</new></EndDate><RoundLegTypeID><new>1</new></RoundLegTypeID></changes>
<changes><RoundID><new>8404</new></RoundID><RoundLeg><new>HOLLY AREA AL6 (1 - 9)</new></RoundLeg><SortType><new>1</new></SortType><SortOrder><new>730</new></SortOrder><StartDate><new>01/01/2009</new></StartDate><EndDate><new>01/01/2021</new></EndDate><RoundLegTypeID><new>1</new></RoundLegTypeID></changes>
<changes><RoundID><new>8379</new></RoundID><RoundLeg><new>PRI PARK AL6 (1 - 42)</new></RoundLeg><SortType><new>1</new></SortType><SortOrder><new>300</new></SortOrder><StartDate><new>01/01/2009</new></StartDate><EndDate><new>01/01/2021</new></EndDate><RoundLegTypeID><new>1</new></RoundLegTypeID></changes>
What is the easiest way to separate this data out into individual columns? (that is all one column)
Try this:
SELECT DATA.query('/changes/RoundID/new/text()') AS RoundID
,DATA.query('/changes/RoundLeg/new/text()') AS RoundLeg
,DATA.query('/changes/SortType/new/text()') AS SortType
-- And so on and so forth
FROM (SELECT CONVERT(XML, Details) AS DATA
FROM YourTable) AS T
Once you get your result set from the sql (mysql or whatever) you will probably have an array of strings. As I understand your question, you wanted to know how to extract each of the xml nodes that were contained in the string that was stored in the column in question. You could loop through the results from the sql query and extract the data that you want. In php it would look like this:
// Set a counter variable for the first dimension of the array, this will
// number the result sets. So for each row in the table you will have a
// number identifier in the corresponding array.
$i = 0;
$output = array();
foreach($results as $result) {
$xml = simplexml_load_string($result);
// Here use simpleXML to extract the node data, just by using the names of the
// XML Nodes, and give it the same name in the array's second dimension.
$output[$i]['RoundID'] = $xml->RoundID->new;
$output[$i]['RoudLeg'] = $xml->RoundLeg->new;
// Simply create more array items here for each of the elements you want
$i++;
}
foreach ($output as $out) {
// Step through the created array do what you like with it.
echo $out['RoundID']."\n";
var_dump($out);
}

Matlab read csv string array

I have a comma seperated dataset called Book2.csv I want to extract the contents. The contents are a 496024x1 array of strings (normal, neptune, smurf).
I tryed:
[text_data] = xlsread('Book2.csv');
But it just outputed a text_data empty array?
When trying csvread
M = csvread('Book2.csv')
??? Error using ==> dlmread at 145
Mismatch between file and format string.
Trouble reading number from file (row 1, field 1) ==>
norma
Error in ==> csvread at 54
m=dlmread(filename, ',', r, c);
I get this error. Can anyone help?
Off the top of my head this should get the job done. But possibly not the best way to do it.
fid = fopen(your file); //open file
//read all contents into data as a char array
//(don't forget the `'` to make it a row rather than a column).
data = fread(fid, '*char')';
fclose(fid);
//This will return a cell array with the individual
//entries for each string you have between the commas.
entries = regexp(data, ',', 'split');
Try something like: textread
data = textread('data.csv', '', 'delimiter', ',', ...
'emptyvalue', NaN);
The easiest way for me is :
path='C:\folder1\folder2\';
data = 'data.csv';
data = dataset('xlsfile',sprintf('%s\%s', path,data));
Of cource you could also do the following:
[data,path] = uigetfile('C:\folder1\folder2\*.csv');
data = dataset('xlsfile',sprintf('%s\%s', path,data));
now you will have loaded the data as dataset.
An easy way to get a column 1 for example is
double(data(1))

Resources