Iterating through multiple loops and calculate total - loops

I have to iterate through two arrays and calculate the salary total of the matching employee name.I have two arrays: empData:[Emp1,Emp2] and
salData:[[name:Emp1,sal:1000],[name:Emp2,sal:5000],[name:Emp1,sal:6000],[name:Emp1,sal:7000]].I have to loop through empData and also salData and calculating the sum of the salary for the matching emp name and then push both the user name and the corresponding sal in to an array.
double total
empData.each{x ->
sapData.each{ y ->
if(y.name == x)
{
total =total + y.sal;
}
}
But getting an error Cannot cast object 'null1000.0' with class'java.lang.String' to class 'java.lang.Double'. If i declare total as string then the result is the cancatenation of sal.

You are not initializing total to anything. It is also possible that your salary values are actually strings; hard to tell from what you provided.
Initialize total to 0, and make sure to cast y.sal to double if necessary.

Ugh, loops and loops... have a look at the Groovy collection methods.
// poor emp3 has no salaryData
def employeeNames = ['emp1', 'emp2', 'emp3']
def salaryData = [[name: 'emp1', sal: 1000], [name: 'emp2', sal: 5000], [name: 'emp1', sal: 6000], [name:'emp1', sal: 7000]]
// here's our output array variable
def output = []
// for each employee
// find all the salaryData elements where salaryData.name == employeeName
// using that list, collect just the salary value
// using that list, sum it, adding to an initial value of 0
// append a new entry in output containing the name, and total salary
employeeNames.each { employeeName ->
output << [name: employeeName, totalSalary: salaryData.findAll { sal -> sal.name == employeeName }.collect { sal -> sal.sal }.sum(0)]
}
println output
groovyconsole yields:
[[name:emp1, totalSalary:14000], [name:emp2, totalSalary:5000], [name:emp3, totalSalary:0]]

Related

Local variable referenced before

I have an array of data which in the first column has years, and the other 3 columns has data for 3 different groups, the last of which is carrots.
I am trying to find the year in which the carrot value is the highest, by comparing the carrot value each year to the current highest, and then finding the year that the value takes place on.
I have used identical code for the other 2 columns with just the word carrot replaced and i in year[i] changed appropriately and the code works, but for this it throws up the error "local variable 'carrot_maxyear' referenced before assignment"
def carrot(data):
year0 = data[0]
carrotmax = year0[3]
for year in data:
if year[3] > carrotmax:
carrotmax = year[3]
carrot_maxyear = year[0]
return carrot_maxyear
Python's builtin max will make this easier:
def carrot(data):
maxyear = max(data, key=lambda year: year[3])
return maxyear[0]
This way you don't need the year0 and carrotmax initialization. We need to use the key argument to max because it looks like you want to return year[0] instead of the year[3] value used for the max calculation.
Your original version with a fix would look like:
def carrot(data):
year0 = data[0]
carrotmax = year0[3]
carrot_maxyear = 0 # initialize carrot_maxyear outside of loop to avoid error
for year in data:
if year[3] > carrotmax:
carrotmax = year[3]
carrot_maxyear = year[0]
return carrot_maxyear
but IMO the version utilizing max is more clear and Pythonic.

querying a CSV::Table to find item with most sales between two given dates in plain old ruby script

I am trying to find the highest sales between two given dates.
this is what my ad_report.csv file with headers:
date,impressions,clicks,sales,ad_spend,keyword_id,asin
2017-06-19,4451,1006,608,24.87,UVOLBWHILJ,63N02JK10S
2017-06-18,5283,3237,1233,85.06,UVOLBWHILJ,63N02JK10S
2017-06-17,0,0,0,21.77,UVOLBWHILJ,63N02JK10S
...
Below is all the working code I have that returns the row with the highest value, but not between the given dates.
require 'csv'
require 'date'
# get directory of the current file
LIB_DIR = File.dirname(__FILE__)
# get the absolute path of the ad_report & product_report CSV
# and set to a var
AD_CSV_PATH = File.expand_path('data/ad_report.csv', LIB_DIR)
PROD_CSV_PATH = File.expand_path('data/product_report.csv', LIB_DIR)
# create CSV::Table for ad-ad_report and product_report CSV
ad_report_table = CSV.parse(File.read(AD_CSV_PATH), headers: true)
prod_report_table = CSV.parse(File.read(PROD_CSV_PATH), headers: true)
## finds the row with the highest sales
sales_row = ad_report_table.max_by { |row| row[3].to_i }
At this point I can get the row that has the greatest sale, and all the data from that row, but it is not in the excepted range.
Below I am trying to use range with the preset dates.
## range of date for items between
first_date = Date.new(2017, 05, 02)
last_date = Date.new(2017, 05, 31)
range = (first_date...last_date)
puts sales_row
below is sudo code of what I feel that I am supposed to do, but there is probably a better method.
## check for highest sales
## return sales if between date
## else reject col if
## loop this until it returns date between
## return result
You could do this by creating a range containing two dates and then use Range#cover? to test if the date is in the range:
range = Date.new(2015-01-01)..Date.new(2020-01-01)
rows.select do |row|
range.cover?(Date.parse(row[1]))
end.max_by { |row| row[3].to_i }
Although the Tin Man is completely right in that you should use a database instead.
You could obtained the desired value as follows. I have assumed that the field of interest ('sales') represents integer values. If not, change .to_i to .to_f below.
Code
require 'csv'
def greatest(fname, max_field, date_field, date_range)
largest = nil
CSV.foreach(fname, headers:true) do |csv|
largest = { row: csv.to_a, value: csv[max_field].to_i } if
date_range.cover?(csv[date_field]) &&
(largest.nil? || csv[max_field].to_i > largest[:value])
end
largest.nil? ? nil : largest[:row].to_h
end
Examples
Let's first create a CSV file.
str =<<~END
date,impressions,clicks,sales,ad_spend,keyword_id,asin
2017-06-19,4451,1006,608,24.87,UVOLBWHILJ,63N02JK10S
2017-06-18,5283,3237,1233,85.06,UVOLBWHILJ,63N02JK10S
2017-06-17,0,0,0,21.77,UVOLBWHILJ,63N02JK10S
2017-06-20,4451,1006,200000,24.87,UVOLBWHILJ,63N02JK10S
END
fname = 't.csv'
File.write(fname, str)
#=> 263
Now find the record within a given date range for which the value of "sales" is greatest.
greatest(fname, 'sales', 'date', '2017-06-17'..'2017-06-19')
#=> {"date"=>"2017-06-18", "impressions"=>"5283", "clicks"=>"3237",
# "sales"=>"1233", "ad_spend"=>"85.06", "keyword_id"=>"UVOLBWHILJ",
# "asin"=>"63N02JK10S"}
greatest(fname, 'sales', 'date', '2017-06-17'..'2017-06-25')
#=> {"date"=>"2017-06-20", "impressions"=>"4451", "clicks"=>"1006",
# "sales"=>"200000", "ad_spend"=>"24.87", "keyword_id"=>"UVOLBWHILJ",
# "asin"=>"63N02JK10S"}
greatest(fname, 'sales', 'date', '2017-06-22'..'2017-06-25')
#=> nil
I read the file line-by-line (using CSV#foreach) to keep memory requirements to a minimum, which could be essential if the file is large.
Notice that, because the date is in "yyyy-mm-dd" format, it is not necessary to convert two dates to Date objects to compare them; that is, they can be compared as strings (e.g. '2017-06-17' <= '2017-06-18' #=> true).

Remove garbage(#,$) value from any string and drop records that contains only garbage(#,$) value with multiple occurances in multiple columns

I tried below code for drop records that contains garbage value with multiple occurrences and multiple columns,But I want to remove garbage value form string with multiple occurrences in multiple columns.
Sample Code :-
filter_list = ['$','#','%','#','!','^','&','*','null']
def filterfn(*x):
remove_garbage = list(chain(*[[filter not in elt for filter in
filter_list] for elt in x]))
return(reduce(lambda x,y: x and y, remove_garbage, True))
filter_udf = f.udf(filterfn, BooleanType())
original = original.filter(filter_udf(*[col for col in compulsory_fields]))
original.show()
In this example "original" is my original dataframe and "compulsory_fields" this is my array(it stores as multiple columns).
Sample Input :-
id name salary
# Yogita 1000
2 Neha ##
3 #Jay$deep## 8000
4 Priya 40$00&
5 Bhavana $$%&^
6 $% $$&&
Sample Output :-
id name salary
3 Jaydeep 8000
4 priya 4000
Your requirements are not completely clear to me, but it seems you want to output records that are valid after removing the "garbage" characters. You can achieve this by adding a clean_special_characters udf that removes the special characters before running your filter_udf:
import pyspark.sql.functions as f
from itertools import chain
from pyspark.sql.functions import regexp_replace,col
from pyspark.sql.types import BooleanType,StringType
rdd = sc.parallelize((
('#','Yogita','1000'),
('2', 'Neha', '##'),
('3', '#Jay$deep##','8000'),
('4', 'Priya', '40$00&'),
('5', 'Bhavana', '$$%&^'),
('6', '$%','$$&&'))
)
original = rdd.toDF(['id','name','salary'])
filter_list = ['$','#','%','#','!','^','&','*','null']
compulsory_fields = ['id','name','salary']
def clean_special_characters(input_string):
cleaned_input = input_string.translate({ord(c): None for c in filter_list if len(c)==1})
if cleaned_input == '':
return 'null'
return cleaned_input
clean_special_characters_udf = f.udf(clean_special_characters, StringType())
original = original.withColumn('name', clean_special_characters_udf(original.name))
original = original.withColumn('salary', clean_special_characters_udf(original.salary))
def filterfn(*x):
remove_garbage = list(chain(*[[filter not in elt for filter in
filter_list] for elt in x]))
return(reduce(lambda x,y: x and y, remove_garbage, True))
filter_udf = f.udf(filterfn, BooleanType())
original = original.filter(filter_udf(*[col for col in compulsory_fields]))
original.show()
This outputs:
+---+-------+------+
| id| name|salary|
+---+-------+------+
| 3|Jaydeep| 8000|
| 4| Priya| 4000|
+---+-------+------+

Max value within array of objects

I'm new in ruby. I'm trying to do the following but haven't succeeded.
I've got an array of objects, let's call it objs. Each object has multiple properties, one of those is a variable that holds a number, let's call it val1. I wanna iterate through the array of objects and determine the max value of val1 through all the objects.
I've tried the following:
def init(objs)
#objs = objs
#max = 0
cal_max
end
def cal_max
#max = #objs.find { |obj| obj.val1 >= max }
# also tried
#objs.each { |obj| #max = obj.val1 if obj.val1 >= #max }
end
As I said, I'm just learning about blocks.
Any suggestion is welcome.
Thanks
Let's say you have set up the following model:
class SomeObject
attr_accessor :prop1, :prop2, :val1
def initialize(prop1, prop2, val1)
#prop1 = prop1
#prop2 = prop2
#val1 = val1
end
end
#define your objects from the class above
david = SomeObject.new('David', 'Peters', 23)
steven = SomeObject.new('Steven', 'Peters', 26)
john = SomeObject.new('John', 'Peters', 33)
#define an array of the above objects
array = [david, steven, john]
Then use max_by by passing the condition into its block as follows to determine the object with the max val1 value. Finally call val1 to get the value of the max object.
array.max_by {|e| e.val1 }.val1 #=> 33
You may also consider using hashes (negating the need to define a new class), like so:
david = {f_name: 'David', s_name: 'Peters', age: 23}
steven = {f_name: 'Steven', s_name: 'Peters', age: 26}
john = {f_name: 'John', s_name: 'Peters', age: 33}
array = [david, steven, john]
array.max_by { |hash| hash[:age] }[:age] #=> 33
#objs.map(&:val1).max
That will invoke the method on each object, and create a new array of the results, and then find the max value. This is shorthand for:
#objs.map{ |o| o.val1 }.max
Alternatively, you can select the object with the largest value, and then operate on it (as properly recomended by Cary Swoveland below):
#objs.max_by(&:val1).val1

Matlab string manipulation

I need help with matlab using 'strtok' to find an ID in a text file and then read in or manipulate the rest of the row that is contained where that ID is. I also need this function to find (using strtok preferably) all occurrences of that same ID and group them in some way so that I can find averages. On to the sample code:
ID list being input:
(This is the KOIName variable)
010447529
010468501
010481335
010529637
010603247......etc.
File with data format:
(This is the StarData variable)
ID>>>>Values
002141865 3.867144e-03 742.000000 0.001121 16.155089 6.297494 0.001677
002141865 5.429278e-03 1940.000000 0.000477 16.583748 11.945627 0.001622
002141865 4.360715e-03 1897.000000 0.000667 16.863406 13.438383 0.001460
002141865 3.972467e-03 2127.000000 0.000459 16.103060 21.966853 0.001196
002141865 8.542932e-03 2094.000000 0.000421 17.452007 18.067214 0.002490
Do not be mislead by the examples I posted, that first number is repeated for about 15 lines then the ID changes and that goes for an entire set of different ID's, then they are repeated as a whole group again, think [1,2,3],[1,2,3], the main difference is the values trailing the ID which I need to average out in matlab.
My current code is:
function Avg_Koi
N = evalin('base', 'KOIName');
file_1 = evalin('base', 'StarData');
global result;
for i=1:size(N)
[id, values] = strtok(file_1);
result = result(id);
result = result(values)
end
end
Thanks for any assistance.
You let us guess a lot, so I guess you want something like this:
load StarData.txt
IDs = { 010447529;
010468501;
010481335;
010529637;
010603247;
002141865}
L = numel(IDs);
values = cell(L,1);
% Iteration through all arrays and creating an cell array with matrices for every ID
for ii=1:L;
ID = IDs{ii};
ID_first = find(StarData(:,1) == ID,1,'first');
ID_last = find(StarData(:,1) == ID,1,'last');
values{ii} = StarData( ID_first:ID_last , 2:end );
end
When you now access the index ii=6 adressing the ID = 002141865
MatrixOfCertainID6 = values{6};
you get:
0.0038671440 742 0.001121 16.155089 6.2974940 0.001677
0.0054292780 1940 0.000477 16.583748 11.945627 0.001622
0.0043607150 1897 0.000667 16.863406 13.438383 0.001460
0.0039724670 2127 0.000459 16.103060 21.966853 0.001196
0.0085429320 2094 0.000421 17.452007 18.067214 0.002490
... for further calculations.

Resources