Looping in R with tidyverse - loops

I am trying to read multiple csv files with date in the format dd-mm-yyyy. I want to convert the months into seasons, for which I have used the following codes (for one csv file):
x= data %>%
dplyr::mutate(year = lubridate::year(UDATE),
month = lubridate::month(UDATE),
day = lubridate::day(UDATE))
x %>%
mutate(season = case_when(
month %in% c('3', '4', '5', '6') ~ 'Summer',
month %in% c('7', '8', '9', '10') ~ 'Monsoon',
month %in% c('11', '12', '1', '2') ~ 'Winter'
))
Now I want to run this for the multiple csv files simultaneously and export those files with the converted data frames such that my month is converted into seasons.
Can someone please suggest me how to put that in a loop function for multiple csv files simultaneously.
Thank you

Without data to reproduce it is a little hard to help, so I wrote some generic code that might need a few tweaks to solve your problem.
First, I would create a single function that import, transform and export the data.
prep_data <-
function(file_name){
data <-
read.csv2(file_name) %>%
dplyr::mutate(
year = lubridate::year(UDATE),
month = lubridate::month(UDATE),
day = lubridate::day(UDATE),
season = case_when(
month %in% c('3', '4', '5', '6') ~ 'Summer',
month %in% c('7', '8', '9', '10') ~ 'Monsoon',
month %in% c('11', '12', '1', '2') ~ 'Winter'
)
#I put "_new" as prefix to not overwrite the original .csv
write.csv2(file = paste0("new_",file_name))
}
Then I would create a vector with all my files names.
all_files <- list.files(pattern = "*.csv",path = "your path with the csv files",full.names = TRUE)
Lastly, apply the function for all files.
purrr::map(.x = all_files,.f = prep_data)

Related

P-values for glmer mixed effects logistic regression in Python

I have a dataset for one year for all employees with individual-level data (e.g. age, gender, promotions, etc.). Each employee is in a team of a certain manager. I have some variables on the team- and manager-levels as well (e.g. manager's tenure, team diversity, etc.). I want to explain the termination of employees (binary: left the company or not). I am running a multilevel logistic regression, where employees are grouped by their managers, therefore they share the same team- and manager-level characteristics.
So, my model looks like:
Termination ~ Age + Time in company + Promotions + Manager tenure + Percent of employees who completed training", data, groups=data[Manager_ID]
Dataset example:
data = {'Employee': ['ID1', 'ID2','ID3','ID4','ID5','ID6','ID7', 'ID8'],
'Manager_ID': ['MID1', 'MID2','MID2','MID1','MID3','MID3','MID3', 'MID1'],
'Termination': ['0', '0', '0', '0', '1', '1', '1', '0'],
'Age': ['35', '40','50','24','33','46','44', '31'],
'TimeinCompany': ['1', '3', '10', '20', '4', '0', '4', '9'],
'Promotions': ['1', '0', '0', '0', '1', '1', '1', '0'],
'Manager_Tenure': ['10', '5', '5', '10', '8', '8', '8', '10'],
'PercentCompletedTrainingTeam': ['40', '20', '20', '40', '49', '49', '49', '40']}
columns = ['Employee','Manager_ID','Age', 'TimeinCompany', 'Promotions', 'Manager_Tenure', 'AverageAgeTeam', 'PercentCompletedTrainingTeam']
data = pd.DataFrame(data, columns=columns)
I managed to run mixed effects logistic regression using lme4 package from R in Python.
importr('lme4')
model1 = r.glmer(formula=Formula('Termination ~ Age + TimeinCompany + Promotions + Manager_Tenure + PercentCompletedTrainingTeam + (1 | Manager_ID)'),
data=data)
print(r.summary(model1))
I receive the following output for the full sample:
REML criterion at convergence: 54867.6
Scaled residuals:
Min 1Q Median 3Q Max
-2.9075 -0.3502 -0.2172 -0.0929 3.9378
Random effects:
Groups Name Variance Std.Dev.
Manager_ID (Intercept) 0.005033 0.07094
Residual 0.072541 0.26933
Number of obs: 211974, groups: Manager_ID, 24316
Fixed effects:
Estimate Std. Error t value
(Intercept) 0.14635573 0.00299341 48.893
Age -0.00112153 0.00008079 -13.882
TimeinCompany -0.00238352 0.00010314 -23.110
Promotions -0.01754085 0.00491545 -3.569
Manager_Tenure -0.00044373 0.00010834 -4.096
PercentCompletedTrainingTeam -0.00014393 0.00002598 -5.540
Correlation of Fixed Effects:
(Intr) Age TmnCmpny Promotions Mngr_Tenure
Age -0.817
TmnCmpny 0.370 -0.616
Promotions -0.011 -0.009 -0.033
Mngr_Tenure -0.279 0.013 -0.076 0.035
PrcntCmpltT -0.309 -0.077 -0.021 -0.042 0.052
But, there are no p-values displayed. I read a lot that lme4 does not provide p-values for a number of reasons, however I have to have them for the work presentation.
I tried several possible solutions that I found, but none of them worked:
importr('lmerTest')
importr('afex')
print(r.anova(model1))
does not display any output
print(r.anova(model1, ddf="Kenward-Roger"))
only displays npar, Sum Sq, Mean Sq, F value
print(r.summary(model1, ddf="merModLmerTest"))
Provides the same output as with just summary
print(r.anova(model1, "merModLmerTest"))
only displays npar, Sum Sq, Mean Sq, F value
Any ideas on how to get p-values are much appreciated.

Ruby: the closest date to specific date

I have a problem with my ruby script. I have an array
files = ["2020-09-14.access","2020-09-13.access","2020-09-11.access","2020-09-10.access","2020-09-09.access","2020-09-08.access","2020-09-07.access","2020-09-05.access","2020-09-04.access","2020-09-02.access","2020-09-01.access","2020-09-14.sale","2020-09-12.sale","2020-09-08.sale","2020-09-07.sale","2020-09-06.sale","2020-09-04.sale",]
that contains values that are file names. There are two types of files: access and sale. Every file name contains date of file creating. From each file type I want to get only these values with older dates beginning form file created two days ago. For the file type sale there is no problem, today is 2020-09-14, file created two days ago is 2020-09-12.sale. But in case access files there is no file created 2020-09-12 so I want file with the closest date to 2020-09-12 which means value 2020-09-10.access and I'm stack in here. In short I want to get array like this
to_del_files = [["2020-09-10.access","2020-09-09.access","2020-09-08.access","2020-09-07.access","2020-09-05.access","2020-09-04.access","2020-09-02.access","2020-09-01.access"],["2020-09-12.sale","2020-09-08.sale","2020-09-07.sale","2020-09-06.sale","2020-09-04.sale"]]
My code is below:
require 'date'
files = ["2020-09-14.access","2020-09-13.access","2020-09-10.access","2020-09-09.access","2020-09-08.access","2020-09-07.access","2020-09-05.access","2020-09-04.access","2020-09-02.access","2020-09-01.access","2020-09-14.sale","2020-09-12.sale","2020-09-08.sale","2020-09-07.sale","2020-09-06.sale","2020-09-04.sale",]
names = files.map {|x| x.split('.')[1] }.uniq
puts names
date = Date.today
date2ago = date -2
to_del_files = []
names.each do |item|
tmp = files.select { |x| x =~ /#{item}/ }
flag = tmp.select {|x| x =~ /#{date2ago}/ }
if flag.size > 0
index = tmp.find_index("#{flag[0]}")
to_del_files << tmp[index..-1]
else
#what to do in case where there is no such date in files
end
end
puts to_del_files
Thanks for any help.
In order for you to get the files to delete:
def old_files(files, date)
files.sort.filter { |file| Date.parse(file) < date }
end
And then you can use:
files = ["2020-09-14.access","2020-09-13.access","2020-09-10.access","2020-09-09.access","2020-09-08.access","2020-09-07.access","2020-09-05.access","2020-09-04.access","2020-09-02.access","2020-09-01.access","2020-09-14.sale","2020-09-12.sale","2020-09-08.sale","2020-09-07.sale","2020-09-06.sale","2020-09-04.sale",]
today = Date.today
date = today -2
to_del_files = old_files(files, date)
I understand you wish to select elements from files corresponding to dates that are equal to or earlier than a given date. If that is correct you can do that as follows.
files = [
"2020-09-14.access", "2020-09-13.access", "2020-09-11.access",
"2020-09-10.access", "2020-09-09.access", "2020-09-08.access",
"2020-09-07.access", "2020-09-05.access", "2020-09-04.access",
"2020-09-02.access", "2020-09-01.access", "2020-09-14.sale",
"2020-09-12.sale", "2020-09-08.sale", "2020-09-07.sale",
"2020-09-06.sale", "2020-09-04.sale"
]
require 'date'
def files_on_or_before_date(arr)
files_on_or_before_target_date(arr, Date.now-2)
end
def files_on_or_before_target_date(arr, target_date)
arr.select { |d| Date.strptime(d, '%Y-%m-%d') <= target_date }
end
files_on_or_before_target_date(files, Date.new(2020, 9, 12))
#=> ["2020-09-11.access", "2020-09-10.access", "2020-09-09.access",
# "2020-09-08.access", "2020-09-07.access", "2020-09-05.access",
# "2020-09-04.access", "2020-09-02.access", "2020-09-01.access",
# "2020-09-12.sale", "2020-09-08.sale", "2020-09-07.sale",
# "2020-09-06.sale", "2020-09-04.sale"]
files_on_or_before_target_date(files, Date.new(2020, 9, 10))
#=> ["2020-09-10.access", "2020-09-09.access", "2020-09-08.access",
# "2020-09-07.access", "2020-09-05.access", "2020-09-04.access",
# "2020-09-02.access", "2020-09-01.access", "2020-09-08.sale",
# "2020-09-07.sale", "2020-09-06.sale", "2020-09-04.sale"]
These return values can of course be added to an array.
See Date::strptime and DateTime#strftime, the latter for date formatting directives.
Date.strptime("2020-09-14.access", '%Y-%m-%d')
returns the same Date object as does
Date.strptime("2020-09-14", '%Y-%m-%d')
To guard against possible future change in the implementation of Date::strptime strptime's argument d could be replaced with d[/[^.]+/] or d[0, d.index('.')], both of which become "2020-09-14" when d = "2020-09-14.access".

querying a CSV::Table to find item with most sales between two given dates in plain old ruby script

I am trying to find the highest sales between two given dates.
this is what my ad_report.csv file with headers:
date,impressions,clicks,sales,ad_spend,keyword_id,asin
2017-06-19,4451,1006,608,24.87,UVOLBWHILJ,63N02JK10S
2017-06-18,5283,3237,1233,85.06,UVOLBWHILJ,63N02JK10S
2017-06-17,0,0,0,21.77,UVOLBWHILJ,63N02JK10S
...
Below is all the working code I have that returns the row with the highest value, but not between the given dates.
require 'csv'
require 'date'
# get directory of the current file
LIB_DIR = File.dirname(__FILE__)
# get the absolute path of the ad_report & product_report CSV
# and set to a var
AD_CSV_PATH = File.expand_path('data/ad_report.csv', LIB_DIR)
PROD_CSV_PATH = File.expand_path('data/product_report.csv', LIB_DIR)
# create CSV::Table for ad-ad_report and product_report CSV
ad_report_table = CSV.parse(File.read(AD_CSV_PATH), headers: true)
prod_report_table = CSV.parse(File.read(PROD_CSV_PATH), headers: true)
## finds the row with the highest sales
sales_row = ad_report_table.max_by { |row| row[3].to_i }
At this point I can get the row that has the greatest sale, and all the data from that row, but it is not in the excepted range.
Below I am trying to use range with the preset dates.
## range of date for items between
first_date = Date.new(2017, 05, 02)
last_date = Date.new(2017, 05, 31)
range = (first_date...last_date)
puts sales_row
below is sudo code of what I feel that I am supposed to do, but there is probably a better method.
## check for highest sales
## return sales if between date
## else reject col if
## loop this until it returns date between
## return result
You could do this by creating a range containing two dates and then use Range#cover? to test if the date is in the range:
range = Date.new(2015-01-01)..Date.new(2020-01-01)
rows.select do |row|
range.cover?(Date.parse(row[1]))
end.max_by { |row| row[3].to_i }
Although the Tin Man is completely right in that you should use a database instead.
You could obtained the desired value as follows. I have assumed that the field of interest ('sales') represents integer values. If not, change .to_i to .to_f below.
Code
require 'csv'
def greatest(fname, max_field, date_field, date_range)
largest = nil
CSV.foreach(fname, headers:true) do |csv|
largest = { row: csv.to_a, value: csv[max_field].to_i } if
date_range.cover?(csv[date_field]) &&
(largest.nil? || csv[max_field].to_i > largest[:value])
end
largest.nil? ? nil : largest[:row].to_h
end
Examples
Let's first create a CSV file.
str =<<~END
date,impressions,clicks,sales,ad_spend,keyword_id,asin
2017-06-19,4451,1006,608,24.87,UVOLBWHILJ,63N02JK10S
2017-06-18,5283,3237,1233,85.06,UVOLBWHILJ,63N02JK10S
2017-06-17,0,0,0,21.77,UVOLBWHILJ,63N02JK10S
2017-06-20,4451,1006,200000,24.87,UVOLBWHILJ,63N02JK10S
END
fname = 't.csv'
File.write(fname, str)
#=> 263
Now find the record within a given date range for which the value of "sales" is greatest.
greatest(fname, 'sales', 'date', '2017-06-17'..'2017-06-19')
#=> {"date"=>"2017-06-18", "impressions"=>"5283", "clicks"=>"3237",
# "sales"=>"1233", "ad_spend"=>"85.06", "keyword_id"=>"UVOLBWHILJ",
# "asin"=>"63N02JK10S"}
greatest(fname, 'sales', 'date', '2017-06-17'..'2017-06-25')
#=> {"date"=>"2017-06-20", "impressions"=>"4451", "clicks"=>"1006",
# "sales"=>"200000", "ad_spend"=>"24.87", "keyword_id"=>"UVOLBWHILJ",
# "asin"=>"63N02JK10S"}
greatest(fname, 'sales', 'date', '2017-06-22'..'2017-06-25')
#=> nil
I read the file line-by-line (using CSV#foreach) to keep memory requirements to a minimum, which could be essential if the file is large.
Notice that, because the date is in "yyyy-mm-dd" format, it is not necessary to convert two dates to Date objects to compare them; that is, they can be compared as strings (e.g. '2017-06-17' <= '2017-06-18' #=> true).

Ruby - Array of Hashes to CSV

I have collected some data and saved it to an array of hashes in the form of:
info = [
{'Name' => 'Alex', 'Tel' => 999, 'E-mail' => "bla#bla.com"},
{'Name' => 'Ali', 'Tel' => 995, 'E-mail' => "ali#bla.com"}
# ...
]
But not all information is always there:
{'Name' => 'GuyWithNoTelephone', 'E-mail' => "poor#bla.com"}
I want to turn this information into a CSV file. Here is what I tried:
def to_csv (info)
CSV.open("sm-data.csv", "wb") do |csv|
csv << ["Name", "Tel", "E-mail"]
info.each do |person|
csv << person.values
When I try this, the format of the table is not correct, and say, if a telephone number is missing, then the e-mail of that person appears at telephone column.
How can I selectively write this info to the CSV file, i.e how can I tell to skip columns if a value is missing?
But sadly when I try this, the format of the table is not correct, and say, if a telephone number is missing, then the e-mail of that person appears at telephone column.
That's because you are omitting the telephone number in that case, providing just 2 of the 3 values. To fix this, you simply have to provide all 3 values, even if they don't exist in the hash:
csv << ['Name', 'Tel', 'E-mail']
info.each do |person|
csv << [person['Name'], person['Tel'], person['E-Mail']]
end
or via Hash#values_at:
csv << ['Name', 'Tel', 'E-mail']
info.each do |person|
csv << person.values_at('Name', 'Tel', 'E-Mail')
end
For:
{'Name' => 'GuyWithNoTelephone', 'E-mail' => "poor#bla.com"}
this results in:
csv << ['GuyWithNoTelephone', nil, 'poor#bla.com']
which generates this output: (note the two commas, denoting an empty field in-between)
"GuyWithNoTelephone,,poor#bla.com\n"
Try this,
def to_csv(csv_filename="sm-data.csv")
# Get all unique keys into an array:
keys = info.map(&:keys).inject(&:|)
CSV.open(csv_filename, "wb") do |csv|
csv << keys
info.each do |hash|
# fetch values at keys location, inserting null if not found.
csv << hash.values_at(*keys)
end
end
end
Simple as this:
path = "data/backtest-results.csv"
CSV.open(path, "wb") do |csv|
csv << ["Asset", "Strategy", "Profit"]
result_array.each do |p|
csv << p.map { |key, value| value }
end
end
use "(File.file?(path) ? "ab" : "wb")" rather than "wb" if you want to continue adding as the new data comes

Append data to new cell arrays using Token

I have a problem I couldnt solve it. I have a set of data present in lines (generally text that are organized in number of sentences)
Example of my text in sentence:
1. Hello, world, It, is, beautiful, to, see, you, all
2. ,Wishing, you, happy, day, ahead
I am using the strtok
[token remain] = strtok(remain, ', ');
% token = strtrim(token);
CellArray {NumberOFCells} = token(1:end) ;
NumberOFCells= NumberOFCells+1;
I am using the CellArray to store the Token into the cells however what my code does is it takes the first sentences and put into cells and once it iterates to the second sentence and it deletes the pre-assigned cells thus it replaces it with token of the second sentences.
Expected Output
[ nxn ] [ nxn ] [ ] [ ] [ ] [ ] [ ] [ ] [ ] ......
'Hello' 'world' 'It' 'is' 'beautiful' 'to' see' 'you' 'all' 'Wishing' 'you' 'happy' 'day' 'ahead'
The question is how can I append the second sentence strings to the cells without clearing the pre-filled cells.
Thank you and looking forward to meet experts matlab programmer
My Code .. Ignore commented lines... Retrieved is basically in this form.
[Index,Retrieved] = system(['wn ' keyword type ]);
Retrieved;
arrowSymbol = ' => ';
CommaSymbol= ', '
NumberOfSense= 'Sense ';
% let's look for the lines with '=> ' only?
senses = regexp(Retrieved, [arrowSymbol '[\w, ]*\n '], 'match');
SplitIntoCell = regexp(senses, [CommaSymbol '[\w, ]*\n'], 'match');
% now, we take out the '=> ' symbol
for i = 1:size(senses, 2)
senses{i} = senses{i}(size(arrowSymbol,2):end);
SplitIntoCell{i}= SplitIntoCell{i}(size(CommaSymbol,2): end);
% SeperateCells= senses ([1:2 ; end-1:end]);
% SplitCellContentIntoSingleRows{i}= strtok (SeperateCells, ['\n' ])
numberCommas = size(regexp(senses{i}, CommaSymbol), 2);
remain = senses{i};
RestWord= SplitIntoCell{i};
NumberOFCells=1;
for j = 2:numberCommas + 1 + 1 % 1 for a word after last comma and 1 because starts at index 2
% RemoveCellComma= regexp (Conversion,',');
% CellArray = [CellArray strsplit(remain, ', ')];
% [str,~] = regexp(remain,'[^, \t]+', 'match', 'split');
% CellArray = [CellArray str];
% [token remain] = strtok(remain, ', ');
% token = strtrim(token);
% CellArray {NumberOFCells} = token(1:end) ;
%
% % CellArray =[CellArray strsplit(remain, ', ')]
% [str, ~]= regexp(remain,'[^, \t]+', 'match', 'split');
% CellArray = [CellArray str];
% NumberOFCells= NumberOFCells+1;
[token remain] = strtok(remain, ', ');
token = strtrim(token);
CellArray {NumberOFCells} = token;
NumberOFCells= NumberOFCells+1;
Retrieved=
cat, true cat
=> feline, felid
=> carnivore
=> placental, placental mammal, eutherian, eutherian mammal
=> mammal, mammalian
=> vertebrate, craniate
=> chordate
=> animal, animate being, beast, brute, creature, fauna
=> organism, being
=> living thing, animate thing
=> object, physical object
=> physical entity
=> entity
Your question is a little confusing, but reading it (and other comments) a couple of times, I think I understand what you're asking.
Eitan T is correct about using regexp for this, and when it comes to cell arrays, be careful of the difference in indexing/concatenation with [] and {}: see Combining Cell Arrays. Assuming your using a loop to go through each sentence, you can do something like:
CellArray = [CellArray strsplit(next_sentence, ', ')];
Using regexp (or it's case-insensitive alternative regexpi), try adding 'split' as another one of the function options, for example:
[str,~] = regexp(next_sentence,'[^, \t]+', 'match', 'split');
CellArray = [CellArray str];

Resources