I have an array of data which in the first column has years, and the other 3 columns has data for 3 different groups, the last of which is carrots.
I am trying to find the year in which the carrot value is the highest, by comparing the carrot value each year to the current highest, and then finding the year that the value takes place on.
I have used identical code for the other 2 columns with just the word carrot replaced and i in year[i] changed appropriately and the code works, but for this it throws up the error "local variable 'carrot_maxyear' referenced before assignment"
def carrot(data):
year0 = data[0]
carrotmax = year0[3]
for year in data:
if year[3] > carrotmax:
carrotmax = year[3]
carrot_maxyear = year[0]
return carrot_maxyear
Python's builtin max will make this easier:
def carrot(data):
maxyear = max(data, key=lambda year: year[3])
return maxyear[0]
This way you don't need the year0 and carrotmax initialization. We need to use the key argument to max because it looks like you want to return year[0] instead of the year[3] value used for the max calculation.
Your original version with a fix would look like:
def carrot(data):
year0 = data[0]
carrotmax = year0[3]
carrot_maxyear = 0 # initialize carrot_maxyear outside of loop to avoid error
for year in data:
if year[3] > carrotmax:
carrotmax = year[3]
carrot_maxyear = year[0]
return carrot_maxyear
but IMO the version utilizing max is more clear and Pythonic.
Related
So if I have two arrays in matlab. Let's call them locations1 and locations2
locations1
1123.44977625437 890.824688325172
1290.31273560851 5065.65794385883
1718.10632735926 2563.44895531365
1734.55379433782 4408.20631924691
2050.70084480064 1214.45353443990
2299.46239346717 3781.34694047196
4186.02801290113 4386.67818566045
5676.10649593031 4529.23023993815
locations2
7474.22619378039 3166.41503120846
8604.40241305284 5069.40744277799
9048.25231808890 2563.58997620248
9059.71923042408 4381.75034710351
9643.05902166767 3796.42822996919
11460.8617087264 4392.85930695209
And I want to make it so that any two entries of the second columns that match each other within 100.0 remain while any entry that has no match will get removed. So I want the output to look like
locations1
1290.31273560851 5065.65794385883
1718.10632735926 2563.44895531365
1734.55379433782 4408.20631924691
2299.46239346717 3781.34694047196
4186.02801290113 4386.67818566045
locations2
8604.40241305284 5069.40744277799
9048.25231808890 2563.58997620248
9059.71923042408 4381.75034710351
9643.05902166767 3796.42822996919
11460.8617087264 4392.85930695209
How would I do this? Preferably without loops. Here is what I've done, but it has loops
locround1=round(locations1/50)*50;
locround2=round(locations2/50)*50;
for i=1:size(locations1,1)
nodel1(i)=sum(locround1(i,2)== locround2(:,2))
end
nodel1=repmat(nodel1>0,[2,1]);
nodel1=nodel1';
locations1=nodel1.*locations1;
locations1( ~any(locations1,2), : ) = [];
for i=1:size(locations2,1)
nodel2(i)=sum(locround2(i,2)== locround1(:,2))
end
nodel2=repmat(nodel2>0,[2,1]);
nodel2=nodel2';
locations2=nodel2.*locations2;
locations2( ~any(locations2,2), : ) = [];
This is what I got. If your MATLAB version has set operators, you can do it with the following codes:
Li1 = ismembertol(locations1(:,2),locations2(:,2),100, 'DataScale', 1);
locations1_new = locations1 (Li1,:);
Li2 = ismembertol(locations2(:,2),locations1(:,2),100, 'DataScale', 1);
locations2_new = locations2 (Li2,:);
I tested it, it works.
Let the data be defined as
locations1 = [
1123.44977625437 890.824688325172
1290.31273560851 5065.65794385883
1718.10632735926 2563.44895531365
1734.55379433782 4408.20631924691
2050.70084480064 1214.45353443990
2299.46239346717 3781.34694047196
4186.02801290113 4386.67818566045
5676.10649593031 4529.23023993815
];
locations2 = [
7474.22619378039 3166.41503120846
8604.40241305284 5069.40744277799
9048.25231808890 2563.58997620248
9059.71923042408 4381.75034710351
9643.05902166767 3796.42822996919
11460.8617087264 4392.85930695209
];
threshold = 100;
Then:
m = abs(locations1(:,2)-locations2(:,2).')<=threshold;
result1 = locations1(any(m,2),:);
result2 = locations2(any(m,1),:);
How this works:
The first line computes a matrix with the distance between each value from the second column of locations1 and each value from the second column of locations2. The distances are then compared with threshold, so that the matrix entries become true or false.
This makes use of implicit expansion, introduced in R2016b. For Matlab versions before that, use bsxfun as follows:
m = abs(bsxfun(#minus, locations1(:,2), locations2(:,2).'))<=threshold;
Each row of the computed matrix, m, corresponds to a value from locations1; and each column corresponds to a value from locations2.
The second line uses logical indexing to select the rows of location1 that satisfy the criterion for some value of location2.
Similarly, the third line selects the rows of location2 that satisfy the criterion for some value of location1.
I'm trying to read an input file in Scala that I know the structure of, however I only need every 9th entry. So far I have managed to read the whole thing using:
val lines = sc.textFile("hdfs://moonshot-ha-nameservice/" + args(0))
val fields = lines.map(line => line.split(","))
The issue, this leaves me with an array that is huge (we're talking 20GB of data). Not only have I seen myself forced to write some very ugly code in order to convert between RDD[Array[String]] and Array[String] but it's essentially made my code useless.
I've tried different approaches and mixes between using
.map()
.flatMap() and
.reduceByKey()
however nothing actually put my collected "cells" into the format that I need them to be.
Here's what is supposed to happen: Reading a folder of text files from our server, the code should read each "line" of text in the format:
*---------*
| NASDAQ: |
*---------*
exchange, stock_symbol, date, stock_price_open, stock_price_high, stock_price_low, stock_price_close, stock_volume, stock_price_adj_close
and only keep a hold of the stock_symbol as that is the identifier I'm counting. So far my attempts have been to turn the entire thing into an array only collect every 9th index from the first one into a collected_cells var. Issue is, based on my calculations and real life results, that code would take 335 days to run (no joke).
Here's my current code for reference:
import org.apache.spark.SparkContext
import org.apache.spark.SparkContext._
import org.apache.spark.SparkConf
object SparkNum {
def main(args: Array[String]) {
// Do some Scala voodoo
val sc = new SparkContext(new SparkConf().setAppName("Spark Numerical"))
// Set input file as per HDFS structure + input args
val lines = sc.textFile("hdfs://moonshot-ha-nameservice/" + args(0))
val fields = lines.map(line => line.split(","))
var collected_cells:Array[String] = new Array[String](0)
//println("[MESSAGE] Length of CC: " + collected_cells.length)
val divider:Long = 9
val array_length = fields.count / divider
val casted_length = array_length.toInt
val indexedFields = fields.zipWithIndex
val indexKey = indexedFields.map{case (k,v) => (v,k)}
println("[MESSAGE] Number of lines: " + array_length)
println("[MESSAGE] Casted lenght of: " + casted_length)
for( i <- 1 to casted_length ) {
println("[URGENT DEBUG] Processin line " + i + " of " + casted_length)
var index = 9 * i - 8
println("[URGENT DEBUG] Index defined to be " + index)
collected_cells :+ indexKey.lookup(index)
}
println("[MESSAGE] collected_cells size: " + collected_cells.length)
val single_cells = collected_cells.flatMap(collected_cells => collected_cells);
val counted_cells = single_cells.map(cell => (cell, 1).reduceByKey{case (x, y) => x + y})
// val result = counted_cells.reduceByKey((a,b) => (a+b))
// val inmem = counted_cells.persist()
//
// // Collect driver into file to be put into user archive
// inmem.saveAsTextFile("path to server location")
// ==> Not necessary to save the result as processing time is recorded, not output
}
}
The bottom part is currently commented out as I tried to debug it, but it acts as pseudo-code for me to know what I need done. I may want to point out that I am next to not at all familiar with Scala and hence things like the _ notation confuse the life out of me.
Thanks for your time.
There are some concepts that need clarification in the question:
When we execute this code:
val lines = sc.textFile("hdfs://moonshot-ha-nameservice/" + args(0))
val fields = lines.map(line => line.split(","))
That does not result in a huge array of the size of the data. That expression represents a transformation of the base data. It can be further transformed until we reduce the data to the information set we desire.
In this case, we want the stock_symbol field of a record encoded a csv:
exchange, stock_symbol, date, stock_price_open, stock_price_high, stock_price_low, stock_price_close, stock_volume, stock_price_adj_close
I'm also going to assume that the data file contains a banner like this:
*---------*
| NASDAQ: |
*---------*
The first thing we're going to do is to remove anything that looks like this banner. In fact, I'm going to assume that the first field is the name of a stock exchange that start with an alphanumeric character. We will do this before we do any splitting, resulting in:
val lines = sc.textFile("hdfs://moonshot-ha-nameservice/" + args(0))
val validLines = lines.filter(line => !line.isEmpty && line.head.isLetter)
val fields = validLines.map(line => line.split(","))
It helps to write the types of the variables, to have peace of mind that we have the data types that we expect. As we progress in our Scala skills that might become less important. Let's rewrite the expression above with types:
val lines: RDD[String] = sc.textFile("hdfs://moonshot-ha-nameservice/" + args(0))
val validLines: RDD[String] = lines.filter(line => !line.isEmpty && line.head.isLetter)
val fields: RDD[Array[String]] = validLines.map(line => line.split(","))
We are interested in the stock_symbol field, which positionally is the element #1 in a 0-based array:
val stockSymbols:RDD[String] = fields.map(record => record(1))
If we want to count the symbols, all that's left is to issue a count:
val totalSymbolCount = stockSymbols.count()
That's not very helpful because we have one entry for every record. Slightly more interesting questions would be:
How many different stock symbols we have?
val uniqueStockSymbols = stockSymbols.distinct.count()
How many records for each symbol do we have?
val countBySymbol = stockSymbols.map(s => (s,1)).reduceByKey(_+_)
In Spark 2.0, CSV support for Dataframes and Datasets is available out of the box
Given that our data does not have a header row with the field names (what's usual in large datasets), we will need to provide the column names:
val stockDF = sparkSession.read.csv("/tmp/quotes_clean.csv").toDF("exchange", "symbol", "date", "open", "close", "volume", "price")
We can answer our questions very easy now:
val uniqueSymbols = stockDF.select("symbol").distinct().count
val recordsPerSymbol = stockDF.groupBy($"symbol").agg(count($"symbol"))
I used this simple script from: https://github.com/icambron/moment-countdown to make a simple countdown. The code below i'm using.
Used Code:
$interval(function(){
$scope.nextDate = moment().countdown($scope.nextDateGet,
countdown.DAYS|countdown.HOURS|countdown.MINUTES|countdown.SECONDS
);
$scope.daysCountdown = moment($scope.nextDate).format('dd');
$scope.hoursCountdown = moment($scope.nextDate).format('hh');
$scope.minutesCountdown = moment($scope.nextDate).format('mm');
$scope.secondsCountdown = moment($scope.nextDate).format('ss');
},1000,0);
This gives correct output
$scope.nextDate.toString();
But this contains one string with the remaining days,hours,minutes and seconds. So i decided i want to split this string into 4 strings by using this:
$scope.daysCountdown = moment($scope.nextDate).format('dd');
$scope.hoursCountdown = moment($scope.nextDate).format('hh');
$scope.minutesCountdown = moment($scope.nextDate).format('mm');
$scope.secondsCountdown = moment($scope.nextDate).format('ss');
Example for input
2016-10-15 10:00:00 // $scope.nextDateGet
Desired output is something like this:
0 // (days)
12 // (hours)
24 // (minutes)
30 // (seconds)
But i can't seem to format the remainings days, i get this output:
Fr // Shortcode for the day the item is scheduled => I need the remaining days in this case that would be 0. The other formatting is correct.
The following output was correct if remaining days was not 0:
$scope.daysCountdown = moment($scope.nextDate).format('D');
If remaining days was 0 it would set remaining days on 14 so this work around did the trick:
if(moment($scope.nextDate).isSame(moment(), 'day')){
$scope.daysCountdown = 0;
} else {
$scope.daysCountdown = moment($scope.nextDate).format('D');
}
Any suggestions to improve this code are always welcome.
I have this table named BondData which contains the following:
Settlement Maturity Price Coupon
8/27/2016 1/12/2017 106.901 9.250
8/27/2019 1/27/2017 104.79 7.000
8/28/2016 3/30/2017 106.144 7.500
8/28/2016 4/27/2017 105.847 7.000
8/29/2016 9/4/2017 110.779 9.125
For each day in this table, I am about to perform a certain task which is to assign several values to a variable and perform necessary computations. The logic is like:
do while Settlement is the same
m_settle=current_row_settlement_value
m_maturity=current_row_maturity_value
and so on...
my_computation_here...
end
It's like I wanted to loop through my settlement dates and perform task for as long as the date is the same.
EDIT: Just to clarify my issue, I am implementing Yield Curve fitting using Nelson-Siegel and Svensson models.Here are my codes so far:
function NS_SV_Models()
load bondsdata
BondData=table(Settlement,Maturity,Price,Coupon);
BondData.Settlement = categorical(BondData.Settlement);
Settlements = categories(BondData.Settlement); % get all unique Settlement
for k = 1:numel(Settlements)
rows = BondData.Settlement==Settlements(k);
Bonds.Settle = Settlements(k); % current_row_settlement_value
Bonds.Maturity = BondData.Maturity(rows); % current_row_maturity_value
Bonds.Prices=BondData.Price(rows);
Bonds.Coupon=BondData.Coupon(rows);
Settle = Bonds.Settle;
Maturity = Bonds.Maturity;
CleanPrice = Bonds.Prices;
CouponRate = Bonds.Coupon;
Instruments = [Settle Maturity CleanPrice CouponRate];
Yield = bndyield(CleanPrice,CouponRate,Settle,Maturity);
NSModel = IRFunctionCurve.fitNelsonSiegel('Zero',Settlements(k),Instruments);
SVModel = IRFunctionCurve.fitSvensson('Zero',Settlements(k),Instruments);
NSModel.Parameters
SVModel.Parameters
end
end
Again, my main objective is to get each model's parameters (beta0, beta1, beta2, etc.) on a per day basis. I am getting an error in Instruments = [Settle Maturity CleanPrice CouponRate]; because Settle contains only one record (8/27/2016), it's suppose to have two since there are two rows for this date. Also, I noticed that Maturity, CleanPrice and CouponRate contains all records. They should only contain respective data for each day.
Hope I made my issue clearer now. By the way, I am using MATLAB R2015a.
Use categorical array. Here is your function (without its' headline, and all rows I can't run are commented):
BondData = table(datetime(Settlement),datetime(Maturity),Price,Coupon,...
'VariableNames',{'Settlement','Maturity','Price','Coupon'});
BondData.Settlement = categorical(BondData.Settlement);
Settlements = categories(BondData.Settlement); % get all unique Settlement
for k = 1:numel(Settlements)
rows = BondData.Settlement==Settlements(k);
Settle = BondData.Settlement(rows); % current_row_settlement_value
Mature = BondData.Maturity(rows); % current_row_maturity_value
CleanPrice = BondData.Price(rows);
CouponRate = BondData.Coupon(rows);
Instruments = [datenum(char(Settle)) datenum(char(Mature))...
CleanPrice CouponRate];
% Yield = bndyield(CleanPrice,CouponRate,Settle,Mature);
%
% NSModel = IRFunctionCurve.fitNelsonSiegel('Zero',Settlements(k),Instruments);
% SVModel = IRFunctionCurve.fitSvensson('Zero',Settlements(k),Instruments);
%
% NSModel.Parameters
% SVModel.Parameters
end
Keep in mind the following:
You cannot concat different types of variables as you try to do in: Instruments = [Settle Maturity CleanPrice CouponRate];
There is no need in the structure Bond, you don't use it (e.g. Settle = Bonds.Settle;).
Use the relevant functions to convert between a datetime object and string or numbers. For instance, in the code above: datenum(char(Settle)). I don't know what kind of input you need to pass to the following functions.
I need help with matlab using 'strtok' to find an ID in a text file and then read in or manipulate the rest of the row that is contained where that ID is. I also need this function to find (using strtok preferably) all occurrences of that same ID and group them in some way so that I can find averages. On to the sample code:
ID list being input:
(This is the KOIName variable)
010447529
010468501
010481335
010529637
010603247......etc.
File with data format:
(This is the StarData variable)
ID>>>>Values
002141865 3.867144e-03 742.000000 0.001121 16.155089 6.297494 0.001677
002141865 5.429278e-03 1940.000000 0.000477 16.583748 11.945627 0.001622
002141865 4.360715e-03 1897.000000 0.000667 16.863406 13.438383 0.001460
002141865 3.972467e-03 2127.000000 0.000459 16.103060 21.966853 0.001196
002141865 8.542932e-03 2094.000000 0.000421 17.452007 18.067214 0.002490
Do not be mislead by the examples I posted, that first number is repeated for about 15 lines then the ID changes and that goes for an entire set of different ID's, then they are repeated as a whole group again, think [1,2,3],[1,2,3], the main difference is the values trailing the ID which I need to average out in matlab.
My current code is:
function Avg_Koi
N = evalin('base', 'KOIName');
file_1 = evalin('base', 'StarData');
global result;
for i=1:size(N)
[id, values] = strtok(file_1);
result = result(id);
result = result(values)
end
end
Thanks for any assistance.
You let us guess a lot, so I guess you want something like this:
load StarData.txt
IDs = { 010447529;
010468501;
010481335;
010529637;
010603247;
002141865}
L = numel(IDs);
values = cell(L,1);
% Iteration through all arrays and creating an cell array with matrices for every ID
for ii=1:L;
ID = IDs{ii};
ID_first = find(StarData(:,1) == ID,1,'first');
ID_last = find(StarData(:,1) == ID,1,'last');
values{ii} = StarData( ID_first:ID_last , 2:end );
end
When you now access the index ii=6 adressing the ID = 002141865
MatrixOfCertainID6 = values{6};
you get:
0.0038671440 742 0.001121 16.155089 6.2974940 0.001677
0.0054292780 1940 0.000477 16.583748 11.945627 0.001622
0.0043607150 1897 0.000667 16.863406 13.438383 0.001460
0.0039724670 2127 0.000459 16.103060 21.966853 0.001196
0.0085429320 2094 0.000421 17.452007 18.067214 0.002490
... for further calculations.