Matlab string manipulation - arrays

I need help with matlab using 'strtok' to find an ID in a text file and then read in or manipulate the rest of the row that is contained where that ID is. I also need this function to find (using strtok preferably) all occurrences of that same ID and group them in some way so that I can find averages. On to the sample code:
ID list being input:
(This is the KOIName variable)
010447529
010468501
010481335
010529637
010603247......etc.
File with data format:
(This is the StarData variable)
ID>>>>Values
002141865 3.867144e-03 742.000000 0.001121 16.155089 6.297494 0.001677
002141865 5.429278e-03 1940.000000 0.000477 16.583748 11.945627 0.001622
002141865 4.360715e-03 1897.000000 0.000667 16.863406 13.438383 0.001460
002141865 3.972467e-03 2127.000000 0.000459 16.103060 21.966853 0.001196
002141865 8.542932e-03 2094.000000 0.000421 17.452007 18.067214 0.002490
Do not be mislead by the examples I posted, that first number is repeated for about 15 lines then the ID changes and that goes for an entire set of different ID's, then they are repeated as a whole group again, think [1,2,3],[1,2,3], the main difference is the values trailing the ID which I need to average out in matlab.
My current code is:
function Avg_Koi
N = evalin('base', 'KOIName');
file_1 = evalin('base', 'StarData');
global result;
for i=1:size(N)
[id, values] = strtok(file_1);
result = result(id);
result = result(values)
end
end
Thanks for any assistance.

You let us guess a lot, so I guess you want something like this:
load StarData.txt
IDs = { 010447529;
010468501;
010481335;
010529637;
010603247;
002141865}
L = numel(IDs);
values = cell(L,1);
% Iteration through all arrays and creating an cell array with matrices for every ID
for ii=1:L;
ID = IDs{ii};
ID_first = find(StarData(:,1) == ID,1,'first');
ID_last = find(StarData(:,1) == ID,1,'last');
values{ii} = StarData( ID_first:ID_last , 2:end );
end
When you now access the index ii=6 adressing the ID = 002141865
MatrixOfCertainID6 = values{6};
you get:
0.0038671440 742 0.001121 16.155089 6.2974940 0.001677
0.0054292780 1940 0.000477 16.583748 11.945627 0.001622
0.0043607150 1897 0.000667 16.863406 13.438383 0.001460
0.0039724670 2127 0.000459 16.103060 21.966853 0.001196
0.0085429320 2094 0.000421 17.452007 18.067214 0.002490
... for further calculations.

Related

How do I delete entries from two two-column tables such that their second columns match within a certain error

So if I have two arrays in matlab. Let's call them locations1 and locations2
locations1
1123.44977625437 890.824688325172
1290.31273560851 5065.65794385883
1718.10632735926 2563.44895531365
1734.55379433782 4408.20631924691
2050.70084480064 1214.45353443990
2299.46239346717 3781.34694047196
4186.02801290113 4386.67818566045
5676.10649593031 4529.23023993815
locations2
7474.22619378039 3166.41503120846
8604.40241305284 5069.40744277799
9048.25231808890 2563.58997620248
9059.71923042408 4381.75034710351
9643.05902166767 3796.42822996919
11460.8617087264 4392.85930695209
And I want to make it so that any two entries of the second columns that match each other within 100.0 remain while any entry that has no match will get removed. So I want the output to look like
locations1
1290.31273560851 5065.65794385883
1718.10632735926 2563.44895531365
1734.55379433782 4408.20631924691
2299.46239346717 3781.34694047196
4186.02801290113 4386.67818566045
locations2
8604.40241305284 5069.40744277799
9048.25231808890 2563.58997620248
9059.71923042408 4381.75034710351
9643.05902166767 3796.42822996919
11460.8617087264 4392.85930695209
How would I do this? Preferably without loops. Here is what I've done, but it has loops
locround1=round(locations1/50)*50;
locround2=round(locations2/50)*50;
for i=1:size(locations1,1)
nodel1(i)=sum(locround1(i,2)== locround2(:,2))
end
nodel1=repmat(nodel1>0,[2,1]);
nodel1=nodel1';
locations1=nodel1.*locations1;
locations1( ~any(locations1,2), : ) = [];
for i=1:size(locations2,1)
nodel2(i)=sum(locround2(i,2)== locround1(:,2))
end
nodel2=repmat(nodel2>0,[2,1]);
nodel2=nodel2';
locations2=nodel2.*locations2;
locations2( ~any(locations2,2), : ) = [];
This is what I got. If your MATLAB version has set operators, you can do it with the following codes:
Li1 = ismembertol(locations1(:,2),locations2(:,2),100, 'DataScale', 1);
locations1_new = locations1 (Li1,:);
Li2 = ismembertol(locations2(:,2),locations1(:,2),100, 'DataScale', 1);
locations2_new = locations2 (Li2,:);
I tested it, it works.
Let the data be defined as
locations1 = [
1123.44977625437 890.824688325172
1290.31273560851 5065.65794385883
1718.10632735926 2563.44895531365
1734.55379433782 4408.20631924691
2050.70084480064 1214.45353443990
2299.46239346717 3781.34694047196
4186.02801290113 4386.67818566045
5676.10649593031 4529.23023993815
];
locations2 = [
7474.22619378039 3166.41503120846
8604.40241305284 5069.40744277799
9048.25231808890 2563.58997620248
9059.71923042408 4381.75034710351
9643.05902166767 3796.42822996919
11460.8617087264 4392.85930695209
];
threshold = 100;
Then:
m = abs(locations1(:,2)-locations2(:,2).')<=threshold;
result1 = locations1(any(m,2),:);
result2 = locations2(any(m,1),:);
How this works:
The first line computes a matrix with the distance between each value from the second column of locations1 and each value from the second column of locations2. The distances are then compared with threshold, so that the matrix entries become true or false.
This makes use of implicit expansion, introduced in R2016b. For Matlab versions before that, use bsxfun as follows:
m = abs(bsxfun(#minus, locations1(:,2), locations2(:,2).'))<=threshold;
Each row of the computed matrix, m, corresponds to a value from locations1; and each column corresponds to a value from locations2.
The second line uses logical indexing to select the rows of location1 that satisfy the criterion for some value of location2.
Similarly, the third line selects the rows of location2 that satisfy the criterion for some value of location1.

How to speed up iteration through array in ruby

I have multiple csv files that have the name and the price of products. There may be or may not be products that are in both files. I have to find the highest and the lowest price across these files for each product.
I joined products from both files into one array:
Dir["./*.csv"].each do |file|
CSV.foreach(file, headers:true) do |row|
tmpRow = row.to_s.chomp + "," + file #saving name of the input file
list.push(tmpRow.chomp.split(","))
end
end
The array list looks like this:
[["5893105","2.38", "weightOrSomethingIrrelevant", "./FIAT_2.csv"]]
This is the main algorithm:
while list[0] do
if list[0] != nil
tmpPart = list[0][0]
tmpParts = list.select{ |part, price| part == tmpPart}
tmpParts.each do |tp|
tmpPrices.push(tp[1])
end
list[0][2].to_f != 0.0 ? tmpWeight = list[0][2].to_s : tmpWeight = "Undefined"
tmpMaxPrice = tmpParts.select{|part, price| part == tmpPart && price == tmpPrices.max}
tmpMinPrice = tmpParts.select{|part, price| part == tmpPart && price == tmpPrices.min}
result.push([tmpPart, tmpWeight, tmpPrices.max, tmpMaxPrice[0].last, tmpPrices.min, tmpMinPrice[0].last)
tmpPart = ""
list = list - tmpParts
tmpParts = []
tmpPrices = []
tmpMaxPrice = []
tmpMinPrice = []
tmpWeight = ""
end
end
The input files are huge (over 200 000 rows), so I am having problems with efficiency of my algorithm (as it processes one row in half a second).
I am wondering if there is any better way to write this app.
I would split this into several parts:
1) I suggest you have a table which represents files (the file name, location, line number etc) and connected to that a product table (the row data from that file)
2) script / function to ingest files and store rows as DB records
3) script / function to analyse rows and find products by name, using the DB and pulling price info out using Min/max.
This could later be improved to deal with naming inconsistencies products vs product occurrences etc.

using lookup tables to plot a ggplot and table

I'm creating a shiny app and i'm letting the user choose what data that should be displayed in a plot and a table. This choice is done through 3 different input variables that contain 14, 4 and two choices respectivly.
ui <- dashboardPage(
dashboardHeader(),
dashboardSidebar(
selectInput(inputId = "DataSource", label = "Data source", choices =
c("Restoration plots", "all semi natural grasslands")),
selectInput(inputId = "Variabel", label = "Variable", choices =
choicesVariables)),
#choicesVariables definition is omitted here, because it's very long but it
#contains 14 string values
selectInput(inputId = "Factor", label = "Factor", choices = c("Company
type", "Region and type of application", "Approved or not approved
applications", "Age group" ))
),
dashboardBody(
plotOutput("thePlot"),
tableOutput("theTable")
))
This adds up to 73 choices (yes, i know the math doesn't add up there, but some choices are invalid). I would like to do this using a lookup table so a created one with every valid combination of choices like this:
rad1<-c(rep("Company type",20), rep("Region and type of application",20),
rep("Approved or not approved applications", 13), rep("Age group", 20))
rad2<-choicesVariable[c(1:14,1,4,5,9,10,11, 1:14,1,4,5,9,10,11, 1:7,9:14,
1:14,1,4,5,9,10,11)]
rad3<-c(rep("Restoration plots",14),rep("all semi natural grasslands",6),
rep("Restoration plots",14), rep("all semi natural grasslands",6),
rep("Restoration plots",27), rep("all semi natural grasslands",6))
rad4<-1:73
letaLista<-data.frame(rad1,rad2,rad3, rad4)
colnames(letaLista) <- c("Factor", "Variabel", "rest_alla", "id")
Now its easy to use subset to only get the choice that the user made. But how do i use this information to plot the plot and table without using a 73 line long ifelse statment?
I tried to create some sort of multidimensional array that could hold all the tables (and one for the plots) but i couldn't make it work. My experience with these kind of arrays is limited and this might be a simple issue, but any hints would be helpful!
My dataset that is the foundation for the plots and table consists of dataframe with 23 variables, factors and numerical. The plots and tabels are then created using the following code for all 73 combinations
s_A1 <- summarySE(Samlad_info, measurevar="Dist_brukcentrum",
groupvars="Companytype")
s_A1 <- s_A1[2:6,]
p_A1=ggplot(s_A1, aes(x=Companytype,
y=Dist_brukcentrum))+geom_bar(position=position_dodge(), stat="identity") +
geom_errorbar(aes(ymin=Dist_brukcentrum-se,
ymax=Dist_brukcentrum+se),width=.2,position=position_dodge(.9))+
scale_y_continuous(name = "") + scale_x_discrete(name = "")
where summarySE is the following function, burrowed from cookbook for R
summarySE <- function(data=NULL, measurevar, groupvars=NULL, na.rm=TRUE,
conf.interval=.95, .drop=TRUE) {
# New version of length which can handle NA's: if na.rm==T, don't count them
length2 <- function (x, na.rm=FALSE) {
if (na.rm) sum(!is.na(x))
else length(x)
}
# This does the summary. For each group's data frame, return a vector with
# N, mean, and sd
datac <- ddply(data, groupvars, .drop=.drop,
.fun = function(xx, col) {
c(N = length2(xx[[col]], na.rm=na.rm),
mean = mean (xx[[col]], na.rm=na.rm),
sd = sd (xx[[col]], na.rm=na.rm)
)
},
measurevar
)
# Rename the "mean" column
datac <- rename(datac, c("mean" = measurevar))
datac$se <- datac$sd / sqrt(datac$N) # Calculate standard error of the mean
# Confidence interval multiplier for standard error
# Calculate t-statistic for confidence interval:
# e.g., if conf.interval is .95, use .975 (above/below), and use df=N-1
ciMult <- qt(conf.interval/2 + .5, datac$N-1)
datac$ci <- datac$se * ciMult
return(datac)
}
The code in it's entirety is a bit to large but i hope this may clarify what i'm trying to do.
Well, thanks to florian's comment i think i might have found a solution my self. I'll present it here but leave the question open as there is probably far neater ways of doing it.
I rigged up the plots (that was created as lists by ggplot) into a list
plotList <- list(p_A1, p_A2, p_A3...)
tableList <- list(s_A1, s_A2, s_A3...)
I then used subset on my lookup table to get the matching id of the list to select the right plot and table.
output$thePlot <-renderPlot({
plotValue<-subset(letaLista, letaLista$Factor==input$Factor &
letaLista$Variabel== input$Variabel & letaLista$rest_alla==input$DataSource)
plotList[as.integer(plotValue[1,4])]
})
output$theTable <-renderTable({
plotValue<-subset(letaLista, letaLista$Factor==input$Factor &
letaLista$Variabel== input$Variabel & letaLista$rest_alla==input$DataSource)
skriva <- tableList[as.integer(plotValue[4])]
print(skriva)
})

Scala read only certain parts of file

I'm trying to read an input file in Scala that I know the structure of, however I only need every 9th entry. So far I have managed to read the whole thing using:
val lines = sc.textFile("hdfs://moonshot-ha-nameservice/" + args(0))
val fields = lines.map(line => line.split(","))
The issue, this leaves me with an array that is huge (we're talking 20GB of data). Not only have I seen myself forced to write some very ugly code in order to convert between RDD[Array[String]] and Array[String] but it's essentially made my code useless.
I've tried different approaches and mixes between using
.map()
.flatMap() and
.reduceByKey()
however nothing actually put my collected "cells" into the format that I need them to be.
Here's what is supposed to happen: Reading a folder of text files from our server, the code should read each "line" of text in the format:
*---------*
| NASDAQ: |
*---------*
exchange, stock_symbol, date, stock_price_open, stock_price_high, stock_price_low, stock_price_close, stock_volume, stock_price_adj_close
and only keep a hold of the stock_symbol as that is the identifier I'm counting. So far my attempts have been to turn the entire thing into an array only collect every 9th index from the first one into a collected_cells var. Issue is, based on my calculations and real life results, that code would take 335 days to run (no joke).
Here's my current code for reference:
import org.apache.spark.SparkContext
import org.apache.spark.SparkContext._
import org.apache.spark.SparkConf
object SparkNum {
def main(args: Array[String]) {
// Do some Scala voodoo
val sc = new SparkContext(new SparkConf().setAppName("Spark Numerical"))
// Set input file as per HDFS structure + input args
val lines = sc.textFile("hdfs://moonshot-ha-nameservice/" + args(0))
val fields = lines.map(line => line.split(","))
var collected_cells:Array[String] = new Array[String](0)
//println("[MESSAGE] Length of CC: " + collected_cells.length)
val divider:Long = 9
val array_length = fields.count / divider
val casted_length = array_length.toInt
val indexedFields = fields.zipWithIndex
val indexKey = indexedFields.map{case (k,v) => (v,k)}
println("[MESSAGE] Number of lines: " + array_length)
println("[MESSAGE] Casted lenght of: " + casted_length)
for( i <- 1 to casted_length ) {
println("[URGENT DEBUG] Processin line " + i + " of " + casted_length)
var index = 9 * i - 8
println("[URGENT DEBUG] Index defined to be " + index)
collected_cells :+ indexKey.lookup(index)
}
println("[MESSAGE] collected_cells size: " + collected_cells.length)
val single_cells = collected_cells.flatMap(collected_cells => collected_cells);
val counted_cells = single_cells.map(cell => (cell, 1).reduceByKey{case (x, y) => x + y})
// val result = counted_cells.reduceByKey((a,b) => (a+b))
// val inmem = counted_cells.persist()
//
// // Collect driver into file to be put into user archive
// inmem.saveAsTextFile("path to server location")
// ==> Not necessary to save the result as processing time is recorded, not output
}
}
The bottom part is currently commented out as I tried to debug it, but it acts as pseudo-code for me to know what I need done. I may want to point out that I am next to not at all familiar with Scala and hence things like the _ notation confuse the life out of me.
Thanks for your time.
There are some concepts that need clarification in the question:
When we execute this code:
val lines = sc.textFile("hdfs://moonshot-ha-nameservice/" + args(0))
val fields = lines.map(line => line.split(","))
That does not result in a huge array of the size of the data. That expression represents a transformation of the base data. It can be further transformed until we reduce the data to the information set we desire.
In this case, we want the stock_symbol field of a record encoded a csv:
exchange, stock_symbol, date, stock_price_open, stock_price_high, stock_price_low, stock_price_close, stock_volume, stock_price_adj_close
I'm also going to assume that the data file contains a banner like this:
*---------*
| NASDAQ: |
*---------*
The first thing we're going to do is to remove anything that looks like this banner. In fact, I'm going to assume that the first field is the name of a stock exchange that start with an alphanumeric character. We will do this before we do any splitting, resulting in:
val lines = sc.textFile("hdfs://moonshot-ha-nameservice/" + args(0))
val validLines = lines.filter(line => !line.isEmpty && line.head.isLetter)
val fields = validLines.map(line => line.split(","))
It helps to write the types of the variables, to have peace of mind that we have the data types that we expect. As we progress in our Scala skills that might become less important. Let's rewrite the expression above with types:
val lines: RDD[String] = sc.textFile("hdfs://moonshot-ha-nameservice/" + args(0))
val validLines: RDD[String] = lines.filter(line => !line.isEmpty && line.head.isLetter)
val fields: RDD[Array[String]] = validLines.map(line => line.split(","))
We are interested in the stock_symbol field, which positionally is the element #1 in a 0-based array:
val stockSymbols:RDD[String] = fields.map(record => record(1))
If we want to count the symbols, all that's left is to issue a count:
val totalSymbolCount = stockSymbols.count()
That's not very helpful because we have one entry for every record. Slightly more interesting questions would be:
How many different stock symbols we have?
val uniqueStockSymbols = stockSymbols.distinct.count()
How many records for each symbol do we have?
val countBySymbol = stockSymbols.map(s => (s,1)).reduceByKey(_+_)
In Spark 2.0, CSV support for Dataframes and Datasets is available out of the box
Given that our data does not have a header row with the field names (what's usual in large datasets), we will need to provide the column names:
val stockDF = sparkSession.read.csv("/tmp/quotes_clean.csv").toDF("exchange", "symbol", "date", "open", "close", "volume", "price")
We can answer our questions very easy now:
val uniqueSymbols = stockDF.select("symbol").distinct().count
val recordsPerSymbol = stockDF.groupBy($"symbol").agg(count($"symbol"))

How to loop through table based on unique date in MATLAB

I have this table named BondData which contains the following:
Settlement Maturity Price Coupon
8/27/2016 1/12/2017 106.901 9.250
8/27/2019 1/27/2017 104.79 7.000
8/28/2016 3/30/2017 106.144 7.500
8/28/2016 4/27/2017 105.847 7.000
8/29/2016 9/4/2017 110.779 9.125
For each day in this table, I am about to perform a certain task which is to assign several values to a variable and perform necessary computations. The logic is like:
do while Settlement is the same
m_settle=current_row_settlement_value
m_maturity=current_row_maturity_value
and so on...
my_computation_here...
end
It's like I wanted to loop through my settlement dates and perform task for as long as the date is the same.
EDIT: Just to clarify my issue, I am implementing Yield Curve fitting using Nelson-Siegel and Svensson models.Here are my codes so far:
function NS_SV_Models()
load bondsdata
BondData=table(Settlement,Maturity,Price,Coupon);
BondData.Settlement = categorical(BondData.Settlement);
Settlements = categories(BondData.Settlement); % get all unique Settlement
for k = 1:numel(Settlements)
rows = BondData.Settlement==Settlements(k);
Bonds.Settle = Settlements(k); % current_row_settlement_value
Bonds.Maturity = BondData.Maturity(rows); % current_row_maturity_value
Bonds.Prices=BondData.Price(rows);
Bonds.Coupon=BondData.Coupon(rows);
Settle = Bonds.Settle;
Maturity = Bonds.Maturity;
CleanPrice = Bonds.Prices;
CouponRate = Bonds.Coupon;
Instruments = [Settle Maturity CleanPrice CouponRate];
Yield = bndyield(CleanPrice,CouponRate,Settle,Maturity);
NSModel = IRFunctionCurve.fitNelsonSiegel('Zero',Settlements(k),Instruments);
SVModel = IRFunctionCurve.fitSvensson('Zero',Settlements(k),Instruments);
NSModel.Parameters
SVModel.Parameters
end
end
Again, my main objective is to get each model's parameters (beta0, beta1, beta2, etc.) on a per day basis. I am getting an error in Instruments = [Settle Maturity CleanPrice CouponRate]; because Settle contains only one record (8/27/2016), it's suppose to have two since there are two rows for this date. Also, I noticed that Maturity, CleanPrice and CouponRate contains all records. They should only contain respective data for each day.
Hope I made my issue clearer now. By the way, I am using MATLAB R2015a.
Use categorical array. Here is your function (without its' headline, and all rows I can't run are commented):
BondData = table(datetime(Settlement),datetime(Maturity),Price,Coupon,...
'VariableNames',{'Settlement','Maturity','Price','Coupon'});
BondData.Settlement = categorical(BondData.Settlement);
Settlements = categories(BondData.Settlement); % get all unique Settlement
for k = 1:numel(Settlements)
rows = BondData.Settlement==Settlements(k);
Settle = BondData.Settlement(rows); % current_row_settlement_value
Mature = BondData.Maturity(rows); % current_row_maturity_value
CleanPrice = BondData.Price(rows);
CouponRate = BondData.Coupon(rows);
Instruments = [datenum(char(Settle)) datenum(char(Mature))...
CleanPrice CouponRate];
% Yield = bndyield(CleanPrice,CouponRate,Settle,Mature);
%
% NSModel = IRFunctionCurve.fitNelsonSiegel('Zero',Settlements(k),Instruments);
% SVModel = IRFunctionCurve.fitSvensson('Zero',Settlements(k),Instruments);
%
% NSModel.Parameters
% SVModel.Parameters
end
Keep in mind the following:
You cannot concat different types of variables as you try to do in: Instruments = [Settle Maturity CleanPrice CouponRate];
There is no need in the structure Bond, you don't use it (e.g. Settle = Bonds.Settle;).
Use the relevant functions to convert between a datetime object and string or numbers. For instance, in the code above: datenum(char(Settle)). I don't know what kind of input you need to pass to the following functions.

Resources