I have a project which outputs data into many files. These files look something like this as interpreted by readmatrix:
ans =
8×8 table
Date_Time EZOO2Con___ EZOCO2Con_ppm_ SGPCO2Con_ppm_ SGPTVOC_ppb_ BMEHumidity___ BMEPressure_Pa_ BMETemp_DegC_
___________________ ___________ ______________ ______________ ____________ ______________ _______________ _____________
09/06/2022 11:55:17 19.16 419 400 0 48.5 95948 22.57
09/06/2022 11:55:18 19.16 419 400 0 48.89 99577 22.58
09/06/2022 11:55:19 19.16 419 400 0 48.89 99578 22.58
09/06/2022 11:55:20 19.15 420 400 0 48.84 99584 22.57
09/06/2022 11:55:21 19.15 420 400 0 48.95 99574 22.58
09/06/2022 11:55:22 19.15 421 400 0 49.15 99578 22.57
09/06/2022 11:55:23 19.15 421 400 0 48.9 99577 22.56
09/06/2022 11:55:24 19.15 422 400 0 48.9 99573 22.57
For my previous test, I have approx. 289 separate files of this format which I'd like to combine together in 8 arrays for plotting.
The Date_Time column is a string with MM/DD/YYYY and HH:MM:SS separated by a space. When using the table2array command, I am able to convert the date&time column of each file's data into a datetime array. However, I am unable to use the cat or vertcat functions to append the date&time column to my "combined" array. Below is the code that's giving me trouble:
for k = 1:length(fileList)
baseFileName = fileList(k).name;
fullFileName = fullfile(fileList(k).folder, baseFileName);
fprintf(1, 'Now reading %s\n', fullFileName);
data = readtable(fullFileName);
timecol = data(:,1);
EZOCO2col = data(:,2);
EZOO2col = data(:, 3);
SGP30CO2col = data(:, 4);
SGP30TVOCcol = data(:, 5);
BME280Humcol = data(:, 6);
BME280Presscol = data(:, 7);
BME280Tempcol = data(:, 8);
timecol_array = table2array(timecol);
EZOCO2col_array = table2array(EZOCO2col);
EZOO2col_array = table2array(EZOO2col);
SGP30CO2col_array = table2array(SGP30CO2col);
SGP30TVOCcol_array = table2array(SGP30TVOCcol);
BME280Humcol_array = table2array(BME280Humcol);
BME280Presscol_array = table2array(BME280Presscol);
BME280Tempcol_array = table2array(BME280Tempcol);
timecol_tot = cat(1, timecol_tot, timecol_array);
EZOCO2col_tot = cat(1, EZOCO2col_tot, EZOCO2col_array);
EZOO2col_tot = cat(1, EZOO2col_tot, EZOO2col_array);
SGP30CO2col_tot = cat(1, SGP30CO2col_tot, SGP30CO2col_array);
SGP30TVOCcol_tot = cat(1, SGP30TVOCcol_tot, SGP30TVOCcol_array);
BME280Humcol_tot = cat(1, BME280Humcol_tot, BME280Humcol_array);
BME280Presscol_tot = cat(1, BME280Presscol_tot, BME280Presscol_array);
BME280Tempcol_tot = cat(1, BME280Tempcol_tot, BME280Tempcol_array);
end
I receive this error each time:
Error using datetime/cat (line 1376)
All inputs must be datetimes or date/time character vectors or date/time strings.
Error in Plot_attempt_9_6_22_1 (line 66)
timecol_tot = cat(1, timecol_tot, timecol_array);
As per How to preallocate a datetime array in matlab, I have tried:
timecol_tot = [];,
timecol_tot = datetime([],[],[],[],[],[]);, and
timecol_tot = NaT(1,1); to no avail.
Because the length of each of these files may vary, I didn't try pre-allocating the datetime array in the size of the incoming data, since that may not work across different datasets. However, it does work if I have only one file.
Is there a way to do this that would allow me to just initialize an empty array and concatenate datetime arrays to it without defining the size of the first datetime set to add?
I ended up fixing this problem by adding an if statement, and now, it looks something like this:
if k == 1
timecol_tot = timecol_array;
EZOCO2col_tot = cat(1, EZOCO2col_tot, EZOCO2col_double);
EZOO2col_tot = cat(1, EZOO2col_tot, EZOO2col_double);
SGP30CO2col_tot = cat(1, SGP30CO2col_tot, SGP30CO2col_array);
SGP30TVOCcol_tot = cat(1, SGP30TVOCcol_tot, SGP30TVOCcol_array);
BME280Humcol_tot = cat(1, BME280Humcol_tot, BME280Humcol_array);
BME280Presscol_tot = cat(1, BME280Presscol_tot, BME280Presscol_array);
BME280Tempcol_tot = cat(1, BME280Tempcol_tot, BME280Tempcol_array);
else
timecol_tot = cat(1, timecol_tot, timecol_array);
EZOCO2col_tot = cat(1, EZOCO2col_tot, EZOCO2col_double);
EZOO2col_tot = cat(1, EZOO2col_tot, EZOO2col_double);
SGP30CO2col_tot = cat(1, SGP30CO2col_tot, SGP30CO2col_array);
SGP30TVOCcol_tot = cat(1, SGP30TVOCcol_tot, SGP30TVOCcol_array);
BME280Humcol_tot = cat(1, BME280Humcol_tot, BME280Humcol_array);
BME280Presscol_tot = cat(1, BME280Presscol_tot, BME280Presscol_array);
BME280Tempcol_tot = cat(1, BME280Tempcol_tot, BME280Tempcol_array);
end
Another reason that I was getting a type error was because some of my files were empty or corrupted. If you're having similar issues, check to make sure that you're not working with empty or corrupted csv's!
You are asking 2 questions, I am new here but some moderators may tell you break it down into 2 separate questions.
I am here answering both anyway :
The error doesn't seem to be coming from cat
With the data you have provided I have reproduced cat output and cat works fine; it concatenates DATATIME when required and floats told to.
The problem has to be somewhere else ; as long as you feed cat with same format there shouldn't be any error.
The error comes from a function called Plot_attempt_9_6_22_1 right?
Either
you are sending to plot rows instead of columns therefore plot input vectors contain mixed formats, a transpose or more than one missing somewhere,
or
somewhere your are mixing date/time formats with something that is not date/time type.
2.- Do you really need to preallocate memory?
In any case MATLAB suggests to preallocate tables with the following expression :
T = table('Size',sz,'VariableTypes',varTypes)
not with the C/C++ style mentioned in question
Related
import urllib2
import pandas as pd
from bs4 import BeautifulSoup
x = 0
i = 1
data = []
while (i < 13):
soup = BeautifulSoup(urllib2.urlopen(
'http://games.espn.com/ffl/tools/projections?&slotCategoryId=4&scoringPeriodId=%d&seasonId=2018&startIndex=' % i, +str(x)).read(), 'html')
tableStats = soup.find("table", ("class", "playerTableTable tableBody"))
for row in tableStats.findAll('tr')[2:]:
col = row.findAll('td')
try:
name = col[0].a.string.strip()
opp = col[1].a.string.strip()
rec = col[10].string.strip()
yds = col[11].string.strip()
dt = col[12].string.strip()
pts = col[13].string.strip()
data.append([name, opp, rec, yds, dt, pts])
except Exception as e:
pass
df = pd.DataFrame(data=data, columns=[
'PLAYER', 'OPP', 'REC', 'YDS', 'TD', 'PTS'])
df
i += 1
I have been working with a fantasy football program and I am trying to increment data over all weeks so I can create a dataframe for the top 40 players for each week.
I have been able to get it for any week of my choice by manually entering the week number in the PeriodId part of the url, but I am trying to programmatically increment it over each week to make it easier. I have tried using PeriodId='+ I +' and PeriodId=%d but I keep getting various errors about str and int concatenate and bad operands. Any suggestions or tips?
Try removing the comma between %i and str(x) to concatenate the strings and see if that helps.
soup = BeautifulSoup(urllib2.urlopen('http://games.espn.com/ffl/tools/projections?&slotCategoryId=4&scoringPeriodId=%d&seasonId=2018&startIndex='%i, +str(x)).read(), 'html')
should be:
soup = BeautifulSoup(urllib2.urlopen('http://games.espn.com/ffl/tools/projections?&slotCategoryId=4&scoringPeriodId=%d&seasonId=2018&startIndex='%i +str(x)).read(), 'html')
if you have problem concatenating or formatting URL please create variable instead write it one line with BeautifulSoup and urllib2.urlopen.
Use parenthesis to format with multiple value like "before %s is %s" % (1, 0)
url = 'http://games.espn.com/ffl/tools/projections?&slotCategoryId=4&scoringPeriodId=%s&seasonId=2018&startIndex=%s' % (i, x)
# or
#url = 'http://games.espn.com/ffl/tools/projections?&slotCategoryId=4&scoringPeriodId=%s&seasonId=2018&startIndex=0' % i
html = urllib2.urlopen(url).read()
soup = BeautifulSoup(html, 'html.parser')
Make the code sorter will not effect the performance.
This question already has answers here:
Faster way to read fixed-width files
(4 answers)
Closed 4 years ago.
I have a huge datatset (14GB, 200 Mn rows) of character vector. I've fread it (took > 30 mins on 48 core 128 GB server). The string contains concatenated information on various fields. For instance, the first row of my table looks like:
2014120900000001091500bbbbcompany_name00032401
where the first 8 characters represent date in YYYYMMDD format, next 8 characters are id, next 6 the time in HHMMSS format and then next 16 are name (prefixed with b's) and the last 8 are price (2 decimal places).
I need to transfer the above 1 column data.table into 5 columns: date, id, time, name, price.
For the above character vector that will turn out to be: date = "2014-12-09", id = 1, time = "09:15:00", name = "company_name", price = 324.01
I am looking for a (very) fast and efficient dplyr / data.table solution. Right now I am doing it with using substr:
date = as.Date(substr(d, 1, 8), "%Y%m%d");
and it's taking forever to execute!
Update: With readr::read_fwf I am able to read the file in 5-10 mins. Apparently, the reading is faster than fread. Below is the code:
f = "file_name";
num_cols = 5;
col_widths = c(8,8,6,16,8);
col_classes = "ciccn";
col_names = c("date", "id", "time", "name", "price");
# takes 5-10 mins
data = readr::read_fwf(file = f, col_positions = readr::fwf_widths(col_widths, col_names), col_types = col_classes, progress = T);
setDT(data);
# object.size(data) / 2^30; # 17.5 GB
A possible solution:
library(data.table)
library(stringi)
widths <- c(8,8,6,16,8)
sp <- c(1, cumsum(widths[-length(widths)]) + 1)
ep <- cumsum(widths)
DT[, lapply(seq_along(sp), function(i) stri_sub(V1, sp[i], ep[i]))]
which gives:
V1 V2 V3 V4 V5
1: 20141209 00000001 091500 bbbbcompany_name 00032401
Including some additional processing to get the desired result:
DT[, lapply(seq_along(sp), function(i) stri_sub(V1, sp[i], ep[i]))
][, .(date = as.Date(V1, "%Y%m%d"),
id = as.integer(V2),
time = as.ITime(V3, "%H%M%S"),
name = sub("^(bbbb)","",V4),
price = as.numeric(V5)/100)]
which gives:
date id time name price
1: 2014-12-09 1 09:15:00 company_name 324.01
But you are actually reading a fixed width file. So could also consider read.fwf from base R or read_fwffrom readr or write your own fread.fwf-function like I did a while ago:
fread.fwf <- function(file, widths, enc = "UTF-8") {
sp <- c(1, cumsum(widths[-length(widths)]) + 1)
ep <- cumsum(widths)
fread(file = file, header = FALSE, sep = "\n", encoding = enc)[, lapply(seq_along(sp), function(i) stri_sub(V1, sp[i], ep[i]))]
}
Used data:
DT <- data.table(V1 = "2014120900000001091500bbbbcompany_name00032401")
Maybe your solution is not so bad.
I am using this data:
df <- data.table(text = rep("2014120900000001091500bbbbcompany_name00032401", 100000))
Your solution:
> system.time(df[, .(date = as.Date(substr(text, 1, 8), "%Y%m%d"),
+ id = as.integer(substr(text, 9, 16)),
+ time = substr(text, 17, 22),
+ name = substr(text, 23, 38),
+ price = as.numeric(substr(text, 39, 46))/100)])
user system elapsed
0.17 0.00 0.17
#Jaap solution:
> library(data.table)
> library(stringi)
>
> widths <- c(8,8,6,16,8)
> sp <- c(1, cumsum(widths[-length(widths)]) + 1)
> ep <- cumsum(widths)
>
> system.time(df[, lapply(seq_along(sp), function(i) stri_sub(text, sp[i], ep[i]))
+ ][, .(date = as.Date(V1, "%Y%m%d"),
+ id = as.integer(V2),
+ time = V3,
+ name = sub("^(bbbb)","",V4),
+ price = as.numeric(V5)/100)])
user system elapsed
0.20 0.00 0.21
An attempt with read.fwf:
> setClass("myDate")
> setAs("character","myDate", function(from) as.Date(from, format = "%Y%m%d"))
> setClass("myNumeric")
> setAs("character","myNumeric", function(from) as.numeric(from)/100)
>
> ff <- function(x) {
+ file <- textConnection(x)
+ read.fwf(file, c(8, 8, 6, 16, 8),
+ col.names = c("date", "id", "time", "name", "price"),
+ colClasses = c("myDate", "integer", "character", "character", "myNumeric"))
+ }
>
> system.time(df[, as.list(ff(text))])
user system elapsed
2.33 6.15 8.49
All outputs are the same.
Maybe try using matrix with numeric instead of data.frame. Aggregation should take less time.
I have this table named BondData which contains the following:
Settlement Maturity Price Coupon
8/27/2016 1/12/2017 106.901 9.250
8/27/2019 1/27/2017 104.79 7.000
8/28/2016 3/30/2017 106.144 7.500
8/28/2016 4/27/2017 105.847 7.000
8/29/2016 9/4/2017 110.779 9.125
For each day in this table, I am about to perform a certain task which is to assign several values to a variable and perform necessary computations. The logic is like:
do while Settlement is the same
m_settle=current_row_settlement_value
m_maturity=current_row_maturity_value
and so on...
my_computation_here...
end
It's like I wanted to loop through my settlement dates and perform task for as long as the date is the same.
EDIT: Just to clarify my issue, I am implementing Yield Curve fitting using Nelson-Siegel and Svensson models.Here are my codes so far:
function NS_SV_Models()
load bondsdata
BondData=table(Settlement,Maturity,Price,Coupon);
BondData.Settlement = categorical(BondData.Settlement);
Settlements = categories(BondData.Settlement); % get all unique Settlement
for k = 1:numel(Settlements)
rows = BondData.Settlement==Settlements(k);
Bonds.Settle = Settlements(k); % current_row_settlement_value
Bonds.Maturity = BondData.Maturity(rows); % current_row_maturity_value
Bonds.Prices=BondData.Price(rows);
Bonds.Coupon=BondData.Coupon(rows);
Settle = Bonds.Settle;
Maturity = Bonds.Maturity;
CleanPrice = Bonds.Prices;
CouponRate = Bonds.Coupon;
Instruments = [Settle Maturity CleanPrice CouponRate];
Yield = bndyield(CleanPrice,CouponRate,Settle,Maturity);
NSModel = IRFunctionCurve.fitNelsonSiegel('Zero',Settlements(k),Instruments);
SVModel = IRFunctionCurve.fitSvensson('Zero',Settlements(k),Instruments);
NSModel.Parameters
SVModel.Parameters
end
end
Again, my main objective is to get each model's parameters (beta0, beta1, beta2, etc.) on a per day basis. I am getting an error in Instruments = [Settle Maturity CleanPrice CouponRate]; because Settle contains only one record (8/27/2016), it's suppose to have two since there are two rows for this date. Also, I noticed that Maturity, CleanPrice and CouponRate contains all records. They should only contain respective data for each day.
Hope I made my issue clearer now. By the way, I am using MATLAB R2015a.
Use categorical array. Here is your function (without its' headline, and all rows I can't run are commented):
BondData = table(datetime(Settlement),datetime(Maturity),Price,Coupon,...
'VariableNames',{'Settlement','Maturity','Price','Coupon'});
BondData.Settlement = categorical(BondData.Settlement);
Settlements = categories(BondData.Settlement); % get all unique Settlement
for k = 1:numel(Settlements)
rows = BondData.Settlement==Settlements(k);
Settle = BondData.Settlement(rows); % current_row_settlement_value
Mature = BondData.Maturity(rows); % current_row_maturity_value
CleanPrice = BondData.Price(rows);
CouponRate = BondData.Coupon(rows);
Instruments = [datenum(char(Settle)) datenum(char(Mature))...
CleanPrice CouponRate];
% Yield = bndyield(CleanPrice,CouponRate,Settle,Mature);
%
% NSModel = IRFunctionCurve.fitNelsonSiegel('Zero',Settlements(k),Instruments);
% SVModel = IRFunctionCurve.fitSvensson('Zero',Settlements(k),Instruments);
%
% NSModel.Parameters
% SVModel.Parameters
end
Keep in mind the following:
You cannot concat different types of variables as you try to do in: Instruments = [Settle Maturity CleanPrice CouponRate];
There is no need in the structure Bond, you don't use it (e.g. Settle = Bonds.Settle;).
Use the relevant functions to convert between a datetime object and string or numbers. For instance, in the code above: datenum(char(Settle)). I don't know what kind of input you need to pass to the following functions.
I have a list of xy points that I'm trying to sum together and identify the centroid, but it only uses the last value in the row. I'm trying to create a centroid for each state, Here's the code:
Total_X1 = 0
Total_Y1 = 0
TotalPop1 = 0
#Cat = "cali"
cntyName1 = "cnty"
stateName1 = "statename"
for row in cursor:
#if row[0] >= : ### for condition that is met
#if row[0]== []:
TheStateName1 = row[0]
thecntyName1 = row[4]
idpoly1 = row[5]
idobject1 = row[6]
stateFIPS1 = row[7]
countyFIPS1 = row[8]
fips1 = row[9]
fipSnum1 = row[10]
fipsNumer1 = row[11]
#totarea = row[12]
XPoint = row [13]
YPoint = row[14]
#print Cat
print TheStateName1
print thecntyName1
print row ### do something with that value!
Total_X1 += row[2] *row[3]
print Total_X1
Total_Y1 += row[1] *row[3]
print Total_Y1
TotalPop1 += row[3]
print TotalPop1
print ""
print "X is: " , Total_X1
print "POP is: " , TotalPop1
centroid_X1 = Total_X1/TotalPop1
print "your x centroid is: ",centroid_X1
print ""
#print Cat
print thecntyName1
print TheStateName1
Any Suggestions, Thanks!
The cursor can only 'see' one row at a time, you have to pull info from that row and store it elsewhere.
loc_list = [(row[0], row[1]) for row in arcpy.da.SearchCursor(dataset, ['X_coord', 'Y_coord'])
Will give you a list of X,Y tuples from your attribute table.
After that you've got multiple options for turning that list of tuples into a spatial dataset before calculating the mean - start by reading the ESRI documentation for arcpy.Point and all the related topics linked, and go from there. If you have 10.3 or above you can use Mean Center once you have a point layer.
You'll probably get a wrong answer if you just take the mean of the X and Y without projecting first, so don't.
The code I'm using imports data from multiple files and saves them into an array of cells, the code is as follows:
[FileName,PathName,FilterIndex] = uigetfile('*.txt*','MultiSelect','on');
numfiles = size(FileName,2);
FileData= cell(1,numfiles);
for ii = 1:numfiles
FileName{ii};
A=[];
entirefile =fullfile(PathName,FileName{ii});
fid = fopen(entirefile);
tline = fgets(fid);
while ischar(tline)
parts = textscan(tline, '%f;');
if numel(parts{1}) > 0
A = [ A ; parts{:}' ];
end
tline = fgets(fid);
end
fclose(fid);
FileData{ii} = A;
A = FileData{ii};
X = A(:,1);
Y = A(:,5);
DataToUse = [X,Y];
end
Now my issue is I want to use the first DataToUse created by the loop, which will be data from the first file, seperatley to the other files but I can not issolate it. I have tried DataToUse(1), DataToUse(1,1) and DataToUse(:,[1,2]) but none are working for me. An example of the type of data would be:
DataToUse=
0.0762 0.0271
0.0763 0.2671
0.0764 0.4079
0.0765 0.0510
0.0766 0.0087
0.0767 0.0099
0.0768 0.0067
0.0769 0.0047
0.0770 0.0047
0.0771 0.0349
0.0772 0.2094
0.0773 0.2740
0.0774 0.0294
0.0775 0.0100
0.0776 0.0159
I have different numbers of this kind of data depending on how many files are selected but I would like to only use the first initially and use the others later. Anybody know how I can go about doing this? Many thanks in advance
The solution is to use cell arrays, like so:
DataToUse{ii} = [X, Y]
To get the desired output put this after your for-loop:
firstLoopXY = DataToUse{1}
Enjoy!