Manipulating character arrays quickly in R data.table [duplicate] - arrays

This question already has answers here:
Faster way to read fixed-width files
(4 answers)
Closed 4 years ago.
I have a huge datatset (14GB, 200 Mn rows) of character vector. I've fread it (took > 30 mins on 48 core 128 GB server). The string contains concatenated information on various fields. For instance, the first row of my table looks like:
2014120900000001091500bbbbcompany_name00032401
where the first 8 characters represent date in YYYYMMDD format, next 8 characters are id, next 6 the time in HHMMSS format and then next 16 are name (prefixed with b's) and the last 8 are price (2 decimal places).
I need to transfer the above 1 column data.table into 5 columns: date, id, time, name, price.
For the above character vector that will turn out to be: date = "2014-12-09", id = 1, time = "09:15:00", name = "company_name", price = 324.01
I am looking for a (very) fast and efficient dplyr / data.table solution. Right now I am doing it with using substr:
date = as.Date(substr(d, 1, 8), "%Y%m%d");
and it's taking forever to execute!
Update: With readr::read_fwf I am able to read the file in 5-10 mins. Apparently, the reading is faster than fread. Below is the code:
f = "file_name";
num_cols = 5;
col_widths = c(8,8,6,16,8);
col_classes = "ciccn";
col_names = c("date", "id", "time", "name", "price");
# takes 5-10 mins
data = readr::read_fwf(file = f, col_positions = readr::fwf_widths(col_widths, col_names), col_types = col_classes, progress = T);
setDT(data);
# object.size(data) / 2^30; # 17.5 GB

A possible solution:
library(data.table)
library(stringi)
widths <- c(8,8,6,16,8)
sp <- c(1, cumsum(widths[-length(widths)]) + 1)
ep <- cumsum(widths)
DT[, lapply(seq_along(sp), function(i) stri_sub(V1, sp[i], ep[i]))]
which gives:
V1 V2 V3 V4 V5
1: 20141209 00000001 091500 bbbbcompany_name 00032401
Including some additional processing to get the desired result:
DT[, lapply(seq_along(sp), function(i) stri_sub(V1, sp[i], ep[i]))
][, .(date = as.Date(V1, "%Y%m%d"),
id = as.integer(V2),
time = as.ITime(V3, "%H%M%S"),
name = sub("^(bbbb)","",V4),
price = as.numeric(V5)/100)]
which gives:
date id time name price
1: 2014-12-09 1 09:15:00 company_name 324.01
But you are actually reading a fixed width file. So could also consider read.fwf from base R or read_fwffrom readr or write your own fread.fwf-function like I did a while ago:
fread.fwf <- function(file, widths, enc = "UTF-8") {
sp <- c(1, cumsum(widths[-length(widths)]) + 1)
ep <- cumsum(widths)
fread(file = file, header = FALSE, sep = "\n", encoding = enc)[, lapply(seq_along(sp), function(i) stri_sub(V1, sp[i], ep[i]))]
}
Used data:
DT <- data.table(V1 = "2014120900000001091500bbbbcompany_name00032401")

Maybe your solution is not so bad.
I am using this data:
df <- data.table(text = rep("2014120900000001091500bbbbcompany_name00032401", 100000))
Your solution:
> system.time(df[, .(date = as.Date(substr(text, 1, 8), "%Y%m%d"),
+ id = as.integer(substr(text, 9, 16)),
+ time = substr(text, 17, 22),
+ name = substr(text, 23, 38),
+ price = as.numeric(substr(text, 39, 46))/100)])
user system elapsed
0.17 0.00 0.17
#Jaap solution:
> library(data.table)
> library(stringi)
>
> widths <- c(8,8,6,16,8)
> sp <- c(1, cumsum(widths[-length(widths)]) + 1)
> ep <- cumsum(widths)
>
> system.time(df[, lapply(seq_along(sp), function(i) stri_sub(text, sp[i], ep[i]))
+ ][, .(date = as.Date(V1, "%Y%m%d"),
+ id = as.integer(V2),
+ time = V3,
+ name = sub("^(bbbb)","",V4),
+ price = as.numeric(V5)/100)])
user system elapsed
0.20 0.00 0.21
An attempt with read.fwf:
> setClass("myDate")
> setAs("character","myDate", function(from) as.Date(from, format = "%Y%m%d"))
> setClass("myNumeric")
> setAs("character","myNumeric", function(from) as.numeric(from)/100)
>
> ff <- function(x) {
+ file <- textConnection(x)
+ read.fwf(file, c(8, 8, 6, 16, 8),
+ col.names = c("date", "id", "time", "name", "price"),
+ colClasses = c("myDate", "integer", "character", "character", "myNumeric"))
+ }
>
> system.time(df[, as.list(ff(text))])
user system elapsed
2.33 6.15 8.49
All outputs are the same.

Maybe try using matrix with numeric instead of data.frame. Aggregation should take less time.

Related

Python : Replace a column in a dataframe by datetime values

I'm trying to replace a column an array of 4 columns by datetime values that I treated. The problem is that it's difficult to keep the same form between the different formats of dataframe, array,....
dataw = ds.variables["pr"][:]
dataw = np.array(dataw[:,0,0])
lat = ds.variables["lat"][:]
long = ds.variables["lon"][:]
time = ds.variables["time"][:]
time = pd.to_datetime(ds.variables["time"][:],origin=pd.Timestamp('1850-01-01'),unit='D')
#np.datetime64(ds.variables["time"][:],'D')
x2 = pd.DataFrame(np.zeros((len(dataw),4), float))
x = np.zeros((len(dataw),4), float)
x[:,0] = time
x[:,1] = long
x[:,2] = lat[:]
x[:,3] = dataw[:]*86400
x=pd.DataFrame(x)
x[:,0] = pd.to_datetime(time,origin=pd.Timestamp('1850-01-01'),unit='D')
If I put directly the dates transformed in the array, the result is like: 1.32542e+18
I tried
time = ds.variables["time"][:]
and include it in the array, and then use
x[:,0]=pd.to_datetime(x[:,0],origin=pd.Timestamp('1850-01-01'),unit='D')
I get the error:
TypeError: unhashable type: 'slice'
I tried also directly put:
time=pd.to_datetime(time,origin=pd.Timestamp('1850-01-01'),unit='D')
x[:,0] = time[:]
TypeError: unhashable type: 'slice'
try this instead
import numpy as np
import pandas as pd
dataw = ds.variables["pr"][:]
dataw = np.array(dataw[:, 0, 0])
lat = ds.variables["lat"][:]
long = ds.variables["lon"][:]
time = ds.variables["time"][:]
time = np.datetime64(time, 'D')
x = np.zeros((len(dataw), 4), dtype='datetime64[D]')
x[:, 0] = time
x[:, 1] = long
x[:, 2] = lat
x[:, 3] = dataw * 86400
df = pd.DataFrame(x, columns=["Time", "Longitude", "Latitude", "Data"])
Xarray makes the netcdf->pandas workflow quite straightforward:
import xarray as xr
ds = xr.open_dataset('file.nc', engine='netcdf4')
df = ds.to_pandas()
Presuming your time variable is using cf-conventions, Xarray will automatically decode it into datetime objects.

How can I speed up my optimization with Gekko?

My program is optimizing the charging and decharging of a home battery to minimize the cost of electricity at the end of the year. The electricity usage of homes is measured each 15 minutes, so I have 96 measurement point in 1 day. I want to optimilize the charging and decharging of the battery for 2 days, so that day 1 takes the usage of day 2 into account. I wrote the following code and it works.
from gekko import GEKKO
import numpy as np
import pandas as pd
import time
import math
# ------------------------ Import and read input data ------------------------
file = r'D:\Bedrijfseconomie\MP Thuisbatterijen\Spyder - Gekko\Data Sim 1.xlsx'
data = pd.read_excel(file, sheet_name='Input', na_values='NaN')
dataRead = pd.DataFrame(data, columns= ['Timestep','Verbruik woning (kWh)','Prijs afname (€/kWh)',
'Capaciteit batterij (kW)','Capaciteit batterij (kWh)',
'Rendement (%)','Verbruikersprofiel'])
timestep = dataRead['Timestep'].to_numpy()
usage_home = dataRead['Verbruik woning (kWh)'].to_numpy()
price = dataRead['Prijs afname (€/kWh)'].to_numpy()
cap_batt_kW = dataRead['Capaciteit batterij (kW)'].iloc[0]
cap_batt_kWh = dataRead['Capaciteit batterij (kWh)'].iloc[0]
efficiency = dataRead['Rendement (%)'].iloc[0]
usersprofile = dataRead['Verbruikersprofiel'].iloc[0]
# ---------------------------- Optimization model ----------------------------
# Initialise model
m = GEKKO()
# Global options
m.options.SOLVER = 1
# Constants
snelheid_laden = cap_batt_kW/4
T = len(timestep)
loss_charging = m.Const(value = (1-efficiency)/2)
max_cap_batt = m.Const(value = cap_batt_kWh)
min_cap_batt = m.Const(value = 0)
max_charge = m.Const(value = snelheid_laden) # max battery can charge in 15min
max_decharge = m.Const(value = -snelheid_laden) # max battery can decharge in 15min
# Parameters
dummy = np.array(np.ones([T]))
# Variables
e_batt = m.Array(m.Var, (T), lb = min_cap_batt, ub = max_cap_batt) # energy in battery
usage_net = m.Array(m.Var, (T)) # usage home & charge/decharge battery
price_paid = m.Array(m.Var, (T)) # price paid each 15min
charging = m.Array(m.Var, (T), lb = max_decharge, ub = max_charge) # amount charge/decharge each 15min
# Intermediates
e_batt[0] = m.Intermediate(charging[0])
for t in range(T):
e_batt[t] = m.Intermediate(m.sum([charging[i]*(1-loss_charging) for i in range(t)]))
usage_net = [m.Intermediate(usage_home[t] + charging[t]) for t in range(T)]
price_paid = [m.Intermediate(usage_net[t] * price[t] / 100) for t in range(T)]
total_price = m.Intermediate(m.sum([price_paid[t] for t in range(T)]))
# Equations (constraints)
m.Equation([min_cap_batt*dummy[t] <= e_batt[t] for t in range(T)])
m.Equation([max_cap_batt*dummy[t] >= e_batt[t] for t in range(T)])
m.Equation([max_charge*dummy[t] >= charging[t] for t in range(T)])
m.Equation([max_decharge*dummy[t] <= charging[t] for t in range(T)])
m.Equation([min_cap_batt*dummy[t] <= usage_net[t] for t in range(T)])
m.Equation([(-1*charging[t]) <= (1-loss_charging)*e_batt[t] for t in range(T)])
# Objective
m.Minimize(total_price)
# Solve problem
m.solve()
My code is running and it works but despite that it gives a Solution time of 10 seconds, the total time for it to run is around 8 minutes. Does anyone know a way I can speed it up?
There are a few ways to speed up the Gekko code:
Solve locally instead of on the public server. The option is m=GEKKO(remote=False). The public server can slow down with many jobs.
Use sum() instead of m.sum(). This can be faster for compiling the model. Otherwise, use m.integral(x) if you need the integral of x.
Many of the equations are repeated at each time horizon step. Gekko is more efficient using a single equation definition with IMODE=2 (for algebraic equation models) or IMODE=6 (for differential / algebraic equation models) and then it creates the equations over the time horizon. You may need to use m.vsum() instead of m.sum().
For additional diagnosis, try setting m.options.DIAGLEVEL=1 to get a detailed timing report of how long it takes to compile the model and perform each function, 1st derivative, and 2nd derivative calculation. It also gives a detailed view of the solver versus model time during the solution phase.
Update with Data File Testing
Thanks for sending the data file. The run directory shows that the model file is 58,682 lines long. It takes a while to compile a model that size. Here is the solution from the files you sent:
--------- APM Model Size ------------
Each time step contains
Objects : 193
Constants : 5
Variables : 20641
Intermediates: 578
Connections : 18721
Equations : 20259
Residuals : 19681
Number of state variables: 20641
Number of total equations: - 19873
Number of slack variables: - 1152
---------------------------------------
Degrees of freedom : -384
* Warning: DOF <= 0
----------------------------------------------
Steady State Optimization with APOPT Solver
----------------------------------------------
Iter Objective Convergence
0 3.37044E+01 5.00000E+00
1 2.81987E+01 1.00000E-10
2 2.81811E+01 5.22529E-12
3 2.81811E+01 2.10942E-15
4 2.81811E+01 2.10942E-15
Successful solution
---------------------------------------------------
Solver : APOPT (v1.0)
Solution time : 10.5119999999879 sec
Objective : 28.1811214884047
Successful solution
---------------------------------------------------
Here is a version that uses IMODE=6 instead. You define the variables and equations once and let Gekko handle the time discretization. It makes a much more efficient model because there is no unnecessary duplication of equations.
from gekko import GEKKO
import numpy as np
import pandas as pd
import time
import math
# ------------------------ Import and read input data ------------------------
file = r'Data Sim 1.xlsx'
data = pd.read_excel(file, sheet_name='Input', na_values='NaN')
dataRead = pd.DataFrame(data, columns= ['Timestep','Verbruik woning (kWh)','Prijs afname (€/kWh)',
'Capaciteit batterij (kW)','Capaciteit batterij (kWh)',
'Rendement (%)','Verbruikersprofiel'])
timestep = dataRead['Timestep'].to_numpy()
usage_home = dataRead['Verbruik woning (kWh)'].to_numpy()
price = dataRead['Prijs afname (€/kWh)'].to_numpy()
cap_batt_kW = dataRead['Capaciteit batterij (kW)'].iloc[0]
cap_batt_kWh = dataRead['Capaciteit batterij (kWh)'].iloc[0]
efficiency = dataRead['Rendement (%)'].iloc[0]
usersprofile = dataRead['Verbruikersprofiel'].iloc[0]
# ---------------------------- Optimization model ----------------------------
# Initialise model
m = GEKKO()
m.open_folder()
# Global options
m.options.SOLVER = 1
m.options.IMODE = 6
# Constants
snelheid_laden = cap_batt_kW/4
m.time = timestep
loss_charging = m.Const(value = (1-efficiency)/2)
max_cap_batt = m.Const(value = cap_batt_kWh)
min_cap_batt = m.Const(value = 0)
max_charge = m.Const(value = snelheid_laden) # max battery can charge in 15min
max_decharge = m.Const(value = -snelheid_laden) # max battery can decharge in 15min
# Parameters
usage_home = m.Param(usage_home)
price = m.Param(price)
# Variables
e_batt = m.Var(value=0, lb = min_cap_batt, ub = max_cap_batt) # energy in battery
price_paid = m.Var() # price paid each 15min
charging = m.Var(lb = max_decharge, ub = max_charge) # amount charge/decharge each 15min
usage_net = m.Var(lb=min_cap_batt)
# Equations
m.Equation(e_batt==m.integral(charging*(1-loss_charging)))
m.Equation(usage_net==usage_home + charging)
price_paid = m.Intermediate(usage_net * price / 100)
m.Equation(-charging <= (1-loss_charging)*e_batt)
# Objective
m.Minimize(price_paid)
# Solve problem
m.solve()
import matplotlib.pyplot as plt
plt.plot(m.time,e_batt.value,label='Battery Charge')
plt.plot(m.time,charging.value,label='Charging')
plt.plot(m.time,price_paid.value,label='Price')
plt.plot(m.time,usage_net.value,label='Net Usage')
plt.xlabel('Time'); plt.grid(); plt.legend(); plt.show()
The model is only 31 lines long (see gk0_model.apm) and it solves much faster (a couple seconds total).
--------- APM Model Size ------------
Each time step contains
Objects : 0
Constants : 5
Variables : 8
Intermediates: 1
Connections : 0
Equations : 6
Residuals : 5
Number of state variables: 1337
Number of total equations: - 955
Number of slack variables: - 191
---------------------------------------
Degrees of freedom : 191
----------------------------------------------
Dynamic Control with APOPT Solver
----------------------------------------------
Iter Objective Convergence
0 3.46205E+01 3.00000E-01
1 3.30649E+01 4.41141E-10
2 3.12774E+01 1.98558E-11
3 3.03148E+01 1.77636E-15
4 2.96824E+01 3.99680E-15
5 2.82700E+01 8.88178E-16
6 2.82039E+01 1.77636E-15
7 2.81334E+01 8.88178E-16
8 2.81085E+01 1.33227E-15
9 2.81039E+01 8.88178E-16
Iter Objective Convergence
10 2.81005E+01 8.88178E-16
11 2.80999E+01 1.77636E-15
12 2.80996E+01 8.88178E-16
13 2.80996E+01 8.88178E-16
14 2.80996E+01 8.88178E-16
Successful solution
---------------------------------------------------
Solver : APOPT (v1.0)
Solution time : 0.527499999996508 sec
Objective : 28.0995878585948
Successful solution
---------------------------------------------------
There is no long compile time. Also, the solver time is reduced from 10 sec to 0.5 sec. The objective function is nearly the same (28.18 versus 28.10).
Here is a complete version without the data file dependency (in case the data file isn't available in the future).
from gekko import GEKKO
import numpy as np
import pandas as pd
import time
import math
# ------------------------ Import and read input data ------------------------
timestep = np.arange(1,193)
usage_home = np.array([0.05,0.07,0.09,0.07,0.05,0.07,0.07,0.07,0.06,
0.05,0.07,0.07,0.09,0.07,0.06,0.07,0.07,
0.07,0.16,0.12,0.17,0.08,0.10,0.11,0.06,
0.06,0.06,0.06,0.06,0.07,0.07,0.07,0.08,
0.08,0.06,0.07,0.07,0.07,0.07,0.05,0.07,
0.07,0.07,0.07,0.21,0.08,0.07,0.08,0.27,
0.12,0.09,0.10,0.11,0.09,0.09,0.08,0.08,
0.12,0.15,0.08,0.10,0.08,0.10,0.09,0.10,
0.09,0.08,0.10,0.12,0.10,0.10,0.10,0.11,
0.10,0.10,0.11,0.13,0.21,0.12,0.10,0.10,
0.11,0.10,0.11,0.12,0.12,0.10,0.11,0.10,
0.10,0.10,0.11,0.10,0.10,0.09,0.08,0.12,
0.10,0.11,0.11,0.10,0.06,0.05,0.06,0.06,
0.06,0.07,0.06,0.06,0.05,0.06,0.05,0.06,
0.05,0.06,0.05,0.06,0.07,0.06,0.09,0.10,
0.10,0.22,0.08,0.06,0.05,0.06,0.08,0.08,
0.07,0.08,0.07,0.07,0.16,0.21,0.08,0.08,
0.09,0.09,0.10,0.09,0.09,0.08,0.12,0.24,
0.09,0.08,0.09,0.08,0.10,0.24,0.08,0.09,
0.09,0.08,0.08,0.07,0.06,0.05,0.06,0.07,
0.07,0.05,0.05,0.06,0.05,0.28,0.11,0.20,
0.10,0.09,0.28,0.10,0.15,0.09,0.10,0.18,
0.12,0.13,0.30,0.10,0.11,0.10,0.10,0.11,
0.10,0.21,0.10,0.10,0.12,0.10,0.08])
price = np.array([209.40,209.40,209.40,209.40,193.00,193.00,193.00,
193.00,182.75,182.75,182.75,182.75,161.60,161.60,
161.60,161.60,154.25,154.25,154.25,154.25,150.70,
150.70,150.70,150.70,150.85,150.85,150.85,150.85,
150.00,150.00,150.00,150.00,153.25,153.25,153.25,
153.25,153.25,153.25,153.25,153.25,151.35,151.35,
151.35,151.35,151.70,151.70,151.70,151.70,154.95,
154.95,154.95,154.95,150.20,150.20,150.20,150.20,
153.75,153.75,153.75,153.75,160.55,160.55,160.55,
160.55,179.90,179.90,179.90,179.90,202.00,202.00,
202.00,202.00,220.25,220.25,220.25,220.25,245.75,
245.75,245.75,245.75,222.90,222.90,222.90,222.90,
203.40,203.40,203.40,203.40,205.30,205.30,205.30,
205.30,192.80,192.80,192.80,192.80,177.00,177.00,
177.00,177.00,159.90,159.90,159.90,159.90,152.50,
152.50,152.50,152.50,143.95,143.95,143.95,143.95,
142.10,142.10,142.10,142.10,143.75,143.75,143.75,
143.75,170.80,170.80,170.80,170.80,210.35,210.35,
210.35,210.35,224.45,224.45,224.45,224.45,226.30,
226.30,226.30,226.30,227.85,227.85,227.85,227.85,
225.45,225.45,225.45,225.45,225.80,225.80,225.80,
225.80,224.50,224.50,224.50,224.50,220.30,220.30,
220.30,220.30,220.00,220.00,220.00,220.00,221.90,
221.90,221.90,221.90,230.25,230.25,230.25,230.25,
233.60,233.60,233.60,233.60,225.20,225.20,225.20,
225.20,179.85,179.85,179.85,179.85,171.85,171.85,
171.85,171.85,162.90,162.90,162.90,162.90,158.85,
158.85,158.85,158.85])
cap_batt_kW = 3.00
cap_batt_kWh = 5.00
efficiency = 0.95
usersprofile = 1
# ---------------------------- Optimization model ----------------------------
# Initialise model
m = GEKKO()
#m.open_folder()
# Global options
m.options.SOLVER = 1
m.options.IMODE = 6
# Constants
snelheid_laden = cap_batt_kW/4
m.time = timestep
loss_charging = m.Const(value = (1-efficiency)/2)
max_cap_batt = m.Const(value = cap_batt_kWh)
min_cap_batt = m.Const(value = 0)
max_charge = m.Const(value = snelheid_laden) # max battery can charge in 15min
max_decharge = m.Const(value = -snelheid_laden) # max battery can decharge in 15min
# Parameters
usage_home = m.Param(usage_home)
price = m.Param(price)
# Variables
e_batt = m.Var(value=0, lb = min_cap_batt, ub = max_cap_batt) # energy in battery
price_paid = m.Var() # price paid each 15min
charging = m.Var(lb = max_decharge, ub = max_charge) # amount charge/decharge each 15min
usage_net = m.Var(lb=min_cap_batt)
# Equations
m.Equation(e_batt==m.integral(charging*(1-loss_charging)))
m.Equation(usage_net==usage_home + charging)
price_paid = m.Intermediate(usage_net * price / 100)
m.Equation(-charging <= (1-loss_charging)*e_batt)
# Objective
m.Minimize(price_paid)
# Solve problem
m.solve()
import matplotlib.pyplot as plt
plt.plot(m.time,e_batt.value,label='Battery Charge')
plt.plot(m.time,charging.value,label='Charging')
plt.plot(m.time,price_paid.value,label='Price')
plt.plot(m.time,usage_net.value,label='Net Usage')
plt.xlabel('Time'); plt.grid(); plt.legend(); plt.show()

Apache Solr heatmap : How to convert in usable coordinate int2d array returned by heatmap

I'm using faceting heatmap on a spatial field which then returns a 2d array like this
"counts_ints2D",
[
null,
null,
null,
null,
[
0,
8,
4,
0,
0,
0,
0,
0,
0,
...
I want to locate those cluster on the map but the problem is that I don't know how to convert that 2d array in geo coordinates.
There's absolutely no documentation out there showing what to do with those integer.
Can somebody give some guidance ?
Going with the data you gave for Glasgow, and using the formula given in the comments, lets explore the coordinates in a python repl:
# setup
>>> minX = -180
>>> maxX = 180
>>> minY = -53.4375
>>> maxY = 74.53125
>>> columns = 256
>>> rows = 91
# calculate widths
>>> bucket_width = (maxX - minX) / columns
>>> bucket_width
1.40625
>>> bucket_height = (maxY - minY) / rows
>>> bucket_height
1.40625
# calculate area for bucket in heatmap facet for x = 124, y = 13
# point in lower left coordinate
>>> lower_left = {
... 'lat': maxY - (13 + 1) * bucket_height,
... 'lon': minX + 124 * bucket_width,
... }
>>> lower_left
{'lat': 54.84375, 'lon': -5.625}
# point in upper right
>>> upper_right = {
... 'lat': maxY - (13 + 1) * bucket_height + bucket_height,
... 'lon': minX + 124 * bucket_width + bucket_width,
... }
>>> upper_right
{'lat': 56.25, 'lon': -4.21875}
Let's graph these points on a map, courtesy of open street map. We generate a small CSV snippet we can import on umap (select the up arrow, choose 'csv' as the type and enter content into the text box). To our coordinates to show:
>>> bbox = [
... "lat,lon,description",
... str(lower_left['lat']) + "," + str(lower_left['lon']) + ",ll",
... str(upper_right['lat']) + "," + str(lower_left['lon']) + ",ul",
... str(upper_right['lat']) + "," + str(upper_right['lon']) + ",uu",
... str(lower_left['lat']) + "," + str(upper_right['lon']) + ",lu",
... ]
>>> print("\n".join(bbox))
lat,lon,description
54.84375,-5.625,ll
56.25,-5.625,ul
56.25,-4.21875,uu
54.84375,-4.21875,lu
After pasting these points into the import box creating the layer, we get this map:
Map based on Open Street Map data through uMap. This area encloses Glasgow as you expected.
Here's some code that takes 180th meridian (date line) wrapping into account:
$columns = $heatmap['columns'];
$rows = $heatmap['rows'];
$minX = $heatmap['minX'];
$maxX = $heatmap['maxX'];
$minY = $heatmap['minY'];
$maxY = $heatmap['maxY'];
$counts = $heatmap['counts_ints2D'];
// If our min longitude is greater than max longitude, we're crossing
// the 180th meridian (date line).
$crosses_meridian = $minX > $maxX;
// Bucket width needs to be calculated differently when crossing the
// meridian since it wraps.
$bucket_width = $crosses_meridian
? $bucket_width = (360 - abs($maxX - $minX)) / $columns
: $bucket_width = ($maxX - $minX) / $columns;
$bucket_height = ($maxY - $minY) / $rows;
$points = [];
foreach ($counts as $rowIndex => $row) {
if (!$row) continue;
foreach ($row as $columnIndex => $column) {
if (!$column) continue;
$point = []
$point['count'] = $column;
// Put the count in the middle of the bucket (adding a half height and width).
$point['lat'] = $maxY - (($rowIndex + 1) * $bucket_height) + ($bucket_height / 2);
$point['lng'] = $minX + ($columnIndex * $bucket_width) + ($bucket_width / 2);
// We crossed the meridian, so wrap back around to negative.
if ($point['lng'] > 180) {
$point['lng'] = -1 * (180 - ($point['lng'] % 180));
}
$points[] = $point;
}
}

R: how to properly create rx_forest_model object?

I'm trying to do a churn analysis with R and SQL Server 2016.
I have uploaded my dataset on my database in a local SQL Server and I did all the preliminary work on this dataset.
Well, now I have this function trainModel() which I would use to estimate my random model forest:
trainModel = function(sqlSettings, trainTable) {
sqlConnString = sqlSettings$connString
trainDataSQL <- RxSqlServerData(connectionString = sqlConnString,
table = trainTable,
colInfo = cdrColInfo)
## Create training formula
labelVar = "churn"
trainVars <- rxGetVarNames(trainDataSQL)
trainVars <- trainVars[!trainVars %in% c(labelVar)]
temp <- paste(c(labelVar, paste(trainVars, collapse = "+")), collapse = "~")
formula <- as.formula(temp)
## Train gradient tree boosting with mxFastTree on SQL data source
library(RevoScaleR)
rx_forest_model <- rxDForest(formula = formula,
data = trainDataSQL,
nTree = 8,
maxDepth = 16,
mTry = 2,
minBucket = 1,
replace = TRUE,
importance = TRUE,
seed = 8,
parms = list(loss = c(0, 4, 1, 0)))
return(rx_forest_model)
}
But when I run the function I get this wrong output:
> system.time({
+ trainModel(sqlSettings, trainTable)
+ })
user system elapsed
0.29 0.07 58.18
Warning message:
In tempGetNumObs(numObs) :
Number of observations not available for this data source. 'numObs' set to 1e6.
And for this warning message, the function trainModel() does not create the object rx_forest_model
Does anyone have any suggestions on how to solve this problem?
After several attempts, I found the reason why the function trainModel() did not function properly.
Is not a connection string problem and is not even a data source type issue.
The problem is in the syntax of function trainModel().
It is enough to eliminate from the body of the function the statement:
return(rx_forest_model)
In this way, the function returns the same warning message, but creates the object rx_forest_model in the correct way.
So, the correct function is:
trainModel = function(sqlSettings, trainTable) {
sqlConnString = sqlSettings$connString
trainDataSQL <- RxSqlServerData(connectionString = sqlConnString,
table = trainTable,
colInfo = cdrColInfo)
## Create training formula
labelVar = "churn"
trainVars <- rxGetVarNames(trainDataSQL)
trainVars <- trainVars[!trainVars %in% c(labelVar)]
temp <- paste(c(labelVar, paste(trainVars, collapse = "+")), collapse = "~")
formula <- as.formula(temp)
## Train gradient tree boosting with mxFastTree on SQL data source
library(RevoScaleR)
rx_forest_model <- rxDForest(formula = formula,
data = trainDataSQL,
nTree = 8,
maxDepth = 16,
mTry = 2,
minBucket = 1,
replace = TRUE,
importance = TRUE,
seed = 8,
parms = list(loss = c(0, 4, 1, 0)))
}

Restricted boltzmann machine - array

I'm doing a college assignment about RBM restricted Boltzmann machines. but this code error.I am confused how to get this code working. can anybody help me about this error?
def reconstruct(self, v):
h = sigmoid(np.dot(v, self.W) + self.hbias)
reconstructed_v = sigmoid(np.dot(h, self.W.T) + self.vbias)
return reconstructed_v
def test_rbm(learning_rate=0.1, k=1, training_epochs=10):
data = datainput
rng = np.random.RandomState(123)
# construct RBM
rbm = RBM(input=data, n_visible=40, n_hidden=20, np_rng=rng)
# train
for epoch in range(training_epochs):
rbm.contrastive_divergence(lr=learning_rate, k=k)
cost = rbm.get_reconstruction_cross_entropy()
print ('Training epoch %d, cost is ' % epoch, cost, file = sys.stderr)
# test
v = datatarget
print (rbm.reconstruct(v))
if __name__ == "__main__":
test_rbm()
ValueError: shapes (1979,1) and (40,20) not aligned: 1 (dim 1) != 40 (dim 0)
at run time print(rbm.reconstruct(v)) error line in h = sigmoid(np.dot(v, self.W) + self.hbias)

Resources