Webscraping to a DataFrame - arrays

I am trying to get information from a website, and into a Dataframe, but I'm having some trouble.
I have extracted the data, but I'm trying to merge two dataframes, and reshape them into one. Here is what I have:
import numpy as np
import pandas as pd
import requests
from bs4 import BeautifulSoup
url = 'https://www.civilaviation.gov.in/'
resp = requests.get(url)
soup = BeautifulSoup(resp.content.decode(), 'html.parser')
div = soup.find('div', {'class':'airport-col vande-bharat-col'})
div2 = soup.find('div', {'class':'airport-col airport-widget'})
div['class'] = 'Domestic traffic'
div2['class'] = 'International traffic'
dom = div.get_text()
intl = div2.get_text()
def str2frame(estr, sep = '\n', lineterm = '\n\n\n\n\n', set_header = True):
dat = [x.split(sep) for x in estr.split(lineterm)][0:-1]
df = pd.DataFrame(dat)
if set_header:
df = df.T.set_index(0, drop = True).T # flip, set ix, flip back
return df
df1 = str2frame(dom)
df2 = str2frame(intl)
df1.rename(columns={"अन्तर्देशीय यातायात Domestic traffic On 29 Jan 2023":"Domestic Traffic"}, inplace=True)
df2.rename(columns={"अंतर्राष्ट्रीय यातायात International traffic On 29 Jan 2023":"International Traffic"}, inplace=True)
So now I get two separate DataFrames with all the information I want, but not in the format I want. The shape of my dataframes are 6,2(one of the columns is blank)... I need them merged into one dataframe that is 2,6. So basically I show
Domestic Traffic
1 Departing flights 2,967
2 Departing Pax 4,24,224
3 Arriving flights 2,960
4 Arriving Pax 4,18,697
5 Aircraft movements 5,927
6 Airport footfalls 8,42,921
I would like to see two rows, one for domestic and one for international traffic, and each column based on the given values. I apologize if my question or my coding is unclear. This is my first time asking a question on this forum. Thank you for your help.

Not sure if this is the expected result but you could transform and concat your data:
pd.concat([
df1.set_index(df1.columns[0]).T,
df2.set_index(df2.columns[0]).T
]).reset_index()
Output
0
Departing flights
Departing Pax
Arriving flights
Arriving Pax
Aircraft movements
Airport footfalls
0
अन्तर्देशीय यातायात Domestic traffic On 30 Jan 2023
2,862
4,07,957
2,864
4,04,799
5,726
8,12,756
1
अंतर्राष्ट्रीय यातायात International traffic On 30 Jan 2023
433
90,957
516
82,036
949
1,72,993

Related

HOW TO FILTER DJANGO QUERYSETS WITH MULTIPLE AGGREGATIONS

Lets say I have a django model table
class Table(models.Model):
name = models.CharField()
date_created = models.DatetimeField()
total_sales = models.DecimalField()
some data for context
Name
date-created
total-sales
a
2020-01-01
200
b
2020-02-01
300
c
2020-04-01
400
*
**********
***
c
2020-12-01
1000
c
2020-12-12
500
now I want to filter an aggregate of
total_yearly_sales = 10500
current month being December
total_monthly_sales = 1500
daily_sales
total_daily_sales = 500
also do a Group by by name
models.Table.objects.values('Name').annotate(Sum('total-sales')).order_by()
I want to do this in one query(one db hit)
Hence the query should generate
total_yearly_sales
total_monthly_sales
total_daily_sales
total_sales_grouped_by_name ie {a:200, b:300, c:1900}
I know this is too much to ask. Hence let me express my immense gratitude and thanks for having a look at this.
cheers
The above queries I can generate them individually like so
today = timezone.now().date()
todays_sales = models.Table.filter(date_created__date__gte=today, date_created___date__lte=today).aggregate(Sum('total_sales'))
=> 500
monthly_sales(this month) = models.Table.objects.filter(date_created__year=today.year, date_created__month=today.month).aggregate(Sum('total_sales'))
=>10500
total_yearly_sales = models.Table.objects.filter(date_created__year=today.year).aggregate(Sum('total_sales')) => 10500

How can I speed up my optimization with Gekko?

My program is optimizing the charging and decharging of a home battery to minimize the cost of electricity at the end of the year. The electricity usage of homes is measured each 15 minutes, so I have 96 measurement point in 1 day. I want to optimilize the charging and decharging of the battery for 2 days, so that day 1 takes the usage of day 2 into account. I wrote the following code and it works.
from gekko import GEKKO
import numpy as np
import pandas as pd
import time
import math
# ------------------------ Import and read input data ------------------------
file = r'D:\Bedrijfseconomie\MP Thuisbatterijen\Spyder - Gekko\Data Sim 1.xlsx'
data = pd.read_excel(file, sheet_name='Input', na_values='NaN')
dataRead = pd.DataFrame(data, columns= ['Timestep','Verbruik woning (kWh)','Prijs afname (€/kWh)',
'Capaciteit batterij (kW)','Capaciteit batterij (kWh)',
'Rendement (%)','Verbruikersprofiel'])
timestep = dataRead['Timestep'].to_numpy()
usage_home = dataRead['Verbruik woning (kWh)'].to_numpy()
price = dataRead['Prijs afname (€/kWh)'].to_numpy()
cap_batt_kW = dataRead['Capaciteit batterij (kW)'].iloc[0]
cap_batt_kWh = dataRead['Capaciteit batterij (kWh)'].iloc[0]
efficiency = dataRead['Rendement (%)'].iloc[0]
usersprofile = dataRead['Verbruikersprofiel'].iloc[0]
# ---------------------------- Optimization model ----------------------------
# Initialise model
m = GEKKO()
# Global options
m.options.SOLVER = 1
# Constants
snelheid_laden = cap_batt_kW/4
T = len(timestep)
loss_charging = m.Const(value = (1-efficiency)/2)
max_cap_batt = m.Const(value = cap_batt_kWh)
min_cap_batt = m.Const(value = 0)
max_charge = m.Const(value = snelheid_laden) # max battery can charge in 15min
max_decharge = m.Const(value = -snelheid_laden) # max battery can decharge in 15min
# Parameters
dummy = np.array(np.ones([T]))
# Variables
e_batt = m.Array(m.Var, (T), lb = min_cap_batt, ub = max_cap_batt) # energy in battery
usage_net = m.Array(m.Var, (T)) # usage home & charge/decharge battery
price_paid = m.Array(m.Var, (T)) # price paid each 15min
charging = m.Array(m.Var, (T), lb = max_decharge, ub = max_charge) # amount charge/decharge each 15min
# Intermediates
e_batt[0] = m.Intermediate(charging[0])
for t in range(T):
e_batt[t] = m.Intermediate(m.sum([charging[i]*(1-loss_charging) for i in range(t)]))
usage_net = [m.Intermediate(usage_home[t] + charging[t]) for t in range(T)]
price_paid = [m.Intermediate(usage_net[t] * price[t] / 100) for t in range(T)]
total_price = m.Intermediate(m.sum([price_paid[t] for t in range(T)]))
# Equations (constraints)
m.Equation([min_cap_batt*dummy[t] <= e_batt[t] for t in range(T)])
m.Equation([max_cap_batt*dummy[t] >= e_batt[t] for t in range(T)])
m.Equation([max_charge*dummy[t] >= charging[t] for t in range(T)])
m.Equation([max_decharge*dummy[t] <= charging[t] for t in range(T)])
m.Equation([min_cap_batt*dummy[t] <= usage_net[t] for t in range(T)])
m.Equation([(-1*charging[t]) <= (1-loss_charging)*e_batt[t] for t in range(T)])
# Objective
m.Minimize(total_price)
# Solve problem
m.solve()
My code is running and it works but despite that it gives a Solution time of 10 seconds, the total time for it to run is around 8 minutes. Does anyone know a way I can speed it up?
There are a few ways to speed up the Gekko code:
Solve locally instead of on the public server. The option is m=GEKKO(remote=False). The public server can slow down with many jobs.
Use sum() instead of m.sum(). This can be faster for compiling the model. Otherwise, use m.integral(x) if you need the integral of x.
Many of the equations are repeated at each time horizon step. Gekko is more efficient using a single equation definition with IMODE=2 (for algebraic equation models) or IMODE=6 (for differential / algebraic equation models) and then it creates the equations over the time horizon. You may need to use m.vsum() instead of m.sum().
For additional diagnosis, try setting m.options.DIAGLEVEL=1 to get a detailed timing report of how long it takes to compile the model and perform each function, 1st derivative, and 2nd derivative calculation. It also gives a detailed view of the solver versus model time during the solution phase.
Update with Data File Testing
Thanks for sending the data file. The run directory shows that the model file is 58,682 lines long. It takes a while to compile a model that size. Here is the solution from the files you sent:
--------- APM Model Size ------------
Each time step contains
Objects : 193
Constants : 5
Variables : 20641
Intermediates: 578
Connections : 18721
Equations : 20259
Residuals : 19681
Number of state variables: 20641
Number of total equations: - 19873
Number of slack variables: - 1152
---------------------------------------
Degrees of freedom : -384
* Warning: DOF <= 0
----------------------------------------------
Steady State Optimization with APOPT Solver
----------------------------------------------
Iter Objective Convergence
0 3.37044E+01 5.00000E+00
1 2.81987E+01 1.00000E-10
2 2.81811E+01 5.22529E-12
3 2.81811E+01 2.10942E-15
4 2.81811E+01 2.10942E-15
Successful solution
---------------------------------------------------
Solver : APOPT (v1.0)
Solution time : 10.5119999999879 sec
Objective : 28.1811214884047
Successful solution
---------------------------------------------------
Here is a version that uses IMODE=6 instead. You define the variables and equations once and let Gekko handle the time discretization. It makes a much more efficient model because there is no unnecessary duplication of equations.
from gekko import GEKKO
import numpy as np
import pandas as pd
import time
import math
# ------------------------ Import and read input data ------------------------
file = r'Data Sim 1.xlsx'
data = pd.read_excel(file, sheet_name='Input', na_values='NaN')
dataRead = pd.DataFrame(data, columns= ['Timestep','Verbruik woning (kWh)','Prijs afname (€/kWh)',
'Capaciteit batterij (kW)','Capaciteit batterij (kWh)',
'Rendement (%)','Verbruikersprofiel'])
timestep = dataRead['Timestep'].to_numpy()
usage_home = dataRead['Verbruik woning (kWh)'].to_numpy()
price = dataRead['Prijs afname (€/kWh)'].to_numpy()
cap_batt_kW = dataRead['Capaciteit batterij (kW)'].iloc[0]
cap_batt_kWh = dataRead['Capaciteit batterij (kWh)'].iloc[0]
efficiency = dataRead['Rendement (%)'].iloc[0]
usersprofile = dataRead['Verbruikersprofiel'].iloc[0]
# ---------------------------- Optimization model ----------------------------
# Initialise model
m = GEKKO()
m.open_folder()
# Global options
m.options.SOLVER = 1
m.options.IMODE = 6
# Constants
snelheid_laden = cap_batt_kW/4
m.time = timestep
loss_charging = m.Const(value = (1-efficiency)/2)
max_cap_batt = m.Const(value = cap_batt_kWh)
min_cap_batt = m.Const(value = 0)
max_charge = m.Const(value = snelheid_laden) # max battery can charge in 15min
max_decharge = m.Const(value = -snelheid_laden) # max battery can decharge in 15min
# Parameters
usage_home = m.Param(usage_home)
price = m.Param(price)
# Variables
e_batt = m.Var(value=0, lb = min_cap_batt, ub = max_cap_batt) # energy in battery
price_paid = m.Var() # price paid each 15min
charging = m.Var(lb = max_decharge, ub = max_charge) # amount charge/decharge each 15min
usage_net = m.Var(lb=min_cap_batt)
# Equations
m.Equation(e_batt==m.integral(charging*(1-loss_charging)))
m.Equation(usage_net==usage_home + charging)
price_paid = m.Intermediate(usage_net * price / 100)
m.Equation(-charging <= (1-loss_charging)*e_batt)
# Objective
m.Minimize(price_paid)
# Solve problem
m.solve()
import matplotlib.pyplot as plt
plt.plot(m.time,e_batt.value,label='Battery Charge')
plt.plot(m.time,charging.value,label='Charging')
plt.plot(m.time,price_paid.value,label='Price')
plt.plot(m.time,usage_net.value,label='Net Usage')
plt.xlabel('Time'); plt.grid(); plt.legend(); plt.show()
The model is only 31 lines long (see gk0_model.apm) and it solves much faster (a couple seconds total).
--------- APM Model Size ------------
Each time step contains
Objects : 0
Constants : 5
Variables : 8
Intermediates: 1
Connections : 0
Equations : 6
Residuals : 5
Number of state variables: 1337
Number of total equations: - 955
Number of slack variables: - 191
---------------------------------------
Degrees of freedom : 191
----------------------------------------------
Dynamic Control with APOPT Solver
----------------------------------------------
Iter Objective Convergence
0 3.46205E+01 3.00000E-01
1 3.30649E+01 4.41141E-10
2 3.12774E+01 1.98558E-11
3 3.03148E+01 1.77636E-15
4 2.96824E+01 3.99680E-15
5 2.82700E+01 8.88178E-16
6 2.82039E+01 1.77636E-15
7 2.81334E+01 8.88178E-16
8 2.81085E+01 1.33227E-15
9 2.81039E+01 8.88178E-16
Iter Objective Convergence
10 2.81005E+01 8.88178E-16
11 2.80999E+01 1.77636E-15
12 2.80996E+01 8.88178E-16
13 2.80996E+01 8.88178E-16
14 2.80996E+01 8.88178E-16
Successful solution
---------------------------------------------------
Solver : APOPT (v1.0)
Solution time : 0.527499999996508 sec
Objective : 28.0995878585948
Successful solution
---------------------------------------------------
There is no long compile time. Also, the solver time is reduced from 10 sec to 0.5 sec. The objective function is nearly the same (28.18 versus 28.10).
Here is a complete version without the data file dependency (in case the data file isn't available in the future).
from gekko import GEKKO
import numpy as np
import pandas as pd
import time
import math
# ------------------------ Import and read input data ------------------------
timestep = np.arange(1,193)
usage_home = np.array([0.05,0.07,0.09,0.07,0.05,0.07,0.07,0.07,0.06,
0.05,0.07,0.07,0.09,0.07,0.06,0.07,0.07,
0.07,0.16,0.12,0.17,0.08,0.10,0.11,0.06,
0.06,0.06,0.06,0.06,0.07,0.07,0.07,0.08,
0.08,0.06,0.07,0.07,0.07,0.07,0.05,0.07,
0.07,0.07,0.07,0.21,0.08,0.07,0.08,0.27,
0.12,0.09,0.10,0.11,0.09,0.09,0.08,0.08,
0.12,0.15,0.08,0.10,0.08,0.10,0.09,0.10,
0.09,0.08,0.10,0.12,0.10,0.10,0.10,0.11,
0.10,0.10,0.11,0.13,0.21,0.12,0.10,0.10,
0.11,0.10,0.11,0.12,0.12,0.10,0.11,0.10,
0.10,0.10,0.11,0.10,0.10,0.09,0.08,0.12,
0.10,0.11,0.11,0.10,0.06,0.05,0.06,0.06,
0.06,0.07,0.06,0.06,0.05,0.06,0.05,0.06,
0.05,0.06,0.05,0.06,0.07,0.06,0.09,0.10,
0.10,0.22,0.08,0.06,0.05,0.06,0.08,0.08,
0.07,0.08,0.07,0.07,0.16,0.21,0.08,0.08,
0.09,0.09,0.10,0.09,0.09,0.08,0.12,0.24,
0.09,0.08,0.09,0.08,0.10,0.24,0.08,0.09,
0.09,0.08,0.08,0.07,0.06,0.05,0.06,0.07,
0.07,0.05,0.05,0.06,0.05,0.28,0.11,0.20,
0.10,0.09,0.28,0.10,0.15,0.09,0.10,0.18,
0.12,0.13,0.30,0.10,0.11,0.10,0.10,0.11,
0.10,0.21,0.10,0.10,0.12,0.10,0.08])
price = np.array([209.40,209.40,209.40,209.40,193.00,193.00,193.00,
193.00,182.75,182.75,182.75,182.75,161.60,161.60,
161.60,161.60,154.25,154.25,154.25,154.25,150.70,
150.70,150.70,150.70,150.85,150.85,150.85,150.85,
150.00,150.00,150.00,150.00,153.25,153.25,153.25,
153.25,153.25,153.25,153.25,153.25,151.35,151.35,
151.35,151.35,151.70,151.70,151.70,151.70,154.95,
154.95,154.95,154.95,150.20,150.20,150.20,150.20,
153.75,153.75,153.75,153.75,160.55,160.55,160.55,
160.55,179.90,179.90,179.90,179.90,202.00,202.00,
202.00,202.00,220.25,220.25,220.25,220.25,245.75,
245.75,245.75,245.75,222.90,222.90,222.90,222.90,
203.40,203.40,203.40,203.40,205.30,205.30,205.30,
205.30,192.80,192.80,192.80,192.80,177.00,177.00,
177.00,177.00,159.90,159.90,159.90,159.90,152.50,
152.50,152.50,152.50,143.95,143.95,143.95,143.95,
142.10,142.10,142.10,142.10,143.75,143.75,143.75,
143.75,170.80,170.80,170.80,170.80,210.35,210.35,
210.35,210.35,224.45,224.45,224.45,224.45,226.30,
226.30,226.30,226.30,227.85,227.85,227.85,227.85,
225.45,225.45,225.45,225.45,225.80,225.80,225.80,
225.80,224.50,224.50,224.50,224.50,220.30,220.30,
220.30,220.30,220.00,220.00,220.00,220.00,221.90,
221.90,221.90,221.90,230.25,230.25,230.25,230.25,
233.60,233.60,233.60,233.60,225.20,225.20,225.20,
225.20,179.85,179.85,179.85,179.85,171.85,171.85,
171.85,171.85,162.90,162.90,162.90,162.90,158.85,
158.85,158.85,158.85])
cap_batt_kW = 3.00
cap_batt_kWh = 5.00
efficiency = 0.95
usersprofile = 1
# ---------------------------- Optimization model ----------------------------
# Initialise model
m = GEKKO()
#m.open_folder()
# Global options
m.options.SOLVER = 1
m.options.IMODE = 6
# Constants
snelheid_laden = cap_batt_kW/4
m.time = timestep
loss_charging = m.Const(value = (1-efficiency)/2)
max_cap_batt = m.Const(value = cap_batt_kWh)
min_cap_batt = m.Const(value = 0)
max_charge = m.Const(value = snelheid_laden) # max battery can charge in 15min
max_decharge = m.Const(value = -snelheid_laden) # max battery can decharge in 15min
# Parameters
usage_home = m.Param(usage_home)
price = m.Param(price)
# Variables
e_batt = m.Var(value=0, lb = min_cap_batt, ub = max_cap_batt) # energy in battery
price_paid = m.Var() # price paid each 15min
charging = m.Var(lb = max_decharge, ub = max_charge) # amount charge/decharge each 15min
usage_net = m.Var(lb=min_cap_batt)
# Equations
m.Equation(e_batt==m.integral(charging*(1-loss_charging)))
m.Equation(usage_net==usage_home + charging)
price_paid = m.Intermediate(usage_net * price / 100)
m.Equation(-charging <= (1-loss_charging)*e_batt)
# Objective
m.Minimize(price_paid)
# Solve problem
m.solve()
import matplotlib.pyplot as plt
plt.plot(m.time,e_batt.value,label='Battery Charge')
plt.plot(m.time,charging.value,label='Charging')
plt.plot(m.time,price_paid.value,label='Price')
plt.plot(m.time,usage_net.value,label='Net Usage')
plt.xlabel('Time'); plt.grid(); plt.legend(); plt.show()

How to forecast unknown future target values with gluonts DeepAR?

How to forecast unknown future target values with gluonts DeepAR?
I have a time series from 1995-01-01 to 2021-10-01. Monthly frequency data. How to forecast values for the future (next 3 months): 2021-11-01 to 2022-01-01? Note that I don't have the target values for 2021-11-01, 2021-12-01 and 2022-01-01.
Many thanks!
from gluonts.model.deepar import DeepAREstimator
from gluonts.mx import Trainer
import numpy as np
import mxnet as mx
np.random.seed(7)
mx.random.seed(7)
estimator = DeepAREstimator(
prediction_length=12
, context_length=120
, freq='M'
, trainer=Trainer(
epochs=5
, learning_rate=1e-03
, num_batches_per_epoch=50))
predictor = estimator.train(training_data=df_train)
# Forecasting
predictions = predictor.predict(df_test)
predictions = list(predictions)[0]
predictions = predictions.quantile(0.5)
print(predictions)
[163842.34 152805.08 161326.3 176823.97 127003.79 126937.78
139575.2 117121.67 115754.67 139211.28 122623.586 120102.65 ]
As I understood, the predictions values are not for "2021-11-01", "2021-12-01" and "2022-01-01". How do I know to which months this values refer to? How to forecast values for the next 3 months: "2021-11-01", "2021-12-01" and "2022-01-01"?
Take a look at this code. It comes from "Advanced Forecasting with Python".
https://github.com/Apress/advanced-forecasting-python/blob/main/Chapter%2020%20-%20Amazon's%20DeepAR.ipynb
It does not seem to forecast unknown future values, once it compares the last 28 values of test_ds (Listing 20-5. R2 score and prediction graph) with the predictions made over this same dataset test_ds (Listing 20-4. Prediction)
How do I forecast unknown future values?
Many thanks!
Data source
https://www.kaggle.com/c/recruit-restaurant-visitor-forecasting
# Listing 20-1. Importing the data
import pandas as pd
y = pd.read_csv('air_visit_data.csv.zip')
y = y.pivot(index='visit_date', columns='air_store_id')['visitors']
y = y.fillna(0)
y = pd.DataFrame(y.sum(axis=1))
y = y.reset_index(drop=False)
y.columns = ['date', 'y']
# Listing 20-2. Preparing the data format requered by the gluonts library
from gluonts.dataset.common import ListDataset
start = pd.Timestamp("01-01-2016", freq="H")
# train dataset: cut the last window of length "prediction_length", add "target" and "start" fields
train_ds = ListDataset([{'target': y.loc[:450,'y'], 'start': start}], freq='H')
# test dataset: use the whole dataset, add "target" and "start" fields
test_ds = ListDataset([{'target': y['y'], 'start': start}],freq='H')
# Listing 20-3. Fitting the default DeepAR model
from gluonts.model.deepar import DeepAREstimator
from gluonts.trainer import Trainer
import mxnet as mx
import numpy as np
np.random.seed(7)
mx.random.seed(7)
estimator = DeepAREstimator(
prediction_length=28,
context_length=100,
freq=’H’,
trainer=Trainer(ctx="gpu", # remove if running on windows
epochs=5,
learning_rate=1e-3,
num_batches_per_epoch=100
)
)
predictor = estimator.train(train_ds)
# Listing 20-4. Prediction
predictions = predictor.predict(test_ds)
predictions = list(predictions)[0]
predictions = predictions.quantile(0.5)
# Listing 20-5. R2 score and prediction graph
from sklearn.metrics import r2_score
print(r2_score( list(test_ds)[0]['target'][-28:], predictions))
import matplotlib.pyplot as plt
plt.plot(predictions)
plt.plot(list(test_ds)[0]['target'][-28:])
plt.legend(['predictions', 'actuals'])
plt.show()
In your case the context length is 120 and prediction length is 12 so the model will look behind 120 data points to predict 12 future data points
The recommendation is to reduce the context to may be 10 and include the data from past 10 months in the df_test table
you can get the start of the forecast using
list(predictor.predict(df_test))[0].start_date
based on this create a future table of 12 dates(as 12 is the prediction length)

How to have the xlim with seaborn automatically adjust based on dataframe date range

I am trying to loop through plots. Each "station" is a pandas dataframe has a single water year of data (oct 1 to Spet 29). The data is being read in with this code:
sh_784_2020 = pd.read_csv("sh_784_WY2020.csv", parse_dates=['Date'])
sh_784_2020.columns = ["Index", "Date", "Temp_C","Precip_mm","SnowDepth_cm","SWE_mm","SM2","SM8","SM20"]
My plots loop through but the x-axis always starts at the year 2000 through the current date displayed but my data is from 2006-2020. Is there a way to have the xlim adjust automatically for the date range in the data frame? Or is there a way to create this plot in matyplotlib and not seaborn?
for station in stations:
station['Density'] = station['SWE_mm']/(station['SnowDepth_cm']*10)*100
station['Density range'] = pd.cut( station['Density'], [-np.inf, 25, 30, 35, 40, np.inf])
Date = station.loc[:, 'Date'].values
SWE_mm = station.loc[:, 'SWE_mm'].values
Density = station.loc[:, 'Density'].values
sns.scatterplot(station['Date'], station['SWE_mm'], hue='Density range', data= station, edgecolor = 'none', palette=['grey', 'green', 'gold', 'orange', 'crimson'], alpha= 1)
plt.xlim ()
plt.show()
Plot example 1
Plot example 2
If you upgrade to seaborn 0.11 you should find that the default autoscaling works better, but you can get a good result without upgrading by creating the Axes object before plotting and setting the units, e.g. something like
ax = plt.figure().subplots()
ax.xaxis.update_units(station["Date"])

Sort by a key, but value has more than one element using Scala

I'm very new to Scala on Spark and wondering how you might create key value pairs, with the key having more than one element. For example, I have this dataset for baby names:
Year, Name, County, Number
2000, JOHN, KINGS, 50
2000, BOB, KINGS, 40
2000, MARY, NASSAU, 60
2001, JOHN, KINGS, 14
2001, JANE, KINGS, 30
2001, BOB, NASSAU, 45
And I want to find the most frequently occurring for each county, regardless of the year. How might I go about doing that?
I did accomplish this using a loop. Refer to below. But I'm wondering if there is shorter way to do this that utilizes Spark and Scala duality. (i.e. can I decrease computation time?)
val names = sc.textFile("names.csv").map(l => l.split(","))
val uniqueCounty = names.map(x => x(2)).distinct.collect
for (i <- 0 to uniqueCounty.length-1) {
val county = uniqueCounty(i).toString;
val eachCounty = names.filter(x => x(2) == county).map(l => (l(1),l(4))).reduceByKey((a,b) => a + b).sortBy(-_._2);
println("County:" + county + eachCounty.first)
}
Here is the solution using RDD. I am assuming you need top occurring name per county.
val data = Array((2000, "JOHN", "KINGS", 50),(2000, "BOB", "KINGS", 40),(2000, "MARY", "NASSAU", 60),(2001, "JOHN", "KINGS", 14),(2001, "JANE", "KINGS", 30),(2001, "BOB", "NASSAU", 45))
val rdd = sc.parallelize(data)
//Reduce the uniq values for county/name as combo key
val uniqNamePerCountyRdd = rdd.map(x => ((x._3,x._2),x._4)).reduceByKey(_+_)
// Group names per county.
val countyNameRdd = uniqNamePerCountyRdd.map(x=>(x._1._1,(x._1._2,x._2))).groupByKey()
// Sort and take the top name alone per county
countyNameRdd.mapValues(x => x.toList.sortBy(_._2).take(1)).collect
Output:
res8: Array[(String, List[(String, Int)])] = Array((KINGS,List((JANE,30))), (NASSAU,List((BOB,45))))
You could use the spark-csv and the Dataframe API. If you are using the new version of Spark (2.0) it is slightly different. Spark 2.0 has a native csv data source based on spark-csv.
Use spark-csv to load your csv file into a Dataframe.
val df = sqlContext.read.format("com.databricks.spark.csv")
.option("header", "true")
.option("inferSchema", "true")
.load(new File(getClass.getResource("/names.csv").getFile).getAbsolutePath)
df.show
Gives output:
+----+----+------+------+
|Year|Name|County|Number|
+----+----+------+------+
|2000|JOHN| KINGS| 50|
|2000| BOB| KINGS| 40|
|2000|MARY|NASSAU| 60|
|2001|JOHN| KINGS| 14|
|2001|JANE| KINGS| 30|
|2001| BOB|NASSAU| 45|
+----+----+------+------+
DataFrames uses a set of operations for structured data manipulation. You could use some basic operations to become your result.
import org.apache.spark.sql.functions._
df.select("County","Number").groupBy("County").agg(max("Number")).show
Gives output:
+------+-----------+
|County|max(Number)|
+------+-----------+
|NASSAU| 60|
| KINGS| 50|
+------+-----------+
Is this what you are trying to achieve?
Notice the import org.apache.spark.sql.functions._ which is needed for the agg() function.
More information about Dataframes API
EDIT
For correct output:
df.registerTempTable("names")
//there is probably a better query for this
sqlContext.sql("SELECT * FROM (SELECT Name, County,count(1) as Occurrence FROM names GROUP BY Name, County ORDER BY " +
"count(1) DESC) n").groupBy("County", "Name").max("Occurrence").limit(2).show
Gives output:
+------+----+---------------+
|County|Name|max(Occurrence)|
+------+----+---------------+
| KINGS|JOHN| 2|
|NASSAU|MARY| 1|
+------+----+---------------+

Resources