Which geopandas datasets (maps) are available? - maps

I just created a very simple geopandas example (see below). It works, but I noticed that it is important for me to be able to have a custom part of the world. Sometimes Germany and sometimes only Berlin. (Also, I want to aggregate the data I have by areas which I define as polygons in a geopandas file, but I'll add this in another question.)
How can I get a different "base map" than
world = gpd.read_file(gpd.datasets.get_path('naturalearth_lowres'))
for visualizations?
Example
# 3rd party modules
import pandas as pd
import geopandas as gpd
import shapely
# needs 'descartes'
import matplotlib.pyplot as plt
df = pd.DataFrame({'city': ['Berlin', 'Paris', 'Munich'],
'latitude': [52.518611111111, 48.856666666667, 48.137222222222],
'longitude': [13.408333333333, 2.3516666666667, 11.575555555556]})
gdf = gpd.GeoDataFrame(df.drop(['latitude', 'longitude'], axis=1),
crs={'init': 'epsg:4326'},
geometry=[shapely.geometry.Point(xy)
for xy in zip(df.longitude, df.latitude)])
print(gdf)
world = gpd.read_file(gpd.datasets.get_path('naturalearth_lowres'))
base = world.plot(color='white', edgecolor='black')
gdf.plot(ax=base, marker='o', color='red', markersize=5)
plt.show()

As written in the geopandas.datasets.get_path(...) documentation, one has to execute
>>> geopandas.datasets.available
['naturalearth_lowres', 'naturalearth_cities', 'nybb']
Where
naturalearth_lowres: contours of countries
naturalearth_cities: positions of cities
nybb: maybe New York?
Other data sources
Searching for "germany shapefile" gave an arcgis.com url which used the "Bundesamt für Kartographie und Geodäsie" as a source. The result of using vg2500_geo84/vg2500_krs.shp looks like this:
Source:
© Bundesamt für Kartographie und Geodäsie, Frankfurt am Main, 2011
Vervielfältigung, Verbreitung und öffentliche Zugänglichmachung, auch auszugsweise, mit Quellenangabe gestattet.
I also had to set base.set_aspect(1.4), otherwise it looked wrong. The value 1.4 was found by trial and error.
Another source for such data for Berlin is daten.berlin.de
When geopandas reads the shapefile, it is a geopandas dataframe with the columns
['USE', 'RS', 'RS_ALT', 'GEN', 'SHAPE_LENG', 'SHAPE_AREA', 'geometry']
with:
USE=4 for all elements
RS is a string like 16077 or 01003
RS_ALT is a string like 160770000000 or 010030000000
GEN is a string like 'Saale-Holzland-Kreis' or 'Erlangen'
SHAPE_LENG is a float like 202986.1998816 or 248309.91235015
SHAPE_AREA is a float like 1.91013141e+08 or 1.47727769e+09
geometry is a shapely geometry - mostly POLYGON

Related

Use the X from for-loop as part of variable name (Pythnon)

I have a problem that may be obvious for Pythonist, but I just can't google it out.
Shortly:
I want to use x from the for x in my_data_header as part of my variable name. For example instead of hardcoding my_data.selected_column use my_data.x to loop trough all columns.
Longer:
I want to make boxplot from scientific data imported from the spreadsheet. In one columns are the treatments designation by which I trim dataset. Other are measurements I want to draw boxplots from. I need to loop trough measurement columns and export the boxplots. So the x of the for loop have to be used in:
*selection of the column (within each treatment), titling the boxplot, nameing the export .png file,...
I could perform steps separately, but coudn't compose the loop.
What is recommended approach for looping through spreadsheet columns with complex task where you have to refer to column titles? (I will complete information if needed.)
I am trying to switch from RStudio/Markdown/Knit to Python.
Thank you in advance!
It was the pd.DataFrame issue: my prior used date.read method imported as non-pandas compatible. I past my working code below if someone find it useful. If SO community find it irrelevant just delete all together.
`import numpy as np`
`import pandas as pd
`import seaborn as sns
`data = pd.read_excel('/media/Data/my_file.xlsx', 0)`
`h=data.columns #read headers line`
`d = pd.DataFrame(data)
`print(type(d)) #check that is <class 'pandas.core.frame.DataFrame'>`
`for x in h:`
` yy=d[x] #forreference to column`
` bp = sns.boxplot(x='Column_with_treatments', y=yy, data=data) #make graphs`
` fig = bp.get_figure() #put created graph in to obeject fig`
` nn=yy.name+'_name_you_want.png' #crate file name string`
` print(nn) # it print in log the column name of present graph`
` fig.savefig(nn) #save graph image`
` #plt.show() # it would show each image, but you need to close it to continue`
` plt.clf() #clear the present graph from memory`

How to save the results of an np.array for future use when using Google Colab

I am working on a project of Information Retrieval. For that I am using Google Colab. I am in the phase where I have computed some features ("input_features") and I have the labels ("labels") by doing a for loop, which took me about 4 hours to finish.
So at the end I have appended the results to an array:
input_features = np.array(input_features)
labels = np.array(labels)
So my question would be:
Is it possible to save those results in order to use them future purposes when using google colab?
I have found 2 options that maybe could be applied but I don't know where these files are created.
1) To save them as csv files. And my code would be:
from numpy import savetxt
# save to csv file
savetxt('input_features.csv', input_features, delimiter=',')
savetxt('labels.csv', labels, delimiter=',')
And in order to load them:
from numpy import loadtxt
# load array
input_features = loadtxt('input_features.csv', delimiter=',')
labels = loadtxt('labels.csv', delimiter=',')
# print the array
print(input_features)
print(labels)
But still I don't get something back when I print.
2) Save the results of an array by using pickle where I followed these instructions from here:
https://colab.research.google.com/drive/1EAFQxQ68FfsThpVcNU7m8vqt4UZL0Le1#scrollTo=gZ7OTLo3pw8M
from google.colab import files
import pickle
def features_pickeled(input_features, results):
input_features = input_features + '.txt'
pickle.dump(results, open(input_features, 'wb'))
files.download(input_features)
def labels_pickeled(labels, results):
labels = labels + '.txt'
pickle.dump(results, open(labels, 'wb'))
files.download(labels)
And to load them back:
def load_from_local():
loaded_features = {}
uploaded = files.upload()
for input_features in uploaded.keys():
unpickeled_features = uploaded[input_features]
loaded[input_features] = pickle.load(BytesIO(data))
return loaded_features
def load_from_local():
loaded_labels = {}
uploaded = files.upload()
for labels in uploaded.keys():
unpickeled_labels = uploaded[labels]
loaded[labels] = pickle.load(BytesIO(data))
return loaded_labes
#How do I print the pickled files to see if I have them ready for use???
When using python I would do something like this for pickle:
#Create pickle file
with open("name.pickle", "wb") as pickle_file:
pickle.dump(name, pickle_file)
#Load the pickle file
with open("name.pickle", "rb") as name_pickled:
name_b = pickle.load(name_pickled)
But the thing is that I don't see any files to be created in my google drive.
Is my code correct or do I miss some part of the code?
Long description in order to hopefully have explained in detail what I want to do and what I have done for this issue.
Thank you in advance for your help.
Google Colaboratory notebook instances are never guaranteed to have access to the same resources when you disconnect and reconnect because they are run on virtual machines. Therefore, you can't "save" your data in Colab. Here are a few solutions:
Colab saves your code. If the for loop operation you referenced doesn't take an extreme amount of time to run, just leave the code and run it every time you connect your notebook.
Check out np.save. This function allows you to save an array to a binary file. Then, you could re-upload your binary file when you reconnect your notebook. Better yet, you could store the binary file on Google Drive, mount your drive to your notebook, and reference it like that.
# Mount driver to authenticate yourself to gdrive
from google.colab import drive
drive.mount('/content/gdrive')
#---
# Import necessary libraries
import numpy as np
from numpy import savetxt
import pandas as pd
#---
# Create array
arr = np.array([1, 2, 3, 4, 5])
# save to csv file
savetxt('arr.csv', arr, delimiter=',') # You will see the results if you press in the File icon (left panel)
And then you can load it again by:
# You can copy the path when you find your file in the file icon
arr = pd.read_csv('/content/arr.csv', sep=',', header=None) # You can also save your result as a txt file
arr

using CDO to extract dataset only for a specific region

I want to use 'cdo' to extract data from a precipitation NetCDF dataset using another NetCDF over South America.
I have tried multiple procedures, but I always get some error (such as grid size not same, Unsupported generic co-ordinates, etc).
The codes I have tried:
cdo mul chirps_2000-2015_annual_SA.nc Extract feature.nc output.nc
# Got grid error
cdo -f nc4 setctomiss,0 -gtc,0 -remapcon,r1440x720 Chirps_2000-2015_annual_SA.nc CHIRPS_era5_pev_2000-2015_annual_SA_masked.nc
# Got unsupported generic error
I am sure you could make/find more elegant solution, but I just combined Python and shell executable cdo to fulfill the task (calling subprocess can be considered as a bad habit sometimes/somewhere).
#!/usr/bin/env ipython
import numpy as np
from netCDF4 import Dataset
import subprocess
# -------------------------------------------------
def nc_varget(filein,varname):
ncin=Dataset(filein);
vardata=ncin.variables[varname][:];
ncin.close()
return vardata
# -------------------------------------------------
gridfile='extract_feature.nc'
inputfile='precipitation_2000-2015_annual_SA.nc'
outputfile='selected_region.nc'
# -------------------------------------------------
# Detect the start/end based on gridfile:
poutlon=nc_varget(gridfile,'lon')
poutlat=nc_varget(gridfile,'lat')
pinlon=nc_varget(inputfile,'lon')
pinlat=nc_varget(inputfile,'lat')
kkx=np.where((pinlon>=np.min(poutlon)) & (pinlon<=np.max(poutlon)))
kky=np.where((pinlat>=np.min(poutlat)) & (pinlat<=np.max(poutlat)))
# -------------------------------------------------
# -------------------------------------------------
commandstr='cdo selindexbox,'+str(np.min(kkx))+','+str(np.max(kkx))+','+str(np.min(kky))+','+str(np.max(kky))+' '+inputfile+' '+outputfile
subprocess.call(commandstr,shell=True)
The problem in your data is that the file "precipitation_2000-2015_annual_SA.nc" does not specify the grid at the moment - variables lon, lat are generic and hence the grid is generic. Otherwise you could use other operators instead of selindexbox. File extract_feature.nc are more closer to the standard as the variables lon, lat have also the name and unit attributes.

JSON array keyerror in Python

I'm fairly new to Python programming and am attempting to extract data from a JSON array. Below code results in an error for
js[jstring][jkeys]['5. volume'])
Any help would be much appreciated.
import urllib.request, urllib.parse, urllib.error
import json
def DailyData(symb):
url = https://www.alphavantage.co/queryfunction=TIME_SERIES_DAILY&symbol=MSFT&apikey=demo
stockdata = urllib.request.urlopen(url)
data = stockdata.read().decode()
try:
js = json.loads(data)
except:
js = None
jstring = 'Time Series (Daily)'
for entry in js:
i = js[jstring].keys()
for jkeys in i:
return (jkeys,
js[jstring][jkeys]['1. open'],
js[jstring][jkeys]['2. high'],
js[jstring][jkeys]['3. low'],
js[jstring][jkeys]['4. close'],
js[jstring][jkeys]['5. volume'])
print('volume',DailyData(symbol)[5])
Looks like the reason for the error is because the returned data from the URL is a bit more hierarchical than you may realize. To see that, print out js (I recommend using a jupyter notebook):
import urllib.request, urllib.parse, urllib.error
import ssl
import json
import sqlite3
url = "https://www.alphavantage.co/query?function=TIME_SERIES_DAILY&symbol=MSFT&apikey=demo"
stockdata = urllib.request.urlopen(url)
data = stockdata.read().decode()
js = json.loads(data)
js
You can see that js (now a python dict) has a "Meta Data" key before the actual time series begins. You need to start operating on the dict at that key.
Having said that, to get the data into a table like structure (for plotting, time series analysis, etc), you can use pandas package to read the dict key directly into a dataframe. The pandas DataFrame constructor accepts a dict as input. In this case, the data was transposed, so the T at the end rotates it (try with and without the T and you will see it.
import pandas as pd
df=pd.DataFrame(js['Time Series (Daily)']).T
df
Added edit... You could get the data into a dataframe with a single line of code:
import requests
import pandas as pd
url = "https://www.alphavantage.co/query?function=TIME_SERIES_DAILY&symbol=MSFT&apikey=demo"
data = pd.DataFrame(requests.get(url).json()['Time Series (Daily)']).T
DataFrame: The contructor from Pandas to make data into a table like structure
requests.get(): method from the requests library to fetch data..
.json(): directly converts from JSON to a dict
['Time Series (Daily)']: pulls out the key from the dict that is the time series
.T: transposes the rows and columns.
Good luck!
Following code worked for me
import urllib.request, urllib.parse, urllib.error
import json
def DailyData(symb):
# Your code was missing the ? after query
url = "https://www.alphavantage.co/query?function=TIME_SERIES_DAILY&symbol={}&apikey=demo".format(symb)
stockdata = urllib.request.urlopen(url)
data = stockdata.read().decode()
js = json.loads(data)
jstring = 'Time Series (Daily)'
for entry in js:
i = js[jstring].keys()
for jkeys in i:
return (jkeys,
js[jstring][jkeys]['1. open'],
js[jstring][jkeys]['2. high'],
js[jstring][jkeys]['3. low'],
js[jstring][jkeys]['4. close'],
js[jstring][jkeys]['5. volume'])
# query multiple times, just to print one item?
print('open',DailyData('MSFT')[1])
print('high',DailyData('MSFT')[2])
print('low',DailyData('MSFT')[3])
print('close',DailyData('MSFT')[4])
print('volume',DailyData('MSFT')[5])
Output:
open 99.8850
high 101.4300
low 99.6700
close 101.1600
volume 19234627
Without seeing the error, it's hard to know what exact problem you were having.

How does one call external datasets into scikit-learn?

For example consider this dataset:
(1)
https://archive.ics.uci.edu/ml/machine-learning-databases/annealing/anneal.data
Or
(2)
http://data.worldbank.org/topic
How does one call such external datasets into scikit-learn to do anything with it?
The only kind of dataset calling that I have seen in scikit-learn is through a command like:
from sklearn.datasets import load_digits
digits = load_digits()
You need to learn a little pandas, which is a data frame implementation in python. Then you can do
import pandas
my_data_frame = pandas.read_csv("/path/to/my/data")
To create model matrices from your data frame, I recommend the patsy library, which implements a model specification language, similar to R formulas
import patsy
model_frame = patsy.dmatrix("my_response ~ my_model_fomula", my_data_frame)
then the model frame can be passed in as an X into the various sklearn models.
Simply run the following command and replace the name 'EXTERNALDATASETNAME' with the name of your dataset
import sklearn.datasets
data = sklearn.datasets.fetch_EXTERNALDATASETNAME()

Resources