array - list format input - arrays

I have the following question: how can I change the format of curve2 (list). I want something similar to curve
curve = [0.0556, 0.0563]
curve2 = [[0.0159, 0.0178]]
Context: I´d like to apply a certain code, but I don´t get the result I expect since the input has different format
My code is something like:
import pandas as pd
import numpy as np
curve = [0.0556, 0.0563]
curve2 = [[0.0159, 0.0178]]
df= pd.DataFrame()
def SUM (curve):
df['COl1'] = curve
return df
print(SUM(curve))
PD: curve2 is a row extracted from an array (as a list):
[[ 0.01593353 0.01783041]
[ 0.00917833 0.00593893]
[ 0.00829569 0.02123637]
[-0.03057529 -0.04138836]
[ 0.05212978 0.03239212]]

Related

Slicing the first row of a Dataframe into an Array[String]

import org.apache.spark.sql.functions.broadcast
import org.apache.spark.sql.SparkSession._
import org.apache.spark.sql.SparkSession
import org.apache.spark.sql._
import org.apache.spark.sql.functions._
import org.apache.spark.SparkContext._
import org.apache.spark.{SparkConf,SparkContext}
import java.io.File
import org.apache.commons.io.FileUtils
import org.apache.hadoop.fs.FileSystem
import org.apache.hadoop.fs.Path
import org.apache.spark.sql.expressions.Window
import scala.runtime.ScalaRunTime.{array_apply, array_update}
import scala.collection.mutable.Map
object SimpleApp {
def main(args: Array[String]){
val conf = new SparkConf().setAppName("SimpleApp").setMaster("local")
val sc = new SparkContext(conf)
val input = "file:///home/shahid/Desktop/sample1.csv"
val hdfsOutput = "hdfs://localhost:9001/output.csv"
val localOutput = "file:///home/shahid/Desktop/output"
val sqlContext = new SQLContext(sc)
val df = sqlContext.read.format("com.databricks.spark.csv").load(input)
var colLen = df.columns.length
val df1 = df.filter(!(col("_c1") === ""))
I am capturing the top row into a val named headerArr.
val headerArr = df1.head
I wanted this val to be Array[String].
println("class = "+headerArr.getClass)
What can I do to either typecast this headerArr into an Array[String] or get this top row directly into an Array[String].
val fs = org.apache.hadoop.fs.FileSystem.get(new java.net.URI("hdfs://localhost:9001"), sc.hadoopConfiguration)
fs.delete(new org.apache.hadoop.fs.Path("/output.csv"),true)
df1.write.csv(hdfsOutput)
val fileTemp = new File("/home/shahid/Desktop/output/")
if (fileTemp.exists)
FileUtils.deleteDirectory(fileTemp)
df1.write.csv(localOutput)
sc.stop()
}
}
I have tried using df1.first also but both return the same type.
The result of the above code on the console is as follows :-
class = class org.apache.spark.sql.catalyst.expressions.GenericRowWithSchema
Help needed.
Thankyou for you time. xD
Given the following dataframe:
val df = spark.createDataFrame(Seq(("a", "hello"), ("b", "world"))).toDF("id", "word")
df.show()
+---+-----+
| id| word|
+---+-----+
| a|hello|
| b|world|
+---+-----+
You can get the first row as you already mentioned and then turn this result into a Seq, which is actually backed by a subtype of Array and that you can then "cast" to an array without copying:
// returns: WrappedArray(a, hello)
df.first.toSeq.asInstanceOf[Array[_]]
Casting is usually not a good practice in a language with very good static typing as Scala, so you'd probably want to stick to the Seq unless you really have a need for an Array.
Notice that thus far we always ended up not with an array of strings but with an array of objects, since the Row object in Spark has to accommodate for various types. If you want to get to a collection of strings you can iterate the fields and extract the strings:
// returns: Vector(a, hello)
for (i <- 0 until df.first.length) yield df.first.getString(i)
This of course will cause a ClassCastException if the Row contains non-strings. Depending on your needs, you may also want to consider using a Try to silently drop non-strings within the for-comprehension:
import scala.util.Try
// same return type as before
// non-string members will be filtered out of the end result
for {
i <- 0 until df.first.length
field <- Try(df.first.getString(i)).toOption
} yield field
Until now we returned an IndexedSeq, which is suitable for efficient random access (i.e. has constant access time to any item in the collection) and in particular a Vector. Again, you may really need to return an Array. To return an Array[String] you may want to call toArray on the Vector, which unfortunately copies the whole thing.
You can skip this step and directly output an Array[String] by explicitly using flatMap instead of relying on the for-comprehension and using collection.breakOut:
// returns: Array[String] -- silently keeping strings only
0.until(df.first.length).
flatMap(i => Try(df.first.getString(i)).toOption)(collection.breakOut)
To learn more about builders and collection.breakOut you may want to have a read here.
well my problem didn't solve with the best way but i tried a way out :-
val headerArr = df1.first
var headerArray = new Array[String](colLen)
for(i <- 0 until colLen){
headerArray(i)=headerArr(i).toString
}
But still I am open for new suggestions.
Although I am slicing the dataframe into a var of class = org.apache.spark.sql.catalyst.expressions.GenericRowWithSchema and then transfering the elements to Array[String] with an iteration.

Using Pandas to store spotipy output in csv (python)

I wrote a code to store the output of spotipy into a pd.Dataframe:
import pandas as pd
import spotipy
sp = spotipy.Spotify()
from spotipy.oauth2 import SpotifyClientCredentials
cid ='XXXCIDXXX'
secret = 'XXXSECRETXXX'
client_credentials_manager = SpotifyClientCredentials(client_id=cid,
client_secret=secret)
sp = spotipy.Spotify(client_credentials_manager=client_credentials_manager)
sp.trace=False
playlist = sp.user_playlist_tracks('spotify', '37i9dQZF1DX5nwnRMcdReF')
songs = playlist['items']
df = pd.DataFrame(songs)
df.to_csv('Songs.csv', sep=';', encoding='utf-8', index=True)
But theres a lot of output there that I dont need, so I found a code to only output the data that I need which is:
for i, item in enumerate(playlist['items']):
track = item['track']
need = (i, track['artists'][0]['name'], track['name'], track['id'])
now i can use print(need) to output exactly what I want, but I dont know how to store the data into the DataFrame.
If someone could help me that would be great.
Thank you.
Append the output you need to separate lists and then load those lists into a dataframe:
# appending needed data to separate lists
for i, item in enumerate(playlist['items']):
track = item['track']
artist_name.append(track['artists'][0]['name'])
track_name.append(track['name'])
track_id.append(track['id']
# loading lists into the dataframe
df = pd.DataFrame({'artist_name':artist_name,'track_name':track_name,'track_id':track_id})

Keras input shape error - passing the whole array not each line

I am loading images from a csv file. The images are 300 x 300 pixels but flattened to 90000. I am getting an error for input shape. I am using tensorflow back end. I have attached an image of my csv file as well as an image of the error. It looks like its passing the whole list of arrays instead of passing each line.
"ValueError: Error when checking model input: the list of Numpy arrays that you are passing to your model is not the size the model expected. Expected to see 1 arrays but instead got the following list of 380 arrays:[array([ 43., 45., 46., ..., 161., 152., 146.]), array([ 211., 222., 224., ..., 212., 213., 213.]), array([ 201., 201., "
csv file
error
import numpy as np
from keras.models import Sequential
from keras.layers import Dense, Dropout
import csv
import cv2
import re
loaded_images_train = []
loaded_labels_train = []
loaded_images_test = []
loaded_labels_test = []
with open('images_train.csv') as f:
csvReader = csv.reader(f, lineterminator = '\n')
for row in csvReader:
row = np.asarray(row, dtype='float')
loaded_images_train.append(row)
with open('labels_train.csv') as f:
csvReader = csv.reader(f, lineterminator = '\n')
for row in csvReader:
row = str(row)
row = row.strip(',')
loaded_labels_train.append(row)
with open('images_test.csv') as f:
csvReader = csv.reader(f, lineterminator = '\n')
for row in csvReader:
row = np.asarray(row, dtype='float')
loaded_images_test.append(row)
with open('labels_test.csv') as f:
csvReader = csv.reader(f, lineterminator = '\n')
for row in csvReader:
row = str(row)
row = row.strip(',')
loaded_labels_test.append(row)
# load data
x_train = loaded_images_train
y_train = loaded_labels_train
print("Loaded Training Data")
x_test = loaded_images_test
y_test = loaded_labels_test
print("Loaded Testing Data")
model = Sequential()
model.add(Dense(64, input_shape=(90000,), activation='relu'))
model.add(Dropout(0.5))
model.add(Dense(64, activation='relu'))
model.add(Dropout(0.5))
model.add(Dense(1, activation='sigmoid'))
model.compile(loss='binary_crossentropy',
optimizer='rmsprop',
metrics=['accuracy'])
model.fit(x_train, y_train,
epochs=20,
batch_size=128)
#score = model.evaluate(x_test, y_test, batch_size=128)
print(score)
The way you are converting each line with asarray and then feeding keras with a list of arrays is not working.
I've tested your code with a sightly different approach and it did run flawlessly for me with the csv you provided in the comments (changing input_size to 400).
Read all lines from the file to loaded_images_train. It will be a list of lists:
input_size = 90000
with open('images_train.csv') as f:
csvReader = csv.reader(f, lineterminator = '\n')
for row in csvReader:
assert len(row) == input_size
loaded_images_train.append(row)
I've included the assertion following your feedback to my comment.
You can also assert len(row) == output_size for the labels.
On the other hand, if you are pretty sure about the sizes of the rows, you can substitute the loop by a simple:
loaded_images_train = list(csvReader)
Whichever you choose, do the same to test images.
Then do the conversion to numpy.ndarray when declaring x_train:
x_train = np.asarray(loaded_images_train, dtype=float) # you don't really need the quotes here
Finally, printing the shape of the loaded data can help you know that everything is ok. For example:
print("Loaded Training Data", x_train.shape)
The reason why you met the problem is the type of your dataset is list, but the acceptable type for Keras model is only numpy array.
You need to convert the lists to numpy array with np.asarray(loaded_images_train) and make sure the shape of the data is (n,90000).

Non conformable array error when using rpart with rpy2

I'm using rpart with rpy2 (version 2.8.6) on python 3.5, and want to train a decision tree for classification. My code snippet looks like this:
import rpy2.robjects.packages as rpackages
from rpy2.robjects.packages import importr
from rpy2.robjects import numpy2ri
from rpy2.robjects import pandas2ri
from rpy2.robjects import DataFrame, Formula
rpart = importr('rpart')
numpy2ri.activate()
pandas2ri.activate()
dataf = DataFrame({'responsev': owner_train_label,
'predictorv': owner_train_data})
formula = Formula('responsev ~.')
clf = rpart.rpart(formula = formula, data = dataf, method = "class", control=rpart.rpart_control(minsplit = 10, xval = 10))
where owner_train_label is a numpy float64 array of shape (12610,) and
owner_train_data is a numpy float64 array of shape (12610,88)
This is the error I'm getting when I run the last line of code to fit the data.
RRuntimeError: Error in ((xmiss %*% rep(1, ncol(xmiss))) < ncol(xmiss)) & !ymiss :
non-conformable arrays
I get that it is telling me they are non-conformable arrays but I don't know why as for the same training data, I can train using sklearn's Decision tree successfully.
Thanks for your help.
I got around this by creating the dataframe using pandas and passing the panadas dataframe to rpart using rpy2's pandas2ri to convert it to R's dataframe.
from rpy2.robjects.packages import importr
from rpy2.robjects import pandas2ri
from rpy2.robjects import Formula
rpart = importr('rpart')
pandas2ri.activate()
df = pd.DataFrame(data = owner_train_data)
df['l'] = owner_train_label
formula = Formula('l ~.')
clf = rpart.rpart(formula = formula, data = df, method = "class", control=rpart.rpart_control(minsplit = 10, xval = 10))

Bokeh MultiSelect plotting in infinite loop, distorting plot

I'm trying to plotting multiple lines into a graph based on a user's "MultiSelect" options. I read in two separate excel files of data and and plot their axis based on the user's request. I'm using Python 3.5 and running on a MAC.
1). As soon as I make a multiselection the figure gets distorted
2). It seems the plot is running in an infinite loop.
3). The plot doses not properly update when user changes selections. It just adds more plots without removing the previous plot.
from os.path import dirname, join
from pandas import *
import numpy as np
import pandas.io.sql as psql
import sqlite3 as sql
import sys, os
from bokeh.plotting import figure
from bokeh.layouts import layout, widgetbox
from bokeh.models import ColumnDataSource, HoverTool, Div
from bokeh.models.widgets import Slider, Select, TextInput, MultiSelect
from bokeh.io import curdoc
import matplotlib.pyplot as plt
files = list()
path = os.getcwd()
for x in os.listdir(path):
if x.endswith(".xlsx"):
if x != 'template.xlsx' :
files.append(x)
axis_map = {
"0% void": "0% void",
"40% void": "40% void",
"70% void": "70% void",
}
files_list = MultiSelect(title="Files", value=["dummy2.xlsx"],
options=open(join(dirname(__file__), 'files.txt')).read().split())
voids = MultiSelect(title="At what void[s]", value=["0% void"], options=sorted(axis_map.keys()))
p = figure(plot_height=600, plot_width=700, title="", toolbar_location=None)
pline = figure(plot_height=600, plot_width=700, title="")
path = os.getcwd()
data_dict = {}
for file in os.listdir(path):
if file.endswith(".xlsx"):
xls = ExcelFile(file)
df = xls.parse(xls.sheet_names[0])
data = df.to_dict()
data_dict[file] = data
# converting dictionary to dataframe
newdict = {(k1, k2):v2 for k1,v1 in data_dict.items() \
for k2,v2 in data_dict[k1].items()}
xxs = DataFrame([newdict[i] for i in sorted(newdict)],
index=MultiIndex.from_tuples([i for i in sorted(newdict.keys())]))
master_data = xxs.transpose()
def select_data():
for vals in files_list.value:
for vox in voids.value:
pline.line(x=master_data[vals]['Burnup'], y= master_data[vals][vox])
pline.circle(x=master_data[vals]['Burnup'], y= master_data[vals][vox])
return
def update():
select_data()
controls = [ files_list, voids]
for control in controls:
control.on_change('value', lambda attr, old, new: update())
sizing_mode = 'fixed' # 'scale_width' also looks nice with this example
inputs = widgetbox(*controls, sizing_mode=sizing_mode)
l = layout([
[inputs, pline],
], sizing_mode=sizing_mode)
update()
curdoc().add_root(l)
curdoc().title = "Calculations"
I am not 100% certain, since the code above is not self-contained and cannot be run and investigated, but there are some issues (as of Bokeh 0.12.4) with adding new components to documents being problematic in some situations. These issues are high on the priority list for the next two point releases.
Are the data sizes reasonable such that you could create all the combinations up front? If so, I would recommend doing that, and then having the multi-select values toggle the visibility on/off appropriately. E.g., here's a similar example using a checkbox:
import numpy as np
from bokeh.io import curdoc
from bokeh.layouts import row
from bokeh.palettes import Viridis3
from bokeh.plotting import figure
from bokeh.models import CheckboxGroup
p = figure()
props = dict(line_width=4, line_alpha=0.7)
x = np.linspace(0, 4 * np.pi, 100)
l0 = p.line(x, np.sin(x), color=Viridis3[0], legend="Line 0", **props)
l1 = p.line(x, 4 * np.cos(x), color=Viridis3[1], legend="Line 1", **props)
l2 = p.line(x, np.tan(x), color=Viridis3[2], legend="Line 2", **props)
checkbox = CheckboxGroup(labels=["Line 0", "Line 1", "Line 2"], active=[0, 1, 2], width=100)
def update(attr, old, new):
l0.visible = 0 in checkbox.active
l1.visible = 1 in checkbox.active
l2.visible = 2 in checkbox.active
checkbox.on_change('active', update)
layout = row(checkbox, p)
curdoc().add_root(layout)
If the data sizes are not such that you can create all the combinations up front, then I would suggest making an issue on the project issue trackerhttps://github.com/bokeh/bokeh/issues) that has a complete, minimal, self-contained, runnable as-is code to reproduce the problem (i.e. generates random or synthetic data but it otherwise identical). This it the number one thing that would help the core devs address the issue more promptly.
#bigreddot Thanks for your response.
I edited the code to now make it self contained.
1). The plot does not reset. The new selected plots over the previous plot.
2). When the user makes multiple selections (ctrl+shift) the plot axis gets distorted and it seems to be running in an infinite loop
from pandas import *
import numpy as np
import sys, os
from bokeh.plotting import figure
from bokeh.layouts import layout, widgetbox
from bokeh.models.widgets import MultiSelect
from bokeh.io import curdoc
from bokeh.plotting import reset_output
import math
axis_map = {
"y1": "y3",
"y2": "y2",
"y3": "y1",
}
x1 = np.linspace(0,20,62)
y1 = [1.26 * math.cos(x) for x in np.linspace(-1,1,62) ]
y2 = [1.26 * math.cos(x) for x in np.linspace(-0.95,.95,62) ]
y3 = [1.26 * math.cos(x) for x in np.linspace(-.9,.90,62) ]
TOOLS = "pan,wheel_zoom,box_zoom,reset,save,hover"
vars = MultiSelect(title="At what void[s]", value=["y1"], options=sorted(axis_map.keys()))
master_data = { 'rate' : x1,
'y1' : y1,
'y2' : y2,
'y3' : y3
}
p = figure(plot_height=600, plot_width=700, title="", toolbar_location=None)
pline = figure(plot_height=600, plot_width=700, title="", tools=TOOLS)
def select_data():
for vox in vars.value:
pline.line(x=master_data['rate'], y= master_data[vox], line_width=2)
pline.circle(x=master_data['rate'], y=master_data[vox], line_width=2)
return
controls = [ vars]
for control in controls:
control.on_change('value', lambda attr, old, new: select_data())
sizing_mode = 'fixed'
inputs = widgetbox(*controls)
l = layout([
[inputs, pline],
])
select_data()
curdoc().add_root(l)
curdoc().title = "Plot"

Resources