sorry for my annoying questions again.
Having a bit of trouble with some code. The task is to write a function that takes a nested list of currency conversions and turn it into a table (I've attached some pictures for clarification)
(could only attach one image, this is the nested list it converts)
[[10, 9.6, 7.5, 6.7, 4.96], [20, 19.2, 15.0, 13.4, 9.92], [30, 28.799999999999997, 22.5, 20.1, 14.879999999999999], [40, 38.4, 30.0, 26.8, 19.84], [50, 48.0, 37.5, 33.5, 24.8], [60, 57.599999999999994, 45.0, 40.2, 29.759999999999998], [70, 67.2, 52.5, 46.900000000000006, 34.72], [80, 76.8, 60.0, 53.6, 39.68], [90, 86.39999999999999, 67.5, 60.300000000000004, 44.64], [100, 96.0, 75.0, 67.0, 49.6]]
I've got the Header column for the table to work fine.
I'm having issues when I'm trying to iterate over each sublist in the nested list, convert it to a string (and two decimal places) and with a tab between each entry.
the code I've got so far is:
def printTable(cur):
list2 = makeTable(cur)
lst1 = Extract(cur)
lst1.insert(0, "$NZD")
lne1 = "\t".join(lst1)
print(lne1)
list=map(str,list2)
print(list2)
for list in list2:
for elem in list:
linelem = "\t".join(elem)
print(linelem)
printTable(cur)
(Note: The first function I call and assign to list2 is what generates the data/nested list)
I've tried playing around a bit but I keep coming up with different error messages trying to convert each sublist to a string.
Thank you all for your help!![enter image description here][1]
Try using this method;
import pandas as pd
import csv
with open("out.csv", "w", newline="") as f:
writer = csv.writer(f)
writer.writerows(a)
pd.read_csv("out.csv", header = [col1, col2, col3, col4, col5])
Related
I'm having trouble converting a script I wrote to create and save 15 pie charts separately which I would like to save as a single figure with 15 subplots instead. I have tried taking fig, ax = plt.subplots(5, 3, figsize=(7, 7)) out of the loop and specifying the number of rows and columns for the plot but I get this error AttributeError: 'numpy.ndarray' object has no attribute 'pie'. This error doesn't occur if I leave that bit of code in the script as is seen below. Any help with tweaking the code below to create a single figure with 15 subplots (one for each site) would be enormously appreciated.
import pandas as pd
import matplotlib.pyplot as plt
df = pd.read_excel(path)
df_1 = df.groupby(['Site', 'group'])['Abundance'].sum().reset_index(name='site_count')
site = ['Ireland', 'England', 'France', 'Scotland', 'Italy', 'Spain',
'Croatia', 'Sweden', 'Denmark', 'Germany', 'Belgium', 'Austria', 'Poland', 'Stearman', 'Hungary']
for i in site:
df_1b = df_1.loc[df_1['Site'] == i]
colors = {'Dog': 'orange', 'Cat': 'cyan', 'Pig': 'darkred', 'Horse': 'lightcoral', 'Bird':
'grey', 'Rat': 'lightsteelblue', 'Whale': 'teal', 'Fish': 'plum', 'Shark': 'darkgreen'}
wp = {'linewidth': 1, 'edgecolor': "black"}
fig, ax = plt.subplots(figsize=(7, 7))
texts, autotexts = ax.pie(df_1b['site_count'],
labels=None,
shadow=False,
colors=[colors[key] for key in labels],
startangle=90,
wedgeprops=wp,
textprops=dict(color="black"))
plt.setp(autotexts, size=16)
ax.set_title(site, size=16, weight="bold", y=0)
plt.savefig('%s_group_diversity.png' % i, bbox_inches='tight', pad_inches=0.05, dpi=600)
It's hard to guess how exactly you'd like the plot to look like.
The main changes the code below makes, are:
adding fig, axs = plt.subplots(nrows=5, ncols=3, figsize=(12, 18)). Here axs is a 2d array of subplots. figsize should be large enough to fit the 15 subplots.
df_1b['group'] is used for the labels that decide the color (it's not clear where the labels themselves should be shown, maybe in a common legend)
autopct='%.1f%%' is added to ax.pie(...). This shows the percentages with one decimal.
With autopct, ax.pie(...) now returns 3 lists: texts, autotexts, wedges. The texts are the text objects for the labels (currently empty texts), autotexts are the percentages (that are calculated "automatically"), wedges are the triangular wedges.
ax.set_title now uses the site name, and puts it at a negative y-value (y=0 would overlap with the pie)
plt.tight_layout() at the end tries to optimize the surrounding white space
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
site = ['Ireland', 'England', 'France', 'Scotland', 'Italy', 'Spain',
'Croatia', 'Sweden', 'Denmark', 'Germany', 'Belgium', 'Austria', 'Poland', 'Stearman', 'Hungary']
colors = {'Dog': 'orange', 'Cat': 'cyan', 'Pig': 'darkred', 'Horse': 'lightcoral', 'Bird': 'grey',
'Rat': 'lightsteelblue', 'Whale': 'teal', 'Fish': 'plum', 'Shark': 'darkgreen'}
wedge_properties = {'linewidth': 1, 'edgecolor': "black"}
# create some dummy test data
df = pd.DataFrame({'Site': np.random.choice(site, 1000),
'group': np.random.choice(list(colors.keys()), 1000),
'Abundance': np.random.randint(1, 11, 1000)})
df_1 = df.groupby(['Site', 'group'])['Abundance'].sum().reset_index(name='site_count')
fig, axs = plt.subplots(nrows=5, ncols=3, figsize=(12, 18))
for site_i, ax in zip(site, axs.flat):
df_1b = df_1[df_1['Site'] == site_i]
labels = df_1b['group']
texts, autotexts, wedges = ax.pie(df_1b['site_count'],
labels=None,
shadow=False,
colors=[colors[key] for key in labels],
startangle=90,
wedgeprops=wedge_properties,
textprops=dict(color="black"),
autopct='%.1f%%')
plt.setp(autotexts, size=10)
ax.set_title(site_i, size=16, weight="bold", y=-0.05)
plt.tight_layout()
plt.savefig('group_diversity.png', bbox_inches='tight', pad_inches=0.05, dpi=600)
plt.show()
I have a post-joined datasets where the columns are identical except the right side has new and corrected data and a .TODROP suffix appended at the end of the column's name.
So the dataset looks something like this:
df = spark.createDataFrame(
[
(1, "Mary", 133, "Pizza", "Mary", 13, "Pizza"),
(2, "Jimmy", 8, "Hamburger", None, None, None),
(3, None, None, None, "Carl", 6, "Cake")
],
["guid", "name", "age", "fav_food", "name.TODROP", "age.TODROP", "fav_food.TODROP"]
)
I'm trying to slide over the right side columns to the left side columns if there is value:
if df['name.TODROP'].isNotNull():
df['name'] = df['name.TODROP']
if df['age.TODROP'].isNotNull():
df['age'] = df['age.TODROP']
if df['fav_food.TODROP'].isNotNull():
df['fav_food'] = df['fav_food.TODROP']
However, the problem is that the brute-force solution will take a lot longer with my real dataset because it has a lot more columns than this example. And I'm also getting this error so it wasn't working out anyway...
"pyspark.sql.utils.AnalysisException: Can't extract value from
name#1527: need struct type but got string;"
Another attempt where I try to do it in a loop:
col_list = []
suffix = ".TODROP"
for x in df.columns:
if x.endswith(suffix) == False:
col_list.append(x)
for x in col_list:
df[x] = df[x + suffix]
Same error as above.
Goal:
Can someone point me in the right direction? Thank you.
First of all, your dot representation of the column name makes confusion for the struct type of column. Be aware that. I have concatenate the column name with backtick and it prevents the misleading column type.
suffix = '.TODROP'
cols1 = list(filter(lambda c: not(c.endswith(suffix)), df.columns))
cols2 = list(filter(lambda c: c.endswith(suffix), df.columns))
for c in cols1[1:]:
df = df.withColumn(c, f.coalesce(f.col(c), f.col('`' + c + suffix + '`')))
df = df.drop(*cols2)
df.show()
+----+-----+---+---------+
|guid| name|age| fav_food|
+----+-----+---+---------+
| 1| Mary|133| Pizza|
| 2|Jimmy| 8|Hamburger|
| 3| Carl| 6| Cake|
+----+-----+---+---------+
I have dataframe that looks like the following. I want to be able to find an average and put in a new_column. I can find avg using udf, but cannot put it in a column. It would be nice, if you can help without udf. Otherwise, any help with current solution is welcome.
from pyspark.sql.types import StructType,StructField
from pyspark.sql.types import StringType, IntegerType, ArrayType
data = [
("Smith","[55, 65, 75]"),
("Anna","[33, 44, 55]"),
("Williams","[9.5, 4.5, 9.7]"),
]
schema = StructType([
StructField('name', StringType(), True),
StructField('some_value', StringType(), True)
])
df = spark.createDataFrame(data = data, schema= schema)
df.show(truncate=False)
+--------+---------------+
|name |some_value |
+--------+---------------+
|Smith |[55, 65, 75] |
|Anna |[33, 44, 55] |
|Williams|[9.5, 4.5, 9.7]|
+--------+---------------+
A solution is like this,
array_mean = F.udf(lambda x: float(np.mean(x)), FloatType())
(from Find mean of pyspark array<double>) returns a dataframe not a new column.
Any help is welcome. Thank you.
You have a string column that looks like an array, instead of an array column, so you need to convert the datatype in the UDF as well:
import json
import numpy as np
import pyspark.sql.functions as F
array_mean = F.udf(lambda x: float(np.mean(json.loads(x))), 'float')
df2 = df.withColumn('mean_value', array_mean('some_value'))
df2.show()
+--------+---------------+----------+
| name| some_value|mean_value|
+--------+---------------+----------+
| Smith| [55, 65, 75]| 65.0|
| Anna| [33, 44, 55]| 44.0|
|Williams|[9.5, 4.5, 9.7]| 7.9|
+--------+---------------+----------+
Coming from Pandasand newbie to pyspark, I went the long way.
Strip []
split to turn into list
explode
mean
df2 = df.select(df.name,F.regexp_replace('some_value', '[\\]\\[]', "").alias("some_value")).select(df.name, F.split("some_value",",").alias("some_value")).select(df.name, F.explode("some_value").alias("some_value"))
df2 = df2.withColumn("some_value", df2.some_value.cast('float')).groupBy("name").mean( "some_value")
I'm having trouble getting data from several .csv files into a single array. I can get all of the data from the .csv files fine, I just can't get everything into a simple numpy array. The name of each .csv file is important to me so in the end I'd like to have a Pandas DataFrame with the columns labeled by the initial name of the .csv file.
import glob
import numpy as np
import pandas as pd
files = glob.glob("*.csv")
temp_dict = {}
wind_dict = {}
for file in files:
data = pd.read_csv(file)
temp_dict[file[:-4]] = data['HLY-TEMP-NORMAL'].values
wind_dict[file[:-4]] = data['HLY-WIND-AVGSPD'].values
temp = []
wind = []
name = []
for word in temp_dict:
name.append(word)
temp.append(temp_dict[word])
for word in wind_dict:
wind.append(wind_dict[word])
temp = np.array(temp)
wind = np.array(wind)
When I print temp or wind I get something like this:
[array([ 32.1, 31.1, 30.3, ..., 34.9, 33.9, 32.9])
array([ 17.3, 17.2, 17.2, ..., 17.5, 17.5, 17.2])
array([ 41.8, 41.1, 40.6, ..., 44.3, 43.4, 42.6])
...
array([ 32.5, 32.2, 31.9, ..., 34.8, 34.1, 33.7])]
when what I really want is:
[[ 32.1, 31.1, 30.3, ..., 34.9, 33.9, 32.9]
[ 17.3, 17.2, 17.2, ..., 17.5, 17.5, 17.2]
[ 41.8, 41.1, 40.6, ..., 44.3, 43.4, 42.6]
...
[ 32.5, 32.2, 31.9, ..., 34.8, 34.1, 33.7]]
This does not work but is the goal of my code:
df = pd.DataFrame(temp, columns=name)
And when I try to use a DataFrame from Pandas each row is its own array which isn't helpful because it thinks every row has only element in it. I know the problem is with "array(...)" I just don't know how to get rid of it. Thank you in advance for your time and consideration.
I think you can use:
files = glob.glob("*.csv")
#read each file to list of DataFrames
dfs = [pd.read_csv(fp) for fp in files]
#create names for each file
lst4 = [x[:-4] for x in files]
#create one big df with MultiIndex by files names
df = pd.concat(dfs, keys=lst4)
If want separately DataFrames change last row above solution with reshape:
df = pd.concat(dfs, keys=lst4).unstack()
df_temp = df['HLY-TEMP-NORMAL']
df_wind = df['HLY-WIND-AVGSPD']
I am trying to use zeppelin to plot a realtime graph. I have doing a sentiment analysis on tweets on per minute . I am able to query statically and plot a graph. But i would like this to be done dynamically. I am new to zeppelin and do not have much knowledge about angularJS. What should be the correct approach to this problem?
val final_score=uni_join.map{case((year,month,day,hour,minutes),(tweet_count,sentiment))=>(year, month, day, hour, minutes(sentiment/tweet_count).ceil)}
final_score.saveToCassandra("twitter", "score",writeConf = WriteConf(ttl = TTLOption.constant(1000)))
final_score.foreachRDD(score => {
val rowRDD =score.map{case(year,month,day,hour,minutes,sentiment) =>(year,month,day,hour,minutes,sentiment) }
val tempDF = sqlContext.createDataFrame(rowRDD)
z.angularBindGlobal("stream", parsed) //to bind parsed to stream.
tempDF.registerTempTable("realTimeTable")
})
Doing a query on the above table , i am able to get the graph. But i would like to dynamically update the graph every minute in order to keep in sync with the sentiment score .
Thanks prior.
[update] the angular part for the zeppelin notebook are as follows:
%angular
<div id="graph" style="height: 100%; width: 100%">
<canvas id="myChart" width="400" height="400"></canvas>
<div id="legendDiv"></div>
</div>
<script>
function initMap() {
var colorList = ["#fde577", "#ff6c40", "#c72a40", "#520833", "#a88399"]
var el = angular.element($('#stream'));
console.log("El is "+el) //returns el as object
angular.element(el).ready(function() {
console.log('Hello')
window.locationWatcher =el.$scope.$watch('stream', function(new, old){
console.log('changed');}, true)})
</script>
But running this code keeps returning the following error.
vendor.js:29 jQuery.Deferred exception: Cannot read property '$watch' of undefined TypeError: Cannot read property '$watch' of undefined
The spark version that i am using is 1.6
And Zeppelin is 0.6
spark-highcharts since version 0.6.3 support Spark Structured Streaming.
For a structuredDataFrame after aggregation, with the following code in one Zeppelin paragraph. The OutputMode can be either append or complete depends how the structureDataFrame is aggregated.
import com.knockdata.spark.highcharts._
import com.knockdata.spark.highcharts.model._
val query = highcharts(
structuredDataFrame.seriesCol("country")
.series("x" -> "year", "y" -> "stockpile")
.orderBy(col("year")), z, "append")
And the following code in the next paragraph. The chart in this paragraph will be updated when there are new data coming to the structureDataFrame.
StreamingChart(z)
Run following code to stop update the chart.
query.stop()
Here is the example generate structureDataFrame.
spark.conf.set("spark.sql.streaming.checkpointLocation","/usr/zeppelin/checkpoint")
case class NuclearStockpile(country: String, stockpile: Int, year: Int)
val USA = Seq(0, 0, 0, 0, 0, 6, 11, 32, 110, 235, 369, 640,
1005, 1436, 2063, 3057, 4618, 6444, 9822, 15468, 20434, 24126,
27387, 29459, 31056, 31982, 32040, 31233, 29224, 27342, 26662,
26956, 27912, 28999, 28965, 27826, 25579, 25722, 24826, 24605,
24304, 23464, 23708, 24099, 24357, 24237, 24401, 24344, 23586,
22380, 21004, 17287, 14747, 13076, 12555, 12144, 11009, 10950,
10871, 10824, 10577, 10527, 10475, 10421, 10358, 10295, 10104).
zip(1940 to 2006).map(p => NuclearStockpile("USA", p._1, p._2))
val USSR = Seq(0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
5, 25, 50, 120, 150, 200, 426, 660, 869, 1060, 1605, 2471, 3322,
4238, 5221, 6129, 7089, 8339, 9399, 10538, 11643, 13092, 14478,
15915, 17385, 19055, 21205, 23044, 25393, 27935, 30062, 32049,
33952, 35804, 37431, 39197, 45000, 43000, 41000, 39000, 37000,
35000, 33000, 31000, 29000, 27000, 25000, 24000, 23000, 22000,
21000, 20000, 19000, 18000, 18000, 17000, 16000).
zip(1940 to 2006).map(p => NuclearStockpile("USSR/Russia", p._1, p._2))
input.addData(USA.take(30) ++ USSR.take(30))
val structureDataFrame = input.toDF
And the following code can be simulate to update the chart. The chart will be updated when the following code run.
input.addData(USA.drop(30) ++ USSR.drop(30))
NOTE: The example using Zeppelin 0.6.2 and Spark 2.0
NOTE: Please check the highcharts license for commercial usage