Calculating Average for specific column in a 2D array - arrays

I am new to Python and need your help. I need to calculate the average for a specific column in a very large array. I would like to use numpy.average function (open to any other suggestions) but can not figure out a way to select a column by its header (e.g. an average for a Flavor_Score column):
Beer_name Tester Flavor_Score Overall_Score
Coors Jim 2.0 3.0
Sam Adams Dave 4.0 4.5
Becks Jim 3.5 3.5
Coors Dave 2.0 2.2
Becks Dave 3.5 3.7
Do I have to transpose the array (it seems there are many functions for rows in pandas and numpy but relatively few for columns (I could be wrong, of course) to get average calculations for a column done?
Second question for the same array: is the best way to use the answer from the first question (calculating the average Flavor_Score) to calculate the average Flavor_Score for a specific beer (among different testers)?
Beer-test="Coors"
for i in Beer_Name():
if i=Beer_test: # recurring average calculation
else: pass
I would hope there is a built-in function for this.
Very much appreciate your help!

Ok here is an example of how to do that.
# Test data
df = pd.DataFrame({'Beer_name': ['Coors', 'Sam Adams', 'Becks', 'Coors','Becks'],
'Tester': ['Jim', 'Dave', 'Jim', 'Dave', 'Dave'],
'Flavor_Score': [2,4,3.5,2,3.5],
'Overall_Score': [3, 4.5, 3.5, 2.2, 3.7]})
# Simply call mean on the DataFrame
df.mean()
Flavor_Score 3.00
Overall_Score 3.38
Then you can use the groupby feature:
df.groupby('Beer_name').mean()
Flavor_Score Overall_Score
Beer_name
Becks 3.5 3.6
Coors 2.0 2.6
Sam Adams 4.0 4.5
Now you can even see how it looks like by tester.
df.groupby(['Beer_name','Tester']).mean()
Flavor_Score Overall_Score
Beer_name Tester
Becks Dave 3.5 3.7
Jim 3.5 3.5
Coors Dave 2.0 2.2
Jim 2.0 3.0
Sam Adams Dave 4.0 4.5
Good beer !

Related

Pandas: switch from two arrays (i.e. columns) to one array [duplicate]

This question already has answers here:
How do I melt a pandas dataframe?
(3 answers)
Closed 6 months ago.
Given the following df with fruits and their prices by month, I'd like to have all of the fruits listed in a single Fruit_Month column and then have another column called Prices. The ultimate goal is to calculate the correlation between fruit prices.
Given:
Fruit Jan Feb
Apple 2.00 2.50
Banana 1.00 1.25
Desired output:
Fruit_Month Price
Apple_Jan 2.00
Apple_Feb 2.50
Banana_Jan 1.00
Banana_Feb 1.25
And then from here, I'd like to see how correlated each fruit is with one another. In this simple example, it'd just be Apple vs Banana, but it should apply if there were more fruits. If there's a better/easier way, please let me know.
Here is an approach that first melts the table to make the Month row, then makes a new df using the melted columns. I bet there are more clever ways to do this, maybe with unstack. Maybe depending on what you need to do, it will be easier to keep Fruit and Month as separate columns.
df = df.melt(id_vars='Fruit',var_name='Month',value_name='Price')
df = pd.DataFrame({
'Fruit_Month': df.Fruit+'_'+df.Month,
'Price': df.Price
})

How should i format/set up my dataset/dataframe? and factor ->numeric problems

New to R and new to this forum, tried searching, hope i dont embarass myself by failing to identify previous answers.
So i got my data, and i intend to do some kind of glmm's in the end but thats far away in the future, first im going to do some simple glm/lm's to learn what im doing
first about my data:
I have data sampled from 2 "general areas" on opposite sides of the country.
in these general areas there are roughly 50 trakts placed (in a grid, random staring point)
Trakts have been revisited each year for a duration of 4 years
A tract contains 16 sample plots, i intend to work on trakt-level so i use the means of the 16 sample plots for each trakt.
2x4x50 = 400 rows (actual number is 373 rows when i have removed trakts where not enough plots could be sampled due to terrain etc)
the data in my excel file is currently divided like this:
rows = trakts
Columns= the measured variable
i got 8-10 columns i want to use
short example how the data looks now:
V1 - predictor, 4 different columns
V2 - Response variable = proportional data, 1-4 columns depending on which hypothesis i end up testing,
the glmm in the end would look something like, (V2~V1+V1+V1,(area,year))
Area Year Trakt V1 V2
A 2015 1 25.165651 0
A 2015 2 11.16894652 0.1
A 2015 3 18.231 0.16
A 2014 1 3.1222 N/A
A 2014 2 6.1651 0.98
A 2014 3 8.651 1
A 2013 1 6.16416 0.16
B 2015 1 9.12312 0.44
B 2015 2 22.2131 0.17
B 2015 3 12.213 0.76
B 2014 1 1.123132 0.66
B 2014 2 0.000 0.44
B 2014 3 5.213265 0.33
B 2013 1 2.1236 0.268
How should i get started on this?
8 different files?
Nested by trakts ( do i start nesting now or later when i'm doing glmms?)
i load my data into r through the read.tables function
If i run: sapply(dataframe,class)
V1 and V2 are factors, everything else integer
if i run sapply(dataframe,mode)
everything is numeric
so finally to my actual problems, i have been trying to do normality tests (only trid shapiro so far) but i keep getting errors that imply my data is not numeric
also, when i run a normality test, do i only run one column and evaluate it before moving on to the next column or should i run several columns? the entire dataset?
should i in my case run independent normality tests for each of my areas and year?
hope it didnt end up to cluttered
best regards

Storing the order of entities

If I have entities with the attribute :fruit:
apple
banana
grapes
tomato
and a feature allowing a user to order his fruits:
1 grapes
2 apple
3 tomato
4 banana
Is there a good way to store fruit order to the database with the expectation that a fruit may be deleted, a fruited added, and fruits reordered?
A naive solution is to add an order column. A problem with that is expensive updates. Say I have an entity: 1000000 durian. I suddenly decide it's my favorite fruit and move it to the top. This causes 999999 fruits to require an order update.
The short answer is no, Datomic doesn't have this built in, and to be fair neither does many other databases.
You have the "order" column approach that you mentioned, which also has the problem you mentioned. Gaps isn't the worst part really, since you can still get correct sorting with some gaps, it becomes even worse if you want to insert an item in the middle, then you have to update the following entities. And you should probably do it all in a transaction function unless you're certain your peer is single threaded.
There's also the linked list approach, where each entity points to the next and the last doesn't point to anything. Appending, prepending and slicing in the middle becomes constant operations.
There is not a built-in way to accomplish your goal in any database, whether PostgreSQL, Datomic, or anything. However, there is an easy answer.
Just convert your proposed "priority" column from an integer to a floating-point value. Then you can always insert a new entry between any two existing items without the need to change anything. Suppose you start with
1.0 grape
2.0 apple
3.0 tomato
4.0 banana
and you then decide that to add a pear between grape and apple. Just insert is as:
1.0 grape
1.5 pear
2.0 apple
3.0 tomato
4.0 banana
You then decide to insert cherry between grape and pear, so you get:
1.0 grape
1.25 cherry
1.5 pear
2.0 apple
3.0 tomato
4.0 banana
Then, whenever you want to examine your list you simply fetch both the priority column and the name column, sort by priority, and you are done.

multiple operations of columns and a single scan SQL Server

I have a table with 200 columns (maybe more...)
a1 a2 a3 a4 a5 ...a200
---------------------------------
1.2 2.3 4.4 5.1 6.7... 11.9
7.2 2.3 4.3 5.1 4.7... 3.9
1.9 5.3 3.3 5.1 3.7... 8.9
5.2 2.7 7.4 9.1 1.7... 2.9
I would like to compute many operations:
SUM(every column)
AVG(every column)
SQRT(SUM(every column))
POWER(SUM(every column),2)
MIN(all columns)
MAX(all columns)
GREATEST(SUM(one column) vs SUM(other column))
something like finding wich sum is greatest for every column:
a1 vs a2, a1 vs a3, a1 vs a4....,a1 vs a200,
a2 vs a1, a2 vs a3, a4 vs a5....,a2 vs a200,
...
a200 vs a1, a200vs a2, a200vs a3.....a200 vs a199
If I do a single select statement for each column,and for each operation I'd have:
SELECT
SUM(a1),...,SUM(a200),
AVG(Sum(a1)),...,AVG(Sum(a200)),
POWER(Sum(a1),2),...,POWER(Sum(a200),2),
GREATEST(SUM(a1),SUM(a2)), GREATEST(SUM(a1),SUM(a3)),...,GREATEST(SUM(a1),SUM(a200)),
GREATEST(SUM(a2),SUM(a1)), GREATEST(SUM(a2),SUM(a3)),...,GREATEST(SUM(a2),SUM(a200))....
GREATEST(SUM(a200),SUM(a1)), GREATEST(SUM(a200),SUM(a3)),...,GREATEST(SUM(a200),SUM(a199))
etc... FROM tabMultipleColumns
The problem here is when I do a query with more than 1024 possible results
aka, >= 1024 columns
Is there a way to keep doing massive operations with data doing a single scan of the table, I mean avoiding doing multiple selects statements?
I am trying to use only a scan, because if the table is huge (with size
of many GB's) using many selects statements to scan the same table would be expensive...
Can a tool like BCP be used or what solution do you think is more efficient...
if you look only for the SUM, POWER(SUM(),2) and SQRT(SUM()), there are 600 result columns... if I keep doing this operations there are more than 1024...
That's a lot of calculations. I would probably just do a periodic dump of them into another table to minimize server load. It depends on how often the query is going to be used though.

Human Name lookup / translation

I am working on a requirement to match people from different databases. One tricky problem is variance in names like Bob - Robert, Jim - James, Lizzy - Elizabeth etc across databases.
Is there a lookup/translation available for this kind of a requirement.
Take a look at my answer (as well as the others) here:
Tools for matching name/address data
You'd need to implement a lookup table with the alternate names in it:
Base | Alternate
----------------
Robert | Bob
Elizabeth | Liz
Elizabeth | Lizzy
Elizabeth | Beth
Then search the database for the base name and all alternates. You'll end up with a number of multiple matches which will then need to be checked to see if they really match based on a comparison of whatever other data you have in the two databases. Maybe the dates of the records in each database could be used - records entered close in time indicate the same person.

Resources