Database-like basic use - database

I would like to use R for basic database purpose with two data frames: the first data frame is a list of individuals with different features:
data = data.frame("individual"=c("Steve","Bob","Simon","Lisa"),
"feature1"=c(1,2,2,3),
"feature2"=c(3,4,1,NA))
the second data frame has features descritions:
description = data.frame("feature"=c(1,2,3,4,NA),
"label"=c("foot","golf","curling","ski","No answer"))
My goal is to make a third data frame with the individuals names followed by their features descriptions:
Steve foot curling
Bob golf ski
and so on...

sqldf Try this three way join:
library(sqldf)
data[is.na(data)] <- "NA"
description[is.na(description)] <- "NA"
sqldf("select d1.individual, d2.label, d3.label
from data d1
left join description d2 on d1.feature1 = d2.feature
left join description d3 on d1.feature2 = d3.feature"
)
The output is:
individual label label
1 Simon golf foot
2 Steve foot curling
3 Bob golf ski
4 Lisa curling No answer
subscripting
This solution assumes we have run the two <- "NA" lines above.
labels <- with(description, setNames(label, feature))
with(data,
data.frame(individual, labels[feature1], labels[feature2], stringsAsFactors = FALSE)
)
which gives the output:
individual labels.feature1. labels.feature2.
3 Steve foot curling
4 Bob golf ski
1 Simon golf foot
NA Lisa curling No answer
REVISED:
Use left join.
Handle NAs as regular values.
Add second solution.

For this task, match can be used.
cbind(data[1], as.data.frame(lapply(data[-1], function(x)
description$label[match(x, description$feature)])))
individual feature1 feature2
1 Steve foot curling
2 Bob golf ski
3 Simon golf foot
4 Lisa curling No answer

Just for fun a third approach using plyr and reshape2
require(reshape2)
require(plyr)
dcast(join(melt(data, id = "individual", value.name = "feature"), description),
individual ~ variable, value.var = "label")
individual feature1 feature2
1 Bob golf ski
2 Lisa curling No answer
3 Simon golf foot
4 Steve foot curling

Related

How to return an array where a column matches a value

I am trying to create the column _Type_below which return the type value for a matching name AND a matching interval. I know I can use VLOOKUP for individual names, but lets say I have thousands of names and I can specify an array for VLOOKUP for all of them. Cheers!!
Name position _Type_ Name Range_From Range_To Type
bob 0 A bob 0 30 A
bob 5 A bob 30 100 B
bob 10 A doug 0 40 C
bob 15 A doug 40 200 A
bob 20 A
bob 30 B
bob 40 B
bob 80 B
doug 0 C
doug 20 C
doug 40 A
...
If yu have the dynamic array formula you can use FILTER():
=VLOOKUP(B2,FILTER(E:G,A2=D:D),3)
If not then your data must be sorted on D then E:
=VLOOKUP(B2,INDEX(E:E,MATCH(A2,D:D,0)):INDEX(G:G,MATCH(A2,D:D,0)+COUNTIF(D:D,A2)-1),3)
This should be relatively quick, but it requires the data sorted.
You can go super old fashioned and get the row using SumProduct() and pass that into Index(). It's not going to be speedy though.
=INDEX($I$1:$I$5, SUMPRODUCT(($F$1:$F$5=A2)*($G$1:$G$5<=B2)*($H$1:$H$5>B2)*ROW($F$1:$F$5)), 1)

Displaying differences with PowerPivot

Based on the following Data:
Location Salesperson Category SalesValue
North Bill Bikes 10
South Bill Bikes 90
South Bill Clothes 250
North Bill Accessories 20
South Bill Accessories 20
South Bob Bikes 200
South Bob Clothess 400
North Bob Accesories 40
I have the following Sales PivotTable in Excel 2016
Bill Bob
Bikes 100 200
Clothes 10 160
Accessories 40 40
I would now like to diplay the difference between Bill and Bob and, importantly, be able to sort the table by difference. I have tried adding the Sales a second time and displaying it as a difference to "Bill". This gives me the correct values but sorts according to the underlying sales value and not the computed difference.
Bill Bob Difference
Bikes 100 200 100
Clothes 10 160 150
Accessories 40 40 0
I am fairly sure I need to use some form of DAX calculation but am having difficulty finding out exactly how. Can anyone give me a pointer?
Create a measure for that calculation:
If Bill and Bob are columns in your table.
Difference = ABS(TableName[Bill] - TableName[Bob])
If Bill and Bob are measures:
Difference = ABS([Bill] - [Bob])
UPDATE: Expression to only calculate difference between Bob and Bill.
Create a measure (in this case DifferenceBillAndBob) and use the following expression.
DifferenceBillAndBob =
ABS (
SUMX ( FILTER ( Sales, Sales[SalesPerson] = "Bob" ), [SalesValue] )
- SUMX ( FILTER ( Sales, Sales[SalesPerson] = "Bill" ), [SalesValue] )
)
It is not tested but should work.
Let me know if this helps.

Excel Address County Appending

I have a sheet of addresses which do not have the counties in the "add4" column.
I also have a separate sheet of town names and the corresponding counties.
I would like to append the county names to the "add4" column table where they match the town names using a formula.
Here's the tables...
Address Sheet (sheet1):
A B C D E
1 Name Add1 Add2 Add3 Add4
2 John Smith 1 High Street Little Town Chelmsford
3 Keith Jones 44 Tall House Bransby Lincoln
Town & County Sheet (sheet2):
A B
1 Town County
2 Chelmsford Essex
3 Lincoln Lincolnshire
In the example above the formula should return the following:
A B C D E
1 Name Add1 Add2 Add3 Add4
2 John Smith 1 High Street Little Town Chelmsford Essex
3 Keith Jones 44 Tall House Bransby Lincoln Lincolnshire
Please can you advise on the formula I should be using.
Thanks in advance.
You just need to use the vlookup() function
In cell E2 of your address sheet put below formula (change the TownCountySheet with actual sheet name where you have town & county )
=vlookup(D2,'TownCountySheet'!A:B,2,false)

Trying to sort two columns of an array in R

Assume my data (df) looks something like this:
Rank Student Points Type
3 Liz 60 Junior
1 Sarah 100 Junior
10 John 40 Senior
2 Robert 70 Freshman
13 Jackie 33 Freshman
11 Stevie 35 Senior
I want to sort the data according to the Points, followed by Rank column in descending and ascending order, respectively, so that it looks like this:
Rank Student Points Type
1 Sarah 100 Junior
2 Robert 70 Freshman
3 Liz 60 Junior
10 John 40 Senior
11 Stevie 35 Senior
13 Jackie 33 Freshman
So I did this:
df[order(df[, "Points"], df[, "Rank"]), ]
Resulted in this:
Rank Student Points Type
1 Sarah 100 Junior
10 John 40 Senior
11 Stevie 35 Senior
13 Jackie 33 Freshman
2 Robert 70 Freshman
3 Liz 60 Junior
Question: How do I fix this?
I'm trying to use the column headers because the column length/width may change which can affect my sorting if I use physical locations.
FYI: I've tried so many suggestions and none seems to work:
one, two, three and four...
Try this:
df[order(df$Points,decreasing=T,df$Rank),]
Rank Student Points Type
2 1 Sarah 100 Junior
4 2 Robert 70 Freshman
1 3 Liz 60 Junior
3 10 John 40 Senior
6 11 Stevie 35 Senior
5 13 Jackie 33 Freshman
going back to the basics :) http://www.statmethods.net/management/sorting.html
so, your code should be:
df <- df[order(-Points, Rank),]
Like m0nhawk pointed out in his comment, you probably have the data as strings. String characters are ordered one at a time.
You need to convert them to numeric first. Also, for decreasing order you need the argument decreasing = TRUE.
df[, "Rank"] <- as.numeric(df[, "Rank"])
df[, "Points"] <- as.numeric(df[, "Points"])
df[order(df[, "Points"], decreasing = TRUE, df[, "Rank"]), ]
If the data type is 'factor' this will not work, though. You can try the following:
df <- as.data.frame(df, stringsAsFactors = FALSE)
And then the three lines above will work.

combinations of players for a team in C

I am trying to generate all combinations of players to make up a team of basketball players.
Let's say there's 5 positions(SG, PG, SF, PF, C) and I need to fill a rooster with 9 players, 2 of each position except the Center position for which there's only 1.
Let's say I have 10 players for each position, how can I generate a list of all possible permutations.
I would like to import the names from excel in a csv file, and then output all the combinations back to excel in another csv file.
I can figure out how to do the importing and exporting csv stuff, but i am more interested in the best algorithm to do the above permutations.
If it is easier to generate permutations, that is fine as well as I can easily eliminate duplicates in excel.
Thanks!
You could use an algorithmic technique called backtracking.
Or, depending on how many players you have, you could use brute force and just loop. For example, you could use the following to select all the combinations of 2 forwards and 1 center (this is a C++ sample just shown to illustrate the technique).
#include <iostream>
#include <fstream>
#include <algorithm>
#include <numeric>
#include <iostream>
#include <sstream>
#include <string>
#include <vector>
using namespace std;
int main() {
vector< string > centers;
vector< string > forwards;
centers.push_back("joey");
centers.push_back("rick");
centers.push_back("sam");
forwards.push_back("steve");
forwards.push_back("joe");
forwards.push_back("harry");
forwards.push_back("william");
for(int i = 0; i < centers.size(); ++i) {
for(int j = 0; j < forwards.size(); ++j) {
for(int k = j+1; k < forwards.size(); ++k) {
printf("%s %s %s\n",centers[i].c_str(), forwards[j].c_str(), forwards[k].c_str());
}
}
}
return 0;
}
Output:
---------- Capture Output ----------
> "c:\windows\system32\cmd.exe" /c c:\temp\temp.exe
joey steve joe
joey steve harry
joey steve william
joey joe harry
joey joe william
joey harry william
rick steve joe
rick steve harry
rick steve william
rick joe harry
rick joe william
rick harry william
sam steve joe
sam steve harry
sam steve william
sam joe harry
sam joe william
sam harry william
> Terminated with exit code 0.
However, it's important to remember that if you have a lot of players, anything you do that's "brute force", which would include backtracking (backtracking is the same idea as the loops I used above, only it uses recursion) is going to grow exponentially in running time. So for example for a 5 man roster, if you have 10 centers, 20 forwards, and 18 guards, then the running time is basically:
10 * 20 * 20 * 18 * 18 = 1,296,000
(20 * 20 because we need 2 fowards, and 18 * 18 because we need 2 guards).
1,296,000 is not too bad for running time, but when you start talking about 9 man rosters, you get much higher running times, because now you're dealing with alot more combinations.
So it depends on how much data you have as to whether this is even feasible.

Resources