String array combination in R

String array combination in R - arrays

I'm starting my studies in R, and even looking for this topic in many forums, I couldn't find a good answer. Maybe I'm not searching using the right terms, or maybe it's not possible to do in R, so please apologize my ignorance.
I would like to find how many times two professionals participates in a given project. Additional to that, I would like to map what is their position when they are found together.
I'm not using a specific notation below. For example, assume I have the following string arrays:
Project1: Bob (President), Joe (Vice President), Mary (Participant), Paul (Participant)
Project2: Bob (President), Joe (Vice President), Sue (Participant), Bill (Participant)
Project3: Paul (President), Sue (Vice President), Bob (Participant), Joe (Participant)
Project'n: (...)
The output would be:
Bob (President) & Joe (Vice President) = 2
Bob (President) & Mary (Participant) = 1
Bob (President) & Paul (Participant) = 1
Bob (Participant) & Paul (President) = 1
Sue (Vice President) & Joe (Participant) = 1
And it goes on and on, and I assume these results could be aggregate in a histogram graph. I have 86 names, participating in 38 different projects, at 3 different possible positions.
Any ideas if it would be possible to do in R? How could it accomplished? Any code templates available or documentation that I could use to get to this answer?
## MY ATTEMPT (START)
Groups <- data.frame (Name=c('Paul','Paul','Paul','Bob','Bob','Sue','Bill'),Group=c('P1','P2','P3','P1','P2','P3','P3'),Role=c('President','President','President','Vice President','Vice President','Participant','Participant'))
Table <- table (Groups)
When I print 'Table', it shows this output:
, , Role = Participant
Group
Name P1 P2 P3
Bill 0 0 1
Bob 0 0 0
Paul 0 0 0
Sue 0 0 1
, , Role = President
Group
Name P1 P2 P3
Bill 0 0 0
Bob 0 0 0
Paul 1 1 1
Sue 0 0 0
, , Role = Vice President
Group
Name P1 P2 P3
Bill 0 0 0
Bob 1 1 0
Paul 0 0 0
Sue 0 0 0
Now - for instance - in project "P1" we can see Paul as President and Bob as Vice President. Same happens in project "P2". In "P3", we have Paul as President plus Sue and Bill both as Participants.
My doubt is now how to count how many occurrences of a given relationship all over the projects. Something like:
Paul/President & Bob/Vice = 2 occurrences,
Paul/President & Sue/Participant = 1 occurrence,
Paul/President & Bill/Participant = 1 occurrence, etc
Basically a 'hist' based on the occurrences of a particular people/role combination.
## MY ATTEMPT (END)

Now that you have your Table, you can count the occurrence of different types of relationships using apply over different sets of axes:
How many occurrences of different types of participants are there for each project?
> apply(Table, c(2,3), sum)
Role
Group Participant President Vice President
P1 0 1 1
P2 0 1 1
P3 2 1 0
How many occurrences of Person-Role combinations?
> apply(Table, c(1,3), sum)
Role
Name Participant President Vice President
Bill 1 0 0
Bob 0 0 2
Paul 0 3 0
Sue 1 0 0
Which projects is each person working in?
> apply(Table, c(1,2), sum)
Group
Name P1 P2 P3
Bill 0 0 1
Bob 1 1 0
Paul 1 1 1
Sue 0 0 1
How many projects is each person working on?
> apply(Table, 1, sum)
Bill Bob Paul Sue
1 2 3 1
How many people are involved in each project?
> apply(Table, 2, sum)
P1 P2 P3
2 2 3
How many people belong to each role?
> apply(Table, 3, sum)
Participant President Vice President
2 3 2

Thanks #ScottRitchie for your tips. After some additional readings and tests, I came out with the following:
A csv file was imported with columns containing the name, project and role. I also added another column at the end, like a counter (with a constant value of 1 from end to end).
I did:
Groupings <-read.csv("~/Documents/TCC_BIGDATA/Test.csv", sep=";")
Groupings$Counter <- as.integer(Groupings$Counter)
print(Groupings)
Project Name Role Counter
1 P1 Paul President 1
2 P1 Bob Vice President 1
3 P1 Sue Participant 1
4 P1 Bill Participant 1
5 P2 Paul Vice President 1
6 P2 Bob Participant 1
7 P2 Bill President 1
8 P3 Bob President 1
9 P3 Bill Vice President 1
10 P3 Sue Participant 1
How many times a name shows in the list?
aggregate(Counter ~ Name, data = Groupings, sum)
Name Counter
1 Bill 3
2 Bob 3
3 Paul 2
4 Sue 2
How many times a Name+Role combination shows in the list?
aggregate(Counter ~ Name + Role, data = Groupings, sum)
Name Role Counter
1 Bill Participant 1
2 Bob Participant 1
3 Sue Participant 2
4 Bill President 1
5 Bob President 1
6 Paul President 1
7 Bill Vice President 1
8 Bob Vice President 1
9 Paul Vice President 1
And other exercises and combinations can be made. At the end, it is just another way to achieve the same you (#ScottRitchie) built to answer my question. I thought it would be a good idea to share so others could apply.

Related

Extract items from Column in a Dataframe

country_name
rank
show_title
Argentina
2
The Queen of Flow
India
1
Cobra Kai
Argentina
1
The Queen of Flow
England
3
Stay Close
Argentina
1
The Queen of Flow
I am trying to get a table that will display the number of times each show title is ranked 1st, 2nd or Third. The result something like this:
Rank
Cobra Kai
The Queen of Flow
Stay Close
1
1
2
0
2
0
1
1
3
0
0
0

You can use pivot_table like this.
df.pivot_table(index=['rank'], columns=['show_title'], aggfunc='count', fill_value=0)
Result
country_name
show_title Cobra Kai Stay Close The Queen of Flow
rank
1 1 0 2
2 0 0 1
3 0 1 0

How to aggregate number of notes sent to each user?

Consider the following tables
group (obj_id here is user_id)
group_id obj_id role
--------------------------
100 1 A
100 2 root
100 3 B
100 4 C
notes
obj_id ref_obj_id note note_id
-------------------------------------------
1 2 10
1 3 10
1 0 foobar 10
1 4 20
1 2 20
1 0 barbaz 20
2 0 caszes 30
2 1 30
4 1 70
4 0 taz 70
4 3 70
Note: a note in the system can be assigned to multiple users (for instance: an admin could write "sent warning to 2 users" and link it to 2 user_ids). The first user the note gets linked to is stored differently than the other linked users. The note itself is linked to the first linked user only. Whenever group.obj_id = notes.obj_id then ref_obj_id = 0 and note <> null
I need to make an overview of the notes per user. Normally I would do this by joining on group.obj_id = notes.obj_idbut here this goes wrong because of ref_obj_id being 0 (in which case I should join on notes.obj_id)
There are 4 notes in this system (foobar, barbaz, caszes and taz).
The desired output is:
obj_id user_is_primary notes_primary user_is_linked notes_linked
-------------------------------------------------------------------
1 2 10;20 2 30;70
2 1 30 2 10;20
3 0 2 10;70
4 1 70 1 20
How can I get to this aggregated result?
I hope that I was able to explain the situation clearly; perhaps it is my inexperience but I find the data model not the most straightforward.

Couldn't you simply put this in the ON clause of your join?
case when notes.ref_obj_id = 0 then notes.obj_id else notes.ref_obj_id end = group.obj_id

Crystal Report - Display Count of differents Details on same line

i have a question on Crystal Report, i have a query that return a list of name that a free slots in the morning and free slots in the afternoon. A person can have 1 or several free slots in the morning or afternoon. The query return something like that:
NAME | HALFDAY
Jean | 1
Jean | 1
Jean | 2
Martin | 2
Martin | 2
Martin | 2
Francois | 1
Francois | 1
Francois | 1
1 is for the morning, 2 is for the afternoon.
So, Jean have 2 free slots in the morning and one in the afternoon.
Martin has 3 free slots in the afternoon.
I would like my report looks like that:
Jean 2 / 1
Martin 0 / 3
Francois 3 / 0
But i don't know how to do that. Any ideas please?
Thanks a lot!

First group by customer
Create multiple formulas like
If halfday=1
Then 1
Else 0
If halfday=2
Then 2
Else 0
Now take the count of every formula in group footer
Take saperate formula in group footer
Write below code
Databasefield.customernam & totext(count of formula1) & "/" & totext (count of formula 2) //so on

Clustering Coefficient using SQL Server/C#

I have two tables in SQL Server i.e.
one table is GraphNodes as:
---------------------------------------------------------
id | Node_ID | Node | Node_Label | Node_Type
---------------------------------------------------------
1 677 Nuno Vasconcelos Author 1
2 1359 Peng Shi Author 1
3 6242 Z. Q. Shi Author 1
4 8318 Kiyoung Choi Author 1
5 12405 Johan A. K. Author 1
6 26615 Tzung-Pei Hong Author 1
7 30559 Luca Benini Author 1
...
...
and other table is GraphEdges as:
-----------------------------------------------------------------------------------------
id | Source_Node | Source_Node_Type | Target_Node | Target_Node_Type | Year | Edge_Type
-----------------------------------------------------------------------------------------
1 1 1 10965 2 2005 1
2 1 1 10179 2 2007 1
3 1 1 10965 2 2007 1
4 1 1 19741 2 2007 1
5 1 1 10965 2 2009 1
6 1 1 4816 2 2011 1
7 1 1 5155 2 2011 1
...
...
I also have two tables i.e. GraphNodeTypes as:
-------------------------
id | Node | Node_Type
-------------------------
1 Author 1
2 CoAuthor 2
3 Venue 3
4 Paper 4
and GraphEdgeTypes as:
-------------------------------
id | Edge | Edge_Type
-------------------------------
1 AuthorCoAuthor 1
2 CoAuthorVenue 2
3 AuthorVenue 3
4 PaperVenue 4
5 AuthorPaper 5
6 CoAuthorPaper 6
Now, I want to calculate clustering coefficient for this graph i.e of two types:
If N(V) is # of links b/w neighbors of node V and K(V) is degree of node V then,
Local Clustering Coefficient(V) = 2 * N(V)/K(V) [K(V) - 1]
and
Global Clustering Coefficient = 3 * # of Triangles / # of connected Triplets of V
The questions is, how can I calculate degree of a node? Is it possible in SQL Server or C# programming required. And also please suggest hints for calculating Local and Global CCs as well.
Thanks!

The degree of a node is not "calculated". It's simply the number of edges this node has.
While you can try to do this in SQL, the performance will likely be mediocre. Such type of analysis is commonly done in specialized databases and, if possible, in memory.

Count the degree of each vertices as the number of edges connected to it. Using COUNT(source_node) and GROUP BY(source_node) will be helpful in this case.
To find N(V), you can join the edge table with itself and then take the intersection between the resulting table and edge table. From the result, for each vertex take the COUNT().

Comparisons across multiple rows in Stata (household dataset)

I'm working on a household dataset and my data looks like this:
input id id_family mother_id male
1 2 12 0
2 2 13 1
3 3 15 1
4 3 17 0
5 3 4 0
end
What I want to do is identify the mother in each family. A mother is a member of the family whose id is equal to one of the mother_id's of another family member. In the example above, for the family with id_family=3, individual 5 has mother_id=4, which makes individual 4 her mother.
I create a family size variable that tells me how many members there are per family. I also create a rank variable for each member within a family. For families of three, I then have the following piece of code that works:
bysort id_family: gen family_size=_N
bysort id_family: gen rank=_n
gen mother=.
bysort id_family: replace mother=1 if male==0 & rank==1 & family_size==3 & (id[_n]==id[_n+1] | id[_n]==id[_n+2])
bysort id_family: replace mother=1 if male==0 & rank==2 & family_size==3 & (id[_n]==id[_n-1] | id[_n]==id[_n+1])
bysort id_family: replace mother=1 if male==0 & rank==3 & family_size==3 & (id[_n]==id[_n-1] | id[_n]==id[_n-2])
What I get is:
id id_family mother_id male family_size rank mother
1 2 12 0 2 1 .
2 2 13 1 2 2 .
3 3 15 1 3 1 .
4 3 17 0 3 2 1
5 3 4 0 3 3 .
However, in my real data set, I have to get the mother for families of size 4 and higher (up to 9), which makes this procedure very inefficient (in the sense that there are too many row elements to compare "manually").
How would you obtain this in a cleaner way? Would you make use of permutations to index the rows? Or would you use a for-loop?

Here's an approach using merge.
// create sample data
clear
input id id_family mother_id male
1 2 12 0
2 2 13 1
3 3 15 1
4 3 17 0
5 3 4 0
end
save families, replace
clear
// do the job
use families
drop id male
rename mother_id id
sort id_family id
duplicates drop
list, clean abbreviate(10)
save mothers, replace
use families, clear
merge 1:1 id_family id using mothers, keep(master match)
generate byte is_mother = _merge==3
list, clean abbreviate(10)
The second list yields
id id_family mother_id male _merge is_mother
1. 1 2 12 0 master only (1) 0
2. 2 2 13 1 master only (1) 0
3. 3 3 15 1 master only (1) 0
4. 4 3 17 0 matched (3) 1
5. 5 3 4 0 master only (1) 0
where I retained _merge only for expositional purposes.

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight

String array combination in R - arrays

Related

Extract items from Column in a Dataframe

How to aggregate number of notes sent to each user?

Crystal Report - Display Count of differents Details on same line

Clustering Coefficient using SQL Server/C#

Comparisons across multiple rows in Stata (household dataset)

Categories

Resources