Clustering Coefficient using SQL Server/C# - sql-server

I have two tables in SQL Server i.e.
one table is GraphNodes as:
---------------------------------------------------------
id | Node_ID | Node | Node_Label | Node_Type
---------------------------------------------------------
1 677 Nuno Vasconcelos Author 1
2 1359 Peng Shi Author 1
3 6242 Z. Q. Shi Author 1
4 8318 Kiyoung Choi Author 1
5 12405 Johan A. K. Author 1
6 26615 Tzung-Pei Hong Author 1
7 30559 Luca Benini Author 1
...
...
and other table is GraphEdges as:
-----------------------------------------------------------------------------------------
id | Source_Node | Source_Node_Type | Target_Node | Target_Node_Type | Year | Edge_Type
-----------------------------------------------------------------------------------------
1 1 1 10965 2 2005 1
2 1 1 10179 2 2007 1
3 1 1 10965 2 2007 1
4 1 1 19741 2 2007 1
5 1 1 10965 2 2009 1
6 1 1 4816 2 2011 1
7 1 1 5155 2 2011 1
...
...
I also have two tables i.e. GraphNodeTypes as:
-------------------------
id | Node | Node_Type
-------------------------
1 Author 1
2 CoAuthor 2
3 Venue 3
4 Paper 4
and GraphEdgeTypes as:
-------------------------------
id | Edge | Edge_Type
-------------------------------
1 AuthorCoAuthor 1
2 CoAuthorVenue 2
3 AuthorVenue 3
4 PaperVenue 4
5 AuthorPaper 5
6 CoAuthorPaper 6
Now, I want to calculate clustering coefficient for this graph i.e of two types:
If N(V) is # of links b/w neighbors of node V and K(V) is degree of node V then,
Local Clustering Coefficient(V) = 2 * N(V)/K(V) [K(V) - 1]
and
Global Clustering Coefficient = 3 * # of Triangles / # of connected Triplets of V
The questions is, how can I calculate degree of a node? Is it possible in SQL Server or C# programming required. And also please suggest hints for calculating Local and Global CCs as well.
Thanks!

The degree of a node is not "calculated". It's simply the number of edges this node has.
While you can try to do this in SQL, the performance will likely be mediocre. Such type of analysis is commonly done in specialized databases and, if possible, in memory.

Count the degree of each vertices as the number of edges connected to it. Using COUNT(source_node) and GROUP BY(source_node) will be helpful in this case.
To find N(V), you can join the edge table with itself and then take the intersection between the resulting table and edge table. From the result, for each vertex take the COUNT().

Related

How to track changes in the board of directors within the same firm

In Stata I need to create a new variable "changes in the board of directors" which indicates whether the same directors are observed in the same firm over time. Consider an example below:
clear
input dirid firmid year
1 10 2006
2 10 2006
3 10 2006
1 10 2007
2 10 2007
3 10 2007
1 10 2008
2 10 2008
3 10 2008
4 10 2008
3 10 2009
4 10 2009
end
Directors ID 1, 2, and 3 are in firm 10 in 2006 and in 2007. So there was no change in the board of directors from t-1 to t. The variable "changes in the board of directors" should be 0. However, in 2008 a new director came to the board dirid = 4, so there was a change in the board and the variable should be 1. The same in 2009 because dirid 1 and 2 left the company. So any change, whether the entrance or exit of directors, should be reported with 1 in the new binary variable.
Here's another way to do it. I think it should cope with directors leaving and later coming back.
clear
input dirid firmid year
1 10 2006
2 10 2006
3 10 2006
1 10 2007
2 10 2007
3 10 2007
1 10 2008
2 10 2008
3 10 2008
4 10 2008
3 10 2009
4 10 2009
end
bysort firmid year (dirid) : gen board = strofreal(dirid) if _n == 1
by firmid year : replace board = board[_n-1] + " " + strofreal(dirid) if _n > 1
by firmid year : replace board = board[_N]
by firmid : gen anychange = year != year[_n-1] & board != board[_n-1]
bysort firmid year (anychange) : replace anychange = anychange[_N]
sort firmid year dirid
list, sepby(firmid year)
+--------------------------------------------+
| dirid firmid year board anycha~e |
|--------------------------------------------|
1. | 1 10 2006 1 2 3 1 |
2. | 2 10 2006 1 2 3 1 |
3. | 3 10 2006 1 2 3 1 |
|--------------------------------------------|
4. | 1 10 2007 1 2 3 0 |
5. | 2 10 2007 1 2 3 0 |
6. | 3 10 2007 1 2 3 0 |
|--------------------------------------------|
7. | 1 10 2008 1 2 3 4 1 |
8. | 2 10 2008 1 2 3 4 1 |
9. | 3 10 2008 1 2 3 4 1 |
10. | 4 10 2008 1 2 3 4 1 |
|--------------------------------------------|
11. | 3 10 2009 3 4 1 |
12. | 4 10 2009 3 4 1 |
+--------------------------------------------+
See also [this paper][1] on concatenating rowwise.
[1]: https://journals.sagepub.com/doi/full/10.1177/1536867X20909698
clear
input dirid firmid year
1 10 2006
2 10 2006
3 10 2006
1 10 2007
2 10 2007
3 10 2007
1 10 2008
2 10 2008
3 10 2008
4 10 2008
3 10 2009
4 10 2009
end
bysort firmid year (dirid): gen n = _n
reshape wide n, i(firmid year) j(dirid)
egen all_directors = concat(n*)
bysort firmid (year): gen change = all_directors != all_directors[_n-1] & _n > 1
reshape long
drop if missing(n)
drop all_directors n

how to query different columns and stack into new table in google sheet

Hi I have columns like so, where it's auto fill every rows.
Where column BCD is from source a, column EFG from source b and HIJ from source c
sheet data
A
B
C
D
E
F
G
H
I
J
1
Date
Name
Cost
Date
Name
Cost
Date
Name
Cost
2
2022-01-02
Alan
5
2022-01-03
James
6
2022-01-02
Timmy
5
3
2022-01-02
Hana
5
2022-01-03
Paul
6
2022-01-02
Jane
5
into
summary sheet
A
B
C
D
E
1
Date
Name
Cost
Source
2
2022-01-02
Alan
5
sourceA
3
2022-01-02
Hana
5
sourceA
4
2022-01-03
James
6
sourceB
5
2022-01-03
Paul
6
sourceB
6
2022-01-02
Timmy
5
sourceC
7
2022-01-02
Jane
5
sourceC
How do I achieve this with formula query, stacking it on top one another.
Source is using if but then how do you detect last row and used it for the if.
the rows for each source might be different.
this is array: {} inside of it you can use comma , to put something next to each other or semicolon ; to put something under something else. eg. having:
={1,2;3,4}
will yield:
A B
------+-------+
1 | 1 | 2
------+-------+
2 | 3 | 4
in that manner you can do:
={QUERY(B:D);
QUERY(E:G);
QUERY(H:J)}
side note: if your locale is non-english then comma , in array is replaced by backslash \
Use array notation to combine the ranges, and combine them with either filter() or query() to remove the empty rows.
Filter docs
Query docs

Equivalent of Excel Pivoting in Stata

I have been working with country-level survey data in Stata that I needed to reshape. I ended up exporting the .dta to a .csv and making a pivot table in in Excel but I am curious to know how to do this in Stata, as I couldn't figure it out.
Suppose we have the following data:
country response
A 1
A 1
A 2
A 2
A 1
B 1
B 2
B 2
B 1
B 1
A 2
A 2
A 1
I would like the data to be reformatted as such:
country sum_1 sum_2
A 4 4
B 3 2
First I tried a simple reshape wide command but got the error that "values of variable response not unique within country" before realizing reshape without additional steps wouldn't work anyway.
Then I tried generating new variables conditional on the value of response and trying to use reshape following that... the whole thing turned into kind of a mess so I just used Excel.
Just curious if there is a more intuitive way of doing that transformation.
If you just want a table, then just ask for one:
clear
input str1 country response
A 1
A 1
A 2
A 2
A 1
B 1
B 2
B 2
B 1
B 1
A 2
A 2
A 1
end
tabulate country response
| response
country | 1 2 | Total
-----------+----------------------+----------
A | 4 4 | 8
B | 3 2 | 5
-----------+----------------------+----------
Total | 7 6 | 13
If you want the data to be changed to this, reshape is part of the answer, but you should contract first. collapse is in several ways more versatile, but your "sum" is really a count or frequency, so contract is more direct.
contract country response, freq(sum_)
reshape wide sum_, i(country) j(response)
list
+-------------------------+
| country sum_1 sum_2 |
|-------------------------|
1. | A 4 4 |
2. | B 3 2 |
+-------------------------+
In Stata 16 up, help frames introduces frames as a way to work with multiple datasets in the same session.

Comparisons across multiple rows in Stata (household dataset)

I'm working on a household dataset and my data looks like this:
input id id_family mother_id male
1 2 12 0
2 2 13 1
3 3 15 1
4 3 17 0
5 3 4 0
end
What I want to do is identify the mother in each family. A mother is a member of the family whose id is equal to one of the mother_id's of another family member. In the example above, for the family with id_family=3, individual 5 has mother_id=4, which makes individual 4 her mother.
I create a family size variable that tells me how many members there are per family. I also create a rank variable for each member within a family. For families of three, I then have the following piece of code that works:
bysort id_family: gen family_size=_N
bysort id_family: gen rank=_n
gen mother=.
bysort id_family: replace mother=1 if male==0 & rank==1 & family_size==3 & (id[_n]==id[_n+1] | id[_n]==id[_n+2])
bysort id_family: replace mother=1 if male==0 & rank==2 & family_size==3 & (id[_n]==id[_n-1] | id[_n]==id[_n+1])
bysort id_family: replace mother=1 if male==0 & rank==3 & family_size==3 & (id[_n]==id[_n-1] | id[_n]==id[_n-2])
What I get is:
id id_family mother_id male family_size rank mother
1 2 12 0 2 1 .
2 2 13 1 2 2 .
3 3 15 1 3 1 .
4 3 17 0 3 2 1
5 3 4 0 3 3 .
However, in my real data set, I have to get the mother for families of size 4 and higher (up to 9), which makes this procedure very inefficient (in the sense that there are too many row elements to compare "manually").
How would you obtain this in a cleaner way? Would you make use of permutations to index the rows? Or would you use a for-loop?
Here's an approach using merge.
// create sample data
clear
input id id_family mother_id male
1 2 12 0
2 2 13 1
3 3 15 1
4 3 17 0
5 3 4 0
end
save families, replace
clear
// do the job
use families
drop id male
rename mother_id id
sort id_family id
duplicates drop
list, clean abbreviate(10)
save mothers, replace
use families, clear
merge 1:1 id_family id using mothers, keep(master match)
generate byte is_mother = _merge==3
list, clean abbreviate(10)
The second list yields
id id_family mother_id male _merge is_mother
1. 1 2 12 0 master only (1) 0
2. 2 2 13 1 master only (1) 0
3. 3 3 15 1 master only (1) 0
4. 4 3 17 0 matched (3) 1
5. 5 3 4 0 master only (1) 0
where I retained _merge only for expositional purposes.

Inflow/Outflow Count in Stata

I have the following data
id pair_id id_in id_out date
1 1 2 3 1/1/2010
2 1 2 3 1/2/2010
3 1 3 2 1/3/2010
4 1 3 2 1/5/2010
5 1 3 2 1/7/2010
6 2 2 1 1/2/2010
7 3 1 3 1/5/2010
8 2 1 2 1/7/2010
At any given row I want to know what the inflow/outflow differential is between the unique pair id_in and id_out from the id_in perspective
For example, for id_in == 2 and id_out == 3 it would look like the following (from id_in == 2s perspective)
id pair_id id_in id_out date inflow_outflow
1 1 2 3 1/1/2010 1
2 1 2 3 1/2/2010 2
3 1 3 2 1/3/2010 1
4 1 3 2 1/5/2010 0
5 1 3 2 1/7/2010 -1
Explanation. id_in == 2 as received first so they get +1 then they received again so +2. Then they gave out so it gets reduced by -1 bringing the total to that point to 1, etc.
This is what I have tried
sort pair_id id_in date
gen count = 0
qui forval i = 2/`=_N' {
local I = `i' - 1
count if id_in == id_out[`i'] in 1/`I'
replace count = r(N) in `i'
}
I don't follow all the logic here and in particular presenting transactions from the point of view of one member seems quite arbitrary. But a broad impression from loosely similar problems is that you should not be thinking about loops here. It should suffice to use by: and cumulative sums. There is an attempt at some systematic discussion of how to handle dyads at http://www.stata-journal.com/sjpdf.html?articlenum=dm0043 but it is only a beginning.
Please note that presenting dates according to some display format is a small pain as they need to be reverse engineered. dataex from SSC can be used to create examples that are easy to copy and paste.
This code may suggest some technique:
clear
input id pair_id id_in id_out str8 sdate
1 1 2 3 "1/1/2010"
2 1 2 3 "1/2/2010"
3 1 3 2 "1/3/2010"
4 1 3 2 "1/5/2010"
5 1 3 2 "1/7/2010"
6 2 2 1 "1/2/2010"
7 3 1 3 "1/5/2010"
8 2 1 2 "1/7/2010"
end
gen date = daily(sdate, "MDY")
format date %td
assert id_in != id_out
gen pair1 = cond(id_in < id_out, id_in, id_out)
gen pair2 = cond(id_in < id_out, id_out, id_in)
bysort pair_id (date): gen sum1 = sum(id_in == pair1) - sum(id_out == pair1)
bysort pair_id (date): gen sum2 = sum(id_in == pair2) - sum(id_out == pair2)
list date id_* pair? sum?, sepby(pair_id)
+----------------------------------------------------------+
| date id_in id_out pair1 pair2 sum1 sum2 |
|----------------------------------------------------------|
1. | 01jan2010 2 3 2 3 1 -1 |
2. | 02jan2010 2 3 2 3 2 -2 |
3. | 03jan2010 3 2 2 3 1 -1 |
4. | 05jan2010 3 2 2 3 0 0 |
5. | 07jan2010 3 2 2 3 -1 1 |
|----------------------------------------------------------|
6. | 02jan2010 2 1 1 2 -1 1 |
7. | 07jan2010 1 2 1 2 0 0 |
|----------------------------------------------------------|
8. | 05jan2010 1 3 1 3 1 -1 |
+----------------------------------------------------------+
A specific pair (as defined by pair_id) is always conformed by two entities that can be ordered in one of two ways. For example, entity 5 with entity 8, and entity 8 with entity 5. If one is receiving, the other is giving out, necessarily.
Two slightly different ways of approaching the problem can be found below.
clear all
set more off
*----- example data -----
input id pair_id id_in id_out str8 sdate
1 1 2 3 "1/1/2010"
2 1 2 3 "1/2/2010"
3 1 3 2 "1/3/2010"
4 1 3 2 "1/5/2010"
5 1 3 2 "1/7/2010"
6 2 2 1 "1/2/2010"
7 3 1 3 "1/5/2010"
8 2 1 2 "1/7/2010"
end
gen date = daily(sdate, "MDY")
format date %td
drop sdate
sort pair_id date id
list, sepby(pair_id)
*---- what you want -----
// approach 1
bysort pair_id (date id) : gen sum1 = sum(cond(id_in == id_in[1], 1, -1))
gen sum2 = -1 * sum1
// approach 2
bysort pair_id (id_in date id) : gen temp = cond(id_in == id_in[1], 1, -1)
bysort pair_id (date id) : gen sum100 = sum(temp)
gen sum200 = -1 * sum100
// list
drop temp
sort pair_id date
list, sepby(pair_id)
The first approach involves creating a variable that holds the differential for the entity that first receives according to the date variable. sum1 does just that. Variable sum2 holds the differential for the other entity.
The second approach creates a variable that holds the differential for the entity that has the smallest identifying number. I've named it sum100. Variable sum200 holds the information for the other entity.
Note that I added id to the sorting list in case pair_id date does not uniquely identify observations.
The second approach is equivalent to the code provided by #NickCox, or so I believe.
The results:
. list, sepby(pair_id)
+---------------------------------------------------------------------------+
| id pair_id id_in id_out date sum1 sum2 sum100 sum200 |
|---------------------------------------------------------------------------|
1. | 1 1 2 3 01jan2010 1 -1 1 -1 |
2. | 2 1 2 3 02jan2010 2 -2 2 -2 |
3. | 3 1 3 2 03jan2010 1 -1 1 -1 |
4. | 4 1 3 2 05jan2010 0 0 0 0 |
5. | 5 1 3 2 07jan2010 -1 1 -1 1 |
|---------------------------------------------------------------------------|
6. | 6 2 2 1 02jan2010 1 -1 -1 1 |
7. | 8 2 1 2 07jan2010 0 0 0 0 |
|---------------------------------------------------------------------------|
8. | 7 3 1 3 05jan2010 1 -1 1 -1 |
+---------------------------------------------------------------------------+
Check them carefully, as the difference between both approaches is subtle, at least initially.

Resources