Record Linkage with multiple datasets - matching

The problem
fastLink and RecordLinkage packages do extremely well in matching records (rows) from database A to database B and vice-versa. The developers are working on extending from matching only 2 databases to multiple databases.
A simple example of both I gave here.
In the meantime, how would we go about matching multiple data frames? For example, I have multiple medical records of patients from clinic A, B, C, D, E, F, and I want to merge them into a single one.
A reproducible example:
dfA <-
structure(list(fname = c("Jafar", "Nemo", "Simba", "Belle", "Nala",
"Jasmine"), lname = c("Evil", "Water", "King", "Beauty", "Princess",
"Princess"), gender = c("M", "M", "M", "F", "F", "F"), dob = c(1987,
2000, 2011, 1989, 1970, 1989), city = c("Arabtown", "Atlantic",
"Sahara", "Nice", "Sahara", "Arabtown")), row.names = c(NA, -6L
), class = c("tbl_df", "tbl", "data.frame"))
dfB <-
structure(list(fname = c("Jafar Jr", "Nemo", "Simba", "Belle",
"Nala", "Jasmine"), lname = c("Evil", "Waterson", "King", "Beauty",
"Princess", "Princess of Arabtown"), gender = c("M", "M", "M",
"F", "F", "F"), dob = c(NA, 2000, 2011, NA, NA, 1989), city = c("Arabtown",
"Atlantica", "Sahara", "Nice-France", "Sahara", "Arabia")), row.names = c(NA,
-6L), class = c("tbl_df", "tbl", "data.frame"))
dfC <-
structure(list(fname = c("Jafar Jr", "Fishy", "Lion", "Belle",
"Sarabi", "Jasmine"), lname = c("Evil", "Waterpal", "King", "Beauty",
"Queen", NA), gender = c("M", "M", NA, "F", "F", "F"), dob = c(NA,
2000, 2011, NA, 1940, 1989), city = c("Arabia", NA, "Sahara",
"France", "Sahara", NA)), row.names = c(NA, -6L), class = c("tbl_df",
"tbl", "data.frame"))
dfD <-
structure(list(fname = c("Jafar Jr", "Nemo", "Simba", "Belle",
"Sarabi", "Jasmine"), lname = c("Evil", "Waterson", "King", "Beast",
"Queen", "Evil"), gender = c("M", "M", "M", "F", "F", "M"), dob = c(NA,
2000, 2011, 1989, NA, 1989), city = c("Arabtown", "Atlantica",
"Sahara", NA, "Sahara", "Arabtown")), row.names = c(NA, -6L), class = c("tbl_df",
"tbl", "data.frame"))
dfE <-
structure(list(fname = c("Jafar Jr", "Nemo", "Simba", "Belle",
"Nala", "Aladdin"), lname = c("Evil", "Pateron", NA, "Gaston",
NA, "Streetrat"), gender = c("M", NA, "M", "F", "F", "M"), dob = c(1987,
NA, NA, NA, 1970, 1989), city = c("Arabtown", "Atlantica", "Sahara",
"France", "Sahara", "Arabia")), row.names = c(NA, -6L), class = c("tbl_df",
"tbl", "data.frame"))
dfF <-
structure(list(fname = c("Jafar Jr", "Nemo", "Simba", "Belle",
"Nala", "Al"), lname = c("Evil", "Waterson", "Dead", "Beauty",
"Princess", "Streetrat"), gender = c("M", "M", NA, "F", "F",
"M"), dob = c(1987, 2000, 2011, NA, NA, 1989), city = c("Arabia",
"Atlantic", "Sahara", "Nice-France", "Sahara", "Arabia")), row.names = c(NA,
-6L), class = c("tbl_df", "tbl", "data.frame"))
Expected result :
In the end I want unique identified records :
1 Jafar Evil M 1987 Arabtown
2 Nemo Water M 2000 Atlantic
3 Simba King M 2011 Sahara
4 Belle Beauty F 1989 Nice
5 Nala Princess F 1970 Sahara
6 Jasmine Princess F 1989 Arabtown
7 Sarabi Queen F 1940 Sahara
8 Aladdin Streetrat M 1989 Arabia
Even if the result isn't as clean as above, it's alright. The goal is to find a unified record from all 6 records and belong to the same entity.
Both fastLink & RecordLinkage take care of deduping (removing duplicates).
How can I develop an approach to deal with more than two databases in this scenario?

Related

VBA Arrays and If statements

I'm just getting started on learning VBA and am a bit stumped on the following. I'd be grateful for your assistance.
I'm using If Range("C3") - which will either be Sitel Audit or BA Tracker to paste values in a spreadsheet into found.row of slightly different columns. The corresponding worksheets are 'Sitel Audit' and 'BA Tracker.'
The PasteArray doesn't seem to work for either of the following, but did work before I introduced the If.
Private Sub Extract()
Dim CopyArray As Variant, PasteArray As Variant, X As Long, FileYear As String, FileMonth As String, AgentName As String, Agreement As String
Dim CallDate As String, wb As Workbook, wd As Worksheet, found As Range, wtf As Range
On Error Resume Next
Set wb = Workbooks.Open("C:\Users\matthew.varnham\Desktop\QA Improvements\SITEL - Inbound Tracker.xlsm")
Set wtf = Sheets("Observation Sheet").Range("E8:G8")
Set found = wd.Columns("E:E").find(what:=wtf, LookIn:=xlValues, lookat:=xlWhole)
Set wss = Sheets("Observation Sheet")
If Range("C3") = "Sitel Audit" Then
Set wd = wb.Sheets("Sitel Audit")
CopyArray = Array(17, 20, 21, 22, 23, 26, 27, 28, 30, 31, 34, 35, 36, 37, 40, 41, 44, 47, 51)
PasteArray = Array("P", "Q", "R", "S", "T", "U", "Y", "Z", "AB", "AC", "AD", "AE", "AF", "AG", "AH", "AI", "AJ", "AK", "N")
For X = LBound(CopyArray) To UBound(CopyArray)
wd.Range(PasteArray(X) & found.Row).Value = wss.Range("F" & CopyArray(X)).Value
Next
ElseIf Range("C3") = "BA Tracker" Then
Set wd = wb.Sheets("BA Tracker")
CopyArray = Array(17, 20, 21, 22, 23, 26, 27, 28, 30, 31, 34, 35, 36, 37, 40, 41, 44, 47)
PasteArray = Array("Q", "R", "S", "T", "U", "V", "Z", "AA", "AC", "AD", "AE", "AF", "AG", "AH", "AI", "AJ", "AK", "AL")
For X = LBound(CopyArray) To UBound(CopyArray)
wd.Range(PasteArray(X) & found.Row).Value = wss.Range("F" & CopyArray(X)).Value
Next
End If
wb.Save
Workbooks(1).Close SaveChanges:=True
End Sub

Permutation/Combinations of Multiple Arrays / Cartesian product generation in swift

I have an example of 4 arrays describing the
let arrColor = ["red","green","blue"]
let arrSize = ["L","M"]
let arrMaterial = ["Cotton","polyester"]
let arrstyle = ["Eastern","Western"]
I need all the combinations of products possible from these arrays.
This concept is called the Cartesian product generation. We can use the following code from here to do this
let arrColor = ["red","green","blue"]
let arrSize = ["L","M"]
let arrMaterial = ["Cotton","polyester"]
let arrstyle = ["Eastern","Western"]
func + <T>(el: T, arr: [T]) -> [T] {
var ret = arr
ret.insert(el, at: 0)
return ret
}
func cartesianProduct<T>(_ arrays: [T]...) -> [[T]] {
guard let head = arrays.first else {
return []
}
let first = Array(head)
func pel(
_ el: T,
_ ll: [[T]],
_ a: [[T]] = []
) -> [[T]] {
switch ll.count {
case 0:
return a.reversed()
case _:
let tail = Array(ll.dropFirst())
let head = ll.first!
return pel(el, tail, el + head + a)
}
}
return arrays.reversed()
.reduce([first], {res, el in el.flatMap({ pel($0, res) }) })
.map({ $0.dropLast(first.count) })
}
let arrResult = cartesianProduct(arrColor,arrSize,arrMaterial,arrstyle)
print(arrResult)
Output :
[["red", "L", "Cotton", "Eastern"], ["red", "L", "Cotton", "Western"], ["red", "L", "polyester", "Eastern"], ["red", "L", "polyester", "Western"], ["red", "M", "Cotton", "Eastern"], ["red", "M", "Cotton", "Western"], ["red", "M", "polyester", "Eastern"], ["red", "M", "polyester", "Western"],
["green", "L", "Cotton", "Eastern"], ["green", "L", "Cotton", "Western"], ["green", "L", "polyester", "Eastern"], ["green", "L", "polyester", "Western"], ["green", "M", "Cotton", "Eastern"], ["green", "M", "Cotton", "Western"], ["green", "M", "polyester", "Eastern"], ["green", "M", "polyester", "Western"],
["blue", "L", "Cotton", "Eastern"], ["blue", "L", "Cotton", "Western"], ["blue", "L", "polyester", "Eastern"], ["blue", "L", "polyester", "Western"], ["blue", "M", "Cotton", "Eastern"], ["blue", "M", "Cotton", "Western"], ["blue", "M", "polyester", "Eastern"], ["blue", "M", "polyester", "Western"]]

Shuffling an array with a condition

Here's an array:
scramble = [
"R", "U", "R' ", "U' ", "L", "L' ", "L2", "B", "B' ",
"B2", "F", "F' ", "F2", "R2", "U2", "D", "D' ", "D2"
]
I want to shuffle it with a condition such that "R", "R' ", and "R2" are not together while shuffled, and similar for other letters.
scramble.shuffle does shuffle it, but how can I set such conditions?
Lets derive a generic solution.
Given a 2D array of groups of elements, return an array of shuffled elements such that no two consecutive elements belong to same group.
For example:
all_groups = [
["R", "R'", "R2" ],
["L", "L'", "L2" ],
["U", "U'", "U2" ],
["D", "D'", "D2" ],
["F", "F'", "F2" ],
["B", "B'", "B2" ]
]
Expected output:
["F'", "U'", "R'", "L2", "R2", "D2", "F2", "R", "L", "B", "F", "L'", "D'", "B'", "U2", "B2", "U", "D"]
TL;DR Code:
all_groups = [
["R", "R'", "R2" ],
["L", "L'", "L2" ],
["U", "U'", "U2" ],
["D", "D'", "D2" ],
["F", "F'", "F2" ],
["B", "B'", "B2" ]
]
ans = []
total = all_groups.map(&:length).reduce(:+)
# Initial shuffling
all_groups = all_groups.each { |i| i.shuffle! }.shuffle
until all_groups.empty?
# Select and pop last group
last_group = all_groups.pop
# Insert last element to our ans, and remove from group
ans.push(last_group.pop)
total -= 1
# Shuffle remaining groups
all_groups.shuffle!
# Check if any group has reached critical state
length_of_longest_group = all_groups.reduce(0) { |len, i| [len, i.length].max }
if length_of_longest_group * 2 == total + 1
# Move that group to last
# This ensures that next element picked is from this group
longest_group = all_groups.detect { |i| i.length == length_of_longest_group }
all_groups.delete(longest_group)
all_groups.push(longest_group)
end
# Insert the last group at first place. This ensures that next element
# is not from this group.
all_groups.unshift(last_group) unless last_group.empty?
end
puts ans.inspect
Examples:
all_groups = [
["R", "R'", "R2" ],
["L", "L'", "L2" ],
["U", "U'", "U2" ],
["D", "D'", "D2" ],
["F", "F'", "F2" ],
["B", "B'", "B2" ]
]
# Output 1:
ans = ["B'", "U'", "L'", "U2", "R", "B2", "F2", "R2", "D2", "L2", "D", "R'", "U", "F'", "D'", "L", "F", "B"]
# Output 2:
ans = ["U'", "R", "D", "R'", "U", "D2", "B2", "D'", "L", "B", "L2", "B'", "U2", "F'", "L'", "F", "R2", "F2"]
# Output 3:
ans = ["B", "D", "R'", "D'", "B'", "R", "F2", "L", "D2", "B2", "F'", "R2", "U'", "F", "L'", "U2", "L2", "U"]
# Output 4:
ans = ["U'", "F'", "R2", "B2", "D", "L2", "B'", "U", "R", "B", "R'", "L'", "D'", "U2", "F", "D2", "F2", "L"]
# Output 5:
ans = ["U2", "F2", "L'", "F'", "R'", "F", "D'", "B2", "D2", "L", "B", "D", "L2", "B'", "R", "U", "R2", "U'"]
all_groups = [
['a', 'aa', 'aaa', 'A', 'AA', 'AAA'],
['b', 'bb', 'bbb', 'B', 'BB', 'BBB'],
['c', 'cc', 'ccc', 'C', 'CC', 'CCC']
]
ans = ["c", "AAA", "B", "ccc", "bbb", "C", "AA", "CC", "aa", "BB", "CCC", "bb", "cc", "A", "BBB", "a", "b", "aaa"]
all_groups = [
['r1', 'r2', 'r3', 'r4', 'r5', 'r6', 'r7', 'r8'],
['b1', 'b2', 'b3', 'b4', 'b5', 'b6', 'b7', 'b8']
]
# Output1:
ans = ["r2", "b7", "r1", "b5", "r7", "b6", "r3", "b8", "r4", "b3", "r5", "b1", "r6", "b4", "r8", "b2"]
# Output2:
ans = ["b6", "r8", "b2", "r1", "b4", "r2", "b8", "r7", "b3", "r4", "b5", "r5", "b7", "r3", "b1", "r6"]
Explanation:
To begin with, lets shuffle the input data. First we shuffle each group, then we shuffle the outer array.
all_groups = all_groups.each { |i| i.shuffle! }.shuffle
Now our array looks like this:
[["B", "B2", "B'"],
["F'", "F", "F2"],
["L", "L2", "L'"],
["R2", "R'", "R"],
["D", "D'", "D2"],
["U'", "U", "U2"]]
Now, if we traverse this array in column-major order, we get a prety decent shuffling of elements wherein no two consecutive elements belong to same group.
["B", "F'", "L", "R2", "D", "U'", "B2", "F", "L2", "R'", "D'", "U", "B'", "F2", "L'", "R", "D2", "U2"]
But only flaw in this is that all elements of a particular group are equidistant. Lets improve.
So we have a set of groups. Lets select any one of group and then select the last element from that group and add this element to our answer array. Now shuffle the 2D array and repeat until all elements are selected.
Also, we don't want any two consecutive elements to belong to same group, so we need to ensure that the next selection of element is from some other group. So how about this strategy:
Always pick the last group from our 2d array, and then once we shuffle we will ensure that this group is not the last group in 2d array.
In terms of psuedocode:
1. Select last group from 2d array.
2. Remove an element from this group.
3. Shuffle the order of other groups in 2d array
4. Insert the selected group at begining.
For example, lets start with this 2d array:
[["D", "D'", "D2"],
["U", "U'", "U2"],
["B2", "B", "B'"],
["F2", "F", "F'"],
["L'", "L", "L2"],
["R'", "R2", "R"]]
last group: ["R'", "R2", "R"]
Element removed: ("R")
Shuffle order of remaining groups of 2d array:
[["L'", "L", "L2"],
["B2", "B", "B'"],
["D", "D'", "D2"],
["U", "U'", "U2"],
["F2", "F", "F'"]]
2D array after inserting the popped group(the group from which we extracted an element)
[["R'", "R2"],
["L'", "L", "L2"],
["B2", "B", "B'"],
["D", "D'", "D2"],
["U", "U'", "U2"],
["F2", "F", "F'"]]
Now, since our logic selects the last group and then inserts it in the begining, we are insured that two elements of same group will never be picked in two consecutive turns.
A catch: There can be a possibility that there is a group which never reaches the last position, hence this group will never shrink, and its element will be forced to be picked consecutively once the number of elements is too less.
For example observe the output of 4 iterations of above algorithm on below input:
[['a', 'aa', 'aaa', 'A', 'AA', 'AAA'],
['b', 'bb', 'bbb', 'B', 'BB', 'BBB'],
['c', 'cc', 'ccc', 'C', 'CC', 'CCC']]
# Output 1: in this 'a' & 'aaa' are together
["CC", "bb", "C", "BB", "aa", "bbb", "AA", "CCC", "b", "c", "B", "cc", "A", "BBB", "AAA", "ccc", "a", "aaa"]
# Output 2: in this 'b' & 'BB' are together
["cc", "A", "B", "c", "AAA", "C", "a", "ccc", "bbb", "CCC", "aaa", "bb", "CC", "aa", "BBB", "AA", "b", "BB"]
# Output 3: this is perfect
["CCC", "BB", "a", "c", "BBB", "aa", "bbb", "A", "bb", "ccc", "B", "CC", "AAA", "b", "AA", "cc", "aaa", "C"]
# Output 4: in this 'c' & 'cc' are together
["CC", "bb", "AA", "b", "aa", "BBB", "aaa", "bbb", "CCC", "B", "A", "BB", "C", "a", "ccc", "AAA", "c", "cc"]
So, there are chances that when ratio of number of elements in a group to total no of elements increases, two elements of same group can be clubbed together. Hmm, lets improve further:
Since the group with maximum number of elements has most probability of having two elements combined consecutively, lets study such groups. i.e. group with maximum number of elements.
Lets say, there is a group with X number of elements in it. So what is the minimum number of other type of elements required that none of the X elements are consecutive? Simple right: insert a different element between each pair of element of this group.
x * x * x * x * x * x
So we realise, that if group has X elements it requires atleast (X-1) other type of elements so that no two elements of this group get consecutive.
Mathematically the condition can be represented as:
number_of_elements_in_longest_group == number_of_other_type_of_elements + 1
=> number_of_elements_in_longest_group == (total_number_of_elements - number_of_elements_in_longest_group) + 1
=> number_of_elements_in_longest_group * 2 == total_number_of_elements + 1
Going back to our algorithm, lets add a condition, that at any point of time, if number of remaining items is 1 less than the number of elements in largest group then we have to ensure that next element picked is from this largest group
Juksefantomet sent me this solution in discord, since he can't post here due to an account lock.
The below codeblock contains an alternative approach on how to tackle the problem at hand. This contains a fragmented solution to understand the steps on how you would normally approach a complex problem like the one presented above.
going through the various steps you can see how each condition has to be known in advance and specified to the point where your final array is not "illegal".
#illegal_order = ['1','2','3','4','5','6','7','8','9']
puts #illegal_order.count
puts "#{#illegal_order}"
#foo = []
# traverse the original array and append a random value from that into foo array
# this can of course and should be done in a loop where you check for duplicates
# this is done below in this example, fragmented to see the individual action
(0..#illegal_order.count).each do |add_value|
#temp = #illegal_order[rand(#illegal_order.count)]
unless #foo.count == #illegal_order.count
#foo.push(#temp)
end
end
# making sure it contains everything in the original array
#foo.each do |bar|
unless #illegal_order.include?(bar)
#approve = false
puts "Errored generation!"
end
end
# defining patterns, this can of course be extracted by the original array and be done alot cleaner
# but printing it out to better break it down
#pattern1 = [#illegal_order[0],#illegal_order[1],#illegal_order[2]]
#pattern2 = [#illegal_order[3],#illegal_order[4],#illegal_order[5]]
#pattern3 = [#illegal_order[6],#illegal_order[7],#illegal_order[8]]
# Let us step through the new array and use case to check for various conditions that would flag the new array as invalid
#foo.each do |step|
# setting a temp pattern based on current state
#temp_pattern = [#foo[step.to_i-1],#illegal_order[step.to_i],#illegal_order[step.to_i+1]]
case step
when #temp_pattern == #pattern1
#illegalpatterns = true
when #temp_pattern == #pattern2
#illegalpatterns = true
when #temp_pattern == #pattern3
#illegalpatterns = true
end
# checking the foo array if it contains duplicates, if yes, set conditional to true
#illegal_order.each do |count|
if #foo.count(count) > 1
#duplicates = true
end
end
end
# printing out feedback based on duplicates or not, this is where you rescramble the array if duplicate found
(#duplicates == true) ? (puts "dupes found") : (puts "looks clear. no duplicates")
# printing out feedback based on illegal patterns or not, this is where you rescramble the array if duplicate found
(#illegalpatterns == true) ? (puts "illegal patterns found") : (puts "looks clear, no illegal patterns")
puts "#{#foo}"

RSQLite RS-DBI driver: (error in statement: no such table: test)

I have just started using RSQLite for analysis of a very large survey data set using R and the survey package by Thomas Lumley. I am getting an error message that has been asked about before on Stack Overflow and the R help archive, but the solutions don't apply to my data (one solution was that the original poster was using POSIX data type, but my data doesn't have that). I don't think it is a problem with the survey package, rather I think I am doing something wrong with the database/table creation. One thing that may help, when I use the sample from my data that I posed below, I don't get an error with a SELECT query, but when I do the same thing with my full data set, I do get the same error. Here is a sample of my data and some reproducible code:
test=structure(list(household = c(0, 0, 0, 0, 0), NUMADULT = c(2L,
1L, 2L, 1L, 1L), CHILDREN = c(NA_real_, NA_real_, NA_real_, NA_real_,
NA_real_), SEX = c(1L, 2L, 1L, 2L, 2L), X_STATE = c(36L, 5L,
53L, 41L, 10L), X_FINALWT = c(665.97647582, 53.293518032, 72.60538811,
61.223634396, 5.5921160216), AGE = c(30L, 65L, 9L, 49L, 48L),
X_INCOMG = structure(c(6L, 6L, 6L, 6L, 6L), .Label = c("1",
"2", "3", "4", "5", "9"), class = "factor"), X_MAM502Y = structure(c(NA,
1L, NA, NA, NA), .Label = c("1", "2", "9"), class = "factor"),
HLTHPLAN = structure(c(2L, 1L, 1L, 1L, 1L), .Label = c("1",
"2"), class = "factor"), MEDCOST = structure(c(1L, 2L, 2L,
2L, 2L), .Label = c("1", "2"), class = "factor"), QLACTLM2 = c(2L,
2L, 2L, 2L, 2L), CTYCODE = structure(c(30L, 53L, 33L, 26L,
1L), .Label = c("1", "3", "5", "6", "7", "9", "10", "11",
"13", "14", "15", "17", "19", "20", "21", "23", "25", "27",
"28", "29", "30", "31", "33", "35", "37", "39", "41", "43",
"45", "47", "49", "51", "53", "55", "57", "59", "61", "63",
"65", "67", "69", "71", "73", "75", "77", "79", "81", "83",
"85", "86", "87", "89", "91", "93", "95", "97", "99", "101",
"103", "105", "107", "109", "111", "113", "115", "117", "119",
"121", "123", "125", "127", "129", "131", "133", "135", "137",
"139", "141", "143", "145", "147", "149", "151", "153", "155",
"157", "159", "161", "163", "165", "167", "169", "171", "173",
"175", "177", "179", "181", "183", "185", "187", "189", "191",
"193", "195", "197", "199", "201", "205", "209", "215", "227",
"235", "245", "297", "303", "309", "339", "355", "439", "453",
"491", "510", "550", "590", "650", "700", "710", "740", "760",
"770", "777", "800", "810", "999", "203", "207", "217", "221",
"223", "275", "277", "295", "313", "381", "423", "680", "12",
"54", "186", "211", "213", "219", "225", "229", "231", "233",
"237", "239", "241", "247", "249", "251", "253", "255", "257",
"259", "261", "265", "267", "271", "273", "279", "281", "285",
"287", "289", "291", "293", "299", "305", "311", "321", "323",
"325", "329", "331", "337", "341", "343", "347", "349", "351",
"353", "361", "363", "365", "367", "371", "373", "375", "387",
"395", "397", "401", "407", "409", "415", "419", "427", "441",
"449", "451", "455", "457", "459", "463", "465", "467", "469",
"471", "473", "477", "479", "481", "485", "487", "489", "493",
"497", "499", "503", "520", "540", "570", "600", "630", "660",
"670", "683", "690", "730", "750", "775", "820", "830", "840",
"790"), class = "factor"), X_RACEGR2 = structure(c(1L, 1L,
NA, 1L, NA), .Label = c("1", "2", "3", "4", "5"), class = "factor"),
PERSDOC2 = structure(c(3L, 1L, 1L, 1L, 1L), .Label = c("1",
"2", "3"), class = "factor"), POORHLTH = c(0, NA, NA, 0,
0), X_EDUCAG = structure(c(3L, 2L, 4L, 4L, 4L), .Label = c("1",
"2", "3", "4"), class = "factor"), X_PSU = c(2004006698L,
2004014294L, 2004100796L, 2004024220L, 2004005537L), X_STSTR = c(36011L,
5012L, 53271L, 41012L, 10011L), X_RFMAM2Y = structure(c(NA,
1L, NA, 1L, 1L), .Label = c("1", "2", "9"), class = "factor"),
X_RFSMOK3 = structure(c(2L, 1L, 1L, 2L, 1L), .Label = c("1",
"2"), class = "factor"), X_RFHLTH = structure(c(1L, 1L, 1L,
1L, 1L), .Label = c("1", "2", "3"), class = "factor"), YEAR = c(2004,
2004, 2004, 2004, 2004), bcccp = structure(c(2L, 2L, 2L,
2L, 1L), .Label = c("0", "1"), class = "factor"), pov.limit = c(NA_real_,
NA_real_, NA_real_, NA_real_, NA_real_), cutoff = c(NA_real_,
NA_real_, NA_real_, NA_real_, NA_real_), elig = c(NA, NA,
NA, NA, NA), bcccp_elig = c(NA, NA, NA, NA, NA)), .Names = c("household",
"NUMADULT", "CHILDREN", "SEX", "X_STATE", "X_FINALWT", "AGE",
"X_INCOMG", "X_MAM502Y", "HLTHPLAN", "MEDCOST", "QLACTLM2", "CTYCODE",
"X_RACEGR2", "PERSDOC2", "POORHLTH", "X_EDUCAG", "X_PSU", "X_STSTR",
"X_RFMAM2Y", "X_RFSMOK3", "X_RFHLTH", "YEAR", "bcccp", "pov.limit",
"cutoff", "elig", "bcccp_elig"), row.names = c(NA, 5L), class = "data.frame")
library(survey)
library(sqldf)
library(RSQLite)
drv=dbDriver('SQLite')
con=dbConnect(drv,'brfsagg.db')
dbWriteTable(con,'brfs0210',test)
dbListFields(con,'brfs0210') #This function works
sqldf("select SEX from brfs0210") #This works with my sample data but I get the same error message when I use the full data set.
dbExistsTable(con,'test') #This proves that the table exists
brfsvy=svydesign(id=~X_PSU, strata=~X_STSTR, weights=~X_FINALWT,nest=TRUE,
data='test',dbtype='SQLite',dbname=system.file('brfsagg.db',package='survey')) #This always generates the error message, regardless of whether I am using the test sample data or my full data set.
the r code that you are trying to write has already been written here with accompanying blog post here. why would you bother re-inventing the wheel? googling r brfss or import brfss into r gets you to those posts.
is there a reason you want to re-write everything from scratch yourself? there is lots of example syntax using SQLite with the survey package here ..here's how to fix this particular issue. :)
library(survey)
library(RSQLite)
db.filename <- 'brfsagg.db'
con <- dbConnect(SQLite(),db.filename)
dbWriteTable( con , 'test' , test )
brfsvy <-
svydesign(
id = ~X_PSU ,
strata = ~X_STSTR ,
weights = ~X_FINALWT ,
nest = TRUE ,
data = 'test' ,
dbtype = 'SQLite' ,
dbname = db.filename
)
svymean( ~ SEX , brfsvy )
options( 'survey.lonely.psu' = 'adjust' )
svymean( ~ SEX , brfsvy )
svymean( ~ factor( SEX ) , brfsvy )

Complex data transformation

I need to transform following (simplified) dataset, created by following code:
structure(list(W1.1 = structure(c(1L, NA, NA), .Names = c("case1",
"case2", "case3"), .Label = "1", class = "factor"), R1.1 = structure(c(1L,
NA, NA), .Names = c("case1", "case2", "case3"), .Label = "2", class = "factor"),
W1.2 = structure(c(NA, 1L, NA), .Names = c("case1", "case2",
"case3"), .Label = "1", class = "factor"), R1.2 = structure(c(NA,
1L, NA), .Names = c("case1", "case2", "case3"), .Label = "1", class = "factor"),
W2.1 = structure(c(NA, 1L, NA), .Names = c("case1", "case2",
"case3"), .Label = "1", class = "factor"), R2.1 = structure(c(NA,
1L, NA), .Names = c("case1", "case2", "case3"), .Label = "1", class = "factor"),
W2.2 = structure(c(1L, NA, NA), .Names = c("case1", "case2",
"case3"), .Label = "2", class = "factor"), R2.2 = structure(c(1L,
NA, NA), .Names = c("case1", "case2", "case3"), .Label = "1", class = "factor"),
W3.1 = structure(c(1L, NA, NA), .Names = c("case1", "case2",
"case3"), .Label = "1", class = "factor"), R3.1 = structure(c(1L,
NA, NA), .Names = c("case1", "case2", "case3"), .Label = "1", class = "factor"),
W3.2 = structure(c(1L, 1L, NA), .Names = c("case1", "case2",
"case3"), .Label = "1", class = "factor"), R3.2 = structure(c(1L,
1L, NA), .Names = c("case1", "case2", "case3"), .Label = "1", class = "factor"),
age = structure(c(3L, 1L, 2L), .Names = c("case1", "case2",
"case3"), .Label = c("20", "48", "56"), class = "factor"),
gender = structure(c(2L, 1L, 2L), .Names = c("case1", "case2",
"case3"), .Label = c("female", "male"), class = "factor")), .Names = c("W1.1",
"R1.1", "W1.2", "R1.2", "W2.1", "R2.1", "W2.2", "R2.2", "W3.1",
"R3.1", "W3.2", "R3.2", "age", "gender"), row.names = c(NA, 3L
), class = "data.frame")
For the new data I want:
- a row dedicated to every x.x, with info on the Rx.x value, age and gender.
- only have a row returned when Wx.x was 1. When 2 or NA, I don't need it.
For my example this dataset should look something like this:
incident type Where Reported age gender
1 1 1.1 1 2 56 male
2 2 3.1 1 1 56 male
3 3 3.2 1 1 56 male
4 4 1.2 1 1 20 female
5 5 2.1 1 1 20 female
6 6 3.2 1 1 20 female
Note: the "Where" column can even be omitted since it should be a constant vector of 1, and I don't need it for the analysis.
This is (mostly) a problem to be tackled by reshape(). Assuming your original dataset is called "temp":
First, reshape it from a wide format to a long format.
temp.long <- reshape(temp, direction = "long",
idvar=c("age", "gender"),
varying = which(!names(temp) %in% c("age", "gender")),
sep = "")
temp.long
# age gender time W R
# 56.male.1.1 56 male 1.1 1 2
# 20.female.1.1 20 female 1.1 <NA> <NA>
# 48.male.1.1 48 male 1.1 <NA> <NA>
# 56.male.1.2 56 male 1.2 <NA> <NA>
# 20.female.1.2 20 female 1.2 1 1
# 48.male.1.2 48 male 1.2 <NA> <NA>
# 56.male.2.1 56 male 2.1 <NA> <NA>
# 20.female.2.1 20 female 2.1 1 1
# 48.male.2.1 48 male 2.1 <NA> <NA>
# 56.male.2.2 56 male 2.2 2 1
# 20.female.2.2 20 female 2.2 <NA> <NA>
# 48.male.2.2 48 male 2.2 <NA> <NA>
# 56.male.3.1 56 male 3.1 1 1
# 20.female.3.1 20 female 3.1 <NA> <NA>
# 48.male.3.1 48 male 3.1 <NA> <NA>
# 56.male.3.2 56 male 3.2 1 1
# 20.female.3.2 20 female 3.2 1 1
# 48.male.3.2 48 male 3.2 <NA> <NA>
Second, do some cleanup.
temp.long <- na.omit(temp.long)
temp.long <- temp.long[-which(temp.long$W == 2), ]
temp.long <- temp.long[order(rev(temp.long$gender), temp.long$time), ]
rownames(temp.long) <- NULL
temp.long$incident <- seq(nrow(temp.long))
temp.long
# age gender time W R incident
# 1 56 male 1.1 1 2 1
# 2 56 male 3.1 1 1 2
# 3 56 male 3.2 1 1 3
# 4 20 female 1.2 1 1 4
# 5 20 female 2.1 1 1 5
# 6 20 female 3.2 1 1 6
You can do further cleanup to change your column names and column order if it's important.

Resources