The problem
fastLink and RecordLinkage packages do extremely well in matching records (rows) from database A to database B and vice-versa. The developers are working on extending from matching only 2 databases to multiple databases.
A simple example of both I gave here.
In the meantime, how would we go about matching multiple data frames? For example, I have multiple medical records of patients from clinic A, B, C, D, E, F, and I want to merge them into a single one.
A reproducible example:
dfA <-
structure(list(fname = c("Jafar", "Nemo", "Simba", "Belle", "Nala",
"Jasmine"), lname = c("Evil", "Water", "King", "Beauty", "Princess",
"Princess"), gender = c("M", "M", "M", "F", "F", "F"), dob = c(1987,
2000, 2011, 1989, 1970, 1989), city = c("Arabtown", "Atlantic",
"Sahara", "Nice", "Sahara", "Arabtown")), row.names = c(NA, -6L
), class = c("tbl_df", "tbl", "data.frame"))
dfB <-
structure(list(fname = c("Jafar Jr", "Nemo", "Simba", "Belle",
"Nala", "Jasmine"), lname = c("Evil", "Waterson", "King", "Beauty",
"Princess", "Princess of Arabtown"), gender = c("M", "M", "M",
"F", "F", "F"), dob = c(NA, 2000, 2011, NA, NA, 1989), city = c("Arabtown",
"Atlantica", "Sahara", "Nice-France", "Sahara", "Arabia")), row.names = c(NA,
-6L), class = c("tbl_df", "tbl", "data.frame"))
dfC <-
structure(list(fname = c("Jafar Jr", "Fishy", "Lion", "Belle",
"Sarabi", "Jasmine"), lname = c("Evil", "Waterpal", "King", "Beauty",
"Queen", NA), gender = c("M", "M", NA, "F", "F", "F"), dob = c(NA,
2000, 2011, NA, 1940, 1989), city = c("Arabia", NA, "Sahara",
"France", "Sahara", NA)), row.names = c(NA, -6L), class = c("tbl_df",
"tbl", "data.frame"))
dfD <-
structure(list(fname = c("Jafar Jr", "Nemo", "Simba", "Belle",
"Sarabi", "Jasmine"), lname = c("Evil", "Waterson", "King", "Beast",
"Queen", "Evil"), gender = c("M", "M", "M", "F", "F", "M"), dob = c(NA,
2000, 2011, 1989, NA, 1989), city = c("Arabtown", "Atlantica",
"Sahara", NA, "Sahara", "Arabtown")), row.names = c(NA, -6L), class = c("tbl_df",
"tbl", "data.frame"))
dfE <-
structure(list(fname = c("Jafar Jr", "Nemo", "Simba", "Belle",
"Nala", "Aladdin"), lname = c("Evil", "Pateron", NA, "Gaston",
NA, "Streetrat"), gender = c("M", NA, "M", "F", "F", "M"), dob = c(1987,
NA, NA, NA, 1970, 1989), city = c("Arabtown", "Atlantica", "Sahara",
"France", "Sahara", "Arabia")), row.names = c(NA, -6L), class = c("tbl_df",
"tbl", "data.frame"))
dfF <-
structure(list(fname = c("Jafar Jr", "Nemo", "Simba", "Belle",
"Nala", "Al"), lname = c("Evil", "Waterson", "Dead", "Beauty",
"Princess", "Streetrat"), gender = c("M", "M", NA, "F", "F",
"M"), dob = c(1987, 2000, 2011, NA, NA, 1989), city = c("Arabia",
"Atlantic", "Sahara", "Nice-France", "Sahara", "Arabia")), row.names = c(NA,
-6L), class = c("tbl_df", "tbl", "data.frame"))
Expected result :
In the end I want unique identified records :
1 Jafar Evil M 1987 Arabtown
2 Nemo Water M 2000 Atlantic
3 Simba King M 2011 Sahara
4 Belle Beauty F 1989 Nice
5 Nala Princess F 1970 Sahara
6 Jasmine Princess F 1989 Arabtown
7 Sarabi Queen F 1940 Sahara
8 Aladdin Streetrat M 1989 Arabia
Even if the result isn't as clean as above, it's alright. The goal is to find a unified record from all 6 records and belong to the same entity.
Both fastLink & RecordLinkage take care of deduping (removing duplicates).
How can I develop an approach to deal with more than two databases in this scenario?
I'm trying to convert a data.frame into json format
my data.frame has the following structure
a <- rep(c("Mario", "Luigi"), each = 3)
b <- sample(34:57, size = length(a))
df <- data.frame(a,b)
> df
a b
1 Mario 43
2 Mario 34
3 Mario 36
4 Luigi 45
5 Luigi 52
6 Luigi 35
What I want to create is something like this (to finally print it to a .json file)
[
{
"a": "Mario",
"b": [43, 34, 36]
},
{
"a": "Luigi",
"b": [45, 52, 35]
}
]
I've tried different packages handling json format but so far failed to produce this kind of output. I usually end up with something like this
[
{
"a":"Mario",
"b":43
},
{
"a":"Mario",
"b":34
},
{
"a":"Mario",
"b":36
},
{
"a":"Luigi",
"b":45
},
{
"a":"Luigi",
"b":52
},
{
"a":"Luigi",
"b":35
}
]
If you nest b as a list column, it will convert correctly:
library(jsonlite)
# converts b to nested list column
df2 <- aggregate(b ~ a, df, list)
df2
## a b
## 1 Luigi 49, 42, 37
## 2 Mario 46, 50, 45
toJSON(df2, pretty = TRUE)
## [
## {
## "a": "Luigi",
## "b": [49, 42, 37]
## },
## {
## "a": "Mario",
## "b": [46, 50, 45]
## }
## ]
or if you prefer dplyr:
library(dplyr)
df %>% group_by(a) %>%
summarise(b = list(b)) %>%
toJSON(pretty = TRUE)
or data.table:
library(data.table)
toJSON(setDT(df)[, .(b = list(b)), by = a], pretty = TRUE)
which both return the same thing.
To get the required JSON structure you will want your data in a list, something like:
l <- list(list(a = "Mario",
b = c(43,34,36)),
list(a = "Luigi",
b = c(45,52,35)))
## then can use the library(jsonlite) to convert to JSON
library(jsonlite)
toJSON(l, pretty = T)
[
{
"a": ["Mario"],
"b": [43, 34, 36]
},
{
"a": ["Luigi"],
"b": [45, 52, 35]
}
]
So to split your data into this format, you can do
l <- lapply(unique(df$a), function(x) list(a = x, b = df[a == x,"b"]) )
## and then the conversion works
toJSON(l, pretty = T)
[
{
"a": ["Mario"],
"b": [44, 49, 50]
},
{
"a": ["Luigi"],
"b": [39, 57, 35]
}
]
This works for the simple case, but if it gets more complex it might be better to re-design how you create your data.frame, and instead create a list(s) to begin with.
Reference
The jsonlite vignette is a very good resource.
I'm returning the index of the smallest integer in an array. I found a solution from this forum. Here is what I did:
require 'amatch'
include Amatch
ingredients_arr = [
"All purpose", "Ammoniaco", "Assorted sprinkles", "Baking Powder",
"Baking soda", "Banana", "Banana flavor", "Bread improver", "Brown Sugar",
"Buko pandan flavor", "Butter", "Butter flavor", "Butter oil subs",
"Cake Flour", "Cake emulsi\nfier", "Canyon baking powder", "Cheese",
"Chiffon oil", "Choco flavor", "Choco spri\nnkles", "Cocoa (imp)",
"Cocoa (loc)", "Coconut", "Condensada", "Cooking\nOil", "\n Corn Starch",
"Dessicated", "Dutch choco fudge premium", "Dutch cocoa premium", "Egg",
"Evaporad\n a (B)", "Evaporada (S)", "Evaporadaorated\n(B)", "First Class",
"Food color", "Glucose", "Go\n ld coin", "Heart sprinkles", "LPG", "Lard",
"Linga", "Margarine", "Mocha flavor", "Mongo paste red", "Onion", "Ovalet",
"Powdered sugar", "Rhum", "Royal", "Salt", "Sibuyas", "Skim milk (h-end)",
"Skim milk (l-end)", "\n Star sprinkles", "Strawberry flavor",
"Styro (l. plan)", "Super syrup", "Taba", "Tartar", "\n Third Class",
"Ube Paste", "Ube flavor", "Vanilla", "Vanilla 1G", "Vivid icing",
"Wash Sugar", "Water", "White Sugar"
]
i, ingredient = 1, "Flour"
ing_array = Array.new
until ingredient == ""
puts "Enter ingredient #{i}: "
ingredient = gets.chomp
ing_array << ingredient
i += 1
end
ing_array.pop
m = Sellers.new("margarine")
no_words = ing_array.length
ing_index_arr = Array.new
i = 0
while i < no_words
rating_arr = Array.new
m = Sellers.new(ing_array[i])
j = 0
while j < ingredients_arr.length
x = m.match(ingredients_arr[j])
rating_arr << x
j += 1
end
y = rating_arr.each.with_index.find_all{|a, i| a == rating_arr.min }.map{|a, b| b}
ing_index_arr << y
i += 1
end
ing_index_arr # => [[0], [67], [4]]
but I need something like this:
[0, 67, 4]
Hope someone can help me.
If you want to collapse your sub-arrays, use the flatten method. http://ruby-doc.org/core-2.2.0/Array.html#method-i-flatten
[[0],[67],[4]].flatten == [0,67,4]
I'm not sure if I understood You correctly.
If you want to get index of the smalleest integer in the array You can simply sort it.
array=[2,9,90,345]
array.sort
=> [2, 9, 90, 345]
In this case the lowest integer would have always an index=0
When you have array=[2,3,4] it is already numeric:
array[0].class
=> Fixnum
I have just started using RSQLite for analysis of a very large survey data set using R and the survey package by Thomas Lumley. I am getting an error message that has been asked about before on Stack Overflow and the R help archive, but the solutions don't apply to my data (one solution was that the original poster was using POSIX data type, but my data doesn't have that). I don't think it is a problem with the survey package, rather I think I am doing something wrong with the database/table creation. One thing that may help, when I use the sample from my data that I posed below, I don't get an error with a SELECT query, but when I do the same thing with my full data set, I do get the same error. Here is a sample of my data and some reproducible code:
test=structure(list(household = c(0, 0, 0, 0, 0), NUMADULT = c(2L,
1L, 2L, 1L, 1L), CHILDREN = c(NA_real_, NA_real_, NA_real_, NA_real_,
NA_real_), SEX = c(1L, 2L, 1L, 2L, 2L), X_STATE = c(36L, 5L,
53L, 41L, 10L), X_FINALWT = c(665.97647582, 53.293518032, 72.60538811,
61.223634396, 5.5921160216), AGE = c(30L, 65L, 9L, 49L, 48L),
X_INCOMG = structure(c(6L, 6L, 6L, 6L, 6L), .Label = c("1",
"2", "3", "4", "5", "9"), class = "factor"), X_MAM502Y = structure(c(NA,
1L, NA, NA, NA), .Label = c("1", "2", "9"), class = "factor"),
HLTHPLAN = structure(c(2L, 1L, 1L, 1L, 1L), .Label = c("1",
"2"), class = "factor"), MEDCOST = structure(c(1L, 2L, 2L,
2L, 2L), .Label = c("1", "2"), class = "factor"), QLACTLM2 = c(2L,
2L, 2L, 2L, 2L), CTYCODE = structure(c(30L, 53L, 33L, 26L,
1L), .Label = c("1", "3", "5", "6", "7", "9", "10", "11",
"13", "14", "15", "17", "19", "20", "21", "23", "25", "27",
"28", "29", "30", "31", "33", "35", "37", "39", "41", "43",
"45", "47", "49", "51", "53", "55", "57", "59", "61", "63",
"65", "67", "69", "71", "73", "75", "77", "79", "81", "83",
"85", "86", "87", "89", "91", "93", "95", "97", "99", "101",
"103", "105", "107", "109", "111", "113", "115", "117", "119",
"121", "123", "125", "127", "129", "131", "133", "135", "137",
"139", "141", "143", "145", "147", "149", "151", "153", "155",
"157", "159", "161", "163", "165", "167", "169", "171", "173",
"175", "177", "179", "181", "183", "185", "187", "189", "191",
"193", "195", "197", "199", "201", "205", "209", "215", "227",
"235", "245", "297", "303", "309", "339", "355", "439", "453",
"491", "510", "550", "590", "650", "700", "710", "740", "760",
"770", "777", "800", "810", "999", "203", "207", "217", "221",
"223", "275", "277", "295", "313", "381", "423", "680", "12",
"54", "186", "211", "213", "219", "225", "229", "231", "233",
"237", "239", "241", "247", "249", "251", "253", "255", "257",
"259", "261", "265", "267", "271", "273", "279", "281", "285",
"287", "289", "291", "293", "299", "305", "311", "321", "323",
"325", "329", "331", "337", "341", "343", "347", "349", "351",
"353", "361", "363", "365", "367", "371", "373", "375", "387",
"395", "397", "401", "407", "409", "415", "419", "427", "441",
"449", "451", "455", "457", "459", "463", "465", "467", "469",
"471", "473", "477", "479", "481", "485", "487", "489", "493",
"497", "499", "503", "520", "540", "570", "600", "630", "660",
"670", "683", "690", "730", "750", "775", "820", "830", "840",
"790"), class = "factor"), X_RACEGR2 = structure(c(1L, 1L,
NA, 1L, NA), .Label = c("1", "2", "3", "4", "5"), class = "factor"),
PERSDOC2 = structure(c(3L, 1L, 1L, 1L, 1L), .Label = c("1",
"2", "3"), class = "factor"), POORHLTH = c(0, NA, NA, 0,
0), X_EDUCAG = structure(c(3L, 2L, 4L, 4L, 4L), .Label = c("1",
"2", "3", "4"), class = "factor"), X_PSU = c(2004006698L,
2004014294L, 2004100796L, 2004024220L, 2004005537L), X_STSTR = c(36011L,
5012L, 53271L, 41012L, 10011L), X_RFMAM2Y = structure(c(NA,
1L, NA, 1L, 1L), .Label = c("1", "2", "9"), class = "factor"),
X_RFSMOK3 = structure(c(2L, 1L, 1L, 2L, 1L), .Label = c("1",
"2"), class = "factor"), X_RFHLTH = structure(c(1L, 1L, 1L,
1L, 1L), .Label = c("1", "2", "3"), class = "factor"), YEAR = c(2004,
2004, 2004, 2004, 2004), bcccp = structure(c(2L, 2L, 2L,
2L, 1L), .Label = c("0", "1"), class = "factor"), pov.limit = c(NA_real_,
NA_real_, NA_real_, NA_real_, NA_real_), cutoff = c(NA_real_,
NA_real_, NA_real_, NA_real_, NA_real_), elig = c(NA, NA,
NA, NA, NA), bcccp_elig = c(NA, NA, NA, NA, NA)), .Names = c("household",
"NUMADULT", "CHILDREN", "SEX", "X_STATE", "X_FINALWT", "AGE",
"X_INCOMG", "X_MAM502Y", "HLTHPLAN", "MEDCOST", "QLACTLM2", "CTYCODE",
"X_RACEGR2", "PERSDOC2", "POORHLTH", "X_EDUCAG", "X_PSU", "X_STSTR",
"X_RFMAM2Y", "X_RFSMOK3", "X_RFHLTH", "YEAR", "bcccp", "pov.limit",
"cutoff", "elig", "bcccp_elig"), row.names = c(NA, 5L), class = "data.frame")
library(survey)
library(sqldf)
library(RSQLite)
drv=dbDriver('SQLite')
con=dbConnect(drv,'brfsagg.db')
dbWriteTable(con,'brfs0210',test)
dbListFields(con,'brfs0210') #This function works
sqldf("select SEX from brfs0210") #This works with my sample data but I get the same error message when I use the full data set.
dbExistsTable(con,'test') #This proves that the table exists
brfsvy=svydesign(id=~X_PSU, strata=~X_STSTR, weights=~X_FINALWT,nest=TRUE,
data='test',dbtype='SQLite',dbname=system.file('brfsagg.db',package='survey')) #This always generates the error message, regardless of whether I am using the test sample data or my full data set.
the r code that you are trying to write has already been written here with accompanying blog post here. why would you bother re-inventing the wheel? googling r brfss or import brfss into r gets you to those posts.
is there a reason you want to re-write everything from scratch yourself? there is lots of example syntax using SQLite with the survey package here ..here's how to fix this particular issue. :)
library(survey)
library(RSQLite)
db.filename <- 'brfsagg.db'
con <- dbConnect(SQLite(),db.filename)
dbWriteTable( con , 'test' , test )
brfsvy <-
svydesign(
id = ~X_PSU ,
strata = ~X_STSTR ,
weights = ~X_FINALWT ,
nest = TRUE ,
data = 'test' ,
dbtype = 'SQLite' ,
dbname = db.filename
)
svymean( ~ SEX , brfsvy )
options( 'survey.lonely.psu' = 'adjust' )
svymean( ~ SEX , brfsvy )
svymean( ~ factor( SEX ) , brfsvy )