create hash value for each row of data in dataframe in R - database

I am exploring how to compare two dataframe in R more efficiently, and I come up with hash.
My plan is to create hash for each row of data in two dataframe with same columns, using digest in digest package, and I suppose hash should be the same for any 2 identical row of data.
I tried to give and unique hash for each row of data, using the code below:
for (loop.ssi in (1:nrow(ssi.10q3.v1)))
{ssi.10q3.v1[loop.ssi,"hash"] <- digest(as.character(ssi.10q3.v1[loop.ssi,]))
print(paste(loop.ssi,nrow(ssi.10q3.v1),sep="/"))
flush.console()
}
But this is very slow.
Is my approach in comparing dataframe correct? If yes, any suggestion for speeding up the code above? Thanks.
UPDATE
I have updated the code as below:
ssi.10q3.v1[,"uid"] <- 1:nrow(ssi.10q3.v1)
ssi.10q3.v1.hash <- ddply(ssi.10q3.v1,
c("uid"),
function(df)
{df[,"uid"]<- NULL
hash <- digest(as.character(df))
data.frame(hash=hash)
},
.progress="text")
I self-generated a uid column for the "unique" purpose.

If I get what you want properly, digest will work directly with apply:
library(digest)
ssi.10q3.v1.hash <- data.frame(uid = 1:nrow(ssi.10q3.v1), hash = apply(ssi.10q3.v1, 1, digest))

I know this answer doesn't match the title of the question, but if you just want to see when rows are different you can do it directly:
rowSums(df2 == df1) == ncol(df1)
Assuming both data.frames have the same dimensions, that will evaluate to FALSE for every row that is not identical. If you need to test rownames as well that could be manage seperately and combined with the test of contents, and similarly for colnames (and attributes, and strict tests on column types).
rowSums(df2 == df1) == ncol(df1) & rownames(df2) == rownames(df1)

Related

Matching and replacing a selection of data from two different dataframes

(First time posting so please bear with) I have two different dataframes, one of which contains a column of replacement data for a selection of data within the first dataframe.
#dataframe 1
df<-data.frame(site= rep(1:4,3), landings = rep("val",12),
harbour = c("a","b","c","d","e","f","g","h","i","j","k","l"))
#dataframe 2
new_site4<-data.frame(harbour = c("a","b","c","d","e","f","g","h","i","j","k","l"),
sub_site = c("x","x","y","x","y","y","y","x","y","x","y","y") )
I want to replace the "site" in dataframe 1 with the "subsite" in dataframe 2 based on the match of "harbour" however I only need to do it for records for site "4".
Is there a neat way to select only site 4 and then replace the site number with the subsite, ideally without merging or without creating a whole new dataframe. My real dataset is large but the key is only small as it only refers to a small selection of the data which needs the subsite added.
I tried using match() on my main dataset but for some reason it only matched some of the required data not all of it, but this code wont work on my sample data either.
#df$site[match(df$harbour, new_site4$harbour)] <- new_site4$sub_site[match(df$harbour, df$harbour)]`

Merge/concatenate CSV-imported dataframes and delete duplicates

I am following up on my previous question.
Have sorted out a loop to import CSVs, concatenate data and remove duplicates.
files = glob.glob('./A08_csv/A08_B1_T*.csv')
dfs = [pd.read_csv(fp, index_col=[0], parse_dates=[0], dayfirst=True) for fp in files]
df = pd.concat(dfs)
df_purged = df.drop_duplicates(inplace=True)
print df_purged
However df.drop_duplicates(inplace=True) does not work (surely I am missing something) and print returns a void. How can I specify to check the duplicates by index? Adding the column name does not seem to work.
Also, how can I transform this loop into a formula, so I can apply this recursive input to csv with different filenames (i.e something that could work for A08_B1_T*.csv (bedroom) and for A08_KI_T*.csv (kitchen) etc.)?
Do you understand the inplace = True option?
If you do it inplace, it means you will modify df, so don't set the values to df_purged.
You here have two solutions: either you want to keep the 'unpurged' dataframe and you do:
df_purged = df.drop_duplicates()
Either you don't care about keeping it and you do:
df.drop_duplicates(inplace = True)
First option your result dataframe will be df_purged, but in the second it will be df which will be purged since you performed it inplace.
That being said, if you want to purge on your index, if you don't need to keep it, you can reset_index and then drop_duplicates like this:
df_purged = df.reset_index().drop_duplicates(['index']).drop('index',1)
And if you need to keep the index (modulo the dropped lines):
df_purged = df.reset_index().drop_duplicates(['index']).set_index('index')
del df.index.name
(Note that once again deleting the index name is only here for aesthetic)
Would this help?
df.drop_duplicates(['col_name'])
Here is a solution that adds the index as a dataframe column, drops duplicates on that, then removes the new column:
df= df.reset_index().drop_duplicates(subset='Date', 'Time', keep='last').set_index(subset='Date', 'Time')

SparkR - extracting dataframe's array<int> for an R function

I have 1000s of sensors, I need to partition the data (i.e. per sensor per day) then submit each list of data points to an R algorithm). Using Spark, simplified sample looks like:
//Spark
val rddData = List(
("1:3", List(1,1,456,1,1,2,480,0,1,3,425,0)),
("1:4", List(1,4,437,1,1,5,490,0)),
("1:6", List(1,6,500,0,1,7,515,1,1,8,517,0,1,9,522,0,1,10,525,0)),
("1:11", List(1,11,610,1))
)
case class DataPoint(
key: String,
value: List[Int]) // 4 value pattern, sensorID:seq#, seq#, value, state
I convert to a parquet file, save it.
Load the parquet in SparkR, no problem, the schema says:
#SparkR
df <- read.df(sqlContext, filespec, "parquet")
schema(df)
StructType
|-name = "key", type = "StringType", nullable = TRUE
|-name = "value", type = "ArrayType(IntegerType,true)", nullable = TRUE
So in SparkR, I have a dataframe where each record has all of the data I want (df$value). I want to extract that array into something R can consume then mutate my original dataframe(df) with a new column holding the resultant array. Logically something like results = function(df$value). Then I need to get results (for all rows) back into a SparkR dataframe for output.
How to I extract an array from the SparkR dataframe then mutate with the results?
Let spark data frame be, df and R data frame be df_r
To convert sparkR df to R df, use code
df_r <- collect(df)
with R data frame df_r, you can do all computations you want to do in R.
let say you have the result in column df_r$result
Then for converting back to SparkR data frame use code,
#this is a new SparkR data frame, df_1
df_1 <- createDataFrame(sqlContext, df_r)
For adding the result back to SparkR data frame `df` use code
#this adds the df_1$result to a new column df$result
#note that number of rows should be same in df and `df_1`, if not use `join` operation
df$result <- df_1$result
Hope this solves your problem
I had this problem too. The way I got around it was by adding a row index into the spark DataFrame and then using explode inside a select statement. Make sure to select the index and then the row you want in your select statement. That will get you a "long" dataframe. If each of the nested lists in the DataFrame column has the same amount of information in it (for example if you are exploding a list-column of x,y coordinates), you would expect each row index in the long DataFrame to occur twice.
After doing the above, I typically do a groupBy(index) on the exploded DataFrame, filter where the n() of each index is not equal to the expected number of items in the list and proceed with additional groupBy, merge, join, filter, etc. operations on the Spark DataFrame.
There are some excellent guides on the Urban Institute's GitHub page. Good luck. -nate

How to ignore lines with missing fields in the database

So I'm following the tutorial on spark using scala, and working with this dataset from wikimedia. I was interested in generating a histogram of total page views by language. The first column is language, while the third column is page views. However, it seems that some lines in that database do not have any field for the third column, as I get ArrayIndexOutOfBondException error when I run the following code.
scala> val tuples = pagecounts.map(line => line.split(" "))
scala> val keyValuePairs = tuples.map(line => (line(0).substring(0, 2),
line(2).toInt))
scala> keyValuePairs.reduceByKey(_+_, 1).collect
Does anyone have an idea, how to ignore the lines which have missing fields for the third column, so that I can run query against only those lines which contain the field for the third column in the database?
You want to filter the page counts so that only the ones with 3 fields are being operated on. Use filter to select just those:
val tuples = pagecounts.map(line => line.split(" ").filter(_.length == 3))

Searching for and matching elements across arrays

I have two tables.
In one table there are two columns, one has the ID and the other the abstracts of a document about 300-500 words long. There are about 500 rows.
The other table has only one column and >18000 rows. Each cell of that column contains a distinct acronym such as NGF, EPO, TPO etc.
I am interested in a script that will scan each abstract of the table 1 and identify one or more of the acronyms present in it, which are also present in table 2.
Finally the program will create a separate table where the first column contains the content of the first column of the table 1 (i.e. ID) and the acronyms found in the document associated with that ID.
Can some one with expertise in Python, Perl or any other scripting language help?
It seems to me that you are trying to join the two tables where the acronym appears in the abstract. ie (pseudo SQL):
SELECT acronym.id, document.id
FROM acronym, document
WHERE acronym.value IN explode(documents.abstract)
Given the desired semantics you can use the most straight forward approach:
acronyms = ['ABC', ...]
documents = [(0, "Document zeros discusses the value of ABC in the context of..."), ...]
joins = []
for id, abstract in documents:
for word in abstract.split():
try:
index = acronyms.index(word)
joins.append((id, index))
except ValueError:
pass # word not an acronym
This is a straightforward implementation; however, it has n cubed running time as acronyms.index performs a linear search (of our largest array, no less). We can improve the algorithm by first building a hash index of the acronyms:
acronyms = ['ABC', ...]
documents = [(0, "Document zeros discusses the value of ABC in the context of..."), ...]
index = dict((acronym, idx) for idx, acronym in enumberate(acronyms))
joins = []
for id, abstract in documents:
for word in abstract.split():
try
joins.append((id, index[word]))
except KeyError:
pass # word not an acronym
Of course, you might want to consider using an actual database. That way you won't have to implement your joins by hand.
Thanks a lot for the quick response.
I assume the pseudo SQL solution is for MYSQL etc. However it did not work in Microsoft ACCESS.
the second and the third are for Python I assume. Can I feed acronym and document as input files?
babru
It didn't work in Access because tables are accessed differently (e.g. acronym.[id])

Resources