How to check if two data frames are equal [duplicate] - database

This question already has an answer here:
regarding matrix comparison in R
(1 answer)
Closed 9 years ago.
Say I have large datasets in R and I just want to know whether two of them they are the same. I use this often when I'm experimenting different algorithms to achieve the same result. For example, say we have the following datasets:
df1 <- data.frame(num = 1:5, let = letters[1:5])
df2 <- df1
df3 <- data.frame(num = c(1:5, NA), let = letters[1:6])
df4 <- df3
So this is what I do to compare them:
table(x == y, useNA = 'ifany')
Which works great when the datasets have no NAs:
> table(df1 == df2, useNA = 'ifany')
TRUE
10
But not so much when they have NAs:
> table(df3 == df4, useNA = 'ifany')
TRUE <NA>
11 1
In the example, it's easy to dismiss the NA as not a problem since we know that both dataframes are equal. The problem is that NA == <anything> yields NA, so whenever one of the datasets has an NA, it doesn't matter what the other one has on that same position, the result is always going to be NA.
So using table() to compare datasets doesn't seem ideal to me. How can I better check if two data frames are identical?
P.S.: Note this is not a duplicate of R - comparing several datasets, Comparing 2 datasets in R or Compare datasets in R

Look up all.equal. It has some riders but it might work for you.
all.equal(df3,df4)
# [1] TRUE
all.equal(df2,df1)
# [1] TRUE

As Metrics pointed out, one could also use identical() to compare the datasets. The difference between this approach and that of Codoremifa is that identical() will just yield TRUE of FALSE, depending whether the objects being compared are identical or not, whereas all.equal() will either return TRUE or hints about the differences between the objects. For instance, consider the following:
> identical(df1, df3)
[1] FALSE
> all.equal(df1, df3)
[1] "Attributes: < Component 2: Numeric: lengths (5, 6) differ >"
[2] "Component 1: Numeric: lengths (5, 6) differ"
[3] "Component 2: Lengths: 5, 6"
[4] "Component 2: Attributes: < Component 2: Lengths (5, 6) differ (string compare on first 5) >"
[5] "Component 2: Lengths (5, 6) differ (string compare on first 5)"
Moreover, from what I've tested identical() seems to run much faster than all.equal().

Related

How can I create a tiled/stacked array based on ranges using these 2 input arrays - but without looping?

My basic problem is that I need to use 2 arrays with integers, and arrive at an combined array that is the combination of many ranges made using pairwise combinations from the 2 initial arrays.
Said slightly differently, I want to use 2 arrays, combine them to produce a set of ranges, and then merge these ranges together. Importantly, I need to do this without using any looping, as I am going to need to do this almost 4 million times.
My 2 starting arrays are:
import numpy as np
sd = np.array([3,3,4,2,5,1]) # StartDate
ed = np.array([4,5,5,5,8,2]) # EndDate
Pairwise, they would look like this, combining (sd[i] with ed[i]):
[(3, 4), (3, 5), (4, 5), (2, 5), (5, 8), (1, 2)] # Pairwise combinations of StartDate and EndDate
By way of example, I could iterate over these pairs, creating ranges, exemplifying below:
[In]: range1 = np.arange(3,4)
[Out]: array([3])
[In]: range2 = np.arange(3,5)
[Out]: array([3,4])
...and so on, to arrive at the final out put which would be:
array([3, 3, 4, 4, 2, 3, 4, 5, 6, 7, 1]) # End result where the arrays are tiled after one another
#(note first 3 digits are array 1 and array 2 from immediately above.
My issue is that I need to go from the input arrays and to the output array without looping, as I have already tried a version of this, and it is WAY too slow. Any help very much appreciated.
You are in luck. Here is a one liner solution:
indexer = np.r_[tuple([np.s_[i:j] for (i,j) in zip(sd,ed)])]
output:
[3 3 4 4 2 3 4 5 6 7 1]
I have also explained similar case in here for torch: "Here is how it works:
np.s_[i:j] creates a slice object (simply a range) of indices from start=i to end=j.
np.r_[i:j, k:m] creates a list ALL indices in slices (i,j) and (k,m) (You can pass more slices to np.r_ to concatenate them all together at once. This is an example of concatenating only two slices.)
Therefore, indexer creates a list of ALL indices by concatenating a list of slices (each slice is a range of indices)."

Array division comparison between Matlab and Julia [duplicate]

A \ B in matlab gives a special solution while numpy.linalg.lstsq doesn't.
A = [1 2 0; 0 4 3];
b = [8; 18];
c_mldivide = A \ b
c_mldivide =
0
4
0.66666666666667
c_lstsq = np.linalg.lstsq([[1 ,2, 0],[0, 4, 3]],[[8],[18]])
print c_lstsq
c_lstsq = (array([[ 0.91803279],
[ 3.54098361],
[ 1.27868852]]), array([], dtype=float64), 2, array([ 5.27316304,1.48113184]))
How does mldivide A \ B in matlab give a special solution?
Is this solution usefull in achieving computational accuracy?
Why is this solution special and how might you implement it in numpy?
For under-determined systems such as yours (rank is less than the number of variables), mldivide returns a solution with as many zero values as possible. Which of the variables will be set to zero is up to its arbitrary choice.
In contrast, the lstsq method returns the solution of minimal norm in such cases: that is, among the infinite family of exact solutions it will pick the one that has the smallest sum of squares of the variables.
So, the "special" solution of Matlab is somewhat arbitrary: one can set any of the three variables to zero in this problem. The solution given by NumPy is in fact more special: there is a unique minimal-norm solution
Which solution is better for your purpose depends on what your purpose is. The non-uniqueness of solution is usually a reason to rethink your approach to the equations. But since you asked, here is NumPy code that produces Matlab-type solutions.
import numpy as np
from itertools import combinations
A = np.matrix([[1 ,2, 0],[0, 4, 3]])
b = np.matrix([[8],[18]])
num_vars = A.shape[1]
rank = np.linalg.matrix_rank(A)
if rank == num_vars:
sol = np.linalg.lstsq(A, b)[0] # not under-determined
else:
for nz in combinations(range(num_vars), rank): # the variables not set to zero
try:
sol = np.zeros((num_vars, 1))
sol[nz, :] = np.asarray(np.linalg.solve(A[:, nz], b))
print(sol)
except np.linalg.LinAlgError:
pass # picked bad variables, can't solve
For your example it outputs three "special" solutions, the last of which is what Matlab chooses.
[[-1. ]
[ 4.5]
[ 0. ]]
[[ 8.]
[ 0.]
[ 6.]]
[[ 0. ]
[ 4. ]
[ 0.66666667]]

How to check whether have same element in two arrays? [duplicate]

This question already has answers here:
How can I check if a Ruby array includes one of several values?
(5 answers)
Closed 7 years ago.
For example:
a = [1,2,3,4,5,6,7,8]
b = [1,9,10,11,12,13,14,15]
a array has 1 and b array has 1 too. So they have the same element.
How to compare them and return true or false with ruby?
Check if a & b is empty:
a & b
# => [1]
(a & b).empty?
# => false
If you have many elements per Array, doing an intersection (&) can be an expensive operation. I assume that it would be quicker to go 'by hand':
def have_same_element?(array1, array2)
# Return true on first element found that is in both array1 and array2
# Return false if no such element found
array1.each do |elem|
return true if array2.include?(elem)
end
return false
end
a = [*1..100] # [1, 2, 3, ... , 100]
b = a.reverse.to_a # [100, 99, 98, ... , 1]
puts have_same_element?(a, b)
If you know more beforehand (e.g. "array1 contains many duplicates") you can further optimize the operation (e.g. by calling uniq or compact first, depending on your data).
Would be interesting to see actual benchmarks.
Edit
require 'benchmark'
Benchmark.bmbm(10) do |bm|
bm.report("by hand") {have_same_element?(a, b)}
bm.report("set operation") { (a & b).empty? }
end
Rehearsal -------------------------------------------------
by hand 0.000000 0.000000 0.000000 ( 0.000014)
set operation 0.000000 0.000000 0.000000 ( 0.000095)
---------------------------------------- total: 0.000000sec
user system total real
by hand 0.000000 0.000000 0.000000 ( 0.000012)
set operation 0.000000 0.000000 0.000000 ( 0.000131)
So, in this case it looks as if the "by hand" method is really faster, but its quite a sloppy method of benchmarking with limited expressiveness.
Also, see #CarySwoveland s excellent comments about using sets, proper benchmarking and a snappier expression using find (detect would do the same and be more expressive imho - but carefull as it returns the value found - if your arrays contain falsey values like nil (or false)...; you generally want to use any?{} here).
Intersection of two arrays can get using & operator. If you need to get similar elements in two arrays, take intersect as
a = [1,2,3,4,5,6,7,8]
b = [1,9,10,11,12,13,14,15]
and taking intersection
u = a & b
puts u
# [1]
u.empty?
# false

Form array of equivalence related classes

I have an array in Matlab. I numbered every entry in array with natural number. So I formed equivalence relation in array.
For example,
array = [1 2 3 5 6 7]
classes = [1 2 1 1 3 3].
I want to get cell array: i-th cell array's position is connected with i-th entry of initial array and shows, which elements are in the one class with this entry. For the example above, I would get:
{[1 3 5], [2], [1 3 5], [1 3 5], [6 7], [6 7]}
It can be done easily with for-loop, but is there any other solution? It will be good if it works faster than O(n^2), where n is the size of initial array.
Edit.
Problem will be solved, if I know the approach to split sorted array into cells with indeces of equal elements by O(n).
array = [1 1 1 2 3 3]
groups = {[1 2 3], [4], [5 6]}
Not sure about complexity, but accumarray with cell output is useful for splitting up the array based on unique values of the classes:
data = sortrows([classes; array].',1) %' stable w.r.t. array
arrayPieces = accumarray(data(:,1),data(:,2)',[],#(x){x.'})
classElements = arrayPieces(classes).'
Regarding sorted array splitting into cells of indeces:
>> array = [1 1 1 2 3 3]
>> arrayinds = accumarray(array',1:numel(array),[],#(x){x'})' %' transpose for rows
arrayinds =
[1x3 double] [4] [1x2 double]
>> arrayinds{:}
ans =
1 2 3
ans =
4
ans =
5 6
I don't know how to do this without for-loops entirely, but you can use a combination of sort, diff, and find to organize and partition the equivalence class identifiers. That'll give you a mostly vectorized solution, where the M-code level for-loop is O(n) where n is the number of classes, not the length of the whole input array. This should be pretty fast in practice.
Here's a rough example using some index munging. Be careful; there's probably an off-by-one edge case bug in there somewhere since I just banged this out.
function [eqVals,eqIx] = equivsets(a,x)
%EQUIVSETS Find indexes of equivalent values
[b,ix] = sort(x);
ixEdges = find(diff(b)); % identifies partitions between equiv classes
ix2 = [0 ixEdges numel(ix)];
eqVals = cell([1 numel(ix2)-1]);
eqIx = cell([1 numel(ix2)-1]);
% Map back to original input indexes and values
for i = 1:numel(ix2)-1
eqIx{i} = ix((ix2(i)+1):ix2(i+1));
eqVals{i} = a(eqIx{i});
end
I included the indexes in the output because they're often more useful than the values themselves. You'd call it like this.
% Get indexes of occurrences of each class
equivs = equivsets(array, classes)
% You can expand that to get equivalences for each input element
equivsByValue = equivs(classes)
It's a lot more efficient to build the lists for each class first and then expand them out to match the input indexes. Not only do you have to do the work just once, but when you use the b = a(ix) to expand a small cell array to a larger one, Matlab's copy-on-write optimization will end up reusing the memory for the underlying numeric mxArrays so you get a more compact representation in memory.
This transformation pops up a lot when working with unique() or databases. For decision support systems and data warehouse style things I've worked with, it happens all over the place. I wish it were built in to Matlab. (And maybe it's been added to one of the db or timeseries toolboxes in recent years; I'm a few versions behind.)
Realistically, if performance of this is critical for your code, you might also look at dropping down to Java or C MEX functions and implementing it there. But if your data sets are low cardinality - that is, have a small number of classes/distinct values, like numel(unique(classes)) / numel(array) tends to be less than 0.1 or so - the M-code implementation will probably be just fine.
For the second question:
array = [1 1 1 2 3 3]; %// example data
Use diff to find the end of each run of equal values, and from that build the groups:
ind = [0 find(diff([array NaN])~=0)];
groups = arrayfun(#(n) ind(n)+1:ind(n+1), 1:numel(ind)-1, 'uni', 0);
Same approach using unique:
[~, ind] = unique(array);
ind = [0 ind];
groups = arrayfun(#(n) ind(n)+1:ind(n+1), 1:numel(ind)-1, 'uni', 0);
I haven't tested if the complexity is O(n), though.

Multiple assignment in Scala without using Array?

I have an input something like this: "1 2 3 4 5".
What I would like to do, is to create a set of new variables, let a be the first one of the sequence, b the second, and xs the rest as a sequence (obviously I can do it in 3 different lines, but I would like to use multiple assignment).
A bit of search helped me by finding the right-ignoring sequence patterns, which I was able to use:
val Array(a, b, xs # _*) = "1 2 3 4 5".split(" ")
What I do not understand is that why doesn't it work if I try it with a tuple? I get an error for this:
val (a, b, xs # _*) = "1 2 3 4 5".split(" ")
The error message is:
<console>:1: error: illegal start of simple pattern
Are there any alternatives for multiple-assignment without using Array?
I have just started playing with Scala a few days ago, so please bear with me :-) Thanks in advance!
Other answers tell you why you can't use tuples, but arrays are awkward for this purpose. I prefer lists:
val a :: b :: xs = "1 2 3 4 5".split(" ").toList
Simple answer
val Array(a, b, xs # _*) = "1 2 3 4 5".split(" ")
The syntax you are seeing here is a simple pattern-match. It works because "1 2 3 4 5".split(" ") evaluates to an Array:
scala> "1 2 3 4 5".split(" ")
res0: Array[java.lang.String] = Array(1, 2, 3, 4, 5)
Since the right-hand-side is an Array, the pattern on the left-hand-size must, also, be an Array
The left-hand-side can be a tuple only if the right-hand-size evaluates to a tuple as well:
val (a, b, xs) = (1, 2, Seq(3,4,5))
More complex answer
Technically what's happening here is that the pattern match syntax is invoking the unapply method on the Array object, which looks like this:
def unapplySeq[T](x: Array[T]): Option[IndexedSeq[T]] =
if (x == null) None else Some(x.toIndexedSeq)
Note that the method accepts an Array. This is what Scala must see on the right-hand-size of the assignment. And it returns a Seq, which allows for the #_* syntax you used.
Your version with the tuple doesn't work because Tuple3's unapplySeq is defined with a Product3 as its parameter, not an Array:
def unapply[T1, T2, T3](x: Product3[T1, T2, T3]): Option[Product3[T1, T2, T3]] =
Some(x)
You can actually "extractors" like this that do whatever you want by simply creating an object and writing an unapply or unapplySeq method.
The answer is:
val a :: b :: c = "1 2 3 4 5".split(" ").toList
Should clarify that in some cases one may want to bind just the first n elements in a list, ignoring the non-matched elements. To do that, just add a trailing underscore:
val a :: b :: c :: _ = "1 2 3 4 5".split(" ").toList
That way:
c = "3" vs. c = List("3","4","5")
I'm not an expert in Scala by any means, but I think this might have to do with the fact that Tuples in Scala are just syntatic sugar for classes ranging from Tuple2 to Tuple22.
Meaning, Tuples in Scala aren't flexible structures like in Python or other languages of the sort, so it can't really create a Tuple with an unknown a priori size.
We can use pattern matching to extract the values from string and assign it to multiple variables. This requires two lines though.
Pattern says that there are 3 numbers([0-9]) with space in between. After the 3rd number, there can be text or not, which we don't care about (.*).
val pat = "([0-9]) ([0-9]) ([0-9]).*".r
val (a,b,c) = "1 2 3 4 5" match { case pat(a,b,c) => (a,b,c) }
Output
a: String = 1
b: String = 2
c: String = 3

Resources