I am writing a data mining algorithm in Scala and I want to write the Euclidean Distance function for a given test and several train instances. I have an Array[Array[Double]] with test and train instances. I have a method which loops through each test instance against all training instances and calculates distances between the two (picking one test and train instance per iteration) and returns a Double.
Say, for example, I have the following data points:
testInstance = Array(Array(3.2, 2.1, 4.3, 2.8))
trainPoints = Array(Array(3.9, 4.1, 6.2, 7.3), Array(4.5, 6.1, 8.3, 3.8), Array(5.2, 4.6, 7.4, 9.8), Array(5.1, 7.1, 4.4, 6.9))
I have a method stub (highlighting the distance function) which returns neighbours around a given test instance:
def predictClass(testPoints: Array[Array[Double]], trainPoints: Array[Array[Double]], k: Int): Array[Double] = {
for(testInstance <- testPoints)
{
for(trainInstance <- trainPoints)
{
for(i <- 0 to k)
{
distance = euclideanDistanceBetween(testInstance, trainInstance) //need help in defining this function
}
}
}
return distance
}
I know how to write a generic Euclidean Distance formula as:
math.sqrt(math.pow((x1 - y1), 2) + math.pow((x2 - y2), 2))
I have some pseudo steps as to what I want the method to do with a basic definition of the function:
def distanceBetween(testInstance: Array[Double], trainInstance: Array[Double]): Double = {
// subtract each element of trainInstance with testInstance
// for example,
// iteration 1 will do [Array(3.9, 4.1, 6.2, 7.3) - Array(3.2, 2.1, 4.3, 2.8)]
// i.e. sqrt(3.9-3.2)^2+(4.1-2.1)^2+(6.2-4.3)^2+(7.3-2.8)^2
// return result
// iteration 2 will do [Array(4.5, 6.1, 8.3, 3.8) - Array(3.2, 2.1, 4.3, 2.8)]
// i.e. sqrt(4.5-3.2)^2+(6.1-2.1)^2+(8.3-4.3)^2+(3.8-2.8)^2
// return result, and so on......
}
How can I write this in code?
So the formula you put in only works for two-dimensional vectors. You have four dimensions, but you should probably write your function to be flexible on this. So check out this formula.
So what you really want to say is:
for each position i:
subtract the ith element of Y from the ith element of X
square it
add all of those up
square root the whole thing
To make this more functional-programming style it will be more like:
square root the:
sum of:
zip X and Y into pairs
for each pair, square the difference
So that would look like:
import math._
def distance(xs: Array[Double], ys: Array[Double]) = {
sqrt((xs zip ys).map { case (x,y) => pow(y - x, 2) }.sum)
}
val testInstances = Array(Array(5.0, 4.8, 7.5, 10.0), Array(3.2, 2.1, 4.3, 2.8))
val trainPoints = Array(Array(3.9, 4.1, 6.2, 7.3), Array(4.5, 6.1, 8.3, 3.8), Array(5.2, 4.6, 7.4, 9.8), Array(5.1, 7.1, 4.4, 6.9))
distance(testInstances.head, trainPoints.head)
// 3.2680269276736382
As for predicting the class, you can make that more functional too, but it's unclear what the Double is that you are intending to return. It seems like you would want to predict the class for each test instance? Maybe choosing the class c corresponding to the nearest training point?
def findNearestClasses(testPoints: Array[Array[Double]], trainPoints: Array[Array[Double]]): Array[Int] = {
testPoints.map { testInstance =>
trainPoints.zipWithIndex.map { case (trainInstance, c) =>
c -> distance(testInstance, trainInstance)
}.minBy(_._2)._1
}
}
findNearestClasses(testInstances, trainPoints)
// Array(2, 0)
Or maybe you want the k-nearest neighbors:
def findKNearestClasses(testPoints: Array[Array[Double]], trainPoints: Array[Array[Double]], k: Int): Array[Int] = {
testPoints.map { testInstance =>
val distances =
trainPoints.zipWithIndex.map { case (trainInstance, c) =>
c -> distance(testInstance, trainInstance)
}
val classes = distances.sortBy(_._2).take(k).map(_._1)
val classCounts = classes.groupBy(identity).mapValues(_.size)
classCounts.maxBy(_._2)._1
}
}
findKNearestClasses(testInstances, trainPoints)
// Array(2, 1)
The generic formula for the euclidean distance is as follows:
math.sqrt(math.pow((x1 - x2), 2) + math.pow((y1 - y2), 2))
You can only compare the x coordinate with the x, and y with the y.
Related
I have a simple class I always implement when working with a new Language, MergeSort. So I am looking at my implementations of it with type Int and it looks great. Then I wanted to genericize it. I started with a simple implementation of T, but i noticed that needed to relfect the ClassTag. How do i assign the reflected ClassTag + extending?
class MergeSort[T: scala.reflect.ClassTag] {
var array: Array[T] = Array[T]()
var length: Int = 0
var tempArray: Array[T] = new Array(length)
def sort(data: Array[T]): Unit = {
array = data;
length = data.length;
tempArray = new Array[T](length)
//sort(0, length - 1)
}
}
Now this looks nice! It works, but when I i do the sort and rest of the functionality, I need to be able to compare 2 items of type T. The "Java" way was to just make sure the Object has the compareTo method. So i was thinking: [T extends Comparable]
but in scala, I am doing assignment for T with ClassTag, and
class MergeSort[T: scala.reflect.ClassTag extends Comparable] {} for example. It will error saying:
']' expected, but 'extends' found.
I was thinking this would sorta be the way to do things, but i am not sure whats going on here.
The endstate is to implement the merge portion of the class:
def merge(lower: Int, center: Int, upper: Int){
// ...
// loop
// if (tempArr(i) <= tempArr(j)) {} // OLD WAY, since First attempt was with Int.
// if (tempArr(i).compareTo(tempArr(j)) < 0) {} // Modified way with Comparable
}
Is this the scala way of implementing? I was noticing that people were mentioning Ordering, but i thought Comparable made sense.
The Scala way of implementing merge sort is using List and vals and the Ordering trait. The advantage of Ordering (the Java Comparator) is that Scala gives you implicit orderings for all standard library types by default.
def msort[T: Ordering](xs: List[T]): List[T] = {
#tailrec
def merge(xs: List[T], ys: List[T], acc: List[T] = Nil): List[T] =
(xs, ys) match {
case (Nil, _) => acc.reverse ++ ys
case (_, Nil) => acc.reverse ++ xs
case (x :: xs1, y :: ys1) =>
if (implicitly[Ordering[T]].lt(x, y))
merge(xs1, ys, x :: acc)
else
merge(xs, ys1, y :: acc)
}
xs match {
case Nil | _ :: Nil => xs
case _ =>
val (xs1, xs2) = xs splitAt (xs.length / 2)
merge(msort(xs1), msort(xs2))
}
}
msort(List(4, 23, 1, 2, 5, 76, 3, 142, 4321, 213, 42323))
// List(1, 2, 3, 4, 5, 23, 76, 142, 213, 4321, 42323)
msort(List("John", "Chris", "Helen", "Danny", "Michelle"))
// List(Chris, Danny, Helen, John, Michelle)
Another advantage over Ordered is that Scala provides implicit conversions from Ordered[A] => Ordering[A], which means your custom types that mix in Ordered will work with msort without the need to define implicit orderings.
Finally, the last advantage over Ordered is when using numeric types: Int, Double, etc. do not mix in Ordered, so you will not be able to sort elements of these types with Ordered, this is why most use Ordering instead.
I'm well aware this variant is not in-memory, but it does not require ClassTag at all to implement.
I am trying to create time stamp arrays in Swift.
So, say I want to go from 0 to 4 seconds, I can use Array(0...4), which gives [0, 1, 2, 3, 4]
But how can I get [0.0, 0.5 1.0, 2.0, 2.5, 3.0, 3.5, 4.0]?
Essentially I want a flexible delta, such as 0.5, 0.05, etc.
You can use stride(from:through:by:):
let a = Array(stride(from: 0.0, through: 4.0, by: 0.5))
An alternative for non-constant increments (even more viable in Swift 3.1)
The stride(from:through:by:) functions as covered in #Alexander's answer is the fit for purpose solution where, but for the case where readers of this Q&A wants to construct a sequence (/collection) of non-constant increments (in which case the linear-sequence constructing stride(...) falls short), I'll also include another alternative.
For such scenarios, the sequence(first:next:) is a good method of choice; used to construct a lazy sequence that can be repeatedly queried for the next element.
E.g., constructing the first 5 ticks for a log10 scale (Double array)
let log10Seq = sequence(first: 1.0, next: { 10*$0 })
let arr = Array(log10Seq.prefix(5)) // [1.0, 10.0, 100.0, 1000.0, 10000.0]
Swift 3.1 is intended to be released in the spring of 2017, and with this (among lots of other things) comes the implementation of the following accepted Swift evolution proposal:
SE-0045: Add prefix(while:) and drop(while:) to the stdlib
prefix(while:) in combination with sequence(first:next) provides a neat tool for generating sequences with everything for simple next methods (such as imitating the simple behaviour of stride(...)) to more advanced ones. The stride(...) example of this question is a good minimal (very simple) example of such usage:
/* this we can do already in Swift 3.0 */
let delta = 0.05
let seq = sequence(first: 0.0, next: { $0 + delta})
/* 'prefix(while:)' soon available in Swift 3.1 */
let arr = Array(seq.prefix(while: { $0 <= 4.0 }))
// [0.0, 0.5, 1.0, 1.5, 2.0, 2.5, 3.0, 3.5, 4.0]
// ...
for elem in sequence(first: 0.0, next: { $0 + delta})
.prefix(while: { $0 <= 4.0 }) {
// ...
}
Again, not in contest with stride(...) in the simple case of this Q, but very viable as soon as the useful but simple applications of stride(...) falls short, e.g. for a constructing non-linear sequences.
If I had a file(like csv, txt...).
I wish get two array such as
Array(Array(1.0,2.0),Array(4.0,5.0),Array(7.0, 8.0),Array(10.0,11.0),Array(13.0,14.0))
and
Array(3.0, 6.0, 9.0, 12.0, 15.0)
What's the ideal way to do this in scala?
val rdd = sc.textFile("1.csv").map(_.split(',').map(_.trim().toDouble))
rdd.map(_.take(2)).collect()
res0: Array[Array[Double]] = Array(Array(1.0, 2.0), Array(4.0, 5.0), Array(7.0, 8.0), Array(10.0, 11.0), Array(13.0, 14.0))
rdd.map(_(2)).collect()
res2: Array[Double] = Array(3.0, 6.0, 9.0, 12.0, 15.0)
You can get both arrays in one go, so that you don't need to traverse the data twice:
val (first, second) = {
io.Source.fromFile(name).getLines
.map(_.split(",").map(_.toDouble))
.foldRight(Seq.empty[Array[Double]] -> Seq.empty[Double]) {
case (Array(x, y, z), (as, bs)) => (Array(x, y) +: as, z +: bs)
}
}
Now, you end up with two lists rather that arrays. Of that matters to you, first.toArray and second.toArray will do the conversion for you.
Similar to #Vitaliy Kotlyarenko's answer, but without using 3rd parties like Spark (Spark is great if your data is large, but an overkill otherwise):
val lines: Iterator[String] = scala.io.Source.fromFile("txt.csv").getLines()
val matrix: Array[Array[Double]] = lines.map(_.split(",").map(_.trim.toDouble)).toArray
val twoFirstColumns: Array[Array[Double]] = matrix.map(_.take(2))
val thirdColumn: Array[Double] = matrix.map(_(2))
I want to randomly sample from a Scala list or array (not an RDD), the sample size can be much longer than the length of the list or array, how can I do this efficiently? Because the sample size can be very big and the sampling (on different lists/arrays) needs to be done a large number of times.
I know for a Spark RDD we can use takeSample() to do it, is there an equivalent for Scala list/array?
Thank you very much.
An easy-to-understand version would look like this:
import scala.util.Random
Random.shuffle(list).take(n)
Random.shuffle(array.toList).take(n)
// Seeded version
val r = new Random(seed)
r.shuffle(...)
For arrays:
import scala.util.Random
import scala.reflect.ClassTag
def takeSample[T:ClassTag](a:Array[T],n:Int,seed:Long) = {
val rnd = new Random(seed)
Array.fill(n)(a(rnd.nextInt(a.size)))
}
Make a random number generator (rnd) based on your seed. Then, fill an array with random numbers from 0 until the size of your array.
The last step is applying each random value to the indexing operator of your input array. Using it in the REPL could look as follows:
scala> val myArray = Array(1,3,5,7,8,9,10)
myArray: Array[Int] = Array(1, 3, 5, 7, 8, 9, 10)
scala> takeSample(myArray,20,System.currentTimeMillis)
res0: scala.collection.mutable.ArraySeq[Int] = ArraySeq(7, 8, 7, 3, 8, 3, 9, 1, 7, 10, 7, 10,
1, 1, 3, 1, 7, 1, 3, 7)
For lists, I would simply convert the list to Array and use the same function. I doubt you can get much more efficient for lists anyway.
It is important to note, that the same function using lists would take O(n^2) time, whereas converting the list to arrays first will take O(n) time
If you want to sample without replacement -- zip with randoms, sort O(n*log(n), discard randoms, take
import scala.util.Random
val l = Seq("a", "b", "c", "d", "e")
val ran = l.map(x => (Random.nextFloat(), x))
.sortBy(_._1)
.map(_._2)
.take(3)
Using a for comprehension, for a given array xs as follows,
for (i <- 1 to sampleSize; r = (Math.random * xs.size).toInt) yield a(r)
Note the random generator here produces values within the unit interval, which are scaled to range over the size of the array, and converted to Int for indexing over the array.
Note For pure functional random generator consider for instance the State Monad approach from Functional Programming in Scala, discussed here.
Note Consider also NICTA, another pure functional random value generator, it's use illustrated for instance here.
Using classical recursion.
import scala.util.Random
def takeSample[T](a: List[T], n: Int): List[T] = {
n match {
case n: Int if n <= 0 => List.empty[T]
case n: Int => a(Random.nextInt(a.size)) :: takeSample(a, n - 1)
}
}
package your.pkg
import your.pkg.SeqHelpers.SampleOps
import scala.collection.generic.CanBuildFrom
import scala.collection.mutable
import scala.language.{higherKinds, implicitConversions}
import scala.util.Random
trait SeqHelpers {
implicit def withSampleOps[E, CC[_] <: Seq[_]](cc: CC[E]): SampleOps[E, CC] = SampleOps(cc)
}
object SeqHelpers extends SeqHelpers {
case class SampleOps[E, CC[_] <: Seq[_]](cc: CC[_]) {
private def recurse(n: Int, builder: mutable.Builder[E, CC[E]]): CC[E] = n match {
case 0 => builder.result
case _ =>
val element = cc(Random.nextInt(cc.size)).asInstanceOf[E]
recurse(n - 1, builder += element)
}
def sample(n: Int)(implicit cbf: CanBuildFrom[CC[_], E, CC[E]]): CC[E] = {
require(n >= 0, "Cannot take less than 0 samples")
recurse(n, cbf.apply)
}
}
}
Either:
Mixin SeqHelpers, for example, with a Scalatest spec
Include import your.pkg.SeqHelpers._
Then the following should work:
Seq(1 to 100: _*) sample 10 foreach { println }
Edits to remove the cast are welcome.
Also if there is a way to create an empty instance of the collection for the accumulator, without knowing the concrete type ahead of time, please comment. That said, the builder is probably more efficient.
Did not test for performance, but the following code is a simple and elegant way to do the sampling and I believe can help many that come here just to get a sampling code. Just change the "range" according to the size of your end sample. If pseude-randomness is not enough for your need, you can use take(1) in the inner list and increase the range.
Random.shuffle((1 to 100).toList.flatMap(x => (Random.shuffle(yourList))))
I am beginner to functional programming and Scala. I have an Array of arrays which contain Double numerals. I want to subtract elements (basically two arrays, see example) and I am unable to find online how to do this.
For example, consider
val instance = Array(Array(2.1, 3.4, 5.6),
Array(4.4, 7.8, 6.7))
I want to subtract 4.4 from 2.1, 7.8 from 3.4 and 6.7 from 5.6
Is this possible in Scala?
Apologies if the question seems very basic but any guidance in the right direction would be appreciated. Thank you for your time.
You can use .zip:
scala> instance(1).zip(instance(0)).map{ case (a,b) => a - b}
res3: Array[Double] = Array(2.3000000000000003, 4.4, 1.1000000000000005)
instance(1).zip(instance(0)) makes an array of tuples Array((2.1,4.4), (3.4,7.8), (5.6,6.7))from corresponding pairs in your array
.map{ case (a,b) => a - b} or .map(x => x._1 - x._2) is doing subtraction for every tuple.
I would also recommend to use tuple instead of your top-level array:
val instance = (Array(2.1, 3.4, 5.6), Array(4.4, 7.8, 6.7))
So now, with additional definitions, it looks much better
scala> val (a,b) = instance
a: Array[Double] = Array(2.1, 3.4, 5.6)
b: Array[Double] = Array(4.4, 7.8, 6.7)
scala> val sub = (_: Double) - (_: Double) //defined it as function, not method
sub: (Double, Double) => Double = <function2>
scala> a zip b map sub.tupled
res20: Array[Double] = Array(2.3000000000000003, 4.4, 1.1000000000000005)
*sub.tupled allows sub-function to receive tuple of 2 parameters instead of just two parameters here.