In Scala, how can I get the max date from an array of maps where one of the keys in the map is "date"? - arrays

I have an array of maps and the maps and I would like to find the maximum date in the array of maps and I think I'm heading down a non-scala path because I'm not sure how to wire the pieces of this question together.
Is there a better way of doing this? I'm concerned that I need to assume things like casting the value to a Date for comparison, but that is what's in the Map and the map includes other data types also (so Map[String, Object] is what I have)
val df = new SimpleDateFormat("yyyy-MM-dd")
def omap = List(Map("date" -> df.parse("2013-08-01")), Map("date" -> df.parse("2013-02-01"), "otherkey" -> "nothing special"), Map("date" -> df.parse("2013-01-01")))
omap.max(new Ordering[Map[String, Object]] {
def compare(x: Map[String, Object], y: Map[String, Object]) = x.get("date").get.asInstanceOf[Date] compareTo y.get("date").get.asInstanceOf[Date]
})
The code seems to work, but I feel like I'm missing a more scala like way of doing this.

This little one liner works, but it will throw an exception if there is no date in any map:
omap.flatMap(map => map.get("date").collect({case d:Date => d})).max
Here's a safer version, but you have to provide a default date:
val defaultDate = new Date()
omap.map(map => map.get("date").collect({case d:Date => d}))
.foldLeft(defaultDate)((default, od) => od.fold(default)( d => if (d.compareTo(default) > 0) d else default))

If you have a minimal number of types, you should probably make them explicit with something like Either. But I'll assume you have too many for that.
Then, the more canonical way to get your value is something like this (to replace your omap.max):
val defaultDate = df.parse("1970-01-01")
def dateFrom(m: Map[String, Object]) =
m.get("date").collect{ case d: Date => d }.getOrElse(defaultDate)
omap.maxBy(dateFrom)
This protects you from missing entries by grouping everything missing into the earliest possible date.
If you don't have any sensible default date, then
def dateFrom(m: Map[String, Object]) = m.get("date").collect{ case d: Date => d }
omap.max(new Ordering[Map[String, Object]] {
def compare(x: Map[String, Object], y: Map[String, Object]) = {
(dateFrom(x), dateFrom(y)) match {
case (Some(a), Some(b)) => a compareTo b
case (_, None) => -1
case _ => 1
}
}
})
to shuffle all dateless maps to the end.

omap.maxBy{ _("date").asInstanceOf[Date] }

Related

GenericRowWithSchema ClassCastException in Spark 3 Scala UDF for Array data

I am writing a Spark 3 UDF to mask an attribute in an Array field.
My data (in parquet, but shown in a JSON format):
{"conditions":{"list":[{"element":{"code":"1234","category":"ABC"}},{"element":{"code":"4550","category":"EDC"}}]}}
case class:
case class MyClass(conditions: Seq[MyItem])
case class MyItem(code: String, category: String)
Spark code:
val data = Seq(MyClass(conditions = Seq(MyItem("1234", "ABC"), MyItem("4550", "EDC"))))
import spark.implicits._
val rdd = spark.sparkContext.parallelize(data)
val ds = rdd.toDF().as[MyClass]
val maskedConditions: Column = updateArray.apply(col("conditions"))
ds.withColumn("conditions", maskedConditions)
.select("conditions")
.show(2)
Tried the following UDF function.
UDF code:
def updateArray = udf((arr: Seq[MyItem]) => {
for (i <- 0 to arr.size - 1) {
// Line 3
val a = arr(i).asInstanceOf[org.apache.spark.sql.catalyst.expressions.GenericRowWithSchema]
val a = arr(i)
println(a.getAs[MyItem](0))
// TODO: How to make code = "XXXX" here
// a.code = "XXXX"
}
arr
})
Goal:
I need to set 'code' field value in each array item to "XXXX" in a UDF.
Issue:
I am unable to modify the array fields.
Also I get the following error if remove the line 3 in the UDF (cast to GenericRowWithSchema).
Error:
Caused by: java.lang.ClassCastException: org.apache.spark.sql.catalyst.expressions.GenericRowWithSchema cannot be cast to MyItem
Question: How to capture Array of Structs in a function and how to return a modified array of items?
Welcome to Stackoverflow!
There is a small json linting error in your data: I assumed that you wanted to close the [] square brackets of the list array. So, for this example I used the following data (which is the same as yours):
{"conditions":{"list":[{"element":{"code":"1234","category":"ABC"}},{"element":{"code":"4550","category":"EDC"}}]}}
You don't need UDFs for this: a simple map operation will be sufficient! The following code does what you want:
import spark.implicits._
import org.apache.spark.sql.Encoders
case class MyItem(code: String, category: String)
case class MyElement(element: MyItem)
case class MyList(list: Seq[MyElement])
case class MyClass(conditions: MyList)
val df = spark.read.json("./someData.json").as[MyClass]
val transformedDF = df.map{
case (MyClass(MyList(list))) => MyClass(MyList(list.map{
case (MyElement(item)) => MyElement(MyItem(code = "XXXX", item.category))
}))
}
transformedDF.show(false)
+--------------------------------+
|conditions |
+--------------------------------+
|[[[[XXXX, ABC]], [[XXXX, EDC]]]]|
+--------------------------------+
As you see, we're doing some simple pattern matching on the case classes we've defined and successfully renaming all of the code fields' values to "XXXX". If you want to get a json back, you can call the to_json function like so:
transformedDF.select(to_json($"conditions")).show(false)
+----------------------------------------------------------------------------------------------------+
|structstojson(conditions) |
+----------------------------------------------------------------------------------------------------+
|{"list":[{"element":{"code":"XXXX","category":"ABC"}},{"element":{"code":"XXXX","category":"EDC"}}]}|
+----------------------------------------------------------------------------------------------------+
Finally a very small remark about the data. If you have any control over how the data gets made, I would add the following suggestions:
The conditions JSON object seems to have no function in here, since it just contains a single array called list. Consider making the conditions object the array, which would allow you to discard the list name. That would simpify your structure
The element object does nothing, except containing a single item. Consider removing 1 level of abstraction there too.
With these suggestions, your data would contain the same information but look something like:
{"conditions":[{"code":"1234","category":"ABC"},{"code":"4550","category":"EDC"}]}
With these suggestions, you would also remove the need of the MyElement and the MyList case classes! But very often we're not in control over what data we receive so this is just a small disclaimer :)
Hope this helps!
EDIT: After your addition of simplified data according to the above suggestions, the task gets even easier. Again, you only need a map operation here:
import spark.implicits._
import org.apache.spark.sql.Encoders
case class MyItem(code: String, category: String)
case class MyClass(conditions: Seq[MyItem])
val data = Seq(MyClass(conditions = Seq(MyItem("1234", "ABC"), MyItem("4550", "EDC"))))
val df = data.toDF.as[MyClass]
val transformedDF = df.map{
case MyClass(conditions) => MyClass(conditions.map{
item => MyItem("XXXX", item.category)
})
}
transformedDF.show(false)
+--------------------------+
|conditions |
+--------------------------+
|[[XXXX, ABC], [XXXX, EDC]]|
+--------------------------+
I am able to find a simple solution with Spark 3.1+ as new features are added in this new Spark version.
Updated code:
val data = Seq(
MyClass(conditions = Seq(MyItem("1234", "ABC"), MyItem("234", "KBC"))),
MyClass(conditions = Seq(MyItem("4550", "DTC"), MyItem("900", "RDT")))
)
import spark.implicits._
val ds = data.toDF()
val updatedDS = ds.withColumn(
"conditions",
transform(
col("conditions"),
x => x.withField("code", updateArray(x.getField("code")))))
updatedDS.show()
UDF:
def updateArray = udf((oldVal: String) => {
if(oldVal.contains("1234"))
"XXX"
else
oldVal
})

map reduce sum item weights in a string

I have a string like the following:
s = "eggs 103.24,eggs 345.22,milk 231.25,widgets 123.11,milk 14.2"
such that a pair of item and its corresponding weights is separated by a comma, and the item name and its weight is by a space. I want to get the sum of the weights for each item:
//scala.collection.immutable.Map[String,Double] = Map(eggs -> 448.46, milk -> 245.45, widgets -> 123.11)
I have done the following but got stuck on the steps of separating out the item and its weight:
s.split(",").map(w=>(w,1)).sortWith(_._1 < _._1)
//Array[(String, Int)] = Array((eggs 345.22,1), (milk 14.2,1), (milk 231.25,1), (widgets 103.24,1), (widgets 123.11,1))
I think to proceed, for each element in the array I need to separate out the item name and weight separated by space, but when I tried the following I got quite confused:
s.split(",").map(w=>(w,1)).sortWith(_._1 < _._1).map(w => w._1.split(" ") )
//Array[Array[String]] = Array(Array(eggs, 345.22), Array(milk, 14.2), Array(milk, 231.25), Array(widgets, 103.24), Array(widgets, 123.11))
I am not sure what the next steps should be to proceed the calculations.
If you guaranteed to have the string in this format (so no exceptions and edge cases handling) you can do something like that:
val s = "eggs 103.24,eggs 345.22,milk 231.25,widgets 123.11,milk 14.2"
val result = s
.split(",") // array of strings like "eggs 103.24"
.map(_.split(" ")) // sequence of arrays like ["egg", "103.24"]
.map { case Array(x, y) => (x, y.toFloat)} // convert to tuples (key, number)
.groupBy(_._1) // group by key
.map(t => (t._1, t._2.map(_._2).sum)) // process groups, results in Map(eggs -> 448.46, ...)
Similar to what #GuruStron proposed, but handling possible errors (by just ignoring any kind of malformed data).
Also this one requires Scala 2.13+, older versions won't work.
def mapReduce(data: String): Map[String, Double] =
data
.split(',')
.iterator
.map(_.split(' '))
.collect {
case Array(key, value) =>
key.trim.toLowerCase -> value.toDoubleOption.getOrElse(default = 0)
}.toList
.groupMapReduce(_._1)(_._2)(_ + _)

Scala convert WrappedArray or Array[Any] to Array[String]

I've been trying to convert an RDD to a dataframe. For that, the types need to be defined and not Any. I'm using spark MLLib PrefixSpan, that's where freqSequence.sequence is from. I start with a dataframe that contains Session_IDs, views and purchases as String-Arrays:
viewsPurchasesGrouped: org.apache.spark.sql.DataFrame =
[session_id: decimal(29,0), view_product_ids: array[string], purchase_product_ids: array[string]]
I then calculate frequent patterns and need them in a dataframe so I can write them to a Hive table.
val viewsPurchasesRddString = viewsPurchasesGrouped.map( row => Array(Array(row(1)), Array(row(2)) ))
val prefixSpan = new PrefixSpan()
.setMinSupport(0.001)
.setMaxPatternLength(2)
val model = prefixSpan.run(viewsPurchasesRddString)
val freqSequencesRdd = sc.parallelize(model.freqSequences.collect())
case class FreqSequences(views: Array[String], purchases: Array[String], support: Long)
val viewsPurchasesDf = freqSequencesRdd.map( fs =>
{
val views = fs.sequence(0)(0)
val purchases = fs.sequence(1)(0)
val freq = fs.freq
FreqSequences(views, purchases, freq)
}
)
viewsPurchasesDf.toDF() // optional
When I try to run this, views and purchases are "Any" instead of "Array[String]". I've desperately tried to convert them around, but the best I get is Array[Any]. I think I need to map the contents to a String, I've tried e.g. this: How to get an element in WrappedArray: result of Dataset.select("x").collect()? and this: How to cast a WrappedArray[WrappedArray[Float]] to Array[Array[Float]] in spark (scala) and thousands of other Stackoverflow questions...
I really don't know how to solve this. I guess I'm already converting the initial dataframe/RDD to much, but can't understand where.
I think the problem is that you have a DataFrame, which retains no static type information. When you take an item out of a Row, you have to tell it explicitly which type you expect to get.
Untested, but inferred from the information you gave:
import scala.collection.mutable.WrappedArray
val viewsPurchasesRddString = viewsPurchasesGrouped.map( row =>
Array(
Array(row.getAs[WrappedArray[String]](1).toArray),
Array(row.getAs[WrappedArray[String]](2).toArray)
)
)
I solved the problem. For reference, this works:
val viewsPurchasesRddString = viewsPurchasesGrouped.map( row =>
Array(
row.getSeq[Long](1).toArray,
row.getSeq[Long](2).toArray
)
)
val prefixSpan = new PrefixSpan()
.setMinSupport(0.001)
.setMaxPatternLength(2)
val model = prefixSpan.run(viewsPurchasesRddString)
case class FreqSequences(views: Long, purchases: Long, frequence: Long)
val ps_frequences = model.freqSequences.filter(fs => fs.sequence.length > 1).map( fs =>
{
val views = fs.sequence(0)(0)
val purchases = fs.sequence(1)(0)
val freq = fs.freq
FreqSequences(views, purchases, freq)
}
)
ps_frequences.toDF()

How to use Slick's mapped tables with foreign keys?

I'm struggling with Slick's lifted embedding and mapped tables. The API feels strange to me, maybe just because it is structured in a way that's unfamiliar to me.
I want to build a Task/Todo-List. There are two entities:
Task: Each task has a an optional reference to the next task. That way a linked list is build. The intention is that the user can order the tasks by his priority. This order is represented by the references from task to task.
TaskList: Represents a TaskList with a label and a reference to the first Task of the list.
case class Task(id: Option[Long], title: String, nextTask: Option[Task])
case class TaskList(label: String, firstTask: Option[Task])
Now I tried to write a data access object (DAO) for these two entities.
import scala.slick.driver.H2Driver.simple._
import slick.lifted.MappedTypeMapper
implicit val session: Session = Database.threadLocalSession
val queryById = Tasks.createFinderBy( t => t.id )
def task(id: Long): Option[Task] = queryById(id).firstOption
private object Tasks extends Table[Task]("TASKS") {
def id = column[Long]("ID", O.PrimaryKey, O.AutoInc)
def title = column[String]("TITLE")
def nextTaskId = column[Option[Long]]("NEXT_TASK_ID")
def nextTask = foreignKey("NEXT_TASK_FK", nextTaskId, Tasks)(_.id)
def * = id ~ title ~ nextTask <> (Task, Task.unapply _)
}
private object TaskLists extends Table[TaskList]("TASKLISTS") {
def label = column[String]("LABEL", O.PrimaryKey)
def firstTaskId = column[Option[Long]]("FIRST_TASK_ID")
def firstTask = foreignKey("FIRST_TASK_FK", firstTaskId, Tasks)(_.id)
def * = label ~ firstTask <> (Task, Task.unapply _)
}
Unfortunately it does not compile. The problems are in the * projection of both tables at nextTask respective firstTask.
could not find implicit value for evidence parameter of type
scala.slick.lifted.TypeMapper[scala.slick.lifted.ForeignKeyQuery[SlickTaskRepository.this.Tasks.type,justf0rfun.bookmark.model.Task]]
could not find implicit value for evidence parameter of type scala.slick.lifted.TypeMapper[scala.slick.lifted.ForeignKeyQuery[SlickTaskRepository.this.Tasks.type,justf0rfun.bookmark.model.Task]]
I tried to solve that with the following TypeMapper but that does not compile, too.
implicit val taskMapper = MappedTypeMapper.base[Option[Long], Option[Task]](
option => option match {
case Some(id) => task(id)
case _ => None
},
option => option match {
case Some(task) => task.id
case _ => None
})
could not find implicit value for parameter tm: scala.slick.lifted.TypeMapper[Option[justf0rfun.bookmark.model.Task]]
not enough arguments for method base: (implicit tm: scala.slick.lifted.TypeMapper[Option[justf0rfun.bookmark.model.Task]])scala.slick.lifted.BaseTypeMapper[Option[Long]]. Unspecified value parameter tm.
Main question: How to use Slick's lifted embedding and mapped tables the right way? How to I get this to work?
Thanks in advance.
The short answer is: Use ids instead of object references and use Slick queries to dereference ids. You can put the queries into methods for re-use.
That would make your case classes look like this:
case class Task(id: Option[Long], title: String, nextTaskId: Option[Long])
case class TaskList(label: String, firstTaskId: Option[Long])
I'll publish an article about this topic at some point and link it here.

How do I remove duplicate values from my Multidimensional array in a Scala way?

I'm trying extract some values from a String. The string contains several lines with values. The values on each line are number, firstname, last name. Then I want to filter by a given pattern and remove the duplicate numbers.
This is my test:
test("Numbers should be unique") {
val s = Cool.prepareListAccordingToPattern(ALLOWED_PATTERN, "1234,örjan,nilsson\n4321,eva-lisa,nyman\n1234,eva,nilsson")
assert(s.length == 2, "Well that didn't work.. ")
info("Chopping seems to work. Filtered duplicate numbers. Expected 1234:4321, got: "+s(0)(0)+":"+s(1)(0))
}
The methods:
def prepareListAccordingToPattern(allowedPattern: String, s: String) : Array[Array[String]] = {
val lines = chop("\n", s)
val choppedUp = lines.map(line =>
chop(",", line)).filter(array =>
array.length == 3 && array(0).matches(allowedPattern)
)
choppedUp
}
def chop(splitSymbol: String, toChop: String) : Array[String] = {
toChop.split(splitSymbol)
}
My test fails as expected since I receive back a multidimensional array with duplicates:
[0]["1234","örjan","nilsson"]
[1]["4321","eva-lisa","nyman"]
[2]["1234","eva","nilsson"]
What I would like to do is to filter out the duplicated numbers, in this case "1234"
so that I get back:
[0]["1234","örjan","nilsson"]
[1]["4321","eva-lisa","nyman"]
How should I do this in a scala way? Maybe I could attack this problem differently?
val arr = Array(
Array("1234","rjan","nilsson"),
Array("4321","eva-lisa","nyman"),
Array("1234","eva","nilsson")
)
arr.groupBy( _(0)).map{ case (_, vs) => vs.head}.toArray
// Array(Array(1234, rjan, nilsson), Array(4321, eva-lisa, nyman))
If you have a collection of elements (in this case Array of Array[String]) and want to get single element with every value of some property (in this case property is the first string from Array[String]) you should group elements of collection based on this property (arr.groupBy( _(0))) and then somehow select one element from every group. In this case we picked up the first element (Array[String]) from every group.
If you want to select any (not necessary first) element for every group you could convert every element (Array[String]) to the key-value pair ((String, Array[String])) where key is the value of target property, and then convert this collection of pairs to Map:
val myMap = arr.map{ a => a(0) -> a }.toMap
// Map(1234 -> Array(1234, eva, nilsson), 4321 -> Array(4321, eva-lisa, nyman))
myMap.values.toArray
// Array(Array(1234, eva, nilsson), Array(4321, eva-lisa, nyman))
In this case you'll get the last element from every group.
A bit implicit, but should work:
val arr = Array(
Array("1234","rjan","nilsson"),
Array("4321","eva-lisa","nyman"),
Array("1234","eva","nilsson")
)
arr.view.reverse.map(x => x.head -> x).toMap.values
// Iterable[Array[String]] = MapLike(Array(1234, rjan, nilsson), Array(4321, eva-lisa, nyman))
Reverse here to override "eva","nilsson" with "rjan","nilsson", not vice versa

Resources