Related
I am using apache flink for batch processing and I have a sorted 10 element dataset which consists of movie id and its ratings and it looks something like (movie_id, rating)
(318,4.429022082018927)
(858,4.2890625)
(2959,4.272935779816514)
(1276,4.271929824561403)
(750,4.268041237113402)
(904,4.261904761904762)
(1221,4.25968992248062)
(48516,4.252336448598131)
(1213,4.25)
(912,4.24)
I have to join this dataset with a movie data set which has movie id and the movie names like
(1285,Heathers (1989))
(1286,Somewhere in Time (1980))
(1287,Ben-Hur (1959))
(1288,This Is Spinal Tap (1984))
The code I am using to join is
movies.join(sorted) // sorted represents sorted ratings dataset
.where(0) // joining movie_id of movies to movie_id of ratings
.equalTo(0)
.with(new JoinFunction<Tuple2<Long, String>, Tuple2<Long, Double>, Tuple3<Long, String, Double>>() {
#Override
public Tuple3<Long, String, Double> join(
Tuple2<Long, String> movie,
Tuple2<Long, Double> rating) throws Exception {
return new Tuple3<>(movie.f0, movie.f1, rating.f1);
}
}).print();
//movies row : movie_id, movie name
//sorted row: movie_id, rating
When I print this dataset, my output is in random order like,
(904,Rear Window (1954),4.261904761904762)
(318,Shawshank Redemption, The (1994),4.429022082018927)
(48516,Departed, The (2006),4.252336448598131)
(858,Godfather, The (1972),4.2890625)
(2959,Fight Club (1999),4.272935779816514)
(912,Casablanca (1942),4.24)
(1276,Cool Hand Luke (1967),4.271929824561403)
Here I would expect movie Shawshank Redemption id 318 to be placed above Rear Window id 904 based on descending order of rating. I am following a course online and the dataset is available here. Can anyone help me with correcting my logic?
I'm trying to test a single API that internally does different things based on inputs:
country
customer
amount of items
The following simulation is what I came up with:
val countries = List("US", "CAN")
val customerTypes = List("TYPE1", "TYPE2")
val basketSizes = List(1, 10, 50)
val scenarioGenerator: Seq[(String, String, Int)] = for {
country <- countries
customerType <- customerTypes
basketSize <- basketSizes
} yield (country, customerType, basketSize)
def scenarios(): Seq[PopulationBuilder] = {
var scenarioList = new ArraySeq[PopulationBuilder](countries.size * customerTypes.size * basketSizes.size)
var i = 0;
for ((country: String, customerType: String, basketSize: Int) <- scenarioGenerator) {
// fetch customer data for scenario
val customers = DataFetcher.customerRequest(country, customerType)
// fetch product data for scenario
val products = DataFetcher.productRequest(country)
// generate a scenario with given data and parameters
val scen = scenario(s"Pricing-(${country},${customerType},${basketSize})")
// feeder that creates the request object for the gatling user
.feed(new PricingFeeder(country, customers, products, basketSize))
.repeat(10) {
exec(Pricing.price)
.pause(500 milliseconds)
}
.inject(
rampUsers(10) over (10 seconds)
)
scenarioList(i) = scen
i = i + 1
}
scenarioList
}
setUp(scenarios: _*).protocols(httpProto)
This is run with the maven plugin (and with tracking in jenkins using the gatling plugin), but this results in a single tracked case: Pricing. This is useless as even the item amount will be close to a linear increase in response time.
The simulation.log has the data for each scenario type, but out of the box reporting handles it as a single type of query, and merges all the results in a single graph, which means that it's impossible to see if a certain combination causes a spike due to a calculation or data bug.
I'd like to get separate metrics for each of the combinations, so it would be really easy to see for example that a code or data change in the API caused a latency spike in the Pricing-(US,TYPE1,50) scenario.
What is the idiomatic way of achieving this with gatling? I don't want to create simulations for each case, as this would be a nightmare to manage (and getting rid of manually managed data and jenkins jobs with jmeter is what we are trying to achieve).
First thing - it is not a good practice to run so many scenarios in one simulation as it runs them in parallel not sequentially so you should be sure that it is what you want.
If so, you can use fact that gatling report allows to show graphs per group. So you can wrap all your requests in group that is named based on parameters, this way in detailed view of report you will be able to select which group to show, fe.:
val singleScenario = scenario(s"Pricing-(${country},${customerType},${basketSize})")
.group(s"Pricing-(${country},${customerType},${basketSize})"){
.feed(new PricingFeeder(country, customers, products, basketSize))
.repeat(10) {
exec(Pricing.price)
.pause(500 milliseconds)
}
}
If you do not need all scenarios to run in parallel, and want separate reports for separate scenarios best way is to implement simulation class as parametrized abstract class and add separate classes for each parameters set, as in Gatling one simulation equals on report, fe.:
package com.performance.project.simulations
import io.gatling.core.Predef.Simulation
import scala.concurrent.duration._
class UsType1Simulation1 extends ParametrizedSimulation("US", "TYPE1", 1)
class UsType1Simulation10 extends ParametrizedSimulation("US", "TYPE1", 10)
class UsType1Simulation50 extends ParametrizedSimulation("US", "TYPE1", 50)
class UsType2Simulation1 extends ParametrizedSimulation("US", "TYPE2", 1)
class UsType2Simulation10 extends ParametrizedSimulation("US", "TYPE2", 10)
class UsType2Simulation50 extends ParametrizedSimulation("US", "TYPE2", 50)
class CanType1Simulation1 extends ParametrizedSimulation("CAN", "TYPE1", 1)
class CanType1Simulation10 extends ParametrizedSimulation("CAN", "TYPE1", 10)
class CanType1Simulation50 extends ParametrizedSimulation("CAN", "TYPE1", 50)
class CanType2Simulation1 extends ParametrizedSimulation("CAN", "TYPE2", 1)
class CanType2Simulation10 extends ParametrizedSimulation("CAN", "TYPE2", 10)
class CanType2Simulation50 extends ParametrizedSimulation("CAN", "TYPE2", 50)
sealed abstract class ParametrizedSimulation(country: String, customerType: String, basketSize: Int) extends Simulation{
val customers = DataFetcher.customerRequest(country, customerType)
val products = DataFetcher.productRequest(country)
val singleScenario = scenario(s"Pricing-(${country},${customerType},${basketSize})")
.feed(new PricingFeeder(country, customers, products, basketSize))
.repeat(10) {
exec(Pricing.price)
.pause(500 milliseconds)
}
.inject(
rampUsers(10) over (10 seconds)
)
setUp(singleScenario).protocols(httpProto)
}
Of course it makes sense only if there is small amount of combinations, with hundreds of them it will get messy.
I would like to aggregate a stream of trades into windows of the same trade volume, which is the sum of the trade size of all the trades in the interval.
I was able to write a custom Trigger that partitions the data into windows. Here is the code:
case class Trade(key: Int, millis: Long, time: LocalDateTime, price: Double, size: Int)
class VolumeTrigger(triggerVolume: Int, config: ExecutionConfig) extends Trigger[Trade, Window] {
val LOG: Logger = LoggerFactory.getLogger(classOf[VolumeTrigger])
val stateDesc = new ValueStateDescriptor[Double]("volume", createTypeInformation[Double].createSerializer(config))
override def onElement(event: Trade, timestamp: Long, window: Window, ctx: TriggerContext): TriggerResult = {
val volume = ctx.getPartitionedState(stateDesc)
if (volume.value == null) {
volume.update(event.size)
return TriggerResult.CONTINUE
}
volume.update(volume.value + event.size)
if (volume.value < triggerVolume) {
TriggerResult.CONTINUE
}
else {
volume.update(volume.value - triggerVolume)
TriggerResult.FIRE_AND_PURGE
}
}
override def onEventTime(time: Long, window: Window, ctx: TriggerContext): TriggerResult = {
TriggerResult.FIRE_AND_PURGE
}
override def onProcessingTime(time: Long, window:Window, ctx: TriggerContext): TriggerResult = {
throw new UnsupportedOperationException("Not a processing time trigger")
}
override def clear(window: Window, ctx: TriggerContext): Unit = {
val volume = ctx.getPartitionedState(stateDesc)
ctx.getPartitionedState(stateDesc).clear()
}
}
def main(args: Array[String]) : Unit = {
val env: StreamExecutionEnvironment = StreamExecutionEnvironment.getExecutionEnvironment
env.setParallelism(1)
env.setStreamTimeCharacteristic(TimeCharacteristic.EventTime)
val trades = env
.readTextFile("/tmp/trades.csv")
.map {line =>
val cells = line.split(",")
val time = LocalDateTime.parse(cells(0), DateTimeFormatter.ofPattern("yyyyMMdd HH:mm:ss.SSSSSSSSS"))
val millis = time.toInstant(ZoneOffset.UTC).toEpochMilli
Trade(0, millis, time, cells(1).toDouble, cells(2).toInt)
}
val aggregated = trades
.assignAscendingTimestamps(_.millis)
.keyBy("key")
.window(GlobalWindows.create)
.trigger(new VolumeTrigger(500, env.getConfig))
.sum(4)
aggregated.writeAsText("/tmp/trades_agg.csv")
env.execute("volume agg")
}
The data looks for example as follows:
0180102 04:00:29.715706404,169.10,100
20180102 04:00:29.715715627,169.10,100
20180102 05:08:29.025299624,169.12,100
20180102 05:08:29.025906589,169.10,214
20180102 05:08:29.327113252,169.10,200
20180102 05:09:08.350939314,169.00,100
20180102 05:09:11.532817015,169.00,474
20180102 06:06:55.373584329,169.34,200
20180102 06:07:06.993081961,169.34,100
20180102 06:07:08.153291898,169.34,100
20180102 06:07:20.081524768,169.34,364
20180102 06:07:22.838656715,169.34,200
20180102 06:07:24.561360031,169.34,100
20180102 06:07:37.774385969,169.34,100
20180102 06:07:39.305219107,169.34,200
I have a time stamp, a price and a size.
The above code can partition it into windows of roughly the same size:
Trade(0,1514865629715,2018-01-02T04:00:29.715706404,169.1,514)
Trade(0,1514869709327,2018-01-02T05:08:29.327113252,169.1,774)
Trade(0,1514873215373,2018-01-02T06:06:55.373584329,169.34,300)
Trade(0,1514873228153,2018-01-02T06:07:08.153291898,169.34,464)
Trade(0,1514873242838,2018-01-02T06:07:22.838656715,169.34,600)
Trade(0,1514873294898,2018-01-02T06:08:14.898397117,169.34,500)
Trade(0,1514873299492,2018-01-02T06:08:19.492589659,169.34,400)
Trade(0,1514873332251,2018-01-02T06:08:52.251339070,169.34,500)
Trade(0,1514873337928,2018-01-02T06:08:57.928680090,169.34,1000)
Trade(0,1514873338078,2018-01-02T06:08:58.078221995,169.34,1000)
Now I like to partition the data so that the volume is exactly matching the trigger value. For this I would need to change the data slightly by splitting a trade at the end of an interval into two parts, one that belongs to the actual window being fired, and the remaining volume that is above the trigger value, has to be assigned to the next window.
Can that be handled with some custom aggregation function? It would need to know the results from the previous window(s) though and I was not able to find out how to do that.
Any ideas from Apache Flink experts how to handle that case?
Adding an evictor does not work as it only purges some elements at the beginning.
I hope the change from Spark Structured Streaming to Flink was a good choice, as I have later even more complicated situations to handle.
Since your key is the same for all records, you may not require a window in this case. Please refer to this page in Flink's documentation https://ci.apache.org/projects/flink/flink-docs-release-1.4/dev/stream/state/state.html#using-managed-keyed-state.
It has a CountWindowAverage class where in the aggregation of a value from each record in a stream is done using a State Variable. You can implement this and send the output whenever the state variable reaches your trigger volume and reset the value of the state variable with the remaining volume.
A simple approach (though not super-efficient) would be to put a FlatMapFunction ahead of your windowing flow. If it's keyed the same way, then you can use ValueState to track total volume, and emit two records (the split) when it hits your limit.
I have an objective to use a playground and show monthly temperatures (ex. January high temp of 30, low temp of -2) and it needs to use an array of strings as well as a dictionary with tuple values for the temperatures.
So far I have an array of strings Months: [String] with the months in it. As well as a dictionary of Temperatures: [String, (temp1: Int, temp2: Int). I have a function SetMonthlyTemp(month: String, temp1: Int, temp2: Int) which I am trying to use to set the dictionary up but i can't figure out how to do so. I'm totally new to dictionaries and only used a tuple once last week and that was as a standalone property. Any help regarding setting up this dictionary to intake a tuple (Int, Int) would be great! Obviously there will be a display method that prints the results but I am not having any trouble finding information for that.
Enjoy:
var temperatures = [String: (Int, Int)]()
temperatures["Jan"] = (10, 20)
temperatures["Feb"] = (-1, -16)
// setting temp 1 for January (note: "Jan" entry must exist in dictionary)
temperatures["Jan"]?.0 = 30
// setter ;)
func setMonthlyTemp(month: String, temp1: Int, temp2: Int) {
temperatures[month] = (temp1, temp2)
}
Accessing:
temperatures["Feb"] // whole tuple for February
temperatures["Jan"]?.0 // first temperature for January
temperatures["Feb"]?.1 // second temperature for February
From my perspective if you are working with constrained data set e.g. month, weeks, categories of something, I don't know then it's better to use enum which will describe your data much better then tuples & string
enum Month {
case january
case february
// ...
case november
case december
static let allMonths = [january, february, /*...*/ november, december]
}
struct MonthlyTemperature {
let month: Month
var lowestTemp: Double?
var highestTemp: Double?
init(month: Month, lowest: Double? = nil, highest: Double? = nil) {
self.month = month
self.lowestTemp = lowest
self.highestTemp = highest
}
}
let temparatures = [MonthlyTemperature]()
// ...
var dict = Dictionary(grouping: temparatures, by: { $0.month })
Month.allMonths.forEach { month in
dict.updateValue(dict[month] ?? [], forKey: month)
}
I know that my title is slightly confusing but let me explain:
I have been at this for a while and I can't seem to figure this out.
First of all here is some of my code:
struct CalorieLog {
var date: Date
var calories: Int
}
var logs: [CalorieLog] = []
func logCalories() {
//... calculate calories to log
let currentDate: Date = Date()
let calories: Int = calculatedCalories
logs.append(CalorieLog(date: currentDate, calories: calculatedCalories))
}
Now how do I group the CalorieLog items in the logs array by day and get the sum of all the calories logged per day? And maybe sort them into an array of dictionaries? e.g. Dictionary(String: Int) so that (Day: Total Calories)
Please don't be harsh I am still a novice developer. Thanks in advance.
A great deal depends on what you mean by a "day". So, in this highly simplified example, I simply use the default definition, that is, a day as the Calendar defines it for a particular Date (not taking time zone realities into account).
Here's some basic data:
struct CalorieLog {
var date: Date
var calories: Int
}
var logs: [CalorieLog] = []
logs.append(CalorieLog(date:Date(), calories:150))
logs.append(CalorieLog(date:Date(), calories:140))
logs.append(CalorieLog(date:Date()+(60*60*24), calories:130))
Now we construct a dictionary where the key is the ordinality of the date's Day, and the value is an array of all logs having that day as their date:
var dict = [Int:[CalorieLog]]()
for log in logs {
let d = log.date
let cal = Calendar(identifier: .gregorian)
if let ord = cal.ordinality(of: .day, in: .era, for: d) {
if dict[ord] == nil {
dict[ord] = []
}
dict[ord]!.append(log)
}
}
Your CalorieLogs are now clumped into days! Now it's easy to run through that dictionary and sum the calories for each day's array-of-logs. I don't know what you ultimately want to do with this information, so here I just print it, to prove that our dictionary organization is useful:
for (ord,logs) in dict {
print(ord)
print(logs.reduce(0){$0 + $1.calories})
}
A lot of what you're attempting to do can be accomplished using Swift's map, sorted, filtered, and reduce functions.
struct CalorieLog {
var date: Date
var calories: Int
}
var logs: [CalorieLog] = []
// I changed your method to pass in calculatedCalories, we can make that random just for learning purposes. See below
func logCalories(calculatedCalories: Int) {
let currentDate: Date = Date()
logs.append(CalorieLog(date: currentDate, calories: calculatedCalories))
}
// This is a method that will calculate dummy calorie data n times, and append it to your logs array
func addDummyCalorieData(n: Int, maxRandomCalorie: Int) {
for _ in 1...n {
let random = Int(arc4random_uniform(UInt32(maxRandomCalorie)))
logCalories(calculatedCalories: random)
}
}
// Calculate 100 random CalorieLog's with a max calorie value of 1000 calories
addDummyCalorieData(n: 100, maxRandomCalorie: 1000)
// Print the unsorted CalorieLogs
print("Unsorted Calorie Data: \(logs)")
// Sort the logs from low to high based on the individual calories value.
let sortedLowToHigh = logs.sorted { $0.calories < $1.calories }
// Print to console window
print("Sorted Low to High: \(sortedLowToHigh)")
// Sort the CalorieLogs from high to low
let sortedHighToLow = logs.sorted { $1.calories < $0.calories }
// Print to console window
print("Sorted High to Low: \(sortedHighToLow)")
// Sum
// This will reduce the CaloreLog's based on their calorie values, represented as a sum
let sumOfCalories = logs.map { $0.calories }.reduce(0, +)
// Print the sum
print("Sum: \(sumOfCalories)")
If you wanted to map your CalorieLogs as an array of dictionaries you could do something like this:
let arrayOfDictionaries = logs.map { [$0.date : $0.calories] }
However that's kind of inefficient. Why would you want an array of dictionaries? If you just wanted to track the calories consumed/burned for a specific date, you could just make one dictionary where the date is your key, and an array of Int is the value which represents all the calories for that day. You probably would only need one dictionary, i.e.
var dictionary = [Date : [Int]]()
Then you could find all the calories for a date by saying dictionary[Date()]. Although keep in mind that you would have to have the exact date and time. You may want to change the key of your dictionary to be something like a String that just represents a date like 2/19/2017, something that could be compared easier. That will have to be taken into account when you design your model.
To get the logs sorted by date, you can simply do:
logs.sorted(by: { $0.date < $1.date })
To get a dictionary that maps a day to a sum of calories on that day you can do this:
let dateFormatter = DateFormatter()
dateFormatter.dateFormat = "dd MMM yyyy"
var calorieCalendar = [String: Int]()
for log in logs {
let date = dateFormatter.string(from: log.date)
if let _ = calorieCalendar[date] {
calorieCalendar[date]! += log.calories
} else {
calorieCalendar[date] = log.calories
}
}
For logs setup like this
logs.append(CalorieLog(date: Date(), calories: 1))
logs.append(CalorieLog(date: Date.init(timeIntervalSinceNow: -10), calories: 2))
logs.append(CalorieLog(date: Date.init(timeIntervalSinceNow: -60*60*24*2), calories: 3))
The code above will produce a dictionary like this:
["17 Feb 2017": 3, "19 Feb 2017": 3]