What is the difference between combining array by using reduce or joined? - arrays

Consider the following array -of strings-:
let arrayStrings = ["H", "e", "l", "l", "o"]
For combining its elements (to get "Hello" as single String), we could:
reduce it:
let reducedString = arrayStrings.reduce("", { $0 + $1 }) // "Hello"
Or join it:
let joinedString = arrayStrings.joined() // "Hello"
Both would return "Hello" String as output.
However, what is the logic to keep in mind to determine what is the better choice for such a process? What is the difference when comparing based on the performance?

There are two reasons why joined is a better choice than reduce:
Readability
If you want to join multiple strings into one string, why would you use reduce, with manual concatenation? If there is a specific function for the task you want to do, use it. When reading the code, it's easier to understand joined than reduce.
Performance
joined for String can be implemented better than reduce. It does not have to be but it can. reduce operates on one element at a time, without knowledge about the other elements, with many temporary variables passed around. joined has the knowledge of the entire sequence and it knows that the operation is always the same, therefore it can optimize. It can even use the internal structure of String. See String.joined implementation.
In summary, always use the more specific implementation.
Note that the performance reason above is the less important one.

Update The previous results were obtained by running an iOS app on the simulator. Running the app on a real device, or running the code from a MacOS command line app gives similar results to ones #Sulthan mentioned.
Interestingly enough, reduce gave better results on my machine:
func benchmark(_ label: String, times: Int = 100000, _ f: () -> Void) {
let start = CACurrentMediaTime()
(0..<times).forEach { _ in f() }
let end = CACurrentMediaTime()
print("\(label) took \(end-start)")
}
let arrayStrings = ["H", "e", "l", "l", "o"]
benchmark("reduce", { _ = arrayStrings.reduce("", +) } )
benchmark("join", { _ = arrayStrings.joined() })
The results were around the following numbers when run from the main method of a typical iOS app, build in Debug mode:
reduce took 0.358474982960615
join took 0.582276367989834
Same app, built in Release mode, gave the same order of results:
reduce took 0.126910287013743
join took 0.0291724550188519
I ran the benchmarks multiple times, and reduce performed better in all cases. The difference is not that big though, so unless your string operations are critical in regards to performance, I'd recommend using joined, that method carries more semantical value, it better transmits the intent.

Related

How might I implement a map of maps in Flink keyed state that supports fast insert, lookup and iteration of nested maps?

I'd like to write a Flink streaming operator that maintains say 1500-2000 maps per key, with each map containing perhaps 100,000s of elements of ~100B. Most records will trigger inserts and reads, but I’d also like to support occasional fast iteration of entire nested maps.
I've written a KeyedProcessFunction that creates 1500 RocksDb-backed MapStates per key, and tested it by generating a stream of records with a single distinct key, but I find things perform poorly. Just initialising all of them takes on the order of several minutes, and once data begin to flow async incremental checkpoints frequently fail due to timeout. Is this is a reasonable approach? If not, what alternative(s) should I consider?
Thanks!
Functionally my code is along the lines of:
val stream = env.fromCollection(new Iterator[(Int, String)] with Serializable {
override def hasNext: Boolean = true
override def next(): (Int, String) = {
(1, randomString())
}
})
stream
.keyBy(_._1)
.process(new KPF())
.writeUsingOutputFormat(...)
class KFP extends KeyedProcessFunction[Int, (Int, String), String] {
var states: Array[MapState[Int, String]] = _
override def processElement(
value: (Int, String),
ctx: KeyedProcessFunction[Int, (Int, String), String]#Context,
out: Collector[String]
): Unit = {
if (states(0).isEmpty) {
// insert 0-300,000 random strings <= 100B
}
val state = states(random.nextInt(1500))
// Read from R random keys in state
// Write to W random keys state
// With probability 0.01 iterate entire contents of state
if (random.nextInt(100) == 0) {
state.iterator().forEachRemaining {
// do something trivial
}
}
}
override def open(parameters: Configuration): Unit = {
states = (0 until 1500).map { stateId =>
getRuntimeContext.getMapState(new MapStateDescriptor[Int, String](stateId.toString, classOf[Int], classOf[String]))
}.toArray
}
}
There's nothing in what you've described that's an obvious explanation for poor performance. You are already doing the most important thing, which is to use MapState<K, V> rather than ValueState<Map<K, V>>. This way each key/value pair in the map is a separate RocksDB object, rather than the entire Map being one RocksDB object that has to go through ser/de for every access/update for any of its entries.
To understand the performance better, the next step might be to enable the RocksDB native metrics, and study those for clues. RocksDB is quite tunable, and better performance may be achievable. E.g., you can tune for your expected mix of read and writes, and if you are trying to access keys that don't exist, then you should enable bloom filters (which are turned off by default).
The RocksDB state backend has to go through ser/de for every state access/update, which is certainly expensive. You should consider whether you can optimize the serializer; some serializers can be 2-5x faster than others. (Some benchmarks.)
Also, you may want to investigate the new spillable heap state backend that is being developed. See https://flink-packages.org/packages/spillable-state-backend-for-flink, https://cwiki.apache.org/confluence/display/FLINK/FLIP-50%3A+Spill-able+Heap+Keyed+State+Backend, and https://issues.apache.org/jira/browse/FLINK-12692. Early benchmarking suggest this state backend is significantly faster than RocksDB, as it keeps its working state as objects on the heap, and spills cold objects to disk. (How much this would help probably depends on how often you have to iterate.)
And if you don't need to spill to disk, the the FsStateBackend would be faster still.

Appending values to DataSet in Apache Flink

I am currently writing an (simple) analytisis code to sum time connected powerreadings. With the data being assumingly raw (e.g. disturbances from the measuring device have not been calculated out) I have to account for disturbances by calculation the mean of the first one thousand samples. The calculation of the mean itself is not a problem. I only am unsure of how to generate the appropriate DataSet.
For now it looks about like this:
DataSet<Tupel2<long,double>>Gyrotron_1=ECRH.includeFields('11000000000'); // obviously the line to declare the first gyrotron, continues for the next ten lines, assuming separattion of not occupied space
DataSet<Tupel2<long,double>>Gyrotron_2=ECRH.includeFields('10100000000');
DataSet<Tupel2<long,double>>Gyrotron_3=ECRH.includeFields('10010000000');
DataSet<Tupel2<long,double>>Gyrotron_4=ECRH.includeFields('10001000000');
DataSet<Tupel2<long,double>>Gyrotron_5=ECRH.includeFields('10000100000');
DataSet<Tupel2<long,double>>Gyrotron_6=ECRH.includeFields('10000010000');
DataSet<Tupel2<long,double>>Gyrotron_7=ECRH.includeFields('10000001000');
DataSet<Tupel2<long,double>>Gyrotron_8=ECRH.includeFields('10000000100');
DataSet<Tupel2<long,double>>Gyrotron_9=ECRH.includeFields('10000000010');
DataSet<Tupel2<long,double>>Gyrotron_10=ECRH.includeFields('10000000001');
for (int=1,i<=10;i++) {
DataSet<double> offset=Gyroton_'+i+'.groupBy(1).first(1000).sum()/1000;
}
It's the part in the for-loop I'm unsure of. Does anybody know if it is possible to append values to DataSets and if so how?
In case of doubt, I could always put the values into an array but I do not know if that is the wise thing to do.
This code will not work for many reasons. I'd recommend looking into the fundamentals of Java and the basic data structures and also in Flink.
It's really hard to understand what you actually try to achieve but this is the closest that I came up with
String[] codes = { "11000000000", ..., "10000000001" };
DataSet<Tuple2<Long, Double>> result = env.fromElements();
for (final String code : codes) {
DataSet<Tuple2<Long, Double>> codeResult = ECRH.includeFields(code)
.groupBy(1)
.first(1000)
.sum(0)
.map(sum -> new Tuple2<>(sum.f0, sum.f1 / 1000d));
result = codeResult.union(result);
}
result.print();
But please take the time and understand the basics before delving deeper. I also recommend to use an IDE like IntelliJ that would point to at least 6 issues in your code.

How to execute a collection of statements in Tiberius?

I could not figure out how to iterate over a collection and execute statements one by one with Tiberius.
My current code looks like this (simplified):
use futures::Future;
use futures_state_stream::StateStream;
use tokio::executor::current_thread;
use tiberius::SqlConnection;
fn find_files(files: &mut Vec<String>) {
files.push(String::from("file1.txt"));
files.push(String::from("file2.txt"));
files.push(String::from("file3.txt"));
}
fn main() {
let mut files: Vec<String> = Vec::new();
find_files(&mut files);
let future = SqlConnection::connect(CONN_STR)
.and_then(|conn| {
conn.simple_exec("CREATE TABLE db.dbo.[Filenames] ( [Spalte 0] varchar(80) );")
})
.and_then(|(_, conn)| {
for k in files.iter() {
let sql = format!("INSERT INTO db.dbo.Filenames ([Spalte 0]) VALUES ('{}')", k);
&conn.simple_exec(sql);
}
Ok(())
});
current_thread::block_on_all(future).unwrap();
}
I got the following error message
error[E0382]: use of moved value: `conn`
--> src/main.rs:23:18
|
20 | .and_then(|(_, conn)| {
| ---- move occurs because `conn` has type `tiberius::SqlConnection<std::boxed::Box<dyn tiberius::BoxableIo>>`, which does not implement the `Copy` trait
...
23 | &conn.simple_exec(sql);
| ^^^^ value moved here, in previous iteration of loop
I'm new to Rust but I know there is something wrong with the use of the conn variable but nothing works.
There are actual two questions here:
The header question: how to perform multiple sequential statements using tiberius?
The specific question concerning why an error message comes from a specific bit of code.
I will answer them separately.
Multiple statements
There are many ways to skin a cat. In TDS (the underlying protocol Tiberius is implementing) there is the possibility to execute several statements in a single command. They just need to be delimited by using semicolon. The response from such an execution is in Tiberius represented a stream of futures, one for each statement.
So if your chain of statements is not too big to fit into one command, just build one string and send it over:
fn main() {
let mut files: Vec<String> = Vec::new();
find_files(&mut files);
let stmts = vec![
String::from(
"CREATE TABLE db.dbo.[Filenames] ( [Spalte 0] varchar(80) )")]
.into_iter()
.chain(files.iter().map(|k|
format!("INSERT INTO db.dbo.Filenames ([Spalte 0]) VALUES ('{}')", k)))
.collect::<Vec<_>>()
.join(";");
let future
= SqlConnection::connect(std::env::var("CONN_STR").unwrap().as_str())
.and_then(|conn|
conn.simple_exec(stmts)
.into_stream()
.and_then(|future| future)
.for_each(|_| Ok(())));
current_thread::block_on_all(future).unwrap();
}
There is some simple boilerplate in that example.
simple_exec returns an ExecResult, a wrapper around the individual statement's future results. Calling `into_stream() on that provides a stream of those futures.
That stream of results need to be forced to be carried out, one way of doing that is to call and_then, which awaits each future and does something with it.
We don't actually care about the results here, so we just do a noop for_each.
But, say that there is a lot of statements, more than can fit in a single TDS command, then there is a need to issue them separately (another case is when the statements themselves depend on earlier ones). A version of that problem is solved in How do I iterate over a Vec of functions returning Futures in Rust? .
Then finally, what is your specific error? Well conn is consumed by simple_exec, so it cannot be used afterwards, that is what the error tells you. If you want to use the connection after that execution is done you have to use the Future it returns, which is wrapping the mutated connection. I defer to the link above, on one way to do that.

Flink: Implementing a "join" between a DataStream and a "set of rules"

What is the best practice recommendation for the following use case? We need to match a stream against a set of “rules”, which are essentially a Flink DataSet concept. Updates to this “rules set" are possible but not frequent. Each stream event must checked against all the records in “rules set”, and each match produces one or more events into a sink data stream. Number of records in a rule set are in the 6 digit range.
Currently we're simply loading rules into a local List of rules and using flatMap over an incoming DataStream. Inside flatMap, we're just iterating over a list comparing each event to each rule.
To speed up the iteration, we can also split the list into several batches, essentially creating a list of lists, and creating a separate thread to iterate over each sub-list (using Futures in either Java or Scala).
Questions:
Is there a better way to do this kind of a join?
If not, is it safe to add additional parallelism by creating new threads inside each flatMap operation, on top of what Flink is already doing?
EDIT:
Here's sample code as requested:
package wikiedits
import org.apache.flink.streaming.connectors.wikiedits.{WikipediaEditEvent, WikipediaEditsSource}
import org.apache.flink.streaming.api.scala._
import org.apache.flink.streaming.api.scala.extensions._
import scala.concurrent.ExecutionContext.Implicits.global
import scala.concurrent.Future
object WikipediaEditEventProcessor {
def main(args: Array[String])= {
val see = StreamExecutionEnvironment.getExecutionEnvironment
val edits = see.addSource(new WikipediaEditsSource())
val ruleSets = Map[Int, List[String]](
(1, List("a", "b", "c", "d", "e", "f", "g", "h", "i", "j")),
(2, List("k", "l", "m", "n", "o", "p", "q", "r", "s", "t")),
(3, List("u", "v", "w", "x", "y", "z", "0", "1", "2", "3"))
)
val result = edits.flatMap { edit =>
ruleSets.map { ruleSet =>
applyRuleSet(edit, ruleSet._2, ruleSet._1)
}
}
see.execute
}
def applyRuleSet(event: WikipediaEditEvent, ruleSet: List[String], ruleSetId: Int): Future[List[String]] = {
val title = event.getTitle
Future(
ruleSet.map {
case rule if title.contains(rule) =>
val result = s"Ruleset $ruleSetId: $rule -> exists in: $title"
println(result) // this would be creating an output event instead
result
case rule =>
val result = s"Ruleset $ruleSetId: $rule -> NO MATCH in: $title"
println(result)
result
}
)
}
}
Each stream event must checked against all the records in “rules set”,
and each match produces one or more events into a sink data stream.
Number of records in a rule set are in the 6 digit range
Say you have K rules. Your approach is fine if input rate is faster than the time taken for processing K rules for single event.
Else, you need some approach where you can process these K rules in parallel.
Think of them as K toll-booth. Place them one after other rather than having them inside single big room. This would simplify the things for streaming engine.
In other words, use simple for loop to iterate over all the rules and have a separate flatMap for each rule.
So that, each one of them is independent of each other thus can be processed in parallel.
In the end you would have K flatMaps for execution. Engine would use maximum parallelism possible with whatever configuration you provide for execution.
This approach limits maximum possible parallelism to K. But, that is good enough for high number of rules.
additional parallelism by creating new threads inside each flatMap
operation
Not at all recommended. Leave parallelism to flink. You define unit of work you wish to perform inside your flatMap.

How to find which element failed comparison between arrays in Kotlin?

I'm writing automated tests for one site. There's a page with all items added to the cart. Maximum items is 58. Instead of verification of each element one by one I decided to create 2 arrays filled with strings: 1 with correct names : String and 1 with names : String I got from the site. Then I compare those 2 arrays with contentEquals.
If that comparison fails, how do I know which element exactly caused comparison fail?
Short simple of what I have now:
#Test
fun verifyNamesOfAddedItems () {
val getAllElementsNames = arrayOf(materials.text, element2.text,
element3.text...)
val correctElementsNames = arrayOf("name1", "name2", "name3"...)
val areArraysEqual = getAllElementsNames contentEquals correctElementsNames
if (!areArraysEqual) {
assert(false)
} else {
assert(true)
}
}
This test fails if 2 arrays are not the same but it doesn't show me the details, so is there a way to see more details of fail, e.g. element that failed comparison?
Thanks.
I recommend using a matcher library like Hamcrest or AssertJ in tests. They provide much better error messages for cases like this. In this case with Hamcrest it would be:
import org.hamcrest.Matchers.*
assertThat(getAllElementsNames, contains(*correctElementsNames))
// or just
assertThat(getAllElementsNames, contains("name1", "name2", "name3", ...))
There are also matcher libraries made specifically for Kotlin: https://github.com/kotlintest/kotlintest, https://yobriefca.se/expect.kt/, https://github.com/winterbe/expekt, https://github.com/MarkusAmshove/Kluent, probably more. Tests using them should be even more readable, but I haven't tried any of them. Look at their documentation and examples and pick the one you like.
You need to find the intersection between the two collections. Intersection will be the common elements. After than removing the intersection collection from the collection you want to perform the test will give you the complementary elements.
val intersection = getAllElementsNames.intersect(correctElementsNames)
getAllElementsNames.removeAll(intersection)

Resources