Timing discrepancy between in Netlogo - benchmarking

Can anyone explain why there is a performance difference between the following two segments? It's statistically significant that the second timer call reports a smaller number than the first timer call. My only thoughts would be that Netlogo could be caching the turtles in memory. Is this the expected behavior or is there a bug?
to setup
crt 100
let repetitions 10000
;;Timing assigning x to self
repeat repetitions
ask turtles
let x self
show timer
;;Timing assigning x to who of self
repeat repetitions
ask turtles
let x [who] of self
show timer

This isn't because of anything in NetLogo itself, but rather because NetLogo runs on the JVM. The JVM learns to optimize code the more it runs it as part of its just-in-time compilation (JIT).
By the time the second segment is run, the JVM has had time to optimize many code paths that the two segments have in common. Indeed, switching the order of the segments, I got the following results:
observer> setup
observer: 0.203
observer: 0.094
observer> setup
observer: 0.136
observer: 0.098
observer> setup
observer: 0.13
observer: 0.097
observer> setup
observer: 0.119
observer: 0.095
observer> setup
observer: 0.13
observer: 0.09
Now the let x self code is faster (it's now the second thing that runs)! Notice also that both times decrease the more I ran setup. This is also due to the JVM's JIT.
Similarly, if I turn off view updates and run your original code, I get:
observer> setup
observer: 0.088
observer: 0.071
observer> setup
observer: 0.094
observer: 0.072
observer> setup
observer: 0.065
observer: 0.075
observer> setup
observer: 0.067
observer: 0.071
observer> setup
observer: 0.067
observer: 0.068
The let x self code starts out slower (for the reason above) and then becomes about the same speed, as one might expect. There are many possible reasons as to why this only happens with view updates off. NetLogo is doing a lot less with view updates off
The JVM's JIT is extremely optimized, yet complicated, and it can be hard to reason about. There's a lot to consider if you want to write truly correct micro-benchmarks.


Train an already trained model in Sagemaker and Huggingface without re-initialising

Let's say I have successfully trained a model on some training data for 10 epochs. How can I then access the very same model and train for a further 10 epochs?
In the docs it suggests "you need to specify a checkpoint output path through hyperparameters" --> how?
# define my estimator the standard way
huggingface_estimator = HuggingFace(
hyperparameters = hyperparameters,
# train the model
{'train': training_input_path, 'test': test_input_path}
If I run huggingface_estimator.fit again it will just start the whole thing over again and overwrite my previous training.
You can find the relevant checkpoint save/load code in Spot Instances - Amazon SageMaker x Hugging Face Transformers.
(The example enables Spot instances, but you can use on-demand).
In hyperparameters you set: 'output_dir':'/opt/ml/checkpoints'.
You define a checkpoint_s3_uri in the Estimator (which is unique to the series of jobs you'll run).
You add code for train.py to support checkpointing:
from transformers.trainer_utils import get_last_checkpoint
# check if checkpoint existing if so continue training
if get_last_checkpoint(args.output_dir) is not None:
logger.info("***** continue training *****")
last_checkpoint = get_last_checkpoint(args.output_dir)

What is the difference between combining array by using reduce or joined?

Consider the following array -of strings-:
let arrayStrings = ["H", "e", "l", "l", "o"]
For combining its elements (to get "Hello" as single String), we could:
reduce it:
let reducedString = arrayStrings.reduce("", { $0 + $1 }) // "Hello"
Or join it:
let joinedString = arrayStrings.joined() // "Hello"
Both would return "Hello" String as output.
However, what is the logic to keep in mind to determine what is the better choice for such a process? What is the difference when comparing based on the performance?
There are two reasons why joined is a better choice than reduce:
If you want to join multiple strings into one string, why would you use reduce, with manual concatenation? If there is a specific function for the task you want to do, use it. When reading the code, it's easier to understand joined than reduce.
joined for String can be implemented better than reduce. It does not have to be but it can. reduce operates on one element at a time, without knowledge about the other elements, with many temporary variables passed around. joined has the knowledge of the entire sequence and it knows that the operation is always the same, therefore it can optimize. It can even use the internal structure of String. See String.joined implementation.
In summary, always use the more specific implementation.
Note that the performance reason above is the less important one.
Update The previous results were obtained by running an iOS app on the simulator. Running the app on a real device, or running the code from a MacOS command line app gives similar results to ones #Sulthan mentioned.
Interestingly enough, reduce gave better results on my machine:
func benchmark(_ label: String, times: Int = 100000, _ f: () -> Void) {
let start = CACurrentMediaTime()
(0..<times).forEach { _ in f() }
let end = CACurrentMediaTime()
print("\(label) took \(end-start)")
let arrayStrings = ["H", "e", "l", "l", "o"]
benchmark("reduce", { _ = arrayStrings.reduce("", +) } )
benchmark("join", { _ = arrayStrings.joined() })
The results were around the following numbers when run from the main method of a typical iOS app, build in Debug mode:
reduce took 0.358474982960615
join took 0.582276367989834
Same app, built in Release mode, gave the same order of results:
reduce took 0.126910287013743
join took 0.0291724550188519
I ran the benchmarks multiple times, and reduce performed better in all cases. The difference is not that big though, so unless your string operations are critical in regards to performance, I'd recommend using joined, that method carries more semantical value, it better transmits the intent.

How to write the content of a Flink var to screen in Zeppelin?

I try to run the followin simple commands in Apache Zeppelin.
var rabbit = env.fromElements(
"ARTHUR: What, behind the rabbit?",
"TIM: It is the rabbit!",
"ARTHUR: You silly sod! You got us all worked up!",
"TIM: Well, that's no ordinary rabbit. That's the most foul, cruel, and bad-tempered rodent you ever set eyes on.",
"ROBIN: You tit! I soiled my armor I was so scared!",
"TIM: Look, that rabbit's got a vicious streak a mile wide, it's a killer!")
var counts = rabbit.flatMap { _.toLowerCase.split("\\W+")}.map{ (_,1)}.groupBy(0).sum(1)
I try to print out the results in the notebook. But unfortunately, I only get the following output.
rabbit: org.apache.flink.api.scala.DataSet[String] = org.apache.flink.api.scala.DataSet#37fdb65c
counts: org.apache.flink.api.scala.AggregateDataSet[(String, Int)] = org.apache.flink.api.scala.AggregateDataSet#1efc7158
res103: org.apache.flink.api.java.operators.DataSink[(String, Int)] = DataSink '<unnamed>' (Print to System.out)
How can I spill the content of counts to the notebook in Zeppelin?
The way to print the result of such computation in Zeppelin is:
//or one might prefer
//counts.collect foreach println
The reason for the observed behaviour lies in the interplay between Apache Zeppelin and Apache Flink. Zeppelin captures all standard output of Console. However, Flink also prints output to System.out and that's exactly what's happening when you call counts.print(). The reason why bzz's solution works is that it prints the result using Console.
I opened a JIRA issue [1] and opened a pull request [2] to correct this behaviour so that you can also use counts.print().
[1] https://issues.apache.org/jira/browse/ZEPPELIN-287
[2] https://github.com/apache/incubator-zeppelin/pull/288

Is there a way to query a Julia Timer to see whether it is running or not?

Julia has a Timer object which can run a callback function at a set repetition rate. According to the standard library, the only functions using a Timer are start_timer() and stop_timer().
Is there a way, given a Timer, to check whether it is currently running or not?
Best way to look for something like this is methodswith. Unfortunately, there aren't many methods defined for Julia Timer objects:
julia> methodswith(Timer, true) # true to check super types, too (but not Any)
5-element Array{Method,1}:
stop_timer(timer::Timer) at stream.jl:499
close(t::Timer) at stream.jl:460
start_timer(timer::Timer,timeout::Int64,repeat::Int64) at deprecated.jl:204
start_timer(timer::Timer,timeout::Real,repeat::Real) at stream.jl:490
close(t::Timer) at stream.jl:460
So we've got to dig a bit deeper. Looking at the implementation for Timer reveals that it simply wraps a libuv timer object. So I just did a search through libuv/include/uv.h for the timer API, and found int uv_is_active(const uv_handle_t* handle), which looks very promising. I simply wrap this c call in a Julian function, and it works like a charm:
julia> isactive(t::Timer) = bool(ccall(:uv_is_active, Cint, (Ptr{Void},), t.handle));
julia> t = Timer((x)->println(STDOUT,"\nboo"));
julia> isactive(t)
julia> start_timer(t, 10., 0); # fire in 10 seconds, don't repeat
julia> isactive(t)
julia> isactive(t)

R tm: reloading a 'PCorpus' backend filehash database as corpus (e.g. in restarted session/script)

Having learned loads from answers on this site (thanks!), it's finally time to ask my own question.
I'm using R (tm and lsa packages) to create, clean and simplify, and then run LSA (latent semantic analysis) on, a corpus of about 15,000 text documents. I'm doing this in R 3.0.0 under Mac OS X 10.6.
For efficiency (and to cope with having too little RAM), I've been trying to use either the 'PCorpus' (backend database support supported by the 'filehash' package) option in tm, or the newer 'tm.plugin.dc' option for so-called 'distributed' corpus processing). But I don't really understand how either one works under the bonnet.
An apparent bug using DCorpus with tm_map (not relevant right now) led me to do some of the preprocessing work with the PCorpus option instead. And it takes hours. So I use R CMD BATCH to run a script doing things like:
> # load corpus from predefined directory path,
> # and create backend database to support processing:
> bigCcorp = PCorpus(bigCdir, readerControl = list(load=FALSE), dbControl = list(useDb = TRUE, dbName = "bigCdb", dbType = "DB1"))
> # converting to lower case:
> bigCcorp = tm_map(bigCcorp, tolower)
> # removing stopwords:
> stoppedCcorp = tm_map(bigCcorp, removeWords, stoplist)
Now, supposing my script crashes soon after this point, or I just forget to export the corpus in some other form, and then I restart R. The database is still there on my hard drive, full of nicely tidied-up data. Surely I can reload it back into the new R session, to carry on with the corpus processing, instead of starting all over again?
It feels like a noodle question... but no amount of dbInit() or dbLoad() or variations on the 'PCorpus()' function seem to work. Does anyone know the correct incantation?
I've scoured all the related documentation, and every paper and web forum I can find, but total blank - nobody seems to have done it. Or have I missed it?
The original question was from 2013. Meanwhile, in Feb 2015, a duplicate, or similar question, has been answered:
How to reconnect to the PCorpus in the R tm package?. That answer in that post is essential, although pretty minimalist, so I'll try to augment it here.
These are some comments I've just discovered while working on a similar problem:
Note that the dbInit() function is not part of the tm package.
First you need to install the filehash package, which the tm-Documentation only "suggests" to install. This means it is not a hard dependency of tm.
Supposedly, you can also use the filehashSQLite package with library("filehashSQLite") instead of library("filehash"), and both of these packages have the same interface and work seamlesslessly together, due to object-oriented design. So also install "filehashSQLite" (edit 2016: some functions such as tn::content_transformer() are not implemented for filehashSQLite).
then this works:
# this string becomes filename, must not contain dots.
# Example: "mydata.sqlite" is not permitted.
s <- "sqldb_pcorpus_mydata" #replace mydat with something more descriptive
if(! file.exists(s)){
# csv is a data frame of 900 documents, 18 cols/features
pc = PCorpus(DataframeSource(csv), readerControl = list(language = "en"), dbControl = list(dbName = s, dbType = "SQLite"))
dbCreate(s, "SQLite")
db <- dbInit(s, "SQLite")
# add another record, just to show we can.
# key="test", value = "Hi there"
dbInsert(db, "test", "hi there")
} else {
db <- dbInit(s, "SQLite")
pc <- dbLoad(db)
# <<PCorpus>>
# Metadata: corpus specific: 0, document level (indexed): 0
#Content: documents: 900
dbFetch(db, "test")
# remove it
#reload it
db <- dbInit(s, "SQLite")
pc <- dbLoad(db)
# the corpus entries are now accessible, but not loaded into memory.
# now 900 documents are bound via "Active Bindings", created by makeActiveBinding() from the base package
# [1] "1" "2" "3" "4" "5" "6" "7" "8" "9"
# ...
# [900]
#[883] "883" "884" "885" "886" "887" "888" "889" "890" "891" "892"
#"893" "894" "895" "896" "897" "898" "899" "900"
#[901] "test"
dbFetch(db, "900")
# <<PlainTextDocument>>
# Metadata: 7
# Content: chars: 33
dbFetch(db, "test")
#[1] "hi there"
This is what the database backend looks like. You can see that the documents from the data frame have been encoded somehow, inside the sqlite table.
This is what my RStudio IDE shows me: