Is it possible to set a random number generator seed to get reproducible training? - tensorflow.js

I would like to re-run training with fewer epochs to stop with the same state it had at that point in the earlier training.
I see that tf.initializers take a seed argument. tf.layers.dropout does as well but 1.2.7 reports "Error: Non-default seed is not implemented in Dropout layer yet: 1". But even without dropout are there other sources of randomness? And can those be provided with a seed?

You can get a reproductible training by setting the weights default value. These default value are randomly generated at the beginning of the training.
To set the value of the weights the property kernerInitializer of the layer object parameter can be used.
Another way to set the weights is to call setWeights on the model passing as arguments the weights values
Also shuffle in model.fit parameter property is set to true by default. It has to be set to false to prevent the training data to be shuffled at each epoch.

Related

How do I track random numbers with stateless services

I have a design problem that I'm looking for an efficient way to solve:
I have three instances of a single service running. Each instance is totally stateless and exposes a single endpoint /token. When the /token endpoint is called by a client, a random number is returned. The random number is generated by a non-repeating pseudo-random number generator which generates a unique random integer for the first n-times it is called and then repeat the same sequence the next n-times it is called. In other words, it's a repeating cycle of n values. So say n = 20, it'll return unique values within the range of 0 to 20 for the first 20 times it is called.
The problem here is: given that I have three instances of this service running, how do I avoid duplicating random integers since any of the services can't know what random value has been generated by either of them.
Here's what I've done:
I have passed as enviroment variable, a seed value that ensures that all the services are generating the same random sequence.
I have setup a database that each of these services can access remotely. It has a table with a single column map set to a default value of 0
When the client calls the /token endpoint of a service, the service increases the value in the map column by 1 and fetches the resulting value.
I then return the random number this resulting value maps to in the random sequence
Is the above approach efficient ?
Could the services experience a race condition when trying to access the database row ?
Could this problem be solved without a database ?
Suggestions would be really appreciated.
Thanks in advance
Sounds like snowflake could be used -
Instead of regular integers, well they will be integers again but you will need to encode a bit of information inside them, you can construct them so they encode the information bitwise.
So let's say your integers will be 64 bits. You can leave x bits for service, you can leave y bits for seed and z bits for the "real" id.(for example datetime ticks).
Doing this you will have something like 0001 (serviceid) (x=4) 1010 (z=4) (your n=20) 0101....0 (56 bits for datetime ticks).
So now you can identify your generated numbers by serviceId, give them some seed (maybe configuration) and some "real" part.
More on snowflake you can probably find from here.

Sort by constant number

I need to randomize Solr (6.6.2) search results, but the order needs to be consistent given a specific seed. This is for a paginated search that returns a limited result set from a much larger one, so I must do the ordering at the query level and not at the application level once the data has been fetched.
Initially I tried this:
https://localhost:8984/solr/some_index/select?q=*:*&sort=random_999+ASC
Where 999 is a constant that is fed in when constructing the query prior to sending it to Solr. The constant value changes for each new search.
This solution works. However, when I run the query a few times, or run it on different Solr instances, the ordering is different.
After doing some reading, random_ generates a number via:
fieldName.hashCode() + context.docBase + (int)top.getVersion()
This means that when the random number is generated, it takes the index version into account. This becomes problematic when using a distributed architecture or when indexes are updated, as is well explained here.
There are various recommended solutions online, but I am trying to avoid writing a custom random override. Is there some type of trick where I can feed in some type of function or equation to the sort param?
For example:
min(999,random_999)
Though this always results in the same order, even when either of the values change.
This question is somewhat similar to this other question, but not quite.
I searched for answers on SO containing solr.RandomSortField, and while they point out what the issue is, none of them have a solution. It seems the best way would be to override the solr.RandomSortField logic, but it's not clear how.
Prior Research
https://lucene.472066.n3.nabble.com/Random-sorting-and-result-consistency-across-successive-calls-based-on-seed-td4170508.html
Solr: Random sort order after index version change
https://mail-archives.apache.org/mod_mbox/lucene-dev/201811.mbox/%3CJIRA.13196983.1541639245000.300557.1541639520069#Atlassian.JIRA%3E
Solr - Return random results (Sort by Random)
https://realize.be/blog/random-results-apache-solr-and-drupal
https://lucene.472066.n3.nabble.com/Sorting-with-customized-function-of-score-td3987281.html
Even after implementing a custom random sort field, the results still differed across instances of Solr.
I ended up adding a new field that is populated at index time which is a 32 bit hash of an ID field that already existed in the document.
I then built a "stateless" linear congruential generator to produce a set of acceptably random numbers to use for sorting:
?sort=mod(product(hash_int_id,{seedConstant},982451653), 104395301) asc
Since this function technically passes a new seed for each row, and because it does not store state (like rand.Next() would), this solution is admittedly inferior and it is not a true PRNG; however, it does seem to get me most of the way there. Note that you will have to tune your values depending on the size of your data set and the size of the values in your hash_int_id equivalent field.

How can you exclude a large large initial value from a running delta calculation?

I'm trying to use a running delta calculation to graph how much additional storage is used per hour, from a field that contains how much storage is used. Let's say I have a field like disk_space_used_mb. If I have the values 50000, 50100, 50300, the running delta would be 50000, 100, 200, but I don't really care about the first value, and it throw off my graph. I can of course set the max value of the y axis manually, but that isn't dynamic.
How can I prevent this first large value from throwing off my graph? is there a way to force that to 0?
Here's an example of why this is a problem (with different numbers):
Sadly, this is currently not possible and it is a very common problem when plotting running delta.
To workaround, if your initial value is static, you can create a new calculated field where you subtract the initial value from all rows (so the initial value will be always zero). But obviously, this is not an elegant solution and your chart Y-axis values will be different from the real values.
But if the initial value can be changed by the user (it is dynamic), you're really out of lucky. The only solution I can imagine is to search for an alternative visualization that support this feature or develop your own visualization.
The second option probably solves your problem, but the development of community visualizations is far from being an easy task.

VAE input data scaling

Variational Autoencoders (VAE) are quite a heaving concept themselves. Non-surprisingly most post, comments and tutorials focus on the theory and architecture, but most also fail to address the topic of data scaling. While experimenting with VAEs I have come across a (to me) surprising read flag that the way the data is scaled into an VAE is very important and I could not put my head around it what is the explanation.
To visualize the following issue descripting access the Notebook here: https://github.com/mobias17/VAE-Input-Scaling/blob/master/VAE%20Input%20Scaling.ipynb
Let’s assume the goal is to reconstruct a sine wave (e.g. a sound wave) by a VAE. When I feed the standardized data through the model, it is only able to approximate values between -1 and 1. Obviously, the quick answer would be to normalize the data. Still, this leads to the following questions:
1) What is the rational that the VAE can only approximate values between -1 and 1? (is it the gaussian reparameterization, vanishing gradients?)
2) Is there a way to overcome this boundary (model changes)?
3) What is the best practice to scale data for a VAE? Should the data be normalized over the std dev?
Results showing Sutputs are between -1 & 1
Variational autoencoders can approximate values in any range. The problem here is with this particular model's architecture.
The decoder of this VAE uses as the last layer keras.layers.LSTM.
This layer's default activation function is tanh, and the tanh function outputs values in the range (-1,1). This is why the model cannot generate values outside that range.
However, if we change the activation function to, say, linear, replacing
decoder_mean = LSTM(input_dim, return_sequences=True)
with
decoder_mean = LSTM(input_dim, return_sequences=True, activation=None)
the VAE now can approximate the data. This is the result I got after training for 100 epochs.
The general recommendation is to ensure that the data you are trying to approximate lies in the range of the function you are using to approximate it, either by scaling the data or choosing a more expressive function.

How is the hold out set chosen in vowpal wabbit

I am using vowpal wabbit for logistic regression. I came to know that vowpal wabbit selects a hold out set for validation from the given training data. Is this set chosen randomly. I have a very unbalanced dataset with 100 +ve examples and 1000 -ve examples. I want to know given this training data, how vowpal wabbit selects the hold out examples?
How do I assign more weights to the +ve examples
By default each 10th example is used for holdout (you can change it with --holdout_period,
see https://github.com/JohnLangford/vowpal_wabbit/wiki/Command-line-arguments#holdout-options).
This means the model trained with holdout evaluation on is trained only on 90% of the training data.
This may result in slightly worse accuracy.
On the other hand, it allows you to use --early_terminate (which is set to 3 passes by default),
which makes it easier to reduce the risk of overtraining caused by too many training passes.
Note that by default holdout evaluation is on, only if multiple passes are used (VW uses progressive validation loss otherwise).
As for the second question, you can add importance weight to the positive examples. The default importance weight is 1. See https://github.com/JohnLangford/vowpal_wabbit/wiki/Input-format

Resources