I am new to reinforcement learning and experimenting with training of RL agents.
I have a doubt about reward formulation, from a given state if a agent takes a good action i give a positive reward, and if the action is bad, i give a negative reward. So if i give the agent very high positive rewards when it takes a good action, like 100 times positive value as compared to negative rewards, will it help agent during the training?
Intuitively I feel, it will help the agent training, but will there be any drawbacks of such skewed reward structure?
Well, generally I(personal opinion based on my experience) think that rewards should be relative to the impact it has on the agent. If the problem is sparse rewards, you can have a look at this Arxiv Insights Youtube to see how that can be solved.
I can give one example that might be challenging: if the reward is much more positive than the bad rewards are negative, the agent will probably not care too much if it risks ending up in the states with negative rewards to acquire the big positive reward. So you might end up with a risky agent.
Related
I see frequent trades of the same token in the same wallet in the same minute. What is going on?
For example, this pair of swaps in and out of AIOZ are from the same wallet:
Out:
https://etherscan.io/tx/0xf306745eef0ea056d28df0a70a0b78c8544b2cb2421697a55371345486c74320
In
https://etherscan.io/tx/0x827313545c9fe2c3f5990d11e728c9c9be7f4f1581a1107c5f75a4b97cb5f1ad
Thanks.
I received this answer from a blockchain dev:
The account you're referring to has been doing it since day 1 with
large amounts in the order of $20k to $30k almost never in profit.
This to me is wash trading to simulate volume. It is a common but
unspoken practice of new projects until they get up and running with
legitimate volume. Most CEXs require a certain sustained daily volume
while other not so reputable ones do not care if it is done through
wash trading. This behaviour usually dies down shortly after CEX
listings with a reputable project. If it persists too long it is a
serious warning sign. In this market I would be concerned if wash
trading goes on longer than 2 months after IDO
Pretty important stuff!
IMHO it is a failure of stack overflow to allow down voting without justification (if at all). Clearly this was not down voted because someone recognized a problem with the question. They had indigestion for all we know. I'm hesitant to post here anymore.
While I was implementing agents for various problems...I have seen that my actor loss is reducing as expected. But my critic loss kept increases even though the policy learned is very. This happens for DDPG , PPO etc.
Any thoughts why my critic loss is increasing.
I tried playing with hyper parameters, it actually makes my policy worse.
In Reinforcement Learning, you really shouldn't typically be paying attention to the precise values of your loss values. They are not informative in the same sense that they would be in, for example, supervised learning. The loss values should only be used to compute the correct updates for your RL approach, but they do not actually give you any real indication of how well or poorly you are doing.
This is because in RL, your learning targets are often non-stationary; they are often a function of the policy that you are modifying (hopefully improving!). It's very well possible that, as the performance of your RL agent improves, your loss actually increases. Due to its improvement, it may discover new parts of its search space which lead to new target values that your agent was previously completely oblivious to.
Your only really reliable metric for how well your agent is doing is the returns it collects in evaluation runs.
I'm trying to implement the MCTS algorithm on a game. I can only use around 0.33 seconds per move. In this time I can generate one or two games per child from the start state, which contains around 500 child nodes. My simulations aren't random, but of course I can't make a right choice based on 1 or 2 simulations. Further in the game the tree becomes smaller and I can my choices are based on more simulations.
So my problem is in the first few moves. Is there a way to improve the MCTS algorithm so it can simulate more games or should I use another algorithm?
Is it possible to come up with some heuristic evaluation function for states? I realise that one of the primary benefits of MCTS is that in theory you wouldn't need this, BUT: if you can create a somewhat reasonable evaluation function anyway, this will allow you to stop simulations early, before they reach a terminal game state. Then you can back-up the evaluation of such a non-terminal game state instead of just a win or a loss. If you stop your simulations early like this, you may be able to run more simulations (because every individual simulation takes less time).
Apart from that, you'll want to try to find ways to ''generalize''. If you run one simulation, you should try to see if you can also extract some useful information from that simulation for other nodes in the tree which you didn't go through. Examples of enhancements you may want to consider in this spirit are AMAF, RAVE, Progressive History, N-Gram Selection Technique.
Do you happen to know where the bottleneck is for your performance? You could investigate this using a profiler. If most of your processing time is spent in functions related to the game (move generation, advancing from one state to the next, etc.), you know for sure that you're going to be limited in the number of simulations you can do. You should then try to implement enhancements that make each individual simulation as informative as possible. This can for example mean using really good, computationally expensive evaluation functions. If the game code itself already is very well optimized and fast, moving extra computation time into things like evaluation functions will be more harmful to your simulation count and probably pay off less.
For more on this last idea, it may be interesting to have a look through some stuff I wrote on my MCTS-based agent in General Video Game AI, which is also a real-time environment with a very computationally expensive game, meaning that simulations counts are severely constrained (but the branching factor is much much smaller than it seems to be in your case). Pdf files of my publications on this are also available online.
I had points docked on a homework assignment for calculating the wrong total cost in an amortized analysis of a dynamic array. I think the grader probably only looked at the total and not the steps I had taken, and I think I accounted for malloc and their answer key did not.
Here is a section of my analysis:
The example we were shown did not account for malloc, but I saw a video that did, and it made a lot of sense, so I put it in there. I realize that although malloc is a relatively costly operation, it would probably be O(1) here, so I could have left it out.
But my question is this: Is there only 1 way to calculate cost when doing this type of analysis? Is there an objective right and wrong cost, or is the conclusion drawn what really matters?
You asked, "Is there only 1 way to calculate cost when doing this type of analysis?" The answer is no.
These analyses are on mathematical models of machines, not real ones. When we say things like "appending to a resizable array is O(1) amortized", we are abstracting away the costs of various procedures needed in the algorithm. The motivation is to be able to compare algorithms even when you and I own different machines.
In addition to different physical machines, however, there are also different models of machines. For instance, some models don't allow integers to be multiplied in constant time. Some models allow variables to be real numbers with infinite precision. In some models all computation is free and the only cost tracked is the latency of fetching data from memory.
As hardware evolves, computer scientists make arguments for new models to be used in the analysis of algorithms. See, for instance, the work of Tomasz Jurkiewicz, including "The Cost of Address Translation".
It sounds like your model included a concrete cost to malloc. That is neither wrong nor right. It might be a more accurate model on your computer and a less accurate model on the graders.
I'm using a feed-foward neural network in python using the pybrain implementation. For the training, i'll be using the back-propagation algorithm. I know that with the neural-networks, we need to have just the right amount of data in order not to under/over-train the network. I could get about 1200 different templates of training data for the datasets.
So here's the question:
How do I calculate the optimal amount of data for my training? Since I've tried with 500 items in the dataset and it took many hours to converge, I would prefer not to have to try too much sizes. The results we're quite good with this last size but I would like to find the optimal amount. The neural network has about 7 inputs, 3 hidden nodes and one output.
How do I calculate the optimal amount
of data for my training?
It's completely solution-dependent. There's also a bit of art with the science. The only way to know if you're into overfitting territory is to be regularly testing your network against a set of validation data (that is data you do not train with). When performance on that set of data begins to drop, you've probably trained too far -- roll back to the last iteration.
The results were quite good with this
last size but I would like to find the
optimal amount.
"Optimal" isn't necessarily possible; it also depends on your definition. What you're generally looking for is a high degree of confidence that a given set of weights will perform "well" on unseen data. That's the idea behind a validation set.
The diversity of the dataset is much more important than the quantity of samples you are feeding to the network.
You should customize your dataset to include and reinforce the data you want the network to learn.
After you have crafted this custom dataset you have to start playing with the amount of samples, as it is completely dependant on your problem.
For example: If you are building a neural network to detect the peaks of a particular signal, it would be completely useless to train your network with a zillion samples of signals that do not have peaks. There lies the importance of customizing your training dataset no matter how many samples you have.
Technically speaking, in the general case, and assuming all examples are correct, then more examples are always better. The question really is, what is the marginal improvement (first derivative of answer quality)?
You can test this by training it with 10 examples, checking quality (say 95%), then 20, and so on, to get a table like:
10 95%
20 96%
30 96.5%
40 96.55%
50 96.56%
you can then clearly see your marginal gains, and make your decision accordingly.