Central parameter in our problem statement, it is never explicitly given to the agents. We instead let each agent run as long as necessary and analyse the time elapsed afterwards. Another point which needs to be discussed is the impact of the implementation of an algorithm on the comparison results. For each algorithm, many implementations are possible, some being better than others. Even though we did our best to provide the best possible implementations, BBRL does not compare algorithms but rather the implementations of each algorithms. Note that this issue mainly concerns small problems, since the complexity of the algorithms is preserved.5 IllustrationThis section presents an illustration of the protocol presented in Section 3. We first describe the algorithms considered for the comparison in Section 5.1, followed by a description of the benchmarks in Section 5.2. Section 5.3 shows and analyses the results obtained.5.1 Compared algorithmsIn this section, we present the list of the algorithms considered in this study. The pseudo-code of each algorithm can be found in S1 File. For each algorithm, a list of “reasonable” values is provided to test each of their parameters. When an algorithm has more than one parameter, all possible parameter combinations are tested, even for those which do not use the offline phasePLOS ONE | DOI:10.1371/journal.pone.0157088 June 15,9 /Benchmarking for Bayesian Reinforcement Learningexplicitly. We considered that tuning their parameters with an optimisation algorithm chosen arbitrarily would not be fair for both offline computation time and online performance. 5.1.1 Random. At each time-step t, the action ut is drawn uniformly from U. 5.1.2 -Greedy. The -Greedy agent order AZD-8055 maintains an approximation of the current MDP and computes, at each time-step, its associated Q-function. The selected action is either selected randomly (with a probability of (1 ! ! 0), or greedily (with a probability of 1 – ) with respect to the approximated model. FPS-ZM1 side effects Tested values: ? 2 0.0, 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1.0. 5.1.3 Soft-max. The Soft-max agent maintains an approximation of the current MDP and computes, at each time-step, its associated Q-function. The selected action is selected randomly, where the probability to draw an action u is proportional to Q(xt, u). The temperature parameter allows to control the impact of the Q-function on these probabilities ( ! 0+: greedy selection; ! +1: random selection). Tested values: ? 2 0.05, 0.10, 0.20, 0.33, 0.50, 1.0, 2.0, 3.0, 5.0, 25.0. 5.1.4 OPPS. Given a prior distribution p0 ??and an E/E strategy space S (either discrete or M continuous), the Offline, Prior-based Policy Search algorithm (OPPS) identifies a strategy p?2 S which maximises the expected discounted sum of returns over MDPs drawn from the prior. The OPPS for Discrete Strategy spaces algorithm (OPPS-DS) [4, 8] formalises the strategy selection problem as a k-armed bandit problem, where k ?jSj. Pulling an arm amounts to draw an MDP from p0 ?? and play the E/E strategy associated to this arm on it for one single M trajectory. The discounted sum of returns observed is the return of this arm. This multi-armed bandit problem has been solved by using the UCB1 algorithm [9, 10]. The time budget is defined by a variable , corresponding to the total number of draws performed by the UCB1. The E/E strategies considered by Castronovo et. al are index-based strategies, where the index is generated by evaluating a.Central parameter in our problem statement, it is never explicitly given to the agents. We instead let each agent run as long as necessary and analyse the time elapsed afterwards. Another point which needs to be discussed is the impact of the implementation of an algorithm on the comparison results. For each algorithm, many implementations are possible, some being better than others. Even though we did our best to provide the best possible implementations, BBRL does not compare algorithms but rather the implementations of each algorithms. Note that this issue mainly concerns small problems, since the complexity of the algorithms is preserved.5 IllustrationThis section presents an illustration of the protocol presented in Section 3. We first describe the algorithms considered for the comparison in Section 5.1, followed by a description of the benchmarks in Section 5.2. Section 5.3 shows and analyses the results obtained.5.1 Compared algorithmsIn this section, we present the list of the algorithms considered in this study. The pseudo-code of each algorithm can be found in S1 File. For each algorithm, a list of “reasonable” values is provided to test each of their parameters. When an algorithm has more than one parameter, all possible parameter combinations are tested, even for those which do not use the offline phasePLOS ONE | DOI:10.1371/journal.pone.0157088 June 15,9 /Benchmarking for Bayesian Reinforcement Learningexplicitly. We considered that tuning their parameters with an optimisation algorithm chosen arbitrarily would not be fair for both offline computation time and online performance. 5.1.1 Random. At each time-step t, the action ut is drawn uniformly from U. 5.1.2 -Greedy. The -Greedy agent maintains an approximation of the current MDP and computes, at each time-step, its associated Q-function. The selected action is either selected randomly (with a probability of (1 ! ! 0), or greedily (with a probability of 1 – ) with respect to the approximated model. Tested values: ? 2 0.0, 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1.0. 5.1.3 Soft-max. The Soft-max agent maintains an approximation of the current MDP and computes, at each time-step, its associated Q-function. The selected action is selected randomly, where the probability to draw an action u is proportional to Q(xt, u). The temperature parameter allows to control the impact of the Q-function on these probabilities ( ! 0+: greedy selection; ! +1: random selection). Tested values: ? 2 0.05, 0.10, 0.20, 0.33, 0.50, 1.0, 2.0, 3.0, 5.0, 25.0. 5.1.4 OPPS. Given a prior distribution p0 ??and an E/E strategy space S (either discrete or M continuous), the Offline, Prior-based Policy Search algorithm (OPPS) identifies a strategy p?2 S which maximises the expected discounted sum of returns over MDPs drawn from the prior. The OPPS for Discrete Strategy spaces algorithm (OPPS-DS) [4, 8] formalises the strategy selection problem as a k-armed bandit problem, where k ?jSj. Pulling an arm amounts to draw an MDP from p0 ?? and play the E/E strategy associated to this arm on it for one single M trajectory. The discounted sum of returns observed is the return of this arm. This multi-armed bandit problem has been solved by using the UCB1 algorithm [9, 10]. The time budget is defined by a variable , corresponding to the total number of draws performed by the UCB1. The E/E strategies considered by Castronovo et. al are index-based strategies, where the index is generated by evaluating a.