Skip to content

Commit bea7eb4

Browse files
author
Nathan Lambert
authored
Update RL docs for better sharing / adding models (huggingface#1563)
* init docs update * style * fix bad colab formatting, add pipeline comment * update todo
1 parent ca68ab3 commit bea7eb4

File tree

6 files changed

+46
-68
lines changed

6 files changed

+46
-68
lines changed

docs/source/using-diffusers/other-modalities.mdx

+3-2
Original file line numberDiff line numberDiff line change
@@ -14,7 +14,8 @@ specific language governing permissions and limitations under the License.
1414

1515
Diffusers is in the process of expanding to modalities other than images.
1616

17-
Currently, one example is for [molecule conformation](https://www.nature.com/subjects/molecular-conformation#:~:text=Definition,to%20changes%20in%20their%20environment.) generation.
18-
* Generate conformations in Colab [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/huggingface/notebooks/blob/main/diffusers/geodiff_molecule_conformation.ipynb)
17+
Example type | Colab | Pipeline |
18+
:-------------------------:|:-------------------------:|:-------------------------:|
19+
[Molecule conformation](https://www.nature.com/subjects/molecular-conformation#:~:text=Definition,to%20changes%20in%20their%20environment.) generation | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/huggingface/notebooks/blob/main/diffusers/geodiff_molecule_conformation.ipynb) |
1920

2021
More coming soon!

docs/source/using-diffusers/rl.mdx

+9-2
Original file line numberDiff line numberDiff line change
@@ -13,6 +13,13 @@ specific language governing permissions and limitations under the License.
1313
# Using Diffusers for reinforcement learning
1414

1515
Support for one RL model and related pipelines is included in the `experimental` source of diffusers.
16+
More models and examples coming soon!
1617

17-
To try some of this in colab, please look at the following example:
18-
* Model-based reinforcement learning on Colab [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/huggingface/notebooks/blob/main/diffusers/reinforcement_learning_with_diffusers.ipynb) ![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)
18+
# Diffuser Value-guided Planning
19+
20+
You can run the model from [*Planning with Diffusion for Flexible Behavior Synthesis*](https://arxiv.org/abs/2205.09991) with Diffusers.
21+
The script is located in the [RL Examples](https://github.com/huggingface/diffusers/tree/main/examples/rl) folder.
22+
23+
Or, run this example in Colab [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/huggingface/notebooks/blob/main/diffusers/reinforcement_learning_with_diffusers.ipynb)
24+
25+
[[autodoc]] diffusers.experimental.ValueGuidedRLPipeline

examples/rl/README.md

+7-4
Original file line numberDiff line numberDiff line change
@@ -1,9 +1,12 @@
11
# Overview
22

3-
These examples show how to run (Diffuser)[https://arxiv.org/abs/2205.09991] in Diffusers.
4-
There are four scripts,
5-
1. `run_diffuser_locomotion.py` to sample actions and run them in the environment,
6-
2. and `run_diffuser_gen_trajectories.py` to just sample actions from the pre-trained diffusion model.
3+
These examples show how to run [Diffuser](https://arxiv.org/abs/2205.09991) in Diffusers.
4+
There are two ways to use the script, `run_diffuser_locomotion.py`.
5+
6+
The key option is a change of the variable `n_guide_steps`.
7+
When `n_guide_steps=0`, the trajectories are sampled from the diffusion model, but not fine-tuned to maximize reward in the environment.
8+
By default, `n_guide_steps=2` to match the original implementation.
9+
710

811
You will need some RL specific requirements to run the examples:
912

examples/rl/run_diffuser_gen_trajectories.py

-57
This file was deleted.

examples/rl/run_diffuser_locomotion.py

+3-1
Original file line numberDiff line numberDiff line change
@@ -8,7 +8,7 @@
88
n_samples=64,
99
horizon=32,
1010
num_inference_steps=20,
11-
n_guide_steps=2,
11+
n_guide_steps=2, # can set to 0 for faster sampling, does not use value network
1212
scale_grad_by_std=True,
1313
scale=0.1,
1414
eta=0.0,
@@ -40,13 +40,15 @@
4040
# execute action in environment
4141
next_observation, reward, terminal, _ = env.step(denorm_actions)
4242
score = env.get_normalized_score(total_reward)
43+
4344
# update return
4445
total_reward += reward
4546
total_score += score
4647
print(
4748
f"Step: {t}, Reward: {reward}, Total Reward: {total_reward}, Score: {score}, Total Score:"
4849
f" {total_score}"
4950
)
51+
5052
# save observations for rendering
5153
rollout.append(next_observation.copy())
5254

src/diffusers/experimental/rl/value_guided_sampling.py

+24-2
Original file line numberDiff line numberDiff line change
@@ -23,6 +23,22 @@
2323

2424

2525
class ValueGuidedRLPipeline(DiffusionPipeline):
26+
r"""
27+
This model inherits from [`DiffusionPipeline`]. Check the superclass documentation for the generic methods the
28+
library implements for all the pipelines (such as downloading or saving, running on a particular device, etc.)
29+
Pipeline for sampling actions from a diffusion model trained to predict sequences of states.
30+
31+
Original implementation inspired by this repository: https://github.com/jannerm/diffuser.
32+
33+
Parameters:
34+
value_function ([`UNet1DModel`]): A specialized UNet for fine-tuning trajectories base on reward.
35+
unet ([`UNet1DModel`]): U-Net architecture to denoise the encoded trajectories.
36+
scheduler ([`SchedulerMixin`]):
37+
A scheduler to be used in combination with `unet` to denoise the encoded trajectories. Default for this
38+
application is [`DDPMScheduler`].
39+
env: An environment following the OpenAI gym API to act in. For now only Hopper has pretrained models.
40+
"""
41+
2642
def __init__(
2743
self,
2844
value_function: UNet1DModel,
@@ -78,21 +94,26 @@ def run_diffusion(self, x, conditions, n_guide_steps, scale):
7894
for _ in range(n_guide_steps):
7995
with torch.enable_grad():
8096
x.requires_grad_()
97+
98+
# permute to match dimension for pre-trained models
8199
y = self.value_function(x.permute(0, 2, 1), timesteps).sample
82100
grad = torch.autograd.grad([y.sum()], [x])[0]
83101

84102
posterior_variance = self.scheduler._get_variance(i)
85103
model_std = torch.exp(0.5 * posterior_variance)
86104
grad = model_std * grad
105+
87106
grad[timesteps < 2] = 0
88107
x = x.detach()
89108
x = x + scale * grad
90109
x = self.reset_x0(x, conditions, self.action_dim)
110+
91111
prev_x = self.unet(x.permute(0, 2, 1), timesteps).sample.permute(0, 2, 1)
92-
# TODO: set prediction_type when instantiating the model
112+
113+
# TODO: verify deprecation of this kwarg
93114
x = self.scheduler.step(prev_x, i, x, predict_epsilon=False)["prev_sample"]
94115

95-
# apply conditions to the trajectory
116+
# apply conditions to the trajectory (set the initial state)
96117
x = self.reset_x0(x, conditions, self.action_dim)
97118
x = self.to_torch(x)
98119
return x, y
@@ -126,5 +147,6 @@ def __call__(self, obs, batch_size=64, planning_horizon=32, n_guide_steps=2, sca
126147
else:
127148
# if we didn't run value guiding, select a random action
128149
selected_index = np.random.randint(0, batch_size)
150+
129151
denorm_actions = denorm_actions[selected_index, 0]
130152
return denorm_actions

0 commit comments

Comments
 (0)