BokehDepth β Released Checkpoints
This repository hosts the released checkpoints for BokehDepth: Boosting Monocular Metric Depth Estimation via Bokeh Rendering (ICML 2026).
- π Paper: https://arxiv.org/abs/2512.12425
- π Project page: https://fogradio.github.io/BokehDepth_Project/
- π» Code: https://github.com/fogradio/BokehDepth
BokehDepth is a two-stage framework. Stage-1 turns a single sharp image into a calibrated multi-strength bokeh stack (no depth map needed); Stage-2 fuses the resulting defocus cues to produce sharper, more reliable metric depth. The three files in this repository correspond exactly to the two-stage inference pipeline.
| File | Stage | Role | Size |
|---|---|---|---|
bokeh_lora.bin |
Stage-1 | Bokeh generation LoRA adapter on top of FLUX.1-Kontext | β 556 MB |
bokeh_lora_ft.bin |
Stage-1 | Robustness-finetuned variant of bokeh_lora.bin |
β 556 MB |
UDv2_dsfa_release.pth |
Stage-2 | UniDepthV2 + DSFA depth estimator | β 5.19 GB |
Stage-1 β Bokeh Generation LoRA
Stage-1 uses FLUX.1-Kontext (rectified-flow MMDiT) plus a lightweight bokeh cross-attention adapter. Heterogeneous optical settings (focal length, aperture, focus distance) collapse into a single calibrated scalar K from the thin-lens circle-of-confusion model, which captures the near-linear relation r β K Β· Ξdisp between blur radius and disparity offset. Conditioned on K, Stage-1 turns one sharp image into a multi-strength bokeh stack with no depth map at any point.
bokeh_lora.bin β base LoRA
The base Stage-1 checkpoint. Trained on the unified Stage-1 data pipeline that aligns real defocused photos, synthetic renderings, and paired datasets onto the shared K axis.
bokeh_lora_ft.bin β robustness fine-tune
A continued fine-tune of bokeh_lora.bin that additionally mixes in synthetic bokeh renderings produced by BokehMe from subsets of the standard monocular-depth datasets KITTI / Hypersim / NYU-v2 / vKITTI 2. Since these datasets cover many scenes where the foreground is ambiguous, low-contrast, or simply absent, the resulting checkpoint is noticeably more robust at generating clean bokeh on such "no-clear-subject" inputs (driving scenes, dense indoor clutter, distant cityscapes, etc.) while preserving the calibrated K-control of the base LoRA.
Both LoRAs are wrapped at inference time by BokehFluxControlAdapter (see bokeh-generation/model/bokeh_adapter_flux.py in the code repository) and are loaded with lora_rank=128, lora_alpha=128 over FLUX transformer blocks 0β56.
Stage-2 β UniDepthV2-DSFA
UDv2_dsfa_release.pth
The Stage-2 metric depth model: UniDepthV2 (ViT-L/14 DINOv2 backbone) with our Divided Space Focus Attention (DSFA) module inserted into the depth encoder. DSFA first runs spatial attention inside each frame conditioned on that frame's blur strength K_f, then runs focus attention across frames at matching spatial locations, modulated by FiLM. Each location can therefore read how its blur grows with K, which is the physical depth-from-defocus cue. Only reference-frame tokens are passed downstream, so the original DPT decoder and metric head stay untouched.
This checkpoint is the plug-and-play DSFA build dropped onto UniDepthV2 and trained jointly with the Stage-1 bokeh stack as input. Use it together with the config UniDepth/configs/config_v2_vitl14_DSFA_inference.json in the code repository.
How to use
# from the project root
bash run_inference.sh
run_inference.sh expects all three files to live exactly here, i.e. under weights/:
weights/
βββ bokeh_lora.bin # or bokeh_lora_ft.bin (see ADAPTER_CKPT env var)
βββ bokeh_lora_ft.bin
βββ UDv2_dsfa_release.pth
Override which Stage-1 LoRA is used with:
ADAPTER_CKPT=weights/bokeh_lora_ft.bin bash run_inference.sh # robust default
ADAPTER_CKPT=weights/bokeh_lora.bin bash run_inference.sh # base LoRA
The Stage-2 weights path is fixed via WEIGHTS_PATH=weights/UDv2_dsfa_release.pth (default).
Citation
If you use these checkpoints, please cite:
@inproceedings{zhang2026bokehdepth,
title = {Boosting Monocular Metric Depth Estimation via Bokeh Rendering},
author = {Zhang, Hangwei and Fortes, Armando and Wei, Tianyi and Pan, Xingang},
booktitle = {Proceedings of the International Conference on Machine Learning (ICML)},
year = {2026}
}
License & acknowledgements
Released under CC BY-NC 4.0 for research use only. Stage-1 builds on FLUX.1-Kontext (Black Forest Labs) and Stage-2 builds on UniDepthV2; both upstream licenses apply to their respective base weights. The robustness fine-tune additionally relies on synthetic bokeh produced by BokehMe on standard monocular-depth datasets (KITTI / Hypersim / NYU-v2 / vKITTI 2) β please respect each dataset's individual license when redistributing derived data.