BokehDepth β€” Released Checkpoints

This repository hosts the released checkpoints for BokehDepth: Boosting Monocular Metric Depth Estimation via Bokeh Rendering (ICML 2026).

BokehDepth is a two-stage framework. Stage-1 turns a single sharp image into a calibrated multi-strength bokeh stack (no depth map needed); Stage-2 fuses the resulting defocus cues to produce sharper, more reliable metric depth. The three files in this repository correspond exactly to the two-stage inference pipeline.

File Stage Role Size
bokeh_lora.bin Stage-1 Bokeh generation LoRA adapter on top of FLUX.1-Kontext β‰ˆ 556 MB
bokeh_lora_ft.bin Stage-1 Robustness-finetuned variant of bokeh_lora.bin β‰ˆ 556 MB
UDv2_dsfa_release.pth Stage-2 UniDepthV2 + DSFA depth estimator β‰ˆ 5.19 GB

Stage-1 β€” Bokeh Generation LoRA

Stage-1 uses FLUX.1-Kontext (rectified-flow MMDiT) plus a lightweight bokeh cross-attention adapter. Heterogeneous optical settings (focal length, aperture, focus distance) collapse into a single calibrated scalar K from the thin-lens circle-of-confusion model, which captures the near-linear relation r β‰ˆ K Β· Ξ”disp between blur radius and disparity offset. Conditioned on K, Stage-1 turns one sharp image into a multi-strength bokeh stack with no depth map at any point.

bokeh_lora.bin β€” base LoRA

The base Stage-1 checkpoint. Trained on the unified Stage-1 data pipeline that aligns real defocused photos, synthetic renderings, and paired datasets onto the shared K axis.

bokeh_lora_ft.bin β€” robustness fine-tune

A continued fine-tune of bokeh_lora.bin that additionally mixes in synthetic bokeh renderings produced by BokehMe from subsets of the standard monocular-depth datasets KITTI / Hypersim / NYU-v2 / vKITTI 2. Since these datasets cover many scenes where the foreground is ambiguous, low-contrast, or simply absent, the resulting checkpoint is noticeably more robust at generating clean bokeh on such "no-clear-subject" inputs (driving scenes, dense indoor clutter, distant cityscapes, etc.) while preserving the calibrated K-control of the base LoRA.

Both LoRAs are wrapped at inference time by BokehFluxControlAdapter (see bokeh-generation/model/bokeh_adapter_flux.py in the code repository) and are loaded with lora_rank=128, lora_alpha=128 over FLUX transformer blocks 0–56.


Stage-2 β€” UniDepthV2-DSFA

UDv2_dsfa_release.pth

The Stage-2 metric depth model: UniDepthV2 (ViT-L/14 DINOv2 backbone) with our Divided Space Focus Attention (DSFA) module inserted into the depth encoder. DSFA first runs spatial attention inside each frame conditioned on that frame's blur strength K_f, then runs focus attention across frames at matching spatial locations, modulated by FiLM. Each location can therefore read how its blur grows with K, which is the physical depth-from-defocus cue. Only reference-frame tokens are passed downstream, so the original DPT decoder and metric head stay untouched.

This checkpoint is the plug-and-play DSFA build dropped onto UniDepthV2 and trained jointly with the Stage-1 bokeh stack as input. Use it together with the config UniDepth/configs/config_v2_vitl14_DSFA_inference.json in the code repository.


How to use

# from the project root
bash run_inference.sh

run_inference.sh expects all three files to live exactly here, i.e. under weights/:

weights/
β”œβ”€β”€ bokeh_lora.bin          # or bokeh_lora_ft.bin (see ADAPTER_CKPT env var)
β”œβ”€β”€ bokeh_lora_ft.bin
└── UDv2_dsfa_release.pth

Override which Stage-1 LoRA is used with:

ADAPTER_CKPT=weights/bokeh_lora_ft.bin bash run_inference.sh   # robust default
ADAPTER_CKPT=weights/bokeh_lora.bin    bash run_inference.sh   # base LoRA

The Stage-2 weights path is fixed via WEIGHTS_PATH=weights/UDv2_dsfa_release.pth (default).


Citation

If you use these checkpoints, please cite:

@inproceedings{zhang2026bokehdepth,
  title     = {Boosting Monocular Metric Depth Estimation via Bokeh Rendering},
  author    = {Zhang, Hangwei and Fortes, Armando and Wei, Tianyi and Pan, Xingang},
  booktitle = {Proceedings of the International Conference on Machine Learning (ICML)},
  year      = {2026}
}

License & acknowledgements

Released under CC BY-NC 4.0 for research use only. Stage-1 builds on FLUX.1-Kontext (Black Forest Labs) and Stage-2 builds on UniDepthV2; both upstream licenses apply to their respective base weights. The robustness fine-tune additionally relies on synthetic bokeh produced by BokehMe on standard monocular-depth datasets (KITTI / Hypersim / NYU-v2 / vKITTI 2) β€” please respect each dataset's individual license when redistributing derived data.

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Paper for fogradio/BokehDepth