Pulpie Orange

Pulpie Orange Base

Pareto-optimal main-content extraction from HTML.
610M-parameter encoder · 0.863 ROUGE-5 F1 on WebMainBench · the balanced Pulpie model.

GitHub · Blog · PyPI

Pulpie Orange Base extracts the main content from raw HTML, stripping navigation, ads, sidebars, and footers. It is an encoder that labels every HTML block as content or boilerplate in a single forward pass, so it approaches state-of-the-art extraction quality while running far faster and cheaper than autoregressive extractors.

At 610M parameters it sits between Orange Small and Orange Large, scoring 0.863 ROUGE-5 F1. For most use cases Orange Small offers a better speed/quality trade-off; choose Base when you want a little more headroom than Small without the cost of the 2.1B teacher.

Usage

The easiest way to use this model is through the pulpie package:

pip install pulpie

from pulpie import Extractor

extractor = Extractor(model="orange-base")
result = extractor.extract(html)

print(result.markdown)                # clean Markdown
print(result.html)                    # clean HTML
print(result.n_main, result.n_other)  # blocks kept vs dropped

Extractor auto-detects CUDA, Apple MPS, then CPU. See the GitHub README for batch and multi-GPU usage.

How it works

Pulpie runs a four-stage pipeline:

Simplify — remove scripts, styles, and formatting noise; tag each block with a unique ID.
Chunk — pack blocks into sequences of up to 8,192 tokens separated by <|sep|> markers (~80% of pages fit in one chunk).
Classify — a single encoder forward pass labels every block (at its <|sep|> position) as content or boilerplate.
Reconstruct — return the kept blocks as HTML, or convert to Markdown.

This model is a token-classification head over EuroBERT-610m, distilled from the 2.1B Pulpie Orange Large teacher (KL-divergence 0.7 + hard-label cross-entropy 0.3, temperature 2.0).

Benchmarks

WebMainBench, English subset (6,647 pages), ROUGE-5 F1:

Model	Params	ROUGE-5 F1	Throughput (L4)
Pulpie Orange Large	2.1B	0.873	1.3 pages/sec
Dripper	0.6B	0.864	0.68 pages/sec
Pulpie Orange Base (this model)	610M	0.863	3.9 pages/sec
Pulpie Orange Small	210M	0.862	13.7 pages/sec
magic-html	-	0.700	-
Trafilatura	-	0.619	-

Full analysis in the blog post.

Model family

Model	Params	ROUGE-5 F1	Use case
pulpie-orange-small	210M	0.862	Recommended — best value, fastest
pulpie-orange-base	610M	0.863	Balanced
pulpie-orange-large	2.1B	0.873	Highest quality (teacher)

Acknowledgements

Pulpie builds directly on the work of the MinerU-HTML and Dripper team (Ma et al., 2025). Their simplify_html preprocessing, block-level annotation scheme, and the WebMainBench benchmark are foundational to this work. Built on EuroBERT (Boizard et al., 2025).

Citation

@note{pulpie2026,
  title  = {Pulpie: Pareto-Optimal Models for Cleaning the Web},
  author = {Minhas, Bhavnick and Nigam, Shreyash and Feyn Research},
  year   = {2026},
  venue  = {Feyn Field Notes}
}

Built by Feyn. Model weights and the pulpie library are licensed under Apache 2.0.