Scaling Data Processing for Training Workloads at DeepL Research with Rust

Jonas Dedden, Johanna Goergen

Rust
Python Skill Intermediate
Domain Expertise Intermediate
Wednesday 15:00 in None

We set out to replace an inefficient internal file format with an industry standard - a seemingly straightforward task. What we got instead was a descent into memory leak hell.

This talk will walk you through our journey of scaling DeepL's data preprocessing and model training pipelines to handle petabyte-scale corpora. When open-source C++-based Python libraries proved too unstable and memory-inefficient, we invested time and resources into developing our own Rust-based tooling and, compared to our previous internal file format, decreased memory load by a factor of 10 and latency until first byte read by a factor of 50.

What we'll cover: • Why Rust's memory safety guarantees matter in practice: We will provide a direct comparison of our results using C++-based vs Rust-based implementations for data processing libraries. • The Rust ecosystem advantage for Python interop: While C++ offers a fragmented landscape of build systems and tooling choices, Rust provides a canonical path with cargo, maturin, and PyO3—providing a clean interface for everything from GIL management to readable, zero-copy conversions between Rust and Python objects • Rust's surprisingly friendly features: Despite its reputation for having a steep learning curve, Rust offers language features that make it genuinely pleasant to work with, even for beginners coming from a Python background: from enums to pattern matching, error handling with Result, and cargo's canonical, ergonomic tooling. • Rust's impact on the arrow ecosystem and data engineering with Python in general: Besides the well-known impact that Rust-based data processing libraries like polars, Daft, and datafusion are having on the engineering ecosystem, we we will show how the Rust implementation of Arrow called arrow-rs is having a growing impact and expanding the data engineering toolkit by powering an increasing number of great and contributor-friendly processing and introspection tools built in Rust.

Jonas Dedden

Hi, I'm Jonas Dedden, Staff Research Data Engineer at DeepL SE, Germany. Johanna Goergen and I work at the Research Data Platform team of DeepL Research, where we are responsible for the on-prem & cloud-based k8s compute infrastructure for petabyte scale data processing pipelines. We provide the platform that our Research Data Engineers can use to collect & preprocess all data needed for training the DeepL foundational language models that power our production services.

Johanna Goergen

I'm a Staff Research Data Engineer in the Research Department of DeepL, working on platform-level tooling for scaling data pipelines to petabyte scale. I have been part of the initiative to adopt Rust in critical components used for model training, and I'm looking forward to sharing this experience with you.