LOBench: Representation Learning of Limit Order Book

Motivation and Research Questions

The paper follows the structure of the accompanying Chinese explainer by tracing the evolution from industry demands to methodological advances. In high-frequency trading desks and quantitative research, analysts routinely re-engineer bespoke indicators for each task. Such pipelines struggle to transfer across markets or asset classes, and they rarely expose the intrinsic quality of learned representations.

LOBench reframes the problem through three guiding questions:

How should LOB data be standardized? Real trading venues impose rules—daily price limits, call auctions, intraday halts—that reshape the stochastic process. LOBench proposes reproducible preprocessing and sampling pipelines that respect these constraints.
What constitutes a sufficient representation? Encoders must capture level-to-level inventory pressure, asynchronous order arrivals, and rapidly shifting volatility regimes. The benchmark quantifies sufficiency through reconstruction fidelity and error attribution.
When is representation learning necessary? By attaching lightweight decoders to frozen embeddings, LOBench evaluates whether general-purpose representations reduce engineering effort while delivering competitive accuracy on prediction and imputation tasks.

This perspective connects academic advances—self-attention, temporal convolutions, latent diffusion—to concrete trading scenarios such as short-term trend forecasting and liquidity risk assessment.

Families of representation learning models evaluated in LOBench — Figure 1. LOBench spans CNN/RNN baselines and recent Transformer-style encoders to probe representation capacity.

Objective functions used to optimize LOB representations — Figure 2. Weighted reconstruction objectives emphasise liquidity-critical levels during encoder training.

Study Overview

The Limit Order Book (LOB) records the full depth of bid and ask intentions across price levels and is indispensable for understanding micro-structure dynamics, liquidity formation, and price discovery. This work revisits LOB modelling through the lens of representation learning. Instead of crafting task-specific predictors, the authors compare canonical CNN/RNN baselines, Transformer families, and recent time-series encoders under a unified experimental protocol.

LOBench consolidates the fragmented landscape by offering consistent data curation, an encoder–decoder benchmarking pipeline, and reproducible downstream evaluations. The benchmark targets two central questions: representation sufficiency—whether an encoder can capture strong temporal autocorrelation, cross-level constraints, and heterogeneous feature scales inherent to LOBs; and representation necessity—whether transferable embeddings outperform bespoke task designs in both accuracy and development efficiency.

All experiments are conducted on newly curated China A-share datasets that reflect T+1 settlement, price-limit rules, and retail-driven order flows. The resulting analysis produces practical guidelines for building general-purpose, finance-ready time-series representations.

Standardized Benchmark

Unified schemas for data formatting, normalization, slicing, and labelling underpin fair comparisons across reconstruction, trend prediction, and imputation tasks.

Real-World Market Coverage

Curated desensitized snapshots from Shenzhen Stock Exchange securities in 2019 capture liquidity tiers, sector diversity, and distinct institutional constraints.

Broad Model Spectrum

Benchmarked architectures span DeepLOB, TransLOB, SimLOB, TimesNet, iTransformer, TimeMixer, and other state-of-the-art time-series encoders.

Transferability Assessment

Lightweight feed-forward decoders probe how learned embeddings generalize to downstream tasks, highlighting when representation learning pays off.

Visualization of a limit order book and the matching process — Figure 3. A snapshot of the limit order book highlights multi-level bids/asks, market order matching, and cancellations.

Benchmark Design

LOBench decouples representation learning from task-specific objectives. Each model is trained as an encoder–decoder pair on standardized LOB windows; downstream experiments reuse the frozen encoder with simple decoders to isolate the contribution of the learned representations. The protocol mirrors the workflow described in the Chinese article: data alignment → representation learning → task transfer → quantitative diagnosis.

The dataset contains 10-level bid/ask prices and volumes, order-flow derived features (mid-price, spread, depth imbalances), and calendar indicators. Raw transactions drawn from the Shenzhen Stock Exchange in 2019 are aggregated following exchange regulations, with outlier handling for suspended sessions and capped limit-move days. Sliding windows of 120 time steps form the input tensors, while target windows cover both reconstruction horizons and forward-looking prediction spans.

Data Processing Pipeline

Full-day order-flow is synchronized to event time, respecting call auction boundaries, T+1 settlement, and price-limit mechanics.
Snapshots retain the top ten levels on both sides along with derived statistics; all data are anonymized and normalized within each trading day.
Sliding windows with 50% overlap provide balanced coverage of high- and low-liquidity periods for training, validation, and testing.

Task Suite

Reconstruction: Evaluate how faithfully the decoder recovers the original LOB tensor, with weighted MSE emphasising best bid/ask levels.
Trend Prediction: Classify the mid-price movement over the next five steps using the encoding at time t; stratified sampling balances upward, downward, and neutral labels.
Imputation: Randomly mask 20% of entries in each sequence to assess robustness against missing depth information.

Evaluation Protocol

Weighted MSE and unweighted MSE capture reconstruction and imputation fidelity; weighted cross-entropy and macro-F1 score quantify trend prediction accuracy under class imbalance. Every benchmark run logs training time, parameter count, and FLOPs, enabling a nuanced comparison between representation quality and computational overhead.

Data preprocessing pipeline for LOBench — Figure 4. The data preprocessing pipeline standardises sampling, normalization, and labelling before representation learning.

Comparison between feature-wise normalization and global z-score — Figure 5. Feature-wise normalization preserves cross-level structure better than global z-score scaling on LOB data.

Detailed effect of normalization strategies on LOB distribution — Figure 6. Additional comparison illustrating stability gains from feature-wise normalization during volatile sessions.

Experimental Findings

The evaluation is organised into two groups, echoing the narrative set out in the Chinese article. Group 1 compares representation models in a reconstruction-only setting. Group 2 freezes the encoders and attaches lightweight heads for downstream prediction and imputation. All models are trained with Adam, cosine learning-rate decay, and early stopping based on validation weighted MSE.

Transformer-centric encoders dominate. TimesNet delivers the lowest reconstruction error across all five benchmark stocks, closely followed by iTransformer and TimeMixer. CNN-based DeepLOB and recurrent baselines struggle to separate nearby price curves, especially during volatility spikes. Introducing weighted losses consistently lowers both MSE and macro-F1 variance, underscoring the importance of prioritising best-level liquidity information.

Transferability aligns with representation quality. When encoders are reused for trend prediction, performance improvements mirror reconstruction rankings: the best representations yield the highest macro-F1 and the most stable calibration across stocks. For imputation, Transformer variants recover masked depth values with 10–15% lower error than convolutional counterparts, highlighting their ability to model cross-level dependencies.

Efficiency matters. LOBench reports training time in seconds, revealing that iTransformer and TimeMixer strike a favourable balance between accuracy and runtime, while TimesNet trades marginally higher cost for superior fidelity. These statistics provide actionable guidance for practitioners who must weigh latency constraints against modelling accuracy.

Key Takeaways

Transformer-based encoders best capture cross-level dependencies and volatility regimes.
Task-agnostic representations reduce the need for bespoke feature engineering downstream.
Weighted objectives are crucial for emphasising liquidity-critical price levels.

Open Challenges

Scalability to longer horizons and additional exchanges while preserving reproducibility.
Balancing model complexity with execution latency in production trading systems.
Leveraging self-supervised or contrastive objectives to further enhance transferability.

Experiment group 1 reconstruction results across models — Figure 7. Group 1 results: Transformer variants achieve the lowest weighted reconstruction error across all benchmark stocks.

Sample reconstruction comparison across models — Figure 8. Representative reconstruction example showing improved separation of bid/ask trajectories by TimesNet and iTransformer.

Training curves on the imputation task — Figure 9. Training dynamics for the imputation task highlight faster convergence and lower loss for Transformer-based encoders.

Transfer performance table comparing downstream metrics — Figure 10. Transfer performance summary demonstrates the correlation between encoder quality and downstream macro-F1 and MSE.

Resources and Outlook

The public repository ships with preprocessing scripts, benchmark configurations, and experiment logs to replicate every figure reported in the paper, while the processed dataset is available through the HuggingFace hub. To comply with exchange policies, the released dataset consists of desensitized LOB snapshots; future updates will broaden the stock universe and extend coverage to additional trading venues. The team plans to enrich LOBench with self-supervised objectives, anomaly detection tasks, and risk-oriented evaluation metrics, inviting the community to collaborate on high-frequency financial representation learning.

Processed dataset on HuggingFace

Code, configurations, and experiment logs

Contact: [email protected]

BibTeX

@misc{zhong2025representationlearninglimitorder,
    title={Representation Learning of Limit Order Book: A Comprehensive Study and Benchmarking},
    author={Muyao Zhong and Yushi Lin and Peng Yang},
    year={2025},
    eprint={2505.02139},
    archivePrefix={arXiv},
    primaryClass={cs.CE},
    url={https://arxiv.org/abs/2505.02139},
    }