I've been working on Highway, a Strict Atomic Workflow Management System that handles massive amounts of log data from multiple concurrent processes. The challenge I faced was brutal: dozens of processes writing tons of metadata simultaneously, and I kept running into data corruption issues that would derail entire machine learning pipelines.
Traditional file systems just weren't cutting it. When 10 or 20 processes try to write to the same files, you get race conditions, partial writes, and corrupted data. I'd wake up to find that 80% of my log processing results were garbage because processes interfered with each other. It wasn't just inefficient – it was destroying months of work.
Since for years I am working with Iceberg in Both Chase UK and Revolut, I very well know it solves this elegantly with ACID transactions and time travel, but integrating it meant dealing with Java dependencies and complex infrastructure that I couldn't handle. I wanted something that would work in pure Python without all the overhead.
//
//
//
//
//
//
//
//
//
//
//
//
//___________________
// /
//___________________/
D A T A S H A R D
So I (and a bit by Gemini :)) built datashard – a Python implementation that brings Iceberg's core concepts to regular Python workflows. The key insight was implementing Optimistic Concurrency Control (OCC) properly. Instead of locking files while processes fight over access, datashard validates the world looks like the process expects before committing changes. If something changed in the meantime, it retries safely.
The results shocked me. In my test with 12 concurrent processes each doing 10,000 operations, traditional file operations managed only 8,566 operations successfully – meaning over 111,000 operations were lost to corruption. With datashard, all 120,000 operations completed safely. Zero data loss.
Now Highway's log processing pipeline handles massive concurrency without fear of corruption. Multiple ML models can ingest and process the same data streams, different workflow stages can safely update metadata, and I never worry about processes stepping on each other. The time travel feature means I can debug issues by examining data as it existed at any point, which has saved countless hours of troubleshooting.
The hardest part was getting the atomic commit patterns right – ensuring that either all changes succeed or none do, while maintaining performance. But seeing data pipelines run reliably with dozens of concurrent operations makes it worth every hour spent debugging concurrency edge cases.
install it with: pip install datashard