ETL pipeline to achieve reliability at scale

Loading

The architecture of an online betting exchange. Luigi and Spark among other technologies.

Follow to receive video recommendations   a   A


In an online betting exchange, thousands of money related transactions are generated per minute. This data flow transforms a common and, in general, tedious task such as accounting into an interesting big data engineering problem. At Smarkets, accounting reports serve two main purposes: housekeeping of our financial operations and documentation for the relevant regulation authorities. In both cases, reliability and accuracy are crucial in the final result. The fact that these reports are generated daily, the need to cope with failure when retrieving data from previous days, and the fast growing transaction volume obsoleted the original accounting system and required a new pipeline that could scale.

This talk presents the ETL pipeline designed to meet the constraints highlighted above, and explains the motivations behind the tech stack chosen for the job, which includes Python3, Luigi and Spark among others. These topics will be covered by describing the main technical problems solved with our design: - Fault tolerance and reliability, i.e ability to identify faulty steps and only rerun those instead of the whole pipeline. - Fast input/output. - Fast computations.



Editors Note:

I am looking for editors/curators to help with branches of the tree. Please send me an email  if you are interested.