27-29 November, Vilnius

Conference about Big Data, High Load, Data Science, Machine Learning & AI

Conference is over. See you next year!

GERARD TOONSTRA

BigData Republic, The Netherlands

GERARD TOONSTRA

BigData Republic, The Netherlands

Biography

Gerard Toonstra is an Apache Airflow enthousiast and is excited about it ever since it was announced as open source. He was the initial contributor of the HttpHook and HttpOperator and set up a site “ETL with airflow”, which is one of the richest practical sources of information about Apache Airflow. Gerard has a background in nautical engineering, but works in information technology since 1998, after which he worked in different engineering positions in the UK, The Netherlands and Brazil.
He now works at BigData Republic in The Netherlands as BigData Architect / Engineer. BigData Republic is a multidisciplinary team of experienced and business oriented Data Scientists, Data Engineers, and Architects. Irrespective of an organization’s data maturity level, we help to translate business goals into the design, implementation and utilization of innovative solutions. In his spare time Gerard likes oil painting and in his holidays visit a beautiful beach in Brazil to read spy novels or psychology books.

Talk #1

Agile Data Architecture

For many retailers there is both big and small data to manage. Small data needs to be managed carefully, because it is subject to many business rules and definitions. Both streams should converge at some point to be able to develop forecasting models, effective recommendation engines and other purposes. How can we develop an agile data architecture to support this? This session provides an example how a layered architecture consisting of a data lake, a data vault and the big data warehouse fit together, so that every group in the organization can connect to the architecture somewhere and get a consistent view of shipments, page views, sales and stock movements for the organization.

Talk #2

Design philosophy of Apache Airflow ETL Pipelines

Apache Airflow is attracting a lot of attention over the past couple of years. This session explains very important principles that should be maintained in your ETL pipelines to make them scalable and restartable; many of these principles have been known for years in functional programming communities. Apache Airflow is designed around that philosophy and naturally guides the developer towards better and more scalable pipelines.

Workshop

Apache Airflow hands on

Apache Airflow is attracting more attention worldwide as a de-facto ETL platform. As the author of the site “ETL with airflow”, I’d like to share this knowledge and get novices up to speed with Apache Airflow as their ETL platform. Learn how to write your first DAG in python, email notifications, scheduler configuration, writing your own hooks and operators and pointing you towards important principles to maintain when composing your dags.

Apache Airflow has become a very popular tool for running ETL, machine learning and data processing pipelines. Embedded in the implementation are the insights and learnings from years of experience in data engineering.

The workshop explains what these principles are and how they can be achieved rather effortlessly by putting the components of Apache Airflow together in a data processing workflow.