Willy Lulciuc is a Software Engineer at WeWork and works on the Project Marquez team in San Francisco, making datasets discoverable and meaningful. Previously, he worked on the real-time streaming data platform powering BounceX, and before that, designed and scaled sensor data streams at Canary. When he's not reviewing code and creating indirections, he can be found experimenting with analog synthesizers.
The term data quality is used to describe the dependability, reliability, and usability of datasets. Data scientists and business analysts often determine the quality of a dataset by its trustworthiness and completeness. But what information might be needed to differentiate between useful vs noisy data? How quickly can data quality issues be identified and explored? More importantly, how can metadata enable data scientists to make better sense of the high volume of data within their organization from a variety of data sources?
With Airflow now ubiquitous for DAG orchestration, organizations increasingly dependon Airflow to manage complex inter-DAG dependencies and provide up-to-date runtime visibility into DAG execution. At WeWork, Airflow has quickly become an important component of our Data Platform powering billing, space inventory, etc. But what effects (if any) would upstream DAGs have on downstream DAGs if dataset consumption was delayed? What alerting rules should be in place to notify downstream DAGs of possible upstream processing issues or failures?
At WeWork, we feel it’s critical that DAG metadata is collected, maintained, and shared across the organization. This investment in metadata enables:
● Data lineage
● Data governance
● Data discovery
In this talk, we introduce Marquez: an open source metadata service for the collection, aggregation, and visualization of a data ecosystem’s metadata. We will demonstrate how metadata management with Marquez helps maintain inter-DAG dependencies, catalog historical runs of DAGs, and minimize data quality issues.