Shirshanka is a Principal Staff Software Engineer and the architect for LinkedIn's “Analytics Platforms and Applications” team. He was among the original authors of a variety of open and closed source projects built at LinkedIn, including Databus, Espresso, and Apache Helix. He is currently working with his team on simplifying the big data analytics space through a multitude of mostly open-source projects: Apache Pinot (incubating), a high-performance distributed OLAP engine; Apache Gobblin (incubating), a distributed data integration framework; WhereHows, a data discovery and lineage platform and Dali, a data virtualization layer for Hadoop.
What is metadata?
What sorts of data constructs does it apply to?
When should you collect it?
Where and how should you store it?
What can you do with it?
How do you scale it to a million data constructs, thousands of people, and hundreds of teams?
These fundamental questions are at the heart of LinkedIn’s metadata evolution. A journey that started with a small team trying to improve the searchability of Hadoop data. Over the years, this system has grown to be the central data hub where the entirety (more than a million) of data assets at LinkedIn (online, streaming and batch) have a home. This system is deployed at global scale, powers data productivity for all engineers and data enthusiasts, while serving as critical infrastructure for data privacy by default in our data systems.
In this talk, I focus on different metadata strategies for modeling metadata, storing metadata and then scaling the acquisition and refinement of metadata for thousands of metadata authors and producing systems. I discuss the pros and cons of each strategy and in which scenarios I think organizations should deploy them. Strategies discussed include generic types versus specific types, crawling versus publish-subscribe, single source of truth versus multiple federated sources of truth, automated classification of data, lineage propagation and more! I also discuss different axes on which we’ve been tested on scale, the sheer number of entities, the richness of metadata, the connectivity between entities, the velocity of evolution of the metadata model as well as the efficiency of serving metadata for simple and complex queries.