Crunch Data Engineering and Analytics Conference Budapest October 29-31, 2018

CRUNCH is a use case heavy conference for people interested in building the finest data driven businesses. No matter the size of your venture or your job description you will find exactly what you need on the two-track CRUNCH conference. A data engineering and a data analytics track will serve diverse business needs and levels of expertise.

If you are a Data Engineer, Data Scientist, Product Manager or simply interested how to utilise data to develop your business, this conference is for you. No matter the size of your company or the volume of your data, come and learn from the Biggest players of Big Data, get inspiration from their practices, from their successes and failures and network with other professionals like you.

29
October
CONFERENCE DAY #1, MONDAY

The day will start at 9AM and the last talk will end around 6PM. After the sessions there will be a Crunch party at the conference venue.

30
October
CONFERENCE DAY #2, TUESDAY

The day will start at 9AM and the closing ceremony will end around 6PM.

31
October
WORKSHOP DAY

Our full-day workshops will be announced soon. You need to buy separate workshop tickets to attend them.


Location

Meet Budapest, a really awesome city

Here are a few reasons why you need to visit Budapest

MAGYAR VASÚTTÖRTÉNETI PARK

BUDAPEST, TATAI ÚT 95, 1142

The Magyar Vasúttörténeti Park (Hungarian Railway History Park) is Europe’s first interactive railway museum located at a railway station and workshop of the Hungarian State Railways. There are over a hundred vintage trains, locomotives, cars and other types of railroad equipment on display, including a steam engine built in 1877, a railcar from the 1930’s and a dining car built in 1912 for the famous Orient Express.

On the conference days there will be direct Crunch trains in the morning from Budapest-Nyugati Railway Terminal to the venue, and in the evening from the venue to Budapest-Nyugati Railway Terminal, so we recommend to find a hotel near to Nyugati station.


Speakers

Jon Morra

Jon Morra

Vice President of Data Science at Zefr
Clustering YouTube: A Top Down & Bottom up Approach

At ZEFR we know that when an advertisement on YouTube is relevant to the content a user is watching it is a better experience for both the user and the advertiser. In order to facilitate this experience we discover billions of videos on YouTube and cluster them into concepts that advertisers and brands want to buy to align with their particular creatives. To serve our clients we use two different clustering strategies, a top down supervised learning approach and a bottom up unsupervised learning approach. The top down approach involves using human annotated data and a very fast and robust machine learning model deployment system that solves problems with model drift. Our clients are also interested in discovering topics on YouTube. To serve this need we use unsupervised clustering of videos to surface clusters that are relevant. This type of clustering allows ZEFR to highlight what users are currently interested in. We show how using Latent Dirichlet Allocation can help to solve this problem. Along the way we will show some of the tricks that produce an accurate unsupervised learning system. This talk will touch on some common machine learning engines including Keras, TensorFlow, and Vowpal Wabbit. We will also introduce our open source Scala DSL for model representation, Aloha. We show how Aloha solves a key problem in a typical data scientist's workflow, namely ensuring that feature functions make it from the data scientist's machine to production with zero changes.

Bio

Jon Morra is the Vice President of Data Science at Zefr, a video ad-tech company. His team's main focus is on figuring out the best videos to deliver to Zefr's client to optimize their advertising campaign objectives based on the content of the videos. In this role he leads a team of data scientists whom are responsible for extracting information form both videos and our clients to create data driven models. Prior to Zefr, Jon was the Director of Data Science at eHarmony where he helped increase both the breath and depth of data science usage. Jon holds a B.S. from Johns Hopkins and a Ph.D. from UCLA both in Biomedical Engineering.

Daniel Porter

Daniel Porter

Co-founding member of BlueLabs
Using Rapid Experiments and Uplift Modeling to Optimize Outreach at Scale

In the current environment, media consumption is fragmenting, cord cutters are an increasingly large segment of the population, and “digital” is no longer a ubiquitous, single medium. As such, large companies and other organizations looking to do outreach at scale to change individuals’ behavior have an overwhelming number of choices for how to deploy their outreach resources. In this talk, Daniel Porter, co-founder and Chief Analytics Officer of BlueLabs, will discuss how current tools which combine uplift models with state of the art allocation algorithms make it possible for organizations ranging from Fortune 100 companies to Presidential Campaigns to large government agencies to optimize these decisions at the individual level, leading to ensuring delivery of the right message to the right person at the right time, through media channels where an individual is most likely to engage positively with the content.

Bio

Dan is a co-founding member of BlueLabs and has led its data science team since the company’s inception. Over the past 5 years, he has overseen the team’s growth in applying data science to new industries, spearheaded the development of critical new tools and methodologies, and expanded the team’s technical capabilities to incorporate the world’s most recent cutting-edge data science innovations. Dan’s interest is in using data science as a tool to predict, and more importantly, influence behaviors. As Director of Statistical Modeling on the 2012 Obama Campaign, his team was the first in the history of Presidential Politics to use persuasion modeling to determine the voters who were most likely to be persuaded by the campaign’s outreach. Since co-founding BlueLabs, Dan’s team has iterated on this work to influence behaviors and attitudes for applications ranging from perceptions of a Fortune 10 company, to buying products from big-box retail stores, and the uptake of key Federal Government services. Much of Dan’s team’s recent work has focused on how different individuals can be influential on each other’s attitudes and behaviors in asymmetric ways. Dan is passionate about understanding how these key drivers of influence are critical to organizations seeking to achieve their campaign, brand, or policy goals. Dan has a MA in Quantitative Methods from Columbia University, and a BA from Wesleyan University. He is an avid sports fan (always watching from a statistical perspective), and, sadly, enjoys optimizing his frequent flyer miles portfolio between vacations almost as much as vacation itself.

Ananth Packkildurai

Ananth Packkildurai

Senior data engineer at Slack
Operating data pipeline using Airflow @ Slack

Slack is a communication and collaboration platform for teams. Our millions of users spend 10+ hrs connected to the service on a typical working day. The Slack data engineering team goal is simple: Drive up speed, efficiency, and reliability of making data-informed decisions. For engineers, For people managers, For salespeople, For every slack customer. Airflow is the core system in our data infrastructure to orchestrate our data pipeline. We use Airflow to schedule Hive/ Tez, spark, Flink and TensorFlow applications. Airflow helps us to manage our stream processing, statistical analytics, machine learning, and deep learning pipelines. About six months back, we started on-call rotation for our data pipeline to adopt what we learned from devops paradigm. We found out several airflow performance bottleneck and operational inefficiency that siloed with ad-hoc pipeline management. In this talk, I will speak about how we identified Airflow performance issues and fixed it. I will talk about our experience as we thrive to resolve our on-call nightmares and make data pipeline simpler and pleasant to operate and the hacks we did to improve alerting and visibility of our data pipeline. Though the talk tune towards Airflow, the principles we applied for data pipeline visibility engineering is more generic and can apply to any tools/ data pipeline.

Bio

I work Senior data engineer at Slack manage core data infrastructures like Airflow, Kafka, Flink, and Pinot. I love talking about all things ethical data management.

Szilard Pafka

Szilard Pafka

Chief Scientist at Epoch USA
Better than Deep Learning: Gradient Boosting Machines (GBM)

With all the hype about deep learning and "AI", it is not well publicized that for structured/tabular data widely encountered in business applications it is actually another machine learning algorithm, the gradient boosting machine (GBM) that most often achieves the highest accuracy in supervised learning tasks. In this talk we'll review some of the main GBM implementations available as R and Python packages such as xgboost, h2o, lightgbm etc, we'll discuss some of their main features and characteristics, and we'll see how tuning GBMs and creating ensembles of the best models can achieve the best prediction accuracy for many business problems.

Bio

Szilard studied Physics in the 90s and obtained a PhD by using statistical methods to analyze the risk of financial portfolios. He worked in finance, then more than a decade ago moved to become the Chief Scientist of a tech company in Santa Monica, California doing everything data (analysis, modeling, data visualization, machine learning, data infrastructure etc). He is the founder/organizer of several meetups in the Los Angeles area (R, data science etc) and the data science community website datascience.la. He is the author of a well-known machine learning benchmark on github (1000+ stars), a frequent speaker at conferences (keynote/invited at KDD, R-finance, Crunch, eRum and contributed at useR!, PAW, EARL etc.), and he has developed and taught graduate data science and machine learning courses as a visiting professor at two universities (UCLA in California and CEU in Europe).

Thomas Dinsmore

Thomas Dinsmore

Senior Director for DataRobot
The Path to Open Data Science

Bio

Thomas W. Dinsmore is a Senior Director for DataRobot, an AI startup based in Boston, Massachusetts, where he is responsible for competitor and market intelligence. Thomas’ previous experience includes service for Cloudera, The Boston Consulting Group, IBM Big Data, and SAS. Thomas has worked with data and machine learning for more than 30 years. He has led or contributed to projects for more than 500 clients around the world, including AT&T, Banco Santander, Citibank, CVS, Dell, J.C.Penney, Monsanto, Morgan Stanley, Office Depot, Sony, Staples, United Health Group, UBS, Vodafone, and Zurich Insurance Group. Apress published Thomas’ book, Disruptive Analytics, in 2016. Previously, he co-authored Modern Analytics Methodologies and Advanced Analytics Methodologies for FT Press and served as a reviewer for the Spark Cookbook. He posts observations about the machine learning business on his personal blog at thomaswdinsmore.com.

Ajay Gopal

Ajay Gopal

Chief Data Scientist at Deserve, Inc
"Full-Stack" Data Science with R

In the past 5 years, there has been a rapid evolution of the ecosystem of R packages and services. This enables the crossover of R from the domain of statisticians to being an efficient functional programming language that can be used across the board for data engineering, analytics, reporting and data science. I'll illustrate how startups and medium-size companies can use R as a common language for

  1. engineering functions such as ETL and creation of data APIs,
  2. analytics through scalable real-time reporting dashboards and
  3. the prototyping and deployment of ML models.
Along the way, I'll specifically identify open-source tools that allow scalable stacks to be built in the cloud with minimal budgets. The efficiency gained enables small teams of R programmers & data scientist to provide diverse lateral intelligence across a company.

Bio

Ajay is a California resident, building his second FinTech Startup Data Science team as Chief Data Scientist at Deserve. Before that, he built the data science & digital marketing automation functions at CARD.com – another CA FinTech Startup. In both roles, he has built diverse teams, cloud data science infrastructures and R&D/Prod workflows with a "full stack" approach to scalable intelligence & IP generation. Ajay holds a PhD in physical chemistry and researched bio-informatics and graph theory as a post-doc before transitioning to the startup world.

Tim Berglund

Tim Berglund

Senior Director of Developer Experience at Confluent
Kafka as a Platform: the Ecosystem from the Ground Up

Kafka has become a key data infrastructure technology, and we all have at least a vague sense that it is a messaging system, but what else is it? How can an overgrown message bus be getting this much buzz? Well, because Kafka is merely the center of a rich streaming data platform that invites detailed exploration.


In this talk, we’ll look at the entire open-source streaming platform provided by the Apache Kafka and Confluent Open Source projects. Starting with a lonely key-value pair, we’ll build up topics, partitioning, replication, and low-level Producer and Consumer APIs. We’ll group consumers into elastically scalable, fault-tolerant application clusters, then layer on more sophisticated stream processing APIs like Kafka Streams and KSQL. We’ll help teams collaborate around data formats with schema management. We’ll integrate with legacy systems without writing custom code. By the time we’re done, the open-source project we thought was Big Data’s answer to message queues will have become an enterprise-grade streaming platform.

Bio

Tim is a teacher, author, and technology leader with Confluent, where he serves as the Senior Director of Developer Experience. He can frequently be found at speaking at conferences in the United States and all over the world. He is the co-presenter of various O’Reilly training videos on topics ranging from Git to Distributed Systems, and is the author of Gradle Beyond the Basics. He tweets as @tlberglund, blogs very occasionally at http://timberglund.com, is the co-host of the http://devrelrad.io podcast, and lives in Littleton, CO, USA with the wife of his youth and their youngest child, the other two having mostly grown up.

Jeffrey Theobald

Jeffrey Theobald

Staff Engineer at Zendesk
Machine Learning: The Journey to Production

Simply building a successful machine learning product is extremely challenging, and just as much effort is needed to turn that model into a customer-facing product. As we cover the journey of Zendesk’s article recommendation product, we’ll discuss design challenges and real-world problems you may encounter when building a machine learning product at scale. We’ll talk in detail about the evolution of the machine learning system, from individual models per customer (using Hadoop to aggregate the training data) to a universal deep learning model for all customers using TensorFlow, and outline some challenges they faced while building the infrastructure to serve TensorFlow models. They also explore the complexities of seamlessly upgrading to a new version of the model and detail the architecture that handles the constantly changing collection of articles that feed into the recommendation engine.

Bio

Jeffrey Theobald is a Staff Engineer at Zendesk, a customer support company who provide a myriad of solutions to help their customers improve their relationships with their end users. He has been working in data processing for around 9 years, across several companies, in several languages from Python and Ruby, through bash to C++, and Java. He has used Hadoop since 2011 and has built analytics and batch processing systems as well as data preparation tools for machine learning. When not stressing about data correctness, he enjoys hiking and recently climbed Kilimanjaro. He believes that talks should be entertaining as well as informative and has tried to promote interesting and unusual talks about software engineering by organising the talk series Software Art Thou (Link: www.softwareartthou.com).

Jacek Laskowski

Jacek Laskowski

Spark, Kafka, Kafka Streams Consultant, Developer and Technical Instructor
Deep Dive into Query Execution in Spark SQL 2.3

If you want to get even slightly better performance of your structured queries (regardless whether they are batch or streaming) you have to peek at the foundations of Dataset API starting with QueryExecution. That’s where any structured query ends at and my talk starts from. The talk will show you what stages a structured query has to go through before execution in Spark SQL. I’ll be talking about the different phases of query execution and the logical and physical optimizations. I’ll show the different optimizations in Spark SQL 2.3 and how to write one yourself (in Scala).

Bio

Jacek Laskowski is an independent consultant, software developer and technical instructor specializing in Apache Spark, Apache Kafka and Kafka Streams (with Scala, sbt, Kubernetes, DC/OS, Apache Mesos, and Hadoop YARN). He is best known by the gitbooks at https://jaceklaskowski.gitbooks.io about Apache Spark, Spark Structured Streaming, and Apache Kafka. Follow Jacek at https://twitter.com/jaceklaskowski.

Andrey Sharapov

Andrey Sharapov

Data Scientist and Data Engineer at Lidl
Building data products: from zero to hero!

Modern organizations are overwhelmingly becoming data-driven in order to optimize internal process and increase competitiveness. At Lidl we turn data into products and provide our internal customers with business insights at scale. Come to learn how we started from zero and turned into data heroes!

Bio

Andrey Sharapov is a data scientist and data engineer at Lidl. He is currently working on various projects related to machine learning and data product development. Previously, he spent 2 years at Xaxis where he help to develop a campaign optimization tool for GroupM agencies, then at TeamViewer, where he led data science initiatives and developed a tools for customer analytics. Andrey is interested in “explainable AI” and is passionate about making machine learning accessible to general public.

Abhishek Tiwari

Abhishek Tiwari

Staff Software Engineer at LinkedIn, Apache Gobblin PPMC / Committer
Stream and Batch Data Integration at LinkedIn scale using Apache Gobblin

This talk will discuss about how Apache Gobblin powers stream and batch data integration at LinkedIn for use cases such as: ingestion of 300+ billion / daily Kafka events, storage management of several petabytes of data on HDFS, and near real-time processing of thousands of enterprise customer jobs.

Bio

Abhishek Tiwari is a Committer and PPMC member of Apache Gobblin (incubating). He is the Tech Lead for Data Integration Infrastructure at LinkedIn. Before joining LinkedIn, he had worked on building Amazon CloudSearch service at AWS, platform for Watson supercomputer at Nuance, Hadoop infrastructure at Yahoo, and web architecture for several million monthly users at AOL.

Wael Elrifai

Wael Elrifai

VP of Big Data, IOT & AI at Hitachi Vantara

Bio

Wael Elrifai is a thought leader, book author & public speaker in the AI & IOT space in addition to his role as VP of Big Data, IOT & AI at Hitachi Vantara. He has served corporate and government clients in North America, Europe, the Middle East, and East Asia across a number of industry verticals and has presented at conferences worldwide. With graduate degrees in both electrical engineering economics he’s a member of the Association for Computing Machinery, the Special Interest Group for Artificial Intelligence, the Royal Economic Society, and The Royal Institute of International Affairs.

Tyler Akidau

Tyler Akidau

Software engineer at Google
Foundations of streaming SQL or: How I learned to love stream and table theory

What does it mean to execute robust streaming queries in SQL? What is the relationship of streaming queries to classic relational queries? Are streams and tables the same thing conceptually, or different? And how does all of this relate to the programmatic frameworks like we’re all familiar with? This talk will address all of those questions in two parts, providing a survey of core points from chapters 6 and 8 in the recently published Streaming Systems book.
First, we’ll explore the relationship between the Beam Model for data processing (as described in The Dataflow Model paper and the Streaming 101 and Streaming 102 blog posts) and stream & table theory (as popularized by Martin Kleppmann and Jay Kreps, amongst others, but essentially originating out of the database world). It turns out that stream & table theory does an illuminating job of describing the low-level concepts that underlie the Beam Model.
Second, we’ll apply our clear understanding of that relationship towards explaining what is required to provide robust stream processing support in SQL. We’ll discuss concrete efforts that have been made in this area by the Apache Beam, Calcite, and Flink communities, compare to other offerings such as Apache Kafka’s KSQL and Apache Spark’s Structured streaming, and talk about new ideas yet to come.
In the end, you can expect to have a much better understanding of the key concepts underpinning data processing, regardless of whether that data processing batch or streaming, SQL or programmatic, as well as a concrete notion of what robust stream processing in SQL looks like.

Bio

Tyler Akidau is a software engineer at Google, where he is the technical lead for the Data Processing Languages & Systems group, responsible for Google's Apache Beam efforts, Google Cloud Dataflow, and internal data processing tools like Google Flume, MapReduce, and MillWheel. His also a founding member of the Apache Beam PMC. Though deeply passionate and vocal about the capabilities and importance of stream processing, he is also a firm believer in batch and streaming as two sides of the same coin, with the real endgame for data processing systems the seamless merging between the two. He is the author of the Streaming Systems book from O'Reilly, the 2015 Dataflow Model paper, and the Streaming 101 and Streaming 102 articles. His preferred mode of transportation is by cargo bike, with his two young daughters in tow.

Juliet Hougland

Juliet Hougland

Data Platform Engineering Manager at Stitch Fix
Enabling Full Stack Data Scientists

Autonomy and ownership are core to working at Stitch Fix, particularly on the Algorithms team. Data Scientists are expected to build their systems end to end and maintain them in the long run. We rely on automation, documentation, and collaboration to enable data scientist to build and maintain production services. In this talk I will discuss the platform we have built and how we communicate about these tools with our data scientists.

Bio

Juliet Hougland leads a team that builds data science infrastructure at Stitch Fix. She is a data scientist and engineer with expertise in computational mathematics and years of hands-on machine learning and big data experience. She has built and deployed production ML models, advised Fortune 500 companies on infrastructure and worked on a variety of open source projects (Apache Spark, Scalding, and Kiji) at the intersection of big data and machine learning. She has worked at Cloudera as well as two tiny (now defunct) startups.

Jonathon Morgan

Jonathon Morgan

Founder and CEO of New Knowledge
Machine Learning and Information Warfare

We've entered an age of information war. Hyper-partisan rhetoric, social media filter bubbles, and massive networks of fake social media accounts are being used to undermine elections, sow discord, and even inspire acts of violence. We can quantify this manipulation and its impact using new and novel approaches to natural language understanding and information semantics. We'll look at how knowledge graph embeddings can help humans quickly identify computational propaganda, and investigate how word vectors from models trained on partisan corpora can measure radicalization and polarization in political discourse.

Bio

Jonathon Morgan is the founder and CEO of New Knowledge, a technology company building AI for disinformation defense. He is also the founder of Data for Democracy, a policy, research, and volunteer collective with nearly 4,000 members that's bridging the gap between technology and society. Prior to founding New Knowledge, Jonathon published research about extremist groups manipulating social media with the Brookings Institution, The Atlantic, and the Washington Post, presented at NATO's Center of Excellence for Defense Against Terrorism, the United States Institute for Peace, and the African Union. He also served as an adviser to the US State Department, developing strategies for digital counter-terrorism. He regularly provides expert commentary about online disinformation for publications such as the New York Times, NBC, NPR, and Wired, and has published op-eds about information warfare and computational propaganda for CNN, The Guardian, and VICE.

Christoph Reininger

Christoph Reininger

Head of Business Intelligence at Runtastic GmbH
From Data Science to Business Science - How Runtastic's data scientists translate stakeholder needs from what they say they want

You got a working business model, scalable analytics infrastructure and highly skilled data scientists. But somehow you just don’t seem to be generating value from your data. Digital health and fitness company Runtastic shares their experience in translating business requirements into actionable data products that drive innovation.
After scaling a capable analytics infrastructure and building skilled data science / engineering teams, Runtastic’s challenge was to apply these capabilities to it’s fast growing and ever-changing business. Business processes like user acquisition and customer relationship management had matured quickly and got more complex and sophisticated. Existing analytics and data products did not fit the business requirements any more and external solutions appeared both too expensive and limited. The solution was that data scientists took a step back from the data they knew to take a hard look on the business and how it works.
The requirements engineering process that leads to a functional and valuable data product is a big challenge that involves a lot of different stakeholders and requires a wide variety of skills. In the past 24 months Runtastic tackled and revamped some if its most crucial business processes und discovered a lot of learnings on this way.

Bio

After receiving his master’s degree in Medical Informatics at the Medical University Vienna, Christoph started working for gespag, one of Austria’s biggest healthcare providers. After 4+ years of working as an IT architect and project manager Christoph joined Runtastic in 2013 to start their business intelligence initiative. In his position as Head of Business Intelligence he implemented and grew the data infrastructure and organization at Runtastic for the past 5 years. Working for a very innovative organization in the mobile health & fitness area, has given Christoph the opportunity to not only apply his knowledge in data management but to expand his experience regarding business processes and agile product management.

Wes McKinney

Wes McKinney

Director of Ursa Labs, PMC for Apache Arrow
Apache Arrow: A Cross-language Development Platform for In-memory Data

This talk discusses Apache Arrow project and its uses for high performance analytics and system interoperability.
Data processing systems have historically been full-stack systems features memory management, IO, file format adapters, runtime memory format, in-memory query engine, and front-end user interfaces. Many of these components are fully "bespoke" or "custom", in part due to a lack of open standards for many of the pieces.
Apache Arrow was created by a diverse group of open source data system developers to define open standards and community-maintained libraries for high performance in-memory data processing. Since the beginning of 2016, we have been building a cross-language development platform for data processing to help create systems that are faster, more scalable, and more interoperable.
I discuss the current development initiative and future roadmap as it relates to the data science and data engineering worlds.

Bio

Wes McKinney is an open source software developer focusing on data processing tools. He created the Python pandas project and has been a major contributor to many other OSS projects. He is a Member of the Apache Software Foundation and a project PMC member for Apache Arrow and Apache Parquet. He is the director of Ursa Labs, an innovation lab for open source data science tools powered by Apache Arrow.

More speakers will be announced soon


Workshops31 Oct, Wed

Zoltan C. Toth

Apache Spark™ for Machine Learning and Data Science (Official Databricks Workshop)

CTO Datapao

Overview
This 3-day course is primarily for data scientists but is directly applicable to analysts, architects, software engineers, and technical managers interested in a thorough, hands-on overview of Apache Spark and its applications to Machine Learning. The course covers the fundamentals of Apache Spark including Spark’s architecture and internals, the core APIs for using Spark, SQL and other high-level data access tools, Spark’s streaming capabilities and a heavy focus on Spark’s machine learning APIs. The class is a mixture of lecture and hands-on labs. Each topic includes lecture content along with hands-on labs in the Databricks notebook environment. Students may keep the notebooks and continue to use them with the free Databricks Community Edition offering after the class ends; all examples are guaranteed to run in that environment.

  • Use the core Spark APIs to operate on data.
  • Articulate and implement typical use cases for Spark
  • Build data pipelines and query large data sets using Spark SQL and DataFrames
  • Analyze Spark jobs using the administration UIs inside Databricks
  • Create Structured Streaming jobs
  • Understand the basics of Spark’s internals
  • Work with relational data using the GraphFrames APIs
  • Understand how a Machine Learning pipeline works
  • Use various ML algorithms to perform clustering, regression and classification tasks.
  • Train & export ML models
  • How to train models with 3rd-party libraries like scikit-learn
  • Create and transform DataFrames to query large datasets.
  • Improve performance through judicious use of caching and applying best practices.
  • Visualize how jobs are broken into stages and tasks and executed within Spark.
  • Troubleshoot errors and program crashes using Spark UI, executor logs, driver stack traces, and local-mode runtimes.
  • Find answers to common Spark and Databricks questions using the documentation and other resources.

Topics

  • Spark Overview
  • In-depth discussion of Spark SQL and DataFrames, including:
    • The DataFrames/Datasets API
    • Spark SQL
    • Data Aggregation
    • Column Operations
    • The Functions API: date/time, string manipulation, aggregation
    • Caching and caching storage levels
    • Use of the Spark UI to analyze behavior and performance
    • Overview of Spark internals
    • Cluster Architecture
    • How Spark schedules and executes jobs and tasks
    • Shuffling, shuffle files, and performance
    • The Catalyst query optimizer
    • An in-depth overview of Spark’s MLlib Pipeline API for Machine Learning
    • Build machine learning pipelines for both supervised and unsupervised learning
    • Transformer/Estimator/Pipeline API
    • Use transformers to perform pre-processing on a dataset prior to training
    • Train analytical models with Spark ML’s DataFrame-based estimators including Linear Regression, Logistic Regression, Decision Trees + Random Forests, Boosted Trees, K-Means, Alternating Least Squares, and Neural Nets
    • Tune hyperparameters via cross-validation and grid search
    • Evaluate model performance
    • Spark-sklearn
    • How to distribute single-node algorithms (like scikit-learn) with Spark
    • Partitioning data concerns
    • Spark Structured Streaming
    • Sources and sinks
    • Structured Streaming APIs
    • Windowing & Aggregation
    • Checkpointing & Watermarking
    • Reliability and Fault Tolerance
    • Graph processing with GraphFrames
    • Transforming DataFrames into a graph
    • Perform graph analysis, including Label Propagation, PageRank, and ShortestPaths

More details on this workshop:
https://databricks.com/training/instructor-led-training/courses/apache-spark-for-machine-learning-and-data-science

Bio

Zoltan is CTO at Datapao, helping companies build data analytics infrastructures and teaching their teams to do the same. He is also a Senior Instructor on the Databricks training team and a professor at the Central European University. Earlier he worked on RapidMiner's Spark integration and managed a petabyte-scale data infrastructure at Prezi.com.

Mate Gulyas

Apache Spark Essentials

CEO and Senior Instructor at Datapao

Apache Spark Essentials will help you get productive with the core capabilities of Spark, as well as provide an overview and examples for some of Spark’s more advanced features. This full-day course features hands-on technical exercises so that you can become comfortable applying Spark to your datasets. In this class, you will get hands-on experience with ETL, exploration, and analysis using real world data. Prerequisites: This class doesn't require any Spark knowledge. Some experience in Python and some familiarity with big data or parallel processing concepts is helpful.

Bio

CEO and Senior Instructor at Datapao, a Big Data and Cloud consultancy and training firm, focusing on industrial applications (aka Industry 4.0). Datapao helps Fortune 500 companies kick off and mature their data analytics infrastructure by giving them Apache Spark, Big Data and Data Analytics training and consultancy. Mate also serves as Senior Instructor in the Professional Services Team at Databricks, the company founded by the authors of Apache Spark. Previously he was Co-Founder and CTO of enbrite.ly, an award-winning Budapest based startup.
Mate has experience spanning more than a decade with Big Data architectures, data analytics pipelines, operation of infrastructures and growing organisations by focusing on culture. Mate also teaches Big Data analytics at Budapest University of Technology and Economics​. Speaker and organiser of local and international conferences and meetups.

Gergely Daróczi

Practical Introduction to Data Science and Engineering with R

Gergely Daróczi, Passionate R developer

This is an introductory 1-day workshop on how to use the R programming language and software environment for the most common data engineering and data science tasks. After a brief overview on the R ecosystem and language syntax, we quickly get on to speed with hands-on examples on

  • reading data from local files (CSV, Excel) or simple web pages and APIs
  • manipulating datasets (filtering, summarizing, ordering data, creating new variables)
  • computing descriptive statistics
  • building basic models
  • visualizing data with `ggplot2`, the R implementation of the Grammar of Graphics
  • doing multivariate data analysis for dummies (eg anomaly detection with principal component analysis; dimension reduction with multidimensional-scaling to transform the distance matrix of European cities into a map)
  • introduction to decision trees, random forest and boosted trees with `h2o`
No prior R knowledge or programming skills required.
Bio

Gergely has been using R for more than 10 years in academia (teaching data analysis and statistics at MA programs of PPCU, Corvinus and CEU) and in industry as well. He started his professional career at public opinion and market research companies, automating the survey analysis workflow, then founded and become the CTO of rapporter.net, a reporting SaaS based on R and Ruby on Rails, that role he quit to move to Los Angeles, to standardize the data infrastructure of a fintech startup. Currently, he is the Senior Director of Data Operations of an adtech company in Venice, CA. Gergely is an active member of the R community (main organizer of the Hungarian R meetup and the first satRday conference the eRum 2018 conference; speaker at international R conferences; developer and maintainer of CRAN packages).


Tickets

Crunch will be held together with Impact (a product management conference) and Amuse (a ux conference).
Your ticket allows you to attend all three tracks.

Contact

Crunch Conference is organized by

Questions? Drop us a line at hello@crunchconf.com