Crunch Data Engineering and Analytics Conference Budapest October 18-20, 2017

Get tickets

CRUNCH is a use case heavy conference for people interested in building the finest data driven businesses. No matter the size of your venture or your job description you will find exactly what you need on the two-track CRUNCH conference. A data engineering and a data analytics track will serve diverse business needs and levels of expertise.

If you are a Data Engineer, Data Scientist, Product Manager or simply interested how to utilise data to develop your business, this conference is for you. No matter the size of your company or the volume of your data, come and learn from the Biggest players of Big Data, get inspiration from their practices, from their successes and failures and network with other professionals like you.

18
October
WORKSHOP DAY

Our full-day workshops will be announced soon. You need to buy separate workshop tickets to attend them.

19
October
CONFERENCE DAY #1, THURSDAY

The day will start at 9AM and the last talk will end around 6PM. After the sessions there will be an Crunch party at the conference venue.

20
October
CONFERENCE DAY #2, FRIDAY

The day will start at 9AM and the closing ceremony will end around 6PM.


Speakers

Charles Smith

Charles Smith

Manager - Big Data Platform Architecture, Netflix
Working hard to build an easy data platform at Netflix

Here is a problem: You would like to buy the next great show for Netflix. The dream is that, given your data and a question, you can find the next House of Cards with a click of the mouse. But is that the reality? Why does it seem like data engineers and analysts spend so much time talking about memory requirements and stack traces? This talk will explore the past, present, and some of the future of the Netflix data platform, as well as how we are prioritizing work that will make it easier to focus on data problems rather than the complexities of the platform.

Bio

Charles Smith leads the Big Data Platform Architecture team at Netflix, whose mission is to make using data easy and efficient. He and his team are responsible for envisioning how the data platform allows data scientists to make Netflix's service even better.

Gyula Fóra

Gyula Fóra

Data Warehouse Engineer, King
Real-time analytics at King

This talk gives a technical overview of the different tools and systems we are using at King to process and analyse over 30 billion events in real-time every day.
The core topic of this talk is RBEA (Rule-Based Event Aggregator) , the scalable real-time analytics platform developed by King’s Streaming Platform team. RBEA is a streaming-as-a-service platform built on top of Apache Flink and Kafka which allows developer and data scientists to write analytics scripts in a high level DSL and deploy them on the live event streams in a matter of few clicks.
The distinguishing feature of this platform is that new analytics jobs are not deployed as independent Flink programs, but instead, a fix number of continuously running jobs serve as backends for the RBEA platform. By streaming both the events and new scripts to the backends, scripts share both the incoming data and the state they may build up when analyzing user activity in the games. This design makes new deployments very lightweight and the whole architecture highly efficient without sacreficing expressivity.
We push the Apache Flink framework to it’s full potential in order to provide highly scalable stateful and windowed processing logic for the analytics applications. We will show how we have built a high-level DSL on the abstractions provided by Flink that is more approachable to developers without stream-processing experience and how we use code-generation to execute the programs efficiently at scale.
In addition to our streaming platform we will also introduce other tools that we have developed in order to make deployment and monitoring of real-time applications as simple as possible at scale.

Bio

Gyula is a Data Warehouse Engineer in the Streaming Platform team at King, working hard on shaping the future of real-time data processing. This includes researching, developing and sharing awesome streaming technologies. Gyula grew up in Budapest where he first started working on distributed stream processing and later became a core contributor to the Apache Flink project. Among his everyday funs and challenges, you find endless video game battles, super spicy foods and thinking about stupid bugs at night.
Gyula has been a speaker at numerous big data related conferences and meetups, talking about stream processing technologies and use-cases.

Shirshanka Das

Shirshanka Das

Principal Staff Software Engineer, Linkedin
Taming the ever-evolving Compliance Beast: Lessons learned at LinkedIn

Just when you think you have your Kafka and Hadoop clusters set up and humming and you’re well on your path to democratizing data, you realize that you now have a very different set of challenges to solve. You want to provide unfettered access to data to your data scientists but at the same time you need to preserve the privacy of the data that your members have entrusted you with.

In this session, I outline the path LinkedIn has taken to protect member privacy on our scalable distributed data ecosystem built around Kafka and Hadoop. Like most companies, in our early days, our first priority was getting data flowing freely and reliably. Over the past few years we’ve made significant advances in data governance, going above and beyond expectations with regard to the commitments we’ve made to our members in how we handle their data.

Specifically, I’ll discuss how we’ve handled the Irish Data Protection Commissioner’s requirements for ensuring that our member’s data was purged from all our data systems including Hadoop within the required timeframe and the kind of systems we had to build to solve it. I discuss three foundational building blocks that we’ve focused on: a centralized metadata system, a standardized data movement platform and a unified data access layer. Some of these systems are open source and can be of use to companies that are in a similar situation as us. I’ll also look to the future as the General Data Protection Regulation comes into effect in 2018, and outline our plans in addressing those requirements and the challenges that lie ahead.

Technology is just part of the solution.
In this talk I’ll also discuss the culture and process change we’ve seen happen at the company, and our learnings around sustainable process and governance.

Bio

Shirshanka is a Principal Staff Software Engineer and the architect for LinkedIn’s Data & Analytics team. He was among the original authors of a variety of open and closed source projects built at LinkedIn, including Databus, Espresso, and Apache Helix. He is currently working with his team on simplifying the big data analytics space at LinkedIn through a multitude of mostly open-source projects: Pinot, a high-performance distributed OLAP engine; Gobblin, a data lifecycle management platform for Hadoop; WhereHows, a data discovery and lineage platform and Dali, a data virtualization layer for Hadoop.

Justin Bozonier

Justin Bozonier

Lead Data Scientist, Finance & Analytics, GrubHub
Science the shit out of your business

The mission of my data science team is to make a science out of our business at GrubHub. We work on understanding how every initiative our company undertakes affects our bottomline. I will discuss how we analyze every feature shipped to production, marketing programs, customer service, and more using a variety of statistical, machine learning, and decision theoretic tools and techniques. Most importantly, I will cover how we have learned to tune these tools, not with just abstract or theoretical scores, but by connecting model error with bottom line impact.

Bio

Justin Bozonier is the author of Test-Driven Machine Learning (published by Packt) and Lead Data Scientist in GrubHub's Financial Planning & Analytics group. The founding data scientist of GrubHub's split testing efforts, his team runs the company's experiment analysis platform, develops experiments and models to tune larger business operations, and data mines experiments and operational data to look for new business opportunities and value existing programs. He has spoken previously at PyData Seattle, Kellogg at Northwestern, PyData Chicago's monthly meetup, and more.
He lives in Lake Villa, IL (just outside the greater Chicago area) with his wife Savannah and soon, their first child. In his spare time he studies math, video game development, and enjoys running.

Sean Kross

Sean Kross

Programmer Analyst, The Johns Hopkins Bloomberg School of Public Health
Lessons from teaching data science to over a million people

My colleges and I saw the demand for data scientists ballooning and we decided to do something about it. In this talk I will explain how the Johns Hopkins Data Science Lab leveraged the latest statistical, computational, and open source methods in order to create over a million new data scientists. We'll talk about what happens as you take data-newbies through their first serious programming experiences, rigorous mathematical training, and the creation of their first data products. We'll discuss the data we collected about how students handle these challenges and how you can take our insights to implement better data science training and understanding in your organization.

Bio

Sean Kross is a PhD student at the University of California San Diego where he studies data science, human-computer interaction, and distributed education. Sean formerly worked in the Johns Hopkins Data Science Lab where he and his colleagues developed The Data Science Specialization on Coursera.org. Sean is the author of Mastering Software Development in R, Developing Data Products, and The Unix Workbench. He blogs less often than he would like at seankross.com and you can find him on Twitter @seankross.

Gio Fernandez-Kincade

Gio Fernandez-Kincade

Co-Founder @ RelatedWorks.io. Formerly Staff Engineer @ Etsy
AI in Production

Read enough Hacker News and you will quickly become convinced that building AI products looks something like:

  1. Fire up Tensore Flow
  2. Choose your favorite network architecture (or better yet, generate one!)
  3. Pipe in tons of data
  4. Profit

That couldn’t be farther from the truth. In this talk, we’ll figure out what it really takes to ship AI products in production.

Bio

Gio has been working with data, architecting systems, and leading teams of engineers for over a decade. He’s currently a co-founder at Related Works, which aims to build simple, intelligent products that help cultural institutions share their collections with the world. Previously he worked as a Staff Engineer at Etsy, where he lead the Search Ranking and Search Experience teams. He focused on Search from the ground up: infrastructure, ranking and machine-learned relevance, diversity, fairness, query understanding, autosuggest, faceting, navigation, experimentation, etc. Prior to working at Etsy, Gio worked at CapitalIQ where he designed, built, and maintained a multi-terabyte database, real-time processing-system, and search engine for globally-sourced financial reports.

Cassandra Jacobs

Cassandra Jacobs

Data Scientist, Stitch Fix
Imposing structure on unstructured text at Stitch Fix

At Stitch Fix, we have a wealth of text data related to each Fix we send out to clients. Fixes contain 5 apparel and non-apparel fashion items, ranging anywhere from blouses to leggings to shoes. Stylists are shown algorithmically scored pieces and ultimately use their own discretion to decide what to send to a client. After they’ve picked everything, stylists write notes detailing the items they selected, and once clients have received their Fix, they leave feedback on the pieces that we sent them. These notes and feedback can be leveraged to learn about our inventory so we can explore what occasions an item is good for or learn features that might not be in the descriptions of an item in our databases that function like a knowledge base. Ultimately we can use this information to make recommendations to stylists about what to write about an item if they’re suffering from writer’s block, automatically make suggestions about what the client might like given a request note they’ve written, or even help stylists find better similar items to ones they are considering sending.

Unfortunately, our text data is largely unstructured – stylists can talk about anything they send and in any order and clients don’t necessarily talk about the item’s prints or fabrics, or occasions that an item is good for. I will discuss a technique I have developed that builds upon a number of existing information extraction methods in natural language processing that allows us to impose structure on these notes and comments. This way we can find out how a stylist talks about an item even if we don’t know where it’s mentioned. The technique results in a network that defines words and items in a common space that we can use to make recommendations about how to talk about an item in a note, or for finding the right item in our inventory.

Bio

Cassandra Jacobs is a data scientist at Stitch Fix. A lover of unstructured data, she works primarily on natural language processing systems for recommendation algorithms, helping expert stylists pick the right pieces to send to clients. After earning her BA in Linguistics at the University of Texas, she earned her PhD in Cognitive Psychology and an MS in Computer Science at the University of Illinois at Urbana-Champaign. In her spare time, she likes to go on backpacking trips, reading literary science fiction, and learning foreign languages.

Thomas in’t Veld

Thomas in’t Veld

Head of Data Science, Peak Labs
Event Driven Growth Hacking

Peak acquired more than 25 million users in two years by combining event analytics with marketing attribution and predictive modelling. In this talk, I will take you on a journey through what makes this tick, how we built it and why it is one of the best ways to grow a new business. Event analytics is the cool new thing used by everyone from Facebook to your second cousin's dog's start-up, but why are so many people doing it wrong? And what will be the next step?
I am a theoretical physicist turned data scientist, and after building shiny data tools for Sky and The Guardian I joined Peak in 2015 to build a data science team. My continuing mission: making sure that every decision at Peak is made with as much data as possible.

Our mission here at Peak is to make lifelong progress enjoyable. We believe there’s always a little room for improvement, and we should strive to better ourselves bit by bit. That’s why we use a combination of neuroscience, technology and fun to get those little grey cells active and striding purposefully towards their full potential. Peak is the number one brain training app on mobile and, since it launched in 2014, has been downloaded more than 25 million times. It has been recognised by both Apple and Google as one of the best apps available, winning Best of 2014, Best of 2015 and Best of 2016 awards as well as Editors’ Choice on both the App Store and Play Store.

Bio

I am a theoretical physicist turned data scientist, and after building shiny data tools for Sky and The Guardian I joined Peak in 2015 to build a data science team. My continuing mission: making sure that every decision at Peak is made with as much data as possible.

Dirk Gorissen

Dirk Gorissen

Senior Engineer, Oxbotica
Beyond Ad-Click Prediction

We all know machine learning is great for helping you tag friends on Facebook, suggesting what brand of toothpaste will improve your smile, and picking the ad most likely to unlock your wallet. In this talk, however, I hope to demonstrate you that there are some interesting applications you may not have thought of. Such as detecting landmines from drone mounted radar, finding orangutans in the Bornean Jungle, or helping a car avoid pedestrians.

Bio

Dirk Gorissen has a background in Computer Science & AI and worked in academic and commercial research labs across Europe and the US. His interests span machine learning, robotics, and computational engineering as well as their application into the humanitarian and development areas. He has been a regular consultant for the World Bank in Tanzania and closely involved with a number of Drone related startups. He currently is a senior engineer in self driving car Oxbotica and on the side is an active STEM Ambassador, and organiser of the London Big-O Algorithms & Machine Learning meetups.

Maxime Beauchemin

Maxime Beauchemin

Data Engineer, Airbnb
Advanced Data Engineering Pattern with Apache Airflow

Analysis automation and analytic services are the future of data engineering! Apache Airflow's DSL makes it natural to build complex DAGs of tasks dynamically, and Airbnb has been leveraging this feature in intricate ways, creating a wide array of services as dynamic workflows. In this talk, we'll explain the mechanics of dynamic pipeline generation using Apache Airflow, and present advanced use cases that have been developed at Airbnb.

Bio

Maxime Beauchemin works at Airbnb as part of the "Analytics & Experimentation Products team", developing open source products that reduce friction and help generating insight from data. He is the creator and a lead maintainer of Apache Airflow [incubating] (a workflow engine), Superset (a data visualization platform), and recognized as a thought leader in the data engineering field. Before Airbnb, Maxime worked at Facebook on computation frameworks powering engagement and growth analytics, on clickstream analytics at Yahoo!, and as a data warehouse architect at Ubisoft.

Melanie Warrick

Melanie Warrick

Senior Developer Advocate, Google
Machine Learning with Containers and Cloud

Machine learning (ML) has gained significant attention because of its impact from advancements in areas like automated medical diagnosis to unique product interactions and advertising for individual users. At its core, it's a set of algorithms used for pattern matching and prediction, and it plays a prominent role in AI development.
When using ML in production, it doesn't happen in a vacuum. That's where containers and cloud systems can help. Containers create isolated environments to easily setup servers and safely run the software. Cloud systems give flexible access to hardware resources without the cost and pain to build it out and maintain it all. This talk will walk through an example of how to implement a machine learning algorithm using containers in the cloud. The goal is to give you an understanding of how the tools work together, and how you can apply these concepts.

Bio

Melanie Warrick is a Senior Developer Advocate at Google. Previous experience includes work as a founding engineer on Deeplearning4J as well as implementing machine learning in production at Change.org. Prior experience also covers business consulting and large enterprise technology implementations for a wide variety of companies. Over the last couple years, she's spoken at many conferences about artificial intelligence, and her passions include working on machine learning problems at scale.

Dr. Martin Loetzsch

Dr. Martin Loetzsch

Chief Data Officer, Project A Ventures
ETL Patterns with Postgres

Some companies have to process data volumes that by far exceed the capacity of “small” database clusters and they definitely have a valid use case for one of the modern parallelizing / streaming / big data processing technologies. For all others, expressing transformations in plain SQL is just fine and PostgreSQL is the perfect workhorse for that purpose.
In this talk, I will go through some of our best practices for building fast, robust, and tested data integration pipelines inside PostgreSQL. I will explain many of our technical patterns, for example for schema management or for splitting large computations by chunking and table partitioning. And I will show how to apply standard software engineering techniques to maintain agility, consistency, and correctness.

Bio

Martin Loetzsch works at Project A, a Berlin-based operational VC focusing on digital business models. As Chief Data Officer, he has helped many of Project A’s portfolio companies forming teams that build data warehouses and other data-driven applications. Before joining Project A (with a short interlude at Rocket Internet), he worked in artificial intelligence labs in Paris and Brussels on computational linguistics and robotics. He received a PhD in computer science from the Humboldt University of Berlin.

Shrikanth Shankar

Shrikanth Shankar

Director of Engineering, Data Analytics Infrastructure, Linkedin
Scaling Reporting and Analytics at LinkedIn

At LinkedIn, we have been working on the next generation of our reporting infrastructure. This talk will describe our journey to build a centralized platform that scales to hundreds of users, thousands of metrics and supports applications ranging from simple dash boarding to anomaly detection. We will discuss how a combination of technology and processes has allowed us to scale our user base while preserving trust in our metrics. We will also cover some of the exciting work we have been doing running metrics across a wide variety of platforms (from MR to Streaming systems like Samza).

Bio

Shrikanth Shankar is a Director of Engineering at LinkedIn where he leads multiple teams that work on infrastructure and platforms to support LinkedIn's analytic needs. Shrikanth has a long background in data and has worked at big companies and startups in a variety of technical and management roles.

András Németh

András Németh

Chief Technology Officer, Lynx Analytics
Scalable Distributed Graph Algorithms on Apache Spark

Graph analysis is extremely important to get insights out of the ever increasing amounts of data available today. Be it connections in a social network, calls placed among subscribers of a mobile network, connections among computers and routers, webpages with links or proteins reacting with each other there are vast datasets which can best be modeled as graphs.

To make sense of these datasets we need to run various graph algorithms on them. To be able to identify critical nodes we need pagerank, centrality, clustering coefficient, to find related groups of nodes we might want to find maximal cliques, communities or a modular clustering, to decompose into independent sets we need graph coloring, and so on.

The above list is made of fairly standard graph problems with well understood algorithms to solve. But very often these algorithms are unsuitable for trivial parallelization, they intrinsically require the full graph to be available in the memory of a single computer.

So how do we handle graphs too large for a single computer?

This talk is exactly about that. We at Lynx Analytics have built a big graph analysis engine on top of Apache Spark with a large library of graph algorithms readily available for users. In this talk we will dive into a few representative single computer graph algorithms and show how to translate them into Spark's execution model. We will also get into some hands on technical details. We will see how to optimally partition the data. We will show some tricks on how to deal with skewed graphs with vertices of immense degree without running out of memory.

Bio

Andras is the CTO at Lynx. Based on Apache Spark, Andras and his team built innovative tools that help users build, run, and generate insights from big data graphs.

Andras joined Lynx in 2014 from Google, where he served as Lead Software Engineer in Zurich. At Google, Andras contributed to YouTube’s revolutionary personal ad-targeting system, a cross-media prediction engine, and semantic web analysis based on Google’s knowledge graph.

Prior to Google, Andras worked for Applied Logic Laboratory in Budapest. His primary efforts were on speech recognition, text classification and intelligent retrieval systems.

Andras holds two Master’s degrees in Mathematics and Computer Science, from the Eotovos Lorand University and Budapest University of Technology and Economics respectively.

Sameer Farooqui

Sameer Farooqui

Freelancer / AI + Deep Learning
Separating hype from reality in Deep Learning

Deep Learning is all the rage these days, but where does the reality of what Deep Learning can do end and the media hype begin? In this talk, I will dispel common myths about Deep Learning that are not necessarily true and help you decide whether you should practically use Deep Learning in your software stack. I will begin with a technical overview of common neural network architectures like CNNs, RNNs, GANs and their common use cases like computer vision, language understanding or unsupervised machine learning.

Then I'll separate the hype from reality around questions like:

  • When should you prefer traditional ML systems like scikit learn or Spark.ML instead of Deep Learning?
  • Do you no longer need to do careful feature extraction and standardization if using Deep Learning?
  • Do you really need terabytes of data when training neural networks or can you 'steal' pre-trained lower layers from public models by using transfer learning?
  • How do you decide which activation function (like ReLU, leaky ReLU, ELU, etc) or optimizer (like Momentum, AdaGrad, RMSProp, Adam, etc) to use in your neural network?
  • Should you randomly initialize the weights in your network or use more advanced strategies like Xavier or He initialization?
  • How easy is it to overfit/overtrain a neural network and what are the common techniques to ovoid overfitting (like l1/l2 regularization, dropout and early stopping)?
Bio

Sameer Farooqui is a freelancer who teaches corporate classes on big data and machine learning. Over the past 5 years, he has taught 150+ classes globally at conferences and private clients on topics like NoSQL, Hadoop, Cassandra, HBase, Hive, Couchbase, and Spark. Sameer has been teaching Spark classes for three years and was the first full time hire into Databricks’ training department, where he worked closely with the Apache Spark committers on designing curriculum. Previously, Sameer also worked as a Systems Architect at Hortonworks and a Emerging Data Platforms consultant at Accenture R&D. When not working on Spark projects, Sameer enjoys exploring ideas in AI + Deep Learning, especially Google’s new TensorFlow library.

Zareen Farooqui

Zareen Farooqui

Business Intelligence Analyst, Wayfair
Breaking into Data Analytics

Are you interested in starting a career in data analytics, but don't know where to begin? Attend a boot camp or teach yourself? Python or R? Last year, I quit my sales job to learn programming and break into the analytics field. In this talk, I will share the advice I learned and wish I had known then. I'll discuss common tools and technologies used in industry, how to continue developing tech skills after landing your first analytics job, and recommendations for managers to support direct reports with data science related ambitions.

Bio

Zareen is a Business Intelligence Analyst at Wayfair, where she focuses on marketing analytics. Previously, she interned at the Wikimedia Foundation and worked on projects to help understand how readers around the globe consume Wikipedia. She transitioned into data analytics after working as a sales engineer for 3 years and then taking time off to learn Python, SQL, and Visualization. Zareen holds a B.S. in Industrial and Systems Engineering from the University of Florida.

Gábor Szabó

Gábor Szabó

Data Scientist, Tesla
Some notes on processing sensor data for autonomous driving

Tesla's vision with Autopilot is to make driving safer and more effortless on the path towards full autonomy on the road. To quickly converge towards an ever more intelligent vehicle we apply fleet learning to a number of driving scenarios that requires the collection and the processing of streams of anonymized machine-generated data.

In this talk I cover some of the tools and approaches that we use to make sense these streams, and how the vehicles in turn can make use of the models we generate.

Bio

Gabor Szabo leads the Autopilot Maps team at Tesla, where our goal is to bring fully autonomous driving to the world of sustainable transportation. Previously, he was a data scientist at Lyft and Twitter, and did research on social networks and online communities at HP Labs, the Harvard Medical School, the University of Notre Dame, and the Budapest University of Technology and Economics.

Nilan Peiris

Nilan Peiris

VP Growth, TransferWise
How to Grow Without Spending a Ton on Marketing
Bio

Nilan Peiris is VP Growth at TransferWise, the international money transfer platform.

TransferWise is the low cost and fair way of transferring money internationally. Using peer-to-peer technology and without any hidden fees, it makes sending money abroad up to eight times cheaper compared to using a bank. TransferWise customers send £1.3 billion every month using the platform, and it’s attracted $117m from investors such as world’s largest VC firm Andreessen Horowitz, Sir Richard Branson, Peter Thiel and Max Levchin, the co-founders of PayPal.

Prior to TransferWise Nilan was VP Growth at HouseTrip, in charge of scaling the company’s growth in the European market. He’s also worked as Chief Marketing Technology Officer at Holiday Extras, where he was responsible for all areas of technology, marketing and customer acquisition. Nilan also advises a number of early stage startups on growth and getting to traction,

More speakers will be announced soon

If you want to be one of the speakers at Crunch 2017 submit your application via Papercall. Deadline for submission is 15th of May, 2017


Workshops18 Oct, Wed

Zoltan C. Toth

Apache® Spark™ Foundations (Databricks training)

Zoltan C. Toth, Senior Instructor & Consultant, Databricks

This hands-on 1-day course is for data engineers, analysts, and architects; software engineers; IT operations; and technical managers interested in a brief hands-on overview of Apache Spark.
The course covers core APIs for using Spark, basic internals of the framework, SQL and other high-level data access tools, as well as Spark’s streaming capabilities and machine learning APIs. Each topic includes slide and lecture content along with hands-on use of a Spark cluster through a web-based notebook environment.

After taking this class, you will be able to:

  • Experiment with use cases for Spark and Databricks, including extract-transform-load operations, data analytics, data visualization, batch analysis, machine learning, graph processing, and stream processing.
  • Identify Spark and Databricks capabilities appropriate to your business needs.
  • Communicate with team members and engineers using appropriate terminology.
  • Build data pipelines and query large data sets using Spark SQL and DataFrames.
  • Execute and modify extract-transform-load (ETL) jobs to process big data using the Spark API, DataFrames, and Resilient Distributed Datasets (RDD).
  • Analyze Spark jobs using the administration UIs and logs inside Databricks.
  • Find answers to common Spark and Databricks questions using the documentation and other resources.

Modules

  • Spark Overview
  • RDD Fundamentals
  • SparkSQL and DataFrames
  • Spark Job Execution
  • Intro to Spark Streaming
  • Machine Learning Basics

More details on this workshop:
https://databricks.com/training/courses/apache-spark-overview

Bio

Zoltan works as Senior Spark Instructor and Consultant at Databricks, the company founded by the creators of Apache Spark. Earlier he worked on Prezi.com’s data infrastructure and managed the team that scaled it up to an infrastructure that crunches over 1 Petabyte of data. Later he joined RapidMiner, a global leader in predictive analytics and worked on kicking off the company's Apache Spark integration. Besides Databricks, he designs and prototypes Big Data architectures and regularly gives Spark courses on Conferences and for companies.

Sameer Farooquie

Deep Learning Fundamentals with TensorFlow and Keras

Sameer Farooquie, Freelancer / AI + Deep Learning

Abstract: Are you a software engineer who has been curious to get hands on experience with Deep Learning? In this workshop, I'll introduce the fundamentals concepts of Deep Learning and walk through code examples of common use cases. The class will be 60% lecture and 40% labs. The labs will run in Google Cloud and all students should sign up for Google Cloud prior to class. Note that new users of Google Cloud will receive a $300 USD credit valid for 12 months after sign up, which is sufficient to run all of the class examples and code for free.

In the morning, the class will cover:

  • What is Deep Learning?
  • Math fundamentals of Neural Networks (matrices, derivatives, gradient descent)
  • Initialization, Activation, Loss and Optimization functions
  • Fundamentals of TensorFlow
  • Data preprocessing and feature engineering for different use cases
  • Overfitting and Underfitting
  • Introduction to the Keras API
  • Lab: TensorBoard UI
  • Lab: MNIST
  • Lab: Regression
  • Lab: Classification

In the afternoon, we will cover:

  • Common network architectures
  • Convolutional Neural Networks (CNNs) for computer vision
  • Recurrent Neural Networks (RNNs) for language understanding (including LSTMs)
  • Stealing pre-trained layers with Transfer Learning
  • Lab: Object Detection in Images
  • Lab: Text and Language Understanding
Bio

Sameer Farooqui is a freelancer who teaches corporate classes on big data and machine learning. Over the past 5 years, he has taught 150+ classes globally at conferences and private clients on topics like NoSQL, Hadoop, Cassandra, HBase, Hive, Couchbase, and Spark. Sameer has been teaching Spark classes for three years and was the first full time hire into Databricks’ training department, where he worked closely with the Apache Spark committers on designing curriculum. Previously, Sameer also worked as a Systems Architect at Hortonworks and a Emerging Data Platforms consultant at Accenture R&D. When not working on Spark projects, Sameer enjoys exploring ideas in AI + Deep Learning, especially Google’s new TensorFlow library.

Gergely Daróczi

Practical Introduction to Data Science and Engineering with R

Gergely Daróczi, Passionate R developer

This is an introductory 1-day workshop on how to use the R programming language and software environment for the most common data engineering and data science tasks. After a brief overview on the R ecosystem and language syntax, we quickly get on to speed with hands-on examples on

  • reading data from local files (CSV, Excel) or simple web pages and APIs
  • manipulating datasets (filtering, summarizing, ordering data, creating new variables)
  • computing descriptive statistics
  • building basic models
  • visualizing data with `ggplot2`, the R implementation of the Grammar of Graphics
  • doing multivariate data analysis for dummies (eg anomaly detection with principal component analysis; dimension reduction with multidimensional-scaling to transform the distance matrix of European cities into a map)
  • introduction to decision trees, random forest and boosted trees with `h2o`
Bio

Gergely has been using R for more than 10 years in academia (teaching data analysis and statistics at MA programs of PPCU, Corvinus and CEU) and in industry as well. He started his professional career at public opinion and market research companies, automating the survey analysis workflow, then founded and become the CTO of rapporter.net, a reporting SaaS based on R and Ruby on Rails, that role he quit to move to Los Angeles, to standardize the data infrastructure of a fintech startup. Currently, he is the Senior Director of Data Operations of an adtech company in Venice, CA. Gergely is an active member of the R community (main organizer of the Hungarian R meetup and the first satRday conference; speaker at international R conferences; developer and maintainer of CRAN packages).

Ágoston Nagy

Visualizing multidimensional data using unsupervised machine learning (t-SNE) in JavaScript

Ágoston Nagy, HCI Researcher, Prezi

Dimensionality reduction (esp. tSNE) is an area of Unsupervised Machine Learning, so is a useful technique to find previously unknown correlations and patterns within different datasets. It’s super efficient and scalable: can be used from low dimensional personal data to high dimensional, large scale datasets. It is a very useful technique for visualization, where the user would like to see an n-dimensional space embedded into a human understandable, 2D/3D space. By the end of the workshop, participants will be able to use dimension reduction to find patterns and correlations within their own high dimensional datasets. They will learn, how to display elements on a 2D/3D canvas, interact with them using different inputs and animations.

What we cover:

  • Drawing basics (canvas coordinates, shapes, lines)
  • 3D basics (WebGL)
  • Animation
  • Generating Data (perlin noise, random probabilities, sensors, environment etc)
  • Loading Data, using APIs
  • Publicly available datasets
  • Visualizing Data
  • Feature extraction
  • Dimension Reduction (t-SNE)
  • Beyond the canvas (socket.io, nodejs etc)
Bio

Agoston is making interaction design, experimental media, generative arts using free & open source tools. He is designing dynamic systems & interfaces for networked installations, developing creative mobile applications. He is addicted to hacking, altering functions of existing contexts and ordinary objects. He regularly gives workshops and courses on interaction design and creative coding using several open source languages. He is a guest lecturer at Bergen University of Fine Arts (Norway), Moholy-Nagy University of Design & Arts (MOME, Hungary) and a HCI researcher at Prezi.com. As of 2016, he is making a post doctoral research in Realtime Interactions & Machine Learning at MOME. His works have been exhibited worldwide including China, India, Canada, Germany, Italy, Norway, Poland, United States, Belgium, Hungary among others. He is the co-founder of the experimental new media design group Binaura.


Location

Meet Budapest, a really awesome city

Here are a few reasons why you need to visit Budapest

MAGYAR VASÚTTÖRTÉNETI PARK

BUDAPEST, TATAI ÚT 95, 1142

The Magyar Vasúttörténeti Park (Hungarian Railway History Park) is a railway museum located in Budapest, Hungary at a railway station and workshop of the Hungarian State Railways. Located on the site of the former north depot of the Hungarian State Railway (MÁV), the Hungarian Railway Museum is Europe’s first interactive museum of its kind. The north depot’s roundhouse, home to the museum, was built in 1911 and is also part of Hungarian railway history. There are over a hundred vintage trains, locomotives, cars and other types of railroad equipment on display, including a steam engine built in 1877, a railcar from the 1930’s and a dining car built in 1912 for the famous Orient Express.


Tickets

Sponsors

Platinum

Gold

Silver

CRUNCH is a non-profit conference. We are looking for sponsors who help us make this conference happen.
Take a look at our sponsor packages and contact us at hello@crunchconf.com


Contact

Crunch Conference is organized by

Ádám Boros
Ádám Boros
Marketing Intern, Prezi
Attila Balogi
Attila Balogi
Event manager, Prezi
Attila Petróczi
Attila Petróczi
R&D and Data Science Manager, Realeyes
Balázs Szakács
Balázs Szakács
Business Intelligence Manager, IBM Budapest Lab
Dániel Molnár
Dániel Molnár
Senior Data & Applied Scientist, Microsoft Deutschland GmbH / Wunderlist Team
Katalin Marosvölgyi
Katalin Marosvölgyi
Travel and accommodation manager, Prezi
Medea Baccifava
Medea Baccifava
Head of conference management, Prezi
Tamás Imre
Tamás Imre
Lead Analyst, Prezi
Tamás Németh
Tamás Németh
Data Engineer, Prezi
Zoé Rimay
Zoé Rimay
Software Developer, Morgan Stanley
Zoltán Prekopcsák
Zoltán Prekopcsák
VP Big Data, RapidMiner
Zoltán Tóth
Zoltán Tóth
Big Data and Hadoop expert, Datapao; Teacher, CEU Business School
Ryan McCabe
Ryan McCabe
Data Analyst, Prezi
Gergely Krasznai
Gergely Krasznai
Data Analyst, Prezi

Questions? Drop us a line at hello@crunchconf.com