At the end of October 2015 the first European Spark Summit took place at the Beurs van Berlage center in Amsterdam. The conference was the third of its kind this year dedicated to Apache Spark. Four of comSysto’s engineers traveled to Amsterdam for three intense days of Spark. This post summarizes highlights from the training and talks, as well as some of our general thoughts about Spark.
comSysto at Spark Summit Europe Amsterdam 2015
Keynotes
There were a total of 9 keynotes over two days, here our favorites:
Matei Zaharia the creator of Spark gave a state-of-the-union keynote focusing on the rapid adoption and overall growth of Spark as an Apache Foundation project. Spark now has over 600 contributors and is one of the most active Apache projects. 51% of Spark users are deploying in the cloud. Python popularity as a Spark language grew by 20% and people are also picking up R as a fourth language choice. The introduction of the new DataFrame API was the main challenge this year, more performance optimizations are coming with Project Tungsten. Zaharia also gave a peek at the upcoming Spark 1.6 features: mainly a type-safe DataFrame API named Dataset API, the integration of DataFrames into the Spark Streaming and GraphX APIs and more Tungsten features (in-memory cache, SSD storage).
Martin Odersky gave a keynote on Spark being the “ultimate scala collections”. Spark is an example of a Scala DSL that defines lazy collection operations and adds pairwise operations (e.g. reduceByKey). Scala will adopt some of the concepts, such as collection views, cachable collections and pairwise operations on sequence of pairs as a result of Spark using them extensively. On the other hand Spark can benefit from Scala’s rich type system as well as the upcoming Spores feature for compile-time check of closure captures that might get distributed across nodes. There is obviously a lot of exchange between the two communities which both can benefit from.
Talks
Magellan: Geospatial Analytics on Spark
It is promising to see a library addressing the handling of geospatial data and operations in Spark. There are many libraries available for encoding, parsing and storing geospatial data in various formats, however when trying to express more advanced operations such as geospatial joins, unions or intersections in a distributed fashion you were on your own. Spatial operations will often involve a join of multiple geospatial layers which maps well to RDD operations.Magellan provides optimized geospatial predicates and operations on top of Spark’s DataFrame API. For primitive spatial operations it depends on ESRI’s Geometry API and it aims at implementing the OpenGIS Simple Feature for SQL API.
Streaming Analytics with Spark, Kafka, Cassandra and Akka
Helena Edelson gave a presentation on rethinking classical data processing architectures to meet the flood of data faced with today. LinkedIn for example generates 2.5 trillion events per day amounting to 1 Petabyte of streaming data. The Lambda Architecture style provides guidelines for handling both batch and stream processing of massive datasets, however implementing is still hard. Edelson discussed some technology choices for implementing different aspects of Lambda: Spark/Scala for distributed computing, Mesos for cluster resource management, Akka for concurrent and fault-tolerant application logic, Cassandra for distributed data storage and Kafka for real-time ingestion of streaming data: the SMACK stack. The colocation of Cassandra and Spark nodes for data locality especially seems like a good choice. Code for her reference application killrweather can be found on Github.
Spark DataFrames
Michael Armbrust from Databricks talked about Spark’s DataFrame API and its integration with Spark ML. A DataFrame is a distributed collection of rows organized into named columns and a unified interface for interacting with data in Scala, Java, Python or R. The main advantage of DataFrames over RDDs is Spark’s ability to optimize program execution. Since DataFrames provide more information on the structure of the data, usually better performance can be achieved by optimization compared to regular RDDs. Also user defined functions are language agnostic: for example, user defined Python functions are no longer sent to worker nodes and executed using a slower Python interpreter. Regarding integration with Spark ML, a more streamlined version of MLlib built on top of the DataFrame API was presented. Databricks also introduced Spark ML Pipeline abstraction: A practical machine learning pipeline often involves a sequence of data pre-processing, feature extraction, model fitting, and validation stages. This had to be done manually and was error prone. Spark ML Pipelines provide an abstraction for those common data processing steps. It is nice to see that the programming interface matured and we think we will see plenty of new features in the upcoming releases.
Productionizing Spark and the Spark Job Server
The talk by Evan Chan focused on setting up and tuning Spark clusters and how to avoid common pitfalls: from choosing the right cluster mode to debugging Spark applications and collecting Spark context metrics. Another step towards making Spark production ready is using the Spark Job Server, which turns a Spark cluster into a “cluster as a service” by adding a REST management interface. Spark Job Server provides its own metadata store for storing and sharing jobs, configurations and job jars. It sits on top of your streaming or batch workloads and manages jobs and Spark contexts for you. Since the Job Server is creating the context, an existing Spark context can be re-used or a new one can be created, allowing for low latency queries and RDD sharing among jobs. Security, Authentication and all cluster managers are supported. Spark Job Server also found its way into the latest DataStax Enterprise distribution.
Spark Training
On the first day Databricks offered four training sessions on Spark in parallel. We chose the “Data Science with Apache Spark” training by Jon Bates since our main use cases include exploratory data analysis and machine learning. Offering a training at that scale (hundreds of participants) is definitely a challenge, however it was well executed. Databricks provided access to their cloud platform for all participants which gave everyone the opportunity to use browser-based “notebooks” for exploration and execution of lab code against their own Spark clusters in the cloud (AWS). Compared to small scale trainings there were obviously less opportunities to ask questions and the pace of presentation and amount of the material was tremendous: there was a lot to digest. However the quality of the tutorial content and the opportunity to continue to use the platform for some weeks after the training made up for that.
Conclusion
Spark is a promising tool for handling all kinds of large-scale data processing tasks which are getting more and more common at companies across all industries. IBM calls Spark “Potentially the Most Significant Open Source Project of the Next Decade” and commits to Spark by investing $300 million over the next few years and by assigning more than 3,500 researchers and developers to work on Spark-related projects. Microsoft for instance is using Spark and Cassandra to process over 10TB of event data per day from its Office 365 products. The diverse ecosystem of languages and tools offered by Spark is definitely a unique feature, making the switch from exploratory data analysis to application development a lot smoother. Deploying complete stacks (such as SMACK) on a computing cluster or in the cloud seems challenging at the moment. The current focus lies on explorative tools (notebooks) and languages (Python) tailored towards data scientists as well as deployment topics. Discussions on developing full-stack applications and integrating Spark in existing systems, however, are still rare.
At comSysto we explore Spark during our labs, at data science challenges and by implementing prototypes. For data intensive projects and for implementing lambda architectures we currently regard Spark as one of the primary options.
You want to shape a fundamental change in dealing with data in Germany? Then join our Big Data Community Alliance!