Extending Spark Machine Learning Pipelines

FOSDEM 2017

Apache Spark is one of the most popular new "big data" technologies, and now has a sci-kit-learn inspired pipeline API. This talk looks at how the pipeline API works as well as how to add your own custom algorithms to Apache Spark. The talk will be focused in Scala, but the same techniques can be used in Java or with other JVM languages. Sadly extending the pipeline API can not currently be done in non-JVM languages, but the information on how to use the pipeline API will be useful to Python and R users as well.

Speakers: Holden Karau