Interview with Dean Wampler at GOTO Chicago 2015
Transcript
Hi it’s Mike with UGtastic. I’m here again at GOTO Conf 2015 sitting here with Dean Wampler who’s giving a talk on data science with Spark. Thank you very much for taking the time to speak with me Dean. You know the big question is and kind of jokingly what is data science but more in the moment what is Spark? So Spark is a distributed computing engine designed for data systems that was invented about five or so years ago as a research project in Berkeley and it’s now an Apache project and it’s now an official tool in the Hadoop ecosystem but even outside it as well for you know either processing like large data sets that you have either on you know the file system somewhere or a database or data streaming in from say you know Twitter or you know your log files or whatever so you can do transformations, analyze it, even machine learning algorithms like looking for anomalies or you’re making predictions and things like that. Interesting about the logs that it’s one of those aspects that we tend to as developers forget has real data in there can tell us a lot about how our systems are working but the thing I’m thinking about is there’s been kind of a theme from some of the interviews I’ve spoken to about this concept that these new ways of like I don’t know if they’re second generation versions of these massive processing systems you know what is it about Spark compared to like Ignite or some of the other vendor offerings that we have here. Yeah so you made a good point that it’s kind of a second generation tool it’s really sort of a rethinking of the original Hadoop tool called MapReduce which actually was invented at Google making it more flexible for different problems more efficient so that you can do things like highly iterative processing that turns out to be important for things like machines. learning or if you’re representing your data as a graph and you need to walk the edges of the graph that sort of thing you know it compared to like some of the commercial tools that are available it’s it’s a little rougher around the edges because it’s you know it has been open source it was like I said a research project that’s rapidly being adopted by industry so a lot of times it’s a question of do I have the scale requirements and maybe the flexibility to work with an open source tool and I’m less concerned about something that’s you know maybe as mature as mature and commercially supported it’s like a proprietary system or some of the proprietary systems are really really good it’s particular kinds of problems and particular kinds of users and whereas something like Spark is more general purpose and can be more widely used but you might have to do a little more self-service at this stage but that’s you know I like all open source projects which in this case it happens to be maybe the most active in the world right now with a possible exception Linux or something you know things just rapidly improve and people fill in the gaps where needed even you know I work for type safe and we ‘re supporting Spark commercially now and we’re also contributing to Spark and you know how it runs on top of the mesos framework for example which is an alternative to to Hadoop even there’s a standalone mode if you just have a small cluster you want to wire up and then just go that’s also an interesting option now Spark is gonna be one of those hard words to Google is there what’s up is there a project page mentioned Apache is it yeah I always Google when I want to get to the page Apache Spark and it’s spark.apache.org is the home page for it so yeah that’s yeah you’re right the word is a little it’s not a really Google friendly word well thank you very much for taking the time to speak to me my pleasure thank you thanks