Conference Speaking And Presentation Skills: Mike Hall Interviews Dean Wampler | GOTO Conference 2015

UGtastic Archive
🚀 Dean Wampler explains Spark, a distributed computing engine for big data processing, and compares it to other tools in the ecosystem. 📊 He highlights Spark's flexibility, efficiency, and open-source nature, as well as its growing community and commercial support. Don't miss this! 🌐 #Spark #BigData #DataScience #OpenSource #GOTOConference2015 Learn more: https://just3ws.github.io/interviews/dean-wampler-goto-conference-2015
The Interviewer

Mike Hall

Interviewer, UGtastic

The Guest

Dean Wampler

conference speaking and presentation skills

The Conversation


Mike Hall Interviewer, UGtastic
Hi, it's Mike with UGtastic. I'm here again at GOTO Conference 2015, sitting here with Dean Wampler, who's giving a talk on data science with Spark. Thank you very much for taking the time to speak with me, Dean. You know, the big question is, and kind of jokingly, what is data science? But more in the moment, what is Spark?
Dean Wampler conference speaking and presentation skills
So Spark is a distributed computing engine designed for data systems that was invented about five or so years ago as a research project in Berkeley, and it's now an Apache project, and it's now an official tool in the Hadoop ecosystem, but even outside of it as well for, you know, either processing, like, large data sets that you have either on, you know, a file system somewhere or a database, or data streaming in from, say, you know, Twitter or, you know, your log files or whatever, so that you can, you know, do transformations, analyze it, even machine learning algorithms, like looking for anomalies or making predictions and things like that. Interesting about the logs that it's one of those aspects that we tend to, as developers, for Git has real data in there and can tell us a lot about how our systems are working. But the thing I'm thinking about is there's been kind of a theme from some of the interviews I've spoken to you about this concept, these new ways of, like, I don't know if they're second generation versions of these massive processing systems. You know, what is it about Spark compared to, like, Ignite or some of the other vendor offerings that we have here? Yeah, so you made a good point that it's kind of a second generation tool. It's really sort of a rethinking of the original Hadoop tool called MapReduce, which actually was invented at Google, making it more flexible for different problems, more efficient so that you can do things like highly iterative processing. That turns out to be important for things like machine learning or if you're representing your data as a graph and you need to walk the edges of the graph, that sort of thing. You know, compared to, like, some of the commercial tools that are available, it's a little rougher around the edges because it's, you know, it has been open source. It was, like I said, a research project that's rapidly being adopted by industry. So a lot of times it's a question of do I have the scale requirements and maybe the flexibility to work with an open source tool and I'm less concerned about something that's, you know, maybe as mature, as mature and commercially supported as, like, a proprietary system. Or some of the proprietary systems are really, really good. It's particular kinds of problems and particular kinds of users. Whereas something like Spark is more general purpose and can be more widely used. But you might have to do a little more self-service at this stage. But that's, you know, unlike all open source projects, which in this case it happens to be maybe the most active in the world right now, with the possible exception of Linux or something, you know, things just rapidly improve and people fill in the gaps where needed. Even, you know, I work for TypeSafe and we're supporting Spark commercially now. And we're also contributing to Spark and, you know, how it runs on top of the Mesos framework, for example, which is an alternative to Hadoop, even though it's a standalone mode. If you just have a small cluster you want to wire up and then just go, that's also an interesting option. Now, Spark is going to be one of those hard words to Google.
Mike Hall Interviewer, UGtastic
Is there a project page you mentioned, Apache?
Dean Wampler conference speaking and presentation skills
Yeah, I always Google when I want to get to the page Apache Spark. And it's spark. apache. org is the homepage for it. So, yeah, that's, yeah, you're right. The word's a little, it's not a really Google-friendly word.
Mike Hall Interviewer, UGtastic
Well, thank you very much for taking the time to speak with me. My pleasure. Thank you. Thanks.