From the course: Learning Data Science: Understanding the Basics

Focus on knowledge

From the course: Learning Data Science: Understanding the Basics

Focus on knowledge

- One of the key challenges in data science is what I call, the cluster of dreams. It's based on the movie starring Kevin Costner, called the Field of Dreams. It was about a man who spent all of his savings building a baseball diamond in a cornfield. He was visited by the ghosts of old players. They told him to finish the baseball diamond. They said, if you build it, they will come. Many organizations get caught in the same trap. They focus their energy on building hardware and collecting massive amounts of data. They make sizable investments in software to run on large data clusters. Their dream is that they have enough hardware and software then they'll gain valuable insights. If they build it, it will come. That makes a lot of sense if you think about it. Many organizations have a lot of experience delivering successful hardware projects. It's something they know how to do. Most large organizations are good at it and they've done it for decades. Data science is new. In many organizations, it's not as easy to spend money on exploring and asking questions. You're not building out an operational capability. Instead, you should be focused on starting a new mindset. Hardware is real. It's tangible. You can see what you're buying. Exploration is tougher to quantify. It doesn't have an ROI that fits neatly into a project pipeline. It can be volatile. You only know if it was worth it after you've already done it. The US Library of Congress famously started a project to collect 170 billion tweets. They wanted to show that they could work with data science. They bought the servers and hardware to hold the tweets but they didn't have any plan on what to do with the data. They also couldn't give anyone access to the data. They figured if they build it, then they would come. Unfortunately, it's still sits idle on hundreds of servers. A monument to data collection. This might seem like an unusual case, but it's actually very common. Organizations focus on building out capability. They set a goal for a certain number of nodes in their Hadoop cluster. They also focus on using a suite of visualization tools. The budget goes into hardware and software. There's little left over for the data science team. I once worked for an organization that was trying to use a big data cluster to replace their data warehouse. They were already used to spending millions on data warehouse hardware, so they would hire experts to maintain this investment. When they moved to Hadoop, they had the same mindset. They started a multi-million dollar project to create three separate cluster. The entire budget went into servers and software. At the end of two years, they had three clusters but only a few people accessed the data. To make matters worse, the people were spread out across several different functional areas. They had millions in hardware and software, but no data science team. A couple years into the project, the cluster only had a few terabytes of data, roughly the same amount that you could fit on a hard drive for a few hundred dollars. Only a few people created simple reports for one or two departments. There are a few things to remember to keep from falling into this trap. The first thing is that data science teams are exploratory. They're looking at the data to find insights. Data isn't the product, it's the insights that come from the data. There's no prize for having the biggest cluster. Even though a data science team might spend most of their time collecting their data, that doesn't mean that all the value comes from this raw material. It's the same way that having a chef's knife doesn't make you a chef. A big data collection doesn't make you a data science team. What makes a data science team, is the questions you ask. It's about the scientific method. The second thing to remember is that most data science teams use several different software tools. Sometimes I want to use R instead of Python. It might be easier to hold a small amount of data in a relational database. They might use different visualization tools. Give them the option to explore. Often, a data science team can get a lot more done with several free tools, than they can with one big investment. The team should build tools out as they need them. A good data science team will be very messy. They'll use many different tools and techniques to wrangle and scrub their data. Instead of hardware and software, invest in training and expertise. Remember that the most valuable part of your data science team, are the people asking the questions.

Contents