21 Questions you should ask before starting a DataDriven Project


A data-driven approach enables companies to examine and organise their data with the goal of better serving their customers and consumers. By using data to drive its actions, an organisation can contextualise and/or personalise its messaging to its prospects and customers for a more customer-centric approach.

Published on April 24, 2021

problem statement data science Over The Top

5 min READ

    Let’s say there are two OTT platforms such as Netflix and Amazon namely A and B. Company-A uses an advanced AI powered technology to retained its customers, recommend shows and movies, analyze user demographics, etc and is one of the top streaming platforms now a days.

    Where as company-B is having a tough time figuring ways to increase its annual revenue and provide best experience to their users. Assume you are a data scientist or an analyst and are given this task to complete…

    In this post we are going to explore how we are going to tackle a business problem and turn it into a data science task.

Understanding the Problem

    First and the foremost thing that we should do is to know the goals of the business inorder to have a solution.

    As the information provided to you will be the basis for your analysis. So take your time and make sure to get all the information you need from the domain expert.

    “It’s necessary to accurately define the data problem that is to be solved. The problem should be clear, concise, and measurable. Many companies are too vague when defining data problems, which makes it difficult or even impossible for data scientists to translate them into machine code” - Brainhub

    The things that we have to focus is the main problems that a client(in this case a company) is facing, resources that are given to us, benefits of solving the issue, and risks may occur.

    Going back to the problem that we were solving, It states that company-B needs to improve their income by using big data and AI. There also mentioned that another company-A uses data analytics for their betterment.

    Below is a video showing you how Netfilx uses data

    With this you may understood what we are trying to do…Yes! a RECOMMENDATION SYSTEM.

    I do agree there are many more thing that we can do to improve a company revenue. But in this case this is one of the most feasible option.

Converting to a data science task

    A problem statement generally follows the format:
        “The problem P, has the impact I, which affects B, so a good starting point would be S.”

    After you thoroughly understood the goals of the company-B then you have to make your own data driven statements that will lead to insights.

    I believe this task is completely an individual task, nevertheless I will help you in doing so.

    I think of this as narrowing the big problem statement into smaller statements that include some data science jargon such as cleaning, models, classification, etc.

    In our case we can determine the features that can determine user niche such as time stamp of pausing and watching a video, date, time and place(zip code) of watch, rating provided, searhes, browsing and scrolling behavior, etc and then recommend them accordingly.

or

    Classification of each user into which genre does he/she is mostly likely to be interested in watching at an instance. And then recommend those genres at the top of the list to the user.

Data Scarcity:

    Yeah, I agree there isn’t low amounts of data anymore in this BIG DATA age.

    But I’m taking about the scarcity that occurs while choosing data suitable to the problem.

    I recommend starting with small datasets and then increase its volume further with the requirements.

Exploring the data goals

    After successfully defining your data goals you will still able to ask questions at various stages of your workflow.

    One such stage is Exploratory Data Anlaysis(EDA).

    Data visualization tools like Qlik or Tableau typically have capabilities to directly access several kinds of structured and unstructured data sources, so they can be applied on top of raw data and are extremely effective in identifying trends, anomalies, outliers in analyzed data with a productivity level not comparable to a classical tabular approach.

    You will find never before insights about data while performing EDA and which will further helpful in the model building or inference process.

    In the view of company-B, it could be narrowing down our analysis to a specific group of users or TV shows. Or it may also lead us to change the whole data goals that we decided before.

With this we came to the end… wait!

What about the

21 questions

Questions are not specific they vary from one problem to other. It is you who need to figure them out.

But I told you I will help you… Here you go

Recommendation System

Business Goals

  1. What is the main objectives or goals that company-B wants to achieve?
  2. What are the expectations of the company-B?
  3. What are the positive and negative impacts of solving the task?
  4. What are the potential and personal risks that may occur?
  5. Are the goals fall into the short term or can long term?
  6. What is the point of view or perspective of the subscribers?
  7. Do I have perfect domain knowledge? If not who is the domain expert?

converting to data terms...

  1. What are the popular TV shows or Movies streaming in both B and A?
  2. What does company-B lacks in terms of A?
  3. What are the similarities and differences between B and A?
  4. What are the peak hours of streaming in B? Do they converge with A?
  5. What is the average subscriber stats of the company-B?
  6. Can we create some clusters based on subscriber stats?
  7. Is our data sufficient to solve the individual goals?

Problem Statement: The lack of customer recommendation in company-B, has an impact on their annual revenue, which affects the company growth, so a good starting point would be comparing with its competitor company-A.

While EDA...

  1. Which age bins of subscribers will need more attention?
  2. Who are the top-n user based on categories?
  3. What are the important features?
  4. Which algorithm is best for training?
  5. What are some trends in the features?
  6. Is a feature correlated with the other?
  7. Do null values should be imputed or left ignored?

 
Nevertheless questions may vary… but who makes them do not and I say “you are the one”