Discover millions of ebooks, audiobooks, and so much more with a free trial

Only $11.99/month after trial. Cancel anytime.

Cracking the Data Science Interview: Unlock insider tips from industry experts to master the data science field
Cracking the Data Science Interview: Unlock insider tips from industry experts to master the data science field
Cracking the Data Science Interview: Unlock insider tips from industry experts to master the data science field
Ebook1,037 pages6 hours

Cracking the Data Science Interview: Unlock insider tips from industry experts to master the data science field

Rating: 0 out of 5 stars


Read preview

About this ebook

The data science job market is saturated with professionals of all backgrounds, including academics, researchers, bootcampers, and Massive Open Online Course (MOOC) graduates. This poses a challenge for companies seeking the best person to fill their roles. At the heart of this selection process is the data science interview, a crucial juncture that determines the best fit for both the candidate and the company.
Cracking the Data Science Interview provides expert guidance on approaching the interview process with full preparation and confidence. Starting with an introduction to the modern data science landscape, you’ll find tips on job hunting, resume writing, and creating a top-notch portfolio. You’ll then advance to topics such as Python, SQL databases, Git, and productivity with shell scripting and Bash. Building on this foundation, you'll delve into the fundamentals of statistics, laying the groundwork for pre-modeling concepts, machine learning, deep learning, and generative AI. The book concludes by offering insights into how best to prepare for the intensive data science interview.
By the end of this interview guide, you’ll have gained the confidence, business acumen, and technical skills required to distinguish yourself within this competitive landscape and land your next data science job.

Release dateFeb 29, 2024
Cracking the Data Science Interview: Unlock insider tips from industry experts to master the data science field

Related to Cracking the Data Science Interview

Related ebooks

Computers For You

View More

Related articles

Reviews for Cracking the Data Science Interview

Rating: 0 out of 5 stars
0 ratings

0 ratings0 reviews

What did you think?

Tap to rate

Review must be at least 10 words

    Book preview

    Cracking the Data Science Interview - Leondra R. Gonzalez


    Cracking the Data Science Interview

    Copyright © 2024 Packt Publishing

    All rights reserved. No part of this book may be reproduced, stored in a retrieval system, or transmitted in any form or by any means, without the prior written permission of the publisher, except in the case of brief quotations embedded in critical articles or reviews.

    Every effort has been made in the preparation of this book to ensure the accuracy of the information presented. However, the information contained in this book is sold without warranty, either express or implied. Neither the authors, nor Packt Publishing or its dealers and distributors, will be held liable for any damages caused or alleged to have been caused directly or indirectly by this book.

    Packt Publishing has endeavored to provide trademark information about all of the companies and products mentioned in this book by the appropriate use of capitals. However, Packt Publishing cannot guarantee the accuracy of this information.

    Group Product Manager: Niranjan Naikwadi

    Publishing Product Manager: Nitin Nainani

    Senior Editor: Hayden Edwards

    Technical Editor: Simran Haresh Udasi

    Copy Editor: Safis Editing

    Project Coordinator: Aishwarya Mohan

    Proofreader: Safis Editing

    Indexer: Rekha Nair

    Production Designer: Prashant Ghare

    Marketing Coordinators: Vinishka Kalra

    First published: March 2024

    Production reference: 1160224

    Published by Packt Publishing Ltd.

    Grosvenor House

    11 St Paul’s Square


    B3 1RB

    ISBN 978-1-80512-050-6


    The data science landscape is ever-evolving and has been that way since its conception. Though it is a rewarding field with many opportunities, navigating it can be a challenge, especially when you’re just getting started.

    During my career, I have found that various companies can interpret data science differently depending on their business needs or understanding of data science. When I first began my data science journey in 2015, I was employed as a health data analyst with a start-up. It was there that I was exposed to data science, as my role was not purely data analytics or data science, but a mixture somewhere in between. I wanted to continue learning and advancing, but I did not know where to focus my energy to gain the information needed to thrive in this field. So, I curated a list of lessons I needed to learn in order to be competent enough to enter and advance in the field. I learned Python, data science with Python, R programming, linear algebra, and calculus, and as time went on, it became more and more daunting, the list of lessons becoming even longer than what was required for a graduate degree. Unfortunately, even after all of my hard work, during interviews, I found there were still concepts that I was unaware of. This has been the issue that I, as well as others, have noted with this field – there is so much information, but it can be unclear where to begin and what information is necessary to know.

    On top of this, the data science interview is universally dreaded and challenging for various reasons that I have already alluded to. For instance, candidates are usually unsure of what that particular company considers data science. Plus, take-home assignments can take hours to complete – and once that time has been invested in completing the assignment, the company may choose to not offer feedback or, even worse, disappear completely when they’ve decided they aren’t interested. After experiencing this devastating outcome more than once, I became highly selective in what companies I chose to do a take-home assignment for. Many companies had a habit of immediately asking candidates to complete a take-home assignment before an interview, which I have learned rarely works in the candidate’s favor.

    This book will address and outline the concepts that are necessary to begin or progress in a data science role. Because this field is ever-evolving, our understanding of concepts will continue as well, however this book can be used as a reference for those that are experienced in the field, or for those that are in data science adjacent roles and want to keep their knowledge current. This book will include imperative information so that candidates can be successful during a data science interview, as well as removing some of the guesswork in what companies are expecting.

    It is widely accepted that data science candidates have an online portfolio to showcase their talent and application of knowledge – for this reason, there is information on how to build a portfolio and create a resume that will get you noticed. Salary and benefits negotiation is also outlined to streamline the process for you – a process many of us had to learn completely uninformed in the past, is now disseminated for the benefit of others.

    We are certain that you will find this book helpful in your data science journey. Cheers!

    Angela Baltes, PhD

    Data Scientist, UnitedHealth Group


    About the authors

    Leondra R. Gonzalez is a senior data and applied scientist at Microsoft with a decade of experience in data science, analytics, and corporate strategy. In addition to her work as a data scientist, Leondra has led teams in the entertainment, media, and advertising space to produce advanced e-commerce models for top brands, including NBC Peacock, First Aid Beauty, Procter & Gamble, HBO Max, Toyota, Whirlpool, and Tubi.

    Academically, Leondra graduated from Carnegie Mellon University’s Heinz College of Information Systems Management with a master’s in entertainment industry management, with a focus on business analytics; Quantic School of Business and Technology with an MBA, including a specialization in statistics; and Otterbein University with a bachelor’s in music and business. Leondra is currently pursuing a PhD in information technology with a specialization in artificial intelligence at the University of the Cumberlands, and she has researched deep learning architectures as a PhD computer science apprentice at Google.

    To my loving husband, Chris, my parents, my sister, and my unborn son who kicked my bump every day while writing this book.

    Aaren Stubberfield is a senior data scientist for Microsoft’s digital advertising business and the author of three popular courses on DataCamp. He graduated with an MS in predictive analytics and has over 10 years of experience in various data science and analytical roles, focused on finding insights for business-related questions.

    With his experience, he has led numerous teams of data scientists and has been instrumental in the successful completion of many projects. Aaren’s technical skills include the use of AI, like LLMs, Python, and various other tools necessary for the execution of data science projects.

    I want to thank the people who have been close to me and supported me, especially my wife, Pam, and my family.

    About the reviewer

    Vishal Kumar, a seasoned data scientist, has over seven years of experience with a premium credit card company, where he has made indelible contributions to the realms of AI and ML. He has a master’s degree in statistics from Delhi University.

    Throughout his career, he has garnered a plethora of accolades, stemming from his adeptness in constructing cutting-edge decision science tools that have steered various organizations’ success. His commitment to continuous learning is evidenced by his embrace of new technologies, such as generative AI, to stay at the forefront of the ever-evolving data science landscape.

    Beyond his professional pursuits, his creativity extends into his personal life, as he likes to paint and play ukulele.

    Table of Contents


    Part 1: Breaking into the Data Science Field


    Exploring Today’s Modern Data Science Landscape

    What is data science?

    Exploring the data science process

    Data collection

    Data exploration

    Data modeling

    Model evaluation

    Model deployment and monitoring

    Dissecting the flavors of data science

    Data engineer

    Dashboarding and visual specialist

    ML specialist

    Domain expert

    Reviewing career paths in data science

    The traditionalist

    Domain expert

    Off-the-beaten path-er

    Tackling the experience bottleneck

    Academic experience

    Work experience

    Understanding expected skills and competencies

    Hard (technical) skills

    Soft (communication) skills

    Exploring the evolution of data science

    New models

    New environments

    New computing

    New applications




    Finding a Job in Data Science

    Searching for your first data science job

    Preparing for the road ahead

    Finding job boards

    Beginning to build a standout portfolio

    Applying for jobs

    Constructing the Golden Resume

    The perfect resume myth

    Understanding automated resume screening

    Crafting an effective resume

    Formatting and organization

    Using the correct terminology

    Prepping for landing the interview

    Moore’s Law

    Research, research, research



    Part 2: Manipulating and Managing Data


    Programming with Python

    Using variables, data types, and data structures


    Indexing in Python

    Using string operations

    Initializing a string

    String indexing



    Using Python control statements, loops, and list comprehensions

    Conditional statements such as if, elif, and else

    Loop statements such as for and while

    List comprehension

    Using user-defined functions

    Breaking down the user-defined function syntax

    Doing stuff with user-defined functions

    Getting familiar with lambda functions

    Creating good functions


    Handling files in Python

    Opening files with pandas


    Wrangling data with pandas

    Handling missing data

    Selecting data

    Sorting data

    Merging data

    Aggregation with groupby()




    Visualizing Data and Data Storytelling

    Understanding data visualization

    Bar charts

    Line charts

    Scatter plots


    Density plots

    Quantile-quantile plots (Q-Q plots)

    Box plots

    Pie charts

    Surveying tools of the trade

    Power BI



    ggplot2 (R)

    Matplotlib (Python)

    Seaborn (Python)

    Developing dashboards, reports, and KPIs

    Developing charts and graphs

    Bar chart – Matplotlib

    Bar chart – Seaborn

    Scatter plot – Matplotlib

    Scatter plot – Seaborn

    Histogram plot – Matplotlib

    Histogram plot – Seaborn

    Applying scenario-based storytelling



    Querying Databases with SQL

    Introducing relational databases

    Mastering SQL basics

    The SELECT statement

    The WHERE clause

    The ORDER BY clause

    Aggregating data with GROUP BY and HAVING

    The GROUP BY statement

    The HAVING clause

    Creating fields with CASE WHEN

    Analyzing subqueries and CTEs

    Subqueries in the SELECT clause

    Subqueries in the FROM clause

    Subqueries in the WHERE clause

    Subqueries in the HAVING clause

    Distinguishing common table expressions (CTEs) from subqueries

    Merging tables with joins

    Inner joins

    Left and right join

    Full outer join

    Multi-table joins

    Calculating window functions


    LAG and LEAD



    Using date functions

    Approaching complex queries

    Process and answer



    Scripting with Shell and Bash Commands in Linux

    Introducing operating systems

    Navigating system directories

    Introducing basic command-line prompts

    Understanding directory types

    Filing and directory manipulation

    Scripting with Bash

    Introducing control statements

    Creating functions

    Processing data and pipelines

    Using pipes

    Using cron



    Using Git for Version Control

    Introducing repositories (repos)

    Creating a repo

    Cloning an existing remote repository

    Creating a local repository from scratch

    Linking local and remote repositories

    Detailing the Git workflow for data scientists

    Using Git tags for data science

    Understanding Git tags

    Using tagging as a data scientist

    Understanding common operations


    Part 3: Exploring Artificial Intelligence


    Mining Data with Probability and Statistics

    Describing data with descriptive statistics

    Measuring central tendency

    Measuring variability

    Introducing populations and samples

    Defining populations and samples

    Representing samples

    Reducing the sampling error

    Understanding the Central Limit Thereom (CLT)

    The CLT

    Demonstrating the assumption of normality

    Shaping data with sampling distributions

    Probability distributions

    Uniform distribution

    Normal and student’s t-distributions

    The binomial distribution

    The Poisson distribution

    Exponential distribution

    Geometric distribution

    The Weibull distribution

    Testing hypotheses

    Understanding one-sample t-tests

    Understanding two-sample t-tests

    Understanding paired sample t-tests

    Understanding ANOVA and MANOVA

    Chi-squared test

    A/B tests

    Understanding Type I and Type II errors

    Type I error (false positive)

    Type II error (false negative)

    Striking a balance




    Understanding Feature Engineering and Preparing Data for Modeling

    Understanding feature engineering

    Avoiding data leakage

    Handling missing data

    Scaling data

    Applying data transformations

    Introducing data transformations

    Logarithm transformations

    Power transformations

    Box-Cox transformations

    Exponential transformations

    Engineering categorical data and other features

    One-hot encoding

    Label encoding

    Target encoding

    Calculated fields

    Performing feature selection

    Types of feature selection

    Recursive feature elimination

    L1 regularization

    Tree-based feature selection

    The variance inflation factor

    Working with imbalanced data

    Understanding imbalanced data

    Treating imbalanced data

    Reducing the dimensionality

    Principal component analysis

    Singular value decomposition





    Mastering Machine Learning Concepts

    Introducing the machine learning workflow

    Problem statement

    Model selection

    Model tuning

    Model predictions

    Getting started with supervised machine learning

    Regression versus classification

    Linear regression – regression

    Logistic regression

    k-nearest neighbors (k-NN)

    Random forest

    Extreme Gradient Boosting (XGBoost)

    Getting started with unsupervised machine learning


    Density-based spatial clustering of applications with noise (DBSCAN)

    Other clustering algorithms

    Evaluating clusters

    Summarizing other notable machine learning models

    Understanding the bias-variance trade-off

    Tuning with hyperparameters

    Grid search

    Random search

    Bayesian optimization



    Building Networks with Deep Learning

    Introducing neural networks and deep learning

    Weighing in on weights and biases

    Introduction to weights

    Introduction to biases

    Activating neurons with activation functions

    Common activation functions

    Choosing the right activation function

    Unraveling backpropagation

    Gradient descent

    What is backpropagation?

    Loss functions

    Gradient descent steps

    The vanishing gradient problem

    Using optimizers

    Optimization algorithms

    Network tuning

    Understanding embeddings

    Word embeddings

    Training embeddings

    Listing common network architectures

    Common networks

    Tools and packages

    Introducing GenAI and LLMs

    Unveiling language models

    Transformers and self-attention

    Transfer Learning

    GPT in action



    Implementing Machine Learning Solutions with MLOps

    Introducing MLOps

    A model pipeline overview

    Understanding data ingestion

    Learning the basics of data storage

    Reviewing model development

    Packaging for model deployment

    Identifying requirements

    Virtual environments

    Tools and approaches for environment management

    Deploying a model with containers

    Using Docker

    Validating and monitoring the model

    Validating the model deployment

    Model monitoring

    Thinking about governance

    Using Azure ML for MLOps


    Part 4: Getting the Job


    Mastering the Interview Rounds

    Mastering early interactions with the recruiter

    Mastering the different interview stages

    The hiring manager stage

    The technical interview

    Coding questions, step by step

    The panel stage




    Negotiating Compensation

    Understanding the compensation landscape

    Negotiating the offer

    Negotiation considerations

    Responding to the offer

    Maximum negotiable compensation and situational value


    Final words


    Other Books You May Enjoy


    In today’s dynamic technological landscape, the demand for skilled professionals in artificial intelligence (AI) and data science roles has surged, and the data science job market is increasingly saturated by various levels of data science and AI employees. This book is a comprehensive guide, crafted to equip both aspiring and seasoned individuals with the essential tools and knowledge required to navigate the intricacies of data science interviews. Whether you’re stepping into the AI realm for the first time or aiming to elevate your expertise, this book offers a holistic approach to mastering the fundamental and cutting-edge facets of the field.

    The chapters within this book span a wide spectrum of critical subjects, from programming with Python and SQL to statistical analysis, pre-modeling and data cleaning concepts, machine learning (ML), deep learning, Large Language Models (LLMs), and generative AI. We aim to provide a comprehensive review and update on the foundational concepts while also delving into the latest advancements. In an era marked by the disruptive potential of language models and generative AI, it’s imperative to continually hone your skills. This book serves as a compass, guiding you through the intricacies of these transformative technologies, ensuring you’re poised to tackle the challenges and harness the opportunities they present.

    Moreover, beyond technical prowess, we delve into the art of interviewing for AI roles, offering guidance on how to ace interviews and negotiate compensation effectively. Additionally, crafting a standout résumé tailored for data science roles is a crucial step, and our guide offers insights into writing compelling résumés that capture attention in a competitive job market. As AI reshapes industries and innovation accelerates, now is the ideal time to embark on or advance in your data science journey. We invite you to dive into this comprehensive resource and embark on your path to mastering the dynamic world of data science and AI.

    Who this book is for

    If you are a seasoned or young professional who needs to brush up on your technical skills, or you are looking to break into the exciting world of the data science industry, then this book is for you.

    What this book covers

    In Chapter 1, Exploring the Modern Data Science Landscape, we begin our journey with a brief but valuable overview of the contemporary landscape of data science and AI.

    In Chapter 2, Finding a Job in Data Science, we will introduce data science roles and their various categories.

    In Chapter 3, Programming with Python, you will familiarize yourself with the most common and useful tasks and operations in the Python language.

    In Chapter 4, Visualizing Data and Storytelling, you will learn techniques for telling engaging data stories.

    In Chapter 5, Querying Databases with SQL, you will dive into the world of databases, understanding their design and how to query them to acquire data.

    In Chapter 6, Scripting with Bash and Shell Commands in Linux, you will boost your operating system skills with the power of bash and shell commands, enabling you to interface with multiple technologies either locally or in the cloud.

    In Chapter 7, Using Git for Version Control, we explore the most useful commands in Git for project collaboration and reproducibility.

    In Chapter 8, Mining Data with Probability and Statistics, you will understand some of the most relevant topics in probability and statistics that serve as the foundation for many ML models and assumptions.

    In Chapter 9, Understanding Feature Engineering and Preparing Data for Modeling, you will use your understanding of descriptive statistics to create clean, machine-legible datasets.

    In Chapter 10, Mastering Machine Learning Concepts, you will learn about the most used ML algorithms, their assumptions, how they work, and how to best evaluate their performance.

    In Chapter 11, Building Networks with Deep Learning, we take a step further into building and evaluating neural networks in various applications while also touching base on the latest advancements in AI.

    In Chapter 12, Implementing Machine Learning Solutions with MLOps, we will review the data science process, tools, and strategies to effectively design and implement an end-to-end ML solution.

    In Chapter 13, Mastering the Interview Rounds, you will learn the best techniques to successfully bypass technical and non-technical factors at every stage of the interview process.

    In Chapter 14, Negotiating Compensation, you will learn to optimize your earning potential.

    To get the most out of this book

    To get the most out of this book, you should have a basic knowledge of Python, SQL, and statistics. However, you will also benefit from this book if you have familiarity with other analytical languages, such as R. By brushing up on critical data science concepts such as SQL, Git, statistics, and deep learning, you’ll be well-equipped to crack through the interview process.

    Conventions used

    There are a number of text conventions used throughout this book.

    Code in text: Indicates code words in text, database table names, folder names, filenames, file extensions, pathnames, dummy URLs, user input, and Twitter handles. Here is an example: The split() method can be used to split s into individual words: words = s.split().

    A block of code is set as follows:

    x = 5

    print(type(x)) #

    Bold: Indicates a new term, an important word, or words that you see on screen. For instance, words in menus or dialog boxes appear in bold. Here is an example: The increased computing power and the development of advanced algorithms, especially in machine learning (ML) and deep learning (DL), have made it possible to efficiently process and analyze massive amounts of data.

    Tips or important notes

    Appear like this.

    Special Note

    The prevalence of accessible AI technology has exploded over the past few months, particularly over the course of writing this book. We encourage our readers to utilize AI during their educational journey, leveraging tools such as Chat GPT to test your newly acquired skills. Long gone are the days where you browse StackOverFlow for hours for your specific inquiry. Now, the power of asking for help is right at your fingertips.

    Even we, the authors of this book, leveraged generative AI to aid in minor editorial tasks and creating code examples. However, rest assured that humans wrote the content and laid out what is covered in the book! In this new era, we just wanted to make our readers aware of how we used the tool.

    Get in touch

    Feedback from our readers is always welcome.

    General feedback: If you have questions about any aspect of this book, email us at [email protected] and mention the book title in the subject of your message.

    Errata: Although we have taken every care to ensure the accuracy of our content, mistakes do happen. If you have found a mistake in this book, we would be grateful if you would report this to us. Please visit and fill in the form.

    Piracy: If you come across any illegal copies of our works in any form on the internet, we would be grateful if you would provide us with the location address or website name. Please contact us at [email protected] with a link to the material.

    If you are interested in becoming an author: If there is a topic that you have expertise in and you are interested in either writing or contributing to a book, please visit

    Share Your Thoughts

    Once you’ve read Cracking the Data Science Interview, we’d love to hear your thoughts! Please click here to go straight to the Amazon review page for this book and share your feedback.

    Your review is important to us and the tech community and will help us make sure we’re delivering excellent quality content.

    Download a free PDF copy of this book

    Thanks for purchasing this book!

    Do you like to read on the go but are unable to carry your print books everywhere?

    Is your eBook purchase not compatible with the device of your choice?

    Don’t worry, now with every Packt book you get a DRM-free PDF version of that book at no cost.

    Read anywhere, any place, on any device. Search, copy, and paste code from your favorite technical books directly into your application.

    The perks don’t stop there, you can get exclusive access to discounts, newsletters, and great free content in your inbox daily

    Follow these simple steps to get the benefits:

    Scan the QR code or visit the link below

    Submit your proof of purchase

    That’s it! We’ll send your free PDF and other benefits to your email directly

    Part 1: Breaking into the Data Science Field

    In the first part of this book, you will learn about the data science profession as it exists in the modern day, and how this relates to your endeavors in the field. This will serve as an introduction to various career paths and help to set expectations in terms of the skills and competencies required to be successful.

    This part includes the following chapters:

    Chapter 1, Exploring Today’s Modern Data Science Landscape

    Chapter 2, Finding a Job in Data Science


    Exploring Today’s Modern Data Science Landscape

    If you’ve picked up this book, chances are that you’ve already heard of data science. It’s arguably one of the fastest-growing, most discussed professions within the tech and STEM space, all while maintaining its relative edge and mystique. That is, many people have heard of data scientists, but very few know what they do, how a data scientist produces value, or how to break into the field from scratch.

    In this chapter, we will verify the definition of data science with a practical description. Then, we will discuss what most data science jobs entail, while spending some time describing the distinction between different flavors of data science. We’ll then dive into the various paths into data science and what makes it so challenging to land your first job. We’ll finish the chapter with an overview of the non-negotiable competencies expected of data scientists.

    By the end of this chapter, you will have a firm understanding of the modern data scientist, the various paths to getting the job, and what to expect in your journey to becoming one.

    With this gentle introduction, you’ll have a better understanding of the job of a data scientist, which path to becoming a data scientist best fits your journey, the barriers to expect in your journey, and which skills you should master.

    In this chapter, we will cover the following topics:

    What is data science?

    Exploring the data science process

    Dissecting the flavors of data science

    Reviewing career paths in data science

    Tacking the experience bottleneck

    Understanding expected skills and competencies

    Exploring the evolution of data science

    What is data science?

    To begin, let’s offer a definition of data science. According to Wikipedia, data science "is an interdisciplinary academic field that uses statistics, scientific computing, scientific methods, processes, algorithms, and systems to extract or extrapolate knowledge and insights from noisy, structured, and unstructured data"[1]. It encompasses various techniques, procedures, and tools to process, analyze, and visualize data, enabling businesses and organizations to make data-driven decisions and predictions. The primary goal of data science is to identify patterns, relationships, and trends within data to support decision-making and create actionable insights.

    You are not alone in your interest in data science – it was called by the Harvard Business Review one of the sexiest jobs in the 21st century [2], and stories of data scientists earning enormous salaries in the six-figure range are not uncommon. Data scientists are often looked at as oracles within an organization, answering complex business questions such as, If we increase our offering to this group of customers, can we increase our revenues? or What are the common causes of customer churn?

    Within organizations, the demand for the skills of data scientists has continued to grow. The U.S. Bureau of Labor Statistics estimated that in 2022, the number of jobs for data scientists will increase by roughly 36% over the next 10 years [3]. This growth in the demand for data scientists is being fuelled by several factors, which are shown here:

    Figure 1.1: Reasons for the increased demand for data scientists

    Figure 1.1: Reasons for the increased demand for data scientists

    The first is the proliferation of data. The exponential growth of data generated by digital devices, social media, and various other sources has made it essential for organizations to harness this data for decision-making and innovation. This data growth is expected to continue in the future, with the International Data Corporation (IDC) expecting that by 2025, we will generate 175 zettabytes of data annually [4]. That is a staggering amount of data!

    Organizations want to take advantage of this explosion in data availability to generate insights for decision-making. As the world becomes more interconnected and complex, the need for evidence-based decision-making has grown, leading to an increased demand for skilled data scientists who can transform data into actionable insights. Organizations and businesses increasingly rely on data-driven insights to gain a competitive edge in the market, optimize operations, and improve customer experiences.

    Finally, transforming data into insights couldn't be accomplished without advancements in computational power and the advancement of tools and platforms. The increased computing power and the development of advanced algorithms, especially in machine learning (ML) and deep learning (DL), have made it possible to efficiently process and analyze massive amounts of data. In addition, the development of open source tools, libraries, and platforms has made data science more accessible to a broader audience, fostering the growth of the profession.

    Hence, data science is still an evolving field that is only expected to grow in parallel with computational and technological advancements (such as generative AI). Furthermore, as companies continue to embrace the digital age with an increased interest in maximizing their utility of data and capitalizing on its underlying insights for a competitive advantage, the demand for data scientists will also expand.

    However, although data science is often regarded and described as a monolithic function, you’ll soon learn that it’s a multi-faceted discipline that often varies by team, department, or even company. Naturally, the data scientist job profile is also an ever-evolving description, but we will cover all our bases for the most common tasks.

    Exploring the data science process

    Performing data science work is often an iterative process, where the data scientist needs to return to earlier steps if they run into challenges. There are many ways to categorize the data science process, but it often includes:

    Data collection

    Data exploration

    Data modeling

    Model evaluation

    Model deployment and monitoring

    Let’s briefly touch on each step and discuss what’s expected of the data scientist during them.

    Data collection

    Data collection and preprocessing involves gathering data from various sources (such as databases, APIs, and web scraping), then cleaning and transforming the data to prepare it for analysis. This step involves dealing with missing, inconsistent, or noisy data and converting it into a structured format. Depending on the organization, a team of data engineers support this step of the data science process; however, it is common for the data scientist to manage this process as well. This requires them to have intimate knowledge of the data sources and the ability to write Structured Query Language (SQL) queries, code that can query databases, or custom tools such as web scrapers to gather the needed data.

    Data exploration

    Data exploration involves conducting exploratory data analysis (EDA) to better understand the data, detect anomalies, and identify relationships between variables. The key to this step is to look for correlations and understand the distribution of the data. This involves using descriptive statistics and visualization techniques to summarize the data and gain insights; therefore, the data scientist should be able to use summary statistics, program descriptive visualizations, or utilize reporting tools such as Power BI or Tableau to create robust charts.

    Data modeling

    Using what was learned in the data exploration step, data modeling is the step when the data scientist builds their predictive or descriptive models using ML and statistical techniques that identify patterns and relationships in the data. Here, the data scientist selects the appropriate algorithms, trains the models on historical data, and validates their performance.

    Model evaluation

    Model evaluation and optimization involves assessing the performance of models using metrics such as accuracy, RMSE, precision, recall, AUC, or F1 scores. Based on these evaluations, data scientists may refine the models or try alternative algorithms to improve their performance. Understanding the underlying reasons behind a model’s predictions is crucial for building trust in its results and ensuring that it aligns with the domain knowledge. Therefore, the data scientist must be sure the model solves the organizational/business goal. Here, the data scientist needs to be able to communicate their findings to possible technical and non-technical individuals.

    Model deployment and monitoring

    Model deployment and monitoring involves implementing the models in real-world applications, monitoring their performance, and maintaining them to ensure their continued accuracy and relevance. For example, the data scientist might work with a data engineering team or use tools such as containers to implement the model. Once deployed, the data scientist may also need to develop dashboards to monitor the model’s performance over time and flag stakeholders if it goes outside the expected performance range.

    As you can see, data science is a profession that incorporates many data-related tasks – particularly those that involve the acquisition, prepping, and delivery of data in one format or another. While data modeling makes up most of the glitz and glamour associated with the job, it is really everything else that takes up roughly 80% of the gig. This does not include non-data-related tasks, such as interfacing with stakeholders, gathering requirements, debugging software, checking emails, and research. However, those tasks are not necessarily unique to data scientists.

    Now that you understand the common tasks associated with the job, let’s explore the different types or flavors of data science.

    Dissecting the flavors of data science

    Now that we have defined some of the critical aspects of the role of a data scientist, it is clear that the role often covers many different skills. Data scientists are frequently asked to perform a variety of data-related tasks, including designing database tables to collect data, programming ML algorithms, understanding statistics, and creating stunning visuals to help explain interesting findings to others, but it is difficult for any single person to master all of these skill areas.

    Therefore, we often see data scientists who are particularly skilled in one or two areas and have basic competencies in the others. Their talents could be considered T-shaped, where they are proficient across many areas such as the horizontal line of a T, while they have deep knowledge and expertise in a few areas such as the vertical portion of the letter:

    Figure 1.2: Example of the ‘T of Competencies’

    Figure 1.2: Example of the ‘T of Competencies’

    While this example shows an example of someone who is adequate in data engineering and visualization principles but exceptional in ML, you can expect to see every possible combination of skills among data scientists. These competencies are often aligned with a person’s unique experiences or interests. Perhaps they were a statistics major and took a liking to ML, or perhaps they’re a former business intelligence (BI) engineer with considerable experience in data extraction, transformation, and loading (ETL), allowing them to grasp data engineering concepts much faster.

    Whatever the reason, it’s natural for someone to grasp some concepts better than others. This is important to remember as you navigate this book. While you are not expected to specialize in every facet of data science, you are expected to master the fundamentals. However, you will almost certainly discover your T of Competencies – a trinity of top skill sets that will solidify your identity in the data science space.

    While there are countless combinations of skill proficiencies, let’s review some of the most common that you will encounter:

    The data engineer

    The dashboarding and visual specialist

    The ML specialist

    The domain expert

    Let’s take a look at these now.

    Data engineer

    As we discussed earlier, data engineering is a crucial aspect of the data science process that involves data collection, storage, processing, and management. It focuses on designing, developing, and maintaining scalable data infrastructure, ensuring the availability of high-quality data for analysis and modeling. Data engineers are most known for their oversight of the ETL process of data pipelines. On some data scientist teams, especially within smaller organizations, the data engineering responsibilities sit within the data science team. Therefore, the data scientist specializing in this area can help support team projects with data collection and storage, understanding the needs of the ML process, such as structuring the data so that it can be fed efficiently to a DL algorithm.

    Data engineers have a wealth of tools to choose from. It is not expected for any single data engineer to know all of these technologies, especially at the same level of competencies. In fact, the more senior the engineer, the more competent they are in their tools of choice. Furthermore, this is not a comprehensive list. However, you can expect to see the following on

    Enjoying the preview?
    Page 1 of 1