The Right way to get started with Data Science
When I was introduced to data science almost 7–8 years ago, there wasn’t abundance of information available online. It was easier to read most of the articles and make sense of it. We knew what came first and how one should proceed in this field. Fast forward to today, and there is so much information available online, which often leads to confusion for those who want to start their career in data science.
To make things simpler, I’ve decided to provide a detailed roadmap to help you get started with data science.
This roadmap is exactly how I and my industry peers would have started their careers in this field. In the following blogs I’ll be covering some of the topics in detail.
Basics: Although some of the topics mentioned below might not be immediately useful in day-to-day data science tasks, it’s essential to start with a comprehensive list. Linear algebra will be particularly useful when you dive into deep learning, statistics and probability are used constantly, and calculus will help you understand machine learning and deep learning algorithms in depth.
A. Mathematics
- Linear Algebra: Scalars, vectors, matrices, tensors, Matrix operations(multiplication, inversion, transposition), eigenvalues, eigenvectors.
- Statistics & Probability: Descriptive statistics(mean, median, mode, variance, standard deviation, percentiles & quartiles), Probability(conditional probability, Bayes’ theorem), Distributions(normal, Bernoulli, binomial, Poisson, uniform), Central Limit Theorem (CLT), Law of Large Numbers, Hypothesis Testing: z-test, t-test, chi-square test, ANOVA, Correlation vs. Causation.
- Calculus: Differentiation(gradients, partial derivatives), Integration(area under curves, definite/indefinite integrals), gradient descent.
Types of Analytics: Before diving into the roadmap, it’s crucial to understand the different types of analytics that form the backbone of data science. Each type addresses a unique question and provides valuable insights.
A. Descriptive Analytics: What happened? This involves summarizing past data to identify trends or patterns. Common examples include dashboards, reports, and data visualizations.
B. Diagnostic Analytics: Why did it happen? This type goes a step further by identifying the causes of trends and anomalies. Techniques like correlation analysis and root cause analysis fall under this category.
C. Predictive Analytics: What will happen? By using historical data and machine learning models, predictive analytics forecasts future trends or outcomes. Examples include churn prediction and demand forecasting.
D. Prescriptive Analytics: What should we do? The most advanced type, prescriptive analytics suggests actions based on predictive insights, often leveraging optimization and simulation techniques.
Data Handling: Data handling is where most of your work begins. You’ll spend around 80% of your time analysing and cleaning data, as well as creating features that make the data suitable for modeling. Many people make the mistake of jumping straight into model building, but this step is essential to ensure that the data is ready.
A. Data Cleaning
- Handling missing values (imputation, dropping).
- Removing duplicates and outliers.
- Data normalization, standardization, and transformation.
B. Feature Engineering
- Encoding: Scaling (Min-Max scaling, Standardization), One-hot encoding, label encoding, target encoding.
- Feature selection: Correlation, variance threshold, recursive feature elimination.
- Feature extraction: PCA, t-SNE.
Statistical Modeling: Statistical modeling is the foundation of data science. I remember my first model being either linear regression or logistic regression. These models may seem basic now, but understanding them thoroughly will help you grasp more advanced concepts later, such as deep learning. That’s why statistical modeling is called the backbone of data science.
A. Supervised Learning
- Regression: Linear regression, Ridge regression, Lasso regression, Elastic Net Regression, Quantile Regression, Polynomial Regression.
- Classification: Logistic regression, k-Nearest Neighbours (kNN), Naive Bayes (Gaussian, Multinomial), Support Vector Machines (SVM).
B. Unsupervised Learning
- Clustering: k-Means, Hierarchical clustering, DBSCAN (Density-Based Spatial Clustering of Applications with Noise), Gaussian Mixture Models (GMMs).
- Dimensionality Reduction: Principal Component Analysis (PCA), Independent Component Analysis (ICA).
C. Time Series Forecasting
- Stationarity tests: Augmented Dickey-Fuller (ADF) Test, Kwiatkowski-Phillips-Schmidt-Shin (KPSS) Test.
- Seasonality and Trend Analysis: STL (Seasonal Decomposition of Time Series), Autocorrelation Function (ACF), Partial Autocorrelation Function (PACF).
- Autocorrelation and Partial Autocorrelation: Durbin-Watson Test.
- Tests for residual normality: Shapiro-Wilk Test, Kolmogorov-Smirnov Test, or Anderson-Darling Test.
- Multicollinearity Test: Variance Inflation Factor (VIF).
- Feature Scaling: Min-Max scaling, Standardization.
- Models: ARIMA (Auto-Regressive Integrated Moving Average), SARIMA (Seasonal ARIMA), ARIMAX (ARIMA with exogenous variables), SARIMAX (SARIMA with exogenous variables), Simple Exponential Smoothing, Holt-Winters Method (Additive and Multiplicative).
Machine Learning: As machine learning evolved, these models began gaining popularity due to their high accuracy, though they were less interpretable than statistical models. Over time, they became more explainable and started to dominate, as they required less effort in data cleaning and produced better results on complex data.
A. Advanced Supervised Learning
- Decision Trees.
- Ensemble Methods: Bagging, Boosting, Random Forest, Gradient Boosting (XGBoost, LightGBM, AdaBoost, CatBoost).
- Probabilistic Models: Hidden Markov Models (HMMs), Bayesian Networks.
B. Advanced Unsupervised Learning
- Advanced Clustering: Gaussian Mixture Models (GMM), k-Medoids.
- Recommendation Systems: Collaborative Filtering, Content-based Filtering, Hybrid Systems.
Deep Learning: As technology advanced, deep learning emerged as a powerful tool for handling unstructured data such as images, text, and audio. While machine learning models are still preferred for structured data due to their interpretability, deep learning shines in complex tasks that require high-dimensional data.
A. Foundations
- Neural Networks: Perceptron, Multi-Layer Perceptron (MLP).
- Activation Functions: Sigmoid, ReLU, Tanh, Softmax, Leaky ReLU.
- Loss Functions: Mean Squared Error (MSE), Binary Cross-Entropy, Categorical Cross-Entropy.
- Optimisation Algorithms: Gradient Descent, Stochastic Gradient Descent (SGD), Mini-Batch Gradient Descent.
B. Convolutional Neural Networks (CNNs)
- Concepts: Filters/Kernels, Convolution, pooling layers, Padding.
- Architectures: LeNet, AlexNet, VGG, ResNet, Inception, DenseNet, EfficientNet.
C. Recurrent Neural Networks (RNNs)
- Concepts: Sequence modeling, Back-propagation Through Time (BPTT), Vanishing/Exploding Gradient Problem.
- Variants: Long Short-Term Memory (LSTM), Gated Recurrent Units (GRU), Bidirectional RNNs.
D. Transformers
- Attention mechanism
- Architectures: Transformer, BERT, GPT, Transformer-XL.
E. Generative Models
- Autoencoders: Variational Autoencoders (VAEs).
- Generative Adversarial Networks (GANs): DCGAN, CycleGAN.
F. Reinforcement Learning
- Basics: Markov Decision Processes (MDPs).
- Algorithms: Q-Learning, Deep Q-Learning.
Natural Language Processing (NLP): NLP is an exciting field within data science, especially given the recent advancements with transformers and pre-trained language models.
A. Basics
- Text preprocessing: Tokenization, Stemming, Lemmatization, Stopword Removal, Text Normalization, POS (Part of Speech) Tagging.
- Word embeddings: Word2Vec, GloVe, FastText.
- Vectorization Techniques: TF-IDF, Count Vectorization(BOW).
B. Core Tasks
- Text Classification: Sentiment analysis, Named Entity Recognition (NER), Spam Detection.
- Text Generation: Language Modeling, Text Summarization (Extractive and Abstractive).
- Topic Modeling: Latent Dirichlet Allocation (LDA).
C. Advanced NLP
- Sequence-to-Sequence Models: RNN, LSTM, GRU.
- Transformers: Self-Attention Mechanism, BERT, GPT, Transformer-XL, T5.
- Pre-Trained Language Models: GPT-2/GPT-3, BERT, RoBERTa, DistilBERT, XLNet, ERNIE.
- Attention Mechanisms: Scaled Dot-Product Attention, Multi-Head Attention.
- Transfer Learning in NLP: Fine-Tuning Pre-trained Models.
This roadmap ensures a strong foundation before diving into more advanced topics. In addition to mastering these concepts, one should also become proficient in programming languages like Python, R, and SQL. Further, understanding how to deploy your models will be crucial, which we can cover in later blogs.
If you found this article helpful, I’d love to hear your thoughts! Feel free to leave a comment below with your feedback or any questions you have. Don’t forget to share this article with others who might find it useful. Your support means a lot — thank you for reading!