Data Science
Description
Data science is a field of Big Data geared toward providing meaningful information based on large amounts of complex data. Data science, or data-driven science, combines different fields of work in statistics and computation in order to interpret data for the purpose of decision making.
Data Science Course Content
Introduction to Data Science
- a.What is data science?
- How is data science different from Bi and Reporting?
- b.Who are data scientists?
- What skill sets are required?
- c.What do they do?
- What kind of projects they work on?
Business statistics
- a.Data types
- Continuous variables
- Ordinal Variables
- Categorical variables
- Time Series
- Miscellaneous
- b.Descriptive statistics?
- c.Sampling
- Need for Sampling?
- Different types of Sampling
- Simple random sampling
- Systematic sampling
- Stratified Sampling
- d. Data distributions
- Normal Distribution – Characteristics of a normal distribution
- Binomial Distribution
- e. Inferential statistics
- f. Hypothesis testing
- Type I error
- Type II error
- Null and alternate hypothesis
- Reject or acceptance criterion
Introduction to R
- A Primer to R programming
- What is R? similarities to OOP and SQL
- Types of objects in R – lists, matrices, arrays, data.frames etc.
- Creating new variables or updating existing variables
- IF statements and conditional loops - For, while etc.
- String manipulations
- Sub setting data from matrices and data.frames
- Casting and melting data to long and wide format.
- Merging datasets
Exploratory data analysis and visualization
- Getting data into R – reading from files
- Cleaning and preparing the data – converting data types (Character to
numeric etc.)
- Handling missing values – Imputation or replacing with place holder
- values
- Visualization in R using ggplot2(plots and charts) – Histograms, bar
- charts, box plot, scatter plots
- Adding more dimensions to the plots
- Visualization using Tableau( Introduction)
- Correlation – Positive , negative and no correlation
- What is a spurious correlation
- Correlation vs. causation
Introduction to Python:
- a. Different types of predictive analytics – prediction, forecasting,
- Optimization , segmentation etc.
- b. Supervised learning
- Prediction (Linear)
- Simple Linear Regression
- Assumptions
- Model development and interpretation
- Sum of least squares
- Model validation – tests to validate assumptions
- Multiple linear regressions
- Disadvantages of linear models
Classification
- Logistic Regression
- Need for logistic regression
- Logit link function
- Maximum likelihood estimation
- Model development and interpretation
- Confusion Matrix – error measurement
- ROC curve
- Measuring sensitivity and specificity
- Advantages and disadvantages of logistic regression models
Decision trees
- C5.0
- Classification and Regression trees(CART)
- Process of tree building
- Entropy and Gini Index
- Problem of over fitting
- Pruning a tree back
- Trees for Prediction (Linear) – example
- Tress for classification models – example
- Advantages of tree based models?
KNN – K nearest neighbors
- Advantages and disadvantages of KNN
- Re-Sampling and Ensembles Methods
- Bagging
- Random Forests
- Boosting – Gradient boosting machines
- Advanced methods
- Support Vector machines
- Neural networks
- Introduction to deep learning
- Introduction to online learning
- Un-Supervised learning
- Cluster analysis
- Hierarchical clustering
- K-Means clustering
- Distance measures
- Applications of cluster analysis – Customer Segmentation
- Time series analysis - Forecasting
- Simple moving averages
- Exponential smoothing
- Time series decomposition
- ARIMA
- Collaborative filtering
- 5. User based Filtering
- 6. Item based Filtering
Model validation and deployment
- Error measurement
- RMSE – Root Mean squared error
- Misclassification rate
- Area under the curve (AUC)
Practical use cases and best practices
- a. Business problem to an analytical problem
- Problem definition and analytical method selection
- b. Guidelines in model development
Introduction to big-data and other tools ( Python and R-Server)
- a. Big data and analytics?
- Leverage Big data platforms for Data Science
- b. Introduction to evolving tools e.g Spark
- Machine learning with Spark
Introduction to Azure cloud and Big-Data computing over cloud
- Creation of R-Server clusters
- Computation of Big-Data ML algorithms over the Azure cloud
Introduction to Deep Learning
- What is DL and how does it score better over traditional MLs?
- Convolutional and Perceptron models
- Comparison between DL and ML performances over the MNIST dataset
Analytical Visualization with Tableau
- Why is it important for Data-Analyst
- Tableau workbook walkthrough
- Instruction of creation of your own workbooks
- Demo of few more workbooks
Offerings from Kelly
- Mock interviews questions and case studies walkthrough over Azure
- Cortana gallery
- Guidance to prepare resumes
- Information on companies and industry trends on data science