As a data scientist, your role will require sound statistical skills to make sense of quantitative data to spot trends and make predictions.

However, most learners are not very clear about what to learn or where to start.

If you are eager to learn statistics, first understand the difference between the two major categories within statistics and then the key concepts imperative for Data Science.

Descriptive Statistics

Inferential Statistics

>>> Competency in statistics will lead you to a rewarding career in Data Science.

The Descriptive Statistics enable us to present raw data constructively and summarize data points in a practical way.

learn these basic concepts:

Normal Distribution

Central Tendency

Skewness

Kurtosis

Variability

>>> learn these concepts of research methods to perform statistical analysis.

Inferential Statistics provide the bases for predictions, forecasts, and estimates that are used to transform information into knowledge.

learn these basic concepts:

Central Limit Theorem

Hypothesis Testing

ANOVA

Regression Analysis

>>> Inferential Statistics allows us to infer trends and make predictions about a population bases on a study of data.

The Descriptive Statistics enable us to present raw data constructively and summarize data points in a practical way.

You must learn these basic concepts of research methods to perform simple statistical analysis, visualize data, predict future trends from the data, etc.

Normal distribution is one of the most common probability distribution function used for many types of real-world applications.

>>> Learn 68–95–99.7 rule

The Normal distribution looks like this symmetric bell looking thing, which is why it's often called a bell curve.

This bell curve is centered around its mean and spreads out with decreasing probability as you move away in either direction.

>>> 1.1 Normal distribution

For a normal distribution between plus + and minus -1 one standard deviation, there is 68% of the distributions data.

>>> This distribution function applies in most ML Algorithms.

Between plus + or minus -2 standard deviations there is 95% of the data.

Finally, between plus or minus 3 standard deviations, there is 99.7% of the data.

>>> 1.2 Central Tendency

Using a central tendency, we identify the central point of the data. The three important parts of central tendency:

Mean: The arithmetic average of all the observations (values) in the dataset.

Median: The middle value of the data arranged from the least to the greatest.

Mode: The most frequent value in the dataset.

>>> It is very easy to learn Mean Median and Mode, including Skewness & Kurtosis.

>>> 1.2 Central Tendency

It is a measure of symmetry, but sometimes the distribution does not exhibit any form of symmetry.

We can see visually what happens to the measures of central tendency when we encounter asymmetrical distribution.

Notice how these measures spread when the normal distribution (Symmetrical) is distorted.

>>> Learn to detect the extent of asymmetry

>>> 1.2 Central Tendency

It is a measure of whether the data is heavy-tailed or light-tailed relative to a normal distribution.

The distributions with a large kurtosis have tails that are larger than normal distributions whereas, negative kurtosis will have smaller tails than normal distributions.

>>> Learning the distribution of data is a very important aspect

>>> 1.2 Central Tendency

This is caused by the variability in the distribution.

Lepto-kurtic – It is a curve having a higher peak than normal curve because of more concentration of the items near the center.

Platy-kurtic: It is a curve having a lower peak and is flatter than the normal peak because of less concentration of the items near the center.

Meso-Kurtic: It is a curve having a normal peak or normal curve. When there is equal distribution around the center value (mean), in that event mean, median, and mode are equal.

>>> Learn these concepts

>>> 1.2 Central Tendency

Variability measures the distance between the data-point from the central mean of the distribution.

Variability is usually defined in terms of distance

How far apart scores are from each other

How far apart scores are from the mean

How representative a score is of the data set as a whole

>>> The measures of variability include range, variance, standard-deviation, and inter-quartile ranges.

>>> 1.2 Central Tendency

Range: The range only considers the two extreme scores and ignores any values in between.

Standard Deviation: This measure expresses the variability in terms of a typical deviation in the data set. It is commonly used to learn whether a specific data point is standard and expected, or unusual and unexpected.

Variance: The variance is a measure of variability. It is essentially the average of the squared differences from the mean.

>>> These concepts may sound intimidating, but are simple to learn.

>>> 1.2 Central Tendency

Percentiles: Percentiles tell us how a value compares to other values. A percentile is a value below which a certain percentage of observations lie.

Quantiles: A quantile determines how many values in a distribution are above or below a certain limit. It can also refer to dividing a probability distribution into areas of equal probability.

Interquartile Range (IQR): The interquartile range measures the spread of the middle half of data. It is the range for the middle 50% of data.

>>> These concepts may sound intimidating, but are simple to learn.

Inferential Statistics provide the bases for predictions, forecasts, and estimates that are used to transform information into knowledge.

The process of “inferring” insights or concluding from the data through probability is called “Inferential Statistics.”

Some of these techniques are helpful for data science.

Central Limit Theorem is one of the most fundamental and a simple concept in statistics.

It is essentially the sampling distribution of the mean.

As the sample size increases, the distribution of the mean of sample values is normal.

>>> It is easier to understand CTL if you are familiar with the Normal distribution.

The mean of the smaller sample data is the same as that of the mean of the larger population. So, the resulting standard deviation is also equal to the standard deviation of the population.

>>> 2.1 central limit theorem

Understand and practice CTL concepts with

The central limit theorem is important for statistics because of the normality assumption and the precision of the estimates.

Learn concepts like estimation of the population mean, the laws of frequency of errors, including how to calculate margin error with the z-score of the percentage of confidence level.

>>> Continue

Normal Population

Dichotomous Outcome

Skewed Distribution

Hypothesis testing is the heart of all statistics that allows us to make inferences about the world.

It is the measure of assumption, an educated guess about the world around us.

>>> It is most often used by Data Scientists to prove specific predictions, called hypotheses, that arise from theories.

Hypothesis testing, in simple words, is a process whereby a Statistical Analyst investigates an assumption regarding a population parameter. The methods employed depend on the data used and the reason for the analysis.

>>> 2.2 Hypothesis testing

Normally, the methods include estimating population properties like mean, differences between means, proportions, and the relations between variables.

There are two hypotheses that we require to test against each other:

Null Hypothesis: It is a prediction of no relationship between the variables and may state that the population mean return is equal to zero.

Alternate Hypothesis: It is the initial hypothesis that predicts a relationship between variables.

>>> Both are mutually exclusive, and only one of the two hypotheses will always be true.

>>> 2.2 Hypothesis testing

It's not always clear how the null and alternative hypothesis should be formulated and, for this reason, the context of the situation is important in determining how the hypotheses should be stated.

In data science, the applications of hypothesis testing are analytical and involve an attempt to gather evidence to support a research hypothesis.

Correct hypothesis formulation takes a lot of practice. Understand the terminology, testing process and concepts well enough to make inferences with real-world examples.

>>> Learn Hypothesis testing

Analysis of variance—ANOVA test, in its simplest form, is used to observe whether the differences between groups of data are statistically significant.

>>> ANOVA techniques are used in Data Science and Machine Learning to decide the result of the dataset.

The ANOVA test applies when there are over two independent groups. Using ANOVA, we test our hypothesis to figure out if there is a need to reject the null hypothesis or accept the alternate hypothesis.

ANOVA performs testing with a minimal error rate to see if there’s a difference between multiple groups.

Cases when you might want to test multiple groups:

>>> You will need to apply statistical knowledge to make confident and reliable decisions as a Data Scientist.

A group of cancer patients are trying three different therapies: Chemotherapy, Hormone Therapy and Immunotherapy. You want to study if one therapy is stronger than the others.

A company makes two affordable smartphones using two different operating systems. They want to learn if one is better than the other.

You will need to apply statistical knowledge to make a confident and reliable decision as a Data Scientist.

Regression is a form of qualitative data analysis to find trends in data.

>>> we use Python libraries like NumPy, Pylab, and Scikit-learn for Simple and multiple regression analysis.

It is a predictive modeling technique that helps companies to understand what their data points represent and uses them skillfully with other analytical techniques to make better decisions.

In Data Science, the evaluation of the relation between the variables is called Regression Analysis, and these variables refer to the properties or characteristics of certain events or objects.

01

>>>

02

>>>

>>>

03

05

>>>

06

>>>

>>>

04