Statistics for Business

Statistics For Management

  1. Important Terminology
  2. Introduction
  3. Statistical Survey
  4. Classification, Tabulation & Presentation of data
  5. Probabilities
  6. Theoretical Distributions
  7. Sampling & Sampling Distributions
  8. Estimation
  9. Testing of Hypothesis in case of large & small samples
  10. Chi-Square
  11. F-Distribution and Analysis of variance (ANOVA)
  12. Simple correlation and Regression
  13. Business Forecasting
  14. Time Series Analysis
  15. Index Numbers

 

  1. Important Terminologies

Statistics: Statistics is a process of collection of methods for planning experiments, obtaining data, and then organizing, summarizing, presenting, analyzing, interpreting, and drawing conclusions.

Variable: A characteristic or attribute which can assume different values.



Random Variable: Whose values are determined by chance.

Population: All subjects which possess a common characteristic that is being studied.

Sample: A subgroup or subset of the population.

Parameter: It denotes characteristic or measure obtained from a population.

Statistic: It is characteristic or measure obtained from a sample.

 

Descriptive Statistics: A process of collection, organization, summarization, and presentation of data.



Inferential Statistics: Drawing conclusion after generalizing from samples to populations using probabilities.

Qualitative Variables:

Variables assume non-numerical values.

Quantitative Variables:

Variables assume numerical values.

Discrete Variables:

Variables assume a finite or countable number of possible values.

Continuous Variables:

Variables assume an infinite number of possible values. Usually it is in range form.

Nominal Level Scaling:

Level of measurement classifies data into mutually exclusive. It is just used to identify by assigning numeral values, alphabets or special characteristics.

Ordinal Level:

This is a level of measurement which classifies data into categories that can be ranked.

Interval Level:

This is a level of measurement which classifies data that can be ranked and differences are meaningful. However, there is no meaningful zero, so ratios are meaningless.

E.g. Ramesh got the marks in range of 50-70.

 

Ratio Level:

Level of measurement which classifies data that can be ranked, differences are meaningful, and there is a true zero. True ratios exist between the different units of measure.

Random Sampling:

A sampling, where data is collected using chance methods or random numbers.

Systematic Sampling:

Sampling in which data is obtained by selecting every kth object.

Convenience Sampling:

Sampling in which data is which is readily available is used.

Stratified Sampling:

A Sampling, here the population is divided into groups (called strata) according to some common characteristic. Each of these strata is then sampled using one of the other sampling techniques. These intra- strata are homogeneous in nature.




Cluster Sampling:

A Sampling, here the population is divided into groups according to some heterogeneous characteristics.  Each of these clusters is then sampled using one of the other sampling techniques. These intra-clusters are heterogeneous in nature.

Raw Data:

Data collected in original form.

Frequency:

The number of times a certain value or class of values occurs.

Frequency Distribution:

This is an organization of raw data in table form with classes and frequencies.

Grouped Frequency Distribution:

This is a frequency distribution where several numbers are grouped into one class.

Class Limits:

The limits could actually appear in the data which have gaps between the upper limit of one class and the lower limit of the next.

Class Width:

This is the difference between the upper and lower boundaries of any class. The class width is also the difference between the lower limits of two consecutive classes or the upper limits of two consecutive classes.

Class Mark (Midpoint):

It is the number in the middle of the class. It is found by adding the upper and lower limits and dividing by two.

Cumulative Frequency:

It is the number of values less than the upper class boundary for the current class. This is a running total of the frequencies.

Histogram:

It is a graph which displays the data by using vertical bars of various heights to represent frequencies. The horizontal axis can be the class boundaries, the class marks, or the class limits.


Frequency Polygon:

It is a line graph. The frequency is placed along the vertical axis and the class midpoints are placed along the horizontal axis. These points are connected with lines.

 

 

Ogive :

Ogive is a frequency polygon of the cumulative frequency or the relative cumulative frequency.

Pareto Chart:

It is a bar graph for qualitative data with the bars arranged according to frequency.

Pie Chart:

It is a graphical depiction of data as slices of a pie. The frequency determines the size of the slice. The number of degrees in any slice is the relative frequency time’s 360 degrees.

Pictograph:

It is a graph which uses pictures to represent data.

 

  1. Introduction

Definition of Statistics:
Statistics are usually defined as a collection of numerical data that measure something.
Statistics is the science of recording, organizing, analyzing and reporting quantitative information.

Different components of statistics:
Four components as per Croxton & Cowden
1. Collection of Data.
2. Presentation of Data
3. Analysis of Data
4. Interpretation of Data

Use of Correlation & Regression:
Correlation & Regression is a statistical tools, are used to measure the relationships between two variables.

 

 

Significance of Statistics:

Statistics gives us a technique to obtain, condense, analyze and relate numerical data. Statistics is used in various application of different streams like sociology, economy, business, etc. Statistics are everywhere, election predictions are statistics, anything food product that says they x% more or less of a certain ingredient is a statistic. Life expectancy is a statistic. If you play card games card counting is using statistics.

Statistical survey:

It is used to collect quantitative information about items in a population. A survey may focus on opinions or factual information depending on its purpose, and many surveys involve administering questions to individuals. When the questions are administered by a researcher, the survey is called a structured interview or a researcher-administered survey. When the questions are administered by the respondent, the survey is referred to as a questionnaire or a self-administered survey.

 

Pros of survey:

  • Efficient way of collecting information
  • Wide coverage of information
  • Easy to administer
  • Complete enumeration

Cons of survey:

  • Responses may be subjective
  • Motivation may be low to answer
  • Errors due to sampling
  • Less specific question may lead to vague data.

Modes of data collection:

  • Telephone
  • Mail
  • Online surveys
  • Personal survey
  • Mall intercept survey

 

Sampling:

Sampling basically means selecting people/objects from a given population in order to test the population for some objects. For example, we might want to find out people choice about candidature at the next election. Obviously we can’t ask everyone in the country, so we ask a sample.

Classification, Tabulation & Presentation of data

 

Types of data collection:

Primary Data: Primary data which is collected by the researcher themselves.
This kind of data is new, original research information.
Secondary Data: This is derived from other source. It is past data.

 

Difference Between Primary and Secondary Sources of Data:




Primary sources enable the researcher to get as close as possible to
what actually happened and is hands on. A primary source reflects the
individual viewpoint of a participant or observer. Primary sources are
first-hand information from a person who witnessed or participated in
an event.                                                                                                                                     Examples of primary data are:
Interviews
Questionnaires
Observations

Secondary research is using information that has already been produced
by other people. A secondary source is used by a person usually not
present at the event and relying on primary source documents for
information. Secondary sources usually analyze and interpret. Finding
out about research that already exists will help form new research.
Examples of secondary data:
Internet
Books/ Magazines
Newspapers
Office statistics
The government statistics service
The office of national statistics

Another Type of Data:

Qualitative Data

  • Nominal, Attributable or Categorical data
  • Ordinal or Ranked data

 

Quantitative or Interval data

  • Discrete data
  • Continuous measurements

 

Tabulation of data:

 

Tabulation refers to the systematic arrangement of the information in rows and columns. Rows are the horizontal arrangement. In simple words, tabulation is a layout of figures in rectangular form with appropriate headings to explain different rows and columns. The main purpose of the table is to simplify the presentation and to facilitate comparisons.

Presentation of data:

Descriptive statistics can be illustrated in an understandable fashion by presenting them graphically using statistical and data presentation tools.

Different elements of tabulation:

Tabulation:

  • Table Number
  • Title
  • Captions and Stubs
  • Headnotes
  • Body
  • Source

 

Forms of presentation of the data:

Grouped and ungrouped data may be presented as:

  • Pie Charts
  • Frequency Histograms
  • Frequency Polygons
  • Ogives

 

Measures of summarizing data:




 

  • Measures of Central tendency: Mean, median, mode
  • Measures of Dispersion: Range, Variance, Standard Deviation

 

Mean: The mean value is what we typically call the “average.” You calculate the mean by adding up all of the measurements in a group and then dividing by the number of measurements.

Median: Median is the middle most value in a series when arranged in ascending or descending order.                                                                                                                Mode: The most repeated value in a series.

Central tendency:

 

The measure to be used differs in different contexts. If your results involve categories instead of continuous numbers, then the best measure of central tendency will probably be the most frequent outcome (the mode). On the other hand, sometimes it is an advantage to have a measure of central tendency that is less sensitive to changes in the extremes of the data.

Range:

It is defined by the smallest and largest data values in the set.

 

Variance:

The variance (σ2) is a measure of how far each value in the data set is from the mean.

 

 

Standard Deviation:

It is the square root of the variance.

Probability:

Probability is a way of expressing knowledge or belief that an event will occur or has occurred.

Random experiment:

An experiment is said to be a random experiment, if it’s out-come can’t be predicted with certainty.

Sample space:

The set of all possible out-comes of an experiment is called the sample space. It is denoted by ‘S’ and its number of elements are n(s).

Example; In throwing a dice, the number that appears at top is any one of 1, 2, 3, 4, 5, 6. So here:

S = {1, 2, 3, 4, 5, 6} and n(s) = 6

Similarly in the case of a coin, S= {Head, Tail} or {H, T} and n(s) =2.

Event and Its Classifications:

Event: Every subset of a sample space is an event. It is denoted by ‘E’.

Example: In throwing a dice S={1,2,3,4,5,6}, the appearance of an event number will be the event E={2,4,6}.

Clearly E is a sub set of S.

Simple event: An event, consisting of a single sample point is called a simple event.

Example: In throwing a dice, S={1,2,3,4,5,6}, so each of {1},{2},{3},{4},{5} and {6} are simple events.

Compound event: A subset of the sample space, which has more than on element is called a mixed event.

Example: In throwing a dice, the event of appearing of odd numbers is a compound event, because E= {1,3,5} which has ‘3’ elements.

Definition of probability:

If ‘S’ be the sample space, then the probability of occurrence of an event ‘E’ is defined as:

P(E) = n(E)/N(S) =

Theoretical Distributions:

Theoretical distributions are based on mathematical formulae and logic. When empirical and theoretical distributions correspond, you can use the theoretical one to determine probabilities of an outcome, which will lead to inferential statistics.

Types of theoretical distributions:

  • Rectangular distribution (or Uniform Distribution)
  • Binomial distribution
  • Normal distribution

Rectangular distribution:

Rectangular distribution: Distribution in which all possible scores have the same probability of occurrence.

Binomial distribution:                                                                                                    Distribution of the frequency of events that can have only two possible outcomes.

Normal distribution:

The normal distribution is a bell-shaped theoretical distribution that predicts the frequency of occurrence of chance events. The probability of an event or a group of events corresponds to the area of the theoretical distribution associated with the event or group of event. The distribution is asymptotic: its line continually approaches but never reaches a specified limit. The curve is symmetrical: half of the total area is to the left and the other half to the right.

Central limit theorem:




This theorem states that when an infinite number of successive random samples are taken from a population, the sampling distribution of the means of those samples will become approximately normally distributed with mean μ and standard deviation σ/√ N as the same size (N) becomes larger, irrespective of the shape of the population distribution.

Sampling & Sampling Distributions

Sampling distribution:

Suppose that we draw all possible samples of size n from a given population. Suppose further that we compute a statistic (mean, proportion, standard deviation) for each sample. The probability distribution of this statistic is called Sampling Distribution.

Variability of a sampling distribution:

The variability of sampling distribution is measured by its variance or its standard deviation. The variability of a sampling distribution depends on three factors:

  • N: the no. of observations in the population.
  • n: the no. of observations in the sample
  • The way that the random sample is chosen.

Sampling distribution of the population:

In a population of size N, suppose that the probability of the occurrence of an event (dubbed a “success”) is P; and the probability of the event’s non-occurrence (dubbed a “failure”) is Q. From this population, suppose that we draw all possible samples of size n. And finally, within each sample, suppose that we determine the proportion of successes p and failures q. In this way, we create a sampling distribution of the proportion.

 

Estimation

When will the sampling distribution be normally distributed?

Generally, the sampling distribution will be approximately normally distributed if any of the following conditions apply.

  • The population distribution is normal.
  • The sampling distribution is symmetric, unimodal, without outliers, and the sample size is 15 or less.
  • The sampling distribution is moderately skewed, unimodal, without outliers, and the sample size is between 16 and 40.
  • The sample size is greater than 40, without outliers.

 

 

Testing of Hypothesis in case of large & small samples

Statistical hypothesis:

 

statistical hypothesis is an assumption about a population parameter. This assumption may or may not be true.

 

Types of statistical hypothesis:

There are two types of statistical hypotheses.

  • Null hypothesis. The null hypothesis, denoted by H0, is usually the hypothesis that sample observations result purely from chance.
  • Alternative hypothesis. The alternative hypothesis, denoted by H1or Ha, is the hypothesis that sample observations are influenced by some non-random cause.

 

Hypothesis testing:

 

Statisticians follow a formal process to determine whether to reject a null hypothesis, based on sample data. This process is called hypothesis testing.




 

Steps of hypothesis testing:

Hypothesis testing consists of four steps.

  • State the hypotheses. This involves stating the null and alternative hypotheses. The hypotheses are stated in such a way that they are mutually exclusive. That is, if one is true, the other must be false.
  • Formulate an analysis plan. The analysis plan describes how to use sample data to evaluate the null hypothesis. The evaluation often focuses around a single test statistic.
  • Analyze sample data. Find the value of the test statistic (mean score, proportion, t-score, z-score, etc.) described in the analysis plan.
  • Interpret results. Apply the decision rule described in the analysis plan. If the value of the test statistic is unlikely, based on the null hypothesis, reject the null hypothesis.

 

 

 

 

Decision errors:

Two types of errors can result from a hypothesis test.

  • Type I error. A Type I error occurs when the researcher rejects a null hypothesis when it is true. The probability of committing a Type I error is called the significance level. This probability is also called alpha, and is often denoted by α.
  • Type II error. A Type II error occurs when the researcher fails to reject a null hypothesis that is false. The probability of committing a Type II error is called Beta, and is often denoted by β. The probability of notcommitting a Type II error is called the Power of the test.

How to arrive at a decision on hypothesis?

The decision rules can be taken in two ways – with reference to a P-value or with reference to a region of acceptance.

  • P-value. The strength of evidence in support of a null hypothesis is measured by the P-value. Suppose the test statistic is equal to S. The P-value is the probability of observing a test statistic as extreme as S, assuming the null hypotheis is true. If the P-value is less than the significance level, we reject the null hypothesis.
  • Region of acceptance. The region of acceptanceis a range of values. If the test statistic falls within the region of acceptance, the null hypothesis is not rejected. The region of acceptance is defined so that the chance of making a Type I error is equal to the significance level. The set of values outside the region of acceptance is called the region of rejection. If the test statistic falls within the region of rejection, the null hypothesis is rejected. In such cases, we say that the hypothesis has been rejected at the α level of significance.

One-tailed and two-tailed tests:

A test of a statistical hypothesis, where the region of rejection is on only one side of the sampling distribution, is called a one-tailed test. For example, suppose the null hypothesis states that the mean is less than or equal to 10. The alternative hypothesis would be that the mean is greater than 10. The region of rejection would consist of a range of numbers located located on the right side of sampling distribution; that is, a set of numbers greater than 10.

A test of a statistical hypothesis, where the region of rejection is on both sides of the sampling distribution, is called a two-tailed test. For example, suppose the null hypothesis states that the mean is equal to 10. The alternative hypothesis would be that the mean is less than 10 or greater than 10. The region of rejection would consist of a range of numbers located located on both sides of sampling distribution; that is, the region of rejection would consist partly of numbers that were less than 10 and partly of numbers that were greater than 10.

Chi Sqare in Statistics:

Suppose Manish plays 100 tests, and 20 times he made 50. Is he a good player?

In statistics, the chi-square test calculates how well a series of numbers fits a distribution. In this module, we only test for whether results fit an even distribution. It doesn’t simply say “yes” or “no”. Instead, it gives you a confidence interval, which sets upper and lower bounds on the likelihood that the variation in your data is due to chance.

There are basically two types of random variables and they yield two types of data: numerical and categorical.

A chi square (X2) statistic is used to investigate whether distributions of categorical variables differ from one another. Basically categorical variable yield data in the categories and numerical variables yield data in numerical form.

Responses to such questions as “What is your major?” or Do you own a car?” are categorical because they yield data such as “biology” or “no.” In contrast, responses to such questions as “How tall are you?” or “What is your G.P.A.?” are numerical. Numerical data can be either discrete or continuous.

Datatype Questiontype Possible answer
Categorical Where are you from ? India / US / Japan
Numerical How tall are you ? 70 inches

 

F-Distribution and Analysis of variance (ANOVA)

 

ANOVA:

Analysis of variance (ANOVA) is a collection of statistical models and their associated procedures in which the observed variance is partitioned into components due to different sources of variation. ANOVA provides a statistical test of whether or not the means of several groups are all equal.

Assumption in ANOVA:

  • Independence of cases – this is an assumption of the model that simplifies the statistical analysis.
  • Normality – the distributions of the residuals are normal.
  • Equality (or “homogeneity”) of variances.

 

Logic of ANOVA:

 

Partitioning of the sum of squares

The fundamental technique is a partitioning of the total sum of squares (abbreviated SS) into components related to the effects used in the model. For example, we show the model for a simplified ANOVA with one type of treatment at different levels.

So, the number of degrees of freedom (abbreviated df) can be partitioned in a similar way and specifies the chi-square distribution which describes the associated sums of squares.

F-test:

The F-test is used for comparisons of the components of the total deviation. For example, in one-way, or single-factor ANOVA, statistical significance is tested for by comparing the F test statistic

where

I = number of treatments

and

nT = total number of cases

to the F-distribution with I − 1,nT − I degrees of freedom. Using the F-distribution is a natural candidate because the test statistic is the quotient of two mean sums of squares which have a chi-square distribution.

How ANOVA helpful:

ANOVAs are helpful because they possess a certain advantage over a two-sample t-test. Doing multiple two-sample t-tests would result in a largely increased chance of committing a type I error. For this reason, ANOVAs are useful in comparing three or more means.

Simple correlation and Regression

Correlation:

Correlation is a measure of association between two variables. The variables are not designated as dependent or independent.

Values for correlation coefficient:

The value of a correlation coefficient can vary from -1 to +1. A -1 indicates a perfect negative correlation and a +1 indicated a perfect positive correlation. A correlation coefficient of zero means there is no relationship between the two variables.

Interpretation of the correlation coefficient values:

When there is a negative correlation between two variables, as the value of one variable increases, the value of the other variable decreases, and vise versa. In other words, for a negative correlation, the variables work opposite each other. When there is a positive correlation between two variables, as the value of one variable increases, the value of the other variable also increases. The variables move together.

Simple regression:

Simple regression is used to examine the relationship between one dependent and one independent variable. After performing an analysis, the regression statistics can be used to predict the dependent variable when the independent variable is known. Regression goes beyond correlation by adding prediction capabilities.

There are three equivalent ways to mathematically describe a linear regression model.

y = intercept + (slope x) + error

y = constant + (coefficientx) + error

y = a + bx + c

Business Forecasting

Forecasting:

Forecasting is a prediction of what will occur in the future, and it is an uncertain process. Because of the uncertainty, the accuracy of a forecast is as important as the outcome predicted by the forecast.

Various type business forecasting techniques:

Various smoothing techniques:

Simple Moving average: The best-known forecasting methods is the moving averages or simply takes a certain number of past periods and add them together; then divide by the number of periods. Simple Moving Averages (MA) is effective and efficient approach provided the time series is stationary in both mean and variance. The following formula is used in finding the moving average of order n, MA(n) for a period t+1,

MAt+1 = [Dt + Dt-1 + … +Dt-n+1] / n

Where n is the number of observations used in the calculation.

Weighted Moving Average: Very powerful and economical. They are widely used where repeated forecasts required-uses methods like sum-of-the-digits and trend adjustment methods. As an example, a Weighted Moving Averages is:

Weighted MA(3) = w1.Dt + w2.Dt-1 + w3.Dt-2

where the weights are any positive numbers such that: w1 + w2 + w3 = 1.

Exponential smoothing techniques:

Single Exponential Smoothing: It calculates the smoothed series as a damping coefficient times the actual series plus 1 minus the damping coefficient times the lagged value of the smoothed series. The extrapolated smoothed series is a constant, equal to the last value of the smoothed series during the period when actual data on the underlying series are available.

Ft+1 = a Dt + (1 – a) Ft

where:

  • Dtis the actual value
  • Ftis the forecasted value
  • ais the weighting factor, which ranges from 0 to 1
  • t is the current time period.

Double Exponential Smoothing: It applies the process described above three to account for linear trend. The extrapolated series has a constant growth rate, equal to the growth of the smoothed series at the end of the data period.




Time series models:

A time series is a set of numbers that measures the status of some activity over time. It is the historical record of some activity, with measurements taken at equally spaced intervals (exception: monthly) with a consistency in the activity and the method of measurement.

Time Series Analysis

  1. What is time series forecasting?

The time-series can be represented as a curve that evolve over time. Forecasting the time-series mean that we extend the historical values into the future where the measurements are not available yet.

  1. What are the different models in time series forecasting?
  • Simple moving average
  • Weighted moving average
  • Simple exponential smoothing
  • Holt’s double Exponential smoothing
  • Winter’s triple exponential smoothing
  • Forecast by linear regression

Exponential smoothing techniques:

Single Exponential Smoothing: It calculates the smoothed series as a damping coefficient times the actual series plus 1 minus the damping coefficient times the lagged value of the smoothed series. The extrapolated smoothed series is a constant, equal to the last value of the smoothed series during the period when actual data on the underlying series are available.

Ft+1 = a Dt + (1 – a) Ft

where:

  • Dtis the actual value
  • Ftis the forecasted value
  • a is the weighting factor, which ranges from 0 to 1
  • t is the current time period.

 

Double Exponential Smoothing: It applies the process described above three to account for linear trend. The extrapolated series has a constant growth rate, equal to the growth of the smoothed series at the end of the data period.

Triple exponential Smoothing: It applies the process described above three to account for nonlinear trend.

Forecasting by linear regression:

Regression is the study of relationships among variables, a principal purpose of which is to predict, or estimate the value of one variable from known or assumed values of other variables related to it.

Index Numbers:

 

Index numbers are used to measure changes in some quantity which we cannot observe directly. E.g changes in business activity.

 

Classification of index numbers:

Index numbers are classified in terms of the variables that are intended to measure. In business, different groups of variables in the measurement of which index number techniques are commonly used are i) price ii) quantity iii) value iv) Business activity

Simple index numbers:

 

A simple index number is a number that measures a relative change in a single variable with respect to a base.

 

Composite index numbers:

 

A composite index number is a number that measures an average relative change in a group of relative variables with respect to a base.

 

Price index numbers:

 

Price index numbers measure the relative changes in the prices of commodities between two periods. Prices can be retail or wholesale.

Quantity index numbers:

These index numbers are considered to measure changes in the physical quantity of goods produced, consumed, or sold of an item or a group of items.

 

 

 

For Complete UGC NET Management/HRM/Commerce Notes/Guidance after contributing some amount, Please email at 1985sac@gmail.com or studyalarm@yahoo.in or 9990919804