RM Notes
RM Notes
SYLLABUS
Objectives:
• To provide an understanding on the basic concepts of research methods
• To expose the students to the role that statistics plays in business decisions
Module 2: (8 hours)
Types of Business Research Design: Exploratory and Conclusive Research Design
Exploratory Research: Meaning, purpose, methods –secondary resource analysis, comprehensive
case methods, expert opinion survey, focus group discussions.
Conclusive research Design - Descriptive Research - Meaning, Types – cross sectional studies and
longitudinal studies. –
Experimental research design – Meaning and classification of experimental designs- Pre
experimental design, Quasi-experimental design, True experimental design, statistical experimental
design.
Module 3: (8 hours)
Sampling: Concepts- Types of Sampling - Probability Sampling – simple random sampling,
systematic sampling, stratified random sampling, cluster sampling -Non Probability Sampling –
convenience sampling- judgemental sampling, snowball sampling- quota sampling - Errors in
sampling.
Module 5: (8 hours)
Preparing the Data for Analysis: Editing, Coding, Classification, Tabulation, Validation
Analysis and Interpretation Report writing and presentation of results: Importance of report
writing, types of research report, report structure, guidelines for effective documentation
Practical Components:
• Students are expected to write the research design on Exploratory and Descriptive Research.
• Students are asked to prepare the questionnaire on brand awareness, effectiveness of training in
public sector organization, Investors attitude towards Mutual funds in any financial institutions.
• Students are asked to conduct Market survey to know the consumer perception towards any FMCG.
• Identify the problem and collect relevant literatures and data for analysis
• Data Interpretation and report writing: Short and Long reports.
• Report presentation methods, ex: Power Point Presentation, etc
RECOMMENDED BOOKS
• Research Methodology- C R Kothari, Vishwa Prakashan,2002
• Business Research Methods. Donald R. Cooper & Pamela s Schindler, 9/e, TMH /2007
• Research Methodology – concepts and cases – Deepak Chawla and Neena Sondhi -Vikas
Publication - 2014
• Research Methods for Business, Uma Sekaran & Roger Bougie, 6th Edition, Wiley, 2013
• Business Research Methods-SL Guptah and Hetesh Guptha, McGraw hill - 2012
• Marketing Research- Naresh K Malhotrs- 5th Edition, Pearson Education /PHI 2007
• Business Research Methodology – J K Sachdeva – 2nd Edition - HPH, 2011
REFERENCE BOOKS
• Research Methods- William M C Trochi,- 2/e, Biztantra, 2007
Index
Module 1
Business research involves establishing objectives and gathering relevant information to obtain the
answer to a business issue. or
Business research can be defined as the systematic and objective process of gathering, recording and
analyzing data for aid in making business decisions.
• Applied Research: Applied research is conducted when decision must be made about a
Specific real life problem.It is thus problem oriented and action directed.
Contribution of Applied Research:
Action Research: Action research is either research initiated to solve an immediate problem
or a reflective process of progressive problem solving led by individuals working with others in
teams or as part of a "community of practice" to improve the way they address issues and solve
problems.
Process of research
The research problem is a general statement of an issue meriting research. Its nature will
suggest appropriate forms for its investigation
Problem definition involves stating the general marketing research problem and identifying
its specific components. Only when research problem has been clearly defined can research
be designed and conducted properly.
Research proposal is a specific kind of document written for a specific purpose. Research
involves a series of actions and therefore it presents all actions in a systematic and scientific
way. In this way, Research proposal is a blue print of the study which simply outlines the
steps that researcher will undertake during the conduct of his/her study.Proposal is a tentative
plan so the researcher has every right to modify his proposal on the basis of his reading,
discussion and experiences gathered in the process of research.Even with this relaxation
available to the researcher, writing of research proposal is a must for the researcher.
A research design is a framework or blueprint for conducting the marketing research project.
It details the procedures necessary for obtaining the required information, and its purpose is
to design a study that will test the hypotheses of interest, determine possible answers to the
research questions, and provide the information needed for decision making. Decisions are
also made regarding what data should be obtained from the respondents (e,g,, by conducting
a survey or an experiment). A questionnaire and sampling plan also are designed in order to
select the most appropriate respondents for the study. The following steps are involved in
formulating a research design:
– Secondary data analysis (based on secondary research)
– Qualitative research
– Methods of collecting quantitative data (survey, observation, and experimentation)
– Definition of the information needed
– Measurement and scaling procedures
– Questionnaire design
– Sampling process and sample size
– Plan of data analysis
6. Sampling design
Sampling is a means of selecting a subset of units from a target population for the purpose of
collecting information. This information is used to draw inferences about the population as a
whole. The subset of units that are selected is called a sample. The sample design
encompasses all aspects of how to group units on the frame, determine the sample size,
allocate the sample to the various classifications of frame units, and finally, select the sample.
Choices in sample design are influenced by many factors, including the desired level of
Module 2:
Exploratory research is research conducted for a problem that has not been clearly defined. It often
occurs before one knows enough to make conceptual distinctions or posit an explanatory
relationship. Exploratory research helps determine the best research design, data collection method
and selection of subjects. It should draw definitive conclusions only with extreme caution. Given its
fundamental nature, exploratory research often concludes that a perceived problem does not actually
exist.
Purpose
– The purpose of exploratory research is to gather preliminary information that will help define
problems and suggest hypotheses
– to gain familiarity with a phenomenon or acquire new insight into it in order to formulate a
more precise problem or develop hypothesis
Methods
Expert opinion survey /Experience Survey: It is better to interview those individuals who know
about the subject. The objectives of such survey are to obtain insight into the relationship between
variables and new ideas relating to the research problem. The respondents picked are interviewed by
the researcher. The researcher should prepare an interview schedule for the systematic questioning of
informants. Thus an experience survey may enable the researcher to define the problem more
consciously and help in the formulation of hypothesis.
Focus Groups discussions: This is certainly the most used method in research. In a focus group,
only a few individuals (e.g., 8-12) are brought together to speak about some topic of interest. The
dialogue is coordinated by a moderator. The majority of the organizations engaging in the focus
groups first screen the candidates to find out who will compose the particular group. Organizations
also make sure to avoid groups, in which some of the participants have their relatives and friends, as
these results in a one-sided discussion. Group interaction is the key factor that differentiates focus
group interviews from experience survey that are conducted with one respondent at a time.
Furthermore it is the key advantage of the focus group over the majority of exploratory techniques.
Due to their interactive nature, ideas sometimes drop ―out of the blue‖ in a focus group discussion.
is called Case. Study of relationships between different factors of each case is more important and
not the number. It is specifically helpful in situations where there is little experience to serve as a
guide. Attitude of the investigator, the intensity of the investigator and the ability of the researcher to
draw together diverse information into a unified interpretation are the main features which make this
method a suitable procedure for evoking insights.
Conclusive research aims to verify insights and to aid decision makers in selecting a specific course
of action. Conclusive research is sometimes called confirmatory research, as it is used to "confirm" a
hypothesis.
Cross-sectional studies are carried out at one time point or over a short period. They are usually
conducted to estimate the prevalence of the outcome of interest for a given population, commonly for
the purposes of public health planning. Data can also be collected on individual characteristics,
including exposure to risk factors, alongside information about the outcome. In this way cross-
sectional studies provide a 'snapshot' of the outcome and the characteristics associated with it, at a
specific point in time.
Cross-sectional research studies all have the following characteristics:
Takes place at a single point in time
Variables are not manipulated by researchers
Provide information only; do not answer why
Longitudinal studies
A longitudinal survey is a correlational research study that involves repeated observations of the
same variables over long periods of time — often many decades. It is a type of observational study.
Longitudinal studies are often used in psychology to study developmental trends across the life span,
and in sociology to study life events throughout lifetimes or generations. The reason for this is that,
unlike cross-sectional studies, in which different individuals with same characteristics are
compared, longitudinal studies track the same people, and therefore the differences observed in those
people are less likely to be the result of cultural differences across generations. Because of this
benefit, longitudinal studies make observing changes more accurate, and they are applied in various
other fields. In medicine, the design is used to uncover predictors of certain diseases. In advertising,
the design is used to identify the changes that advertising has produced in the attitudes and behaviors
of those within the target audience who have seen the advertising campaign.
Classified as:
1. Pre experimental design,
2. Quasi-experimental design,
3. True experimental design,
4. Statistical experimental design
A quasi-experiment is an empirical study used to estimate the causal impact of an intervention on its
target population. Quasi-experimental research shares similarities with the traditional experimental
design or randomized controlled trial, but they specifically lack the element of random assignment to
treatment or control. Instead, quasi-experimental designs typically allow the researcher to control the
assignment to the treatment condition, but using some criterion other than random assignment (e.g.,
an eligibility cutoff mark)
The first part of creating a quasi-experimental design is to identify the variables. The quasi-
independent variable will be the x-variable, the variable that is manipulated in order to affect a
dependent variable. ―X‖ is generally a grouping variable with different levels. Grouping means two
or more groups such as a treatment group and a placebo or control group (placebos are more
frequently used in medical or physiological experiments). The predicted outcome is the dependent
variable, which is the y-variable. In a time series analysis, the dependent variable is observed over
time for any changes that may take place. Once the variables have been identified and defined, a
procedure should then be implemented and group differences should be examined
True experimental design is regarded as the most accurate form of experimental research, in that it
tries to prove or disprove a hypothesis mathematically, with statistical analysis.
For some of the physical sciences, such as physics, chemistry and geology, they are standard and
commonly used. For social sciences, psychology and biology, they can be a little more difficult to set
up.
For an experiment to be classed as a true experimental design, it must fit all of the following criteria.
The sample groups must be assigned randomly.
There must be a viable control group.
Only one variable can be manipulated and tested. It is possible to test more than one, but such
experiments and their statistical analysis tend to be cumbersome and difficult.
The tested subjects must be randomly assigned to either control or experimental groups.
Causation. It allows the experimenter to make causal inferences about the relationship
between independent variables and a dependent variable.
Control. It allows the experimenter to rule out alternative explanations due to
the confounding effects of extraneous variables (i.e., variables other than the independent
variables).
Variability. It reduces variability within treatment conditions, which makes it easier to detect
differences in treatment outcomes.
Module 3
Sampling: Concepts
Sampling is the process by which inference is made to the whole by examining a part.
a) Population
The collection of all units of a specified type in a given region at a particular point or period of time
is termed as a population or universe. Thus, one may consider a population of persons, families,
farms, cattle in a region or a population of trees or birds in a forest or a population of fish in a tank
etc. depending on the nature of data required.
b) Sampling Unit
Elementary units or group of such units which besides being clearly defined, identifiable and
observable, are convenient for purpose of sampling are called sampling units. For instance, in a
family budget enquiry, usually a family is considered as the sampling unit since it is found to be
convenient for sampling and for ascertaining the required information. In a crop survey, a farm or a
group of farms owned or operated by a household may be considered as the sampling unit.
c) Sampling Frame
A list of all the sampling units belonging to the population to be studied with their identification
particulars or a map showing the boundaries of the sampling units is known as sampling frame.
Examples of a frame are a list of farms and a list of suitable area segments like villages in India or
counties in the United States. The frame should be up to date and free from errors of omission and
duplication of sampling units.
Types of Sampling
Probability Sampling
A probability sampling method is any method of sampling that utilizes some form of random
selection. In order to have a random selection method, one must set up some process or procedure
that assures that the different units in selected population have equal probabilities of being chosen.
Types of Probability Sampling include Simple random sampling, systematic sampling, stratified
random sampling, cluster sampling
of k individuals has the same probability of being chosen for the sample as any other subset
of k individuals
Systematic sampling
Systematic sampling is a random sampling technique which is frequently chosen by researchers for
its simplicity and its periodic quality. Systematic sampling is a statistical method involving the
selection of elements from an ordered sampling frame. The most common form of systematic
sampling is an equal-probability method. In this approach, progression through the list is treated
circularly, with a return to the top once the end of the list is passed. The sampling starts by selecting
an element from the list at random and then every kth element in the frame is selected, where k, the
sampling interval (sometimes known as the skip): this is calculated as:[1]
A method of sampling that involves the division of a population into smaller groups known as strata.
In stratified random sampling, the strata are formed based on members' shared attributes or
characteristics. A random sample from each stratum is taken in a number proportional to the
stratum's size when compared to the population. These subsets of the strata are then pooled to form a
random sample.
The main advantage with stratified sampling is how it captures key population characteristics in the
sample. Similar to a weighted average, this method of sampling produces characteristics in the
sample that are proportional to the overall population. Stratified sampling works well for populations
with a variety of attributes, but is otherwise ineffective, as subgroups cannot be formed.
Cluster sampling
Cluster sampling refers to a sampling method that has the following properties.
The population is divided into N groups, called clusters.
The researcher randomly selects n clusters to include in the sample.
The number of observations within each cluster Mi is known, and M = M1 + M2 + M3 + ... +
MN-1 + MN.
Each element of the population can be assigned to one, and only one, cluster.
two types of cluster sampling methods.
One-stage sampling. All of the elements within selected clusters are included in the sample.
Two-stage sampling. A subset of elements within selected clusters are randomly selected for
inclusion in the sample.
Convenience sampling
Convenience sampling is a non-probability sampling technique where subjects are selected because
of their convenient accessibility and proximity to the researcher.
A statistical method of drawing representative data by selecting people because of the ease of their
volunteering or selecting units because of their availability or easy access. The advantages of this
type of sampling are the availability and the quickness with which data can be gathered.
The disadvantages are the risk that the sample might not represent the population as a whole, and it
might be biased by volunteers. For example, a study to determine the average age of gamblers at a
casino that is conducted for three hours on a weekday afternoon might be overrepresented by elderly
people who have retiredand underrepresented by people of working age.Also called accidental
sampling.
Judgemental sampling
Judgmental sampling is a non-probability sampling technique where the researcher selects units to be
sampled based on their knowledge and professional judgment.
This type of sampling technique is also known as purposive sampling and authoritative sampling.
Purposive sampling is used in cases where the specialty of an authority can select a more
representative sample that can bring more accurate results than by using other probability sampling
techniques. The process involves nothing but purposely handpicking individuals from the population
based on the authority's or the researcher's knowledge and judgment.
Snowball sampling
Snowball sampling is a non-probability sampling technique that is used by researchers to identify
potential subjects in studies where subjects are hard to locate.
To create a snowball sample, there are two steps: (a) trying to identify one or more units in the
desired population; and (b)using these units to find further units and so on until the sample size is
met.
Quota sampling
A sampling method of gathering representative data from a group. As opposed to random sampling,
quota sampling requires that representative individuals are chosen out of a specific subgroup. For
example, a researcher might ask for a sample of 100 females, or 100 individuals between the ages of
20-30.
Step-by-step Quota Sampling
The first step in non-probability quota sampling is to divide the population into
exclusive subgroups.
Then, the researcher must identify the proportions of these subgroups in the population; this
same proportion will be applied in the sampling process.
Finally, the researcher selects subjects from the various subgroups while taking into
consideration the proportions noted in the previous step.
The final step ensures that the sample is representative of the entire population. It also allows
the researcher to study traits and characteristics that are noted for each subgroup.
Errors in sampling
Sampling error is the deviation of the selected sample from the true characteristics, traits,
behaviors, qualities or figures of the entire population.
Given two exactly the same studies, same sampling methods, same population, the study with a
larger sample size will have less sampling process error compared to the study with smaller
sample size. Keep in mind that as the sample size increases, it approaches the size of the entire
population, therefore, it also approaches all the characteristics of the population, thus, decreasing
sampling process error.
There is only one way to eliminate this error. This solution is to eliminate the concept of sample,
and to test the entire population.
In most cases this is not possible; consequently, what a researcher must to do is to minimize
sampling process error. This can be achieved by a proper and unbiased probability sampling and
by using a large sample size.
Module 4
Data Collection
Data collection is the process of gathering and measuring information on variables of interest, in an
established systematic fashion that enables one to answer stated research questions, test hypotheses,
and evaluate outcomes.
Secondary data: Data collected by someone else for some other purpose (but being utilized by the
investigator for another purpose).
Examples: Census data being used to analyze the impact of education on career choice and earning.
Data collection methods for impact evaluation vary along a continuum. At the one end of this
continuum are quantitative methods and at the other end of the continuum are Qualitative methods
for data collection.
Observations
Observation is a process of recording the behaviour patterns of people, objects, and occurrences
without questioning or communicating with them. Observation can take the place in a laboratory
setting or in a natural setting. Generally there are two ways to conduct observation, namely non-
participative observation and participative observation.
The researcher in non-participative observation does not involve in the activities of the people being
observed. He or she merely record whatever happens among the people , including their actions and
their behaviour, and anything worth recording. On the one hand, the researcher in a participative
observation involves fully with the people being observed, with the objective of trying to understand
the values, motives and practices of those being researched.
The main advantage of observation as compared to questionnaire survey is one can obtain richer and
more in-depth information. One can able to catch phenomena, characteristics, activities and other
things impossible to detect by questionnaire survey . However,there are some weaknesses associated
with observation method as shown below:
Cannot control variables in the natural setting
Researcher own values and ethics might affect his objectivity and give rise to observer bias
Failure to observe some activities due to distractions.
Survey
Survey research is often used to assess thoughts, opinions, and feelings. Survey research can be
specific and limited, or it can have more global, widespread goals.
The span of time needed to complete the survey brings us to the two different types of surveys:
cross-sectional and longitudinal.
1. Cross-Sectional Surveys
Collecting information from the respondents at a single period in time uses the cross-sectional
type of survey. Cross-sectional surveys usually utilize questionnaires to ask about a particular
topic at one point in time. For instance, a researcher conducted a cross-sectional survey asking
teenagers‘ views on cigarette smoking as of May 2010. Sometimes, cross-sectional surveys are
used to identify the relationship between two variables, as in a comparative study. An example of
this is administering a cross-sectional survey about the relationship of peer pressure and cigarette
smoking among teenagers as of May 2010.
2. Longitudinal Surveys
When the researcher attempts to gather information over a period of time or from one point in
time up to another, he is doing a longitudinal survey. The aim of longitudinal surveys is to
collect data and examine the changes in the data gathered. Longitudinal surveys are used
in cohort studies, panel studies and trend studies.
According to Instrumentation
In survey research, the instruments that are utilized can be either a questionnaire or
an interview (either structured or unstructured).
1. Questionnaires
Typically, a questionnaire is a paper-and-pencil instrument that is administered to the respondents.
The usual questions found in questionnaires are closed-ended questions, which are followed by
response options. However, there are questionnaires that ask open-ended questions to explore the
answers of the respondents.
Questionnaires have been developed over the years. Today, questionnaires are utilized in
various survey methods, according to how they are given. These methods include the self-
administered, the group-administered, and the household drop-off. Among the three, the self-
administered survey method is often used by researchers nowadays. The self-administered
questionnaires are widely known as the mail survey method. However, since the response rates
related to mail surveys had gone low, questionnaires are now commonly administered online, as in
the form of web surveys.
Advantages: Ideal for asking closed-ended questions; effective for market or consumer
research
Disadvantages: Limit the researcher‘s understanding of the respondent‘s answers; requires
budget for reproduction of survey questionnaires
2. Interviews
Between the two broad types of surveys, interviews are more personal and probing. Questionnaires
do not provide the freedom to ask follow-up questions to explore the answers of the respondents, but
interviews do.
An interview includes two persons - the researcher as the interviewer, and therespondent as the
interviewee. There are several survey methods that utilize interviews. These are the personal or face-
to-face interview, the phone interview, and more recently, the online interview.
Advantages: Follow-up questions can be asked; provide better understanding of the answers
of the respondents
A questionnaire is a research instrument consisting of a series of questions and other prompts for the
purpose of gathering information from respondents. Although they are often designed
for statistical analysis of the responses, this is not always the case. The questionnaire was invented
by Sir Francis Galton.
At the outset, the researcher must define the population about which he/she wishes to generalise from
the sample data to be collected. For example, in marketing research, researchers often have to decide
whether they should cover only existing users of the generic product type or whether to also include
non-users. Secondly, researchers have to draw up a sampling frame. Thirdly, in designing the
questionnaire we must take into account factors such as the age, education, etc. of the target
respondents.
· personal interviews
· group or focus interviews
· mailed questionnaires
· telephone interviews.
Within this region the first two mentioned are used much more extensively than the second pair.
However, each has its advantages and disadvantages. A general rule is that the more sensitive or
personal the information, the more personal the form of data collection should be.
they are likely to break off immediately. If, on the other hand, they find the opening question easy
and pleasant to answer, they are encouraged to continue.
Question flow: Questions should flow in some kind of psychological order, so that one leads easily
and naturally to the next. Questions on one subject, or one particular aspect of a subject, should be
grouped together. Respondents may feel it disconcerting to keep shifting from one topic to another,
or to be asked to return to some subject they thought they gave their opinions about earlier.
Question variety:. Respondents become bored quickly and restless when asked similar questions for
half an hour or so. It usually improves response, therefore, to vary the respondent's task from time to
time. An open-ended question here and there (even if it is not analysed) may provide much-needed
relief from a long series of questions in which respondents have been forced to limit their replies to
pre-coded categories. Questions involving showing cards/pictures to respondents can help vary the
pace and increase interest.
7. Web information
8. Historical data and information
Nominal Scale is the crudest among all measurement scales but it is also the simplest scale. In this
scale the different scores on a measurement simply indicate different categories. The nominal scale
does not express any values or relationships between variables. The nominal scale is often referred to
as a categorical scale. The assigned numbers have no arithmetic properties and act only as labels. The
only statistical operation that can be performed on nominal scales is a frequency count. We cannot
determine an average except mode. For example: labeling men as ‗1‘ and women as ‗2‘ which is the
most common way of labeling gender for data recording purpose does not mean women are ‗twice
something or other‘ than men. Nor it suggests that men are somehow ‗better‘ than women.
Ordinal scale
Ordinal Scale involves the ranking of items along the continuum of the characteristic being scaled. In
this scale, the items are classified according to whether they have more or less of a characteristic. The
main characteristic of the ordinal scale is that the categories have a logical or ordered relationship.
This type of scale permits the measurement of degrees of difference, (i.e. ‗more‘ or ‗less‘) but not the
specific amount of differences (i.e. how much ‗more‘ or ‗less‘). This scale is very common in
marketing, satisfaction and attitudinal research. Using ordinal scale data, we can perform statistical
analysis like Median and Mode, but not the Mean. For example, a fast food home delivery shop may
wish to ask its customers: How would you rate the service of our staff? (1) Excellent • (2) Very Good
• (3) Good • (4) Poor • (5) Worst •
Interval scale
Interval Scale is a scale in which the numbers are used to rank attributes such that numerically equal
distances on the scale represent equal distance in the characteristic being measured. An interval scale
contains all the information of an ordinal scale, but it also one allows to compare the
difference/distance between attributes. Interval scales may be either in numeric or semantic formats.
The interval scales allow the calculation of averages like Mean, Median and Mode and dispersion
like Range and Standard Deviation. For example, the difference between ‗1‘ and ‗2‘ is equal to the
difference between ‗3‘ and ‗4‘. Further, the difference between ‗2‘ and ‗4‘ is twice the difference
between ‗1‘ and ‗2‘. Measuring temperature is an example of interval scale. But, we cannot say 40°C
is twice as hot as 20°C.
Ratio scale
Ratio Scale is the highest level of measurement scales. This has the properties of an interval scale
together with a fixed (absolute) zero point. The absolute zero point allows us to construct a
meaningful ratio. Ratio scales permit the researcher to compare both differences in scores and
relative magnitude of scores. Examples of ratio scales include weights, lengths and times. For
example, the number of customers of a bank‘s ATM in the last three months is a ratio scale. This is
because you can compare this with previous three months. For example, the difference between 10
and 15 minutes is the same as the difference between 25 and 30 minutes and 30 minutes is twice as
long as 15 minutes
Attitudes are composed of 1) Beliefs about the subject 2)Emotional feeling (like-dislike) 3)
Readiness to respond behaviourally - i.e. buy7."Attitude is defined as the predisposition to respond to
an idea or object, and in marketing it relates to the consumers predisposition to respond to a
particular product or service".
Likert’s Scale
Likert, is extremely popular for measuring attitudes, because, the method is simple to administer.
With the Likert scale, the respondents indicate their own attitudes by checking how strongly they
agree or disagree with carefully worded statements that range from very positive to very negative
towards the attitudinal object. Respondents generally choose from five alternatives (say strongly
agree, agree, neither agree nor disagree, disagree, strongly disagree). A Likert scale may include a
number of items or statements. Disadvantage of Likert Scale is that it takes longer time to complete
than other itemised rating scales because respondents have to read each statement. Despite the above
disadvantages, this scale has several advantages. It is easy to construct, administer and use.
Semantic Differential Scale
This is a seven point rating scale with end points associated with bipolar labels (such as good and
bad, complex and simple) that have semantic meaning. It can be used to find whether a respondent
has a positive or negative attitude towards an object. It has been widely used in comparing brands,
products and company images. It has also been used to develop advertising and promotion strategies
and in a new product development study.
Thurstone scale
Multi-Dimensional Scaling
Multidimensional scaling (MDS) is a means of visualizing the level of similarity of individual
cases of a dataset. It refers to a set of related ordination techniques used in information
visualization, in particular to display the information contained in a distance matrix
Module 5
Editing
Editing is the process of checking and adjusting the data for omissions, legibility, and consistency.
Editing may be differentiated from coding, which isthe assignment of numerical scales or classifying
symbols to previously edited data.
The purpose of editing is to ensure the completeness, consistency, and readability of the data to be
transferredto data storage. The editor's task is to check for errors and omissions on the questionnaires
or other data collection forms.
Information gathered during data collection may lack uniformity. Example: Data collected through
questionnaire and schedules may have answers which may not be ticked at proper places, or some
questions may be left unanswered. Sometimes information may be given in a form which needs
reconstruction in a category designed for analysis, e.g., converting daily/monthly income in annual
income and so on. The researcher has to take a decision as to how to edit it.
Types :
1. Field Editing
• Preliminary editing by a field supervisor on the same day as the interview to catch
technical omissions, check legibility of handwriting, and clarify responses that are
logically or conceptually inconsistent.
2. In-house Editing
• Editing performed by a central office staff; often dome more rigorously than field
editing
• Pitfalls of Editing
• Allowing subjectivity to enter into the editing process.
• Data editors should be intelligent, experienced, and objective.
• Failing to have a systematic procedure for assessing the questionnaires developed by
the research analyst
• An editor should have clearly defined decision rules to follow.
• Pretesting Edit
• Editing during the pretest stage can prove very valuable for improving questionnaire
format, identifying poor instructions or inappropriate question wording.
Coding
Coding is translating answers into numerical values or assigning numbers to the various categories of
a variable to be used in data analysis. Coding is done by using a code book, code sheet, and a
computer card. Coding is done on the basis of the instructions given in the codebook. The code book
gives a numerical code for each variable.
Manual processing is employed when qualitative methods are used or when in quantitative studies, a
small sample is used, or when the questionnaire/schedule has a large number of open-ended
questions, or when accessibility to computers is difficult or inappropriate. However, coding is done
in manual processing also.Ex: Male- Code 1,female –Code2
Classification
Distribution of data as a form of classification of scores obtained for the various categories or a
particular variable. There are four types of distributions:
Frequency distribution
Percentage distribution
Cumulative distribution
Statistical distribution
Ungrouped: Here, the scores are not collapsed into categories, e.g., distribution of ages of the
students of a BJ (MC) class, each age value (e.g., 18, 19, 20, and so on) will be presented
separately in the distribution.
Grouped: Here, the scores are collapsed into categories, so that 2 or 3 scores are presented
together as a group. For example, in the above age distribution groups like 18-20, 21-22 etc., can
be formed)
Percentage distribution: It is also possible to give frequencies not in absolute numbers but in
percentages. For instance instead of saying 200 respondents of total 2000 had a monthly income
of less than Rs. 500, we can say 10% of the respondents have a monthly income of less than Rs.
500.
Cumulative distribution: It tells how often the value of the random variable is less than or equal
to a particular reference value
Statistical distribution: In this type of data distribution, some measure of average is found out of
a sample of respondents. Several kind of averages are available (mean, median, mode) and the
researcher must decide which is most suitable to his purpose. Once the average has been
calculated, the question arises: how representative a figure it is, i.e., how closely the answers are
bunched around it.
Tabulation
After editing, which ensures that the information on the schedule is accurate and categorized in a
suitable form, the data are put together in some kinds of tables and may also undergo some other
forms of statistical analysis.Table can be prepared manually and/or by computers. For a small study
of 100 to 200 persons, there may be little point in tabulating by computer since this necessitates
putting the data on punched cards. But for a survey analysis involving a large number of respondents
and requiring cross tabulation involving more than two variables, hand tabulation will be
inappropriate and time consuming.
Uses of tables
Tables are useful to the researchers and the readers in three ways:
Validation
Data validation ensures that the survey questionnaires are completed and present consistent data.
In this step, should not include the questions that were not answered by most respondents in the
data analysis as this would result to bias in the results. However, in the case of incomplete
questionnaires, must count the actual number of respondents that were able to answer a particular
question. This should be the same for the rest of the questions.
The process by which sense and meaning are made of the data gathered in qualitative research,
and by which the emergent knowledge is applied to problems.
Inferential statistics can be contrasted with descriptive statistics. Descriptive statistics is solely
concerned with properties of the observed data, and does not assume that the data came from a
larger population.
A document prepared by an analyst or strategist who is a part of the investment research team. A
research report may focus on a specific stock or industry sector, a currency, commodity or fixed-
income instrument, or even on a geographic region or country. Research reports generally, but not
always, have "actionable" recommendations
keep record which are used if the same situation recur. Reports also provide objective
recommendations on any problem. Hence the skill of report-writing is as important as good raw
material and equipment for running an industry or a business efficiently. An efficient executive need
to possess this skills, if he wants to rise up the corporate ladder. It helps him to perform his functions
of planning and evaluating men and material resources efficiently.
Other forms
– Dissertations and theses
– Dissertation and theses proposals
– Journal articles
– Conference papers
– Conference paper proposals
– Reports for policy makers and schools
Report structure
A. Preliminary Section
1. Title Page
2. Acknowledgments (if any)
3. Table of Contents
4. List of Tables (if any)
5. List of Figures (if any)
6. Abstract
B. Main Body
1. Introduction
a. Statement of the Problem
b. Significance of the Problem (and historical background)
c. Purpose
d. Statement of Hypothesis
e. Assumptions
f. Limitations
g. Definition of Terms
4. Analysis of Data
contains:
a. text with appropriate
b. tables and
c. figures
C. Reference Section
1. End Notes (if in that format of citation)
2. Bibliography or Literature Cited
3. Appendix
Title: Be specific. Tell what, when, where, etc. In one main title and a subtitle, give a clear idea
of what the paper investigated.
Acknowledgment: Include only if special help was received from an individual or group.
Abstract: Summarizes the report including the hypotheses, procedures, and major findings.
Introduction: Sections may be combined in short reports.
Purpose: What is the goal to be gained from a better understanding of this question?
Statement of the Hypothesis: In one statement (not a question) declare the question which is
investigated and the expected results. (For a null hypothesis, no difference is predicted.)
Assumptions: Explain everything that is assumed in order for the investigation to be undertaken.
Limitations: Explain the limitations that may invalidate the study or make it less than accurate.
Definition of Terms: Define or clarify any term or concept that is used in the study in a non-
traditional manner or in only one of many interpretations.
Review of Related Literature: Gives the reader the necessary background to understand the study
by citing the investigations and findings of previous researchers and documents the researcher's
knowledge and preparation to investigate the problem.
Design of the Study: Gives the reader the information necessary to exactly replicate (repeat) the
study with new data or if the same raw data were available, the reader should be able to duplicate
the results. This is written in past tense but without reference to or inclusion of the results
determined from the analysis.
Description of the Research Design and Procedures Used: Completely explain step-by-step what
was done.
Sources of Data: Give complete information about who, what, when, where, and how the data
was collected.
Sampling Procedures: Explain how the data was limited to the amount which was gathered. If all
of the available data were not utilized, how was a representative sample achieved?
Methods and Instruments of Data Gathering: Explain the procedures for obtaining the data
collected. Include the forms or manner by which it was recorded.
Statistical Treatment: Explain the complete mathematical procedures used in analyzing the data
and determining the significance of the results.
Analysis of Data: Describe the patterns observed in the data. Use tables and figures to help
clarify the material when possible.
Summary and Conclusions: This section condenses the previous sections, succinctly presents the
results concerning the hypotheses, and suggests what else can be done.
Description of the Procedures: This is a brief reiteration of important elements of the design of
the study.
Major Findings: The final results from the analysis are presented, the hypothesis stated, and the
decision about the rejection or the failure to reject the hypothesis is given.
Recommendations for Further Investigation: From the knowledge and experienced gained in
undertaking this particular study, how might the study have been improved or what other
possible hypotheses might be investigated?
End Notes: These are like footnotes but are located at the back rather than the bottom of each
page. These would include all of the references for all works cited in the Review of Related
Literature or any other sections of the report as well as the references for quotations, either direct
or indirect, taken from other sources, or any footnote comments that might have been included.
These are listed in numeric order as presented in the text.
Bibliography or Literature Cited: These are the bibliographic reference for each of the works
cited in the End Notes.
Appendix: Any tables, figures, forms, or other materials that are not totally central to the analysis
but that need to be included are placed in the Appendix.
The following is a list of guides dealing with each specific field and type:
thesis writing
As the research and field work progress, files of notes, sketches, reproduced reference
materials, and photographs should be compiled. If an aid to navigation has changed over time,
chronologically arranged files of plans, photographs, and notes will help to understand the
progression and nature of the alterations. Color slides of the aid to navigation may be useful in
preparing the National Register nomination when returning to the site is not possible.
Black and white photographs of the aid to navigation should be taken. The quality of the
photographs actually included in the nomination will benefit from selecting among a wide choice
of photographs. Historic photographs and graphics may be located and copied for inclusion with
the nomination. Historic plans may be copied in photographs to aid in documentation. If historic
plans do not exist, modern plans may be prepared.
Module 6
Hypothesis: Meaning
Types
Simple hypothesis - this predicts the relationship between a single independent variable
(IV) and a single dependent variable (DV)
For example:
Lower levels of exercise postpartum (IV) will be associated with greater weight retention
(DV).
NB.
IV = independent variable
D V = dependent variable
Complex hypothesis - this predicts the relationship between two or more independent
variables and two or more dependent variables.
Hypotheses can be stated in various ways as long as the researcher specifies or implies the
relationship that will be tested.
For example:
Lower levels of exercise postpartum are associated with greater weight retention.
There is a relationship between level of exercise postpartum and weight retention.
The greater the level of exercise postpartum, the lower the weight retention.
Women with different levels of exercise postpartum differ with regard to weight retention.
Weight retention postpartum decreases as the woman's level of exercise increases.
Women who exercise vigorously postpartum have lower weight retention than women who
do not.
Directional hypotheses
These are usually derived from theory.
They may imply that the researcher is intellectually committed to a particular outcome.
They specify the expected direction of the relationship between variables i.e. the researcher predicts
not only the existence of a relationship but also its nature.
Non-directional hypotheses
Used when there is little or no theory, or when findings of previous studies are contradictory.
They may imply impartiality.
Do not stipulate the direction of the relationship.
The dependent variable is measured to examine the effect created by the independent variable.
Null hypotheses
These are used when the researcher believes there is no relationship between two variables or when
there is inadequate theoretical or empirical information to state a research hypothesis
Null hypotheses can be:
simple or complex;
associative or causal.
Testable hypotheses
Characteristics
Source
1. General Culture:
The general pattern of culture helps not only to formulate a hypothesis, but also to guide its
trend. The culture has a great influence upon the thinking process of people and hypothesis
may be formed to test one or more of these ideas.
2. Scientific Theory:
The knowledge of theory leads to form further generalizations from it. These generalizations
form the part of hypothesis.
3. Analogies:
Sometimes a hypothesis is formed from the analogy. A similarity between two phenomena is
observed and a hypothesis is formed to test whether the two phenomena are similar in any
other respect.
Formulation of Hypothesis
Hypothesis Formulation
Once having identified research question, it is time to formulate hypothesis. While the
research question is broad and includes all the variables one want to consider, the hypothesis is a
statement that specific relationship one expect to find from examination of these variables. When
formulating the hypothesis(es), there are a few things one need to keep in mind. Good hypotheses
meet the following criteria:
1) Identify the independent and dependent variables to be studied.
2) Specify the nature of the relationship that exists between these variables.
3) Simple (often referred to as parsimonious). It is better to be concise than to be
long-winded. It is also better to have several simple hypotheses than one
complicated hypothesis.
4) Does not include reference to specific measures.
5) Does not refer to specific statistical procedures that will be used in analysis.
Errors in Hypothesis
Type I error: Rejecting the null hypothesis when it is in fact true is called a Type I error.
Deciding, before doing a hypothesis test, on a maximum p-value for which they will reject the null
hypothesis. This value is often denoted α (alpha) and is also called the significance level.
When a hypothesis test results in a p-value that is less than the significance level, the result of the
hypothesis test is called statistically significant.
Type II error: Not rejecting the null hypothesis when in fact the alternate hypothesis is true is called
a Type II error. (The second example below provides a situation where the concept of Type II error is
important.)
If the information about the population is completely known by means of its parameters then
statistical test is called parametric test Eg: t- test, f-test, z-test, ANOVA
If there is no knowledge about the population or paramters, but still it is required to test the
hypothesis of the population. Then it is called non-parametric test Eg: mann-Whitney, rank sum test,
Kruskal-Wallis test
T-Test
A t-test is any statistical hypothesis test in which the test statistic follows
a Student's t distribution if the null hypothesis is supported. It can be used to determine if two
sets of data are significantly different from each other, and is most commonly applied when the
test statistic would follow a normal distribution if the value of a scaling term in the test statistic
were known. When the scaling term is unknown and is replaced by an estimate based on
the data, the test statistic (under certain conditions) follows a Student's t distribution.
A two-sample t-test examines whether two samples are different and is commonly used when the
variances of two normal distributions are unknown and when an experiment uses a small sample
size. For example, a t-test could be used to compare the average floor routine score of the U.S.
women's Olympic gymnastic team to the average floor routine score of China's women's team.
The t-test, and any statistical test of this sort, consists of three steps.
2. Define the null and alternate hyptheses,
3. Calculate the t-statistic for the data,
4. Compare tcalc to the tabulated t-value, for the appropriate significance level and degree of
freedom. Iftcalc > ttab, we reject the null hypothesis and accept the alternate hypothesis.
Otherwise, we accept the null hypothesis.
The t-test can be used to compare a sample mean to an accepted value (a population mean), or it
can be used to compare the means of two sample sets.
t-test to Compare One Sample Mean to an Accepted Value
t-test to Compare Two Sample Means
t-test to Compare One Sample Mean to an Accepted Value
In the example, the mean of arsenic concentration measurements was m=4 ppm, for n=7 and,
with sample standard deviation s=0.9 ppm. We established suitable null and alternative
hypostheses:
Null Hypothesis H0: μ = μ0
Alternate Hypothesis HA: μ > μ0
where μ0 = 2 ppm is the allowable limit and μ is the population mean of the measured soil
(refresher on the difference between sample and population means).
We have already seen how to do the first step, and have null and alternate hypotheses. The
second step involves the calculation of the t-statistic for one mean, using the formula:
where s is the standard deviation of the sample, not the population standard deviation. In our
case,
For the third step, we need a table of tabulated t-values for significance level and degrees of
freedom, such as the one found in your lab manual or most statistics textbooks. Referring to a
table for a 95% confidence limit for a 1-tailed test, we find tν=6,95% = 1.94. (The difference
between 1- and 2-tailed distributions was covered in a previous section.)
We are now ready to accept or reject the null hypothesis. If the tcalc > ttab, we reject the null
hypothesis. In our case, tcalc=5.88 > ttab=2.45, so we reject the null hypothesis, and say that our
sample mean is indeed larger than the accepted limit, and not due to random chance, so we can
say that the soil is indeed contaminated.
t-test to Compare Two Sample Means
The method for comparing two sample means is very similar. The only two differences are the
equation used to compute the t-statistic, and the degrees of freedom for choosing the tabulate t-
value. The formula is given by
In this case, we require two separate sample means, standard deviations and sample sizes. The
number of degrees of freedom is computed using the formula
and the result is rounded to the nearest whole number. Once these quantities are determined, the
same three steps for determining the validity of a hypothesis are used for two sample means.
Problem: Sam Sleepresearcher hypothesizes that people who are allowed to sleep for
only four hours will score significantly lower than people who are allowed to sleep for eight
hours on a cognitive skills test. He brings sixteen participants into his sleep lab and randomly
assigns them to one of two groups. In one group he has participants sleep for eight hours and in
the other group he has them sleep for four. The next morning he administers the SCAT (Sam's
Cognitive Ability Test) to all participants. (Scores on the SCAT range from 1-9 with high scores
representing better performance).
SCAT scores
8 hours sleep
57535339
group (X)
4 hours sleep
81466412
group (Y)
x (x-Mx)2 y (y - My)2
5 0 8 16
7 4 1 9
5 0 4 0
3 4 6 4
5 0 6 4
3 4 4 0
3 4 1 9
9 16 2 4
Mx=5 My=4
*(according to the t sig/probability table with df = 14, t must be at least 2.145 to reach p < .05, so
this difference is not statistically significant)
Interpretation: Sam's hypothesis was not confirmed. He did not find a significant difference
between those who slept for four hours versus those who slept for eight hours on cognitive test
performance.
Z-Test
A Z-test is any statistical test for which the distribution of the test statistic under the null
hypothesis can be approximated by a normal distribution. Because of the central limit theorem,
many test statistics are approximately normally distributed for large samples. For each
significance level, the Z-test has a single critical value (for example, 1.96 for 5% two tailed)
which makes it more convenient than the Student's t-test which has separate critical values for
each sample size.
Therefore, many statistical tests can be conveniently performed as approximate Z-tests if the
sample size is large or the population variance known. If the population variance is unknown
(and therefore has to be estimated from the sample itself) and the sample size is not large (n <
30), the Student's t-test may be more appropriate.
If T is a statistic that is approximately normally distributed under the null hypothesis, the next
step in performing a Z-test is to estimate the expected value θ of T under the null hypothesis, and
then obtain an estimate s of the standard deviation ofT. After that the standard
score Z = (T − θ) / s is calculated, from which one-tailed and two-tailed p-values can be
calculated as Φ(−Z) (for upper-tailed tests), Φ(Z) (for lower-tailed tests) and 2Φ(−|Z|) (for two-
tailed tests) where Φ is the standardnormal cumulative distribution function.
Problem:
A principal at a certain school claims that the students in his school are above average
intelligence. A random sample of thirty students IQ scores have a mean score of 112. Is there
sufficient evidence to support the principal‘s claim? The mean population IQ is 100 with
a standard deviation of 15.
Step 1: State the Null hypothesis. The accepted fact is that the population mean is 100, so: H0:
μ=100.
Step 2: State the Alternate Hypothesis. The claim is that the students have above average IQ
scores, so:
H1: μ > 100.
The fact that we are looking for scores ―greater than‖ a certain point means that this is a one-
tailed test.
Step 3: Draw a picture to help you visualize the problem.
Step 4: State the alpha level. If you aren‘t given an alpha level, use 5% (0.05).
Step 5: Find the rejection region area (given by your alpha level above) from the z-table. An area
of .05 is equal to a z-score of 1.645.
F-Test
The f statistic, also known as an f value, is a random variable that has an F distribution.
An F-test is any statistical test in which the test statistic has an F-distribution under the null
hypothesis. It is most often used when comparing statistical models that have been fitted to
a data set, in order to identify the model that best fits the population from which the data were
sampled. Exact "F-tests" mainly arise when the models have been fitted to the data using least
squares. The name was coined by George W. Snedecor, in honour of Sir Ronald A. Fisher.
Fisher initially developed the statistic as the variance ratio in the 1920s.
f = [ s12/σ12 ] / [ s22/σ22 ]
f = [ s12 * σ22 ] / [ s22 * σ12 ]
f = [ Χ21 / v1 ] / [ Χ22 / v2 ]
f = [ Χ21 * v2 ] / [ Χ22 * v1 ]
where σ1 is the standard deviation of population 1, s1 is the standard deviation of the sample
drawn from population 1, σ2 is the standard deviation of population 2, s2 is the standard deviation
of the sample drawn from population 2, Χ21 is the chi-square statistic for the sample drawn from
population 1, v1 is the degrees of freedom for Χ21, Χ22 is the chi-square statistic for the sample
drawn from population 2, and v2 is the degrees of freedom for Χ22 . Note that degrees of
freedom v1 = n1 - 1, and degrees of freedom v2 = n2 - 1 .
Problem 1
In one state, 52% of the voters are Republicans, and 48% are Democrats. In a second state, 47%
of the voters are Republicans, and 53% are Democrats. Suppose a simple random sample of 100
voters are surveyed from each state.
What is the probability that the survey will show a greater percentage of Republican voters in the
second state than in the first state?
(A) 0.04
(B) 0.05
(C) 0.24
(D) 0.71
(E) 0.76
Solution
The correct answer is C. For this analysis, let P1 = the proportion of Republican voters in the first
state, P2 = the proportion of Republican voters in the second state, p1 = the proportion of
Republican voters in the sample from the first state, and p2 = the proportion of Republican voters
in the sample from the second state. The number of voters sampled from the first state (n1) = 100,
and the number of voters sampled from the second state (n2) = 100.
Make sure the sample size is big enough to model differences with a normal population.
Because n1P1 = 100 * 0.52 = 52, n1(1 - P1) = 100 * 0.48 = 48, n2P2 = 100 * 0.47 = 47, and
n2(1 - P2) = 100 * 0.53 = 53 are each greater than 10, the sample size is large enough.
Find the mean of the difference in sample proportions: E(p1 - p2) = P1 - P2 = 0.52 - 0.47 =
0.05.
Find the probability. This problem requires us to find the probability that p1 is less than
p2. This is equivalent to finding the probability that p1 - p2 is less than zero. To find this
probability, we need to transform the random variable (p1 - p2) into a z-score. That
transformation appears below.
z p1 - p2 = (x - μ p1 - p2 ) / σd = = (0 - 0.05)/0.0706 = -0.7082
Using Stat Trek's Normal Distribution Calculator, we find that the probability of a z-
score being -0.7082 or less is 0.24.
Therefore, the probability that the survey will show a greater percentage of Republican voters in
the second state than in the first state is 0.24.
U-Test
The Mann–Whitney U test (also called the Mann–Whitney–Wilcoxon (MWW), Wilcoxon rank-
sum test(WRS), or Wilcoxon–Mann–Whitney test) is a nonparametric test of the null
hypothesis that two populations are the same against an alternative hypothesis, especially that a
particular population tends to have larger values than the other.
It has greater efficiency than the t-test on non-normal distributions, such as a mixture of normal
distributions, and it is nearly as efficient as the t-test on normal distributions.
The Wilcoxon rank-sum test is not the same as the Wilcoxon signed-rank test, although both are
nonparametric and involve summation of ranks.
Problem:
Researchers have asked several smokers how many cigarettes they had smoked in the previous
day. Here are the data.
Women Men
4 2
7 2
20 5
20 6
8
16
5. The distribution that these data are drawn from is not normal. Is there a difference
between number of cigarettes smoked per day between the sexes?
6. Mann-Whitney U test
7. U= 6, U' = 18. U.05(2),4,6 = 22, so we can not reject H0: Women smoke the same number of
cigarettes as men.
K-W Test
The Kruskal–Wallis one-way analysis of variance by ranks (named after William
Kruskal and W. Allen Wallis) is a non-parametric method for testing whether samples originate
from the same distribution It is used for comparing two or more samples that are independent,
and that may have different sample sizes, and extends the Mann–Whitney U test to more than
two groups. The parametric equivalent of the Kruskal-Wallis test is the one-way analysis of
variance (ANOVA). When rejecting the null hypothesis of the Kruskal-Wallis test, then at least
one sample stochastically dominates at least one other sample.
1. Rank all data from all groups together; i.e., rank the data from 1 to N ignoring group
membership. Assign any tied values the average of the ranks they would have received
had they not been tied.
2. The test statistic is given by:
where:
is the number of observations in group
is the rank (among all observations) of observation from group
is the total number of observations across all groups
,
is the average of all the .
3. If the data contain no ties the denominator of the expression for is
exactly and . Thus
The last formula only contains the squares of the average ranks.
4. A correction for ties if using the short-cut formula described in the previous point can be
the pooled variance implied by the null hypothesis of the Kruskal-Wallis test in order to
determine which of the sample pairs are significantly different.[4]When performing
multiple sample contrasts or tests, the Type I error rate tends to become inflated, raising
concerns about multiple comparisons.
Problem:
A shoe company wants to know if three groups of workers have different salaries:
Women: 23K, 41K, 54K, 66K, 78K.
Men: 45K, 55K, 60K, 70K, 72K
Minorities: 18K, 30K, 34K, 40K, 44K.
Step 1: Sort the data for all groups/samples into ascending order in one combined set.
18K
23K
30K
34K
40K
41K
44K
45K
54K
55K
60K
66K
70K
72K
78K
Step 2: Assign ranks to the sorted data points. Give tied values the average rank.
20K 1
23K 2
30K 3
34K 4
40K 5
41K 6
44K 7
45K 8
54K 9
55K 10
60K 11
66K 12
70K 13
72K 14
90K 15
Where:
H = 6.72
Step 5: Find the critical chi-square value. With c-1 degrees of freedom. For 5 – 4 degrees of freedom
and an alpha level of .05, the critical chi square value is 9.4877.
Step 5: Compare the H value from Step 4 to the critical chi-square value from Step 5.
If the critical chi-square value is less than the test statistic, reject the null hypothesis that the medians
are equal.
The chi-square value is not less than the test statistic, so there is not enough evidence to suggest that
the means are unequal.
Statistical Analysis
Statistical analysis is a component of data analytics. In the context of business intelligence (BI),
statistical analysis involves collecting and scrutinizing every single data sample in a set of items from
which samples can be drawn.
Bivariate analysis is one of the simplest forms of quantitative (statistical) analysis. It involves the
analysis of two variables (often denoted as X, Y), for the purpose of determining the empirical
relationship between them. In order to see if the variables are related to one another, it is
common to measure how those two variables simultaneously change together (see
also covariance).
Bivariate analysis can be helpful in testing simple hypotheses of association andcausality –
checking to what extent it becomes easier to know and predict a value for the dependent
variable if we know a case's value of the independent variable.
Chi-Square
A chi-square test, also referred to as test (infrequently as the chi-squared test), is
any statistical hypothesis test in which the sampling distribution of the test statistic is a chi-
square distribution when the null hypothesis is true.
To review, the chi-square method of hypothesis testing has seven basic steps
1. State the null and research/alternative hypotheses.
2. Specify the decision rule and the level of statistical significance for the test, i.e., .05, .01, or
.001. (A significance level of .01 would mean that the probability of the chi-square value must
be .01 or less to reject the null hypothesis, a more stringent criterion than .05.)
5. Determine the degrees of freedom for the table. Then identify the critical value of chi-square
at the specified level of significance and appropriate degrees of freedom.
6. Compare the computed chi-square statistic with the critical value of chi-square; reject the null
hypothesis if the chi-square is equal to or larger than the critical value; accept the null
hypothesis if the chi-square is less than the critical value.
7. State a substantive conclusion, i.e., describe the meaning and importance of the test results in
terms of the historical problem under investigation.
Problem
A public opinion poll surveyed a simple random sample of 1000 voters. Respondents were
classified by gender (male or female) and by voting preference (Republican, Democrat, or
Independent). Results are shown in the contingency table below.
Voting Preferences
Row total
Republican Democrat Independent
Is there a gender gap? Do the men's voting preferences differ significantly from the women's
preferences? Use a 0.05 level of significance.
Solution
The solution to this problem takes four steps: (1) state the hypotheses, (2) formulate an analysis
plan, (3) analyze sample data, and (4) interpret results. We work through those steps below:
State the hypotheses. The first step is to state the null hypothesis and an alternative
hypothesis.
Formulate an analysis plan. For this analysis, the significance level is 0.05. Using
sample data, we will conduct a chi-square test for independence.
Analyze sample data. Applying the chi-square test for independence to sample data, we
compute the degrees of freedom, the expected frequency counts, and the chi-square test
statistic. Based on the chi-square statistic and the degrees of freedom, we determine
the P-value.
DF = (r - 1) * (c - 1) = (2 - 1) * (3 - 1) = 2
where DF is the degrees of freedom, r is the number of levels of gender, c is the number
of levels of the voting preference, nr is the number of observations from level r of gender,
nc is the number of observations from level c of voting preference, n is the number of
observations in the sample, Er,c is the expected frequency count when gender is
level r and voting preference is level c, and Or,c is the observed frequency count when
gender is level r voting preference is level c.
The P-value is the probability that a chi-square statistic having 2 degrees of freedom is
more extreme than 16.2.
We use the Chi-Square Distribution Calculator to find P(Χ2 > 16.2) = 0.0003.
Interpret results. Since the P-value (0.0003) is less than the significance level (0.05), we
cannot accept the null hypothesis. Thus, we conclude that there is a relationship between
gender and voting preference.
Formula
x = individual observation
r= number of groups
N = total number of observations (all groups)
n = number of observations in group
Create six columns: "x1", "x12", "x2", "x22", "x3", and "x32"
1. Put the raw data, according to group, in "x1", "x2", and "x3"
8. Calculate SSwithin
9. Enter sums of squares into the ANOVA table, and complete the table by calculating:
dfamong, dfwithin, MSamong, and MSwithin, and F
10. Check to see if F is statistically significant on probability table with appropriate degrees
of freedom and p < .05.
The two-way ANOVA is an extension of the one-way ANOVA. The "two-way" comes because
each item is classified in two ways, as opposed to one way.
For example, one way classifications might be: gender, political party, religion, or race. Two
way classifications might be by gender and political party, gender and race, or religion and race.
Each classification variable is a called a factor and so there are two factors, each having several
levels within that factor. The factors are called the "row factor" and the "column factor" because
the data is usually arranged into table format. Each combination of a row level and a column
level is called a treatment.
Assumptions
The populations from which the samples were obtained must be normally or
approximately normally distributed.
Hypotheses
The null hypotheses for each of the sets are given below.
1. The population means of the first factor are equal. This is like the one-way ANOVA for
the row factor.
2. The population means of the second factor are equal. This is like the one-way ANOVA
for the column factor.
3. There is no interaction between the two factors. This is similar to performing a test for
independence with contingency tables.
Factors
The two independent variables in a two-way ANOVA are called factors. The idea is that there
are two variables, factors, which affect the dependent variable. Each factor will have two or more
levels within it, and the degrees of freedom for each factor is one less than the number of levels.
It is assumed that main effect A has a levels (and A = a-1 df), main effect B has b levels (and B =
b-1 df), n is the sample size of each treatment, and N = abn is the total sample size. Notice the
overall degrees of freedom is once again one less than the total sample size.
Source SS Df MS F