Getting Started in Frequencies, Crosstab, Factor and Regression Analysis
Getting Started in Frequencies, Crosstab, Factor and Regression Analysis
Oscar Torres-Reyna
Data Consultant
[email protected]
http://dss.princeton.edu/training/
Case study: intro
Search here in the
home page for this
dataset
Metadata
Codebook in two
formats
Barack Obama and Joe Biden, the Democra 481 45.68 45.68
John McCain and Sarah Palin, the Republ 464 44.06 89.74 No weights
(VOL) Other/Neither 21 1.99 91.74
(VOL) Undecided/Don't know/no answer 87 8.26 100.00
.
. tab q5 [aweight=weight] /*With weights*/
Barack Obama and Joe Biden, the Democra 504.337749 47.90 47.90
John McCain and Sarah Palin, the Republ
(VOL) Other/Neither
449.487545
20.5570831
42.69
1.95
90.58
92.53
Using weights
(VOL) Undecided/Don't know/no answer 78.61762284 7.47 100.00
.
. tab qa /*No weights*/
A. Gender
(DO NOT
ASK) Freq. Percent Cum.
Male
Female
493
560
46.82
53.18
46.82
100.00
No weights
Total 1,053 100.00
NOTE: At this point, it is strongly
recommended to open a log to keep a
.
. tab qa [aweight=weight] /*With weights*/
record of your work and to extract output,
type:
A. Gender
(DO NOT
ASK) Freq. Percent Cum. log using mywork.log
Male
Female
500.388396
552.611604
47.52
52.48
47.52
100.00
Using weights You could also open a do-file by typing
Total 1,053 100.00 doedit and copy your commands there.
Case study: Electoral preferences by gender
Key
frequency
row percentage
column percentage
Q5. If the
Presidential election
were held today and A. Gender (DO NOT
the candidates were ASK)
Barack Male Female Total
Key
frequency
row percentage
column percentage
Q5. If the
Presidential election
were held today and
the candidates were F1. What is your age? F1. What is your age?
Barack 18-24 25-29 30-34 35-39 40-44 45-54 55-64 65 or old (VOL) No Total
Barack Obama and Joe 29.355119 26.435913 40.727272 46.595118 38.610873 129.51971 86.169373 102.39238 4.5319886 504.33775
5.82 5.24 8.08 9.24 7.66 25.68 17.09 20.30 0.90 100.00
77.57 50.53 40.92 51.51 37.59 51.68 50.32 43.21 40.00 47.90
John McCain and Sarah 6.2229886 22.18839 54.883049 36.825588 51.046351 99.992283 69.037199 104.76215 4.5295414 449.487545
1.38 4.94 12.21 8.19 11.36 22.25 15.36 23.31 1.01 100.00
16.44 42.42 55.14 40.71 49.69 39.90 40.31 44.21 39.98 42.69
(VOL) Undecided/Don't 2.2672181 3.6879373 1.809561 4.5920698 8.5570854 17.952531 13.264407 24.219596 2.2672179 78.617623
2.88 4.69 2.30 5.84 10.88 22.84 16.87 30.81 2.88 100.00
5.99 7.05 1.82 5.08 8.33 7.16 7.75 10.22 20.01 7.47
Total 37.845325 52.312241 99.540836 90.454747 102.7289 250.600407 171.24932 236.93948 11.328748 1,053
3.59 4.97 9.45 8.59 9.76 23.80 16.26 22.50 1.08 100.00
100.00 100.00 100.00 100.00 100.00 100.00 100.00 100.00 100.00 100.00
Case study: Electoral preferences by educational attainment
Key
frequency
row percentage
column percentage
Q5. If the
Presidential election
were held today and
the candidates were F4. What is the highest grade of schooling that you've completed?
Barack 8th grade Some high High scho Some coll College g Postgradu (VOL) No Total
Barack Obama and Joe 2.2991619 3.883265 81.589679 113.53524 169.39657 130.23545 3.3983797 504.33775
0.46 0.77 16.18 22.51 33.59 25.82 0.67 100.00
41.27 33.65 44.32 45.42 44.89 59.52 60.03 47.90
John McCain and Sarah 3.2718681 6.1159475 76.7484051 116.69213 170.30303 74.093841 2.2623235 449.487545
0.73 1.36 17.07 25.96 37.89 16.48 0.50 100.00
58.73 53.00 41.69 46.68 45.13 33.86 39.97 42.69
Key
frequency
row percentage
column percentage
Q5. If the
Presidential election
were held today and
the candidates were F13. Finally, just for classification purposes, was your total family income bef
Barack Less than $20,000 t $35,000 t $50,000 t $75,000 t $100,000 or $150,0 (VOL) No Total
Barack Obama and Joe 37.525195 51.14097 72.715849 122.78749 59.632459 69.732723 51.1129092 39.690155 504.33775
7.44 10.14 14.42 24.35 11.82 13.83 10.13 7.87 100.00
60.42 49.40 48.59 57.51 39.05 46.16 44.46 37.63 47.90
John McCain and Sarah 18.630762 39.764056 64.4115908 69.827216 86.023642 68.843117 54.640308 47.346852 449.487545
4.14 8.85 14.33 15.53 19.14 15.32 12.16 10.53 100.00
30.00 38.41 43.04 32.71 56.34 45.57 47.53 44.88 42.69
(VOL) Other/Neither 1.5060026 .88321203 3.2060684 2.5018142 2.1243815 3.0806277 2.200355 5.0546217 20.557083
7.33 4.30 15.60 12.17 10.33 14.99 10.70 24.59 100.00
2.42 0.85 2.14 1.17 1.39 2.04 1.91 4.79 1.95
(VOL) Undecided/Don't 4.4480018 11.739914 9.3136182 18.37691 4.9181423 9.409895 7.01703324 13.3941079 78.617623
5.66 14.93 11.85 23.38 6.26 11.97 8.93 17.04 100.00
7.16 11.34 6.22 8.61 3.22 6.23 6.10 12.70 7.47
Total 62.109961 103.52815 149.64713 213.49343 152.69863 151.06636 114.97061 105.48574 1,053
5.90 9.83 14.21 20.27 14.50 14.35 10.92 10.02 100.00
100.00 100.00 100.00 100.00 100.00 100.00 100.00 100.00 100.00
Case study: Electoral preferences by employment status
Key
frequency
row percentage
column percentage
Q5. If the
Presidential election
were held today and
the candidates were f8
Barack Employed Employed Laid off Retired Student Homemaker Something (VOL) No Total
Barack Obama and Joe 263.30095 36.9693237 17.692466 125.50328 15.486465 16.644394 24.6988275 4.0420475 504.33775
52.21 7.33 3.51 24.88 3.07 3.30 4.90 0.80 100.00
47.30 52.01 67.23 46.14 82.02 27.25 60.77 64.12 47.90
John McCain and Sarah 252.31686 25.723928 6.1500438 112.5963 1.1268505 37.187532 12.123702 2.2623235 449.487545
56.13 5.72 1.37 25.05 0.25 8.27 2.70 0.50 100.00
45.33 36.19 23.37 41.39 5.97 60.89 29.83 35.88 42.69
(VOL) Undecided/Don't 29.558151 6.7386098 2.4747578 27.814172 2.2672181 7.2399743 2.52474 0 78.617623
37.60 8.57 3.15 35.38 2.88 9.21 3.21 0.00 100.00
5.31 9.48 9.40 10.22 12.01 11.85 6.21 0.00 7.47
Total 556.67476 71.08488 26.3172676 272.02643 18.8805338 61.0719 40.639858 6.304371 1,053
52.87 6.75 2.50 25.83 1.79 5.80 3.86 0.60 100.00
100.00 100.00 100.00 100.00 100.00 100.00 100.00 100.00 100.00
Case study: Testing for associations (preparing the data)
Before running any test we need to prepare the data by setting to missing any non-valid response (like
“don’t know/no answer/not sure”) unless is relevant to the question. It is important to ‘clean’ the variables
for the tests to be as accurate as possible. For demographics we will remove non-response items. Here are
a series of commands per variable (columns) to prepare some variables for you to run on your own.
creating a new
gen age=f1 gen educ=f4 gen income=f13 gen employ=f8 gen gender=qa
variable
label variable age label variable educ "Educational label variable income label variable employ
adding variable labels
"Age" attainment" "Family income" "Employment status"
setting no
replace age=. replace educ=. replace income=. if replace employ=. if
response to
if age>8 if educ==8 income==8 employ==8
missing
Original variable Value 2=2 with label in quotes New variable, name in parenthesis
Value 1=1 with label in quotes Values 3, 4 & 8 = 3 with label in quotes
When you have continuous data you need to use descriptive statistics. To start exploring
this option you can use the summarize command which provides first look at the data
(number of observations, mean, standard deviation, minimum and maximum values).
Lets take a look at the battery of questions in q8.
To get more than the mean and sd you can use tabstat which offers several
options (type help tabstat for more details). Notice we use weights here.
In these series of questions ‘0’ means ‘unfavorable’ and ‘100’ favorable.
. tabstat x8a x8b x8c x8d x8e x8f x8g x8h x8i x8j, s(mean median sd var count range min max)
stats x8a x8b x8c x8d x8e x8f x8g x8h x8i x8j
mean 55.83044 54.43365 55.87257 49.39961 40.33595 56.01527 55.93845 34.61905 53.96314 60.41824
p50 60 55 60 50 50 60 55 30 55 65
sd 35.31804 31.28831 31.18157 35.33493 24.2347 26.50595 22.22173 31.2718 23.95454 22.56533
variance 1247.364 978.9581 972.2905 1248.557 587.3207 702.5655 493.8053 977.9253 573.82 509.194
N 1038 1040 1028 1036 1018 982 991 1050 1031 1009
range 100 100 100 100 100 100 100 100 100 100
min 0 0 0 0 0 0 0 0 0 0
max 100 100 100 100 100 100 100 100 100 100
Principal-components factoring
Variables
Since the sum of eigenvalues
= total number of variables.
. factor x8a x8b x8c x8d x8e x8f x8g x8h x8i x8j, pcf
Proportion indicate the relative
Total variance accounted by (obs=897) weight of each factor in the
each factor. The sum of all total variance. For example,
Factor analysis/correlation Number of obs = 897 4.109/10=0.4109. The first
eigenvalues = total number of Method: principal-component factors Retained factors = 2
variables. Rotation: (unrotated) Number of params = 19 factor explains 41% of the total
When negative, the sum of variance
eigenvalues = total number of Factor Eigenvalue Difference Proportion Cumulative
factors (variables) with positive
Factor1 4.10910 1.89154 0.4109 0.4109
eigenvalues. Factor2 2.21756 1.35886 0.2218 0.6327
Kaiser criterion suggests to Factor3 0.85870 0.12199 0.0859 0.7185
Factor4 0.73671 0.15331 0.0737 0.7922
retain those factors with Factor5 0.58340 0.05168 0.0583 0.8505 Cumulative shows the amount
eigenvalues equal or higher Factor6 0.53172 0.13910 0.0532 0.9037
of variance explained by n+(n-
than 1. Factor7 0.39262 0.11864 0.0393 0.9430
Factor8 0.27398 0.11808 0.0274 0.9704 1) factors. For example, factor
Factor9 0.15591 0.01559 0.0156 0.9860 1 and factor 2 account for 63%
Factor10 0.14031 . 0.0140 1.0000
of the total variance.
LR test: independent vs. saturated: chi2( 45) = 4884.51 Prob>chi2 = 0.0000
Factor analysis is a data reduction technique. Question 8 has a battery of questions evaluating
favorability levels for different candidates/politicians
The pattern matrix here offers a clearer Variable Factor1 Factor2 Uniqueness
picture of the relevance of each variable in x8a -0.8860 0.2103 0.1709
the factor. x8b 0.8780 0.1124 0.2165
x8c -0.8260 0.2790 0.2399
x8d 0.9285 0.0343 0.1367
x8e -0.4075 0.6055 0.4674
x8f -0.0888 0.6869 0.5202
x8g 0.2836 0.5257 0.6432
x8h 0.8513 0.1947 0.2373
x8i 0.0483 0.7245 0.4728
x8j 0.0350 0.6559 0.5686
Factor1 Factor2
This is a correlation matrix between factor1
and factor2. Factor1 0.9930 -0.1177
Factor2 0.1177 0.9930
NOTE: If you want the factors to be correlated (oblique rotation) you need to use the option promax after rotate:
rotate, promax
Type help rotate for details.
Case study: factor analysis, step 3 (predict)
To create the new variables, after factor, rotate you type predict.
predict x8f1 x8f2 /*Or whatever name you prefer to identify the factors*/
We reduced all eight variables to two: x8f1 and x8f2. There is another way to use these results. We could
create indexes out of each cluster of variables. For example, ‘x8b’, ‘x8d’ and ‘x8h’ define the first factor.
You could aggregate these to create a new variable to measure ‘Republican favorability’. The second
factor is defined by ‘x8e’, ‘x8f’, x8i’ and ‘x8j’ related to ‘government institutions’. Since all variables are in
the same valence (go from 0 to 100), we can create the two new variables as
Robust
x8a Coef. Std. Err. t P>|t| [95% Conf. Interval]
Robust
x8b Coef. Std. Err. t P>|t| [95% Conf. Interval]
factor x8a x8b x8c x8d x8e x8f x8g x8h x8i x8j, pcf
rotate
predict x8f1 x8f2
factor x25a x25b x25c x25d x25e x25f x25g x25h x25i x25j, pcf
rotate
predict x25f1 x25f2 x25f3
/*Regression*/
•The mean is the sum of the observations divided by the total number of observations.
•The median (p50 in the table above) is the number in the middle . To get the median you have to order
the data from lowest to highest. If the number of cases is odd the median is the single value, for an even
number of cases the median is the average of the two numbers in the middle.
•The standard deviation is the squared root of the variance. Indicates how close the data is to the mean.
Assuming a normal distribution, 68% of the values are within 1 sd from the mean, 95% within 2 sd and
99% within 3 sd
•The variance measures the dispersion of the data from the mean. It is the simple mean of the squared
distance from the mean.
•Count (N in the table) refers to the number of observations per variable.
•Range is a measure of dispersion. It is the difference between the largest and smallest value, max – min.
•Min is the lowest value in the variable.
•Max is the largest value in the variable.
Exploring data: regression (what to look for)
Lets run the regression:
regress x8a gender age educ income, robust
This is the p-value of the model. It
Dependent Independent To control for indicates the reliability of X to
variable (Y) variables (X) heteroskedasticity predict Y. Usually we need a p-
value lower than 0.05 to show a
statistically significant relationship
between X and Y.
Type help outreg2 for more details. If you do not see outreg2, you may have to install it by typing ssc install outreg2. If this does not work type
findit outreg2, select from the list and click “install”.
Note: If you get the following error message (when you use the option append or replace it means that you need to close the excel/word window.
You can add more models to compare. Lets say you want to add another model without percent2:
regress csat percent high
Now type to export the results to excel (notice we add the append option)
outreg2 using results, bdec(2) tdec(2) rdec(2) adec(2) alpha(0.001, 0.01, 0.05) addstat(Adj. R-
squared, e(r2_a)) excel append
In excel In word