COURSE EXAMPLES AND FILES --- ED161 Winter 2000 D Rogosa

*.dat are ASCII data files.
Output from computer packages (e.g. MINITAB) are typically *.lis, *.out, *.log.
Links in this file take you directly to the specific data or data analysis example.
This file is cumulative; I'll add entries as we move to that material.

(Note: Additional examples not in electronic form will be introduced throughout the course during lectures. Also, additional data sets and output in electronic form for homework assignments, solutions, etcetera will be described in those documents and posted to the course directory at the appropriate point in the course.)


I. Design and Analysis of Comparative Studies (Experiments)


 NAME          DESCRIPTION

mlapair.dat    Paired pre-test post-test data example.
               Story from textbook (MM p517).EXAMPLE 8.3		
               "The National Endowment for the Humanities 
               sponsors summer institutes to improve the skills
               of high school teachers of foreign languages. One such 
               institute hosted 20 French teachers for 4 weeks. At 
               the beginning of the period, the teachers were given 
               the Modern Language Association s (MLA) listening test 
               of under-standing of spoken French. After 4 weeks of 
               immersion in French in and out of class, the listening 
               test was given again. (The actual spoken French in 
               the two tests was different, so that taking the first 
               test should not improve the score on the second test.) 
               The maximum possible score on the test is 36.
mlapair.lis    Analysis of Paired pre-test post-test data 
               example using Minitab.
mlasign.lis    Nonparametric analysis of Paired pre-test
               post-test data via sign test procedures using Minitab.


smsg.dat       Used in Part I and analysis of covariance).
               Data from a mathematics curriculum evaluation,
               circa 1961. Purpose of the large scale study was
               to compare mathematics achievement in a 
               traditional ninth-grade algebra course with 
               that in an alternative course developed by the 
               School Mathematics Study Group (SMSG). 43 
               teachers from schools across the US 
               participated; by random assignment there were 
               21 SMSG (new math) classrooms with 22 traditional
               math classrooms.
               Columns c1 and c3 contain group indicator 
               variables; c3 = 1 is SMSG classroom and c3 = 0 
               is traditional.
               The post-instruction outcome measure (classroom
               average) on math achievement given at the end of 
               the school year is in c2; this test was a 
               traditional algebra test published by the 
               Cooperative Test division of Educational Testing
               Service.
               In c4 is a pre-instruction ("pre-test") measure 
               of knowledge of number systems.
smsg.lis       Used in Part I review.  Descriptive and
               inferential two-group comparisons for the outcome
               measure (c2) in smsg.dat.

drptwot.dat    Two group comparison example.
               Story from textbook (MM p542).EXAMPLE 8.8		
               "An educator believes that new directed reading activities
               in the classroom will help elementary school pupils 
               improve some aspects of their reading ability. She
               arranges for a third-grade class of 21 students to follow 
               these activities for an 8-week period.  A control classroom 
               of 23 third graders follows the same curriculum without 
               the activities.  At the end of the 8 weeks are given a 
               Degree of Reading Power (DRP) test, which measures the 
               aspects of reading ability that the treatment is designed to 
               improve. data are in unstacked form with treatment in C1 and
               control in C2.
drptwot.lis    Two sample Analysis of Paired pre-test post-test data 
               example using Minitab.


alphatot.tab   Tabulation of total error rate        
               probabilities for c inferences each done at level
               alph:  tot = 1 - (1 - alph)^c  is solved for 
               alph. Mathematica script appended.


harr.dat       Data obtained from the
               Hopkins&Glass textbook.  Their description is
               "Harrington (1968) experimented with the order
               of 'mental organizers' that structure the 
               material for the learner.  A group of 30 persons
               were randomly split into three groups of 10 each.
               Group I received organizing material before
               studying instructional material on mathematics;
               Group II received the 'organizer' after studying
               the mathematics; Group III received the math 
               materials but no organizing materials.  Scores 
               are from a 10-item mathematics test on the 
               instructional content.
               The data are in "unstacked" form in c1-c3.
harr.lis       One-way anova (MINITAB) on harr.dat.
harr1v.out     BMDP1V output for harr.dat using
               orthogonal contrasts.
hartukey.lis   Minitab implementation of Tukey post-hoc 
               comparison procedures with the harr.dat data.


ibs.dat        Used in Part I.A.1.  
               These are waiting-time data under
               three different protocols.  Data are in stacked 
               form. The actual data are from Ott's text and 
               are described as follows: 
               "Irritable bowel syndrome (IBS) is a non-
               specific intestinal disorder characterized by 
               abdominal pain and irregular bowel habits.  Each
               person in a random sample of 24 patients having 
               periodic attacks of IBS was randomly assigned to
               one of three treatment groups.  The number of 
               hours of relief while on therapy is recorded 
               for each patient." Outcome in c1, group 
               indicator in c2.
ibsbmd7d.log   Part I.A.1. BMDP7D output for ibs.dat. Implements
               Levene's test.  Implements two versions of 
               one-way anova (Welch, Brown-Forsythe) that do 
               not assume equal within-group variances.
ibslev.lis     Part I.A.1.  Gives description of ibs.dat;
               implements in Minitab two forms of Levene's 
               test for equal within-group variances.
ibstrans.log   Part I.A.1.  Carries out (in MINITAB) natural log
               transformation of ibs.dat outcome to stabilize
               variance.  Compares anova on raw and transformed
               data.

clergy.lis     Part I.A.4.  Illustration of Kruskal-Wallis test
               (in MINITAB), non-parametric alternative to 
               one-way anova.  Comparison with standard anova 
               on ranked data.
               Data taken from Ott text: "Three random samples 
               of clergyman were drawn: one containing 10 
               Methodist ministers, the second containing 10 
               Catholic priests, the third containing 10 
               Pentecostal ministers.  Each of the clergyman 
               was examined with a test to measure
               his knowledge about causes of mental illness.


bakery.dat     3 x 2  fixed effects with 2 replications per cell.
               The Castle Bakery Company 
               supplies wrapped Italian bread to a large number 
               of supermarkets in a metropolitan area. An 
               experimental study was made of the effects of 
               height of the shelf display (factor A: bottom, 
               middle, top in c2) and the width of the shelf 
               display (factor B: regular, wide in c3) on sales 
               of this bakery’s bread during the experimental 
               period (c1, measured in cases). Twelve supermarkets, 
               similar in terms of sales volume and clientele, were 
               utilized in the study. The six treatments were 
               assigned at random to two stores each according to 
               a completely randomized design, and the display of 
               the bread in each store followed the treatment 
               specifications for that store. Sales of the bread 
               were recorded, and these results are presented in 
               bakery.dat.
bakery.lis     Table of cell means and two-way fixed effects anova 
               for bakery.dat.


integ.dat      2 x 2  fixed effects with 50
               replications per cell.  Data obtained from early
               Minitab Handbook which gives the following
               description: "A researcher at Columbia 
               University was interested in the effect of 
               school integration on racial attitudes.  He 
               gave an "ethnocentrism" test to four groups of 
               children: black children in a segregated school,
               white children in a segregated school, black
               children in an integrated school, and white 
               children in an integrated school. 'Ethnocentrism'
               is defined as the tendency of children to prefer
               to associate with, and respect, other children 
               of the same ethnic group to those of another 
               ethnic group.  Thus, students who score high on 
               this test have a stronger preference for their 
               own race."  The data are in stacked form,
               with the test score in c1, schooltype in c2
               (1 = integrated, 2 = segregated) and race in c3
               (1 = black, 2 = caucasian).
integ.lis      Cell means and anova table 
               (from MINITAB) for integ.dat.

scitest.dat    Data collected as part of study designed
               to investigate the feasibility and technical 
               quality of science performance assessments.  Two
               tasks, called Radiation and Rate of Cooling, 
               were developed from a common "task shell"; 
               in other words, they were designed to be as 
               parallel as possible in the science processes
               tested and in the format of stimulus materials 
               and required response.  They can be thought of 
               as two sample tasks from a "universe" of similar,
               parallel tasks.  The investigators treat task as 
               a random factor because they could imagine 
               creating additional tasks out of the task shell 
               from which these two came.  This data set 
               contains the scores of thirty students, assumed 
               to be drawn at random from the population of 
               students,each tested on both tasks.  Three 
               raters scored the responses; each paper was 
               scored by two of the three raters.  The students 
               come from three different schools, ten from each.
               Scores are in C1, student ID is in C2, task (1 
               for Radiation, 2 for Rate of Cooling) is in C3, 
               rater is in C4, and school is in C5.
scitest.lis    Minitab output from a 2-way random effects anova
               with outcome the score on the science test, with
               the two random factors being student and task.
               So the design is 30x2 with 2 replications per
               cell.


sunburn.dat    Two-way mixed example; taken from Sunscreen ex.
               Ott p.770
               A corporation is interested in comparing two 
               different sunscreens (s1 and s2).  A random 
               sample of 10 females (ages 20-25 years) 
               participated in the study.  For each person two 
               1" x 1" squares were marked off on either side
               of the back, under the shoulder but above the 
               small of the back.  Sunscreen s1 was randomly 
               assigned to the two squares on one side of the 
               back, with s2 on the other two squares. Exposure
               to the sun was for a two-hour period.
               The outcome was change (postexposure minus 
               preexposure) in a reading based on the color of 
               skin in a square.  So we have 10 levels of the 
               random column factor subjects, two levels of the 
               fixed row factor, sunscreen, and two replications
               per cell.  In file sunburn.dat we have the 
               outcome measure in c1, the type of sunscreen 
               (s1 =1, s2=2) in c2, the person (i.e. female 
               tanning subject) in c3.
sunburn.lis    Minitab output for the 
               mixed model analysis of the sunburn.dat data, 
               a 2X10 design with 2 replications per cell.

unbalanc.dat   
               Data for a 2 x 3 fixed effects
               design, having between 1 and 3 replications per 
               cell. The data are shown and described in Table 
               20.1 and section 20.2 of NWK text. The first
               part of this data file has the outcome measure 
               (growth rate in response to therapy) in c4, the 
               row factor (subject gender 1,2) in c1, the column
               factor (degree of depressed development; 
               severe = 1, moderate = 2, mild = 3) in c2, and 
               the replication indicator in c3.  
               This data structure is set up for the GLM 
               approach to the analysis of unbalanced designs.
               The second part of the data file is set up for 
               the application of the approximate analysis based
               on cell means; cell means in c1, row factor in 
               c2, column factor in c3.
unbalanc.log     
               Analyses of the data in unbalanc.dat.
               First is shown the GLM analysis (cf. MTB version 
               7 manual p. 8-27). Second the approximate cell 
               means analysis is constructed and then compared 
               with GLM results.

stress.dat     Data are from a 2x2x2 fixed
               effects design with 3 replications per cell.  
               Data are shown in Table 22.2 and described in 
               Section 22.2 of NWK text.  The outcome 
               measure is exercise tolerance from a stress 
               test in c1, with gender (male = 1, female = 2) 
               in c2, body fat level (low = 1, high = 2) in c3 
               and smoking history (light = 1, heavy = 2) in c4.
stress.lis     Analysis of the 3-way design from
               stress.dat.  Description using versions of 
               MINITAB Table command along with Layout 
               subcommand (cf. MTB version 7 manual pages 
               11-9,11-12).  Three-way analysis of variance 
               using anova command.

*************************** PART II ********************************************
                         CORRELATION and REGRESSION

corr.dat       28 bivariate observations,
               test 1 in c1, test 2 in c2.
corr.out       Simple plotting,
               correlation, and straight-line regression 
               analyses of corr.dat.
corrres.lis    Illustration of different types of
               residual scores using corr.dat data.  
               See NWK text Chap 9 (esp Sec. 9.2).
predict.lis    Illustration of PREDICT subcommand
               (cf. MTB ver 7 manual 7-10,11) using corr.dat.

welfare.dat    Children's Welfare in California.
               Data collected by the Oakland-based 
               "Children Now" from government resources over the
               past four years to comprise a "year-in-the-life"
               composite index of children's welfare.  Data are
               presented on a county-by-county basis.
               c1: County ranking on Welfare index
               c2: Median family income
               c3: Median family income ranking
welfare.lis    Illustrates descriptive univariate analyses
               (stem-and-leaf etc) and correlation and 
               regression analyses and plots.

coleman.dat    Data from the Coleman report used
               to illustrate multiple regression.
               File coleman.dat contains data from a random 
               sample of 20 schools (from the East) from the 
               1966 Coleman Report.
               The outcome measure C7 is the verbal mean test 
               score for all sixth graders in the school.  The 
               predictor variables are:  C2, staff salaries 
               per pupil, C3, percent white collar fathers for 
               the sixth graders; C4 is a SES composite measure
               (deviation) for the sixth graders, C5 Mean 
               teacher's verbal test score, C6 6th grade mean
               mother's educational level (1 unit=2 school yrs)

bodyfat.dat    Data taken from NWK text,
               Table 8.1.  Measurement data in which 3 
               relatively inexpensive methods of assessment 
               are compared with the "gold standard" of 
               accurate measurement.
               Description: "data for a study of the relation of
               the amount of body fat to several possible 
               explanatory, independent variables, based on a 
               sample of 20 healthy females 25-34 years old.  
               The possible independent variables are triceps 
               skinfold thickness, thigh circumference, and 
               midarm circumference."
               c1 has triceps, c2 has thigh, c3 has midarm, c4 
               has amount of body fat.
bodyfat.out    Illustrates multiple regression
               procedures in NWK text Sec. xx, and residual
               diagnostics.


marks.dat      Used in Part II.  Data from 17 students in 
               a prior (many years ago) 2-qtr version of part
               of this course (i.e. Education 250A,B). c2 has 
               the sum of the scores on the six graded homework
               assignments; c1 has the final exam for 250A, c3 
               has the midterm in 250B, and c4 has the outcome 
               score, the final exam in 250B.
marks.log      Uses marks.dat to illustrate
               properties of multiple regression (and partial 
               correlation) coefficients and diagnostics for 
               same via adjusted variables approach.
marksnew.log   Repeats, revises aspects of the 
               marks.log analyses to match partial regression 
               slopes and plots approach in NWK Section 11.1.

nels.dat       Contains a subset of observations and variables
               from the public release data tape for National
               Educational Longitudinal Study of 1988 (NELS:88).
               The National Center for Education Statistics 
               collected data from a representative sample of 
               8th-graders across the U.S. and followed these 
               students through grades 10 and 12.  At each 
               grade, students took several achievement tests 
               and completed surveys that included questions 
               about their academic, family, and social lives.
               The nels.dat data set contains students' 
               10th-grade scores on the science achievement 
               test, along with several variables that are
               hypothesized to be good predictors of 10th-grade
               science achievement.
               Student ID is in C1 and 10th-grade science score
               is in C2.  Four achievement variables from 8th 
               grade are included:  science, reading, math 
               knowledge, and math reasoning (C3-C6).  The 
               math knowledge and math reasoning scores are 
               standardized (they have mean zero, variance
               one).  Indicator variables are included for 
               advanced "track" (i.e., high school program) and
               general track; each student receives a 1 on the 
               variable if he or she is in that program and a 
               0 otherwise.  Students in the academic track 
               receive 0's on both variables.  These are found 
               in C7 and C8, respectively.  In C9-C12 there are
               indicator variables for courses taken - biology
               or not in C9, chemistry or not in C10, earth 
               science or not in C11, and general science or 
               not in C12.  C13 contains an indicator variable
               for gender:  1 for males, 0 for females.  In 
               C14-C16 are indicator variables for ethnicity:  
               Asian or not in C14, African-American or not
               in C15, and Latino/Hispanic or not in C16.  
               Finally, C17 and C18 contain indicator variables
               for socio-economic status:  Lowest quartile or 
               not in C17 and highest quartile or not in C18.

grow.dat
               Data from the Berkeley Growth Study
               (Nancy Bailey).  These data are for Child
               #8 in the BGS study with age in months in c2
               (ranging from 1 to 60) and intellectual
               performance in C1.
grow.lis       Fitting a score on age regression
               for grow.dat, using polynomial regression.

                                    
dummy.log      Single classification anova via
               regression with dummy (group membership)
               predictor variables. Uses smsg.dat and harr.dat

ancova.log
               Illustration of 2-group, pre-post analysis of
               covariance with data from smsg.dat.  First the
               multiple regression approach is shown,
               followed by the MINITAB ancova routine
               for comparison.

ancvdrug.dat     Data taken from Ott's text
               to illustrate a 2-group, pre-post design.  The
               description of these data is: "An investigator is
               interested in comparing two drug products (A and
               B) in overweight female volunteers.  The
               experiment calls for 20 randomly selected
               subjects who are at least 25% overweight.  Ten
               of these women are to be randomly assigned to
               product 1 and the remaining 10 to product 2.
               The response of interest is a score on a rating
               scale used to measure the mood of a subject.  To
               obtain a score, a subject must complete a
               checklist indicating how each of 50 adjectives
               describes her mood at that time.
               On the study day, all 20 volunteers are required
               to complete the checklist at 8 AM.  Then each
               subject is given the prescribed medication
               (product 1 or 2). Each subject is required to
               complete the checklist again at 10 AM. The 8AM
               score is in c1, the 10 AM score
               in c2 and the group membership indicator
               (1 = product 1; 0 = product 2) in c3.
ancvdrug.lis
               Description of 2-group pre-post data in
               ancvdrug.dat.  Analysis of covariance is carried
               out with multiple regression, dummy-variable
               approach and then compared with MINITAB ancova
               command.

huitema.dat
               Three groups, each of size 10,
               single outcome, 2 covariates.  Taken from the
               Huitema text with the description: "The
               investigator is concerned with the effects of
               three different types of study objectives on
               student achievement in freshman biology. The
               three types of objectives are:
               1.General--students are told to know and
               understand everything in the text.
               2.Specific--students are provided with a clear
               specification of the terms and concepts they are
               expected to master and of the testing format.
               3.Specific with study time allocations--the
               amount of time that should be spent on each
               topic is provided in addition to specific
               objectives that describe the type
               of behavior expected on examinations.
               The dependent variable is the biology
               achievement test.
               A population of freshman students scheduled to
               enroll in biology is defined, and 30 students
               are randomly selected.  The investigator obtains
               aptitude test scores and scores from an academic
               motivation test for all students before the
               investigator randomly assigns 10 students to each
               of the three treatments.  Treatments are
               administered, and scores on the dependent
               variable are obtained for all students."
               In the data file, the dependent variable is in
               c1, aptitude test in c2, academic motivation in
               c3, and group membership variable (1,2,3) in
               c6.  In c4-c5 are two 0,1 dummy variables that
               define the group membership in c6.
huitema.lis    Description of data in huitema.dat.
               Carries out ancova for the 3-group two-covariate
               design using MINITAB ancova and multiple
               regression approach.



*************************** PART III ********************************************
                       BINARY and CATEGORICAL DATA

Binomial Distribution examples.
binchina.lis    You've just entered a class  in
               ancient Chinese literature. You haven't even 
               learned the alphabet yet but they've given you 
               a pop quiz. You'll have to guess on every question.
               It's a multiple choice test, with each of the 20
               questions having three possible answers. To pass, you
               must get at least 12 correct. What are the chances 
               you'll pass?
binfreet.lis    Rick is a basketball player
               who makes 75 percent of his free throws over the
               course of a season.  In a key game Rick shoots 12 free 
               throws and misses 5 of them. The fans think he failed 
               because he was nervous.  Is it unusual for Rick to 
               perform this poorly?
binnorm.lis     Illustrations of normal
               approximations to the binomial.
binsign.lis     Sign test example from
               GH section 9.11; use of binomial proability.

Poisson Distribution examples.
poisson.lis    Illustration of Poisson distribution
               and binomial approximations for rare events.


draft.cnt      Draft lottery data from 1971. Rows are
               months Jan-Dec and columns are #days with
               highest risk C1 (numbers 1-122), numbers
               123-244 in C2 and lowest risk
               (numbers 245-366) in C3.
draft.lis      Chi-square test for independence
               (fairness) for draft lottery data.

teacher1.dat   Part III. Source: U.S. Department of Education, 
               National Center for Education Statistics, 
               1987-1988 Schools and Staffing Survey. 
               Data: Willingness to become a teacher again 
               for Elementary and Secondary school teachers. 
               (Data + Output).  This example illustrates 
               cross-classified categorical data, 2x5 table
               and chi-square test.

teacher2.dat   Part III. 1987-1988 Schools and Staffing Survey
               Data: 
               Gender distribution for teachers in Elementary 
               and Secondary schools. (Data + Output) 
               Illustrates 2x2 table and chi-square test.

Agresti Supplement  Tables from the Appendix
               of "An Introduction to Categorical Data Analysis,"
               by Alan Agresti, published by John Wiley and Sons, Inc.,
               January 1996.  The tables show SAS code for the analyses
               conducted in that text, and contain the major data sets
               from that text.

Aspirin and MI Data and SAS analysis for Aspirin Use
               and Myocardial Infarction, Agresti Section 2.2.2

Lung Cancer    Data and SAS analysis for Smoking and
               Lung Cancer example

Tea Tasting    Data and SAS analysis Fishers Tea Tasting
               example; Fisher's Exact test, Agresti Section 2.6.1

program.dat    Dichotomous outcome, single
               quantitative predictor.  From NWK supplement
               (or the NWK regression book), the description is:
              "A small-scale investigation was undertaken to
               study the effect of computer programming
               experience on ability to complete a complex
               programming task, including debugging, within
               a specified time.
               Twenty-five persons were selected for the study.
               They had varying amounts of programming
               experience (measured in months of experience).
               All persons were given the same programming task.
               The results are coded in binary fashion; if the
               task was completed successfully in the allotted
               time, it was scored 1, and if the task was not
               completed successfully, it was scored 0."
               Months of experience are in c1, and the binary
               outcome measure is in c2.
program.lis    Plots and description
               of program.dat. OLS and WLS fits of straight-line
               functional form.
               BMDPLR logistic regression fit (presented in
               class) compared with straight-line fit.
               NEW! Minitab blog binary logistic regression.

progsas.sas    contains the SAS instructions to carry out
               a logistic regression for the data in program.dat.
progsas.lst    SAS output obtained from the command
               line statement: "sas progsas" on an elaine.
               Contains the logistic regression parameter
               estimates and fits.


coupon.dat     Dichotomous outcome, single
               quantitative predictor (*with replication*).
               From NWK supplement (or the NWK regression book),
               the description is:
              "In a study of the effectiveness of coupons 
               offering a price reduction on a given product, 
               1,000 homes were selected and a coupon and 
               advertising material for the product were mailed 
               to each.  The coupons offered different price 
               reductions (5,10,15,20, and 30 cents), and 200 
               homes were assigned at random to each of the
               price reduction categories.  The independent 
               variable in this study is the amount of price 
               reduction, and the dependent variable is a binary
               variable indicating whether or not the coupon 
               was redeemed within a six-month period."
               The price reduction is in c1, number of 
               households (200) in c2, and number redeemed from
               the 200 households in c3.
coupon.lis      Logit transformation and 
               OLS and WLS fits to coupon.dat.