|
Final
Report
Physical Setting/Earth Science Regents
Examination:
Data and Information Related to Standard Setting
A study performed for the New York State Education Department by
Gary Echternacht
Gary Echternacht, Inc.
4 State Park Drive
Titusville, NJ 08560
(609) 737-8187
garyecht@aol.com
April 27, 2001
Introduction
The New York State Board of Regents has established learning standards all students must meet to graduate from high school. One set of learning standards is for mathematics, science and technology. Within those learning standards, some apply to earth science. In terms of general content, the earth science content refers
to
- Astronomy
- Meteorology and weather
- Geology
Key ideas, performance indicators, and sample tasks further describe each learning standard. Standards are also broken down by educational level--elementary, intermediate, and commencement. To assess the extent that students have met the learning standards, the New York State Education Department has developed a testing program. The content of the tests reflect accomplishment of the learning standards. For earth science, the State Education Department has developed a Regents Examination in Physical setting/Earth Science to reflect accomplishment of the learning standards pertaining to the above content areas.
Although scores for Physical Setting/Earth Science Regents Examination are placed on a numerical scale, essentially there are only three scores—does not meet standards, meets standards, and meets standards with distinction. New York State teachers, using professionally established procedures, have developed the test items, and the items have been pretested and field-tested on samples of students.
The purpose of the study described in this report is to obtain information that the State Education Department can use to establish scores that will classify test takers into does not meet standards, meets standards, and meets standards with distinction categories. Setting cut-points requires judgment. This study employs professionally established methods to quantify and summarize the judgements of experts related to how individuals who have met the learning standards will perform on the test.
The Physical Setting/Earth Science Regents Examination
The Physical Setting/Earth Science Regents Examination assesses student achievement at the commencement level. Items for the examination were developed through the cooperative efforts of teachers, school districts, other science educators, and New York State Education Department staff. The examination consists of two parts. The first part is a written examination. The second part is a performance examination. The written portion of the examination is administered in a 3-hour period and will first be offered in June 2001.
The written part of the examination has three sections (or
parts):
- Part A consists of multiple-choice questions assessing the student’s knowledge and understanding of core material.
- Part B consists of multiple choice and constructed response questions assessing the student’s ability to apply, analyze, and evaluate material.
- Part C consists of constructed response and extended response questions assessing the student’s ability to apply knowledge of science concepts and skills.
The performance part of the examination, termed part D, is a performance examination and is an assessment of laboratory skills. Part D must be administered prior to the written examination.
The examination blueprint, taken from the test sampler, is given in the table below:
|
Content |
Approximate Weight (%) |
|
Standard 1 (Analysis, Inquiry, and Design)
Mathematical Analysis
Scientific Inquiry
Engineering Design
|
15-20 |
|
Standard 2
Information Systems
|
0-5 |
|
Standard 6 (Interconnectedness: Common Themes)
Systems Thinking
Models
Magnitude and Scale
Equilibrium and Stability
Patterns of Change
Optimization
|
15-20 |
|
Standard 7 (Interdisciplinary Problem Solving)
Connections
Strategies
|
5-10 |
|
Standard 4 |
|
|
Key Idea 1 |
20-25 |
|
Key Idea 2 |
20-25 |
|
Key Idea 3 |
0-5 |
A complete description of the examination, including test specifications and scoring rubrics, is given in a test sampler.
Methods Employed
Data related to the performance standards for the test were obtained from a committee of experts. Judgments from committee members were quantified using standard practices employed by psychometricians who conduct standard setting studies. The committee made their judgments with respect to the difficulty scale resulting from the scaling and equating of field test items. In the field testing, each item, or score category if the item has multiple scores, is given a difficulty parameter obtained through item response methods. Test items corresponding to various points on the difficulty scale are presented as examples of test items at that difficulty level. The items used came from the anchor test form. The anchor test form is the test form upon which the cut-points are set and the form to which all later forms of the test will be equated.
Committee members were given definitions of three performance categories—not meeting standards, meeting standards, and meeting standards with distinction. The State Education Department has developed these category definitions and they are applied to all of the Regents tests that are being developed. In addition, committee members were given an exercise designed to help familiarize them with the examination and an exercise in which they were asked to categorize some of their students into the performance categories as defined by the State Education Department.
The committee met as a group on April 2, 2001 at the State Education Department.
The standard setting study test used the bookmarking approach because all the multiple choice items and constructed response item had been scaled using item response theory methods and because the bookmarking procedure enables committee members to consider these two item types together.
In the bookmarking procedure, multiple choice items and constructed response items are ordered in terms of their difficulty parameters. The purpose of the items is to illustrate the meaning of the difficulty scale at specific points. Committee members are asked to apply their judgments to these ordered items. The committee meeting is conducted in rounds. The rounds and the activities employed in each round are given
below.
|
Round |
Activity |
|
1 |
Committee members review the Learning Standards for the content area and consider ways of measuring accomplishment of the performance indicators and key ideas. Committee members review the ordered items and learn and understand the increasing complexity of the items and responses required. |
|
2 |
Working individually, committee members set their bookmark for meeting the standards. That is, committee members conceive of an individual who has the minimum level of skill and knowledge needed to meet the learning standards and indicate the last item (or difficulty level) that the hypothetical individual is likely to answer correctly two-thirds of the time (or to construct a response that is at least as good). |
|
3 |
Working individually, committee members set their bookmarks for meeting standards with distinction. That is, committee members conceive of an individual who has the minimum level of skill and knowledge needed to meet the standards with distinction and indicate the last item students are likely to answer correctly (or to construct a response that is at least as good). |
|
4 |
A report of the results of round 2 is given committee members. The committee is divided into small groups and the individual results are discussed. Committee members revise their judgments in light of the discussion. |
|
5 |
The same procedure as in round 4 is used with the round 3 results. |
|
6 |
A report of rounds 4 and 5 are given the committee. Also given the committee are the impacts (percent below the committee median for meeting standards and percent above for meeting standards with distinction based on field test results). Committee members make final judgments based on the accumulated judgments and data. |
Committee members were also asked four overall questions about accomplishment of the learning standards and test performance. Answers to these questions might aid New York in setting appropriate performance standards on the test. These questions asked:
Each committee member's estimate of the percentage of students in their classes who are currently meeting the learning standards.
Each committee member's estimate of the percentage of students in their classes who are currently meeting the learning standards with distinction.
Which was the more serious error--to categorize a student as having met the standards when, in fact, that student has not met the learning standards or to categorize a student as having not met the learning standards when, in fact, that student has met the standards?
Which was the more serious error--to grant distinction a student who has not met the learning standards at that level or to fail to grand distinction to a student who had achieved that level of proficiency.
Committee members provided judgments relating to the performance test using the following procedure.
-
Committee members reviewed the directions to the student and the scoring rubric for each item. All committee members were familiar with the rubric and most had used the rubric in scoring the performance test.
-
Committee members then estimated the score that the borderline student (i.e., a student who meets the standards minimally) who meets standards would achieve on the test.
-
Committee members then estimated the score that the borderline student who meets the standards with distinction would achieve on the test.
-
Score distributions of the individual committee member judgments were obtained.
Committee Members
The New York State Education Department assembled a committee of 24 people to provide judgments for the study. Committee members were, with one exception, current or former classroom teachers. Some were supervisors. One committee member was a representative from the business community. All committee members were recognized as very knowledgeable of the learning standards pertaining to physical setting and earth science and of how students perform on standardized tests similar to the Physical Setting/Earth Science Examination. Some had worked on an aspect of either the standards or development of the tests.
Committee members, their schools, the number of years experience each has in teaching Earth Science and the number of students who are currently in their Earth Science classes are given in the table
below.
|
Committee Member |
School and Location |
Years Teaching Physical Setting/Earth Science |
Number of Students Currently |
|
Sue Ellen Ali |
Highland Residential Center
Highland |
17 |
10 |
|
David Banker |
Stamford Central School
Stamford |
21 |
33 |
|
Mary Bishop |
Saugerties High School
Saugerties |
30 |
78 |
|
Kathleen Champney |
Retired |
37 |
0 |
|
Dennis Conklin |
Retired |
34 |
0 |
|
Kathy Conway |
Sand Creek Middle School
Albany |
12 |
25 |
|
Dennis DeSain |
Retired |
30 |
0 |
|
Lisa Gottlieb |
Ardsley High School
Ardsley |
2 |
74 |
|
Frances Hess |
Cooperstown High School
Cooperstown |
36 |
85 |
|
Susan Hoffmire |
Phoenix Central High School
Phoenix |
3 |
60 |
|
Faye Landsman |
Community District 10
Bronx |
5 |
0 |
|
Janette Liddle |
Adirondack High School
Boonville |
17 |
26 |
|
Michael McDonnell |
Millwood High School
Brooklyn |
7 |
100 |
|
Glenn Meyer |
Marlboro High School
Marlboro |
17 |
0 |
|
Glen Olf |
Hoosac School
Hoosick |
26 |
13 |
|
George Pafumi |
Geologist |
0 |
0 |
|
John Pritchard |
Grover Cleveland High School
Ridgewood |
8 |
10 |
|
Jack Ridolph |
Roy C. Ketcham High School
Wappingers Falls |
31 |
75 |
|
Len Sharp |
Liverpool High School
Liverpool |
30 |
105 |
|
Sue Marie Soto |
Health Opportunities High School
Bronx |
2 |
0 |
|
Nancy Spaulding |
Elmira Free Academy
Elmira |
35 |
0 |
|
Wendy Taylor |
Schenectady High School
Schenectady |
6 |
110 |
|
Bernadette Tomaselli |
Lancaster High School
Lancaster |
24 |
70 |
|
Ruth Wahl |
Allegany-Limestone High School
Allegany |
13 |
125 |
Committee members were chosen so that they would represent a wide range of schools and different types of students. Each committee member was asked to complete a short background questionnaire that included questions about their sex, ethnic background, and the setting for their school. Results of the questionnaire tabulations are given in the table
below.
|
Characteristic |
Percent of committee |
|
Sex |
|
|
Female |
58% |
|
Male |
42% |
|
Ethnic Background of Committee Member |
|
|
Hispanic |
4% |
|
White |
96% |
|
School Setting |
|
|
New York City |
21% |
|
Other urban |
13% |
|
Suburban |
33% |
|
Rural |
33% |
Findings related to the bookmarking procedure
Findings--Round 2
In round 2 every committee member independently placed his or her own bookmarks for meeting standards. The results of the placements are given in the table below. The table gives the difficulty parameter, corresponding raw score, and percentage of students below that raw score based on the field test results. The cut-points include the committee average plus or minus one or two standard deviations (i.e., standard deviations of the committee estimates) and the median committee cut-point including the cut-points corresponding to the 75th and 25th percentile ranks of committee
estimates.
|
Cut-point |
Difficulty |
Raw score (Max=83) |
Percent below |
|
Mean + 2 SD |
1.60 |
73 |
93% |
|
Mean + 1 SD |
0.91 |
60 |
70% |
|
Mean |
0.23 |
28 |
9% |
|
Mean - 1 SD |
-0.46 |
11 |
1% |
|
Mean - 2 SD |
-1.15 |
4 |
0% |
|
75% |
0.60 |
45 |
34% |
|
Median |
0.30 |
31 |
13% |
|
25% |
-0.30 |
17 |
2% |
Findings--round 3
In round 3 every committee member independently placed his or her own bookmarks for meeting standards with distinction. The results of the placements are given in the table below. The table gives the difficulty parameter, corresponding raw score, and the percentage above that raw score based on the field test results. The cut-points include the committee average plus or minus one or two standard deviations (i.e., standard deviations of the committee estimates) and the median committee cut-point including the cut-points corresponding to the 75th and 25th percentile ranks of committee
estimates.
|
Cut-point |
Difficulty |
Raw score (Max=83) |
Percent above |
|
Mean + 2 SD |
2.36 |
82 |
0% |
|
Mean + 1 SD |
1.83 |
76 |
4% |
|
Mean |
1.30 |
69 |
13% |
|
Mean - 1 SD |
0.76 |
56 |
40% |
|
Mean - 2 SD |
0.23 |
28 |
91% |
|
75% |
1.70 |
74 |
6% |
|
Median |
1.50 |
71 |
10% |
|
25% |
0.93 |
60 |
30% |
Findings--round 4
In round four, committee members received a report of their round two results. They also were placed in small groups where individual results were discussed. After the discussion, committee members were asked to place another bookmark for meeting standards based on the information and knowledge they had gained up to this point. The round four results, which generally show less variation than the round two results, are given in the table
below.
|
Cut-point |
Difficulty |
Raw score (Max=83) |
Percent below |
|
Mean + 2 SD |
1.41 |
69 |
87% |
|
Mean + 1 SD |
0.77 |
57 |
62% |
|
Mean |
0.13 |
27 |
8% |
|
Mean - 1 SD |
-0.52 |
10 |
1% |
|
Mean - 2 SD |
-1.16 |
4 |
0% |
|
75% |
0.50 |
40 |
25% |
|
Median |
-0.05 |
23 |
5% |
|
25% |
-0.40 |
15 |
2% |
Findings--round 5
In round five, committee members received a report of their round three results. They also were placed in small groups where individual results were discussed. After the discussion, committee members were asked to place another bookmark for meeting standards with distinction based on the information and knowledge they had gained up to this point. The round five results, which generally show less variation than the round three results, are given in the table
below.
|
Cut-point |
Difficulty |
Raw score (Max=83) |
Percent above |
|
Mean + 2 SD |
2.10 |
82 |
0% |
|
Mean + 1 SD |
1.73 |
74 |
6% |
|
Mean |
1.37 |
69 |
13% |
|
Mean - 1 SD |
1.00 |
63 |
24% |
|
Mean - 2 SD |
0.63 |
46 |
63% |
|
75% |
1.50 |
71 |
10% |
|
Median |
1.50 |
71 |
10% |
|
25% |
1.18 |
66 |
19% |
Findings--round 6
In round six, committee members received a report of their round four and five judgments. They also received a report of the impact of their estimates from that round. Impact was reported in terms of the frequency distributions of the field test scores. The committee was also advised that scores from field testing generally underestimate operational test performance, but that the amount of the underestimate was not known. Committee members then returned to their groups and discussed the report and their judgments. At the end of the discussion, committee members were asked to place new bookmarks for both meeting standards and meeting standards with distinction based on the information and knowledge they had at that time. Results of this final placement are given in the table
below.
|
Cut-point |
Meeting standards |
Meeting standards with distinction |
|
Diff |
Raw score |
Percent below |
Diff |
Raw score |
Percent above |
|
Mean + 2 SD |
1.31 |
69 |
87% |
1.90 |
79 |
2% |
|
Mean + 1 SD |
0.84 |
58 |
65% |
1.72 |
74 |
6% |
|
Mean |
0.36 |
33 |
15% |
1.53 |
73 |
7% |
|
Mean - 1 SD |
-0.11 |
21 |
4% |
1.34 |
69 |
13% |
|
Mean - 2 SD |
-0.59 |
8 |
0% |
1.16 |
66 |
19% |
|
75% |
0.61 |
45 |
34% |
1.53 |
73 |
7% |
|
Median |
0.50 |
40 |
25% |
1.50 |
71 |
10% |
|
25% |
0.10 |
26 |
8% |
1.50 |
71 |
10% |
Other Judgments Obtained
Committee members were asked to provide their best judgment of the percentage of their current students who are not achieving the learning standards as well as the percentage of their current students who are achieving the learning standards with distinction. These judgments were made not with respect to the test, but with respect to the learning standards and the definitions of meeting standards and meeting standards with distinction. Results appear in the table
below.
|
Standard |
% Meeting standards |
% Meeting standards with distinction |
|
Mean + 2 SD |
100% |
63% |
|
Mean + 1 SD |
97% |
44% |
|
Mean |
73% |
24% |
|
Mean - 1 SD |
49% |
5% |
|
Mean - 2 SD |
24% |
0% |
|
75% |
89% |
37% |
|
Median |
75% |
15% |
|
25% |
64% |
10% |
The data in the table above relates to the cut-points for the test in that the committee on average was indicating that in their judgment almost one of four students in the state is not currently achieving at level suggested by the learning standards. This assessment was made without test scores and is independent of the test scores. Similarly, the committee on average judged that about 15%-25% of students were achieving at the distinguished level.
Also noteworthy is the relatively large standard deviations for the estimates. This reflects the very real variation in achievement among classrooms. For example, estimates of the percentage of students achieving at least at the meets standards level of achievement ranged from 1% to 100%. For meeting standards with distinction, the estimates ranged from 0% to 75%.
With respect to the relative severity of the errors of classification, 71% of the committee said that classifying a student as having not met standards who in reality has met the learning standards was more serious than classifying a student as having met the standards who in reality has not met the learning standards. Twenty-nine percent of the committee said the opposite. With respect to meeting standards with distinction, the committee was about evenly divided. Fifty-four percent said that not granting a student distinction who in reality has attained that level of achievement was the more serious error.
Thus, the committee might be considered "lenient" with respect to setting the lower cut-point, but indifferent about setting the higher cut-point.
Cut-Points for the performance test
Results for the performance test are given in the table
below.
|
Standard |
Meeting Standards |
Meeting Standards with Distinction |
|
Mean + 2 SD |
17.1 |
23.7 |
|
Mean + 1 SD |
15.0 |
22.2 |
|
Mean |
12.9 |
20.7 |
|
Mean - 1 SD |
10.8 |
19.2 |
|
Mean - 2 SD |
8.7 |
17.7 |
|
75% |
14.0 |
21.3 |
|
Median |
13.0 |
21.0 |
|
25% |
11.8 |
20.0 |
Discussion and Recommendations
The purpose of this study was to obtain data and information that New York may use in setting cut-points for the Physical Setting/Earth Science Examination. The data should be used to guide those decisions.
The committee that provided the data was diverse and well represented the diversity of New York students, teachers, and school districts. With that diversity, it is not surprising that committee judgments varied.
The final bookmarks from the procedure are given in the table
below.
|
Cut-point |
Meeting standards |
Meeting standards with distinction |
|
Diff |
Raw score |
Percent below |
Diff |
Raw score |
Percent above |
|
Mean + 2 SD |
1.31 |
69 |
87% |
1.90 |
79 |
2% |
|
Mean + 1 SD |
0.84 |
58 |
65% |
1.72 |
74 |
6% |
|
Mean |
0.36 |
33 |
15% |
1.53 |
73 |
7% |
|
Mean - 1 SD |
-0.11 |
21 |
4% |
1.34 |
69 |
13% |
|
Mean - 2 SD |
-0.59 |
8 |
0% |
1.16 |
66 |
19% |
|
75% |
0.61 |
45 |
34% |
1.53 |
73 |
7% |
|
Median |
0.50 |
40 |
25% |
1.50 |
71 |
10% |
|
25% |
0.10 |
26 |
8% |
1.50 |
71 |
10% |
The committee also indicated that currently about 25% of students are not meeting the learning standards and about 15% - 25% of students are meeting the standards at the distinction level. Further, the committee overwhelmingly believes that the error of classifying a student as not meeting standards who in reality has met standards should be minimized. The committee seems indifferent with respect to classification errors at the distinction level.
Finally, it is well known that student performance improves once operational testing begins. What is not known is the amount of improvement that might be expected.
Final judgments for the performance test are given in the table
below.
|
Standard |
Meeting Standards |
Meeting Standards with Distinction |
|
Mean + 2 SD |
17.1 |
23.7 |
|
Mean + 1 SD |
15.0 |
22.2 |
|
Mean |
12.9 |
20.7 |
|
Mean - 1 SD |
10.8 |
19.2 |
|
Mean - 2 SD |
8.7 |
17.7 |
|
75% |
14.0 |
21.3 |
|
Median |
13.0 |
21.0 |
|
25% |
11.8 |
20.0 |
What should be made of these results?
The study author recognizes that New York has the responsibility and duty to set cut-points in such a way that the purpose of the testing program is best accomplished. That requires judgment and consideration of all the data and information that is available at the time cut-points are set. The study author strongly encourages New York not to routinely adopt the mean for the bookmarking procedure as the final cut-points. Final cut-points should result from staff deliberations using all of the data presented in this report.
It is well known that field test results underestimate how well students perform on operational testing. Under performance in field testing is due to several factors, chief of which are student recognition that the test scores do not count and that teaching practices are not yet congruent with the standards on which the tests are based. The amount of underestimation for the Physical Setting/Earth Science Examination is unknown. Yet difficulty parameters and impact estimates used in the standard setting were based on field test statistics.
Thus, the study author’s first recommendation is to repeat the standard setting study once the test becomes operational. The repeated study should use item difficulty and impact estimates obtained from operational testing and not simply a repeat of the study using the same data. If that is not possible, the study author encourages New York to repeat the standard setting study with other methods, such as the contrasting groups method, which does not rely on the state collecting item level data on a large scale.
For initial operational testing, the study author recommends that the cut-point for meeting standards be set within the raw score range of 33-45. The committee means and medians fall within this range. Within this range, the study author recommends a final cut-point be set based on the state's best judgment as to the improvement that will actually occur once operational testing begins. That judgment should be informed by discussions with test developers, curriculum specialists, and teachers. The study author would choose a raw score of 40, which is lenient and will likely result in fewer than about 20% failing the test, a lower level than indicated by teachers of the percentage of students who are not currently meeting standards. This is only the personal opinion of the study author, however.
For initial operational testing, the study author recommends that the cut-point for meeting standards with distinction be set within the raw-score range of 69-74. Again, all committee mean and median judgments fall within that range. And again, within that range choice should be made based on the estimated improvement from field testing to operational testing and the choice should be informed by discussions with test developers, curriculum specialists, and teachers. The state should realize, however, that improvement in the upper range of scores is likely to be less than improvement in the lower range of scores. The study author would choose a raw score of 71, but again that is the personal opinion of the study author only.
With respect to the performance test, until more data on performance can be collected, the study author recommends that cut-points of 13 and 21 (the committee medians) be used.
|