Dan
MacIsaac
Rebecca
Pollard Cole
David
M. Cole
Northern
Arizona University
This is a preprint of
a manuscript submitted for publication in the Journal of Research in Science
Teaching ©1999 by the National Association for Research in Science Teaching.
Running
Head: Paper- vs.
Web-based Testing in Physics
Reprint
requests should be sent to Dan MacIsaac, Department of Physics and Astronomy,
Northern Arizona University, Flagstaff, AZ, 86011-6010. This research was
partially supported by funding from the Arizona Collaborative for Excellence in
the Preparation of Teachers (ACEPT) and NAU Organized Research Grant
funds. The authors wish to acknowledge
the helpful contributions of Nate Davis and Brian Nance who assisted with HTML
and data coding and were funded by an NAU Hooper Undergraduate Research
Fellowship. Valuable comments and suggestions regarding the statistical
comparisons of item response patterns was provided by Professor Philip Sadler of
Harvard University.
Abstract
On-line web-based
technologies provide students with the opportunity to complete assessment
instruments from personal computers with internet access. The purpose of this
study was to examine the differences in paper-based and web-based
administrations of a commonly used assessment instrument, the Force Concept
Inventory (FCI). Results demonstrated no appreciable difference on FCI scores
or FCI items based on the type of administration. A 4 way ANOVA (N = 376)
demonstrated differences in FCI scores due to different sections of the same
sections, different courses and gender. However, none of these differences was
influenced by the type of test administration. Similarly, FCI student scores
were comparable with respect to both test reliability and predictive validity.
For individual FCI items, paper-based and web-based comparisons were made by
examining potential differences in item means and by examining potential
differences in response patterns. Chi Squares demonstrated no differences in
response patterns and t Tests demonstrated no differences in item means between
paper-based and web-based administrations. In summary, the web-based
administration of the Force Concept Inventory appears to be as efficacious as
the paper-based administration.
Since
the late 1970’s, science educators have been experimenting with the use of
microcomputers for the conceptual and attitudinal assessment of their students
(Arons, 1984, 1986; Bork, 1981; Waugh, 1985). Since the late 1980’s,
multiple-choice, machine scored, standardized instruments have been developed
to assess the conceptual and attitudinal state of introductory physics
students. The Force Concept Inventory (FCI), perhaps the best known of these
standardized instruments, assesses student’s conceptual knowledge of physics
(see Hestenes, Wells & Swackhamer, 1992). Recently, Redish, Saul, and
Steinberg (1998) developed the Maryland Physics Expectations Survey (MPEX), a
standardized instrument which assesses the attitudinal state of physics students.
Both the FCI and the MPEX are widely used in the physics education research
community (Hake, 1998).
Data
from these instruments can provide valuable information for both research and
teaching. For example, the instruments can be used to assess physics learning,
to justify and guide interventions in physics teaching practices, to evaluate
introductory physics programs and to compare student learning and attitudinal
outcomes. However, each of these instruments requires approximately thirty
minutes. Additionally, in both research and teaching situations, the
instruments are typically given for both pre- and post- instruction. Each
instrument can therefore consume a full hour of valuable instructional time.
Further, additional resources are required to score, collate, record, and
analyze the instrument data. Both the loss of instructional time and the
administrative overhead may discourage the regular use of these instruments by
many introductory physics instructors (Hake, 1998).
Recently available, “on-line” or web-based technologies
provide students with the opportunity to complete assessment instruments from
personal computers with internet access (Titus, Martin & Beichner, 1998).
While not as sophisticated as advanced computer-adaptive testing (CAT) of the
sort recently adopted for tests like the Graduate Record Examinations
(Straetmans & Eggen, 1998), web testing could still greatly reduce the
administrative and class time burden required for the application of
standardized instruments. Furthermore, new kinds of data could be collected for
improving the instruments themselves (such as question latency data) and data
could be readily collated for long-term studies of student learning in
databases contributed to by on-line instruments.
To be widely used, the web-based administration of these
instruments must be characterized in terms of reliability, and results from the
web-based administration of these instruments must be statistically compared to
results from standard paper administration. If measurements from web-based
administrations are explored, they can be corrected or calibrated to
paper-based administrations. Therefore, the purpose of this study is to begin
this process by examining the differences in paper-based and web-based
administrations of the Force Concept Inventory.
The participants in the study were
students from three introductory physics courses taught at a medium sized
university in the southwest during the Spring of 1998 and the Fall of 1999. The
first two courses, General College Physics I (Physics 111) and General College
Physics II (Physics 112) comprise the two semester algebra-based sequence for
non-science majors. Students in these two courses were mostly pre-health
professions, biology and education majors. The third course, University Physics
I (Physics 161) is part of the three semester calculus-based sequence for
science majors. Students in this course were mostly science (e.g. physics,
chemistry) and engineering majors.
The participants made up a sample of 376 students,
235 (62.5%) women and 141 (37.5%) men. As the majority of the
students were caucasian, in the age range of 18 to 22, age and ethnicity were
not considered further.
Instruments
The Force Concept Inventory
(FCI) is a 30 item multiple choice test which "requires a forced choice
between Newtonian concepts and common-sense alternatives" (Hestenes,
Wells, & Swackhamer, 1992, p. 142). The concepts tested include kinematics,
Newton's First, Second and Third Laws, the superposition principle and forces.
Student data from the FCI and related instruments have now been collected and
published on thousands of students (Hake, 1998). The Maryland Physics
Expectations Survey (MPEX) is a 34 item Likert instrument with 5 attitudinal
subscales (Redish, Saul, and Steinberg, 1998) which was used as a filler task
and not analyzed further in this study.
This study used a quasi-random,
quasi-experimental design. During the Spring of 1998, one section of Physics
112 and one section of Physics 161 participated in the study. During the Fall
of 1998, one section each of Physics 111, Physics 112, and Physics 161
participated. In total, 5 sections of three different courses participated. For
simplicity, these will be referred to as classes. Each class section was divided
into two equal (within one student) half-class groups by selecting every second
name in alphabetical order from the roster. During the first week of each
semester, thirty minutes was devoted to testing. In each class, one half-class
group completed a paper-based FCI and were then asked to complete the web-based
MPEX in the next seven days. The other half-class group completed a paper-based
MPEX and were then asked to complete the web-based FCI in the next seven days.
Each student was supplied with the web address for the test
appropriate to their assigned half-class group. No training was provided to the
students for taking either the FCI or the MPEX on the web. Further, there was
no attempt to authenticate the web users. Each student's work was accepted as
their own. Overall completion times, submission times and dates were recorded.
This information was used to ensure that students took no longer than 30
minutes to complete the test and that they took the test within the seven day
period. It should be noted that the web-based format allowed students to retake
the test after they received on-line feedback regarding their first submission.
The date and time information ensured that the test data used as part of the
study was their first submission.
All of the tests were graded as to completeness and counted
as the equivalent of one homework or quiz assignment. With respect to final
class grades, students’ participation comprised about 3 points out of one
thousand total points, so that completion or non-completion had negligible
impact.
As a result of the paper-based and web-based
administrations, 376 usable tests were collected. Tests that were turned in
after the seven day period, or that were taken for longer than 30 minutes were
deemed unusable. Student scores on the FCI were calculated by adding the total
number of correct answers with a total possible FCI score being 30. For the
entire data set (N = 376), the mean of the FCI was M = 13.71
(SD = 6.08). Table 1 presents the means and standard deviations
of the Force Concept Inventory for all sections of all of the introductory
physics classes tested.
Table
1 |
|||||||
Means
and Standard Deviations of FCI student scores in all sections of all physics
classes. |
|||||||
|
Spring
1998 |
|
Fall 1998
|
||||
Course
|
N |
Mean |
SD |
|
N |
Mean |
SD |
Physics
111 |
na |
na |
na |
|
109 |
9.11 |
4.19 |
Physics
112 |
38 |
15.37 |
6.09 |
|
38 |
13.71 |
4.16 |
Physics
161 |
90 |
18.17 |
5.64 |
|
101 |
14.09 |
5.41 |
The purpose of the study was to examine
differences in paper-based and web-based administrations of the Force Concept
Inventory. Therefore, several different
analyses were conducted. First, total FCI scores were calculated and
differences between paper and web were examined. Second, differences in
individual items between paper and web were explored. Third, patterns of
responses in the individual items were examined to determine if differences
existed between paper and web-based administrations. Finally, the predictive
validity of the two different FCI administrations on students' course grades was
examined. The results of these analyses are reported in the sections which
follow.
Data for this study were collected in different sections of
3 different physics courses (see Table 1). In addition, previous research has
indicated differences in FCI scores due to gender. Therefore, to examine
differences in paper-based and web-based FCI student scores a
5 X 3 X 2 X 2 ANOVA was used (5 sections, 3
courses, 2 genders, 2 types of FCI administration). An alpha level of .01 was
used for all statistical tests. Significant differences were found for the main
effects of section, course, and gender. No significant differences were found
for the main effect of FCI administration. For the first-order interactions, no
significant differences were found due to type of FCI administration. Table 2
presents the results of the ANOVA.
Table 2 |
|||
Four-Way ANOVA summary table for
section, course, gender, and type of FCI administration |
|||
Source |
df |
MSe |
F |
course |
2 |
1684.72 |
68.09 * |
section |
2 |
421.75 |
17.05 * |
gender |
1 |
499.79 |
20.20 * |
administration |
1 |
29.06 |
1.17 |
course x administration |
2 |
26.79 |
1.08 |
section x administration |
2 |
41.45 |
1.68 |
gender x administration |
1 |
.14 |
.01 |
*p < .01 |
|
|
|
To
further examine potential differences in the student scores, Cronbach's alpha
was calculated separately for the paper and web administrations. For the entire
sample a = .86 (N = 376), for the paper-based
administration a = .86 (N = 212), and for the
web-based administration a = .85 (N = 164). These
alpha levels appear to be comparable.
Paper-Based
Versus Web-based Individual FCI Items
Differences in the paper-based and web-based administrations
of the FCI for individual items was explored using t Tests. A probability level
of .01 was used for all statistical tests. The F statistic was used to
determine whether the variances of the paper- and web-based administrations of
each item were equal. No significant differences were found for any of the 30
items. Table 3 presents the results of the t Tests.
Table 3 |
||||||
Results of t Tests for paper-based and
web-based administrations of FCI items |
||||||
Item |
F |
prob<F |
|
Item |
F |
prob<F |
Item 1 |
1.29 |
.08 |
|
Item 16 |
1.00 |
.98 |
Item 2 |
1.08 |
.60 |
|
Item 17 |
1.04 |
.79 |
Item 3 |
1.02 |
.91 |
|
Item 18 |
1.12 |
.45 |
Item 4 |
1.05 |
.71 |
|
Item 19 |
1.07 |
.66 |
Item 5 |
1.04 |
.77 |
|
Item 20 |
1.08 |
.62 |
Item 6 |
1.05 |
.75 |
|
Item 21 |
1.04 |
.80 |
Item 7 |
1.13 |
.41 |
|
Item 22 |
1.01 |
.96 |
Item 8 |
1.03 |
.86 |
|
Item 23 |
1.00 |
.98 |
Item 9 |
1.01 |
.98 |
|
Item 24 |
1.00 |
.98 |
Item 10 |
1.04 |
.81 |
|
Item 25 |
1.06 |
.67 |
Item 11 |
1.12 |
.45 |
|
Item 26 |
1.01 |
.93 |
Item 12 |
1.02 |
.90 |
|
Item 27 |
1.00 |
.98 |
Item 13 |
1.04 |
.80 |
|
Item 28 |
1.10 |
.53 |
Item 14 |
1.03 |
.83 |
|
Item 29 |
1.07 |
.66 |
Item 15 |
1.20 |
.22 |
|
Item 30 |
1.06 |
.70 |
df = (211, 163) for all tests |
Additional
analysis was performed to explore potential differences in the administration
for individual items. Chi Square tests of the paper- and web-based
administrations of each item were conducted to determine whether the response
patterns (patterns of A, B, C, D, and E distractors) of the two administrations
differed. A probability level of .01 was used for all statistical tests. No
significant differences in the response patterns were found for any of the 30
items. Table 4 presents the results of the Chi Square tests.
Table 4 |
||||||
Results of X2 tests
for paper-based and web-based administrations of FCI items |
||||||
Item |
X2 |
p |
|
Item |
X2 |
p |
Item 1 |
8.89 |
.06 |
|
Item 16 |
1.95 |
.75 |
Item 2 |
3.33 |
.51 |
|
Item 17 |
1.23 |
.87 |
Item 3 |
2.12 |
.71 |
|
Item 18 |
4.69 |
.32 |
Item 4 |
11.85 |
.02 |
|
Item 19 |
2.78 |
.60 |
Item 5 |
.97 |
.91 |
|
Item 20 |
2.67 |
.67 |
Item 6 |
2.25 |
.69 |
|
Item 21 |
2.19 |
.70 |
Item 7 |
8.32 |
.08 |
|
Item 22 |
4.25 |
.37 |
Item 8 |
1.96 |
.74 |
|
Item 23 |
2.00 |
.74 |
Item 9 |
2.12 |
.70 |
|
Item 24 |
.91 |
.92 |
Item 10 |
1.14 |
.89 |
|
Item 25 |
4.99 |
.29 |
Item 11 |
2.40 |
.66 |
|
Item 26 |
11.47 |
.02 |
Item 12 |
3.27 |
.51 |
|
Item 27 |
2.55 |
.64 |
Item 13 |
8.12 |
.09 |
|
Item 28 |
4.62 |
.33 |
Item 14 |
8.08 |
.04 |
|
Item 29 |
8.10 |
.09 |
Item 15 |
9.42 |
.05 |
|
Item 30 |
4.60 |
.33 |
df = 4 for all tests |
Predictive
Validity of Paper-Based Versus Web-based FCI Scores
Finally,
differences in the predictive validity of the paper-based and web-based
administrations of the FCI were explored by examining the correlations between
the student's score on the FCI and their subsequent letter grade in the course.
Letter grades were changed to their numeric equivalents ("A" was
given a value of 4, etc.). For the entire sample r = .26 (N = 376),
for the paper-based administration r = .26 (N = 212),
and for the web-based administration r = . 28 (N = 164).
These correlations, indicating predictive validity, appear to be comparable.
Table 5 presents these results.
Table 5 |
||||
Predictive
validity of the paper-based and web-based administrations of the FCI. |
||||
FCI
Administration |
N |
M |
SD |
r |
Paper-based |
|
|
|
|
FCI
Score |
212 |
13.29 |
6.11 |
.26 |
Grade |
191 |
2.69 |
.96 |
|
Web-based |
|
|
|
|
FCI
Score |
164 |
14.26 |
6.02 |
.28 |
Grade |
153 |
2.69 |
.85 |
|
Both Administrations |
|
|
|
|
FCI
Score |
376 |
13.71 |
6.08 |
.26 |
Grade |
344 |
2.69 |
.91 |
|
This study sought to examine
potential differences in paper-based and web-based administrations of the Force
Concept Inventory. The results of these analyses demonstrated no appreciable
differences on FCI scores or items based on the type of administration. While
the results of a 4 way ANOVA did demonstrate differences in FCI student scores
due to different sections, courses, and gender, none of these differences were
influenced by the type of test administration. FCI student scores were
comparable with respect to both reliability and predictive validity. For
individual FCI items, paper- and web-based comparisons were made by examining
potential differences in item means and by examining potential differences in
response patterns. Again, no differences in item means (as demonstrated by t
Tests) and no differences in response patterns (as demonstrated by Chi Squares)
were found. In summary, the web-based administration of the Force Concept
Inventory appears to be as efficacious as the paper-based administration.
Although
this study reports no differences between web and paper-administrations of the
FCI, there are a number of issues related to web-administered testing of
concern to students, instructors and researchers. The first of these is
academic dishonesty. In our study, students were awarded only a small grade
(1-3 points maximum from 1000 total for the course) for completing the survey. We
wanted to encourage students to participate and to be conscientious in their
responses, yet minimize the incentive to cheat. We did not prevent students
from copying or printing out the test, nor did we authenticate that the
students were who they claimed to be. There is no practical way of doing these
things without requiring students to take the test in a proctored computer lab;
a solution which has been used at other institutions (e.g. Harvard).
Another
issue related to web-administered tests is the resolution of the student's
computer video monitor. Computer video monitors have a much lower resolution
than paper printouts (typically 72 dots per inch vs. 600 dots per inch). In the
present study, the paper-administered FCI was a direct printout of the web
pages. However, the finer resolution of the laser printer made it easier to
read both the text and graphics, particularly the vectors and dotted lines
which indicated trajectories. While Clausing and Schmitt (1989, 1990a, 1990b)
found that with reasonable diligence, there was no a difference in reading
errors between computer video monitors and paper-printed tests, the finer paper
resolution may still be more comfortable to work with. In addition, it was
difficult for students using a smaller computer monitor to see several test
questions together with the accompanying diagrams. Conversely, printed pages
afford students the opportunity to easily flip back and forth or lay successive
pages side by side. For the web-administrations, this can only be accomplished
by the unwieldy process of scrolling back and forth.
Through
informal discussions, we believe that the web-administered FCI may both
reinforce and alleviate test anxiety for different students. A small number of
student participants in the present study reported they were uncomfortable
using the computer technology to take the test. They reported the computer
interface and web browsers to be slightly intimidating, and were particularly
concerned that their test responses might not be received by the instructor or
"lost" by the software. On the other hand, two female students
informally reported they liked having control over the time and environment of
test-taking; they felt more comfortable being able to take the test at home in
the evening, at a time of their choice, and in a relaxed environment with a cup
of coffee.
Finally,
the paper-administered FCI has it's own problems. In the present study, the
optically-encoded scanned bubble sheets produced errors due to skipped rows of
questions and incomplete erasures. We eliminated such errors from our data set
by rigorously proofreading and screening bubble sheets prior to scanning, and
by comparing scanner output files to the original bubble sheets. Such proofing
is unlikely to occur with typical paper-administrations, as it poses a
significant additional burden on the instructor. Eliminating the use of bubble
sheets and allowing students to mark directly on the test might alleviate this
problem, but would complicate the grading process. In comparison, the web
administered FCI used "radio buttons" for responses. These buttons
accurately code only one solution per question, allowed students to cleanly
change responses (i.e. no erasing), and aligned each and every response with
the question text and graphics on the screen.
This study demonstrated no differences
between the paper-based and web-based administration of a major standardized
physics test, the Force Concept Inventory. The main implication of this finding
is that, at least for the FCI, web-based administrations could be used in place
of paper- administrations, thus saving precious instructional time, reducing
the administrative overhead associated with testing, grading, and photocoping
thus cutting the costs associated with large scale data collection. Further,
web-based administrations offer information that paper-based administrations do
not. For example, item latency and completion data can be collected.
We are extending this research by
investigating the possibility of creating a web-based "Physics Testing
Center" that could administer tests and feed resulting measurements
directly into a modern database. Such a testing center would allow for the
routine collection of conceptual and attitudinal data and be available for longitudinal
studies of student learning and instruction.
This would enhance our understanding of programs and pedagogy both
inside and outside our university. Another use of a Physics Testing Center
would be the opportunity for researchers to pilot and standardize new
instruments by providing access to large numbers of student participants.
Along these lines, the authors have begun
to collaborate with other researchers and institutions in an attempt to create
such a centralized web-based testing center and common database. In addition,
we are expanding our on-line standardized testing effort to include other
instruments. Specifically, we are readying the Conceptual Survey in Electricity
and Magnetism (Hieggelke, Maloney, O’Kuma, & van Heuvelen, 1996) for web-based
administration.
References
Arons,
A.B. (1984). Computer-based instructional dialogs in science courses. Science,
224, 1051.
Arons,
A.B. (1986). Overcoming conceptual difficulties in physical science through
computer-based Socratic dialogs. In A. Bork & H. Weinstock (Eds.), Designing
computer-based learning materials: NATO ASI Series Vol F23. Berlin:
Springer-Verlag.
Bork,
A. (1981). Learning with computers. Bedford MA: Digital Press.
Champagne,
A.B., Klopfer, L.E., & Anderson, J.H. (1980). Factors influencing the
learning of classical physics. American Journal of Physics, 48,
1074-1079.
Clausing,
C.S., & Schmitt, D. R. (1989). The effects of computer usage on computer
screen reading rate. Paper presented at
the annual meeting of the Mid-South Educational Research Association. Little
Rock, AK.
Clausing,
C.S., & Schmitt, D. R. (1990a). Paper versus CRT: Are reading rate and
comprehension affected? Proceedings
of Selected Paper Presentations at the Convention of the Association for
Educational Communications and Technology.
Clausing,
C.S., & Schmitt, D. R. (1990b). Does computer screen color affect reading
rate? Paper presented at the annual meeting of the Mid-South Educational
Research Association. New Orleans, LA.
Hake,
R.R. (1998). Interactive-engagement versus traditional methods: A
six-thousand-student survey of mechanics test data for introductory physics
courses. American Journal of Physics, 64-74.
Hallouin,
H. (1998). Views about science and physics achievement: The VASS story.
Proceedings of the Third International Conference on Undergraduate Physics
Education. College Park, MD.
Hallouin,
H., & Hestenes, D. (1996). The initial knowledge state of college physics
students. American Journal of Physics, 53, 1043-1055.
Hestenes
D., & Hallouin, I. (1995). Interpreting the Force Concept Inventory: A
response to March 1995 critiques by Huffman and Heller. The Physics Teacher,
33, 502.
Hestenes,
D., Wells, M., & Swackhamer, G. (1992). Force concept inventory. The
Physics Teacher, 30, 141-158.
Hieggelke,
C.J., Maloney, D., O’Kuma, T., & Van Heuvelen, A. (1996). Electric and
magnetic concept inventory. AAPT Announcer, 26, 80.
Huffman
D., & Heller, P. (1995). What does the Force Concept Inventory actually
measure? The Physics Teacher, 33, 138.
Linn,
R.L. (Ed.). (1978) Educational Measurement (3rd ed.). New York:
Macmillan.
Redish,
E.F., Saul, J.M., & Steinberg, R.N. (1998). Students expectations in
introductory physics. American Journal of Physics, 6, 212-224.
Straetmans,
G., & Eggen, T. (1998). Computerized adaptive testing: What it is and how
it works. Educational Technology, JanFeb98, 45.
Titus,
A.P., Martin, L.W., & Beichner, R.J. (1998). Web-based testing in physics
education: Methods and opportunities. Computers in Physics, 12,
117.
Waugh,
M.L. (1985). Effects of microcomputer-administered diagnostic testing on
immediate and continuing science achievement and attitudes. Journal of
Research in Science Teaching, 23, 793.
Wise,
S.L., & Plake, B.S. (1990). Computer-based testing in higher education. Measurement
and Evaluation in Counseling and Development, 23, 3.
Zandvliet,
D., & Farrager, P. (1997). A comparison of computer-administered and
written tests. Journal of Research on Computing in Education, 29,
423.