Correlating higher order thinking skills among high school students with their performance on a government assessment

correlating higher order thinking skills among high school students with their performance on a government assessment


Sean M. Lennon

Spring semester, 2004

In partial fulfillment for ORLD 611

Quantative Research



Background Information

With enacting of No Child Left Behind (NCLB), states are now mandated in testing students enrolled in public schools. Data from testing must be published and schools are to be held accountable for continued failing marks (Goertz & Duffy, 2003). NCLB is a continuation of a trend widely accepted and advanced by a majority of states within the last few decades. In 1994 Congress enacted the Improving America’s Schools Act, later improved by the NCLB, initially mandating the testing of students. The movement of accountability and testing has continued to expand into other curriculums, grades, and subjects. It appears that testing, at least in the moderate future, is here to stay.

The accountability of the testing is also expanding, incorporating students, teachers and school systems. High-stakes, usually referring to a single, or series of tests used to advance or graduate students, is also gaining in prominence. Students must illustrate proficiency of knowledge in the content area of the assessment and under new NCLB guidelines show improvement from year to year (Goertz & Duffy, 2003). Common in these assessments is proficiency in application, more recognized as ‘performance based skills’, where the subject taking the test must show ability to use pertinent knowledge to answer questions. This testing style is considered more applicable in assessing higher knowledge skills but more challenging to prepare for. The priority is magnified when the stakes are higher.


Proficiency in taking performance based assessments can be a challenging task for an educator. The problem is twofold; first you must give the student the basic knowledge so it can be remembered, second you must teach the student how to use the knowledge correctly. Compounding the fickleness of human memory the teacher must now deal with another human variable; that of cognition. It is assumed that modeling higher order thinking techniques, usually classified by the original Bloom’s Taxonomy, is one of the most effective ways to teach these skills (Gray & Waggoner, 2002). This assumption has never been tested nor identified as effective. The question of this experiment hopes to asses this problem. Does the use of higher order thinking skills relate to a higher score on a performance based assessment?

Because of the accountability of high-stakes testing educators must strive for the most effective way to teach students. Students must not only pass these assessments but must excel at them and to repeatedly do so year after year. Teachers, restricted by time, class size and disruptions must find a way to give the student the best opportunity to pass. The repercussions for failing affect the teacher as much, if not more than, the student taking the test.


This experiment can be of great interest to teachers and school districts preparing for a high-stakes assessment. Performance based measurements rely upon the student to illustrate actual mastery of the knowledge. This action is a higher order thinking skill, originally defined by Bloom as comprehension, application, or analysis depending upon the nature of the particular question (Krathwohl, 2002). Under the newer version of Bloom’s Taxonomy these skills are listed as verbs; students must apply, analyze or evaluate (Krathwohl, 2002). Either model list possible thinking skills commonly utilized within a high-stakes test.

Educators, usually crunched for time, fail to incorporate many techniques and activities that foster and generate such skills. Traditionally, knowledge or simple memorization was the staple form of teaching. Today we see the limitations of this as many graduates have trouble assimilating into the workforce. They don’t know how to ‘do the job’. Most children have been taught to memorize, then are praised with grades highlighting this simple task. This leads to difficulty in a world where getting the job done is rewarded over simply knowing how to do it.

Testing has incorporated these skills forcing changes in the weight of content and pedagogy. In some cases this change might possibly be for the better. The reality for the teacher, however, is more realistic and less noble than then we would like to believe. Grading on the tests has become more important than the actual thinking abilities of the students. The assessments can only measure gradable outcomes, a limitation of all testing. Students’ actual abilities are more abstract and un-testable, and are ignored in the scramble to teach only testable outcomes. If the two could be correlated then the issue would be mute. Does utilizing higher order thinking skills and applications increase the performance on a high-stakes assessment? Does it help in cognitive ability as well?

Research Question

The research question has been modified to test this commonality in the social studies curriculum. Does the use of higher order thinking skills help on the performance of a government high-stakes assessment? It is generally perceived that the answer would apply across curriculum or content areas, but this would be too large and cumbersome of a study. Eventually this theory will probably be tested. At this point, however, the focus is narrowed to high school students enrolled in a government class. These students will take the Maryland Government Assessment at the end of the semester. Will higher order thinking help them on their upcoming test?

Relevance to Educational Leadership

This research, in lieu of the trends in testing is highly pertinent today. The knowledge is applicable to test creators, school districts, and educators including teachers of pedagogy as well as classroom teachers. The research will give insights into the effectiveness of higher order thinking on test performance, in turn possibly driving reform for curriculum modifications. Most educators and professionals believe such skills are crucial in the teaching process. This research will hopefully explain if higher order thinking activities should be incorporated and expanded.


The limits on the research focuses on random error associated with human subjects and the possible systematic errors found in the survey instrument. To deal with random error the population sample has been enlarged to encompass all, or as much as possible, of Easton High School’s junior class. This population consists of approximately 225 students representing the average demographic consistent in the American public high schools system. Students in Academic Government classes, a mandatory eleventh grade course, will be asked to participate. Error can occur if too few volunteer or the volunteers do not represent the proportionate population. To offset this possibility each volunteer will be rewarded with an ice cream or similar ‘treat’ immediately after finishing the experiment.

The questions on the instrument itself can also be a source of error or bias. Enough questions must be used to indicate statistical validity and to reduce error but too many can generate student apathy or multiple treatment interference. The number of questions on the assessment is set at fifteen. Each question will be of multiple choice design, consisting of four possible choices. The instrument will be formatted for consistency.

Other possible sources of error would include the instructional block and the higher order exercises. To validate these, the instruction will be reversed engineered from the questions. The data will be constructed from the content asked from the questions themselves. Social studies content specialists at the Maryland Department of Education will establish content validity. The higher order thinking exercises will be created from research. Face or content validity will be established through experts at the state department or through local universities.


Review of Related Literature

Assessment Movement

To understand the correlation of higher order thinking and assessments we have to understand the history, process and impact of such testing and how it has affected, and continues to affect the educational process. The earliest known form of testing dates back to 210 B.C. China where candidates had to pass an examination for a career in civil service. Students would be tested on their knowledge as well as their reasoning skills based on the canon of Confucius (Madaus & O’Dwyer, 1999). According to Madaus & O’Dwyer (1999) the emphasis on reasoning and its subjective scoring would see this test dropped in favor of assessments using more objective scoring. This issue of scoring students for a grade has continued through to modern times.

In Europe’s early Middle Ages testing was used for membership into craft guilds as well as being used for entrance into the priesthood or for knighthood. By the reformation the use of such testing would be incorporated by colleges and universities. Evidence of reference guides used towards high-stakes examinations have been discovered, illustrating the use of set standards and the curriculums created to meet them (Madaus & O’Dwyer, 1999). This trend would continue in America, most notably through the great American educator, Horace Mann with his quest for school reform in mid 19th century Boston (Madaus & O’Dwyer, 1999).

Mann saw the need for written examinations of students and for school and educator accountability of their scores. Though his reasons were political in nature, common in most school reform, he was instrumental in starting what would eventually become the modern American public school testing movement (Madaus & O’Dwyer, 1999). Other notable contributors would follow; Francis Kelly the inventor of the first norm-referenced assessment, Alfred Binet who introduced the 1st successful intelligence test in 1905, Francis Kelly again, in creating the concept of the multiple choice assessment in 1914 and finally Arthur Otis, who in 1917 created the Army Alpha, the first group administered intelligence test for American soldiers heading to Europe during World War 1 (Madaus & O’Dwyer, 1999). The movement continues today with high-stakes testing, performance assessments and standards.

By the end of World War 2 many critics saw a discrepancy in the skills of young graduates to what was needed in the workforce. Due, in part to the hysteria surrounding the Cold War between the United States and the Soviet Union this issue would be seen as a problem of national security (Rich, 2003). Competitiveness within this nation and against the other nations in the world was be scrutinized, the skills of our young would be seen in terms of strength or weakness. This is the advent, or beginning of the achievement testing movement. As a nation we take worry in low scores and are concerned with countries that score competitively or better than our children (Rich, 2003).

According to Gunzenhauser (2003) this has led to a default philosophy in education or a narrow focus on test scores rather than the subjects the tests are actually supposed to measure. In other words, the tests meant to be a part of the system instead drives the system (Gunzenhauser, 2003). Actual control of curriculum now moves away from teachers, the schools and districts to whatever agency that creates or monitors the assessment (Vogler, 2003). Regardless of the reasons or impact testing has become part of the American psyche and is here to stay (Rich, 2003). An offshoot of this, also influenced by criticisms and concerns, has been the call for established state or national content standards and the creation and utilization of performance based assessments.

Standards Movement

The focus for national standardization of content areas originated with the National Education Goals, a consensus of agreement by the nation’s fifty governors in creating series of outcomes and initiatives in response to educational reform (Nash, 1997). Through this initiative, the National Council of Education Standards would be created. Funding would be established through the Department of Education and the National Endowment of the Humanities (Nash, 1997). By 1994 the Geography for Education Standards Project completed the standards for geography. In that same year the Center for Civic Education completed for civics and the National Council for the Social Studies finished for social studies. The National Center for History in the Schools would complete the standards for history in 1996 (Buckles & Watts, 1998). In 2000 legislation signed by President Clinton mandated states to adopt some form of standardization, either with the previously completed national sets or they had to create and implement their own.

Controversy has surrounded this movement almost since its inception. Critics, from teachers to college professors, educational leaders and historians to everyday Americans have been outraged by the selections used for each standard. Problems about what to teach, what not to teach, and the weight and prioritization of certain issues have developed with no clear consensus or outcome found (Nash, 1997). What to teach has become increasingly controversial as states mandate successful completion of tests for graduation and grade promotion. As subject matter narrows towards testable outcomes and objectives, content areas become limited and in some cases discarded entirely. What curriculum, therefore, is deemed more valuable than others? The answer to this question appears to be in the nature of the assessment being utilized (Vogler, 2003).

Most experts agree that multiple assessments, in some form or another, are the only fair and impartial way to assess the learning of a child or individual (Olson, 2001). If states utilize more than one test the limitations in curriculum will not be as severe as in areas using just one test for student promotion or graduation (Olson, 2001). The problem surrounding multiple testing, however, is one of syntax; is multiple testing taking the same test more than once, or is it taking multiple tests? Or taking a test that includes grades from other subjective outcomes, or possibly a combination of tests including other subjective outcomes? What form the program will take depends upon the state, the stake of the student (i.e.: being used for promotion, graduation or has no passing requirement), and the grade being assessed (Vogler, 2003).

Many states differ in the way they test their students. Federal guidelines set testing to one in every educational ‘block’, consisting of elementary, middle and high school. Most will test more frequently than this minimum. (Goertz & Duffy, 2003). Testing will vary from state to state, and year to year but almost all use some form of criterion and norm referenced assessment. This type of test looks at performance as well as knowledge (Goertz & Duffy, 2003).

Curriculum reform, created in response to the standards movement and in itself conceived from the assessment movement has come full circle to its original preposition; to teach our children the necessary skills to augment the workforce (Rich, 2003). Has testing of our children driven improvements in education? In itself the answer is probably no but as a series of reforms and programs the new generation of curricula and assessments has utilized performance outcomes, objectives based on student thinking and problem solving. By passing on knowledge then teaching how to use it in a real world setting students are given a better chance for success (Evans, 1999). Tests that utilize application or thinking concepts in answering a problem have been used long in history but were abandoned due to subjectivity of grading (Madaus & O’Dwyer, 1999). Despite limitations in grading, these types of assessments are coming back as is the demand for the curricula that incorporates them. As it was in the forties and fifties we need a skilled and intelligent workforce, not only for economic justification but for security reasons as well. How can we continue to be a super power if we can’t teach our children? How are we going to survive if they can’t think? In our zeal to measure performance we initially moved away from such measurements and are just now beginning to address thinking and application both through assessments and in the classrooms.

Higher Order Thinking

The application of higher order thinking skills in the classroom can be problematic for many educators. Thinking is physiological but is applied in more of an abstract way, especially in terms of pedagogy. It remains maddeningly complex in categorizing and formatting as well as in applying to the classroom. In 1949 Benjamin Bloom, then associate director of the Board of Examinations of the University of Chicago, wanted to create an open, easily accessible ‘bank’ of test questions with open access to professionals who could use them. This ‘bank’ of test questions would make it easier, cheaper for schools and programs in creating and implementing tests (Krathwohl, 2002). Bloom enlisted specialists in measurement, meeting approximately twice a year for six years before finishing the final draft of what would eventually be known as Bloom’s Taxonomy. Originally titled the ‘Taxonomy of Educational Objectives: The Classification of Educational Goals” this would be used as a classification system of higher order thinking skills (Krathwohl, 2002). The skills were listed in ascending order from lower to higher starting with knowledge, considered to be the lowest form of thinking. Next would be comprehension, then application, analysis, synthesis and finally evaluation (Krathwohl, 2002).

Bloom’s Taxonomy would quickly incorporate itself into the assessment movement. Test creators, implementers and advocates would push for questions based on Bloom’s classifications. These questions, referred to as performance based questioning, would assess the students’ ability to take knowledge then to apply it. Classroom educators would respond by incorporating the skills into their teaching, though few would be adequately trained in its use (Ivie, 1998). Today Bloom’s Taxonomy is a staple of teacher education despite being infrequently used in the classroom. Most educators still teach to the lower thinking skills of knowledge due to lack of training, available resources, ease in application, and their comfort level (Evans, 1999).

Researchers with expertise in medicine and psychology would expand and define the physiological nature of thinking. New research would revise Bloom’s original table and extend beyond it. David Ausubel, author of a different higher thinking theory, based his research on the cognitive memory and comprehension techniques commonly utilized in the brain (Ivie, 1998). Ausubel saw higher thinking extending beyond simple memorization but unlike Bloom tried to conceptualize the actual process, and to give definition to his concept. In his view Bloom’s Taxonomy did not go far enough in helping people to actually apply the skills (Ivie, 1998). Ausubel saw higher order thinking in terms of organization of information categorized into a series of relationships from general to specific with the use of logic and reasoning in applying it. Consistent with other research, a student best learns through proper organization of new ideas and the use of practice (Kauffman, Davis, Jakubecy & Lundgren, 2001). If an educator is cognizant of the process than he or she can utilize it through best case teaching strategies (Ivie, 1998).

Howard Gardner in 1983 would take a different approach to thinking. Gardner believed people had different strengths in cognitive thought and be best be served to that strength, albeit through the use of higher or lower thinking skills. He originally proposed eight different categories of thinking styles differentiated by verbal/linguistic, logical/mathematical, visual/spatial, musical/rhythmic, bodily/kinesthetic, naturalistic, interpersonal and intrapersonal (Gray & Waggoner, 2002).

Another work has combined Bloom’s Taxonomy with Gardner’s Multiple Intelligences in forming a matrix incorporating all of the domains and skills of both. This matrix follows Bloom’s order from lower to higher intelligence skills but allows freedom in using the different strengths as defined by Gardner (Gray & Waggoner, 2002). The matrix is designed to be user friendly and can be incorporated into lesson plans and teaching formats.

A revised edition of Bloom’s Taxonomy has been published, utilizing newer concepts of metacognitive application. A previous criticism of Bloom’s table was in its difficulty in applying easily into the teaching process. The definitions of skills were considered more subject based than action. Each skill, not being defined as a verb, led to difficulty in definition and use, and the vagueness of its listed characteristics caused confusion (Krathwohl, 2002). In the revision the table is separated into two; the first being the structure of knowledge, the second the structure of the cognitive process. In the knowledge chart skills are divided into factual knowledge, conceptual knowledge, procedural knowledge, and metacognitive knowledge. In the metacognitive chart the skills are broken into remembering, understanding, applying, analyzing, evaluating and creating. Like the original these categories are in ascending order of lower to higher skills (Krathwohl, 2002). Included in the revision is a simple to use matrix, similar to the Bloom’s/Gardner matrix, written for use within an educational setting. This is a useful and handy reference for a teacher wanting a visual aide in applying higher order thinking skills into the lesson (Krathwohl, 2002).


In 2001 President Bush signed The No Child Left Behind (NCLB) Act into law. Considered a momentous achievement in reforming the public school system NCLB is one of numerous changes that have swept the nation since the close of World War 2 (Goertz & Duffy, 2003). The testing of subjects to assess their skills and knowledge is nothing new, nor is the accountability of testers or those who prepare them a novel idea. The difference today is in the scope and breadth of this reform. The assessment movement, born from national pride and desire to compare our children against the world was the catalyst for the climate seen today. Materializing from this reform would be the standards movement, created in the hope of implementing a uniform, high set of lessons, objectives and measures for teachers to apply in helping students pass the newly implemented assessments. Running parallel with this was the call for higher thinking, both in research and classroom application. Eventually these movements would coalesce into the standard reform movement familiar to us today.

But does it work? The Improving America’s Schools Act of 1994, signed by President Clinton mandated state assessments for elementary, middle and high schools. These assessments had to be aligned with established national standards or standards defined by the state. 48 states soon implemented state controlled assessments in reading and math; the two remaining also assessed but allowed the measurements to be controlled by the individual districts (Goertz & Duffy, 2003). 46 of the 48 states administering these assessments utilized either criterion referenced or criterion and norm referenced testing and all of the states publicized the results. NCLB continued this trend of accountability but did not start anything new. Under this new legislation states must create assessment programs based on preset standards. All but two already had done so. Accountability must show improvement. IASA in 1994 had mandated data to be used in tracking low performing schools (Goertz & Duffy, 2003). The trend is more aggressive but is it worth the time and money?

New research illustrates a possibility that it does. Carnoy and Loeb (2002) in an analysis of two databases looked for correlation between states that used high stakes assessments and those that did not in outcomes not tested in these assessments. The researchers looked at improvements in math and reading in grades four and eight, high school retention rates, and ninth grade passing rates. The indicators for retention and passing were inconclusive but the math and reading showed improvement (Carnoy & Loeb, 2002). The data is too new, and statistically small to make any conclusive proof yet it looks like assessments lead to possible student improvements in some domains. It is no doubt these programs will continue to develop and to be implemented.



Experimental Design

The testing will occur in a block of time consisting of the entire 2004-2005 school year at Easton High School located in Easton, Maryland. The school offers nine government courses throughout the year, excluding the advanced placement classes which will not be tested. Classes are in a block scheduling format, consisting of 88 minutes in length which gives ample time for the experiment to be conducted. The length of the period allows for possible variability as well as unforeseen situations in the treatment and the measurement. The school uses a homogeneous grouping format meaning students are not placed into specific classes based upon ability or other demographic variables. Class size averages about twenty five students per class, allowing for the approximate number of 225 possible students tested. Random variables already exist due to homogeneous grouping. In testing all of the students in a year the percentage of sampling error can be reduced. The only concern in the sample population is within the control groups.

There will be three assigned control groups selected by the researcher based on demographic variables found within each class. The experiment is measuring the impact of higher thinking skills on the performance of students taking a government assessment. The primary control group, control group #1, will be assigned by the researcher. This group will be extended the same instructional block and posttest as the experimental groups but will be withheld from the higher thinking skills instruction. This variability will be measured through the assessment. Control group #2 and #3 are included to reduce possibility of random and systematic error and testing bias. Control group #2 will be the only group pre-tested with the assessment, offered the instructional block, and then tested again. This group will measure possible testing bias and instrument reliability as it is the only group that will see the instrument twice. Control Group #3 will be offered the posttest only, hopefully defining the baseline score or measurement (See Appendix).

Survey Instrument

The instrument consists of fifteen multiple choice questions pertaining to the information covered in the instructional block. The time of the instrument is limited to less then twenty minutes to prevent subject apathy and neglect. The questions selected were once used on the Maryland State Government Assessment but have been retired. Presently they are allocated for teacher instructional purposes and are published on the Maryland Department of Education website. Parallel forms reliability and Survey Rating Reliability has already been established through state wide testing. I am asking department officials to actually assign the fifteen questions to me based on their expertise which will establish content validity.

Instructional Block

The instructional block is a pre-selected section of curriculum already published

to my web site. The actual lesson used will depend upon the exact questions released by the state. The instructional block is situated to last no more than thirty minutes to prevent subject apathy and to allow time for other variables within the 88 minute period. The treatment will be the same for all the groups; I will teach the lesson using my style and the exact same notes, information and within the same allocated time. The information will be in visual form by projecting the web page onto the classroom screen then offering verbal lecturing to the subjects. Using both extends auditory and visual stimuli, focusing on the two dominant learning styles usually taught within high school classrooms.


The subjects will incorporate approximately 80% of the junior class at Easton High School for the selected year. The percentage not tested will include approximately five percent that took the class in a different school or different year and students enrolled in the advanced placement government course. The latter group comprises fewer than fifteen percent of the testable population. Groups tested include all pertinent demographic groups found in the Talbot County region. This includes, but not limited to, Caucasians, African Americans, Hispanics, and other populations. Talbot County is representative, or very close to, representing most of America in terms of race population found in public schools. Other variables, such as socio-economics, will be determined through testing.

Analysis Procedures

The experiment is looking for differences in performance on a high-stakes government assessment through the use of teacher applied higher order thinking applications. To determine any causal effect the experiment will give the exact lesson to multiple classes, with some receiving higher order thinking activities. The control groups, classes that will receive no higher order instruction, are utilized to measure differences, to determine baseline(s) and to minimize error. The population is large enough to reduce bias and to give a more representative sample of the typical 11th grade American public school students. The mean (M), median (Mdn), standard deviation (SD), and percentage (P) difference will be calculated from all the experimental groups against the three control groups.



Buckles, S. & Watts, M. (1998). National standards in economics, history, social studies, civics, and geography: Complementarities, competition, or peaceful coexistence?. Journal of Economic Education, 29(2), 157-166. Retrieved February 12, 2004, from ProQuest database.

Carnoy, M. & Loeb, S. (2002). Does external accountability affect student outcomes? A cross-state analysis. Educational Evaluation and Policy Analysis, 24(4), 305-331. Retrieved September 15, 2003, from ProQuest database.

SEQ CHAPTER h r 1Evans, C. (1999). Improving test practices to require and evaluate higher levels of thinking. Education, 119(4), 616-619. Retrieved February 1, 2004, from ProQuest database.

Goertz, M. & Duffy, M. (2003). Mapping the landscape of high-stakes testing and

accountability programs. Theory into Practice, 42(1). Retrieved February 12, 2004, from ProQuest database.

Gray, K. C. & Waggoner, J. E. (2002). Multiple intelligences meet Bloom’s taxonomy. Kappa Delta Pi Record, 38(4), 184. Retrieved February 13, 2004, from ProQuest database.

Gunzenhauser, M. & G. (2003). High-stakes testing and the default philosophy of education. Theory into Practice, 42(1). Retrieved February 13, 2004