Contributions to Session Two of the 1998 NISE Forum: Assessment and the Promotion of Change

Assessing and Evaluating the Evaluation Tool - The Standardized Test
Richard Tapia

Introduction

The misuse of standardized tests at selective and even not-so-selective institutions prevents the nation from tapping into a large part of its human resources' creativity and leadership. We are significantly retarding the process of change and reform that have been accepted as critical to maintaining our national health. For decades now, we have let the traditional beliefs of the ruling class dictate the policy for change and reform in testing, and consequently we have ended up with little or no reform.

My purpose in writing this essay is to push for rigorous study of standardized tests' traditional use. It is imperative that we collect data, evaluate and assess, and use these findings as the impetus for change and reform. While we often allude to such studies, they are invariably incomplete, anecdotal, or nonrigorous. Hence, there can be no effective dissemination or buy-in on the part of our colleagues, administrators, and national educational policy makers. This problem is not restricted to underrepresented minorities; although, I quickly add that, as identifiable groups, they are hurt the most. Indeed there is good correlation between the misuse of standardized tests and underrepresentation. Bluntly put, the misuse of standardized tests is the underrepresented minority's worst enemy.

Let us suggest an effective way of using SAT scores that we shall refer to as a "threshold approach". In most selective universities, admissions people are going to look at the higher end of the test spectrum, say 1300 and above, and then try to make decisions from that group. I maintain that in terms of SAT score alone we can not make meaningful distinctions in terms of real success between members of the group consisting of individuals who have scored, say 1050, and above. Moreover there are significantly many individuals with SAT scores between 1050 and 1300 who in a real sense will be equally or more successful than most individuals with scores above 1300. So, 1050 is our threshold value. This means that all with scores above 1050 are deemed acceptable and other factors should be used to differentiate among the members of the acceptable group. All other factors being equal, I have no problem with breaking the ties with SAT scores. On the other hand experience has taught me that at Rice University it is unlikely that individuals with scores below, say 850, will succeed. So we shouldn't accept them into Rice. Now what can we say about the group of individuals with scores between 850 and 1050? Well, we need to look very closely at them and decide if they should be put in the reject class, the acceptable class, or some other class that would require additional information and study.

Rice University has been quite successful at implementing diversity in its undergraduate population. The threshold system deserves much credit for this success. The Rice Guidelines for Admission and Financial Aid included in section 3 of this paper strongly allude to a threshold approach to the use of SAT scores in the undergraduate admission process. On average Rice University underrepresented minority students have substantially lower SAT scores then does the university at large. However, they are on par with their nonminority counterparts in terms of retention rates and grade point average. They bring in more than their share of awards and admissions to prestigious graduate and professional schools.

I realize that the parameters used in my presentation of the threshold approach to the SAT score are somewhat arbitrary. In a real situation they would have to be fuzzy numbers. However, it is really more the concept that I want to discuss in this section.

I would like to preface my remaining comments with three anecdotes. The stories are true, they really happened. The first concerns a Mexican American male, the second a white male, and the third an African American female.

Pedro was born in the barrios of San Antonio, Texas. He was proud of his Mexican-American and barrio heritage, in fact he was so proud that he had no problem referring to himself as Chicano. Some at Rice University felt that he was a "Barrio elitist". He gave them the feeling that if you were not from the barrio, you really did not have a handle on life, and did not know what life was all about. From his traditional family Pedro learned to be respectful and considerate to others. From the barrio he learned a sense of survival and toughness. He could be sensitive and he could be tough as the situation required. Pedro possessed excellent mathematical and scientific talent. He was the star of his local barrio school. Of course no one from his barrio school had ever gone off to a selective college like Rice; in fact few had gone anywhere except community college. The combination of academic success, inner-city survival, and pride of his heritage gave Pedro considerable inner confidence and self-esteem. He had learned not only how to survive, but also the excellent attribute of never quitting or giving up.

While Pedro and his family had never heard about Rice University, one of his counselors knew about Rice. He advised him to apply to Rice. Pedro had excellent grades and letters of recommendation. His SAT score was 400 points below the Rice average of 1410. However, Rice University was making strong efforts to improve diversity. As such we were experimenting with what I call the threshold approach to standardized tests. More will be said about this later. Suffice it to say for the purpose of this story, the threshold approach essentially sets a threshold score for acceptability and deems all above that score as equivalent with respect to SAT score. Decisions are then made on members of this equivalence class by considering other factors. So, Pedro was accepted at Rice. In previous years when the threshold policy was not in play he would have been quickly rejected. Pedro found Rice very demanding and very challenging. He received several C grades. He thought of leaving. But he was not a quitter and he stayed. When I met Pedro in his junior year at Rice he was president of an active Hispanic organization at Rice. He took a class from me in mathematics. In class it was clear that he had excellent scientific talent. I found him to be exceptionally creative. He was not the best student in the traditional sense, but he was very good, and no one seemed to me to have better potential for graduate school. So, I asked him if he planned on attending graduate school. He replied that he had received several grades of C early on in his Rice career. I told him that the grades alone would not preclude his acceptance. Especially since they were early in his career and he was doing so well now. He was very excited, took the GRE and applied to several good schools; Stanford, Berkeley, The University of Texas, and Texas A&M University. I wrote him a very strong letter emphasizing that he was not only an excellent student, but one of the more creative students that I had taught at Rice in 25 years. School by school rejected him saying that his GRE scores were too low. I then pushed strongly for his acceptance and Rice and was successful. Again because we use a form of the threshold approach in some of our departments graduate programs. He breezed through a thesis masters degree. As before our faculty gained a high respect for his talent and creativity. He had an opportunity to work in high-tech here in Houston while finishing up the Ph.D. degree. He was a star there. Recently his supervisor asked me if we had any more like Pedro, he said that he would hire as many as we had. Pedro will finish his doctorate this year with an excellent dissertation.

The moral to this story is obvious. If I had not played a major role Pedro would not have realized his full potential and his leadership would have been lost to the scientific community. He would not have had the chance to become the leader that he has become in both the industrial and academic communities. He would have been cut down by his GRE score. The misuse of the standardized test would have claimed yet one more victim.

Jim is a white male who grew up in a small Texas town. He told me that he always knew that he was smart, but things were not in proper alignment and he dropped out of school in the ninth grade. He eventually moved to the Houston area, and decided to obtain his GED from a local community college. In community college he was an absolute star. He was directed to Rice University for his undergraduate education. Rice is extremely selective and rarely pulls from the community college population. However, it is to our credit that we accepted him. As was explained in the previous story about Pedro, Rice does not put overly excessive weight on the use of standardized tests at the undergraduate level (however, wait for my third story). At Rice, Jim took several advanced math classes from me. One of the classes is essentially a graduate course, and Jim was the star of the class as an undergraduate. He clearly was one of the more mathematically creative students that I have taught in all my 27 years of teaching. We talked often and I encouraged him to apply to good graduate schools. I wrote him a very strong letter. Not long ago, Jim appeared in my office very somber and distraught. He confided in me that he had been rejected at Berkeley, Cornell, and Stanford. He had been accepted at one good state school and at Cambridge and Oxford in England. He desperately wanted to know what had gone wrong. I asked him about his GPA. He said, "I will be graduating from Rice with an A+ average and the distinction of summa cum laude. Moreover, I did it in mathematics, one of Rice's most challenging majors, and I did it in three years coming from a community college high school GED". I then asked him about his reference letters. He quickly replied that all that wrote were professors from classes where he was at the very top of the class and they had told him that their letters were very strong. Finally, I said "tell me about your GRE scores". He answered that in two of the three categories he had done very well, but in one category he was only in the 75-th percentile. I replied, "that's it". He said, "how can that possibly be"? I repeated my reply, and we had a very needed conversation.

Jim was a victim of the misuse of a standardized test. Yet he was a white male, a straight A student at a very demanding school, and one of the most intelligent and creative individuals that I have ever had the pleasure of teaching.

Sandra is African American and was born and raised in Houston, Texas. She was an excellent student in high school and received a full scholarship to study at a university in the northeast, well-known for its excellent engineering programs. Upon graduation she applied to graduate school at Rice University, to one of our "better" engineering departments. She felt that it would be nice to return to Texas, and Rice had a fine reputation. She applied, had not been accepted, and was visiting Rice. I was asked if I would be willing to talk to her. I replied that I would be happy to talk to her. She was brought to my office by a faculty member that I have respect for professionally. I spent considerable time with Sandra, in my estimation she was a potential star. She had an A average from an excellent school, was very mature and focused, had overcome serious obstacles, knew what she wanted and why, and in general most impressive. I expected the Departmental representative to proudly tell me that they were going to accept her and support her. Instead I was asked if she could be considered for tuition and support under a diversity program that I administer. I asked, "why, she is an outstanding applicant"? I was told that they actually had several applicants that they yet had not decided to accept and were definitely superior to Sandra. I asked in what way were they superior? I was told that the applicant at hand was only in the 89-th percentile on one part of the GRE test and the other applicants (all foreign) had GRE scores in the 92-nd or 93-rd percentile. Hence, the department felt that it could not pass them over for an "inferior" applicant. It was difficult to contain myself. They were sincere, and of course extremely naive. I chose to support Sandra, and lost even more respect for that particular department. She may or may not come to Rice.

As was the case with Pedro, Sandra would have not been accepted if I had not intervened. Our evaluation system is flawed, and it is not going to be saved by waiting for these interventions from outside.

Today universities are looking for individuals with a broader range of attributes. However, standardized tests do nothing to identify most of these attributes. I firmly believe that members of underrepresented groups (by the very nature of being a member of such a group), have learned skills and have developed sensitivities and understandings that would fall into this broad range. For example, in research university environments we talk about the needs for nurturing, mentoring, more effective teaching, a better understanding of the whole student, and outreach to broader communities. Members of our underrepresented groups are prepared to contribute in these directions. However, to a very large extent, these individuals do not have an opportunity to demonstrate this creativity and leadership skill because of traditional assessment barriers. These barriers are not outright discrimination; no, they are much more subtle. On the surface they look like reasonable measurements of necessary prerequisites or skills. However, they are strongly biased towards the precocious attainment of various pieces of information and knowledge. Potentials for success, creativity, the ability to guide and lead, the ability to adapt to a newenv ironment and bring needed understanding from another environment, are not measured. This is too hard, we do not know how to do this. Moreover, our basic leadership is not totally unhappy with the traditional process; since, after all, their careers were spawned by the process in place, so there must be some real good in this current traditional version. While I am basically criticizing the use of standardized tests in undergraduate and graduate admission processes, it is a straightforward matter to extend my criticism to hiring policies, promotion policies, and selection procedures for prestigious fellowships, grants, and other professional rewards. Moreover, while I find it easy to argue in terms of the effect on members of underrepresented groups, I certainly do not wish to imply that these statements and concerns are restricted to them. We are in danger of locally restricting participation that would globally be of value to our national agenda. Local values and global values are usually at odds; indeed, often without being aware of this conflict. The department does not worry about the division, and the division does not worry about the whole university.

A couple years ago, I served on a committee to review education and human resource development activities of the National Science Foundation. The committee was quite taken back to find that essentially all the winners of the prestigious fellowship awards were nonminority males who had demonstrated an affinity for science by the time that they were 10 years old or so. The winners were very impressive and undoubtly very precocious. It was easy for us to feel that the door had been closed on you if you we re not extremely precocious. Of course our concern was whether this is in the best interest of the nation. We drafted the following statement as a part of our recommendations

Committee to Review Education and Human Resource Development Activities of the National Science Foundation (March 1996)

"The committee feels that the implementation of the current evaluation criteria concerning the quality of the applicants overemphasizes the 'focused prodigy' profile. Since it is impossible to disentangle productivity due to privilege from productivity due to talent, reviewers and panelists generally fall back on this profile as a means of evaluating candidates even though it may not be a good predictor of scientific creativity and success. This emphasis works against some candidates (most often women and underrepresented minorities) who may not have been interested in mathematics or science as a young child, but who develop rapidly and demonstrate great creativity once the interest is manifest."

Our Addiction to the Use of Test Scores

For some not well-understood reason university admission committees demonstrate an addiction to the use of one dimensional qualifiers like the SAT and the GRE test scores in the admission evaluation process. There seems to be a belief that all students can be well-ordered; hence we should try to well-order all students. Clearly no two students are the same; therefore we should be able to come up with some measurement that will differentiate. In mathematics we know that it is not possible to well-order quantities that display many components of value. We know in the admissions process that we value many student attributes, yet we fall back on the one-dimensional standardized test. It does get us out of our dilemma, and perhaps this is the most valued aspect of the test. It is simple to use, and it is readily available. It allows us to differentiate with some feeling of security between any two students. It gives us a simple tool. We know that this simple tool can't be perfect, but no one really knows how good or how bad it is; hence for convenience let's use it until someone demonstrates that it is totally flawed. But, it works, we get good students; they perform well and succeed, and furthermore it is not at all clear that it ignores truly qualified students. So, it can't be that flawed.

People are multidimensional. Science is not only multidimensional, but it benefits from multidimensional approaches. Evaluation from standardized tests places all the weight in one dimension. What about the other dimensions, are they not important? Boldness and creativity play critical roles in research activity; yet, at best, they have a weak correlation to scores on standardized tests. The rub is that once we concede that the problem is multidimensional, then we don't know what to do. The evaluation process becomes extremely difficult. There is another deficiency in the way we evaluate. We define success in a manner which may not be meaningful. For example, the MCAT score may be a fair predictor of success in medical school. However, success in medical school may not correlate well to success as an effective physician. What is real success? From this point of view, the MCAT is an absolutely hopeless evaluation tool. We value what we measure, because measuring what we value is simply too hard to do.

Test Scores and a Perceived Lowering of Standards

It is interesting that a significant part of our population equates lowering of standards, or an inferior applicant, with scores on a standardized test. This was the essence of the infamous Baake Decision in California and Hopwood decision in Texas. So-called inferior minority students were accepted over the named plaintiffs of Baake and Hopwood. Why were these minority students inferior and less capable? Solely, because they had lower LSAT scores. We equate lower scores at all levels with lower standards. I have seen highly intelligent colleagues argue the merits of a 93-rd percentile GRE score over that of a 90-th percentile score (recall our anecdote concerning Sandra). The individual with the lower score was rejected in favor of the one with the higher score with no doubt whatsoever that the process was fair. We have learned to put great value on what we measure and have forgotten to ask if this measure is flawed concerning what we value. These tests are far from God given. We must evaluate the evaluation criteria. We are so naive as a nation to spend considerable time, money, energy, and rational comment on a criteria that is blindly accepted. Here we need to play philosopher more and mathematician less. We must question the validity of the axioms, and not just follow the implications of these axioms. However breaking away from traditional use of these standardized tests will be nearly impossible. We need to assess, evaluate their effectiveness and then use in the appropriate fashion.

Jesse Shapiro, last year's valedictorian at New York's prestigious Stuyuesant High School, stated in his valedictorian address,

"Nothing could be fairer that a simple multiple-choice exam. It leaves no room for political patronage, racial bias, or other discrimination. Unless New York wants talented-blind admissions, it should keep testing".

Let's stop and reflect on Shapiro's comments. It is easy to be fair. But is being fair the complete picture? Not fair would lead us to believe that the process should be questioned. However, a fair process may also have some serious deficiencies. Years ago, my son raced BMX bicycles. I questioned a lane selection process that was being used. I told the officials that it could be greatly improved. They told me that it was fine as it was, because it was fair, each rider had the same chance of getting any particular scenario. I asked them to consider the following hypothetical scenario. Two riders go to the gate. One rider will have to start backwards (rearwheel on the gate) and a coin will be flipped to see which rider has to start backwards. I told them such a procedure was fair in their sense, but was far from optimal. Indeed, I knew exactly how to improve it. They conceded my point, and we eventually introduced a new lane selection system nationwide. An additional point is that multiple choice tests may be fair, but they rarely test what you want to test.

In the opposite direction from Shapiro's comments we quote from the Rice University admissions guidelines,

Rice University Guidelines for Admissions and Financial Aid

First, we seek students, both undergraduates and graduates, of keen intellect who will benefit from the Rice experience. Our admissions process employs many different means to identify these qualities in applicants. History shows that no single gauge can adequately predict a student's preparedness for a successful career at Rice. For example, we are cautious in the use of standardized test scores to assess student preparedness and potential. In making a decision to admit or award financial aid, we are careful not to ascribe too much value to any single metric, such as rank in class, grade-point average, the Standard Achievement Test or Graduate Record Exam.

Rice University seeks to create on its campus a rich learning environment in which all students will meet individuals whose life-experiences and world-views differ significantly from their own. We believe that an educated person is one who is at home in many different environments, at ease among people from many different cultures, and willing to test his or her views against those of others. Moreover, we recognize that in this or any university, learning about the world we live in is not by any means limited to the structured interaction between faculty and students in the classroom but also occurs through informal dialogue between students outside the classroom.

Rice places a premium on recruitment of students who have distinguished themselves through initiatives that build bridges between different cultural, racial, and ethnic groups. In so doing, we endeavor to craft a residential community that fosters creative, inter-cultural interactions between students; a place where prejudices of all sorts are confronted squarely and dispelled.

Our admissions process precludes any quick formula for admitting a given applicant or for giving preference to one particular set of qualifications without reference to the class as a whole. An inevitable consequence of this approach is that some otherwise deserving and well-qualified students will not be admitted to Rice. By selecting a wide range of matriculants of all types, the admissions process seeks to enrich the learning environment at Rice, and thus increase the value of a Rice education for all students.

What is in a Test Score?

In a complex world be leery of easily quantifiable criteria. In undergraduate admissions there is evidence to believe that SAT scores have some meaningful correlation with first year grades. The question that must be asked here is whether grades are an end in themselves, or just an implied, and perhaps ineffective, predictor of some other meaningful property. In graduate school grades are never the dominant issue. There is more concern for creativity and an ability to perform new and independent investigations that lead to new theory. Does the GRE score measure this ability, or even more to the point, can it be used to predict success. Bowen and Rudenstine in their well-known text In Pursuit of the Ph.D. argue that traditional evaluation criteria employed by today's graduate admission committees do not do a good job of predicting success.

In my years of experience at Rice University on both undergraduate and graduate admissions committees, I have seen many diverse students come through our doors with varying degrees of success and varying levels of scores on standardized tests. I am prepared to say that students with very low test scores will not succeed at Rice. The SAT and the GRE tests are effective predictors of failure for those who score very low. I am not prepared to say that students with high test scores will succeed. This is particularly so in graduate education. I have seen students accepted into our graduate program with excellent undergraduate grades coupled with excellent GRE scores, and yet from the very beginning they displayed other attributes, including a perceived lack of creativity, that made me seriously question their ability to succeed in our program. Moreover, they did not succeed. On the other hand we have accepted students with only reasonable GRE scores who were quite successful.

What is Success for Today's Graduate Student?

In traditional mathematics graduate programs, we have screened and evaluated our students with the implied objective of looking for the next Gauss, or Newton, or Einstein. The loss of an individual who could not measure up was not really a loss according to the accepted objective. Well, perhaps they didn't have to measure up to this extent, but they should be able to be successful faculty at any good research institution in the country. However, only a miniscule number of today's Ph.D. recipients are able to obtain faculty positions at research universities. The vast majority obtain employment in a host of different areas. Many are employed by industry, government, the business world, or nonresearch teaching colleges. Things have changed; the job market has taken on a completely new look. Yet, we evaluate, select, train, and educate according to our out-of-date objective. Today's student needs different skills and different training. But our more traditional departments don't change. They continue business as usual. A major point here is that without a change in the evaluation and assessment procedures we are undoubtedly excluding students who could excel in the job market, and the new world, and in many occasions producing students who do not fit well into the new job market. We should also realize that today college degrees play the role that high school diplomas played years ago, union cards for fairly nontechnical jobs, e.g., sales. Universities are playing different and broader roles, yet our admissions policies don't reflect these changes.

The Department that I represent at Rice, The Department of Computational and Applied Mathematics is a world-class department in the area of computational and applied mathematics. Our graduate student population is over 50% women and about 35% underrepresented minorities. These representation figures are unique within the collection of mathematical sciences departments of research universities. Retention through Ph.D. degree is essentially the same for our women and minority students as it is for all students. Our minority students are very qualified. They come from various schools with excellent grades and all have previous successes. However, as a group their GRE scores are somewhat lower than many of our other students. Their scores are not low, but they also are not what would be required at most other selective universities. Our women and minority students succeed in their graduate careers and go off to successful careers. Most go into industry or government research labs. Several of our minority students have demonstrated strong national leadership. We have learned not to put excessive emphasis on the GRE score in the application evaluation process for all students.

While we are on the topic of graduate student diversity, I would like to relate one of my more satisfying teaching experiences. Two years ago I taught a graduate course in mathematical optimization theory. There were 24 students in the class and 12 were members of underrepresented groups. In this case African-American and Mexican-American. Some of the minority students sat in the front, some in the back, some asked good questions, some didn't ask any questions, some asked questions that did not need to be asked, some did well on the exams, some did not. The class atmosphere was one of genuine interaction. At the end of the semester the majority students had learned a very strong lesson; the minority students were just like them, in that on essentially any professional issue they represented the complete spectrum, and could not be stereotyped.

In summary, further study is desperately needed on the way we use SAT and GRE standardized tests in admissions practices. I have put forward a new model that could take us a step in the right direction. This model is not intended as an ultimate solution, but as a way to demonstrate that with some effort we can improve the situation.

Publications • Vitae • Main