Paper Menu >>
Journal Menu >>
![]() Creative Education 2012. Vol.3, Special Issue, 951-958 Published Online October 2012 in SciRes (http://www.SciRP.org/journal/ce) http://dx.doi.org/10.4236/ce.2012.326145 Copyright © 2012 SciRes. 951 The Program Assessment and Improvement Cycle Today: A New and Simple Taxonomy of General Types and Levels of Program Evaluation James Carifio University of Massachusetts-Lowell, Lowell, USA Email: James_Carifio@uml.edu Received August 20th, 2012; r evised September 2 2nd, 2012; accepted October 5th, 2012 There has been strong pressure from just about every quarter in the last twenty years for higher education institutions to evaluate and improve their programs. This pressure is being exerted by several different stake holder groups simultaneously, and also represents the growing cumulative impact of four somewhat contradictory but powerful evaluation and improvement movements, models and advocacy groups. Con- sequently, the program assessment, evaluation and improvement cycle today is much different and far more complex than it was fifty years ago, or even two decades ago, and it is actually a highly diversified and confusing landscape from both the practitioner’s and consumer’s view of such evaluative and im- provement information relative to seemingly different and competing advocacies, standards, foci, findings and asserted claims. Therefore, the purpose of this article is to present and begin to elucidate a relatively simple general taxonomy that helps practitioners, consumers, and professionals to make better sense of competing evaluation and improvement models, methodologies and results today, which should help to improve communication and understanding and to have a broad, simple and useful framework or schema to help guide their more detailed learning. Keywords: Program Evaluation; General Types of Program Evaluation; Program Evaluation Foci; A Program Evaluation Taxonomy; Program Life Cycles; Higher Education Introduction In the past decade, there has been strong pressure from just about every quarter for higher education institutions to evaluate and improve their programs, with this pressure currently being at its high point for the last century (American Council on Education, 2012; State Higher Education Executive Officers, 2012). This pressure has simultaneously come from parents, students themselves (and particularly if they are working to pay for their own education or/and borrowing substantial amounts of money for the same), government agencies, business leaders, the general public, accrediting bodies and professional associa- tions, all of whom are both clients of and stakeholders in the higher education system, which adds several layers of complex- ity and types of nuances to any kind of program evaluation or improvement efforts done by institutions in terms of their goals, design, data structures, analyses and reporting (Scriven, 2010a). The program development, evaluation and improvement cycle today, consequently, is much different and far more complex than it was fifty years ago, or even two decades ago, and it is actually a highly diversified and confusing landscape from both the practitioner’s and consumer’s view of such evaluative and improvement information (Mets, 2011). Part of today’s pressure for program evaluation and improve- ment information has come from a reaction to rapidly and con- tinually increasing higher education costs relative to those who actually pay for th e high er educ ation students receive relative to the degree to which these students are getting quality (or de- sired adequacy) in terms of what is being paid for, with quality defined in many different ways, which range from student satisfaction to parental satisfaction to the achievement of highly desired outcomes, which include personal development objec- tives and desired types of employment, or desired further edu- cation (US News, 2012; London Times, 2012). Another source of pressure for program evaluation and improvement informa- tion today is a strong accountability and stewardship factor that has increasing come to the fore in all areas of public and private endeavors, given many of the excesses of the 1980’s and 1990’s, which is about far more than the intended use of resources and the avoidance of moral hazards, but also the achievement of core ideals and values and societal obligations in the conduct of higher education on a daily basis (Burke, 2005; Lederman, 2009; Shavelson, 2010). Where the first force above may be referred to as individuals getting “getting value for their money,” the second force can be seen as society “getting value for its money,” and it does not take a lot of reflection to see that these two forces may be working and pressuring institutions at cross purposes, and particularly when it comes to program evaluation and improvement efforts and information. Another part of today’s pressure for program evaluation and improvement efforts and information in higher education has come from the Continuous Improvement or Total Quality Man- agement (TQM) movement and approach to carrying out one’s mission or charge. Total Quality Management (English & Hill, 1994; Harman, 1994; Dlugacy, 2006; Mulligan, 2012) has be- come a strong factor and institutionalized in all areas of en- deavors, but actually came to education and allied health latter than other areas, as a more real time, dynamic and different kind of statistical approach to evaluation and improvement in a ![]() J. CARIFIO changing and competitive environment, rather than a fairly sta tic and club-like environment, which it could be said characterized education and health both nationally and internationally until about the mid 1970’s. The work of Deming (1986) in particular stands out in this area, even though many in both education and health do not realize that this creative and revolutionizing gen- ius is at the foundation of much of their methodologies and what they do, after Deming’s work was recognized and suc- cessfully implemented in Japan. What also tends not to be real- ized by many is that Deming’s work and approach runs counter to and is not easily or seamlessly compatible with classical evaluation and improvement models and methodologies (in education in particular), which have been developed and used over the last 150 years and have a pantheon of creative and revo- lutionary geniuses of their own ranging from Taylor to Tho rndik e to Tyler to Cronbach to Campbell to Stake to Suffle- beam and many others. The current force and demand for continuous improvement and the continuous improvement movement, it should be fur- ther noted, often works at cross purposes to the other forces mentioned above, and the views and approaches of the classical evaluation and improvement models and forces tend to look for more definitive answers that typically take more time to pro- duce (Stufflebeam, 2001). In particular, there are the often con- tradictory pressures of the last factor of the five factors in TQM, which is marketing and the marketing movement, for whom evaluation and improvement information are its life’s blood. Higher education and its programs have never been more strongly marketed than now and that marketing is happening both nationally and internationally and has created a version of “agenda-driven” as opposed to more neutral program evalua- tion and improvement information and efforts that further con- fuses understanding this currently complex landscape in a more organized and systematic way. I have spent 4 decades doing program evaluation and im- provement in K-12 and higher education, business, health and the military, of just about every conceivable kind and just about at every level, as well as trying to help practitioners and those charged with doing program evaluation and improvement carry out their charge, mandate, or mission in sensible and valid ways. I have increasing found that the program development, assess- ment, evaluation and improvement cycle today is much differ- ent and far more complex than it was even a decade ago. In fact, the field now is actually a highly diversifie d and confusing la nd- scape from both the practitioner’s and consumer’s view of such evaluative and improvement functions, activities and informa- tion, as there are many competing and conflicting forces, ap- proaches and agendas, which make the whole area difficult for the non-specialist and even those who call themselves experts. To illustrate this point more concretely, Stufflebeam (2001) identified 23 different evaluation, assessment, and accountabil- ity models that reflect the four major movements and forces outlined above either singly or in combinations. Stufflebeam expanded these 23 models to 31 in 2007 (Stufflebeam & Shrinkfield, 2007), and more models and model variants have been added since then. Mets (2011) tried to analyze and better systematize the models and movements in this field, but con- cluded that a set of more macro and simplifying categories were needed that were steps towards one or more taxonomies that would help to better organize and represent this seemingly ever burgeoning field. Creating taxonomies, however, is not an easy task, as is well known, and all taxonomies have various advan- tages and disadvantages (Mezzich, 1980; Godfray, 2002) and relative “goodness’s of fit” and usefulness in different contexts and situations. In an attempt to help the practitioners, managers, colleagues, doctoral students and others with whom I work, I have devel- oped a very simple taxonomy that is quite helpful in organizing and understanding the highly diversified and confusing pro- gram evaluation and improvement landscape today. This simple taxonomy allows one to locate and classify various kinds of program evaluation and improvement efforts, activities, meth- odologies and reports in their own right and as compared to others, but also in terms of the general and developmental na- ture of program evaluation and improvement efforts today. Like all taxonomies, the primary purpose of the one presented below is to help facilitate communication and discussion between people as well as better situating and contextualizing particulars and particular instances, and what in general they are and are about. Such classifications help to represent an approach or model more appropriately and understand what they provide and do not provide, so that one has more reasonable expecta- tions in a given context, as well as evaluates what has or has not been done more reasonably. Therefore, the taxonomy presented below, as well as this article, is not meant as a definitive or detailed answer to the things it conceptualizes, categorizes, and discusses, but as an advanced organizer for the field currently to help those doing or consuming program evaluation and im- provement information to have a broad, simple and useful fr ame- work or schema to help guide their more detailed learning. At one level, this simple taxonomy can be seen as one way to group Stufflebeam’s 31 plus models of accountability and pro- gram evaluation into more macro categories and progressive levels, questions, and functions that are easier and quicker to ascertain, understand, and evaluate in Scriven’s (2012) sense of this term. The Program Assessment and Evaluation Cycle Many people do not grasp, or do not seem to gasp when talking about program evaluation and improvements, that all programs (like many other things including institutions) are not eternal, and do not spring fully formed and developed from the left ear of Zeus like Athena, but rather have a life cycle a nd go through a life cycle from birth to suspended animation or death or rebirth of some kind, and that during this developmental life cycle the program and its evaluation and improvement is quali- tatively different in several key and important ways at each stage of the cycle. This basic fact obviously means among other things that one may be trying to use the wrong or rather least appropriate evaluation models, methodologies, activities, tools, and information for a particular program, or desired improve- ment, given where the program is in its development life cycle and the improvement sought. These are several excellent de- scriptions of program (or product) life cycles (O’Rand & Krecker, 1990), but the critical point of importance here is that all program evaluation and improvement efforts begin and really cannot wisely progress without answering the prime core question, which typically tends not to have been answered when I am asked for consultative help and ask it. This prime core question is: Where are you? And what in general are you looking to do? Briefly define and characterize (give me a model of) your ven- ture/program (and its general goals) and what general kind Copyright © 2012 SciRes. 952 ![]() J. CARIFIO (Level) of “evaluation” you want to do and why, with the im- plicit question here being, “where exactly are you on the pro- gram’s life cycle and do you know and understand this key point that drives almost everything else.” Please note that a program of inquiry (or research) to evalu- ate a particular “venture” (i.e., set of activities or program at any level) can be designed to reflect multiple levels of sophis- tication and goals. Once you define and characterize your “venture” (and its goals), it is usually quite helpful to “profile” the kinds (levels) of evaluation you wish to do, or ultimately wish to do, and what each level will require both incrementally and developmentally. Your evaluation efforts may progress across levels and across levels over time, but the key is know- ing where you are on your venture’s developmental cycle, and what “business and evaluation business your are really in right now”, as Richard Morley, president of “The Breakfast Club” has help generations of entrepreneurs understand (Morley, 2012). What level(s) you choose also typically depends on where you are in the “program development, improvement and eval uation cycle”. The levels of the program development, assessment and evaluation in a program’s or venture’s life cycle, and my simple general taxonomy that attempts to capture and organize them are as follows: Level 1: Venture/Program (Status) Reporting The goal of this level is to produce on-demand and fairly quick narrative (and quasi-quantitative) reports and similar sto- ries about the program/venture and its current status or/and promise and progress (or not). This type of evaluation is often called managerial or practitioner evaluation but it is also quite often called qualitative evaluation of several different kinds. This type of evaluation is typically characterized by “back- ground homework” activities (briefings) on the program (and its competitors), “census” surveys of various kind, program review activities (and reports), program auditing activities (and reports), news stories (and feature articles), press releases, tes- timonials, official testimony (and briefings), symposia, confer- ences, various kinds of case studies and similar activities, all of which often represent different evaluation models from differ- ent traditions and somewhat non-commensurate disciplines. Managerial, practitioner and much qualitative evaluation is typically done without a formal underlying data structure for the venture/program (or a formal venture/program evaluation plan), and with shifting goals and priorities that are most often externally determined. The lack of these two aforementioned features (which are part of the underlying core foundation of more classical evaluation models and higher levels in this tax- onomy) are some of the features that characterize “managerial” or “naturalistic” evaluation and its activities from other types and levels of evaluation (Lincoln & Guba, 1985; Pawson & Tilley, 2008). Managerial, practitioner, and qualitative evaluation is most typically done in a fast-paced, fast-moving setting where there are many competing ventures and priorities and comparatively little response time, and where the external and internal envi- ronment must be responded to quickly with mission critical information relative to several different stake holders. Such settings tend to be “institutional” and “action-oriented” in char- acter, and informal R&D and “change oriented” settings where making a quick (initial) response that is then modified at will (usually with comparatively little “high powered and hard” data) is key. Given the context and its features, a wide variety of more qualitative research and evaluation techniques (e.g., inter- views, focus groups, open-ended questions) tend to be used at this level, as they are far quicker and easier to both develop and do, and speed and response deadlines are of the essence at this level (Denizen & Lincoln, 2005). As explained in detail elsewhere (Carifio & Perla, 2009), all research and evaluation methodologies and techniques are “qual-quantifications” or “quant-qualifications” and different sides of the same blanket, so the “qualitative-quantitative con- troversies” are essentially irrelevant from this perspective, and all methodology is essentially “mixed methods” to some degree. Methodology, therefore, is typically a matter of the mixture as well as the precision and warrants one wants for claims, as well as the complexity of the design and data structure one needs to employ to make different kinds of decisions on different kinds of claims (Mertens, 2010). Consequently, evaluation (and re- search) methodologies are not competitive but complimentary and well to poorly suited singly or in combinations for the problem and questions to answer (Green et al., 20 06 ). It also follows from the above points that almost all evalua- tion and research methodologies are both qualitative and quan- titative to some degree or mixture of degrees at the same time with the degrees and mixtures varying according to several factors, including what level of the taxonomy that is being pre- sented here one is currently at or working to be at, as one pro- gresses through the program or venture’s life cycle, relative to where one “stops,” decides to stay, or “exits” the life cycle. In general, as one progress up the levels of this taxonomy, the evaluation or research one does typically becomes more quan- titative and more powerfully quantitative and statistical in so- phistication, design and complexity, but that is in great part due to having built the measures and data structures in the activities carried out at lower levels of this taxonomy and thus having the capacity as well as the time needed to implement and carry out this more extensive, sophisticated, and complex quantitative and statistical type and level of evaluation and research. This later type of evaluation and research also does not spring full blown in an instant from the left ear of Zeus like Athena, but must be developed typically as a capacity of the venture and program evaluation efforts over a fairly considerable amount of time. Further, just because one is doing evaluation or research that is somewhat more qualitative than quantitative (as a mixture) does not mean that one cannot employ an experimental or quasi-experimental approach and actually use an experimental or quasi-experimental/evaluative design, even in case studies (Yin, 2008), as the extensive work of Kleining (1982) has clearly shown. Such qualitative designs and analyses indeed do not have the “power” of more quantitative and statistical de- signs and analyses, but they can still establish and answer causal and similar type questions. In a word, there is more to qualitative methodology than ethnography and various forms of text and literary analyses and methods, and if done appropri- ately, one can get valuable and valid findings for decision making at the lower levels or this taxonomy just as one can get fairly useless and invalid quantitative and statistical findings at the higher levels of this taxonomy when blind empiricism and shotgun designs are used. These issues are just not that simple or easy to generalize about definitely in a few words or para- graphs and each case and design must be judged on its own Copyright © 2012 SciRes. 953 ![]() J. CARIFIO details, quality, adequacy and merit as Phillips (2005) has pointed out and analyzed in detail. However, venture or pro- gram status reporting and evaluation designs typically tend not to be of the Kleining or Yin kind and tend to be more along the lines of managerial and practitioner evaluations as described above. Lastly, it should also be noted that managerial, practitioner, and qualitative evaluation is often done in settings where there are not the resources to do much else (Bamberger et al., 2004), and this basic fact is an important and contextualizing factor. This type and mana gerial level of evaluation may suffice and be quite adequate for many ventures, programs and activities, and particularly in their initial phases, but once the “stakes” in- crease and particularly relative to the claims and assertions one wants to make about the program or venture, higher and differ- ent levels and types of evaluation are needed. Level 2: Data for Decision-Making The goal here is to select and measure a small set of variables (e.g., percent passing grades, number of users, increases in knowledge etc.) and to link them in a regression “equation” or logical decision-making algorithm of some kind such that by inputting actual “quantitative” data and formulating cri tical th re sh- olds, decisions can be made to continue/discontinue/expand or modify use of the venture or particular parts of it. In other words, is there an adequate “return on investment” (or “bene- fits” as compared to lack thereof or/and losses) or promise to continue the venture and keep working on it? I call this “keep or kill” evaluation, and this kind of evalua- tion may be formative or summative or retrospective or pro- spective. The “decision equations” may include stake holder interests and “good will factors” as well as policy and organiza- tional interests and goals. This kind of evaluation is usually the cheapest, less labor intensive and easiest formal evaluation to do as very simple criteria, relatively “low-powered” data and “keep or kill value equations” may be used (or not). Keep or kill (level 2) evaluation is often done with “initial ventures” and prototypes or to make better timely practical and management decisions about a given venture or program, and there is a for- mal minimum data structure of some kind at this level. Usually, attempts to start building some kind of formal data structure for the venture or program begins at this level and usually to im- prove the quality and sophistication of the keep or kill equation and statistical analyses so that they are something more than just blind and/or shotgun “number crunching.” However, it is also usually at this level that institution and corporate evalua- tors, evaluation teams and units discover that the institution has an (applications and “business procedures”) Management In- formation System (MIS) which is highly problematic and often fairly useless for the keep or kill evaluati ons and decis io n -m a ki ng that is the goal at this level rather than an Evaluation Informa- tion Management System and associated generally useful data structures that are needed for higher quality and more sophisti- cated keep or kill evaluations and a period in decision making. Being more than “twice-burned” on this critical short coming and flaw usually begins to encourage management and institu- tional and corporate evaluators to start designing and building more generally useful Evaluation Information Management Systems and data structures so they have better capacities to do this level and higher and more sophisticated levels of program and venture evaluations. The problems at the data for decision-making level tend to be problems of “good” (reliable and valid) measures that maxi- mize variance and minimize measurement error on each vari- able included in the functional inputs-throughput-outputs (mul- tiple regression) equation of some kind that will be used, and selecting “power” and “explanatory,”… i.e. as opposed to con- venient and locally “believable” variables and their accompany- ing convenient post-hoc armchair narratives, as often happens in this typically popular blind empiricism approach, which can actually be shotgun evaluation or research of the quantitative or/and qualitative kind, given the setting and the institutional or corporate data available (Schick, 2000). Many evaluation ex- perts have written about the positives of using data for deci- sion-making (Scriven, 2001; Pawson, 2006; Dlugacy, 2006), but many have also written about the flaws, difficulties, poor designs and even poor logic of this approach and over-inter- preting and over-generalizing the results, which tends to be very context bound (Phillips, 2005; Coryn, 2007; Sloane, 2008). However, used judiciously, wisely, and for what it is, this level of evaluation is very useful, efficient, fairly timely, and cost effective for making “keep or kill” decisions in particular. Usually, in my experience, decision-makers tend to wait way too long to make the “kill” decision when doing “keep or kill” evaluations. At one level, I believe that this problem is due to the decision-makers being too invested in too many ways in the venture (including their reputations for championing the ven- ture), and a natural tendency not to want to be disappointed or to disappoint others. However, I also believe that this delay and foot dragging comes from decision makers not being honest about the fact that this is the level and kind of evaluation that they are actually in and doing, and that they need to kill off non-performing and non-promising ventures fairly ruthlessly, as Deming and other counsels, even if they are making a mis- take, as the “power of the approach” will eventually assert itself and usually in a better form (Suppe, 1974). Continuous im- provement, never mind more extensive change, can be a very slow process, if decision makers drag their feet on the kill deci- sions or let politics impede these decisions or are engaged in “keep or kill” evaluations only cosmetically which also often happens. Level 3: Review and Learn This level is a typical early stage in the evaluation/research/ inquiry process, with an emphasis on gathering more informa- tion from reviews of available literature, examination of archived records, focus-group style interviews with current users (faculty, students and other stakeholders), and actual questionnaires and “harder measures”. The goals are to identify potential key vari- ables worth investigating, how they might be related, and how they can be measure d, what actually is “The T heory of the Pro- gram (venture)”, how sound is it, has it been implemented ap- propriately and what problems and impediments are being en- countered and what might be done about either or both (Aneshensel, 2002). This level is sometimes called “program improvement evaluation” or “getting the program up to its specs”, so there is a valid version of the program/venture to evaluate and an appropriate framework and model to interpret the evaluative res ults. It cannot be over-emphasized how important this level of program evaluation is in terms of developing a research- knowledge-base and theory for the venture, even if it is only in Copyright © 2012 SciRes. 954 ![]() J. CARIFIO first draft form, as these two components are necessary if the venture is not to “fly blind in a changing storm,” and not be yet another example of old fashioned shotgun research and evalua- tion and “black box” empiricism (of both the quantitative and qualitative kind) that is known as logical positivism, which is a much used “vampire” model that more or less died fifty years ago, but is quite difficult to keep in its coffin (Schick, 2000), and particularly so when the “improvement fever” is high. One also needs to have some type of research-knowledge-base and initial first draft or proto-theory of the program if one is going to be “primed for and recognize” and not simply ignore highly important unanticipated consequences or outcome of one’s program or venture and evaluation efforts, which may be a ground breaking discovery even if it is one of the petite kind. There is a long and well documented history world wide of accidental and unanticipated discoveries that have occurred in all areas and with all kinds of ventures that made the venture and its efforts a thousand times more valuable than its initial goals or theory. In fact, one could argue that it actually would not be research, evaluation or a major improvement effort, if there were not unanticipated positive (or negative) conseq uenc es observed. Obviously, it is the “venture changing” positive un- anticipated consequences that are important, but one has to be primed to observe/discover them, and that requires having a research-knowledge-base and theory for the venture, which is also needed to some degree for the next levels in this taxonomy. All of the points are more fully explained and elaborated in Perla and Carifio (2011). A few concrete examples of “review and learn” (level 3) evaluations are Glass (2000), Kenney (2008), Carifio and Perla (2009), and Mets (2011). Also one should not miss that level 3 review and learn evaluations today tend to be quite quantitative and statistical and statistically sophisticated in nature ranging from various form of meta-analysis (Glass, 2000) to quantitative model building (Aneshensel, 2002) and even secondary data analysis and formative causal and struc- tural equation modeling. Level 4: Defining “Does It Work?” Does the program/venture actually work in terms of its “ad- vertised capabilities” (and underlying theories)? Is it accessible, trouble-free, convenient, hitting or exceeding the bench marks set on the goals and criteria chosen? One should note that de- fining “works” and “does it work” is often not an easy thing to do and usually takes considerable effort. For example, are pro- gram effects immediate (and how immediate) or delayed (and how delayed) and are they lasting (and how lasting) or tempo- rary (and how temporary). Does the program help some sub- group of students or clients or hurt some subgroup of students or clients or both simultaneously (all forms of different kinds of interactions effects or “workings”). Are the subgroups helped (or/and their advocates) so important mission-wise and politi- cally that it trumps the subgroups hurt (or/and their advocates), and the reverse of this statement. Does the program “stop facial tics” (target goal) but “cause stuttering” in doing so; namely, are there unanticipated consequences, outcomes, or collateral damages (Elton, 1988; van Thiel & Leeuw, 2002; Figlio, 201l). Are there key qualitative differences between the outcomes of the new program (increases comprehension but decreases reten- tion of facts) versus the program it is replacing (always called the “traditional approach,” even though it was once the new approach), which produces lower comprehension but higher retention of key facts, and what is the calculus of choice in such a situation, or for saying the new program works or not? “Does it work” is a very hard question to answer most often and requires a great deal of a priori focus and clarity about what “works” actual means, as well as a decent evaluative de- sign and adequate evaluative data structure. It is at this level that the relevancy and adequacy of the general data structure of the institution in which the program or venture is embedded begins to express itself even more strongly than at level 2, and the many weaknesses of the institution’s evaluative data struc- tures begin to be discovered relative to being able to actually answer questions about does some program or venture “work.” Further if one must move across institutions to answer ques- tions of “does it work,” one typically encounters multiple in- compatible measures and data structures, which not only im- pede one efforts, but also helps one to understand the current movements to develop a common standard student or client “unit record” at the K-12 and higher education levels and in the field of medicine as well, to begin to alleviate this major evaluative data structures” problem and enable much better “does it work” evaluations in a much more feasible and cost effective way (Brass et al., 2006). Also, when it comes to the question “Does it work”, there is an unfortunate truism that one must always keep in mind which is that “any program any human can conceive will work for someone somewhere at some point in time, or might appear to do so, if one of Campbell and Stanley now 20 evaluative design flaws are operating in the situation”. One must always be ex- tremely cautious that one is not so over focused and over cus- tomized in terms of one’s program, goals, clients, and situation or context that one essentially has an “sample of one” on eve- rything (i.e., uniqueness or its fuzzy equivalent) or a flawed design or flawed data structure or all three when it comes to questions of “Does it Work”. The basic problem here is that if one does in effect have “samples of one” across the board, it really does not matter, as the situation, problem, set of circum- stances or client type will never occur again most likely (or extremely rarely), and one is doing a lot of work and making a lot of hoopla for very little return, unless one is in the “rare and orphan disease” business and that is the nature of one’s venture. “Does it work” is one of the trickiest questions to ask and an- swer and particularly in terms of the manner in which this ques- tion tends to be asked and often answered by the various stake- holders in this process, which tends to be in a fairly vague, imprecise, somewhat naive, and implicitly personally defined way. These various flaws are some of the major roots and sources of difficulties in answering this question, the others roots being inadequate data structures and designs to actually do the job of saying whether the program or venture actually works or not. The “Does it work” level in this simple taxonomy means es- tablishing a design and data structure that causally connects the program or venture to its inputs and outputs in a reasonably valid way that allows causal statements and claims to be made about the effects of the program on whom relative to what out- comes and why as opposed to other uncontrolled, unperceived or unknown (exogenous) factors (variables), which is by no means a nd easy thin g to do f or many di ffe ren t reason s, wh ic h is why “Does it work” designs today tend to be multivariate in character. The “Does it work” level is also focused on under- standing the general class of the program/venture and the gen- eral theory underlying it, so the evaluation effort is not “over Copyright © 2012 SciRes. 955 ![]() J. CARIFIO customized” and a “tunnel vision” effort, but contributes some- thing to the general knowledge-base of the program type and theory that guides it. It looks to “expand the view and knowl- edge-base” a little bit and build organizational understanding and insight into what kinds of things the venture does and what it and the things that are its chief foci are about in a broader and more general way. And it should be clearly noted that organiza- tional knowledge and wisdom is quite often very important (sometimes called “understanding the business you are actually in as opposed to the business you think you are in”), and some- times even more important than the much more generalized knowledge and wisdom all of the experts and experts sources in this area tend to focus on and discuss. This organizational knowl- edge and wisdom, given that it is reasonably valid locally, is the very pay off of these efforts, provide that it is indeed espoused and touted as such (i.e., “this works for us in our context for our goals and clients”) and not something more through the various rhetoric and pufferies that are endemic to reporting and disse mi- nation activities now. Level 5: Formative and Summative Evaluation Research This level of evaluation represents the standard model of program evaluation research, where various “Stake and Stuf- flebeam” quasi-experimental and experimental designs and de ci- sion-making models are used to do (in the end) confirmatory evaluation of the program/venture, possibly with comparisons to naturally existing “control groups” and purposefully con- structed “control” groups as well. Are there unanticipated out- comes and/or side effects of various kinds? “Policy Research” and making decisions to scale a program/venture up and/or disseminate it usually are at this level and this level typically require even better designs, data structures and multivariate analytic techniques than are needed at level 4. Sometimes, this type of evaluation is “high stakes” evaluation, and usually it is also done to provide the program financers, potential program users, and the general public with reasonable information about the veracity and validity of the program’s claims (i.e., external social action consumer reports). This level of this simple taxonomy is well developed and well worked by the pantheon of experts who have assiduously labored at this level for the last century, and the reader is re- ferred to the most representative of these texts (e.g., Stake, 2003; Stufflebeam & Shinkfield, 2007; Pawson & Tilley, 2008; Mertens, 2010), and particularly to Stufflebeam’s (2001) classic article summarizing the major models and approaches to for- mative and summative evaluation that have been developed and used extensively in this type of evaluation work. I have only a few comments of importance to make about this type and level of evaluation in this simple taxonomy. The first of these comments is to strongly emphasize Stuf- flebeam’s view, which he has expressed in several places, that there really are no direct or straight forward and simple algo- rithmic connections between formative and summative research and evaluation. Nor are there simple and straight forward trans- formation of “formative” research and research efforts into “summative” research and research efforts, and the two are essentially different in kind and basically incommensurate. This point, it should be clearly noted, in no way means that one is better than the other, as each has an appropriate setting and context. The point only means that although there are indeed fuzzy overlaps between the two, each has its own appropriate and valid questions, designs, data, analyses, standards and deci- sion-making sets that need to be used, and the formative sets are not necessarily valid or converted or transformed into the summative set and vice-versa. The two, therefore, are qualita- tively different and one is actually not necessary for the other. Stufflebeam’s important point helps to explain the “disconnects and disappointments” that are often observed between forma- tive and summative evaluations of the same program and the “effects discounting” (diminutions) that typically occurs when the program is disseminated to other settings. However, Stuf- flebeam’s point has also given rise to some promising new approaches at this level which have been exploring the forma- tive and summative evaluations of programs or ventures as (macro-level) case-studies along the lines of those done in busi- ness and medicine, as opposed to the classical scientific model and paradigm that is and has been the classical paradigm for formative and summative evaluation at this level for several decades (Stake, 2010). This new line of inquiry has a great deal of potential and particularly relative to building up institutional knowledge and wisdom about an institution’ s programs and ve n- tures of various kinds. My second comment of importance here is that the previous four levels are the developmental precursors of this level, more or less to some degree, and may be understood (and even char- acterized) as missing one or more of the elements in the models that are used to conduct an acceptable and valid evaluation at this level (which is a highly informative way to view and un- derstand each of the previous levels). Each of the previous lev- els, therefore, is an “approximation” of some kind to this level and the next one. It should also be noted that not a great deal (comparatively) is written about the first four levels in this sim- ple taxonomy, which is why I wrote more about each one of them than these last two levels, nor are these previous levels typically located, situated and contextualize in terms of this level and the next, which is one of the several useful and valu- able attributes of this simple taxonomy. Level 6: “Hard” Research/Evaluation This level is the most advanced and ambitious level of pro- gram or venture evaluation. The goals here are to examine the relationships that exist between multiple antecedent, mediator, and outcome variables through mixtures of regression analyses, quasi-experimental studies, and true experimental manipulations and even national and now international trials. This type of evaluation typically is “high stakes” evaluation and is almost always prospective in character, although approximate retro- spective designs/efforts are sometimes possible in certain situa- tions. Often one also tries to estimate the range of outcomes for the program (lower limit results and upper limit results that will be observed and under what conditions) and other similar pa- rameters as well as the decay of effects of the program over time (all effects are usually initially inflated). Often, one also tries to assess how well the program or venture works inde- pendent of its originators/founders (is it person or stakeholder dependent) and the degree to which it is “context/site/practi- tioner” proof (dissemination vulnerabilities). The standards for assessing ROI (Return on Investment) are also usually higher as are the policy questions and evaluations. It is relatively straight forward to see how much more generalized level 6 is than level Copyright © 2012 SciRes. 956 ![]() J. CARIFIO 5 in terms of its focus and the types of claims it seeks to make, and it’s stronger and much tighter focus on causation and estab- lishing strong evidence and warrants for causal claims. There is much ongoing debate about this evaluation level and its re- quirements (Phillips, 2005; Brass et al., 2006; Coryn, 2007; Sloane, 2008; Scriven, 2010b), and the context and conditions under which it should be attempted and occur, but it is the kind of program or venture evaluation that needs to occur on key issues and goals if we are to build truly generalizable learning, instruction and educational theory. It is also at this level that the availability of stable and general data structures over significant periods of time becomes both critical and key. And once again one sees the importance and value of current movements to develop a common standard student or client “unit record” at the K-12 and higher education levels and in the field of medi- cine as well, to the longitudinal program and venture evalua- tions we do that examine and assess the more remote antece- dents and the longer range outcomes of the programs and ven- tures we evaluate at this level in far more sophisticated and higher quality ways. Summary As previously stated, there has been strong pressure from just about every quarter in the last twenty years for higher education institutions to evaluate and improve their programs. This pres- sure is being exerted by several different stake holder groups simultaneously, and also represents the growing cumulative impact of four somewhat contradictory but powerful evaluation and improvement movements, models and advocacy groups. Consequently, the program assessment, evaluation and im- provement cycle today is much different and far more complex than it was fifty years ago, or even two decades ago, and it is actually a highly diversified and confusing landscape from both the practitioner’s and consumer’s view of such evaluative and improvement information. Therefore, the purpose of this article was to present and begin to elucidate a relatively simple ge nera l taxonomy that can help practitioners, consumers, and profes- sionals to make better sense of competing evaluation and im- provement models, methodologies and results today, which should help to improve communication and understanding and to have a broad, simple and useful framework or schema to help guide their more detailed learning. It is hoped that the simple level 6 taxonomy presented achieves these goals and simplifies this complex area for those involved in evaluating programs and ventures today. REFERENCES American Council on Education (2012). National and international projects on accountability and hi gher education outcomes. http://www.acenet.edu/Content/Naviga tionMenu/OnlineResources/A ccountability/index.htm Aneshensel, C. S. (2002). Theory-based data analysis for the social sciences. Thousand Oaks, CA: Pine Forge Press. Bamberger, M., Rugh, J., Church, M., & Fort, L. (2004). Shoestring evaluation: Designing impact evaluations under budget, time and data constraints. Americ an Journal of E valuation, 25, 5-7. Brass, C. T., Nunez-Neto, B., & Williams, E. D. (2006). Congress and program evaluation: An overview of randomized control trials (RCTs) and related issues. URL (last checked 24 October 2008). http://digital.library.unt.edu/govdocs/crs/permalink/meta-crs-9145:1 Burke, J. (2005). Achieving accountability in higher education: Bal- ancing public, academic, and market demands. San Francisco: Jossey-Bass. Carifio, J., & Perla, R. (2009). A critique of the theoretical and empiri- cal literature on the use of diagrams, graphs and other visual aids in the learning of scientific-technical content from expository texts and instruction. Interchange, 41, 403-436. Coryn, C. L. S. (2007). The “holy trinity” of methodological rigor: A skeptical view. Journal of Multidisciplinary E valuation, 4, 26-31. Deming, W. E. (1986). Out of the crisis. Cambridge, MA: Center for Advanced Engineering Study, Massachusetts Institute of Technol- ogy. Denzin, N., & Lincoln, Y. (2005). The Sage handbook of qualitative research. ThousandOaks, CA: Sage. Dlugacy, Y. (2006). Measuring healthcare: Using quality data for operational, financial and clinical improvement. San Francisco, CA: Jossey-Bass. Elton, L. (1988). Accountability in higher education: The danger of unintended consequences. Higher E ducation, 17, 377-390. English, F. W., & Hill, J. C. (1994). Total quality education: Trans- forming schools into learning places. Thousand Oaks, CA: Corwin Press. Figlio, D. (2011). Intended and unattended consequences of school accountability. http://www.youtube.com/watch?v=e3aKEuctqy8 Glass, G. (2000). Meta-analysis at 25. URL (last checked 15 January 2007). http://glass.ed. a s u /g e n e / p a p e rs /meta25.html Godfray, H. (2002). Challenges for ta xo no my. Nature, 417, 17-19. Green, J., Camilli, G., & Elmore, P. (2006). Handbook of complemen- tary methods in educational research. Mahwah, NJ: Erlbaum. Harman, G. (1994). Australian higher education administration and quality assurance movement. Journal for Higher Education Man- agement, 9, 25-45. Kenney, C. (2008). The best practice: How the new quality movement is transforming medicine. Phi la de lp hi a, CA: Perseus Book Group. Kleining, G. (1982). An outline for the methodology of qualitative so- cial research. URL (last checked 22 October 2008). http://www1.unihamburg.de/abu//Archiv/QualitativeMethoden/Klein ing/KleiningEng1982.htm Lederman, D. (2009). Defining accountability. Inside higher education. http://www.insidehighered.com/news/2009/11/18/aei Lincoln, Y., & Guba, G. (1985). Naturalistic inquiry. Thousand Oak, CA: Sage. London Times (2012). World university ran k i n g s . http://www.timeshighereducation.co.uk/world-university-rankings/20 11-2012/top-400.html Mets, T. (2011). Accountability in higher education: A comprehensive analytical framework. Theory and Research in Education March, 9, 41-58. Morley, R. (2012). R morley inc o rporated. http://www.barn.org/index.htm O’Rand, A., & Krecker, M. (1990). Concepts of the life cycle: Their history, meanings, and uses in the social sciences. Annual Review of Sociology, 16, 241-262. Mertens, D. (2010). Research and evaluation in educational and psy- chology: Integrating diversity with quantitative, qualitative, and mix- methods approaches. Thousand Oaks, CA: S age. Mezzich, J. E. (1980). Taxonomy and behavioral science: Comparative performance of grouping methods. New York: Academ ic Press. Mulligan, R. (2012). The Deming University. http://paws.wcu.edu/mulligan/www/demingu.html Pawson, R. (2006). Evidence-based policy: A realistic perspective. Thousand Oaks, CA: Sage. Pawson, R., & Tilley, N. (2008). Realistic evaluation. Thousand Oaks, CA: Sage. Perla, R., & Carifio, J. (2009). Toward a general and unified view of educational research and educational evaluation: Bridging philoso- phy and methodology. Journal of Multi-Disciplinary Evaluation, 5, 38-55. Perla, R., & Carifio, J. (2011). Theory creation, modification, and test- ing: An information-processing model and theory of the anticipated and unanticipated consequences of research and development. Jour- Copyright © 2012 SciRes. 957 ![]() J. CARIFIO Copyright © 2012 SciRes. 958 nal of Multi-Disciplinary Evalua t i on, 7, 84-110. Phillips, F. (2005). The contested nature of empirical research (and why philosophy of education offers little help). Journal of Philosophy of Education, 39, 577-597. Schick, T. (2000). Readings in the philosophy of science: From positiv- ism to postmodernism. Mountain View, CA: Mayfield. Scriven, M. (2010a). Rethinking Evaluation methodology. Journal of Multidisciplinary Evaluation, 6, 1-2. Scriven, M. (2010b). Contemporary thinking about causation in evalua- tion: A dialogue with Tom Cook and Michael Scriven. American Journal of Evaluation, 31, 105-117. Scriven, M. (2012). Evaluating evaluations: A meta-evaluation check- list. http://mi ch ae l sc ri v en .info/images/EV ALUATING_ EVALUATIO NS_ 8.16.11.pdf Shavelson, R. (2010). Accountability in higher education: Déjà vu all over again. http://www.stanford.edu/dept/SUSE/SEAL/Presentation/Presentation %20PDF/Accountability%20in%20hi%20ed%20CRESST.pdf Sloane, F. (2008). Through the looking glass: Experiments, quasi-ex- periments and the m edical model . Education Researcher, 37, 41-46. Stake, R. (2003). Standards-based and responsive evaluation. Thou- sand Oaks, CA: Sage Stake, R. (2010). Qualitative research: Studying how things work. New York: Guilford Press. State Higher Education Executive Officers (2012). National commis- sion on accountability i n h i gh e r e du cation. http://www.sheeo.org/account/comm-home.htm Stufflebeam, D. (2001). Evaluation models. New Directions in Evalua- tion, 89, 7-98. Stufflebeam, D. L., & Shinkfield, A. J. (2007). Evaluation theory, mod- els and applications. San Francisco, CA: Josse y-Bass. Suppe, F. (1974). The structure of scientific theories. Urbana: Univer- sity of Illinois Press. US News (2012). Best colleges and unive rsities. http://www.usnews.com/rankings Van Thiel, S. & Leeuw, F. (2002). The performance paradox in the public sector . Pu blic Perform ance & Management Review, 25, 267-281. Yin, R. (2008). Case study research: Design and methods (applied social research methods). Thousand Oaks, CA: Sage. |