Berkman v. City of New York, 536 F. Supp. 177 (E.D.N.Y. 1982)
March 4, 1982
The CITY OF NEW YORK; Edward Koch, individually and as Mayor of the City of New York; New York City Fire Department; Augustus Beekman, individually and as Fire Commissioner of the City of New York; New York City Department of Personnel; Michael Nadel, individually and as Director of Personnel of the City of New York; Thomas Roche, individually and as former Director of Personnel of the City of New York; Civil Service Commission of the City of New York, Defendants.
United States District Court, E. D. New York.
*178 *179 Women's Rights Clinic of the Washington Square Legal Services, Inc. by Laura Sager, and Debevoise & Plimpton by Robert L. King, Bart R. Schwartz, Jeffrey N. Drummond, Kathryn Quirk, New York City, for plaintiff.
Frederick A.O. Schwarz, Corp. Counsel of the City of New York by Norma Kerlin, Thomas C. Crane, Gary P. Shaffer, Asst. Corp. Counsels, New York City, for defendants.
Norman Teitler, Rego Park, N. Y., attorney for intervenor, Uniformed Firefighters Association.
MEMORANDUM DECISION AND ORDER
SIFTON, District Judge.
This is an action brought pursuant to Title VII of the Civil Rights Act of 1964, as amended, 42 U.S.C. § 2000e et seq.; the Civil Rights Act of 1871, 42 U.S.C. § 1983; the fourteenth amendment to the United States Constitution; and Section 296 of the New York Human Rights Law (Executive Law), seeking declaratory and injunctive relief and damages to redress alleged sex-based discrimination against plaintiff and the class she represents in connection with the physical test portion of New York City's Examination 3040 ("Exam 3040") for the entry level position of firefighter in New York City.
Plaintiff, Brenda Berkman, is a twenty-nine-year-old woman who passed the written portion of Exam 3040, but failed the physical test portion, which she took on February 22, 1978. The class she represents consists of 410 women who took the written portion of Exam 3040 and are alleged either to have taken the physical portion of Exam 3040 and failed it or to have been deterred from taking it as a result of sex discrimination by defendants. Defendants are the City of New York, its Mayor, the City's Fire Department and its Commissioner, the City's Personnel Department, its Director and former Director, and the Civil Service Commission of the City of New York. Appearing as Intervenors pursuant to this Court's Order of April 10, 1981, are the Uniform Firefighters Association ("UFA") and the Uniform Fire Officers Association ("UFOA"), both employee organizations representing current job incumbents.
The trial of this case was conducted before the undersigned, sitting without a jury, over several weeks between September and December 1981. For the reasons set forth herein, I conclude that the physical portion of Exam 3040 discriminated against plaintiff and the class she represents on the basis of their sex and that injunctive relief is appropriate prohibiting further use of the eligibility list established pursuant to Exam 3040 except on a showing of compelling necessity, directing the preparation of a new physical exam which does not discriminate against women, and awarding plaintiff interim and other relief to secure compliance with the requirements of the Civil Rights Act of 1964, as amended. What follows sets forth the findings of fact and conclusions of law on which these determinations are based, as required by Rule 52(a) of the Federal Rules of Civil Procedure.
Jurisdiction over plaintiff's Title VII action exists under 42 U.S.C. § 2000e-5. Plaintiff, by filing her complaint with the Equal Employment Opportunity Commission on May 16, 1978, has complied with the time limitations imposed by that section with respect to all of the named defendants except the defendant Fire and Personnel Departments and the Director of the latter Department, who were not named in plaintiff's administrative complaint. As to those defendants not named in plaintiff's administrative complaint jurisdiction exists since there is substantial identity between those *180 defendants who were named in the conciliation proceeding and those not named and since those not named had notice of the pendency of the conciliation proceedings. Vulcan Society of Westchester County v. City of White Plains, 82 F.R.D. 379 (S.D.N. Y.1979).
Jurisdiction exists over plaintiff's claim under 42 U.S.C. § 1983 and the fourteenth amendment by virtue of 28 U.S.C. § 1343. Pendant jurisdiction exists over plaintiff's claim under New York State's Human Rights Law.
As of the end of 1980, New York City employed in excess of 285,000 persons. Of these, approximately 175,000 (including those in the City's Fire Department) were hired by the Department of Personnel. The balance were employed by independent agencies such as the Board of Education and the Off-track Betting Corporation.
Of the approximately 175,000 employed by the Department of Personnel, approximately 168,000 (including those in the City's Fire Department) were in the competitive class, a class including all positions for which it is practicable to determine the merit and fitness of candidates by competitive examination.
Employment statistics for the uniformed force of the New York City Fire Department for the period 1973 through 1981 were as follows:
Year Firemen Officers Total 1973 10,720 2,548 13,394 1974 10,426 2,538 13,091 1975 9,089 2,347 11,548 1976 8,304 2,267 10,662 1977 8,847 2,329 11,271 1978 8,513 2,375 10,979 1979 8,966 2,404 11,466 1980 8,765 2,460 11,048 1981 9,042 2,563 11,616
The Fire Department of the City of New York is charged with responsibility for extinguishment, prevention and investigation of fires occurring in the City. The most recent available statistics for operational firefighting incidents (1979) indicate a total of close to 350,000 incidents annually requiring Fire Department attention. The largest number of these incidents have been false alarms (162,529). A total of 43,072 of these incidents involved structural fires, of which the largest number occurred in residential buildings (31,504), the next largest, in vacant buildings (5,698), followed by fires in commercial (4,227) and public structures (1,643). Non-structural fires (71,298) exceeded structural fires. In addition to its ordinary firefighting duties, the Fire Department was, in 1979, called upon to respond to close to 72,250 other emergencies of various types.
The Fire Commissioner is the head of the Department, responsible for policy decisions. The Chief of Department is charged with operational responsibility. Under the Chief of Department the uniformed force of the Department consists of the assistant chiefs, deputy assistant chiefs, deputy chiefs, battalion chiefs, chief medical officer, medical officers, chaplains, captains, lieutenants, marine engineers, pilots, and firemen.
The uniformed force is organized into divisions, battalions, companies, and other operational units. A division is composed of one or more battalions and is commanded by a battalion chief. A company is composed of its captain, one or more lieutenants, and the firefighters assigned to it.
The command structure of Department operations consists of borough commands in each of the five boroughs of New York City. Each borough command (with the exception of Staten Island) is composed of several divisions, the number varying in each borough. The borough commanders report to the Chief of Operations.
Companies are divided into engine companies, ladder companies, rescue companies, and marine companies. An engine company consists of a pumper apparatus and an assigned complement of personnel (typically, one captain, three lieutenants, and 20 or 25 firefighters). One platoon (typically, one officer and four or five firefighters) is assigned to the apparatus on each tour of duty. The pumper apparatus consists of *181 pumping equipment and carries lengths of hoseline, nozzles, and other equipment. The engine company is primarily responsible for the extinguishment of fires.
A ladder company consists of a ladder truck apparatus and an assigned complement of personnel (typically, one captain, three lieutenants, and 25 firefighters). One platoon (typically, one officer and five firefighters) is assigned to the apparatus on each tour of duty. The ladder truck apparatus includes an extension ladder, portable ladders, tools and equipment for entry and ventilation of burning buildings, and a variety of other equipment. The members of the ladder company perform operations at a fire scene exclusive of extinguishment, including forcible entry, ventilation, search and rescue, and overhauling, the process of seeking hidden sources of fire after the visible fire is extinguished.
There are at present 208 engine companies and 138 ladder companies in the Department. These are currently housed in 233 firehouses, most of which quarter one ladder company and one engine company.
In addition to engine companies and ladder companies, there are rescue and marine companies. A rescue company consists of the rescue apparatus and an assigned complement of personnel (one captain, three lieutenants and 25 firefighters). A platoon (one officer and five firefighters) is assigned to the apparatus on each tour of duty. The members of a rescue company are assigned on an ad hoc basis at fires to augment ladder or engine company personnel. They also perform specialized search and extrication tasks, utilizing special tools carried on the apparatus for use in emergencies. There are four rescue companies in the City, one assigned to each borough, except Staten Island. Rescue companies respond to multiple alarm fires and unusual emergencies.
A marine company is responsible for extinguishment of fires on the waterfront and in New York Harbor. A marine company consists of a captain, three lieutenants, four pilots, eight marine engineers, one wiper, and either ten to twenty firefighters. Forty-five firefighters are assigned to the marine division.
Under Section 487a-3.0a. of the New York City Administrative Code (the "Administrative Code"), a member of the Fire Department must be a citizen of the United States, must be able to read and write the English language with understanding, and must have passed his or her eighteenth, but not twenty-ninth, birthday on the date of the filing of an application for civil service examination. No person will be appointed unless twenty-one years of age at the time of appointment. There is at present no restriction on the sex of applicants.
There are four grades of firefighter. Section 487a-7.0 of the Administrative Code provides in this connection:
"Members of the uniformed force, upon appointment shall be assigned to the fourth grade; after one year of service in the fourth grade, they shall be advanced to the third grade; after one year in the third grade, they shall be advanced to the second grade; after one year in service in the second grade, they shall be advanced to the first grade; and they shall in each instance receive the annual pay or compensation of the grade to which they belong."
The basic starting salary for firefighter fourth grade is currently $17,949.
Under Section 487a-4.0 of the Administrative Code, preliminary to a permanent appointment as a firefighter, there is a period of probation of one year. Probationary firefighters undergo a six-week training program in the fundamentals of firefighting at the beginning of their probationary period.
The Department of Personnel of New York City has responsibility for the hiring of Fire Department personnel. When staffing needs arise for the position of entry-level firefighter, the Fire Department advises the Department of Personnel and requests the scheduling of an entrance examination. The examination is prepared by the Department of Personnel with the assistance of the Fire Department. Scheduling, *182 notification, receipt of applications, and administration of the examination are the responsibility of the Department of Personnel. The examination produces an eligible list of individuals who have passed the examination. Thereafter, the Department of Personnel and the Fire Department conduct further investigations, including background investigation, medical examination, and interviews, of sufficient members of the eligible list to satisfy projected hiring requirements. Candidates who satisfactorily complete the post-examination investigations are deemed qualified. In order to fill vacancies, the Fire Department will request certification of qualified eligible candidates from the Department of Personnel. Upon issuance of such certification, the Fire Department appoints candidates from the qualified eligible list.
Examinations for the Position of Fireman Prior to Exam 3040
From 1960 to 1971 the Department of Personnel administered five examinations for the position of fireman. None of these was open to women. These were the five examinations which immediately preceded Exam 3040, which is at issue in this lawsuit.
From October 5, 1960, to January 31, 1961, pursuant to Notice of Examination 9010, the Department of Personnel accepted applications for the position of fireman. A written test for 9010 was administered March 3, 1961, and the eligible list was promulgated January 1, 1962. No physical test appears to have been administered.
From July 5, 1962, to October 2, 1962, pursuant to Notice of Examination 9606, the Department of Personnel accepted applications for the position of fireman. A written test for 9606 was administered December 1, 1962, followed by a qualifying physical test. The physical test was graded on a pass/fail basis; the written test had a passing score of 70%. The eligible list resulting from this process was promulgated June 26, 1963.
From December 3, 1963, to January 27, 1964, pursuant to Notice of Examination 9984, the Department of Personnel accepted applications for the position of fireman. A written test for 9984 was administered on May 19, 1964, followed by a competitive physical test from which an eligible list was promulgated. The written examination and the competitive physical examination were given equal weight. An average of 70% was required on the results of both written and physical tests in order to qualify for appointment.
From April 3, 1968, to May 31, 1968, pursuant to Notice of Examination 7060, the Department of Personnel again accepted applications for the position of fireman. The competitive written test for 7060 was administered June 15, 1968, followed by a qualifying (pass/fail) physical test from which an eligible list was promulgated August 20, 1968. A 70% score on the written test was required to pass.
From February 3, 1971, to August 18, 1971, pursuant to Notice of Examination 0159, the Department of Personnel accepted applications for the position of fireman. The written test for 0159 was administered on September 18, 1971, followed by a qualifying (pass/fail) physical test, and an eligible list was promulgated January 18, 1973. Again, a 70% score on the written test was required to pass.
The Vulcan Society Litigation
On January 12, 1973, immediately preceding the promulgation of the eligible list resulting from Exam 0159, the Vulcan and Hispanic Societies, representing black and Hispanic firefighters, commenced an action in the United States District Court for the Southern District of New York pursuant to 42 U.S.C. §§ 1981 and 1983 and sought a preliminary injunction to restrain the New York City Personnel Department, Civil Service Commission, and Fire Department from appointing firemen from List No. 0159. Vulcan Society v. Civil Service Commission, 73 Civ. 201 (Weinfeld, J.).
Thereafter, on June 12, 1973, Judge Weinfeld issued an opinion declaring Examination 0159 unconstitutional under the equal protection clause because of its discriminatory *183 impact on blacks and Hispanics and enjoining defendants from making further appointments based on the examination's results, except upon a showing of compelling necessity. 360 F. Supp. 1265, aff'd in part and remanded, 490 F.2d 387 (2d Cir. 1973).
Subsequently, on August 2, 1973, Judge Weinfeld issued an order directing defendants, inter alia, to proceed expeditiously to prepare a new examination for the position of fireman which did not discriminate against blacks and Hispanics in accordance with professionally accepted methods of test preparation.
Retention of AIR
The City had previously placed itself in a position from which it was able to commence compliance with Judge Weinfeld's direction with some rapidity. In 1970 Congress had enacted Public Law 91-648, known as the Intergovernmental Personnel Act ("IPA"), 84 Stat. 1909 (1970), which described itself as follows:
"An Act to reinforce the federal system by strengthening the personnel resources of State and local governments, to improve intergovernmental cooperation in the administration of grant-in-aid programs, to provide grants for improvement of State and local personnel administration, to authorize Federal assistance in training State and local employees, to provide grants to State and local governments for training of their employees, to authorize interstate compacts for personnel and training activities, to facilitate the temporary assignment of personnel between the Federal Government, the State and local governments, and for other purposes."
Pursuant to Title II of the IPA, "Strengthening State and Local Personnel Administration," the United States Civil Service Commission was authorized to make grants to state and local governments to enable them to strengthen their staffs by improving personnel administration. Pub. L.No.91-648, § 201, 84 Stat. 1911 (1970).
In January 1973, the City's Personnel Department had determined to take advantage of this federal program by applying for and receiving IPA funding for development of personnel tests and personnel test validation procedures. With the promise of such funding, the City, in January 1973, entered into two contracts with a Washington, D.C. consulting firm named American Institutes for Research ("AIR") one directed towards developing a method of constructing valid physical selection tests for physically demanding city jobs, the other directed towards the development of suitable written examinations.
Since IPA grants are awarded to support broad research projects and not to pay for the development of an operational selection procedure to select employees for a particular job, the 1973 City contracts looked to the development of generalizable approaches to job analysis, test development, and validation which could be applied to a wide variety of city jobs. In accordance with IPA requirements, both contracts were for one year only. However, each contract was thereafter renewed annually for the following two years.
Although general in their overall objectives, both contracts provided that AIR would study three job titles in particular in order to develop more broadly applicable methods. In March 1973 the decision was made that the three jobs to be studied would be fireman, sanitationman, and parking enforcement agent.
Specifically, as of January 1, 1973, the City contracted with AIR for the "Development of Physical Test Validation Procedures" to be applied to jobs with physical demands. The contract was authorized under IPA Grant No. 72-14. The full amount of the contract ($37,476) was to be paid with funds received from the U.S. Civil Service Commission. The term of the contract was January 1, 1973, through December 31, 1973, later extended to March 31, 1974. The second year of the City's contract with AIR for the "Development of Physical Test Validation Procedures" was carried out under IPA Grant No. 73-NY-27c. *184 The full amount of the contract ($22,500) was to be payable with funds received from the U.S. Civil Service Commission. The term of the contract was April 1974, through December 31, 1974, later extended to February 15, 1975.
Also, as of January 1, 1973, the City entered into a contract with AIR to develop written test procedures. Programs were to be developed which, with a minimum number of modifications, might be utilized for most entry-level City positions and some promotional positions. The contract was authorized under IPA Grant No. 72-5, and the full amount of the contract ($59,992) was to be payable with funds received from the U.S. Civil Service Commission. The term of the contract was January 1, 1973, to December 31, 1973, but was extended to March 31, 1974. The second year of the City's contract with AIR for the development of written test validation procedures was carried out under IPA Grant No. 73-NY-28c. The full amount of the contract ($19,956) was to be payable with funds received from the U.S. Civil Service Commission. The term of the contract was April 1, 1974, through December 31, 1974, later extended to February 15, 1975.
While the parties to the contracts had contemplated that the overall term of the projects undertaken by AIR would last three years and while IPA grants (No. 75-NY-04c and No. 75-NY-05c in the amounts of $19,922 and $15,000 for the physical and written projects, respectively) to be paid in full by the U.S. Civil Service Commission had been issued, both contracts were cancelled by the City, effective September 30, 1976. These cancellations were explained by witnesses for the City at trial as attributable to the City's fiscal crisis. Whatever the merits of this explanation the cancellations, as will be seen, had the unfortunate consequence of depriving the City of an opportunity, at least as part of AIR's IPA-funded work, to produce a criterion-based validation study of the physical test for fireman which is at issue in this litigation that is, a follow-up study of those who passed the physical test to determine how well they performed as firefighters in order to decide if the test accurately predicted job performance.
In all events, in the period immediately following the June 1973 decision in the Vulcan case, with AIR already performing pursuant to these two contracts and, coincidentally, investigating the position of City firefighter as one of the three job titles from which a generalizable approach was to be developed, the City understandably turned to AIR for the specific purpose of developing a new entrance level examination for firefighters as required by Judge Weinfeld's decision.
Accordingly, on October 25, 1973, the City entered into a contract with AIR for the purpose of preparing an examination for firemen in the New York City Fire Department. The contract was not to exceed $32,500, and the work was to be completed by April 12, 1974, although this date too was later extended. In the contract AIR agreed, inter alia, to comply in the course of its work with EEOC Guidelines on employee selection pertaining to fair employment, to conduct a job analysis of the position of fireman, using material and expertise it had acquired on its two other contracts with the City, and to prepare, based on the job analysis, a notice of examination, test plan, written test, medical standards, and minimum standards and procedures to be followed in administration of the physical test. Precisely what portions of its earlier work *185 in connection with the 1972 contracts were, in fact, drawn upon by AIR in developing Exam 3040 pursuant to this contract is the subject of considerable debate between the parties. However, since all parties are agreed that at least the physical abilities analysis ("PAA") conducted by AIR pursuant to the 1972 physical test contract formed a central part of the job analysis eventually relied upon to develop the firefighters' exam, it is appropriate to discuss that aspect of the job analysis first.
*186 Physical Abilities Analysis
The 1972 proposals for the IPA contracts specified that AIR's work for the City would build on previous research and methodology developed by Dr. Edwin Fleishman, an industrial psychologist, then the head of AIR's principal office. Dr. Fleishman's previous research in physical fitness measurement is reported in his book, The Structure and Measurement of Physical Fitness (1964). The book describes a method of factor analysis by which Dr. Fleishman identified eight basic physical fitness "factors" as comprising the domain of human physical performance. Dr. Fleishman isolated these factors by applying a factor-analysis methodology described in the book to test scores on various physical fitness tests all, coincidentally, administered to male subjects.
Factor analysis is a statistical technique for determining the minimum number of factors necessary to account for the inter-correlations among a set of variables. Through factor analysis, Dr. Fleishman isolated clusters of physical tests that appeared to measure the same underlying ability. Each cluster represents a factor. The correlation between a test and a factor is called a factor loading.
The eight factors identified by Dr. Fleishman in his 1964 book are static strength, explosive strength, gross body equilibrium, extent flexibility, gross body coordination, dynamic strength, dynamic flexibility, and speed of limb movement. A ninth physical fitness factor, stamina, was subsequently adopted by Dr. Fleishman as an additional physical ability from tests reported by other researchers in the field of industrial psychology.
Dr. Fleishman's 1964 book also identifies the factor loadings of each of the physical fitness tests on each factor and contains recommendations as to which tests best *187 measure each hypothesized underlying ability. Tests were recommended on the basis of being "factorially pure," i.e., loading highest on the factor tested and not loading significantly on other factors. The book also contains data on the reliability of the various tests recommended.
In 1970 and 1971 Dr. Fleishman undertook further research to identify the full range of all human abilities. Based on this work, he reported that there are 37 human abilities: 14 cognitive abilities, nine psychomotor abilities, five perceptual abilities, and the nine physical abilities enumerated above. Having identified the 37 human abilities, Dr. Fleishman and his colleagues developed a system of rating scales, known as the abilities analysis, which could be used by raters to indicate on a scale of one to seven how much of a given ability is needed to perform a particular task.
The rating scales include references to particular concrete instances of human behavior thought to exemplify the ability under consideration and a particular degree of that ability. These concrete examples were intended to serve as "anchors" at intervals along the scale to assist the rater in using the instrument. These behavior anchors were developed based on research with groups of psychologists who devised the task anchors in group sessions and later participated in trying out the scales by using them to rate various tasks. The anchors are descriptions of various kinds of human activity and do not reflect tasks commonly performed by any particular occupational group.
For example, anchors under the physical ability of stamina include, at the top of the seven-point scale, "swim the English Channel"; at a point mid-way between five and six on the scale of seven, "climb the stairs to the top of the Statue of Liberty"; and at the bottom of the scale, "walk to the corner grocery." Under gross body coordination the anchors are "do a skilled ballet dance like Swan Lake"; "a runner jumps a series of ten three-foot hurdles"; and "make a lay up shot in the basketball game." Under extent flexibility the anchors are "side show lady bends herself into a pretzel shape and other unusual positions"; "do a modern dance"; and, at the bottom of the scale, "reach for the control lever of a drill press." Using these scales the person completing the abilities analysis is asked to state "how much" of each ability is needed to do the task under study.
In addition to the task anchors, the test instrument gives a definition of the ability, e.g.:
"Static Strength. This ability involves the amount of muscle force (power) used against an object to lift, push or pull the object. Force is used without stopping up to the amount needed to move the object. This ability can involve different muscle groups like the hand, arm, back, shoulder, and leg. This ability does not involve the use of force over a long time. This ability does not deal with the number of times the act is repeated." (Emphasis in original)
Further, an effort is made in each instance to differentiate the ability under consideration from other abilities with which it might be confused. Thus, static strength is contrasted with explosive and dynamic strength and with stamina as follows:
"HOW STATIC STRENGTH IS DIFFERENT FROM OTHER ABILITIES:
Use force without stopping to lift, push or pull objects
Explosive Strength (2): Gather energy to move the body's own weight or objects with short bursts of force.
Use force to lift, push or pull objects
Dynamic Strength (3): Use muscle power (force) repeatedly or without stopping to hold up or move the body's own weight.
Does not involve the use of force over a long time
Stamina (4): Do physical work over a long time; involves heart and blood vessels resisting becoming tired."
*188 Further, the persons performing the abilities analysis are asked to complete a sheet entitled "comments" on which they are to state separately with respect to "usual tasks" (described as those performed every day or almost every day on the job) and with respect to "special tasks" "what the worker does" and "under what circumstances the worker does it." It is principally based on these comments that defendants contend that the PAA was content validated as grounded in the important observable behaviors of firefighters.
The Job Analysis of the Physical Requirements of the Firefighter
In November 1973, AIR administered the PAA to 23 firemen and 12 fire officers. This use of the PAA was the first time that the instrument had ever been used in an actual analysis of an entire job as opposed to the analysis of a specific task. The firemen and officers were asked to rate the amount of each ability required by all of the usual tasks of their job and by all of the special tasks of their job on a scale of 0 to 7. Usual tasks were, as noted above, defined as tasks performed every day or every other day. Special tasks were defined as tasks performed less often. The PAA questionnaire also asked the raters to provide written comments, giving examples of both usual and special tasks which required each ability. These comments provide some idea of the various specific tasks which the raters were considering in answering the questions directed to the rather abstract abilities they were asked to rate.
For some reason, most probably because of the difficulties presented by the complex analysis instrument, relatively few of the 35 firemen who did the PAA were able to think of special tasks that is, ones performed on a less frequent basis than every day or almost every day on the job. As a result, noting that the rank orderings for special tasks would be influenced by very small differences in mean ratings and calling them "secondary ratings," AIR determined to rely on the ratings given for every day tasks and noted that the results for the special tasks analysis "should be viewed with caution and considered merely suggestive."
Using the responses given for usual tasks, the abilities were then ranked as follows, based on the mean ratings for both officers and firemen:
Mean Score on Ability Scale of 7 Stamina 5.41 Static Strength 4.64 Explosive Strength 4.41 Gross Body Equilibrium 4.27 Extent Flexibility 4.26 Gross Body Coordination 4.04 Dynamic Strength 4.03 Dynamic Flexibility 3.74 Speed of Limb Movement 3.41
The AIR effort was not only the first occasion on which an entire job was analyzed using the PAA; it was also the first time job incumbents, as opposed to professional scientists, were asked to perform the rather abstract abilities analysis. In all earlier uses of the test instruments to rate *189 specific tasks, the ratings had been made by trained psychologists, familiar with the mechanisms of the analysis, who had at least some acquaintance with the definitions they were being asked to apply and some of whom had, in fact, participated in the preparation of the test anchors used in an effort to make the analysis concrete. Given the complexity of the rating scales and the nature of the task anchors, the analysis is clearly capable of producing a different result when used by different individuals with different backgrounds. Thus, the significance of an anchor for static strength (which is set at a level of 5.8 on a scale of 7) described as loading five full fifty-gallon drums into a truck may be quite different for a person who has done the task described than for a person who has not. In fact, there is considerable evidence that actual confusion and misuse of the instrument did occur.
As already noted, a large number of respondents failed to find the selected abilities present at all in special firefighting tasks, i.e., those occurring less frequently than once every one or two days. Moreover, many of the same tasks were considered usual tasks by some respondents and special tasks by others. Thus, "overhauling using ladders," "ventilation," "lifting people to stretchers," "moving large refrigerator or furniture in fire building," "making rescue of person or persons of varying weights under adverse conditions," and "forcible entry" are all listed as special tasks by some respondents. Other respondents describe essentially the same tasks, e.g., "moving usual household items such as refrigerator in a time of heavy fire conditions," "overhauling at a fire," "needed when forcing entry through doors or walls under fire conditions," "moving heavy refrigerators, furniture, unconscious persons," "to remove person from smoke-filled room or apartment," "force doors, vent," as a part of usual operations.
Moreover, many of the comments list the same activity or job behavior as an example of all or almost all of the abilities inquired about in the PAA questionnaire. For example, "functions at fire operations" and "stretching a hose" is given as an example of each of the given abilities in one fireman's comments. While this may reflect a correct understanding of the abilities being asked about (it is possible, for example, given the complexity of body movements involved in stretching a hose, to find at one point or another in the evolution an example of static strength, explosive strength, dynamic strength, stamina, extent flexibility, and gross body coordination), it is difficult to say, given the complexity of the example of job behavior offered in the comment, whether the example reflects a correct understanding of the ability or not. Nor is it clear that one has given concrete support for the degree of importance attributed to the ability when one has cited a complex job behavior involving to some unspecified degree the ability inquired about, but also involving other abilities.
This point may be made clearer by examining the definition of dynamic strength given in the PAA, along with the contrasting definitions provided, and then considering a number of examples given for that ability in the comments of firemen.
Dynamic strength is defined in the PAA instrument as follows:
"This ability involves the power of arm and trunk muscles to hold up or move the body's own weight repeatedly or at one time without stopping. This ability involves muscles resisting getting tired when having to repeatedly or without stopping to hold up or move the body's own weight." (Emphasis in original)
Contrasts are then provided between dynamic strength and static strength (involving the use of force to lift, push or pull objects), explosive strength (involving short bursts of muscle force to move the body's own weight or objects), and stamina (defined as work over a long time, involving "heart and blood vessels resisting becoming tired").
*190 The following comments by firemen, listing job behaviors thought to exemplify dynamic strength, are taken in order from a list prepared by AIR from the firemen's comments:
"Pull hose up six flights in smoke, pull ceilings; overhauling operations; carry persons up or down stairs; to and from places.
"Advancing a line into a building, in heat and smoke. Rescue operations. [Underlined comments represent a special task.] "Advancing a line when charged up two or three floors or when pulling a charged line in a window off the fire escape four floors up.
"Firefighting; climbing stairs with tools and Air Paks on backs; climbing ladders; leaning over roofs to break windows. "Under heavy fire conditions in rescues, this is pertinent.
"Pulling and running at a fire.
"Act of fighting a fire carrying in line, movement from place to place with line while extinguishing fire.
"Needed when advancing charged line at fire. Needed for opening ceilings and walls to allow access to fire to engine company.
"Pulling your own body up to the top of tall bulkheads. Pulling yourself over fences and walls to reach a certain objective (rear office building, adjoining roof tops). Pulling yourself up outside of fire escape to pass fire."
What is noteworthy about this list, when it is compared with the definition for dynamic strength given above, is that none of the groups of examples, except the last, appears to involve the defined ability as anything like its most noticeable characteristic. Indeed, each of the first eight activities described appear to involve first and foremost the contrasted ability of static strength in which force is used to lift, push or pull objects. The contrasted ability of stamina and other unmentioned abilities such as extent flexibility appear as much involved in the examples listed as the defined ability of dynamic strength. When these examples are placed alongside others in which the defined ability is almost certainly misunderstood, e.g., "moving Multi-Versal Nozzle from apparatus" as an example of explosive strength, it appears that the PAA and its ranking of abilities are not so well grounded in the observable behaviors of firefighters that one can predict with any confidence from the degree of which the abilities are possessed the capacity of an individual to perform well as a firefighter.
Moreover, despite the small size of the group used to perform the analysis and the failure to select that group on a random basis, no effort appears to have been made either on a subjective basis, by weeding out responses that evinced misunderstanding or inappropriate use of the scales, or on the more objective and systematic basis to compute inter-rater correlation coefficients. Inter-rater correlation coefficients are statistical expressions of the degree to which raters agree with each other. Without inter-rater correlation coefficients, a researcher can have only a rough idea with regard to the consistency with which raters ranked the rated abilities. The only inter-rater correlations which do appear to have been calculated those comparing the rank ordering by firemen with the rank ordering by officers supply further troublesome data, since there appears to have been a significantly greater degree of agreement between the officers and men in the more concrete physical demands analysis ("PDA") than that found with respect to the PAA.
Finally, AIR failed to reconcile the results of the PAA with data from other job analysis techniques. In this connection, it appears appropriate to consider, in particular, the PDA in some detail in part because *191 of the contrast between the PDA and the PAA in the degree to which the analysis is concretely grounded in observable work behaviors.
The Physical Demands Analysis
The PDA is a job analysis method, designed by the U.S. Department of Labor, which describes systematically the physical demands of a job in terms of 11 different activities: strength, climbing, balancing, stooping, kneeling, crouching, crawling, reaching, handling, fingering, and feeling. On November 13, 1973, a panel of 19 firemen and 12 officers under the supervision of AIR used the PDA instrument to describe the job of fireman. The panel was first asked to divide the job into six categoriessitting, standing, crawling, walking, running, and other and to state the percentage of time spent in each activity. The results showed that 71.7% of the firefighter's time was spent in the more physically taxing work positions of standing, walking, running, and crawling.
Next, the panel was asked to analyze the strength demands of the job by stating in terms of "sometimes," "often," and "very often" the need to lift, carry, push or pull five categories of weights from the very light (0-10 lbs.) to the very heavy (100 + lbs.). The results of this exercise showed that firefighters sometimes have to lift weights on the average of 110 lbs., carry weights on the same average, push weights of 161 lbs., and pull weights of 125 lbs. At the other end of the scale, firemen "very often" lift, carry, push or pull weights averaging 63 lbs. Each of these activities lifting, carrying, pushing, and pulling were then particularized by questions as to what was required to be moved, the circumstances, the tools to be used, and the nature of any special job or task requiring the ability. In addition, respondents were asked to rate the importance of strength in general on a scale of no importance, important, and high importance. Although the answers to these questions would appear to have been particularly valuable in establishing a content valid test for strength, little use appears to have been made of them in fact, since the PAA provided another system of rating strength as a required ability in terms of three separate components, static strength, explosive strength, and dynamic strength, and since it was the PAA which, as noted below, was principally relied upon.
After dealing with strength in this fashion, the PDA further investigated each of the remaining activities (e.g., climbing, crawling, balancing, fingering, etc.) by asking whether it is present in the job, its importance in terms of the same importance scale referred to above and what climbing, balancing, etc. the worker did, (and, in the case of climbing, how far the firefighter had to climb), the circumstances under which the activity was performed, and the name of any special task requiring the activity.
While AIR made some effort to compare the results of the PDA with the results of the PAA, some of the comparisons smack of an effort at rationalization of apparent conflicts in the results of the two analyses to support use of the PAA. Thus, although climbing and crawling were rated first and fourth out of the eleven behaviors on the PDA, the ability of dynamic strength with which AIR associated them is given a rank of seven out of the nine abilities on the PAA. The differences are to be reconciled, according to AIR, because dynamic strength was ranked third with respect to special tasks on the PAA. Elsewhere, however, as noted above, AIR had cautioned that the general lack of response to the special tasks portion of the PAA made it not a very useful part of the job analysis. As a result, the contrast between the ratings of the PDA and the PAA for dynamic strength remains troubling.
So too, AIR noted that reaching is ranked considerably lower on the PDA than the ability with which AIR associated it on the PAA (extent flexibility). AIR explained this difference as "most likely" reflecting *192 the difference of definition given for the two physical activities. In other areas no comparison at all is made between the ratings achieved on the PAA and on the PDA because the differences of definition between the physical demands involved in the PDA and the physical abilities involved in the PAA make clear that different matters are being inquired about. Thus, as firmly grounded as the PDA appears to be in the observable behaviors of the job of firefighter, it does not appear that those observable behaviors provide much support for the analysis accomplished by the PAA.
The Test Development
As Dr. Fleishman testified and as AIR's reports to the City confirm, the principal reason for selecting the PAA method of job analysis for use in connection with Exam 3040 over the other techniques explored pursuant to the 1972 contracts was that the appropriate tests to determine applicants' abilities identified as necessary to the fireman's job had already been the subject of extensive study chiefly by Dr. Fleishman himself in his 1964 book and later work. As already noted, the appropriate test for each ability, Dr. Fleishman had determined, was, all else being equal, the one most factorially pure for the abilities determined by the PAA to be important for the job, that is to say, tests with relatively high loadings on those factors and relative low loadings on other factors. In order to assure that the tests did not unduly overlap in measuring the same abilities, and also as the beginning of an effort, which could not draw to any substantial degree on Dr. Fleishman's prior work, namely, an effort to develop an appropriate scoring and weighting system, AIR, in December 1973, tried out an experimental battery of 11 factorially "pure" tests of eight physical abilities selected on the basis of the PAA, as well as a test of discrimination reaction time, a psychomotor ability. The 11 physical tests tried out in December 1973 and the eight abilities (in the rank order established by the PAA) they were intended to measure were as follows:
Ability Test 1. Stamina 1. One Mile Run/Walk 2. Five Minute Free-Style Stepping 2. Static Strength 3. Hand Grip Preferred 4. Hand Grip Nonpreferred 3. Explosive Strength 5. Free-Style Broad Jump 4. Gross Body Equilibrium 6. Balancing 5. Extent Flexibility 7. Twist and Touch 6. Gross Body Coordination 8. Cable Jump 7. Dynamic Strength 9. Push-Ups Truck Strength, a 10. Leg Lifts subcategory of Dynamic Strength 8. Dynamic Flexibility 11. Bend, Twist, and Touch
The tests were administered at the Baruch Recreational Center, which contains a small gym, on the Lower East Side of Manhattan, to 68 men then in training to become firefighters and to 32 firemen who had recently completed their training and were in the first probationary year. All of the persons taking the trial tests had, thus, taken and passed the previous physical Exam 0159, administered by the Fire Department two years before, in 1971. The reason for using trainees and probationers as test subjects was to try out the tests to be used on a sample closely approximating the applicant population expected to take Exam 3040. (The sample was not confined exclusively to trainees because there were not enough at the time to produce a test sample of 100.) Also, to this same end, and mindful of its obligation to the litigants in *193 the Vulcan case, the City selected a sample designed to include a sufficient number of white, black, and Hispanic men to permit evaluation of the relative performance of each group on the various tests. Since, however, it was anticipated at this point that women would again not be permitted to apply for the firefighter's position, no women were included in the December 1973 tryout sample.
After the tryout battery was given, AIR computed correlations among the various tests to determine the degree of overlap between them and, also, correlated the tryout scores of 95 of the men in the sample with their scores on the five physical tests of Exam 0159 (the physical test previously used by the City), one of which was an agility test. The use to which the results of these correlations were put in determining which tests were to be included in Exam 3040 and, most importantly, in establishing the scoring of the exam will be discussed infra.
Before turning to the content of the battery of tests recommended by AIR to the City for inclusion in Exam 3040, however, it is appropriate to note at this point the occurrence of another event which was, in its implications, at least, to have profound effects on the test development of Exam 3040. In early 1974, because of concern about the impact of a height requirement on Hispanic persons, then-Fire Commissioner O'Hagan asked AIR to consider the appropriate height requirement for the fireman job.
The Height-Related Tests
On January 14, 1974, 12 individuals of varying heights (six firemen and six non-firemen from the Model Cities Program) performed a number of actual job sample tasks, thought to be height- and job-related, to determine what relationship, if any, existed between the height of an individual and his ability to perform firefighting tasks.
Subsequently, on January 29, and February 2, 1974, the following tasks and one test, again thought to be height-related, were administered to a group of 100 firemen and 100 non-firemen, all male, including persons shorter than the City's then current 5'7" minimum height requirement for firefighters:
(1) Ladder Climb Subject climbs 35-foot ladder wearing a Scott air pack, coat, boots, and hat;
(2) Ladder Raise and Positioning Subject raises non-extended 20-foot length of 35-foot extension ladder from floor to wall and back down;
(3) Window Ventilation Subject swings six-foot hook to a designated target area in upper right quadrant of a wall;
(4) Hose Unclamping Subject removes one end of a hose section from a 74"-high clamp (with an examiner's assistant unclamping and steadying the other end), lowers the hose to the ground, and then raises and re-clamps;
(5) Victim Rescue Subject carries a 120-pound dummy up and down one flight of stairs one time;
(6) Vertical Hose Stretch Subject hauls over the shoulder the nozzle section of a 2½" hose up and down one flight of stairs;
(7) Static Strength Hand grip test.
*194 On February 22, 1974, immediately following these tests, AIR sent to the City a recommended test plan for both the written and physical portions of Exam 3040, including recommended tests and scoring tables. For the physical test battery, AIR recommended eight tests to test the parenthetically indicated abilities: the one-mile run/walk (stamina), hand grip preferred (static strength), dummy carry (static strength, explosive strength, dynamic strength, and gross body equilibrium), agility (explosive strength, stamina, dynamic strength and gross body coordination), freestyle broad jump (explosive strength), balance (gross body equilibrium), twist and touch (extent flexibility), and push-ups (dynamic strength).
Of these eight recommended tests, five were factorially pure tests discussed in Dr. Fleishman's 1964 book, and one, the mile run, was a factorially pure test for stamina not included in his book but recognized in the literature. However, as indicated by the descriptions given above, the agility test and the dummy carry were also recommended although they were not factorially pure tests. Their inclusion in the test battery requires separate discussion.
The "Face Valid" Tests
The agility test recommended in February 1974 consisting of an obstacle course with two walls, one 5' and one 8' appears to have been drawn directly from the prior Fire Department Exam 0159, administered on a pass/fail basis in 1971. Why it was substituted, as it appears to have been, for the more factorially pure cable jump test is not at all clear. Whereas the cable jump test is said by Dr. Fleishman to measure the sixth-ranked ability of gross body coordination, the dodge-and-run aspect of the agility test recommended is said by Fleishman in his 1964 book to measure the third-ranked ability of explosive strength an ability already tested for by two of the recommended subtests (free-style broad jump and dummy carry). Moreover, in calculating intercorrelations between scores for the agility test on the December 1973 tryout and on Exam 0159, AIR determined that the agility test had a moderate correlation with tests for explosive strength and stamina, a somewhat lower correlation with gross body equilibrium, and the lowest correlation of all with gross body coordination the ability tested by the cable jump test for which it was substituted. This substitution became of crucial importance for plaintiff and the class she represents because the height-related eight-foot wall which formed part of the agility test became not just a stumbling block, but a literal barrier for almost all women taking the exam.
In January 1974, immediately before submission of the proposed test battery to the City, an effort was made by AIR to justify the inclusion of the eight-foot wall in the agility test. The AIR letter reporting the results of an effort to justify the height chosen, however, explains its rationale not in terms of the height needed to establish the ability (explosive strength) sought to be tested, but rather in terms of the nature of the fireman's job. Moreover, rather than *195 refer to the results of any of the job analyses already performed in order to determine if the analyses had established the relative importance of scaling walls as a task performed by firemen, AIR concluded, apparently on the basis of a single unrecorded interview between an AIR representative and representatives of the Fire Department, that firemen in New York have occasion to encounter eight-foot parapets between roof tops which must be scaled in the course of their work. The relative importance of this work behavior to the job as a whole is nowhere addressed in AIR's job analysis.
The dummy test (as it came to be known) appears to have had its origin in the height study requested by then-Commissioner O'Hagan. Again the reason for inclusion of the test in the February recommendations is unclear. Whatever the explanation for the inclusion whether a desire to aid minorities or a recognition, which later proved accurate, that the City and the Department of Personnel would not accept too many tests with which they were unfamiliar either from earlier exams, the athletic field, or notions of what the fireman's job entailed inclusion of the test presented the same difficulties, given AIR's overall approach, as the inclusion of the agility test. Again, the complexity of the test assured that more than one ability was, in all likelihood, being tested for, thus increasing the problem, sought to be avoided by Fleishman's factorially pure approach, that scoring would be distorted by measuring more than once for the same ability. In addition, since neither the dummy carry nor the agility test arose out of any systematic analysis of observable work behaviors (although appearing to owe their justification to such an origin), the addition of the tests reinforced the need for a criterion-related validity study.
The same movement away from the measurement of the relatively abstract abilities required for the job based on Fleishman's abilities analysis towards so-called "face valid" tests continued thereafter as *196 City officials began to criticize and question the proposed test battery, in some cases on quite subjective grounds. Thus, both the Chief of the Fire Department and the Chief of the Department of Personnel complained of the scoring plan for the mile run (which permitted a candidate to complete the mile in 12 minutes and still not fail), not on the basis of its methodology, but simply because they or some close relative could run the mile in less time. Similarly, the so-called twist-and-touch test for the ability of extent flexibility and the balance test for gross body equilibrium were the subjects of complaints by then-Commissioner O'Hagan to representatives of the City Department of Personnel and through the Department to AIR on the ground that what was wanted was a test "for firefighters, not for ballet dancers."
In response to this criticism of the mile run, AIR appears to have initially reacted quite appropriately by re-examining the data on the basis of which its scoring of the mile run had been determined. This re-examination revealed that AIR had developed the March 1974 scoring norms based on the distribution of scores in the December 1973 tryouts, using a 31-lap track at the Baruch gym. Obviously, the effect of having so many short laps was to increase the average time of the runners. When City officials complained that the scoring system was too lenient and insisted it be made more stringent, AIR asked to try out the mile run again on a four-lap track. This request was, however, rejected. Instead, the representatives of the City's Department of Personnel directed AIR's attention to a standard reference work by T. K. Cureton, Jr., which set out suggested scores for the mile run based on norms developed from runs performed by an all-male sample population consisting of high school, college, and military men involved in a program of daily physical training in preparation for military service in World War II. AIR accepted the suggestion that this more stringent scoring norm be used with little if any attention to the logic of using such norms in its overall scoring system, either in terms of job requirements or norms in the applicant pool.
A substitute for the rejected twist-and-touch test appears to have emerged from a visit by AIR, Fire Department personnel, and a representative of the Personnel Department to a physical examination for firefighters being administered by the City of Chicago in August 1974, made at the specific request of then-Commissioner O'Hagan who had already expressed his dissatisfaction with AIR's test battery. The Chicago battery consisted of five tests: flexed arm hang, obstacle run, stair climb, man lift and carry, and hose coupling. The AIR representative summoned to Chicago was asked to observe and evaluate the Chicago battery as compared with AIR's.
Following this visit, AIR continued to recommend the tests it had originally proposed, based primarily on differences in the job analysis and testing techniques being used, with Chicago testing for multiple abilities in tests resembling actual job tasks and New York testing for single abilities by means of factorially pure tests highly loaded for a single factor deemed important to the job. In addition, AIR opposed use of the one factorially pure test used in Chicago, the flexed arm hang, because of the low degree of reliability of that test from one administration of the test to another so called test/re-test reliability.
*197 Not satisfied with this response, a representative of the City's Department of Personnel then contacted, apparently without consulting AIR, the consultants used by the City of Chicago in preparing that City's physical exam and asked those consultants to criticize AIR's proposed test battery. This criticism served to confirm the impression of the City officials that the mile run scoring was too lenient and that the twist-and-touch and balance tests were "inappropriate" for firemen. In addition, the Chicago experts pointed out that at least one of the tests, push-ups, would have an adverse impact on women because of their lower average upper-body strength.
Without reaching any professionally acceptable resolution of these differences, AIR thereafter acceded to the City's requests by substituting a test dubbed a "ledge walk" test for Fleishman's balance test of gross body equilibrium, adding a new "obstacle" to the Exam 0159 obstacle test called "window-ladder-window" as a test for extent flexibility and replacing push-ups with what AIR had previously termed a relatively unreliable test of dynamic strength: the flexed arm hang. Both the "window-ladder-window" obstacle and the flexed arm hang were drawn directly from the Chicago battery. The "ledge walk" appears to be entirely unstudied and unprecedented as a test of the ability of gross body equilibrium which it purported to measure either in terms of factor analysis or reliability. The "window-ladder-window" test, by being incorporated into the agility test, came to measure extent flexibility (to the extent it can be said to have reliably measured it at all) in terms of time rather than in terms of the inches to which one could extend oneself, as under the original twist and touch test. The ledge walk, performed with eyes open, no longer eliminated that type of balance produced by visual eye contacts, as was the case with Fleishman's balance test. In fact, there appears no basis for saying that any of the three tests reliably measured the abstract abilities measured by the Fleishman tests for which they were substituted.
As the result of these additions to the proposed test battery, it became necessary to conduct another tryout test to establish new scoring norms. This tryout was administered in early November 1974 to a group of 75 incumbent firemen, only six of whom were probationers. The November tryout battery consisted of the new tests: the flexed arm hang, ledge walk, and a window-ladder-window test (consisting of crawling through a window, across a horizontal ladder and through another window). In addition, the dummy carry was tried out again apparently in order to get new scoring norms.
A final proposed test battery for Exam 3040 was forwarded to the Department of Personnel on November 29, 1974. The physical portion of the exam contained seven physical subtests. The test and the abilities they were said by AIR to measure were the mile run (stamina), hand grip (static strength), broad jump (explosive strength), dummy carry (static strength, explosive strength, dynamic strength, and gross body equilibrium), agility (including window-ladder-window) (explosive strength, stamina, dynamic strength, gross body coordination and extent flexibility), ledge balance (gross body equilibrium), and flexed arm hang (dynamic strength).
Because AIR's development of test scores did not arise out of Fleishman's prior work, but rather had its origin at least in part (leaving to one side the scoring of the mile *198 run already discussed) in empirical studies conducted during the test development, the subject merits separate discussion.
As already noted, the December 1973 tryouts were designed to provide AIR with the basis for a norm-referenced scoring system based on a sample that was thought to approximate the applicant population. To this end trainees were used to the extent they were available with probationary firemen filling out the complement of persons tested. No women were included in the sample intended to represent the applicant population; and the sample, by definition, differed from the applicant population in at least one other notable characteristic, namely, that all members of it had passed (on the pass/fail basis on which it was administered) Exam 0159.
In order to get norms for the tests added subsequent to the December 1973 tryouts, however, AIR appears to have departed to some extent from its intention of using an applicant sample. Thus, norms for the dummy carry were drawn from the height-related tasks tryout on firemen and non-firemen held on January 29 and February 2, 1974. (Although the hand grip test was also administered to the same group, AIR used its December 1973 based sample perhaps on the ground it more closely resembled the applicant population.) On the other hand, to obtain norms for the agility test drawn from Exam 0159, AIR turned to the test scores reported for that exam when it was administered in 1971, thereby securing a sample resembling perhaps as closely as possible (but for the exclusion of women) a sample representative of what might be expected to turn out for Exam 3040.
As noted, the agility test had been modified to include a window-ladder-window component. Scores on the window-ladder-window component from the November 7 and 8 tryout were added by AIR to the original distribution of scores on the agility test derived from Exam 0159 by re-working the original 1971 distribution on the assumption that those persons who did well on window-ladder-window would also have done well on the agility test in its previous form an illogical assumption if the window-ladder-window component was supposed to test for the discrete ability of extent flexibility, an ability not included in the former agility test.
In setting the scores for the flexed arm hang, although AIR had data both for the 75 firemen who tried out this test on November 7 and 8, 1974, and data from the arm hang test in Chicago, AIR disregarded the New York City scores and set the scoring table according to the Chicago data. This made some sense in terms of a desire to replicate an applicant population, since the November 1974 group consisted of 75 incumbent firemen, only six of whom were probationers. However, one paradoxical result of this choice was that a large number of the New York City firemen who performed the flexed arm hang as administered in November 1974 would have scored at or below zero on this portion of Physical Exam 3040, as scored based on the Chicago norms.
In the scoring of the ledge balance, for which there was no normative data other than the November 7 and 8, 1974 tryouts, that data was necessarily used. However, as a result of using as norms the performance on the test by a group of job incumbents rather than performance by an applicant pool, the scores were set at a level at which many of the New York City firemen who tried out the test would have received an unacceptably low score.
This result came about because AIR applied to all scores whether for groups of job incumbents or for groups purporting to reflect the applicant pool traditional grading techniques, fixing the pass/fail levels at what appeared to be "natural breaks" in a roughly drawn curve in the distribution of scores. AIR then, also using traditional grading techniques, fixed 70.0% as the appropriate designation for a passing score, thereby effectively dividing those passing the exam into 300 separate ranks, each representing a tenth of a percentage point between 70.0 and 100.0%. Selection of candidates from within each of these 300 ranks (in cases in which a number of candidates *199 scored at the same percentage level) was made on the basis of candidates' social security numbers that is, essentially on a random basis.
On March 6, 1974, AIR sent to the City a proposed scoring table for the recommended battery of eight tests. In determining the proposed tables, AIR first determined the relative weighting of each test based on the rankings of the mean ratings of the physical abilities in the PAA. For example, because stamina received the highest mean rating on the PAA, AIR proposed that the mile run, which tested stamina, should be weighted 20 percent of the final score or 200 points out of a total of 1,000 points. (1,000 points was to equal a score of 100%.) The other five recommended factorially pure tests were to be given a weight of 100 points each. In order to deal with what had become known as the mixed ability tests, it was proposed that the agility test and the dummy carry were to receive 150 points each. Thereafter, when one of the factorially pure tests the balance test was incorporated into the agility test by the addition of the window-ladder-window component, the 100 points attributable to the balance test was simply added in toto to the agility test with the paradoxical effect that one of the mixed ability test received 150, and the other 250, points, becoming the most important test of all. Another paradoxical result of this system of weighting was that the mile run, said to test for the ability rated first as a needed ability for firefighters, was assigned a total of 200 points while the hand grip test, a measure of static strength, the next highest rated ability, was assigned a total of 100 points.
At the same time as representatives of the City expressed themselves not satisfied with the proposed test battery, they also complained about the recommended scoring tables for those tests. Not only was the scoring for the mile run criticized, as noted above, but questions were also raised on a not much more objective basis concerning the scoring for the broad jump, the dummy carry, and the hand grip. AIR responded to these criticisms by "tightening" the proposed scores on each of them. Thus, with regard to the broad jump, the original proposal gave no credit on the test unless the candidate jumped at least six feet. At the insistence of the City, this minimum distance was raised to 6 foot 2 inches. With regard to the hand grip, the minimum pressure required to obtain the minimum score on the test was raised from between 28 to 30 kilograms to between 34 to 35 kilograms. There appears no consistency in these increases in the minimum scores (which served to deprive some women candidates of any score at all on the subtests 5 out of the total 79 on the broad jump and 3 on the hand grip) nor any plausible reason for them, apart from acquiescence in the City's view that the scores provided were too lenient.
Finally, it must be noted that for reasons not explained by the evidence at trial AIR used no score conversions in arriving at a combined score for the seven tests comprising the physical examination and for the written examination. The effect of this was to give a far greater weight to the physical portion of Exam 3040 than was intended. The reason for this is as follows. As already noted, Exam 3040 consisted of both a written and a physical part, each weighted 50 percent in determining the candidate's rank order. In any test consisting of more than one part (each to be given equal weight), it is necessary to insure that one test not become, de facto, the determinative one because of greater variance in results on one test than the other. The generally accepted method of avoiding this result is to convert the raw scores on each test into a standard unit before the results of the test are combined to determine the candidate's overall rating. Use of such a standard score would have insured that the variance in the two sets of raw scores was taken into account so as to maintain an effective 50/50 rating.
Test Administration and Results
As noted above, Exam 3040 included a written test and a physical test. The written test consisted of 100 short-answer questions. Only those candidates who passed *200 the written test were allowed to take the physical test battery. As noted above, candidates who passed the physical test received a final combined score consisting of the average of their scores on the written and physical tests. Candidates who were veterans received an additional five points, which were added to their combined score.
The written test was given on December 3, 1977, to 25,168 persons: 24,758 males and 410 females. (It is these 410 women who constitute the class, as presently defined.) A total of 24,252 males and 389 females passed the written test. The physical exam was administered over the period February 15, 1978, to April 30, 1978, to a total of 18,148 persons: 18,060 males and 88 females. Of these, 16,925 males and 79 females completed the physical test. A total of 7,847 males and no females passed the test.
Prior to administering the test, as part of an effort stimulated by the Vulcan litigation to recruit minorities, an effort was made by the City through the City University system to encourage women to apply for the job of firefighter and take the test. Another effective means of recruitment (to judge from the witnesses who testified at trial) was the encouragement by incumbent firemen of female family members (sisters, nieces, and daughters) to take the exam.
All applicants completing a formal application form were sent a 104-page booklet which described the written and physical tests. The booklet included a description of each physical subtest as well as a description, with pictures depicting both men and women, of 25 preparatory exercises and general information on physical fitness. In addition, the City conducted a "physical fitness familiarization program" in anticipation of the physical portion of Exam 3040. Four sessions were held outdoors at the Fire Department's division of training at Wards Island during December 1977, and the remaining eight sessions were held indoors during January 1978 at the Summer Avenue Armory in Brooklyn, the site of the actual test.
More important than this "familiarization" program in terms of test preparation appears to have been an informal system of schooling conducted privately for a fee by City firemen on their own time. While the familiarization program succeeded in giving candidates a concrete idea of what the physical tests would look like, only the private sessions gave candidates an opportunity to practice and train for the physical tests. As established by the testimony at trial, including that of one woman candidate who scored second among the women on the test, these training and practice sessions were of enormous value in teaching techniques of mastering the tests, particularly with the so-called multiple ability tests the dummy test and the agility test (with its eight-foot wall) and the ledge walk.
The physical exam as administered consisted of the following tests: dummy carry, hand grip, broad jump, flexed arm hang, agility test, ledge walk, and one-mile run, administered in this order. No particular order to the administration of the test appears to have been prescribed by AIR. What follows describes the administration of each of the tests.
1. Dummy Carry. The test was administered and scored as described to the candidates in the booklet provided them as follows:Dummy Carry Test Weight = 150
Standing with both feet behind the start line, on the signal "Go," step forward, pick up the 120-pound dummy, put it on one shoulder, carry it in that position up one flight of stairs, walk around the marked lines of the upper stair landing, return down the stairs, bend down on one knee, then place the dummy back on the mat under control. The dummy must be kept off the ground when carrying it. Score is time to completion.
A two-second penalty is added to the total time score if the dummy is "dropped" out of control after the dummy has been carried up and down the flight of stairs. "Dropping" the dummy *201 anywhere else in the test, such as on the stairs or the upper stairs landing results in a zero score.
One trial is to be given unless a candidate receives a zero penalty score, in which case a second and final trial is allowed.
Score (in seconds) Weighted Percent 18 or less 150 19 145 20 141 21 137 22 133 23 129 24 125 25 121 26 117 27 113 28 109 29 105 30 100 31 95 32 90 33 85 34 80 35 75 Still unfinished after 35 seconds: Lowering dummy to ground: 70 On upper stair landing or returning down stairs: 35 Can't lift or going up stairs: 0
As already noted, the dummy test was said by AIR to test for static strength, explosive strength, dynamic strength, and gross body equilibrium.
The dummy used in the test was cylindrical, covered in canvas, without handles or other articulation, and presented initially a considerable technical problem to the uninitiated as to how to lift and hold it. Anyone unable (whether through failure of technique or a failure in the capacities of strength needed to lift the dummy) lost not only his or her score for the abilities said to be involved in lifting, but also any score on the others of the multiple abilities sought to be tested during the test, including gross body equilibrium, ranked number four among the abilities needed for the job. Out of the 80 women taking the exam, only four appear to have succeeded in lifting the dummy.
2. The Hand Grip. The requirements and scoring for the hand grip test were as follows:Hand Grip Test Weight = 100
Holding the dynamometer in the preferred hand, with the palm around the top bar and the fingers curled around the bottom bar, arm extended downward away from the side of the body and with the palm facing the side of the body, squeeze the dynamometer as sharply and as steadily as possible. This test is to be performed while standing. Score is the highest dynamometer reading obtained on one squeeze.
One squeeze is a trial. Two trials, separated by at least one minute, are to be given. The best of the two trials is to be rated.
Score (in kilograms; kg. = 2.2 lbs.) Weighted Percent 70 or more 100 65 - 69 95 60 - 64 90 55 - 59 85 50 - 54 80 46 - 49 75 42 - 45 70 40 - 41 65 38 - 39 60 36 - 37 55 34 - 35 50 Less than 34 kgs. 0
The ability said to be measured by the hand grip is static strength. The principal difficulty in administration of the hand grip appears to have been that, although the dynamometer employed contained a means of adjustment for hand size, AIR did not explain its use to the City, and the City's representatives did not explain its use to candidates. As a result, plaintiff, for one, found difficulty in getting a grip on the instrument with, undoubtedly, some significant effect on her score. It may be inferred, moreover, that there was as a result of this error in administration an adverse effect on women in general as a result of smaller hand size on average.
3. The Free-Style Broad Jump. The broad jump requirements and scoring were as follows:Free-Style Broad Jump Test Weight = 100
*202 With both feet behind the start line, jump forward as far as possible without falling backwards. Arms may be used free-style during a jump. Score is greatest distance jumped as measured from the point where the rear-most heel lands to the start line.
Falling backward when landing results in a zero score for that trial. The best of two trials is to be rated.
Score (in feet and inches) Weighted Percent 8'08" or better 100 8'05" or better 95 8'02" or better 90 7'10" or better 85 7'06" or better 80 7'02" or better 75 6'10" or better 70 6'08" or better 65 6'06" or better 60 6'04" or better 55 6'02" or better 50 Less than 6'02" 0
The test was said to measure (as did the dummy carry) explosive strength.
One problem encountered by plaintiff in the administration of the test was that at the time she took the test it was administered on a mat which had a tendency to slip on the floor. The mat slipped in her case, and she fell backwards as a result. The mat was later removed. These and other minor difficulties in test administration are complained of by plaintiff not because they affected women only but because they call into question the degree of precision in scoring with which the test purported to measure differences in the abilities of candidates to perform on the job.
4. Flexed Arm Hang. The test's requirements and scoring were as follows:Flexed Arm Hang Test Weight = 100
Mount the ladder to a height such that the chin is even with the horizontal bar; use a firm overhand grip on the bar and, when ready, nod; the ladder will be taken away and the timing will begin. Hang onto the bar as long as possible. The test will be over when arms are completely extended. Score is time before arms are completely extended.
One trial is to be given.
Weighted Score (in minutes and seconds) Percent 2'20" or more 100 2'10" - 2'19" 95 2'00" - 2'09" 90 1'50" - 1'59" 85 1'40" - 1'49" 80 1'30" - 1'39" 75 1'20" - 1'29" 70 1'15" - 1'19" 65 1'10" - 1'14" 60 1'05" - 1'09" 55 1'00" - 1'04" 50 Less than 1'00" 0
The test was intended to measure for dynamic strength (also measured by the dummy carry).
One notable difficulty in the administration of the flexed arm hang was, as AIR noted in recommending against its use by the City, its low (.53) test/re-test reliability. This low degree of reliability was attributed to difficulties in obtaining objective measurements because of subjective differences between monitors in deciding when the arms are "completely extended."
5. Agility Test. The agility test was administered and scored as follows:Agility Test Weight = 250
From a position lying on the back, feet together, hands at side, on signal "Go," rise and run nine feet to the five-foot wall and scale it; run 11½ feet to a maze of obstacles and dodge through; run about 15 feet and crawl through the first "Window;" mount the 30-foot horizontal ladder and crawl across it, keeping one hand on each outside rail; crawl through the second window; run about 20 feet to the eight-foot wall and scale it; run around one obstacle; sprint 45 feet back to the finish line. Score is time to complete the course.
A penalty of 2 seconds will be added to the total time score each time one foot touches the floor on the outside of the ladder while crawling across it. A penalty of two seconds will be added to the total time score if both feet fail to land *203 beyond the last rung of the ladder and on the side of the ladder farthest away from the second window.
Use of the iron supporting rods on the eight-foot wall to aid the climb or running out of the course without retracking and continuing properly within the time limit will result in credit only for the previous properly completed obstacles.
The best of two trials is to be rated.
Score (in seconds) Weighted Percent 22 or less 250 23 245 24 240 25 235 26 229 27 223 28 217 29 211 30 205 31 199 32 193 33 187 34 181 35 175 36 169 37 163 38 157 39 151 40 145 41 139 42 132 43 125
Still unfinished after 43 seconds:
Sprint 118 Eight-foot wall 59 Window, ladder, maze, or five-foot wall 0
The test was said to test for explosive strength, stamina, dynamic strength, gross body coordination, and extent flexibility. Almost all of the women candidates appear to have been unable to clear the eight-foot wall. Whether this was because of a failure of technique or because of lack of ability cannot be determined in individual cases. However, technique clearly played a large part in the successful completion of the test by other candidates. Since the eight-foot wall appeared at the end of the obstacle test and since failure to clear it automatically dropped one's score from a possible 250 to 59, a candidate who might have scored well on all of the other abilities tested for (including gross body coordination, which was tested only by the agility test and which was deemed by AIR to be worth 100 points when tested separately by its balance test) was effectively deprived, by means of a failure on the eight-foot wall, of accurately demonstrating abilities deemed by AIR to be necessary for the job.
6. Ledge Walk. The test and its scoring were as follows:Ledge Walk Test Weight = 100
Wearing an "oxygen unit" consisting of a harness and weight on the back, for a total weight of 26 lbs., start with both feet facing the wall, within the one-foot area marked off on the 30-foot "ledge" that is two and one-half inches wide, and six inches from the wall. On the signal "Go," side step to the right as quickly as possible to the opposite side of the ledge without falling off; touch the side board with the right foot and make sure that both feet are within the one-foot marked-off area; then side step to the left as quickly as possible, but without falling off, until both feet are again within the one-foot marked-off starting area. Score is time to move across the ledge and back again without falling off.
"Falling" off the ledge constitutes a zero score for that trial. Two trials are to be given and the best trial rated.
Score (in seconds) Weighted Percentage 9 or less 100 10 97 11 94 12 90 13 86 14 82 15 78 16 74 17 70 18 65 19 60 20 55 21 50 More than 21 0
The ledge balance was said to test for gross body equilibrium. One problem encountered by plaintiff in the administration of the test was that she was unable to *204 adjust the harness for the breathing unit to her back and was not assisted by anyone in doing so. In addition, when plaintiff took the test one of the walls used leaned towards the candidates, making it more difficult to accomplish the walk. Again, this difficulty is mentioned not because it adversely affected the ability of women to perform on the test it affected all equally but to illustrate the difficulties, discussed infra, created by attributing too great precision to the test instrument in predicting job performance.
7. Mile Run. The test was administered in general as described for the candidates in the booklet provided them as follows:One Mile Test Weight = 200
With both feet behind the start line, on the signal "Go," complete X number of laps around the track as fast as possible by running or a combination of running and walking. Score is time to complete X number of laps which is equal to one mile.
One trial is to be given.
Weighted Score (in minutes and seconds) Percent Less than 4'50" 200 4'50" - 4'56" 195 4'57" - 5'03" 190 5'04" - 5'11" 185 5'12" - 5'19" 180 5'20" - 5'27" 175 5'28" - 5'36" 170 5'37" - 5'45" 165 5'46" - 5'54" 160 5'55" - 6'03" 155 6'04" - 6'12" 150 6'13" - 6'21" 145 6'22" - 6'30" 140 6'31" - 6'39" 135 6'40" - 6'48" 130 6'29" - 6'56" 125 6'57" - 7'04" 120 7'05" - 7'11" 115 7'12" - 7'18" 110 7'19" - 7'24" 105 7'25" - 7'30" 100 More than 7'30" 0
The test was designed to measure stamina or aerobic capacity something it will do only for runners taught to pace themselves in running. In the absence of pacing, the test measures not stamina or aerobic capacity that is, the ability of the body to generate energy from oxygen uptake but anaerobic capacity the ability of the body to generate energy from itself, something also measured by strength tests. For this reason, a more generally accepted test for stamina than the mile run is a test requiring the candidate to run or walk for twelve minutes, measuring the distance accomplished, on the theory that this test forces pacing on even an untrained candidate. Several of the candidates testified that they had trained themselves in pacing and that only this training had enabled them to complete the mile. The absence of an accurate test of stamina or aerobic capacity is said by plaintiff to have been of great significance in terms of the validity of the test because of the general consensus among the witnesses, both experts and job incumbents, that stamina rather than brute strength is the prime requirement for the job of firefighting. "Firefighting is," as one witness testified, "not a sprint event."
25,168 candidates filed applications and appeared for the written portion of Exam 3040, of which 24,758 were men and 410 were women. 506 men failed the written exam, and 21 women did so. A total of 18,148 persons who passed the written exam appeared to take the physical exam: 18,060 males and 88 females. Thus, over one-quarter (26%) of the men who passed the written exam (24,252) did not present themselves for the physical (6,192). Close to 77% of the women who took and passed the written exam (389) did not appear to take the physical exam (301).
Of the 17,004 candidates who completed the physical exam, 16,925 were men and 79 were women. 7,847, or 46%, of the men passed. None of the women did.
Data from the United States Bureau of Census, based on the 1970 census, relating to the Standard New York Metropolitan Statistical Area and to Orange and Putnam Counties, show that 872,284 males and *205 1,014,706 females of the age group eligible for firefighters resided in the geographical area within which eligible persons are required to reside at the time of appointment. Expressed another way, 46.2% of this population was male, and 52.8% of this population was female. The total labor force between the ages of 18 and 29 residing in the identified geographic area was 655,022 males and 518,301 females, or 55.8% males and 44.2% females.
Roughly comparable data from the United States Bureau of Labor Statistics for the year 1979 indicate that in that year males constituted 55.3% and females, 44.7% of the civilian non-institutional labor force in the age group and localities from which the applicants for Exam 3040 were drawn.
An eligibility list consisting of some 8,018 names in rank order was produced from Exam 3040. As noted above, each "rank" on the list consists of a tenth of a percentage point; and, where there are several candidates in the same rank, the call-up of candidates is made by random selection, using candidates' social security numbers. As of February 19, 1982, 2,666 men have been appointed from the list. Fifty places have been reserved for members of the plaintiff class, pursuant to an agreement between counsel, with back pay, benefits, and seniority as of November 5, 1980, should this Court determine that Exam 3040 was discriminatory and that members of the plaintiff class are entitled to appointment to the Fire Department. The eligibility list expires June 21, 1982.
The Validity of Exam 3040
As noted in Guardians Association v. Civil Service Commission, 630 F.2d 79, 88 (2d Cir. 1980) (Guardians IV):
"... the accepted procedure for Title VII cases is to require the plaintiffs to establish a prima facie case, and then to require the defendants to rebut this showing with proof that the test was legitimately job-related. See Albemarle Paper Co. v. Moody, 422 U.S. 405, 95 S. Ct. 2362, 45 L. Ed. 2d 280 (1975); McDonnell Douglas Corp. v. Green, 411 U.S. 792, 93 S. Ct. 1817, 36 L. Ed. 2d 668 (1973); Griggs v. Duke Power Co., 401 U.S. 424, 91 S. Ct. 849, 28 L. Ed. 2d 158 (1971)."
A. The Prima Facie Case
In this case, as in Guardians IV, the exam at issue had a disparate impact by any reasonable measure including the standards developed by the Supreme Court in Castaneda v. Partida, 430 U.S. 482, 97 S. Ct. 1272, 51 L. Ed. 2d 498 (1977), and by the Equal Employment Opportunity Commission ("EEOC") in its Uniform Guidelines on Employee Selection Procedures, 29 C.F.R. § 1607 (1978) ("Guidelines", hereinafter cited only by the subdivision numbers of 29 C.F.R. § 1607). Under Castaneda, in cases such as this involving substantial samples, "if the difference between the expected value [from the random selection] and the observed number is greater than two or three standard deviations," a prima facie case of discriminatory impact is established. 430 U.S. at 497 n.17, 97 S. Ct. at 1281 n.17. Under the Guidelines, "[a] selection rate for any race, sex, or ethnic group which is less than four-fifths ( 4/5 ) (or eighty percent) of the rate for the group with the highest rate will generally be regarded by the Federal enforcement agencies as evidence of adverse impact." Section 4(D).
Here, the 0% pass rate for women is obviously less than 80% of the 46% pass rate for men, employing the test referred to in the Guidelines. The pass rates of the men and the women were separated from the results to be expected from a sex-neutral selection process by more than eight standard deviations, representing far more than the .03% likelihood of chance factors having produced the outcome found to establish a prima facie case of disparate impact in Castaneda. Comparing the zero pass rate with the available work force figures produces an even stronger case for non-chance selection *206 factors. In this case, the inference of discriminatory impact from facially neutral employment practices arises from "the inexorable zero." International Brotherhood of Teamsters v. United States, 431 U.S. 324, 342 n.23, 97 S. Ct. 1843, 1858 n.23, 52 L. Ed. 2d 396 (1977).
Defendants argue that statistics alone should not be sufficient to establish a prima facie case of disparate impact here in the absence of proof to rebut another available inference from the statistics, namely, that none of the women who took the test were strong enough to do the work required by the job. This argument seeks to re-define the issues and reverse the burden of proof established by the Supreme Court for Title VII cases, see Griggs v. Duke Power Co., supra; Dothard v. Rawlinson, supra, by requiring plaintiff to show that she is an exception to the general characteristics of her sex and to prove that individual members of her sex are worthy of employment. Title VII imposes neither requirement on the plaintiff.
B. Test Validation
The issue in this case, as in Guardians IV, therefore, is whether defendants have rebutted plaintiff's prima facie case by showing that Exam 3040 was job-related, that is, that the test was valid because it accurately selected those applicants who will make better firefighters. Guardians IV, supra, 630 F.2d at 88.
The "threshold task" in determining the validity or job-relatedness of the test is to select the appropriate method of assessing its validity. Id. at 91. In this case defendants consistently, until almost the conclusion of the trial, asserted that only one technique of the three specified in Guidelines §§ 5(B) and 14 was being urged, namely, content validation. Accordingly, that proposition will be considered first. The belated assertion that the test is criterion valid will be considered infra.
The proposition that the physical portion of Exam 3040 is content valid must be rejected. Defendants themselves and their outside consultants did not consider it to be such until the validity of their test was attacked. In all events "the abilities that the test attempts to measure are ... [not] the most observable abilities of significance to the particular job in question." Guardians IV, supra, 630 F.2d at 93.
That Dr. Fleishman's nine categories of physical ability constitute anything other than constructs would seem beyond debate by almost any definition, whether it be the somewhat mechanical fact/inference distinction of Guidelines § 14(C) (2) or the the more realistic view of Guardians IV, which distinguishes between job content and job construct as "simply different segments along a continuum reflecting a person's capacity to perform various categories of tasks." Id. at 93. In terms of the purpose of the distinction between content and construct namely, the method by which one is going to determine on a firm, factual foundation that one has tested for the abilities or qualities needed for the job it is clear that Dr. Fleishman's technique can produce such factually grounded confidence only by a criterion study characteristic of construct validation. Nothing in the concepts of dynamic strength, gross body equilibrium, stamina, and the like has such a grounding in observable behavior or the way firefighters operate that one can say with confidence that a person who possesses a high degree of these abilities as opposed to others will perform well on the job.
AIR's own description of the method of validation of the physical tests it proposed *207 to develop through the PAA confirms that criterion rather than content validation is appropriate for a physical test grounded on the PAA. This description was in marked contrast to its description of the method of validation of the written exam, which was stated explicitly to be content validation. The proposal for the use of the PAA makes clear that criterion or construct validation was contemplated by AIR from the beginning of its project to be abandoned only when the City declined to fund the contemplated criterion study because of an asserted shortage of funds. The only mention of content validation appears in a 1974 progress report on the project which, after noting the clear suitability of the PAA method of analysis for construct or criterion validation, states that the suitability of the PAA for content validation "depends" on the grounding of the PAA in a thorough task analysis. Here, the abilities analysis arose not out of an analysis of firefighting activities, but out of Dr. Fleishman's 1964 factor analysis of physical tests. While some effort was made to connect the abilities analysis to specific examples of firefighters' behavior, the effort can hardly be called thorough. Many of the concrete instances of the abstract abilities identified as important to the job are described with almost as great generality as the ability itself, e.g., "functions at fire operation." Other instances betray confusion concerning the content of the ability and create severe doubts as to whether the job analysis method was understood and correctly applied, e.g., as an instance of dynamic strength, "have a fire late in a work shift." Even where concrete instances of work behavior as evidence of a particular ability are given, there is no systematic study of what the rater has in mind or of its relative importance or criticality. See Guidelines § 14(D) (2), and cf. Guardians IV, supra, 630 F.2d at 95.
What has been said might, on first impression, seem appropriately limited only to those factorially pure tests which remained part of the test battery and to have less application with respect to the so-called "face valid" tests: the ledge walk, dummy carry, and agility tests. Yet, in fact, the same conclusion applies to these tests as well. All of the tests were conceived of and justified by AIR as tests for the factorially pure abilities identified by Fleishman in his 1964 work, not in terms of the job behaviors they mimic or appear to resemble. The multi-ability tests were explicitly referred to by AIR as "face valid," a term of art developed by the American Psychological Association, Inc., indicating that no pre-tense was made that these tests were content valid, that is, drawn from an assessment "of the important work behavior(s) required for successful performance and their relative importance." Guidelines § 14(C) (2). As APA Standards at p. 29 caution, while the concept of face validity may have some public relations role in selling a test to job applicants, face validity is no substitute for validation based on concrete analysis either of the job's requirements or the test's results.
Thus, while it is possible to find, scattered through AIR's job analysis, isolated references to work behaviors bearing superficial resemblance to certain of the face valid tests (although there is little or nothing about walking on building ledges, carrying people on one's shoulder or dodging and running through obstacles at top speed), the work behaviors reported are simply given as examples of the more generalized abilities which are at the center of AIR's study; they are hardly described with the care required to establish with certainty what is being referred to. Nor is there any indication of their relative importance to the job. As the evidence at trial established, a carry of the type illustrated by the dummy carry portion of Exam 3040 would be dangerous in the extreme to carrier and carried alike; ledge walking of the type illustrated in Exam 3040 would be an extraordinarily rare event as an actual job behavior; and the top speed obstacle run bears little relation to the paced conservative approach, which, all witnesses agreed, is appropriate to effective performance at a fire. Finally, even if these techniques did represent actual job behaviors, it is clear from the demonstrated *208 advantage possessed by those who were taught the special tricks by which the dummy is to be lifted and the eight-foot wall scaled, that a large part of the abilities measured represented those which applicants could be trained to acquire. As the Guidelines sensibly conclude, tricks of the trade that can be taught do not represent appropriate abilities to measure to determine whether the applicant can perform successfully on the job. Guidelines § 14(C) (1).
To say that Dr. Fleishman's abilities analysis lacks content validity is, of course, not to criticize it as an unworkable job analysis device, but only to point out what Fleishman himself recognized, namely, the need for construct or criterion validation to see if the test in fact succeeded in predicting those persons who would perform the job best. Unfortunately, not only was the original criterion validation proposed a victim of the City's fiscal crisis, but a more limited concurrent validation study in which the predictive value of the test would be studied by its administration to job incumbents was also rejected. In the absence of either a content or criterion validation of AIR's analysis, the value of it remains a moot question. Moreover, because of errors in the test preparation and scoring of the exam, it seems likely that even a belated criterion validation of Exam 3040 would be of little assistance in determining if those who did well on it would make better firefighters than those who did not. Accordingly, before discussing the subject of criterion validity, it seems appropriate at this point to turn to the subject of the test preparation and, then, to the method of scoring.
C. Preparation of the Test
The standard required for test preparation is that the test makers exercise reasonable competence, so that it can reasonably be said that the matters measured by the test are the matters identified as appropriate for measurement by the job analysis. Guardians IV, supra at 95. Here, there can be no question that AIR and Dr. Fleishman exercised reasonable competence in the design of their factorially pure tests to measure the constructs identified by the job analysis as necessary to the job. Whatever disagreement there may be within the test maker's profession as to whether Dr. Fleishman's method of test construction will ultimately prove the best one, there can be no question that it represents one reasonable approach to the subject, carried out, at least in instances which preceded and followed the actual preparation of Exam 3040, according to a high level of professional competence.
The problem which arises in this case, however, is that Dr. Fleishman and his subordinates were subjected to marked interference in the process of professional job preparation by representatives of the Fire Department and of the City's Department of Personnel. As we said in Guardians IV, supra, 630 F.2d at 96:
"Of course, the law should not be designed to subsidize specialists. But employment testing is a task of sufficient difficulty to suggest that an employer dispenses with expert assistance at his peril."
Here, the City, under Judge Weinfeld's mandate, employed experts to assist them, but then took the dangerous course of dispensing with the experts' advice. The resulting mixture of professionalism and the kind of subjective judgment of supervisors which condemned an earlier generation of employment tests, see Guardians IV, supra, 630 F.2d at 88-89, resulted in an inadequate job of test preparation.
Most notable of the defects in test preparation is the substitution of the series of so-called face valid tests for the factorially pure tests selected by Dr. Fleishman after extensive analysis and testing. As noted, the agility test and dummy test first appeared between the December 1973 pre-test and the February 1974 proposed test battery not because of any consideration having to do with job-relatedness, but apparently in an effort to accommodate the demands made on the City by other disadvantaged groups involved in the Vulcan litigation. While the motivation behind this substitution *209 was certainly laudable, the high purpose is no excuse for a marked relaxation of professional standards. In the case of the agility test, the effect of substituting it for the cable test earlier conceived of as the appropriate one for measuring the ability of gross body coordination was to introduce marked overlapping and distortion in the rank ordering of the abilities purportedly identified as necessary for the job of firefighting. While an effort at factor analysis of the agility test appears to have been made, the results it produced namely, low correlation between the test and gross body coordination dictated against the substitution rather than in favor of it.
The crucial eight-foot wall which entered the test battery by means of the agility test was justified not in terms of the PAA technique otherwise used to analyze the job's requirements, but rather in terms of an altogether ad hoc and incomplete inquiry into job content whether New York City firemen ever have occasion to climb eight-foot walls in the course of the job. As noted earlier, this inquiry failed to determine the importance, frequency, or criticality of the job behavior or even to define it with sufficient precision to make it possible to judge whether the scaling techniques required were those appropriately measured by the obstacle course wall.
Similarly, the dummy carry, derived from the February 1974 height study also carried out as an outgrowth of the Vulcan litigation, introduced further distortions of ranking and overlapping of tests into the factorially pure battery devised by Fleishman. As the evidence at trial established, the carry in fact tested in Physical Exam 3040 tests for abilities different than those that are actually needed in rescue operations and, to the extent it does test for abilities needed in one form of firemen's carry, those abilities are ones learned by training rather than received as part of one's natural endowment.
Equally disquieting in terms of the professionalism of the test preparation is the elimination of AIR's balance and twist and touch tests as measuring abilities needed by ballet dancers as opposed to firefighters. Again, the failure of the professional test maker to turn the discussion from the dangerous realm of job images to the neutral realm of job requirements must be criticized both in terms of the overall goals of the test preparation and in concrete terms of test preparation. The insertion of what was deemed to be a test of extent flexibility (the window-ladder-window test) into the agility run meant that no accurate test of extent flexibility would be recorded at all unless the candidate mastered the eight-foot wall which, whatever else it did, did not measure for extent flexibility. The substitution of the ledge walk for the balance test (to be performed with eyes closed to insure that gross body equilibrium was measured rather than balance achieved through visual contacts) again compromised the factorial purity of AIR's battery. It also introduced an observable behavior into the test without any factual inquiry as to the existence and importance in firefighting of having to side step at high speed forward and back over a narrow ledge.
Of similar significance in evaluating AIR's test preparation is its "tightening" of *210 the test requirements in response not to any requirements of the job, but rather in response to the "subjective judgments of supervisors," Vulcan Society v. Civil Service Commission, supra, 490 F.2d at 396-98. While there is certainly every reason for a professional to listen to the criticism of job incumbents on the lack of relation of the test to the job's requirements, the reason for listening to the criticism is to make certain that the test in fact tests for what the job needs. Here, a beneficial result of the criticism of the one-mile run was that AIR recognized an error in its test preparation, namely, that the test would not be performed on the 31-lap track on which the norm scores had been based. However, instead of then re-doing the pre-test or re-calculating it to eliminate the time attributable to the test takers having to run more corners than would be present in the actual test, thereby assuring that the norms would be based on a sample of the applicant group as originally intended, AIR accepted from the Department of Personnel and the Fire Department a set of norms having little if any rational connection with what it was reasonable to expect either of applicants or of incumbent firemen. Thus, instead of using the norms developed on a group of probationary firemen and firemen in training, reasonably thought to approximate the class of applicants as it was then anticipated it would be, AIR used figures for the mile run developed by Thomas Cureton in studies conducted in the earlier '40's on men in physical training for war. The combined effect of the effort to tighten up the mile run requirements was to produce a scoring table for the mile run which not only could not be passed by any of the incumbent firemen who took the December 1973 tryout, but also would have been failed by half of the Cureton sample.
Similar tightening of the scoring was based on subjective criticism by Fire Department personnel and representatives of the Department of Personnel of the cut-off scores for the broad jump, the hand grip, and the dummy carry. This overall tendency to defer to the pressures of employers to get "the best men" for the job, without consideration whether such a requirement unnecessarily excludes women or other disadvantaged groups, requires a conclusion that the test preparation of Exam 3040 was wanting in professional competence precisely because it failed to take reasonable steps to exclude engrained discrimination of the sort that Title VII was designed to eliminate. Griggs v. Duke Power Co., supra at 429-30, 91 S. Ct. at 852-53.
D. Test Scoring
As already noted, the City did not use the results of the exam simply to determine who possessed the minimum qualifications necessary for the job, as it had on earlier physical examinations administered on a pass/fail basis in 1962, 1968, and 1971 (Exams 9606, 7060 and 0159). Instead, it used the results of the test (after averaging the raw scores on both the written exam and the physical) to compile the rank-ordering of all applicants on a scale of 1,000 (100.0%) with a cut-off score fixed at 700 (70.0%). The use of a percentage scale with ranks calculated to the nearest tenth of a percent and the use of the 70th percentile as the passing score appears to have been decided upon simply because such a rank and cutoff method was customary in competitive civil service testing and, indeed, had been the practice on earlier competitive firefighters' exams.
As was stated in Guardians IV, supra, 630 F.2d at 100:
"The Guidelines provide that rank-ordering should be used only if it can be shown that `a higher score ... is likely to result in better job performance.' Guidelines § 14(C) (9). This requirement is reasonable and consistent with Title VII's provision that the `results' of a test may not be `used to discriminate.' 42 U.S.C. § 2000e-2(h). If test scores do not vary directly with job performance, ranking *211 the candidates on the basis of their scores will not select better employees .... The frequency with which such one-point differentials are used for important decisions in our society, both in academic assessment and civil service employment, should not obscure their equally frequent lack of demonstrated significance."
At first glance it might seem that, since no women passed the physical exam, the only subject that need concern us is the validity of the cut-off score. Realistically, however, since over 17,000 persons took the physical exam and since fewer than 3,000 persons have been appointed to date from the all-male eligibility list on which over 8,000 names appeared, the effective cut-off score is as a practical matter considerably above the 70th percentile. Accordingly, it is the strict rank ordering of Exam 3040, pursuant to which every one of the thousands of successful candidates in the 299 ranks between 100.0 and 70.1 must be exhausted before the candidates in the first passing rank are called, that must be considered to determine whether it is predictive of job performance.
Based on the evidence introduced at the hearing, I conclude that here, as in Guardians IV, supra, 630 F.2d at 101, "the defects ... in the job analysis and the test construction are substantial enough to preclude an inference that passing scores will correlate with job performance closely enough to justify rank-ordered selections" of the type which resulted from Exam 3040.
First, as has been noted by the EEOC, it is "easier" to make the inference between higher scores and better job performance, "[t]he more closely and completely the selection procedure approximates the important work behaviors." EEOC, Uniform Employee Selection Guidelines: Interpretation and Clarification (Questions and Answers) Q. 62 (1979). Here, the lack of systematic study of the relationship between work behaviors observed in the PAA and PDA and the abilities tested for in the test battery makes it apparent that the fine gradations of the rank order employed are hardly justified by the job analysis. Moreover, an examination of the PAA and the PDA instruments themselves makes clear that for the most part all that AIR hoped to establish were gross distinctions in the degrees of abilities and physical demands, respectively. For example, in the PDA, ratings for balancing are in terms of "no importance," "important," "high importance"; weight-lifting ability is analyzed in terms of five categories of weight. In the PAA all abilities are rated on a scale of 1 to 7. While analyses of this sort might be used to justify a limited series of grades (within which the candidates would be selected at random), they cannot be used to justify the extraordinary pretense at precision implicit in the ranking established by Exam 3040.
In terms of test preparation, the compromise in the weighting of the subtests, resulting from the substitution of the multiple-ability tests for the tests selected originally by AIR and the "tightening" of the test scores in response to subjective judgments of the employer that firefighters ought to be able to run the mile faster, jump farther, and have a firmer hand grip than the experts concluded, likewise makes it impossible to say with any reasonable measure of confidence that the rank order resulting from the test is an accurate predictor of job performance.
The City responds to the attack on its rank ordering not with any justification drawn from its job analysis or based on its test preparation, but rather by an assertion that every increment in the abilities tested for in its physical exam necessarily represents a better performance as a firefighter. This conclusion was presented as "obvious" by two witnesses, one an industrial psychologist, the other a physiologist who was also *212 a volunteer fireman, both of whom testified that it is in the "nature" of firefighting that the stronger the firefighter the better. This proposition was, however, hotly contested by plaintiff's witnesses, including an industrial psychologist with extensive acquaintance with firefighting and an exercise physiologist, among others, who pointed out that few jobs making large physical demands, least of all firefighting, are properly performed at maximum speed or at the limits of one's strength or endurance. As these witnesses testified, what must be identified are not those who are strongest or fastest but, instead, those who, with the benefit of training in pacing or because of their native capacities of endurance, can perform the punishing tasks of firefighting as they are actually required to be performed. According to these witnesses, firefighting takes its toll, not as a result of failures of maximum strength or speed, even at critical moments, but rather through the physical demands extending over long periods of time which necessitate paced performance at less-than-maximum levels. This explanation makes sense of a number of otherwise puzzling features of firefighting, namely, the ability of firemen to continue to do their work competently over an entire working career and the results of physical testing of incumbent male firemen which show that on a variety of measures of maximum physical capacity firemen rate no better than the average American male. It also makes sense in terms of the recognized dangers of firefighting and their unpredictability which, as numerous witnesses testified, make hazardous in the extreme performance at top speed or at the limits of strength or capacity. Not only does maximal performance compromise the firefighter's ability to pace performance over the long periods of time during which physical demands are being made on the body, in addition, the very unpredictability of fire and the instability of burning structures call upon qualities of foresight, endurance, and pacing not examined by tests of maximum physical strength.
In all events, even the industrial psychologist who testified that every increment of strength predicted a better fireman did not claim that his own test instrument could grade firefighters beyond the degrees of ability measured in a scale of excellent, good, average, fair, and poor (P.X. 251, Table 5). In short, neither the job analysis instrument, the test instrument, or the validation instruments appear able to perform their tasks with the precision necessary to justify the rank ordering used here.
This conclusion does not necessarily mean that a future exam must be administered on a pass/fail basis. It does suggest, however, that some larger use of random selection within fewer, more rationally grounded ranks will have to be substituted for the present system, unless finer test instruments can be found.
E. The Effort to Criterion Validate Exam 3040
There remains for consideration whether Exam 3040 can be criterion validated by a study of firefighters' task performances conducted at the University of Maryland for the National Fire Data Center subsequent to the administration of Exam 3040.
The City's "borrowing" of this study of a group of firefighters from the Washington, *213 D. C. area was offered late in the trial in an effort to obtain predictive validation for Exam 3040 pursuant to Guidelines § 7 which provides:
A. Validity studies not conducted by the user. Users may, under certain circumstances, support the use of selection procedures by validity studies conducted by other users or conducted by test publishers or distributors and described in test manuals. While publishers of selection procedures have a professional obligation to provide evidence of validity which meets generally accepted professional standards (see section 5C above), users are cautioned that they are responsible for compliance with these guidelines. Accordingly, users seeking to obtain selection procedures from publishers and distributors should be careful to determine that, in the event the user becomes subject to the validity requirements of these guidelines, the necessary information to support validity has been determined and will be made available to the user.
B. Use of criterion-related validity evidence from other sources. Criterion-related validity studies conducted by one test user, or described in test manuals and the professional literature, will be considered acceptable for use by another user when the following requirements are met:
(1) Validity evidence. Evidence from the available studies meeting the standards of section 14B below clearly demonstrates that the selection procedure is valid;
(2) Job similarity. The incumbents in the user's job and the incumbents in the job or group of jobs on which the validity study was conducted perform substantially the same major work behaviors, as shown by appropriate job analyses both on the job or group of jobs on which the validity study was performed and on the job for which the selection procedure is to be used; and
(3) Fairness evidence. The studies include a study of test fairness for each race, sex, and ethnic group which constitutes a significant factor in the borrowing user's relevant labor market for the job or jobs in question. If the studies under consideration satisfy (1) and (2) above but do not contain an investigation of test fairness, and it is not technically feasible for the borrowing user to conduct an internal study of test fairness, the borrowing user may utilize the study until studies conducted elsewhere meeting the requirements of these guidelines show test unfairness, or until such time as it becomes technically feasible to conduct an internal study of test fairness and the results of that study can be acted upon. Users obtaining selection procedures from publishers should consider, as one factor in the decision to purchase a particular selection procedure, the availability of evidence concerning test fairness.
The first matter which bears noting is that defendants have not offered any evidence of the "test fairness" of Exam 3040 for members of the female sex as required by Guideline § 7B(3), supra. Moreover, the only evidence offered by defendants to support an argument that it is not technically feasible for New York City to conduct such a study of the fairness of the test is that New York City has at present no female firefighters whose performance on the job could be measured against their performance on Exam 3040. The answer to this argument is that the criterion used to measure the accuracy of the predictors in the Maryland study was not performance in actual firefighting tasks by job incumbents. Instead, the criterion was performance on a series of so-called job sample tasks, requiring neither instruction nor training. New York City's lack of female job incumbents was, in other words, no barrier at all to the performance of the series of Maryland criteria tasks by a representative sample of male and female firefighter applicants in order to determine the fairness of applying those criterion tasks to measure the accuracy of Exam 3040 as a predictor of job performance. However, quite apart from that fundamental objection, the University *214 of Maryland study does not purport to validate the tests administered in Exam 3040 or anything like them.
What the Maryland study does is suggest both a laboratory and a field test to predict performance ability of firefighters. The laboratory test consists of a determination of percent of body fat, lean body weight, maximal heart rate, score on a treadmill test, grip strength, performance of sit-ups and long jump, and submaximal oxygen pulse. Of this list, only the long jump and the grip strength tests bear any resemblance to the subtests administered as part of Exam 3040, and in fact both of these differ in either their method of administration or scoring from the tests administered in New York. The field test consists of a determination of percent fat, lean weight, score on a step test, push-ups, sit-ups, and grip strength. In this group, only grip strength bears resemblance to a subtest administered as part of the New York City exam.
Defendants do not, in fact, suggest that Exam 3040 is comparable to either of the two recommended Maryland tests. Instead, by picking and choosing from among the subtests actually recommended in Maryland, from among a number of other predictor tests considered and rejected in the Maryland study, and, finally, from among the performance measures or tasks used in Maryland in an effort to validate the predictor tests recommended, defendants came up with a group of five "matching" tests, the validation of which they claim validates Exam 3040. Even this extraordinary selection process, in other words, could not secure a match for all seven of the subtests employed in Exam 3040; and, in fact, the five performance measures selected to match the subtests of Exam 3040 do so only approximately.
In addition to the Maryland long jump and hand grip, the defendants chose two predictor measures which were considered, but not recommended, as part of the Maryland test batteries, namely, a balance beam test and a 12-minute run, as matching the New York City balance test and mile run, respectively. Further, a measure used in the Maryland study in an effort to validate the Maryland tests, namely, a dummy drag or carry, was chosen by the defendants as a match for the New York City dummy carry which was in New York, of course, a predictive measure. The distinctions between each of these three tests and the New York City tests are considerable.
The 12-minute run used in Maryland measures the distance run by a candidate in 12 minutes and was almost universally acknowledged by the experts at trial to be superior to the mile run as a test of stamina since it assists the runner in finding his or her own pace, thereby more accurately measuring aerobic, as opposed to anaerobic capacity. The balance test employed in Maryland, in contrast to the New York City ledge walk, involved walking forward over a balance beam with no time limit or emphasis on speed. Finally, the Maryland dummy drag consisted of dragging a dummy down five flights of stairs, as opposed to the test in New York involving a carry of the dummy first up and then down a flight of stairs. Clearly, the substantial differences between these three tests and their *215 alleged counterparts in New York preclude validation of the New York City tests by such validation as they were given in Maryland. Since all of the criterion validation measures offered by defendants as established by the Maryland study include at least one of these markedly dissimilar tests on the predictor side, it can hardly be said, laying all its other faults aside, that defendants' efforts to borrow the Maryland study to validate the New York test have, to any degree, succeeded.
However, there are even more substantial problems with defendants' criterion validation effort. The method by which these Maryland tests were validated was by a study of the correlations between the results of the performance of these tests by a group of firefighters from the Washington, D. C. area and the scores they obtained in performing five tasks said to replicate actual firefighting behavior. These tasks involved (1) the vertical raising of a 35-foot ladder; (2) the carrying of 73 lbs. of standpipe hose up five flights of stairs; (3) the pulling of 52 lbs. of hose to the fifth floor window by means of a line; (4) the dragging of a 117-lb. dummy down five flights of stairs; and (5) striking a railroad tie with 30 swings of an eight-lb. sledge hammer.
In the Maryland study the researchers measured the subjects heart rate and time to complete these five tasks in order to determine which subjects had performed the work fastest with the greatest reserve ability to continue working. The ability of the selected tests to predict high performance on the criterion tasks was seen as validating the predictors. In borrowing the Maryland study for validation purposes, however, defendants inexplicably eliminated the average heart rate as a factor in rating task performance and sought to validate the tests said to be comparable to some of those used in New York City based solely on the speed with which the subjects completed the five criterion tasks. The faster one accomplished the five tasks, all to be performed seriatim as expeditiously as possible, the more successful one was considered as a fireman for the purpose of validating the New York City predictive tests.
The problem with this method of analysis is clear. Even if one could agree based on an adequate job analysis that the five job sample tasks simulate actual job behaviors, there is here obviously an assumption about the appropriate way to measure optimum performance of the tasks which has no basis in the evidence or common sense, that is, that the most successful firefighter will be the one who performs these five tasks in the shortest time. Moreover, when one realizes that the characteristic manner in which the dummy test was performed in the Maryland test was to drag the dummy down five flights of stairs at top speed, one begins to question whether the job tasks in fact do simulate actual job behavior. Moreover, giving speed the emphasis it received in the Maryland study, even taking into account heart rate, appears highly questionable. This seems obvious, once it is realized that the mean time to complete the five simulated tasks was 7.03 minutes, during which the firefighters' average heart rate was measured at a value representing 97 percent of the maximum heart rate. This heart rate was taken by the Maryland researchers to suggest that firefighters, while performing their tasks, work at their maximum aerobic capacity. However, as the study itself paradoxically noted as a result of comparing these demands with the general fitness level of the firefighters studied, "it can be concluded that the aerobic capacity of the firefighter is not, on the average, adequate to complete typical firefighting tasks at the pace observed in this study." The study then concluded that the firemen studied were unfit for their jobs. If the firemen used to validate the tests were unfit for the job, it can well be asked how their performance could be used to validate tests administered in other jurisdictions. A *216 much more obvious explanation for the apparent paradox is that the pace at which the tasks were performed bore no relation to reality and that the validity study does not pass the technical standard for validity studies in Guidelines § 14B(3) that "[w]hatever criteria are used should represent important or critical work behavior(s)." In all events, nothing in the Maryland study shows that performance of the five tasks listed above under actual fighting conditions, assuming them to be important or critical work behaviors, is to be rated simply according to the speed with which they can be performed seriatim. Accordingly, I can find no basis in the Maryland study for determining that Exam 3040 is criterion valid that is, that it accurately predicts the abilities of the members of the plaintiff class who took the physical exam to be New York City firefighters.
In considering the relief to which plaintiff is entitled, it is appropriate "to distinguish between those aspects ... designed to assure compliance with Title VII and those aspects that provide affirmative relief as a remedy for past discrimination." Guardians IV, supra, 630 F.2d at 108. Turning first to the relief necessary to assure that Title VII's requirements will be complied with, "[c]ompliance involves restricting the use of an invalid exam, specifying procedures and standards for a new valid selection procedure, and authorizing interim hiring that does not have a disparate impact." Id.
In order to insure that an invalid exam is not improperly used to discriminate in the future, it appears appropriate in this case, as it was in Vulcan Society v. Civil Service Commission, supra, 360 F. Supp. at 1278, to enjoin defendants "from making any further appointments based upon its results, without prejudice, however, to any application by the parties, upon a showing of compelling necessity, for interim relief which would permit appointments from the eligible list upon an equitable basis pending the prompt promulgation and administration of a new examination free from ... taint." As Judge Weinfeld noted in the Vulcan Society case, "to freeze all appointments [to the Fire Department] may present a hazardous situation to the citizens of the community."
It is also appropriate, however, to direct defendants to commence forthwith preparation of new and valid selection procedures, validated in general accord with the Guidelines and the APA Standards and which have the least adverse impact on women. Id.
With respect to interim hiring provisions, such as are sought here,
"[s]ince interim hiring provisions, where needed to satisfy immediate personnel requirements, are to be used prior to development and approval of a valid selection procedure, such provisions cannot meet Title VII standards by demonstrated job relatedness. Therefore, one appropriate way to assure Title VII compliance on an interim basis is to avoid disparate ... impact. This means selecting from among adequately qualified applicants either on a random basis, see, e.g., Association Against Discrimination, supra, 594 F.2d  at 313 n.19 [2nd Cir.], or according to some appropriately noncompensatory ratio, see, e.g., Kirkland, supra, 520 F.2d  at 429-30 [2nd Cir.]; Vulcan Society, supra, 490 F.2d at 398-99, normally reflecting the minority ratio of the applicant pool or the relevant work force." Id. at 109.
In this case since it cannot be said that Exam 3040 was sufficient to determine which among the plaintiff class were "adequately qualified," it will first be necessary to determine which, if any, of the plaintiff class are in that category, either by a procedure to be agreed upon by the parties or, in the absence of agreement, by one established by the Court.
With regard to the number of positions which will be offered to qualified women applicants determined by this means, the percentage of women in "the relevant work *217 force" does not, as already noted, appear an adequate measure of non-discriminatory hiring. As defendants point out, it is unreasonable to suppose that the relevant work force can be determined without further evidence concerning the physical abilities of those women in the work force to do the physical work required of firefighters. Instead, it appears more appropriate in this case to resort to the actual "applicant pool," id., to determine the appropriate measure of non-discriminatory hiring, since the applicant pool represents those women in the work force who, like their male counterparts, believed themselves to be qualified for the job.
At the same time, care must be taken to avoid giving continued force to the deterrent effect of the City's discriminatory physical test as a result of the publicity and familiarization program. While it is difficult to calculate with any precision the deterrent effect of the physical exam on those who took and passed the written exam, one reasonable measure results from comparing the approximately 77% fall off in interest among women who took and passed the written exam with the approximately 26% fall off in interest among men in the same category. Assuming that the sex-neutral exam would have been followed by an equal decline (26%) in interest between both groups, one would have expected a group of approximately 288 women to have presented itself to take the physical exam. Finally, it seems reasonable to assume that, given a sex-neutral exam and sex-neutral call-up rates, the ratio of persons called up to total applicants would be the same for women and men. Of the total of 16,925 men presenting themselves to take the physical exam, 2,666, or approximately 16%, have been called up to date. Applying the same percentage figure to the 288 women who one would have expected to have taken the physical exam but for its deterrent effect results in a figure of close to 45 women who one would expect to have been called up under a sex-neutral exam. This figure compares favorably with the figure of 50 agreed upon by the parties as reflecting the number of places appropriate to reserve for women, should the exam be found to be discriminatory. In the absence of some other proposal, the most reasonable, non-discriminatory measure to provide for interim hiring appears to be 45 places.
Plaintiff recognizes, however, given the small size of the plaintiff class to begin with, the deterrent effect of the exam and the delay which has occurred as a result of the preparation of this case for trial, that it may well be that there are not available 45 women in the class as originally defined who continue to be interested in the position of firefighter. Nevertheless, plaintiff argues that she is entitled to a decree directing that defendants affirmatively recruit and appoint women for these positions either because defendants' discrimination was intentional or because there has been demonstrated here a pattern of significant and long-standing discrimination warranting such relief.
I find no basis in the record for a determination that defendants' discrimination against women has been intentional. Nor do I find that the evidence is sufficient to warrant a conclusion that there has been "a demonstrated pattern of significant prior discrimination." Guardians IV, supra, 630 F.2d at 112.
With regard to intentional discrimination, apart from a few isolated remarks demeaning to women by referring to them as a group in circumstances in which it was appropriate to deal with them as individuals, the only evidence relied on is the continued use by defendants of the eligibility list established pursuant to Exam 0159 which was closed to women after 1972 when Title VII was amended to include municipal employers. However, as noted in Guardians IV, supra at 112, in rejecting a similar argument:
"[p]ersistent use of an exam with disparate ... effects would support an inference of intentional discrimination if proper test construction were not even attempted." (Emphasis added)
Here, as in Guardians IV, proper test construction was at least attempted: "The *218 City's ... officials made extensive efforts to understand and apply the Guidelines and develop a test they hoped would have the requisite validity." Here, unlike Guardians IV, at least one warning concerning the adverse impact of the test on women was, in fact, heeded and led to the substitution of a test conceded to have a less adverse effect on women. That defendants proceeded in what Guardians IV referred to as "a naive self-confidence" that this was the only aspect of the exam which adversely affected women is not a sufficient basis for a finding of intentional wrong-doing.
Nor is the evidence sufficient to establish a demonstrated pattern of significant prior discrimination. To be sure, of the six examinations administered between 1960 and 1978, which are responsible for the present composition of the fire department, none was open to women. No women are firefighters at present. Moreover, women constitute 44% of the labor force between the ages of 18 and 29 in the relevant geographic area. However, as indicated above, there is at present insufficient evidence from which to conclude that a substantial part of the total female work force possesses the requisite qualifications necessary to perform the job of firefighter. In the absence of proof that any substantial part of the female work force possesses the requisite qualifications for the job, there has been no showing of the kind of "flagrant disparity shown in prior cases where long-term hiring quotas were in issue." Guardians IV, supra, 630 F.2d at 113.
Accordingly, defendants are hereby enjoined from further use of the eligibility list compiled pursuant to Exam 3040, except upon a showing of compelling necessity. Notices of appointment shall be sent to up to 45 of those women who in 1977 and 1978 applied to become firefighters and who are found to be qualified for appointment and willing to be appointed. In the event that more than 45 are found to be qualified, then notices of appointment shall be sent to 45 of those qualifying, selected by lot. Qualification shall be determined pursuant to procedures agreed upon by the parties. In the event the parties are unable expeditiously to agree upon such procedures, then the procedures shall be, on notice to the parties, fixed by the Court.
The Clerk is directed to notify the attorneys for the parties and intervenors of the entry of this Memorandum Decision and to mail a copy to each.
Settle Order on Notice.NOTES
 What the fiscal crisis prevented was the hiring of the firemen, parking enforcement agents, and sanitationmen who were to be the subjects of the validation study. Since, as discussed infra, the AIR study contemplated a criterion-based validation study to find out whether the test scores accurately predicted actual job performance, the failure by the City to do any hiring is said to have made such a criterion study impossible. The fiscal crisis does not, however, entirely explain why a concurrent validation study not involving new hiring, as described in an AIR letter to the City of September 22, 1976 (Plaintiff's Exhibit [P.X.] 141), could not have been performed with IPA money. What this letter also demonstrates is that no validation of the physical portion of the exam of any sort had been done prior to the time the letter was composed.
 In addition to the PAA, AIR analyzed all three jobs, that is, fireman, sanitationman, and parking enforcement agent, for purposes of both the written and physical exam projects by means of the following methods: job inventory, critical incidents analysis, position analysis questionnaire, and physical demands analysis. AIR personnel also met and conferred with Fire Department officials and observed the probationary and training schools and firefighting operations on a number of more informal occasions.
Since defendants now seek to rely on selections from some or all of these job analysis techniques, other than the PAA, to validate Exam 3040 and since there is in each of them some reference to the kind of observable behavior and ratings of importance which are missing from the PAA, it is important to understand what role these other analysis techniques actually played in the preparation of the physical test portion of Exam 3040. In general, as stated by AIR in its Final Report:
"A variety of job analysis procedures were reviewed in order to chose [sic] techniques appropriate for the present effort. Many of the standardized techniques such as the Job Inventory, Abilities Analysis, Functional Job Analysis, and the Position Analysis Questionnaire often include cognitive, perceptual, and psychomotor dimensions in addition to physical proficiency ones and are designed to assess requirements for total job performance. Since the focus of this project involves the identification of physical job requirements exclusively, total job analyses were evaluated in terms of how detailed and comprehensive the physical abilities portion of these instruments were [sic]. The physical portion of Abilities Analysis represents one detailed categorization of the physical performance domain empirically derived from extensive factor analytic and correlational analyses (see Fleishman, 1964). One other technique reviewed, called Physical Demands Analysis, concentrates particularly on physical attributes of job performance.
"Rather than pre-judge the relative effectiveness of describing physical requirements of jobs by either physical abilities or physical demands both techniques were utilized and later evaluated in terms of the results which they produced."
Technical Report No. 3 (P.X. 87), prepared by AIR, confirms that the job analysis methodology ultimately used with respect to the physical requirements of the fireman's job involved "the utilization of two job assessment techniques": the PAA and the PDA. Between these two job analyses techniques, the PAA was ultimately selected for the reason that, relying on Fleishman's work, it already had associated with each ability identified an appropriate test thus, theoretically, facilitating the task of test preparation. According to AIR: "Data derived from Physical Demands Analysis help to determine the exact nature of the tests to be used which are diagnostic of the abilities selected."
Despite this clear evidence that AIR itself intended to rely exclusively on the PAA and the PDA to construct the physical portion of Exam 3040, defendants sought at trial to use evidence of observable behaviors reported as part of at least three other job analysis techniques to demonstrate the content validity of the physical exam, namely, the job inventory, the critical incidents analysis, and the position analysis questionnaire. Accordingly, it is appropriate to say a word about AIR's use of these three techniques.
The job inventory involved by far the most far reaching inquiry made by AIR of all of its job analysis efforts. The effort started with the creation of a checklist of 269 work activities involving both physical and mental abilities derived from training manuals and meetings with Fire Department officials. These were then divided into eight categories, e.g., public education and community relations, inspection, investigation and enforcement, miscellaneous firehouse activities, firefighting, and other emergency and support operations. Each activity and each category of activity was then rated in terms of its difficulty, the time spent on it, and its importance, first by 20 job incumbents, consisting of ten firemen and ten officers, and then by some 600 firemen who completed the job inventory questionnaire. The result achieved from this inventory was the general conclusion that "firefighting and other emergency and support operations" rated first in terms of job importance, difficulty, and time consumed. The ratings given for the top quartile of the 269 job activities were found to confirm this general conclusion. Unfortunately, the 90 job activities rated most difficult and most important (e.g., in order of importance, "drive apparatus to and from scene, operate apparatus at scene, select appropriate tool or piece of equipment, make and unmake couplings and connections, estimate lengths of hose needed, stretch hose vertically, operate low pressure hydrant," etc., etc.) were not analyzed by AIR either at the time of the test's preparation or at trial in terms of physical abilities involved in the activity or in terms of the tests administered as part of Exam 3040. In place of such an analysis, defendants have simply asserted on the basis of quite superficial resemblances between a few of the many job activities listed in the job inventory and certain of the tests actually administered as part of Exam 3040 that the tests were content validated by the job inventory (e.g., the job activity described as "perform search operations" is said to be "connected" to the window-ladder-window component of the agility test, Defendants' Interim Post Trial Brief at p. 29). In sum, defendants appear to be engaged in ex post facto rationalization of tests selected for quite different purposes, drawing on aspects of AIR's job analysis which were never intended to be put to such use.
The so-called critical incidents analysis involved representatives of AIR sitting down with the two groups of firemen and officers, the same groups as those used in the job inventory, "to discuss examples of very effective and ineffective job performance and attributes believed to be important for job performance." This process resulted in two lists describing in narrative form "examples provided of very effective and ineffective performance by individual firemen," and "attributes believed to be important for the job of fireman." Among the 32 "attributes" listed as "important" (along with "need initiative and imagination," "need to administer first aid," "need to make decisions and follow through," and "know the use of tools") are "strength" (including "dynamic strength" and "explosive strength"), "stamina," "balance," "coordination," "extent flexibility," "dynamic flexibility," and "speed of limb movement." Under each of these general categories of "attributes" (which appear to have been drawn directly from the language of the PAA rather than from a job study), a few examples are given, e.g., "an example of explosive strength is when the man must raise a ladder." However, here, as with the "comments" section of the PAA, there is a lack of any systematic effort to determine whether the respondents understood the abilities being discussed in the same sense in which they were used in the PAA. Moreover, there is a failure to assess the relative importance of the specific job behaviors reported vis à vis the overall job and vis à vis each other.
The results of the Position Analysis Questionnaire ("PAQ"), which are at a level of generality equal to that of the PAA, in fact raise serious questions about the reliability of the PAA results. The ordering of importance of the physical attributes of stamina, explosive strength, dynamic strength, and static strength based on the results of the PAQ appears strikingly different from the ordering accorded those attributes based on the results of the PAA (D.X. GG p. 31). Moreover, in those few instances in which defendants claim that the results of the PAQ confirm the results of the PAA, it is to be noted that the definitions of the matters being inquired about differ from one analysis instrument to the other and are, in fact, measured on a different scale. (Cf. D.X. BB, p. 28 with D.X. GG, p. 27.)
 As stated in AIR's final report: "An examination of the ratings of abilities for special tasks and narrative descriptions revealed that these data were entirely too fragmentary and divergent to be of any value here."
 Given the bounded nature of the scale and the small size of the sample used, it is questionable whether the differences between the ratings reach the level of statistical significance. This becomes even more apparent when one considers the results of ratings for the abilities used in special tasks:
Mean Score on Ability Scale of 7 Stamina 5.86 Extent Flexibility 5.82 Dynamic Strength 5.81 Explosive Strength 5.67 Gross Body Equilibrium 5.64 Gross Body Coordination 5.54 Static Strength 5.50 Dynamic Flexibility 5.43 Speed of Limb Movement 5.29
AIR itself noted this deficiency in its final report: "It should be noted ... that the differences between mean ratings for adjacently ranked abilities are very small whether considered within or across groups of ranked abilities."
 It should be noted that not all of the abilities inquired about appear, from the examples given in the comments, to have generated the same degree of confusion. The examples given for activities requiring stamina in particular appear to be in accord with the definition given in the PAA, which is in turn a definition which accords well with our common, everyday understanding of the term.
 This high rating for standing, walking, running, and crawling was properly considered by AIR to confirm the high mean rating for stamina given in the PAA.
 Dr. Fleishman testified at trial that a group of 9 abilities identified by him, called psychomotor abilities, exists between the physical and mental abilities and can be tested for by tests which bear comparison at times with physical tests requiring special apparatus and at other times with the paper-and-pencil tests used to test mental abilities. The psychomotor test here discussed was a paper-and-pencil test.
 Although a total of 100 men took the tryout test, only 99 results were used, because one trainee had a bad cough and was dropped.
 Examination 0159 had a five-part physical portion consisting of an agility test, a dumbbell lift (pectoral), a dumbbell lift (strength), a dumbbell lift (supine), and a broad jump. The exam was a qualifying exam, administered on a pass/fail basis, unlike Exam 3040.
 The basis on which these tasks and tests were determined to be height- and job-related so as to be included in the height study is not at all clear. Nor, with the exception of the hand grip test for static strength, does it appear that the tasks selected to be performed bore any relationship to AIR's job analysis. What is significant about the height study is that it appears to be the first instance of a problem which was eventually to compromise severely the rigor of AIR's test preparation, namely, a tendency on the part of AIR to defer in an entirely unsystematic way to the say-so of individual job incumbents when confronted with a claim that a particular test, derived from a source other than Fleishman's prior work, was job-related. As one example, the job task described in the height-related study as a victim rescue, accomplished by slinging a football blocking type dummy over one's shoulder and running up (and down) a flight of stairs, appears, on the basis of the testimony of fireman witnesses at trial, to be a highly uncharacteristic fireman's carry, if used at all.
 One likely explanation offered for the inclusion of the agility test is that the two groups identified as adversely impacted in the Vulcan litigation, namely, blacks and Hispanics, were perceived to do better on the agility test than whites. If so, inclusion of this test represented a step in the direction of decreasing the adverse impact of the physical test on blacks and Hispanics. Such a laudable purpose (while clearly having a bearing on the City's good faith) cannot excuse the lack of a rational connection between the test and the job analysis. Nor, indeed, would the addition of such a test, simply to reduce adverse impact on one class of employees, appear to comply with Judge Weinfeld's direction in the Vulcan litigation that the preparation of a replacement for the physical portion of Exam 0159 proceed in accordance with professionally accepted methods of test preparation.
 One woman who scaled the eight-foot wall not once, but twice testified she was able to do so by reason of a simple technique taught her in a private training school she attended, not apparently involving any of the abilities purportedly measured, namely, running up the wall with rubber-soled shoes before attempting to grasp the top rather than, as most women did, standing before the wall and jumping up.
 The only evidence of such a characteristic of New York architecture introduced at trial was a Factory Exit Rule adopted by the New York City Board of Standards and Appeals in 1947 implementing the state's Labor Law by requiring special safety exits to be provided for factory buildings, built before 1914 and then still in use, of five stories or less in height, where the distance between the roof of the building and the roof of the next building exceeds eight feet. Whether or not this provision can be taken as establishing the existence of the eight-foot differential as a common characteristic of factory buildings of this height and vintage at the time of enactment of the Code provision, the testimony at trial indicates that there are at present exceedingly few buildings in this category. The witness who testified for the City on this subject a fireman of over two decades' experience related that on the only occasion he has had to climb an eight-foot wall from one building to another he used the scuttle cover from a roof opening which he leaned against the side of the wall as a step. Other after-the-fact rationalizations were offered at trial for the eight-foot wall, such as the need to leap into the air to grab the lower rung of fire escape ladders said to hang on occasion eight feet or more above the street an ability not tested by the agility test actually used, since the technique taught and permitted to scale the wall in the test involved not leaping in place, but running up the wall as far as one could before grabbing its top. Still another fireman testified at trial to an occasional need to climb to the top of an enclosed exit to the roof of buildings to break through a skylight and create ventilation. However, as noted, the need to climb an eight-foot wall as a work behavior of sufficient importance to warrant its role in Exam 3040 does not appear to have been established on the basis of any of the systematic job analyses done. I conclude, accordingly, that the wall was included in Exam 3040 simply because of its inclusion in earlier tests and a subjective feeling on the part of City personnel that firemen should be able to do such things. As the EEOC Guidelines wisely inform us, such seat-of-the-pants justifications for employment tests may well be the source for the stereotyped images of jobs which exclude women without there being a rational reason for their exclusion.
 As noted in the American Psychological Association, Inc. Standards for Education and Psychological Tests ("APA Standards") at p. 29: "It should be clear that content validity is quite different from face validity. Content validity is determined by a set of operations[,] and one evaluates content validity by the thoroughness and care with which these operations have been conducted. In contrast, face validity is a judgment that the requirements of a test appear to be relevant. The writing of items in terms used in a particular job ... may give an appearance of relevance while contributing nothing to content validity or indeed to any other useful validity information (although such information may serve a useful public-relations function)."
 Ironically, among the relatively neutral job anchors used by Dr. Fleishman in his job analysis, the abilities of ballet dancers were recognized as one reliable indice of human abilities common to ballet dancers and firefighters.
 Even so, the scoring for the mile run, as it was finally established for the physical portion of Exam 3040, was such that only 50% of the trained Cureton sample would have been able to achieve a passing score despite the fact that the Exam 3040 mile was run on a ten-lap, rather than a four-lap, track. Based on their recorded times in the tryouts on a 31-lap track, none of the incumbent firemen who tried out the one-mile run in December 1973 would have received a passing score on Exam 3040.
 The Chicago experts' complaint about the twist-and-touch test was somewhat more reasoned than Commissioner O'Hagan's. Thus, they are reported to have noted that firemen must "twist and touch," not just to one side of the body, but to both sides. This criticism, while rational in terms of a test based on actual job behavior, misconceived the purpose of the AIR test, which was to determine the abstract ability of extent flexibility which Dr. Fleishman's studies showed could be evaluated by measuring the candidate's ability to twist and touch in one direction only.
 In addition, the push-up test was reasonably criticized as difficult to administer and monitor.
 Defendants question the failure of plaintiff to present evidence with regard to the number of men and women in the work force who have valid New York driver's licenses and meet the vision requirement of the N.Y.C.Admin.Code of 20/40 for each eye separately, without glasses, and the height requirement. For this reason, as well as the other unusual demands of the firefighter's vocation, it seems appropriate in determining adverse impact and relief, as noted infra, to rely on a comparison of the ratio of male to female pass rates to the male/female ratio among those actually presenting themselves for the physical examination rather than a comparison of the pass-rate ratio to the ratio of men to women in the work force. But cf. Dothard v. Rawlinson, 433 U.S. 321, 330, 97 S. Ct. 2720, 2727, 53 L. Ed. 2d 786 (1977).
 Defendants seek to excuse shortcomings in the job analysis and test preparation by reference to Spurlock v. United Airlines, 475 F.2d 216, 219 (10th Cir. 1972), which determined that, because of the economic and human risks of hiring unqualified flight officers to pilot what was described as a $20 million aircraft carrying as many as 300 passengers per flight, "the employer bears a correspondingly lighter burden to show that employment criteria are job related." Whatever the merits of applying such a rule to the hiring of firefighters (see Vulcan Society v. Fire Department, 505 F. Supp. 955, 965 (S.D.N.Y.1981), in which Judge Sofaer declined to impose a requirement of a high-school diploma on firefighters in the face of an argument that it was required by the Spurlock rule), it is clear that its application cannot save a job analysis and test preparation which fail not simply as a matter of degree, but in the overall logic and consistency of the approaches employed. Thus, even accepting the rule of the Spurlock case as appropriately employed to New York City firefighters because of the economic and human risks involved in their work, the inconsistencies in the test preparation and in defendants' attempts at validation prevent me from validating the test on that ground.
 There is considerably less justification for relying on the comments of a non-job incumbent, such as a representative of the Department of Personnel, as to what it takes to be a firefighter.
 Significantly, the one woman firefighter who testified in the trial of this case a member of the Chicago fire department related that she had become a firefighter, after securing a passing score of just above 70 and, as a result, being rank ordered to the bottom of Chicago's eligibility list, solely as a result of the happenstance of a firefighter's strike which resulted in the call-up of the entire eligibility list. She has been serving now in firefighting operations in Chicago for close to two years, by all accounts, with distinction.
 An illustration of this point is contained in a comment by an AIR observer who recorded his on-the-job observations while accompanying firemen on several runs (D.X. CC, pp. 36-37):
"I observed at a tenement fire in Brownsville a situation where the Lieutenant gave a probie fireman immediate feedback about his aggressiveness. For whatever reason, this probie apparently was moving upon the firefront[;] however, he was doing this at a rate which the Lieutenant considered too fast for safety, namely, he wasn't checking and making sure that the fire didn't break out behind him."
 One illustration of this distinction is the contrast between the ledge balance test, calling upon the test taker to race at top speed sideways along a balance beam said to represent a ledge facing a wall, and balancing as described in the examples furnished in the PDA and in the testimony of trial witnesses. What is required is the ability to get across an exposed floor joist, for example, safely within a reasonable time, not top speed.
 The dangers of this kind of picking and choosing are manifest. For one thing, such a process severely undercuts one rationale behind permitting the borrowing of another's test, namely, that the test has been actually chosen and either sold (by a test manufacturer or test developer) or at least adopted by another test user willing to stand behind its results. Indeed, the borrower in such a situation as that presented here is not, in fact, relying on any effort on the part of a test lender to criterion validate a specific test battery, but is, instead, simply asserting that individual tests, e.g., hand grip or broad jump, have been shown by studies to be good predictors of job performance without consideration of their job-relatedness, representativeness or fairness.
 This readiness to ignore what the authors of the Maryland study thought they were doing, by converting one of the Maryland criterion measures into a predictive measure, in order to establish the validity of the New York City exam, makes clear that defendants are not, in fact, relying on the Maryland study for anything like what it purported to conclude about the validity of firefighter's tests.
 It bears repeating that the Maryland dummy drag was never, in any sense, validated in Maryland since it was in fact one of the criterion measures by which other predictive measures were validated.
 In the event that more than 45 members of the class are found qualified and willing to accept appointment, selection shall be by lot.
 If more men are called up between the date hereof and the date of the list's expiration, it will be appropriate to increase the number of places available to be filled by qualified women, correspondingly.