Our measurement professionals are fluent with classical statistics, Rasch, and multi-parameter item response theory (IRT). We will assess and report on the quality and usefulness of each individual test item/performance task, create balanced and equivalent test forms, provide equating services, and report on the proficiency level of test candidates.
We will work with you, your team, and your SMEs to determine an appropriate performance standard (cut score). Our measurement professionals are experienced with multiple standard setting methodologies and will work with you to determine the most appropriate methodology to apply for a given test development project.
Alpine Testing Solutions' scientists use classical and item response theory (IRT) analyses and a variety of other techniques to assess the quality and usefulness of individual test items, the efficacy of entire test forms, and the proficiency level of test candidates. We are experienced in a variety of candidate-centered and test-centered standard setting methodologies, and work with our client's to determine the most appropriate method for a given test.
Classical Item Analysis, based on traditional classical theory models, forms the foundation for looking at the performance of each item in a test. Our report includes item difficulty, item high and low group discrimination, item score to total score correlations, and item option statistics. These measures tell our psychometricians much about how samples of candidates responded to each item.
Item response theory (IRT) can be used as a supplement to classical item and test analysis. IRT is a statistical procedure used to model examinee test item responses with parameters to determine the proficiency level of the examinees compared to the probability of a correct answer for each test item. IRT computes an estimated item characteristic curve (ICC) for each test item. IRT can use one to three parameters to specify the item response model. IRT has the following advantages when compared to classical item analysis:
Differential Item Functioning (DIF) is a method to determine if test items or tests are performing differently (in a measurable way) for two or more groups of examinees that are classified based on some distinguishing characteristic such as gender, age, experience, racial, ethnic, cultural or other defining characteristics. DIF is a statistical approach to determine if there is a potential bias either for or against a particular group based on group membership and not on the ability, proficiency or achievement measured by the test. If differential item functioning is found between groups, this is an indication that additional investigations should be made to determine if there is bias present for or against one of the target groups.
We use the results from the item analysis and calibrations to create test forms that are as parallel as possible in terms of content, expected test score means, expected test score standard deviations, and expected standard errors of measurement.
We utilize classical test statistics to compute various types of reliability and test precision. Our psychometricians regularly use internal consistency and decision reliability measures (e.g. Cronbach's Alpha, Spearman Brown, Livingston's Coefficient, etc.), along with more traditional raw score test precision measures.
Reliability is an index of the consistency of measurements of behavior in a specified content, ability, or performance domain. The difference between the observed score for a test taker and his or her true score or universe score is called measurement error.
Every psychological or educational test contains some measurable amount of measurement error. The amount of measurement error in scores can be estimated from the test reliability and is generally represented by the standard error of measurement (SEM), The more reliable the test, the smaller the standard error.
Reliability of a test can be expressed in terms of the components of systematic variance and error variance. Measurement error can be influenced by the test administration conditions, test-taker conditions (e.g., the test-taker is distracted due to lack of sleep, illness, etc.), scoring procedures, poor item quality, and item sampling (the test taker simply knows the answers to more of the items on one test form than on another even though they are built to the same specifications). Different reliability estimates take different sources of error variance into account.
For certain applications in educational, certification, licensure, and occupational or employment testing you must determine a performance standard or test score(s) to distinguish qualified from unqualified individuals. These scores will identify individuals that perform at basic, proficient, or advanced levels.
These performance standards may be determined using test-centered methods, candidate-centered methods, or a combination of both. We will determine which option is best for you and set the standards you will use to ensure learning and performance.
which rely upon the judgment of experienced subject matter experts or candidate-centered methods which rely upon examinee test scores or combinations of both.
In its broadest sense, validity is the extent to which a test in fact measures what it claims to measure for the purpose for which it is intended to be used. Validity studies are conducted to accumulate and document multiple lines of evidence and information supporting the scientific basis for the interpretations and uses of test scores. Validity studies attempt to triangulate or provide multiple approaches and complementary strategies to strengthen the validity argument and the evidence to support the construct or purpose of the test. Validity studies attempt to find both convergent evidence pointing toward the target test purpose and divergent evidence that shows that other assessment methods and assessment instruments measure different constructs than the target test purpose. The process of validation is an ongoing and continuous process of accumulating evidence to support the use of the test for its intended purpose.
We will work with you and your team to design and implement a test maintenance program that can encompass ongoing validation studies, assessment of the performance and exposure of individual items/performance tasks, assessment of the efficacy of test forms, development and piloting of items/performance tasks, refreshing and equating of test forms, and monitoring test security.
Tests—particularly ones that are administered on demand—require a sound maintenance plan that can address security breaches, item exposure, ongoing validation, and shifts in the content domain. Alpine Testing Solutions will work with you to create and implement a plan that addresses all of these needs.
Health Checks are vital indicators of how well a test is performing, and provide insight as to when additional investigation or action is required to maintain the quality of your test.
Health Checks are particularly important for programs that are delivering on-demand, linear fixed form tests. The initial health check should occur after the test has been released for 30 days or has been administered to approximately 200 candidates. Subsequent health checks can be schedule at reasonable intervals based upon delivery volume.
The purpose of the health check is to determine whether or not the test forms and items are performing as expected and whether or not there is undue drift in item performance, test performance, or pass/fail rates. Alpine Testing Solutions will calculate classical item and test statistics and conduct a Rasch or IRT calibration to determine if the items and tests are performing within acceptable tolerances.
Whether your test is delivered in limited testing windows or on-demand, Alpine Testing Solutions will work with you to create and implement a plan for blueprint reviews, item development, pilot testing, and item analysis and selection so that test forms can be refreshed or replaced as needed and cut scores can be equated to the appropriate performance standard. A well-thought out and implemented test maintenance and ongoing validation plan is every bit as critical as the original test development and validation plan.