Introduction

Assessments that are used to measure students’ ability or knowledge need to produce valid, reliable and fair scores (Brennan (2006, 2006); Downing and Haladyna (2011, 2011); AERA, APA and NCME (2014)). While many R packages have been developed to cover general psychometric concepts (e. g. psych (Revelle 2018), ltm (Rizopoulos 2006)) or specific psychometric topics (e. g. difR (D. Magis et al. 2010), lavaan (Rosseel 2012), see also Psychometrics), stakeholders in this area are often non-programmers and thus, may find it difficult to overcome the initial burden of an R-based environment. Commercially available software provides an alternative but high prices and limited methodology may be an issue. Nevertheless, it is of high importance to enforce routine psychometric analysis in development and validation of educational tests of various types worldwide. Freely available software with user-friendly interface and interactive features may help this enforcement even in regions or scientific areas where understanding and usage of psychometric concepts is underdeveloped.

In this work we introduce ShinyItemAnalysis (Patrícia Martinková et al. 2018) - an R package and an online application based on shiny (Chang et al. 2018) which was initially created to support teaching of psychometric concepts and test development, and subsequently used to enforce routine validation of admission tests to Czech universities (Patricia Martinková, Drabinová, and Houdek 2017). Its current mission is to support routine validation of educational and psychological measurements worldwide.

We first briefly explain the methodology and concepts in a step-by-step way, from the classical test theory (CTT) to the item response theory (IRT) models, including methods to detect differential item functioning (DIF). We then describe the implementation of ShinyItemAnalysis with practical examples coming from development and validation of the Homeostasis Concept Inventory (HCI, McFarland et al. 2017). We conclude with discussion of features helpful for teaching psychometrics, as well as features important for generation of PDF and HTML reports, enforcing routine analysis of admission and other educational tests.

Psychometric models for analysis of items and tests

Classical test theory

Traditional item analysis uses counts, proportions and correlations to evaluate properties of test items. Difficulty of items is estimated as the percentage of students who answered correctly to that item. Discrimination is usually described as the difference between the percent correct in the upper and lower third of the students tested (upper-lower index, ULI). As a rule of thumb, ULI should not be lower than 0.2, except for very easy or very difficult items (Ebel 1954). ULI can be customized by determining the number of groups and by changing which groups should be compared: this is especially helpful for admission tests where a certain proportion of students (e. g. one fifth) are usually admitted (Patrícia Martinková, Štěpánek, et al. 2017).

Other traditional statistics for a description of item discrimination include the point-biserial correlation, which is the Pearson correlation between responses to a particular item and scores on the total test. This correlation (R) is denoted RIT index if an item score (I) is correlated with the total score (T) of the whole test, or RIR if an item score (I) is correlated with the sum of the rest of the items (R).

In addition, difficulty and discrimination may be calculated for each response of the multiple choice item to evaluate distractors and diagnose possible issues, such as confusing wording. Respondents are divided into N groups (usually 3, 4 or 5) by their total score. Subsequently, the percentage of students in each group which selected a given answer (correct answer or distractor) is calculated and may be displayed in a distractor plot.

To gain empirical proofs of the construct validity of the whole instrument, correlation structure needs to be examined. Empirical proofs of validity may be provided by correlation with a criterion variable. For example, a correlation with subsequent university success or university GPA may be used to demonstrate predictive validity of admission scores.

To gain proofs of test reliability, internal consistency of items can be examined using Cronbach’s alpha (Cronbach (1951, 1951), see also Zinbarg et al. (2005, 2005) for more discussion). Cronbach’s alpha of test without a given item can be used to determine items not internally consistent with the rest of the test.

Regression models for description of item properties

Various regression models may be fitted to describe item properties in more detail. With binary data, logistic regression may be used to model the probability of a correct answer as a function of the total score by an S-shaped logistic curve. Parameter \(b_0 \in R\) describes horizontal location of the fitted curve, parameter \(b_1 \in R\) describes its slope. \[\begin{aligned} \mathrm{P}\left(Y = 1|X, b_0, b_1\right) = \frac{\exp\left(b_0 + b_1 X\right)}{1 + \exp\left(b_0 + b_1 X\right)}. \label{eq:2PLreg} \end{aligned} (\#eq:2PLreg)\]

A standardized total score may be used instead of a total score as an independent variable to model the properties of a given item. In such a case, the estimated curve remains the same, only the interpretation of item properties now focuses on improvement of 1 standard deviation rather than 1 point improvement. It is also helpful to re-parametrize the model using new parameters \(a \in R\) and \(b \in R\). Item difficulty parameter \(b\) is given by location of the inflection point, and item discrimination parameter \(a\) is given by the slope at the inflection point: \[\begin{aligned} \mathrm{P}(Y = 1|Z, a, b) = \frac{\exp(a(Z - b))}{1 + \exp(a(Z - b))}. \label{eq:2PLregZIRT} \end{aligned} (\#eq:2PLregZIRT)\]

Further, non-linear regression models allow us to account for guessing by providing non-zero left asymptote \(c \in [0,1]\) and inattention by providing right asymptote \(d \in [0,1]\): \[\begin{aligned} \mathrm{P}\left(Y = 1|Z, a, b, c, d\right) = c + \left(d - c\right)\frac{\exp\left(a\left(Z - b\right)\right)}{1 + \exp\left(a\left(Z - b\right)\right)}. \label{eq:4PLreg} \end{aligned} (\#eq:4PLreg)\]

Other regression models allow for further extensions or different data types: Ordinal regression allows for modelling Likert-scale and partial-credit items. To model responses to all given options (correct response and all distractors) in multiple-choice questions, a multinomial regression model can be used. Further complexities of the measurement data and item functioning may be accounted for by incorporating student characteristics with multiple regression models or clustering with hierarchical models.

A logistic regression model @ref(eq:2PLreg) for item description, and its re-parametrizations and extensions as illustrated in equations @ref(eq:2PLregZIRT) and @ref(eq:4PLreg) may serve as a helpful introductory step for explaining and building IRT models which can be conceptualized as (logistic/non-linear/ordinal/multinomial) mixed effect regression models (Rijmen et al. 2003).

Item response theory models

IRT models assume respondent abilities being unobserved/latent scores \(\theta\) which need to be estimated together with item parameters. 4-parameter logistic (4PL) IRT model corresponds to regression model @ref(eq:4PLreg) above \[\begin{aligned} \mathrm{P}\left(Y = 1|\theta, a, b, c, d\right) = c + \left(d - c\right)\frac{\exp\left(a\left(\theta - b\right)\right)}{1 + \exp\left(a\left(\theta - b\right)\right)}. \label{eq:4PLIRT} \end{aligned} (\#eq:4PLIRT)\] Similarly, other submodels of 4PL model @ref(eq:4PLIRT) may be obtained by fixing appropriate parameters. E. g. the 3PL IRT model is obtained by fixing inattention to \(d = 1.\) The 2PL IRT model is obtained by further fixing pseudo-guessing to \(c = 0,\) and the Rasch model by fixing discrimination \(a = 1\) in addition.

Other IRT models allow for further extensions or different data types: Modelling Likert-scale and partial-credit items can be done by modelling cumulative responses in a graded response model (GRM, Samejima 1969). Alternatively, ordinal items may be analyzed by modelling adjacent categories logits using a generalized partial credit model (GPCM, Muraki 1992), or its restricted version - partial credit model (PCM, Masters 1982), and rating scale model (RSM, Andrich 1978). To model distractor properties in multiple-choice items, Bock’s nominal response model (NRM, Bock 1972) is an IRT analogy of a multinomial regression model. This model is also a generalization of GPCM/PCM/RSM ordinal models. Many other IRT models have been used in the past to model item properties, including models accounting for multidimensional latent traits, hierarchical structure or response times (van der Linden 2017).

A wide variety of estimation procedures has been proposed in last decades. Joint maximum likelihood estimation treats both ability and item parameters as unknown but fixed. Conditional maximum likelihood estimation takes an advantage of the fact that in exponential family models (such as in the Rasch model), total score is a sufficient statistics for an ability estimate and the ratio of correct answers is a sufficient statistics for a difficulty parameter. Finally, in a marginal maximum likelihood estimation used by the mirt (Chalmers 2018) and ltm (Rizopoulos 2006) package as well as in ShinyItemAnalysis, parameter \(\theta\) is assumed to be a random variable following certain distribution (often standard normal) and is integrated out (see for example, Johnson 2007). An EM algorithm with a fixed quadrature is used in latent scores and item parameters estimation. Besides MLE approaches, Bayesian methods with the Markov chain Monte Carlo are a good alternative, especially for multidimensional IRT models.

Differential item functioning

DIF occurs when respondents from different groups (e. g. such as defined by gender or ethnicity) with the same underlying true ability have a different probability of answering the item correctly. Differential distractor functioning (DDF) is a phenomenon when different distractors, or incorrect option choices, attract various groups with the same ability differentially. If an item functions differently for two groups, it is potentially unfair, thus detection of DIF and DDF should be routine analysis when developing and validating educational and psychological tests (Patrícia Martinková, Drabinová, et al. 2017).

Presence of DIF can be tested by many methods including Delta plot (Angoff and Ford 1973), Mantel-Haenszel test based on contingency tables that are calculated for each level of a total score (Mantel and Haenszel 1959), logistic regression models accounting for group membership (Swaminathan and Rogers 1990), nonlinear regression (Adéla Drabinová and Martinková 2017), and IRT based tests (Lord 1980; Raju 1988, 1990).

Implementation

The ShinyItemAnalysis package can be used either locally or online. The package uses several other R packages to provide a wide palette of psychometric tools to analyze data (see Table 1). The main function is called startShinyItemAnalysis(). It launches an interactive shiny application (Figure 1) which is further described below. Furthermore, function gDiscrim() calculates generalized coefficient ULI comparing the ratio of correct answers in predefined upper and lower groups of students (Patrícia Martinková, Štěpánek, et al. 2017). Function ItemAnalysis() provides a complete traditional item analysis table with summary statistics and various difficulty and discrimination indices for all items. Function DDplot() plots difficulty and selected discrimination indices of the items ordered by their difficulty. Function DistractorAnalysis() calculates the proportions of choosing individual distractors for groups of respondents defined by their totals score. Graphical representation of distractor analysis is provided via function plotDistractorAnalysis(). Other functions include item - person maps for IRT models, ggWrightMap(), and plots for DIF analysis using IRT methods, plotDIFirt(), and logistic regression models, plotDIFlogistic(). These functions may be applied directly on data from an R console as shown in the provided R code. The package also includes training datasets Medical 100, Medical 100 graded (Patrícia Martinková, Štěpánek, et al. 2017), and HCI (McFarland et al. 2017).

Table 1: R packages used for developing ShinyItemAnalysis.
R package Citation Title
corrplot (Wei and Simko 2017) Visualization of a correlation matrix
cowplot (Wilke 2018) Streamlined plot theme and plot annotations for ‘ggplot2’
CTT (Willse 2018) Classical test theory functions
data.table (Dowle and Srinivasan 2018) Extension of data.frame
deltaPlotR (David Magis and Facon 2014) Identification of dichotomous differential item functioning using Angoff’s delta plot method
difNLR (Adela Drabinová, Martinková, and Zvára 2018) DIF and DDF detection by non-linear regression models
difR (D. Magis et al. 2010) Collection of methods to detect dichotomous differential item functioning
DT (Xie, Cheng, and Tan 2018) A wrapper of the JavaScript library ‘DataTables’
ggdendro (de Vries and Ripley 2016) Create dendrograms and tree diagrams using ‘ggplot2’
ggplot2 (Wickham 2016) Create elegant data visualisations using the grammar of graphics
gridExtra (Auguie 2017) Miscellaneous functions for ‘grid’ graphics
knitr (Xie 2018) A general-purpose package for dynamic report generation in R
latticeExtra (Sarkar and Andrews 2016) Extra graphical utilities based on lattice
ltm (Rizopoulos 2006) Latent trait models under IRT
mirt (Chalmers 2018) Multidimensional item response theory
moments (Komsta and Novomestky 2015) Moments, cumulants, skewness, kurtosis and related tests
msm (Jackson 2011) Multi-state Markov and hidden Markov models in continuous time
nnet (Venables and Ripley 2002) Feed-forward neural networks and multinomial log-linear models
plotly (Sievert 2018) Create interactive web graphics via ‘plotly.js’
psych (Revelle 2018) Procedures for psychological, psychometric, and personality research
psychometric (Fletcher 2010) Applied psychometric theory
reshape2 (Wickham 2007) Flexibly reshape data: A reboot of the reshape package
rmarkdown (Allaire et al. 2018) Dynamic documents for R
shiny (Chang et al. 2018) Web application framework for R
shinyBS (Bailey 2015) Twitter bootstrap components for shiny
shinydashboard (Chang and Borges Ribeiro 2018) Create dashboards with ‘shiny’
shinyjs (Attali 2018) Easily improve the user experience of your shiny apps in seconds
stringr (Wickham 2018) Simple, consistent wrappers for common string operations
xtable (Dahl et al. 2018) Export tables to LaTeXor HTML

Examples

Running the application

The ShinyItemAnalysis application may be launched in R by calling startShinyItemAnalysis(), or more conveniently, directly from https://shiny.cs.cas.cz/ShinyItemAnalysis. The intro page (Figure 1) includes general information about the application. Various tools are included in separate tabs which correspond to separate sections.

graphic without alt text
Figure 1: Intro page.

Data selection and upload

Data selection is available in section Data. Six training datasets may be uploaded using the Select dataset button: Training datasets Medical 100, Medical 100 graded (Patrícia Martinková, Štěpánek, et al. 2017) and HCI (McFarland et al. 2017) from the ShinyItemAnalysis package, and datasets GMAT (Patrícia Martinková, Drabinová, et al. 2017), GMAT2 and MSAT-B (Adéla Drabinová and Martinková 2017) from the difNLR package (Adela Drabinová, Martinková, and Zvára 2018).

Besides the provided toy datasets, users’ own data may be uploaded as csv files and previewed in this section. To replicate examples involving the HCI dataset (McFarland et al. 2017), csv files are provided for upload in Supplemental Materials.

graphic without alt text
Figure 2: Page to select or upload data.

Item analysis step-by-step

Further sections of the ShinyItemAnalysis application allow for step-by-step test and item analysis. The first four sections are devoted to traditional test and item analyses in a framework of classical test theory. Further sections are devoted to regression models, to IRT models and to DIF methods. A separate section is devoted to report generation and references are provided in the final section. The individual sections are described below in more detail.

Section Summary provides for histogram and summary statistics of the total scores as well as for various standard scores (Z scores, T scores), together with percentile and success rate for each level of the total score. Section Reliability offers internal consistency estimates with Cronbach’s alpha.

Section Validity provides correlation structure (Figure 3, left) and checks of criterion validity (Figure 3, right). A correlation heat map displays selected type of correlation estimates between items. Polychoric correlation is the default correlation used for binary data. The plot can be reordered using hierarchical clustering while highlighting a specified number of clusters. Clusters of correlated items need to be checked for content and other similarities to see if they are intended. Criterion validity is analyzed with respect to selected variables (e. g. subsequent study success, major, or total score on a related test) and may be analyzed for the test as a whole or for individual items.

graphic without alt textgraphic without alt text

Figure 3: Validity plots for HCI data.

Section Item analysis offers traditional item analysis of the test as well as a more detailed distractor analysis. The so called DD plot (Figure 4) displays difficulty (red) and a selected discrimination index (blue) for all items. Items are ordered by difficulty. While lower discrimination may be expected for very easy or very hard items, all items with ULI discrimination lower than 0.2 (borderline in the plot) are worth further checks by content experts.

graphic without alt text
Figure 4: DD plot for HCI data.

The distractor plot (Figure 5) provides for detailed analysis of individual distractors by the respondents’ total score. The correct answer should be selected more often by strong students than by students with a lower total score, i. e. the solid line in the distractor plot (see Figure 5) should be increasing. The distractors should work in an opposite direction, i. e. the dotted lines should be decreasing.

graphic without alt textgraphic without alt text

Figure 5: Distractor plots for items 13 and 17 in HCI data.

Section Regression allows for modelling of item properties with a logistic, non-linear or multinomial regression (see Figures 6). Probability of the selection of a given answer is modelled with respect to the total or standardized total score. Classical as well as IRT parametrization are provided for logistic and non-linear models. Model fit can be compared by Akaike’s (Akaike 1974) or Schwarz’s Bayesian (Schwarz 1978) information criteria and a likelihood-ratio test.

graphic without alt textgraphic without alt textgraphic without alt text

Figure 6: Regression plots for item 13 in HCI data.

Section IRT models provides for 1-4PL IRT models as well as Bock’s nominal model, which may also be used for ordinal items. In IRT model specification, ShinyItemAnalysis uses default settings of the mirt package and for the Rasch model (Rasch 1960) it fixes discrimination to \(a = 1\), while variance of ability \(\theta\) is freely estimated. Contrary to Rasch model, 1PL model allows any discrimination \(a \in R\) (common to all items), while fixing variance of ability \(\theta\) to 1. Similarly, other submodels of 4PL model @ref(eq:4PLIRT), e. g. 2PL and 3PL model, may be obtained by fixing appropriate parameters, while variance of ability \(\theta\) is fixed to 1.

Interactive item characteristic curves (ICC), item information curves (IIC) and test information curves (TIC) are provided for all IRT models (see Figure 7). An item-person map is displayed for Rasch and 1PL models (Figure 7, bottom right). Table of item parameter estimates is completed by \(S-X^2\) item fit statistics (Orlando and Thissen 2000). Estimated latent abilities (factor scores) are also available for download. While fitting of IRT models is mainly implemented using the mirt package (Chalmers 2018), sample R code is provided for work in both mirt and ltm (Rizopoulos 2006).

graphic without alt text
Figure 7: IRT plots for HCI data. From top left: Item characteristic curves, item information curves, test information curve with standard error of measurement, and item-person map.

Finally, section DIF/Fairness offers the most used tools for detection of DIF and DDF included in deltaPlotR (David Magis and Facon 2014), difR (D. Magis et al. 2010) and the difNLR package (Adela Drabinová, Martinková, and Zvára (2018), (2018), see also Adéla Drabinová and Martinková (2018), (2018) and Adéla Drabinová and Martinková (2017), (2017)).

Datasets GMAT and HCI provide valuable teaching examples, further discussed in (Patrícia Martinková, Drabinová, et al. 2017). HCI is an example of a situation whereby the two groups differ significantly in their overall ability, yet no item is detected as DIF. Dataset GMAT was simulated to demonstrate that it is hypothetically possible that even though the distributions of the total scores for the two groups are identical, yet, there may be an item present that functions differently for each group (see Figure 8).

graphic without alt textgraphic without alt text

graphic without alt textgraphic without alt text

Figure 8: GMAT data simulated to show that hypothetically, two groups may have an identical distribution of total scores, yet there may be a DIF item present in the data.

Teaching with ShinyItemAnalysis

ShinyItemAnalysis is based on examples developed for a course of IRT models and psychometrics offered to graduate students at the University of Washington and at the Charles University. It has also been used at workshops for educators developing admission tests and other tests in various fields.

Besides the presence of a broad range of CTT as well as IRT methods, toy data examples, model equations, parameter estimates, and interactive interpretation of results, selected R code is also available, ready to be copy-pasted and run in R. The shiny application can thus serve as a bridge to users who do not feel secure in the R programming environment by providing examples which can be further modified or adopted to different datasets.

As an important teaching tool, an interactive training section is present, deploying item characteristic and item information curves for IRT models. For dichotomous models (Figure 9, left), the user can specify parameters \(a,\) \(b,\) \(c,\) and \(d\) of two toy items and latent ability \(\theta\) of respondent. ICC is then provided interactively, displaying probability of a correct answer for any \(\theta\) and highlighting this probability for selected \(\theta\). IIC compares the two items in the amount of information they provide about respondents of a given ability level.

For polytomous items (Figure 9, right), analogous interactive plots are available for GRM, (G)PCM as well as for NRM. Step functions are displayed for GRM, and category response functions are available for all three models. In addition, the expected item score is displayed for all the models.

graphic without alt textgraphic without alt text

Figure 9: Interactive training IRT section.

The training sections also contain interactive exercises where students may check their understanding of the IRT models. They are asked to sketch ICC and IIC functions of items with given parameters, and to answer questions, e. g. regarding probabilities of correct answers and the information these items provide for certain ability levels.

Automatic report generation

To support routine usage of psychometric methods in test development, ShinyItemAnalysis offers the possibility to upload your own data for analysis as csv files, and to generate PDF or HTML reports. A sample PDF report and the corresponding csv files used for its generation are provided in Supplemental Materials.

Report generation uses rmarkdown templates and knitr for compiling (see Figure 10). LaTeX is used for creating PDF reports. The latest version of LaTeX with properly set paths is needed to generate PDF reports locally.

graphic without alt text

Figure 10: Report generation workflow.

Page with report setting allows user to specify a dataset name, the name of the person who generated the report, to select a method and to customize the settings (see Figure 11). The Generate report button at the bottom starts analyses needed for the report to be created. Subsequently, the Download report button initializes compiling the text, tables and figures into a PDF or an HTML file.

graphic without alt textgraphic without alt text

Figure 11: Report settings for the HCI data analysis.

Sample pages of a PDF report on the HCI dataset are displayed in Figure 12. Reports provide a quick overview of test characteristics and may be a helpful material for test developers, item writers and institutional stakeholders.

graphic without alt text graphic without alt text graphic without alt text graphic without alt text
graphic without alt text graphic without alt text graphic without alt text graphic without alt text
Figure 12: Selected pages of a report on the HCI data.

Discussion and conclusion

ShinyItemAnalysis is an R package (currently version 1.2.9) and an online shiny application for psychometric analysis of educational tests and items. It is suitable for teaching psychometric concepts and it aspires to be an easy-to-use tool for routine analysis of educational tests. For teaching psychometric concepts, a wide range of models and methods are provided, together with interactive plots, exercises, data examples, model equations, parameter estimates, interpretation of results, and selected R code to bring new users to R. For analysis of educational tests by educators, ShinyItemAnalysis interactive application allows users to upload and analyze their own data online or locally, and to automatically generate analysis reports in PDF or HTML.

Functionality of the ShinyItemAnalysis has been validated on three groups of users. As the first group, two university professors teaching psychometrics and test development provided their written feedback on using the application and suggested edits for wording used for interpretation of results provided by the shiny application. As the second group, over 20 participants of a graduate seminar on "Selected Topics in Psychometrics" at the Charles University in 2017/2018 used ShinyItemAnalysis throughout the year in practical exercises. Students prepared their final projects with ShinyItemAnalysis applied on their own datasets and provided closer feedback on their experience. The third group consisted of more than 20 university academics from different fields who participated in a short-term course on "Test Development and Validation" in 2018 at the Charles University. During the course, participants used the application on toy data embedded in the package. In addition, an online knowledge test was prepared in Google Docs, answered by participants, and subsequently analyzed in ShinyItemAnalysis during the same session. Participants provided their feedback and commented on usability of the package and shiny application. As a result, various features were improved (e. g. data upload format was extended, some cases of missing values are now handled).

Current developments of the ShinyItemAnalysis package comprise implementation of wider types of models, especially ordinal and multinomial models and models accounting for effect of the rater and a hierarchical structure. In reliability estimation, further sources of data variability are being implemented to provide estimation of model-based inter-rater reliability (Patrícia Martinková, Goldhaber, and Erosheva 2018). Technical improvements include a more complex data upload or automatic testing of new versions of the application.

We argue that psychometric analysis should be a routine part of test development in order to gather proofs of reliability and validity of the measurement and we have demonstrated how ShinyItemAnalysis may enforce this goal. It may also serve as an example for other fields, demonstrating the ability of shiny applications to interactively present theoretical concepts and, when complemented with sample R code, to bring new users to R, or to serve as a bridge to those who have not yet discovered the beauties of R.

Acknowledgments

This work was initiated while P. Martinková was a Fulbright-Masaryk fellow with University of Washington. Research was partly supported by the Czech Science Foundation (grant number GJ15-15856Y). We gratefully thank Jenny McFarland for providing HCI data and David Magis, Hynek Cígler, Hana Ševčíková, Jon Kern and anonymous reviewers for helpful comments to previous versions of this manuscript. We also wish to acknowledge those who contributed to the ShinyItemAnalysis package or provided their valuable feedback by e-mail, on GitHub or through an online Google form at http://www.ShinyItemAnalysis.org/feedback.html.

Akaike, Hirotugu. 1974. “A New Look at the Statistical Model Identification.” IEEE Transactions on Automatic Control 19 (6): 716–23. https://doi.org/10.1109/TAC.1974.1100705.
Allaire, JJ, Yihui Xie, Jonathan McPherson, Javier Luraschi, Kevin Ushey, Aron Atkins, Hadley Wickham, Joe Cheng, and Winston Chang. 2018. rmarkdown: Dynamic Documents for R. https://CRAN.R-project.org/package=rmarkdown.
American Educational Research Association (AERA), American Psychological Association (APA), National Council on Measurement in Education (NCME). 2014. Standards for Educational and Psychological Testing. American Educational Research Association.
Andrich, David. 1978. “A Rating Formulation for Ordered Response Categories.” Psychometrika 43 (4): 561–73. https://doi.org/10.1007/BF02293814.
Angoff, William H, and Susan F Ford. 1973. “Item-Race Interaction on a Test of Scholastic Aptitude.” Journal of Educational Measurement 10 (2): 95–105. https://doi.org/10.1002/j.2333-8504.1971.tb00812.x.
Attali, Dean. 2018. shinyjs: Easily Improve the User Experience of Your shiny Apps in Seconds. https://CRAN.R-project.org/package=shinyjs.
Auguie, Baptiste. 2017. gridExtra: Miscellaneous Functions for ’Grid’ Graphics. https://CRAN.R-project.org/package=gridExtra.
Bailey, Eric. 2015. shinyBS: Twitter Bootstrap Components for shiny. https://CRAN.R-project.org/package=shinyBS.
Bock, R Darrell. 1972. “Estimating Item Parameters and Latent Ability When Responses Are Scored in Two or More Nominal Categories.” Psychometrika 37 (1): 29–51. https://doi.org/10.1007/BF02291411.
Brennan, Robert L. 2006. Educational Measurement. Praeger Publishers.
Chalmers, Phil. 2018. mirt: Multidimensional Item Response Theory. https://CRAN.R-project.org/package=mirt.
Chang, Winston, and Barbara Borges Ribeiro. 2018. shinydashboard: Create Dashboards with ’Shiny’. https://CRAN.R-project.org/package=shinydashboard.
Chang, Winston, Joe Cheng, JJ Allaire, Yihui Xie, and Jonathan McPherson. 2018. shiny: Web Application Framework for R. https://CRAN.R-project.org/package=shiny.
Cronbach, Lee J. 1951. “Coefficient Alpha and the Internal Structure of Tests.” Psychometrika 16 (3): 297–334. https://doi.org/10.1007/BF02310555.
Dahl, David B., David Scott, Charles Roosen, Arni Magnusson, and Jonathan Swinton. 2018. xtable: Export Tables to LaTeX or HTML. https://CRAN.R-project.org/package=xtable.
de Vries, Andrie, and Brian D. Ripley. 2016. ggdendro: Create Dendrograms and Tree Diagrams Using ’Ggplot2’. https://CRAN.R-project.org/package=ggdendro.
Dowle, Matt, and Arun Srinivasan. 2018. data.table: Extension of ’Data.frame’. https://CRAN.R-project.org/package=data.table.
Downing, Steven M, and Thomas M Haladyna, eds. 2011. Handbook of Test Development. Lawrence Erlbaum Associates, Inc.
Drabinová, Adela, Patricia Martinková, and Karel Zvára. 2018. difNLR: DIF and DDF Detection by Non-Linear Regression Models. https://CRAN.R-project.org/package=difNLR.
Drabinová, Adéla, and Patrícia Martinková. 2017. “Detection of Differential Item Functioning with Nonlinear Regression: A Non-IRT Approach Accounting for Guessing.” Journal of Educational Measurement 54 (4): 498–517. https://doi.org/10.1111/jedm.12158.
———. 2018. difNLR: Generalized Logistic Regression Models for DIF and DDF Detection.” The R Journal.
Ebel, Robert L. 1954. “Procedures for the Analysis of Classroom Tests.” Educational and Psychological Measurement 14 (2): 352–64. https://doi.org/10.1177/001316445401400215.
Fletcher, Thomas D. 2010. psychometric: Applied Psychometric Theory. https://CRAN.R-project.org/package=psychometric.
Jackson, Christopher H. 2011. “Multi-State Models for Panel Data: The msm Package for R.” Journal of Statistical Software 38 (8): 1–29. https://doi.org/10.18637/jss.v038.i08.
Johnson, Matthew S. 2007. “Marginal Maximum Likelihood Estimation of Item Response Models in R.” Journal of Statistical Software 20 (10): 1–24. https://doi.org/10.18637/jss.v020.i10.
Komsta, Lukasz, and Frederick Novomestky. 2015. moments: Moments, Cumulants, Skewness, Kurtosis and Related Tests. https://CRAN.R-project.org/package=moments.
Lord, Frederic M. 1980. Applications of Item Response Theory to Practical Testing Problems. Routledge.
Magis, David, and Bruno Facon. 2014. deltaPlotR: An R Package for Differential Item Functioning Analysis with Angoff’s Delta Plot.” Journal of Statistical Software, Code Snippets 59 (1): 1–19. https://doi.org/10.18637/jss.v059.c01.
Magis, D., S. Beland, F. Tuerlinckx, and P. De Boeck. 2010. “A General Framework and an R Package for the Detection of Dichotomous Differential Item Functioning.” Behavior Research Methods 42: 847–62. https://doi.org/10.3758/BRM.42.3.847.
Mantel, Nathan, and William Haenszel. 1959. “Statistical Aspects of the Analysis of Data from Retrospective Studies of Disease.” Journal of the National Cancer Institute 22 (4): 719–48. https://doi.org/10.1093/jnci/22.4.719.
Martinková, Patricia, Adela Drabinová, and Jakub Houdek. 2017. ShinyItemAnalysis: Analýza Přijímacích a Jiných Znalostních Či Psychologických Testů [ShinyItemAnalysis: Analyzing Admission and Other Educational and Psychological Tests].” TESTFÓRUM 6 (9): 16–35. https://doi.org/10.5817/TF2017-9-129.
Martinková, Patrícia, Adéla Drabinová, Ondřej Leder, and Jakub Houdek. 2018. ShinyItemAnalysis: Test and Item Analysis via shiny. https://CRAN.R-project.org/package=ShinyItemAnalysis.
Martinková, Patrícia, Adéla Drabinová, Yuan-Ling Liaw, Elizabeth A Sanders, Jenny L McFarland, and Rebecca M Price. 2017. “Checking Equity: Why Differential Item Functioning Analysis Should Be a Routine Part of Developing Conceptual Assessments.” CBE-Life Sciences Education 16 (2): rm2. https://doi.org/10.1187/cbe.16-10-0307.
Martinková, Patrícia, Dan Goldhaber, and Elena Erosheva. 2018. “Disparities in Ratings of Internal and External Applicants: A Case for Model-Based Inter-Rater Reliability.” PloS ONE 13 (10): e0203002. https://doi.org/10.1371/journal.pone.0203002.
Martinková, Patrícia, L Štěpánek, A Drabinová, J Houdek, M Vejražka, and Č Štuka. 2017. “Semi-Real-Time Analyses of Item Characteristics for Medical School Admission Tests.” In Computer Science and Information Systems (FedCSIS), 2017 Federated Conference on, 189–94. IEEE. https://doi.org/10.15439/2017F380.
Masters, Geoff N. 1982. “A Rasch Model for Partial Credit Scoring.” Psychometrika 47 (2): 149–74. https://doi.org/10.1007/BF02296272.
McFarland, Jenny L, Rebecca M Price, Mary Pat Wenderoth, Patrícia Martinková, William Cliff, Joel Michael, Harold Modell, and Ann Wright. 2017. “Development and Validation of the Homeostasis Concept Inventory.” CBE-Life Sciences Education 16 (2): ar35. https://doi.org/10.1187/cbe.16-10-0305.
Muraki, Eiji. 1992. “A Generalized Partial Credit Model: Application of an EM Algorithm.” ETS Research Report Series 1992 (1). https://doi.org/10.1002/j.2333-8504.1992.tb01436.x.
Orlando, Maria, and David Thissen. 2000. “Likelihood-Based Item Fit Indices for Dichotomous Item Response Theory Models.” Applied Psychological Measurement 24 (1): 50–64. https://doi.org/10.1177/2F01466216000241003.
Raju, Nambury S. 1988. “The Area Between Two Item Characteristic Curves.” Psychometrika 53 (4): 495–502. https://doi.org/10.1007/BF02294403.
———. 1990. “Determining the Significance of Estimated Signed and Unsigned Areas Between Two Item Response Functions.” Applied Psychological Measurement 14 (2): 197–207. https://doi.org/10.1177/014662169001400208.
Rasch, Georg. 1960. Studies in Mathematical Psychology: I. Probabilistic Models for Some Intelligence and Attainment Tests. Nielsen & Lydiche.
Revelle, William. 2018. psych: Procedures for Psychological, Psychometric, and Personality Research. https://CRAN.R-project.org/package=psych.
Rijmen, Frank, Francis Tuerlinckx, Paul De Boeck, and Peter Kuppens. 2003. “A Nonlinear Mixed Model Framework for Item Response Theory.” Psychological Methods 8 (2): 185. https://doi.org/10.1037/1082-989X.8.2.185.
Rizopoulos, Dimitris. 2006. ltm: An R Package for Latent Variable Modelling and Item Response Theory Analyses.” Journal of Statistical Software 17 (5): 1–25. https://doi.org/10.18637/jss.v017.i05.
Rosseel, Yves. 2012. lavaan: An R Package for Structural Equation Modeling.” Journal of Statistical Software 48 (2): 1–36. https://doi.org/10.18637/jss.v048.i02.
Samejima, Fumiko. 1969. “Estimation of Latent Ability Using a Response Pattern of Graded Scores.” Psychometrika 34 (1): 1–97. https://doi.org/10.1007\%2FBF03372160.
Sarkar, Deepayan, and Felix Andrews. 2016. latticeExtra: Extra Graphical Utilities Based on Lattice. https://CRAN.R-project.org/package=latticeExtra.
Schwarz, Gideon. 1978. “Estimating the Dimension of a Model.” The Annals of Statistics 6 (2): 461–64. https://doi.org/10.1214/aos/1176344136.
Sievert, Carson. 2018. plotly for R. https://plotly-book.cpsievert.me.
Swaminathan, Hariharan, and H Jane Rogers. 1990. “Detecting Differential Item Functioning Using Logistic Regression Procedures.” Journal of Educational Measurement 27 (4): 361–70. https://doi.org/10.1111/j.1745-3984.1990.tb00754.x.
van der Linden, Wim J. 2017. Handbook of Item Response Theory. Chapman; Hall/CRC.
Venables, W. N., and B. D. Ripley. 2002. Modern Applied Statistics with S. 4th ed. New York: Springer-Verlag. http://www.stats.ox.ac.uk/pub/MASS4.
Wei, Taiyun, and Viliam Simko. 2017. corrplot: Visualization of a Correlation Matrix. https://CRAN.R-project.org/package=corrplot.
Wickham, Hadley. 2007. “Reshaping Data with the reshape Package.” Journal of Statistical Software 21 (12): 1–20. https://doi.org/10.18637/jss.v021.i12.
———. 2016. ggplot2: Elegant Graphics for Data Analysis. Springer-Verlag. http://ggplot2.org.
———. 2018. stringr: Simple, Consistent Wrappers for Common String Operations. https://CRAN.R-project.org/package=stringr.
Wilke, Claus O. 2018. cowplot: Streamlined Plot Theme and Plot Annotations for ’Ggplot2’. https://CRAN.R-project.org/package=cowplot.
Willse, John T. 2018. CTT: Classical Test Theory Functions. https://CRAN.R-project.org/package=CTT.
Xie, Yihui. 2018. knitr: A General-Purpose Package for Dynamic Report Generation in R. https://CRAN.R-project.org/package=knitr.
Xie, Yihui, Joe Cheng, and Xianying Tan. 2018. DT: A Wrapper of the JavaScript Library ’DataTables’. https://CRAN.R-project.org/package=DT.
Zinbarg, Richard E, William Revelle, Iftah Yovel, and Wen Li. 2005. “Cronbach’s \(\alpha\), Revelle’s \(\beta\), and McDonald’s \(\omega_H\): Their Relations with Each Other and Two Alternative Conceptualizations of Reliability.” Psychometrika 70 (1): 123–33. https://doi.org/10.1007/s11336-003-0974-7.