proportions are actually not raw data: they are the proportion of one
response (typically called a success) over all the
responses (the other responses being called collectively a
failure). As such, a proportion is a summary
statistic, a bit like the mean is a summary statistic of continuous
data.
Very often, the success are coded using the digit
1 and the failure, with the digit
0. When this is the case, computing the mean is actually
the same as computing the proportion of successes. However, it is a
conceptual mistake to think of proportions as means, because they must
the processed completely differently from averages. For example,
standard error and confidence intervals for proportions are obtained
using very different procedures than standard error and confidence
intervals for the mean.
In this vignette, we review various ways that data can be coded in a data frame. In a nutshell, there are three ways to represent success or failures, Wide, Long, and Compiled. The first two shows raw scores whereas the last shows a summary of the data.
Before we begin, we load the package ANOPA (if is not
present on your computer, first upload it to your computer from CRAN or
from the source repository
devtools::install_github("dcousin3/ANOPA")):
In this format, there is one line per subject and one column
for each measurements. The columns contain only 1s
(success) or 0s (`failure).
If the participant was measured multiple times, there is one (or some) within-subject factor(s) resulting in multiple columns of measurements. In between-group design, there is only a single column of scores.
As an example, consider the following data for a between-subject factor design with two factors: Class (2 levels) and Difficulty (3 levels) for 6 groups. There is an identical number of participants in each, 12, for a total of 72 participants.
##    Class Difficulty success
## 1  First       Easy       1
## 2  First       Easy       1
## 3  First       Easy       1
## 4  First       Easy       1
## 5  First       Easy       1
## 6  First       Easy       1
## 7  First       Easy       1
## 8  First       Easy       1
## 9  First       Easy       1
## 10 First       Easy       1
## 11 First       Easy       1
## 12 First       Easy       0
## 13 First   Moderate       1
## 14 First   Moderate       1
## 15 First   Moderate       1
## 16 First   Moderate       1
## 17 First   Moderate       1
## 18 First   Moderate       1
## 19 First   Moderate       1
## 20 First   Moderate       1
## 21 First   Moderate       1
## 22 First   Moderate       0
## 23 First   Moderate       0
## 24 First   Moderate       0
## 25 First  Difficult       1
## 26 First  Difficult       1
## 27 First  Difficult       1
## 28 First  Difficult       1
## 29 First  Difficult       1
## 30 First  Difficult       1
## 31 First  Difficult       0
## 32 First  Difficult       0
## 33 First  Difficult       0
## 34 First  Difficult       0
## 35 First  Difficult       0
## 36 First  Difficult       0
## 37  Last       Easy       1
## 38  Last       Easy       1
## 39  Last       Easy       1
## 40  Last       Easy       1
## 41  Last       Easy       1
## 42  Last       Easy       1
## 43  Last       Easy       1
## 44  Last       Easy       1
## 45  Last       Easy       1
## 46  Last       Easy       1
## 47  Last       Easy       0
## 48  Last       Easy       0
## 49  Last   Moderate       1
## 50  Last   Moderate       1
## 51  Last   Moderate       1
## 52  Last   Moderate       1
## 53  Last   Moderate       1
## 54  Last   Moderate       1
## 55  Last   Moderate       1
## 56  Last   Moderate       1
## 57  Last   Moderate       0
## 58  Last   Moderate       0
## 59  Last   Moderate       0
## 60  Last   Moderate       0
## 61  Last  Difficult       1
## 62  Last  Difficult       1
## 63  Last  Difficult       1
## 64  Last  Difficult       0
## 65  Last  Difficult       0
## 66  Last  Difficult       0
## 67  Last  Difficult       0
## 68  Last  Difficult       0
## 69  Last  Difficult       0
## 70  Last  Difficult       0
## 71  Last  Difficult       0
## 72  Last  Difficult       0When the data are in a wide format, the formula in
anopa() must provide the columns where the success/failure
are stored, and the conditions after the usual ~, as in
(how dataWide1 was obtained is shown below in the Section Converting between formats below.)
As another example, consider the following example obtained in a
mixed, within- and between- subject design. It has a factor
Status with 8, 9 and 7 participants per group respectively.
It also has four repeated measures, bpre,
bpost, b1week and b5week which
represent four different Moments of measurements. The data frame is
##      Status bpre bpost b1week b5week
## 1    Broken    1     1      1      0
## 2    Broken    1     1      0      0
## 3    Broken    0     0      1      1
## 4    Broken    1     1      1      1
## 5    Broken    0     0      1      1
## 6    Broken    1     0      1      1
## 7    Broken    1     1      0      1
## 8    Broken    0     1      1      0
## 9  Repaired    1     1      0      0
## 10 Repaired    0     1      0      1
## 11 Repaired    1     1      0      0
## 12 Repaired    0     0      1      0
## 13 Repaired    0     0      0      0
## 14 Repaired    1     0      0      0
## 15 Repaired    0     0      0      0
## 16 Repaired    0     0      0      1
## 17 Repaired    0     0      1      0
## 18      New    0     0      0      1
## 19      New    0     0      1      0
## 20      New    0     0      0      0
## 21      New    0     0      1      0
## 22      New    0     0      0      0
## 23      New    0     1      0      0
## 24      New    0     0      1      0
## 25      New    1     1      0      0
## 26      New    0     0      0      1
## 27      New    1     1      1      0The formula for analyzing these data in this format is
It is necessary to (a) group all the measurement columns using
cbind(); (b) indicate the within-subject factor(s) using
the argument WSFactors along with the number of levels each
in a string.
Alternatively, cbind() can be replaced by
crange() with the first and last variable to be binded. The
in-between variables will be taken from the
data.frame().
This format may be preferred for linear modelers (but it may rapidly becomes very long!). There is always at least these columns: One Id column, one column to indicate a within-subject level, and one column to indicate the observed score. On the other hand, this format has fewer columns in repeated measure designs.
This example shows the first 6 lines of the 2-factor between design data above, stored in the long format.
##   Id Class Difficulty Variable Value
## 1  1 First       Easy  success     1
## 2  2 First       Easy  success     1
## 3  3 First       Easy  success     1
## 4  4 First       Easy  success     1
## 5  5 First       Easy  success     1
## 6  6 First       Easy  success     1To analyse such data format within anopa(), use
The vertical line symbol indicates that the observations are nested
within Id (i.e., all the lines with the same Id are
actually the same subject).
With the mixed design described above, the data begin as:
##   Id Status Variable Value
## 1  1 Broken     bpre     1
## 2  1 Broken    bpost     1
## 3  1 Broken   b1week     1
## 4  1 Broken   b5week     0
## 5  2 Broken     bpre     1
## 6  2 Broken    bpost     1and are analyzed with the formula:
This format is compiled, in the sense that the 0s and 1s have been replaced by a single count of success for each cell of the design. Hence, we no longer have access to the raw data. This format however has the advantage of being very compact, requiring few lines. Here is the data for the 2 between-subject factors example
##   Class Difficulty success Count
## 1 First  Difficult       6    12
## 2 First       Easy      11    12
## 3 First   Moderate       9    12
## 4  Last  Difficult       3    12
## 5  Last       Easy      10    12
## 6  Last   Moderate       8    12To use a compiled format in anopa(), use
where succes identifies in which column the total number
of successes are stored. The column Count indicates the total number of
observations in that cell. The notation {s;n} is read
s over n (note the curly braces and semicolon).
For the mixed design presented earlier, the data looks like:
##     Status bpre bpost b1week b5week Count      uAlpha
## 1   Broken    5     5      6      5     8 -0.15204678
## 2      New    2     3      4      2    10 -0.03463203
## 3 Repaired    3     3      2      2     9 -0.10416667where there are columns for the number of success for each repeated
measures. A new columns appear uAlpha. This column (called
unitary alpha) is a measure of correlation (between -1 and +1).
In this fictitious example, the correlations are near zero (negative
actually) by chance as the data were generated randomly.
To run an ANOPA on compiled data having repeated measures, use
w2Compiled <- anopa( {cbind(bpre, bpost, b1week, b5week); Count; uAlpha} ~ Status, 
                    dataCompiled2, WSFactors = "Week(4)")
summary(w2Compiled)##                      MS  df        F   pvalue correction    Fcorr pvalcorr
## Week           0.006472   3 0.221708 0.881375   1.030864 0.215070 0.886009
## Status         0.174546   2 6.583566 0.001383   1.018673 6.462886 0.001560
## Week:Status    0.005486   6 0.187927 0.980311   1.445036 0.130050 0.992591
## Error(within)  0.029192 Inf                                               
## Error(between) 0.026512 Infwhere cbind() lists all the within-subject success count
columns, Count is the column in the data.frame with the total
number of observations, and uAlpha is the column containing the
mean pairwise correlation measured with the unitary alpha.
Here again, crange() can be used in place of
cbind() with
w2Compiledbis <- anopa( {crange(bpre, b5week); Count; uAlpha} ~ Status, 
                    dataCompiled2, WSFactors = "Week(4)")
summary(w2Compiledbis)##                      MS  df        F   pvalue correction    Fcorr pvalcorr
## Week           0.006472   3 0.221708 0.881375   1.030864 0.215070 0.886009
## Status         0.174546   2 6.583566 0.001383   1.018673 6.462886 0.001560
## Week:Status    0.005486   6 0.187927 0.980311   1.445036 0.130050 0.992591
## Error(within)  0.029192 Inf                                               
## Error(between) 0.026512 InfOnce entered in an anopa() structure, it is possible to
convert to any format using toWide(),
toCompiled() and toLong(). For example:
##   Class Difficulty success Count
## 1 First  Difficult       6    12
## 2 First       Easy      11    12
## 3 First   Moderate       9    12
## 4  Last  Difficult       3    12
## 5  Last       Easy      10    12
## 6  Last   Moderate       8    12##     Status bpre bpost b1week b5week Count      uAlpha
## 1   Broken    5     5      6      5     8 -0.15204678
## 2      New    2     3      4      2    10 -0.03463203
## 3 Repaired    3     3      2      2     9 -0.10416667The compiled format is probably the most compact format, but the wide format is the most explicit format (as we see all the subjects and their scores on a single line, one subject per line).
Above, we used two examples. They are available in this package under
the names twoWayExample and minimalMxExample.
The first is available in compiled form, the second in wide form.
We converted these data set in other formats using:
w1 <- anopa( {success;total} ~ Class * Difficulty, twoWayExample)
dataWide1     <- toWide(w1)
dataCompiled1 <-toCompiled(w1)
dataLong1     <- toLong(w1)
w2 <- anopa( cbind(bpre, bpost, b1week, b5week) ~ Status, minimalMxExample, WSFactors = "Moment(4)")
dataWide2     <- toWide(w2)
dataCompiled2 <-toCompiled(w2)
dataLong2     <- toLong(w2)One limitation is with regards to repeated measures: It is not
possible to guess the name of the within-subject factors from the names
of the columns. This is why, as soon as there are more than one
measurement, the argument WSFactors must be added.
Suppose a two-way within-subject design with 2 x 3 levels. The data
set twoWayWithinExample has 6 columns; the first three are
for the factor A, level 1, and the last three are for factor A, level 2.
Within each triplet of column, the factor B goes from 1 to 3.
w3 <- anopa( cbind(r11,r12,r13,r21,r22,r23) ~ . , 
             twoWayWithinExample, 
             WSFactors = c("B(3)","A(2)") 
            )## ANOPA::fyi: Here is how the within-subject variables are understood:##  B A Variable
##  1 1      r11
##  2 1      r12
##  3 1      r13
##  1 2      r21
##  2 2      r22
##  3 2      r23##   r11 r12 r13 r21 r22 r23 Count     uAlpha
## 1  14   6   8  14  16  14    30 0.08223684A “fyi” message is shown which lets you see how the variables are
interpreted. Take the time to verify that the order of the variables
within cbind() does match the expected order from
anopa(). Note that FYI messages can be inhibited by
changing the option
To know more about analyzing proportions with ANOPA, refer to Laurencelle & Cousineau (2023) or to What is an ANOPA?.