## simsalRbim - A package for preference test simulations

Preference tests are a valuable tool to measure the “wants” of individuals and have been proven to be a valid method to rate different commodities. The number of commodities presented at the same time is, however, limited and in classical test settings usually, only two options are presented. In our paper (add reference), we evaluate the option of combining multiple binary choices to rank preferences among a larger number of commodities. The simsalRbim package offers the necessarytools to test selections of commodities and to obtain an estimate of new or incompletely tested items and their relative position.

### Data availability

The package contains an artificial set of data, ZickeZacke, which we use as an example to show the functionality of the package. In addition experimental data from six different preference tests can be downloaded from a separate GitHub repository and can be used directly in this package (e.g., with the bimload function).

### Dependencies

simsalRbim was developed on R (v4.0.3). It depends on the following packages (in no particular order, excluding R base packages) and some of them may have to be installed manually. (If you did not yet install the package devtools, remove the hashtag in the first line.)

“dplyr”[1]
“ggplot2”[2]
“gnm”[3]
“prefmod”[4]
“reshape2”[5]
“rlang”[6]
“stringr”[7]
“viridis”[8]
“ggrepel”[9]

The following function can be used to install single packages - or just the missing ones from CRAN.

install.packages("paste missing package name here")


### simsalRbim

The development version of the simsalRbim package can be downloaded from GitHub with the following command.

# install.packages("devtools")
devtools::install_github("mytalbot/simsalRbim@main")
library(simsalRbim)

A CRAN version may be available soon.

## General concept of the simsalRbim package

Data from a series of binary preference tests may be incomplete, may show ties or new item positions shall be tested without the need of testing ALL possible combinations. The simsalRbim package can address all three cases by using informed and uninformed simulations to find the optimal item positions in the data. Uninformed or informed simulations refer to whether there is knowledge about intransitivity in the ranking. For details see the simsalRbim Examples Vignette.

Preference test data usually come in two formats: quantities (i.e., a subject consumes 200 ml of orange juice but only 17 ml of milk) or binary data (i.e., “I like Star Trek better than Star Wars”). The bimload function in the package can load both formats, including the side the item is presented at as a test variable. The side information can be examined with the bimbalance function. However, if a preference test is showing significant side dependencies (i.e., milk is preferred over orange juice if it is presented in a bottle positioned on the left side but not if it is presented in the right bottle), the simulated outcome will be biased. The current package can highlight such experimental imbalances, which should be addressed using a different experimental design.

In case of present ties (A = B), the binary response variable is randomized. By default this happens at a 50% likelihood for an equal commodity selection. However, the user may specify other thresholds for considering a case as a tie, for example, a preference can only be deemed meaningful if one item is chosen at >65%.

Missing item tests and combinations are filled as ties and then simulated in the two simulation functions. In the uninformed simulation, data are continuously randomized and analyzed. With each additional simulation, the number of degrees of freedom decreases and the rank position of the items will become more confident, At a given number of degrees of freedom this will stabilize the 95% confidence intervals on the worth scale. Thus, item positions can be obtained with a reasonable confidence without having to include tests on all possible item combinations. If you assume that the data is following the supposition of transitivity (i.e., if A>B and B>C then A>C) then the informed simulation can help to derive a ranking with higher confidence. In the informed simulation, the number of intransitive relationships within the simulated rank order is also calculated. The number of intransitive relationships in the results can be limited to any percentage between 0% and 100%. Thereby, the item of interest can be placed on the worth scale at much higher confidence.

## Data

The package includes example data (ZickeZacke) that can be used to explore the packages functions and gives an example of how the input data must be formatted. Please note, that the item ‘HoiHoiHoi’ deliberately introduces variance in the data as it is (i) incomplete, (ii) introduces a tie, and (iii) introduces an intransitive relationship. Without this item it would be perfectly balanced and complete data which do not need a simulation.

head(simsalRbim::ZickeZacke)
#>   subjectID optionA optionB quantityA quantityB
#> 1      eins   Zicke   Zacke        50        40
#> 2      zwei   Zicke   Zacke        30        10
#> 3      drei   Zicke   Zacke        80         5
#> 4      vier   Zicke   Zacke        66        15
#> 5      eins Huehner   Kacke        30        20
#> 6      zwei Huehner   Kacke        50        40

The input table must have the following columns:

subjectID - a unique identifier for each subject in the preference test
optionA - item A
optionB - item B
quantityA - quantitiy of item A (can also be binary 0 or 1)
quantityB - quantitiy of item B (can also be binary 0 or 1)
side - side of option A (this column is optional)

## Functions

# bimload - load function allows easy loading of data.
bimload(filename)

The function takes a filename (*.txt) with quantitative or binary test data in the format shown the Data section above.

### bimpre

# bimpre - will take care of the data preprocessing.
bimpre(dat=NULL, GT=NULL, simOpt=NULL, deviation=0, minQuantity=0,
verbose=TRUE)

The function takes the data object (dat) from bimload, a parameter defining the ground truth (GT) as well as the item that requires simulation (simOpt). The deviation and minQuantity arguments can be used to modify the threshold of the tie selection [e.g, 50% + deviation; with deviation = amount of additional deviation in percent (i.e, 50% + 5% deviation = 55%)].

### bimworth

# bimworth is the central function of the package doing the worth calculations.
bimworth(ydata=NULL, GT=NULL, simOpt=NULL, randOP=FALSE, intrans=FALSE,
showPlot=FALSE, ylim=c(0,0.8), size = 5, verbose=FALSE)

The ydata object takes the output from the preprocessing function (bimpre) as well as the ground truth (GT) and the simulated item (simOpt). With the randOP argument the randomization process can be controlled. If randOP=TRUE (default; if FALSE, the random seed will be fixed) ties are randomized each time the function is executed. The intrans argument controls whether the intransitivity of the items shall be computed.

### bimeval

# bimeval evaluates the consensus error in the worth calculations
bimeval(ydata=NULL, GT=NULL, simOpt=NULL, worth= NULL, coverage=0.8,
showPlot=FALSE, filtersim=FALSE, title="Consensus Analysis",subtitle=NULL,
ylim=c(0,1))

The first three input objects are the same as in bimworth. Additionally, the function requires the output from bimworth for the evaluation. Be aware that the bimworth output can be a list when intrans=TRUE. The coverage object defines an arbitrary threshold (default=0.8, ratio of tested subjects/total subjects) for data coverage warnings. There can be two warnings: number of subjects and number of items warnings. The graphical output can be controlled with the showPlot object.

### bimUninformed

# bimUninformed does an uninformed simulation of item positions to find a
# cutoff.
bimUninformed(ydata=NULL, GT=NULL, simOpt=NULL, limitToRun=5, seed=TRUE,
showPlot=TRUE, ylim=c(-0.5,1.5) )

The function has the same first three input parameters as bimworth. In addition, the limitToRun object controls the number of randomizations in the worth calculations. The seeding can be set constant (seed). The output indicates an optimal cutoff for the number of required randomizations, e.g., for the separation of items, and, therefore, potential item positioning.

The function also plots the results with 95% confidence intervals of the adjusted p-values, derived from a post-hoc ANOVA with multiple comparisons (Tukey) test. There will be perfect item separation when the plotted value reaches zero. Any remaining variance indicates potential overlaps in the 95% confidence intervals. A less stringent (but less confident) threshold can be obtained when the upper level of the 95% confidence intervals falls below (0.05, see green points in the plot). This may be useful when calculation times are long. There will be a warning, if the function did not converge to a threshold.

### bimpos

# bimpos - calculates item positions at a discrete number of randomizations.
bimpos(ydata=NULL, GT=NULL, simOpt=NULL, limitToRun=5, showPlot=TRUE)

This function is very similar to bimUninformed and uses the same input arguments. Contrary to bimUninformed, however, it will not explore randomization space but rather uses a distinct cutoff to show the items’ relative position. The query item (simOpt) is shown in “red” in the plot. All items will be shown on the worth scale, together with 95% confidence intervals. Any overlap indicates potential ambiguity in item positioning.

### bimsim

# bimsim - informed position simulation
bimsim(rawdat=NULL, GT=GT, simOpt=simOpt, filter.crit="CE", limitToRun=5,
tcut=0.9, deviation=0, minQuantity=0, seed=TRUE, showPlot=TRUE,ylim=c(0,0.7))

The bimsim function performs an informed position simulation of variable preference test data. The limitToRun argument from the bimUninformed output can be used as an estimate for the required number of randomizations. The tcut argument sets the cutoff for showing the consensus error (%, CE) of the simulated results in the plot OR the transitivity ratios (1-Iratio; with Iratio indicating the number of intransitive triplets per total number of triplets). The selection depends on the filter.crit parameter (“CE” or “Iratio”), i.e., tcut=0.95 results in a 95% cutoff. The more stringent (= higher tcut in case of transitivity ratios) the criterion, the more informed the simulation becomes. The output is a frequency distribution of simOpts preferred position at the given tcut. The plot shows the worth value of the simOpt item as a function of its position. The bubblesizes code for the consensus error/transitivity ratio.

### simsim

# simsim(data = NULL, simOpt = NULL, GT = NULL, seeding = TRUE, runs = NULL,
truepos = NULL, verbose = FALSE, path = NULL)

The simsim function will simulate an item (simOpt) against any combinations given in the Ground Truth (GT). A constant seeding ensures that the results remain stable. The number of simulation runs can be set with the runs object. However, this number is data-dependent and, therefore, a heuristic. When the true position of the simulated item is known (or shall be tested), the truepos object is used to indicate that position. The verbose object silences outputs from the bimpre function. When the path is given, the resulting table from the simulation will be stored as a *.txt file.

### bimristics

# bimristics - data characteristics
bimristics(predat=NULL, simOpt=NULL)

The bimristics functions gives an overview about conditional data characteristics. The simOpt object determines the subgroup for which the conditions are checked (e.g., the number of tested subjects, the number of natural ties, the number of simulated items, etc.)

### bimbalance

# bimbalance - data characteristics
bimbalance(dat=NULL, sidevar="sideA")

The bimbalance determines how balanced the items were tested in terms of side. Items can be presented left (L) or right (R) - or in some other binary logic. If the data are somehow tested in an imbalanced way, this may introduce bias. For each (included!) item combination, the output shows the number of counts on the right (R) and left (L) side. Also, there is a ratio output which is obtained by dividing the counts(R) by counts(L).

## References

1. Wickham H, François R, Henry L, Müller K: Dplyr: A grammar of data manipulation. 2020.

2. Wickham H: Ggplot2: Elegant graphics for data analysis. Springer-Verlag New York; 2016.

3. Turner H, Firth D: Generalized nonlinear models in r: An overview of the gnm package. 2020.

4. Hatzinger R, Dittrich R: prefmod: An R Package for Modeling Preferences Based on Paired Comparisons, Rankings, or Ratings. Journal of Statistical Software 2012, 48:1–31.

5. Wickham H: Reshaping data with the reshape package. Journal of Statistical Software 2007, 21:1–20.

6. Henry L, Wickham H: Rlang: Functions for base types and core r and ’tidyverse’ features. 2020.

7. Wickham H: Stringr: Simple, consistent wrappers for common string operations. 2019.

8. Garnier S: Viridis: Default color maps from ’matplotlib’. 2018.

9. Slowikowski K: Ggrepel: Automatically position non-overlapping text labels with ’ggplot2’. 2021.