The evola package is nice wrapper of the AlphaSimR package that enables the use of the evolutionary algorithm (EA) to solve complex questions in a simple manner.
The vignettes aim to provide several examples in how to use the evola package under different optimization scenarios. We will spend the rest of the space providing examples for:
Because of CRAN requirements I will only run few generations but please when you run your analysis let it run for many generations.
The example presented here is a list of gems (Color) that have different weights in Kg (Weight) and a given value (Value).
set.seed(1)
# Data
Gems <- data.frame(
  Color = c("Red", "Blue", "Purple", "Orange",
            "Green", "Pink", "White", "Black", 
            "Yellow"),
  Weight = round(runif(9,0.5,5),2),
  Value = round(abs(rnorm(9,0,5))+0.5,2)
)
head(Gems)##    Color Weight Value
## 1    Red   1.69  8.20
## 2   Blue   2.17  5.14
## 3 Purple   3.08  1.97
## 4 Orange   4.59  0.53
## 5  Green   1.41 12.52
## 6   Pink   4.54  4.32The task to optimize here is to be able to pick in your bag all the possible gems (explanatory variable) that maximize the Value (response variable) with the constraint (Weight) that your bag would break after 10Kg. In the evolafit function this would be specified as follows:
# Task: Gem selection. 
# Aim: Get highest combined value.
# Restriction: Max weight of the gem combined = 10. 
res0<-evolafit(cbind(Weight,Value)~Color, dt= Gems,
               # constraints: if greater than this ignore
               constraintsUB = c(10,Inf), 
               # constraints: if smaller than this ignore
               constraintsLB= c(-Inf,-Inf), 
               # weight the traits for the selection
               b = c(0,1), 
               # population parameters
               nCrosses = 100, nProgeny = 20, recombGens = 1, 
               # coancestry parameters
               D=NULL, lambda=0, nQtlStart = 1, 
               # selection parameters
               propSelBetween = .9, propSelWithin =0.9, 
               nGenerations = 15, verbose = FALSE
) 
pmonitor(res0)Notice that the formula cbind(Weight,Value)~Color specifies the traits to be considered in the optimization problem and are indicated in the left side of the formula whereas the right side of the formula specifies the term corresponding to the genes that will form the ‘genome’ of the possible solutions (progeny). Each trait in the formula requires a value for the constraints, weights in the fitness function (e.g., a selection index or any other customized fitness function). Please notice that the default fitness function is a classical base selection index. In this example only Value contributes to the fitness and Weight is purely used as a constraint. Lambda (weight for the group relationship between the genes in the genome; equivalent to the linkage disequilibrium). The rest of the parameters are the parameters controlling the evolution of the population of solutions.
When looking at the results of the evolution we can observe that the best solution for the traits under the contraints can be extracted with the bestSol() function.
## Value 
##    51##    Red   Blue Purple Orange  Green   Pink  White  Black Yellow 
##      1      1      0      0      1      0      0      1      0# value and weight for the selected solution 
qa = Q[best,] %*% as.matrix(Gems[,c("Weight","Value")]); qa##      Weight Value
## [1,]   8.74  32.1The best selection of Gems was the one one found in the M element of the resulting object.
One situation that occurs in plant and animal breeding is the so called ‘optimal contribution’ problem where we want to pick a set of parents that can maximize the gain while managing genetic variance as much as possible. In the following example we take a population of 363 possible parents (which will become the genes) and pick the best 20 while conserving genetic variance (group relationship).
##        id Row Col Year      color  Yield FruitAver Firmness Rowf Colf occ
## P003 P003   3   1 2014 0.10075269 154.67     41.93  588.917    3    1   0
## P004 P004   4   1 2014 0.13891940 186.77     58.79  640.031    4    1   1
## P005 P005   5   1 2014 0.08681502  80.21     48.16  671.523    5    1   1
## P006 P006   6   1 2014 0.13408561 202.96     48.24  687.172    6    1   1
## P007 P007   7   1 2014 0.13519278 174.74     45.83  601.322    7    1   1
## P008 P008   8   1 2014 0.17406685 194.16     44.63  656.379    8    1   1Our surrogate of fitness will be the Yield trait and we will have a second trait to control the number of individuals we can select. We will set a constraint for the occurrence (occ) trait to 20 but the only trait contributing to fitness will be Yield (using the b argument).
# get best 20 individuals weighting variance by 0.5
res<-evolafit(cbind(Yield, occ)~id, dt= DT, 
              # constraints: if sum is greater than this ignore 
              constraintsUB = c(Inf,20), 
              # constraints: if sum is smaller than this ignore
              constraintsLB= c(-Inf,-Inf), 
              # weight the traits for the selection
              b = c(1,0), 
              # population parameters
              nCrosses = 100, nProgeny = 10, 
              # coancestry parameters
              D=A, lambda= (30*pi)/180 , nQtlStart = 2, 
              # selection parameters
              propSelBetween = 0.5, propSelWithin =0.5, 
              nGenerations = 15, verbose=FALSE) We then use the bestSol() function to extract the solution that maximized our fitness function and constraints.
Q <- pullQtlGeno(res$pop, simParam = res$simParam, trait=1); Q <- Q/2
best = bestSol(res$pop)[,"Yield"];
sum(Q[best,]) # total # of inds selected## [1] 20We can use the pmonitor() function to see if convergence was achieved between the best and the average solutions.
D variation of the same problem is when we want to pick the best crosses instead of the best parents to directly find the optimal solution for a crossing block. In the following example we use a dataset of crosses with marker and phenotype information to show how to optimize this problem.
##    hybrid dent flint     GY    GM      hy occ
## 1 518.298  518   298  -8.04 -0.85 518:298   0
## 2 518.305  518   305 -11.10  1.70 518:305   1
## 3 518.306  518   306 -16.85  2.24 518:306   1
## 4 518.316  518   316   2.08 -1.33 518:316   1
## 5 518.323  518   323   5.65 -2.71 518:323   1
## 6 518.327  518   327 -16.95 -0.52 518:327   1The way to specify this problem is exactly the same than with the optimization of parents but the input information is at the level of predicted crosses instead of individuals (genes).
# silent for CRAN checks restriction on vignettes time
# run the genetic algorithm
#   res<-evolafit(formula = c(GY, occ)~hy, dt= DT, 
#                 # constraints: if sum is greater than this ignore
#                 constraintsUB = c(Inf,100), 
#                 # constraints: if sum is smaller than this ignore
#                 constraintsLB= c(-Inf,-Inf),
#                 # weight the traits for the selection
#                 b = c(1,0), 
#                 # population parameters
#                 nCrosses = 100, nProgeny = 10, 
#                 # coancestry parameters
#                 D=D, lambda= (20*pi)/180 , nQtlStart = 100, 
#                 # selection parameters
#                 propSelBetween = 0.5, propSelWithin =0.5, 
#                 nGenerations = 15, verbose=FALSE) 
# 
# Q <- pullQtlGeno(res$pop, simParam = res$simParam, trait=1); Q <- Q/2
# best = bestSol(res$pop)[1,1]
# sum(Q[best,]) # total # of inds selectedYou can use the pmonitor() or pareto() functions to see the evolution of the solution and see the performance of the solution selected.
Notice that we have maximized te GY variable which is the pure phenotype, but alternatively you could use the GEBVs to get crosses with maximum GEBV or predict the TGV (total genetic value) to maximize the F1 performance (including the dominance).
One particular case when we want to pick a representative subsample is when we don’t have the resources to test everything (e.g., in the field/farm). In this example we use the information from 599 wheat lines to pick a subsample that maximizes the prediction accuracy for the entire sample. We start loading the data, in particular the phenotypes (DT) and the pedigree relationship matrix (D).
data(DT_wheat)
DT <- as.data.frame(DT_wheat)
DT$id <- rownames(DT) # IDs
DT$occ <- 1; DT$occ[1]=0 # to track occurrences
DT$dummy <- 1; DT$dummy[1]=0 # dummy trait
# if genomic
# GT <- GT_wheat + 1; rownames(GT) <- rownames(DT)
# D <-  GT%*%t(GT)
# D <- D/mean(diag(D))
# if pedigree
D <- A_wheatNow in order to pick a structured sample we will do a PCA and pick the cluster number 3 to be a subset to predict later (vp), while we will focus in rest of the population as candidates for the training set (tp).
##Perform eigenvalue decomposition for clustering
##And select cluster 5 as target set to predict
pcNum=25
svdWheat <- svd(D, nu = pcNum, nv = pcNum)
PCWheat <- D %*% svdWheat$v
rownames(PCWheat) <- rownames(D)
DistWheat <- dist(PCWheat)
TreeWheat <- cutree(hclust(DistWheat), k = 5 )
plot(PCWheat[,1], PCWheat[,2], col = TreeWheat, 
     pch = as.character(TreeWheat), xlab = "pc1", ylab = "pc2")## [1] 159Since the objective is to select a set of 100 lines that represent best the training set (tp) of ~400 lines we will subset a relationship matrix for that training set (As).
For this particular case there is no trait to optimize (x’a) but we just want to make sure that we maintain as much variation in the sample as possible (x’Ax). We then just create a dummy trait in the dataset (dummy) to put all the weight into the group relationship (x’Ax) using the lambda argument. The trait for occurrence we will use it as before to control the number of individuals to be in the sample.
res<-evolafit(cbind(dummy, occ)~id, dt= DT2, 
                # constraints: if sum is greater than this ignore 
                constraintsUB = c(Inf, 100), 
                # constraints: if sum is smaller than this ignore
                constraintsLB= c(-Inf, -Inf), 
                # weight the traits for the selection
                b = c(1,0), 
                # population parameters
                nCrosses = 100, nProgeny = 10, 
                # coancestry parameters
                D=As,
                lambda=(60*pi)/180, nQtlStart = 80, 
                # selection parameters
                propSelBetween = 0.5, propSelWithin =0.5, 
                nGenerations = 15, verbose = FALSE)
Q <- pullQtlGeno(res$pop, simParam = res$simParam, trait=1); Q <- Q/2
best = bestSol(res$pop)[,1]
sum(Q[best,]) # total # of inds selected## [1] 94You can see which individuals were selected.
cex <- rep(0.5,nrow(PCWheat))
names(cex) <- rownames(PCWheat)
cex[names(which(Q[best,]==1))]=2
plot(PCWheat[,1], PCWheat[,2], col = TreeWheat, cex=cex,
     pch = TreeWheat, xlab = "pc1", ylab = "pc2")we can use the covariance between the training population and the validation population to create a new trait (x’a) that can be used in addition to the group relationship (x’Ax).
The model can be specified as before with the suttle difference that the covariance between the training and validation population contributes to the fitness function.
res<-evolafit(cbind(cov, occ)~id, dt= DT2, 
                # constraints: if sum is greater than this ignore 
                constraintsUB = c(Inf, 100), 
                # constraints: if sum is smaller than this ignore
                constraintsLB= c(-Inf, -Inf), 
                # weight the traits for the selection
                b = c(1,0), 
                # population parameters
                nCrosses = 100, nProgeny = 10, 
                # coancestry parameters
                D=As,
                lambda=(60*pi)/180, nQtlStart = 80, 
                # selection parameters
                propSelBetween = 0.5, propSelWithin =0.5, 
                nGenerations = 15, verbose = FALSE)
Q <- pullQtlGeno(res$pop, simParam = res$simParam, trait=1); Q <- Q/2
best = bestSol(res$pop)[,1]
sum(Q[best,]) # total # of inds selected## [1] 77You can plot the final results and see which individuals were picked.
cex <- rep(0.5,nrow(PCWheat))
names(cex) <- rownames(PCWheat)
cex[names(which(Q[best,]==1))]=2
plot(PCWheat[,1], PCWheat[,2], col = TreeWheat, cex=cex,
     pch = TreeWheat, xlab = "pc1", ylab = "pc2")In this case is better if you only create the cross combinations that are possible (e.g., where male and female can couple) and you handed them to the evolutionary algorithm. That means, the rows of the crosses to be in the searching space only include the realistic ones.
In this case you can modify the fitness function to set to a low value the fitness of solutions that have used too many times the same parent. Using the DT_technow dataset this would be done the following way:
First you create an incidence matrix for parents in columns and hybrids in crosses:
data(DT_technow)
DT <- DT_technow
DT$occ <- 1; DT$occ[1]=0
M <- M_technow
D <- A.mat(M)
Z=with(DT,overlay(dent,flint) )#  Matrix::sparse.model.matrix(~dent-1, data=DT)
rownames(Z) <- DT$hy # needed to link to the QTL matrixthe secons step is to create a new fitness function for the genetic algorithm. Our objective function to be maximized is normally of the form Yb - d, where Y is the trait values, b is the trait weights, and d is the group relationship x’Ax. We then are going to put some additional constrait that parents of the crosses can’t show up more than twice. This can be done in the following way:
# regular fitness function
fitnessf <-function (Y, b, d, Q, Z) {
  fit <- Y %*% b - d
  return(fit)
}
# new fitness function with constraint
fitnessf <-function (Y, b, Q, D, a, lambda, scale=TRUE, Z) {
  X=Q%*%Z[colnames(Q),]
  bad <- as.vector( apply(X,1, function(x){length(which(x > 5))}) )
  bad <- which(bad > 0)
  # (q'a)b - l(q'Dq)
  if(scale){
    fit <- stan( apply(Y,2,scale) %*% b) -  lambda*stan( Matrix::diag(Q %*% Matrix::tcrossprod(D, Q)) )
  }else{
    fit <- stan( Y %*% b) -  lambda*stan( Matrix::diag(Q %*% Matrix::tcrossprod(D, Q)) )
  }
  if(length(bad) > 0){fit[bad,1]=min(fit[,1])}
  return(fit)
}Notice that we have added a matrix product Q%*%Z to see how may times each parent is used in the proposed solution of crosses. The next step would be to provide the new fitness function to the evolafit() function and the additional argument Z which is the overlay matrix formed in the first chunck of code:
# silent for CRAN checks restriction on time
# res<-evolafit(formula = c(GY, occ)~hy,
#               dt= DT, 
#               # constraints: if sum is greater than this ignore
#               constraintsUB = c(Inf,50), 
#               # constraints: if sum is smaller than this ignore
#               constraintsLB= c(-Inf,-Inf),
#               # weight the traits for the selection
#               b = c(1,0), 
#               # population parameters
#               nCrosses = 100, nProgeny = 10, 
#               # coancestry parameters
#               D=D, lambda= (10*pi)/180 , nQtlStart = 40, 
#               # new fitness function and additional args
#               fitnessf = fitnessf, Z=Z,
#               # selection parameters
#               propSelBetween = 0.5, propSelWithin =0.5, 
#               nGenerations = 15, verbose=FALSE) 
# 
# Q <- pullQtlGeno(res$pop, simParam = res$simParam, trait=1); Q <- Q/2
# best = bestSol(res$pop)[,1]
# qa = (Q %*% DT$GY)[best,]; qa 
# qDq = Q[best,] %*% D %*% Q[best,]; qDq 
# sum(Q[best,]) # total # of inds selectedNow, last but not least we check how many times each parent was used:
The following example show how the genetic algorithm can be tweeked to do a predictive model of the type of of a linear regression.
data("mtcars")
# we scale the variables
mtcars <- as.data.frame(apply(mtcars,2,scale))
mtcars$inter <- 1 # add an intercept if desired
# define the train and validation set
train <- sample(1:nrow(mtcars) , round((nrow(mtcars)*.4)))
validate <- setdiff(1:nrow(mtcars),train)
mtcarsT <- mtcars[train,]
mtcarsV <- mtcars[validate,]
##############################
# fit the regular linear model
head(mtcarsT)##           mpg       cyl       disp         hp       drat         wt        qsec
## 26  1.1961900 -1.224858 -1.2241687 -1.1768396  0.9041644 -1.3104811  0.58829513
## 25 -0.1477738  1.014882  1.3658214  0.4129422 -0.9661175  0.6415711 -0.44699237
## 14 -0.8114596  1.014882  0.3637131  0.4858679 -0.9848204  0.5751400  0.08464175
## 21  0.2338456 -1.224858 -0.8925532 -0.7246998  0.1934573 -0.7688122  1.20946763
## 17 -0.8944204  1.014882  1.6885616  1.2151256 -0.6855752  2.1745964 -0.23993487
## 27  0.9804921 -1.224858 -0.8909395 -0.8122108  1.5587631 -1.1009677 -0.64285758
##            vs         am       gear       carb inter
## 26  1.1160357  1.1899014  0.4235542 -1.1221521     1
## 25 -0.8680278 -0.8141431 -0.9318192 -0.5030337     1
## 14 -0.8680278 -0.8141431 -0.9318192  0.1160847     1
## 21  1.1160357 -0.8141431 -0.9318192 -1.1221521     1
## 17 -0.8680278 -0.8141431 -0.9318192  0.7352031     1
## 27 -0.8680278  1.1899014  1.7789276 -0.5030337     1## 
## Call:
## lm(formula = mpg ~ cyl + disp + hp + drat, data = mtcarsT)
## 
## Coefficients:
## (Intercept)          cyl         disp           hp         drat  
##    -0.06048      0.25659      0.13217     -1.18239      0.35296##############################
# fit the genetic algorithm
# 1) create initial QTL effects to evolve
nqtls=100
dt <- data.frame(alpha=rnorm(nqtls,0,.3),qtl=paste0("Q",1:nqtls))
head(dt); nrow(dt)##          alpha qtl
## 1 -0.248420700  Q1
## 2 -0.317197344  Q2
## 3 -0.510055983  Q3
## 4  0.037101738  Q4
## 5  0.002896908  Q5
## 6  0.308134871  Q6## [1] 100# generate n samples equivalent to the number of progeny
# you are planning to start the simulation with (e.g., 500)
# these are fixed values
sam <- sample(1:nrow(mtcarsT),500,replace = TRUE)
y <- mtcarsT$mpg[sam]
X = mtcarsT[sam,c("cyl","disp","hp","drat")]
# Task: linear regression
res0<-evolafit(alpha~qtl, dt= dt,
               # constraints: if greater than this ignore
               constraintsUB = c(Inf), 
               # constraints: if smaller than this ignore
               constraintsLB= c(-Inf), 
               # weight the traits for the selection
               b = c(1), 
               # population parameters
               nCrosses = 50, nProgeny = 10, recombGens = 1, 
               # coancestry parameters
               D=NULL, lambda=0, nQtlStart = 4, fixNumQtlPerInd = TRUE,
               # least MSE function (y - Xb)^2; Y are betas; X*Y is X*beta; 
               # Y and X are fixed, we just evolve the betas
               fitnessf=regFun,
               # selection parameters
               propSelBetween = 0.65, propSelWithin =0.65, selectTop=FALSE,
               nGenerations = 10, y=y, X=X, verbose = FALSE
) 
# check how the fitness function changed across generations
pmonitor(res0, kind = 1)# this time the best solution is the one that minimizes the error
Q <- pullQtlGeno(res0$pop, simParam = res0$simParam, trait=1); Q <- Q/2
bestid <- bestSol(res0$pop, selectTop = FALSE)[,"fitness"]
bestid## fitness 
##      61## [1] -0.09469920  0.02158504 -0.73070103  0.21953003# plot predicted versus real values
plot( as.matrix(mtcarsV[,c("cyl","disp","hp","drat")]) %*% betas  , mtcarsV$mpg,
      xlab="predicted mpg value by GA", ylab="mpg",
      main="Correlation between GA-prediction and observed") # GAplot( as.matrix(mtcarsV[,c("inter","cyl","disp","hp","drat")]) %*% mod$coefficients , mtcarsV$mpg,
      xlab="predicted mpg value by lm", ylab="mpg",
      main="Correlation between lm-prediction and observed") # LM# Correlation between GA-prediction and observed 
cor( as.matrix(mtcarsV[,c("cyl","disp","hp","drat")]) %*% betas  , mtcarsV$mpg) ##          [,1]
## [1,] 0.767097# Correlation between lm-prediction and observed
cor( as.matrix(mtcarsV[,c("inter","cyl","disp","hp","drat")]) %*% mod$coefficients , mtcarsV$mpg) # LM##           [,1]
## [1,] 0.6595502A popular problem to optimize is the one of a travel salesman person that needs to go to n cities while minimizing the distance incurred. The way to specify the problem in evola is the following:
Let us start by defining some functions to simulate and plot n cities with their coordinates.
# "not in" operator
'%!in%' <- function(x,y)!('%in%'(x,y))
# function to simulate cities
simCities <- function(n_cities = 5) {
  # extend "LETTERS" function to run from "A" to "ZZ"
  MORELETTERS <- c(LETTERS, sapply(LETTERS, function(x) paste0(x, LETTERS)))
  cities <- matrix(runif(2*n_cities,-1,1),ncol=2)
  rownames(cities) <- MORELETTERS[1:n_cities]
  colnames(cities) <- c("x","y")
  return(cities)
}
# function to plot cities
plotCities <- function(cities, route, dist=NULL, 
                      main="", bg="white", 
                      main_col = "black",
                      point_col = "deepskyblue",
                      start_col = "red",
                      line_col = "black") {
  # plot cities
  city_colors <- c(start_col, rep(point_col,nrow(cities)))
  
  par(bg=bg)
  plot(cities, 
       pch=16, cex=3,
       col=point_col, 
       ylim=c(-1,1.1), xlim=c(-1,1),
       # yaxt="n", xaxt="n",
       # ylab="", xlab="",
       bty="n",
       main=main, col.main=main_col)
  text(x=cities[,1], y=cities[,2], labels=rownames(cities))
  # plot route
  if(!missing(route)){
    for(i in 1:length(route)) {
      if(route[i]>0){
        nodes0 <- strsplit(names(route)[i], "-")[[1]]
        lines(x=cities[nodes0,"x"],
              y=cities[nodes0,"y"],
              col=line_col,
              lwd=1.5)
      }
    }
  }
  # add distance in legend
  if(!is.null(dist)) legend("topleft", 
                            bty="n", # no box around legend
                            legend=round(dist,4), 
                            bg="transparent", 
                            text.col="black")
  
}
# function to compute adjacency matrix
compAdjMat <- function(cities) return(as.matrix(dist(cities)))with such functions defined let us start simulating the problem:
nCities=5
cities <- simCities(nCities)
adjmat <- compAdjMat(cities)
# make the distance matrix a data frame
df2 <- as.data.frame(as.table(adjmat)) 
df2 <- df2[-which(df2$Var1 == df2$Var2),]
colnames(df2)[3] <- c("distances")
df2$route <- paste(df2$Var1, df2$Var2, sep="-")
df2##    Var1 Var2 distances route
## 2     B    A 1.7439462   B-A
## 3     C    A 0.2415316   C-A
## 4     D    A 0.2530058   D-A
## 5     E    A 1.2051771   E-A
## 6     A    B 1.7439462   A-B
## 8     C    B 1.7941801   C-B
## 9     D    B 1.9960206   D-B
## 10    E    B 0.9356282   E-B
## 11    A    C 0.2415316   A-C
## 12    B    C 1.7941801   B-C
## 14    D    C 0.3408769   D-C
## 15    E    C 1.1351260   E-C
## 16    A    D 0.2530058   A-D
## 17    B    D 1.9960206   B-D
## 18    C    D 0.3408769   C-D
## 20    E    D 1.4379369   E-D
## 21    A    E 1.2051771   A-E
## 22    B    E 0.9356282   B-E
## 23    C    E 1.1351260   C-E
## 24    D    E 1.4379369   D-ESince the QTLs will be the different routes and the average allelic effects the distances we need to keep track of the cities (nodes) implied in the different routes (edges), we will create an incidence matrix of cities implied in each possible route step.
##     A B C D E
## B-A 1 1 0 0 0
## C-A 1 0 1 0 0
## D-A 1 0 0 1 0
## E-A 1 0 0 0 1
## A-B 1 1 0 0 0
## C-B 0 1 1 0 0Now we need to define what will be our objective function to score the different possible solutions or combinations of possible routes to take. If you are familiar with evola you know that certain internal variable names are taken such as Y, b, d, Q, D and a which represent different sources of information. We recommend you to read the evolafit documentation. We will use the regular fitness function that gives value to solutions based on the crossproduct of QTLs by average allelic effects but we will add some constraints. These are to ensure that all cities are visited and that all cities are left so there’s no open edges.
salesf <- function(Y,b,Q,D,a,lambda ,H, nCities){
                # simple fitness function
                fitnessVal <-  (Y%*%b) 
                ###############
                # CONSTRAINTS
                ###############
                # calculate how many cities the solution has travelled to
                QH <- Q %*% H
                # ensure all cities have at least 2 edges
                edgeCheck <- apply(QH,1,function(x){if(all(x>1)){return(1)}else{return(0)} })
                #apply condition on arriving and leaving the city
                fitnessVal <- ifelse(edgeCheck == 0, Inf, fitnessVal) # if not touching at least 2 cities give inf distance
                # number of unique cities visited
                nCityCheck <- apply(QH,1,function(x){length(which(x > 0))}) 
                # apply condition on visiting all cities
                fitnessVal <- ifelse(nCityCheck < nCities, Inf, fitnessVal) # if not in all cities give an infinite distance
                return(fitnessVal)
              }With the fitness function defined now we can fit the genetic algorithm:
res<-evolafit(formula=distances~route, dt= df2,
              # constraints on traits: if greater than this ignore
              constraintsUB = c(Inf), 
              # constraints on traits: if smaller than this ignore
              constraintsLB= c(-Inf), 
              # weight the traits for the selection (fitness function)
              b = c(1), 
              # population parameters
              nCrosses = 50, nProgeny = 10, 
              # genome parameters
              recombGens = 1, nChr=1, mutRateAllele=0, 
              # start with at least n QTLs equivalent to n cities
              nQtlStart = nCities*2, 
              # coancestry parameters
              D=NULL, lambda=0, 
              fitnessf = salesf, 
              selectTop = FALSE, 
              # additional variables for the fitness function
              H=H, nCities=nCities,
              # selection parameters
              # propSelBetween = .8, propSelWithin =0.8, 
              nGenerations = 50, verbose=FALSE
) ## Variance across traits exhausted. Early stop.Q <- pullQtlGeno(res$pop, simParam = res$simParam, trait=1); Q <- Q/2
best <- bestSol(res$pop, selectTop = FALSE)[,"fitness"]
Q[best,] # routes taken## B-A C-A D-A E-A A-B C-B D-B E-B A-C B-C D-C E-C A-D B-D C-D E-D A-E B-E C-E D-E 
##   0   0   0   0   0   0   0   1   1   0   1   0   1   0   0   0   0   1   0   0##      A B C D E
## [1,] 2 2 2 2 2The advice here is to upload directly the phased genotypes (haplotypes) to the AlphaSimR machinery and simulate the possible crosses to explore how many individuals are required to sample a given trait (oligogenic or polygenic) with a given probablility. You can also use the inbreeding value of each cross to decide the number of progeny for a given cross since there is a negative relationship between inbreeding of a cross and the expected variance observed in the progeny of such cross. No need to use the evola package.
Giovanny Covarrubias-Pazaran (2024). evola: a simple evolutionary algorithm for optimization of complex problems. To be submitted to Bioinformatics.
Gaynor, R. Chris, Gregor Gorjanc, and John M. Hickey. 2021. AlphaSimR: an R package for breeding program simulations. G3 Gene|Genomes|Genetics 11(2):jkaa017. https://doi.org/10.1093/g3journal/jkaa017.
Chen GK, Marjoram P, Wall JD (2009). Fast and Flexible Simulation of DNA Sequence Data. Genome Research, 19, 136-142.