RGxE: An R Program for Genotype x Environment Interaction Analysis

doi:10.4236/ajps.2017.87116

American Journal of Plant Sciences
Vol.08 No.07(2017), Article ID:77183,27 pages
10.4236/ajps.2017.87116

Mahendra Dia¹, Todd C. Wehner^1*, Consuelo Arellano²

●How to Cite this Article

¹Department of Horticultural Science, North Carolina State University, Raleigh, USA

²Statistics Department, North Carolina State University, Raleigh, USA

This work is licensed under the Creative Commons Attribution International License (CC BY 4.0).

http://creativecommons.org/licenses/by/4.0/

Received: March 31, 2017; Accepted: June 20, 2017; Published: June 26, 2017

ABSTRACT

Genotype x environmental interaction (GxE) can lead to differences in performance of genotypes over environments. GxE analysis can be used to analyze the stability of genotypes and the value of test locations. We developed an Rlanguage program (RGxE) that computes univariate stability statistics, descriptive statistics, pooled ANOVA, genotype F ratio across location and environment, cluster analysis for location, and location correlation with average location performance. Univariate stability statistics calculated are regression slope (b_i), deviation from regression (S²_d), Shukla’s variance (σ_i²), S square Wricke’s ecovalence (W_i), and Kang’s yield stability (YS_i). RGxE is free and intended for use by scientists studying performance of polygenic or quantitative traits over multiple environments. In the present paper we provide the RGxE program and its components along with an example input data and outputs. Additionally, the RGxE program along with associated files is also available on GitHub at https://github.com/mahendra1/RGxE, http://cucurbitbreeding.com/todd-wehner/publications/software-sas-r-project/ and http://cuke.hort.ncsu.edu/cucurbit/wehner/software.html.

Keywords:

Genotype x Environment Interaction, R Programming Language, RGxE, Univariate, Multivariate

1. Introduction

Genotype x environmental interaction (GxE) refers to the modification of genetic factors by environmental factors, and to the role of genetic factors in determining the performance of genotypes in different environments. GxE can occur for quantitative traits of economic importance and is often studied in plant and animal breeding, genetic epidemiology, pharmacogenomics and conservational biology research. The traits include reproductive fitness, longevity, height, weight, yield, and disease resistance.

Selection of superior genotypes in target environments is an important objective of plant breeding programs. A target environment is a production environment used by growers [1] [2] [3] [4] [5] . In order to identify superior genotypes across multiple environments, plant breeders conduct trials across locations and years, especially during the final stages of cultivar development. GxE is said to exist when genotype performance differs over environments. Performance of genotype can vary greatly across environment because of the effect of environment on trait expression. Cultivars with high and stable performance are difficult to identify, but are of great value [6] [7] .

Since it is impossible to test genotypes in all target environments, plant breeders do indirect selection using their own multiple-environment trials, or test environments. GxE reduces the predictability of the performance of genotypes in target environments based on genotype performance in test environments [8] . An important factor in plant breeding is the selection of suitable test locations, since it accounts for GxE and maximizes gain from selection [9] . An efficient test location is discriminating, and is representative of the target environments for the cultivars to be released. Discriminating locations can detect differences among genotypes with few replications. Representative locations make it likely that genotypes selected will perform well in target environments [9] .

The analysis of variance (ANOVA) is useful in determining the existence, size and significance of GxE. In order to determine GxE for a group of elite cultivars, genotypes are often considered to be fixed effects and environments random. However, for the purpose of estimating breeding values using best linear unbiased prediction (BLUP), genotypes are considered to be random and environments fixed. Some statisticians consider genotypes random effect, provided that the objective is to select the best ones [10] . If GxE is significant, additional stability statistics can be calculated.

Several statistical methods have been proposed for stability analysis. These methods are based on univariate and multivariate models. The present paper focuses on univariate models for the analysis of stability measured using R programming, so a brief description of each stability measure is provided below.

The most widely used methods are univariate stability models based on regression and variance estimates. According to the regression model, stability is expressed in terms of the trait mean (M), the slope of regression line (b_i) and the sum of squares for deviation from regression. High mean of a genotype performance is a precondition of stability. The slope (b_i) of regression indicates the response of genotype to the environmental index, which is derived from the average performance of all genotypes in each environment. If b_i is not significantly different from unity, the genotype is adapted in all environments. A b_igreater than unity describes genotypes with higher sensitivity to environmental change (below average stability), and greater specificity of adaptability to high yielding environments. A b_i less than unity provides a measure of greater resistance to environmental change (above average stability), and therefore increasing specificity of adaptability to low yielding environments.

The variance parameters that measure stability statistics include stability ecovalence proposed by [11] , stability variance proposed by [12] , and yield stability (YS_i) proposed by [13] .

Ecovalence stability index of a genotype is its contribution to the GxE squared and summed across all environments. Since the value of is expressed as a sum of squares, a test of significance for W_i² is not available. [12] proposed an unbiased estimate of the variance of GxE plus an error term associated with genotype. Shukla’s stability variance is a linear combination of Wricke’s ecovalence. Shukla’s stability statistic measures the contribution of a genotype to the GxE and error term, therefore a genotype with low σ_i²is regarded as stable. According to [13] , W_i² and σ_i² are equivalent in ranking genotypes for stability.

The [14] stability statistic (YS_i) is a nonparametric stability procedure in which both the mean (M) and [12] stability variance for a trait are used as selection criteria. This method gives equal weight to M and. According to this method, genotypes with YS_i greater than the mean YS_i are considered stable [14] [15] [16] .

Genotype F ratio for each test location and correlation of test location with average location are important measures of location value. When the mean of all genotypes are equal, then the F ratio will be close to 1. If analysis of variance is run by location, then high genotype F ratio indicates high discriminating ability for that location. High and significant value of Pearson correlation of each location with the mean of all locations indicates strong representation of mean location performance.

Our objective was to develop an Rlanguage program (RGxE) that gives an output for genotype stability and location value using univariate models, descriptive statistics, genotype F ratio across location and environment, cluster analysis for location, and location correlation with average location performance. In addition to the RGxE program, [17] provided a SAS program (SASGxE) that computes multivariate stability statistics using R program along with univariate stability statistics and location value using SAS programming. These multivariate stability statistics include the additive main effects and multiplicative interaction (AMMI) model, and genotype main effects plus GxE (GGE) model. RGxE uses R software (version 3.1.3 and higher). RGxE is freely available, annotated, and intended for scientists studying performance of polygenic or quantitative traits under different environmental conditions. In the present paper we provide the general features of RGxE program and along with the functionality of each module and their outputs. A supplemental file is provided with the RGxE program, instructions for the user-enetered fields required in RGxE program, interpretation of univariate stability statistics, example input data, and output from example input data. The RGxE program along with associated files is also available on GitHub at https://github.com/mahendra1/RGxE, http://cucurbitbreeding.com/todd-wehner/publications/software-sas-r-project/ and http://cuke.hort.ncsu.edu/cucurbit/wehner/software.html.

2. General Features and Functionality of the RGxE Program

2.1. Overview of the RGxE Program

RGxE is a user friendly and annotated R program that will allow user to analyze genotype stability and evaluate test location value of balanced mult-location replicated trial data. This program generates output (.csv or .txt) into the same folder from where it reads input dataset and Console window of helper application “R studio” [18] of R statistical software [19] . A schematic representation of RGxE is presented in Figure 1. Below are the key components of RGxE program which user can independently run.

2.2. Installing and Loading Packages

RGxEuses dplyr [20] , tidyr [21] , broom [22] , agricolae [23] , lme4 [24] , afex [25] , cluster [26] , and grDevices [19] packages. The dplyr, tidyr, broom, agri-

Figure 1. Overview of overall process of RGxE program for genotype stability and location value.

colae, lme4, afex, cluster, and grDevices packages are available from the Comprehensive R Archive Network (CRAN), therefore they can be installed as any other packages, by simply typing:

install.packages("dplyr")

install.packages("tidyr")

install.packages("broom")

install.packages("agricolae")

install.packages("lme4")

install.packages("afex")

install.packages("cluster")

install.packages("grDevices")

Once installed, the packages have to be loaded before they can be used. This can be done through the library() or require() command, as shown below.

library(tidyr)

library(dplyr)

library(sqldf)

library(lme4)

library(afex)

library(broom)

library(agricolae)

library(cluster)

library(grDevices)

2.3. Input Data and Validation

RGxE starts with user-entered field to read input data. Instructions on user enetered fields are presented in Supplemental Material. The user is required to set current working directory using setwd(), which is input data file location, and pass input data file name. RGxE requires an input data file in .csv (comma separated value) format. Highlighted fields are user entered in the code shown below for Windows and iOS (Mac) operating system, respectively.

setwd("E:/PhD Research Work/PhD Articles")

#### For Windows user ####

tempa<- read.csv("RGxEInputData2_2016_02_15.csv", header = TRUE)

#### For iOS or Mac user ####

file.name <- "E:/PhD Research Work/PhD Articles/RGxEInputData2_2016_02_15.csv"

out.name <- "E:/PhD Research Work/PhD Articles/GxEROutput.csv"

tempa<- read.csv(file.name)

The input data file is comprised of column names including YR (year), LC (location), RP (replication), CLT (cultigen or genotype), and dependent variable (Trait). Sample input data is presented in Supplemental Material. User is required not to change the column names as program takes same variable name for the analysis. Dependent variable in example input data is yield (Mg∙ha⁻¹) of watermelon. Hereafter, a word “genotype” is used to indicate cultigen, cultivar, variety or genotype. RGxE validates the structure of input data, with below arguments, so that correct column types (numeric, logical, factor, or character) are used for statistical analysis.

tempa$YR<- as.factor(tempa$YR)

tempa$RP<- as.factor(tempa$RP)

tempa$LC<- as.factor(tempa$LC)

tempa$CLT<- as.factor(tempa$CLT)

tempa$Trait<- as.numeric(tempa$Trait)

To access the structure of data, the str() command can be used.

str(tempa)

'data.frame': 400 obs. of 5 variables:

$ YR : Factor w/ 2 levels "2009","2010": 1 1 1 1 1 1 1 1 1 1 ...

$ LC : Factor w/ 5 levels "CI","FL","KN",..: 3 3 3 3 3 3 3 3 3 3 ...

$ RP : Factor w/ 4 levels "1","2","3","4": 1 1 1 1 1 1 1 1 1 1 ...

$ CLT : Factor w/ 10 levels "CalhounGray",..: 3 1 9 2 5 4 7 10 6 8 ...

$ Trait: num 56.2 74.2 32.6 74.2 64.8 ...

Top 6 rows of example input data can be viewed using head() command.

head(tempa)

YR LC RP CLT Trait

1 2009 KN 1EarlyCanada 56.236

2 2009 KN 1CalhounGray 74.167

3 2009 KN 1 StarbriteF1 32.601

4 2009 KN 1CrimsonSweet 74.167

5 2009 KN 1GeorgiaRattlesnake 64.794

6 2009 KN 1 FiestaF1 70.907

2.4. Genotype Stability Statistics

2.4.1. Analysis of Variance (ANOVA)

In multi-location replicated trial data, combined ANOVA is performed with the objectives to identify the significance of different effects; estimate and compare mean for levels of fixed factors; and estimate the size of genotype and GxE variance components. The ANOVA model comprises four factors: genotype (CLT), location (LC), year (YR), and replication or block (RP) nested within locations and year. The response of the genotype i in the location j, year k and replication r is presented as:

where m = grand mean. Depending on the objectives of the analysis, the genotype, location and year are defined as random or fixed effect, which gives five different ANOVA models (Table 1). The genotype is random when the aim is to estimate variance components, genetic parameters, genetic gains expected from selection or different breeding strategies etc. Conversely, genotype is fixed factor when aim is to make comparison of test material for selection or recommendation. Similarly, location is considered as random when the main interest is to estimate variance components for sites that are representative of the relevant population within target region. Location is fixed when interest is to make explicit comparison of one level another and each location represents a well-defined area with relative to crop management. The year and replication are usually treated as random factor.

Different combinations of random and fixed effects in ANOVA model have implications for the expectations of mean square (MS) values with the possible modification of the error term to be adopted in the F test. Therefore, sometimes the F test is not as straightforward as the ratio between two mean squares.

RGxE computes five different cases of ANOVA:

・ case 1: CLT, YR, LC and RP-all random

・ case 2:CLT, YR and LC - fixed; RP-random

・ case 3:CLT-fixed; LC, YR and RP-random

・ case 4: LC-fixed; CLT, YR and RP-random

・ case 5: CLT and LC-fixed; YR and RP-random

For random effect RGxE computes estimates of variance components using lmer() function of lme4 package. The significance of random effects is computed using likelihood ratio test to attain p-values. Likelihood is the probability of the data given a model. The logic of the likelihood ratio test is to compare the likelihood of two models with each other. The model without the factor that you are interested in (null model) is compared with model with the factor that you are interested in (full model) using anova() function. It gives a Chi-Square

Table 1. ANOVA models including the factors genotype (CLT), location (LC), year (YR), and replication (RP) for multi-location replicated trials across years in a randomized complete block design.

value, the associated degrees of freedom and p-value. According to Wilk’s theorem, the negative two times the log likelihood ratio of two models approaches a Chi-Square distribution with k degrees of freedom, where k is number of random effects tested. RGxE create user defined anova_lrt() function to com- pute likelihood ratio test and it is stored in ANOVA model Case I code.

The type III sum of squares (SS), MS, Fvalue of fixed effects are computed by fitting model in anova() function of lme4 package. The significance (p-value) of fixed effects is computed using mixed() function of afex package. The mixed() function computes type III like p-values using default method via Kenward-Roger approximation for degrees of freedom.

To identify each experimental unit (EU) uniquely a distinct value must be assigned to EU. RGxE assign a distinct value to each combination of replication (RP) nested within location (LC) x year (YR) and use this new term (RPid) in model. After installing and calling packages, user can independently compute five different ANOVA models while feeding input data (tempa) in below code. User friendly output is generated in “data.frame” class using dplyr and tidyr packages.

########################################################################

## ANOVA: Compute analysis of variance ##

########################################################################

#Generate unique id for replication for anova

tempa$RPid<-as.factor(paste(tempa$YR, tempa$LC, tempa$RP, sep="."))

########################################################################

### ANOVA Case 1: CLT, YR, LC and RP - All Random ###

########################################################################

#full model

fit.f1<-lmer(Trait~ 1 + (1|YR) + (1|LC) + (1|CLT) + (1|YR:LC) +

(1|YR:CLT) + (1|LC:CLT) + (1|YR:LC:CLT) +

(1|RPid), data=tempa)

#model summary

summary1 <- summary(fit.f1)

#variance of random factors

variance<- as.data.frame(summary1$varcor)

#drop rownames

rownames(variance) <- NULL

variance1 <- variance %>% select (-var1, -var2) %>%

rename(sov=grp, Variance=vcov, stddev=sdcor)

#Type 3 test of hypothesis

#Type III Wald chisquare tests

anova(fit.f1, type="III")

#Type 1 test of hypothesis

anova(fit.f1, type="marginal", test="F")

#model fitness

anovacase1 <- plot(fit.f1,

main="Model fitness Case 1: CLT, YR, LC and RP - All Random", xlab="Predicated Value", ylab="Residual")

#LRT - likelihood ratio test for computing significance of random effect

#create function (anova_lrt) for Likelihood ratio test, where parameters

#a=outputdatasetname; example-anova1r

#b=full model name; example-fit.f1

#c=reduced model name; example-fit.f1r

#d=effect name; example- "RPid", NOTE: call it in quotation

anova_lrt<- function (a,b,c,d){

#level of significance

a <-anova(b,c)

#convert anova into data frame

a <- data.frame(a)

#convert rownames into column

a$name<- rownames(a)

# droprownames

rownames(a) <- NULL

a <- a %>% filter(name=="b") %>%

mutate(sov=d) %>% select(sov, Pr_Chisq = starts_with("Pr..Chisq."))

# return the result

return(a)

}

#null model for YR

fit.f1y<-lmer(Trait~ 1 + (1|LC) + (1|CLT) + (1|YR:LC) + (1|YR:CLT) +