Introduction

This package will be a collection of tools that are routinely used by Texas A&M University Rose Breeding Program for analysis of genetic data from the WagRhSNP68k SNP array.

Installation

If you have not yet installed RoseArrayTools in R, use the following code to do so.

install.packages('devtools')
devtools::install_github("jeekinlau/RoseArrayTools")

Functions

Below are the list of functions currently available:

‘compare_probes2’ ‘compare_probes’
‘dosage_to_ACGT’
‘call_specs’
‘polyorigin_imputation’
‘fitPoly_to_MAPpoly_input’
‘fitPoly_to_MAPpoly2_input’
‘fitPoly_to_GWASpoly_input’
‘fitPoly_to_PolyOrigin_input.R’
‘plot_progeny_dosage_change’

‘compare_probes’

IMPORTANT this package needs both the R-package “data.table” and internet access.

compare_probes(data_dat="path_to_file.dat", progress=T)

This function ‘compare_probes’ takes as the input file the .dat long form file that fitPoly produces. However, on the WagRhSNP 68K SNP array, each marker is probed twice (forward and reverse strand). fitPoly scores all these probes as independent markers so we need to compare the probes. This function looks at each marker by looking at the genotype calls at each of the two probes for that given marker. The result of this function prints two new files, a compared_calls.csv file and a compared_calls_kind_counts.csv.

IMPORTANT: For this function to work properly must have the R-package “data.table” installed

‘compare_probes_defunct’ (DEFUNCT DO NOT USE)

IMPORTANT this function is now defunct. The ideas behind the description are still valid. However there is a small error that introduces a big error. Please use the compare_probes2 function. Function is still here just for documentation of previous error.

install.packages('data.table')

This function takes a very long time. If you set progress=T or by default it is T, it will tell you which step it is on. Steps 1-3 are just background steps to start the comparison. The steps comparing_probes and Marker_stats take a while longer and when the function is done it will print “FINISHED” in the console. Depending on speed of computer can take a couple of hours to finish. There are a Total of 5 automated steps with Marker_stats being the last step.

compare_probes(data_dat="path_to_file.dat", progress=T)

‘dosage_to_ACGT’

This function ‘dosage_to_ACGT’ requires a csv file returned from fitPoly formatted like below (left) and converts it to what you see on the right. The A allele is the “reference” allele and the B allele is the alternative allele and dosage is defined as the number of copies of the alternative allele.

This is how to use the function. Be aware, this function was written as a loop inside of a loop with 5 ifelse statements thus on large files with many individuals it will take a long time. However there the console will print the marker number it is currently on to give you an estimate to the amount of time left. Default for progress is TRUE however if progress is set at F it will not print the markers completed and will only print a message of “FINISHED” when done.

IMPORTANT: if you leave the separator argument blank, then a the output will not have any sort of delimiter between the characters i.e. a dosage of “1” with reference allele “A” and alternative allele “C” will return the value AAAC however if you designate the separator argument with any character, the output will use that to separate the alleles. For example if you used the same example above and you chose a separator = “|” then the result will be A|A|A|C instead of AAAC. The default is no separator.

dosage_to_ACGT(dosage.file="path_to_dosage_file.csv", progress=T, separator = "")

‘call_specs’

This function uses the output of compare_probes “compared_calls_kind_counts_v2.csv” and provides statistics to percentage of SNPs that both probes have the same call, percentage that have only one probe that had a call, percent that are different, and percent missing. The background information is that the rose array has two probes that both probe each SNP.

image of input file and output file formats no separator

image of input file and output file formats with separator=“|”

‘call_specs’

This function uses an output file of the function ‘compare_probes’ it uses the compared_calls_kind_counts.csv file to count what percentage of the calls are called by two probes, one probe, different probes, or not called.

call_specs(kind_counts_file = "compared_calls_kinds.csv", select_genotypes=NULL)

Running the above function results in the output below. The script counts the number of “S” “O” “D” and NA are in the entire matrix. The script will print a percentage and an estimate of number of markers in each group. both these numbers are approximations as it counts the number of “S” “O” “D” and NA are in the entire matrix and divides by total number of rows x columns in the matrix.

By default, select_genotypes is NULL which counts and displays the specs of all the genotypes in the file provided. If a .csv file is provided with the genotype’s names only those will be selected to be counted.

[1] "percent______same 0.109028854404337 7455"
[1] "percent_______one 0.359272009497065 24564"
[1] "percent_different 0.0783920744846818 5360"
[1] "percent________NA 0.453307061613916 30994"

image of select_genotypes input file

‘polyoroigin_imputation’

IMPORTANT: for this function to work, must have R-package “splitstackshape” installed.

install.packages('splitstackshape')

This function takes the polyorigin postdoseprob.csv output which contains conditional probabilities and outputs a csv with the imputed dosages. For this script function you must provide input.file name, number.parents, ploidy, probability.

input.file - should be the name of the .csv file you want to convert
num.parents - number of parents in the .csv file
ploidy - ploidy number as an integer
probability - probability you want to use as a cutoff

polyorigin_imputation(input.file = "example.csv", number.parents = 3, ploidy = 2, probability = 0.9)

Below is a picture of what the data looks like before and after conversion on a diploid dataset. Tetraploid datasets will have 5 dosage classes. input and output files

‘fitPoly_to_MAPpoly_input’

This function takes the output of fitPoly after the “compare_probes” function and converts it into an input ready to be used by MAPpoly. This function also includes the genome positions of the WagRhSNP 68K array aligned to the Old Blush reference genome v1.0(Oyant et al., 2008).
NOTE: This function must have internet connection as a reference file is needed to place the genome LG and positions.
Link to MAPpoly on CRAN (Stable)
Link to MAPpoly on Github (Experimental)

fitPoly_to_MAPpoly_input(input_file="input_file_name.csv", p1="parent1", p2="parent2")

input and output files

‘fitPoly_to_MAPpoly2_input’

This function takes the output of fitPoly after the “compare_probes” function and converts it into an input ready to be used by MAPpoly2. usage is slightly different than ‘fitPoly_to_MAPpoly_input’ function. with a genome input. If genome is NULL or “saintoyant” then the Rosa chinensis genome (Hibrand-Saint Oyant et al., 2018) is used. If genome is “raymond” the Rosa chinensis genome (Raymond et al., 2018) is used.

fitPoly_to_MAPpoly2_input2(input_file="input_file_name.csv", genome=NULL, p1="parent1", p2="parent2")

‘fitPoly_to_GWASpoly_input’

This function takes the output of fitPoly after the “compare_probes” function and and converts it into an input ready to be used by GWASpoly. This function also includes the genome positions of the WagRhSNP 68K array aligned to the Old Blush reference genome v1.0(Oyant et al., 2008).
NOTE: This function must have internet connection as a reference file is needed to place the genome LG and positions.
Link to GWASpoly on Github

fitPoly_to_GWASpoly_input(input_file="input_file_name.csv")

Input and output files

‘fitPoly_to_PolyOrigin_input’

This function takes the output of fitPoly after the “compare_probes” function and and converts it into an input ready to be used by PolyOrigin.

‘plot_progeny_dosage_change’

Look at genotypes that were imputed or changed by the HMM chain given a level of global genotypic error and outputs a graphical representation ggplot with the percent of data changed. The HMM Chain in the mapping algorithm makes genotypic changes to the original genotypic data. Thus this function allows a user to see how much genotypic dosage changes were made to get the final map. The function allows for outputting the corrected genotype dosages. This version of the function is OLD and the newer one is implemented in ‘mappoly::plot_progeny_dosage_change’ see https://github.com/mmollina/MAPpoly/blob/main/R/plot_progeny_dosage_change.R

RoseArrayTools_vignette

Jeekin Lau

Updated: 5/21/2021

Introduction

Installation

Functions

‘compare_probes’

‘compare_probes_defunct’ (DEFUNCT DO NOT USE)

‘dosage_to_ACGT’

‘call_specs’

‘call_specs’

‘polyoroigin_imputation’

‘fitPoly_to_MAPpoly_input’

‘fitPoly_to_MAPpoly2_input’

‘fitPoly_to_GWASpoly_input’

‘fitPoly_to_PolyOrigin_input’

‘plot_progeny_dosage_change’