Identification of expressed resistance gene analogs (RGA) and development of RGA-SSR markers in tobacco

Tobacco is an important cash crop and an ideal experimental system for 
 studies of plant-pathogen interactions. Identification of tobacco resistance 
 (R) genes and resistance gene analogs (RGAs) is propitious to elucidate the 
 underlying resistant mechanisms. In recent years, the public tobacco EST 
 (expressed sequence tags) data set, which provides a rich source for 
 identifying expressed RGAs, has enlarged substantially. In this study, 149606 
 Uni-ESTs were assembled from 412325 tobacco ESTs available in GenBank, 
 scanned with 112 published plant R-genes protein sequences, and 1113 
 Nicotiana (tobacco) RGAs (NtRGAs) were identified. The majority of them 
 comprised the common R-genes domains, such as NBS-LRR, LRR-PK, LRR, PK and 
 Mlo, while we were unable to identify 109 RGAs using published domains of 
 R-genes. Upon sequence alignment, 1079 NtRGAs were allocated on 712 loci 
 within the Nicotiana benthamiana genome. A total of 78 simple sequence 
 repeats (SSRs) were identified from 72 NtRGAs, and out of 64 newly designed 
 primer pairs, 54 primer pairs generated clear bands upon PCR amplification 
 using tobacco genomic DNA. Only nine primer pairs displayed polymorphism in 
 24 varieties of tobacco, with 2-4 alleles per locus (2.56 alleles on 
 average), while 41 primer pairs were able to detect polymorphisms in six wild 
 species of genus Nicotiana, with 2-4 alleles per locus (2.61 alleles on 
 average).


INTRODUCTION
Tobacco (Nicotiana tabacum) is an important cash crop worldwide and an ideal experimental system for studies of plant-pathogen interaction.In tobacco production, severe losses in tobacco yield and quality have been caused by various diseases and pests including bacterial wilt, mosaic virus, black shank, etc.According to the statistical data released by the China tobacco dis-ease and pest forecast, prediction and integrated prevention website, in 2010 and 2011, the total area suffering diseases and pests in the 16 main tobacco production provinces amounted to 800 000 ha, causing a yield loss of 60 million kg and value loss of 700 million RMB.Therefore, effective disease and pest control is of great significance for tobacco production, and identification and cloning of tobacco disease resistance genes (R-genes) and resistance gene analogs (RGAs) will play a fundamental role in the elucidation of the underlying disease resistant mechanisms and in the formulation of correct disease and pest control measures.
Plant disease resistance genes play a crucial role in identification of the proteins decoded by the avirulence genes of pathogens.In recent years, more than 100 plant disease resistance genes have been cloned by either map-based cloning or transposon tagging method (Sanseverino et al., 2010;Johal et al., 1992;Whitham et al., 1996;Dixon et al., 1996) (http://prgdb.crg.eu/wiki/Species_with_R-genes).Although plant disease resistance genes could defend themselves against a broad range of pathogens, they only shared a few highly conserved domains, such as nucleotide binding site (NBS), leucine-rich  repeats (LRR), serine-threonine kinase (STK), leucine zippers (LZ), transmembrane domain (TM), Toll/Interleukin-1 Receptor (TIR) and so on (Bent et al., 1996;Meyers et al., 1999;Hulbert et al., 2001;Dangl et al., 2001).These conservative domains provide a convenient and reliable basis for the rapid identification and cloning of R-genes and RGAs.
Plant disease resistance genes can be divided into five major classes according to the conservative domains of amino acid sequences.The first class containing NBS-LRR domains may further be divided into two sub-classes based on the presence/absence of the N terminus within TIR (i.e.TIR-NBS-LRR and non-TIR-NBS-LRR R-genes).For instance, the tobacco mosaic virus resistance gene, N, contains a TIR-NBS-LRR domain (Meyers et al., 1999;Meyers et al., 2003), while the Rps2 gene of Arabidopsis thaliana resistant to Pseudomonas syringae contains a coiled-coil (CC)-NBS-LRR domain (Bent et al., 1994).The second class contains LRR-PK domains, such as the Fls2 gene of Arabidopsis thaliana and the Xa21 gene of rice (Dunning et al., 2007;Song et al., 1995).The third class is characterized by an extracellular LRR domain, such as the RPP27 gene of Arabidopsis thaliana (Tor et al., 2004).The fourth class contains only the PK domain, such as the Pto gene of tomato and the At1 gene of melon (Martin et al., 1993;Taler et al., 2004).The fifth class comprises all remaining R-genes characterized by different mechanisms of resistance to pathogens, such as the Hm1 of maize and Mlo gene of barley (Johal et al., 1992;Buschges et al., 1997).
In the past, RGAs were isolated by PCR amplifying conserved domains of R genes, with which a number of RGAs were successfully cloned from Arabidopsis thaliana (Botella et al., 1997;Aarts et al., 1998), soybean (Graham et al., 2000), rice (Mago et al., 1999), corn (Collins et al., 1998), wheat (Seah 1998), tobacco (Leng et al., 2010;Gao et al., 2010) and other plants (Wan et al., 2010;Huettel et al., 2002;Nair et al., 2007).Compared with PCR amplification, data mining is an effective and efficient strategy for the identification of RGAs from genomes.Meyers et al. (2003) identified 149 NBS-LRR decoding genes and 58 other types of genes from the genome of Arabidopsis thaliana.Ameline-Torregrosa et al. ( 2008) identified 333 non-redundant NBS-LRR genes from the draft genome sequence of Medicago truncatula, and predicted that in its whole genome there existed 400-500 NBS-LRR genes.Recently, Li et al. (2010) successfully identified 158 NBS-encoding R genes from the genome of Lotus corniculatus.
In the plant genome, there are abundant pseudogenes that have lost biological functions.Inevitably, most of the RGAs identified on the basis of genomic sequences are unexpressed pseudogenes (Li et al., 2010), which severely hamper the effective cloning of R-genes.Therefore, the cloning of true R-genes from plentiful pseudogenes is required.Recently, a number of RGAs have been identified from plant EST sequences through data mining.Liu et al. (2012)  Once the RGAs are identified, the next logical step is to develop RGA markers, which could be restriction fragment length polymorphisms (RFLP) (Sanz et al., 2013), sequence-tagged sites (STS) (Loarce et al., 2009), single-strand conformation polymorphisms (SSCP) (Tantasawat et al., 2012), cleaved amplified polymorphic sequences (CAPS) (Palomino et al., 2009), simple sequence repeats (SSR) (Liu et al., 2013), etc. Sanz et al. (2013) designed 31 RFLP probes based on the RGA sequences of oat, and successfully mapped 53 RGA-RFLPs profiling markers on the hexaploid map of A. byzantina cv.Kanota × A. sativa cv.Ogle.Recently, Liu et al. (2013) developed 28 SSR markers using 25 peanut RGAs, and mapped one of the markers, RGA121, onto the linkage group AhIV.SSR markers possess many advantages over other types of markers such as codominance, high polymorphism, and easy manipulation with good reproducibility (Agarwal et al., 2008); therefore, it is more practical to develop SSR markers from RGAs for the mapping and cloning of plant disease resistance genes.
Up to June 2012, the number of tobacco EST sequences in the public nucleotide database Gen-Bank has reached 412325, covering almost all genes expressed in different growth stages and different tissues, thus enabling the identification of tobacco-expressed RGAs.Therefore, this study was intended to identify expressed RGAs from tobacco EST sequences by data mining, which have been used to develop RGA-SSR markers that will provide a useful basis for future identification and cloning of tobacco disease resistance genes.

Plant materials
Twenty-four varieties of tobacco and 6 wild species of the genus Nicotiana were used in this study (Table 1).The plant materials were culti-vated in the experimental farm of the Guangdong Academy of Agricultural Sciences, China, and young leaves were sampled in the summer of 2012.Genomic DNA was extracted from fresh leaf samples by DNA extraction kit (Cat.No. DP320, Tiangen, Beijing, China).

Tobacco EST sequence assembly
The tobacco EST sequences available in GenBank (http://www.ncbi.nlm.nih.gov/) were downloaded and assembled using the TIGR Gene Indices Clustering Tools (TGICL) (http://compbio.dfci.harvard.edu/tgi/software/). EST sequences were considered to meet the assembly requirements if a) the length of overlapping nucleotides exceeded 50; b) the similarity reached 90%; and c) the nonmatching length did not exceed 20 nucleotides.

Data mining of tobacco RGAs
The amino acid sequences of 112 published plant R-genes (Table 2) were used to scan tobacco Uni-ESTs in order to identify RGAs.Sequence blast was carried out using the tBLASTn tool, and the Uni-ESTs with a ≥100 blast score and E-values ≤1e -10 were considered candidate Nicotiana tabacum RGAs (NtRGAs).

Mapping of NtRGAs in Nicotiana benthamiana genome
The NtRGAs were mapped in the genome of Nicotiana benthamiana through sequence blast using BLASTtool (http://solgenomics.net/organism/Nicotiana_benthamiana/genome).According to the sequence blast result, the genome sequences with the highest blast scores (>50) and the smallest E-values (<1e -10 ) were regarded as the mapping segments of RGAs.

Development of RGA-SSR markers
SSR loci within RGAs were delineated using Perl scripts MIcroSAtellite (MISA, http://pgrc.ipk-gatersleben.de/misa/).The following limiting conditions were set during the screening: Newly designed primers were used for PCR amplification in 24 varieties of tobacco and 6 wild species of the genus Nicotiana in order to detect polymorphism at the species and/or genus levels.PCR was performed in a total volume of 20 μL using standard PCR conditions {20 ng DNA, 2.0 μL 10×buffer [0.8 mol/L Tris-HCl, 0.2 mol/L (NH4) 2 SO4, 0.2% (v/v) Tween 20], 2.0 μL 10× dNTPs (2.5 mmol/L each), 0.4 μL each PCR primer (10 mmol/L), 2.4 μL MgCl 2 (25 mmol/L), 1 unit Taq polymerase (Cat.No. ET101, Tiangen, Beijing, China)}.The PCR profile was as follows: 1 cycle for 5 min at 94°C, 35 cycles of 1 min at 94°C, 30 s at 55°C and 45 s at 72°C and an additional cycle for final extension for 10 min at 72°C.All primers were initially screened using Taq DNA polymerase.A negative control containing all PCR reaction components except template DNA served to validate the PCR.Each of the primer pairs was screened twice to confirm the repeatability of the observed bands in each genotype.PCR products were separated on a 6% polyacrylamide denaturing gel.The gels were silver stained for SSR band detection.Alleles were scored visually by comparing the position of the bands to the DNA marker.

Tobacco EST sequence assembly
Up to June 7 2012, the number of ESTs available for the genus Nicotiana in the GenBank reached 412325, of which 334384 were of N. tabacum, 56102 of N. benthamiana, 12448 of N. langsdorffii x N. sanderae, 8583 of N. sylvestris, 355 of N. attenuate, and 453 of other species.All these EST sequences were downloaded from the GenBank in FASTA format and used for development of tobacco RGAs.However, they comprised a large number of redundant EST sequences.In order to improve the quality of EST sequences, to obtain EST sequences that were longer than the original ones as well as consensus sequences derived from the same loci, the tobacco EST sequences from the GenBank were assembled by TGICL.The results showed that a total of 149606 potential unique EST sequences, including 45137 contigs and 101169 singletons were generated, with the longest sequence of 2312 bp, the shortest of 431 bp, and an average length 874 bp.

Identification of NtRGAs
A total of 112 R genes were used to search against tobacco Uni-EST sequences with an E-value cutoff of 1e -10 .Out of 112 R genes, 109 bore similarity to 6963 Uni-EST sequences except 3 R genes (RPW8.1,RPW8.2, xa27).Since different R-genes often harbor the same or similar domains and such genes tend to be matched with the same Uni-EST sequence, many of the matched Uni-EST sequences are often repeatedly counted via blast.Upon removal of the repeated counts, we found a total of 1113 Uni-EST sequences matching the 109 R-genes (Additional files 1).Out of these sequences, 273 harbored NBS-LRR domains, 546 harbored LRR-PK domains, 53 harbored extracellular LRR domains, 102 harbored only the PK domain, 30 harbored an Mlo domain, and for the remaining 109 EST sequences, no domains were found.These Uni-ESTs, delineated in the present study and matching the R-genes, were identified as tobacco RGAs and designated as NtRGAs.

Mapping of NtRGAs in the Nicotiana benthamiana genome
The mapping of NtRGAs onto the genome of tobacco is of great significance for the isolation of specific candidate disease resistance gene/QTLs.Due to the publication of a draft genome sequence of N. benthamiana, it was possible to map the NtRGAs directly on its genome by sequence blast.In the present study, the draft genome sequence of N. benthamiana was screened for loci matching the NtRGAs sequences.Upon setting the score value greater than 50 and E value less than 1e -10 , matching the NtRGAs sequences with the draft genome sequence of N. benthamiana resulted in the identification of 1071 matched similar sequences out of the 1113 NtRGAs in the N. benthamiana genome.Out of the 1071 matched sequences, 965 (90.7%) were matched with more than one fragments.On average, one NtRGA sequence matched with 9.67 fragments, and there was one (CL7158Contig2) that matched with 529 genome fragments.Since plant genes have frequently acquired multiple copies during the long course of evolution, it remains a difficult task to accurately map the NtRGAs onto the genome.In our study, the genome sequences that had the highest blast score and the lowest E values were regarded as the most possible genome loci of NtR-GAs.Finally, the 1071 NtRGAs were allocated on 712 genome loci, of which 218 loci matched more than one NtRGAs (Appendix 2).A further analysis revealed the existence of two types of NtRGAs matching the same genome loci: (1) NtRGAs with high homology matching to the same genomic region (Fig. 1), and (2) NtRGAs without homology matching to different genomic regions (Fig. 2).An alternatively spliced gene might be transcribed into different mRNAs; the Type 1 NtRGAs could be from the homogenous genes of different tobacco species, but also be from the same gene.ESTs are usually part of a complete gene; therefore, EST sequences from the same gene may lack an overlapping region and are therefore unable to be assembled.Thus, Type 2 NtRGAs were possibly derived from different segments of the same gene.
Regarding the distribution of NtRGAs in the N. benthamiana genome, we found that the NtRGAs were not evenly distributed throughout the genome, and that a tandem of several NtRGAs occurred in 17 genomic scaffolds.For example, the 63 kb genomic scaffold (sequence ID : Niben.v0.3.Scf25265845) contained four NtRGAs.

Development of SSR markers
A total of 78 SSR loci detected in 1113 NtRGAs using Perl script MISA were distributed on 72 sequences and thus, one SSR was present at every 939348 bp on average.Six NtRGAs harbored more than one SSR.We designed 64 pairs of primers within flanking sequences of the SSR using the software Primer Premier 5, but were unable to design primers for 8 NtRGAs due to flanking sequences being too short or too complex in structure.On testing these 64 primer pairs, 54 generated clear bands in tobacco, and the remaining 10 pairs failed to produce PCR products or generated RAPD-like non-specific bands (Table 3).Of the 54 primer pairs that generated clear bands, 46 pairs produced amplified products of the expected lengths, seven pairs had products of lengths larger than expected, and one pair had a shorter fragment than expected.Nine of the 54 primer pairs (16.7%) displayed polymorphism in the 24 varieties of tobacco.The total number of alleles detected at these nine loci was 23; the number of alleles per locus ranged from 2 to 4, with an average of 2.56 alleles per locus.All 54 primer pairs tested in cultivated varieties were successfully amplified in the 6 wild species of genus Nicotiana, and higher levels of polymorphisms were detected as compared to cultivated varieties, i.e. a total of 41 pairs of primers displayed polymorphism, accounting for 75.9% of the amplified primers.The total number of alleles detected at 41 loci was 92, the number of alleles per locus ranged from 2-4, with an average of 2.61.The amplification results of RGA-63 in Nicotiana are shown in Fig. 3.At this locus, four alleles were observed in the 24 varieties of tobacco and 6 wild species of the genus Nicotiana.

DISCUSSION
As an integrated part of the gene-to-gene disease resistance mechanism, plant R-genes play a crucial role in the identification of pathogen-specific proteins decoded by the avirulence genes (Flor, 1956).In the present study, 1113 RGAs were successfully identified from the tobacco EST data submitted to GenBank, and then mapped on the N. benthamiana genome, indicating that EST data could be utilized to efficiently identify RGAs.
To date, RGAs have been successfully identified from sugarcane, wheat, corn and other crops by data mining (Rossi et al., 2003, Dilbirligi et al., 2003, Collins et al., 1998).Dilbirligi et al. (2003) tested four different strategies to search for RGAs from wheat, including domain search, single or multiple motif search, consensus sequence search and single full-length sequence search, respectively.The authors found that the last strategy performed best, whereby 243 NBS-LRR-type RGAs and 101 RGAs of other types were detected, with the E value set at ≤ e -10 .Xiao et al. (2006) applied modified amplified fragment length polymorphism (AFLP), rapid-amplification of cDNA ends (RACE) and data mining to identify R-gene-like ESTs (or RGAs) in maize and found that data mining was the most effective.Using the strictest blast condition (E < e -50 ), Rossi et al. (2003) detected 88 RGAs from sugarcane EST sequences, representing three main R-gene families, namely NBS-LRR, LRR-TM and PK.The above reports demonstrated that RGA searching results were influenced largely by the applied E value.In the present study, the E value was set to be ≤ e -10 , and a total of 1113 RGAs were identified, three times as many as the number of RGAs obtained from wheat (Dilbirligi et al., 2003).There are two reasons for the large difference in the number of detected RGAs in these two crops.The number of tobacco EST used in the present study for data mining of NtRGAs was larger than that of wheat − we used a total of 412325 tobacco EST sequences, whereas Dilbirligi et al. (2003) used only 78221 wheat EST sequences.In addition, in this study, 112 R-genes were employed for the blast, much more than the number of R-genes applied in wheat (Dilbirligi et al., 2003).
One of the advantages of identifying RGAs through data mining EST sequences is that all identified RGAs are expressed genes, whereas RGAs identified from genome sequences may be unexpressed pseudogenes.For example, 65 RGAs of Lotus corniculatus identified by Li et al. (2010) were finally found to be pseudogenes.Meyers et al. (2003) searched for RGAs that contained NBS-LRR domains from Arabidopsis thaliana and found at least 12 NBS-LRR genes had evolved into pseudogenes due to frame shift or nonsense mutation.
In this study, 1071 of the identified NtRGAs were allocated to 712 loci of the N. benthamiana genome, which provides a basis for the future cloning of tobacco R-genes.However, 42 of the identified NtRGAs could not be mapped on the N. benthamiana genome, most likely for the following reasons: (1) the currently available whole genome of N. benthamiana used in the present study is still a draft and comprised regions that have not been sequenced yet; (2) only a part of the NtRGAs identified in this study were derived from N. benthamiana, and the NtRGAs derived from other species cannot be mapped due to the variations between the genomes of different species.We found that the NtRGAs were not distributed evenly throughout the genome, with a number of RGAs occurring in clusters, which was consistent with previous reports in other plants (He et al., 2004;Peñuela et al., 2002).The tandem of R-genes facilitates the genetic variation and evolution of R-genes.Bertioli et al. (2009) analyzed the synteny of Arachis with Lotus and Medicago and found that retrotransposons are associated with some disease resistance gene families.Other hypotheses such as replication, gene conversion and non-allelic exchange have also been used to explain the clusterization and evolution of plant R-genes (Ellis et al., 2000).
How to utilize the identified RGAs remains an unanswered question.In this study, 62 RGA-SSR markers have been developed, and they will facilitate future mapping and cloning of disease resistance genes.Recent studies revealed that SSRs exhibit high polymorphism in common tobacco (Bindler et al., 2011).In the construction of a tobacco genetic map, Bindler et al. (2011) found that 2415 (47%) of 5119 pairs of SSR primers detected polymorphism between parents.However, in this study, only 16.7% of primers detected polymorphism in tobacco, and the number of detected alleles per locus was just 2-4.The low polymorphism of the SSR markers developed in this study might be because they were derived from expressed genes, which are under higher selective pressure than the whole genome and thus have to maintain high sequence conservation to keep their biological functions (Fay et al., 2003;Flowers et al., 2008).

Table 1 .
Varieties and wild species of tobacco used in this study.

Table 2 .
Information on the name, protein ID and structure of 112 known R genes from plants.

Table 2 continued
Gao et al. (2010)2009)isolated 78 RGAs from the peanut and its wild relatives with degenerate primers designed based on an NBS domain.Gao et al. (2010)isolated 100 RGAs from Nicotiana repanda based on NBS and PK domains.

Table 3 .
Characteristics of RGA-SSR in Nicotiana