GENOME-WIDE CHARACTERIZATION AND PHYLOGENETIC AND EXPRESSION ANALYSES OF THE CALEOSIN GENE FAMILY IN SOYBEAN , COMMON BEAN AND BARREL MEDIC

Caleosin are a class of calcium-binding proteins embedded in the phospholipid monolayer of lipid droplets. In addition to maintaining the structure of lipid droplets, caleosin proteins are involved in dormancy and lipid signaling, and are associated with the stress response via their histidine-dependent peroxygenase activity. To date, caleosins have been studied in Arabidopsis thaliana. However, little is known about these genes in legumes, including the most cultivated oilseed crop, soybean. In this paper, 20 caleosin genes in soybean, common bean and barrel medic were studied. Among these, 13 caleosin genes, including 3 in Glycine max, 5 in Phaseolus vulgaris and 5 in Medicago truncatula, are identified for the first time. The structures, characteristics and evolution of the 20 caleosin proteins are analyzed. Expansion patterns show that tandem duplication was the main reason for the caleosin family expansion in the legume. Expression profiles indicate that L-caleosin in soybean and common bean are more important than H-caleosin, which is just the opposite in Arabidopsis thaliana. GmaCLO2, PvuCLO1, PvuCLO3 and MtrCLO3 may play important roles, while GmaCLO6, GmaCLO10 and MtrCLO4 may lose their function in the examined tissues. In addition, according to the results of cis-element analyses, we propose potential functions for the more important caleosin genes in leguminous plants. Our work provides helpful information for further evolution and function analyses of the caleosin gene family in soybean, common bean and barrel medic.


INTRODUCTION
Soybean (Glycine max) is not only the largest source of animal protein feed, but also the most cultivated oilseed crop, and the second largest vegetable oil source in the world.Soybean seed oil is used to supply essential nutrition for humans as well as renewable materials for bioenergy production [1].Triacylglycerols (TAGs), which are packed in organelles called lipid droplets (LDs), are the most abundant ingredient in seed oil.TAGs in LDs are enveloped by a monolayer of phospholipids embedded with some unique proteins including oleosins, caleosins and steroleosins [2,3].
Caleosins are a class of proteins that are only found in plants and fungi [4].Caleosins are structurally similar to the oleosins, consisting of three domains: an N-terminal hydrophilic domain, a central hydrophobic domain containing a proline knot for anchoring LDs, and a C-terminal hydrophilic domain.Unlike the protein structure of oleosins, the N-terminal of caleosins include an EF-hand calcium-binding motif and the C-terminal of caleosins contains several phosphorylation sites [3].
Studies of caleosins, which mainly focus on the model plant Arabidopsis thaliana, suggest that they may have a structural role and other unique func-tions.For instance, AtCLO1 is associated with lipid breakdown; AtCLO2 is involved in dormancy; At-CLO1, AtCLO2, AtCLO3 and AtCLO4 have haemdependent peroxygenase activity, which may be associated with plant stress responses [2,[5][6][7][8][9].However, limited information is available on leguminous caleosin proteins.Recently, a 27 KDa and a 29 KDa caleosin protein were identified in extracted soybean LDs [10].Seven caleosin genes were identified in the soybean (Glycine max) genome [11].However, some questions remain unanswered.Are there any additional caleosin genes in soybean?How many caleosin genes are there in common bean and barrel medic?How do they express and evolve?Which of these caleosin genes are the most important and what are the functions of these important genes?
The soybean genome provides an opportunity to analyze the caleosin gene family at a genomic level.The paleopolyploid soybean went through genome duplication 58 million years ago (MYA) and 13 MYA [12].After the duplication, many genes underwent loss or local duplication.The common bean (Phaseolus vulgaris) and soybean split 19.2 MYA [13], and barrel medic (Medicago truncatula) and soybean diverged 54 MYA [14].The available genome information about common bean [13] and barrel medic [14] facilitates our study of the evolution patterns and other analyses of the caleosin gene family in soybean.
In this study, 10, 5 and 5 caleosin genes were identified in the latest genomes of soybean (Glycine maxWm82.a2.v1), common bean (Phaseolus vulgaris v1.0) and barrel medic (Medicago truncatulaMt4.0v1),respectively.Moreover, the classification, characteristics, evolution analyses of these caleosin proteins and expression profiles of the genes were analyzed.In addition, potential functions of important leguminous caleosin proteins are proposed.

Identification of caleosin genes and characterization of caleosin proteins in three sequenced legume genomes
In order to obtain putative caleosin genes in three sequenced legume genomes, BLAST and Keyword searches were executed.The sequences of 8 Arabidopsis caleosin genes were obtained from the Arabidopsis Information Resource (TAIR10.0,http:// www.arabi-dopsis.org/)[15], and used as queries to search each of the latest legume protein databases using "BLASTP" in phytozome 10 (http://phytozome.jgi.doe.gov/pz/portal.html)[16], with E-value<10 -10 .For soybean, the latest database was Glycine max-Wm82.a2.v1; for common bean it was Phaseolus vulgaris v1.0, and for barrel medic it was Medicago truncatulaMt4.0v1.Sequences of candidate legume caleosin proteins were gathered to run BLASTP in TAIR to get the most similar corresponding genes in Arabidopsis.If the best hit gene in Arabidopsis was not a "caleosin", the candidate legume caleosin was excluded.The Keyword search was also performed using "caleosin" as a query in each legume genome of Phytozome 10 [16].Then, the redundant sequences were deleted manually.The deduced sequences of candidate caleosin proteins were checked for caleosin domains (IPR007736) using InterProScan 5 (http://www.ebi.ac.uk/interpro/search/sequencesearch) [17].

Multiple alignment and phylogenetic analysis
Multiple sequence alignment of leguminous caleosin proteins was performed by ClustalX 2.1 [21].Shading and output of the results were executed using Boxshade 3.21 in ExPASy (http://www.ch.embnet.org/software/BOX_form.html)[18].To evaluate the evolution relationship of caleosin proteins in Arabidopsis and the legumes, multiple sequence alignment was performed again with the leguminous caleosin proteins and 8 Arabidopsis caleosin proteins.Then, a neighbor-joining (NJ) tree was constructed by MEGA 6.06 [22] with the parameters: Poisson model, pairwise deletion of gaps and 1000 bootstrap; a maximum-likelihood (ML) tree was constructed by PhyML 3.0 (http://www.atgc-montpellier.fr/phyml/)[23] with the parameters: LG model, substitution rate 6, SPR improvement and 100 bootstrap.Both trees were visualized using MEGA 6.06 [22].

Expression analysis of caleosin genes
Expression patterns of caleosin genes were examined by analyzing data in public databases.We used RNA-seq data and ESTs (expressed sequence tags) for soybean and common bean, and microarray data and ESTs for barrel medic.ESTs and cDNAs were collected from the EST database in the NCBI (http:// www.ncbi.nlm.nih.gov/nucest/).Leguminous caleosin genes were used as queries to search the corre-sponding EST database, using BLASTN, with the following parameters: E-value<10 -10 , identity>95% and length>200bp.To investigate gene expression in different tissues/organs, all libraries were assigned into 12, 8, 10 synthetic libraries in soybean, common bean and barrel medic, respectively (Table S2).Then, the corresponding ESTs of each gene in each synthetic library were tallied and normalized to transcripts per million (TPM).
RNA-seq data of soybean and common bean were obtained from PhytoMine in Phytozome10.0,using gene loci as queries.
Microarray data of barrel medic were searched in the database Medicago truncatula gene expression atlas (http://mtgea.noble.org/v3/)[27].All microarray data were normalized using mean values of all replicates of the same experiment.As the genomes from version 4 were used, expression data could not found by gene ID search which includes version 3.5.Instead, BLASTN was employed to find the corresponding microarrays, using barrel medic caleosin genes as queries with identity>95%, E-value<10 -10 and length>200bp.
A heat map was drawn for visualizing expression by R script [28] with log2 transformed normalized expression values.

Identification of caleosin genes and classification of caleosin proteins in soybean, common bean and barrel medic
Based on Keyword and BLAST searches in Phytozome, 11, 5 and 5 caleosin genes were identified in soybean, common bean and barrel medic, respectively.After BLASTP in TAIR, no caleosin genes were deleted.Among these 21 leguminous caleosin genes, two adjacent genes in soybean were found.BLASTP in TAIR for these two genes showed that the upstream one was homologous to the 5' end of At5g55240 while the downstream one was homologous to the 3' end of At4g26740.Then, the genomic regions spanning these two genes were obtained, and were re-predicted using the software FGENESH.The obtained re-sult showed that there was a single gene, renamed Glyma.20G200900*,suggesting wrong annotation in the soybean genome.Then, 10, 5 and 5 caleosin genes were finally identified in soybean, common bean and barrel medic, respectively.
Multiple alignments were performed using all 20 complete protein sequences.The results of some alignments were poor, which might be due to the severely truncated sequences.Subsequently, these truncated proteins, GmaCLO6 and MtrCLO4, were deleted.Multiple alignments were performed again using the remaining 18 complete protein sequences (Fig. S1).According to the results of multiple alignment, apart from MtrCLO4 and GmaCLO6, 9 caleosin proteins were classified as both H-types and L-types.
Alignment results also showed that the N-terminal of GmaCLO9 was especially long (Fig. S1), which is characteristic of caleosin proteins in monocot [30].The EF-hand domain and proline knot, which are important to caleosin proteins, were examined.The EFhand domain was constant, except for GmaCLO10 (Fig. S1), which suggested that GmaCLO10 may lose its calcium-binding ability.And proline knots of the 18 caleosin proteins were highly conserved.In addition, previous research indicated that His 70 and His 133 in AtCLO1 were found to be crucial for peroxygenase activity; a similar result was also obtained in AtCLO3 and AtCLO4 [7][8][9].These two sites were found to be highly conserved in all 18 leguminous caleosin proteins, except for GmaCLO10.GmaCLO10 lost His 70 because of the medium sequence loss (Fig. S1).
Potential phosphorylation sites of serine, threonine and tyrosine were analyzed by NetPhos 2.0 and are given in Fig. S1.S 118 , T 136 and T 181 (according to GmaCLO2) were conserved in L-caleosin with an ex-ception of PvuCLO3, while T 85 (according to Gma-CLO1) was conserved in H-caleosin.Y 143 (according to GmaCLO1) was conserved in all the examined caleosin proteins, except GmaCLO2 and GmaCLO4 in L-caleosin (Fig. S1).
In addition, 12 residues were constant in the tested sequences, indicating these sites might play unique roles in the function of leguminous caleosin proteins (Fig. S1).

Property and motif analysis of leguminous caleosin proteins
After exclusion of the two truncated members (Gma-CLO6 and MtrCLO4), the predicted pIs of L-caleosin proteins were all above 8.6, except GmaCLO4, GmaC-LO5, PvuCLO1; and predicted pIs of H-caleosin proteins were all below 7.1, except GmaCLO9 (Table 1).MEME was performed to analyze the motif configurations of leguminous caleosin proteins and the lost motifs in the truncated proteins.The results showed that all motifs were in accord with the char-  S1. acters of "caleosin" (Fig. 1, Table S1).GmaCLO6 had only N-terminal motifs 2, 8, while MtrCLO4 had only C-terminal motifs 3, 7 (Fig. 1).Besides, in H-caleosin, motifs 5, 6, 3 and 10 were lost in GmaCLO7, motifs 3 and 10 were lost in GmaCLO8, and motifs 6, 2, 7, 3 and 10 were absent in GmaCLO10; in L-caleosin, motif 8 was absent in MtrCLO1 (Fig. 1).Moreover, Gma-CLO6 contained motif 8, which was the N-terminal unique to L-caleosin.Thus, GmaCLO6 belonged to L-caleosin.This provided supplementary information for the classification based on multiple alignment.

Phylogenetic analysis of the leguminous caleosin proteins
To investigate the evolutionary relationships of caleosin proteins in the legumes, phylogenetic trees were built using the NJ method by MEGA and the ML method by PhyML.Two trees had similar topologies with high bootstrap values (mostly >80%), suggesting that the trees were reliable (Fig. 2).Based on the trees, 18 caleosin proteins (excluding the two truncated caleosin proteins) were assigned into two groups, which was identical to the two groups classified by the alignment.

Expansion analysis of caleosin genes in the three legumes
According to the chromosomal architecture of the leguminous caleosin genes (Table 1), one cluster including five genes (50%) was tandemly located on chromosome 9 in soybean; one cluster including three genes (60%) was tandemly located on chromosome 3 in common bean, while a pair of genes (40%) was tandemly located on chromosome 1 in barrel medic.It was noteworthy that L-caleosin underwent tandem duplication.Moreover, two pairs containing four genes (40%) were identified to be a segmental duplication in soybean, while no gene was found to be a segmental duplication in common bean and barrel medic (Table 2).These four segmentally duplicated caleosin genes belong to H-caleosin.According to Ks values, GmaCLO1 and GmaCLO9 might have been duplicated ~11.61 million years ago (MYA), while GmaCLO8 and GmaCLO10, might have been duplicated ~58.99MYA (Table 2).

Expression patterns of leguminous caleosin genes in different tissues
The expression patterns of legume caleosin genes were investigated by searching public resources.Corresponding ESTs were identified in three legumes.RNA-seq data were collected in soybean and common bean, while microarrays were obtained for barrel medic.The ESTs of common bean and barrel medic were limited from these resources.The data showed that every caleosin gene had at least one expression, except MtrCLO4, suggesting that these genes were all expressed except MtrCLO4 (Table S2).It was further suggested that MtrCLO4 may lose its function in the examined tissues.
The expression profiles from soybean ESTs and RNA-seq data were similar (Table S2).The expression of GmaCLO2 was relatively high in the examined tissues.Meanwhile, the expressions of GmaCLO4, Gma-CLO6 and GmaCLO7 were particularly low, close to zero.GmaCLO1, GmaCLO2 and GmaCLO9 displayed high expression in flowers (Fig. 4a).
The common bean ESTs of seeds were abundant and ESTs of other tissues were scarce; the RNA-seq data of seeds were absence ( S2).Therefore, the data Fig. 3.A deduced evolutionary tree in caleosin genes in legumes and Arabidopsis.In L-caleosin, three copies (LF1, LF2, LF3, short for L-caleosin 1, 2, 3 of fabaceaes) (a) or two copies (LF1, LF2, short for L-caleosin 1, 2 of fabaceaes) (b) were formed by tandem duplication.In H-caleosin, three copies (HF1, HF2, HF3, short for H-caleosin 1, 2, 3 of fabaceaes) are formed by segmental duplication.In Fig. 3a and Fig. 3b, H  from two sources were combined to survey the expression pattern in common bean.As shown in Fig. 4b, the expressions of PvuCLO1and PvuCLO3 were relatively high in the examined tissues.PvuCLO1 was specifically expressed in flowers.
In barrel medic, the expression patterns from microarray data and ESTs were similar (Table S2).The expression of MtrCLO3 was relatively high and specifically expressed in seeds (Fig. 4c).Evidence for MtrCLO5 expression from ESTs was very low (Table S2).Moreover, the data of ESTs also showed that GmaCLO1, GmaCLO2, GmaCLO5, GmaCLO8 and MtrCLO1 were expressed under stress conditions (Table S2).In addition, some caleosin genes were expressed in nodules.Whether caleosin genes play a role in nodules remains to be determined.

Caleosin genes and coding proteins in three legumes
In this study, ten caleosin genes are identified in the latest soybean genome Glycine max Wm82.a2.v1.Song et al. [11] showed that there are seven caleosin genes in soybean genome Glycine max.This may be because the latest soybean genome version (978Mb, 56,044 protein-coding loci) is more complete than the older version (975Mb, 54,175 protein-coding loci).Meanwhile, the 27 KDa and 29 KDa caleosin proteins have recently been identified in extracted soybean LDs [10], which does not match the molecular weights predicted in this study, which may be due to the post-translational modifications of caleosin proteins [31].
Compared with Arabidopsis [6], phosphorylation sites T 136 and T 181 (according to GmaCLO2) are unique in the L-caleosin of the legumes.These phosphorylation sites may play import roles in these species.In addition, analysis of protein motif configuration indicates that H-caleosins are more likely to lose motifs.Besides, all L-caleosin genes undergo tandem duplication, and two H-caleosin genes undergo segmental duplication, suggesting this also may be the result of different ways of expansion.

Expansion patterns of the caleosin gene family in three legumes
Segmental duplication, tandem duplication and transposition can form members of a gene family.In 20 leguminous caleosin genes, ten genes (50%) have evolved by tandem duplication and four (20%) by segmental duplication.Thus, tandem duplication is the main reason for caleosin expansion in the three legumes.

Evolution of caleosin genes in three legumes and Arabidopsis
In the H-caleosin type, the segmental duplicated pair, GmaCLO8 and GmaCLO10, is speculated to have occurred ~58.99 MYA, which is in agreement with the emergence of papilionoids (59 MYA) [12].The pair, AthCLO1 and AthCLO2, might have evolved 26.33 MYA [6], which is consistent with the origin of crucifers (24-40 MYA) [32].This suggests that the duplications of H-caleosin genes in legumes and Arabidopsis occurred just after the speciation of legumes and crucifers.The phylogeny trees show that each clade of H-caleosin contains genes from the three legumes, implying that the H-caleosin in legumes duplicated before the split between barrel medic and milletioids (including soybean and common bean) (54MYA) [14].Another gene pair in H-caleosin genes, Gma-CLO1 and GmaCLO9, is due to a recent duplication ~11.61MYA after the whole genome duplication of soybean ~13MYA [12].
In the L-caleosin type, the central hydrophobic domain, PX3PSX3P, develops into PX3FSX3P in Arabidopsis [6] but is invariable in legumes.This indicates that the duplications of L-caleosin in legumes and Arabidopsis occurred after the speciation.The duplicated pair, AthCLO4 and AthCLO5, is speculated to have occurred ~40.33MYA [6], which fits into the origin of crucifers (24-40 MYA) well [32], providing support of the above viewpoints.However, numbers of clades (two (Fig. 3b) or three (Fig. 3a)) of the Lcaleosin classification in legumes remain to be determined (Fig. 3).The duplicated pair, GmaCLO1 and GmaCLO9, is speculated to have evolved ~11.61MYA, following the split of common bean and soybean ~19.2 [13].

Expression of caleosin genes in the three legumes
Motif analysis show that most H-caleosin genes with lower expression are truncated, suggesting that H-caleosin genes may lose or undermine their functions through loss of essential motifs or sequences.Moreover, the expressions of L-caleosin are much higher than H-caleosin in soybean and common bean, while the result is just the opposite in barrel medic.This may be due to the divergence of Hologalegina (including barrel medic) and milletioids (including soybean and common bean) 54 MYA [14].

Potential functions of important leguminous caleosin genes
The motifs DOFCOREZM (S000265) and RAV1AAT (S000314), which bind DOF and RAV1 proteins, are found in GmaCLO2, PvuCLO1, PvuCLO3, and MtrCLO3.The RAV1 protein contains AP2-like and B3-like domains, which include transcription factors WRI1, FUS3, ABI3 and LEC2 related to lipid metabolisms [34].A study shows that the soybean DOF proteins are associated with lipid accumulation [35].Thus, these caleosin proteins may play roles in lipid accumulation.ABRERATCAL (S000507), which responds to calcium signaling [36], is also found in GmaCLO2 and PvuCLO3.GmaCLO2 and PvuCLO3 have a calciumbinding domain, which suggests that these two caleosin proteins may be associated with calcium signal transduction.
GmaCLO2 displays expression under stress and specific expression in the flower.Therefore, Gma-CLO2 may be involved in lipid accumulation in the flower, or calcium signal transduction in flower and the process under stress.
PvuCLO3 presents upregulated expression in flower and bud.Thus, PvuCLO3 may participate in calcium signal transduction and lipid in flower and bud.
PvuCLO1 and MtrCLO3 show specific expression in flower and seed, respectively.Therefore, PvuCLO1 and MtrCLO3 may be related to lipid accumulation in flower and seed, respectively.

Fig. 1 .
Fig. 1.Distribution of conserved motifs in 20 caleosin proteins of legumes.Sequences and annotations of the motifs are shown in TableS1.

Fig. 2 .
Fig. 2. Phylogenetic trees of 18 caleosin proteins in legumes and 7 caleosin proteins in Arabidopsis by the neighbor-joining method in MEGA 6.06 based on 1000 replications (a) and the maximum likelihood method in PhyML 3.0 based on 100 replications (b).Blue, black, purple and red represent caleosin proteins from soybean, common bean, barrel medic and Arabidopsis, respectively.Numbers next to a node represent percent bootstrap values (>50%).The bars at the bottom represent the average number of amino acid substitutions per site.

Table 1 .
Classification and characteristics of caleosin proteins in three legumes a. NA means no available value.b.Glyma.20G200900* is a gene re-analyzed by FGENESH using gene models Glyma.20G200900 and Glyma.20G201000.