Development of the CusProSe software
CustomProteinSearch (CusProSe) is a generic genome mining software, consisting of two distinct but complementary customizable programs: IterHMMBuild and ProSeCDA. IterHMMBuild is an HMM profile building tool based on an iterative learning process. ProSeCDA is a protein search and annotation tool based on user-defined domain architectures. The two programs can be run independently. An overview of the CusProSe workflow and package functionalities is presented in Fig. 1. Detailed information about its implementation and functioning is provided in the “Methods” section as well as in the CusProSe documentation page (https://i2bc.github.io/CusProSe/).
The CusProSe IterHMMBuild tool allows the users to construct HMM profiles representatives of their protein sequences of interest, by an iterative learning process starting from seed sequences and a fasta protein dataset (Fig. 1a,b). The IterHMMBuild procedure starts building an HMM profile from a set of related protein or protein domain seed sequences or from a single query sequence. This initial HMM profile is then used to identify sequences with similar domains in any user-specified protein sequence dataset, in order to enrich the profile model. If matching sequences are found, they are added to the initial query sequences and a new HMM profile model is built. This new HMM profile is then searched against the same target dataset in order to find more distant similar sequences to the original query sequence(s)19. This process is repeated (iterations) until convergence is reached, i.e., no new sequences are recovered from the dataset (Fig. 1b) (see also the supplementary Fig. S1 and “Methods” section for technical details). When convergence is reached a final HMM enriched profile is build. A database of HMM profiles is then created by concatenation of the individual final profiles, either automatically or manually depending on input parameters and the user’s choice. Additional information about the HMM database creation procedure can be found in the “Methods” section and in the Documentation page in the IterHMMBuild usage guideline chapter. As regards ProSeCDA, the tool allows to search in a given protein dataset, for multiple proteins of interest defined by a user-specified set of rules (Fig. 1a,c). The first step of the ProSeCDA pipeline is to annotate the protein dataset of interest, with functional protein domains from a user-specified HMM profile database (Fig. 1c). This database can be the one generated using the IterHMMBuild package of CusProSe, as exemplified in the scheme of Fig. 1a, or, alternatively, any other compatible HMM profile database (.hmm file format). In the second step, annotated proteins are filtered following a set of rules which are also user-determined (Fig. 1c). The rules describe any protein families of interest based on the user-defined specific domain architectures. Features defining each rule include the protein family name, the “mandatory” domains (list of domains the protein must contain) and the “forbidden” domains (optional, list of domain the protein must not contain). All proteins matching those rules are then finally accessible in the ProSeCDA output files directory. The output files include, for each identified protein, a summary in xml format, containing information such as the protein sequence and the boundaries of the conserved domain architectures (.xml), the protein sequence in fasta format (.fa), as well as plots showing a graphical representation of all of the domains that matched to the rules at the pdf format (optional, .pdf). (Figs. 1c, 2b,c). An interactive web page allowing to visualize the ProSeCDA annotation results is also created (index.html file). This web page is illustrated in Fig. 2a.
Application of CusProSe to the prediction of fungal SMKEs
As an example of application, CusProSe was used to identify major families of SMKEs in fungi. The tools developed were first tested to detect PKS, NRPS, hybrid PKS-NRPS (including NRPS-PKS), and DMATS enzymes, in four species of phylogenetically unrelated plant pathogenic fungi with different host spectra, infection lifestyles and SM repertoires: (i) Botrytis cinerea (Leotiomycetes), a necrotrophic pathogen responsible for gray mold on more than 200 dicotyledons including grapevine, (ii) Colletotrichum higginsianum (Sordariomycetes), a hemibiotroph which attacks many cultivated plants among Brassicaceae as well as the model plant Arabidopsis thaliana, (iii) Zymoseptoria tritici (Dothideomycetes), a hemibiotroph which causes the most important foliar disease of wheat (“Septoria tritici blotch”) and (iv) Magnaporthe oryzae (Sordariomycetes), also a hemibiotroph, responsible for the most important disease of rice worldwide, rice blast3,4,20,21,22,23,24.
First, HMM profiles were constructed using M. oryzae protein sequences of conserved functional domains (Table 1) characteristic of PKS, NRPS, PKS-NRPS and DMATS enzymes25. At this stage three PKS, three NRPS, three hybrid PKS-NRPS and three DMATS were used to seed IterHMMBuild (Supplementary Data S1). HMM profiles were also constructed for type III PKS (t3PKS), using conserved domain sequences of two M. oryzae t3PKS enzymes (Supplementary Data S1). These initial domain profile models were then used to screen the M. oryzae proteome to identify potential new domains by homology search, to improve the HMM profiles. The database of domain profiles generated by IterHMMBuild was then given as input to ProSeCDA to annotate the M. oryzae proteome, together with the rules file. The rules used to define each type of SMKE above cited can be found in the supplementary File S1. This protocol made it possible to detect all PKS, NRPS, PKS-NRPS and DMATS from M. oryzae25. The enriched HMM domain profile models were subsequently used to screen the C. higginsianum proteome, which again led to the identification of all SMKEs and to further enrich the HMM profiles. The same process was applied to B. cinerea and to Z. tritici. The identified proteins were manually validated at each step of the analysis. Comparison with existing data showed that the results obtained with CusProSe for these four fungi were consistent with previous SMKEs annotations4,22,24,25.
Comparison of CusProSe with existing SMKEs predictors
To assess the performance of CusProSe in predicting SMKEs relative to other predictors, the catalogs of proteins obtained for each fungus were compared to those obtained with antiSMASH18 and SMURF16. The total numbers of SMKEs detected with the three predictors are shown in Fig. 3. The list of all proteins identified is available in supplementary data (File S3). Comparisons of the results obtained with the three software are also presented in Venn diagrams and tables of Fig. 4. As shown in both Figs. 3 and 4, the effectiveness of CusProSe, antiSMASH and SMURF software varies according to the families of SMKEs. For DMATS, all members of this family were detected by the three software (Fig. 3). In contrast, differences were observed between the software for both the number of sequences recovered and the number of sequences correctly assigned, regarding PKS, NRPS and the PKS-NRPS hybrid enzymes. For PKS-NRPS enzymes, a significant number of false negatives (FN) were observed with antiSMASH and SMURF (Figs. 3 and 4). These missed enzymes were wrongly annotated as NRPS or PKS. For instance, the M. oryzae NRPS-PKS hybrid MGG_07803 enzyme, involved in tenuazonic acid biosynthesis26, was assigned as NRPS by both antiSMASH and SMURF, while it was correctly detected as a hybrid enzyme by CusProSe. Overall, antiSMASH correctly assigned 14 of the 19 hybrid enzymes (74%), whereas only 9 were identified by SMURF (47%). In contrast, CusProSe successfully detected all the 19 PKS-NRPS/NRPS-PKS hybrids in the four fungal genomes analyzed4,22,24,25.
As concerns the total number of PKS, 89 proteins were detected with CusProSe compared to 96 and 92 with antiSMASH and SMURF, respectively. As discussed above, some of the SMKEs detected as PKS by antiSMASH and SMURF are actually PKS-NRPS hybrids and can therefore be considered false positives (FP). Among the seven antiSMASH additional PKS, four were in reality PKS-NRPS hybrids (Fig. 4) attested by the presence of NRPS domains in addition to the PKS module4,25. The three other antiSMASH additional PKS were identified as PKS-like by CusProSe. A careful examination of these three proteins revealed that the PP-binding domain, one of the three essential domains of PKS enzymes, is absent from these protein sequences. We made the choice, in our rules file, of considering as PKS only enzymes with all three essential functional domains KS, AT and PP-binding. Those proteins with one missing domain were therefore classified as PKS-like enzymes by CusProSe (Fig. 3). The same holds true for the NRPS enzymes. Regarding SMURF, nine predicted PKS were found to be hybrid PKS-NRPS enzymes by CusProSe. SMURF also missed the five PKS belonging to the type III PKSs (t3PKS), unlike CusProSe and antiSMASH which both have specific profiles / rules for these PKS enzymes.
For NRPS, 40 proteins were detected by CusProSe compared to 45 and 34 by antiSMASH and SMURF, respectively (Fig. 3). The different annotation predictions are illustrated in Fig. 4c. Thirty-three proteins were identified as NRPS by the three predictors, but for 17 cases differences were observed. Regarding the six proteins annotated as NRPS by CusProSe and antiSMASH only, two were missed by SMURF, whereas the four others were annotated by SMURF as “NRPS-like”. Nine proteins were annotated by antiSMASH only as NRPS. From these, two were annotated by CusProSe and SMURF as NRPS-like, whereas the seven others were identifed by CusProSe as NRPS-like (they missed one of the 3 essential domains, Table 1) or “dom_A” (only an isolated adenylation domain was detected). In contrast, one protein was annotated as NRPS by CusProSe alone, being classified as “NRPS-like” by both antiSMASH and SMURF. Finally, as discussed previously, one SMKE detected as NRPS by antiSMASH and SMURF is in fact a hybrid enzyme (NRPS-PKS)26, as annotated by CusProSe.
Identification of Terpene synthase family enzymes
CusProSe was used to improve the detection of Terpene synthases (TS, also referred to as TC or Terpene Cyclases)27. TS are SMKEs involved in the biosynthesis of terpenoids, which are among the most structurally and functionally diverse natural compounds28,29. They are synthesized in various organisms such as plants, bacteria and fungi30,31. The TS are highly variable both in the type of their functional protein domains and in their protein sequences, as compared to other SMKEs32,33. This particularity renders more difficult their detection by bioinformatic methods. As a consequence, TS analysis was not considered in the SMURF software16, whereas antiSMASH does not distinguish between the different families of TS. The lack of a good TS detection method by currently available tools was therefore a challenging issue.
HMM profiles were constructed separately for five different families of TS, including sesquiterpenes, diterpenes, phytoenes, squalenes, and chimeric TS. These last enzymes are bi-functional proteins presenting both TS and prenyltransferase domains30. Specific profiles for sub-families of sesquiterpene synthases, which are the most abundant fungal TS enzymes, were also built. The rules file was enriched to include information on the different TS domain architectures of each sub-family (supplementary File S2). We started from a small number of well-defined fungal TS protein sequences from each of the different TS enzyme sub-families. The set included biochemically characterized proteins and manually annotated and reviewed sequences from the UniprotKB/Swiss-Prot section of the Uniprot knowledgebase34,35,36,37,38,39,40,41 (Data S4). HMM profile models for the different TS families were constructed with IterHMMBuild and used to screen the M. oryzae, C. higginsianum, B. cinerea and Z. tritici proteomes. As the SMURF algorithm does not include TS detection, we compared CusProSe predictions to those of antiSMASH only. As can be seen in the data presented in Fig. 5, our profiles and rules lead to a more precise classification of TS. Indeed, we were able to identify the sub-families of TS present in the four different fungi, with additional TS enzymes found relative to the antiSMASH predictions (Fig. 5a,b). CusProSe particularly outperforms antiSMASH for the chimeric TS, but also performed better for diterpene synthases, sesquiterpene synthases, and squalane-hopene synthases (Fig. 5c, see also Fig. S2). In addition, CusProSe avoids false positives such as prenyltransferases (PT), that are implicated in terpenoids biosynthetic pathways but are not TS enzymes per se42. These proteins were classified as TS by antiSMASH (5 FP) (Fig. 5c). Parallel phylogenetic analyses confirmed our classification of TS into these sub-families (supplementary Fig. 3A,B).
Application of CusProSe to other fungal genomes
To further evaluate the efficiency of our HMM profiles and rules in identifying fungal SMKEs, they were both used to mine other unrelated fungal genomes. We chose representative fungal species with well-annotated genomes from different taxonomic classes : Aspergillus nidulans and Aspergillus niger from Eurotiomycetes, Fusarium fujikuroi and Fusarium graminearum from Sordariomycetes, and Leptosphaeria maculans from Dothideomycetes43,44,45,46. The SMKEs identified for each fungus using the custom HMM profiles and ProSeCDA annotation rules were carefully analyzed and compared to previously published annotations43,44,47,48. The number and lists of all SMKEs are displayed respectively in Fig. S4 and File S5. A comparison with the results obtained with antiSMASH showed that CusProSe performed particularly better in the identification of the distinct TS enzymes (Figs. S5, S6). Beyond confirming the accuracy of the CusProSe predictions, these novel analyses of fungal genome mining for SMKEs further enriched the fungal-specific HMM profiles of each family, with additional sequences from phylogenetically diverse fungi. These HMM models may be useful to the scientific community interested in fungal SM and we hope they can contribute to identifying other as-yet-unknown KEs.
Source link : https://news.google.com/__i/rss/rd/articles/CBMiMmh0dHBzOi8vd3d3Lm5hdHVyZS5jb20vYXJ0aWNsZXMvczQxNTk4LTAyMy0yNzgxMy150gEA?oc=5
Publish date : 2023-01-25 13:38:33