Saturday, February 4, 2023
No Result
View All Result
Afric Info
  • News
    • Cameroon
    • Egypt
    • Ethiopia
    • Ivory-Coast
    • Ghana
    • Kenya
    • Nigeria
    • South Africa
    • Sudan
    • Tanzania
    • Uganda
    • Zimbabwe
  • Health
  • Sports
  • Travel
  • News
    • Cameroon
    • Egypt
    • Ethiopia
    • Ivory-Coast
    • Ghana
    • Kenya
    • Nigeria
    • South Africa
    • Sudan
    • Tanzania
    • Uganda
    • Zimbabwe
  • Health
  • Sports
  • Travel
No Result
View All Result
Afric Info
No Result
View All Result
Home Sports

CusProSe: a customizable protein annotation software with an application to the prediction of fungal secondary metabolism genes

January 25, 2023
inSports
Share on FacebookShare on Twitter


Development of the CusProSe software

CustomProteinSearch (CusProSe) is a generic genome mining software, consisting of two distinct but complementary customizable programs: IterHMMBuild and ProSeCDA. IterHMMBuild is an HMM profile building tool based on an iterative learning process. ProSeCDA is a protein search and annotation tool based on user-defined domain architectures. The two programs can be run independently. An overview of the CusProSe workflow and package functionalities is presented in Fig. 1. Detailed information about its implementation and functioning is provided in the “Methods” section as well as in the CusProSe documentation page (https://i2bc.github.io/CusProSe/).

Figure 1
figure 1

CusProSe package functionalities. (a) Overview of CusProSe. CusProSe contains two independent but complementary programs, IterHMMBuild and ProSeCDA. The figure schematizes the functioning of these tools. The IterHMMBuild program provides users with representative HMM protein profiles of interest, constructed by an iterative enrichment process, starting from a small set of defined protein seed sequences. Two inputs are required in a fasta file format: original seed sequence(s), (examplified here as a.fa b.fa c.fa and x.fa) and a set of other protein sequences such as a proteome (named here as dataset.fa) to iteratively feed the HMM profile. The output of iterHMMBuild includes, for each protein/protein domain of interest, the final HMM profile file (enriched.hmm). The different HMM profiles are then concatenated, and a database of profiles (database.hmm) is created and displayed in the output directory. ProSeCDA allows to search in a given protein dataset for multiple proteins of interest, defined by a user-specified set of domains and rules. The program takes as input a protein dataset of interest such as a proteome (dataset_2.fa file), an HMM profile database (database.hmm file) and a user-defined set of rules (rules.yaml file). The HMM profile database can the one created with IterHMMBuild or, alternative, any other user defined compatible database (in hmm format). (b) Overview of the IterHMMBuild iterative enrichement process. In the first step of the procedure an HMM profile model is build from the query protein sequences (x.hmm). This initial profile is then used to identify sequences with similar domains in the user-specified protein sequence dataset (dataset.fa). If matching sequences are found, they are added to the initial query sequence file (creating a new file named x-enriched.fa), and a new HMM profile is built. This process is repeated until no new sequences are recovered (i.e. convergence is reached). When convergence is reached a final HMM file is build (named here as x-enriched.hmm) and dispalyed in the output directory. (c) Overview of the ProSeCDA steps. The first step of the procedure consists in the annotation of the protein dataset of interest used to mine proteins (such as a proteome, named here as dataset_2.fa), with protein domains from a user-specified HMM profile database (HMM DB). In the next step, the annotated proteins (dataset_2.domtblout file) are filtered according to user-specified rules (rules.yaml file). Each rule is defined by different features including the protein family name, the “mandatory” list of domains (list of domains the protein must contain, green), and the “forbidden” list of domains (optional, list of domain the protein must not contain, red). All proteins matching those rules are selected and accessible in the output files. More details of Prosecda outputs are presented in Fig. 2.

The CusProSe IterHMMBuild tool allows the users to construct HMM profiles representatives of their protein sequences of interest, by an iterative learning process starting from seed sequences and a fasta protein dataset (Fig. 1a,b). The IterHMMBuild procedure starts building an HMM profile from a set of related protein or protein domain seed sequences or from a single query sequence. This initial HMM profile is then used to identify sequences with similar domains in any user-specified protein sequence dataset, in order to enrich the profile model. If matching sequences are found, they are added to the initial query sequences and a new HMM profile model is built. This new HMM profile is then searched against the same target dataset in order to find more distant similar sequences to the original query sequence(s)19. This process is repeated (iterations) until convergence is reached, i.e., no new sequences are recovered from the dataset (Fig. 1b) (see also the supplementary Fig. S1 and “Methods” section for technical details). When convergence is reached a final HMM enriched profile is build. A database of HMM profiles is then created by concatenation of the individual final profiles, either automatically or manually depending on input parameters and the user’s choice. Additional information about the HMM database creation procedure can be found in the “Methods” section and in the Documentation page in the IterHMMBuild usage guideline chapter. As regards ProSeCDA, the tool allows to search in a given protein dataset, for multiple proteins of interest defined by a user-specified set of rules (Fig. 1a,c). The first step of the ProSeCDA pipeline is to annotate the protein dataset of interest, with functional protein domains from a user-specified HMM profile database (Fig. 1c). This database can be the one generated using the IterHMMBuild package of CusProSe, as exemplified in the scheme of Fig. 1a, or, alternatively, any other compatible HMM profile database (.hmm file format). In the second step, annotated proteins are filtered following a set of rules which are also user-determined (Fig. 1c). The rules describe any protein families of interest based on the user-defined specific domain architectures. Features defining each rule include the protein family name, the “mandatory” domains (list of domains the protein must contain) and the “forbidden” domains (optional, list of domain the protein must not contain). All proteins matching those rules are then finally accessible in the ProSeCDA output files directory. The output files include, for each identified protein, a summary in xml format, containing information such as the protein sequence and the boundaries of the conserved domain architectures (.xml), the protein sequence in fasta format (.fa), as well as plots showing a graphical representation of all of the domains that matched to the rules at the pdf format (optional, .pdf). (Figs. 1c, 2b,c). An interactive web page allowing to visualize the ProSeCDA annotation results is also created (index.html file). This web page is illustrated in Fig. 2a.

Figure 2
figure 2

Output of ProSeCDA. (a) Interactive web page allowing to visualize the annotation results. The page displays different pannels. The user can get detailed information about each one by clicking on the “i” located on each pannel header, on the right. The left pannel is a list of the user-defined protein families for which proteins have been found. The user picks a protein family to visualize by clicking on the protein family name to select it, and update its related informations visible in the other pannels. (b) XML file showing details of an individual annotated protein. (c) Schematic visualization of the protein identified domains in pdf format (optional parameter of ProSeCDA). Two types of pdf files are generated: upper panel, a file containing graphical representations of the most-likely domain architecture of all the proteins matching the user-defined family rule. Only one protein is shown here as an example. Lower panel, a file for each individual protein representing all the domains that matched the protein sequence during the annotation step.

Application of CusProSe to the prediction of fungal SMKEs

As an example of application, CusProSe was used to identify major families of SMKEs in fungi. The tools developed were first tested to detect PKS, NRPS, hybrid PKS-NRPS (including NRPS-PKS), and DMATS enzymes, in four species of phylogenetically unrelated plant pathogenic fungi with different host spectra, infection lifestyles and SM repertoires: (i) Botrytis cinerea (Leotiomycetes), a necrotrophic pathogen responsible for gray mold on more than 200 dicotyledons including grapevine, (ii) Colletotrichum higginsianum (Sordariomycetes), a hemibiotroph which attacks many cultivated plants among Brassicaceae as well as the model plant Arabidopsis thaliana, (iii) Zymoseptoria tritici (Dothideomycetes), a hemibiotroph which causes the most important foliar disease of wheat (“Septoria tritici blotch”) and (iv) Magnaporthe oryzae (Sordariomycetes), also a hemibiotroph, responsible for the most important disease of rice worldwide, rice blast3,4,20,21,22,23,24.

First, HMM profiles were constructed using M. oryzae protein sequences of conserved functional domains (Table 1) characteristic of PKS, NRPS, PKS-NRPS and DMATS enzymes25. At this stage three PKS, three NRPS, three hybrid PKS-NRPS and three DMATS were used to seed IterHMMBuild (Supplementary Data S1). HMM profiles were also constructed for type III PKS (t3PKS), using conserved domain sequences of two M. oryzae t3PKS enzymes (Supplementary Data S1). These initial domain profile models were then used to screen the M. oryzae proteome to identify potential new domains by homology search, to improve the HMM profiles. The database of domain profiles generated by IterHMMBuild was then given as input to ProSeCDA to annotate the M. oryzae proteome, together with the rules file. The rules used to define each type of SMKE above cited can be found in the supplementary File S1. This protocol made it possible to detect all PKS, NRPS, PKS-NRPS and DMATS from M. oryzae25. The enriched HMM domain profile models were subsequently used to screen the C. higginsianum proteome, which again led to the identification of all SMKEs and to further enrich the HMM profiles. The same process was applied to B. cinerea and to Z. tritici. The identified proteins were manually validated at each step of the analysis. Comparison with existing data showed that the results obtained with CusProSe for these four fungi were consistent with previous SMKEs annotations4,22,24,25.

Table 1 List of PKS, NRPS and DMATS essential functional domains.

Comparison of CusProSe with existing SMKEs predictors

To assess the performance of CusProSe in predicting SMKEs relative to other predictors, the catalogs of proteins obtained for each fungus were compared to those obtained with antiSMASH18 and SMURF16. The total numbers of SMKEs detected with the three predictors are shown in Fig. 3. The list of all proteins identified is available in supplementary data (File S3). Comparisons of the results obtained with the three software are also presented in Venn diagrams and tables of Fig. 4. As shown in both Figs. 3 and 4, the effectiveness of CusProSe, antiSMASH and SMURF software varies according to the families of SMKEs. For DMATS, all members of this family were detected by the three software (Fig. 3). In contrast, differences were observed between the software for both the number of sequences recovered and the number of sequences correctly assigned, regarding PKS, NRPS and the PKS-NRPS hybrid enzymes. For PKS-NRPS enzymes, a significant number of false negatives (FN) were observed with antiSMASH and SMURF (Figs. 3 and 4). These missed enzymes were wrongly annotated as NRPS or PKS. For instance, the M. oryzae NRPS-PKS hybrid MGG_07803 enzyme, involved in tenuazonic acid biosynthesis26, was assigned as NRPS by both antiSMASH and SMURF, while it was correctly detected as a hybrid enzyme by CusProSe. Overall, antiSMASH correctly assigned 14 of the 19 hybrid enzymes (74%), whereas only 9 were identified by SMURF (47%). In contrast, CusProSe successfully detected all the 19 PKS-NRPS/NRPS-PKS hybrids in the four fungal genomes analyzed4,22,24,25.

Figure 3
figure 3

Number of fungal SMKEs identified by CusProSe, antiSMASH and SMURF. The graphic displays the number of proteins identified with CusProSe, antiSMASH and SMURF for PKS and PKS-like, NRPS and NRPS-like, PKS-NRPS (also including NRPS-PKS) and DMATS SMKEs families from Magnaporthe oryzae (blue), Colletotrichum higginsianum (red), Botrytis cinerea (yellow) and Zymoseptoria tritici (green).

Figure 4
figure 4

Venn diagrams of SMKEs detected with CusProSe, antiSMASH and SMURF. (a) PKS-NRPS*; (b) PKS; (c) NRPS. CusProSe SMKEs are labeled in red, while antiSMASH and SMURF SMKEs are labeled in green and blue, respectively. Tables on the right for each (a–c) panels highlight the proteins differing in their annotation prediction according to the different predictors. The color code is the same as in the Venn diagrams. MGG_ID: Magnaporthe oryzae, CH63R_ID: Colletotrichum higginsianum, BcinID: Botrytis cinerea, Mycgr3PID, Zymoseptoria tritici. Asterisk, includes NRPS-PKS hybrids.

As concerns the total number of PKS, 89 proteins were detected with CusProSe compared to 96 and 92 with antiSMASH and SMURF, respectively. As discussed above, some of the SMKEs detected as PKS by antiSMASH and SMURF are actually PKS-NRPS hybrids and can therefore be considered false positives (FP). Among the seven antiSMASH additional PKS, four were in reality PKS-NRPS hybrids (Fig. 4) attested by the presence of NRPS domains in addition to the PKS module4,25. The three other antiSMASH additional PKS were identified as PKS-like by CusProSe. A careful examination of these three proteins revealed that the PP-binding domain, one of the three essential domains of PKS enzymes, is absent from these protein sequences. We made the choice, in our rules file, of considering as PKS only enzymes with all three essential functional domains KS, AT and PP-binding. Those proteins with one missing domain were therefore classified as PKS-like enzymes by CusProSe (Fig. 3). The same holds true for the NRPS enzymes. Regarding SMURF, nine predicted PKS were found to be hybrid PKS-NRPS enzymes by CusProSe. SMURF also missed the five PKS belonging to the type III PKSs (t3PKS), unlike CusProSe and antiSMASH which both have specific profiles / rules for these PKS enzymes.

For NRPS, 40 proteins were detected by CusProSe compared to 45 and 34 by antiSMASH and SMURF, respectively (Fig. 3). The different annotation predictions are illustrated in Fig. 4c. Thirty-three proteins were identified as NRPS by the three predictors, but for 17 cases differences were observed. Regarding the six proteins annotated as NRPS by CusProSe and antiSMASH only, two were missed by SMURF, whereas the four others were annotated by SMURF as “NRPS-like”. Nine proteins were annotated by antiSMASH only as NRPS. From these, two were annotated by CusProSe and SMURF as NRPS-like, whereas the seven others were identifed by CusProSe as NRPS-like (they missed one of the 3 essential domains, Table 1) or “dom_A” (only an isolated adenylation domain was detected). In contrast, one protein was annotated as NRPS by CusProSe alone, being classified as “NRPS-like” by both antiSMASH and SMURF. Finally, as discussed previously, one SMKE detected as NRPS by antiSMASH and SMURF is in fact a hybrid enzyme (NRPS-PKS)26, as annotated by CusProSe.

Identification of Terpene synthase family enzymes

CusProSe was used to improve the detection of Terpene synthases (TS, also referred to as TC or Terpene Cyclases)27. TS are SMKEs involved in the biosynthesis of terpenoids, which are among the most structurally and functionally diverse natural compounds28,29. They are synthesized in various organisms such as plants, bacteria and fungi30,31. The TS are highly variable both in the type of their functional protein domains and in their protein sequences, as compared to other SMKEs32,33. This particularity renders more difficult their detection by bioinformatic methods. As a consequence, TS analysis was not considered in the SMURF software16, whereas antiSMASH does not distinguish between the different families of TS. The lack of a good TS detection method by currently available tools was therefore a challenging issue.

HMM profiles were constructed separately for five different families of TS, including sesquiterpenes, diterpenes, phytoenes, squalenes, and chimeric TS. These last enzymes are bi-functional proteins presenting both TS and prenyltransferase domains30. Specific profiles for sub-families of sesquiterpene synthases, which are the most abundant fungal TS enzymes, were also built. The rules file was enriched to include information on the different TS domain architectures of each sub-family (supplementary File S2). We started from a small number of well-defined fungal TS protein sequences from each of the different TS enzyme sub-families. The set included biochemically characterized proteins and manually annotated and reviewed sequences from the UniprotKB/Swiss-Prot section of the Uniprot knowledgebase34,35,36,37,38,39,40,41 (Data S4). HMM profile models for the different TS families were constructed with IterHMMBuild and used to screen the M. oryzae, C. higginsianum, B. cinerea and Z. tritici proteomes. As the SMURF algorithm does not include TS detection, we compared CusProSe predictions to those of antiSMASH only. As can be seen in the data presented in Fig. 5, our profiles and rules lead to a more precise classification of TS. Indeed, we were able to identify the sub-families of TS present in the four different fungi, with additional TS enzymes found relative to the antiSMASH predictions (Fig. 5a,b). CusProSe particularly outperforms antiSMASH for the chimeric TS, but also performed better for diterpene synthases, sesquiterpene synthases, and squalane-hopene synthases (Fig. 5c, see also Fig. S2). In addition, CusProSe avoids false positives such as prenyltransferases (PT), that are implicated in terpenoids biosynthetic pathways but are not TS enzymes per se42. These proteins were classified as TS by antiSMASH (5 FP) (Fig. 5c). Parallel phylogenetic analyses confirmed our classification of TS into these sub-families (supplementary Fig. 3A,B).

Figure 5
figure 5

Identification of TS by CusProSe in four fungal genomes and comparison with antiSMASH. (a) Number of TS identified by CusProSe and antiSMASH in Magnaporthe oryzae, Colletotrichum higginsianum, Botrytis cinerea, and Zymoseptoria tritici. (b) Number of TS identified with CusProSe for each sub-family in the four fungal genomes (c) Venn diagram representing the efficiency of CusProSe for detection of TS compared to antiSMASH. The number of CusProSe annotated TS enzymes are shown in red in the Venn diagrams, while the number of TS detected using antiSMASH are shown in green. On the right, proteins differing in their annotation prediction according to predictors.

Application of CusProSe to other fungal genomes

To further evaluate the efficiency of our HMM profiles and rules in identifying fungal SMKEs, they were both used to mine other unrelated fungal genomes. We chose representative fungal species with well-annotated genomes from different taxonomic classes : Aspergillus nidulans and Aspergillus niger from Eurotiomycetes, Fusarium fujikuroi and Fusarium graminearum from Sordariomycetes, and Leptosphaeria maculans from Dothideomycetes43,44,45,46. The SMKEs identified for each fungus using the custom HMM profiles and ProSeCDA annotation rules were carefully analyzed and compared to previously published annotations43,44,47,48. The number and lists of all SMKEs are displayed respectively in Fig. S4 and File S5. A comparison with the results obtained with antiSMASH showed that CusProSe performed particularly better in the identification of the distinct TS enzymes (Figs. S5, S6). Beyond confirming the accuracy of the CusProSe predictions, these novel analyses of fungal genome mining for SMKEs further enriched the fungal-specific HMM profiles of each family, with additional sequences from phylogenetically diverse fungi. These HMM models may be useful to the scientific community interested in fungal SM and we hope they can contribute to identifying other as-yet-unknown KEs.

ADVERTISEMENT



Source link : https://news.google.com/__i/rss/rd/articles/CBMiMmh0dHBzOi8vd3d3Lm5hdHVyZS5jb20vYXJ0aWNsZXMvczQxNTk4LTAyMy0yNzgxMy150gEA?oc=5

Author :

Publish date : 2023-01-25 13:38:33

Previous Post

Kenya: Agricultural Deals Worth Sh3.1 Billion to Be Struck in Nairobi Next Month

Next Post

Kenya Power Seeks to Double Electricity Tariffs in Changes Set for April

Last News

Eskom COO job to be scrapped after Oberholzer retires – News24

Ghana assembles armoured vehicles locally …for safe … – BusinessGhana

43 mins ago
Eskom COO job to be scrapped after Oberholzer retires – News24

More than 40 killed in Nigeria as gunmen and vigilantes clash – Al Jazeera English

48 mins ago
Eskom COO job to be scrapped after Oberholzer retires – News24

South Africa Tourism delighted over successful performance at … – New Telegraph Newspaper

1 hour ago
Eskom COO job to be scrapped after Oberholzer retires – News24

Men’s Basketball Returns To Walter Pyramid For Black And Blue … – Long Beach State Athletics

2 hours ago
Eskom COO job to be scrapped after Oberholzer retires – News24

Apartheid past catches up with explorer who made a new life in Europe – The Times

2 hours ago
Eskom COO job to be scrapped after Oberholzer retires – News24

New Naira Notes: Be patient, join bank queues, Emefiele tells Nigerians – Premium Times

2 hours ago
Eskom COO job to be scrapped after Oberholzer retires – News24

Vegan Africa: Marie Kacouchia shares recipes from Ivory Coast and beyond – Vancouver Sun

2 hours ago
Eskom COO job to be scrapped after Oberholzer retires – News24

Netanyahu counts Iran, Sudan as wins following Blinken visit – Al-Monitor

2 hours ago

Categories

No Result
View All Result
  • Africa News

© 2022 AFRICC.info.

This website uses cookies. By continuing to use this website you are giving consent to cookies being used. Visit our Privacy and Cookie Policy.
Go to mobile version

CusProSe: a customizable protein annotation software with an application to the prediction of fungal secondary metabolism genesErreur : SQLSTATE[HY000] [2002] No such file or directoryCusProSe: a customizable protein annotation software with an application to the prediction of fungal secondary metabolism genes*CusProSe: a customizable protein annotation software with an application to the prediction of fungal secondary metabolism genes