Borgs are big, linear extrachromosomal components related to anaerobic methane-oxidizing archaea. Placing options of Borg genomes are pervasive tandem direct repeat (TR) areas. Right here, we current six new Borg genomes and examine the traits of TRs in all ten full Borg genomes. We discover that TR areas are quickly evolving, not too long ago shaped, come up independently, and are just about absent in host Methanoperedens genomes. Flanking partial repeats and A-enriched character constrain the TR formation mechanism. TRs may be in intergenic areas, the place they may function regulatory RNAs, or in open studying frames (ORFs). TRs in ORFs are underneath very sturdy selective stress, resulting in good amino acid TRs (aaTRs) which might be generally intrinsically disordered areas. Proteins with aaTRs are sometimes extracellular or membrane proteins, and functionally comparable or homologous proteins typically have aaTRs composed of the identical amino acids. We suggest that Borg aaTR-proteins functionally diversify Methanoperedens and all TRs are essential for particular Borg–host associations and presumably cospeciation.
Quotation: Schoelmerich MC, Sachdeva R, West-Roberts J, Waldburger L, Banfield JF (2023) Tandem repeats in big archaeal Borg components endure speedy evolution and create new intrinsically disordered areas in proteins. PLoS Biol 21(1):
Tutorial Editor: Harmit S. Malik, Fred Hutchinson Most cancers Analysis Middle, UNITED STATES
Obtained: June 3, 2022; Accepted: December 23, 2022; Revealed: January 26, 2023
Copyright: © 2023 Schoelmerich et al. That is an open entry article distributed underneath the phrases of the Inventive Commons Attribution License, which allows unrestricted use, distribution, and replica in any medium, offered the unique creator and supply are credited.
Information Availability: Metagenomic sequencing reads, and newly launched Borg genomes reported on this paper can be found underneath NCBI BioProject: PRJNA914281. Further Borg and “Candidatus Methanoperedens spp.” metagenomes can be found underneath NCBI BioProject: PRJNA866293. Protein sequences, structural fashions, and the phylogenetic tree of the DNA polymerases from this research can be found by way of Zenodo (10.5281/zenodo.6533809). The python script that was used to detect tandem repeats on the nucleotide or protein stage is on the market on GitHub (https://github.com/rohansachdeva/tandem_repeats). Extra statistics on TR areas within the ten accomplished Borg genomes may be present in Desk S14 and Desk S15. Sequence databases used embrace KEGG, UniRef100, UniProt, pfam, and ggkbase (ggkbase.berkeley.edu).
Funding: This publication is predicated on analysis partially funded by the Invoice & Melinda Gates Basis (Grant Quantity: INV-037174 to JFB). The findings and conclusions contained inside are these of the authors and don’t essentially replicate positions or insurance policies of the Invoice & Melinda Gates Basis. Further funding for this analysis was offered by a DFG fellowship for MCS (Mission Quantity: 447383558 to MCS), and by the Modern Genomics Institute at UC Berkeley (GI 52482 to JFB). The funders had no function in research design, information assortment and evaluation, determination to publish, or preparation of the manuscript.
Competing pursuits: I’ve learn the journal’s coverage and the authors of this manuscript have the next competing pursuits: JFB is a cofounder of Metagenomi.
amino acid TR; ANME,
anaerobic methanotrophic; AOM,
anaerobic oxidation of methane; arcRNA,
architectural RNA; ECE,
extrachromosomal component; ELM,
eukaryotic linear motif; FDR,
false discovery fee; IDR,
intrinsically disordered area; (l)ncRNA,
(lengthy) noncoding RNA; MHC,
multiheme cytochrome; NCLDV,
nucleocytoplasmic giant DNA virus; ORF,
open studying body; PAM,
protospacer adjoining motif; SLiM,
brief linear motif; SNP,
single nucleotide polymorphism; TMH,
transmembrane helix; TR,
tandem repeat; VNTR,
variable quantity TR
Metagenomics has led to the rising discovery of microbial extrachromosomal components (ECEs) from environmental samples [1,2]. This bears nice significance, because it permits us to higher perceive evolutionary processes and practical roles of ECEs in pure methods and probably use the ECEs, or components of them, for genetic engineering. Utilizing genome-resolved metagenomics, we not too long ago found Borgs, that are unusually giant ECEs. Based mostly on gene content material and co-occurrence patterns, Borgs affiliate with a number of species of anaerobic methanotrophic (ANME) archaea of the Methanoperedens genus . A seek for different ECEs related to these archaea additionally led to the latest discovery of enormous plasmids present in members of the identical genus, but a definite clade of Methanoperedens species .
ANME carry out anaerobic oxidation of methane (AOM) by utilizing the reverse methanogenesis pathway . One excellent function of Borgs is that they carry genes that encode proteins concerned in key steps of their hosts’ metabolism. Remarkably, some Borgs encode the methyl-CoM reductase, all encode multiheme cytochromes (MHCs) that relay electrons onto extracellular terminal electron acceptors, and others encode nitrogenase used for nitrogen fixation . The metabolic potential of various Borgs varies, suggesting various modes of interaction between Borgs and their hosts.
Borg genomes share a conserved genomic structure that could be very distinct from Methanoperedens chromosomes and plasmids of Methanoperedens. They’re linear and enormous, with genome sizes for the primary described examples starting from 0.66 to 0.92 Mbp, thus exceeding identified archaeal virus genome sizes by far. This locations them within the vary of big and enormous eukaryotic double-stranded DNA viruses from the nucleocytoplasmic giant DNA virus (NCLDV), with genome sizes that may exceed 2.5 Mbp . Linearity additionally happens in some virus genomes  and in eukaryotic chromosomes (Saccharomyces) in addition to giant linear plasmids of micro organism (Streptomyces, Micrococcus luteus) [8–10]. The Borg genomes reported thus far have a big and a small replichore, and each carry just about all genes solely on one strand. Borg genomes are terminated by lengthy inverted repeats, and nucleotide tandem repeats (TRs) are scattered all through their complete genomes. These TR sequences comply with a head-to-tail sample, happen in each intergenic and genic areas, and the models are completely repeated with out single nucleotide polymorphisms (SNPs).
Since these good nucleotide TRs are a key frequent function of Borgs, we determined to analyze them additional to make clear their potential capabilities. Right here, we analyze TR areas in Borg genomes that comprise ≥50 nucleotides in size and embrace ≥3 repeat models. We discovered that DNA sequence meeting algorithms typically collapse TR areas and that TRs incessantly terminate contigs. This isn’t stunning, provided that repeats on the whole are a well known trigger for errors in assemblies [7–10]. Thus, we augmented the 4 present manually curated Borg genomes by manually curating six new Borg genomes, and twelve further Borg contigs, to extra fully uncover TRs. We discovered that TRs inside open studying frames (ORFs) didn’t disrupt ORFs, have been of lengths divisible by three, and thus type amino acid tandem repeats (good aaTRs inside proteins). We then bioinformatically analyzed these aaTRs alone and along with the proteins they come up in. The excessive frequency, abundance, and inside inhabitants evolutionary dynamics of repetitive sequences counsel that they’re quick evolving and have essential organic capabilities.
2.1. Genome curation and TR evaluation
2.1.1. Curation of repetitive DNA sequences to finish Borg genomes.
We reconstructed and manually curated six new Borg genomes to completion (see Strategies; Desk 1). Curation included correction of native errors the place the robotically generated assemblies collapsed areas, or integrated the improper variety of repeat models. These areas have been recognized visually based mostly on elevated incidence of SNPs (Fig 1A) that clearly indicated that the area was misassembled (Fig 1B). All gaps created through the scaffolding step of the meeting have been crammed and genome fragments prolonged by making use of unplaced or misplaced paired reads. In some circumstances, the extending sequences have been used to determine lacking genome fragments that have been then curated into the ultimate meeting. In a number of cases, TR areas exceeded the sequencing insert size so the actual variety of TR models couldn’t be exactly decided. Resolving the de novo meeting errors unmasked TR areas within the genomes and revealed the aaTRs that these introduce into protein sequences.
Fig 1. Reads mapped to a de novo meeting earlier than and after curation.
Reads are proven as gray bars beneath the consensus sequence. SNPs in reads are highlighted in coloration. The pink segments mark a singular sequence that was used to position misplaced reads accurately and the nucleotide repeat unit is proven as black segments. (A) Numerous SNPs signifies a neighborhood meeting error related to a collapsed repeat area, leading to a consensus sequence with one repeat unit. (B) The consensus sequence has three repeat models after curation.
Rose Borg possesses the smallest (623,782 bp), and Inexperienced Borg possesses the most important genome (1,094,519 bp). As anticipated, based mostly on the 4 beforehand reported genomes, 5 of the brand new genomes are linear, terminated by inverted repeats. Based mostly on GC skew evaluation, replication of Borg DNA is initiated on the chromosome ends (Figs 2 and S1 and S1 Desk). The Pink Borg genome accommodates a repeated sequence that prevented identification of a singular genome meeting path. The variant that generated the anticipated sample of GC skew as for the opposite Borgs was chosen, finishing the ultimate set of ten Borg genomes. All genomes however the Inexperienced Borg genome are composed of a giant and a small replichore; the Inexperienced Borg genome has a barely extra advanced group. Every replichore carries primarily all genes solely on one strand. Consequently, there are not any obvious transcriptional operons. This was mirrored by a low frequency of transcriptional regulators in Borg genomes. Particularly, we solely discovered 0.35 transcriptional regulators per 100 kbp Borg genome, whereas close to full Methanoperedens genomes encoded 5.9 in 100 kbp (1 in Rose; 2 in Purple; 3 in Brown, Lilac, Sky, Inexperienced; 5 in Purple; 6 in Orange; and none in Ochre and Pink Borg versus 53, 53, and 62 in three near-complete Methanoperedens genomes SRVP18_hole-7m-from-trench_1_20cm__Methanoperedens-related_44_31,RifSed_csp2_16ft_3_Methanoperedens_45_12,RifSed_csp1_19ft_2_Methanoperedens_44_10; 2.77–2.90 Mbp). We additionally observe that not one of the Borgs encode a DNA-dependent RNA polymerase or a TATA-box binding protein.
Fig 2. Borg genomes are composed of a giant and a small replichore, and their replication is initiated on the termini.
Proven is the genome structure of Brown Borg, together with terminal inverted repeats (white arrows), TR areas (purple arrows), and GC composition. GC skew and cumulative GC skew present that the genome is replicated from the terminal inverted repeats (origin) to the terminus. The info underlying this Determine may be present in S1 Desk.
The beforehand reported 4 accomplished Borg genomes and 6 new accomplished Borg genomes have been used to judge the distribution and options of good TRs utilizing a stringent threshold that allowed no mismatch (≥50 nt area and ≥3 TR models). Nevertheless, through the handbook curation, we seen that some areas had mapped sequencing reads with slight variations within the unit composition, normally a single nucleotide change. It’s because reads assigned to a particular Borg seemingly derive from 1000’s of genomes, lots of that are barely totally different to one another.
Within the instance of a repeat area in a MHC gene, there are two TR unit sorts that differ by a G → A substitution SNP that’s synonymous on the amino acid stage. Curiously, the combination of TR variants can differ inside a Borg inhabitants (S2 Fig).
To evaluate whether or not TR areas have been significantly vulnerable to mutation, and, particularly, if there’s a bias towards sure SNPs, we analyzed the reads mapped to every Borg genome utilizing inStrain . We detected no SNPs in 3 genomes and intensely few SNPs in 6 genomes (2.2 SNPs throughout the cumulative TR areas of every genome). The exception was Lilac Borg, which harbored 17 SNPs of which 7 have been inside the similar TR area inside an MHC (S2 Desk). These small numbers preclude any statistical analyses of various mutation incidences within the TR areas in comparison with the entire genome. Curiously, we noticed an A-bias within the SNPs present in TR areas, and that is primarily on the expense of G (14/18 circumstances). In distinction, A-SNPs present in non-TR areas have been equally at the price of G (462/1057 circumstances) and C (494/1057 circumstances), however much less incessantly at the price of T (101/1057 circumstances).
2.1.2. Areas with TRs are quick evolving.
The variety of TR models in a area typically clearly differed inside a single Borg inhabitants (Fig 3A). This suggests that the Borg TR areas are quick evolving, very similar to CRISPR repeat-spacer inventories. Furthermore, the repeat loci have been hardly ever conserved in in any other case alignable areas of probably the most intently associated Borgs and have been absent from homologous proteins from the host or different homologues in NCBI or our personal database ggKbase. That is further proof that TR areas shaped very not too long ago, after Borgs diverged. A uncommon case the place TR areas are comparable happens in Black and Brown Borgs, that are intently associated based mostly on genome alignments (Fig 3B). Shut inspection of an roughly 7-kbp area within the aligned genomes revealed an 83% sequence identification and 4 totally different sorts of TRs. TR-1 is intergenic, consists of a 20-nt unit repeated six instances, adopted by an almost good further unit of 21 nt, then one other good unit. This intergenic TR is absent in Brown Borg. TR-2 is in an ORF and includes 18 nt models which might be equivalent in Black and Brown Borgs, the place they happen seven and eight instances, respectively. TR-3 can also be in an ORF and includes 21 nt models that happen consecutively six instances in Black Borg. Brown Borg has two equivalent models in the identical ORF, adopted by a sequence that differs by one SNP, then one other equivalent unit. TR-4 can also be in an ORF and includes 36 nt models that happen 4 instances in Brown Borg, however these are absent in the identical ORF from Black Borg. Curiously, the nucleotide sequence from Brown Borg within the neighborhood of TR-4 has excessive nucleotide-level similarity however no TRs.
Fig 3. Variability of TR unit numbers inside one Borg inhabitants and variations in TR areas in two intently associated however distinct Borg genomes.
(A) The consensus sequence of the reads mapped to the curated genome accommodates three TR models (aqua segments). Reads are proven as gray bars beneath the consensus sequence (***). Six reads span this area completely (**), 5 reads span this area however are lacking one TR unit (*, black segments), 9 reads don’t span your entire TR area. (B) Genome alignment of intently associated Black Borg and Brown Borg. Genomic areas that align are depicted because the same-colored collinear blocks within the high panel. A 7-kb area (black field) reveals 4 cases of TRs in these genomes: TR-1 is current in Black Borg however absent in Brown Borg; TR-2 is current in each Borg genomes however with totally different numbers of TR models; TR-3 is just current in Black Borg, however an imperfect model is current in Brown Borg; TR-4 is just current in Brown Borg. Similar coloration segments present equivalent TR models; genes are depicted in gray.
To quantify the variation in TR unit quantity and evaluate it with the rely of insertions/deletions in numerous areas of the Borg genomes, we carried out a case research utilizing the genome of Ochre Borg with all reads mapped onto it. We discovered no cases all through your entire genome of indels in well-mapped Ochre sequencing reads, aside from within the TR areas. For this evaluation, we included loci with solely two repeat models and located that six of the 27 TR areas confirmed variation within the repeat unit quantity. Since many reads begin or finish inside the TR area, they don’t present info relating to the entire variety of repeat models, therefore the true incidence of variation is probably going even larger.
2.1.3. TRs are flanked by partial repeats and are enriched in adenine.
To constrain the mechanism behind TR formation in Borg genomes, we assessed traits of the TR sequences and their flanking DNA areas. Repetition of sequences with variation in GC versus AT content material typically introduces GC/AT-symmetry in areas containing TR models (Fig 4 and S3 Desk). Offset of the symmetric models and the TRs is determined by the precise alternative of the repeat unit. The TR areas may be preceded by sequences which might be equivalent to the tip of the TR unit and/or adopted by sequences which might be equivalent to the beginning of the TR unit (Fig 4A and 4B). These flanking partial repeats are sometimes additionally adjoining to TR areas that include solely two repeats (and have been excluded from statistics). Usually the center of the TR unit shouldn’t be in both of the 5′ or 3′ partial repeats. When there are totally different decisions for the repeat unit there are barely totally different partial repeats that flank the TR sequences (Fig 4C).
Fig 4. Tandem repeats are sometimes flanked by partial repeat sequences.
(A) TR areas can happen inside ORFs. (B) TRs may be in intergenic areas, or the area can begin inside the finish of an ORF. The TR areas are sometimes flanked by partial repeats. (C) Completely different choices for the beginning of a TR area outline totally different flanking partial repeats. The GC graph shows the proportional quantity of G or C (or A or T) residues in a sliding window that’s set at half of the TR unit size. The info underlying this Determine may be present in S3 Desk.
The nucleotide composition of the ORF-carrying strands on each replichores was comparable, with nucleotide frequencies of A>T>G>C (38.4%, 28.8%, 20.3%, 12.5%). The nucleotide composition of all TR models in comparison with this total composition confirmed a powerful bias in direction of A (48.4%), whereas all different nucleotides have been depleted (T 23.7%, G 17.1%, C 10.8%) (S4 Desk). This A-bias is analogous to the A-bias noticed in SNPs inside TR areas (S2 Desk). To determine that the variations in nucleotide composition of the TRs versus all nucleotides are important, we carried out a nonparametric Kruskal–Wallis check  and a false discovery fee (FDR) correction in line with Benjamini–Yekutieli . This revealed that the compositional bias between the nucleotide TRs and non-TR sequences is of excessive statistical significance with corrected p-values of 9.78 × 10−68, 1.95 × 10−41, 1.46 × 10−29, 1.51 × 10−23 for A, T, G, C. This bias was much less pronounced in TRs inside ORFs (A 46.3%), and these TRs have been additionally enriched in C (15.5%), whereas T and G have been depleted (18.6%, 19.6%) (corrected p-values of 1.39 × 10−18, 5.06 × 10−58, 8.40 × 10−3, 2.16 × 10−3 for A, T, G, C) (S4 Desk). The truth that this compositional bias is noticed in each genes, intergenic areas and in flanking partial repeats means that the A enrichment is essential for the method that kinds the repeats. Predicting the RNA secondary construction of the nucleotide TRs with RNAfold  revealed that some are predicted to type loops and others type hairpins.
We looked for polymerases in Borg genomes, given their potential relevance for repeat formation . We discovered that each one Borgs encode at the very least one DNA polymerase and phylogenetic placement inside reference sequences  revealed that there are two varieties of PolBs encoded within the Borgs: the B2 clade and the B9 clade (S3 Fig). The DNA PolB9 encompasses a predicted 3′ to five′ exonuclease and was present in every full Borg genome. The amino acid sequences are very comparable, suggesting a excessive mode of conservation for these explicit proteins.
2.1.4. TRs are uncommon in Methanoperedens and introduce aaTRs in Borg ORFs.
To make clear the function and performance of the TRs, we surveyed their prevalence throughout all ten full Borg genomes. We discovered 460 areas that make up 0.62% of the common Borg genome (Desk 1). Draft Methanoperedens metagenomes then again solely have 1 to 4 TR areas, making up ≤0.01% of the metagenomes (0.0099% in SRVP18_hole-7m-from-trench_1_20cm__Methanoperedens-related_44_31, 0.0021% RifSed_csp2_16ft_3_Methanoperedens_45_12, 0.0018% in RifSed_csp1_19ft_2_Methanoperedens_44_10), suggesting that TR formation is extremely genome particular. Roughly half (43% to 65%) of the Borg TRs have been positioned inside ORFs. All TRs inside ORFs had unit lengths which might be divisible by three, so these repeats don’t disrupt studying frames. They lead to amino acid tandem repeats that we consult with as aaTRs (and aaTR-proteins). Solely 14% to 38% of intergenic TR unit lengths have been divisible by three (S4 Fig). Typically, a number of totally different TRs occurred inside the similar ORF, so the dataset comprised 214 aaTRs in 178 particular person proteins from 10 Borg genomes. Fifteen aaTR-proteins weren’t good TRs on the nucleotide stage, but virtually all have been imperfect because of the prevalence of single SNPs. The TR nucleotide sequences are just about all distinctive inside every Borg genome and are very hardly ever shared between Borg genomes, with solely 5 circumstances of equivalent TR models in the identical genome (Inexperienced Borg TR 16 and 19 and TR 17 and 21; S14 Desk) or shared between Borgs (Black Borg TR 35 and Brown Borg TR 31, Black Borg TR 43 and Brown Borg TR 38, Purple Borg TR 21 and Ochre Borg 13). Upon inspection of aaTR-proteins, we discovered a special codon utilization in genes with TRs relative to all Borg genes. As anticipated, based mostly on the comparatively A-rich composition of repeat areas, the aaTR-bearing ORFs typically favored incorporation of codons containing A. Codons containing T in any place have been typically depleted in aaTR areas. Codons with C in place 1 or 2 of the triplett, and G in place 1 have been enriched in aaTRs, however codons with C or G in different positions or mixtures of positions 1 to three have been depleted (S5A Fig and S5 Desk). Probably the most frequent codons in aaTRs have been CCA (encoding proline) and ACA (encoding threonine) with an as much as 3-fold larger frequency in aaTRs than in non-aaTR areas. Probably the most statistically enriched codons in TRs have been GAT and AAA (encoding aspartate and lysine with corrected p-values of 1.63 × 10−57 and 1.66 × 10−52), and most statistically depleted codons in TRs have been ATG and TTT (begin codon and phenylalanine with corrected p-values of 6.43 × 10−104 and 6.67 × 10−100) (S5 Desk). Seven codons have been fully absent in aaTRs, specifically, TGA, TAA, and TAG (cease codons), CGA, CGT, and CGG (encoding arginine), TTC (phenylalanine), and TGC (cysteine) (S5B Fig).
2.2. Biophysical properties of aaTRs
2.2.1. Amino acid frequency in aaTRs.
We intently examined the expected biophysical and biochemical properties of the one repeat models from all ten Borg genomes and 37 further proteins from curated Borg contigs. This closing dataset comprised 215 Borg aaTR-proteins and 306 repeat areas (S6 and S7 Tables). Proline, threonine, glutamate, and lysine have been significantly enriched throughout all Borgs, whereas tryptophan, cysteine, and phenylalanine have been virtually absent (Fig 5A and S8 Desk). We seen that disorder-promoting amino acids have been enriched within the aaTRs and order-promoting amino acids have been depleted (S6 Fig).
Fig 5. Relative amino acid abundance in aaTRs and hierarchical clustering of triple aaTR models.
(A) Relative aa abundance in aaTRs in comparison with all proteins. (B) Repeat models have been hierarchically clustered based mostly on triple aaTR models. Clusters with ≥3 aaTR members have been named by amino acids that type the aaTR unit in descending order (threshold ≥10%). The colour strip reveals through which Borg genome every aaTR is encoded. The info underlying this Determine may be present in S7 and S8 Tables.
Hierarchical clustering of triple aaTR models revealed 28 aaTR clusters, every cluster composed of aaTRs with very comparable amino acid composition and frequency (Fig 5B and S7 Desk). Many aaTR models possessed almost equal numbers of the charged amino acids Ok and E (polyampholyte repeat cluster); others have been enriched in P/T/S, that are potential websites of posttranslational modification. From this, we conclude that there are distinct and associated teams of aaTRs.
2.2.2. aaTRs are enriched in extracellular proteins, and most aaTRs are intrinsically disordered.
Strikingly, 32.1% (69/215) of aaTR-proteins have extracellular areas and 30.2% have transmembrane helices (TMHs), whereas solely 18.6% (2,230/11,995) and 16.7% of all Borg proteins have these options. Constantly, aaTR-proteins are extremely enriched in sign peptides (16.7% of aaTR-proteins in comparison with 4.0% of all Borg proteins).
To additional examine the remark that aaTRs significantly typically include disorder-promoting amino acids, we assessed the prominence of intrinsically disordered areas (IDRs) in aaTR-proteins versus all Borg proteins. IDRs are polypeptide segments which might be characterised by a scarcity of a well-defined 3D construction . Remarkably, 62.8% aaTR-proteins contained at the very least one IDR (≥15 amino acids), whereas solely 5.6% of all Borg proteins contained an IDR. The relative size of the IDR areas diverse from 2.6% to 100% and 1.2% to 100% in each aaTR-proteins and non-aaTR-proteins (S7 Fig and S9 Desk). In aaTR-proteins, the IDRs virtually at all times corresponded to the TR area, and Methanoperedens homologues didn’t have IDRs (Fig 6A). Thus, we hypothesize that the TRs in ORFs principally result in the creation of recent IDRs in present proteins.
Fig 6. aaTRs introduce intrinsic dysfunction, are predicted binding websites, and comprise comparable amino acids in homologous Borg proteins.
(A) Area structure of S-layer protein from Inexperienced Borg. The aaTR area is intrinsically disordered (predicted with IUPred3; ), and the IDR areas are predicted binding websites for proteins, DNA, RNA, or a linker area (predicted with flDPnn ; likelihood of function will increase with coloration depth). (B) All or most members of two protein subfamilies have aaTRs. The aaTRs inside a protein subfamily are from one or two associated aaTR clusters. The amino acid sequence of every aaTR unit is proven in brackets.
Forty proteins possessed a number of repeat areas starting from two to 5 aaTR areas (S10 Desk), and the aaTRs have been typically positioned on the N- or C-termini or between domains. A few of these areas fall into the identical or the same aaTR cluster, whereas others are distinct and infrequently positioned in a totally totally different area of the polypeptide chain. In 22 circumstances, the aaTRs make up the bulk (50% to 96%) of the expected proteins, and lots of of those are small proteins (14 proteins are <100 aa) (S11 Desk). The repeat models of those proteins are various, however clusters enriched in Ok, E, and I (cluster 10 and 11) have been significantly frequent (6/18). Though it’s doable that these are wrongly predicted genes in intergenic areas, they might even be de novo repeat peptides. Their existence as actual proteins is supported by the remark that they possess a begin codon (or different begin codon), the TR models are divisible by three and thus introduce aaTRs, and by the truth that 4 have sign peptides and 4 have practical annotations (linked to apolipoprotein, proline-rich domains, and a glycoprotein area).
2.3. Features of aaTR-proteins
2.3.1. Comparable aaTRs fulfill comparable predicted capabilities, and functionally associated proteins have comparable aaTRs.
To analyze which capabilities aaTR-proteins have, and if functionally associated proteins possess comparable aaTRs, we carried out protein household clustering of 11,995 Borg proteins. Briefly, we clustered proteins utilizing the quick and delicate protein sequence looking out software program MMseqs2  in an all-versus-all strategy to outline protein subclusters based mostly on amino acid similarities. Based mostly on the alignments of the subfamily members, HMMs have been generated for every subfamily utilizing HHblits . These HMMs have been then functionally annotated by an HMM-HMM comparability with the PFAM database utilizing HHSearch . This resulted in 85% of Borg proteins being grouped into 1,890 subfamilies and 80% (172/215) of the aaTR-proteins clustered into 112 subfamilies (S6 Desk). Based mostly on the annotations for the person protein members of every subfamily and the annotation of the HMM of the subfamily, protein subfamilies have been manually grouped when it comes to operate. The practical panorama of the subfamilies ranged from carbon metabolism (18 proteins), cell and protein structure and scaffolding (9), nucleotide processing (15), transcription-associated proteins (8), redox (8), signaling (11), transport (3), stress response (2), to protein processing (1) (Desk 2).
Desk 2. Useful annotations for protein subfamilies with aaTR-bearing protein members.
Proteins have been manually positioned into practical classes based mostly on the pfam annotation of the subfamily. Single aaTR models are listed in addition to through which aaTR cluster they fall. Some aaTRs didn’t fall into clusters (n.c., no cluster), and a few proteins have two or extra aaTRs, which fall into one or two aaTR clusters.
Screening the aaTR-proteins with the eukaryotic linear motif (ELM) useful resource for practical websites in proteins (http://elm.eu.org/) unmasked that the aaTRs inside them have similarity to a plethora of various brief linear motifs (SLiMs). SLiMs are composed of three to 10 consecutive amino acids , that are utilized by eukaryotic cells as cleavage websites, degradation websites, docking websites, ligand binding websites, posttranslational modification websites, and focusing on websites . We discovered that aaTRs from the identical aaTR cluster or associated aaTR clusters typically had matching practical predictions (see examples in S7 Desk).
Fifteen protein subfamilies have been enriched in aaTR-proteins, which included subfamilies functionally annotated as transmembrane phosphatases, ribonucleoproteins, phosphoesterases, zinc-ribbon proteins, and DNA-binding proteins; the opposite subfamilies had no practical annotations (S6 Desk). Members of the identical protein subfamily typically had comparable aaTRs. For instance, a subfamily of DNA binding proteins (subfam0897) solely includes aaTR-bearing members, all of which have aaTRs which might be distinctive in nucleotide sequence (S8 Fig), but of the identical or associated repeat clusters (Fig 6B). These aaTRs type a predicted coil construction or an IDR and resemble SLiMs that play a job in ligand binding, degradation, and focusing on (S7 Desk). Equally, subfam1609 includes eight members, 5 of that are aaTR-proteins, and, regardless of all being solely distant homologues (highest aa identification is 67%), the aaTRs are encoded by distinctive nucleotide TR models (S8 Fig) however belong to the identical aaTR cluster (SVG/IQ), which corresponds to predicted modification (phosphorylation) and ligand-binding websites (Fig 6B and S7 Desk).
2.3.2. aaTR-proteins accountable for cell integrity/stability and floor ornament.
A number of functionally associated however phylogenetically unrelated proteins which might be accountable for cell wall structure (PEGA and S-layer) and ornament (glycosyltransferases and glycosyl hydrolases) have aaTRs. The aaTRs of the PEGA proteins resemble predicted ligand binding, docking, and modification (phosphorylation) websites, and the aaTRs of the S-layer resemble proteolysis-initiating degrons . The aaTRs of glycosyl hydrolases resemble modification and ligand binding websites, and the aaTRs of glycosyl transferases resemble degrons (S7 Desk). Some Borgs additionally possess an aaTR in tubulins, that are proteins required for cell division . These aaTRs resemble SLiMs which might be potential docking and cleavage websites and/or provoke proteasomal degradation. Importantly, these SliM-bearing aaTRs are absent in non-Borg homologues.
2.4. Case research of aaTR-proteins: Ribonucleoproteins, MHCs, and a conserved TR hotspot
To additional examine proteins with the identical operate that evolve comparable however not equivalent aaTRs, we carried out in-depth analyses of two distinct varieties of proteins with a identified operate and a conserved area throughout Borgs comprising a number of aaTR-proteins.
Most Borgs encode Sm ribonucleoproteins, that are archaeal homologues of bacterial Hfq and eukaryotic Sm/Sm-like (Lsm) proteins. These proteins are implicated in versatile capabilities equivalent to RNA-processing and stability, and the protein monomers type a secure hexameric or heptameric ring-shaped particle [25,26]. We discovered 19 Borg Sm ribonucleoproteins (S12 Desk), 5 possess aaTRs, and one further sequence from Gray Borg has a near-perfect aaTR. The aaTRs are at all times positioned on the N-terminus, preserving the conserved Sm1 and Sm2 RNA-binding motifs (S9A Fig). The aaTR models include 4, 8, 12, or 17 amino acids, have the identical aa character, and are predicted SLiMs that facilitate docking or resemble degrons (S9B Fig and S12 Desk). An preliminary homology-based structural search of the aaTR-Sm from Black Borg recognized the bacterial Hfq of Pseudomonas aeruginosa (PDB: 4MML, aa identification: 26%) that kinds a homohexameric ring construction. On account of no alignable template within the database for the aaTR, the mannequin didn’t predict a construction of this N-terminal area. Reworking with AlphaFold2 predicted the construction of the Sm core that shaped a hexameric ring with lengthy loops extending on the distal aspect of the protein (Fig 7). This unstructured area corresponds to the aaTR area and matches the prediction of MobiDBLite  and IUPred3  that the aaTR is an IDR. Modeling of the opposite 4 aaTR-Sm proteins additionally confirmed N-terminal unstructured extensions similar to the aaTRs, which have been equally predicted IDRs (S9C Fig).
Fig 7. Predicted constructions of aaTR-bearing ribonucleoproteins and multiheme cytochromes.
(A) Sm ribonucleoprotein from Black Borg was superimposed on Sm from M. jannaschii (blue, PDB: 4X9D with UMP RNA + ions, magenta). (B) MHC area structure and predicted constructions. aaTR areas in predicted constructions are highlighted in purple, and amino acid sequence of every aaTR unit is proven in brackets beneath every construction. Buildings have been predicted with AlphaFold2.
2.4.2. Extracellular electron transferring MHCs.
All Borgs encode MHCs, that are significantly essential and plentiful in Methanoperedens as they mediate the ultimate electron switch from methane metabolism to an exterior electron acceptor . We discovered 14 MHCs with aaTRs in Borg genomes. All of those are predicted to be positioned extracellularly, some possess a membrane anchor, they usually vary from 20.5 kDa (190 aa) to 137.5 kDa (1,293 aa) and include as much as 30 heme binding websites (Fig 7). The proteins are thus clearly distinct in sequence and area structure. The aaTRs are positioned at totally different websites within the polypeptides however by no means disrupt present practical domains. But remarkably, the aaTRs belong to equivalent or comparable repeat clusters which might be persistently enriched in T and P and resemble ligand binding, docking, and modification websites (S13 Desk). The aaTRs are principally IDRs, that are predicted to type unstructured extensions that protrude from the folded protein core.
2.4.3. A conserved aaTR hotspot.
There’s a area in all full Borg genomes that could be a hotspot for TRs with as much as 5 aaTR-bearing proteins (Fig 8A). It features a gene encoding subfam1773, which we consult with as cell envelope integrity protein TolA. Finest blastp and structural hits are TolA from, e.g., the pathogenic bacterium Leptospira sarikeiensis (48% aa identification with WP_167882360) and YgfJ from the pathogen Salmonella typhimurium (PDB: 2JRP). Eight TolA proteins have distinctive aaTRs, and a few carry further aaTR models shared by different Borgs (e.g., Ochre Borg repeat models are present in Rose Borg and Purple Borg) (Fig 8B). These aaTRs are comparable in amino acid composition and they’re predominantly predicted degradation, docking, and ligand binding websites (S6 and S7 Tables). The areas encode two different conserved protein subfamilies with aaTR-bearing members: subfam0649 missing practical annotation and a subfamily of zinc-ribbon proteins (subfam0108), which normally type interplay modules with nucleic acids, proteins, or metabolites. The subfam0108 proteins possess seven distinct aaTR models, six of which fall into the identical repeat cluster (S6 and S7 Tables). Further aaTR-bearing proteins in the identical context are a glycogen synthase/glycosyl transferase (subfam0184) in Black Borg and Inexperienced Borg and proteins of unknown operate (subfam1382) in Lilac Borg and Orange Borg, and a hypothetical protein in Inexperienced Borg.
Fig 8. A conserved genetic area is a hotspot for tandem repeats.
(A) Gene cluster alignment of a conserved area in all Borg genomes that encompasses TolA. The alignment is predicated on amino acid identification and was generated with clinker . (B) Most TolA homologues have at the very least one aaTR area. The aaTR models are depicted as arrays beneath every sequence, and areas with two aaTR models solely are proven as properly.
Tandem nucleotide repeats in Borg genomes seemingly are shaped and evolve independently, as evidenced by the primarily distinctive sequences of TR models in every area inside and amongst Borg genomes. They evolve quickly, based mostly on variations within the alignable areas of probably the most intently associated Borgs (e.g., Black and Brown Borgs) and variability in repeat numbers inside a single Borg inhabitants. This parallels the speedy evolution of eukaryotic variable quantity TRs (VNTRs). VNTRs are widespread and linked to neuropathological issues (attributable to the buildup of brief TRs) , gene silencing , and speedy morphological variation . In bacterial genomes, extremely repeated sequences are much less frequent, and TR variations are linked to immune evasion, cell-pathogen specificity (definition of which cells/tissues/host pathogens/viruses can infect), and stress tolerance [33,34]. They’re comparatively frequent in a number of archaeal genomes, however the capabilities there stay unsure.
Beforehand proposed mechanisms for the growth and retraction of VNTRs implicate numerous DNA transactions, together with replication, transcription, restore, and recombination . Replication slippage is one rationalization for the origin and evolution of repeat loci . This mechanism includes a pausing of the DNA polymerase and dissociation from the TR area, adopted by DNA reannealing. Realignment of the newly synthesized strand may be out-of-register on the template strand, resulting in propagation/retraction of the TRs. Presently, it’s unclear whether or not the TRs are launched by Borg or Methanoperedens equipment. If the previous, it’s notable that we discovered that each one ten Borgs encode a extremely conserved DNA PolB9, a clade of uncharacterized polymerases which have solely been reported from metagenomic datasets . Archaeal DNA polymerases B and D have been proven to slide throughout replication of TR ssDNA areas . Thus, the Borg PolB9 could possibly be accountable for TR propagation, presumably triggered by noncanonical secondary constructions shaped by the TR DNA or its upstream area .
Given the shared nucleotide traits of genic and intergenic TR areas (perfection, A bias), we infer that they type by the identical mechanism. Even when TRs are launched completely, one would count on a few of them to have accrued mutations, except perfection is strongly chosen for and ensured by way of a restore mechanism, or except all of them shaped very not too long ago. The latter is in line with different proof that TR areas are quick evolving, specifically the variability in TR unit quantity inside a Borg inhabitants.
In lots of circumstances, the sequences that flank the TR areas are partial repeats. We suspect that these are seed sequences that have been used to provoke TR formation. If these areas have been the seeds that gave rise to the TRs, the truth that they’re typically solely components of the repeat raises a thriller relating to the origin of central areas of those TRs. It’s doable that the sequences that served as seeds have been subjected to elevated mutation charges, which is supported by the remark of excessive SNP ranges flanking TR areas within the human genome [25,27].
CRISPR repeats counsel a doable different to the replication slippage rationalization for the origin of TR areas. Like TR areas, CRISPR loci are quick evolving, in that case motivated by the necessity to counter speedy evolution of bacteriophages to outwit spacer-based immunity . CRISPR repeats are launched by an integrase that excises the beforehand added repeat, ligates it to a brand new spacer, and provides that unit to the increasing locus, filling gaps to get well a double stranded sequence . If a Borg system is concerned, the genomes encode many and numerous nucleases and recombinases, a few of which can be accountable for repeat addition. Sadly, an entire Methanoperedens genome shouldn’t be out there but, so we can’t confidently assess their capacities for repeat introduction.
As famous above, our analyses revealed a compositional bias in nucleotide TRs in direction of A on the coding strand of every replichore, primarily on the expense of T. If this displays mutation, the transversion (conversion of purine to pyrimidine) should occur on the biogenesis of the TR template, because the TRs are faithfully copy-pasted error-free. We tentatively assist the choice rationalization that the A-rich nature of the seeds initiated TR formation. A small area that’s A-rich, presumably functioning in a way considerably analogous to a protospacer adjoining motif (PAM) that binds the CRISPR system nuclease, might localize the equipment concerned in repeat formation. We speculate that TR formation could possibly be regulated by distinct methylation patterns of Borg DNA, presumably associated to the A-rich character of the seed.
An evaluation of the SNPs in TR areas offers one other clue relating to A-enrichment within the TRs. Though the small information dimension precludes statistical evaluation, we observe that the reads with SNPs are extra A-rich than the consensus repeat sequences, and this enrichment is seemingly on the expense of G. This might counsel that in Borg genome replication or TR propagation, guanines in TRs are selectively mutated to adenines. This can be harking back to G to A transitions noticed in human ailments at websites of 5-methylcytosine , however a lot bigger datasets might be required to analyze this additional.
A key query pertains to how secure coexistence of Borgs, with their big genomes, seemingly in multicopy , and the host Methanoperedens cells is achieved. Though it’s doable that Borgs direct the functioning and evolution of the hosts upon which they basically rely upon for his or her existence, it’s maybe extra seemingly that Methanoperedens controls Borg exercise. The mechanisms for regulation of Borg gene expression are unclear, given the absence of apparent operons and the conspicuous lack of transcriptional regulators. As we recognized no RNA polymerase and TATA-box binding proteins in any of the ten full Borg genomes, it’s doable that Methanoperedens can tightly regulate Borg gene transcription, thus controlling the manufacturing and localization of Borg proteins. TRs could also be concerned on this course of. Roughly half of the TR areas have been intergenic or have the primary unit inside or partially inside gene ends. These could possibly be (lengthy) noncoding RNAs ((l)ncRNAs), that are identified to modulate chromosome construction and performance, transcription of neighboring and distant genes, have an effect on RNA stability and translation and function architectural RNAs (arcRNAs) . arcRNAs are of explicit curiosity as they’re concerned in forming RNA-protein condensates, and a few arcRNAs possess repeat sequences to build up a number of copies of particular proteins and/or a number of copies of RNAs .
Just about all TRs in ORF have been of lengths divisible by three, thus don’t disrupt ORFs, and their A bias was not as sturdy as intergenic TRs. This means a powerful selective stress to at all times introduce amino acid repeats and a range for (or deselection towards) particular codons. Cumbersome or extremely reactive amino acids equivalent to cysteine, which might severely intervene with redox homeostasis, have been absent. But, disorder-promoting amino acids have been enriched in aaTRs, and, concomitantly, most aaTR areas have been predicted to be IDRs. Thus, we infer that aaTRs might affect protein functioning. Whereas protein homologues from Methanoperedens (or different microbial genomes) virtually by no means confirmed aaTRs, Borg homologues (or functionally associated Borg proteins) had aaTRs that have been extremely comparable in aa character, however not equivalent in DNA sequence. For very intently associated Borgs, this may occasionally come up as a result of comparable DNA areas in Borg genes have been seeds for TR formation. Nevertheless, the presence of aaTRs in numerous areas of proteins with the identical operate (e.g., MHCs) signifies a powerful selective stress for particular aaTRs that’s constrained by protein operate. General, we think about the mix of options to be sturdy proof that the aaTRs aren’t random genomic aberrations and people in ORFs can fulfill an essential function in protein functioning.
To make clear the expected operate that the proteins acquire by having developed aaTRs, we screened lots of aaTRs for predicted practical websites. Most aaTRs are IDRs, and these are generally present in SLiMs and vice versa . Within the case of the N-terminal aaTRs within the Borg ribonucleoproteins, the lengthy loops that stretch from the RNA-binding protein core might function docking websites for interplay companions. That is harking back to eukaryotic homologues that possess fused C-terminal IDR domains and are a part of the spliceosome, a ribonucleoprotein machine that excises introns from eukaryotic pre-mRNAs . The aaTR extension of the Borg ribonucleoproteins might endure a disorder-to-order transition induced by binding to a but to be recognized protein accomplice, which might redefine its operate in, e.g., pre-mRNA or pre-tRNA processing . The aaTRs related to MHCs have been additionally predicted to type lengthy, largely unstructured extensions. Their constant compositional dominance by proline and threonine clearly suggests a shared operate and convergent evolution of those aaTRs. Based mostly on the expected SLiMs inside the aaTRs, they could possibly be intraprotein intersections which might be glycosylated or phosphorylated, triggering a signaling cascade and conformational adjustments that result in a modified electron circulation, redox capability, and, presumably, nature of electron acceptor. Since intrinsic dysfunction is a typical function of (eukaryotic) hub proteins , the aaTRs is also concerned in navigating interprotein interfaces by being docking websites for different MHCs or extracellular oxidoreductases, equivalent to a manganese oxidase that we present in Lilac Borg . We thus suggest that the Borg MHCs develop the redox and metabolic capability of a giant MHC community on the cell floor of the host  and aaTRs inside them might allow a tunable connection to the electron conduction system that’s integral to Methanoperedens’ metabolism.
We discovered that aaTRs have been statistically enriched in Borg proteins with predicted extracellular and membrane localization, one other indication that their formation shouldn’t be a random occasion. These proteins are usually implicated in cell–cell interactions, transport, safety, and virulence. It’s unclear whether or not Borgs have an existence outdoors of Methanoperedens cells, however the inferred excessive copy variety of some Borgs in comparison with host cells  might level to this risk. Borg-encoded S-layer and PEGA area proteins (with and with out aaTRs) might probably encapsulate Borgs and mimic host proteins to evade host protection. Alternatively (or at a special time of their existence), these Borg proteins could possibly be displayed on the Methanoperedens cell floor to change host processes.
If Borgs can exist outdoors of host cells, they would want the flexibility to contaminate Methanoperedens cells to copy (analogous to viruses). Proteins concerned in an infection and protection are anticipated to be quick evolving, and TRs might function a mechanism to allow this. A hotspot for TR evolution that’s conserved throughout Borgs encodes TolA (and extra conserved hypothetical proteins). TolA is essential for cell envelope integrity and is essential for entry of filamentous bacteriophages that infect Escherichia coli or Vibrio cholerae [45,46]. TolA is outstanding, as not like different Borg aaTR-proteins, homologous proteins from Streptomyces and Bacillus are additionally repeat proteins with comparable aa signatures (EAKQ). The remark that the TolA-containing areas in Borgs are conserved, quick evolving, and underneath sturdy selective stress could possibly be in line with a job in Borg–host attachment. As proteins produced within the host, they might facilitate cell–cell interplay (e.g., for lateral gene switch).
A placing remark from prior work on the wetland website is the big variety of various and coexisting Borgs and Methanoperedens species . Based mostly on correlation analyses, totally different Borgs affiliate with totally different Methanoperedens , implying coevolution of Borgs and hosts. Lateral gene switch has been reported as a driver for metabolic flexibility in members of the Methanoperedenaceae . Gene acquisition (lateral gene switch) is a outstanding capability of Borgs, which gave rise to their title and should contribute to Methanoperedens speciation. TR areas additionally could also be key for establishing new Borg–host associations, particularly if the aaTRs and noncoding TRs allow practical cooperation between Borg and host protein inventories. Thus, whereas CRISPR repeats evolve quickly to defend towards phage an infection, speedy evolution of Borg TRs could also be required to take care of coexistence with their hosts throughout coevolution.
We conclude that the nucleotide areas flanking repeats, and the individuality of the TRs in every locus, seemingly point out that TRs come up from native sequences slightly than being launched from exterior templates. Different constraints on the mechanism behind TR formation are their typically good sequence repetition and A-enriched composition. Many TRs result in aaTRs in proteins, and these are normally IDRs which might be additionally predicted posttranslational modification websites and protein or nucleic acid binding websites. We suggest that aaTR-proteins develop and modify the mobile and metabolic capability of Borg-bearing Methanoperedens, but expression of their giant gene inventories is probably going underneath tight management. Introduction of TRs in each genes and intergenic areas could also be central to regulating Borg gene expression, translation, protein localization, and performance. TR areas change quickly in quantity and distribution to generate inside inhabitants heterogeneity and between inhabitants variety. This function is probably going central for Borg an infection, affiliation, and cospeciation of Borgs and their Methanoperedens hosts.
5.1. Identification of Borg genomes and handbook genome curation
Metagenomic datasets on ggKbase (ggkbase.berkeley.edu) have been looked for contigs with a dominant taxonomic profile matching Methanoperedens (Archaea; Euryarchaeaota; Methanomicrobia; Methanosarcinales; Candidatus Methanoperedens; Candidatus Methanoperedens nitroreducens). Guide genome binning was carried out based mostly on contig taxonomy, protection, GC content material (25% to 35% GC), and presence of nucleotide TRs.
Guide curation of Borg genomes was carried out in Geneious Prime 2021.2.2 (https://www.geneious.com). It concerned piecing collectively and verifying by paired reads placement, fragments of roughly the identical GC content material, sequencing reads protection, phylogenetic profile, and relatedness to identified Borgs right into a single chromosome. Subsequently, cautious visualization of the patterns of learn discrepancies was used to find native meeting errors, most of which have been mounted by both relocating paired reads or introducing beforehand unplaced paired reads.
5.2. Genome visualization and alignments
Genomes have been visualized in Geneious Prime 2021.2.2 and aligned with the MCM algorithm, or the progressiveMauve algorithm when aligning a number of contigs. Genetic neighborhood comparisons have been carried out utilizing clinker  (v0.0.21).
5.3. GC skew evaluation
Replichores have been predicted by calculating the GC skew (G − C/G + C) and cumulative GC skew utilizing the iRep bundle (gc_skew.py) .
5.4. SNP evaluation
To detect SNPs in Borg genomes, we mapped reads to the person Borg genomes utilizing bowtie2  (v.188.8.131.52, default settings), extracted mapped reads with SAMtools  (v.1.12), and BBMap  (v.38.79). The reads have been remapped to the Borg genomes permitting 3% mismatch utilizing BBMap (minid = 0.97 ambiguous = random). To detect SNPs within the mapped reads recordsdata, they have been analysed with inStrain profile  (v.1.3.4, commonplace settings).
5.5. Tandem repeat identification
Nucleotide TRs have been initially predicted in Geneious Prime 2021.2.2 (Repeat Finder, minimal repeat size 50, most mismatches 0) after which marked down within the accomplished genomes utilizing a customized Python script (https://github.com/rohansachdeva/tandem_repeats) based mostly on MUMmer  (v3.23). Nucleotide TRs have been searched utilizing a stringent threshold of ≥50 nt area and ≥3 TR models and no mismatch (—min_region_len 50—min_repeat_count 3). aaTRs have been searched utilizing a stringent threshold of ≥16 aa and ≥3 TR (-l 3—min_repeat_count 3—min_uniq_nt 1—min_region_len 16).
5.6. Statistical evaluation
To evaluate the importance of the noticed compositional bias between the nucleotide TRs and non-TR sequences, we carried out a Kruskal–Wallis check . This included separating DNA sequences into repeat and non-repeat segments and dividing these into 50 bp substrings (leading to a complete of 803 substrings from repeat DNA and 9,430 from non-repeat DNA). The incidences of every nucleotide have been counted on every substring, and the values have been positioned into 4-member nucleotide frequency vectors [NA, NT, NC, NG]. Nucleotide frequency vectors for every 50 bp substring have been then grouped into repeat and non-repeat classes and used as enter to the Kruskal–Wallis check, which was applied in SciPy (v. 1.9.0). The values have been then additional corrected by performing an FDR correction in line with Benjamini–Yekutieli .
5.7. Secondary construction prediction
The secondary construction of TRs was predicted with RNAfold WebServer .
5.8. Protein household clustering
A dataset of 11,995 Borg proteins was constructed utilizing all proteins from the ten curated Borg genomes and 37 further aaTR-proteins from curated contigs of Pink, Blue, Metal, Olive, Gray, and Apricot Borg. All proteins have been clustered utilizing the quick and delicate protein sequence looking out software program MMseqs2 in an all-versus-all search utilizing sensitivity 7.5, cowl 0.5, e-value 0.001  (v7e2840992948ee89dcc336522dc98a74fe0adf00). The sequences of every member inside a protein subfamily have been aligned utilizing result2msa of MMseqs2 and used as enter to assemble HMMs for every subfamily utilizing HHblits . The HMMs have been then profiled towards the PFAM database by HMM-HMM comparability utilizing HHsearch , and protein subfamilies enriched in plasmid proteins have been decided as described beforehand .
5.9. Useful prediction of aaTR-proteins
aaTR-proteins have been profiled utilizing InterProScan  (v5.51–85.0) to get practical and structural annotations of particular person proteins. Protein subfamilies have been functionally annotated utilizing HMMER (hmmer.org) (v3.3, hmmsearch) and the PFAM (—cut_nc) HMM database . Homology search was carried out with blastp . Homologous proteins have been outlined as members of the identical protein subfamily (for Borg proteins), or greatest hits from blastp on NCBI or ggKbase for non-Borg proteins. Intrinsic dysfunction was predicted with MobiDBLite  and IUPred3 ; TMHs and mobile localization have been predicted with TMHMM  (v2.0) and psort  (v2.0, archaeal mode). SLiMs have been predicted with the ELM useful resource for practical websites in proteins  (http://elm.eu.org/). The only aaTR unit was queried (cell compartment: not specified), and the practical website overlaying many of the aaTR was normally chosen (S7 Desk). Features of aaTR areas in proteins have been additionally predicted with flDPnn .
5.10. Hierarchical clustering of aaTRs and development of a phylogenetic tree
Hierarchical clustering of aaTRs was carried out utilizing triple TR models for the alignments, since three was the minimal repeat unit size set as threshold, and the areas have been dynamic. The alignments have been carried out with MAFFT  (v7.453) (—treeout—reorder—localpair). The aaTR clusters have been visualized and embellished in iTOL . An aaTR cluster was shaped when a department contained 3≥ associated sequences. The names of the clusters got based mostly on probably the most plentiful amino acids discovered within the repeat models (amino acids represented by 10%≥ turned name-giving).
The Borg DNA polymerase sequences have been aligned with a reference dataset from , aligned with MAFFT  (v7.453) (—reorder—auto), trimmed with trimal  (v1.4.rev15) (-gt 0.2), and a maximum-likelihood tree was calculated in IQ-TREE  (v1.6.12) (-m TEST -st AA -bb 1000 -nt AUTO -ntmax 20 -pre). The phylogenetic tree of the DNA polymerases was visualized and embellished in iTOL  (v6.6).
5.11. Structural modeling
Structural modeling of aaTR-bearing Sm proteins was initially carried out utilizing SWISS-MODEL  and one of the best hit within the Swiss-Prot database as template. Additional structural modeling of aaTR-bearing Sm proteins was carried out with AlphaFold2 utilizing ColabFold  and choosing the experimental possibility homooligomer 6. Structural modeling of aaTR-MHC proteins was carried out utilizing AlphaFold2  by way of a LocalColabFold (—use_ptm—use_turbo—num_relax Top5—max_recycle 3) [68,69]. Modeled protein constructions have been visualized and superimposed onto PDB constructions utilizing PyMOL  (v2.3.4).
S1 Fig. Borg genome replication is initiated on the termini.
Proven is the GC skew (gray) and cumulative GC skew (inexperienced strains). Borg DNA is replicated from the terminal inverted repeats (origin, purple strains) till the terminus (terminus, blue strains).
S2 Fig. Reads mapped to a de novo meeting exhibiting totally different mixtures of repeat models in a Borg inhabitants.
There are two varieties of repeat models proven as magenta or inexperienced segments. They’re equivalent, aside from a SNP highlighted in purple.
S3 Fig. Phylogenetic placement of Borg DNA polymerases.
Amino acid sequences of DNA polymerases cluster collectively within the B9 clade. Further DNA polymerases current in some Borgs cluster collectively within the B2 clade. Reference sequences originate from Kazlauskas and colleagues . The tree was rooted between the G3 and B9 clade. The info underlying this Determine may be present in Zenodo (10.5281/zenodo.6533809).
S5 Fig. Positional nucleotide frequency and total codon frequency in aaTR areas and non-aaTR areas of ORFs.
(A) The positional frequency of the 4 nucleotides was calculated for every codon inside aaTRs and all different codons. The codons have been then divided into six classes (on the x-axis) based mostly on the place of the person nucleotides within the tripletts. One codon can fall into a number of classes. (B) The codon frequency in aaTR areas was divided by the codon frequency in non-aaTR areas. The identical codon use would end result within the worth 1, codons enriched in aaTR areas have values >1, codons depleted in aaTR areas are <1, and codons absent in aaTRs are at worth 0 (7 cases). The info underlying this Determine may be present in S5 Desk.
S6 Fig. aaTRs are enriched in disorder-promoting amino acids.
The aa abundance displays the frequency of amino acids in aaTRs divided by the frequency in all proteins. The info underlying this Determine may be present in S8 Desk and has been sorted when it comes to the propensity to introduce intrinsic dysfunction .
S7 Fig. Size and localization of aaTR and IDR areas in aaTR proteins and non-aaTR proteins.
IDRs have been predicted with MobiDBLite  (threshold: ≥15 consecutive residues). aaTR or IDR areas have been divided by the total protein size to calculate the relative size. The localization of aaTR and IDRs was calculated by dividing the imply coordinate for every area by the total sequence size. A complete of 178 Borg aaTR-proteins had 220 aaTR areas (blue), and 112/178 aaTR-proteins had IDRs (inexperienced). A complete of 557 Borg non-aaTR proteins had an IDR (yellow). The info underlying this Determine may be present in S9 Desk.
S9 Fig. Sequence alignment, aaTR composition, and predicted constructions of Borg Sm ribonucleoproteins.
(A) A number of sequence alignment of Borg Sm ribonucleoproteins with and with out aaTRs, Sm from Methanoperedens bins co-occurring with Borgs, and reference sequences from M. jannaschii (PDB: 4X9D), M. nitroreducens (WP_096203417), and E. coli (PDB: 1HK9). (B) aaTR models of Sm ribonucleoproteins. (C) Predicted constructions of aaTR-Sm ribonucleoproteins.
S4 Desk. Nucleotide composition of Borg genomes and TRs (in %).
The nucleotide composition was calculated for the coding strand of every replichore. Subsequently, a sum of each values was shaped and divided by the genome size and multiplied by 100%.
S7 Desk. Proteins with comparable aaTR composition are a part of the identical repeat cluster.
This Desk accompanies Fig 5B. Repeat models have been hierarchically clustered based mostly on triple aaTR models. Clusters with ≥3 aaTR members have been named by amino acids that type the aaTR unit (≥10% abundance, descending order). Options of choose aaTRs have been predicted utilizing the eukaryotic linear motif (ELM) useful resource for practical websites in proteins (http://elm.eu.org/).
We thank Shufei Lei, Jordan Hoff, Adair Borges, and Yue Claire Lou for bioinformatics help and Luis Valentin Alvarado and Susan Marquesee for useful discussions.
Lai S, Jia L, Subramanian B, Pan S, Zhang J, Dong Y, et al. mMGE: a database for human metagenomic extrachromosomal cell genetic components. Nucleic Acids Res. 2021;49:D783–D791. pmid:33074335
Yu MK, Fogarty EC, Murat Eren A. The genetic and ecological panorama of plasmids within the human intestine. bioRxiv. 2022. p. 2020.11.01.361691.
Al-Shayeb B, Schoelmerich MC, West-Roberts J, Valentin-Alvarado LE, Sachdeva R, Mullen S, et al. Borgs are big genetic components with potential to develop metabolic capability. Nature. 2022. pmid:36261517
Schoelmerich MC, Ouboter HT, Sachdeva R, Penev PI, Amano Y, West-Roberts J, et al. A widespread group of enormous plasmids in methanotrophic Methanoperedens archaea. Nat Commun. 2022;13:1–11.
Haroon MF, Hu S, Shi Y, Imelfort M, Keller J, Hugenholtz P, et al. Anaerobic oxidation of methane coupled to nitrate discount in a novel archaeal lineage. Nature. 2013;500:567–570. pmid:23892779
Koonin EV, Yutin N. Evolution of the Giant Nucleocytoplasmic DNA Viruses of Eukaryotes and Convergent Origins of Viral Gigantism. Adv Virus Res. 2019;103:167–202. pmid:30635076
Wang H, Peng N, Shah SA, Huang L, She Q. Archaeal extrachromosomal genetic components. Microbiol Mol Biol Rev. 2015;79:117–152. pmid:25694123
Gunge N, Fukuda Ok, Takahashi S, Meinhardt F. Migration of the yeast linear DNA plasmid from the cytoplasm into the nucleus in Saccharomyces cerevisiae. Curr Genet. 1995;28:280–288. pmid:8529275
Chater KF, Kinashi H. Streptomyces Linear Plasmids: Their Discovery,Features, Interactions with Different Replicons, and Evolutionary Significance. In: Meinhardt F, Klassen R, editors. Microbial Linear Plasmids. Berlin, Heidelberg: Springer Berlin Heidelberg; 2007. pp. 1–31.
Wagenknecht M, Dib JR, Thürmer A, Daniel R, Farías ME, Meinhardt F. Structural peculiarities of linear megaplasmid, pLMA1, from Micrococcus luteus intervene with pyrosequencing reads meeting. Biotechnol Lett. 2010;32:1853–1862. pmid:20652620
Olm MR, Crits-Christoph A, Bouma-Gregson Ok, Firek BA, Morowitz MJ, Banfield JF. inStrain profiles inhabitants microdiversity from metagenomic information and sensitively detects shared microbial strains. Nat Biotechnol. 2021;39:727–736. pmid:33462508
Kruskal WH, Wallis WA. Use of Ranks in One-Criterion Variance Evaluation. J Am Stat Assoc. 1952;47:583–621.
Benjamini Y, Yekutieli D. The management of the false discovery fee in a number of testing underneath dependency. aos 2001;29:1165–1188.
Gruber AR, Lorenz R, Bernhart SH, Neuböck R, Hofacker IL. The Vienna RNA websuite. Nucleic Acids Res. 2008;36:W70–W74. pmid:18424795
Kim JC, Mirkin SM. The balancing act of DNA repeat expansions. Curr Opin Genet Dev. 2013;23:280–288. pmid:23725800
Kazlauskas D, Krupovic M, Guglielmini J, Forterre P, Venclovas Č. Range and evolution of B-family DNA polymerases. Nucleic Acids Res. 2020;48:10142–10156. pmid:32976577
van der Lee R, Buljan M, Lang B, Weatheritt RJ, Daughdrill GW, Dunker AK, et al. Classification of intrinsically disordered areas and proteins. Chem Rev. 2014;114:6589–6631. pmid:24773235
Hauser M, Steinegger M, Söding J. MMseqs software program suite for quick and deep clustering and looking out of enormous protein sequence units. Bioinformatics. 2016;32:1323–1330. pmid:26743509
Remmert M, Biegert A, Hauser A, Söding J. HHblits: lightning-fast iterative protein sequence looking out by HMM-HMM alignment. Nat Strategies. 2011;9:173–175. pmid:22198341
Söding J. Protein homology detection by HMM-HMM comparability. Bioinformatics. 2005;21:951–60. pmid:15531603
Van Roey Ok, Uyar B, Weatheritt RJ, Dinkel H, Seiler M, Budd A, et al. Quick linear motifs: ubiquitous and functionally various protein interplay modules directing cell regulation. Chem Rev. 2014;114:6733–78. pmid:24926813
Kumar M, Gouw M, Michael S, Sámano-Sánchez H, Pancsa R, Glavina J, et al. ELM-the eukaryotic linear motif useful resource in 2020. Nucleic Acids Res. 2020;48:D296–D306. pmid:31680160
Maupin-Furlow J. Proteasomes and protein conjugation throughout domains of life. Nat Rev Microbiol. 2011;10:100–111. pmid:22183254
Garnham CP, Roll-Mecak A. The chemical complexity of mobile microtubules: tubulin post-translational modification enzymes and their roles in tuning microtubule capabilities. Cytoskeleton. 2012;69:442–63. pmid:22422711
Vogel J, Luisi BF. Hfq and its constellation of RNA. Nat Rev Microbiol. 2011;9:578–89. pmid:21760622
Nikulin A, Mikhailina A, Lekontseva N, Balobanov V, Nikonova E, Tishchenko S. Characterization of RNA-binding properties of the archaeal Hfq-like protein from Methanococcus jannaschii. J Biomol Struct Dyn. 2017;35:1615–28. pmid:27187760
Necci M, Piovesan D, Dosztányi Z, Tosatto SCE. MobiDB-lite: quick and extremely particular consensus prediction of intrinsic dysfunction in proteins. Bioinformatics. 2017;33:1402–4. pmid:28453683
Erdős G, Pajkos M, Dosztányi Z. IUPred3: prediction of protein dysfunction enhanced with unambiguous experimental annotation and visualization of evolutionary conservation. Nucleic Acids Res. 2021;49:W297–W303. pmid:34048569
Kletzin A, Heimerl T, Flechsler J, van Niftrik L, Rachel R, Klingl A. Cytochromes c in Archaea: distribution, maturation, cell structure, and the particular case of Ignicoccus hospitalis. Entrance Microbiol. 2015;6:439. pmid:26029183
Ryan CP. Tandem repeat issues. Evol Med Public Well being. 2019;2019:17. pmid:30800316
Usdin Ok. The organic results of straightforward tandem repeats: classes from the repeat growth ailments. Genome Res. 2008;18:1011–19. pmid:18593815
Fondon JW third, Garner HR. Molecular origins of speedy and steady morphological evolution. Proc Natl Acad Sci U S A. 2004;101:18058–18063. pmid:15596718
Viguera E, Canceill D, Ehrlich SD. Replication slippage includes DNA polymerase pausing and dissociation. EMBO J. 2001;20:2587–95. pmid:11350948
Zhou Ok, Aertsen A, Michiels CW. The function of variable DNA tandem repeats in bacterial adaptation. FEMS Microbiol Rev. 2014;38:119–141. pmid:23927439
Castillo-Lizardo M, Henneke G, Viguera E. Replication slippage of the thermophilic DNA polymerases B and D from the Euryarchaeota Pyrococcus abyssi. Entrance Microbiol. 2014;5:403. pmid:25177316
Tyson GW, Banfield JF. Quickly evolving CRISPRs implicated in acquired resistance of microorganisms to viruses. Environ Microbiol. 2008;10:200–7. pmid:17894817
McGinn J, Marraffini LA. Molecular mechanisms of CRISPR-Cas spacer acquisition. Nat Rev Microbiol. 2019;17:7–12. pmid:30171202
Waters TR, Swann PF. Thymine-DNA glycosylase and G to A transition mutations at CpG websites. Mutat Res. 2000;462:137–147. pmid:10767625
Statello L, Guo C-J, Chen L-L, Huarte M. Gene regulation by lengthy non-coding RNAs and its organic capabilities. Nat Rev Mol Cell Biol. 2021;22:96–118. pmid:33353982
Ninomiya Ok, Hirose T. Quick Tandem Repeat-Enriched Architectural RNAs in Nuclear Our bodies: Features and Related Ailments. Noncoding. RNA. 2020:6. pmid:32093161
Coelho Ribeiro M de L, Espinosa J, Islam S, Martinez O, Thanki JJ, Mazariegos S, et al. Malleable ribonucleoprotein machine: protein intrinsic dysfunction within the Saccharomyces cerevisiae spliceosome. PeerJ. 2013;1:e2. pmid:23638354
Törö I, Thore S, Mayer C, Basquin J, Séraphin B, Suck D. RNA binding in an Sm core area: X-ray construction and practical evaluation of an archaeal Sm protein advanced. EMBO J. 2001;20:2293–2303. pmid:11331594
Haynes C, Oldfield CJ, Ji F, Klitgord N, Cusick ME, Radivojac P, et al. Intrinsic dysfunction is a typical function of hub proteins from 4 eukaryotic interactomes. PLoS Comput Biol. 2006;2:e100. pmid:16884331
Breuer M, Rosso KM, Blumberger J. Electron circulation in multiheme bacterial cytochromes is a balancing act between heme digital interplay and redox potentials. Proc Natl Acad Sci U S A. 2014;111:611–616. pmid:24385579
Yakhnina AA, Bernhardt TG. The Tol-Pal system is required for peptidoglycan-cleaving enzymes to finish bacterial cell division. Proc Natl Acad Sci U S A. 2020;117:6777–6783. pmid:32152098
Heilpern AJ, Waldor MK. CTXphi an infection of Vibrio cholerae requires the tolQRA gene merchandise. J Bacteriol. 2000;182:1739–1747. pmid:10692381
Leu AO, McIlroy SJ, Ye J, Parks DH, Orphan VJ, Tyson GW. Lateral Gene Switch Drives Metabolic Flexibility within the Anaerobic Methane-Oxidizing Archaeal Household Methanoperedenaceae. MBio. 2020:11. pmid:32605988
Gilchrist CLM, Chooi Y-H. Clinker & clustermap.js: Computerized era of gene cluster comparability figures. Bioinformatics. 2021. pmid:33459763
Brown CT, Olm MR, Thomas BC, Banfield JF. Measurement of bacterial replication charges in microbial communities. Nat Biotechnol. 2016;34:1256–1263. pmid:27819664
Langmead B, Salzberg SL. Quick gapped-read alignment with Bowtie 2. Nat Strategies. 2012;9:357–359. pmid:22388286
Li H, Handsaker B, Wysoker A, Fennell T, Ruan J, Homer N, et al. The Sequence Alignment/Map format and SAMtools. Bioinformatics. 2009:2078–2079. pmid:19505943
Bushnell B. BBMap: A quick, correct, splice-aware aligner. Lawrence Berkeley Nationwide Lab. (LBNL), Berkeley, CA (United States); 2014 Mar. Report No.: LBNL-7065E. Obtainable from: https://www.osti.gov/biblio/1241166-bbmap-fast-accurate-splice-aware-aligner
Kurtz S, Phillippy A, Delcher AL, Smoot M, Shumway M, Antonescu C, et al. Versatile and open software program for evaluating giant genomes. Genome Biol. 2004;5:R12. pmid:14759262
Jaffe AL, Thomas AD, He C, Keren R, Valentin-Alvarado LE, Munk P, et al. Patterns of Gene Content material and Co-occurrence Constrain the Evolutionary Path towards Animal Affiliation in Candidate Phyla Radiation Micro organism. MBio. 2021;12:e0052121. pmid:34253055
Jones P, Binns D, Chang H-Y, Fraser M, Li W, McAnulla C, et al. InterProScan 5: genome-scale protein operate classification. Bioinformatics. 2014;30:1236–1240. pmid:24451626
Finn RD, Bateman A, Clements J, Coggill P, Eberhardt RY, Eddy SR, et al. Pfam: the protein households database. Nucleic Acids Res. 2014;42:D222–D230. pmid:24288371
Altschul SF, Gish W, Miller W, Myers EW, Lipman DJ. Primary native alignment search software. J Mol Biol. 1990;215:403–410. pmid:2231712
Krogh A, Larsson B, von Heijne G, Sonnhammer EL. Predicting transmembrane protein topology with a hidden Markov mannequin: software to finish genomes. J Mol Biol. 2001;305:567–580. pmid:11152613
Yu NY, Wagner JR, Laird MR, Melli G, Rey S, Lo R, et al. PSORTb 3.0: improved protein subcellular localization prediction with refined localization subcategories and predictive capabilities for all prokaryotes. Bioinformatics. 2010;26:1608–1615. pmid:20472543
Hu G, Katuwawala A, Wang Ok, Wu Z, Ghadermarzi S, Gao J, et al. flDPnn: Correct intrinsic dysfunction prediction with putative propensities of dysfunction capabilities. Nat Commun. 2021;12:4438. pmid:34290238
Katoh Ok, Misawa Ok, Kuma Ok-I, Miyata T. MAFFT: a novel methodology for speedy a number of sequence alignment based mostly on quick Fourier rework. Nucleic Acids Res. 2002;30:3059–3066. pmid:12136088
Letunic I, Bork P. Interactive tree of life (iTOL) v3: a web-based software for the show and annotation of phylogenetic and different timber. Nucleic Acids Res. 2016;44:W242–W245. pmid:27095192
Capella-Gutiérrez S, Silla-Martínez JM, Gabaldón T. trimAl: a software for automated alignment trimming in large-scale phylogenetic analyses. Bioinformatics. 2009;25:1972–1973. pmid:19505945
Nguyen L-T, Schmidt HA, von Haeseler A, Minh BQ. IQ-TREE: a quick and efficient stochastic algorithm for estimating maximum-likelihood phylogenies. Mol Biol Evol. 2015;32:268–274. pmid:25371430
Waterhouse A, Bertoni M, Bienert S, Studer G, Tauriello G, Gumienny R, et al. SWISS-MODEL: homology modelling of protein constructions and complexes. Nucleic Acids Res. 2018;46:W296–W303. pmid:29788355
Mirdita M, Ovchinnikov S, Steinegger M. ColabFold—Making protein folding accessible to all. bioRxiv. 2021. p. 2021.08.15.456425.
Jumper J, Evans R, Pritzel A, Inexperienced T, Figurnov M, Ronneberger O, et al. Extremely correct protein construction prediction with AlphaFold. Nature. 2021;596:583–589. pmid:34265844
Mirdita M, Schütze Ok, Moriwaki Y, Heo L, Ovchinnikov S, Steinegger M. ColabFold—Making protein folding accessible to all. Analysis Sq.. 2021.
Moriwaki Y. localcolabfold: ColabFold in your native PC. Github; Obtainable from: https://github.com/YoshitakaMo/localcolabfold
Delano WL. The PyMOL Molecular Graphics System. http://www.pymol.org. 2002 [cited 2022 Jan 27]. Obtainable from: https://ci.nii.ac.jp/naid/10020095229/
Uversky VN. Intrinsically disordered proteins and their “mysterious” (meta)physics. Entrance Physiol. 2019:7.