Bioinformatics geekery

Natural GMOs part 85. Speed matters! Era7 and crowd outsourcing provide – E coli EHEC genome annotation fast!

Post Updated 7/06/2011:

“Beware the Jabberwock, my son!
The jaws that bite, the claws that catch!
Beware the Jubjub bird, and shun
The frumious Bandersnatch!”
He took his vorpal sword in hand:
Long time the manxome foe he sought —
So rested he by the Tumtum tree,
And stood awhile in thought.
And, as in uffish thought he stood,
The Jabberwock, with eyes of flame,
Came whiffling through the tulgey wood,
And burbled as it came!
One, two! One, two! And through and through
The vorpal blade went snicker-snack!
He left it dead, and with its head
He went galumphing back.
“And, has thou slain the Jabberwock?
Come to my arms, my beamish boy!
O frabjous day! Callooh! Callay!’
He chortled in his joy.

[Image shows a portrait of the Pundit as a young man fighting the Jabberwock]

Out of Germany, not Out of Africa — the beastly German germ is a GMO whose parents have been around Germany for 10 years.
But it has now been virtully dismembered and its entrails laid out to dry,  thanks to an intense weekend of in silico combat  by the super-geeks at BGI institute and elsewhere in the cloud. The latest results, from BGI, is a sensitive new test for the germ, and full demonstration of the several gene movement events generating this strain by Kat Holt. The Era7 crowd have provide a detailed annotation of the gene content of this organism which is now freely available as a detailed paper in Nature Precedings.
The German beast has a main chromosome that is 99.69% identical to known Escherichia coli EAEC strain 55989  over 96.07% of the chromosome’s length. This strain comes from Africa. Another strain, a German strain 01-09591 originally isolated in 2001  is probably even more closely related to the current outbreak strain, but this strain’s genome has not yet been completely decoded. BGI could do it in about a day if given the DNA. Interestingly, Kat Holt an others now show the German outbreak strain has inherited a shigatoxin gene as part of an acquired virus cassette inserted its main chromosome. It also has genes for a gut surface attachment apparatus (aggABCD) as a mobile gene cassette carried on a plasmid — a clear-cut example of a natural GMO.

Gene-jockies are the scientists who go riding along long stretches of DNA data on a computer screen to try and discover its hidden secrets. Their amazing but largely unappreciated work  — and that of  their clever  colleagues the computer nerds — in dissecting the DNA of disease germs to find how they cause harm — certainly merits much more publicity that it currently gets.
But first a warning– we are going to occasionally talk in jabberwocky, the language of the gene-jockies. It sometimes frightens newbies, but after a while one gets to enjoy it, and even get addicted to it. Indeed, it is extremely satisfying to investigate a disease outbreak with the computer, and to see virtual collaboration in action between research groups in different countries. With the current food poisoning emergency, data come from China and France, outbreak DNA samples from Germany, databases from Los Alamos, computers from Bill Gates, Intel  corporation and Silicon Valley, and the internet from Al Gore.
Fortunately, because laboratory work with cloned DNA was not banned in 1976 in spite of the efforts of many people to stop it, gene-jockies are now able to decode the genome information in pathogens in a few days. The German public deserve knowing that science is their friend in times of trouble, and the efforts of much disparaged DNA scientists need proper recognition. So we are posting post here some samples of the feverish effort that’s currently giving the German EHEC beast a heavy working over in silico.
The Pundit always takes pleasure in saluting people who would do something rather than doing nothing and stopping everything. It is fantastic to witness the power of the wisdom of crowds in open-source action over the internet. The donation to the scientific community just a day or so ago of the complete DNA sequence of the German epidemic germ ( mentioned in the headline of this post, and described more fully by notes added at the end) is stimulating at lot of open-source analysis (details below).
The full version:
To set the stage with better understanding on the EHEC germ, let’s have a go at some rather simple German germ genome analysis here, and follow-up with a scan through on what the super-geeks like David  Studholme, Kat Holt and Nick Loman do.
Using the genetic leads and data deposited in public databases such as Genbank graciously provided by the gene jockies (key sections of which are appended below)  one can use the computer and the internet to discover valuable new information about the German outbreak strain.
For example gene sequence data from BGI indicate it is closely related to E. coli strain EAEC 55989 (proof for this is given below) and we find from searching the open-access genetic databases that that EAEC E. coli strain 55989 is fully described by Touchon and others (2009). We find in these databases that its disease-related plasmid pAA and main chromosome are both  completely characterised by data base reports that were presciently placed there a year ago by the following fine French scientists:

Touchon M, Hoede C, Tenaillon O, Barbe V, Baeriswyl S, et al. 2009 Organised Genome Dynamics in the Escherichia coli Species Results in Highly Diverse Adaptive Paths. PLoS Genet 5(1): e1000344. doi:10.1371/journal.pgen.1000344

JABBERWOCKY ALERT: This paper describes the diverse horizontal gene movement events happening repeatedly in E. coli in very sophisticated detail. Non-experts should avoid it as it contain many obscene passages written in an obscure dialect of the jabberwock language.
The fortunately, for those geeks among us who read jabberwocky, the paper and associated database entries  tell us a lot more about the E. coli strain 55989. Just a part of this new information is:  where the strain might have come from, and what previous research was done on it:
Quote from Touchon and others 2009:
Enteroaggregative E. coli strain 55989 was originally isolated from the diarrheagenic stools of an HIV-positive adult suffering from persistent watery diarrhea in Central African Republic [86]. The enteroaggragative pathotype is recognized as an emerging cause of diarrhoea in children and adults worldwide [87]. [Added note: Amazingly, some readers think the virus involved in stx2 gene movement could be HIV because of the mention of HIV in the African reference just above. Stx2 is carried on a bacterial virus which has nothing to do with HIV. Sigh!. So much mis-information and misunderstanding circulating on the internet, no wonder people get scared].

REF 86. Mossoro C, Glaziou P, Yassibanda S, Lan NT, Bekondi C, et al. (2002) Chronic diarrhea, hemorrhagic colitis, and hemolytic-uremic syndrome associated with HEp-2 adherent Escherichia coli in adults infected with human immunodeficiency virus in Bangui, Central African Republic. J Clin Microbiol 40: 3086–3088.
REF 87. Bernier C, Gounon P, Le Bouguénec C (2002) Identification of an aggregative adhesion fimbria (AAF) type III-encoding operon in enteroaggregative Escherichia coli as a sensitive probe for detecting the AAF-encoding operon family. Infect Immun 70: 4302–11.

Wow! How fascinating. An African pedigree. The Pundit is now going to continue hunting through these DNA sequences to tease out in  further detail the virulence chararacteristics of the German strain.
He’s hunting for the aggR disease plasmid and the stx2 toxin prophage. Not the Snark or the Jabberwock. As he finds it he is going to translate the work from jabberwocky into ordinary English.
Gene-Jocky Sleuthing results:
From the BGI genome data, the Pundit was able to confirm plausible natural mechanisms for gene movement of the virulent gene aggR and the toxin gene stx2 into the German outbreak strain from other E. coli strains known to carry them. These mechanisms involve mating to transfer a virulence plasmid, and virus mediated or assisted movement of the toxin gene stx2.
Some database sleuthing details
A little bit of reading of gene sequences provided by the gene-jockies, plus some software assisted sleuthing in the main Genbank gene databases easily turns up the DNA data files on EAEC strain E.  coli 5589 kindly deposited in the databases a year ago by the French workers for open use by the scientific community. These notes are referenced by hyperlinks to the Genbank entries and brief descriptions later on in this post.
From these entries the Pundit discovered that Bernier, Gounon and Le Bouguenec did some key work way back in 2002. They describe a germ virulence capability known as aggregative adhesion fimbria (AAF). These hair like objects are on the surface of the bacteria and enable the bacteria to attach firmly to the gut wall. They are alternatively  called pili (singular pilus). These pili are associated with a special mechanism that enables these bacteria to inject a protein into the gut lining cells of their host animal. Microbiologists call this a type III secretion system. Type III excretion systems of various sorts are frequently a crucial disease related germ attribute. They inject many diffrent proteins into animal cell targets.
Another fact can be gleaned by scrutiny of the Genbank database entries on EAEC E.  coli 5589 . The African strain has no gene for making shigatoxin — whose presence is a key feature of EHEC germs. Thus if the German beast evolved from the African EAEC germ, it must have captured a new shigatoxin stx2 gene — quite likely via a virus mediated or assisted addition of new genes into its main chromosome.
EAEC virulence components are transmissible.
More sleuthing yields an authoritative 2009 review– “Pathogenomics of the Virulence Plasmids of Escherichia coli” by Timothy Johnson and Lisa Nolan which says this:

EAEC strains are the most recently described of the E. coli intestinal pathotypes (77). These bacteria were first describedby Nataro et al. in 1987, based upon their distinct aggregative adherence phenotype, which is seen as a brick-like pattern when the bacteria adhere to cultured HEp-2 cells (129). EAEC strains are considered to be an emergent diarrheal pathotype implicated in traveler’s diarrhea and affecting immunocompromised children in developing countries (77). In fact, EAEC strains are second only to ETEC strains as being the most common agent of traveler’s diarrhea. It is thought that food and water are the most likely means of transmission (77). Epidemiological studies involving this strain have demonstrated that EAEC virulence is heterogeneous, complex, and likely dependent on multiple bacterial factors and host immune status (126). EAEC pathogenesis is thought to involve three primary steps. First, the bacteria adhere to the intestinal mucosa using aggregative adherent fimbriae (AAF).Second, the bacteria produce a mucus-mediated biofilm on the enterocyte surface. Finally, the bacteria release toxins that affect the inflammatory response, intestinal secretion, and mucosal toxicity (77). Aspects of each of these steps involve plasmid-encoded traits.

A primary virulence factor of EAEC is that encoding the aggregative adherence phenotype (72). This trait was found to be associatedwith AAF (127) and is localized to a 55- to 65-MDa plasmid, termed the “pAA plasmid” (129). Like ETEC CFs, allelic variants of AAF have been identified. AAF from prototypical EAEC strain 17-2 (127) is genetically distinct from AAF from prototypical strain O42 (126), and their respective allelic variants are named types AAF/I and AAF/II. Other allelic variants of AAF have been described, including AAF/III from prototypical strain 55589 (9) and AAF/IV from strain C1010-00 (16). All the identified AAF allelic types appear to be plasmid encoded, and most of the strains analyzed tend to possess only a single AAF allelic type (72). AAF genes are regulated by an AraC-like transcriptional activator, AggR, and strains containing AggR have been termed “typical” EAEC strains (131). The AAF regulon contains both fimbrial genes and a regulator linked to one another on the pAA-type plasmid. There is evidence that AggR is a global regulator of EAEC virulence, as it exhibits effects on a number of chromosomal virulence factors as well (125). The major AAF pilins regulated by AggR include aggA (AAF/I), aafA (AAF/II), and agg3 (AAF/III) (935131). AggR also regulates the expression of aap, a dispersin that is highly prevalent among EAEC isolates and facilitates the movement of EAEC across the intestinal mucosa for subsequent aggregation and adherence (77). This dispersin is exported out of the EAEC cell via the antiaggregation protein transporter system, encoded by the genesaatPABCD (8). This ABC transporter system is highly prevalent among EAEC populations, highly conserved, and regulated by AggR (77). While few studies have involved large numbers of EAEC isolates, recent work by Jenkins et al. found that two groups of EAEC exist based upon gene clustering. They are distinguished by the presence or absence of genes encoded on plasmid pAA and en bloc sets of genes located on genomic islands near the pheU and glyU loci (83). The definition of “typical” versus “atypical” EAEC strains has thus been supported by such results, with typical EAEC strains possessing pAA-associated genes and certain chromosomal islands, apparently coinherited.

The EAEC plasmids also encode toxins such as the plasmid-encoded toxins Pet and EAST1 (45). Pet appears to belong to the serineprotease autotransporter family and has been shown to confer cytoskeletal rearrangements, suggesting a role for Pet in EAECpathogenesis (24188). EAST1 has been found to activate guanylate cyclase, resulting in ion secretion (128). However, relatively few EAEC strains actually possess the genes encoding Pet and EAST1, so their role in EAEC pathogenesis may be limited (36).

Three EAEC plasmids have been completely sequenced: pO42, belonging to AAF/II+ strain O42; 55989p, belonging to AAF/III+ strain55989; and pO86A1, containing a novel AAF-like operon. All three of these plasmids are F-type plasmids with stability, maintenance, and transfer regions (Fig. 2). Plasmid 55989p is considerably smaller than plasmids pO42 and pO86A1, which is due to truncations in the F transfer region. This plasmid also differs from pO86A1 and pO42 in that it contains a RepFIC replication region instead of RepFIIA, although all three plasmids also contain a second replication region known as RepFIB (Fig. 2). All three plasmids encode their respective AAF types, and each contains the AAF regulatory gene aggR. While the AAF types possess considerable genetic diversity, aggR is generally highly conserved among the plasmid sequences available. A phylogenetic comparison based upon a nucleotide alignment of available aggRsequences revealed that aggR genes from AAF types I and III appear to be most closely related, whereas other AAF types are more divergent (Fig. 3). Also, sharing nucleotide similarity with aggR is the AraC-type transcriptional regulator rns of human ETEC plasmid types (26).

The features common to all three sequenced EAEC plasmids are the AAF operons, aggR, and aatPABCD (Fig. 2). In all three plasmids,these sequences are present on a RepFIB/FIIA-type backbone. Each of these plasmids also has unique regions not present in the other two sequenced plasmids, including the pet gene in pO42, the ipd gene in pO86A1 encoding an extracellular serine protease, and the Ets iron transport system in pO42 (Fig. 2). The acquisition of Eit by pO42 is particularly interesting because it was previously found only within ExPEC ColV and ColBM plasmids on a RepFIB/FIIA plasmid backbone (8788). Although the EAEC plasmids share a common plasmid backbone and core EAEC-associated virulence genes, the gross genetic composition and synteny of these three plasmids are quite different from one another. This would suggest that a significant amount of gene shuffling and rearrangement has occurred since the introduction of their virulence-associated module or that this module has been introduced on different occasions.

Another Wow! Have these gene-jockies been busy! And what do they say: that the EAEC plasmids are F-plasmids (described here). This means most likely that they are mobile or can be easily mobilised by fully functional F-plasmids. Their disease causing ability is infectious and is predicted to be easly transmissible to other bacteria. They are related to the first plasmid ever characterised, the fertile one (called F) discovered by Joshua Lederberg in 1946! They are able to create new natural GMOs.

But all this is no surprise to microbiologists– it is quite predictable from what they are taught in Germs 101. But the gene databases provide strong confirmatory evidence on the sexual prowess of EAEC type pathogenic E. coli. Thank you again gene-jockies and database nerd-people. Bioinformatics Rules OK!

Joshua Lederberg in the 1960s (NLM). Lederberg's 1946 discovery of mating in Escherichia coli was heralded by Salvador Luria as likely to be "among the most fundamental advances in the whole history of bacteriological science". For this work he was awarded the Nobel Prize in Physiology or Medicine for 1958. The bacterial mating mechanism, called conjugation, a major mechanism for horizontal gene transfer, is now known to have to have evolved to perform a wide range of biological roles for injection of both DNA and protein into diverse target cells, including bacteria, yeasts, plants and protists

Plausible mechanisms for toxin gene movement into the German outbreak strain can be deduced the BGI DNA data:

From DNA sequence information on the German strain provided by BGI  several phage (silent virus) genes can be identified as being inserted in the main chromosome, e.g. a gene for phage antitermination protein Q. This gene is near the shigatoxin chromosomal genes in EHEC bacteria (e.g. Strain EDL933 analysed by Perna et al in 2001). This finding confirms a possible role of  bacterial virus genes in transfer of toxin forming ability into the German outbreak strain.(Kat Holt provides strong confirmation of this — see later for details).

Gene database search show these virus genes are also widely dispersed in other E. coli bacteria. A simple explanation is that virus genes in some way facilitate toxin gene movement between strains. Several mechanisms can be plausibly suggested for this, but the bottom line is that plausible routes for interstrain toxin gene movement (horizontal gene movement, HGT) are suggested by the DNA sequence evidence donated to the science community by BGI.

Full gene content of the German isolate described

Escherichia coli EHEC Germany outbreak preliminary functional annotation using BG7 system

by: Raquel Tobes, Marina Manrique, Pablo Pareja-Tobes, Eduardo Pareja-Tobes, Eduardo Pareja, Raquel Tobes, Marina Manrique, Pablo Pareja-Tobes, Eduardo Pareja-Tobes, Eduardo Pareja, Raquel Tobes

Free access to full text

Nature Precedings, No. 713. (6 June 2011) doi:10.1038/npre.2011.6001.1 Key: citeulike:9388507

We have annotated the European outbreak E. coli EHEC genome sequenced by BGI (6-2-2011) and assembled with MIRA by Nick Loman (6-2-2011 ). Our system BG7, Bacterial Genome annotation of Era7 Bioinformatics, predicts ORFs and annotates them based on fragments of similarity with Uniprot proteins. We have predicted 6327 genes, 6156 encoding proteins y 171 corresponding to ribosomal and tRNA. Based on the preliminary results of our semi-automated method of annotation we have selected some predicted protein with potential implications in pathogenicity and virulence.There are 33 predicted genes annotated as toxins and we have found three putative hemolysins: Hemolysin E, a putative hemolysin expression modulating protein and a channel protein, hemolysin III family. We have found 31 predicted genes that could be related to specific antibiotic resistance: beta-lactamic, aminoglycoside, macrolide, polymyxin, tetracycline, fosfomycin and deoxycholate, novobiocin, chloramphenicol, bicyclomycin, norfloxacin and enoxacin and 6-mercaptopurine. This strain is rich in adhesion, secretion systems, pathogenicity and virulence related proteins. It seems to have a restriction-modification system, many proteins involved in Fe transport and utilization (siderophores as aerobactin and enterobactin), lysozyme, one inhibitor of pancreatic serine proteases, proteins involved in anaerobic respiration, antimicrobial peptides, proteins involved in quorum sensing and biofilm formation that could confer competitive advantage to this strain.


Now for something really geekerful.  The blow by blow wounding of the Jabberwork from the crowd working in the cloud. May the force be with you.

First we listen to Dave Studholme’s story:

E. coli TY2482: strain-specific genes

Posted on June 4, 2011 by david j studholme

Within days of the reported E. coli outbreak,  BGI have released five runs of genome-wide sequence data generated using Ion Torrent. With equally astonishing speed, Nick Loman has generated a preliminary de novo sequence assembly and Marina Manrique generated a preliminary annotation of that assembly.

I was curious to know: is there anything in the genome of TY2482 that is unique (i.e. not found in any previously seen E. coli genomes)? The answer is probably yes — but not much! The unique genome regions are listed in the table below.

I performed BLASTN searches of the Nick Loman’s TY2482 assembly against each of the Escherichia genomes in the NCBI RefSeq database using an E-value threshold of 1e-10. The set of 221 RefSeq genomes (including chromosomes and plasmids) is listed below the table at the bottom of this page. I then pulled-out all the regions of the TY2482 assembly that showed no BLASTN matches against any of the RefSeq sequences. I found just four such regions, ranging in length from 1. kb to 1.5 kb. The lists of predicted genes in these regions are taken from Marina’s preliminary annotation.

Please note, that the list of sequences against which I compared TY2482 is not comprehensive. In other words, there are other E. coli (and closely related species) sequences in the public databases that are not included in my list. (…go to link for details.

Wait! There is more:

Comparisons of E. coli TY2482 against previously sequenced E. coli genomes

Posted on June 5, 2011 by david j studholme


I have aligned Nick Loman’s TY2482 assembly, and the BGI’s raw Ion Torrent sequence reads, against the compete E. coli genome sequences from the NCBI RefSeq database. I used BLASTN to align Nick’s contigs and used BWA to align BGI’s raw reads. I used CGView to display the results of the alignments.

I also aligned Nick’s assembly against all these genome sequences using Mummer. The results in Excel format are here and OpenOffice format here. Looks like the most similar genome is Escherichia coli 55989 NC_011748: 99.69% nucleotide sequence identity over 96.07% of the chromosome’s length.

Beautiful Bioinformatics geekery

E. coli TY2482: strain-specific genes

Posted on June 4, 2011 by david j studholme

UPDATE Monday 6th June 2011

I have investigated the candidate ‘novel’ genes below a bit more carefully. Thanks to @kamounlab for some help with this (but any mistakes are mine!). As I stated previously, none of these shows significant nucleotide sequence similarity to any of the sequenced E. coli genomes listed at the bottom of this post. However, we do find some similarities to other E. coli sequences in the public databases. The only truly ‘unique’ sequence in the TY2482 is about 1 kb on contig husec41_c1687.

  • husec41_c1060: This contig shares a lot of sequence similarity with Stx2-converting phage 86 (NC_008464.1), which was previously seen in Stx2 phage of EHEC O86:H- strain DIJ1. It shares 96% identity over 75% of the contig’s length.
  • husec41_c1408: This contig is almost identical to an E. coli strain Ec222 pathogenicity island GenBank: AY151282.1.
  • husec41_c1687: About two-thirds of this contig shows no significant nucleotide sequence similarity to anything in the NR or RefSeq databases.
  • However, BLASTX reveals some protein-level sequence similarity to E. coli transposon Tn21 resolvase at the 3′ end of the contig. But the section in the middle of the contig has no detectable similarity to anything at either the DNA or the protein level.
[Pundit’s reaction: A clear cut movement of new mobile DNA]

  • husec41_c1496: Although this contig shows no detectable nucleic acid sequence similarity to any full-sequenced genome in RefSeq, it is almost identical to several sequences in GenBank, including a microcin operon from strain CA58.

E. coli TY2482 genome compared versus E. coli EAEC strain 55989

Of the E. coli strains whose genomes have been fully sequenced previously, EAEC 55989 is the one mostly closely related to TY2482. So what are the genomic differences between TY2482 and its sibling 55989?

The following CGView plots illustrate alignments of the BGI’s TY2482 Ion Torrent data aligned again the EAEC 55989 genome (using BWA).

Posted in Uncategorized | Leave a comment

The Pundit: Oooooooooh!. Very nice. Plasmid gene movement to the main chromosome! Maybe not. Those are probably plasmid contigs matching 55989 data. Still nice. Maybe there a lot of rearrangement of the plasmid?

Then there is the super-geek woman Kat H, who lives near the Pundit:

At bacpathgenomes

EHEC genomes

So two sequences have so far been released relating to the EHEC outbreak in Europe, see details here and links to public data & analyses here on Nick Loman’s blog:

For the first sequence, Ty2482, BGI has release fastqs and an assembly (methods undescribed); Nick Loman did an assembly using MIRA.

The second sequence is LB2226692, for which only an assembly is available (using a combination of mapping and de novo approaches, see here).

So how similar are the two? As a really basic first pass analysis, I used MUMmer to map the two assemblies to the closest reference sequence, E. coli 55989 (accession CU928145). Excluding indels and SNPs called within 100 bp of a contig end or other variant, this leaves 331 SNPs that the two novel genomes share relative to the reference genome; plus each of the novel genomes has 28-40 unique SNPs of their own. This is the tree (bioNJ, but really doesn’t matter as it’s so simple). (Go there or be square).

and at EAEC /STEC genomes ( at 6/06/2011- we are ahead of Europe and the US on time zones)

Updated Summary: The clear differences so far between the german outbreak strains and the similar EAEC genome Ec55989 are:

  • Stx2 phage, see below for alignment to the VT2 Sakai phage (details below)
  • IncI resistance plasmid including blaTEM and blaCTX-M, similar to those found in other E. coli and Shigella
  • An aggregative adherence fimbrial cluster (aggABCD) whose sequence was published in 1994 (U12894ref in pubmed central),  mobilised by IS and present on a plasmid (details below)…

…So the consensus is emerging that the German outbreak strains are an enteroaggregative E. coli (EAEC, similar to Ec55989 causing diarrhea in children in Africa), which has acquired a Shiga toxin phage. Best evidence for this is David Studholme’s comparison of the novel genomes to all available E. coli genomes, which shows that it Ec55989 is the most similar by gene content (sharing 96% of the genes from the outbreak strain), coupled with Konrad Paszkiewicz’s phylogenetic tree which shows that, within these genes, the sequences from the outbreak strains and Ec55989 are near-identical at the DNA level.

Nico Petty and I have been looking at what the novel genes are, and found what other people are reporting – that the EAEC has acquired the shiga toxin phage

STEC/EHEC outbreak – horizontally transferred genes

Posted June 7, 2011 by kat in Uncategorized.

In the German outbreak bacteria, as in most E. coli, plenty of horizontal transfer has gone on to create the genome we are now looking at…

.. .Firstly, as established by other’s work mapping reads and contigs to the available E. coli reference genome sequences, the chromosome of the outbreak strain is most similar to strain Ec55989, an enteroaggregative E. coli (EAEC) isolated in Africa over a decade ago [central circle in figure]. It shares with this strain part of the EAEC plasmid [55989p, top right] carrying aggregative adhesion operons aat, the regulator aggR and some other bits, but it has a different aggregative adhesion fimbrial complement (AAF/I) from Ec55989. It has also acquired the stx2 phage carrying shiga-toxin 2 genes stx2A, stx2B [top left]; a plasmid sharing high similarity with the IncI plasmid pEC_Bactec, including blaCTX-M and blaTEM-1 beta-lactamase (antibiotic resistance) genes [bottom left] and a lot of sequence similar to plasmid pCVM29188_101 from Salmonella entericaKentucky [bottom left]. The circles represent the sequence of the plasmids and phage (previously sequenced and deposited in GenBank) that are most similar to sequences in the novel strain. The green rings indicate which parts of these references sequences are also present in the novel German strain (via BLAST comparison with TY2482/MIRA contigs)….so nearly all of the Ec55989 chromosome and pEC_Bactec plasmid, and not quite all of the other phage & plasmid sequences.

Investigation of the parents to outbreak strain– main points at Biofortified

(Re post the above item)

Other work on the more immediate parental strain in Germany

06.06.2011 Investigations should deliver further indications as to the behavior of the current pathogen HUSEC041 (O104:H4)

New clues found in tracing the origin of the deadly E coli strain and an appeal for the sharing of additional data

2011-06-05 20:54:46 BGI website

… we are now tracing the history of the bacteria, as this latest analysis indicates that the two German strains (01-09591 originally isolated in 2001 and TY2482 from the 2011 outbreak) have identical profiles for all 12 virulence/fitness genes and 7 MLST housekeeping genes. However, at some point over this 10-year period the new 2011 outbreak strain seems to have developed the ability to resist many additional types of antibiotics. The latest data is now pointing to this candidate, as it now seems the African strain (strain 55989) is genetically more “distant” as the Shiga-toxin-producing gene and tellurite-resistance-genes were shown to be absent. (see  for our detailed comparison). The utility of so quickly sharing our initial data is further supported as the link to this original strain has already been independently verified by other groups: See also ColiScope where the sequence of strain 55989 was first displayed (, with option chromosome EC55989_EC55v2)

TY2482, LB226692 vs Genbank Ecoli

Page History

Whole genome phylogenies (03/06/2011)

Konrad Paszkiewicz, University of Exeter Sequencing Service khp204 at

Konrad's trees

The Pundit’s last words: Wow wow wow! That’s enough for a week-end.

Now for week 2 Updates

More data and a diagnostic test from BGI

BGI releases a complete de novo E. coli O104 genome assembly and is making their detection kit protocols and synthesized primers freely available to worldwide disease control and research agencies
2011-06-07 15:25:50

June 7, 2011, Shenzhen– Scientists worldwide have been working on the publicly available genomic sequences of the deadly E. coliO104 strain, which is causing the current health crisis in Germany and now spreading throughout Europe. To continue to speed the ongoing international efforts of researchers to assess and halt this growing epidemic, BGI and their collaborators at the University Medical Centre Hamburg-Eppendorf have now released their third version of the assembled genome, which includes new data from this E. coli O104.( In addition, the FTP site contains a file that provides the PCR primer sequences BGI and their collaborators have used to create diagnostic kits for rapid identification of this highly infectious bacterium.

The new assembly includes more than 200x single-end reads from the Illumina HighSeq Platform, which allowed BGI to provide a more complete genome map and to correct any assembly errors from the previous version. More importantly, this version is a completely de novo assembly, whereas the previous versions by BGI and others used a reference-based assembly method to obtain a consensus sequence. The new assembly continues to support the finding that this infectious strain carries disease-causing genes from two types of pathogenic E. coli: enteroaggregative E. coli(EAEC) and enterohemorrhagic E. coli (EHEC).

Taking advantage of this genomic feature, BGI and the Beijing Institute of Microbiology and Epidemiology researchers have developed a straightforward PCR diagnostic protocol for rapid identification of the outbreak strain. The diagnostic method consists of two pairs of amplification primers that target the enteroaggregative- and hemorrhagic-associated genes (more detailed protocol is available on the BGI FTP site).Diagnostic results can be obtained within 2–3 hours after receiving the sample, and thus will be extremely useful for epidemic surveillance and detection of this bacterium.

BGI has assessed the specificity and sensitivity of this kit and protocol through computational analyses of 4,547 strains (from 2,183 species) using publicly available whole-genome sequences, and through experimental analyses of323DNA samples (from 93 species, including 55E.colistrains that have different phenotypes and the current infectious strain). The findings demonstrated that the kit and protocol have high specificity: no bacterial strain other than E. coli O104 had positive amplification results of both target regions. Sensitivity testing indicated that the kit and protocol could detect this bacterium using a DNA concentration as low as ~1 picogram(10-12 g) in the PCR. Additional validation tests on more patient isolates will be carried out within the week.

NOTE: The complete test protocol for E. coli O104 is available at . Additionally, in a desire to help control the spread of this lethal bacteria, BGI and the Beijing Institute of Microbiology and Epidemiology will provide the designed and synthesized primers free to any disease control and research agency worldwide (contact ). They also continue to call for labs with isolates HUSEC041 (from Germany in 2001) and other 2011 outbreak isolates to understand how the lethal strain originated (

All files previously released on the FTP are available

Professional paper:

Escherichia coli EHEC Germany outbreak preliminary functional annotation using BG7 system

Nature Precedings, No. 713. (6 June 2011) doi:10.1038/npre.2011.6001.1 Key: citeulike:9388507
Crowdsourcing results are now collected at a Github wiki. The work is open-source
They include



More Gene-Jocky jabberwocky talk follows for the record :

  • Mike the Mad Biologist

I Don’t Think the German Outbreak E. coli Strain Is Novel: Something Very Similar Was Isolated Ten Years Ago…

Category: E. coli • Genomics

Posted on: June 3, 2011 8:10 AM, by Mike

…in Europe. I’ll get to that in a moment. You’ve probably heard of the E. coli outbreak sweeping through Germany and now other European countries that has caused over one thousand cases of hemolytic uremic syndrome (‘HUS’). What’s odd is that the initial reports are calling this a novel hybrid or some new strain of E. coli.

BGI has done some sequencing using Ion Torrent of one of these isolates, and Nick Loman assembled the data. Without getting too technical, the genome is actually in about 3,000 pieces, but with those data (and thanks to Nick for assembling them and releasing them) I was able to perform multilocus sequencing typing (‘MLST’). Basically, we look at the partial sequences of several genes (in this case, seven) to identify its sequence type–think of it as a molecular barcode (for the scheme and details, see here).

So what did I find?

This EHEC strain is most likely a very close relative of ST678 (details in a bit). In fact, according to the strain database, there is a strain “Jan-91”, isolated in 2001* from Europe (no further geographic information is provided). That strain belongs to phylogroup D, and is associated with HUS…just like the outbreak strain. And the older strain also has the exact same serotype as the outbreak strain, O104:H4. ….(continues at Mike’s blog)

  • The work of Nick Loman :
Pathogens: Genes and Genomes

A heady mix of bacterial pathogenomics, next-generation sequencing, type-III secretion, bioinformatics and evolution!

You are here: Home / EHEC Genome Assembly
EHEC Genome Assembly
By Nick Loman on June 2, 2011

BGI have released 5 runs of Ion Torrent data for the German EHEC/VTEC outbreak strain. I hope it is released with no specific restrictions on use for the benefit of the entire community,
but the site doesn’t make that entirely clear. Thanks to the BGI for putting it up!

Shall we crowd source some analysis? This comes at a very timely moment as I am currently help organise the Applied Bioinformatics & Public Health conference in Hinxton (#ABPH11, where we are discussing the use of whole-genome sequencing in epidemiology. The problem is I don’t have much time to dig into the data.
But I’ve put a first-pass de novo assembly up using MIRA ( here 3,057 contigs, total bases: 5,491,032, N50 3,675. If you want the alignment files etc. get the big file here (282Mb).
Parameters are: mira –job=denovo,genome,accurate,iontor -GE:not=1
Update 3/6/11 09:15 GMT+1

Marina Manrique has run the assembly through their BG7 bacterial genome annotation pipeline, results are here.

Torsten Seeman and Simon Gladman from the Victorian Bioinformatics Consortium have sent me the results of their in-house annotation pipeline. Results are available: contigs reordered according to E. coli EAEC 55989 and TWEC.

NCBI have also posted a preliminary assembly (of a different isolate – LB226692) – although it is not a true de novo assembly. The approach is a bit different. “Reads were mapped with TMAP against the publicly available E. coli 55989 chromosome (CU928145.2) and the derived consensus was split into contigs at zero-coverage regions. These contigs were used as a ‘backbone’ for mapping of reads, followed by de novo assembly of unmapped reads with the MIRA assembler (v 3.2.1). A small number of de novo and consensus contigs were merged using CAP3.”

Update 3/6/11 16:50 GMT+1
There are two O104 isolates sequenced from this outbreak now. This first – named TY2482 – was done by BGI in collaboration with University Medical Centre Hamburg-Eppendorf and the second was done by Life Tech in-house in collaboration with University of Muenster – this is called LB226692. So opportunities for comparison exist now.
In summary: TY2482 assembly (BGI reads, my assembly), LB226692 assembly (Life Tech reads, assembly).

Mike the Mad Biologist has looked at the TY2482 assembly and concludes it is ST678 (or closely related) which agrees with the original molecular typing release from the Robert Koch Institute.
I’ve heard from another group they are planning on sequencing another isolate. I am going to try and find a place where the latest information can be collated to aid in further crowd-sourcing analysis.

Update 3/6/11 19:50 GMT+1

BGI just released two more 314 chips worth of data and their own assembly of TY2482. I don’t have any details on program used or parameters just yet but I’ve enquired.
Who will take on the challenge of building a whole-genome phylogeny?

  • Era 7 Bioinformatics
Repost from Era 7
Era7 Bioinformatics annotation of the genome of the E Coli strain causing EU outbreak
We are annotating the genome sequences released by BGI on 2th of June 2011:
We are using the assembly published in the blog of Nick Loman :
The annotation has been done with Era7 Bioinformatics BG7 pipeline for bacterial genome annotation . This method has been developed by the Era7 Bioinformatics R&D group Oh No Sequences!
Click here for more detailed description of the method.
The people from Oh No Sequences ! (Era7 Bioinformatics R&D group) involved in this work are: Marina Manrique, Pablo Pareja-Tobes , Eduardo Pareja-Tobes , Eduardo Pareja and Raquel Tobes as team leader.
We will update this page as soon as we find useful information or as new annotations based in new assemblies are been released.
Version 1 : June 3th 2011:
We understand that the assembly obtained from
and the sequences obtained from are not restricted in any way.
The information we are publishing here and in another web pages and sites must be used for research activities only and we do not guarantee the accuracy ot this information.
The data publishe here is preliminary and may contain errors.
Era7 Information technologies SLU provide these annotation data “as is” without any warranty express or limited, including warranty of merchantability ot fitness for a particular purpose or use.
Era7 Information Technologies SLU assumes NO legal liability or responsibility for any purposed for which the data are used.
You can use the data from this draft annotation and information provided that you attribute propperly the source and Authors. Copyright Era7 Information Technologies SLU 2011
Licencia de Creative Commons
E Coli genome draft annotation by Era7 Bioinformatics (Era7 Information Technologies SLU) is licensed under a Creative Commons Reconocimiento-NoComercial-CompartirIgual 3.0 Unported License.
Creado a partir de la obra

  • Genetic database entries describing EAEC E. coli strain 55989 and providing access to highly relevant papers about these pathogenic variants.

LOCUS       AF411067               12012 bp    DNA     linear   BCT 29-JUL-2002
DEFINITION  Escherichia coli strain 55989 plasmid pAA-like agg3 gene cluster, complete sequence.
VERSION     AF411067.1  GI:22001085
SOURCE      Escherichia coli 55989
ORGANISM  Escherichia coli 55989
Bacteria; Proteobacteria; Gammaproteobacteria; Enterobacteriales; Enterobacteriaceae; Escherichia.
REFERENCE   1  (bases 1 to 12012)
AUTHORS   Bernier,C., Gounon,P. and Le Bouguenec,C.
TITLE     Identification of an aggregative adhesion fimbria (AAF) type III-encoding operon in enteroaggregative Escherichia coli as a sensitive probe for detecting the AAF-encoding operon family
JOURNAL   Infect. Immun. 70 (8), 4302-4311 (2002)
PUBMED   12117939

LOCUS       NC_011748            5154862 bp    DNA     circular BCT 15-MAY-2010
DEFINITION  Escherichia coli 55989, complete genome.
VERSION     NC_011748.1  GI:218693476
DBLINK      Project: 59383
SOURCE      Escherichia coli 55989
ORGANISM  Escherichia coli 55989
Bacteria; Proteobacteria; Gammaproteobacteria; Enterobacteriales;
Enterobacteriaceae; Escherichia.
REFERENCE   1  (bases 1 to 5154862)
AUTHORS   Touchon,M., Hoede,C., Tenaillon,O., Barbe,V., Baeriswyl,S., Bidet,P., Bingen,E., Bonacorsi,S., Bouchier,C., Bouvet,O., Calteau,A., Chiapello,H., Clermont,O., Cruveiller,S., Danchin,A., Diard,M., Dossat,C., Karoui,M.E., Frapy,E., Garry,L., Ghigo,J.M., Gilles,A.M., Johnson,J., Le Bouguenec,C., Lescat,M., Mangenot,S., Martinez-Jehanne,V., Matic,I., Nassif,X., Oztas,S., Petit,M.A., Pichon,C., Rouy,Z., Ruf,C.S., Schneider,D., Tourret,J., Vacherie,B., Vallenet,D., Medigue,C., Rocha,E.P. and Denamur,E.
TITLE     Organised genome dynamics in the Escherichia coli species results in highly diverse adaptive paths
JOURNAL   PLoS Genet. 5 (1), E1000344 (2009)
PUBMED   19165319
REFERENCE   2  (bases 1 to 5154862)
LOCUS       NC_011752              72482 bp    DNA     circular BCT 16-APR-2010
DEFINITION  Escherichia coli 55989 plasmid 55989p, complete sequence.
VERSION     NC_011752.1  GI:218511148
DBLINK      Project: 33333
SOURCE      Escherichia coli 55989
ORGANISM  Escherichia coli 55989
Bacteria; Proteobacteria; Gammaproteobacteria; Enterobacteriales; Enterobacteriaceae; Escherichia.
REFERENCE   1  (bases 1 to 72482)
AUTHORS   Touchon,M., Hoede,C., Tenaillon,O., Barbe,V., Baeriswyl,S., Bidet,P., Bingen,E., Bonacorsi,S., Bouchier,C., Bouvet,O., Calteau,A., Chiapello,H., Clermont,O., Cruveiller,S., Danchin,A.,Diard,M., Dossat,C., Karoui,M.E., Frapy,E., Garry,L., Ghigo,J.M., Gilles,A.M., Johnson,J., Le Bouguenec,C., Lescat,M., Mangenot,S., Martinez-Jehanne,V., Matic,I., Nassif,X., Oztas,S., Petit,M.A., Pichon,C., Rouy,Z., Ruf,C.S., Schneider,D., Tourret,J., Vacherie,B., Vallenet,D., Medigue,C., Rocha,E.P. and Denamur,E.
TITLE     Organised genome dynamics in the Escherichia coli species results in highly diverse adaptive paths
JOURNAL   PLoS Genet. 5 (1), E1000344 (2009)
PUBMED   19165319
REFERENCE   2  (bases 1 to 72482)
CONSRTM   NCBI Genome Project
TITLE     Direct Submission
JOURNAL   Submitted (18-DEC-2008) National Center for Biotechnology
Information, NIH, Bethesda, MD 20894, USA
REFERENCE   3  (bases 1 to 72482)
AUTHORS   Genoscope -,C.E.A.
CONSRTM   Institut Pasteur and Genoscope
TITLE     Direct Submission
JOURNAL   Submitted (15-DEC-2008) Genoscope – Centre National de Sequencage :
BP 191 91006 EVRY cedex – FRANCE (E-mail :

LOCUS       NC_011752              72482 bp    DNA     circular BCT 16-APR-2010

DEFINITION  Escherichia coli 55989 plasmid 55989p, complete sequence.


VERSION     NC_011752.1  GI:218511148

DBLINK      Project: 33333


SOURCE      Escherichia coli 55989

ORGANISM  Escherichia coli 55989

Bacteria; Proteobacteria; Gammaproteobacteria; Enterobacteriales;

Enterobacteriaceae; Escherichia.

REFERENCE   1  (bases 1 to 72482)

AUTHORS   Touchon,M., Hoede,C., Tenaillon,O., Barbe,V., Baeriswyl,S.,

Bidet,P., Bingen,E., Bonacorsi,S., Bouchier,C., Bouvet,O.,

Calteau,A., Chiapello,H., Clermont,O., Cruveiller,S., Danchin,A.,

Diard,M., Dossat,C., Karoui,M.E., Frapy,E., Garry,L., Ghigo,J.M.,

Gilles,A.M., Johnson,J., Le Bouguenec,C., Lescat,M., Mangenot,S.,

Martinez-Jehanne,V., Matic,I., Nassif,X., Oztas,S., Petit,M.A.,

Pichon,C., Rouy,Z., Ruf,C.S., Schneider,D., Tourret,J.,

Vacherie,B., Vallenet,D., Medigue,C., Rocha,E.P. and Denamur,E.

TITLE     Organised genome dynamics in the Escherichia coli species results

in highly diverse adaptive paths

JOURNAL   PLoS Genet. 5 (1), E1000344 (2009)


  1. There are several things I’d like to know.
    How long does it take to set up an accurate, high-throughput system to detect a specific/novel strain of E. coli? The availability and use of such a testing regime argues in favor of a non-vegetable vector as the culprit. For instance, bottled water. I understand a ‘challenge dose’ of this bacterium is 100 (one hundred) bacteria, so an extremely attenuated population can be virulent.
    Doesn’t this incident provide support for ‘cold pasteurization’/irradiation? In turn, wouldn’t this cheap, effective food safety measure provide support for labeling of ground beef/raw vegetables etc., that hasn’t been irradiated, as ‘May Contain Live Fecal Bacteria’?
    What are the odds that someone could come up with a human vaccine against E. coli? Are there too many strains, too diverse, to make that possible? But also, doesn’t ‘friendly’ E. coli actually provide benefits in the human gut, mean that a vaccine would be a bad idea?

  2. Detecting something “novel” would be real hard to do. If it is novel, you can’t look for a specific tracer via PCR or antibodies. There are zillions of novel things out there.
    Typically what is looked for are “coliforms” which cover a very large group of organisms, most of which are not disease causing. Everyone has about a pound of these in their gut and the specific ones in your gut are virtually always not causing any problems.

  3. Daedalus,
    I see your point about novelty — but it appears that in this case the novelty is now gone. Specific gene sequences have been identified as unique to this pathogen.
    So the question becomes: how fast can a lab ‘scale up’ a high-throughput test when the unique sequences are known? Week, month, year?

  4. Time and time again I try and correct students who describe a problem like this as being caused by E. coli. There are so many different types of this germ, and most are harmless. Lets call them “this pathogenic strain of E. coli” or some such label.
    Yes, We depend of E.coli for some vitamins.

  5. Is it possible that the pathogen involved is not a chimera, but rather, different strains coexisting?

  6. No, not very likely..
    The bacteria are routinely culture from single cells. E. coli consistently behaves a clonable cells. The Genome sequences also consists of CONTIGuous stretches of DNA, so within a contig the sequence data themselves support a hybrid event. Mixtures would generate overlapping signals.

Comments are closed.