Databases#

What is a locus?#

A locus in the Kaptive sense refers to a biosynthetic gene cluster that is responsible for the synthesis of a bacterial surface polysaccharide, e.g. the Klebsiella pneumoniae K locus is responsible for the synthesis of the capsular polysaccharide, also known as the K antigen. Each locus in the Kaptive databases has been defined based on a unique set of genes, with the assumption that this encodes a unique polysaccharide structure. In many cases, these unique structures will result in unique immunological serotypes.

The gene translations (protein sequences) from each locus are compared by pairwise alignment, and must fall under a defined percent identity threshold to be considered ‘unique’. Some genes (such as the core assembly machinery) will be highly similar, however the genes responsible for the polysaccharide structural diversity are expected to be more variable. The specific identity thresholds vary across species. The thresholds corresponding to the databases distributed with Kaptive are as follows:

Species	Pairwise protein identity threshold
Klebsiella pneumoniae	82.5%
Acinetobacter baumannii	85%

Format#

Genbank file#

Kaptive stores databases in Genbank format consisting of unique loci each with a single record with the following requirements:

The source feature must contain a note qualifier which begins with a label such as K locus:. Whatever follows is used as the locus name reported in the Kaptive output. The label is automatically determined, and any consistent label ending in a colon will work. However, the user can specify exactly which label to use with --locus_label, if desired.
The source feature may optionally contain a note qualifier which begins with a label such as K type: that specifies the serotype (phenotype) associated with the locus (is known). In cases where only some loci are associated with known serotypes we recommend adding a note such as K type: unknown. If no type notes are specified for any loci, the Kaptive will list them as unknown in the output. (Kaptive v2.0+)
Any locus gene should be annotated as CDS features. All CDS features will be used and any other type of feature will be ignored.
If the gene has a name, it should be specified in a gene qualifier. This is not required for Kaptive to run, but if absent the gene will only be named using its numbered position in the locus and it will not be checked for any specific sequence variations relevant to phenotype prediction.

Example piece of input Genbank file:

source          1..23877
                /organism="Klebsiella pneumoniae"
                /mol_type="genomic DNA"
                /note="K locus: KL1"
                /note="K type: K1"
CDS             1..897
                /gene="galF"

Nomenclature#

In constructing the databases included with Kaptive, we have used the following nomenclature rules:

Loci are named after their respective antigen (K, O, or OC) followed by the letter L (which stands for Locus), which separates the label for the genotype from the phenotype (e.g. KL1 -> K1). These letters should be in upper case.
Loci are numbered, first, by their corresponding antigen, and second, in the order in which they were discovered. For example, Klebsiella K-loci 1-79 correspond to K-types 1-79. K-loci 101 and greater correspond to K-loci with unknown antigens in the order in which they were discovered. We intentionally started at 101 to leave room to assign phenotype-genotype pairs.
Locus genes are named in three parts delimited by an underscore (_):
1. The locus the gene belongs to, e.g. KL1_ for a gene in the KL1 locus.
2. The position of the gene in the locus, e.g. KL1_01 for the first gene in the KL1 locus.
3. The name of the gene as a three-letter italicized symbol written in lower case letters and usually suffixed with an italicized capital letter, e.g. KL1_01_galF for the galF gene in the KL1 locus. If the gene name is unknown, this part will be blank and the gene instead would be called KL1_01.

Note

Databases must follow this nomenclature system for distribution within Kaptive.

Phenotype logic#

Phenotype logic (previously called “special logic”) is a set of rules that Kaptive uses to predict the polysaccharide phenotype based on the genes it finds. This was initially implemented for the Klebsiella pneumoniae O locus, whereby additional genes outside of the locus are used to predict the O antigen (sub)type. This logic was extended to the A. baumannii K locus in Kaptive v2.0.2.

In Kaptive 3, we thought about how we could extend this given what we know about truncations or other sequence variations of specific genes in the locus and the impact on the phenotype. For example, in the Klebsiella pneumoniae K locus, we know that a truncation of the core initiating glycosyltransferase (wcaJ) results in a capsule-null phenotype.

The relevant sequence variations are detailed in the database logic files, each lablled with the same file prefix as its resppective locus database, and marked with the extension .logic. Each line consists of three tab-separated columns and represents a phenotype rule:

loci - the loci the rule applies to (or ALL if the rule applies to all loci in the database)
genes - the genes (and optional state) the rule applies to (or ALL if the rule applies to all genes in the locus)
phenotype - the resulting phenotype that appears in the Type column of the Kaptive tabular output, replacing the default phenotype i.e. the one specified in the locus genbank source identifier in the matching locus database.

Let’s look at an example of a logic file for the K. pneumoniae K locus:

loci	genes	phenotype
ALL	wcaJ,truncated	Capsule null
KL22	KL22_17,truncated	K37

In the first line, you can see that if wcaJ is truncated in any locus (selected with ALL), the phenotype will be predicted as ‘Capsule null’. Here, any gene with the name wcaJ will be considered, and the state of the gene is specified as truncated. In the last line, you can see that if KL22_17 (acetyl-transferase) is truncated in locus KL22, the phenotype is predicted as ‘K37’, the non-acetylated version of the K22 capsule.

Note

The gene name and state are delimited by a comma.

Note

The default phenotype is the “type” label in the Genbank record (e.g. K1).

Let’s look at an example that uses extra genes outside of the locus (from the K. pneumoniae O locus database):

loci	genes	phenotype
OL2α.1;OL2α.2;OL2α.3	orf8	O2αγ

Here, the first line states that if orf8 is present in a genome carrying any of the OL2α.1, OL2α.2 or OL2α.3 loci, the phenotype will be predicted as ‘O2αγ’.

Note

Each specific locus and gene is delimited by a semicolon.

Note

Default state is ‘presence’.

This logic is applied during the phenotype prediction step of typing and is reported in the Type column of the Kaptive tabular output.

Databases distributed with Kaptive#

Kaptive is distributed with databases for detection of Klebsiella pneumoniae species complex and Acinetobacter baumanii surface antigen synthesis loci in the reference_database directory, (see details below). You can also generate your own databases for use with Kaptive by following these guidelines.

The existing databases were developed and curated by Kelly Wyres (Klebsiella) and Johanna Kenyon (A. baumannii).

A third-party Kaptive database is available for Vibrio parahaemolyticus K and O loci, created by Aldert Zomer and team (see preprint). The database can be downloaded and used as input to command-line Kaptive, it is also available in the online tool Kaptive-Web along with our Klebsiella and A. baumannii databases.

We are always keen to expand the utility of Kaptive for the research community, so if you have created a database that you feel will be useful for others and you are willing to share this resource, please get in touch via the issues page or email.

Similarly, if you have identified new locus variants not currently in the existing databases, please let us know!

Klebsiella K locus databases#

The Klebsiella K locus primary reference database (Klebsiella_k_locus_primary_reference.gbk) comprises full-length (galF to ugd) annotated sequences for each distinct Klebsiella K locus, where available:

KL1 - KL77 correspond to the loci associated with each of the 77 serologically defined K-type references, for which the corresponding predicted serotypes are K1-K77, respectively.
KL101 and above are defined from DNA sequence data on the basis of gene content, and are not currently associated with any defined serotypes.

Note

Insertion sequences (IS) are excluded from this database since we assume that the ancestral sequence was likely IS-free and IS transposase genes are not specific to the K locus.

Synthetic IS-free K locus sequences were generated for K loci for which no naturally occurring IS-free variants have been identified to date.

Note

KL156-D1 is included in the primary reference database since no full-length version of this locus has been identified to date.

We recommend screening your data with the primary reference database first to find the best-matching K locus. If you have poor matches or are particularly interested in detecting variant loci you should try the variant database.

Warning

The variants database (Klebsiella_k_locus_variant_reference.gbk) has been retired as of v3.0.0b6 as it’s no longer actively maintained and results can be misleading without additional in depth analysis.

Database versions:

Kaptive releases v0.5.1 and below include the original Klebsiella K locus databases, as described in Wyres, K. et al. Microbial Genomics 2016.
Kaptive v0.6.0 and above include four novel primary Klebsiella K locus references defined on the basis of gene content (KL162-KL165) in Wyres et al. Genome Medicine 2020.
Kaptive v0.7.1 and above contain updated versions of the KL53 and KL126 loci (see table below for details). The updated KL126 locus sequence is described in McDougall, F. et al. Research in Microbiology 2021.
Kaptive v0.7.2 and above include a novel primary Klebsiella K locus reference defined on the basis of gene content (KL166), described in Le, MN. et al. Microbial Genomics 2022.
Kaptive v0.7.3 and above include four novel primary Klebsiella K locus references defined on the basis of gene content (KL167-KL170), described in Gorrie, C. et al. Nature Communications 2022.
Kaptive v2.0 and above include 16 novel primary Klebsiella K locus references defined on the basis of gene content (KL171-KL186) and described in Lam, M.M.C et al. Microbial Genomics 2022.

Changes to the Klebsiella K locus primary reference database:

Locus	Change	Reason	Date of change	Kaptive version no.
KL53	Annotation update: wcaJ changed to wbaP	Error in original annotation	21 July 2020	v 0.7.1
KL126	Sequence update: new sequence from isolate FF923 includes rmlBADC genes between gnd and ugd	Assembly scaffolding error in original sequence from isolate A-003-I-a-1	21 July 2020	v 0.7.1
KL37	Removed from the database	Locus is a deletion (atr) variant of KL22	22 March 2024	v 3.0.0

Klebsiella O locus database#

In Kaptive 3.1.0, we introduced new O-antigen nomenclature in the Klebsiella O locus database (Klebsiella_o_locus_primary_reference.gbk) along wth the publication of this review: O-antigen polysaccharides in Klebsiella pneumoniae: structures and molecular basis for antigenic diversity.

We have also summarised the O-antigen nomenclature update on the Wyres Lab website.

The Klebsiella O locus database (Klebsiella_o_locus_primary_reference.gbk) contains annotated sequences for 13 distinct Klebsiella O loci.

O locus classification requires some special logic, as the O1 and O2 serotypes are associated with the same loci and the distinction between O1 and each of the defined O2 subtypes (2α, 2β, 2γ) is determined by the presence/absence of ‘extra genes’ (gml2β and orf8) elsewhere in the chromosome as indicated in the table below. Kaptive therefore looks for these genes to predict antigen (sub)types.

Note

You can find information about the Klebsiella O locus database in Kaptive versions <3.1.0 here.

New serotype designation	Required genes/loci (implemented in Kaptive v.3.1+)	Prior Kaptive designation (v.2.0.8–v.3.0.0b6)	Prior Kaptive genes/loci (v.2.0.8–v.3.0.0b6)
O1αβ,2α	OL2α.(1/2/3), wbbYZ	O1ab	O1/O2v1, wbbYZ
O1α,2α	OL2α.(1/2/3), wbbY	O1a	O1/O2v1, wbbY
O1αβ,2β	OL2α.(1/2/3), gml2β, wbbYZ	O1ab	O1/O2v2, wbbYZ
O1α,2β	OL2α.(1/2/3), gml2β, wbbY	O1a	O1/O2v2, wbbY
O1αβ,2γ	OL2α.(1/2/3), orf8, wbbYZ	O1ab	O1/O2v3, wbbYZ
O2α	OL2α.(1/2/3)	O2a	O1/O2v1
O2β	OL2α.(1/2/3), gml2β	O2afg	O1/O2v2
O2αγ	OL2α.(1/2/3), orf8	O2a	O1/O2v3
O3α + O3β	OL3α/β	O3/O3a	O3/O3a
O3γ	OL3γ	O3b	O3b
O4	OL4	O4	O4
O5	OL5	O5	O5
O10	OL10	OL103	OL103
O11αβ,2α	OL2α.(1/2/3), wbmVWX	O2ac	O1/O2v1, wbmVWX
O11α,2α	OL2α.(1/2/3), wbmVW	O2ac	O1/O2v1, wbmVW
O11αβ,2β	OL2α.(1/2/3), gml2β, wbmVWX	O2ac	O1/O2v2, wbmVWX
O11α,2β	OL2α.(1/2/3), gml2β, wbmVW	O2ac	O1/O2v2, wbmVW
O11αβ,2γ	OL2α.(1/2/3), orf8, wbmVWX	O2ac	O1/O2v3, wbmVW
O12	OL12	O12	O12
O13	OL13	O13	OL13
O14	OL14	OL102	OL102
O15	OL15	OL104	OL104

Acinetobacter baunannii K and OC locus databases#

The A. baumannii K (capsule) locus reference database (Acinetobacter*baumannii*k*locus*primary_reference.gbk) contains annotated sequences for 241 distinct K loci.

The A. baumannii OC (lipooligosaccharide outer core) locus reference database (Acinetobacter*baumannii*OC*locus*primary_reference.gbk) contains annotated sequences for 22 distinct OC loci.

Warning

These databases have been developed and tested specifically for A. baumannii and may not be suitable for screening other Acinetobacter species. You can check that your assembly is a true A. baumannii by screening for the oxaAB gene e.g. using blastn.

Database versions:

Kaptive v0.7.0 and above include the original A. baumannii K and OC locus databases, as described in Wyres, KL. et al. Microbial Genomics 2020.
Kaptive v2.0.1 and above include 149 novel primary A. baumannii K locus references as described in Cahill, S.M. et al. 2022. An update to the database for Acinetobacter baumannii capsular polysaccharide locus typing extends the extensive and diverse repertoire of genes found at and outside the K locus. Microbial Genomics.
Kaptive v2.0.2 and above include special logic parameters that enable prediction of the capsule polysaccharide type based on KL or the detected combination of a specific KL with ‘extra genes’ elsewhere in the chromosome as indicated in the table below and described in Cahill, S.M. et al. 2022. An update to the database for A. baumannii capsular polysaccharide locus typing extends the extensive and diverse repertoire of genes found at and outside the K locus. Microbial Genomics.
Kaptive v2.0.5 and above includes a further 10 A. baumannii OC locus references (OCL13-OCL22) as described in Sorbello, B. et al. Identification of further variation at the lipooligosaccharide outer core locus in Acinetobacter baumannii genomes and extension of the OCL reference sequence database for Kaptive. In prep.

Database keywords#

When Kaptive is installed, it may be difficult to find the databases in the file system. However, each <database> argument in the Kaptive CLI accepts either a path to a Genbank file or a keyword that refers to a database distributed with Kaptive. The keywords are listed below.

Database	Keywords
Klebsiella pneumoniae K locus primary reference database	kpsc_k kp_k k_k
Klebsiella pneumoniae O locus primary reference database	kpsc_o kp_o k_o
Acinetobacter baumannii K locus primary reference database	ab_k
Acinetobacter baumannii OC locus primary reference database	ab_o

Extract#

Kaptive 3.0.0 and above includes a new command-line mode extract that allows you to extract features from a Kaptive database in the following formats:

fna: Locus nucleotide sequences in fasta format.
ffn: Gene nucleotide sequences in fasta format.
faa: Protein sequences in fasta format.

Usage#

General usage is as follows:

kaptive extract <db> [formats] [options]

Formats:

Note, text outputs accept '-' for stdout

--fna []         Convert to locus nucleotide sequences in fasta format
                 Accepts a single file or a directory (default: cwd)
--ffn []         Convert to locus gene nucleotide sequences in fasta format
                 Accepts a single file or a directory (default: cwd)
--faa []         Convert to locus gene protein sequences in fasta format
                 Accepts a single file or a directory (default: cwd)

Database options:

--locus-regex    Python regular-expression to match locus names in db source note
--type-regex     Python regular-expression to match locus types in db source note
--filter         Python regular-expression to select loci to include in the database

Note

These options are useful for customising the database to your needs, for example, to include only a subset of loci or to change the way locus names and types are parsed from the source note.

Other options:

-V, --verbose    Print debug messages to stderr
-v , --version   Show version number and exit
-h , --help      Show this help message and exit

For example, to extract the gene nucleotide sequences from the Klebsiella pneumoniae K locus primary reference database in fasta format, run:

kaptive extract kp_k --fna k_loci.fna

To extract all protein sequences from KL1 and KL2, run either one of the following:

kaptive extract kp_k --filter "^KL(1|2)$" --faa KL1_KL2_proteins.faa
kaptive extract kp_k --filter "^KL(1|2)$" --faa - > KL1_KL2_proteins.faa

To do the same but output each locus to a separate file, run either:

kaptive extract kp_k --filter "^KL(1|2)$" --faa
kaptive extract kp_k --filter "^KL(1|2)$" --faa protein_files/

Which would create two files: KL1.faa and KL2.faa.

kaptive assembly kpsc_k assembly.fasta -j kaptive_results.json

Warning

It is possible to write all text formats (fna, faa and ffn) to the same file (including stdout), however this is not recommended for downstream analysis.

Databases

Contents

Databases#

What is a locus?#

Format#

Genbank file#

Nomenclature#

Phenotype logic#

Databases distributed with Kaptive#

Klebsiella K locus databases#

Klebsiella O locus database#

Acinetobacter baunannii K and OC locus databases#

Database keywords#

Extract#

Usage#