Databases#
What is a locus?#
A locus in the Kaptive sense refers to a biosynthetic gene cluster that is responsible for the synthesis of a bacterial surface polysaccharide, e.g. the Klebsiella pneumoniae K locus is responsible for the synthesis of the capsular polysaccharide, also known as the K antigen. Each locus in the Kaptive databases has been defined based on a unique set of genes, with the assumption that this encodes a unique polysaccharide structure. In many cases, these unique structures will result in unique immunological serotypes.
The gene translations (protein sequences) from each locus are compared by pairwise alignment, and must fall under a defined percent identity threshold to be considered ‘unique’. Some genes (such as the core assembly machinery) will be highly similar, however the genes responsible for the polysaccharide structural diversity are expected to be more variable. The specific identity thresholds vary across species. The thresholds corresponding to the databases distributed with Kaptive are as follows:
Species |
Pairwise protein identity threshold |
---|---|
Klebsiella pneumoniae |
82.5% |
Acinetobacter baumannii |
85% |
Format#
Genbank file#
Kaptive stores databases in Genbank format consisting of unique loci each with a single record with the following requirements:
The
source
feature must contain anote
qualifier which begins with a label such asK locus:
. Whatever follows is used as the locus name reported in the Kaptive output. The label is automatically determined, and any consistent label ending in a colon will work. However, the user can specify exactly which label to use with--locus_label
, if desired.The
source
feature may optionally contain anote
qualifier which begins with a label such asK type:
that specifies the serotype (phenotype) associated with the locus (is known). In cases where only some loci are associated with known serotypes we recommend adding anote
such asK type: unknown
. If notype
notes are specified for any loci, the Kaptive will list them asunknown
in the output. (Kaptive v2.0+)Any locus gene should be annotated as
CDS
features. AllCDS
features will be used and any other type of feature will be ignored.If the gene has a name, it should be specified in a
gene
qualifier. This is not required for Kaptive to run, but if absent the gene will only be named using its numbered position in the locus and it will not be checked for any specific sequence variations relevant to phenotype prediction.
Example piece of input Genbank file:
source 1..23877
/organism="Klebsiella pneumoniae"
/mol_type="genomic DNA"
/note="K locus: KL1"
/note="K type: K1"
CDS 1..897
/gene="galF"
Nomenclature#
In constructing the databases included with Kaptive, we have used the following nomenclature rules:
Loci are named after their respective antigen (K, O, or OC) followed by the letter L (which stands for Locus), which separates the label for the genotype from the phenotype (e.g. KL1 -> K1). These letters should be in upper case.
Loci are numbered, first, by their corresponding antigen, and second, in the order in which they were discovered. For example, Klebsiella K-loci 1-79 correspond to K-types 1-79. K-loci 101 and greater correspond to K-loci with unknown antigens in the order in which they were discovered. We intentionally started at 101 to leave room to assign phenotype-genotype pairs.
Locus genes are named in three parts delimited by an underscore (_):
The locus the gene belongs to, e.g.
KL1_
for a gene in theKL1
locus.The position of the gene in the locus, e.g.
KL1_01
for the first gene in theKL1
locus.The name of the gene as a three-letter italicized symbol written in lower case letters and usually suffixed with an italicized capital letter, e.g.
KL1_01_galF
for the galF gene in theKL1
locus. If the gene name is unknown, this part will be blank and the gene instead would be calledKL1_01
.
Note
Databases must follow this nomenclature system for distribution within Kaptive.
Phenotype logic#
Phenotype logic (previously called “special logic”) is a set of rules that Kaptive uses to predict the polysaccharide phenotype based on the genes it finds. This was initially implemented for the Klebsiella pneumoniae O locus, whereby additional genes outside of the locus are used to predict the O antigen (sub)type. This logic was extended to the A. baumannii K locus in Kaptive v2.0.2.
In Kaptive 3, we thought about how we could extend this given what we know about truncations or other sequence variations of specific genes in the locus and the impact on the phenotype. For example, in the Klebsiella pneumoniae K locus, we know that a truncation of the core initiating glycosyltransferase (wcaJ) results in a capsule-null phenotype.
The relevant sequence variations are detailed in the database logic files, each lablled with the same file prefix as its
resppective locus database, and marked with the extension .logic
. Each line consists of three tab-separated columns
and represents a phenotype rule:
loci - the loci the rule applies to (or ALL if the rule applies to all loci in the database)
genes - the genes (and optional state) the rule applies to (or ALL if the rule applies to all genes in the locus)
phenotype - the resulting phenotype that appears in the Type column of the Kaptive tabular output, replacing the default phenotype i.e. the one specified in the locus genbank source identifier in the matching locus database.
Let’s look at an example of a logic file for the K. pneumoniae K locus:
loci |
genes |
phenotype |
---|---|---|
ALL |
wcaJ,truncated |
Capsule null |
KL22 |
KL22_17,truncated |
K37 |
In the first line, you can see that if wcaJ is truncated in any locus (selected with ALL), the phenotype will be predicted as ‘Capsule null’. Here, any gene with the name wcaJ will be considered, and the state of the gene is specified as truncated. In the last line, you can see that if KL22_17 (acetyl-transferase) is truncated in locus KL22, the phenotype is predicted as ‘K37’, the non-acetylated version of the K22 capsule.
Note
The gene name and state are delimited by a comma.
Note
The default phenotype is the “type” label in the Genbank record (e.g. K1).
Let’s look at an example that uses extra genes outside of the locus (from the K. pneumoniae O locus database):
loci |
genes |
phenotype |
---|---|---|
O1/O2v1;O1/O2v2;O1/O2v3 |
wbbY |
O1a |
O1/O2v1;O1/O2v2;O1/O2v3 |
wbbY;wbbZ |
O1ab |
Here, the first line states that if wbbY is present in a genome carrying any of the O1/O2v1, O1/O2v2, or O1/O2v3 loci, the phenotype will be predicted as ‘O1a’. The second line states that if both wbbY and wbbZ are present in a genome carrying any of the same loci, the phenotype will instead be predicted as ‘O1ab’.
Note
Each specific locus and gene is delimited by a semicolon.
Note
Default state is ‘presence’.
This logic is applied during the phenotype prediction step of typing and is reported in the Type column of the Kaptive tabular output.
Databases distributed with Kaptive#
Kaptive is distributed with databases for detection of Klebsiella pneumoniae species complex and Acinetobacter baumanii surface antigen synthesis loci in the reference_database directory, (see details below). You can also generate your own databases for use with Kaptive by following these guidelines.
The existing databases were developed and curated by Kelly Wyres (Klebsiella) and Johanna Kenyon (A. baumannii).
A third-party Kaptive database is available for Vibrio parahaemolyticus K and O loci, created by Aldert Zomer and team (see preprint). The database can be downloaded and used as input to command-line Kaptive, it is also available in the online tool Kaptive-Web along with our Klebsiella and A. baumannii databases.
We are always keen to expand the utility of Kaptive for the research community, so if you have created a database that you feel will be useful for others and you are willing to share this resource, please get in touch via the issues page or email.
Similarly, if you have identified new locus variants not currently in the existing databases, please let us know!
Klebsiella K locus databases#
The Klebsiella K locus primary reference database (Klebsiella_k_locus_primary_reference.gbk
) comprises full-length
(galF to ugd) annotated sequences for each distinct Klebsiella K locus, where available:
KL1 - KL77 correspond to the loci associated with each of the 77 serologically defined K-type references, for which the corresponding predicted serotypes are K1-K77, respectively.
KL101 and above are defined from DNA sequence data on the basis of gene content, and are not currently associated with any defined serotypes.
Note
Insertion sequences (IS) are excluded from this database since we assume that the ancestral sequence was likely IS-free and IS transposase genes are not specific to the K locus.
Synthetic IS-free K locus sequences were generated for K loci for which no naturally occurring IS-free variants have been identified to date.
Note
KL156-D1 is included in the primary reference database since no full-length version of this locus has been identified to date.
We recommend screening your data with the primary reference database first to find the best-matching K locus. If you have poor matches or are particularly interested in detecting variant loci you should try the variant database.
Warning
The variants database (Klebsiella_k_locus_variant_reference.gbk
) has been retired as of v3.0.0b6
as it’s no
longer actively maintained and results can be misleading without additional in depth analysis.
Database versions:
Kaptive releases v0.5.1 and below include the original Klebsiella K locus databases, as described in Wyres, K. et al. Microbial Genomics 2016.
Kaptive v0.6.0 and above include four novel primary Klebsiella K locus references defined on the basis of gene content (KL162-KL165) in Wyres et al. Genome Medicine 2020.
Kaptive v0.7.1 and above contain updated versions of the KL53 and KL126 loci (see table below for details). The updated KL126 locus sequence is described in McDougall, F. et al. Research in Microbiology 2021.
Kaptive v0.7.2 and above include a novel primary Klebsiella K locus reference defined on the basis of gene content (KL166), described in Le, MN. et al. Microbial Genomics 2022.
Kaptive v0.7.3 and above include four novel primary Klebsiella K locus references defined on the basis of gene content (KL167-KL170), described in Gorrie, C. et al. Nature Communications 2022.
Kaptive v2.0 and above include 16 novel primary Klebsiella K locus references defined on the basis of gene content (KL171-KL186) and described in Lam, M.M.C et al. Microbial Genomics 2022.
Changes to the Klebsiella K locus primary reference database:
Locus |
Change |
Reason |
Date of change |
Kaptive version no. |
---|---|---|---|---|
KL53 |
Annotation update: wcaJ changed to wbaP |
Error in original annotation |
21 July 2020 |
v 0.7.1 |
KL126 |
Sequence update: new sequence from isolate FF923 includes rmlBADC genes between gnd and ugd |
Assembly scaffolding error in original sequence from isolate A-003-I-a-1 |
21 July 2020 |
v 0.7.1 |
KL37 |
Removed from the database |
Locus is a deletion (atr) variant of KL22 |
22 March 2024 |
v 3.0.0 |
Klebsiella O locus database#
The Klebsiella O locus database (Klebsiella_o_locus_primary_reference.gbk
) contains annotated sequences for 13
distinct Klebsiella O loci.
O locus classification requires some special logic, as the O1 and O2 serotypes are associated with the same loci and the distinction between O1 and each of the four defined O2 subtypes (O2a, O2afg, O2ac, O2aeh) is determined by the presence/absence of ‘extra genes’ elsewhere in the chromosome as indicated in the table below. Kaptive therefore looks for these genes to predict antigen (sub)types. (Note that the original implementation of O locus typing in Kaptive (< v2.0) distinguished O1 and O2 but not the O2 subtypes.)
Read more about the O locus and its classification here: The diversity of *Klebsiella* pneumoniae surface polysaccharides.
Find out about the genetic determinants of O1 and O2 (sub)types here: Molecular basis for the structural diversity in serogroup O2-antigen polysaccharides in *Klebsiella pneumoniae*.
Find out about the O1 glycoforms and their genetic determinants here: Identification of a second glycoform of the clinically prevalent O1 antigen from *Klebsiella pneumoniae*.
Database versions:
Kaptive v0.4.0 and above include the original version of the Klebsiella O locus database, as described in Wick, R. et al. J Clin Microbiol 2019.
Kaptive v2.0 and above include a novel O locus reference (O1/O2v3) and updated ‘Extra genes’ for prediction of O1 and O2 antigen (sub)types, as shown in the table below and described in Lam, M.M.C et al. 2021. Microbial Genomics 2022.
Kaptive v2.0.8 and above include:
updated ‘Extra genes’ logic for prediction of O1 glycoforms, reported as O1a (isolate predicted to produce O1a only) and O1ab (isolate predicted to be able to produce both O1a and O1b glycoforms);
OL101 re-assigned as OL13 and its associated phenotype prediction updated to O13, to reflect the description of the novel O13 polysaccharide structure.
Genetic determinants of O1 and O2 outer LPS antigens as reported in Kaptive:
O locus |
Extra genes |
Kaptive < v2.0 (locusa) |
Kaptive v2.0+ (locusa) |
Kaptive v2.0 - v2.0.7 (typeb) |
Kaptive v2.0.8+ (typeb) |
|
---|---|---|---|---|---|---|
O1/O2v1 |
none |
O2v1 |
O1/O2v1 |
O2a |
O2a |
|
O1/O2v2 |
none |
O2v2 |
O1/O2v2 |
O2afg |
O2afg |
|
O1/O2v3 |
none |
Na |
O1/O2v3 |
O2a |
O2a |
|
O1/O2v1 |
wbbYZ |
O1v1 |
O1/O2v1 |
Na |
O1ab |
|
O1/O2v2 |
wbbYZ |
O1v2 |
O1/O2v2 |
Na |
O1ab |
|
O1/O2v3 |
wbbYZ. |
Na |
O1/O2v3 |
Na |
O1ab |
|
O1/O2v1 |
wbbY only |
O1v1 |
O1/O2v1 |
O1 |
O1a |
|
O1/O2v2 |
wbbY only |
O1v2 |
O1/O2v2 |
O1 |
O1a |
|
O1/O2v3 |
wbbY only |
Na. |
O1/O2v3 |
O1 |
O1a |
|
O1/O2v1 |
wbbY OR wbbZ |
O1/O2v1 |
Na |
Na |
Na |
|
O1/O2v2 |
wbbY OR wbbZ |
O1/O2v2 |
Na |
Na |
Na |
|
O1/O2v3 |
wbbY OR wbbZ |
Na |
Na |
Na |
Na |
|
O1/O2v1 |
wbmVW |
Na |
O1/O2v1 |
O2ac |
O2ac |
|
O1/O2v2 |
wbmVW |
Na |
O1/O2v2 |
O2ac |
O2ac |
|
O1/O2v3 |
wbmVW |
Na |
O1/O2v3 |
O2ac |
O2ac |
|
O1/O2v1 |
gmlABD |
Na |
O1/O2v1 |
O2aeh |
O2aeh |
|
O1/O2v2 |
gmlABD |
Na |
O1/O2v2 |
O2aeh |
O2aeh |
|
O1/O2v3 |
gmlABD |
Na |
O1/O2v3 |
O2aeh |
O2aeh |
|
O1/O2v1 |
wbbY AND wbmVW |
Na |
O1/O2v1 |
O1 (O2ac)b |
O1 (O2ac)b |
|
O1/O2v2 |
wbbY AND wbmVW |
Na |
O1/O2v2 |
O1 (O2ac)b |
O1 (O2ac)b |
|
O1/O2v3 |
wbbY AND wbmVW |
Na |
O1/O2v3 |
O1 (O2ac)b |
O1 (O2ac)b |
a as reported in the ‘Best match locus’ column in the Kaptive output.
b predicted antigenic serotype reported in the ‘Best match type’ column in the Kaptive output (v2.0 and above).
Na- not applicable
Acinetobacter baunannii K and OC locus databases#
The A. baumannii K (capsule) locus reference database (Acinetobacter*baumannii*k*locus*primary_reference.gbk) contains annotated sequences for 241 distinct K loci.
The A. baumannii OC (lipooligosaccharide outer core) locus reference database (Acinetobacter*baumannii*OC*locus*primary_reference.gbk) contains annotated sequences for 22 distinct OC loci.
Warning
These databases have been developed and tested specifically for A. baumannii and may not be suitable for screening other Acinetobacter species. You can check that your assembly is a true A. baumannii by screening for the oxaAB gene e.g. using blastn.
Database versions:
Kaptive v0.7.0 and above include the original A. baumannii K and OC locus databases, as described in Wyres, KL. et al. Microbial Genomics 2020.
Kaptive v2.0.1 and above include 149 novel primary A. baumannii K locus references as described in Cahill, S.M. et al. 2022. An update to the database for Acinetobacter baumannii capsular polysaccharide locus typing extends the extensive and diverse repertoire of genes found at and outside the K locus. Microbial Genomics.
Kaptive v2.0.2 and above include special logic parameters that enable prediction of the capsule polysaccharide type based on KL or the detected combination of a specific KL with ‘extra genes’ elsewhere in the chromosome as indicated in the table below and described in Cahill, S.M. et al. 2022. An update to the database for A. baumannii capsular polysaccharide locus typing extends the extensive and diverse repertoire of genes found at and outside the K locus. Microbial Genomics.
Kaptive v2.0.5 and above includes a further 10 A. baumannii OC locus references (OCL13-OCL22) as described in Sorbello, B. et al. Identification of further variation at the lipooligosaccharide outer core locus in Acinetobacter baumannii genomes and extension of the OCL reference sequence database for Kaptive. In prep.
Database keywords#
When Kaptive is installed, it may be difficult to find the databases in the file system. However, each <database>
argument in the Kaptive CLI accepts either a path to a Genbank file or a keyword that refers to a database
distributed with Kaptive. The keywords are listed below.
Database |
Keywords |
---|---|
Klebsiella pneumoniae K locus primary reference database |
|
Klebsiella pneumoniae K locus variant reference database |
|
Klebsiella pneumoniae O locus primary reference database |
|
Acinetobacter baumannii K locus primary reference database |
|
Acinetobacter baumannii OC locus primary reference database |
|
Extract#
Kaptive 3.0.0 and above includes a new command-line mode extract
that allows you to extract features
from a Kaptive database in the following formats:
fna: Locus nucleotide sequences in fasta format.
ffn: Gene nucleotide sequences in fasta format.
faa: Protein sequences in fasta format.
Usage#
General usage is as follows:
kaptive extract <db> [formats] [options]
Formats:
Note, text outputs accept '-' for stdout
--fna [] Convert to locus nucleotide sequences in fasta format
Accepts a single file or a directory (default: cwd)
--ffn [] Convert to locus gene nucleotide sequences in fasta format
Accepts a single file or a directory (default: cwd)
--faa [] Convert to locus gene protein sequences in fasta format
Accepts a single file or a directory (default: cwd)
Database options:
--locus-regex Python regular-expression to match locus names in db source note
--type-regex Python regular-expression to match locus types in db source note
--filter Python regular-expression to select loci to include in the database
Note
These options are useful for customising the database to your needs, for example, to include only a subset of loci or to change the way locus names and types are parsed from the source note.
Other options:
-V, --verbose Print debug messages to stderr
-v , --version Show version number and exit
-h , --help Show this help message and exit
For example, to extract the gene nucleotide sequences from the Klebsiella pneumoniae K locus primary reference database in fasta format, run:
kaptive extract kp_k --fna k_loci.fna
To extract all protein sequences from KL1 and KL2, run either one of the following:
kaptive extract kp_k --filter "^KL(1|2)$" --faa KL1_KL2_proteins.faa
kaptive extract kp_k --filter "^KL(1|2)$" --faa - > KL1_KL2_proteins.faa
To do the same but output each locus to a separate file, run either:
kaptive extract kp_k --filter "^KL(1|2)$" --faa
kaptive extract kp_k --filter "^KL(1|2)$" --faa protein_files/
Which would create two files: KL1.faa
and KL2.faa
.
kaptive assembly kpsc_k assembly.fasta -j kaptive_results.json
Warning
It is possible to write all text formats (FNA, FAA and FFN) to the same file (including stdout), however this is not recommended for downstream analysis.