Introduction

Biodiversity of the marine zooplankton assemblage

The zooplankton assemblage of the global ocean exhibits exceptional levels of phylogenetic, taxonomic, and functional diversity, with described species spanning 15 phyla and 41 functional groups (orders or classes) of animals. The assemblage includes both holoplankton, which are planktonic throughout their life cycle, and meroplankton, which include planktonic forms of animals with both pelagic and benthic life stages. The diversity of forms is also classified by size, including mesozooplankton (0.2–20 mm), macrozooplankton (2–20 cm), and megazooplankton (20–200 cm) (Sieburth et al. 1978; Wiebe et al. 2017). In addition, immature (eggs and larvae) stages of meroplanktonic species may be even smaller. Estimates of the actual number of species, including cryptic and new species, range widely (Bucklin et al. 2010c). Currently, there are  ~5700 described species of metazoan holoplankton, with an estimated additional ~1600 species yet to be discovered and/or described (Wiebe et al. 2010). The estimated number of species increases significantly, to nearly 28,000 total metazoan species (Lenz 2000), when meroplankton are included.

The marine zooplankton assemblage presents a number of challenges for species-level diversity analyses. Some taxonomic groups (e.g., Copepoda) include numerous groups of sibling species, which are difficult or impossible (i.e., cryptic) to discriminate based on morphological characteristics (Knowlton 2000; Lindsay et al. 2017; Snelgrove et al. 2017; Choquet et al. 2018). Many species exhibit broad biogeographical distributions spanning multiple ocean basins, some with genetic divergence among regional populations (Bucklin et al. 2010b; Peijnenburg and Goetze 2013; Kolbasova et al. 2020). The assemblage is characterized by high local-to-global ratios of biodiversity, especially in regions characterized as biodiversity hotspots (Tittensor et al. 2010; Snelgrove et al. 2017). In some regions of the ocean, a single sample may contain hundreds of species, and as many as 10% of the known species of a given taxon (McGowan and Walker 1979).

Marine zooplankton are considered rapid responders to climate change (Hays et al. 2005; Richardson 2008; Beaugrand et al. 2010). Species-specific responses to environmental variation and anthropogenic perturbations have been documented for many taxonomic groups, and changes in the demographic patterns and biogeographical range shifts of single species have been shown to have significant impact on functioning of pelagic food webs, carbon cycling, and ecosystem sustainability (Beaugrand 2005; Hay 2006; Beaugrand et al. 2010; Weydmann et al. 2014, 2018; Polyakov et al. 2020). Despite—and because of—the systematic and taxonomic complexity of the marine zooplankton assemblage, accurate and reliable discrimination and identification of species is critically needed.

DNA barcoding of marine zooplankton

The most frequently-used gene region for species-level identification of marine zooplankton is a ~570 bp region of mitochondrial cytochrome oxidase I (COI; Hebert et al. 2003; Bucklin et al. 2011). Other gene regions used for species identification include the mitochondrial 16S ribosomal RNA (rRNA) gene, which is widely used for Cnidaria (Lindsay et al. 2015) and Copepoda (Bucklin et al. 1992; Lindeque et al. 1999; Goetze 2010), and a portion of the nuclear Internal Transcribed Spacer (ITS) region, which has been used for Ctenophora (Johansson et al. 2018).

Given the phylogenetic diversity of the assemblage, it is perhaps not surprising that the so-called “universal” PCR primers for the COI barcode region (Folmer et al. 1994) do not reliably amplify species of all groups of zooplankton. Subsequent efforts over the many years of DNA barcoding include group-specific primers (e.g., Bucklin et al. 2010a) and re-designed universal primers (Geller et al. 2013; Leray et al. 2013).

The primary goal of DNA barcoding using COI is discrimination and identification of species based on the non-overlapping frequency distributions of intra- and inter-specific levels of sequence divergence, known as the barcode gap (Meyer and Paulay 2005). The presence of a COI barcode gap is a reliable feature across most phyla and taxonomic groups of the marine mesozooplankton assemblage (Bucklin et al. 2011). A confounding issue is the significant COI sequence divergence of individuals of the same species, which may or may not be associated with geographic isolation or morphological differentiation (Peijnenburg and Goetze 2013). The challenge of accurate interpretation of the taxonomic significance of intraspecific variation of the COI barcode region for marine zooplankton led to permanent maintenance of associated archives with photographs of the living specimen, voucher DNA from the barcoded specimen, and voucher specimens preserved in both 95% ethyl alcohol for later genetic analysis and 4% formalin for morphological examination of soft tissues (Bucklin et al. 2010b). The significance and interpretation of genuinely cryptic variation—genetic divergence within a known or described species—continues to be a subject of debate.

The COI barcode region can provide valuable insights into evolutionary processes, demographic history, and population genetic diversity, structure, and connectivity of a species. Many zooplankton species exhibit sufficient intraspecific DNA sequence variation in the COI barcode region to allow useful analysis of population genetic diversity and structure (e.g., Goodall-Copestake et al. 2010; Goetze et al. 2016; Choo et al. 2020). Phylogeographic analysis of zooplankton has revealed patterns and pathways of population connectivity over a variety of spatial scales (Questel et al. 2016; DeHart et al. 2020), and has revealed new insights into the evolution and demographic history of species (e.g., Peijnenburg et al. 2004; Blanco-Bercial et al. 2011; Aarbakke et al. 2014; Burridge et al. 2015; Goetze et al. 2015).

DNA metabarcoding of marine zooplankton

Zooplankton biodiversity is an essential component of ecosystem monitoring and assessment for many applications, from fisheries management to climate impacts (O’Brien et al. 2013). The need for rapid analysis of the pelagic assemblage has driven development and widespread use of DNA barcoding for analysis of complex environmental samples using metabarcoding (i.e., high-throughput DNA sequencing of target genes from environmental samples for analysis of taxon richness; Taberlet et al. 2012). Metabarcoding using hypervariable regions of the nuclear 18S ribosomal RNA (rRNA) gene has been used for detection of marine microbial diversity (Amaral-Zettler et al. 2009; DeVargas et al. 2015).

The taxonomic complexity of zooplankton samples makes metabarcoding a particularly useful and effective approach for analysis of biodiversity, but this also makes species-level identifications particularly challenging (Corell and Rodríguez-Ezpeleta 2014; Bucklin et al. 2016; Rey et al. 2020). Metabarcoding primers and protocols have been developed for analysis of marine zooplankton diversity using COI, usually targeting a shorter region within the usual barcode region (Leray et al. 2013; Cristescu 2014; Stefanni et al. 2018; Wangensteen et al. 2018; Suter et al. 2020). This application makes the COI barcode sequences from archived voucher specimens that have been accurately identified by taxonomic experts even more valuable (Raupach and Radulovici 2015; Bucklin et al. 2016). Studies have shown that species identification using DNA metabarcoding improves when reference sequence databases are specifically designed for the particular taxonomic groups and/or geographic regions of interest (Hirai et al. 2015; Lindsay et al. 2015; Questel et al. 2021). Best practices also include ensuring that digital and/or physical vouchers are linked to all COI reference sequences, to provide additional resources for confirmation, error checking, and future investigations based on broader sampling.

Need for a global COI reference database

A taxonomically complete, globally comprehensive COI reference sequence database for marine zooplankton is an essential foundation for widespread implementation of DNA barcoding and metabarcoding applications for ocean ecosystem monitoring and assessment. The COI barcode region is certain to remain an indispensible tool for accurate and reliable species-level identification of zooplankton, which is becoming increasingly necessary for fisheries management and environmental protection. The availability of DNA barcode sequences for accurately identified species with reliable collection metadata provides a resource for various multivariate statistical data explorations to relate the barcode data to environmental conditions and variability, including range shifts and other responses to climate change. A number of studies have demonstrated the wide range of important questions regarding marine diversity and ecosystem dynamics using fully georeferenced DNA barcode data (Bucklin et al. 2016; Rey et al. 2020). Public access to data and metadata for records for DNA barcode sequences of specimens that have been accurately identified to species is necessary to ensure species-level detection and analysis.

The MetaZooGene database (MZGdb; https://metazoogene.org/MZGdb) includes both holoplanktonic and meroplanktonic marine species, and currently includes barcode data for ~ 5600 species. The numbers of known species with barcodes will certainly continue to increase, as new species are discovered, described, and barcoded. The MZGdb has been designed to simplify targeted data searches and applications, which allow users to designate particular taxonomic groups and/or geographic regions of interest. One of the most important expected applications of the database is the capacity to map and visualize the geographic distributions of species observations (collection locations of identified specimens) and barcoding records (collection locations of specimens used for DNA sequencing) on global and regional scales. Another expected use of the MZGdb is the analysis of completeness of barcode records by taxon and region. This is an essential step that can identify missing barcode data for a given species by geographic region. This information can inform and prioritize continued efforts toward taxonomically complete and geographically comprehensive reference databases for marine zooplankton.

Projects focused on barcoding marine zooplankton

Census of Marine Zooplankton (CMarZ)

The Census of Marine Zooplankton (CMarZ; http://www.cmarz.org/) was a field project of the Census of Marine Life active during 2004–2010 that focused on species diversity of zooplankton throughout the global ocean. DNA barcoding was a primary goal for the program, and CMarZ established barcoding laboratories in the USA (University of Connecticut), Japan (University of Tokyo), and China (Institute of Oceanology, Chinese Academy of Sciences), with new capacity for barcoding in Germany (Alfred Wegener Institute for Polar and Marine Research) and India (Regional Centre of National Institute of Oceanography, Kochi). CMarZ was dedicated to “gold-standard” DNA barcoding, with prior morphological taxonomic identification of any specimen to species level, with archiving of associated specimen and photo vouchers (Fig. 1). When CMarZ ended in 2010, identified specimens had been sequenced for a selected barcode gene region, most usually the mitochondrial cytochrome oxidase I (mtCOI) gene, for 25% to 30% of the described species of marine holozooplankton (Bucklin et al. 2010c, 2011).

Fig. 1
figure 1

Census of Marine Zooplankton online gallery of photo vouchers for living specimens of marine zooplankton identified at sea immediately upon collection by taxonomic experts. See http://www.cmarz.org/galleries.html for species identifications. Photos by R.R. Hopcroft and C. Clarke (University Alaska Fairbanks) and L.P. Madin (Woods Hole Oceanographic Institution)

Norwegian Taxonomy Initiative

The Norwegian Taxonomy Initiative (NTI) was established in 2009 to improve knowledge about Norwegian biodiversity, with the ultimate goal of providing an inventory of all multicellular species occurring in Norway. The program, coordinated by the Norwegian Biodiversity Information Centre, funds survey and barcoding projects with special emphasis on poorly known taxa. Barcoding is conducted in collaboration with the Norwegian Barcode of Life (NorBOL), the local node of iBOL, and the resulting data are made available through the Barcode of Life Data Systems (http://www.boldsystems.org/). NTI has funded inventory and barcoding projects on several taxa that encompass marine zooplankton, resulting in over 570 vouchered COI sequences from the North Sea, Northeast Atlantic, and Arctic Ocean. Project COPCLAD (2015–2017) collected DNA samples of planktonic Copepoda and Cladocera, resulting in 253 barcode-compliant COI sequences for 64 species in Norwegian waters. The inventory of copepods has continued within HYPCOP (2020–2022), which focuses on copepods in the hyperbenthic marine habitats in Norwegian waters. Projects HYPNO (2015–2018) and NORHYDRO (2019–2022) have so far collected DNA samples with pictorial vouchers for 102 species of pelagic hydrozoans occurring in Norwegian waters, with 298 barcode COI sequences for 84 species. NTI has also funded projects on Ctenophora (GooseAlien, 2016–2020) and Amphipoda (NorAmph, 2016–2018; NorAmph2, 2019–2022).

Marine Barcode of Life (MarBOL)

The Marine Barcode of Life (MarBOL) project was a joint effort of the Consortium for the Barcode of Life and the Census of Marine Life, which sought to highlight the variety of applications of DNA barcodes, including accelerating species-level analysis of biodiversity and facilitating conservation efforts. Marine barcoding efforts have continued under the auspices of the International Barcode of Life, including new initiatives that seek to track species dynamics in marine ecosystems (Trivedi et al. 2016; Adamowicz et al. 2019).

Census of Antarctic Marine Life (CAML)

The Census of Antarctic Marine Life (CAML) was a project of the Census of Marine Life and was led by the Australian Antarctic Division and aimed at assessing the nature, distribution, and abundance of Southern Ocean biodiversity. Field work took place between 2005 and 2010, and was based on 18 major research voyages to Antarctica and the Southern Ocean, mainly during the International Polar Year, 2007–2008. A major legacy of CAML is the SCAR-MarBIN (Scientific Committee on Antarctic Research Marine Biodiversity Network) data portal (http://www.scarmarbin.be/), which serves data arising from CAML field projects, including the barcode sequences for Antarctic species, and creates related data storage, analysis, and visualization tools (DeBroyer and Danis 2011).

Arctic Ocean Diversity (ArcOD)

The Arctic Ocean Diversity (ArcOD; http://www.arcodiv.org/) project was one of the first efforts to synthesize basic biodiversity inventories and consolidate datasets in the Arctic; it also embraced the early promise of barcoding (Gradinger et al. 2010). ArcOD was led from the University of Alaska Fairbanks and was one of the Census of Marine Life field projects working in conjunction with the Barcode of Life Database (BOLD) during 2000–2010. DNA barcoding efforts resulted in published COI sequences for specimens of 41 of the more prominent species across 8 major zooplankton groups (Bucklin et al. 2010a). Over 400 benthic invertebrate taxa, many with meroplanktonic larval stages, were also barcoded (Hardy et al. 2011).

Barcodes of Marine Zooplankton in China (BoMZC)

A comprehensive multi-gene barcode database comprising key marine zooplankton species in Chinese coastal regions was the primary goal of the Barcodes of Marine Zooplankton in China (BoMZC) project. Zooplankton samples were collected from inshore regions along mainland China from Bohai Bay to the South China Sea (Fig. 2). When the project ended in 2018, COI barcodes had been generated for 462 species representing 10 major taxonomic groups of marine zooplankton.

Fig. 2
figure 2

Cruise tracks showing regions sampled during the BoMZC program. Zooplankton samples were analyzed for morphological (microscopic) identification of species and DNA barcoding for COI and other gene regions

MetaZooGene Barcode Atlas and Database

Description and purpose

The MetaZooGene Barcode Atlas and Database (MZGdb; https://metazoogene.org/MZGdb) was created to provide advanced searching and reporting functions to the existing content of the GenBank and BOLD databases. For example, a zooplankton researcher wants to know what species of marine copepods are commonly found in the North Atlantic region, which of these already have COI barcodes, and which species do not. This question cannot be answered with GenBank or BOLD, because these databases only contain the species names of already barcoded species (not comprehensive lists of all known species), and they do not provide searching options for geographic regions or marine/freshwater/terrestrial classifications. At best, they can return a list of species with COI barcodes that were collected anywhere in the world from any environment, including oceans, rivers, and freshwater lakes.

The MZGdb provides graphical and statistical summaries of barcoding taxonomic coverage (by class, family, and genus) and offers pre-compiled data files in a variety of common formats (e.g., paired taxonomic mapping and fasta sequence files formatted for the Mothur pipeline; https://mothur.org; Schloss et al. 2009) containing COI sequence data for selected ocean regions and taxonomic groups. This is achieved by combining sequence data from BOLD and GenBank, with species observation data from the Coastal and Oceanic Plankton Ecology, Production, and Observation Database (COPEPOD) and Ocean Biodiversity Information System (OBIS).

The MZGdb provides information for 41 major taxonomic groups of marine zooplankton for the seven major oceans, as well as the Baltic and Mediterranean Seas. The MZGdb adds ancillary information and labeling to the records, and applies quality control to the GenBank and BOLD records to remove duplicates, mislabeled gene types, and possible errors (e.g., barcode data for a specimen collected in the Mediterranean Sea, but identified as a species found only in the Arctic Ocean).

Data sources and search methods

Data from NCBI GenBank

DNA barcode data and metadata are downloaded from NCBI GenBank in a two-step process. The first step uses the NCBI entrez E-utilities (https://www.ncbi.nlm.nih.gov/books/NBK25499/) to request a listing of all GenBank Accessions (Leray et al. 2019) for a designated taxonomic group or associated with a particular author. Since “zooplankton” fall under multiple taxonomic groups, MZGdb queries 30 different taxonomic sub-groups to represent the zooplankton assemblage. Large groups (e.g., Crustacea) are divided into smaller groups to speed up processing. MZGdb updates these lists of accession numbers monthly, and downloads any new or modified accessions for addition to the MZGdb catalog.

There are several challenges associated with the MZGdb data gathering process, as described above. In its earliest years, NCBI GenBank did not require or consistently enforce standard naming or field assignment in its uploaded data. For example, the name of the gene region sequences (e.g., “COI”) should be placed in the GenBank “/gene = ” field, but some older records instead include this in the “/product = ” or “/notes = ” fields. This means those records will not be returned when using only “/gene = COI” data search in GenBank. Furthermore, the exact gene naming text was not controlled, and the mitochondrial cytochrome oxidase I gene has been designated using various abbreviations, as preferred by the author (e.g., COI, CoI, CO1, COX1, CO-1, Co 1). Ignoring capitalization, 85% of the MZGdb-extracted GenBank records used “coi”, 11% used “cox1”, and 3% used “co1”. In all, 32 different text strings are used in GenBank to represent this one gene. A person searching with “/gene = COI” would only get 85% of the entire GenBank COI data collection, while a person searching with “/gene = COX1” would get 11% of these data. Due to significant and multiple searching difficulties, MZGdb eventually stopped using the “/gene = ” and “/product = ” fields all together, and simply searched for the presence of three words (i.e., “dna” + “cytochrome” + “oxidase”) anywhere in the GenBank record. This returns more than the desired COI data, so additional quality control checks were used to remove undesired results (e.g., “COI-like” sequences, “unverified sequences”, and/or ITS records that contain these three keywords).

Data from BOLD

Data for MZGdb were also downloaded from the Barcode of Life Database (BOLD; Ratnasingham and Hebert 2007) using the BOLD API tool (http://www.boldsystems.org/index.php/resources/api). The BOLD database regularly synchronizes with GenBank, and there is significant duplication with GenBank records. These duplicated records contain GenBank Accession Numbers, which were checked against the GenBank downloaded entries and removed or added as necessary. GenBank records were given priority over BOLD records, because according to the BOLD handbook (https://v3.boldsystems.org/index.php/resources/handbook), all BOLD records are eventually submitted to GenBank. Any records unique to BOLD should therefore eventually be included in GenBank, and would then be removed as duplicates. After downloading respective GenBank and BOLD data to MZGdb, duplicated records from BOLD and GenBank are resolved (keeping the GenBank version in cases of duplication), and a final check is run to identify and remove any remaining non-COI records in the combined dataset (e.g., mRNA data and sequences with > 2000 bp).

Translation and verification of species names

The original species description provided in the GenBank/BOLD sequence record is validated against the World Registry of Marine species (WoRMS, https://www.marinespecies.org/). MZGdb stores the original description and also a secondary name field, which contains the official WoRMS spelling and taxonomic status (e.g., “accepted”, “unaccepted (synonym)”, “alternate representation”). In some cases, the original sequence species name contains non-taxonomic information (e.g. “Calanus sp. Sample-X” or “Calanus aff. helgolandicus”). In these cases, the species can only be matched in WoRMS to the genus level (e.g., “Calanus”), so the secondary name field will only show the genus name, but the original description field will retain the full original text.

Enhanced geographic indexing and searching

With hundreds of species and thousands of sequences in most ocean basins, a researcher may want to limit the sequence data to only those species relevant to their study area. Considering only the georeferenced barcodes found in GenBank and BOLD (Fig. 3, red stars) would give a misleading view of a species’ geographic coverage. This can be improved by adding observation data from the COPEPOD and OBIS databases (Fig. 3, blue dots), illustrating that the species of interest is indeed found in additional ocean regions, even though georeferenced barcodes do not exist from these areas.

Fig. 3
figure 3

MZGdb Atlas map of the copepod, Neocalanus gracilis, showing collection sites of identified specimens based on COPEPOD/OBIS (blue dots) and specimens used for DNA barcode records available in GenBank or BOLD (red stars). Although N. gracilis has been collected in most oceans, georeferenced barcodes are available for only 2 locations in the Pacific Ocean. (See https://metazoogene.org/MZGdb)

Information on the locations of all prior collections and observations (species, latitude, longitude) is downloaded from COPEPOD and OBIS and used to calculate the geographic distribution and presence by ocean for all of the MZGdb species. If a species is found in an ocean in at least three different locations, it is marked as being found in that ocean. This minimum of three observations was set to allow cryptic or rare species to be registered, while seeking to exclude errors in data entry (e.g., a missing ± longitude value that might put an observation in the wrong ocean or hemisphere). These ocean assignments are then compiled into ocean-based species lists. Where the MZGdb has > 5600 zooplankton species globally, only ~ 4400 of these are assigned to the North Atlantic, and ~ 1200 are assigned to the Mediterranean Sea.

A quality control measure currently in development is to use NCBI GenBank Basic Local Alignment Search Tool (BLAST; Altschul et al. 1990; https://blast.ncbi.nlm.nih.gov/Blast.cgi) to compare all available sequences for a given species, and to evaluate sequence variation with the species and between closely-related congeneric species (Boratyn et al. 2013). Findings of high levels of sequence differences and apparent mismatches, consistent with species misidentification, will be reported in a notes field in the MZGdb. While the original GenBank/BOLD species description would remain intact and not be changed, this additional information can be used to improve the reference utility of the sequence.

MZGdb data format

The MZGdb data format is abbreviated and does not contain all information from the original GenBank or BOLD records. The database is designed for particular applications (including file formatting for particular bioinformatics pipelines), and does not seek to fully replicate or reproduce the entire data content of the original entries, which are easily accessible via the linked original GenBank and/or BOLD records. Supplemental data fields have been added (e.g., MZG suggested name, QC notes, oceans-of-presence, marine/fresh, holoplankton/meroplankton).

A possible enhancement for MZGdb would be to add collection information missing in the GenBank or BOLD entries, especially georeferencing (latitude/longitude coordinates) and ideally date of collection. These additional metadata will allow evaluation of geographic distributions and identification of possible misidentification of species. In some cases, these metadata may be recovered from the GenBank record in the “/note = ” section and/or cited publications, and thus added to the MZGdb entry. In the future, additional data descriptors not provided by any source (WoRMS, OBIS, GenBank, or BOLD) may be selected for inclusion in the MZGdb.

Current status of DNA barcoding

The MetaZooGene Atlas and Database are designed to enhance access, improve quality control, and expand possible applications for COI barcodes of marine zooplankton. Both international programs and individual research projects have continued progress toward a taxonomically complete and geographically comprehensive reference database in recent years. The MZGdb searching and reporting functions, which yield COI barcode sequences and collection metadata, allow summary analysis of the current status of DNA barcoding for selected taxonomic groups and ocean regions, including assessment of completeness and identification of priorities for future efforts.

DNA barcoding by taxonomic group

Overview summaries are provided for selected taxonomic groups of marine zooplankton for which there has been significant recent progress in DNA barcoding. Several important groups of marine zooplankton are not included here (e.g., Ostracods, gammarid Amphipods, Mysids, Polychaetes), although barcode data for species of these groups are included in the MZGdb database, and updated information is available through the publicly-accessible websites.

Cnidaria

Cnidaria of the Class Anthozoa, which have planktonic larval stages, are known to have low COI sequence divergence between species (Shearer et al. 2002; Hellberg 2006), In contrast, medusozoan Cnidaria (Classes Hydrozoa, Scyphozoa, and Cubozoa) have COI distances between species more compatible with reliable discrimination and accurate identification of species (Huang et al. 2008; Ortman et al. 2010; Zheng et al. 2014; Lindsay et al. 2015). The mitochondrial genome in the Medusozoa is linear; in Cubozoa, it is broken into multiple chromosomes (Kayal et al. 2011, 2015). With the exception of the Subclass Trachylina, pelagic Hydrozoa possess two copies of COI (full and partial) at the ends of the linear mitochondrial chromosome (Kayal et al. 2011). Another issue is the presence of nuclear insertions of mitochondrial sequences (NUMTs), which have been confirmed among medusozoan Cnidaria in the benthic species, Hydra magnipapillata (Song et al. 2013). NUMTs can potentially introduce ambiguity into species identifications of planktonic Cnidaria, as discussed by Lindsay et al. (2015). Despite these complexities, COI has been analyzed for more than 100 species and has proven to be useful for species delimitation in Hydrozoa, including clades with multiple copies (e.g., Ortman et al. 2010; Watson and Govindarajan 2017; Lindsay et al. 2017; Li et al. 2018; Martell et al. 2018), and allows matching pelagic and benthic life history stages of one species (Pyataeva et al. 2016; Schuchert et al. 2017). However, COI primers universally targeting all hydrozoan clades have yet to be designed (Moura et al. 2018). The mitochondrial small ribosomal RNA subunit, mt16S rRNA, has frequently been used for Hydrozoa, since PCR amplification success is reported to be higher and the sequence provides more phylogenetic information (Zheng et al. 2014; Lindsay et al. 2015). While using COI, often in parallel with mt16S rRNA, is becoming more commonplace since its promotion as the universal barcode locus, publicly available mt16S rRNA sequences for Hydrozoa are currently more numerous than COI sequences in GenBank: ~ 5300 vs. 3500, respectively). For some taxa, COI may be the preferred barcode, since mt16S rRNA shows very low divergence between congeneric species for genera as phylogenetically diverse as Catablema (Order Anthoathecata) (Schuchert 2020).

It is apparent that misidentifications, due to the lack of taxonomic expertise and/or the presence of morphologically cryptic species, are frequent in public DNA barcode sequence repositories, and these records should be used with caution (Lindsay et al. 2015). Undamaged specimens of Cnidaria are required for species-level identification, and organisms identified as the same species have subsequently been assigned to different families when integrated studies on morphology and molecules were carried out (Lindsay et al. 2017). The multiple life stages of many Cnidaria have often been described and barcoded as separate species. Reference barcode sequences linked to vouchered specimens are needed to identify taxonomic errors and discrepancies, and to be able to confidently use DNA barcodes for species identification. Unfortunately, securing voucher specimens is not always straightforward, especially for gelatinous zooplankton, since specimens directly preserved in ethanol are generally unsuited for morphological work, while formalin, the preferred medium for fixation of morphological specimens, causes denaturation of DNA (Bucklin and Allen 2004). For the smaller Hydrozoa, photographic vouchers that carefully document key structures of specimens prior to fixation in ethanol may be the best available substitute for physical vouchers.

Ctenophora

Ctenophora remains a problematic phylum in terms of DNA barcoding and taxonomy in general. BOLD (https://www.boldsystems.org/) currently lists only 10 species of formally-described planktonic Ctenophora as having COI barcodes (including data mined from GenBank), not all of which are publicly available, indicating a significant gap in terms of available reference sequences. There is also very limited data on intraspecific variation, with only two of these species, Bolinopsis infundibulum and Beroe ovata, listed with more than four barcode sequences > 500 bp. There currently exists no consensus for a suitable barcoding locus for the phylum, as previous molecular studies have used 18S rRNA, 28S rRNA, the intervening transcribed spacer regions (ITS1/2) and mitochondrial cytochrome b (CYTB), in addition to COI, with varying success (Podar et al. 2001; Bayha et al. 2004, 2015; Simion et al. 2015).

While a few recent papers successfully used COI for delimitation of species in the benthic ctenophore family Coeloplaniidae (Alamaru et al. 2017) and the genus Beroe (Johansson et al. 2018), obtaining COI sequences has been challenging for many species, with standard primers failing to amplify COI sequences due to high intraspecific nucleotide diversity (Schultz et al. 2020). Lack of sequenced mitochondrial genomes has complicated PCR primer design for ctenophores (Wang and Cheng 2019). To date, mitogenomes for 7 Ctenophora species have been sequenced: Mnemiopsis leidyi (Pett et al. 2011), Pleurobrachia bachei (Kohn et al. 2012), three benthic platyctenids (Arafat et al. 2018), Beroe cucumis (Wang and Cheng 2019) and B. forskalii (Schultz et al. 2020). The mitochondrial genomes of ctenophores have been shown to exhibit rapid evolutionary rates, and to be reduced and highly derived compared to other Metazoa (Pett et al. 2011; Kohn et al. 2012; Lavrov and Pett 2016; Arafat et al. 2018; Wang and Cheng 2019; Schultz et al. 2020).

The COI barcode region of Ctenophora has proven problematical and the traditional taxonomy of the group presents numerous challenges: there is a lack of taxonomic experts, identification literature is sparse, and the phylum surely includes considerable hidden diversity, with many undescribed species (Haddock 2004). Many ctenophore species are exceedingly fragile, and can only be successfully collected for morphological studies by divers or, in the case of deep-water species, by Remotely Operated Vehicles (ROVs). Preserving physical voucher specimens is nearly impossible, since most ctenophores will rapidly disintegrate in both ethanol and formalin (Adams et al. 1976), as well as other fixatives. Consequently, type material is rarely available for study (e.g., Gershwin et al. 2010). Detailed, ideally in situ, photographic or video documentation of live ctenophores, together with detailed descriptions and illustrations, provide the most feasible way to document morphological diversity.

Copepoda

Planktonic copepods are thought to be the most abundant metazoans on Earth and one of the most-studied taxonomic groups of marine zooplankton. As a group, copepods are both taxonomically and ecologically diverse; they frequently dominate zooplankton communities. Planktonic copepods currently comprise > 2600 species of > 340 genera, which are assigned to 8 orders: Calanoida, Platycopioida, Misophrioida, Mormonilloida, Cyclopoida, Siphonostomatoida, Harpacticoida, and Monstrilloida (Razouls et al. 2005–2020). The MZGdb includes 2402 species, including only “accepted” taxonomic names status in WoRMS. The Order Calanoida includes the highest number of species (955), followed by Harpacticoida (599), Cyclopoida (509), Siphostomatoida (289), Canuelloida (22), Monstrilloida (19) and Mormonilloida (3 species).

Species identification and discrimination of copepods using DNA sequences has used both mt16S rRNA (Goetze, 2003) and COI (Bucklin et al. 2010a, b; Laakmann et al. 2013; Blanco-Bercial et al. 2014). Analyses have used various PCR primers and protocols for COI (Folmer et al. 1994; Simon et al. 1994), including development of copepod-specific primers (Bucklin et al. 2010c). DNA barcoding has revealed hidden diversity and resolved cryptic species (Goetze 2003; Cornils and Held 2014; Goetze et al. 2016; Bode et al. 2017).

Copepods are one of the best-studied groups of marine zooplankton using integrative morphological and molecular approaches. COI barcode reference sequences are available for many—if not most—of the more abundant and/or ecologically important species from coastal ocean areas, as well as surface layers (epipelagic zone) of the open ocean. Additional effort is needed for taxonomically challenging taxa, including species of the genus Acartia (Figueroa et al. 2020) and family Paracalanidae (Cornils and Held 2014; Moon et al. 2010), as well as representatives of bathy- and abyssopelagic taxa, which remain under-sampled. Rapid progress is being made with COI barcoding of copepods. A comprehensive analysis reported 1381 COI barcode sequences for 195 marine copepods in 2014 (Blanco-Bercial et al. 2014), while the compilation of marine pelagic copepod barcodes in the MZGdb now includes 12,155 sequences for 752 species, or 31% of the total of 2401 valid copepod species (Fig. 4). The highest numbers of copepod species are recorded for the North Atlantic (975 species), Indian Ocean (911 species) and the North and South Pacific (783 and 625 species, respectively), while lower numbers are documented for the South Atlantic (388 species), the Southern Ocean (189 species) and the Arctic (161 species). Regarding the percentage of species for which DNA barcodes are present, the Arctic and South Atlantic show the highest proportions of barcoded species with > 2/3 and > 1/2 of the documented species, respectively. In the Indian Ocean, DNA barcodes exist for < 1/3 of the species. The order Calanoida has the highest species diversity and the most species with a COI barcode (405 species; 42% barcoded, Fig. 5). Siphonostomatoida make up the second largest group with respect to barcoded species (115 species, 39% barcoded), followed by the Harpacticoida (110 species, 18%), Cyclopoida (95 species, 18%), Monstrilloida (11 species, 57% barcoded), Canuelloida (3 species, 13%) and Mormonilloida (2 species, 66% barcoded; Fig. 5).

Fig. 4
figure 4

Copepod species with DNA barcodes. a Entire ocean; bh by ocean region. Pie charts show proportions of barcoded species: total (left); by order (right)

Fig. 5
figure 5

Maximum likelihood tree of COI barcodes from MZGdb. Numbers of barcoded species are shown in parentheses; colors reflect different copepod orders; green branches indicate taxonomic uncertainties or possible errors

Hyperiid amphipoda

Hyperiid amphipods are an exclusively holoplanktonic group of marine Crustacea, which serve as important prey in pelagic food webs for whales, planktivorous fish and seabirds, and as commensals and parasitoids of gelatinous zooplankton (Bocher et al. 2001; Harbison et al. 1977). Despite their conspicuous presence in zooplankton samples worldwide and their striking morphological adaptations to pelagic life, the knowledge of species diversity of hyperiids is very incomplete. The suborder Hyperiidea includes two infraorders, Physosomata and Physocephalata, which together contain ~ 300 species (Gasca et al. 2012; Hurt et al. 2013). The majority of species diversity is contained within the Physocephalata, with approximately 65% of extant species within the 23 families of this infraorder according to the World Register of Marine Species (WoRMS Editorial Board 2021). The MZGdb contains 643 COI barcodes representing 66 and 8 species of Physocephalata and Physosomata, respectively. However, geographic locations associated with DNA barcodes were accessible for only 16 species. The absence of collection information (latitude and longitude) in GenBank or BOLD database entries, despite inclusion in the cited publication (e.g., Hurt et al. 2013), is entirely usual, especially for less recent submissions. Among barcode records with collection locations, the majority of specimens were sampled from coastal regions, and many open ocean and deep-water species have not yet been added to reference databases.

Euphausiacea

The marine crustacean Order Euphausiacea, known as krill, includes 87 described species, many of which have biogeographic ranges spanning multiple ocean basins (Brinton et al. 1999). The MZGdb includes a total of 2567 COI barcode sequences for 66 species. The widespread distributions of many euphausiid species, with the associated possibilities of genetic differentiation of geographic populations and eventual speciation, present significant challenges for reliable species identification based on COI barcodes. The COI barcode region has proven to be a valuable tool for species identification and discrimination of euphausiid species, and also for evaluation of the taxonomic significance of variation among geographic populations (Bucklin et al. 2007). Wiebe et al. (2016) analyzed COI barcodes for specimens of Stylocheiron abbreviatum and S. affine collected in the Red Sea. Comparison with COI barcodes for these species across their multi-ocean distributions indicated that Red Sea populations of S. affine may represent a cryptic species, while COI showed high variability, but no significant divergence (or cladogenesis) for S. abbreviatum (Wiebe et al. 2016; Fig. 6).

Fig. 6
figure 6

MetaZooGene Barcode Atlas maps for the euphausiids Stylocheiron abbreviatum (top) and S. affine (bottom). The MZGdb Atlas records collection locations for specimens that were morphologically identified (blue dots) and barcoded (red stars). See https://metazoogene.org/MZGdb

These case studies indicate the need for COI barcoding of all species of euphausiids, including analysis of samples collected throughout each species’ geographic range. A taxonomically complete and geographically comprehensive reference barcode database will provide an invaluable resource for accurate species identification, and detection of cryptic and novel species.

Gastropoda

Although most marine gastropods are benthic, of which the majority have meroplanktonic larvae, two groups are holoplanktonic and represent independent colonizations of the pelagic zone. These include the Order Pteropoda and the Superfamily Pterotracheoidea (commonly referred to as heteropods) within the Order Littorinimorpha. Both groups are important members of the marine zooplankton assemblage and share the key characteristic of producing shells composed of aragonite, a naturally occurring form of calcium carbonate (CaCO3) that is more soluble than calcite. They are thus considered to be vulnerable to ocean acidification, but the deposition of their shells in marine sediments has also provided a unique and invaluable fossil record of the biodiversity of pelagic ecosystems. The fossil record makes these holoplanktonic gastropods particularly useful for studying evolutionary processes in metazoan plankton in the global ocean (Peijnenburg et al. 2020; Wall-Palmer et al. 2020).

Pteropods comprise three suborders: Euthecosomata, Pseudothecosomata, and Gymnosomata (Bouchet et al. 2017), with a total of 71, 23, and 54 extant species recorded, respectively, in WoRMS. DNA barcoding of described species of the fully-shelled Euthecosomata have usually used the COI barcode region gene (e.g., Jennings et al. 2010a; Corse et al. 2013; Burridge et al. 2017). The MZGdb currently contains 1719 sequences representing 50 pteropod species sampled worldwide, with the best sampling coverage in the Atlantic Ocean. However, the Pseudothecosomata, which are deeper-dwelling and less-frequently sampled, and the Gymnosomata, which lack shells as adults and are referred to as 'sea angels', remain poorly characterized and under represented in the database. Only 8 out of 19 genera of Gymnosomata and only 6 out of 23 species of Pseudothecosomata are represented, and species identifications are often doubtful. These groups should be a priority for future DNA barcoding efforts, although challenges include lack of specimen vouchers due to poor preservation in ethanol. Recent studies have demonstrated the power of integrated molecular and morphometric analyses for these groups (Burridge et al. 2015, 2019; Shimizu et al. 2018; Choo et al. 2020). Discovery of new species and increased estimates of global diversity can be expected, especially for species with widespread and multi-ocean distributions, which may be expected to reveal evidence of cryptic speciation.

The Superfamily Pterotracheoidea includes carnivorous holoplanktonic Gastropoda, with current records for a total of 38 recognized species. Although less abundant than the Pteropoda, the Pterotracheoidea are frequently sampled throughout all ocean basins, with distributions ranging from temperate to tropical regions (Wall-Palmer et al. 2018). The Carinariidae and Pterotracheidae, including a total of 14 species, are larger-bodied forms that are nearly lacking in barcode data, with only 14 COI sequences available for 4 species. In contrast, a taxonomically- and geographically-extensive COI reference dataset is available for the family Atlantidae, with 668 COI sequences, including all 24 species (Wall-Palmer et al. 2018, 2020). The COI barcode region has proven to be a reliable marker of species identification for the Atlantidae, and has also supported the discovery of new species and recognition of likely cryptic species (Wall-Palmer et al. 2018). Morphological identification of this group of Gastropoda relies upon analysis of their minute larval shells (see https://www.planktonic.org/) and loss of taxonomic expertise has contributed to the lack of recent research on this group.

Shelled holoplanktonic gastropods provide useful and reliable indicators of ongoing processes associated with global climate change, including ocean acidification, and also provide an evolutionary record of long-term climatic variation in the oceans. COI barcodes provide an invaluable tool for accurate and consistent identification of species. These COI barcodes and the global resource of the MZGdb will ensure that these groups can be reliably identified, recorded in reference databases, and included in a variety of future studies.

Chaetognatha

The Chaetognatha, commonly known as arrow worms, is a marine phylum containing ~ 150 species, of which the majority are holoplanktonic (Pierrot-Bults 2017). In pelagic waters, Chaetognatha are primary predators of copepods and generally make up a substantial proportion of biomass. Chaetognath species can be found from coastal waters to the open ocean, and from the surface to the deep sea, although quantitative data on distribution and abundance in the global ocean are scarce. Because of their simple body plan, lack of clear morphological characters, and soft body, Chaetognatha are not easily identified to species level, especially in preserved samples. For barcoding reference databases therefore, specimens should be identified by taxonomic experts while still alive and/or specimens preserved in 4% formaldehyde should be paired with voucher specimens for DNA barcoding.

There has been considerable debate about the systematics within the Chaetognatha, but the phylum is classified into two main orders: Aphragmophora and Phragmophora, for which WoRMS currently lists 181 and 127 accepted species names, respectively. However, many of these species have only been observed once and remain to be verified. The majority of species are contained in the family Sagittidae of the Aphragmophora, which comprises 12 genera and 174 species. The MZGdb contains 1117 COI barcodes representing 29 species. The primary geographic focus of studies of Chaetognatha are the Atlantic and Arctic Oceans, where 22 species of Aphragmophora (including 20 species of Sagittidae) have been collected. Hence, there remain large gaps in the reference database for Chaetognatha.

An additional complication is that previous studies examining mitochondrial DNA variation within and between Chaetognatha species have uncovered unusually high levels of diversity, often combined with significant population genetic structuring and potential cryptic speciation (Peijnenburg et al. 2004, 2006; Jennings et al. 2010b; Miyamoto et al. 2012; Kulagin et al. 2014). Analysis of entire mitochondrial genomes combined with nuclear genetic markers, however, led Marlétaz et al. (2017) to conclude that Chaetognatha have unusual patterns of mitochondrial evolution and extreme levels of mitochondrial diversity can be present within natural populations of single species. Hence, valid reference barcode databases for Chaetognatha species should include nuclear, as well as mitochondrial, genetic markers. Best practices also include that specimens preserved in formaldehyde, matched with voucher specimens used for barcoding, should be archived for future examination.

Pelagic tunicata

A large portion of the undiscovered biodiversity of marine mesozooplankton may lie within the pelagic Tunicata, primarily the Appendicularia (also called Larvacea) and the Thaliacea. Larvaceans are holoplanktonic tunicates composing three families: Oikopleuridae, Fritillaridae, and Kowalevskiidae. To date, only 72 species have been described, with most of the diversity (38 species) within the Oikopleuridae. Accurate and detailed taxonomic characterization of larvaceans has proven difficult, if not impossible, due to damage to specimens during routine net sampling, which has markedly limited our understanding of the true biodiversity of these pelagic tunicates (Hopcroft 2005), despite our appreciation for their importance in the deep ocean (Robison et al. 2010). The Thaliacea currently contain 75 species within 7 families, and while most surface-dwelling species are relatively robust and well known, deeper-dwelling species have only begun to be discovered (Robison et al. 2005a, b).

Currently, COI barcodes exist for only 6 species of Larvacea (8% of the described diversity). This lack of COI barcodes is partly a consequence of the difficulty of accurate species identifications of imperfect or damaged specimens for COI barcoding. Additionally, usual PCR primers for the COI barcode region (Folmer et al. 1994; Geller et al. 2013) have proven to be unreliable in targeting and amplifying COI for Larvacea (Sherlock et al. 2017) and Thalicea (Govindarajan et al. 2011). Progress has been made towards COI barcodes with the tunicate primer pair (Hirose et al. 2009) that have successfully amplified COI from the giant larvacean, Bathochordaeus mcnutti (Sherlock et al. 2017). And nuclear 18S rRNA primers have been used successfully (Tsagkogeorga et al. 2009; Govindarajan et al. 2011). These taxonomic and molecular issues have led to pelagic tunicates being vastly under represented in DNA reference databases and hinders the ability of metabarcoding analyses to accurately characterize ecosystem diversity. Therefore, a considerable amount of attention is still needed by taxonomic experts and molecular ecologists to help expand and resolve tunicate biodiversity.

DNA barcoding by ocean region

Patterns of zooplankton diversity vary markedly among the many distinct and diverse geographic regions of the global ocean. Characterization of species diversity of the many taxonomic groups of marine zooplankton by ocean region is a key foundation for understanding the relationships to environmental conditions and recognizing the impacts of human activities, including climate change.

North Atlantic Ocean and regional seas

For many decades, marine research and fisheries management in the North Atlantic Ocean has been promoted by the two international governmental organizations: International Council for the Exploration of the Sea (ICES) and North Atlantic Fisheries Organization (NAFO). For more than 100 years, data on oceanography, plankton, and fish have been collected, allowing the identification and analysis of abiotic and biotic changes in the ICES/NAFO areas over this long time period (Fig. 7). In both areas, there are more than 62 monitoring sites, providing high spatial and temporal resolution zooplankton data based on 40 Continuous Plankton Recorder (CPR) standard areas (O’Brien et al. 2013). These long-term data give detailed information on the zooplankton species and the zooplankton community change over time in the eastern North Atlantic (Greve et al. 2004; Beaugrand 2005; Beaugrand et al. 2009, 2014; Pitois et al. 2009; Eloire et al. 2010) and western North Atlantic (Pershing et al. 2005; Kane 2007; Johnson et al. 2011).

Fig. 7
figure 7

ICES and NAFO areas used for monitoring and ecosystem assessments, including sampling of zooplankton. Figure from Wiebe et al. (2012) used by permission

The MZGdb now gathers available information on zooplankton distribution, and also provides collection metadata and DNA barcode sequences for zooplankton throughout the ICES North Atlantic Region, with additional detail from adjacent ocean regions, including the Baltic and Mediterranean Seas. For the entire area, DNA barcodes exist for 32% of the > 8600 species, including 38% of > 3200 crustacean species and 29% of > 5300 non-crustacean species. Half of these DNA barcodes have been collected from specimens sampled in this area. Considering only the eastern oceanic part (which together with the Baltic Sea is defined as the ICES ecoregion), these proportions are somewhat higher.

The Copepoda, Amphipoda, and Decapoda are the groups with highest species numbers among crustacean zooplankton, comprising 2/3 of all species. The same is true for some non-crustacean zooplankton groups, including Polychaeta, Mollusca, and Cnidaria. A number of studies yielding DNA barcodes for zooplankton groups have focused on particular species (Bucklin et al. 2000; Castellani et al. 2011) or specific genera (Aarbakke et al. 2011, 2014; Cornils and Held 2014).

The North Sea is a particularly well-studied region within the ICES area, with boundaries defined by the English Channel to the southwest, the southern boundary of the Kattegatt to the east, and the Shetlands to the north. The North Sea zooplankton community consists of neritic species with a high seasonal proportion of meroplankton (Reid et al. 2003; Alheit et al. 2005) and an occasional and temporary influx of oceanic species. In total, > 3200 species have been recorded for the North Sea, of which > 1300 species (41%) have associated COI barcodes, with a smaller proportion of species (32%) have barcodes for specimens collected in the region. DNA barcodes for specimens collected from the North Sea are available for Copepoda (Laakmann et al. 2013; Cornils and Wend-Heckmann 2015) and other Crustacea (Raupach et al. 2015), as well as Cnidaria (Holst and Laakmann 2014; Laakmann and Holst 2014). Reliable identification of meroplankton, especially in coastal areas, requires a comprehensive COI reference database for both benthic and pelagic organisms, which is available for Crustacea (Raupach et al. 2015), Mollusca (Barco et al. 2016) and Echinodermata (Laakmann et al. 2016). Considering mero- and holozooplankton together, COI barcodes are available for ~ 50% of the > 900 species of Crustacea in the North Sea zooplankton assemblage, which is dominated by Amphipoda and Copepoda.

The Baltic Sea is the semi-enclosed, marginal sea of the North Atlantic Ocean, with a short geological history. It is also one of the largest brackish waterbodies on Earth, with salinity and temperature gradients decreasing from southwest to northeast. The zooplankton assemblage of the Baltic Sea includes marine, brackish and freshwater species (Schiewer 2008), so assigning certain species to this area in MZGdb Atlas is challenging. Including zooplankton of all size classes and all hydrographic zones, the Baltic Sea assemblage comprises a total of 1199 species, with 1031 occurring in marine waters (the open Baltic) and 168 species in associated estuaries. No species are considered to be endemic to the region, although microzooplankton of the open Baltic are not well-studied (Ojaveer et al. 2010). The mesozooplankton from the region are relatively well known, based on monitoring programs coordinated by the Helsinki Commission (HELCOM; https://helcom.fi/) and research institutions in the Baltic countries (Hernroth and Ackefors 1979; Viitasalo et al. 1995; Dippner et al. 2001; Díaz-Gil et al. 2014; Musialik-Koszarowska et al. 2019). Despite knowledge of the morphological taxonomy of the zooplankton assemblage, relatively few COI barcodes are available for species based on specimens collected from regions of the Baltic Sea.

Arctic Ocean

The Arctic Ocean is described as having low pelagic biodiversity when compared to other major ocean basins of the global ocean, considering only holozooplankton (Kosobokova et al. 2011; Kosobokova 2012; Halsband et al. 2020). Approximately 300–350 species of holozooplankton have been described and/or recorded across the Arctic Ocean (Brodsky 1967; Kosobokova et al. 1998, 2011; Sirenko et al. 1996; Sirenko 2001; Ershova et al. 2015; Ershova and Kosobokova 2019), including Arctic resident species and expatriates from both North Pacific and North Atlantic Oceans. However, the Arctic Ocean is roughly half continental shelf and half basin (Wassmann et al. 2020; Bluhm et al. 2015), and within its vast shallow-water shelf limits, meroplankton make a huge seasonal contribution to diversity (Weydmann-Zwolicka et al. 2021). There are probably more species in this assemblage than holozooplankton, but meroplankton are not well-characterized. Among highly diverse benthic taxa, e.g., Polychaeta and Mollusca, which species have planktonic larvae is largely unknown. Including both holo- and meroplankton, the total diversity may reach 700–900 species in the Arctic Ocean.

Currently, 84% of the holozooplankton species reported from the Arctic Ocean have been barcoded, which is among the highest proportions among major ocean basins. Several large-scale programs have focused on COI barcoding of holozooplankton, including the Census of Marine Zooplankton (CMarZ; Bucklin et al. 2010b) and Arctic Ocean Diversity (ArcOD; Gradinger et al. 2010). Additional efforts have included Arctic samples in focused analysis of particular taxonomic groups (Hunt et al. 2010; Nigro et al. 2016; Questel et al. 2016, 2021; DeHart et al. 2020; Kolbasova et al. 2020). Extensive collections of zooplankton for taxonomic diversity and molecular studies were carried out in all four deep basins of the Arctic Ocean during 2005–2019. Concerted efforts by expert morphological taxonomists and geneticists have resulted in a COI barcode reference database specific to the Arctic Ocean, which has proven beneficial for obtaining accurate species-level identifications and enhancing the detection of ecosystem biodiversity in metabarcoding analyses (Questel et al. 2021).

The deep basins of the Arctic Ocean certainly harbor additional zooplankton species that await discovery and description, and eventual DNA barcoding. Much of what is known about Arctic zooplankton diversity has come from net-based collections during periods of low or no ice cover, due to difficulties in collecting during winter conditions, although a number of studies have also characterized the zooplankton assemblage during seasonal and multi-year ice cover (Mumm 1993; Harding 1966; Sirenko et al. 1996; Kosobokova et al. 1998, 2011; Auel and Hagen 2002; Kosobokova 2012; Questel et al. 2013; Smoot and Hopcroft 2017a, Smoot and Hopcroft 2017b). As the Arctic sea ice cover has degraded (Hanna et al. 2021), both the lesser geographic extent and reduction in sea ice thickness, as well as the increasing number of ice-free days, are allowing continued oceanographic exploration in previously inaccessible regions. With the added use of tools (e.g., ROVs) capable of video and photographic observations, with targeted organismal collections, as well as integrated morphological and molecular approaches to species identification, our understanding of the biodiversity of Arctic zooplankton is becoming more refined. This is particularly true for the soft-bodied gelatinous zooplankton groups (i.e., Ctenophora, Hydrozoa, Scyphozoa, and Larvacea), and organisms found in the deepest regions of the Arctic Ocean, including specific near-bottom habitats (Andronov and Kosobokova 2011; Aarbakke et al. 2017; Weydmann et al. 2017; Walczyńska et al. 2019; Kolbasova et al. 2020).

North Pacific Ocean

Continental shelf regions of the eastern North Pacific have been sampled regularly for many years, with collections from surface waters (top 200 m). For many years, several programs have carried out time-series monitoring efforts that include preserving zooplankton samples in alcohol for genetic analysis; these well-sampled areas include Sub-Arctic regions off Vancouver Island, northern Gulf of Alaska, and the Bering Sea, as well as throughout the California Current System. The COI barcoding efforts based on the resulting samples have focused on particular taxonomic groups (e.g., Questel et al. 2016; Nigro et al. 2016), although new efforts are focusing on community analyses using metabarcoding. Notably, only a few studies have sampled the deep sea; one study carried out routine sampling down to 1000 m at Ocean Station Papa, off the coast of British Columbia (Mackas et al. 1998). Additional studies focused on characterizing the calanoid copepod assemblage of the North Pacific, sampling from 1000–4000 m (Yamaguchi et al. 2002, 2015; Homma and Yamaguchi 2010; Homma et al. 2011).

Time-series monitoring is continuing along the Seward Line in the northern Gulf of Alaska and within the fjord ecosystem of Prince William Sound. These field programs are providing samples for integrated morphological and molecular analysis, which will yield COI barcodes for additional species of zooplankton and likely result in discovery of undescribed species and new records of species distributions in the North Pacific. The growing reference COI barcode database will be vital for population genetic studies and metabarcoding analyses to further characterize diversity of the zooplankton assemblage of the North Pacific Ocean.

In the western North Pacific Ocean, the Barcodes of Marine Zooplankton in China (BoMZC) project made major strides toward the goal of a comprehensive COI barcode reference database. BoMZC generated > 3700 barcodes for 462 zooplankton species based on > 150 samples collected during cruises in 2015–2019 (Fig. 2), including many stations in the deep ocean with net tows to 1000 m. This progress can be considered with reference to a Checklist of Marine Biota of China Seas, which recorded ~ 2000 species of holozooplankton in the western North Pacific Ocean (Liu 2008). There are still significant gaps in knowledge of zooplankton diversity in the region, especially south of the Changjiang River, China. Only 60% of the species could be assigned to species from collections during a 2020 biodiversity survey of zooplankton in the North China Sea based on metabarcoding of environmental DNA (eDNA). Half of the unidentified COI sequences showed low similarity (< 85%) to any sequences in available reference databases, including NCBI GenBank. Continuing efforts are needed to fill the gaps and build a comprehensive and complete barcode database, especially for small-sized and gelatinous groups (e.g., Ctenophora and Tunicata) and for deep-sea communities, where preliminary studies have suggested the presence of cryptic species in different depth zones.

Southern Ocean

Knowledge of pelagic diversity differs among regions of the Southern Ocean, due primarily to logistical difficulties of sampling. The Antarctic Peninsula and continental shelf are better sampled and thus have more records of valid species. Relative numbers of zooplankton species with georeferenced COI barcode sequences in GenBank or BOLD differ markedly among taxonomic groups. The Southern Ocean presents challenges for designation of species diversity in the MZGdb, which currently indicates COI barcode records for Crustacea as follows: for planktonic Copepoda, 60 of 139 valid species of Calanoida and 9 of 32 species of Cyclopoida; for 10 of 13 species of Euphausiacea and all 8 species of Ostracoda. Barcode sequence records are also available for 2 of 5 species of Chaetognatha, 8 of 10 Pteropoda (Gastropoda), and 35 of 108 Tunicata species. For Cnidaria, 30 of 179 species of Hydrozoa and 3 species of Scyphozoa are barcoded. A number of studies have used integrative morphological and molecular methods to examine diversity in the Southern Ocean of selected zooplankton groups, including Copepoda (Bucklin and Frost 2009; Laakmann et al. 2012), Ostracoda (Nigro et al. 2016), Amphipoda (Havermans et al. 2011), Euphausiacea (Jarman et al. 2000; Goodall-Copestake et al. 2010); Pteropoda (Jennings et al. 2010a; Hunt et al. 2010; Sromek et al. 2015; Havermans et al. 2019), and Chaetognatha (Jennings et al. 2010b; Kulagin et al. 2014).

The existing COI barcodes provide a foundation for the goal of determining barcodes for all species of the Antarctic zooplankton assemblage (Cheng et al. 2013). Based on available COI barcode data, Deagle et al. (2017) found that metabarcoding increased the number of species identified and usually detected more species than microscopic analysis of samples collected during Continuous Plankton Recorder (CPR) Surveys along transects south of Tasmania, although the prevalence of DNA from large plankton (e.g., krill) sometimes masked the presence of smaller species (e.g., copepods).

Priorities for the future

Need for morphological taxonomy

A key issue to ensure continued progress toward realizing the promise of DNA barcoding as a universal tool for analysis of pelagic biodiversity is the continuing need for morphological taxonomic identifications by experts. Also key is the retention of the voucher specimens that have been identified, including those subsequently sequenced (see Cornils 2015), and photographs of the specimen before preservation and/or after preservation, focusing on diagnostic morphological characteristics. Complete and careful coordination between morphological and molecular taxonomic approaches is essential to implement the best-practices of “gold-standard” DNA barcoding or integrative taxonomy (Dayrat 2005). Support for the integrative taxonomic approach for DNA barcoding is widespread (e.g., DeSalle 2006; Will et al. 2005; Pinheiro et al. 2019), but not universal. This issue is particularly important for identification of specimens used for the development of DNA sequence reference databases, which can provide a solid foundation for accurate identification of species based solely on COI barcodes. DNA barcode sequence divergence “thresholds” or haplotype frequencies are frequently considered to provide evidence of cryptic speciation, but there is continuing and widespread concern about the use of DNA sequence data as a sole basis for the description of new species (e.g., DeSalle 2006).

Priorities for DNA barcoding and metabarcoding

The Cnidaria and Ctenophora are taxonomic groups that, as a whole, are often missed or ignored by routine plankton surveys using nets and morphological identification, and utilizing molecular methods could provide increased data on their diversity and distributions (Hosia et al. 2017; Leduc et al. 2019; Schroeder et al. 2020). Molecular methods can also improve our understanding of the trophic role of gelatinous zooplankton. Whereas predation on gelatinous zooplankton often goes unnoticed by traditional diet studies, due to rapid digestion and lack of hard body parts, DNA based methods can reliably confirm the presence of gelatinous prey in gut contents or feces of predators (Sousa et al. 2016; McInnes et al. 2017; Ayala et al. 2018; Hays et al. 2018). However, the required comprehensive and reliable reference libraries, as well as universal sequencing protocols applicable for these groups, are still wanting, in particular for Ctenophora.

Deep-sea communities are poorly known in most ocean regions. Specimens of most taxonomic groups, particularly the more fragile forms, do not survive collection in nets and trawls with hours-long deployments. ROVs have proven useful for assessing deep-sea zooplankton of diverse taxonomic groups (Raskoff et al. 2010; Robison et al. 2010; Hidaka et al 2021), but are limited by the number of specimens collected during a single dive. Modified deployments, including nets with finer mesh and large capture devices, may improve the success of collection of fragile, especially gelatinous, zooplankton in suitable condition for both morphological and molecular analysis (Wiebe et al. 2010). Such advances may improve species inventories, allow descriptions of new species, and provide material suitable for DNA barcoding species found in the deep ocean zooplankton assemblages.

DNA metabarcoding analysis of the diversity of marine zooplankton has been carried out using a variety of marker gene regions, including COI (Hirai et al. 2015; Yang et al. 2017; Djurhuus et al. 2018), nuclear 18S rRNA (Bucklin et al. 2019; Blanco-Bercial 2020), and frequently multiple gene regions used in combination (Sommer et al. 2017; Hirai et al. 2020; Questel et al. 2021). Classification of the resulting sequences for any gene requires a taxonomically-comprehensive and globally-extensive reference database of sequences determined from morphologically identified specimens, ideally with complete collection metadata (georeferencing). Efforts are currently underway to include additional genes used for metabarcoding analysis of marine zooplankton in the MetaZooGene database (Todd O’Brien, NOAA Fisheries, pers. comm.), providing data and tools for targeted searches by taxonomic groups and geographic regions, and thereby facilitating creation of custom databases for particular biodiversity research, assessment, and management needs.

Identifying and correcting errors and quality control

After the automatic curation of the MetaZooGene database, such as removing “COI-like” sequences, a phylogenetic analysis of the 11,000 sequences of Copepoda revealed that the database still includes sequences that are wrongly assigned to species (Fig. 5). These errors in species assignment are assumed to be either misidentifications, contaminated sequences, or pseudogenes. While misidentifications likely occur mostly within genera, contamination may yield barcodes that are identical or highly similar (> 97% similarity) to sequences in different genera to phyla. BLAST searches will identify some of these errors when the assumably erroneous sequences match with sequences from other taxonomic groups. Errors can be curated and corrected manually for each record by taxonomic experts for each zooplankton group, but the source records (i.e., GenBank or BOLD submission) can only be modified by the original author.

One would expect that the sequences of one species follow a unimodal distribution. In the Copepoda phylogenetic tree, we however identified species with two or more independent clades. This bi- or multi-modal sequence distribution of species is mainly found in taxonomically challenging taxa, such as Acartiidae or Paracalanidae. Until the taxonomic uncertainties associated with possible cryptic species are examined using integrative morphological taxonomic and molecular genetic approaches these uncertainties cannot be resolved.

Taxonomic name changes that occur after sequences are published in GenBank are not recorded, unless the authors of the record submit requests for name changes. These new species assignments can only be identified by review of published results (see Cornils and Held 2014). Analysis of intraspecific variation of the COI barcode sequences by Locatelli et al. (2020) revealed that more than half of the species they investigated had bi- or multi-modal distributions, which could be traced back to hybridization, misidentification, and contamination. In the future, species exhibiting such distributions in COI sequence variation could be indicated in the MZGdb and used as an indicator of quality of the sequences, to be considered for various uses of barcode data, including diversity estimates based on metabarcoding.

Preventing errors: data entry quality control

Consistent with the goal of creating accurate and reliable DNA sequence reference databases, a best practice is to require that all COI barcodes submitted to public data repositories be determined from specimens for which the species was identified by a qualified taxonomist, based on criteria defining appropriate background and experience. The name of the morphological taxonomist would then be identified in a metadata field in the barcode record.

An important best practice for DNA barcoding is the permanent retention of voucher specimens, which should be accessible to researchers based upon documented requests. One option is to ensure archival storage in academic institutions and public museums, which may ensure long-term preservation. These specimens are valuable for confirmation of species identification, but even more for material for future analyses, including sequencing of additional barcoding genes and—eventually—entire genomes.

In addition to permanent archiving of specimen vouchers, another best practice is photographing living specimens prior to preservation and barcoding. Assuming sufficient resolution and appropriate selection of perspective (i.e., inclusion of diagnostic characteristics for the species), the photo vouchers can be used for confirmation of species identification. For some taxa, including gelatinous groups that are frequently damaged upon collection and do not preserve well, photo vouchers provide the best source of archival information for retrospective examination of species identifications of barcoded specimens.

Call to action: COI reference library for marine zooplankton

A taxonomically complete, globally comprehensive COI reference sequence database for marine zooplankton is an essential foundation for widespread implementation of DNA barcoding and metabarcoding applications, which are rapidly becoming invaluable tools for fisheries management, environmental protection, and detection of climate change impacts throughout the global ocean. Accurate, reliable, and rapid identification of species of marine zooplankton will remain challenging, given the taxonomic and phylogenetic complexity of the assemblage and the marked variation among ocean regions and ecosystems. Public access to data and metadata for records for DNA barcode sequences of specimens that have been accurately identified to species, with associated tools to allow creation of custom databases for target taxonomic groups and ocean regions, and ensure quality control and error detection, will enable and encourage widespread use of species-level diversity analysis for research, management, and monitoring needs. Priorities for completion of a taxonomically complete, globally comprehensive COI reference reference database should include:

  1. (1)

    Consensus agreement on top priorities for DNA barcoding efforts: These may include ecologically and environmentally important species that have significant impacts on ecosystem function and/or are indicators of key processes and parameters, geographic regions of special interest and/or at particular risk, and representatives of ecologically important and/or highly impacted zooplankton taxonomic groups.

  2. (2)

    Requirements for species-level identification: COI barcoding allows accurate discrimination and identification of species for many—but not all—taxonomic groups of marine zooplankton. Where possible, COI reference barcodes should be determined for all described species, based upon specimens that have been identified to species by morphological taxonomic experts, with associated collection metadata (georeferencing), voucher photographs of specimens prior to preservation, and permanent retention of voucher specimens.

  3. (3)

    Georeferencing of collections for barcoded specimens: Collection metadata (latitude, longitude, date) for the specimen are essential for all records of COI barcode data in any public repository. This information facilitates detection of species identification errors by comparison with distribution records and aids detection of cryptic species, especially for taxa with broad biogeographical distributions.

  4. (4)

    Statistical analysis of intraspecific variation of COI barcodes: Standard metrics of pairwise differences between barcode sequences for the same or closely-related species, as listed in GenBank or BOLD, should become a usual metric for barcode databases. These metrics are useful for detection of errors in species identification and the presence of possible cryptic species.