About

About MarkerSeek

MarkerSeek is an open tool and database for discovering DNA-barcoding markers from plastid genomes. It aligns annotated GenBank plastomes, scans nucleotide diversity, scores candidate regions with transparent diagnostics, and designs in-silico primer pairs — turning whole plastomes into ranked, primer-ready, publication-ready markers.

Version 0.1.0 License MIT Source GitHub Maintained at Qingdao University

What MarkerSeek does

MarkerSeek compares annotated chloroplast or plastid GenBank records, aligns homologous sequences with MAFFT, estimates nucleotide diversity (π), detects high-polymorphism hotspots, scores candidate markers from diagnostic and primer-design evidence, and writes reproducible tables, figures, and JSON payloads.

The question it answers

Which plastome regions combine high information content, reliable alignment, species-level resolution, and practical primer-design potential — all at once?

Two ways to use it

Browse the pre-computed catalogue of ready-made markers for thousands of plant genera, or upload your own annotated plastomes and run the full pipeline in the browser.

A command-line version is also available for batch and reproducible workflows.

Why plastid genomes

Plastid genomes are widely used in plant systematics because they are usually compact, collinearly annotated, and recoverable from genome-skimming data. Classical plant DNA barcoding depends on finding loci variable enough to separate closely related species while flanked by conserved sequence suitable for robust PCR amplification. MarkerSeek targets that problem at whole-plastome scale.

  • The nucleotide diversity index π is the mean pairwise nucleotide difference per valid aligned site. Peaks in π mark mutational hotspots such as intergenic spacers or rapidly evolving coding intervals.
  • A high π value alone is not enough: a good marker should also have reliable alignment, conserved flanks, species-level resolution, low estimated misclassification risk, and a working primer pair.
  • MarkerSeek combines π, feature-level diagnostics, and in-silico primer evidence into a single ranked candidate-marker table.

How it works

One reproducible pipeline runs from annotated GenBank inputs to ranked, primer-ready markers.

1
AlignMAFFT builds a genome-wide multiple alignment from annotated plastomes.
2
Scan πA sliding-window scan locates hypervariable hotspots gene-by-gene and spacer-by-spacer.
3
ScoreTen weighted diagnostics rank every candidate region from 0 to 100.
4
Design primersprimer3 plus in-silico PCR estimates cross-species amplification success.
Nucleotide diversity (π)

A sliding-window π scan across the whole plastome pinpoints mutational hotspots and assigns each to a gene or intergenic spacer.

mVISTA-style similarity

Reference-anchored identity tracks visualise conservation and divergence across every sample at a glance.

Transparent scoring

Diversity, resolution, barcoding gap, conserved flanks and more are normalised and weighted into one interpretable score.

In-silico primer design

primer3 designs pairs in conserved flanks; in-silico PCR checks unique amplification on the reference and universality across samples.

The MarkerSeek score

Each candidate marker receives a transparent 0–100 score: a normalized weighted sum of ten diagnostics, each clipped to a defined range and oriented so that higher is always better.

score = round(100 × Σ wᵢ·nᵢ, 1)
DiagnosticDirectionWeight
Nucleotide diversity (π)higher better0.18
Species resolutionhigher better0.15
Flanking conservationhigher better0.12
Variable-site densityhigher better0.10
Alignment reliabilityhigher better0.10
Barcoding gaphigher better0.10
Nearest-neighbour discriminationhigher better0.10
Indel densityhigher better0.05
Missing / ambiguous ratiolower better0.05
Length suitabilityhigher better0.05
Honest handling of missing data

Some diagnostics — barcoding gap and nearest-neighbour discrimination among them — can only be computed when a species is represented by two or more samples. For the typical NCBI layout of one plastome per species, MarkerSeek reports those metrics as NA rather than substituting a default, drops them from the weight set, and re-normalises the remaining weights so the score stays a valid 0–100 quantity scaled to the available evidence.

Primer design and in-silico PCR

When enabled, MarkerSeek designs and validates primer pairs only for the curated candidate-marker set — not the whole genome — keeping the step tractable on modest servers.

  • Conserved flanking windows are identified from the genome-wide alignment, and primer3 designs candidate pairs against the reference sequence.
  • Each pair is validated by full in-silico PCR on the reference to confirm a unique, length-conformant amplicon.
  • Universality is checked across the non-reference samples with a fast anchor-strict, body-fuzzy binding search; the 3′ anchor must match exactly while the primer body tolerates limited mismatch.
  • Pairs are ranked by a primer score that blends primer3 penalty, cross-species amplification success, and amplicon information content.

The pre-computed catalogue

MarkerSeek ships a browsable database of ready-made markers, so you can find candidate regions without uploading anything, currently spanning 2,112 genera.

  • 2,112 genera across 348 families and 128 orders.
  • 16,822 indexed plastomes.
  • 38,303 candidate markers and 192,055 primer pairs.

How to cite

If MarkerSeek supports your analysis, please cite the MarkerSeek paper and the underlying methods and software listed below. The tool and database are available at www.bioseqhub.cn/markerseek.

MarkerSeek

Gao C. MarkerSeek: plastome DNA-barcoding marker discovery. Qingdao University. Software and database, github.com/gaochengwen/MarkerSeek.

References

MarkerSeek builds on established DNA-barcoding theory and on widely used alignment, primer-design, and population-genetics methods.

  1. Hebert PDN, Cywinska A, Ball SL, deWaard JR. Biological identifications through DNA barcodes. Proceedings of the Royal Society B. 2003;270:313–321.
  2. Meyer CP, Paulay G. DNA barcoding: error rates based on comprehensive sampling. PLoS Biology. 2005;3:e422.
  3. Untergasser A, Cutcutache I, Koressaar T, Ye J, Faircloth BC, Remm M, Rozen SG. Primer3: new capabilities and interfaces. Nucleic Acids Research. 2012;40:e115.
  4. Katoh K, Standley DM. MAFFT multiple sequence alignment software version 7: improvements in performance and usability. Molecular Biology and Evolution. 2013;30:772–780.
  5. Nei M, Li WH. Mathematical model for studying genetic variation in terms of restriction endonucleases. Proceedings of the National Academy of Sciences USA. 1979;76:5269–5273.

Contact

Questions, bug reports, and feedback are welcome.