The University of Melbourne
Browse

Test datasets for evaluating automated transcription of primary specimen labels on herbarium specimen sheets

Version 4 2025-06-18, 06:36
Version 3 2025-03-05, 11:32
Version 2 2024-10-02, 08:56
Version 1 2024-04-19, 00:50
dataset
posted on 2025-06-18, 06:36 authored by Robert TurnbullRobert Turnbull, Emily FitzgeraldEmily Fitzgerald, Karen ThompsonKaren Thompson, JOANNE BIRCHJOANNE BIRCH
<p dir="ltr">This contains three datasets to evaluate the automated transcription of primary specimen labels (also known as 'institutional labels') on herbarium specimen sheets.</p><p dir="ltr">Two datasets are derived from the herbarium at the University of Melbourne (MELU), one with printed or typed institutional labels (MELU-T) and the other with handwritten labels (MELU-H). The other dataset (DILLEN) is derived from:</p><p dir="ltr">Mathias Dillen. (2018). A benchmark dataset of herbarium specimen images with label data: Summary [Data set]. Zenodo. <a href="https://doi.org/10.5281/zenodo.6372393" rel="noreferrer" target="_blank">https://doi.org/10.5281/zenodo.6372393</a></p><p dir="ltr">Each dataset is in CSV format and has 100 rows, each relating to an image of an individual herbarium specimen sheet.</p><p dir="ltr">There is a column in each CSV for the URL of the image. The Dillen dataset has an additional column with the DOI for each image.</p><p dir="ltr">There is a column for the `label_classification` which indicates if the type of text to be found in the institutional label in the following four categories:</p><ul><li>handwritten</li><li>typewriter</li><li>printed</li><li>mixed</li></ul><p dir="ltr">There are also columns for the following twelve text fields:</p><ul><li>family</li><li>genus</li><li>species</li><li>infrasp_taxon</li><li>authority</li><li>collector_number</li><li>collector</li><li>locality</li><li>geolocation</li><li>year</li><li>month</li><li>day</li></ul><p dir="ltr">If the text field is not present on the label, then the the corresponding cell is left empty.</p><p dir="ltr">The text fields in the dataset are designed to come from the primary specimen label only and may not agree with other information on the specimen sheet.</p><p dir="ltr">In some cases there may be ambiguity for how the text on the labels and human annotators could arrive with different encodings.</p><h2>Evaluation Script</h2><p dir="ltr">We provide a Python script to evaluate the output of an automated pipeline with these datasets. The script requires typer, pandas, plotly and kaleido.</p><p dir="ltr">You can install these dependencies in a virtual as follows:</p><p dir="ltr">python3 -m venv .venv<br>source .venv/bin/activate<br>pip install -r requirements.txt</p><p><br></p><p dir="ltr">To evaluate your pipeline, produce another CSV file with the same columns and with output of the pipeline in the save order as one of the datasets.</p><p dir="ltr">For example, if the CSV of your pipeline is called `hespi-dillen.csv`, then you can evalulate it like this:</p><p dir="ltr">python3 ./evaluate.py DILLEN.csv hespi-dillen.csv --output hespi-dillen.pdf</p><p dir="ltr">This will produce an output image called `hespi-dillen.pdf` with a plot of the similarity of each field with the test set in DILLEN.csv. The file format for the plot can also be `svg`, `png` or `jpg`.</p><p dir="ltr">The similarity measure uses the Gestalt (Ratcliff/Obershelp) approach and is a percentage similarity between the each pair of strings. Only fields where text is provided in either the test dataset or the predictions are included in the results. If a field is present in either the test dataset or the predictions but not the other then the similarity is given as zero. All non-ASCII characters are removed. By default the results are not case-sensitive. If you wish to evaluate with case-sensitive comparison, then use the `--case-sensitive` option on the command line. The output of the script will also provide the accuracy of the label classification and the whether or not any particular field should be empty.</p><p dir="ltr">Options for the script can be found by running:</p><p dir="ltr">python3 ./evaluate.py --help</p><h2>Credit</h2><p><br></p><p dir="ltr">Robert Turnbull, Emily Fitzgerald, Karen Thompson and Joanne Birch from the University of Melbourne.</p><p dir="ltr">If you use this dataset, please cite it and the corresponding Hespi paper. More information at <a href="https://github.com/rbturnbull/hespi" rel="noreferrer" target="_blank">https://github.com/rbturnbull/hespi</a></p><p dir="ltr">This dataset is available on Github here: <a href="https://github.com/rbturnbull/hespi-test-data" rel="noreferrer" target="_blank">https://github.com/rbturnbull/hespi-test-data</a></p>

History

Add to Elements

  • Yes

Usage metrics

    University of Melbourne

    Exports

    RefWorks
    BibTeX
    Ref. manager
    Endnote
    DataCite
    NLM
    DC