The University of Melbourne
Browse
1/1
6 files

Test datasets for evaluating automated transcription of institutional labels on herbarium specimen sheets

This contains three datasets to evaluate the automated transcription of institutional labels on herbarium specimen sheets.

Two datasets are derived from the herbarium at the University of Melbourne (MELU), one with printed or typed institutional labels (MELU-T) and the other with handwritten labels (MELU-H). The other dataset (DILLEN) is derived from:

Mathias Dillen. (2018). A benchmark dataset of herbarium specimen images with label data: Summary [Data set]. Zenodo. https://doi.org/10.5281/zenodo.6372393

Each dataset is in CSV format and has 100 rows, each relating to an image of an individual herbarium specimen sheet.

There is a column in each CSV for the URL of the image. The Dillen dataset has an additional column with the DOI for each image.

There is a column for the `label_classification` which indicates if the type of text to be found in the institutional label in the following four categories:

  • handwritten
  • typewriter
  • printed
  • mixed

There are also columns for the following twelve text fields:

  • family
  • genus
  • species
  • infrasp_taxon
  • authority
  • collector_number
  • collector
  • locality
  • geolocation
  • year
  • month
  • day

If the text field is not present on the label, then the the corresponding cell is left empty.

The text fields in the dataset are designed to come from the institutional label only and may not agree with other information on the specimen sheet.

In some cases there may be ambiguity for how the text on the labels and human annotators could arrive with different encodings.

Evaluation Script

We provide a Python script to evaluate the output of an automated pipeline with these datasets. The script requires typer, pandas, plotly and kaleido.

You can install these dependencies in a virtual as follows:

python3 -m venv .venv
source .venv/bin/activate
pip install -r requirements.txt


To evaluate your pipeline, produce another CSV file with the same columns and with output of the pipeline in the save order as one of the datasets.

For example, if the CSV of your pipeline is called `hespi-dillen.csv`, then you can evalulate it like this:

./evaluate.py DILLEN.csv hespi-dillen.csv --output hespi-dillen.pdf

This will produce an output image called `hespi-dillen.pdf` with a plot of the similarity of each field with the test set in DILLEN.csv. The file format for the plot can also be `svg`, `png` or `jpg`.

The similarity measure uses the Gestalt (Ratcliff/Obershelp) approach and is a percentage similarity between the each pair of strings. Only fields where text is provided in either the test dataset or the predictions are included in the results. If a field is present in either the test dataset or the predictions but not the other then the similarity is given as zero. All non-ASCII characters are removed. By default the results are not case-sensitive. If you wish to evaluate with case-sensitive comparison, then use the `--case-sensitive` option on the command line. The output of the script will also provide the accuracy of the label classification and the whether or not any particular field should be empty.

Options for the script can be found by running:

./evaluate.py --help

Credit

Robert Turnbull, Emily Fitzgerald, Karen Thompson and Joanne Birch from the University of Melbourne.

If you use this dataset, please cite it and the corresponding Hespi paper. More information at https://github.com/rbturnbull/hespi

History

Add to Elements

  • Yes

Usage metrics

    University of Melbourne

    Exports

    RefWorks
    BibTeX
    Ref. manager
    Endnote
    DataCite
    NLM
    DC