Peggy Newman, Martin Westgate, Amanda Buyan, Dax Kellie & Shandiya Balasubramaniam

The problem


For researchers, getting data out of GBIF nodes is easy…

…but sharing your own data is hard.

Hurdles


  • Darwin Core Standard formatting isn’t easy (e.g., .xml)
  • Existing documentation isn’t well-suited to newbies
  • Poor integration with existing workflows (i.e. in R or Python)
  • Sharing data is low on priority list

Q: How can we help researchers share biodiversity data?

galaxias (and friends)


galaxias: Build, check & publish DWCAs
corella: Convert a tibble to Darwin Core
delma: Convert markdown to EML or xml

Darwin Core

An archive is a .zip file containing three things:

data
csv format
metadata
eml format
schema
xml format

Process



data metadata schema archive validate submit

Data

Load galaxias

library(galaxias)



delma and corella are loaded automatically

Data

Load an example dataset

library(readr)

df <- read_csv("my_example_data.csv")
df
# A tibble: 2 × 5
  latitude longitude date       time  species                 
     <dbl>     <dbl> <chr>      <chr> <chr>                   
1    -35.3      149. 14-01-2023 10:23 Callocephalon fimbriatum
2    -35.3      149. 15-01-2023 11:25 Eolophus roseicapilla   

Data

How should we convert this dataset to Darwin Core?

suggest_workflow(df)

Data

If we follow that advice:

df_dwc <- df |>
  set_occurrences(occurrenceID = sequential_id(),
                  basisOfRecord = "humanObservation") |> 
  set_coordinates(decimalLatitude = latitude, 
                  decimalLongitude = longitude) |>
  set_datetime(eventDate = lubridate::dmy(date),
               eventTime = lubridate::hm(time)) |>
  set_scientific_name(scientificName = species, 
                      taxonRank = "species")

df_dwc
# A tibble: 2 × 8
  basisOfRecord    occurrenceID decimalLatitude decimalLongitude eventDate 
  <chr>            <chr>                  <dbl>            <dbl> <date>    
1 humanObservation 01                     -35.3             149. 2023-01-14
2 humanObservation 02                     -35.3             149. 2023-01-15
# ℹ 3 more variables: eventTime <Period>, scientificName <chr>, taxonRank <chr>

Data

Save as occurrences.csv:

use_data(df_dwc)

Process



data metadata schema archive validate submit

Metadata

Generate a metadata file

use_metadata_template() # creates the following file:
# Dataset
 
 ## Title
 
 A Sentence Giving Your Dataset Title In Title Case
 
 ## Abstract
 
 A paragraph outlining the content of the dataset
 
 ## Creator
 
 ### Individual name
 
 #### Surname

Metadata

Convert to EML

use_metadata("metadata.Rmd") # creates the following file:
<?xml version="1.0" encoding="UTF-8"?>
 <emlEml xmlns:d="eml://ecoinformatics.org/dataset-2.1.0" xmlns:eml="eml://ecoinformatics.org/eml-2.1.1" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:dc="http://purl.org/dc/terms/" xsi:schemaLocation="eml://ecoinformatics.org/eml-2.1.1 http://rs.gbif.org/schema/eml-gbif-profile/1.3/eml-gbif-profile.xsd" system="R-paperbark-package" scope="system" xml:lang="en">
   <dataset>
     <title>A Sentence Giving Your Dataset Title In Title Case</title>
     <abstract>A paragraph outlining the content of the dataset</abstract>
     <creator>
       <individualName>
         <surname>Person</surname>
         <givenName>Steve</givenName>
         <electronicMailAddress>example@email.com</electronicMailAddress>
       </individualName>
       <organisationName>Put your organisation name here</organisationName>
       <address>
         <deliveryPoint>215 Road Street</deliveryPoint>
         <city>Canberra</city>

Process



data metadata schema archive validate submit

Archive

Automated process for zipping the /data-publish folder.

build_archive()
Data (minimum of one)
  • occurrences.csv ✔
  • events.csv      ✖
  • multimedia.csv  ✖
Metadata
  • eml.xml         ✔
Schema
  • meta.xml        ✔

Archive

We can check that the correct files are present.

fs::path_abs("../dwc-archive.zip") |>
  zip::zip_list() |>
  tibble::as_tibble() |>
  dplyr::select(filename:timestamp)
# A tibble: 3 × 4
  filename        compressed_size uncompressed_size timestamp          
  <chr>                     <dbl>             <dbl> <dttm>             
1 occurrences.csv             194               283 2025-07-01 02:34:30
2 eml.xml                     684              1452 2024-12-12 04:21:22
3 meta.xml                    509              2145 2024-12-12 04:21:22


The schema file (eml.xml) has been built automatically.

Process



data metadata schema archive validate submit

Validate

# validate locally
check_directory() 

# validate via GBIF API
check_archive(username = "a_gbif_user",
              email = "my@email.com",
              password = "a_secure_password")

Process



data metadata schema archive validate submit

Submitting

Run submit_archive() to create an issue on data-publication repository

Process



data metadata schema archive validate submit

Benefits of galaxias


  • Darwin Core Standard formatting is easy (e.g., .xml)
  • Documentation well-suited to newbies
  • Good integration with existing workflows (i.e. in R or Python)
  • Sharing data is on the priority list (?)

Thank you


Peggy Newman
Martin Westgate
Amanda Buyan
Dax Kellie
Shandiya Balasubramaniam

galaxias
corella
delma
galah