Supporting Open Science
at the Atlas of Living Australia

Martin Westgate

outline



about who we are
code what we build
workflows what we recommend
future where we’re going





about

who we are

about

who we are

Dr Shandiya
Balasubramaniam
Dr Dax Kellie
Dr Martin Westgate
Dr Amanda Buyan
Ms Juliet Seers

about

principles


(science & decision) support

modern, reproducible workflows require code

community-building for data cleaning, publication and re-use

about

data life-cycle



publish check re-use





code

what we build

code

open science package suite

galah
potions
galaxias
corella
delma

code

galah: access data from the GBIF node network

code

galah: tidy syntax

galah_call() |>
  filter(genus == "Perameles",
         basisOfRecord == "HumanObservation",
         year == 2025) |>
  group_by(species) |>
  count() |>
  collect()
# A tibble: 5 × 2
  species                count
  <chr>                  <int>
1 Perameles nasuta         444
2 Perameles fasciata       128
3 Perameles gunnii          81
4 Perameles pallescens      54
5 Perameles bougainville     3

code

galah: reproducible workflows

galah_config(email = "martinjwestgate@gmail.com")

df <- galah_call() |>
  filter(genus == "Perameles",
         basisOfRecord == "HumanObservation",
         year == "2025") |>
  select(occurrenceID, eventDate, species, occurrenceStatus) |>
  collect(mint_doi = TRUE)

slice_head(df, n = 3)
# A tibble: 3 × 4
  occurrenceID                      eventDate           species occurrenceStatus
  <chr>                             <dttm>              <chr>   <chr>           
1 https://naturemapr.org/sightings… 2025-01-14 06:47:00 Perame… PRESENT         
2 https://naturemapr.org/sightings… 2025-01-12 08:17:00 Perame… PRESENT         
3 https://naturemapr.org/sightings… 2025-01-09 09:27:00 Perame… PRESENT         

code

galah: access sensitive data (coming soon)

authenticate()

galah_call() |>
  filter(species_list_uid == "dr650") |>
  collect()



Request access via https://www.rasd.org.au

code

galaxias: format & publish biodiversity data

code

galaxias: assumptions about scientists


may want to publish data, but not have the tools to do so

don’t want to learn an unfamiliar data format

should retain control over what data is published, when, and by whom

code

galaxias: format to Darwin Core

my_data_dwc <- df |>
  set_occurrences(occurrenceID = composite_id(location_id, 
                                              sequential_id()),
                  basisOfRecord = "humanObservation") |> 
  set_coordinates(decimalLatitude = latitude, 
                  decimalLongitude = longitude) |>
  set_datetime(dmy(date)) |>
  set_scientific_name(scientificName = species, 
                      taxonRank = "species")

my_data_dwc
# A tibble: 2 × 8
  location_id occurrenceID basisOfRecord    decimalLatitude decimalLongitude
  <chr>       <chr>        <chr>                      <dbl>            <dbl>
1 A           A-01         humanObservation           -35.3             149.
2 B           B-02         humanObservation           -35.3             149.
# ℹ 3 more variables: eventDate <date>, scientificName <chr>, taxonRank <chr>





workflows

what we recommend

workflows

ALA labs: home

workflows

ALA labs: posts

workflows

data cleaning





future

where we’re going

future

themes


artificial intelligence and machine learning

biodiversity indicators

data quality

future

artificial intelligence & machine learning


  • ML widely used for sound and image classification
  • AI will enable new applications and change expectations
  • AI unlikely to replace detailed data work, and may lower barriers to coding

future

biodiversity indicators


  • Indicators have traditionally been difficult to build and productionize
  • This is changing with improvements to workflows (GEO-BON, galaxy-e) and data standards (DwC-DP)
  • Unclear how active ALA should be in this space

future

data quality


  • Traditionally treated as a simple error detection problem
  • In practice, many problems stem from meaningful complexity (taxonomic, morphological, computational)
  • Our capacity to identify anomalous observations should increase with higher data volumes

thanks

The ALA Science & Decision Support Team are:

  • Shandiya Balasubramaniam
  • Amanda Buyan
  • Dax Kellie
  • Juliet Seers
  • Martin Westgate

https://labs.ala.org.au

These slides were made with Quarto, R, and:

  • galah
  • galaxias
  • dplyr
  • lubridate
  • tibble