Twelve years of open science infrastructure

Lessons from the Atlas of Living Australia

Martin Westgate / Science & Decision Support Team / ALA
SORTEE Conference / 13 July 2022

@westgatecology

About me / academia




Aside / academic stereotypes



  • Agility, innovation & novelty
  • Discrete projects and budgets
  • Dependence on low-cost, short-term programs to deliver major work

‘…we predict that the word “novel” will appear in every record by the year 2123.’



C Vinkers et al. (2015) Use of positive and negative words in scientific PubMed abstracts between 1974 and 2014: retrospective analysis.BMJ 351 https://doi.org/10.1136/bmj.h6467

Aside / open science



  • Necessity: Historic approaches to science have proven unreliable
  • Opportunity: Tech literacy means sharing is easier than ever
  • Complexity: Broad-scale changes to scientific methods, challenging existing incentive structures

Thesis #1:
Open science is in a transition to a more infrastructure-dependent model

Infrastructure / living atlases

Infrastructure / living atlases

Region Organisation Records (Millions)
Australia ALA 112.3
Austria BAO 7.8
Brazil SiBBr 23.6
Canada Canadensys 6.3
Estonia eElurikkus 6.2
France INPN 87.4
Portugal GBIF.pt 17.5
Spain GBIF.es 36.3
Sweden SBDI 103.4
UK NBN 204.8
Vermont VAL 7.2
Global GBIF 2,204.6

Infrastructure / challenges


Project stage Academia Infrastructure
Data collection fieldwork collaboration with institutions
Data formatting customizable established standards
Data management spreadsheet, app processing pipeline
Data storage single machine or online database
Data out - API

Infrastructure / benefits

  • stability
  • scalability

Thesis #2:
In science, stability & innovation are co-dependent

Stability / information

Image source: https://whatson.melbourne.vic.gov.au/things-to-do/state-library-victoria

Stability / observation

Image source: Gibney, E. How the revamped Large Hadron Collider will hunt for new physics. Nature 25-05-2022

Stability / inference

Stability / inference

Stability / communication

Images:
Henry Oldenburg - Philosophical Transactions, CC BY 4.0, https://commons.wikimedia.org/w/index.php?curid=36495651
Trisos, C.H., Auerbach, J. & Katti, M. Decoloniality and anti-oppressive practices for a more ethical ecology. Nat Ecol Evol 5, 1205–1212 (2021). https://doi.org/10.1038/s41559-021-01460-w

Thesis #3:
Working across the stability / innovation boundary is difficult

Stability / challenges


  • Shifting baseline syndrome
  • Managing pace of change
  • Communication across domains

galah

galah / ALA4R / benefits


  • Groundbreaking: released in 2014
  • Flexible: return the data you want, customised in various ways
  • Inclusive: most options accessible via the API can be constructed

galah / ALA4R / problems


No function naming convention

  • abbreviations: aus()
  • snake case: ala_fields()
  • contractions: fieldguide()
  • single words: occurrences(), images()

galah / ALA4R / problems


Confusing syntax

  • unclear differences between functions
    • ala_list(), ala_lists(), specieslist()
  • argument names require specialist knowledge
    • wkt, fq, qa
  • arguments require solr queries passed as strings:
    • "taxon_name:\"Alaba vibex\""

galah / ALA4R / problems


Inconsistent behaviour

  • most functions return a data.frame
  • occurrences() returns a list
  • fieldguide() and plot.occurrences() output a PDF

galah / benefits

  • Query the ALA (and other national GBIF nodes)
  • Use tidy, pipe-able syntax

galah / benefits



Lookup Narrow a query Run a query
show_all() galah_filter() atlas_counts()
search_all() galah_select() atlas_occurrences()
galah_group_by() atlas_media()

galah / number of records

library(galah)

galah_call() |>
  galah_identify("Eolophus roseicapilla") |> # galahs
  atlas_counts()
# A tibble: 1 × 1
   count
   <int>
1 992079

galah / number of records

galah_call() |>
  galah_identify("Eolophus roseicapilla") |>
  galah_filter(year >= 2010,
               dataResourceName == "iNaturalist Australia") |>
  atlas_counts()
# A tibble: 1 × 1
  count
  <int>
1  7297

galah / number of records

galah_call() |>
  galah_identify("Eolophus roseicapilla") |>
  galah_filter(year >= 2010,
               dataResourceName == "iNaturalist Australia") |>
  galah_group_by(year) |>
  atlas_counts()
# A tibble: 13 × 2
   year  count
   <chr> <int>
 1 2021   1933
 2 2020   1571
 3 2019    942
 4 2022    917
 5 2018    821
 6 2017    537
 7 2016    194
 8 2015    110
 9 2014     79
10 2013     62
11 2011     54
12 2012     41
13 2010     36

galah / number of records

galah_call() |>
  galah_identify("Cacatuidae") |> # cockatoos
  galah_filter(year >= 2019) |>
  galah_group_by(year, dataResourceName) |>
  atlas_counts()
# A tibble: 80 × 3
   year  dataResourceName                                count
   <chr> <chr>                                           <int>
 1 2021  eBird Australia                                248142
 2 2021  iNaturalist Australia                            7621
 3 2021  NSW BioNet Atlas                                 1490
 4 2021  Earth Guardians Weekly Feed                       927
 5 2021  SA Fauna (BDBSA)                                  300
 6 2021  NatureMapr                                        166
 7 2021  WildNet - Queensland Wildlife Data                153
 8 2021  ALA species sightings and OzAtlas                 118
 9 2021  Wildlife Watch NSC                                105
10 2021  Port Adelaide Enfield Flora & Fauna Monitoring     37
# … with 70 more rows

galah / occurrences

library(galah)
library(ozmaps)
library(sf)
library(ggplot2)

# Enter email
galah_config(email = "martinjwestgate@gmail.com")

# Download species occurrences
obs <- galah_call() |>
  galah_identify("peramelidae") |>
  galah_filter(year == 2021) |>
  atlas_occurrences()

# Ensure map uses correct projection
oz_wgs84 <- ozmap_data(data = "country") |>
  st_transform(crs = st_crs("WGS84"))

# Map points
ggplot(data = obs) + 
  geom_sf(data = oz_wgs84, 
          fill = "white") +
  geom_point(aes(x = decimalLongitude,
                 y = decimalLatitude), 
             color = "#78cccc") +
  theme_void()

galah / occurrences

galah / other atlases

galah_config(atlas = "Austria")
galah_call() |> atlas_counts()
# A tibble: 1 × 1
    count
    <int>
1 7786013
galah_config(atlas = "UK")
galah_call() |> atlas_counts()
# A tibble: 1 × 1
      count
      <int>
1 204774003

galah / ALA labs

galah / ALA labs

Learnings / galah


  • articulate your reasons for change
  • balance accepted vs novel methods
  • respond to feedback (but not too much)

Learnings / infrastructure


  • ‘fail quickly’ and ‘succeed slowly’ mindsets both have merit
  • reward systems can become divorced from real-world applications
  • no risk-free paths

Thank you

Summary:

  • Open science is in a transition to a more infrastructure-dependent model
  • In science, stability & innovation are co-dependent
  • Working across the stability / innovation boundary is difficult

Martin Westgate
Team Leader / Science & Decision Support / ALA
e: martin.westgate@csiro.au
t: @westgatecology
gh: @mjwestgate

These slides were made using Quarto & RStudio