Interview: Rukaya Johaadien

Botanical Museum | Oslo | 11-05-2026

Interview
Author

Martin Westgate

Published

May 11, 2026

This is a modified transcript of a discussion on 11th May 2026 with Rukaya Johaadien, Data Manager at GBIF Norway, and two-time winner of the GBIF Ebbe Nielsen Challenge (in 2024 & 2025). The transcript has been editted for brevity and clarity - mainly on the part of the interviewer - and I have also added links and references to the content we discussed.

Could you tell me about the project you won the 2025 Ebbe Nielsen prize for?

BDQEmail? I wanted it to have as low barrier as possible. I was part of a panel which was reviewing a set of biodiversity data quality tests called BDQ, which is intended to become a TDWG standard. I was reviewing it and I felt like it would be quite hard for people to implement it.

So Paul Morris, who’s part of Lee [Belbin] and Arthur [Chapman]’s team on this, wrote some code, which implemented these tests. But to use it, you had to spin up your own Java server, know how to execute it, know how to interface with it. I wanted it people to be able to use it easily, you know?

I thought, okay, the very, very simplest thing would be for people to be able to send an email with their data file and for it to return some information about it. So I made a wrapper over Paul’s code, which acted as an API, and I set up a BDQ email at gmail.com which is where people can send their text file or CSV or spreadsheet or whatever it is they have. I have a very simple Google App script, which polls the inbox account. It checks to see, firstly, if there’s an attachment, and then if it’s in a format that I accept, and finally whether the headers in there are at least a little bit Darwin Core-like. And then if it passes those checks, it sends that file to the API, which I set up…do you want the technical details as well?

Sure!

Well that’s running in the cloud, it just spins up when it’s needed. So most of the time it’s dormant. People were maybe using it like once every two weeks or so. Now people are using it almost daily.

So it spins up the wrapper and the API, Paul’s code in Java, and I had to have the wrapper in order to do batch processing for larger batches because it does geographical lookups, which ends up taking ages and ages. So there’s a little bit of complexity in there. This is another reason why email was such a good format for this, actually, because you don’t expect an email back immediately. GBIF has the same thing and they email you when your download is ready.

Once all the processing is finished, it constructs an email with the results. And that email is not the raw results of those tests because it’s horrendous to try and figure out what’s going on there. It’s an AI interpretation of what it thinks are the critical parts that need fixing, that draws on a lot of contextual information that I gave it. I think this one is using Gemini for cost reasons. For me, open AI’s models are the best at the moment, so that’s what I tend to be using for most of my own stuff. I’m regularly experimenting though. Anyway, it has an AI interpretation written in proper sentences, explaining what they need to do in quite friendly language. It’s encouraging.

Finally, it links to a dashboard and the dashboard has visualisations for the test results. And I think that is super useful because you can’t interpret those test results without visualisation. So that’s all that that that thing is and that’s how it worked. People say, people are giving good feedback about it and say it’s useful when it gets used, so that’s good.

Botanical Museum, Lids House, Oslo Botanical Garden, May 2026

That’s fantastic.

The good thing about it is that I can run things ad hoc against that API. I haven’t done it yet, but I plan to take our important data sets, which are published by the Natural History Museum and Herbarium here, and I will run those tests through BDQemail. And that generates a similar report and emails it to all of our data providers. I plan to do it like maybe twice a year. And I think that’s going to help a lot with data quality, because people, once they start publishing their data, they feel a little bit of ownership over it and they start getting proud of it and wanting wanting to improve it.

So it sounds like you and the team here at GBIF Norway focus primarily on awareness, training, outreach, and you’ve been building additional tools to support the existing GBIF workflows. Is that fair assessment?

Yeah, I would say so. Pretty much all of my job is about getting data published, and also a large portion on data quality and that kind of thing.

And is that mostly talking with the collections or government or researchers?

It’s mostly researchers, I guess, but there’s also a lot of private companies who are mandated to publish their data.

Yeah, okay. Is that a recent thing?

Over the past three years or so, we’ve been getting substantially more people who, more private companies who are publishing their data. So they will interact with me and will assess something up to interface with their database or if they have files that they need regularly publishing. Whatever it is, I’ll help them with it. I would say that there was supposed to be doing it before, but we have been kind of like pushing them to actually really truly do it. But I’m sure in Australia you probably have like laws saying that if you do an environmental assessment, you have to publish the data?

So it feels like you’re acting like partly in an outreach role, but also just doing a lot of helping people process their data or processing data on their behalf. Is that fair?

Yeah.

That’s a big job.

Yeah. I mean, it’s a lot easier now, with AI to help. So obviously I check everything. But I can process things so much more quickly than before and have the chat boards and then I have tools like this BDQ data quality standard thing.

Do you find the current tools for flagging errors are adequate, or do you feel they’re lacking in some way?

It’s an interesting question. I don’t know, really. I think it’s a bit silly that the IPT doesn’t have more built-in checks because we end up using that a lot. I feel like that should have more kind of basic checks and it shouldn’t be like buried away when you press the publish button. We can do such cool stuff with UI now. It’s sort of frustrates me a little bit that you don’t get like live data views showing you errors and it could be really cool.

In fact, I want to build that. I’ve decided. Build something into the IPT; or perhaps it should go into chatIPT, since that’s my baby.

I do feel like…the data quality tools that we’ve got are slightly antiquated. In terms of user friendliness, they could be a lot better. And I guess, the actual things that they’re checking could also be improved. But the BDQ standard seems really exhaustive, so I can’t think of how to improve that.

That’s fair. It’s interesting that as developers, our instinct is often to go and write some new tool or some new code, and I’m trying to resist that in myself. Because I do think there is a communication aspect to it. Like, there are like a lot of the things we have do provide some useful information, but people aren’t finding it.

But beyond that, a lot the errors people report to us are only visible once the data have been aggregated, like classic outliers and stuff. And so I think there is potentially a role for GBIF and the nodes to work in that space, because there’s information you get from having all the data in place that you don’t get from individual datasets.

I’m just trying to remember if there are any BDQ tests for species distribution checking. And I don’t think that it does that kind of thing. I don’t remember anything like that. That’s kind of data interpretation as well. I mean, it’s really difficult to say what is an outlier and what’s not; you have people who dedicate their lives to trying to work this stuff out.

I don’t know how…I guess it could do things like say “Something is suspect here”. Again, I think that you should have an AI filter or something that can be smart about flagging certain things and tell what could plausibly, really, truly be a problem? I hadn’t really thought about that.

Palm House, Oslo Botanical Garden, May 2026

You’re right, I think historically at ALA, we’ve made the generalisation that we’re an aggregator and interpretation is someone else’s remit. And maintaining the infrastructure takes a lot of time too, so it’s difficult to prioritize data quality sometimes.

I’ve always been totally impressed by Atlas of Living Australia.

Thanks! I didn’t build any of it, but our operational teams do a great job.

It’s a combination of like building it and actually like keeping it going. I used to work for the South African National Biodiversity Institute, and I remember I looked at the ALA code base at one point like thinking “Oh, maybe we can implement something like that”. And I was just like, wow, this is complex.

It’s funny, because some of our code is old and needs cleaning up, for sure, but I think a great deal of that complexity is innate in the sorts of data we have and the problems we’re trying to solve.

I realised this as I got deeper and deeper into it. There’s a reason why all these bits are like this. And it’s kind of shocking that deeper you go, the more you find. Which is why I was asking you these questions about like how much AI are you guys using for that now?

Look, my team doesn’t use it much yet. But our systems team have picked it up for helping them write code.

Gosh, I use it constantly for setting up CI/CD and anything that’s to do with SysOps stuff. I never enjoyed doing it. I got kind of got over the hump of the learning curve, but the process of getting familiar with Kubernetes and Jenkins and I just didn’t enjoy that. So now I can just ask Codex to do whatever it is and deploy and roll out and set up a sensible system for me and I’m totally fine with that.