Introduction
This tutorial is designed to supplement Module 6 of the OBIS/OTGA self-paced course: Contributing and publishing datasets to OBIS. It demonstrates how to run some basic quality control (QC) checks in R. The dataset used in the tutorial is the example fishing dataset, “ExampleDataset-1” as labeled in the course. Always remember when we want to make any changes we make to a file it is good practice to ensure a copy of the original data remains unchanged. For this QC tutorial, we will check the following:
- Taxa information,
- Required fields,
- Coordinates,
- Dates,
- Identifiers,
- Presence of outliers.
We will use two R packages to check the data: obistools and Hmisc.
For this tutorial, all required packages are installed and loaded for
you, but would have to install them if you chose to run these checks on
your own computer. Keep in mind that installing
obistools
requires the devtools
package. See here
for details on installing obistools
. Each code chunk in the
tutorial will have a “Run Code” button which will allow you to run the
code as it would in your own R instance. You may also be prompted to
write your own code at times.
While this tutorial can be used as an aide for the OBIS/OTGA Course Exercise 6, particularly if you are new to R, please ensure assignment answers are written in your own words.
Note also that this tutorial focuses on demonstrating QC checks in R, but is not comprehensive and you will have to complete certain aspects of QC (e.g. adding missing required columns) on your own.
Load data files
The first step is of course to load our Event and Occurrence table into R. We have already separated the Excel file “OBIS_Exercise6-1_QC.xlsx” into two separate .csv files to make the QC process easier.
# Read the data files
event<-read.csv("data/QC_event.csv")
occur<-read.csv("data/QC_occur.csv")
Right away we get an error that there’s an invalid string in the data. This is likely due to an encoding error, usually if there’s a symbol or other special character in the data. To get around this, let’s specify the encoding for our data tables.
# Read the data files
event<-read.csv("data/QC_event.csv", fileEncoding = "latin1")
occur<-read.csv("data/QC_occur.csv", fileEncoding = "latin1")
Now the files are being read correctly, hurray!
Check data table structure
Before we get into running the QC checks, it’s important to double
check each of our data columns are actually being interpreted correctly.
We will use the R function str()
to look at the structure
of our two data tables. The output will tell us which class each data
column is. This will be important because we want to make sure the
columns with numeric data are indeed numeric (i.e. class is either
integer or numeric). Use the str()
function below to check
both the “event” and “occur” tables.
str(event)
str(occur)
Interpret the results
We can see in the Event table that minimumDepthInMeters
and maximumDepthInMeters
are of class integer and numeric,
respectively, which is expected. However our coordinate fields,
decimalLongitude
and decimalLatitude
are of
class character! Before we tell R to read these columns as numeric,
let’s take a look at what’s going on in these two columns.
The Occurrence table appears to be read correctly so there is nothing to fix for that one.
event$decimalLongitude
event$decimalLatitude
There are two rows with some strange formatting for both the longitude and latitude:
- Row 27: 54¡17’ N and 132¡57’ W
- Row 46: 54¡10’0.450 N and 132¡59’0.536 W
Convert coordinates to decimal degrees
Let’s convert these in decimal degrees using the
conv_unit
function from the measurements
package. We will replace the incorrect data by manually replacing the
contents of the cells, but note that you could fix these errors in more
automated ways if you needed to do this for many cells. Once we’ve
converted the coordinates to decimal degrees, we’ll go ahead and tell R
to consider these columns as numeric.
library(measurements)
new_long<- conv_unit(c('54 17 00','54 10 0.450') , from = "deg_min_sec", to = "dec_deg") #we added 00 to the long/lat for row 27 so that the coordinates convert properly
new_lat <- conv_unit(c('-132 57 00', '-132 59 0.536'), from = "deg_min_sec", to = "dec_deg")
# Replace data in rows 27 and 46 for our latitude and longitude columns
event[27, c("decimalLatitude", "decimalLongitude")] <- c(new_lat[1], new_long[1])
event[46, c("decimalLatitude", "decimalLongitude")] <- c(new_lat[2], new_long[2])
# Change the longitude and latitude columns to be read as numeric
event$decimalLatitude<-as.numeric(event$decimalLatitude)
event$decimalLongitude<-as.numeric(event$decimalLongitude)
# Double check that the columns are now numeric
str(event$decimalLatitude)
str(event$decimalLatitude)
Once the coordinates are numeric we can go ahead and start running more OBIS-specific QC checks!
Check required fields
One of the first checks we will do is to make sure all the
OBIS-required fields are present in our data. To do we will use the
obistools::check_fields
function. We should make sure to do
this for all relevant tables, so we will run the function for both the
Event and Occurrence table. Use this function in the code block below to
check for required fields in our two data tables.
obistools::check_fields(event)
obistools::check_fields(occur)
Interpret the Results: Event table
We see that the Event table is missing eventDate
,
parentEventID
, and decimalLongitude
+
decimalLatitude
for some rows. We also see warnings that
the required Occurrence terms are missing
(i.e. scientificName
, scientificNameID
,
basisOfRecord
, and occurrenceStatus
). We can
safely ignore these “missing” fields because they are in the Occurrence
table. For the missing coordinates, we would have to take a look at the
file to see if the missing ones are simply associated with parent events
that don’t have coordinate data, or if there are actually missing
coordinates. For any localities with missing coordinates we should
attempt to georeference them, as we learned how to do in this course.
For the missing eventDate
column, we know there is already
a column named “date” in the Event table, so we just have to map this
column to the Darwin Core term. We can do this in R with the code
below.
# Rename the date column to eventeDate
colnames(event)[3]<-"eventDate"
Interpret the Results: Occurrence table
As above, we can ignore the warnings about the missing columns that
are in the Event table. However, the Occurrence table is
missing the basisOfRecord
column, and some rows are missing
for occurrenceStatus.
So we will have to populate these
fields based on what we know the data (e.g. if one individual was
observed, we can assume occurrenceStatus
= present).
Check taxon names
The next step is to check that the taxon names match with WoRMS. We
will use obistools::match_taxa
and then inspect the output
to make sure all the names in the Occurrence table match with WoRMS and
are marine. Note that the code below will not be executable, but we
provide the relevant output below the code chunk.
worms<-match_taxa(unique(occur$scientificName), ask=TRUE) # conduct taxon matching by telling the function to only match the unique species names in the scientificName column
## If you did not already have the scientificNameID field in the dataset, you could use the code below to attach the worms data with the Occurrence table
## Note that we set all=TRUE --> this will keep every record in the worms table which can be handy if there are any errors with taxon names. The matched name will appear in the Occurrence table with NAs for any non-WoRMS field
occur_match<-merge(occur, worms, by="scientificName", all= TRUE) # merges the worms dataframe with the occurrence table (occur) by matching according to scientificName
colnames(occur_match)[colnames(occur_match) == "scientificNameID.y"] = "scientificNameID" # rename the new column as scientificNameID
We see that one taxon has a near match, “Sebastes melanops” (Table
1), and when we look for this species in the Occurrence table (Table 2),
we see that and that the name presumably had a typo, as it is written as
“Sebastes melanps”. Changes to the scientificName
can be
made when you are certain, and verbatimIdentification
can
be populated with the original name.
Check coordinates
Next, we’ll visualize the data using
obistools::plot_map_leaflet()
. This tool allows us to
quickly spot any coordinates that might be out of place, potentially
indicating that longitude and latitude values have been swapped.
We’ll use obistools::plot_map_leaflet
instead of
obistools::plot_map
because it opens an interactive leaflet
map, allowing us to identify the row number of any coordinates that fall
outside the expected geographic area. Be sure to zoom out to view the
entire map, as some coordinates might be significantly distant from the
expected range. Use this function below to investigate coordinates.
obistools::plot_map_leaflet(event)
Explore the map
There are a few coordinates at the edges of the map. Clicking on these points reveals they correspond to row 46 in the Event table. Look carefully at the Event table to verify if these coordinates are indeed swapped, and check for any other rows that might have similar issues since points may be overlapping, then correct the issue.
Check depths and on-land
Next we will check that all coordinates are within the expected depth
range, including if any coordinates are marked as being on land. We will
use obistools::check_depth()
and
obistools::check_onland()
respectively for these checks. We
can specify report=TRUE
for both functions so that errors
are returned instead of records, which provides a more easily
understandable report. Additionally, providing a value to
depthmargin
(for check_depth) or buffer (for check_onland)
will allow us to indicate how much a given coordinate can deviate from
the bathymetry of the reference raster(s).
Modify the code below to run these checks.
obistools::check_depth()
obistools::check_onland()
obistools::check_depth(event, report=TRUE, depthmargin = 30)
obistools::check_onland(event,report=TRUE, buffer=20)
Interpret the results
We got the following warnings:
- Depth value is greater than the value found in the bathymetry raster
- Longitude is outside the bounds of the provided raster (-180 180)
- Latitude is outside the bounds of the provided raster (-90 90)
- No bathymetry value found for coordinate
Some of these warnings will appear because certain rows are missing latitude/longitude coordinates or contain invalid coordinates (e.g., swapped latitude and longitude). In these cases, the warnings can be ignored. But it’s important to check each row, resolve any issues, and rerun the QC checks to ensure all problems are addressed.
Now, about the depth warnings…
When records are flagged for having depths exceeding the bathymetry layer, these records don’t necessarily need to be removed. The raster used by OBIS to check depth values can sometimes be at a coarser resolution than actual conditions. The warning indicates a potential issue with the depth, but the reported depth may still be accurate despite not being detected by the raster. Similarly, points flagged as being ON_LAND may sometimes be due to map resolution discrepancies rather than actual errors.
Remember, QC flags are meant to warn us of potential data issues. However, the data may still be accurate, and flags can occur due to limitations in the data resolution of reference materials.
Make sure to review all the warnings to determine if any actions are needed.
Check table identifiers
We need to check that eventID
and
parentEventID
are present and correspond with one another,
for which we’ll use obistools::check_eventids()
. Then, to
ensure all eventIDs
in the extension table(s) have matching
eventIDs in the Event core table, we’ll use
obistools::check_extension_eventids()
.
Modify the code below to run the checks.
obistools::check_eventids()
obistools::check_extension_eventids(event, extension)
obistools::check_eventids(event)
obistools::check_extension_eventids(event,occur)
Interpret the results
The output from check_eventids()
tells us that there is
a duplicated eventID
in row 15 (this is assuming we have
already added the missing parentEventID
column). The output
from check_extension_eventids()
indicates that row 2 in the
Occurrence table has an eventID
not present in the Event
core table. Therefore, we should review these rows and determine how to
resolve these issues.
Check date formatting
Next we’ll confirm that all the dates are in ISO 8601 formatting
using obistools::check_eventdate
. Use this function in the
code block below.
obistools::check_eventdate(event)
Interpret the results
We received the error “eventDate does not seem to be a valid date” for 14 records. The first 13 eventDates are blank and associated with parent events, so we can ignore the warning for them. However, row 26 has an incorrect date format that needs to be corrected.
This is also a good time to check the times associated with the dates, where applicable. Remember times should be in 24-hour format. Additionally, if any times are shown as 00:00:00 because the time was left blank, 00:00:00 represents midnight. Therefore times written as 00:00:00 may be incorrect if the event did not occur at midnight, and event time confirmation may be necessary (when possible of course).
Look at the data with Hmisc
The final check we will perform is using the
Hmisc::describe
function to identify any inconsistencies in
the data. We will be running this on both the Event and Occurrence
table, highlighting important fields below. Remember, although we will
only highlight a few data fields, it is good practice to verify
that each column contains data as expected.
The describe
function gives a short summary for each
data column, including the number of non-blank rows (n), the number of
rows missing a value (missing), and the number of distinct values for
that column (distinct).
Use this function in the code block below to describe both the Event
and Occurrence tables. We’ve included some additional code to
demonstrate how you can also plot the results of the
describe
function.
Hmisc::describe()
# Plot some results of describe
options(grType='plotly')
e<-Hmisc::describe(event)
p<-plot(e)
p[[1]]; p[[2]]
Hmisc::describe(event)
Hmisc::describe(occur)
# Plot some results of describe
options(grType='plotly')
e<-Hmisc::describe(event)
p<-plot(e)
p[[1]]; p[[2]]
Describe: Event table
There are 19 variables and 47 observations.
eventID
- the sample size (n) is 47 but
the number of distinct values is 46. We identified this earlier but
we’ll want to identify that duplicate eventID and correct it
minimumDepthInMeters
and
maximumDepthInMeters
- take a look at highest and
lowest values and make sure these are in an expected range for this
dataset. There is one value that seems quite extreme given the depth
range of the other values
decimalLongitude
and
decimalLatitude
- if we hadn’t yet corrected the
swapped coordinates as identified earlier, we can see in the output here
that some values are out of place (e.g. highest decimalLongitude value
is 54.28333 which is out of the typically valid longitude range)
At the bottom of the summary we see a list of variables that have all
observations missing, in this case georeferenceRemarks
.
Describe: Occurrence table
There are 9 variables and 34 observations
occurrenceID
- we see that n=34 but
there are only 33 distinct values, an indication that one of the
occurrenceIDs is repeated and we should identify and correct the
error
occurrenceStatus
- here we see that
n=31, there are 3 missing, and the only distinct value is “present”.
This is another indication that some cells are missing the required
field, which will need to be corrected
Conclusion
This tutorial has covered some basic QC steps you can conduct in R to check the data quality of a dataset. You can use the output from this tutorial to help you with exercises in the course, but again please ensure the answers are your own.
Note we could also use obistools::report()
to generate a
quick overview of QC issues, but it’s good to make sure you understand
each check individually, which is why we have presented it step by step
here. Running this code will open a report.html file in your
browser.
obistools::report(event)
obistools::report(occur)