OBIS/OTGA Course: Running QC checks

Introduction

This tutorial is designed to supplement Module 6 of the OBIS/OTGA self-paced course: Contributing and publishing datasets to OBIS. It demonstrates how to run some basic quality control (QC) checks in R. The dataset used in the tutorial is the example fishing dataset, “ExampleDataset-1” as labeled in the course. Always remember when we want to make any changes we make to a file it is good practice to ensure a copy of the original data remains unchanged. For this QC tutorial, we will check the following:

Taxa information,
Required fields,
Coordinates,
Dates,
Identifiers,
Presence of outliers.

We will use two R packages to check the data: obistools and Hmisc. For this tutorial, all required packages are installed and loaded for you, but would have to install them if you chose to run these checks on your own computer. Keep in mind that installing obistoolsrequires the devtools package. See here for details on installing obistools. Each code chunk in the tutorial will have a “Run Code” button which will allow you to run the code as it would in your own R instance. You may also be prompted to write your own code at times.

While this tutorial can be used as an aide for the OBIS/OTGA Course Exercise 6, particularly if you are new to R, please ensure assignment answers are written in your own words.

Note also that this tutorial focuses on demonstrating QC checks in R, but is not comprehensive and you will have to complete certain aspects of QC (e.g. adding missing required columns) on your own.

Load data files

The first step is of course to load our Event and Occurrence table into R. We have already separated the Excel file “OBIS_Exercise6-1_QC.xlsx” into two separate .csv files to make the QC process easier.

# Read the data files
event<-read.csv("data/QC_event.csv")
occur<-read.csv("data/QC_occur.csv")

Right away we get an error that there’s an invalid string in the data. This is likely due to an encoding error, usually if there’s a symbol or other special character in the data. To get around this, let’s specify the encoding for our data tables.

# Read the data files
event<-read.csv("data/QC_event.csv", fileEncoding = "latin1")
occur<-read.csv("data/QC_occur.csv", fileEncoding = "latin1")

Now the files are being read correctly, hurray!

Check data table structure

Before we get into running the QC checks, it’s important to double check each of our data columns are actually being interpreted correctly. We will use the R function str() to look at the structure of our two data tables. The output will tell us which class each data column is. This will be important because we want to make sure the columns with numeric data are indeed numeric (i.e. class is either integer or numeric). Use the str() function below to check both the “event” and “occur” tables.

str(event)

str(occur)

Interpret the results

We can see in the Event table that minimumDepthInMeters and maximumDepthInMeters are of class integer and numeric, respectively, which is expected. However our coordinate fields, decimalLongitude and decimalLatitude are of class character! Before we tell R to read these columns as numeric, let’s take a look at what’s going on in these two columns.

The Occurrence table appears to be read correctly so there is nothing to fix for that one.

event$decimalLongitude
event$decimalLatitude

There are two rows with some strange formatting for both the longitude and latitude:

Row 27: 54¡17’ N and 132¡57’ W
Row 46: 54¡10’0.450 N and 132¡59’0.536 W

Convert coordinates to decimal degrees

Let’s convert these in decimal degrees using the conv_unit function from the measurements package. We will replace the incorrect data by manually replacing the contents of the cells, but note that you could fix these errors in more automated ways if you needed to do this for many cells. Once we’ve converted the coordinates to decimal degrees, we’ll go ahead and tell R to consider these columns as numeric.

library(measurements)

new_long<- conv_unit(c('54 17 00','54 10 0.450') , from = "deg_min_sec", to = "dec_deg") #we added 00 to the long/lat for row 27 so that the coordinates convert properly 
new_lat <- conv_unit(c('-132 57 00', '-132 59 0.536'), from = "deg_min_sec", to = "dec_deg")


# Replace data in rows 27 and 46 for our latitude and longitude columns
event[27, c("decimalLatitude", "decimalLongitude")] <- c(new_lat[1], new_long[1])
event[46, c("decimalLatitude", "decimalLongitude")] <- c(new_lat[2], new_long[2])

# Change the longitude and latitude columns to be read as numeric
event$decimalLatitude<-as.numeric(event$decimalLatitude)
event$decimalLongitude<-as.numeric(event$decimalLongitude)

# Double check that the columns are now numeric
str(event$decimalLatitude) 
str(event$decimalLatitude)

Once the coordinates are numeric we can go ahead and start running more OBIS-specific QC checks!

Check required fields

One of the first checks we will do is to make sure all the OBIS-required fields are present in our data. To do we will use the obistools::check_fields function. We should make sure to do this for all relevant tables, so we will run the function for both the Event and Occurrence table. Use this function in the code block below to check for required fields in our two data tables.

obistools::check_fields(event)
obistools::check_fields(occur)

Interpret the Results: Event table

We see that the Event table is missing eventDate, parentEventID, and decimalLongitude + decimalLatitude for some rows. We also see warnings that the required Occurrence terms are missing (i.e. scientificName, scientificNameID, basisOfRecord, and occurrenceStatus). We can safely ignore these “missing” fields because they are in the Occurrence table. For the missing coordinates, we would have to take a look at the file to see if the missing ones are simply associated with parent events that don’t have coordinate data, or if there are actually missing coordinates. For any localities with missing coordinates we should attempt to georeference them, as we learned how to do in this course. For the missing eventDate column, we know there is already a column named “date” in the Event table, so we just have to map this column to the Darwin Core term. We can do this in R with the code below.

# Rename the date column to eventeDate
colnames(event)[3]<-"eventDate"

Interpret the Results: Occurrence table

As above, we can ignore the warnings about the missing columns that are in the Event table. However, the Occurrence table is missing the basisOfRecord column, and some rows are missing for occurrenceStatus. So we will have to populate these fields based on what we know the data (e.g. if one individual was observed, we can assume occurrenceStatus = present).

Check taxon names

The next step is to check that the taxon names match with WoRMS. We will use obistools::match_taxa and then inspect the output to make sure all the names in the Occurrence table match with WoRMS and are marine. Note that the code below will not be executable, but we provide the relevant output below the code chunk.

worms<-match_taxa(unique(occur$scientificName), ask=TRUE) # conduct taxon matching by telling the function to only match the unique species names in the scientificName column


## If you did not already have the scientificNameID field in the dataset, you could use the code below to attach the worms data with the Occurrence table
## Note that we set all=TRUE --> this will keep every record in the worms table which can be handy if there are any errors with taxon names. The matched name will appear in the Occurrence table with NAs for any non-WoRMS field

occur_match<-merge(occur, worms, by="scientificName", all= TRUE) # merges the worms dataframe with the occurrence table (occur) by matching according to scientificName

colnames(occur_match)[colnames(occur_match) == "scientificNameID.y"] = "scientificNameID" # rename the new column as scientificNameID

We see that one taxon has a near match, “Sebastes melanops” (Table 1), and when we look for this species in the Occurrence table (Table 2), we see that and that the name presumably had a typo, as it is written as “Sebastes melanps”. Changes to the scientificName can be made when you are certain, and verbatimIdentification can be populated with the original name.

Check coordinates

Next, we’ll visualize the data using obistools::plot_map_leaflet(). This tool allows us to quickly spot any coordinates that might be out of place, potentially indicating that longitude and latitude values have been swapped.

We’ll use obistools::plot_map_leaflet instead of obistools::plot_map because it opens an interactive leaflet map, allowing us to identify the row number of any coordinates that fall outside the expected geographic area. Be sure to zoom out to view the entire map, as some coordinates might be significantly distant from the expected range. Use this function below to investigate coordinates.

obistools::plot_map_leaflet(event)

Explore the map

There are a few coordinates at the edges of the map. Clicking on these points reveals they correspond to row 46 in the Event table. Look carefully at the Event table to verify if these coordinates are indeed swapped, and check for any other rows that might have similar issues since points may be overlapping, then correct the issue.

Check depths and on-land

Next we will check that all coordinates are within the expected depth range, including if any coordinates are marked as being on land. We will use obistools::check_depth() and obistools::check_onland() respectively for these checks. We can specify report=TRUE for both functions so that errors are returned instead of records, which provides a more easily understandable report. Additionally, providing a value to depthmargin (for check_depth) or buffer (for check_onland) will allow us to indicate how much a given coordinate can deviate from the bathymetry of the reference raster(s).

Modify the code below to run these checks.

obistools::check_depth()
obistools::check_onland()

obistools::check_depth(event, report=TRUE, depthmargin = 30)
obistools::check_onland(event,report=TRUE, buffer=20)

Interpret the results

We got the following warnings:

Depth value is greater than the value found in the bathymetry raster
Longitude is outside the bounds of the provided raster (-180 180)
Latitude is outside the bounds of the provided raster (-90 90)
No bathymetry value found for coordinate

Some of these warnings will appear because certain rows are missing latitude/longitude coordinates or contain invalid coordinates (e.g., swapped latitude and longitude). In these cases, the warnings can be ignored. But it’s important to check each row, resolve any issues, and rerun the QC checks to ensure all problems are addressed.

Now, about the depth warnings…

When records are flagged for having depths exceeding the bathymetry layer, these records don’t necessarily need to be removed. The raster used by OBIS to check depth values can sometimes be at a coarser resolution than actual conditions. The warning indicates a potential issue with the depth, but the reported depth may still be accurate despite not being detected by the raster. Similarly, points flagged as being ON_LAND may sometimes be due to map resolution discrepancies rather than actual errors.

Remember, QC flags are meant to warn us of potential data issues. However, the data may still be accurate, and flags can occur due to limitations in the data resolution of reference materials.

Make sure to review all the warnings to determine if any actions are needed.

Check table identifiers

We need to check that eventID and parentEventID are present and correspond with one another, for which we’ll use obistools::check_eventids(). Then, to ensure all eventIDs in the extension table(s) have matching eventIDs in the Event core table, we’ll use obistools::check_extension_eventids().

Modify the code below to run the checks.

obistools::check_eventids()
obistools::check_extension_eventids(event, extension)

obistools::check_eventids(event)
obistools::check_extension_eventids(event,occur)

Interpret the results

The output from check_eventids() tells us that there is a duplicated eventID in row 15 (this is assuming we have already added the missing parentEventID column). The output from check_extension_eventids() indicates that row 2 in the Occurrence table has an eventID not present in the Event core table. Therefore, we should review these rows and determine how to resolve these issues.

Check date formatting

Next we’ll confirm that all the dates are in ISO 8601 formatting using obistools::check_eventdate. Use this function in the code block below.

obistools::check_eventdate(event)

Interpret the results

We received the error “eventDate does not seem to be a valid date” for 14 records. The first 13 eventDates are blank and associated with parent events, so we can ignore the warning for them. However, row 26 has an incorrect date format that needs to be corrected.

This is also a good time to check the times associated with the dates, where applicable. Remember times should be in 24-hour format. Additionally, if any times are shown as 00:00:00 because the time was left blank, 00:00:00 represents midnight. Therefore times written as 00:00:00 may be incorrect if the event did not occur at midnight, and event time confirmation may be necessary (when possible of course).

Look at the data with Hmisc

The final check we will perform is using the Hmisc::describe function to identify any inconsistencies in the data. We will be running this on both the Event and Occurrence table, highlighting important fields below. Remember, although we will only highlight a few data fields, it is good practice to verify that each column contains data as expected.

The describe function gives a short summary for each data column, including the number of non-blank rows (n), the number of rows missing a value (missing), and the number of distinct values for that column (distinct).

Use this function in the code block below to describe both the Event and Occurrence tables. We’ve included some additional code to demonstrate how you can also plot the results of the describe function.

Hmisc::describe()

# Plot some results of describe
options(grType='plotly')
e<-Hmisc::describe(event)
p<-plot(e)
p[[1]]; p[[2]]

Hmisc::describe(event)
Hmisc::describe(occur)

# Plot some results of describe
options(grType='plotly')
e<-Hmisc::describe(event)
p<-plot(e)
p[[1]]; p[[2]]

Describe: Event table

There are 19 variables and 47 observations.

eventID - the sample size (n) is 47 but the number of distinct values is 46. We identified this earlier but we’ll want to identify that duplicate eventID and correct it

minimumDepthInMeters and maximumDepthInMeters - take a look at highest and lowest values and make sure these are in an expected range for this dataset. There is one value that seems quite extreme given the depth range of the other values

decimalLongitude and decimalLatitude - if we hadn’t yet corrected the swapped coordinates as identified earlier, we can see in the output here that some values are out of place (e.g. highest decimalLongitude value is 54.28333 which is out of the typically valid longitude range)

At the bottom of the summary we see a list of variables that have all observations missing, in this case georeferenceRemarks.

Describe: Occurrence table

There are 9 variables and 34 observations

occurrenceID - we see that n=34 but there are only 33 distinct values, an indication that one of the occurrenceIDs is repeated and we should identify and correct the error

occurrenceStatus - here we see that n=31, there are 3 missing, and the only distinct value is “present”. This is another indication that some cells are missing the required field, which will need to be corrected

Conclusion

This tutorial has covered some basic QC steps you can conduct in R to check the data quality of a dataset. You can use the output from this tutorial to help you with exercises in the course, but again please ensure the answers are your own.

Note we could also use obistools::report() to generate a quick overview of QC issues, but it’s good to make sure you understand each check individually, which is why we have presented it step by step here. Running this code will open a report.html file in your browser.

obistools::report(event)
obistools::report(occur)