packages <- c("dismo", "gbm", "rgeos", "ENMeval", "maps", "maptools", "raster", "rgbif", "adehabitatHR", "sp", "tidyverse", "sf", "mapview", "sqldf", "roxygen2", "testthat")
#lapply(packages, install.packages)
sapply(packages, library, character.only = T)

Step 1: Gather coordinate data

GBIF is a great place to start your search for coordinate information (simply because of all the places it grabs data from). A few fields that I find most helpful for determining reasonable coordinates to use are the ones used below (in the ‘fields’ argument). stateProvince, country, county, and locality are always good to have for georeferencing purposes and to determine that coordinates are placed where they should be. Issues is a good field that you can whittle records down by.

SPEC <- occ_search(scientificName = "Dipodomys compactus", fields = c('name', 'stateProvince', 'country', 'county', 'locality', 'decimalLongitude', 'decimalLatitude', 'geodeticDatum', 'datasetName', 'institutionCode', 'year', 'issues'), limit = 1000)
SPEC
## Records found [635] 
## Records returned [635] 
## No. unique hierarchies [1] 
## No. media records [15] 
## No. facets [0] 
## Args [limit=1000, offset=0, scientificName=Dipodomys compactus,
##      fields=name,stateProvince,country,county,locality,decimalLongitude,decimalLatitude,geodeticDatum,datasetNam] 
## # A tibble: 635 x 12
##    name      decimalLongitude decimalLatitude  year issues   geodeticDatum
##    <chr>                <dbl>           <dbl> <int> <chr>    <chr>        
##  1 Dipodomy…            -97.3            27.5  2018 cdround… WGS84        
##  2 Dipodomy…            -97.3            27.5  2018 cdround… WGS84        
##  3 Dipodomy…            -97.3            27.3  2016 cdround… WGS84        
##  4 Dipodomy…            -97.2            27.6  2016 cdround… WGS84        
##  5 Dipodomy…            -97.2            27.7  2015 gass84   WGS84        
##  6 Dipodomy…            -97.2            26.0  2015 gass84   WGS84        
##  7 Dipodomy…            -97.3            27.4  2014 gass84   WGS84        
##  8 Dipodomy…            -97.3            27.3  2014 gass84   WGS84        
##  9 Dipodomy…            -97.1            27.7  2014 gass84   WGS84        
## 10 Dipodomy…            -97.2            27.6  2013 gass84   WGS84        
## # ... with 625 more rows, and 6 more variables: country <chr>,
## #   datasetName <chr>, institutionCode <chr>, stateProvince <chr>,
## #   county <chr>, locality <chr>

There are over 600 records of Dipodomys compactus on GBIF. Not all of them are likely great, though.

The maps::map function is helpful for providing quick boundaries. You’ll realize that I’m using maps::map instead of just map. This is because I have purrr loaded as well which has a map function that is great when dealing with lists.

plot(SPEC$data[, c("decimalLongitude", "decimalLatitude")], pch = 19, col = "red")
maps::map(add = T)

levels(as.factor(SPEC$data$stateProvince))
## [1] "California" "Oklahoma"   "Tamaulipas" "Texas"

What is the range of D. compactus, again?

Oftentimes, a placeholder of 0, 0 is used for records, which can be a pain to deal with. Still, doesn’t explain the Oklahoma or California records. Always check the accepted/acceptable range of your study species.

For the purposes of our workshop today, we will get rid of all the records without coordinate information. In an actual study, you would set these to the side and georeference them (if possible).

dplyr::filter is a powerful tool that acts like the base::subset function but becomes very useful if you start creating pipelines with the magrittr package.

OCC <- filter(SPEC$data, !is.na(decimalLongitude))

base::duplicated determines which rows have duplicated information in the decimalLongitude and decimalLatitude columns, which is quite a lot.

DUPL <- duplicated(OCC[ , c("decimalLongitude", "decimalLatitude")]) 

The base::duplicated function returns a logical vector of the number of rows of the data.frame we entered. As such, you can use the base::sum function to find out how many duplicated coordinates we have.

sum(DUPL) 
## [1] 334

Take a look at the duplicate rows just to make sure everything worked

head(OCC[DUPL, ])
## # A tibble: 6 x 12
##   name         decimalLongitude decimalLatitude  year issues geodeticDatum
##   <chr>                   <dbl>           <dbl> <int> <chr>  <chr>        
## 1 Dipodomys c…            -97.2            26.1  1999 cdrou… WGS84        
## 2 Dipodomys c…            -97.2            26.1  1999 cdrou… WGS84        
## 3 Dipodomys c…            -97.2            26.1  1999 cdrou… WGS84        
## 4 Dipodomys c…            -97.2            26.1  1999 cdrou… WGS84        
## 5 Dipodomys c…            -97.2            27.7  1996 gass84 WGS84        
## 6 Dipodomys c…            -97.2            27.7  1996 gass84 WGS84        
## # ... with 6 more variables: country <chr>, datasetName <chr>,
## #   institutionCode <chr>, stateProvince <chr>, county <chr>,
## #   locality <chr>

This takes out the duplicated rows (remember, ! is the logical NOT operator; you can use it to take the opposite of something).

OCC <- OCC[!DUPL, ] 
dim(OCC)
## [1] 98 12

Well… Those 600 records certainly went away fast.

But you can still do so much more to get so much less info!

gbif_issues()
##         code                              issue
## 1        bri            BASIS_OF_RECORD_INVALID
## 2        ccm         CONTINENT_COUNTRY_MISMATCH
## 3        cdc CONTINENT_DERIVED_FROM_COORDINATES
## 4      conti                  CONTINENT_INVALID
## 5       cdiv                 COORDINATE_INVALID
## 6      cdout            COORDINATE_OUT_OF_RANGE
## 7      cdrep             COORDINATE_REPROJECTED
## 8     cdrepf     COORDINATE_REPROJECTION_FAILED
## 9     cdreps COORDINATE_REPROJECTION_SUSPICIOUS
## 10   cdround                 COORDINATE_ROUNDED
## 11   cucdmis        COUNTRY_COORDINATE_MISMATCH
## 12      cudc   COUNTRY_DERIVED_FROM_COORDINATES
## 13      cuiv                    COUNTRY_INVALID
## 14       cum                   COUNTRY_MISMATCH
## 15    depmms              DEPTH_MIN_MAX_SWAPPED
## 16     depnn                  DEPTH_NON_NUMERIC
## 17   depnmet                   DEPTH_NOT_METRIC
## 18    depunl                     DEPTH_UNLIKELY
## 19     elmms          ELEVATION_MIN_MAX_SWAPPED
## 20      elnn              ELEVATION_NON_NUMERIC
## 21    elnmet               ELEVATION_NOT_METRIC
## 22     elunl                 ELEVATION_UNLIKELY
## 23    gass84       GEODETIC_DATUM_ASSUMED_WGS84
## 24    gdativ             GEODETIC_DATUM_INVALID
## 25   iddativ            IDENTIFIED_DATE_INVALID
## 26  iddatunl           IDENTIFIED_DATE_UNLIKELY
## 27    mdativ              MODIFIED_DATE_INVALID
## 28   mdatunl             MODIFIED_DATE_UNLIKELY
## 29  muldativ            MULTIMEDIA_DATE_INVALID
## 30  muluriiv             MULTIMEDIA_URI_INVALID
## 31 preneglat          PRESUMED_NEGATED_LATITUDE
## 32 preneglon         PRESUMED_NEGATED_LONGITUDE
## 33   preswcd        PRESUMED_SWAPPED_COORDINATE
## 34    rdativ              RECORDED_DATE_INVALID
## 35     rdatm             RECORDED_DATE_MISMATCH
## 36   rdatunl             RECORDED_DATE_UNLIKELY
## 37  refuriiv             REFERENCES_URI_INVALID
## 38  txmatfuz                  TAXON_MATCH_FUZZY
## 39   txmathi             TAXON_MATCH_HIGHERRANK
## 40  txmatnon                   TAXON_MATCH_NONE
## 41 typstativ                TYPE_STATUS_INVALID
## 42    zerocd                    ZERO_COORDINATE
##                                                                                                                                                        description
## 1                                                     The given basis of record is impossible to interpret or seriously different from the recommended vocabulary.
## 2                                                                                                           The interpreted continent and country do not match up.
## 3                                                                      The interpreted continent is based on the coordinates, not the verbatim string information.
## 4                                                                                                                          Uninterpretable continent values found.
## 5                                                                                          Coordinate value given in some form but GBIF is unable to interpret it.
## 6                                                                                            Coordinate has invalid lat/lon values out of their decimal max range.
## 7                                                                   The original coordinate was successfully reprojected from a different geodetic datum to WGS84.
## 8                                                          The given decimal latitude and longitude could not be reprojected to WGS84 based on the provided datum.
## 9                    Indicates successful coordinate reprojection according to provided datum, but which results in a datum shift larger than 0.1 decimal degrees.
## 10                                                                                                         Original coordinate modified by rounding to 5 decimals.
## 11                                                                                   The interpreted occurrence coordinates fall outside of the indicated country.
## 12                                                                       The interpreted country is based on the coordinates, not the verbatim string information.
## 13                                                                                                                           Uninterpretable country values found.
## 14                                                                                  Interpreted country for dwc:country and dwc:countryCode contradict each other.
## 15                                                                                                                                         Set if supplied min>max
## 16                                                                                                                             Set if depth is a non numeric value
## 17                                                               Set if supplied depth is not given in the metric system, for example using feet instead of meters
## 18                                                                                                                Set if depth is larger than 11.000m or negative.
## 19                                                                                                                             Set if supplied min > max elevation
## 20                                                                                                                         Set if elevation is a non numeric value
## 21                                                           Set if supplied elevation is not given in the metric system, for example using feet instead of meters
## 22                                                                                Set if elevation is above the troposphere (17km) or below 11km (Mariana Trench).
## 23                        Indicating that the interpreted coordinates assume they are based on WGS84 datum as the datum was either not indicated or interpretable.
## 24                                                                                                              The geodetic datum given could not be interpreted.
## 25                                                                                The date given for dwc:dateIdentified is invalid and cant be interpreted at all.
## 26                                                                          The date given for dwc:dateIdentified is in the future or before Linnean times (1700).
## 27                                                        A (partial) invalid date is given for dc:modified, such as a non existing date, invalid zero month, etc.
## 28                                                                                   The date given for dc:modified is in the future or predates unix time (1970).
## 29                                                                                                 An invalid date is given for dc:created of a multimedia object.
## 30                                                                                                                An invalid uri is given for a multimedia object.
## 31                                                                                                      Latitude appears to be negated, e.g. 32.3 instead of -32.3
## 32                                                                                                     Longitude appears to be negated, e.g. 32.3 instead of -32.3
## 33                                                                                                                    Latitude and longitude appear to be swapped.
## 34                                                                        A (partial) invalid date is given, such as a non existing date, invalid zero month, etc.
## 35                                                     The recording date specified as the eventDate string and the individual year, month, day are contradicting.
## 36                  The recording date is highly unlikely, falling either into the future or represents a very old date before 1600 that predates modern taxonomy.
## 37                                                                                                                      An invalid uri is given for dc:references.
## 38                                                                             Matching to the taxonomic backbone can only be done using a fuzzy, non exact match.
## 39                                                               Matching to the taxonomic backbone can only be done on a higher rank and not the scientific name.
## 40 Matching to the taxonomic backbone cannot be done cause there was no match at all or several matches with too little information to keep them apart (homonyms).
## 41                                                        The given type status is impossible to interpret or seriously different from the recommended vocabulary.
## 42                                                                                 Coordinate is the exact 0/0 coordinate, often indicating a bad null coordinate.
summary(as.factor(OCC$issues))
##                                                       cdrep 
##                             9                             9 
##                 cdrep,cdround                       cdround 
##                             1                            33 
##                cdround,gass84 cdround,gass84,rdativ,txmathi 
##                             6                             4 
##        cucdmis,txmathi,zerocd                        gass84 
##                             1                            19 
##                 gass84,gdativ                 gass84,rdativ 
##                             6                             1 
##                  gass84,rdatm                      iddatunl 
##                             1                             1 
##                        rdativ                rdativ,txmathi 
##                             1                             2 
##                         rdatm 
##                             4

The best way I’ve found to filter out specific issues (due to how they are placed in the issue column) is to run a grep function and use the | symbol to specify everything you want to get rid of.

grep("cdiv|cdrepf|cdreps|txmatnon|cucdmis|zerocd", OCC$issues)
## [1] 39

Because it returns a vector of row numbers that match the pattern, we can use - to negate these numbers and grab everything else.

dim(OCC[-grep("cdiv|cdrepf|cdreps|txmatnon|cucdmis|zerocd", OCC$issues), ])
## [1] 97 12

There are still a bunch of things you can do to clean up your data (check for outliers, make sure the georeferencing was done correctly, etc.).

HOWEVER, let’s all start from the same point.

PRES <- read.csv("Dipo_presence_OSOS.csv", header = T)

Now that we have our presence data, what else do we need?

Step 2: Gather your raster data

We have used the raster::getData function to grab GDAL data for administrative boundaries, but it can also be used to grab the oft used Bioclim dataset of 19 bioclimatic variables.

PRED <- raster::getData("worldclim", var = "bio", res = 2.5)
PRED
## class       : RasterStack 
## dimensions  : 3600, 8640, 31104000, 19  (nrow, ncol, ncell, nlayers)
## resolution  : 0.04166667, 0.04166667  (x, y)
## extent      : -180, 180, -60, 90  (xmin, xmax, ymin, ymax)
## coord. ref. : +proj=longlat +datum=WGS84 +ellps=WGS84 +towgs84=0,0,0 
## names       :  bio1,  bio2,  bio3,  bio4,  bio5,  bio6,  bio7,  bio8,  bio9, bio10, bio11, bio12, bio13, bio14, bio15, ... 
## min values  :  -278,     9,     8,    64,   -86,  -559,    53,  -278,  -501,  -127,  -506,     0,     0,     0,     0, ... 
## max values  :   319,   213,    96, 22704,   489,   258,   725,   376,   365,   382,   289, 10577,  2437,   697,   265, ...

You can see that its class is a RasterStack along with its resolution, dimensions, extent, and the names of the variables.

Step 3: Gather your background data

The background points give the algorithm information about the environmental space available in your study region and thus should reflect a reasonable amount of your study region. If you increase the size of your study region, you can increase your AUC (increase the possibility of “correctly” identifying areas as absent of the species). Current reasonable recommendations are to choose the current extent of your points and one half to one degree (but see Anderson, R.P and A. Raza. 2010. J Biogeogr. The effect of the extent of the study region on GIS models of species geographic distributions and estimates of niche evolution: Preliminary tests with montane rodents (genus Nephelomys) in Venezuela.)

MCP <- mcp(SpatialPoints(PRES, proj4string = CRS("+proj=longlat +datum=WGS84")), percent = 100) #creates a polygon that encapsulates all of your chosen points
#Take a look at your polygon and points
plot(MCP) + points(PRES) 
## numeric(0)
maps::map(add = T)

raster::extent gives the geographical extent of the chosen spatial object (in this case, our polygon).

extent(MCP)
## class       : Extent 
## xmin        : -99.76694 
## xmax        : -97.11967 
## ymin        : 24.5504 
## ymax        : 29.23
attributes(MCP)
## $data
##   id         area
## a  a 0.0006752934
## 
## $polygons
## $polygons[[1]]
## An object of class "Polygons"
## Slot "Polygons":
## [[1]]
## An object of class "Polygon"
## Slot "labpt":
## [1] -98.21859  27.21397
## 
## Slot "area":
## [1] 6.752934
## 
## Slot "hole":
## [1] FALSE
## 
## Slot "ringDir":
## [1] 1
## 
## Slot "coords":
##          lon      lat
## 4  -97.11967 27.74800
## 48 -97.15222 25.82250
## 18 -97.66009 24.55040
## 12 -98.87953 26.55722
## 51 -99.11167 27.02947
## 53 -99.76694 28.56994
## 23 -98.79000 29.23000
## 20 -98.65740 29.22615
## 24 -98.65000 29.22000
## 54 -97.23447 27.98695
## 41 -97.11967 27.74800
## 
## 
## 
## Slot "plotOrder":
## [1] 1
## 
## Slot "labpt":
## [1] -98.21859  27.21397
## 
## Slot "ID":
## [1] "a"
## 
## Slot "area":
## [1] 6.752934
## 
## 
## 
## $plotOrder
## [1] 1
## 
## $bbox
##         min       max
## x -99.76694 -97.11967
## y  24.55040  29.23000
## 
## $proj4string
## CRS arguments:
##  +proj=longlat +datum=WGS84 +ellps=WGS84 +towgs84=0,0,0 
## 
## $class
## [1] "SpatialPolygonsDataFrame"
## attr(,"package")
## [1] "sp"
## 
## $comment
## [1] "FALSE"

Take a look at what makes up the SpatialPolygonsDataFrame that is MCP. In particular, we are looking at $bbox. A SpatialPolygonsDataFrame is considered an S4 class in R, which means it is indexed using @

EXT <- extent(c(MCP@bbox[, 1] - 1, MCP@bbox[, 2] + 1)[c(1, 3, 2, 4)]) 

And now we have the extent of our study!

The dismo::randomPoints function is useful for generating random background points within an area. It uses a raster mask restricted by an extent object to generate n number of points within our study region

PNTS <- randomPoints(PRED, n = 10000, ext = EXT, extf = 1)

A note about background points: The number chosen is usually recommended to be a large number (e.g. 10,000), especially for MaxEnt. However, the number of background points that works best seems to differ with algorithm choice. For more information, see Barbet-Massin, M. et al. 2012. Methods in Ecology and Evolution. Selection psuedo-absences for species distribution models: How, where, and how many?

When using natural history collection specimens, it is often recommended to apply the same bias in selecting background points as is found in collection patterns. But how do we determine this bias? You can see collection bias by obtaining records of similar species (for this example, we will grab records of all heteromyid and cricetid rodents). We will use these records to create a background mask and then bias the selection of background points to within this mask.

#DO NOT RUN the below lines of code
#The rgbif::occ_search functions here are grabbing all cricetid and heteromyid records with coordinate information from Texas and Tamaulipas. This will run for a while.
#####
CRIC <- occ_search(taxonKey = 3240723, stateProvince = c("Texas", "Tamaulipas"), hasCoordinate = T, fields = c("name", "decimalLongitude", "decimalLatitude"), limit = 200000)
HMYD <- occ_search(taxonKey = 5504, stateProvince = c("Texas", "Tamaulipas"), hasCoordinate = T, fields = c("name", "decimalLongitude", "decimalLatitude"), limit = 200000)
#when querying for multiple options (such as here for both Texas and Tamaulipas), the result is in a list, requiring indexing of the different elements of the list to grab the $data portion
MASK <- data.frame(rbind(CRIC[[1]]$data, CRIC[[2]]$data, HMYD[[1]]$data, HMYD[[2]]$data))
coordinates(MASK) <- ~decimalLongitude + decimalLatitude
#I often find it easier to use the sp::coordinates function to turn a data.frame into a SpatialPointsDataFrame when I am trying to crop it to a certain geographic extent
MASK <- crop(MASK, EXT)
#the raster::crop function is handy because it allows us to restrict the points to the extent of our study area (the EXT object)
#####

DO RUN this below code Since searching for 50,000 records takes a slight amount of time, just read in the results from the above few lines of code and turn them into a SpatialPointsDataFrame.

MASK <- read.csv("enm_mask_OSOS.csv", header = T, row.names = 1)
coordinates(MASK) <- ~decimalLongitude + decimalLatitude
plot(MASK, pch = 19, col = "blue", cex = 0.4)
maps::map(add = T)
maps::map(add = T, "county", "texas")

The maps::map function unfortunately doesn’t have Mexican municipalities, so we can only show country boundaries and Texas county boundaries. It is slightly messy, but it is easy and doesn’t take absurd amounts of time to plot.

The dismo::circles function will take the points from the MASK object and create circles of a 50km radius around them.

CIRC <- circles(MASK, d = 50000, lonlat = T)
plot(CIRC)
maps::map(add = T)

You can see that overlapping polygons are dissolved into one another

CIRC@polygons
## class       : SpatialPolygons 
## features    : 1 
## extent      : -101.2812, -95.66719, 23.12183, 30.67529  (xmin, xmax, ymin, ymax)
## coord. ref. : +proj=longlat +datum=WGS84 +ellps=WGS84 +towgs84=0,0,0

The result of the function is one SpatialPolygon.

This line uses both the raster::rasterToPolygons and raster::crop functions to create a smaller RasterStack the same size as our study extent and then turn it into a SpatialPolygonsDataFrame. “But why did you do this? We have the circle polygon and can just grab random data points from within it.” Well you can see that a not so insignificant portion of the polygon is over the Gulf. We don’t want points landing in the ocean. Also the border from say a polygon from GDAL isn’t the same border as the Bioclim raster data, so we can still get points with NA info even if they technically aren’t in the Gulf.

RAST <- rasterToPolygons(crop(PRED, EXT))

The rgeos::gIntersection grabs the intersecting bits from the circle polygon (CIRC@polygons) and our raster polygon (RAST)

POL <- gIntersection(CIRC@polygons, RAST)
plot(POL)

Check out our new background mask!

Here our sp::spsample function will sample 10,000 points randomly within our POL object. The iter argument just gives it the number of times to try grabbing random points before declaring it a failure.

PNTS <- spsample(POL, 10000, type = "random", iter = 25)
plot(PNTS, pch = 19, cex = 0.2, col = "blue")
maps::map(add = T)

Look at that spread of random background points!

Now, let’s turn it into a data.frame and give it more useful column names than just x and y.

ABSV <- as.data.frame(PNTS)
colnames(ABSV) <- c("lon", "lat")

So now we have coordinate data and raster data. Where do we go from here?

Step 4: Building a model

The BIOCLIM algorithm for environmental niche modeling isn’t particularly used anymore (other algorithms often outperform it except in a few special cases). However, it is easy to use and doesn’t require you to go through some sort of R version of a SAW movie’s torture chamber to run it (looking at you, MaxEnt).

All that the dismo::bioclim function needs is a raster of predictor variables and the coordinates. The function will extract the needed information for each of the records (probably using raster::extract, but I haven’t checked).

BCLM <- bioclim(PRED, PRES)
plot(BCLM)

Cool! It’s a plot of something. Actually it appears to be a plot of values for each of the records for certain bioclimatic variables (default appears to be the first two) with a box around a certain percentage of the points.

plot(BCLM, a = 1, b = 2, p = 0.85)

You can seemingly adjust the axes and the size of the box using the arguments a, b, and p

Now things get slightly weird here. We are using the raster::predict function to create a raster prediction of the model object (BCLM) restricted to the extent (EXT). Simple enough. However, there is also the dismo::predict function that does the same thing but with the model object coming before the raster object in the function. The raster and dismo packages have a few authors in common, which likely led to this. I believe that the raster version of the predict function is more applicable to a wider variety of model objects, so I normally use that version.

BCP <- predict(PRED, BCLM, ext = EXT, progress = '')
BCP
## class       : RasterLayer 
## dimensions  : 161, 111, 17871  (nrow, ncol, ncell)
## resolution  : 0.04166667, 0.04166667  (x, y)
## extent      : -100.75, -96.125, 23.54167, 30.25  (xmin, xmax, ymin, ymax)
## coord. ref. : +proj=longlat +datum=WGS84 +ellps=WGS84 +towgs84=0,0,0 
## data source : in memory
## names       : layer 
## values      : 0, 0.5  (min, max)

Look! It’s a RasterLayer. Let’s plot it! Plot the raster prediction, add our points for context, and add country boundaries.

Congrats, you have a visualization of an ENM that you can use (but that is likely worthless because it’s BIOCLIM and we’ve done the minimum because this is a 2 hour workshop).

plot(BCP)
points(PRES, pch = 19, col = "blue", cex = 0.5)
maps::map(add = T)

Step 5: Evaluating your model

The dismo::evaluate function takes presence coordinates, background coordinates, a model object and a RasterStack of predictor variables to generate an evaluation of the model.

ME <- evaluate(PRES, ABSV, BCLM, PRED)
ME
## class          : ModelEvaluation 
## n presences    : 54 
## n absences     : 10000 
## AUC            : 0.8702593 
## cor            : 0.0945532 
## max TPR+TNR at : 0.00477037

You can see that it gives the number of presence and absence/background points used and the AUC of the model.

AUC (area under the curve) is probably the most used evaluation statistic for ecological niche models. It ranges from 0 to 1 with a score of 0.5 meaning the model is essentially random and a score of 1 is a perfect prediction of a species distribution.

It is also a highly contentious statistic, especially in recent literature. Unfortunately given the prevalence of presence-only modeling techniques, it seems to be necessary for now despite its flaws (and most evaluation statistics seem to be similarly flawed).

plot(ME, "ROC")

Thresholds are something used in ENMs to display the predictions in a binary way (presence vs. absence). If you type in ?threshold and check under the Values section, you will get an idea of the options available to set as a threshold. For more insight into the use of thresholds in ENMs, see Norris, D. 2014. Tropical Conservation Science. Model Thresholds are more important than presence location type: Understanding the distribution of lowland tapir (Tapirus terrrestris) in a continuous Atlantic forest of southeast Brazil

threshold(ME, "no_omission")
## [1] 0.01841852
plot(BCP > threshold(ME, "no_omission"))
maps::map(add = T)

Here I used the minimum presence threshold (“no_omission”) which sets the threshold at the rate of the lowest presence point used in the model.

attributes(ME)

This will send up a lot of horrifying numbers.

One of the most important things to look at here is the confusion table (ME@confusion) which shows the range of true positives, false positives, false “negatives”, true “negatives”, for a given threshold (ME@t). I used quotes around “negatives” because in a presence-only framework there are no absences, only background points.

k-fold cross validation is often used to evaulate the models

group <- list(kfold(PRES, 5), kfold(ABSV, 5))

dismo::kfold will randomly, equally assign each record with a number from 1 to k (given here as 5). I put these in a list because I really like lists.

DATA <- lapply(seq(1, 5, 1), function(x) list(PRES[which(group[[1]] != x), ], PRES[which(group[[1]] == x), ], ABSV[which(group[[2]] != x), ], ABSV[which(group[[2]] == x), ]))

This is obnoxious, but is a fairly simple way of assigning presence and background points to k groups of training and test datasets. For each number from 1 to k, the base::lapply function will create a list with 4 elements in it: 1) the training presence dataset with 4/5 of the records in it, 2) the test presence dataset with 1/5 of the records in it, 3) the training background dataset with 4/5 of the background points, and 4) the test background dataset with the remaining 1/5 background points.

STATS <- rep(0, 5)
for(i in 1:5) {
M <- bioclim(PRED, DATA[[i]][[1]])
#creates the model using the training data
ME <- evaluate(DATA[[i]][[2]], DATA[[i]][[4]], M, PRED)
#evaluates the model using the test data
STATS[i] <- ME@auc
#grabs the AUC for the partition and places it in the i position of the empty STATS vector
}

STATS
## [1] 0.6982045 0.7343182 0.5330250 0.8025227 0.8880000
mean(STATS)
## [1] 0.7312141

And now we have the AUC results of each partition of our k-fold cross validation

Additional options: Filtering presence points

Oftentimes, occurence data will be biased in some way towards locations that are easier to access or nearer to population centers. In these events, you can get large clusters of occurences that the model can overfit to, thus reducing the usefulness of the model. See Boria, R.A. et al. 2014. Ecological Modelling. Spatial filtering to reduce sampling bias can improve the performance of ecological niche models.

We will be using an environmental filter (described in Varela, S.A. et al. 2014. Ecography. Environmental filters reduce the effects of sampling bias and improve predictions of ecological niche models.)

envSample<- function (coord, filters, res, do.plot=TRUE){
  
  n<- length (filters)
  pot_points<- list ()
  for (i in 1:n){
    k<- filters [[i]] [!is.na(filters[[i]])]
    ext1<- range (k)
    ext1 [1]<- ext1[1]- 1
    x<- seq(ext1[1],ext1[2], by=res[[i]])
    pot_points[[i]]<- x
  }
  pot_p<- expand.grid(pot_points)
  
  ends<- NULL
  for (i in 1:n){
    fin<- pot_p [,i] + res[[i]]
    ends<- cbind (ends, fin)
  }
  
  pot_pp<- data.frame (pot_p, ends)
  pot_pp<- data.frame (pot_pp, groupID=c(1:nrow (pot_pp)))
  rows<- length (filters[[1]])
  filter<- data.frame(matrix(unlist(filters), nrow=rows))
  real_p<- data.frame (coord, filter)
  
  names_real<- c("lon", "lat")
  names_pot_st<- NULL
  names_pot_end<- NULL
  sql1<- NULL
  for (i in 1:n){
    names_real<- c(names_real, paste ("filter", i, sep=""))
    names_pot_st<- c(names_pot_st, paste ("start_f", i, sep=""))
    names_pot_end<- c(names_pot_end, paste ("end_f", i, sep=""))
    sql1<- paste (sql1, paste ("real_p.filter", i, sep=""), sep=", ")   
  }
  
  names (real_p)<- names_real
  names (pot_pp)<- c(names_pot_st, names_pot_end, "groupID")
  
  conditions<- paste ("(real_p.filter", 1, "<= pot_pp.end_f", 1,") and (real_p.filter", 1, "> pot_pp.start_f", 1, ")", sep="")
  for (i in 2:n){
    conditions<- paste (conditions, 
                        paste ("(real_p.filter", i, "<= pot_pp.end_f", i,") and (real_p.filter", i, "> pot_pp.start_f", i, ")", sep=""), 
                        sep="and")
  }
  
  selection_NA<- sqldf(paste ("select real_p.lon, real_p.lat, pot_pp.groupID",   
                        sql1, "from pot_pp left join real_p on", conditions, sep=" "))

  selection<- selection_NA [complete.cases(selection_NA),]
  
  final_points<- selection[!duplicated(selection$groupID), ]
  coord_filter<- data.frame (final_points$lon, final_points$lat) 
  names (coord_filter)<- c("lon", "lat")
  
  if (do.plot==TRUE){
    par (mfrow=c(1,2), mar=c(4,4,0,0.5))
  plot (filters[[1]], filters[[2]], pch=19, 
        col="grey50", xlab="Filter 1", ylab="Filter 2")
  points (final_points$filter1, final_points$filter2, 
          pch=19, col="#88000090")
  plot (coord, pch=19, col="grey50")
  maps::map(add=T)
  points (coord_filter, pch=19, col="#88000090")
  
  }
  coord_filter
}

If you read closely, you’ll see that it requires coordinates, environmental variables to filter by, and a resolution to filter by.

DATA <- cbind.data.frame(PRES, raster::extract(PRED, PRES))
head(DATA)
##         lon      lat bio1 bio2 bio3 bio4 bio5 bio6 bio7 bio8 bio9 bio10
## 1 -97.18722 27.65695  223   82   34 5488  329   90  239  268  159   287
## 2 -97.15156 25.99902  229   82   36 4879  327  104  223  269  198   284
## 3 -97.29587 27.42936  223   87   36 5389  330   89  241  266  160   285
## 4 -97.11967 27.74800  222   82   34 5531  328   87  241  268  187   286
## 5 -97.18603 27.64635  223   82   34 5488  329   90  239  268  159   287
## 6 -97.16400 26.10300  228   80   35 4933  325  102  223  268  196   284
##   bio11 bio12 bio13 bio14 bio15 bio16 bio17 bio18 bio19
## 1   147   782   148    28    48   322   124   192   135
## 2   160   707   169    19    61   298    98   163   129
## 3   148   741   143    23    50   306   114   192   126
## 4   145   816   152    30    46   328   129   207   143
## 5   147   782   148    28    48   322   124   192   135
## 6   158   702   171    18    62   294    98   158   132
EFILT <- envSample(DATA[, 1:2], filters = list(DATA$bio1, DATA$bio12), res = list(diff(range(DATA$bio1))/10, diff(range(DATA$bio12))/10))

Now create a model and prediction using the spatially filtered points.

ECLM <- bioclim(PRED, EFILT)
ECP <- predict(PRED, ECLM, ext = EXT, progress = '')
plot(ECP)
maps::map(add = T)

Additional options: Reducing correlation of bioclimatic variables

Many studies will check to see if the bioclimatic variables they are using are highly correlated (For an example see Cooper, D.M. et al. 2016. Diversity Distrib. Predicted Pleistocene-Holocene range shifts of the tiger (Panthera tigris))

Some have found that this reduces over-parameterization in the models

raster::layerStats can be used to compute correlation and covariance for Raster objects. Here we are using it to compute the Pearson correlation coefficient for each of the bioclimatic variables. The na.rm = T argument removes the cells of the RasterStack object with NA values (often the ocean) so that they don’t interfere with the calculations. The result is a list with the correlation coefficients and the mean values for each variable.

COR <- layerStats(crop(PRED, EXT), "pearson", na.rm = T)

If you wish to visualize this, run stats::symnum on the first element of the list. Sometimes, you’ll have multiple highly correlated variables and have to choose one. You can do this via jacknife tests or by looking at a PCA.

symnum(COR[[1]])
##       bi1 b2 b3 b4 b5 b6 b7 b8 b9 b10 b11 b12 b13 b14 b15 b16 b17 b18 b19
## bio1  1                                                                  
## bio2  .   1                                                              
## bio3  .   ,  1                                                           
## bio4         +  1                                                        
## bio5  ,      ,  ,  1                                                     
## bio6  ,   .     .     1                                                  
## bio7         .  +  ,  .  1                                               
## bio8  B   .  .  .  +  ,     1                                            
## bio9  ,   .     .     *  .  ,  1                                         
## bio10 +   .  ,  ,  B  .  .  +  .  1                                      
## bio11 ,         .     B  .  ,  *  .   1                                  
## bio12     .  .                            1                              
## bio13 .   .           .  .     .      .   ,   1                          
## bio14     .  .  .     .  .     .      .   +       1                      
## bio15 .      .  .     .  .     .      ,   .   .   +   1                  
## bio16     .           .  .                +   *   .       1              
## bio17     .  .  .                     .   +       B   +   .   1          
## bio18                    .                ,   ,   .       *   .   1      
## bio19     .  .  .                     .   +       *   ,   .   B   .   1  
## attr(,"legend")
## [1] 0 ' ' 0.3 '.' 0.6 ',' 0.8 '+' 0.9 '*' 0.95 'B' 1
PREDI <- crop(PRED, EXT)
PCA <- prcomp(na.omit(getValues(PREDI)))
summary(PCA)
## Importance of components:
##                              PC1       PC2      PC3      PC4      PC5
## Standard deviation     1081.0070 184.65582 62.09080 39.14901 18.59457
## Proportion of Variance    0.9667   0.02821  0.00319  0.00127  0.00029
## Cumulative Proportion     0.9667   0.99495  0.99814  0.99940  0.99969
##                             PC6     PC7     PC8     PC9    PC10    PC11
## Standard deviation     13.42540 7.99356 6.68473 5.80794 4.71306 3.39378
## Proportion of Variance  0.00015 0.00005 0.00004 0.00003 0.00002 0.00001
## Cumulative Proportion   0.99984 0.99989 0.99993 0.99996 0.99997 0.99998
##                           PC12  PC13  PC14  PC15   PC16   PC17   PC18
## Standard deviation     2.67570 2.243 2.014 1.527 0.5993 0.4699 0.4282
## Proportion of Variance 0.00001 0.000 0.000 0.000 0.0000 0.0000 0.0000
## Cumulative Proportion  0.99999 1.000 1.000 1.000 1.0000 1.0000 1.0000
##                             PC19
## Standard deviation     1.272e-13
## Proportion of Variance 0.000e+00
## Cumulative Proportion  1.000e+00
plot(PCA$x[, 1:2])

PREDI[[1]][!is.na(PREDI[[1]])] <- PCA$x[, 1]
PREDI[[2]][!is.na(PREDI[[2]])] <- PCA$x[, 2]
PCAR <- stack(PREDI[[1:2]])
plot(PCAR[[1]])

plot(PCAR[[2]])

Additional options: Schoener’s D for comparison

Sometimes you have predictions from multiple algorithms or from the use of multiple arguments within the same algorithm and want to see how the prediction has changed.

MAPS <- stack(BCP, ECP)
calc.niche.overlap(MAPS)
## 
  |                                                                       
  |                                                                 |   0%
  |                                                                       
  |=================================================================| 100%
##           layer.1 layer.2
## layer.1        NA      NA
## layer.2 0.8633062      NA

ENMeval::calc.niche.overlap takes a RasterStack of the predictive maps created for each model and calculates the Schoener’s D statistic of niche similarity.

A result of 0 means that the two maps have no similarity and a result of 1 means that they are exactly the same.

Additional options: MaxEnt

Maxent is a slight pain to setup to run in R. It requires the rJava package to run which will let you know in the R console window if you do absolutely anything on your computer. First you will have to actually download Maxent on your computer. From this download, you will grab the maxent.jar file and place it in the folder that comes up when you run the line system.file("java", package = "dismo"). Chances are you will possibly run into more errors involving pathways. As always, googling the error is the best way out of the problem (read: don’t ask me).

When everything is finally setup, you can run maxent easily as with 1dismo::bioclim1

install.packages("rJava")
library(rJava)

MXNT <- maxent(PRED, PRES)
MXP <- predict(PRED, MXNT, ext = EXT, progress = '')
map(add = T)

Good introductory reading on Maxent can be found in: Merow, C. et al. 2013. Ecography. A practical guide to MaxEnt for modeling species’ distributions: what it does, and why inputs and settings matter. Elith, J. et al. 2011. Diversity Distrib. A statistical explanation of MaxEnt for ecologists.

Additional options: Boosted Regression Trees

You can use boosted regression trees using the gbm package. There is a nice vignette for the dismo package that goes over the use of boosted regression trees in creating ENMs. Additionally, for a good source of information, see Elith, J. et al. 2008. Journal of Animal Ecology. A working guide to boosted regression trees.