packages <- c("dismo", "gbm", "rgeos", "ENMeval", "maps", "maptools", "raster", "rgbif", "adehabitatHR", "sp", "tidyverse", "sf", "mapview", "sqldf", "roxygen2", "testthat")
#lapply(packages, install.packages)
sapply(packages, library, character.only = T)
GBIF is a great place to start your search for coordinate information (simply because of all the places it grabs data from). A few fields that I find most helpful for determining reasonable coordinates to use are the ones used below (in the ‘fields’ argument). stateProvince, country, county, and locality are always good to have for georeferencing purposes and to determine that coordinates are placed where they should be. Issues is a good field that you can whittle records down by.
SPEC <- occ_search(scientificName = "Dipodomys compactus", fields = c('name', 'stateProvince', 'country', 'county', 'locality', 'decimalLongitude', 'decimalLatitude', 'geodeticDatum', 'datasetName', 'institutionCode', 'year', 'issues'), limit = 1000)
SPEC
## Records found [635]
## Records returned [635]
## No. unique hierarchies [1]
## No. media records [15]
## No. facets [0]
## Args [limit=1000, offset=0, scientificName=Dipodomys compactus,
## fields=name,stateProvince,country,county,locality,decimalLongitude,decimalLatitude,geodeticDatum,datasetNam]
## # A tibble: 635 x 12
## name decimalLongitude decimalLatitude year issues geodeticDatum
## <chr> <dbl> <dbl> <int> <chr> <chr>
## 1 Dipodomy… -97.3 27.5 2018 cdround… WGS84
## 2 Dipodomy… -97.3 27.5 2018 cdround… WGS84
## 3 Dipodomy… -97.3 27.3 2016 cdround… WGS84
## 4 Dipodomy… -97.2 27.6 2016 cdround… WGS84
## 5 Dipodomy… -97.2 27.7 2015 gass84 WGS84
## 6 Dipodomy… -97.2 26.0 2015 gass84 WGS84
## 7 Dipodomy… -97.3 27.4 2014 gass84 WGS84
## 8 Dipodomy… -97.3 27.3 2014 gass84 WGS84
## 9 Dipodomy… -97.1 27.7 2014 gass84 WGS84
## 10 Dipodomy… -97.2 27.6 2013 gass84 WGS84
## # ... with 625 more rows, and 6 more variables: country <chr>,
## # datasetName <chr>, institutionCode <chr>, stateProvince <chr>,
## # county <chr>, locality <chr>
There are over 600 records of Dipodomys compactus on GBIF. Not all of them are likely great, though.
The maps::map
function is helpful for providing quick boundaries. You’ll realize that I’m using maps::map
instead of just map
. This is because I have purrr loaded as well which has a map function that is great when dealing with lists.
plot(SPEC$data[, c("decimalLongitude", "decimalLatitude")], pch = 19, col = "red")
maps::map(add = T)
levels(as.factor(SPEC$data$stateProvince))
## [1] "California" "Oklahoma" "Tamaulipas" "Texas"
What is the range of D. compactus, again?
Oftentimes, a placeholder of 0, 0 is used for records, which can be a pain to deal with. Still, doesn’t explain the Oklahoma or California records. Always check the accepted/acceptable range of your study species.
For the purposes of our workshop today, we will get rid of all the records without coordinate information. In an actual study, you would set these to the side and georeference them (if possible).
dplyr::filter
is a powerful tool that acts like the base::subset
function but becomes very useful if you start creating pipelines with the magrittr package.
OCC <- filter(SPEC$data, !is.na(decimalLongitude))
base::duplicated
determines which rows have duplicated information in the decimalLongitude and decimalLatitude columns, which is quite a lot.
DUPL <- duplicated(OCC[ , c("decimalLongitude", "decimalLatitude")])
The base::duplicated
function returns a logical vector of the number of rows of the data.frame we entered. As such, you can use the base::sum
function to find out how many duplicated coordinates we have.
sum(DUPL)
## [1] 334
Take a look at the duplicate rows just to make sure everything worked
head(OCC[DUPL, ])
## # A tibble: 6 x 12
## name decimalLongitude decimalLatitude year issues geodeticDatum
## <chr> <dbl> <dbl> <int> <chr> <chr>
## 1 Dipodomys c… -97.2 26.1 1999 cdrou… WGS84
## 2 Dipodomys c… -97.2 26.1 1999 cdrou… WGS84
## 3 Dipodomys c… -97.2 26.1 1999 cdrou… WGS84
## 4 Dipodomys c… -97.2 26.1 1999 cdrou… WGS84
## 5 Dipodomys c… -97.2 27.7 1996 gass84 WGS84
## 6 Dipodomys c… -97.2 27.7 1996 gass84 WGS84
## # ... with 6 more variables: country <chr>, datasetName <chr>,
## # institutionCode <chr>, stateProvince <chr>, county <chr>,
## # locality <chr>
This takes out the duplicated rows (remember, !
is the logical NOT operator; you can use it to take the opposite of something).
OCC <- OCC[!DUPL, ]
dim(OCC)
## [1] 98 12
Well… Those 600 records certainly went away fast.
But you can still do so much more to get so much less info!
gbif_issues()
## code issue
## 1 bri BASIS_OF_RECORD_INVALID
## 2 ccm CONTINENT_COUNTRY_MISMATCH
## 3 cdc CONTINENT_DERIVED_FROM_COORDINATES
## 4 conti CONTINENT_INVALID
## 5 cdiv COORDINATE_INVALID
## 6 cdout COORDINATE_OUT_OF_RANGE
## 7 cdrep COORDINATE_REPROJECTED
## 8 cdrepf COORDINATE_REPROJECTION_FAILED
## 9 cdreps COORDINATE_REPROJECTION_SUSPICIOUS
## 10 cdround COORDINATE_ROUNDED
## 11 cucdmis COUNTRY_COORDINATE_MISMATCH
## 12 cudc COUNTRY_DERIVED_FROM_COORDINATES
## 13 cuiv COUNTRY_INVALID
## 14 cum COUNTRY_MISMATCH
## 15 depmms DEPTH_MIN_MAX_SWAPPED
## 16 depnn DEPTH_NON_NUMERIC
## 17 depnmet DEPTH_NOT_METRIC
## 18 depunl DEPTH_UNLIKELY
## 19 elmms ELEVATION_MIN_MAX_SWAPPED
## 20 elnn ELEVATION_NON_NUMERIC
## 21 elnmet ELEVATION_NOT_METRIC
## 22 elunl ELEVATION_UNLIKELY
## 23 gass84 GEODETIC_DATUM_ASSUMED_WGS84
## 24 gdativ GEODETIC_DATUM_INVALID
## 25 iddativ IDENTIFIED_DATE_INVALID
## 26 iddatunl IDENTIFIED_DATE_UNLIKELY
## 27 mdativ MODIFIED_DATE_INVALID
## 28 mdatunl MODIFIED_DATE_UNLIKELY
## 29 muldativ MULTIMEDIA_DATE_INVALID
## 30 muluriiv MULTIMEDIA_URI_INVALID
## 31 preneglat PRESUMED_NEGATED_LATITUDE
## 32 preneglon PRESUMED_NEGATED_LONGITUDE
## 33 preswcd PRESUMED_SWAPPED_COORDINATE
## 34 rdativ RECORDED_DATE_INVALID
## 35 rdatm RECORDED_DATE_MISMATCH
## 36 rdatunl RECORDED_DATE_UNLIKELY
## 37 refuriiv REFERENCES_URI_INVALID
## 38 txmatfuz TAXON_MATCH_FUZZY
## 39 txmathi TAXON_MATCH_HIGHERRANK
## 40 txmatnon TAXON_MATCH_NONE
## 41 typstativ TYPE_STATUS_INVALID
## 42 zerocd ZERO_COORDINATE
## description
## 1 The given basis of record is impossible to interpret or seriously different from the recommended vocabulary.
## 2 The interpreted continent and country do not match up.
## 3 The interpreted continent is based on the coordinates, not the verbatim string information.
## 4 Uninterpretable continent values found.
## 5 Coordinate value given in some form but GBIF is unable to interpret it.
## 6 Coordinate has invalid lat/lon values out of their decimal max range.
## 7 The original coordinate was successfully reprojected from a different geodetic datum to WGS84.
## 8 The given decimal latitude and longitude could not be reprojected to WGS84 based on the provided datum.
## 9 Indicates successful coordinate reprojection according to provided datum, but which results in a datum shift larger than 0.1 decimal degrees.
## 10 Original coordinate modified by rounding to 5 decimals.
## 11 The interpreted occurrence coordinates fall outside of the indicated country.
## 12 The interpreted country is based on the coordinates, not the verbatim string information.
## 13 Uninterpretable country values found.
## 14 Interpreted country for dwc:country and dwc:countryCode contradict each other.
## 15 Set if supplied min>max
## 16 Set if depth is a non numeric value
## 17 Set if supplied depth is not given in the metric system, for example using feet instead of meters
## 18 Set if depth is larger than 11.000m or negative.
## 19 Set if supplied min > max elevation
## 20 Set if elevation is a non numeric value
## 21 Set if supplied elevation is not given in the metric system, for example using feet instead of meters
## 22 Set if elevation is above the troposphere (17km) or below 11km (Mariana Trench).
## 23 Indicating that the interpreted coordinates assume they are based on WGS84 datum as the datum was either not indicated or interpretable.
## 24 The geodetic datum given could not be interpreted.
## 25 The date given for dwc:dateIdentified is invalid and cant be interpreted at all.
## 26 The date given for dwc:dateIdentified is in the future or before Linnean times (1700).
## 27 A (partial) invalid date is given for dc:modified, such as a non existing date, invalid zero month, etc.
## 28 The date given for dc:modified is in the future or predates unix time (1970).
## 29 An invalid date is given for dc:created of a multimedia object.
## 30 An invalid uri is given for a multimedia object.
## 31 Latitude appears to be negated, e.g. 32.3 instead of -32.3
## 32 Longitude appears to be negated, e.g. 32.3 instead of -32.3
## 33 Latitude and longitude appear to be swapped.
## 34 A (partial) invalid date is given, such as a non existing date, invalid zero month, etc.
## 35 The recording date specified as the eventDate string and the individual year, month, day are contradicting.
## 36 The recording date is highly unlikely, falling either into the future or represents a very old date before 1600 that predates modern taxonomy.
## 37 An invalid uri is given for dc:references.
## 38 Matching to the taxonomic backbone can only be done using a fuzzy, non exact match.
## 39 Matching to the taxonomic backbone can only be done on a higher rank and not the scientific name.
## 40 Matching to the taxonomic backbone cannot be done cause there was no match at all or several matches with too little information to keep them apart (homonyms).
## 41 The given type status is impossible to interpret or seriously different from the recommended vocabulary.
## 42 Coordinate is the exact 0/0 coordinate, often indicating a bad null coordinate.
summary(as.factor(OCC$issues))
## cdrep
## 9 9
## cdrep,cdround cdround
## 1 33
## cdround,gass84 cdround,gass84,rdativ,txmathi
## 6 4
## cucdmis,txmathi,zerocd gass84
## 1 19
## gass84,gdativ gass84,rdativ
## 6 1
## gass84,rdatm iddatunl
## 1 1
## rdativ rdativ,txmathi
## 1 2
## rdatm
## 4
The best way I’ve found to filter out specific issues (due to how they are placed in the issue column) is to run a grep
function and use the |
symbol to specify everything you want to get rid of.
grep("cdiv|cdrepf|cdreps|txmatnon|cucdmis|zerocd", OCC$issues)
## [1] 39
Because it returns a vector of row numbers that match the pattern, we can use - to negate these numbers and grab everything else.
dim(OCC[-grep("cdiv|cdrepf|cdreps|txmatnon|cucdmis|zerocd", OCC$issues), ])
## [1] 97 12
There are still a bunch of things you can do to clean up your data (check for outliers, make sure the georeferencing was done correctly, etc.).
HOWEVER, let’s all start from the same point.
PRES <- read.csv("Dipo_presence_OSOS.csv", header = T)
Now that we have our presence data, what else do we need?
We have used the raster::getData
function to grab GDAL data for administrative boundaries, but it can also be used to grab the oft used Bioclim dataset of 19 bioclimatic variables.
PRED <- raster::getData("worldclim", var = "bio", res = 2.5)
PRED
## class : RasterStack
## dimensions : 3600, 8640, 31104000, 19 (nrow, ncol, ncell, nlayers)
## resolution : 0.04166667, 0.04166667 (x, y)
## extent : -180, 180, -60, 90 (xmin, xmax, ymin, ymax)
## coord. ref. : +proj=longlat +datum=WGS84 +ellps=WGS84 +towgs84=0,0,0
## names : bio1, bio2, bio3, bio4, bio5, bio6, bio7, bio8, bio9, bio10, bio11, bio12, bio13, bio14, bio15, ...
## min values : -278, 9, 8, 64, -86, -559, 53, -278, -501, -127, -506, 0, 0, 0, 0, ...
## max values : 319, 213, 96, 22704, 489, 258, 725, 376, 365, 382, 289, 10577, 2437, 697, 265, ...
You can see that its class is a RasterStack along with its resolution, dimensions, extent, and the names of the variables.
The background points give the algorithm information about the environmental space available in your study region and thus should reflect a reasonable amount of your study region. If you increase the size of your study region, you can increase your AUC (increase the possibility of “correctly” identifying areas as absent of the species). Current reasonable recommendations are to choose the current extent of your points and one half to one degree (but see Anderson, R.P and A. Raza. 2010. J Biogeogr. The effect of the extent of the study region on GIS models of species geographic distributions and estimates of niche evolution: Preliminary tests with montane rodents (genus Nephelomys) in Venezuela.)
MCP <- mcp(SpatialPoints(PRES, proj4string = CRS("+proj=longlat +datum=WGS84")), percent = 100) #creates a polygon that encapsulates all of your chosen points
#Take a look at your polygon and points
plot(MCP) + points(PRES)
## numeric(0)
maps::map(add = T)
raster::extent
gives the geographical extent of the chosen spatial object (in this case, our polygon).
extent(MCP)
## class : Extent
## xmin : -99.76694
## xmax : -97.11967
## ymin : 24.5504
## ymax : 29.23
attributes(MCP)
## $data
## id area
## a a 0.0006752934
##
## $polygons
## $polygons[[1]]
## An object of class "Polygons"
## Slot "Polygons":
## [[1]]
## An object of class "Polygon"
## Slot "labpt":
## [1] -98.21859 27.21397
##
## Slot "area":
## [1] 6.752934
##
## Slot "hole":
## [1] FALSE
##
## Slot "ringDir":
## [1] 1
##
## Slot "coords":
## lon lat
## 4 -97.11967 27.74800
## 48 -97.15222 25.82250
## 18 -97.66009 24.55040
## 12 -98.87953 26.55722
## 51 -99.11167 27.02947
## 53 -99.76694 28.56994
## 23 -98.79000 29.23000
## 20 -98.65740 29.22615
## 24 -98.65000 29.22000
## 54 -97.23447 27.98695
## 41 -97.11967 27.74800
##
##
##
## Slot "plotOrder":
## [1] 1
##
## Slot "labpt":
## [1] -98.21859 27.21397
##
## Slot "ID":
## [1] "a"
##
## Slot "area":
## [1] 6.752934
##
##
##
## $plotOrder
## [1] 1
##
## $bbox
## min max
## x -99.76694 -97.11967
## y 24.55040 29.23000
##
## $proj4string
## CRS arguments:
## +proj=longlat +datum=WGS84 +ellps=WGS84 +towgs84=0,0,0
##
## $class
## [1] "SpatialPolygonsDataFrame"
## attr(,"package")
## [1] "sp"
##
## $comment
## [1] "FALSE"
Take a look at what makes up the SpatialPolygonsDataFrame that is MCP. In particular, we are looking at $bbox. A SpatialPolygonsDataFrame is considered an S4 class in R, which means it is indexed using @
EXT <- extent(c(MCP@bbox[, 1] - 1, MCP@bbox[, 2] + 1)[c(1, 3, 2, 4)])
And now we have the extent of our study!
The dismo::randomPoints
function is useful for generating random background points within an area. It uses a raster mask restricted by an extent object to generate n number of points within our study region
PNTS <- randomPoints(PRED, n = 10000, ext = EXT, extf = 1)
A note about background points: The number chosen is usually recommended to be a large number (e.g. 10,000), especially for MaxEnt. However, the number of background points that works best seems to differ with algorithm choice. For more information, see Barbet-Massin, M. et al. 2012. Methods in Ecology and Evolution. Selection psuedo-absences for species distribution models: How, where, and how many?
When using natural history collection specimens, it is often recommended to apply the same bias in selecting background points as is found in collection patterns. But how do we determine this bias? You can see collection bias by obtaining records of similar species (for this example, we will grab records of all heteromyid and cricetid rodents). We will use these records to create a background mask and then bias the selection of background points to within this mask.
#DO NOT RUN the below lines of code
#The rgbif::occ_search functions here are grabbing all cricetid and heteromyid records with coordinate information from Texas and Tamaulipas. This will run for a while.
#####
CRIC <- occ_search(taxonKey = 3240723, stateProvince = c("Texas", "Tamaulipas"), hasCoordinate = T, fields = c("name", "decimalLongitude", "decimalLatitude"), limit = 200000)
HMYD <- occ_search(taxonKey = 5504, stateProvince = c("Texas", "Tamaulipas"), hasCoordinate = T, fields = c("name", "decimalLongitude", "decimalLatitude"), limit = 200000)
#when querying for multiple options (such as here for both Texas and Tamaulipas), the result is in a list, requiring indexing of the different elements of the list to grab the $data portion
MASK <- data.frame(rbind(CRIC[[1]]$data, CRIC[[2]]$data, HMYD[[1]]$data, HMYD[[2]]$data))
coordinates(MASK) <- ~decimalLongitude + decimalLatitude
#I often find it easier to use the sp::coordinates function to turn a data.frame into a SpatialPointsDataFrame when I am trying to crop it to a certain geographic extent
MASK <- crop(MASK, EXT)
#the raster::crop function is handy because it allows us to restrict the points to the extent of our study area (the EXT object)
#####
DO RUN this below code Since searching for 50,000 records takes a slight amount of time, just read in the results from the above few lines of code and turn them into a SpatialPointsDataFrame.
MASK <- read.csv("enm_mask_OSOS.csv", header = T, row.names = 1)
coordinates(MASK) <- ~decimalLongitude + decimalLatitude
plot(MASK, pch = 19, col = "blue", cex = 0.4)
maps::map(add = T)
maps::map(add = T, "county", "texas")
The maps::map
function unfortunately doesn’t have Mexican municipalities, so we can only show country boundaries and Texas county boundaries. It is slightly messy, but it is easy and doesn’t take absurd amounts of time to plot.
The dismo::circles
function will take the points from the MASK object and create circles of a 50km radius around them.
CIRC <- circles(MASK, d = 50000, lonlat = T)
plot(CIRC)
maps::map(add = T)
You can see that overlapping polygons are dissolved into one another
CIRC@polygons
## class : SpatialPolygons
## features : 1
## extent : -101.2812, -95.66719, 23.12183, 30.67529 (xmin, xmax, ymin, ymax)
## coord. ref. : +proj=longlat +datum=WGS84 +ellps=WGS84 +towgs84=0,0,0
The result of the function is one SpatialPolygon.
This line uses both the raster::rasterToPolygons
and raster::crop
functions to create a smaller RasterStack the same size as our study extent and then turn it into a SpatialPolygonsDataFrame. “But why did you do this? We have the circle polygon and can just grab random data points from within it.” Well you can see that a not so insignificant portion of the polygon is over the Gulf. We don’t want points landing in the ocean. Also the border from say a polygon from GDAL isn’t the same border as the Bioclim raster data, so we can still get points with NA info even if they technically aren’t in the Gulf.
RAST <- rasterToPolygons(crop(PRED, EXT))
The rgeos::gIntersection
grabs the intersecting bits from the circle polygon (CIRC@polygons) and our raster polygon (RAST)
POL <- gIntersection(CIRC@polygons, RAST)
plot(POL)
Check out our new background mask!
Here our sp::spsample
function will sample 10,000 points randomly within our POL object. The iter argument just gives it the number of times to try grabbing random points before declaring it a failure.
PNTS <- spsample(POL, 10000, type = "random", iter = 25)
plot(PNTS, pch = 19, cex = 0.2, col = "blue")
maps::map(add = T)
Look at that spread of random background points!
Now, let’s turn it into a data.frame and give it more useful column names than just x and y.
ABSV <- as.data.frame(PNTS)
colnames(ABSV) <- c("lon", "lat")
So now we have coordinate data and raster data. Where do we go from here?
The BIOCLIM algorithm for environmental niche modeling isn’t particularly used anymore (other algorithms often outperform it except in a few special cases). However, it is easy to use and doesn’t require you to go through some sort of R version of a SAW movie’s torture chamber to run it (looking at you, MaxEnt).
All that the dismo::bioclim
function needs is a raster of predictor variables and the coordinates. The function will extract the needed information for each of the records (probably using raster::extract
, but I haven’t checked).
BCLM <- bioclim(PRED, PRES)
plot(BCLM)
Cool! It’s a plot of something. Actually it appears to be a plot of values for each of the records for certain bioclimatic variables (default appears to be the first two) with a box around a certain percentage of the points.
plot(BCLM, a = 1, b = 2, p = 0.85)
You can seemingly adjust the axes and the size of the box using the arguments a, b, and p
Now things get slightly weird here. We are using the raster::predict
function to create a raster prediction of the model object (BCLM) restricted to the extent (EXT). Simple enough. However, there is also the dismo::predict function that does the same thing but with the model object coming before the raster object in the function. The raster and dismo packages have a few authors in common, which likely led to this. I believe that the raster version of the predict function is more applicable to a wider variety of model objects, so I normally use that version.
BCP <- predict(PRED, BCLM, ext = EXT, progress = '')
BCP
## class : RasterLayer
## dimensions : 161, 111, 17871 (nrow, ncol, ncell)
## resolution : 0.04166667, 0.04166667 (x, y)
## extent : -100.75, -96.125, 23.54167, 30.25 (xmin, xmax, ymin, ymax)
## coord. ref. : +proj=longlat +datum=WGS84 +ellps=WGS84 +towgs84=0,0,0
## data source : in memory
## names : layer
## values : 0, 0.5 (min, max)
Look! It’s a RasterLayer. Let’s plot it! Plot the raster prediction, add our points for context, and add country boundaries.
Congrats, you have a visualization of an ENM that you can use (but that is likely worthless because it’s BIOCLIM and we’ve done the minimum because this is a 2 hour workshop).
plot(BCP)
points(PRES, pch = 19, col = "blue", cex = 0.5)
maps::map(add = T)
The dismo::evaluate function takes presence coordinates, background coordinates, a model object and a RasterStack of predictor variables to generate an evaluation of the model.
ME <- evaluate(PRES, ABSV, BCLM, PRED)
ME
## class : ModelEvaluation
## n presences : 54
## n absences : 10000
## AUC : 0.8702593
## cor : 0.0945532
## max TPR+TNR at : 0.00477037
You can see that it gives the number of presence and absence/background points used and the AUC of the model.
AUC (area under the curve) is probably the most used evaluation statistic for ecological niche models. It ranges from 0 to 1 with a score of 0.5 meaning the model is essentially random and a score of 1 is a perfect prediction of a species distribution.
It is also a highly contentious statistic, especially in recent literature. Unfortunately given the prevalence of presence-only modeling techniques, it seems to be necessary for now despite its flaws (and most evaluation statistics seem to be similarly flawed).
plot(ME, "ROC")
Thresholds are something used in ENMs to display the predictions in a binary way (presence vs. absence). If you type in ?threshold and check under the Values section, you will get an idea of the options available to set as a threshold. For more insight into the use of thresholds in ENMs, see Norris, D. 2014. Tropical Conservation Science. Model Thresholds are more important than presence location type: Understanding the distribution of lowland tapir (Tapirus terrrestris) in a continuous Atlantic forest of southeast Brazil
threshold(ME, "no_omission")
## [1] 0.01841852
plot(BCP > threshold(ME, "no_omission"))
maps::map(add = T)
Here I used the minimum presence threshold (“no_omission”) which sets the threshold at the rate of the lowest presence point used in the model.
attributes(ME)
This will send up a lot of horrifying numbers.
One of the most important things to look at here is the confusion table (ME@confusion) which shows the range of true positives, false positives, false “negatives”, true “negatives”, for a given threshold (ME@t). I used quotes around “negatives” because in a presence-only framework there are no absences, only background points.
k-fold cross validation is often used to evaulate the models
group <- list(kfold(PRES, 5), kfold(ABSV, 5))
dismo::kfold
will randomly, equally assign each record with a number from 1 to k (given here as 5). I put these in a list because I really like lists.
DATA <- lapply(seq(1, 5, 1), function(x) list(PRES[which(group[[1]] != x), ], PRES[which(group[[1]] == x), ], ABSV[which(group[[2]] != x), ], ABSV[which(group[[2]] == x), ]))
This is obnoxious, but is a fairly simple way of assigning presence and background points to k groups of training and test datasets. For each number from 1 to k, the base::lapply
function will create a list with 4 elements in it: 1) the training presence dataset with 4/5 of the records in it, 2) the test presence dataset with 1/5 of the records in it, 3) the training background dataset with 4/5 of the background points, and 4) the test background dataset with the remaining 1/5 background points.
STATS <- rep(0, 5)
for(i in 1:5) {
M <- bioclim(PRED, DATA[[i]][[1]])
#creates the model using the training data
ME <- evaluate(DATA[[i]][[2]], DATA[[i]][[4]], M, PRED)
#evaluates the model using the test data
STATS[i] <- ME@auc
#grabs the AUC for the partition and places it in the i position of the empty STATS vector
}
STATS
## [1] 0.6982045 0.7343182 0.5330250 0.8025227 0.8880000
mean(STATS)
## [1] 0.7312141
And now we have the AUC results of each partition of our k-fold cross validation
Oftentimes, occurence data will be biased in some way towards locations that are easier to access or nearer to population centers. In these events, you can get large clusters of occurences that the model can overfit to, thus reducing the usefulness of the model. See Boria, R.A. et al. 2014. Ecological Modelling. Spatial filtering to reduce sampling bias can improve the performance of ecological niche models.
We will be using an environmental filter (described in Varela, S.A. et al. 2014. Ecography. Environmental filters reduce the effects of sampling bias and improve predictions of ecological niche models.)
envSample<- function (coord, filters, res, do.plot=TRUE){
n<- length (filters)
pot_points<- list ()
for (i in 1:n){
k<- filters [[i]] [!is.na(filters[[i]])]
ext1<- range (k)
ext1 [1]<- ext1[1]- 1
x<- seq(ext1[1],ext1[2], by=res[[i]])
pot_points[[i]]<- x
}
pot_p<- expand.grid(pot_points)
ends<- NULL
for (i in 1:n){
fin<- pot_p [,i] + res[[i]]
ends<- cbind (ends, fin)
}
pot_pp<- data.frame (pot_p, ends)
pot_pp<- data.frame (pot_pp, groupID=c(1:nrow (pot_pp)))
rows<- length (filters[[1]])
filter<- data.frame(matrix(unlist(filters), nrow=rows))
real_p<- data.frame (coord, filter)
names_real<- c("lon", "lat")
names_pot_st<- NULL
names_pot_end<- NULL
sql1<- NULL
for (i in 1:n){
names_real<- c(names_real, paste ("filter", i, sep=""))
names_pot_st<- c(names_pot_st, paste ("start_f", i, sep=""))
names_pot_end<- c(names_pot_end, paste ("end_f", i, sep=""))
sql1<- paste (sql1, paste ("real_p.filter", i, sep=""), sep=", ")
}
names (real_p)<- names_real
names (pot_pp)<- c(names_pot_st, names_pot_end, "groupID")
conditions<- paste ("(real_p.filter", 1, "<= pot_pp.end_f", 1,") and (real_p.filter", 1, "> pot_pp.start_f", 1, ")", sep="")
for (i in 2:n){
conditions<- paste (conditions,
paste ("(real_p.filter", i, "<= pot_pp.end_f", i,") and (real_p.filter", i, "> pot_pp.start_f", i, ")", sep=""),
sep="and")
}
selection_NA<- sqldf(paste ("select real_p.lon, real_p.lat, pot_pp.groupID",
sql1, "from pot_pp left join real_p on", conditions, sep=" "))
selection<- selection_NA [complete.cases(selection_NA),]
final_points<- selection[!duplicated(selection$groupID), ]
coord_filter<- data.frame (final_points$lon, final_points$lat)
names (coord_filter)<- c("lon", "lat")
if (do.plot==TRUE){
par (mfrow=c(1,2), mar=c(4,4,0,0.5))
plot (filters[[1]], filters[[2]], pch=19,
col="grey50", xlab="Filter 1", ylab="Filter 2")
points (final_points$filter1, final_points$filter2,
pch=19, col="#88000090")
plot (coord, pch=19, col="grey50")
maps::map(add=T)
points (coord_filter, pch=19, col="#88000090")
}
coord_filter
}
If you read closely, you’ll see that it requires coordinates, environmental variables to filter by, and a resolution to filter by.
DATA <- cbind.data.frame(PRES, raster::extract(PRED, PRES))
head(DATA)
## lon lat bio1 bio2 bio3 bio4 bio5 bio6 bio7 bio8 bio9 bio10
## 1 -97.18722 27.65695 223 82 34 5488 329 90 239 268 159 287
## 2 -97.15156 25.99902 229 82 36 4879 327 104 223 269 198 284
## 3 -97.29587 27.42936 223 87 36 5389 330 89 241 266 160 285
## 4 -97.11967 27.74800 222 82 34 5531 328 87 241 268 187 286
## 5 -97.18603 27.64635 223 82 34 5488 329 90 239 268 159 287
## 6 -97.16400 26.10300 228 80 35 4933 325 102 223 268 196 284
## bio11 bio12 bio13 bio14 bio15 bio16 bio17 bio18 bio19
## 1 147 782 148 28 48 322 124 192 135
## 2 160 707 169 19 61 298 98 163 129
## 3 148 741 143 23 50 306 114 192 126
## 4 145 816 152 30 46 328 129 207 143
## 5 147 782 148 28 48 322 124 192 135
## 6 158 702 171 18 62 294 98 158 132
EFILT <- envSample(DATA[, 1:2], filters = list(DATA$bio1, DATA$bio12), res = list(diff(range(DATA$bio1))/10, diff(range(DATA$bio12))/10))
Now create a model and prediction using the spatially filtered points.
ECLM <- bioclim(PRED, EFILT)
ECP <- predict(PRED, ECLM, ext = EXT, progress = '')
plot(ECP)
maps::map(add = T)
Many studies will check to see if the bioclimatic variables they are using are highly correlated (For an example see Cooper, D.M. et al. 2016. Diversity Distrib. Predicted Pleistocene-Holocene range shifts of the tiger (Panthera tigris))
Some have found that this reduces over-parameterization in the models
raster::layerStats
can be used to compute correlation and covariance for Raster objects. Here we are using it to compute the Pearson correlation coefficient for each of the bioclimatic variables. The na.rm = T argument removes the cells of the RasterStack object with NA values (often the ocean) so that they don’t interfere with the calculations. The result is a list with the correlation coefficients and the mean values for each variable.
COR <- layerStats(crop(PRED, EXT), "pearson", na.rm = T)
If you wish to visualize this, run stats::symnum
on the first element of the list. Sometimes, you’ll have multiple highly correlated variables and have to choose one. You can do this via jacknife tests or by looking at a PCA.
symnum(COR[[1]])
## bi1 b2 b3 b4 b5 b6 b7 b8 b9 b10 b11 b12 b13 b14 b15 b16 b17 b18 b19
## bio1 1
## bio2 . 1
## bio3 . , 1
## bio4 + 1
## bio5 , , , 1
## bio6 , . . 1
## bio7 . + , . 1
## bio8 B . . . + , 1
## bio9 , . . * . , 1
## bio10 + . , , B . . + . 1
## bio11 , . B . , * . 1
## bio12 . . 1
## bio13 . . . . . . , 1
## bio14 . . . . . . . + 1
## bio15 . . . . . . , . . + 1
## bio16 . . . + * . 1
## bio17 . . . . + B + . 1
## bio18 . , , . * . 1
## bio19 . . . . + * , . B . 1
## attr(,"legend")
## [1] 0 ' ' 0.3 '.' 0.6 ',' 0.8 '+' 0.9 '*' 0.95 'B' 1
PREDI <- crop(PRED, EXT)
PCA <- prcomp(na.omit(getValues(PREDI)))
summary(PCA)
## Importance of components:
## PC1 PC2 PC3 PC4 PC5
## Standard deviation 1081.0070 184.65582 62.09080 39.14901 18.59457
## Proportion of Variance 0.9667 0.02821 0.00319 0.00127 0.00029
## Cumulative Proportion 0.9667 0.99495 0.99814 0.99940 0.99969
## PC6 PC7 PC8 PC9 PC10 PC11
## Standard deviation 13.42540 7.99356 6.68473 5.80794 4.71306 3.39378
## Proportion of Variance 0.00015 0.00005 0.00004 0.00003 0.00002 0.00001
## Cumulative Proportion 0.99984 0.99989 0.99993 0.99996 0.99997 0.99998
## PC12 PC13 PC14 PC15 PC16 PC17 PC18
## Standard deviation 2.67570 2.243 2.014 1.527 0.5993 0.4699 0.4282
## Proportion of Variance 0.00001 0.000 0.000 0.000 0.0000 0.0000 0.0000
## Cumulative Proportion 0.99999 1.000 1.000 1.000 1.0000 1.0000 1.0000
## PC19
## Standard deviation 1.272e-13
## Proportion of Variance 0.000e+00
## Cumulative Proportion 1.000e+00
plot(PCA$x[, 1:2])
PREDI[[1]][!is.na(PREDI[[1]])] <- PCA$x[, 1]
PREDI[[2]][!is.na(PREDI[[2]])] <- PCA$x[, 2]
PCAR <- stack(PREDI[[1:2]])
plot(PCAR[[1]])
plot(PCAR[[2]])
Sometimes you have predictions from multiple algorithms or from the use of multiple arguments within the same algorithm and want to see how the prediction has changed.
MAPS <- stack(BCP, ECP)
calc.niche.overlap(MAPS)
##
|
| | 0%
|
|=================================================================| 100%
## layer.1 layer.2
## layer.1 NA NA
## layer.2 0.8633062 NA
ENMeval::calc.niche.overlap
takes a RasterStack of the predictive maps created for each model and calculates the Schoener’s D statistic of niche similarity.
A result of 0 means that the two maps have no similarity and a result of 1 means that they are exactly the same.
Maxent is a slight pain to setup to run in R. It requires the rJava package to run which will let you know in the R console window if you do absolutely anything on your computer. First you will have to actually download Maxent on your computer. From this download, you will grab the maxent.jar file and place it in the folder that comes up when you run the line system.file("java", package = "dismo")
. Chances are you will possibly run into more errors involving pathways. As always, googling the error is the best way out of the problem (read: don’t ask me).
When everything is finally setup, you can run maxent easily as with 1dismo::bioclim1
install.packages("rJava")
library(rJava)
MXNT <- maxent(PRED, PRES)
MXP <- predict(PRED, MXNT, ext = EXT, progress = '')
map(add = T)
Good introductory reading on Maxent can be found in: Merow, C. et al. 2013. Ecography. A practical guide to MaxEnt for modeling species’ distributions: what it does, and why inputs and settings matter. Elith, J. et al. 2011. Diversity Distrib. A statistical explanation of MaxEnt for ecologists.
You can use boosted regression trees using the gbm package. There is a nice vignette for the dismo package that goes over the use of boosted regression trees in creating ENMs. Additionally, for a good source of information, see Elith, J. et al. 2008. Journal of Animal Ecology. A working guide to boosted regression trees.