packages <- c("rgbif", "raster", "maps", "rgdal", "viridis", "picante", "rinat", "sp", "broom", "maptools", "ape", "tidyverse", "magrittr", "sf", "mapview")
#lapply(packages, install.packages)
sapply(packages, library, character.only = T)

Introduction to iNaturalist

package::rinat

Taxon level queries

To query for specific species, use the rinat::get_inat_obs function.

OCC <- get_inat_obs(taxon_name = "Lontra canadensis", maxresults = 25) 

You can query by taxon name, taxon id (from iNat), or from a general query. The max number of results is set to 100 by default. Here I set it to 25 just to show a quick example.

Take a look at the data.

class(OCC)
## [1] "data.frame"
head(OCC)
summary(OCC) 
#OR
str(OCC) #gives structure of object, can look messy at times but does give the class of each column

Queries can be further filtered by year (can only give one year, not a range, but there is a way around this for enterprising individuals…), month(s), status of georeferenced coordinates, and quality. Here I set quality to only those observations deemed research quality (determined when the wider iNaturalist community agrees on a species level ID). If you are using iNat to determine where a species is, you should probably only look at these results.

OCC <- get_inat_obs(taxon_name = "Lontra canadensis", year = 2018, maxresults = 1000, quality = "research", geo = T)

Try plotting the coordinates of 2018 research grade otter observations. The maps::map function can prove helpful if you need quick boundaries and don’t want to fuss with making a ggplot2 map. The database argument is set to “world” by default, but you can choose a few US options by changing it to “state” or “county.”

plot(OCC[, c("longitude", "latitude")], pch = 19, col = "red", cex = 0.5) #plots points
maps::map(add = T) #plots world boundaries
maps::map(add = T, "state") #plots US state boundaries

You can apply a bounding box to filter out observations (S-LAT, W-LON, N-LAT, E-LON). I created a bounding box below that essentially covers Texas.

OCC <- get_inat_obs(taxon_name = "Lontra canadensis", geo = T, bounds = c(25, -106, 37, -94), maxresults = 1000)

Now you can see a general plot of otter observations in/around Texas.

plot(OCC[, c("longitude", "latitude")], pch = 19, col = "red")
maps::map(add = T, database = "county", "texas") #plots the county boundaries for the great Republic of Texas

Group level queries

Besides querying for a specific species or taxonomic group, one can query for a specific iNaturalist group project (e.g., Herps of Texas or Mammals of Texas).

#DON'T run the below code, it can take a while 
MAMM <- get_inat_obs_project(grpid = "mammals-of-texas", type = "observations")
#DO RUN the below line to read a .csv file with the Mammals of Texas Project information
MAMM <- read.csv("iNatMammals.csv", header = T)
summary(MAMM) 

Check summary for User.login, Quality.grade, Scientific.name

plot(MAMM[, c("Longitude", "Latitude")], pch = 19, col = "red") 

Well that seems to look like Texas.

Let’s take a look at trends by each county. iNat doesn’t have explicit county data for each observation, so we’ll create them. First, take out all those records without coordinates to make things easier on us. We’ll use the dplyr::filter function for this.

MAMM <- filter(MAMM, !is.na(Longitude))

The above line checks for which rows have NA values (this the is.na function) for longitude and filters them out to only select those rows that are the opposite of it. The ! symbol acts as a negator for logical values (the result of is.na), so it grabs those with coordinate data.

The raster::getData function will grab the R equivalent of a shapefile of US county data (level 0 = country, level 1 = stateProvince, level 2 = county/municipality)

CNTY <- raster::getData(country = "USA", level = 2) 
class(CNTY) 
## [1] "SpatialPolygonsDataFrame"
## attr(,"package")
## [1] "sp"

A SpatialPointsDataFrame has an associated data.frame (you can access the data.frame only by using CNTY@data if you wish). Take a look at the NAME_1 and NAME_2 columns.

head(CNTY) 
##   OBJECTID ID_0 ISO        NAME_0 ID_1  NAME_1 ID_2  NAME_2   HASC_2 CCN_2
## 1        1  244 USA United States    1 Alabama    1 Autauga US.AL.AU    NA
## 2        2  244 USA United States    1 Alabama    2 Baldwin US.AL.BD    NA
## 3        3  244 USA United States    1 Alabama    3 Barbour US.AL.BR    NA
## 4        4  244 USA United States    1 Alabama    4    Bibb US.AL.BI    NA
## 5        5  244 USA United States    1 Alabama    5  Blount US.AL.BU    NA
## 6        6  244 USA United States    1 Alabama    6 Bullock US.AL.BL    NA
##   CCA_2 TYPE_2 ENGTYPE_2 NL_NAME_2 VARNAME_2
## 1       County    County                    
## 2       County    County                    
## 3       County    County                    
## 4       County    County                    
## 5       County    County                    
## 6       County    County
head(CNTY$NAME_1) #state names
## [1] "Alabama" "Alabama" "Alabama" "Alabama" "Alabama" "Alabama"
head(CNTY$NAME_2) #county names
## [1] "Autauga" "Baldwin" "Barbour" "Bibb"    "Blount"  "Bullock"

The below subsets the polygons to just Texas *ComeAndTakeItFlagWaving.gif*

CNTY <- CNTY[CNTY$NAME_1 %in% "Texas", ]

When subsetting a data.frame, you have a few options:

  1. you can use the base::subset function
subset(CNTY, CNTY$NAME_1 == "Texas")
  1. you can use dplyr::filter
filter(CNTY, NAME_1 == "Texas")
  1. you can subset within square brackets
CNTY[CNTY$NAME_1 == "Texas", ] #(or how I did it above, the %in% symbol is necessary when you want to subset by multiple things, e.g., %in% c("Texas", "Louisiana"))
CNTY$NAME_2 
##   [1] "Anderson"      "Andrews"       "Angelina"      "Aransas"      
##   [5] "Archer"        "Armstrong"     "Atascosa"      "Austin"       
##   [9] "Bailey"        "Bandera"       "Bastrop"       "Baylor"       
##  [13] "Bee"           "Bell"          "Bexar"         "Blanco"       
##  [17] "Borden"        "Bosque"        "Bowie"         "Brazoria"     
##  [21] "Brazos"        "Brewster"      "Briscoe"       "Brooks"       
##  [25] "Brown"         "Burleson"      "Burnet"        "Caldwell"     
##  [29] "Calhoun"       "Callahan"      "Cameron"       "Camp"         
##  [33] "Carson"        "Cass"          "Castro"        "Chambers"     
##  [37] "Cherokee"      "Childress"     "Clay"          "Cochran"      
##  [41] "Coke"          "Coleman"       "Collin"        "Collingsworth"
##  [45] "Colorado"      "Comal"         "Comanche"      "Concho"       
##  [49] "Cooke"         "Coryell"       "Cottle"        "Crane"        
##  [53] "Crockett"      "Crosby"        "Culberson"     "Dallam"       
##  [57] "Dallas"        "Dawson"        "Deaf Smith"    "Delta"        
##  [61] "Denton"        "Dewitt"        "Dickens"       "Dimmit"       
##  [65] "Donley"        "Duval"         "Eastland"      "Ector"        
##  [69] "Edwards"       "El Paso"       "Ellis"         "Erath"        
##  [73] "Falls"         "Fannin"        "Fayette"       "Fisher"       
##  [77] "Floyd"         "Foard"         "Fort Bend"     "Franklin"     
##  [81] "Freestone"     "Frio"          "Gaines"        "Galveston"    
##  [85] "Garza"         "Gillespie"     "Glasscock"     "Goliad"       
##  [89] "Gonzales"      "Gray"          "Grayson"       "Gregg"        
##  [93] "Grimes"        "Guadalupe"     "Hale"          "Hall"         
##  [97] "Hamilton"      "Hansford"      "Hardeman"      "Hardin"       
## [101] "Harris"        "Harrison"      "Hartley"       "Haskell"      
## [105] "Hays"          "Hemphill"      "Henderson"     "Hidalgo"      
## [109] "Hill"          "Hockley"       "Hood"          "Hopkins"      
## [113] "Houston"       "Howard"        "Hudspeth"      "Hunt"         
## [117] "Hutchinson"    "Irion"         "Jack"          "Jackson"      
## [121] "Jasper"        "Jeff Davis"    "Jefferson"     "Jim Hogg"     
## [125] "Jim Wells"     "Johnson"       "Jones"         "Karnes"       
## [129] "Kaufman"       "Kendall"       "Kenedy"        "Kent"         
## [133] "Kerr"          "Kimble"        "King"          "Kinney"       
## [137] "Kleberg"       "Knox"          "La Salle"      "Lamar"        
## [141] "Lamb"          "Lampasas"      "Lavaca"        "Lee"          
## [145] "Leon"          "Liberty"       "Limestone"     "Lipscomb"     
## [149] "Live Oak"      "Llano"         "Loving"        "Lubbock"      
## [153] "Lynn"          "Madison"       "Marion"        "Martin"       
## [157] "Mason"         "Matagorda"     "Maverick"      "McCulloch"    
## [161] "McLennan"      "McMullen"      "Medina"        "Menard"       
## [165] "Midland"       "Milam"         "Mills"         "Mitchell"     
## [169] "Montague"      "Montgomery"    "Moore"         "Morris"       
## [173] "Motley"        "Nacogdoches"   "Navarro"       "Newton"       
## [177] "Nolan"         "Nueces"        "Ochiltree"     "Oldham"       
## [181] "Orange"        "Palo Pinto"    "Panola"        "Parker"       
## [185] "Parmer"        "Pecos"         "Polk"          "Potter"       
## [189] "Presidio"      "Rains"         "Randall"       "Reagan"       
## [193] "Real"          "Red River"     "Reeves"        "Refugio"      
## [197] "Roberts"       "Robertson"     "Rockwall"      "Runnels"      
## [201] "Rusk"          "Sabine"        "San Augustine" "San Jacinto"  
## [205] "San Patricio"  "San Saba"      "Schleicher"    "Scurry"       
## [209] "Shackelford"   "Shelby"        "Sherman"       "Smith"        
## [213] "Somervell"     "Starr"         "Stephens"      "Sterling"     
## [217] "Stonewall"     "Sutton"        "Swisher"       "Tarrant"      
## [221] "Taylor"        "Terrell"       "Terry"         "Throckmorton" 
## [225] "Titus"         "Tom Green"     "Travis"        "Trinity"      
## [229] "Tyler"         "Upshur"        "Upton"         "Uvalde"       
## [233] "Val Verde"     "Van Zandt"     "Victoria"      "Walker"       
## [237] "Waller"        "Ward"          "Washington"    "Webb"         
## [241] "Wharton"       "Wheeler"       "Wichita"       "Wilbarger"    
## [245] "Willacy"       "Williamson"    "Wilson"        "Winkler"      
## [249] "Wise"          "Wood"          "Yoakum"        "Young"        
## [253] "Zapata"        "Zavala"

Oh, wow! A vector of Texas county names

Turn the MAMM data.frame into a SpatialPointsDataFrame by specifying the coordinate columns. I found that this is easier than using sp::SpatialPointsDataFrame.

coordinates(MAMM) <- ~Longitude + Latitude 

All coordinates from iNat are stored in unprojected lat/lon coordinates using the WGS84 datum, so you can grab the coordinate reference system from the county data.

crs(MAMM) <- crs(CNTY)
MAMM <- spTransform(MAMM, crs(CNTY))

The raster::over function is useful for determining what specific polygons (y argument, shown here as CNTY) a point in the x argument (here shown as MAMM) fall into.

OVR <- over(MAMM, CNTY) 
dim(OVR) 
## [1] 8563   15
head(OVR)
##   OBJECTID ID_0 ISO        NAME_0 ID_1 NAME_1 ID_2  NAME_2   HASC_2 CCN_2
## 1     2705  244 USA United States   44  Texas 2705  Nueces US.TX.NU    NA
## 2     2747  244 USA United States   44  Texas 2747 Tarrant US.TX.TN    NA
## 3     2754  244 USA United States   44  Texas 2754  Travis US.TX.TV    NA
## 4     2543  244 USA United States   44  Texas 2543  Blanco US.TX.BC    NA
## 5     2628  244 USA United States   44  Texas 2628  Harris US.TX.RR    NA
## 6     2584  244 USA United States   44  Texas 2584  Dallas US.TX.DS    NA
##   CCA_2 TYPE_2 ENGTYPE_2 NL_NAME_2 VARNAME_2
## 1       County    County                    
## 2       County    County                    
## 3       County    County                    
## 4       County    County                    
## 5       County    County                    
## 6       County    County

This means you now have county information for each iNat record by simply writing out OVR$NAME_2

Let’s add that county information to our SpatialPointsDataFrame.

MAMM@data <- cbind(MAMM@data, county = OVR$NAME_2) 

Grab the number of observations by county.

Hoo boy, time for an apply function explanation. For the below line, we are grabbing the vector of Texas county names and indexing the iNat mammal data by each county (this is what everything after function(x) is concerned with). sapply (results in a vector) and lapply (results in a list) have the following syntax: sapply(x, function(x) whatever function you can think up).

So this is going county by county and counting up the number of observations.

OBS <- sapply(CNTY$NAME_2, function(x) nrow(filter(MAMM@data, county == x)))

Grabs the number of species found in each county. This is similar to the line written to count observations, but instead further indexes by only grabbing the Scientific.name column and then grabs only the number of unique names in this column.

Soooo, you may be wondering what kind of mystical arts I’m attempting with %$% and %>%. I’m lazy, so I constructed a pipeline. Pipelines in R are created in using the handy %>% batch of symbols from the magrittr package. Essentially what it does is take the result from the left and use it as the first argument in whatever is after %>% (%$% is a special usage that only grabs a specific column or index from the left).

SR <- sapply(CNTY$NAME_2, function(x) MAMM@data %>% filter(county == x) %$% Scientific.name %>% unique %>% length)

So essentially, we are doing length(unique(filter(MAMM\@data, county = x)$Scientific.name)) but in an easier to read way (you can actually understand it from left to right).

Ta-dah(rida)! Species richness

One more thing…lets document how many unique users there are recording observations in each county.

USER <- sapply(CNTY$NAME_2, function(x) MAMM@data %>% filter(county == x) %$% User.login %>% unique %>% length)

Place our observation, species richness, and user number vectors as columns within our SpatialPolygonsDataFrame of Texas counties.

CNTY$inat.obs <- OBS 
CNTY$inat.sr <- SR
CNTY$inat.user <- USER

How about we visualize what we just wrought? ggplot2 is a fickle companion, so we will use the broom::tidy function to break down our observations and create a sort of choropleth map. broom::tidy will turn our SpatialPointsDataFrame into a regular data.frame and ID each row according to the region specified. This will allow the fill argument of the ggplot2::aes function to fill each county appropriately.

OBSR <- tidy(CNTY, region = "inat.obs")
class(OBSR$id)
## [1] "character"
OBSR$id <- as.numeric(OBSR$id)

We don’t want these numbers to be characters when filling in each polygon, so change them to numeric

Those polygons with no observations will be set to NA so as to not confuse them with counties that only have a handful of observations

OBSR[which(OBSR$id == 0), "id"] <- NA
SRR <- tidy(CNTY, region = "inat.sr")
SRR$id <- as.numeric(SRR$id)
SRR[which(SRR$id == 0), "id"] <- NA
USR <- tidy(CNTY, region = "inat.user")
USR$id <- as.numeric(USR$id)
USR[which(USR$id == 0), "id"] <- NA

Lists are great classes to store things in. Here we have the data.frames featuring county observation, species richness, and user numbers and can access each one by using [[]].

INAT <- list(OBSR, SRR, USR)
ggplot(data = INAT[[2]], aes(x = long, y = lat, fill = id, group = group)) + geom_polygon(color = "black", size = 0.2) + scale_fill_viridis(name = "iNaturalist Species Richness", option = "D", na.value = NA)

Now that we have county data associated with each record, you can subset the records by a specific county (such as Brazos Co.). You can use the base::unique function to determine which species are found in Brazos Co.

filter(MAMM@data, county == "Brazos")
MAMM@data %>% filter(county == "Brazos") %$% Scientific.name %>% unique
##  [1] Tadarida brasiliensis      Procyon lotor             
##  [3] Canis latrans              Dasypus novemcinctus      
##  [5] Baiomys taylori            Sylvilagus floridanus     
##  [7] Didelphis virginiana       Sciurus niger             
##  [9] Sigmodon hispidus          Peromyscus leucopus       
## [11] Lontra canadensis          Lynx rufus                
## [13] Peromyscus                 Castor canadensis         
## [15] Cryptotis parva            Chaetodipus hispidus      
## [17] Reithrodontomys fulvescens Urocyon cinereoargenteus  
## [19] Odocoileus virginianus    
## 229 Levels:  Ammospermophilus interpres Ammotragus lervia ... Xerospermophilus spilosoma

Introduction to GBIF

package::rgbif

CHAET <- occ_search(scientificName = "Chaetodipus hispidus", limit = 10)
class(CHAET) 
## [1] "gbif"

Hmmm, it is a gbif class. Can’t really review the information we want (presumably you want info regarding the record(s)).

attributes(CHAET)
## $names
## [1] "meta"      "hierarchy" "data"      "media"     "facets"   
## 
## $class
## [1] "gbif"
## 
## $type
## [1] "single"
## 
## $return
## [1] "all"
## 
## $args
## $args$limit
## [1] 10
## 
## $args$offset
## [1] 0
## 
## $args$scientificName
## [1] "Chaetodipus hispidus"
## 
## $args$fields
## [1] "all"
CHAET$data
## # A tibble: 10 x 98
##    name         key decimalLatitude decimalLongitude issues datasetKey    
##    <chr>      <int>           <dbl>            <dbl> <chr>  <chr>         
##  1 Chaetod…  1.83e9            40.5           -103.  gass84 b15d4952-7d20…
##  2 Chaetod…  1.80e9            34.8            -99.4 gass84 71e82020-f762…
##  3 Chaetod…  1.81e9            40.5           -103.  gass84 b15d4952-7d20…
##  4 Chaetod…  1.81e9            40.5           -103.  gass84 b15d4952-7d20…
##  5 Chaetod…  1.81e9            40.5           -103.  gass84 b15d4952-7d20…
##  6 Chaetod…  1.84e9            35.4            -99.1 gass84 b15d4952-7d20…
##  7 Chaetod…  1.81e9            40.5           -103.  gass84 b15d4952-7d20…
##  8 Chaetod…  1.84e9            35.4            -99.1 gass84 b15d4952-7d20…
##  9 Chaetod…  1.81e9            40.5           -103.  gass84 b15d4952-7d20…
## 10 Chaetod…  1.84e9            35.4            -99.1 gass84 b15d4952-7d20…
## # ... with 92 more variables: publishingOrgKey <chr>, networkKeys <chr>,
## #   installationKey <chr>, publishingCountry <chr>, protocol <chr>,
## #   lastCrawled <chr>, lastParsed <chr>, crawlId <int>, extensions <chr>,
## #   basisOfRecord <chr>, individualCount <int>, taxonKey <int>,
## #   kingdomKey <int>, phylumKey <int>, classKey <int>, orderKey <int>,
## #   familyKey <int>, genusKey <int>, speciesKey <int>,
## #   scientificName <chr>, kingdom <chr>, phylum <chr>, order <chr>,
## #   family <chr>, genus <chr>, species <chr>, genericName <chr>,
## #   specificEpithet <chr>, taxonRank <chr>, dateIdentified <chr>,
## #   coordinateUncertaintyInMeters <dbl>, continent <chr>,
## #   stateProvince <chr>, year <int>, month <int>, day <int>,
## #   eventDate <chr>, modified <chr>, lastInterpreted <chr>,
## #   references <chr>, license <chr>, identifiers <chr>, facts <chr>,
## #   relations <chr>, geodeticDatum <chr>, class <chr>, countryCode <chr>,
## #   country <chr>, institutionID <chr>, dynamicProperties <chr>,
## #   county <chr>, identificationVerificationStatus <chr>,
## #   eventRemarks <chr>, gbifID <chr>, language <chr>, type <chr>,
## #   preparations <chr>, locationAccordingTo <chr>, catalogNumber <chr>,
## #   institutionCode <chr>, identifiedBy <chr>, georeferencedDate <chr>,
## #   identifier <chr>, nomenclaturalCode <chr>, verbatimEventDate <chr>,
## #   higherGeography <chr>, georeferencedBy <chr>,
## #   georeferenceProtocol <chr>, georeferenceVerificationStatus <chr>,
## #   endDayOfYear <chr>, locality <chr>, collectionCode <chr>,
## #   verbatimLocality <chr>, occurrenceID <chr>, recordedBy <chr>,
## #   otherCatalogNumbers <chr>,
## #   http...unknown.org.http_..rs.tdwg.org.dwc.terms.ResourceRelationship <chr>,
## #   organismID <chr>, previousIdentifications <chr>,
## #   identificationQualifier <chr>, samplingProtocol <chr>,
## #   accessRights <chr>, higherClassification <chr>,
## #   georeferenceSources <chr>, sex <chr>, lifeStage <chr>,
## #   recordNumber <chr>, datasetName <chr>,
## #   http...unknown.org.http_..rs.tdwg.org.dwc.terms.MeasurementOrFact <chr>,
## #   vernacularName <chr>, verbatimElevation <chr>,
## #   reproductiveCondition <chr>

What is a tibble? It is a trimmed down data.frame. You can always change it to something that you are more familiar or comfortable with.

CHAET <- data.frame(CHAET$data)
head(CHAET)

Well, that is a lot of data. Do we really need that much data? Do you REALLY want the vernacularName, license, and publishingCountry, for example?

The summary of a vector of factors will give you a handy breakdown of what the vector is comprised of. Neat!

summary(as.factor(CHAET$county))
##          Greer   Logan County Washita County 
##              1              6              3

Here we not only listed the set of fields that we wanted, but also only grabbed the $data portion and turned it into a data.frame in one line.

CHAET <- occ_search(scientificName = "Chaetodipus hispidus", fields = c('name', 'infraspecificEpithet', 'stateProvince', 'country', 'county', 'locality', 'decimalLongitude', 'decimalLatitude', 'geodeticDatum', 'datasetName', 'institutionCode', 'recordNumber', 'catalogNumber', 'year', 'preparations'), limit = 10) %$% data %>% data.frame

But what if you wanted more than 200,000 records (the limit imposed by the draconian creators of the package)? *thinkingfaceemoji* This lapply function will call the rgbif::occ_search function twice, each with the country specified as each element of the CNTRY vector.

CNTRY <- c("US", "MX")
CHAET <- lapply(1:2, function(x) as.data.frame(occ_search(scientificName = "Chaetodipus hispidus", country = CNTRY[x], fields = c('name', 'stateProvince', 'country', 'county', 'locality', 'decimalLongitude', 'decimalLatitude', 'geodeticDatum', 'datasetName', 'institutionCode', 'recordNumber', 'catalogNumber', 'year', 'preparations'), limit = 15)$data))
length(CHAET)
## [1] 2
CHAET 

This results in a list composed of Chaetodipus hispidus records from the US and from Mexico

“But I don’t want a list”

The base::do.call function will apply the function specified in the first part of the argument to each portion of the list, thus binding the rows of each list element together. There are other methods to use a function like base::rbind or base::cbind on each element of a list (e.g., plyr::rbind.fill) but base::do.call is accessible to everyone and is simple to use.

CHAET <- do.call(rbind, CHAET)
dim(CHAET)
## [1] 30 14

Searching using the taxonKey (always found in the url of the GBIF page of the group you are looking at) can be very helpful as well

HMYD <- occ_search(taxonKey = 5504, limit = 20)
HMYD
## Records found [283584] 
## Records returned [20] 
## No. unique hierarchies [7] 
## No. media records [16] 
## No. facets [0] 
## Args [limit=20, offset=0, taxonKey=5504, fields=all] 
## # A tibble: 20 x 108
##    name         key decimalLatitude decimalLongitude issues  datasetKey   
##    <chr>      <int>           <dbl>            <dbl> <chr>   <chr>        
##  1 Dipodom…  1.81e9            33.4            -116. cdroun… 50c9509d-22c…
##  2 Dipodom…  1.80e9            33.3            -116. cdroun… 50c9509d-22c…
##  3 Dipodom…  1.81e9            36.7            -122. cdroun… 50c9509d-22c…
##  4 Dipodom…  1.81e9            36.5            -121. cdroun… 50c9509d-22c…
##  5 Dipodom…  1.81e9            36.7            -122. cdroun… 50c9509d-22c…
##  6 Chaetod…  1.83e9            33.1            -116. cdroun… 50c9509d-22c…
##  7 Dipodom…  1.81e9            32.8            -116. cdroun… 50c9509d-22c…
##  8 Perogna…  1.88e9            34.8            -108. cdround b15d4952-7d2…
##  9 Dipodom…  1.81e9            33.3            -116. cdroun… 50c9509d-22c…
## 10 Dipodom…  1.81e9            33.3            -116. cdroun… 50c9509d-22c…
## 11 Chaetod…  1.84e9            36.7            -122. cdroun… 50c9509d-22c…
## 12 Dipodom…  1.80e9            33.3            -116. cdroun… 50c9509d-22c…
## 13 Dipodom…  1.81e9            36.7            -122. cdroun… 50c9509d-22c…
## 14 Dipodom…  1.84e9            32.2            -114. cdroun… 50c9509d-22c…
## 15 Dipodom…  1.83e9            36.5            -117. cdroun… 50c9509d-22c…
## 16 Dipodom…  1.88e9            34.8            -108. cdround b15d4952-7d2…
## 17 Dipodom…  1.88e9            34.8            -108. ""      b15d4952-7d2…
## 18 Chaetod…  1.84e9            30.0            -113. cdroun… 50c9509d-22c…
## 19 Dipodom…  1.88e9            34.8            -108. cdround b15d4952-7d2…
## 20 Dipodom…  1.83e9            40.9            -113. ""      06a00852-f76…
## # ... with 102 more variables: publishingOrgKey <chr>, networkKeys <chr>,
## #   installationKey <chr>, publishingCountry <chr>, protocol <chr>,
## #   lastCrawled <chr>, lastParsed <chr>, crawlId <int>, extensions <chr>,
## #   basisOfRecord <chr>, taxonKey <int>, kingdomKey <int>,
## #   phylumKey <int>, classKey <int>, orderKey <int>, familyKey <int>,
## #   genusKey <int>, speciesKey <int>, scientificName <chr>, kingdom <chr>,
## #   phylum <chr>, order <chr>, family <chr>, genus <chr>, species <chr>,
## #   genericName <chr>, specificEpithet <chr>, taxonRank <chr>,
## #   dateIdentified <chr>, coordinateUncertaintyInMeters <dbl>, year <int>,
## #   month <int>, day <int>, eventDate <chr>, modified <chr>,
## #   lastInterpreted <chr>, references <chr>, license <chr>,
## #   identifiers <chr>, facts <chr>, relations <chr>, geodeticDatum <chr>,
## #   class <chr>, countryCode <chr>, country <chr>, rightsHolder <chr>,
## #   identifier <chr>, verbatimEventDate <chr>, datasetName <chr>,
## #   collectionCode <chr>, verbatimLocality <chr>, gbifID <chr>,
## #   occurrenceID <chr>, taxonID <chr>, catalogNumber <chr>,
## #   recordedBy <chr>, http...unknown.org.occurrenceDetails <chr>,
## #   institutionCode <chr>, rights <chr>, eventTime <chr>,
## #   http...unknown.org.http_..rs.gbif.org.terms.1.0.Multimedia <chr>,
## #   identificationID <chr>, occurrenceRemarks <chr>,
## #   informationWithheld <chr>, individualCount <int>, sex <chr>,
## #   elevation <dbl>, elevationAccuracy <dbl>, continent <chr>,
## #   stateProvince <chr>, institutionID <chr>, dynamicProperties <chr>,
## #   county <chr>, identificationVerificationStatus <chr>,
## #   eventRemarks <chr>, language <chr>, type <chr>, preparations <chr>,
## #   locationAccordingTo <chr>, identifiedBy <chr>,
## #   georeferencedDate <chr>, nomenclaturalCode <chr>,
## #   higherGeography <chr>, georeferencedBy <chr>,
## #   georeferenceProtocol <chr>, georeferenceVerificationStatus <chr>,
## #   endDayOfYear <chr>, verbatimCoordinateSystem <chr>, locality <chr>,
## #   otherCatalogNumbers <chr>,
## #   http...unknown.org.http_..rs.tdwg.org.dwc.terms.ResourceRelationship <chr>,
## #   organismID <chr>, previousIdentifications <chr>,
## #   identificationQualifier <chr>, accessRights <chr>,
## #   higherClassification <chr>, georeferenceSources <chr>,
## #   identificationRemarks <chr>, lifeStage <chr>,
## #   infraspecificEpithet <chr>, …

There are over 200,000 heteromyid records on GBIF. Look at all those cool species of Dipodomys, Perognathus, and Chaetodipus!

What if you weren’t interested in a particular species or family, but wanted to know what was in a particular institution? rgbif has you covered there as well

###DO NOT RUN the below code
BRTC <- occ_search(taxonKey = 359, basisOfRecord = "PRESERVED_SPECIMEN", institutionCode = "TCWC", fields = c('name', 'stateProvince', 'country', 'county', 'locality', 'decimalLongitude', 'decimalLatitude', 'geodeticDatum', 'datasetName', 'institutionCode', 'recordNumber', 'catalogNumber', 'year', 'preparations'), limit = 200000)$data

The above code searches for taxonKey 359 (Mammalia) of PRESERVED_SPECIMENS (so not videos, pictures, or hearsay of a species) at the institutionCode formerly known as the TCWC (our Biodiversity Research and Teaching Collections is still considered the TCWC by GBIF)

###DO RUN the below code instead
BRTC <- read.csv("BRTCMammals.csv", header = T)
dim(BRTC)
## [1] 62415    13
summary(BRTC)
##                      name          stateProvince        year      
##  Peromyscus maniculatus: 3282   Texas     :18054   Min.   :1929   
##  Tadarida brasiliensis : 3115   Chiapas   : 2866   1st Qu.:1984   
##  Peromyscus leucopus   : 2130   Tamaulipas: 1720   Median :1991   
##  Artibeus jamaicensis  : 2019   Veracruz  : 1540   Mean   :1991   
##  Sigmodon hispidus     : 1845   Guerrero  : 1409   3rd Qu.:1995   
##  Peromyscus levipes    : 1541   (Other)   :35661   Max.   :2015   
##  (Other)               :48483   NA's      : 1165   NA's   :58150  
##           country       recordNumber         county     
##  United States:28364   1      :  319   Brewster : 1886  
##  Mexico       :17733   2      :  304   Brazos   : 1193  
##  Honduras     : 5417   4      :  284   Hill     :  682  
##  Canada       : 1680   3      :  278   Tyler    :  587  
##  Guatemala    : 1628   5      :  272   Culberson:  584  
##  (Other)      : 7515   (Other):58570   (Other)  :22855  
##  NA's         :   78   NA's   : 2388   NA's     :34628  
##                              locality    
##  No specific locality            :  661  
##  7.4 mi SSW San Lorenzo          :  365  
##  Lancetilla, 40 m                :  305  
##  10.8 mi S, 2.6 mi W Jicaro Galan:  301  
##  9.5 mi NW Melaque               :  299  
##  (Other)                         :60089  
##  NA's                            :  395  
##                           preparations   catalogNumber   institutionCode
##  skin | skull                   :35050   1      :    1   TCWC:62415     
##  alcoholic                      : 9092   10     :    1                  
##  skeleton                       : 5253   100    :    1                  
##  skull                          : 5107   1000   :    1                  
##  skin in alcohol | body skeleton: 3336   10000  :    1                  
##  (Other)                        : 4554   10001  :    1                  
##  NA's                           :   23   (Other):62409                  
##  decimalLongitude  decimalLatitude  geodeticDatum
##  Min.   :-148.21   Min.   :-28.42   WGS84:15475  
##  1st Qu.:-104.56   1st Qu.: 29.93   NA's :46940  
##  Median : -98.70   Median : 30.88                
##  Mean   :-100.13   Mean   : 32.64                
##  3rd Qu.: -96.30   3rd Qu.: 34.95                
##  Max.   :  61.81   Max.   : 66.40                
##  NA's   :46940     NA's   :46940

Look at all those Peromyscus! These data aren’t up to date, but do include much of the BRTC Mammal collection.

summary(BRTC$country) 
##                         Argentina                         Australia 
##                                56                                 2 
##                        Azerbaijan                            Belize 
##                                 2                                21 
##                           Bermuda                          Botswana 
##                                 1                                12 
##                            Brazil                          Cameroon 
##                                51                                 6 
##                            Canada                             Chile 
##                              1680                                 8 
##                             China                          Colombia 
##                                12                               173 
##                        Costa Rica                              Cuba 
##                               910                                54 
##                           Czechia                          Dominica 
##                                23                                12 
##                Dominican Republic                           Ecuador 
##                                23                               513 
##                             Egypt                       El Salvador 
##                                 5                               850 
##                          Ethiopia                            France 
##                               224                                 1 
##                     French Guiana                           Germany 
##                               122                                 2 
##                         Guatemala                            Guyana 
##                              1628                                27 
##                          Honduras                             India 
##                              5417                                37 
##                         Indonesia                              Iraq 
##                                46                                19 
##                            Israel                             Italy 
##                                19                                 4 
##                           Jamaica                             Japan 
##                                28                                 3 
##                             Kenya                Korea, Republic of 
##                                42                                 3 
##                        Madagascar                            Malawi 
##                                 2                                 1 
##                          Malaysia                            Mexico 
##                                 1                             17733 
##                           Morocco                        Mozambique 
##                                 6                                 2 
##                           Myanmar                           Namibia 
##                                 1                                52 
##                             Nepal                     New Caledonia 
##                                 1                                11 
##                         Nicaragua                             Niger 
##                              1292                                27 
##                          Pakistan                            Panama 
##                                 5                               332 
##                          Paraguay                              Peru 
##                                 2                              1242 
##                       Philippines                       Puerto Rico 
##                                10                                11 
##                      Sierra Leone                           Somalia 
##                                 1                                11 
##                      South Africa                             Spain 
##                                77                                 6 
##                             Sudan                          Suriname 
##                                 1                                24 
##                         Swaziland                       Switzerland 
##                                 1                                12 
##      Tanzania, United Republic of                          Thailand 
##                                 6                                33 
##               Trinidad and Tobago                            Uganda 
##                               348                                 3 
##                     United States Venezuela, Bolivarian Republic of 
##                             28364                               639 
##                          Viet Nam              Virgin Islands, U.S. 
##                                14                                 6 
##                            Zambia                          Zimbabwe 
##                                18                                 6 
##                              NA's 
##                                78
length(unique(BRTC$country))
## [1] 73

The BRTC Mammal collection houses specimens from over 70 countries.

plot(BRTC[, c("decimalLongitude", "decimalLatitude")])
maps::map(add = T)

Specimens with coordinate information are very US biased.

Index the BRTC records to only pick those from Texas. You can use dplyr::filter, base::subset, or %in% to do this.

TEX <- filter(BRTC, stateProvince == "Texas")
summary(TEX)
##                     name               stateProvince        year      
##  Sigmodon hispidus    : 1623   Texas          :18054   Min.   :1936   
##  Peromyscus leucopus  : 1404   Aguascalientes :    0   1st Qu.:1984   
##  Geomys breviceps     :  883   Ahuachapán     :    0   Median :1991   
##  Tadarida brasiliensis:  745   Al Hillah, Liwa:    0   Mean   :1993   
##  Peromyscus gossypinus:  670   Alabama        :    0   3rd Qu.:2004   
##  Lynx rufus           :  664   Alajuela       :    0   Max.   :2015   
##  (Other)              :12065   (Other)        :    0   NA's   :16672  
##           country       recordNumber         county     
##  United States:18054   1      :  220   Brewster : 1886  
##  Argentina    :    0   2      :  212   Brazos   : 1193  
##  Australia    :    0   5      :  204   Hill     :  682  
##  Azerbaijan   :    0   4      :  201   Tyler    :  587  
##  Belize       :    0   3      :  198   Culberson:  584  
##  Bermuda      :    0   (Other):16312   (Other)  :12811  
##  (Other)      :    0   NA's   :  707   NA's     :  311  
##                    locality                              preparations 
##  No specific locality  :  641   skin | skull                   :8997  
##  Edwards Plateau       :  252   skull                          :3329  
##  El Paso               :  203   skeleton                       :2973  
##  14 mi SW Hunt, 2200 ft:  197   alcoholic                      :1187  
##  College Station       :  169   skin in alcohol | body skeleton: 583  
##  (Other)               :16343   (Other)                        : 975  
##  NA's                  :  249   NA's                           :  10  
##  catalogNumber   institutionCode decimalLongitude  decimalLatitude
##  101    :    1   TCWC:18054      Min.   :-106.63   Min.   :20.56  
##  1015   :    1                   1st Qu.: -99.57   1st Qu.:29.32  
##  1016   :    1                   Median : -97.61   Median :30.31  
##  1017   :    1                   Mean   : -98.19   Mean   :30.20  
##  1018   :    1                   3rd Qu.: -96.20   3rd Qu.:31.02  
##  1019   :    1                   Max.   :  61.81   Max.   :66.40  
##  (Other):18048                   NA's   :7953      NA's   :7953   
##  geodeticDatum
##  WGS84:10101  
##  NA's : 7953  
##               
##               
##               
##               
## 

Hmmm that doesn’t seem quite right. Why is Aguascalientes and Alabama showing up in stateProvince. Use base::droplevels to get rid of unneeded levels when using the base::summary function.

TEX <- droplevels(filter(BRTC, stateProvince == "Texas"))
dim(TEX) 
## [1] 18054    13

Over 18,000 specimens from Texas

plot(TEX[, c("decimalLongitude", "decimalLatitude")])

Looks like some specimens weren’t georeferenced correctly. Let’s ignore those.

However, we have more county records (only 311 NA values) than coordinate records (7,953 NA values). Since georeferencing is not the goal of this module, we will visualize records and species richness at the county level.

You’ve seen this before (hopefully you have been paying some modicum of attention). The base::sapply function takes each county name, subsets the BRTC Texas data with it and counts the unique species names found.

SR <- sapply(CNTY$NAME_2, function(x) TEX %>% filter(county == x) %$% name %>% unique %>% length)

Same as above, but uses base::nrow to count the number of records per county.

OBS <- sapply(CNTY$NAME_2, function(x) TEX %>% filter(county == x) %>% nrow)
CNTY$brtc.sr <- SR
CNTY$brtc.obs <- OBS

TSR <- tidy(CNTY, region = "brtc.sr")
TSR$id <- as.numeric(TSR$id)
TSR[which(TSR$id == 0), "id"] <- NA

TOBS <- tidy(CNTY, region = "brtc.obs")
TOBS$id <- as.numeric(TOBS$id)
TOBS[which(TOBS$id == 0), "id"] <- NA

DIV <- list(TSR, TOBS)

Again, the above several lines are creating a data.frame of the species richness and record values for each county in a way that ggplot2 can easily use to fill.

ggplot(data = DIV[[1]], aes(x = long, y = lat, fill = id, group = group)) + geom_polygon(color = "black", size = 0.2) + scale_fill_viridis(name = "BRTC Mammal\nSpecies Richness", option = "D", na.value = NA)

You can use \n to create a return in a title in ggplot2.

But not all species richness is created the same. Five Peromyscus species are different than a community that has marsupials, rodents, and cervids, for example. But how do we show this… Great question, follow below for more!

The ape::read.nexus function will allow us to read in a nexus file that contains a phylogenetic tree. In this case, it is a tree of 5,020 mammal species from S.A. Fritz et al. 2009. Ecology Letters. Geographical variation in predictors of mammalian extinction risk: Big is bad, but only in the tropics.

TREE <- read.nexus("ELE_1307_sm_SA1.nex")
class(TREE) 
## [1] "multiPhylo"
summary(TREE)
##                           Length Class Mode
## mammalST_MSW05_bestDates  4      phylo list
## mammalST_MSW05_lowerDates 4      phylo list
## mammalST_MSW05_upperDates 4      phylo list
attributes(TREE)
## $class
## [1] "multiPhylo"
## 
## $names
## [1] "mammalST_MSW05_bestDates"  "mammalST_MSW05_lowerDates"
## [3] "mammalST_MSW05_upperDates"

This multiPhylo class is essentially a list that contains three different trees with different dates. For our purposes, we will use TREE$mammalST_MSW05_bestDates.

summary(TREE$mammalST_MSW05_bestDates)
## 
## Phylogenetic tree: TREE$mammalST_MSW05_bestDates 
## 
##   Number of tips: 5020 
##   Number of nodes: 2503 
##   Branch lengths:
##     mean: 8.788407 
##     variance: 67.09177 
##     distribution summary:
##    Min. 1st Qu.  Median 3rd Qu.    Max. 
##     0.1     3.3     6.5    11.9   102.6 
##   No root edge.
##   First ten tip labels: Tachyglossus_aculeatus 
##                         Zaglossus_bruijni
##                         Zaglossus_bartoni
##                         Ornithorhynchus_anatinus
##                         Anomalurus_beecrofti
##                         Anomalurus_derbianus
##                         Anomalurus_pelii
##                         Anomalurus_pusillus
##                         Zenkerella_insignis
##                         Idiurus_macrotis
##   No node labels.

Shows number of tips, nodes, and branch length distribution

attributes(TREE$mammalST_MSW05_bestDates)
## $names
## [1] "edge"        "edge.length" "Nnode"       "tip.label"  
## 
## $class
## [1] "phylo"
## 
## $order
## [1] "cladewise"
head(TREE$mammalST_MSW05_bestDates$tip.label)
## [1] "Tachyglossus_aculeatus"   "Zaglossus_bruijni"       
## [3] "Zaglossus_bartoni"        "Ornithorhynchus_anatinus"
## [5] "Anomalurus_beecrofti"     "Anomalurus_derbianus"

If you notice, all tip label names for the tree have a "_" within them

Not so with our species names from GBIF. Could be a problem but it is a simple fix, so let’s fix this!

head(TEX$name)
## [1] Rattus norvegicus Rattus rattus     Cryptotis parva   Geomys attwateri 
## [5] Geomys attwateri  Geomys attwateri 
## 180 Levels: Acomys Acomys russatus ... Ziphius cavirostris

Here we use the base::sapply function to split all names by a space and replace it with a _ using the base::paste function. base::strsplit only works on characters, so I had to use base::as.character to change the class of the TEX$name vector.

TEX$name <- sapply(strsplit(as.character(TEX$name), " "), paste, collapse = "_")

Now comes the tricky part. Unfortunately the picante::pd function to determine phylogenetic diversity requires a community matrix, so we have to create one. Create a matrix of 0s with each row being a county name and each column being a species name and populate the empty row and column names with the names of the counties and species.

SAMP <- matrix(0, nrow = length(CNTY$NAME_2), ncol = length(unique(TEX$name)))
row.names(SAMP) <- CNTY$NAME_2
colnames(SAMP) <- unique(TEX$name)

Then we create a for loop that goes to each county row and places a 1 in each species column that is in that county (because the species names are the column names, we can use “name” to reference both).

for(i in 1:nrow(SAMP)) {
    SAMP[i, unique(TEX[TEX$county %in% CNTY$NAME_2[i], "name"])] <- 1
}

SAMP <- SAMP[, colnames(SAMP)[!colnames(SAMP) %in% setdiff(colnames(SAMP), TREE$mammalST_MSW05_bestDates$tip.label)]]

The picante::pd function does not play well with species being found in the matrix that are not found in the tree. This also looks horrifying but is really quite simple. We find the difference between species names from the BRTC Mammal Collection and the tree tip labels using the base::setdiff function and then exclude the names that are not found in the tree.

SAMP <- SAMP[, colnames(SAMP)[!colnames(SAMP) %in% setdiff(colnames(SAMP), TREE$mammalST_MSW05_bestDates$tip.label)]]

The picante::pd function returns a data.frame comprised of vectors of phylogenetic diversity (PD) and species richness (SR) for each county

PD <- pd(SAMP, TREE$mammalST_MSW05_bestDates)
head(PD)
##               PD SR
## Anderson   819.3 15
## Andrews    328.1  4
## Angelina   340.1  3
## Aransas   1349.2 24
## Archer     400.6  5
## Armstrong  277.7  3

Add it to our list

CNTY$pd <- log(PD$PD)
TPD <- tidy(CNTY, region = "pd")
TPD$id <- as.numeric(TPD$id)
TPD[which(TPD$id == 0), "id"] <- NA

DIV <- list(TSR, TOBS, TPD)

Plot!

ggplot(data = DIV[[3]], aes(x = long, y = lat, fill = id, group = group)) + geom_polygon(color = "black", size = 0.2) + scale_fill_viridis(name = "BRTC Phylogenetic Diversity", option = "D", na.value = NA) #find the light