#packages <- c("tidyverse", "magrittr", "reshape2")
for(i in 1:length(packages)) if(!require(packages[i], character.only = T)) #install.packages(packages[i]); library(packages[i], character.only = T)
#install.packages("tidyverse")
#install.packages("reshape2")
library(tidyverse)
library(magrittr)
library(reshape2)

tidyverse

For this module, we will be taking a short, introductory dive into the tidyverse series of packages, with the goal of showing you what possibilities exist for manipulating and cleaning data.

Since we can’t manipulate data without data, we’ll start there.

For the purposes of working with packages that are supposed to help us deal with messy, disorganized data, we’ll be using biodiversity occurrence data.

OCC <- read.csv("https://raw.githubusercontent.com/acastellanos39/OSOS2019/master/tex_mammals.csv", header = T)

This is data for all mammal records on GBIF from Texas with coordinate data. It turns out there are quite a few records and they have a lot of (potential) information associated with them.

dim(OCC)
head(OCC)
str(OCC)
summary(OCC)

Take a look at these data and get a feeling for the quantity, classes, what may interest us, and things like NAs or data that may not be useful.

magrittr

Before we start cleaning up our data, I want to introduce the concept of a pipe in R via the magrittr package. A pipe is executed by the %>% symbol (command + shift + m on Macs if you don’t want to type it out).

OCC %>% select(1:5) %>% head
##          key                             scientificName decimalLatitude
## 1 1913337407                Lynx rufus (Schreber, 1777)        31.92127
## 2 1927462932 Canis lupus baileyi Nelson & Goldman, 1929        32.17984
## 3 1927462934 Canis lupus baileyi Nelson & Goldman, 1929        32.17984
## 4 1913337038        Canis latrans texensis Bailey, 1905        26.52303
## 5 1913337381  Odocoileus virginianus (Zimmermann, 1780)        30.87341
## 6 1500196460   Bassariscus astutus (Lichtenstein, 1830)        31.92514
##   decimalLongitude        issues
## 1        -99.75406              
## 2        -97.79981              
## 3        -97.79981              
## 4        -97.49110              
## 5        -96.24263              
## 6       -106.51273 gass84,gdativ

A pipe works by taking something from the left and making it the first argument for something on the right. In this case, we take our data.frame (typically we enter some sort of data in at the beginning) and place it as the first argument in a dplyr::select function (check ?dplyr::select to see what the first argument is, but we’ll talk about this later). The real utility of a pipe comes in later by taking the result of our dplyr::select function (the first five columns of our data.frame) and putting that as the first argument of our head function.

There are also other pipe symbols other than %>% that you can use

OCC %$% decimalLatitude %>% max
## [1] 66.39965

The %$% symbol acts like the $ in indexing by essentially doing OCC$decimalLatitude but keeping it within the pipeline ecosystem (may be more helpful further down in a plot).

alt <- OCC %>% select(species, coordinateUncertaintyInMeters)
nrow(alt)
## [1] 44767
alt %<>% dplyr::filter(coordinateUncertaintyInMeters <= 5000)
nrow(alt)
## [1] 6459

The %<>% is for those of us too lazy to come up with new object names and uses the object to the left as both the primary data source for the pipe and the name of the object created at the end of the pipe. Don’t use this unless you are sure it won’t spit out a NULL result or something else you don’t want

dplyr

We will start out and spend a good portion of our time working with the dplyr package for data manipulation and cleanup.

We will focus on a series of helpful functions, how they work, and then look at things like group_by can be used with them.

It should be noted that the first argument of all dplyr functions is data =, which means that it is designed to work well with pipes if so desired.

select

dplyr::select is a helpful function that lets you select the columns that you want in a variety of ways.

We have a lot of data in our OCC data.frame object, but we don’t need most of it, so let’s only grab a bit of it to work with downstream.

dplyr::select works by either giving it a group of numbers (akin to indexing by DATA[, 1:5] or DATA[, c(1:2, 5)]) or column names (akin to DATA[, "key"] or DATA[, c("key", "scientificName")])

colnames(OCC)
##   [1] "key"                                    
##   [2] "scientificName"                         
##   [3] "decimalLatitude"                        
##   [4] "decimalLongitude"                       
##   [5] "issues"                                 
##   [6] "datasetKey"                             
##   [7] "publishingOrgKey"                       
##   [8] "networkKeys"                            
##   [9] "installationKey"                        
##  [10] "publishingCountry"                      
##  [11] "protocol"                               
##  [12] "lastCrawled"                            
##  [13] "lastParsed"                             
##  [14] "crawlId"                                
##  [15] "extensions"                             
##  [16] "basisOfRecord"                          
##  [17] "taxonKey"                               
##  [18] "kingdomKey"                             
##  [19] "phylumKey"                              
##  [20] "classKey"                               
##  [21] "orderKey"                               
##  [22] "familyKey"                              
##  [23] "genusKey"                               
##  [24] "speciesKey"                             
##  [25] "acceptedTaxonKey"                       
##  [26] "acceptedScientificName"                 
##  [27] "kingdom"                                
##  [28] "phylum"                                 
##  [29] "order"                                  
##  [30] "family"                                 
##  [31] "genus"                                  
##  [32] "species"                                
##  [33] "genericName"                            
##  [34] "specificEpithet"                        
##  [35] "taxonRank"                              
##  [36] "taxonomicStatus"                        
##  [37] "dateIdentified"                         
##  [38] "coordinateUncertaintyInMeters"          
##  [39] "stateProvince"                          
##  [40] "year"                                   
##  [41] "month"                                  
##  [42] "day"                                    
##  [43] "eventDate"                              
##  [44] "modified"                               
##  [45] "lastInterpreted"                        
##  [46] "references"                             
##  [47] "license"                                
##  [48] "identifiers"                            
##  [49] "facts"                                  
##  [50] "relations"                              
##  [51] "geodeticDatum"                          
##  [52] "class"                                  
##  [53] "countryCode"                            
##  [54] "country"                                
##  [55] "rightsHolder"                           
##  [56] "identifier"                             
##  [57] "informationWithheld"                    
##  [58] "verbatimEventDate"                      
##  [59] "datasetName"                            
##  [60] "verbatimLocality"                       
##  [61] "gbifID"                                 
##  [62] "collectionCode"                         
##  [63] "occurrenceID"                           
##  [64] "taxonID"                                
##  [65] "recordedBy"                             
##  [66] "catalogNumber"                          
##  [67] "http...unknown.org.occurrenceDetails"   
##  [68] "institutionCode"                        
##  [69] "rights"                                 
##  [70] "eventTime"                              
##  [71] "identificationID"                       
##  [72] "name"                                   
##  [73] "occurrenceRemarks"                      
##  [74] "infraspecificEpithet"                   
##  [75] "identificationRemarks"                  
##  [76] "http...unknown.org.recordedByOrcid"     
##  [77] "establishmentMeans"                     
##  [78] "elevation"                              
##  [79] "elevationAccuracy"                      
##  [80] "continent"                              
##  [81] "institutionID"                          
##  [82] "county"                                 
##  [83] "language"                               
##  [84] "type"                                   
##  [85] "preparations"                           
##  [86] "occurrenceStatus"                       
##  [87] "verbatimElevation"                      
##  [88] "nomenclaturalCode"                      
##  [89] "higherGeography"                        
##  [90] "georeferenceVerificationStatus"         
##  [91] "endDayOfYear"                           
##  [92] "locality"                               
##  [93] "startDayOfYear"                         
##  [94] "bibliographicCitation"                  
##  [95] "accessRights"                           
##  [96] "higherClassification"                   
##  [97] "sex"                                    
##  [98] "lifeStage"                              
##  [99] "habitat"                                
## [100] "fieldNumber"                            
## [101] "taxonConceptID"                         
## [102] "locationID"                             
## [103] "samplingProtocol"                       
## [104] "associatedSequences"                    
## [105] "identifiedBy"                           
## [106] "georeferenceSources"                    
## [107] "X.1f2c0cbe.40df.43f6.ba07.e76133e78c31."
## [108] "individualCount"                        
## [109] "dynamicProperties"                      
## [110] "identificationVerificationStatus"       
## [111] "eventRemarks"                           
## [112] "locationAccordingTo"                    
## [113] "locationRemarks"                        
## [114] "georeferencedDate"                      
## [115] "georeferencedBy"                        
## [116] "georeferenceProtocol"                   
## [117] "verbatimCoordinateSystem"               
## [118] "otherCatalogNumbers"                    
## [119] "organismID"                             
## [120] "previousIdentifications"                
## [121] "identificationQualifier"                
## [122] "collectionID"                           
## [123] "recordNumber"                           
## [124] "municipality"                           
## [125] "taxonRemarks"                           
## [126] "vernacularName"                         
## [127] "reproductiveCondition"                  
## [128] "georeferenceRemarks"                    
## [129] "ownerInstitutionCode"                   
## [130] "earliestEonOrLowestEonothem"            
## [131] "earliestEraOrLowestErathem"             
## [132] "earliestEpochOrLowestSeries"            
## [133] "earliestPeriodOrLowestSystem"           
## [134] "disposition"                            
## [135] "fieldNotes"                             
## [136] "datasetID"                              
## [137] "associatedOccurrences"                  
## [138] "behavior"                               
## [139] "depth"                                  
## [140] "depthAccuracy"                          
## [141] "namePublishedInYear"                    
## [142] "nameAccordingTo"                        
## [143] "acceptedNameUsage"                      
## [144] "parentNameUsage"                        
## [145] "latestEraOrHighestErathem"              
## [146] "latestEpochOrHighestSeries"             
## [147] "latestPeriodOrHighestSystem"            
## [148] "waterBody"                              
## [149] "associatedTaxa"                         
## [150] "associatedReferences"                   
## [151] "earliestAgeOrLowestStage"               
## [152] "formation"                              
## [153] "group"                                  
## [154] "identificationReferences"               
## [155] "dataGeneralizations"                    
## [156] "member"                                 
## [157] "latestAgeOrHighestStage"                
## [158] "http...unknown.org.recordId"            
## [159] "lowestBiostratigraphicZone"             
## [160] "highestBiostratigraphicZone"            
## [161] "eventID"                                
## [162] "typifiedName"                           
## [163] "island"                                 
## [164] "bed"                                    
## [165] "typeStatus"                             
## [166] "coordinatePrecision"                    
## [167] "samplingEffort"                         
## [168] "lithostratigraphicTerms"                
## [169] "verbatimTaxonRank"                      
## [170] "geologicalContextID"                    
## [171] "latestEonOrHighestEonothem"             
## [172] "organismRemarks"                        
## [173] "originalNameUsage"
select(OCC, 1:5) %>% head
##          key                             scientificName decimalLatitude
## 1 1913337407                Lynx rufus (Schreber, 1777)        31.92127
## 2 1927462932 Canis lupus baileyi Nelson & Goldman, 1929        32.17984
## 3 1927462934 Canis lupus baileyi Nelson & Goldman, 1929        32.17984
## 4 1913337038        Canis latrans texensis Bailey, 1905        26.52303
## 5 1913337381  Odocoileus virginianus (Zimmermann, 1780)        30.87341
## 6 1500196460   Bassariscus astutus (Lichtenstein, 1830)        31.92514
##   decimalLongitude        issues
## 1        -99.75406              
## 2        -97.79981              
## 3        -97.79981              
## 4        -97.49110              
## 5        -96.24263              
## 6       -106.51273 gass84,gdativ
OCC %>% select(1:5) %>% head
##          key                             scientificName decimalLatitude
## 1 1913337407                Lynx rufus (Schreber, 1777)        31.92127
## 2 1927462932 Canis lupus baileyi Nelson & Goldman, 1929        32.17984
## 3 1927462934 Canis lupus baileyi Nelson & Goldman, 1929        32.17984
## 4 1913337038        Canis latrans texensis Bailey, 1905        26.52303
## 5 1913337381  Odocoileus virginianus (Zimmermann, 1780)        30.87341
## 6 1500196460   Bassariscus astutus (Lichtenstein, 1830)        31.92514
##   decimalLongitude        issues
## 1        -99.75406              
## 2        -97.79981              
## 3        -97.79981              
## 4        -97.49110              
## 5        -96.24263              
## 6       -106.51273 gass84,gdativ
OCC %>% select(key, scientificName, decimalLatitude, decimalLongitude, issues) %>% head
##          key                             scientificName decimalLatitude
## 1 1913337407                Lynx rufus (Schreber, 1777)        31.92127
## 2 1927462932 Canis lupus baileyi Nelson & Goldman, 1929        32.17984
## 3 1927462934 Canis lupus baileyi Nelson & Goldman, 1929        32.17984
## 4 1913337038        Canis latrans texensis Bailey, 1905        26.52303
## 5 1913337381  Odocoileus virginianus (Zimmermann, 1780)        30.87341
## 6 1500196460   Bassariscus astutus (Lichtenstein, 1830)        31.92514
##   decimalLongitude        issues
## 1        -99.75406              
## 2        -97.79981              
## 3        -97.79981              
## 4        -97.49110              
## 5        -96.24263              
## 6       -106.51273 gass84,gdativ

There are a lot of columns that contain useful information important to us, so we’ll select by column number

DATA <- OCC %>% select(4, 3, 32, 5, 16, 29:30, 38, 40, 59, 66, 68, 82, 85, 92, 97)
dim(DATA)
## [1] 44767    16
head(DATA)
##   decimalLongitude decimalLatitude                species        issues
## 1        -99.75406        31.92127             Lynx rufus              
## 2        -97.79981        32.17984            Canis lupus              
## 3        -97.79981        32.17984            Canis lupus              
## 4        -97.49110        26.52303          Canis latrans              
## 5        -96.24263        30.87341 Odocoileus virginianus              
## 6       -106.51273        31.92514    Bassariscus astutus gass84,gdativ
##        basisOfRecord        order      family
## 1 PRESERVED_SPECIMEN    Carnivora     Felidae
## 2 PRESERVED_SPECIMEN    Carnivora     Canidae
## 3 PRESERVED_SPECIMEN    Carnivora     Canidae
## 4 PRESERVED_SPECIMEN    Carnivora     Canidae
## 5 PRESERVED_SPECIMEN Artiodactyla    Cervidae
## 6 PRESERVED_SPECIMEN    Carnivora Procyonidae
##   coordinateUncertaintyInMeters year datasetName   catalogNumber
## 1                            NA 2018        <NA>           65493
## 2                        804.67 2018        <NA> MSB:Mamm:324207
## 3                        804.67 2018        <NA> MSB:Mamm:324206
## 4                            NA 2017        <NA>           64950
## 5                            NA 2017        <NA>           65491
## 6                            NA 2017        <NA>  UTEP:Mamm:8483
##   institutionCode           county
## 1            TCWC          Runnels
## 2             MSB Somervell County
## 3             MSB Somervell County
## 4            TCWC          Willacy
## 5            TCWC           Brazos
## 6            UTEP   El Paso County
##                                                                           preparations
## 1                                                                          SS | tissue
## 2 blood serum (frozen); blood serum (frozen); blood (EDTA); blood (EDTA); blood (EDTA)
## 3 blood serum (frozen); blood serum (frozen); blood (EDTA); blood (EDTA); blood (EDTA)
## 4                                                                          ss | tissue
## 5                                                                          SK | tissue
## 6                                                                skeleton; skin, study
##                                                         locality    sex
## 1                                        Highway 153 near FM 140   <NA>
## 2                                     Fossil Rim Wildlife Center   <NA>
## 3                                     Fossil Rim Wildlife Center   <NA>
## 4                                 East Foundation, El Sauz Ranch   <NA>
## 5                                           Jack Creek at FM 974   <NA>
## 6 Franklin Mountains State Park, Tom Mays Unit, park access road FEMALE

BUT that’s not all, we still have a lot more that we can do with dplyr::select and its variants. You can select based on certain parameters

OCC %>% select(starts_with("decimal")) %>% head
##   decimalLatitude decimalLongitude
## 1        31.92127        -99.75406
## 2        32.17984        -97.79981
## 3        32.17984        -97.79981
## 4        26.52303        -97.49110
## 5        30.87341        -96.24263
## 6        31.92514       -106.51273
OCC %>% select(ends_with("Name")) %>% head
##                               scientificName
## 1                Lynx rufus (Schreber, 1777)
## 2 Canis lupus baileyi Nelson & Goldman, 1929
## 3 Canis lupus baileyi Nelson & Goldman, 1929
## 4        Canis latrans texensis Bailey, 1905
## 5  Odocoileus virginianus (Zimmermann, 1780)
## 6   Bassariscus astutus (Lichtenstein, 1830)
##                       acceptedScientificName genericName datasetName
## 1                Lynx rufus (Schreber, 1777)        Lynx        <NA>
## 2 Canis lupus baileyi Nelson & Goldman, 1929       Canis        <NA>
## 3 Canis lupus baileyi Nelson & Goldman, 1929       Canis        <NA>
## 4        Canis latrans texensis Bailey, 1905       Canis        <NA>
## 5  Odocoileus virginianus (Zimmermann, 1780)  Odocoileus        <NA>
## 6   Bassariscus astutus (Lichtenstein, 1830) Bassariscus        <NA>
##                                         name vernacularName typifiedName
## 1                Lynx rufus (Schreber, 1777)           <NA>         <NA>
## 2 Canis lupus baileyi Nelson & Goldman, 1929           <NA>         <NA>
## 3 Canis lupus baileyi Nelson & Goldman, 1929           <NA>         <NA>
## 4        Canis latrans texensis Bailey, 1905           <NA>         <NA>
## 5  Odocoileus virginianus (Zimmermann, 1780)           <NA>         <NA>
## 6   Bassariscus astutus (Lichtenstein, 1830)           <NA>         <NA>

the starts_with and ends_with arguments select columns based on their name

DATA %>% select_all(toupper) %>% head 
##   DECIMALLONGITUDE DECIMALLATITUDE                SPECIES        ISSUES
## 1        -99.75406        31.92127             Lynx rufus              
## 2        -97.79981        32.17984            Canis lupus              
## 3        -97.79981        32.17984            Canis lupus              
## 4        -97.49110        26.52303          Canis latrans              
## 5        -96.24263        30.87341 Odocoileus virginianus              
## 6       -106.51273        31.92514    Bassariscus astutus gass84,gdativ
##        BASISOFRECORD        ORDER      FAMILY
## 1 PRESERVED_SPECIMEN    Carnivora     Felidae
## 2 PRESERVED_SPECIMEN    Carnivora     Canidae
## 3 PRESERVED_SPECIMEN    Carnivora     Canidae
## 4 PRESERVED_SPECIMEN    Carnivora     Canidae
## 5 PRESERVED_SPECIMEN Artiodactyla    Cervidae
## 6 PRESERVED_SPECIMEN    Carnivora Procyonidae
##   COORDINATEUNCERTAINTYINMETERS YEAR DATASETNAME   CATALOGNUMBER
## 1                            NA 2018        <NA>           65493
## 2                        804.67 2018        <NA> MSB:Mamm:324207
## 3                        804.67 2018        <NA> MSB:Mamm:324206
## 4                            NA 2017        <NA>           64950
## 5                            NA 2017        <NA>           65491
## 6                            NA 2017        <NA>  UTEP:Mamm:8483
##   INSTITUTIONCODE           COUNTY
## 1            TCWC          Runnels
## 2             MSB Somervell County
## 3             MSB Somervell County
## 4            TCWC          Willacy
## 5            TCWC           Brazos
## 6            UTEP   El Paso County
##                                                                           PREPARATIONS
## 1                                                                          SS | tissue
## 2 blood serum (frozen); blood serum (frozen); blood (EDTA); blood (EDTA); blood (EDTA)
## 3 blood serum (frozen); blood serum (frozen); blood (EDTA); blood (EDTA); blood (EDTA)
## 4                                                                          ss | tissue
## 5                                                                          SK | tissue
## 6                                                                skeleton; skin, study
##                                                         LOCALITY    SEX
## 1                                        Highway 153 near FM 140   <NA>
## 2                                     Fossil Rim Wildlife Center   <NA>
## 3                                     Fossil Rim Wildlife Center   <NA>
## 4                                 East Foundation, El Sauz Ranch   <NA>
## 5                                           Jack Creek at FM 974   <NA>
## 6 Franklin Mountains State Park, Tom Mays Unit, park access road FEMALE

dplyr::select_all is a special case of dplyr::select (labeled by tidyverse developers as a scoped variant along with select_if and select_at) that selects all columns and applies an additional function (here base::toupper which capitalizes all the column names)

DATA %>% select(lon = 1, lat = decimalLatitude) %>% head
##          lon      lat
## 1  -99.75406 31.92127
## 2  -97.79981 32.17984
## 3  -97.79981 32.17984
## 4  -97.49110 26.52303
## 5  -96.24263 30.87341
## 6 -106.51273 31.92514
DATA %>% rename(lon = decimalLongitude, lat = decimalLatitude) %>% head
##          lon      lat                species        issues
## 1  -99.75406 31.92127             Lynx rufus              
## 2  -97.79981 32.17984            Canis lupus              
## 3  -97.79981 32.17984            Canis lupus              
## 4  -97.49110 26.52303          Canis latrans              
## 5  -96.24263 30.87341 Odocoileus virginianus              
## 6 -106.51273 31.92514    Bassariscus astutus gass84,gdativ
##        basisOfRecord        order      family
## 1 PRESERVED_SPECIMEN    Carnivora     Felidae
## 2 PRESERVED_SPECIMEN    Carnivora     Canidae
## 3 PRESERVED_SPECIMEN    Carnivora     Canidae
## 4 PRESERVED_SPECIMEN    Carnivora     Canidae
## 5 PRESERVED_SPECIMEN Artiodactyla    Cervidae
## 6 PRESERVED_SPECIMEN    Carnivora Procyonidae
##   coordinateUncertaintyInMeters year datasetName   catalogNumber
## 1                            NA 2018        <NA>           65493
## 2                        804.67 2018        <NA> MSB:Mamm:324207
## 3                        804.67 2018        <NA> MSB:Mamm:324206
## 4                            NA 2017        <NA>           64950
## 5                            NA 2017        <NA>           65491
## 6                            NA 2017        <NA>  UTEP:Mamm:8483
##   institutionCode           county
## 1            TCWC          Runnels
## 2             MSB Somervell County
## 3             MSB Somervell County
## 4            TCWC          Willacy
## 5            TCWC           Brazos
## 6            UTEP   El Paso County
##                                                                           preparations
## 1                                                                          SS | tissue
## 2 blood serum (frozen); blood serum (frozen); blood (EDTA); blood (EDTA); blood (EDTA)
## 3 blood serum (frozen); blood serum (frozen); blood (EDTA); blood (EDTA); blood (EDTA)
## 4                                                                          ss | tissue
## 5                                                                          SK | tissue
## 6                                                                skeleton; skin, study
##                                                         locality    sex
## 1                                        Highway 153 near FM 140   <NA>
## 2                                     Fossil Rim Wildlife Center   <NA>
## 3                                     Fossil Rim Wildlife Center   <NA>
## 4                                 East Foundation, El Sauz Ranch   <NA>
## 5                                           Jack Creek at FM 974   <NA>
## 6 Franklin Mountains State Park, Tom Mays Unit, park access road FEMALE

You can use the dplyr::select function to rename each variable if you want (by column number or column name) OR you can use the dplyr::rename function (which keeps all of the variables)

filter

The dplyr::filter function does what it sounds and filters your data based on the arguments you give it.

dim(DATA)
## [1] 44767    16
DATA %>% filter(year >= 2000) %>% dim #grabs only records from 2000 on
## [1] 5777   16
DATA %>% filter(year >= 2000 & order == "Rodentia") %>% dim #grabs only *rodent* records from 2000 on
## [1] 4016   16
#DATA %>% filter(year >= 2000, order == "Rodentia") is the same thing as the above line
DATA %>% filter(year >= 2000 | order == "Rodentia") %>% dim #grabs records that are *either* rodents or from the year 2000 on
## [1] 31764    16

You can use the & symbol to string along multiple arguments to filter by or use the | symbol to give possible options

This is equivalent to using DATA[DATA$year >= 2000, ], DATA[DATA$year >= 2000 & DATA$order == "Rodentia", ], etc. Depending on your coding familiarity, one or the other may feel cleaner or more useful for you.

You can also apply functions to determine cutoffs for your filtering arguments.

median(DATA$year, na.rm = T)
## [1] 1977
DATA %>% filter(year > median(year, na.rm = T)) %>% dim
## [1] 20954    16
#DATA %>% filter(year > 1977) %>% dim

Which will lead us into the concept of groups!

group_by

group_by is useful for partitioning things and applying functions to these various partitions.

For the last example, we applied a threshold to the median year of all records. What if we wanted to apply it differently to each order and apply a median threshold for each one?

ORD <- DATA %>% group_by(order)
ORD
## # A tibble: 44,767 x 16
## # Groups:   order [17]
##    decimalLongitude decimalLatitude species   issues  basisOfRecord  order
##               <dbl>           <dbl> <fct>     <fct>   <fct>          <fct>
##  1            -99.8            31.9 Lynx ruf… ""      PRESERVED_SPE… Carn…
##  2            -97.8            32.2 Canis lu… ""      PRESERVED_SPE… Carn…
##  3            -97.8            32.2 Canis lu… ""      PRESERVED_SPE… Carn…
##  4            -97.5            26.5 Canis la… ""      PRESERVED_SPE… Carn…
##  5            -96.2            30.9 Odocoile… ""      PRESERVED_SPE… Arti…
##  6           -107.             31.9 Bassaris… gass84… PRESERVED_SPE… Carn…
##  7            -96.3            30.6 Baiomys … ""      PRESERVED_SPE… Rode…
##  8            -97.6            33.4 Peromysc… gass84  PRESERVED_SPE… Rode…
##  9            -97.6            33.4 Peromysc… gass84  PRESERVED_SPE… Rode…
## 10            -97.6            33.4 Peromysc… gass84  PRESERVED_SPE… Rode…
## # ... with 44,757 more rows, and 10 more variables: family <fct>,
## #   coordinateUncertaintyInMeters <dbl>, year <int>, datasetName <fct>,
## #   catalogNumber <fct>, institutionCode <fct>, county <fct>,
## #   preparations <fct>, locality <fct>, sex <fct>

The argument in the dplyr::group_by function represents what you want to group with. Each of the different levels in the specified column will be their own group. Check out levels(DATA$order) to see what groups there are.

What it results in is a tibble which is the preferred object class for the tidyverse (as part of the tibble package) and somehow not a Gremlins or Furby knockoff. A tibble holds data in a similar way but presents it differently by showing dimensions, a few columns, the classes in the tibble, and groups if they exist.

Often when working with larger data.frames (say in the 100,000 or 1,000,000 rows + range), it can destroy your computer by simply calling the object.

You can also choose multiple grouping variables

DATA %>% group_by(institutionCode, basisOfRecord) #each group is a specific combination of institution and record type
## # A tibble: 44,767 x 16
## # Groups:   institutionCode, basisOfRecord [64]
##    decimalLongitude decimalLatitude species   issues  basisOfRecord  order
##               <dbl>           <dbl> <fct>     <fct>   <fct>          <fct>
##  1            -99.8            31.9 Lynx ruf… ""      PRESERVED_SPE… Carn…
##  2            -97.8            32.2 Canis lu… ""      PRESERVED_SPE… Carn…
##  3            -97.8            32.2 Canis lu… ""      PRESERVED_SPE… Carn…
##  4            -97.5            26.5 Canis la… ""      PRESERVED_SPE… Carn…
##  5            -96.2            30.9 Odocoile… ""      PRESERVED_SPE… Arti…
##  6           -107.             31.9 Bassaris… gass84… PRESERVED_SPE… Carn…
##  7            -96.3            30.6 Baiomys … ""      PRESERVED_SPE… Rode…
##  8            -97.6            33.4 Peromysc… gass84  PRESERVED_SPE… Rode…
##  9            -97.6            33.4 Peromysc… gass84  PRESERVED_SPE… Rode…
## 10            -97.6            33.4 Peromysc… gass84  PRESERVED_SPE… Rode…
## # ... with 44,757 more rows, and 10 more variables: family <fct>,
## #   coordinateUncertaintyInMeters <dbl>, year <int>, datasetName <fct>,
## #   catalogNumber <fct>, institutionCode <fct>, county <fct>,
## #   preparations <fct>, locality <fct>, sex <fct>
ORD %>% filter(year > median(year, na.rm = T)) # slightly different result than before
## # A tibble: 20,788 x 16
## # Groups:   order [13]
##    decimalLongitude decimalLatitude species   issues  basisOfRecord  order
##               <dbl>           <dbl> <fct>     <fct>   <fct>          <fct>
##  1            -99.8            31.9 Lynx ruf… ""      PRESERVED_SPE… Carn…
##  2            -97.8            32.2 Canis lu… ""      PRESERVED_SPE… Carn…
##  3            -97.8            32.2 Canis lu… ""      PRESERVED_SPE… Carn…
##  4            -97.5            26.5 Canis la… ""      PRESERVED_SPE… Carn…
##  5            -96.2            30.9 Odocoile… ""      PRESERVED_SPE… Arti…
##  6           -107.             31.9 Bassaris… gass84… PRESERVED_SPE… Carn…
##  7            -96.3            30.6 Baiomys … ""      PRESERVED_SPE… Rode…
##  8            -97.6            33.4 Peromysc… gass84  PRESERVED_SPE… Rode…
##  9            -97.6            33.4 Peromysc… gass84  PRESERVED_SPE… Rode…
## 10            -97.6            33.4 Peromysc… gass84  PRESERVED_SPE… Rode…
## # ... with 20,778 more rows, and 10 more variables: family <fct>,
## #   coordinateUncertaintyInMeters <dbl>, year <int>, datasetName <fct>,
## #   catalogNumber <fct>, institutionCode <fct>, county <fct>,
## #   preparations <fct>, locality <fct>, sex <fct>
#sapply(levels(DATA$order), function(x) DATA %>% filter(order == x) %$% year %>% median(na.rm = T)) #this also means that we have fossil dermopterans in Texas?!

For something perhaps a bit more useful, if you wanted to get rid of duplicated locality records within an order, how do you do that?

#DATA %>% select(1, 2) %>% duplicated() %>% `!` %>% sum
#DATA %>% distinct(decimalLongitude, decimalLatitude) %>% dim
ORD %>% distinct(decimalLongitude, decimalLatitude, .keep_all = T) %>% ungroup
## # A tibble: 10,282 x 16
##    decimalLongitude decimalLatitude species   issues  basisOfRecord  order
##               <dbl>           <dbl> <fct>     <fct>   <fct>          <fct>
##  1            -99.8            31.9 Lynx ruf… ""      PRESERVED_SPE… Carn…
##  2            -97.8            32.2 Canis lu… ""      PRESERVED_SPE… Carn…
##  3            -97.5            26.5 Canis la… ""      PRESERVED_SPE… Carn…
##  4            -96.2            30.9 Odocoile… ""      PRESERVED_SPE… Arti…
##  5           -107.             31.9 Bassaris… gass84… PRESERVED_SPE… Carn…
##  6            -96.3            30.6 Baiomys … ""      PRESERVED_SPE… Rode…
##  7            -97.6            33.4 Peromysc… gass84  PRESERVED_SPE… Rode…
##  8            -97.6            33.4 Peromysc… gass84  PRESERVED_SPE… Rode…
##  9            -97.6            33.4 Peromysc… gass84  PRESERVED_SPE… Rode…
## 10            -97.6            33.4 Peromysc… gass84  PRESERVED_SPE… Rode…
## # ... with 10,272 more rows, and 10 more variables: family <fct>,
## #   coordinateUncertaintyInMeters <dbl>, year <int>, datasetName <fct>,
## #   catalogNumber <fct>, institutionCode <fct>, county <fct>,
## #   preparations <fct>, locality <fct>, sex <fct>

Here, we use the dplyr::distinct function to only grab those rows that are distinct according to what we give it (longitude and latitude) and use the .keep_all = T argument to specify that we want all other columns as well. Try it without .keep_all and see what happens

If you don’t like the look of the tibble result, you can always end with %>% as.data.frame or turn it into a data.frame later with the base::as.data.frame function.

mutate

The dplyr::mutate function also works how it sounds by either replacing or frankensteining a new column.

Often a specimen in a natural history collection (much of what is found in GBIF are natural history collection specimens) has an associated collection number (some institution code followed by a number, such as TCWC 6524)

DATA %>% mutate(mus.num = paste(institutionCode, catalogNumber, sep = "_")) %>% head
##   decimalLongitude decimalLatitude                species        issues
## 1        -99.75406        31.92127             Lynx rufus              
## 2        -97.79981        32.17984            Canis lupus              
## 3        -97.79981        32.17984            Canis lupus              
## 4        -97.49110        26.52303          Canis latrans              
## 5        -96.24263        30.87341 Odocoileus virginianus              
## 6       -106.51273        31.92514    Bassariscus astutus gass84,gdativ
##        basisOfRecord        order      family
## 1 PRESERVED_SPECIMEN    Carnivora     Felidae
## 2 PRESERVED_SPECIMEN    Carnivora     Canidae
## 3 PRESERVED_SPECIMEN    Carnivora     Canidae
## 4 PRESERVED_SPECIMEN    Carnivora     Canidae
## 5 PRESERVED_SPECIMEN Artiodactyla    Cervidae
## 6 PRESERVED_SPECIMEN    Carnivora Procyonidae
##   coordinateUncertaintyInMeters year datasetName   catalogNumber
## 1                            NA 2018        <NA>           65493
## 2                        804.67 2018        <NA> MSB:Mamm:324207
## 3                        804.67 2018        <NA> MSB:Mamm:324206
## 4                            NA 2017        <NA>           64950
## 5                            NA 2017        <NA>           65491
## 6                            NA 2017        <NA>  UTEP:Mamm:8483
##   institutionCode           county
## 1            TCWC          Runnels
## 2             MSB Somervell County
## 3             MSB Somervell County
## 4            TCWC          Willacy
## 5            TCWC           Brazos
## 6            UTEP   El Paso County
##                                                                           preparations
## 1                                                                          SS | tissue
## 2 blood serum (frozen); blood serum (frozen); blood (EDTA); blood (EDTA); blood (EDTA)
## 3 blood serum (frozen); blood serum (frozen); blood (EDTA); blood (EDTA); blood (EDTA)
## 4                                                                          ss | tissue
## 5                                                                          SK | tissue
## 6                                                                skeleton; skin, study
##                                                         locality    sex
## 1                                        Highway 153 near FM 140   <NA>
## 2                                     Fossil Rim Wildlife Center   <NA>
## 3                                     Fossil Rim Wildlife Center   <NA>
## 4                                 East Foundation, El Sauz Ranch   <NA>
## 5                                           Jack Creek at FM 974   <NA>
## 6 Franklin Mountains State Park, Tom Mays Unit, park access road FEMALE
##               mus.num
## 1          TCWC_65493
## 2 MSB_MSB:Mamm:324207
## 3 MSB_MSB:Mamm:324206
## 4          TCWC_64950
## 5          TCWC_65491
## 6 UTEP_UTEP:Mamm:8483

You can create multiple columns in the same function call. I would also recommend always naming the new column by using column.name = before the mutate specific argument for the new column. You probably aren’t a being of pure chaos and want columns names year + 4 and log(year).

DATA %>% mutate(year.2 = year + 4, year.3 = log(year)) %>% head
##   decimalLongitude decimalLatitude                species        issues
## 1        -99.75406        31.92127             Lynx rufus              
## 2        -97.79981        32.17984            Canis lupus              
## 3        -97.79981        32.17984            Canis lupus              
## 4        -97.49110        26.52303          Canis latrans              
## 5        -96.24263        30.87341 Odocoileus virginianus              
## 6       -106.51273        31.92514    Bassariscus astutus gass84,gdativ
##        basisOfRecord        order      family
## 1 PRESERVED_SPECIMEN    Carnivora     Felidae
## 2 PRESERVED_SPECIMEN    Carnivora     Canidae
## 3 PRESERVED_SPECIMEN    Carnivora     Canidae
## 4 PRESERVED_SPECIMEN    Carnivora     Canidae
## 5 PRESERVED_SPECIMEN Artiodactyla    Cervidae
## 6 PRESERVED_SPECIMEN    Carnivora Procyonidae
##   coordinateUncertaintyInMeters year datasetName   catalogNumber
## 1                            NA 2018        <NA>           65493
## 2                        804.67 2018        <NA> MSB:Mamm:324207
## 3                        804.67 2018        <NA> MSB:Mamm:324206
## 4                            NA 2017        <NA>           64950
## 5                            NA 2017        <NA>           65491
## 6                            NA 2017        <NA>  UTEP:Mamm:8483
##   institutionCode           county
## 1            TCWC          Runnels
## 2             MSB Somervell County
## 3             MSB Somervell County
## 4            TCWC          Willacy
## 5            TCWC           Brazos
## 6            UTEP   El Paso County
##                                                                           preparations
## 1                                                                          SS | tissue
## 2 blood serum (frozen); blood serum (frozen); blood (EDTA); blood (EDTA); blood (EDTA)
## 3 blood serum (frozen); blood serum (frozen); blood (EDTA); blood (EDTA); blood (EDTA)
## 4                                                                          ss | tissue
## 5                                                                          SK | tissue
## 6                                                                skeleton; skin, study
##                                                         locality    sex
## 1                                        Highway 153 near FM 140   <NA>
## 2                                     Fossil Rim Wildlife Center   <NA>
## 3                                     Fossil Rim Wildlife Center   <NA>
## 4                                 East Foundation, El Sauz Ranch   <NA>
## 5                                           Jack Creek at FM 974   <NA>
## 6 Franklin Mountains State Park, Tom Mays Unit, park access road FEMALE
##   year.2   year.3
## 1   2022 7.609862
## 2   2022 7.609862
## 3   2022 7.609862
## 4   2021 7.609367
## 5   2021 7.609367
## 6   2021 7.609367

summarize

The dplyr::summarize function again follows the obvious naming trend and gasp summarizes your data. This is yet another of the common dplyr functions that works really well when using group_by to summarize multiple groups.

DATA %>% summarize(n.spec = n_distinct(species))
##   n.spec
## 1    742
DATA %>% summarize(min = min(year, na.rm = T), max = max(year, na.rm = T), mean = mean(year, na.rm = T))
##    min  max     mean
## 1 1700 2018 1975.623

You can summarize for one function or for multiple ones, and you can either name the results or live dangerously and let it be named after the function. Instead of the potentially useful n.spec, you can have the mysterious n_distinct(species)! Swoon

dplyr::summarize is only as good as the functions you use with it. Some useful ones are typically the usual mean, median, min, max, quantile, n (for numbers of records), n_distinct for (unique numbers of records as shown above), etc.

DATA %>% group_by(order) %>% summarize(n = n(), n.spec = n_distinct(species))
## # A tibble: 17 x 3
##    order               n n.spec
##    <fct>           <int>  <int>
##  1 Artiodactyla     1357    126
##  2 Carnivora        3155    110
##  3 Cetacea          2442     23
##  4 Chiroptera       4464     35
##  5 Cingulata         288      5
##  6 Dermoptera          2      1
##  7 Didelphimorphia   297      8
##  8 Diprotodontia       1      1
##  9 Erinaceomorpha      2      2
## 10 Lagomorpha       1264     26
## 11 Perissodactyla    496     97
## 12 Pilosa             19     13
## 13 Primates           43     23
## 14 Proboscidea        44     14
## 15 Rodentia        30003    188
## 16 Soricomorpha      794     22
## 17 <NA>               96     48
#Often the real power of dplyr is in how you string multiple of these functions together
DATA %>% group_by(order) %>% summarize(n = n(), n.spec = n_distinct(species)) %>% mutate(speciesPerRecords = n.spec/n)
## # A tibble: 17 x 4
##    order               n n.spec speciesPerRecords
##    <fct>           <int>  <int>             <dbl>
##  1 Artiodactyla     1357    126           0.0929 
##  2 Carnivora        3155    110           0.0349 
##  3 Cetacea          2442     23           0.00942
##  4 Chiroptera       4464     35           0.00784
##  5 Cingulata         288      5           0.0174 
##  6 Dermoptera          2      1           0.500  
##  7 Didelphimorphia   297      8           0.0269 
##  8 Diprotodontia       1      1           1.00   
##  9 Erinaceomorpha      2      2           1.00   
## 10 Lagomorpha       1264     26           0.0206 
## 11 Perissodactyla    496     97           0.196  
## 12 Pilosa             19     13           0.684  
## 13 Primates           43     23           0.535  
## 14 Proboscidea        44     14           0.318  
## 15 Rodentia        30003    188           0.00627
## 16 Soricomorpha      794     22           0.0277 
## 17 <NA>               96     48           0.500
DATA %>% group_by(institutionCode) %>% summarize(nspec = n_distinct(species), nfam = n_distinct(family), norder = n_distinct(order))
## # A tibble: 60 x 4
##    institutionCode                                      nspec  nfam norder
##    <fct>                                                <int> <int>  <int>
##  1 AMNH                                                     6     5      4
##  2 ASNHC                                                  125    27      9
##  3 ASNHC-ASU                                                3     1      1
##  4 BM-UW                                                    1     1      1
##  5 BMNH-U of M                                              1     1      1
##  6 Borror Laboratory of Bioacoustics, Ohio State Unive…     3     2      2
##  7 CAS                                                     25    14      6
##  8 CHAS                                                    13     8      4
##  9 CLO                                                      7     6      5
## 10 CUMV                                                    19     8      4
## # ... with 50 more rows

tidyr

Another tidyverse package that is often useful is tidyr specifically for its data manipulation functions that can turn wide data into long data and long data into wide data.

I know, I know. What are wide/long data and why do I need to know them???

As we’ve learned, most things are essentially named how they are, so wide data are column heavy and long data have more rows. It is often essential to know how to switch between them for plotting reasons.

ORD <- DATA %>% group_by(order) %>% summarize(nspec = n_distinct(species), nfam = n_distinct(family))

Take the above result for example. How could you plot this? You’d need an x variable (order), but what is our y variable? In a long data.frame, our y variable will be a vector of factors that has levels that tell the difference between the number of species and number of families. And additional column called value will be used as a color or fill variable.

A non tidyverse method of doing this is found in the reshape2 package using the reshape2::melt and reshape2::cast functions.

melt(ORD, id.vars = "order") #id.vars says what will be used as an ID (typically factors)
##              order variable value
## 1     Artiodactyla    nspec   126
## 2        Carnivora    nspec   110
## 3          Cetacea    nspec    23
## 4       Chiroptera    nspec    35
## 5        Cingulata    nspec     5
## 6       Dermoptera    nspec     1
## 7  Didelphimorphia    nspec     8
## 8    Diprotodontia    nspec     1
## 9   Erinaceomorpha    nspec     2
## 10      Lagomorpha    nspec    26
## 11  Perissodactyla    nspec    97
## 12          Pilosa    nspec    13
## 13        Primates    nspec    23
## 14     Proboscidea    nspec    14
## 15        Rodentia    nspec   188
## 16    Soricomorpha    nspec    22
## 17            <NA>    nspec    48
## 18    Artiodactyla     nfam    22
## 19       Carnivora     nfam    13
## 20         Cetacea     nfam     7
## 21      Chiroptera     nfam     4
## 22       Cingulata     nfam     2
## 23      Dermoptera     nfam     1
## 24 Didelphimorphia     nfam     2
## 25   Diprotodontia     nfam     1
## 26  Erinaceomorpha     nfam     1
## 27      Lagomorpha     nfam     1
## 28  Perissodactyla     nfam     8
## 29          Pilosa     nfam     6
## 30        Primates     nfam    11
## 31     Proboscidea     nfam     3
## 32        Rodentia     nfam    22
## 33    Soricomorpha     nfam     2
## 34            <NA>     nfam    24
melt(ORD, "order") %>% ggplot(aes(x = variable, y = order, fill = value)) +
  geom_tile(color = "white") +
  geom_text(aes(label = value), color = "white") + 
  scale_x_discrete(expand = c(0, 0)) +
  scale_y_discrete(expand = c(0, 0)) +
  scale_fill_viridis_c("Number")

dcast(melt(ORD, id.vars = "order"), order ~ variable)
##              order nspec nfam
## 1     Artiodactyla   126   22
## 2        Carnivora   110   13
## 3          Cetacea    23    7
## 4       Chiroptera    35    4
## 5        Cingulata     5    2
## 6       Dermoptera     1    1
## 7  Didelphimorphia     8    2
## 8    Diprotodontia     1    1
## 9   Erinaceomorpha     2    1
## 10      Lagomorpha    26    1
## 11  Perissodactyla    97    8
## 12          Pilosa    13    6
## 13        Primates    23   11
## 14     Proboscidea    14    3
## 15        Rodentia   188   22
## 16    Soricomorpha    22    2
## 17            <NA>    48   24

For a reshape2::dcast function (the d standing for data.frame, obviously, because why stop at putting an r in front of all your package and function names), you must specify a formula of x ~ y where the y is the variable that is broken up into columns.

gather and spread

Now that you have an idea of what wide/long data are and the reshape2 package (it is often shamefully joked that everyone still uses reshape2 because melt and cast make more sense), let’s see how tidyr approaches these.

ORD %>% gather(key = "var", value = "val", -order)
## # A tibble: 34 x 3
##    order           var     val
##    <fct>           <chr> <int>
##  1 Artiodactyla    nspec   126
##  2 Carnivora       nspec   110
##  3 Cetacea         nspec    23
##  4 Chiroptera      nspec    35
##  5 Cingulata       nspec     5
##  6 Dermoptera      nspec     1
##  7 Didelphimorphia nspec     8
##  8 Diprotodontia   nspec     1
##  9 Erinaceomorpha  nspec     2
## 10 Lagomorpha      nspec    26
## # ... with 24 more rows

It’s a little bit more roundabout to get to the same result using tidyr. The same basic idea applies by giving a key variable and a value variable, but it is harder for them to talk to each other and still keep the order column.

ORD %>% gather(order)
## # A tibble: 34 x 2
##    order value
##    <chr> <int>
##  1 nspec   126
##  2 nspec   110
##  3 nspec    23
##  4 nspec    35
##  5 nspec     5
##  6 nspec     1
##  7 nspec     8
##  8 nspec     1
##  9 nspec     2
## 10 nspec    26
## # ... with 24 more rows

tidyr::spread is equivalent to the reshape2::cast functions and does the opposite of tidyr::gather (as the name yet again suggests)

EXP <- data.frame(order = rep(levels(DATA$order), 3), institution = rep(c("TCWC", "KU", "UMMZ"), each = 16), val = sample(1:50, 48, replace = T))
head(EXP, 25)
##              order institution val
## 1     Artiodactyla        TCWC  13
## 2        Carnivora        TCWC  14
## 3          Cetacea        TCWC  37
## 4       Chiroptera        TCWC  42
## 5        Cingulata        TCWC  46
## 6       Dermoptera        TCWC  23
## 7  Didelphimorphia        TCWC  31
## 8    Diprotodontia        TCWC  48
## 9   Erinaceomorpha        TCWC  34
## 10      Lagomorpha        TCWC  35
## 11  Perissodactyla        TCWC   4
## 12          Pilosa        TCWC   9
## 13        Primates        TCWC  33
## 14     Proboscidea        TCWC  29
## 15        Rodentia        TCWC  37
## 16    Soricomorpha        TCWC  43
## 17    Artiodactyla          KU   4
## 18       Carnivora          KU  13
## 19         Cetacea          KU  36
## 20      Chiroptera          KU  20
## 21       Cingulata          KU  40
## 22      Dermoptera          KU  38
## 23 Didelphimorphia          KU   1
## 24   Diprotodontia          KU  13
## 25  Erinaceomorpha          KU  22

Above we made a random example to use tidyr::spread on that is comprised of 18 mammalian orders, three institution codes, and random values from 1 to 50 for each order in each institution.

EXP %>% spread(key = "institution", val = "val")
##              order KU TCWC UMMZ
## 1     Artiodactyla  4   13   21
## 2        Carnivora 13   14   13
## 3          Cetacea 36   37   40
## 4       Chiroptera 20   42   40
## 5        Cingulata 40   46   49
## 6       Dermoptera 38   23   17
## 7  Didelphimorphia  1   31   14
## 8    Diprotodontia 13   48   14
## 9   Erinaceomorpha 22   34   41
## 10      Lagomorpha 23   35   18
## 11  Perissodactyla 48    4   10
## 12          Pilosa 38    9    4
## 13        Primates 26   33   32
## 14     Proboscidea 47   29   10
## 15        Rodentia 46   37   27
## 16    Soricomorpha 17   43   35
EXP %>% spread(key = "order", val = "val")
##   institution Artiodactyla Carnivora Cetacea Chiroptera Cingulata
## 1          KU            4        13      36         20        40
## 2        TCWC           13        14      37         42        46
## 3        UMMZ           21        13      40         40        49
##   Dermoptera Didelphimorphia Diprotodontia Erinaceomorpha Lagomorpha
## 1         38               1            13             22         23
## 2         23              31            48             34         35
## 3         17              14            14             41         18
##   Perissodactyla Pilosa Primates Proboscidea Rodentia Soricomorpha
## 1             48     38       26          47       46           17
## 2              4      9       33          29       37           43
## 3             10      4       32          10       27           35

Depending on how you choose your key (the column that is spread out with its factors now being columns) and val (the numeric association with the key variable) arguments, you can get very different data.frames.

However, it is possibly easier to connect this with the reshape2::melt and tidyr::gather functions than the different formula syntax of the reshape2 cast functions.

separate

Separate is another function that may be useful when you want to separate data held in a single column

DATA %>% separate(species, c("genus", "specep")) %>% head
##   decimalLongitude decimalLatitude       genus      specep        issues
## 1        -99.75406        31.92127        Lynx       rufus              
## 2        -97.79981        32.17984       Canis       lupus              
## 3        -97.79981        32.17984       Canis       lupus              
## 4        -97.49110        26.52303       Canis     latrans              
## 5        -96.24263        30.87341  Odocoileus virginianus              
## 6       -106.51273        31.92514 Bassariscus     astutus gass84,gdativ
##        basisOfRecord        order      family
## 1 PRESERVED_SPECIMEN    Carnivora     Felidae
## 2 PRESERVED_SPECIMEN    Carnivora     Canidae
## 3 PRESERVED_SPECIMEN    Carnivora     Canidae
## 4 PRESERVED_SPECIMEN    Carnivora     Canidae
## 5 PRESERVED_SPECIMEN Artiodactyla    Cervidae
## 6 PRESERVED_SPECIMEN    Carnivora Procyonidae
##   coordinateUncertaintyInMeters year datasetName   catalogNumber
## 1                            NA 2018        <NA>           65493
## 2                        804.67 2018        <NA> MSB:Mamm:324207
## 3                        804.67 2018        <NA> MSB:Mamm:324206
## 4                            NA 2017        <NA>           64950
## 5                            NA 2017        <NA>           65491
## 6                            NA 2017        <NA>  UTEP:Mamm:8483
##   institutionCode           county
## 1            TCWC          Runnels
## 2             MSB Somervell County
## 3             MSB Somervell County
## 4            TCWC          Willacy
## 5            TCWC           Brazos
## 6            UTEP   El Paso County
##                                                                           preparations
## 1                                                                          SS | tissue
## 2 blood serum (frozen); blood serum (frozen); blood (EDTA); blood (EDTA); blood (EDTA)
## 3 blood serum (frozen); blood serum (frozen); blood (EDTA); blood (EDTA); blood (EDTA)
## 4                                                                          ss | tissue
## 5                                                                          SK | tissue
## 6                                                                skeleton; skin, study
##                                                         locality    sex
## 1                                        Highway 153 near FM 140   <NA>
## 2                                     Fossil Rim Wildlife Center   <NA>
## 3                                     Fossil Rim Wildlife Center   <NA>
## 4                                 East Foundation, El Sauz Ranch   <NA>
## 5                                           Jack Creek at FM 974   <NA>
## 6 Franklin Mountains State Park, Tom Mays Unit, park access road FEMALE

This can be useful for certain things like dates (although there is the lubridate package as part of the tidyverse for y’all that are temporally inclined), those pesky museum numbers, or simple text analysis.

Overall, hopefully this has at least shown you some functions that may be useful for cleaning up and organizing your own data (no more going into Excel and fighting with its formatting!).

As always, more information about specific functions can be found by viewing the help pages in R. Additional questions about tidyverse questions can be found on the thorough website. And general “how do I do something ill advised”, “how do I break this”, or “how do I do this in the most complex manner possible” questions can find answers on StackExchange. Be sure to always read through the comments and most of the answers (the most highly voted answer is often not the most relevant one to you).