#packages <- c("tidyverse", "magrittr", "reshape2")
for(i in 1:length(packages)) if(!require(packages[i], character.only = T)) #install.packages(packages[i]); library(packages[i], character.only = T)
#install.packages("tidyverse")
#install.packages("reshape2")
library(tidyverse)
library(magrittr)
library(reshape2)
tidyverse
For this module, we will be taking a short, introductory dive into the tidyverse series of packages, with the goal of showing you what possibilities exist for manipulating and cleaning data.
Since we can’t manipulate data without data, we’ll start there.
For the purposes of working with packages that are supposed to help us deal with messy, disorganized data, we’ll be using biodiversity occurrence data.
OCC <- read.csv("https://raw.githubusercontent.com/acastellanos39/OSOS2019/master/tex_mammals.csv", header = T)
This is data for all mammal records on GBIF from Texas with coordinate data. It turns out there are quite a few records and they have a lot of (potential) information associated with them.
dim(OCC)
head(OCC)
str(OCC)
summary(OCC)
Take a look at these data and get a feeling for the quantity, classes, what may interest us, and things like NAs or data that may not be useful.
magrittr
Before we start cleaning up our data, I want to introduce the concept of a pipe in R via the magrittr
package. A pipe is executed by the %>%
symbol (command + shift + m on Macs if you don’t want to type it out).
OCC %>% select(1:5) %>% head
## key scientificName decimalLatitude
## 1 1913337407 Lynx rufus (Schreber, 1777) 31.92127
## 2 1927462932 Canis lupus baileyi Nelson & Goldman, 1929 32.17984
## 3 1927462934 Canis lupus baileyi Nelson & Goldman, 1929 32.17984
## 4 1913337038 Canis latrans texensis Bailey, 1905 26.52303
## 5 1913337381 Odocoileus virginianus (Zimmermann, 1780) 30.87341
## 6 1500196460 Bassariscus astutus (Lichtenstein, 1830) 31.92514
## decimalLongitude issues
## 1 -99.75406
## 2 -97.79981
## 3 -97.79981
## 4 -97.49110
## 5 -96.24263
## 6 -106.51273 gass84,gdativ
A pipe works by taking something from the left and making it the first argument for something on the right. In this case, we take our data.frame (typically we enter some sort of data in at the beginning) and place it as the first argument in a dplyr::select
function (check ?dplyr::select
to see what the first argument is, but we’ll talk about this later). The real utility of a pipe comes in later by taking the result of our dplyr::select
function (the first five columns of our data.frame) and putting that as the first argument of our head function.
There are also other pipe symbols other than %>%
that you can use
OCC %$% decimalLatitude %>% max
## [1] 66.39965
The %$%
symbol acts like the $
in indexing by essentially doing OCC$decimalLatitude
but keeping it within the pipeline ecosystem (may be more helpful further down in a plot).
alt <- OCC %>% select(species, coordinateUncertaintyInMeters)
nrow(alt)
## [1] 44767
alt %<>% dplyr::filter(coordinateUncertaintyInMeters <= 5000)
nrow(alt)
## [1] 6459
The %<>%
is for those of us too lazy to come up with new object names and uses the object to the left as both the primary data source for the pipe and the name of the object created at the end of the pipe. Don’t use this unless you are sure it won’t spit out a NULL result or something else you don’t want
dplyr
We will start out and spend a good portion of our time working with the dplyr
package for data manipulation and cleanup.
We will focus on a series of helpful functions, how they work, and then look at things like group_by
can be used with them.
It should be noted that the first argument of all dplyr
functions is data =
, which means that it is designed to work well with pipes if so desired.
select
dplyr::select
is a helpful function that lets you select the columns that you want in a variety of ways.
We have a lot of data in our OCC
data.frame object, but we don’t need most of it, so let’s only grab a bit of it to work with downstream.
dplyr::select
works by either giving it a group of numbers (akin to indexing by DATA[, 1:5]
or DATA[, c(1:2, 5)]
) or column names (akin to DATA[, "key"]
or DATA[, c("key", "scientificName")]
)
colnames(OCC)
## [1] "key"
## [2] "scientificName"
## [3] "decimalLatitude"
## [4] "decimalLongitude"
## [5] "issues"
## [6] "datasetKey"
## [7] "publishingOrgKey"
## [8] "networkKeys"
## [9] "installationKey"
## [10] "publishingCountry"
## [11] "protocol"
## [12] "lastCrawled"
## [13] "lastParsed"
## [14] "crawlId"
## [15] "extensions"
## [16] "basisOfRecord"
## [17] "taxonKey"
## [18] "kingdomKey"
## [19] "phylumKey"
## [20] "classKey"
## [21] "orderKey"
## [22] "familyKey"
## [23] "genusKey"
## [24] "speciesKey"
## [25] "acceptedTaxonKey"
## [26] "acceptedScientificName"
## [27] "kingdom"
## [28] "phylum"
## [29] "order"
## [30] "family"
## [31] "genus"
## [32] "species"
## [33] "genericName"
## [34] "specificEpithet"
## [35] "taxonRank"
## [36] "taxonomicStatus"
## [37] "dateIdentified"
## [38] "coordinateUncertaintyInMeters"
## [39] "stateProvince"
## [40] "year"
## [41] "month"
## [42] "day"
## [43] "eventDate"
## [44] "modified"
## [45] "lastInterpreted"
## [46] "references"
## [47] "license"
## [48] "identifiers"
## [49] "facts"
## [50] "relations"
## [51] "geodeticDatum"
## [52] "class"
## [53] "countryCode"
## [54] "country"
## [55] "rightsHolder"
## [56] "identifier"
## [57] "informationWithheld"
## [58] "verbatimEventDate"
## [59] "datasetName"
## [60] "verbatimLocality"
## [61] "gbifID"
## [62] "collectionCode"
## [63] "occurrenceID"
## [64] "taxonID"
## [65] "recordedBy"
## [66] "catalogNumber"
## [67] "http...unknown.org.occurrenceDetails"
## [68] "institutionCode"
## [69] "rights"
## [70] "eventTime"
## [71] "identificationID"
## [72] "name"
## [73] "occurrenceRemarks"
## [74] "infraspecificEpithet"
## [75] "identificationRemarks"
## [76] "http...unknown.org.recordedByOrcid"
## [77] "establishmentMeans"
## [78] "elevation"
## [79] "elevationAccuracy"
## [80] "continent"
## [81] "institutionID"
## [82] "county"
## [83] "language"
## [84] "type"
## [85] "preparations"
## [86] "occurrenceStatus"
## [87] "verbatimElevation"
## [88] "nomenclaturalCode"
## [89] "higherGeography"
## [90] "georeferenceVerificationStatus"
## [91] "endDayOfYear"
## [92] "locality"
## [93] "startDayOfYear"
## [94] "bibliographicCitation"
## [95] "accessRights"
## [96] "higherClassification"
## [97] "sex"
## [98] "lifeStage"
## [99] "habitat"
## [100] "fieldNumber"
## [101] "taxonConceptID"
## [102] "locationID"
## [103] "samplingProtocol"
## [104] "associatedSequences"
## [105] "identifiedBy"
## [106] "georeferenceSources"
## [107] "X.1f2c0cbe.40df.43f6.ba07.e76133e78c31."
## [108] "individualCount"
## [109] "dynamicProperties"
## [110] "identificationVerificationStatus"
## [111] "eventRemarks"
## [112] "locationAccordingTo"
## [113] "locationRemarks"
## [114] "georeferencedDate"
## [115] "georeferencedBy"
## [116] "georeferenceProtocol"
## [117] "verbatimCoordinateSystem"
## [118] "otherCatalogNumbers"
## [119] "organismID"
## [120] "previousIdentifications"
## [121] "identificationQualifier"
## [122] "collectionID"
## [123] "recordNumber"
## [124] "municipality"
## [125] "taxonRemarks"
## [126] "vernacularName"
## [127] "reproductiveCondition"
## [128] "georeferenceRemarks"
## [129] "ownerInstitutionCode"
## [130] "earliestEonOrLowestEonothem"
## [131] "earliestEraOrLowestErathem"
## [132] "earliestEpochOrLowestSeries"
## [133] "earliestPeriodOrLowestSystem"
## [134] "disposition"
## [135] "fieldNotes"
## [136] "datasetID"
## [137] "associatedOccurrences"
## [138] "behavior"
## [139] "depth"
## [140] "depthAccuracy"
## [141] "namePublishedInYear"
## [142] "nameAccordingTo"
## [143] "acceptedNameUsage"
## [144] "parentNameUsage"
## [145] "latestEraOrHighestErathem"
## [146] "latestEpochOrHighestSeries"
## [147] "latestPeriodOrHighestSystem"
## [148] "waterBody"
## [149] "associatedTaxa"
## [150] "associatedReferences"
## [151] "earliestAgeOrLowestStage"
## [152] "formation"
## [153] "group"
## [154] "identificationReferences"
## [155] "dataGeneralizations"
## [156] "member"
## [157] "latestAgeOrHighestStage"
## [158] "http...unknown.org.recordId"
## [159] "lowestBiostratigraphicZone"
## [160] "highestBiostratigraphicZone"
## [161] "eventID"
## [162] "typifiedName"
## [163] "island"
## [164] "bed"
## [165] "typeStatus"
## [166] "coordinatePrecision"
## [167] "samplingEffort"
## [168] "lithostratigraphicTerms"
## [169] "verbatimTaxonRank"
## [170] "geologicalContextID"
## [171] "latestEonOrHighestEonothem"
## [172] "organismRemarks"
## [173] "originalNameUsage"
select(OCC, 1:5) %>% head
## key scientificName decimalLatitude
## 1 1913337407 Lynx rufus (Schreber, 1777) 31.92127
## 2 1927462932 Canis lupus baileyi Nelson & Goldman, 1929 32.17984
## 3 1927462934 Canis lupus baileyi Nelson & Goldman, 1929 32.17984
## 4 1913337038 Canis latrans texensis Bailey, 1905 26.52303
## 5 1913337381 Odocoileus virginianus (Zimmermann, 1780) 30.87341
## 6 1500196460 Bassariscus astutus (Lichtenstein, 1830) 31.92514
## decimalLongitude issues
## 1 -99.75406
## 2 -97.79981
## 3 -97.79981
## 4 -97.49110
## 5 -96.24263
## 6 -106.51273 gass84,gdativ
OCC %>% select(1:5) %>% head
## key scientificName decimalLatitude
## 1 1913337407 Lynx rufus (Schreber, 1777) 31.92127
## 2 1927462932 Canis lupus baileyi Nelson & Goldman, 1929 32.17984
## 3 1927462934 Canis lupus baileyi Nelson & Goldman, 1929 32.17984
## 4 1913337038 Canis latrans texensis Bailey, 1905 26.52303
## 5 1913337381 Odocoileus virginianus (Zimmermann, 1780) 30.87341
## 6 1500196460 Bassariscus astutus (Lichtenstein, 1830) 31.92514
## decimalLongitude issues
## 1 -99.75406
## 2 -97.79981
## 3 -97.79981
## 4 -97.49110
## 5 -96.24263
## 6 -106.51273 gass84,gdativ
OCC %>% select(key, scientificName, decimalLatitude, decimalLongitude, issues) %>% head
## key scientificName decimalLatitude
## 1 1913337407 Lynx rufus (Schreber, 1777) 31.92127
## 2 1927462932 Canis lupus baileyi Nelson & Goldman, 1929 32.17984
## 3 1927462934 Canis lupus baileyi Nelson & Goldman, 1929 32.17984
## 4 1913337038 Canis latrans texensis Bailey, 1905 26.52303
## 5 1913337381 Odocoileus virginianus (Zimmermann, 1780) 30.87341
## 6 1500196460 Bassariscus astutus (Lichtenstein, 1830) 31.92514
## decimalLongitude issues
## 1 -99.75406
## 2 -97.79981
## 3 -97.79981
## 4 -97.49110
## 5 -96.24263
## 6 -106.51273 gass84,gdativ
There are a lot of columns that contain useful information important to us, so we’ll select by column number
DATA <- OCC %>% select(4, 3, 32, 5, 16, 29:30, 38, 40, 59, 66, 68, 82, 85, 92, 97)
dim(DATA)
## [1] 44767 16
head(DATA)
## decimalLongitude decimalLatitude species issues
## 1 -99.75406 31.92127 Lynx rufus
## 2 -97.79981 32.17984 Canis lupus
## 3 -97.79981 32.17984 Canis lupus
## 4 -97.49110 26.52303 Canis latrans
## 5 -96.24263 30.87341 Odocoileus virginianus
## 6 -106.51273 31.92514 Bassariscus astutus gass84,gdativ
## basisOfRecord order family
## 1 PRESERVED_SPECIMEN Carnivora Felidae
## 2 PRESERVED_SPECIMEN Carnivora Canidae
## 3 PRESERVED_SPECIMEN Carnivora Canidae
## 4 PRESERVED_SPECIMEN Carnivora Canidae
## 5 PRESERVED_SPECIMEN Artiodactyla Cervidae
## 6 PRESERVED_SPECIMEN Carnivora Procyonidae
## coordinateUncertaintyInMeters year datasetName catalogNumber
## 1 NA 2018 <NA> 65493
## 2 804.67 2018 <NA> MSB:Mamm:324207
## 3 804.67 2018 <NA> MSB:Mamm:324206
## 4 NA 2017 <NA> 64950
## 5 NA 2017 <NA> 65491
## 6 NA 2017 <NA> UTEP:Mamm:8483
## institutionCode county
## 1 TCWC Runnels
## 2 MSB Somervell County
## 3 MSB Somervell County
## 4 TCWC Willacy
## 5 TCWC Brazos
## 6 UTEP El Paso County
## preparations
## 1 SS | tissue
## 2 blood serum (frozen); blood serum (frozen); blood (EDTA); blood (EDTA); blood (EDTA)
## 3 blood serum (frozen); blood serum (frozen); blood (EDTA); blood (EDTA); blood (EDTA)
## 4 ss | tissue
## 5 SK | tissue
## 6 skeleton; skin, study
## locality sex
## 1 Highway 153 near FM 140 <NA>
## 2 Fossil Rim Wildlife Center <NA>
## 3 Fossil Rim Wildlife Center <NA>
## 4 East Foundation, El Sauz Ranch <NA>
## 5 Jack Creek at FM 974 <NA>
## 6 Franklin Mountains State Park, Tom Mays Unit, park access road FEMALE
BUT that’s not all, we still have a lot more that we can do with dplyr::select
and its variants. You can select based on certain parameters
OCC %>% select(starts_with("decimal")) %>% head
## decimalLatitude decimalLongitude
## 1 31.92127 -99.75406
## 2 32.17984 -97.79981
## 3 32.17984 -97.79981
## 4 26.52303 -97.49110
## 5 30.87341 -96.24263
## 6 31.92514 -106.51273
OCC %>% select(ends_with("Name")) %>% head
## scientificName
## 1 Lynx rufus (Schreber, 1777)
## 2 Canis lupus baileyi Nelson & Goldman, 1929
## 3 Canis lupus baileyi Nelson & Goldman, 1929
## 4 Canis latrans texensis Bailey, 1905
## 5 Odocoileus virginianus (Zimmermann, 1780)
## 6 Bassariscus astutus (Lichtenstein, 1830)
## acceptedScientificName genericName datasetName
## 1 Lynx rufus (Schreber, 1777) Lynx <NA>
## 2 Canis lupus baileyi Nelson & Goldman, 1929 Canis <NA>
## 3 Canis lupus baileyi Nelson & Goldman, 1929 Canis <NA>
## 4 Canis latrans texensis Bailey, 1905 Canis <NA>
## 5 Odocoileus virginianus (Zimmermann, 1780) Odocoileus <NA>
## 6 Bassariscus astutus (Lichtenstein, 1830) Bassariscus <NA>
## name vernacularName typifiedName
## 1 Lynx rufus (Schreber, 1777) <NA> <NA>
## 2 Canis lupus baileyi Nelson & Goldman, 1929 <NA> <NA>
## 3 Canis lupus baileyi Nelson & Goldman, 1929 <NA> <NA>
## 4 Canis latrans texensis Bailey, 1905 <NA> <NA>
## 5 Odocoileus virginianus (Zimmermann, 1780) <NA> <NA>
## 6 Bassariscus astutus (Lichtenstein, 1830) <NA> <NA>
the starts_with
and ends_with
arguments select columns based on their name
DATA %>% select_all(toupper) %>% head
## DECIMALLONGITUDE DECIMALLATITUDE SPECIES ISSUES
## 1 -99.75406 31.92127 Lynx rufus
## 2 -97.79981 32.17984 Canis lupus
## 3 -97.79981 32.17984 Canis lupus
## 4 -97.49110 26.52303 Canis latrans
## 5 -96.24263 30.87341 Odocoileus virginianus
## 6 -106.51273 31.92514 Bassariscus astutus gass84,gdativ
## BASISOFRECORD ORDER FAMILY
## 1 PRESERVED_SPECIMEN Carnivora Felidae
## 2 PRESERVED_SPECIMEN Carnivora Canidae
## 3 PRESERVED_SPECIMEN Carnivora Canidae
## 4 PRESERVED_SPECIMEN Carnivora Canidae
## 5 PRESERVED_SPECIMEN Artiodactyla Cervidae
## 6 PRESERVED_SPECIMEN Carnivora Procyonidae
## COORDINATEUNCERTAINTYINMETERS YEAR DATASETNAME CATALOGNUMBER
## 1 NA 2018 <NA> 65493
## 2 804.67 2018 <NA> MSB:Mamm:324207
## 3 804.67 2018 <NA> MSB:Mamm:324206
## 4 NA 2017 <NA> 64950
## 5 NA 2017 <NA> 65491
## 6 NA 2017 <NA> UTEP:Mamm:8483
## INSTITUTIONCODE COUNTY
## 1 TCWC Runnels
## 2 MSB Somervell County
## 3 MSB Somervell County
## 4 TCWC Willacy
## 5 TCWC Brazos
## 6 UTEP El Paso County
## PREPARATIONS
## 1 SS | tissue
## 2 blood serum (frozen); blood serum (frozen); blood (EDTA); blood (EDTA); blood (EDTA)
## 3 blood serum (frozen); blood serum (frozen); blood (EDTA); blood (EDTA); blood (EDTA)
## 4 ss | tissue
## 5 SK | tissue
## 6 skeleton; skin, study
## LOCALITY SEX
## 1 Highway 153 near FM 140 <NA>
## 2 Fossil Rim Wildlife Center <NA>
## 3 Fossil Rim Wildlife Center <NA>
## 4 East Foundation, El Sauz Ranch <NA>
## 5 Jack Creek at FM 974 <NA>
## 6 Franklin Mountains State Park, Tom Mays Unit, park access road FEMALE
dplyr::select_all
is a special case of dplyr::select
(labeled by tidyverse developers as a scoped variant along with select_if
and select_at
) that selects all columns and applies an additional function (here base::toupper
which capitalizes all the column names)
DATA %>% select(lon = 1, lat = decimalLatitude) %>% head
## lon lat
## 1 -99.75406 31.92127
## 2 -97.79981 32.17984
## 3 -97.79981 32.17984
## 4 -97.49110 26.52303
## 5 -96.24263 30.87341
## 6 -106.51273 31.92514
DATA %>% rename(lon = decimalLongitude, lat = decimalLatitude) %>% head
## lon lat species issues
## 1 -99.75406 31.92127 Lynx rufus
## 2 -97.79981 32.17984 Canis lupus
## 3 -97.79981 32.17984 Canis lupus
## 4 -97.49110 26.52303 Canis latrans
## 5 -96.24263 30.87341 Odocoileus virginianus
## 6 -106.51273 31.92514 Bassariscus astutus gass84,gdativ
## basisOfRecord order family
## 1 PRESERVED_SPECIMEN Carnivora Felidae
## 2 PRESERVED_SPECIMEN Carnivora Canidae
## 3 PRESERVED_SPECIMEN Carnivora Canidae
## 4 PRESERVED_SPECIMEN Carnivora Canidae
## 5 PRESERVED_SPECIMEN Artiodactyla Cervidae
## 6 PRESERVED_SPECIMEN Carnivora Procyonidae
## coordinateUncertaintyInMeters year datasetName catalogNumber
## 1 NA 2018 <NA> 65493
## 2 804.67 2018 <NA> MSB:Mamm:324207
## 3 804.67 2018 <NA> MSB:Mamm:324206
## 4 NA 2017 <NA> 64950
## 5 NA 2017 <NA> 65491
## 6 NA 2017 <NA> UTEP:Mamm:8483
## institutionCode county
## 1 TCWC Runnels
## 2 MSB Somervell County
## 3 MSB Somervell County
## 4 TCWC Willacy
## 5 TCWC Brazos
## 6 UTEP El Paso County
## preparations
## 1 SS | tissue
## 2 blood serum (frozen); blood serum (frozen); blood (EDTA); blood (EDTA); blood (EDTA)
## 3 blood serum (frozen); blood serum (frozen); blood (EDTA); blood (EDTA); blood (EDTA)
## 4 ss | tissue
## 5 SK | tissue
## 6 skeleton; skin, study
## locality sex
## 1 Highway 153 near FM 140 <NA>
## 2 Fossil Rim Wildlife Center <NA>
## 3 Fossil Rim Wildlife Center <NA>
## 4 East Foundation, El Sauz Ranch <NA>
## 5 Jack Creek at FM 974 <NA>
## 6 Franklin Mountains State Park, Tom Mays Unit, park access road FEMALE
You can use the dplyr::select
function to rename each variable if you want (by column number or column name) OR you can use the dplyr::rename
function (which keeps all of the variables)
filter
The dplyr::filter
function does what it sounds and filters your data based on the arguments you give it.
dim(DATA)
## [1] 44767 16
DATA %>% filter(year >= 2000) %>% dim #grabs only records from 2000 on
## [1] 5777 16
DATA %>% filter(year >= 2000 & order == "Rodentia") %>% dim #grabs only *rodent* records from 2000 on
## [1] 4016 16
#DATA %>% filter(year >= 2000, order == "Rodentia") is the same thing as the above line
DATA %>% filter(year >= 2000 | order == "Rodentia") %>% dim #grabs records that are *either* rodents or from the year 2000 on
## [1] 31764 16
You can use the &
symbol to string along multiple arguments to filter by or use the |
symbol to give possible options
This is equivalent to using DATA[DATA$year >= 2000, ]
, DATA[DATA$year >= 2000 & DATA$order == "Rodentia", ]
, etc. Depending on your coding familiarity, one or the other may feel cleaner or more useful for you.
You can also apply functions to determine cutoffs for your filtering arguments.
median(DATA$year, na.rm = T)
## [1] 1977
DATA %>% filter(year > median(year, na.rm = T)) %>% dim
## [1] 20954 16
#DATA %>% filter(year > 1977) %>% dim
Which will lead us into the concept of groups!
group_by
group_by
is useful for partitioning things and applying functions to these various partitions.
For the last example, we applied a threshold to the median year of all records. What if we wanted to apply it differently to each order and apply a median threshold for each one?
ORD <- DATA %>% group_by(order)
ORD
## # A tibble: 44,767 x 16
## # Groups: order [17]
## decimalLongitude decimalLatitude species issues basisOfRecord order
## <dbl> <dbl> <fct> <fct> <fct> <fct>
## 1 -99.8 31.9 Lynx ruf… "" PRESERVED_SPE… Carn…
## 2 -97.8 32.2 Canis lu… "" PRESERVED_SPE… Carn…
## 3 -97.8 32.2 Canis lu… "" PRESERVED_SPE… Carn…
## 4 -97.5 26.5 Canis la… "" PRESERVED_SPE… Carn…
## 5 -96.2 30.9 Odocoile… "" PRESERVED_SPE… Arti…
## 6 -107. 31.9 Bassaris… gass84… PRESERVED_SPE… Carn…
## 7 -96.3 30.6 Baiomys … "" PRESERVED_SPE… Rode…
## 8 -97.6 33.4 Peromysc… gass84 PRESERVED_SPE… Rode…
## 9 -97.6 33.4 Peromysc… gass84 PRESERVED_SPE… Rode…
## 10 -97.6 33.4 Peromysc… gass84 PRESERVED_SPE… Rode…
## # ... with 44,757 more rows, and 10 more variables: family <fct>,
## # coordinateUncertaintyInMeters <dbl>, year <int>, datasetName <fct>,
## # catalogNumber <fct>, institutionCode <fct>, county <fct>,
## # preparations <fct>, locality <fct>, sex <fct>
The argument in the dplyr::group_by
function represents what you want to group with. Each of the different levels in the specified column will be their own group. Check out levels(DATA$order)
to see what groups there are.
What it results in is a tibble which is the preferred object class for the tidyverse (as part of the tibble
package) and somehow not a Gremlins or Furby knockoff. A tibble holds data in a similar way but presents it differently by showing dimensions, a few columns, the classes in the tibble, and groups if they exist.
Often when working with larger data.frames (say in the 100,000 or 1,000,000 rows + range), it can destroy your computer by simply calling the object.
You can also choose multiple grouping variables
DATA %>% group_by(institutionCode, basisOfRecord) #each group is a specific combination of institution and record type
## # A tibble: 44,767 x 16
## # Groups: institutionCode, basisOfRecord [64]
## decimalLongitude decimalLatitude species issues basisOfRecord order
## <dbl> <dbl> <fct> <fct> <fct> <fct>
## 1 -99.8 31.9 Lynx ruf… "" PRESERVED_SPE… Carn…
## 2 -97.8 32.2 Canis lu… "" PRESERVED_SPE… Carn…
## 3 -97.8 32.2 Canis lu… "" PRESERVED_SPE… Carn…
## 4 -97.5 26.5 Canis la… "" PRESERVED_SPE… Carn…
## 5 -96.2 30.9 Odocoile… "" PRESERVED_SPE… Arti…
## 6 -107. 31.9 Bassaris… gass84… PRESERVED_SPE… Carn…
## 7 -96.3 30.6 Baiomys … "" PRESERVED_SPE… Rode…
## 8 -97.6 33.4 Peromysc… gass84 PRESERVED_SPE… Rode…
## 9 -97.6 33.4 Peromysc… gass84 PRESERVED_SPE… Rode…
## 10 -97.6 33.4 Peromysc… gass84 PRESERVED_SPE… Rode…
## # ... with 44,757 more rows, and 10 more variables: family <fct>,
## # coordinateUncertaintyInMeters <dbl>, year <int>, datasetName <fct>,
## # catalogNumber <fct>, institutionCode <fct>, county <fct>,
## # preparations <fct>, locality <fct>, sex <fct>
ORD %>% filter(year > median(year, na.rm = T)) # slightly different result than before
## # A tibble: 20,788 x 16
## # Groups: order [13]
## decimalLongitude decimalLatitude species issues basisOfRecord order
## <dbl> <dbl> <fct> <fct> <fct> <fct>
## 1 -99.8 31.9 Lynx ruf… "" PRESERVED_SPE… Carn…
## 2 -97.8 32.2 Canis lu… "" PRESERVED_SPE… Carn…
## 3 -97.8 32.2 Canis lu… "" PRESERVED_SPE… Carn…
## 4 -97.5 26.5 Canis la… "" PRESERVED_SPE… Carn…
## 5 -96.2 30.9 Odocoile… "" PRESERVED_SPE… Arti…
## 6 -107. 31.9 Bassaris… gass84… PRESERVED_SPE… Carn…
## 7 -96.3 30.6 Baiomys … "" PRESERVED_SPE… Rode…
## 8 -97.6 33.4 Peromysc… gass84 PRESERVED_SPE… Rode…
## 9 -97.6 33.4 Peromysc… gass84 PRESERVED_SPE… Rode…
## 10 -97.6 33.4 Peromysc… gass84 PRESERVED_SPE… Rode…
## # ... with 20,778 more rows, and 10 more variables: family <fct>,
## # coordinateUncertaintyInMeters <dbl>, year <int>, datasetName <fct>,
## # catalogNumber <fct>, institutionCode <fct>, county <fct>,
## # preparations <fct>, locality <fct>, sex <fct>
#sapply(levels(DATA$order), function(x) DATA %>% filter(order == x) %$% year %>% median(na.rm = T)) #this also means that we have fossil dermopterans in Texas?!
For something perhaps a bit more useful, if you wanted to get rid of duplicated locality records within an order, how do you do that?
#DATA %>% select(1, 2) %>% duplicated() %>% `!` %>% sum
#DATA %>% distinct(decimalLongitude, decimalLatitude) %>% dim
ORD %>% distinct(decimalLongitude, decimalLatitude, .keep_all = T) %>% ungroup
## # A tibble: 10,282 x 16
## decimalLongitude decimalLatitude species issues basisOfRecord order
## <dbl> <dbl> <fct> <fct> <fct> <fct>
## 1 -99.8 31.9 Lynx ruf… "" PRESERVED_SPE… Carn…
## 2 -97.8 32.2 Canis lu… "" PRESERVED_SPE… Carn…
## 3 -97.5 26.5 Canis la… "" PRESERVED_SPE… Carn…
## 4 -96.2 30.9 Odocoile… "" PRESERVED_SPE… Arti…
## 5 -107. 31.9 Bassaris… gass84… PRESERVED_SPE… Carn…
## 6 -96.3 30.6 Baiomys … "" PRESERVED_SPE… Rode…
## 7 -97.6 33.4 Peromysc… gass84 PRESERVED_SPE… Rode…
## 8 -97.6 33.4 Peromysc… gass84 PRESERVED_SPE… Rode…
## 9 -97.6 33.4 Peromysc… gass84 PRESERVED_SPE… Rode…
## 10 -97.6 33.4 Peromysc… gass84 PRESERVED_SPE… Rode…
## # ... with 10,272 more rows, and 10 more variables: family <fct>,
## # coordinateUncertaintyInMeters <dbl>, year <int>, datasetName <fct>,
## # catalogNumber <fct>, institutionCode <fct>, county <fct>,
## # preparations <fct>, locality <fct>, sex <fct>
Here, we use the dplyr::distinct
function to only grab those rows that are distinct according to what we give it (longitude and latitude) and use the .keep_all = T
argument to specify that we want all other columns as well. Try it without .keep_all and see what happens
If you don’t like the look of the tibble result, you can always end with %>% as.data.frame
or turn it into a data.frame later with the base::as.data.frame
function.
mutate
The dplyr::mutate
function also works how it sounds by either replacing or frankensteining a new column.
Often a specimen in a natural history collection (much of what is found in GBIF are natural history collection specimens) has an associated collection number (some institution code followed by a number, such as TCWC 6524)
DATA %>% mutate(mus.num = paste(institutionCode, catalogNumber, sep = "_")) %>% head
## decimalLongitude decimalLatitude species issues
## 1 -99.75406 31.92127 Lynx rufus
## 2 -97.79981 32.17984 Canis lupus
## 3 -97.79981 32.17984 Canis lupus
## 4 -97.49110 26.52303 Canis latrans
## 5 -96.24263 30.87341 Odocoileus virginianus
## 6 -106.51273 31.92514 Bassariscus astutus gass84,gdativ
## basisOfRecord order family
## 1 PRESERVED_SPECIMEN Carnivora Felidae
## 2 PRESERVED_SPECIMEN Carnivora Canidae
## 3 PRESERVED_SPECIMEN Carnivora Canidae
## 4 PRESERVED_SPECIMEN Carnivora Canidae
## 5 PRESERVED_SPECIMEN Artiodactyla Cervidae
## 6 PRESERVED_SPECIMEN Carnivora Procyonidae
## coordinateUncertaintyInMeters year datasetName catalogNumber
## 1 NA 2018 <NA> 65493
## 2 804.67 2018 <NA> MSB:Mamm:324207
## 3 804.67 2018 <NA> MSB:Mamm:324206
## 4 NA 2017 <NA> 64950
## 5 NA 2017 <NA> 65491
## 6 NA 2017 <NA> UTEP:Mamm:8483
## institutionCode county
## 1 TCWC Runnels
## 2 MSB Somervell County
## 3 MSB Somervell County
## 4 TCWC Willacy
## 5 TCWC Brazos
## 6 UTEP El Paso County
## preparations
## 1 SS | tissue
## 2 blood serum (frozen); blood serum (frozen); blood (EDTA); blood (EDTA); blood (EDTA)
## 3 blood serum (frozen); blood serum (frozen); blood (EDTA); blood (EDTA); blood (EDTA)
## 4 ss | tissue
## 5 SK | tissue
## 6 skeleton; skin, study
## locality sex
## 1 Highway 153 near FM 140 <NA>
## 2 Fossil Rim Wildlife Center <NA>
## 3 Fossil Rim Wildlife Center <NA>
## 4 East Foundation, El Sauz Ranch <NA>
## 5 Jack Creek at FM 974 <NA>
## 6 Franklin Mountains State Park, Tom Mays Unit, park access road FEMALE
## mus.num
## 1 TCWC_65493
## 2 MSB_MSB:Mamm:324207
## 3 MSB_MSB:Mamm:324206
## 4 TCWC_64950
## 5 TCWC_65491
## 6 UTEP_UTEP:Mamm:8483
You can create multiple columns in the same function call. I would also recommend always naming the new column by using column.name =
before the mutate specific argument for the new column. You probably aren’t a being of pure chaos and want columns names year + 4
and log(year)
.
DATA %>% mutate(year.2 = year + 4, year.3 = log(year)) %>% head
## decimalLongitude decimalLatitude species issues
## 1 -99.75406 31.92127 Lynx rufus
## 2 -97.79981 32.17984 Canis lupus
## 3 -97.79981 32.17984 Canis lupus
## 4 -97.49110 26.52303 Canis latrans
## 5 -96.24263 30.87341 Odocoileus virginianus
## 6 -106.51273 31.92514 Bassariscus astutus gass84,gdativ
## basisOfRecord order family
## 1 PRESERVED_SPECIMEN Carnivora Felidae
## 2 PRESERVED_SPECIMEN Carnivora Canidae
## 3 PRESERVED_SPECIMEN Carnivora Canidae
## 4 PRESERVED_SPECIMEN Carnivora Canidae
## 5 PRESERVED_SPECIMEN Artiodactyla Cervidae
## 6 PRESERVED_SPECIMEN Carnivora Procyonidae
## coordinateUncertaintyInMeters year datasetName catalogNumber
## 1 NA 2018 <NA> 65493
## 2 804.67 2018 <NA> MSB:Mamm:324207
## 3 804.67 2018 <NA> MSB:Mamm:324206
## 4 NA 2017 <NA> 64950
## 5 NA 2017 <NA> 65491
## 6 NA 2017 <NA> UTEP:Mamm:8483
## institutionCode county
## 1 TCWC Runnels
## 2 MSB Somervell County
## 3 MSB Somervell County
## 4 TCWC Willacy
## 5 TCWC Brazos
## 6 UTEP El Paso County
## preparations
## 1 SS | tissue
## 2 blood serum (frozen); blood serum (frozen); blood (EDTA); blood (EDTA); blood (EDTA)
## 3 blood serum (frozen); blood serum (frozen); blood (EDTA); blood (EDTA); blood (EDTA)
## 4 ss | tissue
## 5 SK | tissue
## 6 skeleton; skin, study
## locality sex
## 1 Highway 153 near FM 140 <NA>
## 2 Fossil Rim Wildlife Center <NA>
## 3 Fossil Rim Wildlife Center <NA>
## 4 East Foundation, El Sauz Ranch <NA>
## 5 Jack Creek at FM 974 <NA>
## 6 Franklin Mountains State Park, Tom Mays Unit, park access road FEMALE
## year.2 year.3
## 1 2022 7.609862
## 2 2022 7.609862
## 3 2022 7.609862
## 4 2021 7.609367
## 5 2021 7.609367
## 6 2021 7.609367
The dplyr::summarize
function again follows the obvious naming trend and gasp summarizes your data. This is yet another of the common dplyr
functions that works really well when using group_by
to summarize multiple groups.
DATA %>% summarize(n.spec = n_distinct(species))
## n.spec
## 1 742
DATA %>% summarize(min = min(year, na.rm = T), max = max(year, na.rm = T), mean = mean(year, na.rm = T))
## min max mean
## 1 1700 2018 1975.623
You can summarize for one function or for multiple ones, and you can either name the results or live dangerously and let it be named after the function. Instead of the potentially useful n.spec
, you can have the mysterious n_distinct(species)
! Swoon
dplyr::summarize
is only as good as the functions you use with it. Some useful ones are typically the usual mean
, median
, min
, max
, quantile
, n
(for numbers of records), n_distinct
for (unique numbers of records as shown above), etc.
DATA %>% group_by(order) %>% summarize(n = n(), n.spec = n_distinct(species))
## # A tibble: 17 x 3
## order n n.spec
## <fct> <int> <int>
## 1 Artiodactyla 1357 126
## 2 Carnivora 3155 110
## 3 Cetacea 2442 23
## 4 Chiroptera 4464 35
## 5 Cingulata 288 5
## 6 Dermoptera 2 1
## 7 Didelphimorphia 297 8
## 8 Diprotodontia 1 1
## 9 Erinaceomorpha 2 2
## 10 Lagomorpha 1264 26
## 11 Perissodactyla 496 97
## 12 Pilosa 19 13
## 13 Primates 43 23
## 14 Proboscidea 44 14
## 15 Rodentia 30003 188
## 16 Soricomorpha 794 22
## 17 <NA> 96 48
#Often the real power of dplyr is in how you string multiple of these functions together
DATA %>% group_by(order) %>% summarize(n = n(), n.spec = n_distinct(species)) %>% mutate(speciesPerRecords = n.spec/n)
## # A tibble: 17 x 4
## order n n.spec speciesPerRecords
## <fct> <int> <int> <dbl>
## 1 Artiodactyla 1357 126 0.0929
## 2 Carnivora 3155 110 0.0349
## 3 Cetacea 2442 23 0.00942
## 4 Chiroptera 4464 35 0.00784
## 5 Cingulata 288 5 0.0174
## 6 Dermoptera 2 1 0.500
## 7 Didelphimorphia 297 8 0.0269
## 8 Diprotodontia 1 1 1.00
## 9 Erinaceomorpha 2 2 1.00
## 10 Lagomorpha 1264 26 0.0206
## 11 Perissodactyla 496 97 0.196
## 12 Pilosa 19 13 0.684
## 13 Primates 43 23 0.535
## 14 Proboscidea 44 14 0.318
## 15 Rodentia 30003 188 0.00627
## 16 Soricomorpha 794 22 0.0277
## 17 <NA> 96 48 0.500
DATA %>% group_by(institutionCode) %>% summarize(nspec = n_distinct(species), nfam = n_distinct(family), norder = n_distinct(order))
## # A tibble: 60 x 4
## institutionCode nspec nfam norder
## <fct> <int> <int> <int>
## 1 AMNH 6 5 4
## 2 ASNHC 125 27 9
## 3 ASNHC-ASU 3 1 1
## 4 BM-UW 1 1 1
## 5 BMNH-U of M 1 1 1
## 6 Borror Laboratory of Bioacoustics, Ohio State Unive… 3 2 2
## 7 CAS 25 14 6
## 8 CHAS 13 8 4
## 9 CLO 7 6 5
## 10 CUMV 19 8 4
## # ... with 50 more rows
tidyr
Another tidyverse
package that is often useful is tidyr
specifically for its data manipulation functions that can turn wide data into long data and long data into wide data.
I know, I know. What are wide/long data and why do I need to know them???
As we’ve learned, most things are essentially named how they are, so wide data are column heavy and long data have more rows. It is often essential to know how to switch between them for plotting reasons.
ORD <- DATA %>% group_by(order) %>% summarize(nspec = n_distinct(species), nfam = n_distinct(family))
Take the above result for example. How could you plot this? You’d need an x variable (order), but what is our y variable? In a long data.frame, our y variable will be a vector of factors that has levels that tell the difference between the number of species and number of families. And additional column called value will be used as a color or fill variable.
A non tidyverse method of doing this is found in the reshape2
package using the reshape2::melt
and reshape2::cast
functions.
melt(ORD, id.vars = "order") #id.vars says what will be used as an ID (typically factors)
## order variable value
## 1 Artiodactyla nspec 126
## 2 Carnivora nspec 110
## 3 Cetacea nspec 23
## 4 Chiroptera nspec 35
## 5 Cingulata nspec 5
## 6 Dermoptera nspec 1
## 7 Didelphimorphia nspec 8
## 8 Diprotodontia nspec 1
## 9 Erinaceomorpha nspec 2
## 10 Lagomorpha nspec 26
## 11 Perissodactyla nspec 97
## 12 Pilosa nspec 13
## 13 Primates nspec 23
## 14 Proboscidea nspec 14
## 15 Rodentia nspec 188
## 16 Soricomorpha nspec 22
## 17 <NA> nspec 48
## 18 Artiodactyla nfam 22
## 19 Carnivora nfam 13
## 20 Cetacea nfam 7
## 21 Chiroptera nfam 4
## 22 Cingulata nfam 2
## 23 Dermoptera nfam 1
## 24 Didelphimorphia nfam 2
## 25 Diprotodontia nfam 1
## 26 Erinaceomorpha nfam 1
## 27 Lagomorpha nfam 1
## 28 Perissodactyla nfam 8
## 29 Pilosa nfam 6
## 30 Primates nfam 11
## 31 Proboscidea nfam 3
## 32 Rodentia nfam 22
## 33 Soricomorpha nfam 2
## 34 <NA> nfam 24
melt(ORD, "order") %>% ggplot(aes(x = variable, y = order, fill = value)) +
geom_tile(color = "white") +
geom_text(aes(label = value), color = "white") +
scale_x_discrete(expand = c(0, 0)) +
scale_y_discrete(expand = c(0, 0)) +
scale_fill_viridis_c("Number")
dcast(melt(ORD, id.vars = "order"), order ~ variable)
## order nspec nfam
## 1 Artiodactyla 126 22
## 2 Carnivora 110 13
## 3 Cetacea 23 7
## 4 Chiroptera 35 4
## 5 Cingulata 5 2
## 6 Dermoptera 1 1
## 7 Didelphimorphia 8 2
## 8 Diprotodontia 1 1
## 9 Erinaceomorpha 2 1
## 10 Lagomorpha 26 1
## 11 Perissodactyla 97 8
## 12 Pilosa 13 6
## 13 Primates 23 11
## 14 Proboscidea 14 3
## 15 Rodentia 188 22
## 16 Soricomorpha 22 2
## 17 <NA> 48 24
For a reshape2::dcast
function (the d standing for data.frame, obviously, because why stop at putting an r in front of all your package and function names), you must specify a formula of x ~ y
where the y
is the variable that is broken up into columns.
gather and spread
Now that you have an idea of what wide/long data are and the reshape2
package (it is often shamefully joked that everyone still uses reshape2
because melt
and cast
make more sense), let’s see how tidyr
approaches these.
ORD %>% gather(key = "var", value = "val", -order)
## # A tibble: 34 x 3
## order var val
## <fct> <chr> <int>
## 1 Artiodactyla nspec 126
## 2 Carnivora nspec 110
## 3 Cetacea nspec 23
## 4 Chiroptera nspec 35
## 5 Cingulata nspec 5
## 6 Dermoptera nspec 1
## 7 Didelphimorphia nspec 8
## 8 Diprotodontia nspec 1
## 9 Erinaceomorpha nspec 2
## 10 Lagomorpha nspec 26
## # ... with 24 more rows
It’s a little bit more roundabout to get to the same result using tidyr
. The same basic idea applies by giving a key variable and a value variable, but it is harder for them to talk to each other and still keep the order column.
ORD %>% gather(order)
## # A tibble: 34 x 2
## order value
## <chr> <int>
## 1 nspec 126
## 2 nspec 110
## 3 nspec 23
## 4 nspec 35
## 5 nspec 5
## 6 nspec 1
## 7 nspec 8
## 8 nspec 1
## 9 nspec 2
## 10 nspec 26
## # ... with 24 more rows
tidyr::spread
is equivalent to the reshape2::cast
functions and does the opposite of tidyr::gather
(as the name yet again suggests)
EXP <- data.frame(order = rep(levels(DATA$order), 3), institution = rep(c("TCWC", "KU", "UMMZ"), each = 16), val = sample(1:50, 48, replace = T))
head(EXP, 25)
## order institution val
## 1 Artiodactyla TCWC 13
## 2 Carnivora TCWC 14
## 3 Cetacea TCWC 37
## 4 Chiroptera TCWC 42
## 5 Cingulata TCWC 46
## 6 Dermoptera TCWC 23
## 7 Didelphimorphia TCWC 31
## 8 Diprotodontia TCWC 48
## 9 Erinaceomorpha TCWC 34
## 10 Lagomorpha TCWC 35
## 11 Perissodactyla TCWC 4
## 12 Pilosa TCWC 9
## 13 Primates TCWC 33
## 14 Proboscidea TCWC 29
## 15 Rodentia TCWC 37
## 16 Soricomorpha TCWC 43
## 17 Artiodactyla KU 4
## 18 Carnivora KU 13
## 19 Cetacea KU 36
## 20 Chiroptera KU 20
## 21 Cingulata KU 40
## 22 Dermoptera KU 38
## 23 Didelphimorphia KU 1
## 24 Diprotodontia KU 13
## 25 Erinaceomorpha KU 22
Above we made a random example to use tidyr::spread
on that is comprised of 18 mammalian orders, three institution codes, and random values from 1 to 50 for each order in each institution.
EXP %>% spread(key = "institution", val = "val")
## order KU TCWC UMMZ
## 1 Artiodactyla 4 13 21
## 2 Carnivora 13 14 13
## 3 Cetacea 36 37 40
## 4 Chiroptera 20 42 40
## 5 Cingulata 40 46 49
## 6 Dermoptera 38 23 17
## 7 Didelphimorphia 1 31 14
## 8 Diprotodontia 13 48 14
## 9 Erinaceomorpha 22 34 41
## 10 Lagomorpha 23 35 18
## 11 Perissodactyla 48 4 10
## 12 Pilosa 38 9 4
## 13 Primates 26 33 32
## 14 Proboscidea 47 29 10
## 15 Rodentia 46 37 27
## 16 Soricomorpha 17 43 35
EXP %>% spread(key = "order", val = "val")
## institution Artiodactyla Carnivora Cetacea Chiroptera Cingulata
## 1 KU 4 13 36 20 40
## 2 TCWC 13 14 37 42 46
## 3 UMMZ 21 13 40 40 49
## Dermoptera Didelphimorphia Diprotodontia Erinaceomorpha Lagomorpha
## 1 38 1 13 22 23
## 2 23 31 48 34 35
## 3 17 14 14 41 18
## Perissodactyla Pilosa Primates Proboscidea Rodentia Soricomorpha
## 1 48 38 26 47 46 17
## 2 4 9 33 29 37 43
## 3 10 4 32 10 27 35
Depending on how you choose your key (the column that is spread out with its factors now being columns) and val (the numeric association with the key variable) arguments, you can get very different data.frames.
However, it is possibly easier to connect this with the reshape2::melt
and tidyr::gather
functions than the different formula syntax of the reshape2
cast functions.
separate
Separate is another function that may be useful when you want to separate data held in a single column
DATA %>% separate(species, c("genus", "specep")) %>% head
## decimalLongitude decimalLatitude genus specep issues
## 1 -99.75406 31.92127 Lynx rufus
## 2 -97.79981 32.17984 Canis lupus
## 3 -97.79981 32.17984 Canis lupus
## 4 -97.49110 26.52303 Canis latrans
## 5 -96.24263 30.87341 Odocoileus virginianus
## 6 -106.51273 31.92514 Bassariscus astutus gass84,gdativ
## basisOfRecord order family
## 1 PRESERVED_SPECIMEN Carnivora Felidae
## 2 PRESERVED_SPECIMEN Carnivora Canidae
## 3 PRESERVED_SPECIMEN Carnivora Canidae
## 4 PRESERVED_SPECIMEN Carnivora Canidae
## 5 PRESERVED_SPECIMEN Artiodactyla Cervidae
## 6 PRESERVED_SPECIMEN Carnivora Procyonidae
## coordinateUncertaintyInMeters year datasetName catalogNumber
## 1 NA 2018 <NA> 65493
## 2 804.67 2018 <NA> MSB:Mamm:324207
## 3 804.67 2018 <NA> MSB:Mamm:324206
## 4 NA 2017 <NA> 64950
## 5 NA 2017 <NA> 65491
## 6 NA 2017 <NA> UTEP:Mamm:8483
## institutionCode county
## 1 TCWC Runnels
## 2 MSB Somervell County
## 3 MSB Somervell County
## 4 TCWC Willacy
## 5 TCWC Brazos
## 6 UTEP El Paso County
## preparations
## 1 SS | tissue
## 2 blood serum (frozen); blood serum (frozen); blood (EDTA); blood (EDTA); blood (EDTA)
## 3 blood serum (frozen); blood serum (frozen); blood (EDTA); blood (EDTA); blood (EDTA)
## 4 ss | tissue
## 5 SK | tissue
## 6 skeleton; skin, study
## locality sex
## 1 Highway 153 near FM 140 <NA>
## 2 Fossil Rim Wildlife Center <NA>
## 3 Fossil Rim Wildlife Center <NA>
## 4 East Foundation, El Sauz Ranch <NA>
## 5 Jack Creek at FM 974 <NA>
## 6 Franklin Mountains State Park, Tom Mays Unit, park access road FEMALE
This can be useful for certain things like dates (although there is the lubridate
package as part of the tidyverse for y’all that are temporally inclined), those pesky museum numbers, or simple text analysis.
Overall, hopefully this has at least shown you some functions that may be useful for cleaning up and organizing your own data (no more going into Excel and fighting with its formatting!).
As always, more information about specific functions can be found by viewing the help pages in R. Additional questions about tidyverse questions can be found on the thorough website. And general “how do I do something ill advised”, “how do I break this”, or “how do I do this in the most complex manner possible” questions can find answers on StackExchange. Be sure to always read through the comments and most of the answers (the most highly voted answer is often not the most relevant one to you).