Nearly all of the data originates from one of two sources: a Qualtrics survey disseminated to kind survey respondents (students) in Stiles, and some housing data from the college. While this project is a living document and is still taking in student responses, at current publication, 71 student responses have been recorded.
For the sake of tidiness, most of the coding setup and data wrangling for this project’s visualizations will be developed in the following two pages. For all readers, the pages to come contain the ~raw data tables which can be explored, as well as an interpretation of the overall survey data. For those interested in following the code, these pages should also elucidate the various datasets and data formats employed.
A copy of the survey questions may be found here.
The data table below gives a first look at the Qualtrics survey responses! (Excepting a few personally-identifying data points.)
data.qualtrics <- read.csv("data/qualtrics_20201227_233000.csv") %>%
slice(3:n()) %>% # First two rows do not have respondent data; remove them
na_if("") # Qualtrics saves NA in CSV as blank strings; reset those to NA
paged_table(data.qualtrics %>%
select(-c(IPAddress, LocationLongitude, LocationLatitude, Q25)))
A number of data-cleaning steps become quickly apparent: some rows are empty or partially empty (representing those who opened the survey but did not complete it); the columns are named quite unhelpfully (e.g. “Q20.1_1_1”), and there seem to be many columns that are not relevant. As a first cleaning step, we rename the relevant columns, remove all unwanted columns, and trim empty rows (keeping partial responses).
data.students <- data.qualtrics %>%
transmute(
# Student info
Year=Q24, Email=Q25,
Enrolled.F20=case_when(
Q23=="Enrolled" ~ T,
Q23=="On Leave" ~ F,
TRUE ~ NA),
# Location/movement data at time t
Loc1=Q4, Dorm1=Q5, Loc2=Q6, Loc2.5=Q26, Loc3=Q7, Loc4=Q8, Loc5=Q9, Dorm5=Q10,
Loc6=Q11,
# Meaning/nostalgia data (in expanded form)
Place1.n=Q20.1_1_1, Place1.t=Q20.2_1, Place2.n=Q20.1_2_1, Place2.t=Q20.2_2,
Place3.n=Q20.1_3_1, Place3.t=Q20.2_3, Place4.n=Q20.1_4_1, Place4.t=Q20.2_4,
Place5.n=Q20.1_5_1, Place5.t=Q20.2_5, Place6.n=Q20.1_6_1, Place6.t=Q20.2_6,
Place7.n=Q20.1_7_1, Place7.t=Q20.2_7, Place8.n=Q20.1_8_1, Place8.t=Q20.2_8,
Place9.n=Q20.1_9_1, Place9.t=Q20.2_9, Place10.n=Q20.1_10_1, Place10.t=Q20.2_10,
# Connectedness data
Conn1=Q22.1_1, Conn2=Q22.1_2, Conn3=Q22.1_3, Conn4=Q22.1_4,
Conn5=Q22.1_5, Conn6=Q22.1_6,
# Useful Qualtrics data
ID=ResponseId
) %>%
# Remove 100% empty responses (empty Loc1)
filter(!is.na(Loc1) | !is.na(Loc6)) %>%
# Set appropriate columns to numeric
mutate(Year=as.numeric(Year), Conn1=as.numeric(Conn1), Conn2=as.numeric(Conn2),
Conn3=as.numeric(Conn3), Conn4=as.numeric(Conn4), Conn5=as.numeric(Conn5),
Conn6=as.numeric(Conn6)
)
In addition, there remain some discrepancies in the location and place data. In order to plot students’ locations or meaningful places on a map, we will want to programatically convert the text-strings of location names (e.g. “Ashley’s Ice Cream”) to GPS coordinates; however, not all respondents gave specific-enough location names for something like the Google Maps service to find the location’s coordinates. Therefore, some by-hand reformatting will be (unfortunately) necessary. As a first step, we can start some of this work for the Spring and Fall semesters (pre-break) by inferring precise location names from dorm data:
data.students <- data.students %>%
mutate(Loc1 = case_when(
grepl("Stiles", Dorm1) ~ "Ezra Stiles College New Haven",
grepl("Lawrance", Dorm1) ~ "Lawrance Hall New Haven",
T ~ Loc1
)) %>%
mutate(Loc5 = case_when(
grepl("Stiles", Dorm5) ~ "Ezra Stiles College New Haven",
grepl("Lawrance", Dorm5) ~ "Lawrance Hall New Haven",
grepl("Vanderbilt", Dorm5) ~ "Vanderbilt Hall New Haven",
grepl("LW", Dorm5) ~ "Lanman-Wright Hall New Haven",
T ~ Loc5
))
We also note that some respondents chose “Off-Campus” for their dorm when the more correct option was “Not in New Haven”—resolving this will come in the next page with some more by-hand work. For now, the data table below represents the cleaned survey responses.
paged_table(data.students %>% select(!Email), options=list(rows.print=25))
colnames(data.students)
[1] "Year" "Email" "Enrolled.F20" "Loc1"
[5] "Dorm1" "Loc2" "Loc2.5" "Loc3"
[9] "Loc4" "Loc5" "Dorm5" "Loc6"
[13] "Place1.n" "Place1.t" "Place2.n" "Place2.t"
[17] "Place3.n" "Place3.t" "Place4.n" "Place4.t"
[21] "Place5.n" "Place5.t" "Place6.n" "Place6.t"
[25] "Place7.n" "Place7.t" "Place8.n" "Place8.t"
[29] "Place9.n" "Place9.t" "Place10.n" "Place10.t"
[33] "Conn1" "Conn2" "Conn3" "Conn4"
[37] "Conn5" "Conn6" "ID"
The data frame above features a wealth of information, and it is now much easier to make some sense of the survey responses. As a guide: note that each row represents one student’s full survey response, while each column (named immediately above) represents that student’s answer to a question, or part of an answer to a question.
In particular, for each student, we have the following response data, given in response to the corresponding survey questions:
t
of 2020.t=1
) and Fall (t=5
) semesters, if applicable.
.n
gives the place’s name while .t
gives whether the place is physical or virtual (its “type”), and the survey had space for up to i=10
places.
t
of 2020.
As hinted, some of the data above is tracked through various time periods or epochs t
which are particularly relevant to the Yale calendar and the effects of the pandemic. The epochs t
are numbered 1
through 6
, but in particular, each epoch corresponds to the following meaningful time frame:
Epoch (t= value) |
Time frame | Date range (2020) |
---|---|---|
t1 |
Before Spring Break | < 3/7 |
t2 |
Spring Break week 1 | 3/7 - 3/15 |
t2.5 |
Spring Break week 2 | 3/16 - 3/22 |
t3 |
Remainder of the Spring semester | 3/23 - 5/6 |
t4 |
Summer | 5/7 - Late August |
t5 |
Fall semester, before Break | 8/31 - 11/21 |
t6 |
Remainder of the Fall semester | 11/22 - 2020 |
In the sections which follow, we extract and clean a couple of other, more specific data sets.
The dataset above focuses on students by letting each row be the student response; hence many columns represent that student’s meaningful/nostalgic places. In particular, the survey has a fixed number of inputs expecting up to 10 places, and since each place has a name (.n
) and a type (.t
), this results in 20 columns total per student. This is a very ungainly format for the Key Places page, as we want to focus on each place rather than each student.
Hence, we extract the place data from the larger survey dataset. We l e n g t h e n
the data by pivoting the Place
column-pairs (Placei.n, Placei.t) to individual rows.
data.places <- data.students %>%
# Work only with Place data and (internal) student id
select(ID, Place1.n, Place1.t, Place2.n, Place2.t, Place3.n, Place3.t,
Place4.n, Place4.t, Place5.n, Place5.t, Place6.n, Place6.t, Place7.n,
Place7.t, Place8.n, Place8.t, Place9.n, Place9.t, Place10.n, Place10.t
) %>%
# https://tidyr.tidyverse.org/articles/pivot.html#multiple-observations-per-row-1
pivot_longer(!ID,
names_to=c("order", ".value"),
names_pattern="Place(.)\\.(.)"
) %>%
# Clean up
rename(Name=n, Type=t) %>%
filter(!is.na(Name)) %>% # One student wrote-in "N/A"
mutate(order=as.numeric(order))
paged_table(sample_n(data.places, n()), options=list(rows.print=25))
In the table above, each unique ID corresponds to one respondent, and order
represents the order in which they named the respective place (between 1 and 10). It was not explicitly stated that they order their places from most to least meaningful/nostalgic.
Note that—as noted when cleaning the survey data above—some of these locations will need to be modified manually to provide for correct geocoding (converting location name/address to GPS coordinates).
Finally, for the Missing Moose page (forthcoming), we will visualize the change in housing density from Spring to Fall. Ezra Stiles College provided the relevant housing counts to assist with this section. The data is saved in the slightly more readable “wider” format, which focuses on the two terms:
# Read housing counts
data.housing_wide <- read.csv("data/housing_S20_F20_wide.csv")
paged_table(data.housing_wide)
However, as with the previous pivot, we want the data in a “longer” format which places each entryway/dorm in a separate row:
# Lengthen housing counts
data.housing_long <- data.housing_wide %>%
pivot_longer(!Term,
names_to="Entryway",
values_to="Count")
paged_table(data.housing_long)
In the table above, Count
gives the number of students housed in that particular entryway in that term.
The initial data cleaning has run its course; we have now done all that we can do, short of manually cleaning the Location
and Place
data!
We save the above datasets to separate files:
# t <- format(Sys.time(), "%Y%m%d_%H%M%S")
t <- "20201227_235052"
write.csv(data.students, paste0("data/students_", t, ".csv", sep=""), row.names=F)
write.csv(data.places, paste0("data/places_", t, ".csv", sep=""), row.names=F)
write.csv(data.housing_long, "data/housing_S20_F20_long.csv", row.names=F)
Lastly, note that the full and original survey data is not made public in order to respect students’ privacy. Email me at cove.geary@yale.edu if you have questions.