Data Introduction & Cleaning

Overview

Nearly all of the data originates from one of two sources: a Qualtrics survey disseminated to kind survey respondents (students) in Stiles, and some housing data from the college. While this project is a living document and is still taking in student responses, at current publication, 71 student responses have been recorded.

For the sake of tidiness, most of the coding setup and data wrangling for this project’s visualizations will be developed in the following two pages. For all readers, the pages to come contain the ~raw data tables which can be explored, as well as an interpretation of the overall survey data. For those interested in following the code, these pages should also elucidate the various datasets and data formats employed.

A copy of the survey questions may be found here.

Survey Responses

Cleanup

The data table below gives a first look at the Qualtrics survey responses! (Excepting a few personally-identifying data points.)

data.qualtrics <- read.csv("data/qualtrics_20201227_233000.csv") %>%
    slice(3:n()) %>%  # First two rows do not have respondent data; remove them
    na_if("")  # Qualtrics saves NA in CSV as blank strings; reset those to NA

paged_table(data.qualtrics %>%
                select(-c(IPAddress, LocationLongitude, LocationLatitude, Q25)))

A number of data-cleaning steps become quickly apparent: some rows are empty or partially empty (representing those who opened the survey but did not complete it); the columns are named quite unhelpfully (e.g. “Q20.1_1_1”), and there seem to be many columns that are not relevant. As a first cleaning step, we rename the relevant columns, remove all unwanted columns, and trim empty rows (keeping partial responses).

data.students <- data.qualtrics %>%
    transmute(
        # Student info
        Year=Q24, Email=Q25,
        Enrolled.F20=case_when(
            Q23=="Enrolled" ~ T,
            Q23=="On Leave" ~ F,
            TRUE ~ NA),
        # Location/movement data at time t
        Loc1=Q4, Dorm1=Q5, Loc2=Q6, Loc2.5=Q26, Loc3=Q7, Loc4=Q8, Loc5=Q9, Dorm5=Q10,
        Loc6=Q11,
        # Meaning/nostalgia data (in expanded form)
        Place1.n=Q20.1_1_1, Place1.t=Q20.2_1, Place2.n=Q20.1_2_1, Place2.t=Q20.2_2,
        Place3.n=Q20.1_3_1, Place3.t=Q20.2_3, Place4.n=Q20.1_4_1, Place4.t=Q20.2_4,
        Place5.n=Q20.1_5_1, Place5.t=Q20.2_5, Place6.n=Q20.1_6_1, Place6.t=Q20.2_6,
        Place7.n=Q20.1_7_1, Place7.t=Q20.2_7, Place8.n=Q20.1_8_1, Place8.t=Q20.2_8,
        Place9.n=Q20.1_9_1, Place9.t=Q20.2_9, Place10.n=Q20.1_10_1, Place10.t=Q20.2_10,
        # Connectedness data
        Conn1=Q22.1_1, Conn2=Q22.1_2, Conn3=Q22.1_3, Conn4=Q22.1_4,
        Conn5=Q22.1_5, Conn6=Q22.1_6,
        # Useful Qualtrics data
        ID=ResponseId
    ) %>%
    # Remove 100% empty responses (empty Loc1)
    filter(!is.na(Loc1) | !is.na(Loc6)) %>%
    # Set appropriate columns to numeric
    mutate(Year=as.numeric(Year), Conn1=as.numeric(Conn1), Conn2=as.numeric(Conn2),
           Conn3=as.numeric(Conn3), Conn4=as.numeric(Conn4), Conn5=as.numeric(Conn5),
           Conn6=as.numeric(Conn6)
    )

In addition, there remain some discrepancies in the location and place data. In order to plot students’ locations or meaningful places on a map, we will want to programatically convert the text-strings of location names (e.g. “Ashley’s Ice Cream”) to GPS coordinates; however, not all respondents gave specific-enough location names for something like the Google Maps service to find the location’s coordinates. Therefore, some by-hand reformatting will be (unfortunately) necessary. As a first step, we can start some of this work for the Spring and Fall semesters (pre-break) by inferring precise location names from dorm data:

data.students <- data.students %>%
    mutate(Loc1 = case_when(
        grepl("Stiles", Dorm1) ~ "Ezra Stiles College New Haven",
        grepl("Lawrance", Dorm1) ~ "Lawrance Hall New Haven",
        T ~ Loc1
    )) %>%
    mutate(Loc5 = case_when(
        grepl("Stiles", Dorm5) ~ "Ezra Stiles College New Haven",
        grepl("Lawrance", Dorm5) ~ "Lawrance Hall New Haven",
        grepl("Vanderbilt", Dorm5) ~ "Vanderbilt Hall New Haven",
        grepl("LW", Dorm5) ~ "Lanman-Wright Hall New Haven",
        T ~ Loc5
    ))

We also note that some respondents chose “Off-Campus” for their dorm when the more correct option was “Not in New Haven”—resolving this will come in the next page with some more by-hand work. For now, the data table below represents the cleaned survey responses.

Data Table (Cleaned Survey Responses)

paged_table(data.students %>% select(!Email), options=list(rows.print=25))
colnames(data.students)
 [1] "Year"         "Email"        "Enrolled.F20" "Loc1"        
 [5] "Dorm1"        "Loc2"         "Loc2.5"       "Loc3"        
 [9] "Loc4"         "Loc5"         "Dorm5"        "Loc6"        
[13] "Place1.n"     "Place1.t"     "Place2.n"     "Place2.t"    
[17] "Place3.n"     "Place3.t"     "Place4.n"     "Place4.t"    
[21] "Place5.n"     "Place5.t"     "Place6.n"     "Place6.t"    
[25] "Place7.n"     "Place7.t"     "Place8.n"     "Place8.t"    
[29] "Place9.n"     "Place9.t"     "Place10.n"    "Place10.t"   
[33] "Conn1"        "Conn2"        "Conn3"        "Conn4"       
[37] "Conn5"        "Conn6"        "ID"          

Interpretation

The data frame above features a wealth of information, and it is now much easier to make some sense of the survey responses. As a guide: note that each row represents one student’s full survey response, while each column (named immediately above) represents that student’s answer to a question, or part of an answer to a question.

In particular, for each student, we have the following response data, given in response to the corresponding survey questions:

As hinted, some of the data above is tracked through various time periods or epochs t which are particularly relevant to the Yale calendar and the effects of the pandemic. The epochs t are numbered 1 through 6, but in particular, each epoch corresponds to the following meaningful time frame:

Epoch (t=value) Time frame Date range (2020)
t1 Before Spring Break < 3/7
t2 Spring Break week 1 3/7 - 3/15
t2.5 Spring Break week 2 3/16 - 3/22
t3 Remainder of the Spring semester 3/23 - 5/6
t4 Summer 5/7 - Late August
t5 Fall semester, before Break 8/31 - 11/21
t6 Remainder of the Fall semester 11/22 - 2020

In the sections which follow, we extract and clean a couple of other, more specific data sets.

Meaningful/Nostalgic Places

The dataset above focuses on students by letting each row be the student response; hence many columns represent that student’s meaningful/nostalgic places. In particular, the survey has a fixed number of inputs expecting up to 10 places, and since each place has a name (.n) and a type (.t), this results in 20 columns total per student. This is a very ungainly format for the Key Places page, as we want to focus on each place rather than each student.

Hence, we extract the place data from the larger survey dataset. We l e n g t h e n the data by pivoting the Place column-pairs (Placei.n, Placei.t) to individual rows.

data.places <- data.students %>%
    # Work only with Place data and (internal) student id
    select(ID, Place1.n, Place1.t, Place2.n, Place2.t, Place3.n, Place3.t,
           Place4.n, Place4.t, Place5.n, Place5.t, Place6.n, Place6.t, Place7.n,
           Place7.t, Place8.n, Place8.t, Place9.n, Place9.t, Place10.n, Place10.t
    ) %>%
    # https://tidyr.tidyverse.org/articles/pivot.html#multiple-observations-per-row-1
    pivot_longer(!ID,
                 names_to=c("order", ".value"),
                 names_pattern="Place(.)\\.(.)"
    ) %>%
    # Clean up
    rename(Name=n, Type=t) %>%
    filter(!is.na(Name)) %>%  # One student wrote-in "N/A"
    mutate(order=as.numeric(order))

Data Table (Key Places)

paged_table(sample_n(data.places, n()), options=list(rows.print=25))

In the table above, each unique ID corresponds to one respondent, and order represents the order in which they named the respective place (between 1 and 10). It was not explicitly stated that they order their places from most to least meaningful/nostalgic.

Note that—as noted when cleaning the survey data above—some of these locations will need to be modified manually to provide for correct geocoding (converting location name/address to GPS coordinates).

Housing Counts

Finally, for the Missing Moose page (forthcoming), we will visualize the change in housing density from Spring to Fall. Ezra Stiles College provided the relevant housing counts to assist with this section. The data is saved in the slightly more readable “wider” format, which focuses on the two terms:

# Read housing counts
data.housing_wide <- read.csv("data/housing_S20_F20_wide.csv")
paged_table(data.housing_wide)

However, as with the previous pivot, we want the data in a “longer” format which places each entryway/dorm in a separate row:

# Lengthen housing counts
data.housing_long <- data.housing_wide %>%
    pivot_longer(!Term,
                 names_to="Entryway",
                 values_to="Count")
paged_table(data.housing_long)

In the table above, Count gives the number of students housed in that particular entryway in that term.

Before we move on…

The initial data cleaning has run its course; we have now done all that we can do, short of manually cleaning the Location and Place data!

We save the above datasets to separate files:

# t <- format(Sys.time(), "%Y%m%d_%H%M%S")
t <- "20201227_235052"
write.csv(data.students, paste0("data/students_", t, ".csv", sep=""), row.names=F)
write.csv(data.places, paste0("data/places_", t, ".csv", sep=""), row.names=F)
write.csv(data.housing_long, "data/housing_S20_F20_long.csv", row.names=F)

Lastly, note that the full and original survey data is not made public in order to respect students’ privacy. Email me at cove.geary@yale.edu if you have questions.