Brianna Heggeseth
MSCS Happenings
In the back of your brain, start thinking about project ideas.
Each of you will generate 2-3 ideas.
By Friday night (updated!), you’ll submit those ideas to Moodle.
For each problem I marked with an X,
Talk with others in the class; help each other understand the WHY.
Turn into me by next class.
UPDATE: You should have been notified of a shared pdf with feedback
Talk through some of the stumbling blocks with your classmates. Take notes for yourself.
By the end of THIS week, submit an updated version of the Midterm Part 2 to Moodle and write a reflection about the midterm in your spreadsheet.
My Deal: You may talk to others in the class (not preceptors, not people who have previously taken it) but you may not directly share code with each other. Instead, talk about the actions more conceptually and point each other to resources.
Exploratory Data Analysis (EDA), a name given to the process of
Another way to describe EDA:
Useful R functions:
str()
to learn about the numbers of variables and observations as well as the classes of variableshead()
to view the top of the data table (can specify the number of rows with n=
)tail()
to view the bottom of the data tablespec_tbl_df [52 × 9] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
$ state : chr [1:52] "United States" "Alabama" "Alaska" "Arizona" ...
$ murder : num [1:52] 5.6 8.2 4.8 7.5 6.7 6.9 3.7 2.9 4.4 35.4 ...
$ forcible_rape : num [1:52] 31.7 34.3 81.1 33.8 42.9 26 43.4 20 44.7 30.2 ...
$ robbery : num [1:52] 140.7 141.4 80.9 144.4 91.1 ...
$ aggravated_assault : num [1:52] 291 248 465 327 387 ...
$ burglary : num [1:52] 727 954 622 948 1085 ...
$ larceny_theft : num [1:52] 2286 2650 2599 2965 2711 ...
$ motor_vehicle_theft: num [1:52] 417 288 391 924 262 ...
$ population : num [1:52] 2.96e+08 4.55e+06 6.69e+05 5.97e+06 2.78e+06 ...
- attr(*, "spec")=
.. cols(
.. state = col_character(),
.. murder = col_double(),
.. forcible_rape = col_double(),
.. robbery = col_double(),
.. aggravated_assault = col_double(),
.. burglary = col_double(),
.. larceny_theft = col_double(),
.. motor_vehicle_theft = col_double(),
.. population = col_double()
.. )
- attr(*, "problems")=<externalptr>
# A tibble: 6 × 9
state murder forcibl…¹ robbery aggra…² burgl…³ larce…⁴ motor…⁵ popul…⁶
<chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 United States 5.6 31.7 141. 291. 727. 2286. 417. 2.96e8
2 Alabama 8.2 34.3 141. 248. 954. 2650 288. 4.55e6
3 Alaska 4.8 81.1 80.9 465. 622. 2599. 391 6.69e5
4 Arizona 7.5 33.8 144. 327. 948. 2965. 924. 5.97e6
5 Arkansas 6.7 42.9 91.1 387. 1085. 2711. 262. 2.78e6
6 California 6.9 26 176. 317. 693. 1916. 713. 3.58e7
# … with abbreviated variable names ¹forcible_rape, ²aggravated_assault,
# ³burglary, ⁴larceny_theft, ⁵motor_vehicle_theft, ⁶population
# A tibble: 6 × 9
state murder forcibl…¹ robbery aggra…² burgl…³ larce…⁴ motor…⁵ popul…⁶
<chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 Vermont 1.3 23.3 11.7 83.5 492. 1686. 103. 618814
2 Virginia 6.1 22.7 99.2 155. 392. 2035 211. 7563887
3 Washington 3.3 44.7 92.1 206. 960. 3150. 784. 6261282
4 West Virginia 4.4 17.7 44.6 206. 621. 1794 210 1803920
5 Wisconsin 3.5 20.6 82.2 135. 441. 1993. 227. 5541443
6 Wyoming 2.7 24 15.3 188. 476. 2534. 145. 506242
# … with abbreviated variable names ¹forcible_rape, ²aggravated_assault,
# ³burglary, ⁴larceny_theft, ⁵motor_vehicle_theft, ⁶population
join
commands)?One convenient way to do this is with a pairs
plot.
The main point of such plots is not necessarily to draw any conclusions, but help generate more specific research questions and hypotheses.
You will often end up with a lot of data, and it can be easy to be overwhelmed.
How should you get started?
To do so, you can again revisit questions like “What patterns do you see?” or “Why might they be occurring?”
Let’s practice these steps using data about flight delays from Kaggle. Download template Rmd file from course website.
Finish this activity for Assignment 11 (EDA)
Brainstorm Activity due Wednesday
Midterm Revisions Part 2 due Friday
IV1 due next week