Module 3 Assignment- Summary Statistics and Visualization of Categorical Data

When looking at the ABC and CBS poll numbers, a few patterns are noticeable. Some candidates are fairly close between the two polls, while others show significant differences. For example, Donald’s support is much higher in the CBS poll than in the ABC poll, with a gap of more than 10 points. Hillary also shows a noticeable increase in the CBS poll compared to the ABC poll. On the other hand, Ted and Marco actually score lower in the CBS poll, which shows how results can shift depending on the source. The CBS poll also has a wider range (1 to 75) than ABC (2 to 62). That suggests CBS shows bigger extremes—pushing some candidates higher and some lower. Utilizing the difference chart, Jeb (+8) and Bernie (+4) have differences, but they’re not as dramatic as Donald (13). Some candidates have very low differences, notably Carly (-1) and Marco (-2).

The impact of using superficial, or made-up data is that the findings cannot be applied to real-world scenarios. In this case, it was about voters and political polls, which means the numbers don’t reflect actual opinions or behaviors of people. Any patterns we notice—such as one candidate seeming stronger in one poll than another—are purely artificial. This limits our ability to draw meaningful conclusions or make predictions, since the dataset wasn’t collected from real respondents. Instead, the value of made-up data lies in practicing analysis techniques, learning the software, and understanding how to interpret results before working with genuine, high-stakes data.

In order to collect or validate real poll data in a true analysis, I would first begin by collecting a random sample. In order to get a wide variety of ages and demographics, I might go to a mall in a major city and use a use a clearly worded, neutrally ordered questionnaire. I would then randomly select participants to invite and complete the survey, using a method that gives each adult a non-zero chance of selection using systematic sampling. After data collection, I’d weight responses to Census/ACS benchmarks (age, sex, race/ethnicity, education, region) and report margin of error and design effect with full methodology (dates, frame, response rate). To validate, I’d compare results to other reputable polls from the same period, look for consistent time trends, run sensitivity checks (alternate weights, likely-voter screens), and, where possible, benchmark demographics to voter files or administrative records. From there, form conclusions and report my findings.

(Please click on the image for clarity)

Code Chunk:

> # 1) Define and inspect data

> Name <- c("Jeb", "Donald", "Ted", "Marco", "Carly", "Hillary", "Bernie")

> ABC_poll <- c(4, 62, 51, 21, 2, 14, 15)

> CBS_poll <- c(12, 75, 43, 19, 1, 21, 19)

> df_polls <- data.frame(Name, ABC_poll, CBS_poll, stringsAsFactors = FALSE)

> str(df_polls)

'data.frame': 7 obs. of 3 variables:

$ Name : chr "Jeb" "Donald" "Ted" "Marco" ...

$ ABC_poll: num 4 62 51 21 2 14 15

$ CBS_poll: num 12 75 43 19 1 21 19

> head(df_polls)

Name ABC_poll CBS_poll

1 Jeb 4 12

2 Donald 62 75

3 Ted 51 43

4 Marco 21 19

5 Carly 2 1

6 Hillary 14 21

> # Means

> mean(df_polls$ABC_poll)

[1] 24.14286

> mean(df_polls$CBS_poll)

[1] 27.14286

> #Medians

> median(df_polls$ABC_poll)

[1] 15

> median(df_polls$CBS_poll)

[1] 19

> # Ranges

> range(df_polls$ABC_poll)

[1] 2 62

> range(df_polls$CBS_poll)

[1] 1 75

> # difference between CBS and ABC

> df_polls$Diff <- df_polls$CBS_poll - df_polls$ABC_poll

> df_polls

Name ABC_poll CBS_poll Diff

1 Jeb 4 12 8

2 Donald 62 75 13

3 Ted 51 43 -8

4 Marco 21 19 -2

5 Carly 2 1 -1

6 Hillary 14 21 7

7 Bernie 15 19 4

For the ggplot2 bar chart:

library(ggplot2)

> library(tidyr)

> df_long <- pivot_longer(df_polls,

+ cols = c("ABC_poll", "CBS_poll"),

+ names_to = "Poll",

+ values_to = "Value")

> # bar chart

> ggplot(df_long, aes(x = Name, y = Value, fill = Poll)) +

+ geom_bar(stat = "identity", position = "dodge") +

+ labs(title = "Poll Results by Candidate",

+ x = "Candidate",

+ y = "Poll Value") +

+ theme_minimal()

Search This Blog

R Programming Journal – Christine Jacob

Module 3 Assignment- Summary Statistics and Visualization of Categorical Data

Comments

Post a Comment

Popular posts from this blog

R Programming Journal – Christine Jacob

Module 2. Assignment: Function Debugging and Evaluation in R