The US presidential election is held every four years on Tuesday after the first

Monday in November. The 2008 and 2012 elections were held, respectively, on

Nov 4, 2008 and Nov 6, 2012. The President of US is not elected directly by

popular vote. Instead, the President is elected by electors who are selected by

popular vote on a state-by-state basis. These selected electors cast direct votes

for the President. Almost all the states except Maine and Nebraska, electors are

selected on a “winner-take-all” basis. That is, all electoral votes go to the

presidential candidate who wins the most votes in popular vote. For simplicity, we

will assume all the states use the “winner-take-all” principle in this lab. The

number of electors in each state is the same as the number of congressmen of

that state. Currently, there are a total of 538 electors including 435 House

representatives, 100 senators and 3 electors from the District of Columbia. A

presidential candidate who receives an absolute majority of electoral votes (no

less than 270) is elected as President.

For simplicity, our data analysis only considers the two major political parties:

Democratic (Dem) and Republican (Rep). The interest is to predict which party

(Dem or Rep) will win the most votes in each state. Because the chance that a

third-party (except Dem and Rep) receives an electoral vote is very small, our

simplification is reasonable.

Our prediction will be based on election polls. An election poll is a survey that

samples a small portion of voters about their vote plans. If the survey is

conducted appropriately, the samples of voters should be a representation of the

voting population at large. However, it is very challenging to obtain a good

representative group because a good sampling strategy needs to consider many

factors (e.g., sampling time, locations, methods). Therefore, a poll’s prediction

2

could be biased, and the prediction accuracy could be improved by combining

multiple poll

There exist many possible factors affecting the prediction accuracy of election

polls. Based on the available data sets, we consider the following three factors.

1. Sampling time. It is understandable that if the sampling time is far ahead of

the election date, the accuracy could be worse than those polls conducted

closer to the election date. Because there are many events that could

change voters’ opinions about presidential candidates, the longer the time,

the more likely voters are going to change their voting plans.

2. Pollsters. Systematic biases could occur if a false sampling method is taken.

For example, if a pollster only collects samples through Internet, it would

be a biased sample since the sample only includes those who have access

to Internet. Each pollster uses different methods for sampling voters. Some

sampling schemes could be better than the others. Therefore, it is very

likely that some pollsters’ predictions are more reliable than some others.

We should not give equal weights to every poll.

3. State edges. The state edge is the difference between the Democratic and

Republican popular vote percentages (based on the polls) in that state. For

instance, if the Democratic candidate receives 55% of the vote and

Republican candidate receives 45% of the votes, then the Democratic edge

is 10 percentage points. Because of the sampling errors, if the state edges

are small, the prediction accuracy of a poll is more likely to be affected by

the sampling errors. However, if the state edges are big, the prediction

accuracy is less likely to be affected by sampling errors.

Available date sets

The following data sets are available for our data analysis

1) Polling data from the 2008 US presidential election (2008-polls.csv);

2) Election results from the 2008 US presidential election (2008-results.csv);

3) Polling data from the 2012 US presidential election (2012-polls.csv);

4) Election results from the 2012 US presidential election (2012-results.csv).

3

The data sets 1) and 2) will be used for training purpose. The data set 3) will be

used for prediction. The data set 4) is provided for validation purpose, which can

help us to check if our predictions are correct or not.

We will first pre-process these data sets for the purpose of performing logistic

regression. As a first step, using the following commands to dead the data sets

“2008-polls.csv”, “2012-polls.csv” and “2008-results.csv” into R.

We will first pre-process these data sets for the purpose of performing logistic

regression. As a first step, using the following commands to dead the data sets

“2008-polls.csv”, “2012-polls.csv” and “2008-results.csv”into R.

setwd(“…”) ## Change the directory where you saved the data sets

polls2008<-read.csv(file=”2008-polls.csv”,header=TRUE)

polls2012<-read.csv(file=”2012-polls.csv”,header=TRUE)

results2008<-read.csv(file=”2008-results.csv”,header=TRUE)

To simplify our data analysis, let us focus on subsets of these available data sets.

We will select the subset of data sets based on pollsters because not all the

pollsters conducted polls in every state. We select pollsters that conducted at

least five polls in both 2008 and 2012 polling data sets 1) and 3) using the

following R code.

pollsters20085<-table(polls2008$Pollster)[table(polls2008$Pollster)>=5]

pollsters20125<-table(polls2012$Pollster)[table(polls2012$Pollster)>=5]

subset1<-

names(pollsters20085)[names(pollsters20085)%in%names(pollsters20125)]

pollers<-names(pollsters20125)[names(pollsters20125)%in%subset1]

Then, we create the subsets of the 2008 and 2012 data sets that are collected by

the selected pollsters using the following R code

subsamplesID2008<-polls2008[,5]%in%pollers

polls2008sub<-polls2008[subsamplesID2008,]

subsamplesID2012<-polls2012[,5]%in%pollers

polls2012sub<-polls2012[subsamplesID2012,]

To build predictive modeling using logistic regression model, we create response

variable and predictors. First, we define binary response variables (Resp), which is

an indicator that indicates if the predictions given by polls are correct or not. If

the prediction is correct, we define Resp to be 1 otherwise 0. To check if the

prediction given by each poll is correct or not, you could first find out the

predicted winner for each state, and then compare it with the actual winner in

the data set “2008-results.csv”. Second, define state edges based on the

definition of the state edges (see above for the definition). Finally, compute the

number of days between the sampling time (polling date) and the presidential election date of 2008 (lag time). The 2008 presidential election date is Nov 4,

2008. The following R code is used for the above purpose.

winers2008<-(results2008[,2]-results2008[,3]>0)+0

StateID2008<-results2008[,1]

Allresponses<-NULL

for (sid in 1:51)

{

polls2008substate<-polls2008sub[polls2008sub$State==StateID2008[sid],]

pollwiners2008state<-(polls2008substate[,2]-

polls2008substate[,3]>0)+0

pollwinersIND<-(pollwiners2008state==winers2008[sid])+0

Allresponses<-c(Allresponses,pollwinersIND)

}

margins<-abs(polls2008sub[,2]-polls2008sub[,3])

lagtime<-rep(0,dim(polls2008sub)[1])

electiondate2008<-c(“Nov 04 2008″)

for (i in 1:dim(polls2008sub)[1])

{

lagtime[i]<-as.Date(electiondate2008, format=”%b %d %Y”)-

as.Date(as.character(polls2008sub[i,4]), format=”%b %d %Y”)

}

dataset2008<-

cbind(Allresponses,as.character(polls2008sub[,1]),margins,lagtime,as.c

haracter(polls2008sub[,5]))

Q1. Fit a logistic regression model using the data set “2008-polls-subset.csv”. In

the model, using Resp as the binary response variable (target variable), pollsters

as categorical predictors, and lag time, the square of lag time and state edges as

continuous predictors. Based on the fitted model, what is the probability of

making a correct prediction for a poll conducted by SurveyUSA exactly 5 days

before the election with a state edge 10%?

Q2. Is the model in Q1 reasonablely good (or acceptable)? Please justify your

answer using deviance and its corresponding p-value? Is the lag time significantly

associated with the probability that an election poll predicts results correctly?

5

Q3. Consider a logistic regression with Resp as the binary response variable

(target variable) and lag time, the square of lag time and state edges as

continuous predictors. Write down the separation hyperplane for classifying the

correct and wrong predictions (defined by the target variable Resp) using the

feature vector containing lag time, square of lag time and state edges. If we use

the state edges as y-axis and lag time as x-axis, please draw a separation curve for

the classification.

For the prediction/classification purpose, we need to define new variables: State

edges and the lag time for the 2012 election poll data set. The definition of these

new variables is same as those described above. For computing the lag time, note

that the 2012 presidential election date is Nov 6, 2012. The following R code preprocess the 2012 data sets for prediction purpose:

pollwiners2012<-(polls2012sub[,2]-polls2012sub[,3]>0)+0

margins2012<-abs(polls2012sub[,2]-polls2012sub[,3])

lagtime2012<-rep(0,dim(polls2012sub)[1])

electiondate2012<-c(“Nov 06 2012″)

for (i in 1:dim(polls2012sub)[1])

{

lagtime2012[i]<-as.Date(electiondate2012, format=”%b %d %Y”)-

as.Date(as.character(polls2012sub[i,4]), format=”%b %d %Y”)

}

dataset2012<-

cbind(pollwiners2012,as.character(polls2012sub[,1]),margins2012,lagtim

e2012,as.character(polls2012sub[,5]))

Q4. Based on the logistic regression models fitted in Q3, predicting the probability

of making a correct prediction using the 2012 election poll data set. Please predict

the probabilities for all the 2012 election polls from Florida (FL).

Q5. In this question, we will predict the winner of Florida using predictions given

in Q4.

To this end, define the winner indicator as 1 (WIND=1) if the Democratic

candidate is the winner, otherwise defines it as 0. Based on Q4, we obtained

predicted probability that a poll made a correct prediction of the winner (i.e.

The price is based on these factors:

Academic level

Number of pages

Urgency

Basic features

- Free title page and bibliography
- Unlimited revisions
- Plagiarism-free guarantee
- Money-back guarantee
- 24/7 support

On-demand options

- Writer’s samples
- Part-by-part delivery
- Overnight delivery
- Copies of used sources
- Expert Proofreading

Paper format

- 275 words per page
- 12 pt Arial/Times New Roman
- Double line spacing
- Any citation style (APA, MLA, Chicago/Turabian, Harvard)

Delivering a high-quality product at a reasonable price is not enough anymore.

That’s why we have developed 5 beneficial guarantees that will make your experience with our service enjoyable, easy, and safe.

You have to be 100% sure of the quality of your product to give a money-back guarantee. This describes us perfectly. Make sure that this guarantee is totally transparent.

Read moreEach paper is composed from scratch, according to your instructions. It is then checked by our plagiarism-detection software. There is no gap where plagiarism could squeeze in.

Read moreThanks to our free revisions, there is no way for you to be unsatisfied. We will work on your paper until you are completely happy with the result.

Read moreYour email is safe, as we store it according to international data protection rules. Your bank details are secure, as we use only reliable payment systems.

Read moreBy sending us your money, you buy the service we provide. Check out our terms and conditions if you prefer business talks to be laid out in official language.

Read more