Rating
4.8/5

Statistics homework help.
The data in the accompanying file Airline Data.xlsx was assembled by Professor Robert Windle of the Smith School with assistance from Oliver Yao. You may be familiar with this data from earlier classes! The file contains information on 638 air routes in the United States. A route refers to a pair of airports. Note that some cities are served by more than one airport. In such cases, the airports are distinguished by their 3-letter code. The data was collected for the third quarter of 1996 (3Q96). The variables in the data set are:

1. S_CODE: starting airport’s code
2. S_CITY: starting city
3. E_CODE: ending airport’s code
4. E_CITY: ending city
5. COUPON: average number of coupons (a one-coupon flight is a non-stop flight, a two-coupon flight is a one-stop flight, etc.) for that route
6. NEW: number of new carriers entering that route between Q3-96 and Q2-97
7. VACATION: whether a vacation route (Yes) or not (No); Florida and Las Vegas routes are generally considered vacation routes
8. SW: whether Southwest Airlines serves that route (Yes) or not (No)
9. HI: Herfindel Index – airlines use this as a measure of market concentration
10. S_INCOME: starting city’s average personal income
11. E_INCOME: ending city’s average personal income
12. S_POP: starting city’s population
13. E_POP: ending city’s population
14. SLOT: whether either endpoint airport is slot controlled or not; this is a measure of airport congestion
15. GATE: whether either endpoint airport has gate constraints or not; this is another measure of airport congestion
16. DISTANCE: distance between two endpoint airports in miles
17. PAX: number of passengers on that route during period of data collection
18. FARE: average fare on that route

The Assignment
The goal is to predict the FARE as a function of the other variables. Please answer all questions. Supply supporting documentation and show calculations as needed (for example for the RMSE you may want to include a picture of the error measures from the Excel output). Please submit a single well-formatted PDF or Word file. The instructor should not need to go searching for your answers! You should also upload an Excel file as supporting information .
Note that the detailed instructions refer to Analytical Solver Data Mining – you are however free to use any other software.

1. Data Exploration & Visualization
2. Using the graphical capabilities of ASDM (or the software of your choice) provide a single plot that captures some aspects of the data. Include the plot as a clearly marked Exhibit.

1. What do you observe from the plot? How could your observation influence your regression model (or why would it not)?

1. Fitting a linear regression model
2. Using the data analysis menu, create dummy variables for variables VACATION, SW, GATE, and SLOT (select “Transform” – “Transform categorical data …” – “Create Dummies”).

Using the resulting new data set, randomly partition the data into 70% training and 30% validation (select “Partition” – “Standard Partition”).

Run a multivariable regression (select “Predict” – “Linear Regression”), with all numerical variables and the appropriate dummies as independent variables. Provide a summary of the model (that includes the values of the regression coefficients) or otherwise include it as a clearly marked Exhibit.

1. What is the resulting RMSE on the training data?

1. On the validation data?

1. From your model, how would you quantify the effects of GATE on the predicted FARE? Please be precise in your interpretation, thinking back to your earlier data analysis class.

1. What is the predicted fare of a leg that has COUPON = 1, NEW = 3, VACATION = No, SW = No, HI = 6000, S_income = \$25000, E_income = \$30000, S_POP = 4,000,000, E_POP=7,150,000, SLOT = Free and GATE = constrained, DISTANCE = 1000, and PAX = 6000?
2. Variable Selection

Experiment with variable selection methods (feel free to take advantage of the how-to tutorials in the resource center that will take you through the key steps). Note: You may want to change the FIN and FOUT settings in order to view more model choices. Set FOUT higher and FIN lower as needed.

You may want to refer to pages 143-152 in the book, which apply variable selection to the Toyota example and highlight how to interpret the model measures such as Adjusted R2 and Cp.

1. From your experiments – pick a model to run as your final regression model.
2. Provide a summary of the model or otherwise include it as a clearly marked Exhibit.

1. Why did you select this particular model? Please provide quantitative reasoning.

1. What is the resulting RMSE on the training data?

1. On the validation data?