Statistics homework help.  
The data in the accompanying file Airline Data.xlsx was assembled by Professor Robert Windle of the Smith School with assistance from Oliver Yao. You may be familiar with this data from earlier classes! The file contains information on 638 air routes in the United States. A route refers to a pair of airports. Note that some cities are served by more than one airport. In such cases, the airports are distinguished by their 3-letter code. The data was collected for the third quarter of 1996 (3Q96). The variables in the data set are:

  1. S_CODE: starting airport’s code
  2. S_CITY: starting city
  3. E_CODE: ending airport’s code
  4. E_CITY: ending city
  5. COUPON: average number of coupons (a one-coupon flight is a non-stop flight, a two-coupon flight is a one-stop flight, etc.) for that route
  6. NEW: number of new carriers entering that route between Q3-96 and Q2-97
  7. VACATION: whether a vacation route (Yes) or not (No); Florida and Las Vegas routes are generally considered vacation routes
  8. SW: whether Southwest Airlines serves that route (Yes) or not (No)
  9. HI: Herfindel Index – airlines use this as a measure of market concentration
  10. S_INCOME: starting city’s average personal income
  11. E_INCOME: ending city’s average personal income
  12. S_POP: starting city’s population
  13. E_POP: ending city’s population
  14. SLOT: whether either endpoint airport is slot controlled or not; this is a measure of airport congestion
  15. GATE: whether either endpoint airport has gate constraints or not; this is another measure of airport congestion
  16. DISTANCE: distance between two endpoint airports in miles
  17. PAX: number of passengers on that route during period of data collection
  18. FARE: average fare on that route

The Assignment
The goal is to predict the FARE as a function of the other variables. Please answer all questions. Supply supporting documentation and show calculations as needed (for example for the RMSE you may want to include a picture of the error measures from the Excel output). Please submit a single well-formatted PDF or Word file. The instructor should not need to go searching for your answers! You should also upload an Excel file as supporting information .
Note that the detailed instructions refer to Analytical Solver Data Mining – you are however free to use any other software.

  1. Data Exploration & Visualization
  2. Using the graphical capabilities of ASDM (or the software of your choice) provide a single plot that captures some aspects of the data. Include the plot as a clearly marked Exhibit.


  1. What do you observe from the plot? How could your observation influence your regression model (or why would it not)?


  1. Fitting a linear regression model
  2. Using the data analysis menu, create dummy variables for variables VACATION, SW, GATE, and SLOT (select “Transform” – “Transform categorical data …” – “Create Dummies”).

Using the resulting new data set, randomly partition the data into 70% training and 30% validation (select “Partition” – “Standard Partition”).
Run a multivariable regression (select “Predict” – “Linear Regression”), with all numerical variables and the appropriate dummies as independent variables. Provide a summary of the model (that includes the values of the regression coefficients) or otherwise include it as a clearly marked Exhibit.

  1. What is the resulting RMSE on the training data?


  1. On the validation data?


  1. From your model, how would you quantify the effects of GATE on the predicted FARE? Please be precise in your interpretation, thinking back to your earlier data analysis class.


  1. What is the predicted fare of a leg that has COUPON = 1, NEW = 3, VACATION = No, SW = No, HI = 6000, S_income = $25000, E_income = $30000, S_POP = 4,000,000, E_POP=7,150,000, SLOT = Free and GATE = constrained, DISTANCE = 1000, and PAX = 6000?
  2. Variable Selection

Experiment with variable selection methods (feel free to take advantage of the how-to tutorials in the resource center that will take you through the key steps). Note: You may want to change the FIN and FOUT settings in order to view more model choices. Set FOUT higher and FIN lower as needed.
You may want to refer to pages 143-152 in the book, which apply variable selection to the Toyota example and highlight how to interpret the model measures such as Adjusted R2 and Cp.

  1. From your experiments – pick a model to run as your final regression model.
  2. Provide a summary of the model or otherwise include it as a clearly marked Exhibit.


  1. Why did you select this particular model? Please provide quantitative reasoning.


  1. What is the resulting RMSE on the training data?


  1. On the validation data?


  1. Adding an interaction term

A senior consultant in the airline industry has indicated that the presence of Southwest on vacation routes has significantly been driving prices down on these legs, beyond other routes. Add this domain knowledge to your regression model from c) by creating an interaction variable (refer back to your notes on interaction variables from your data analysis class) and rerunning the linear regression model.
Note: You need to go back to your “Encoding” worksheet and manually add a column that is SW_yes * Vacation_Yes (in the Z-column). You then need to repartition your data, and make sure to expand the data selection to include the new variable in the Z-column.

  1. Did your error measures improve?


  1. How would you quantify the effect of SW on the fare on vacation routes vs. non-vacation routes (using your model)? Does the data support the consultant’s claim? HINT: Carefully think about which variables determine the fare on each type of route (your have both vacation and non-vacation routes, and for each type, some routes are served by SW and some are not).


Statistics homework help