Why does stata drop variables in regression




















Readers looking for a general introduction to multiple regression should refer to the appropriate examples in SAGE Research Methods. This example focuses specifically on including dummy variables among the independent variables in a multiple regression model. When one or more of the independent variables is a categorical variable, the most common method of properly including them in the model is to code them as dummy variables.

Dummy variables are dichotomous variables coded as 1 to indicate the presence of some attribute and as 0 to indicate the absence of that attribute. The sample dataset includes 1, respondents. The average weight of survey respondents is just over pounds, while the average height is nearly 67 inches.

When conducting a multiple regression, it is often wise to examine the dependent variable in isolation first. Summary statistics for all variables can be compiled using the summarize command, followed by a list of the variables of interest.

In the case of the dependent variable, enter the following command in the Stata Command window:. Press Enter to produce summary statistics detailing the number of observations, mean, standard deviation, and range. Next, we present a histogram of weight.

This is done in Stata by entering the following command in the Command window:. Press Enter to produce a histogram. By default, Stata will produce a density histogram. To select frequency, enter the following command instead:. Such a model assumed that the slope was the same for the two groups. Perhaps the slope might be different for these groups. Note that the slope of the regression line looks much steeper for the year round schools than for the non-year round schools.

This is confirmed by the regression equations that show the slope for the year round schools to be higher 7. Indeed, the yrXsome interaction effect is significant. We can make a graph showing the regression lines for the two types of schools showing how different their regression lines are.

We first create the predicted value, we call it yhata. Then, we create separate variables for the two types of schools which will be called yhata0 for non-year round schools and yhata1 for year round schools.

You can see how the two lines have quite different slopes, consistent with the fact that the yrXsome interaction was significant. If we had used l[. The options to make dashed and dotted lines are new to Stata 7 and you can find more information via help grsym. The graph above used the same kind of dots for the data points for both types of schools.

We can then make the same graph as above except show the points differently for the two types of schools. Below we use small circles for the non-year round schools, and triangles for the year round schools. This is because non-year round schools are the reference group. In this case, the difference is significant, indicating that the regression lines are significantly different. So, if we look at the graph of the two regression lines we can see the difference in the slopes of the regression lines see graph below.

Indeed, we can see that the non-year round schools the solid line have a smaller slope 1. The difference between these slopes is 5. We can use the xi command for doing this kind of analysis as well.

We can run a model just like the model we showed above using the xi command. You can compare the results to those above and see that we get the exact same results. The i. As we did above, we can create predicted values and create graphs showing the regression lines for the two types of schools. We omit showing these commands. We can also run a model just like the model we showed above using the anova command. As we illustrated above, we can compute the predicted values using the predict command and graph the separate regression lines.

These commands are omitted. In general, this type of analysis allows you to test whether the strength of the relationship between two continuous variables varies based on the categorical variable. These examples will extend this further by using a categorical variable with 3 levels, mealcat.

To get an overall test of this interaction, we can use the test command. These results indicate that the overall interaction is indeed significant. This means that the regression lines from the 3 groups differ significantly.

Since we had three groups, we get three regression lines, one for each category of mealcat. The solid line is for group 1, the dashed line for group 2, and the dotted line is for group 3. Indeed, this line has a downward slope. Indeed, group 2 shows an upward slope. So, the slopes for the 3 groups are.

This coefficient represents the coefficient for group 1, so this tested whether the coefficient for group 1 This is probably a non-interesting test. These successive comparisons seem much more interesting.

We can do this by making group 2 the omitted group, and then each group would be compared to group 2. As we have done before, we will use the char command to indicate that we want group 2 to be the omitted category and then rerun the regression. This makes sense given the graph and given the estimates of the coefficients that we have, that -. We can perform the same analysis using the anova command, as shown below. The anova command gives us somewhat less flexibility since we cannot choose which group is the omitted group.

Because the anova command omits the 3rd category, and the analysis we showed above omitted the second category, the parameter estimates will not be the same. You can compare the results from below with the results above and see that the parameter estimates are not the same. Because group 3 is dropped, that is the reference category and all comparisons are made with group 3. This covered four techniques for analyzing data with categorical variables, 1 manually constructing indicator variables, 2 creating indicator variables using the xi command, 3 coding variables using xi3 , and 4 using the anova command.

Each method has its advantages and disadvantages, as described below. Manually constructing indicator variables can be very tedious and even error prone. However, the advantage is that you can have quite a bit of control over how the variables are created and the terms that are entered into the model.

The xi command can really ease the creation of indicator variables, and make it easier to include interactions in your models by allowing you to include interaction terms such as i.

The xi command also gives you the flexibility to decide which category would be the omitted category unlike the anova command. It can be easier to perform tests of simple main effects with the anova command.

However, the anova command is not flexible in letting you choose which category is the omitted category the last category is always the omitted category. In such cases, the regress command offers features not available in the anova command and may be more advantageous to use.

Use this recoded version of ell to predict api00 and interpret the results. Interpret the results. We can also test sets of variables, using the test command, to see if the set of variables are significant.

If you compare this output with the output from the last regression you can see that the result of the F-test, Note that you could get the same results if you typed the following since Stata defaults to comparing the term s listed to 0.

Perhaps a more interesting test would be to see if the contribution of class size is significant. The significant F-test, 3. Finally, as part of doing a multiple regression analysis you might be interested in seeing the correlations among the variables in the regression model.

You can do this with the correlate command as shown below. If we look at the correlations with api00 , we see meals and ell have the two strongest correlations with api These correlations are negative, meaning that as the value of one variable goes down, the value of the other variable tends to go up. Knowing that these variables are strongly associated with api00 , we might predict that they would be statistically significant predictor variables in the regression model.

We can also use the pwcorr command to do pairwise correlations. The most important difference between correlate and pwcorr is the way in which missing data is handled. With correlate , an observation or case is dropped if any variable has a missing value, in other words, correlate uses listwise , also called casewise, deletion.

Two options that you can use with pwcorr , but not with correlate , are the sig option, which will give the significance levels for the correlations and the obs option, which will give the number of observations used in the correlation. Such an option is not necessary with corr as Stata lists the number of observations at the top of the output. Earlier we focused on screening your data for potential errors.

In the next chapter, we will focus on regression diagnostics to verify whether your data meet the assumptions of linear regression. Here, we will focus on the issue of normality.

Some researchers believe that linear regression requires that the outcome dependent and predictor variables be normally distributed. We need to clarify this issue. In actuality, it is the residuals that need to be normally distributed. In fact, the residuals need to be normal only for the t-tests to be valid.

The estimation of the regression coefficients do not require normally distributed residuals. As we are interested in having valid t-tests, we will investigate issues concerning normality. So, let us explore the distribution of our variables and how we might transform them to a more normal shape.

We can use the normal option to superimpose a normal curve on this graph and the bin 20 option to use 20 bins. The distribution looks skewed to the right. You may also want to modify labels of the axes. For example, we use the xlabel option for labeling the x-axis below, labeling it from 0 to incrementing by Histograms are sensitive to the number of bins or columns that are used in the display.

An alternative to histograms is the kernel density plot, which approximates the probability density of the variable. Kernel density plots have the advantage of being smooth and of being independent of the choice of origin, unlike histograms. Stata implements kernel density plots with the kdensity command. Not surprisingly, the kdensity plot also indicates that the variable enroll does not look normal. Note the dots at the top of the boxplot which indicate possible outliers, that is, these data points are more than 1.

This boxplot also confirms that enroll is skewed to the right. There are three other types of graphs that are often used to examine the distribution of variables; symmetry plots, normal quantile plots and normal probability plots.

A symmetry plot graphs the distance above the median for the i-th value against the distance below the median for the i-th value. A variable that is symmetric would have points that lie on the diagonal line. As we would expect, this distribution is not symmetric. A normal quantile plot graphs the quantiles of a variable against the quantiles of a normal Gaussian distribution. This plot is typical of variables that are strongly skewed to the right.

Finally, the normal probability plot is also useful for examining the distribution of variables. Again, we see indications of non-normality in enroll. Having concluded that enroll is not normally distributed, how should we address this problem? First, we may try entering the variable as-is into the regression, but if we see problems, which we likely would, then we may try to transform enroll to make it more normally distributed.

Potential transformations include taking the log, the square root or raising the variable to a power. Selecting the appropriate transformation is somewhat of an art. Stata includes the ladder and gladder commands to help in the process. Ladder reports numeric results and gladder produces a graphic display. The log transform has the smallest chi-square. This also indicates that the log transformation would help to make enroll more normally distributed.

Note that log in Stata will give you the natural log, not log base To get log base 10, type log10 var. We can see that lenroll looks quite normal. We would then use the symplot , qnorm and pnorm commands to help us assess whether lenroll seems normal, as well as seeing how lenroll impacts the residuals, which is really the important consideration. In this lecture we have discussed the basics of how to perform simple and multiple regressions, the basics of interpreting output, as well as some related commands.

We examined some tools and techniques for screening for bad data and the consequences such data can have on your results. Finally, we touched on the assumptions of linear regression and illustrated how you can check the normality of your variables and how you can transform your variables to achieve normality.

The next chapter will pick up where this chapter has left off, going into a more thorough discussion of the assumptions of linear regression and how you can use Stata to assess these assumptions for your data.

In particular, the next lecture will address the following issues. Click here for our answers to these self assessment questions. This page is archived and no longer maintained. Chapter Outline 1. Percent Cum. For this example, our new variable name will be fv , so we will type predict fv option xb assumed; fitted values If we use the list command, we see that a fitted value has been generated for each observation.

Value to be created Option after Predict predicted values of y y is the dependent variable no option needed residuals resid standardized residuals rstandard studentized or jackknifed residuals rstudent leverage lev or hat standard error of the residual stdr Cook's D cooksd standard error of predicted individual y stdf standard error of predicted mean y stdp 1. Checking for points that exert undue influence on the coefficients Checking for constant error variance homoscedasticity Checking for linear relationships Checking model specification Checking for multicollinearity Checking normality of residuals 1.

What is the correlation between api99 and meals? Regress api99 on meals. What does the output tell you? Create and list the fitted predicted values. Graph meals and api99 with and without the regression line. Explain how these commands are different.



0コメント

  • 1000 / 1000