Polynomial Regression does regression on data supplied by the user. It can use least squares analysis to find the best fit polynomial of degree n = 1 (straight line), 2 (quadratic), 3 (cubic), or 4 (quartic).
The program as 5 modes of operation:
1. Linear (the traditional linear regression)
2. Quadratic (parabola)
3. Cubic (degree 3)
4. Quartic (degree 4)
5. Data only scatter chart. (This allows looking at the data in
order determine what kinds regression might be appropriate.
Data values can be typed into the x values and y values text boxes or one can browse local files to inport those data values.
This graphics for this apt are based on the grapher apt although many of the input boxes are used in different, dedicated ways.
There are three way to enter the data:
Using the "Fit the curve" button is not necessary when using a new example or reading data from a file.
One can use this button to input x and y values from a file which is
particular helpful when there are lots of data points. Each point is
on a separate line using the format
x value , y value
For example, the point (3, 5) would appear as "3, 5".
The values from the file are stored in the "x values" and "y values" text boxes. The option allows on-line users to input data from files on their computer.
After entering or changing data into the "x values" or "y values" text boxes, click the this button to ask the apt to calculate the new regression curve. It is not necessary to use this button after changing the mode, using an example, or browsing for a file.
When the evaluate button is pressed, a table containing each of the x and y values along with the estimated (predicted) value from the regression curve and the error (y - estimated y) appears below the button, The table disappears the next time the regression calculations are preformed (click a mode, the "Fit the curve" or an example button.)
The Exact and Random buttons can be used to created data for testing and experimenting. For example, the Random button can be used to see the effects of small variations in the data.
Before using the "Exact" button, make sure you have appropriate x values. (The y values are irrelevant because they will be replaced.) When one clicks the "Exact" button, PolyRegression generates a y value for each x value using the regression formula in the "Fit to" text box using the coefficients a = 2, b = 3, c = 4, d = 5, and e = 6, as needed. For example, if one is using Linear regression, the points are generated using 2x + 3. If one is using Quadratic regression, the y values are generated with 2x^{2} + 3x + 4. The regression curve generated should pass through each of the plotted points, and the values of the coefficients should be "exact" (ignoring round off error). R^{2} and the adjusted R^{2} should be 1 (at least to many decimal points) while σ^{2} should be 0.
When the "Random" button is pressed, the apt does the same thing except after the y values are calculated, they are randomized by multiplying that value by a random value between 0.9 and 1.1 .Each time the button is clicked, a different set of y values will be produced and the calculated values will change. One may be surprised how much the regression coefficients can change particularly with higher degree polynomials with only small changes in the y values.
After appropriate x and y data points have been entered, you can change the regression formula by just clicking a Linear, Quadratic, Cubic, or Quartic button. The calculations are automatically redone and the new regression polynomial is shown below the canvas and coefficients.
How does one decide which regression polynomial is "best"? The plotted graph can be helpful. The values of R^{2}, σ^{2} and adjusted R^{2} can be helpful. R^{2} can be interpreted as being the "percent" by which the regression polynomial explains the data. R^{2} = 1 means that the curve explains 100% of the data, 0 would mean that it explains nothing about the data. (Negative values means the curve is completely irrelevant.) Smaller values of the variance σ^{2} are better. Unfortunately, R^{2} will increase if the degree of the polynomial is increased even if the increased degree is not really a better model.
"The adjusted R-squared adjusts for the number of terms in the model. Importantly, its value increases only when the new term improves the model fit more than expected by chance alone. The adjusted R-squared value actually decreases when the term doesn't improve the model fit by a sufficient amount." https://statisticsbyjim.com/regression/interpret-adjusted-r-squared-predicted-r-squared-regression/
Here is an simple example. The 9 x data points (0, 1, 2, 3, 4, 5, 6, 7, 8) were used. The Exact button would generate y values (3, 5, 7, 9, 11, 13, 15, 17, 19) according to the formula y = 2x + 3. However, in one experiment, the Random button was used randomly modifying the y values by less than 10% . The following data was collected:
Regression Curve | R^{2} | σ^{2} | adjusted R^{2} |
---|---|---|---|
Linear 1.957x + 3.337 | .985 | .505 | .983 |
Quadratic | .985 | .587 | .980 |
Cubic | .986 | .640 | .978 |
Quartic | .991 | .518 | .992 |
If one just looks at R^{2}, they would think the Quartic model was best. But looking at adjusted R^{2}, one sees that the adjusted value decreases for quadratic model suggesting that even the x^{2} does not help. Notice that the coefficients in the linear model curve different noticeably from the exact curve. Because the y values were picked randomly, one would normally get different results if the experiment was repeated.
To illustrate the point that a bigger R^{2}does not always mean a better fit, the experiment was repeated with the same x values but this time using randomize value generated from 2x^{3} + 3x^{2} + 4x + 5.
Regression Curve | R^{2} | σ^{2} | adjusted R^{2} |
---|---|---|---|
Linear | .847 | 35730 | .825 |
Quadratic | .989 | 2941 | .986 |
Cubic 2.27x^{3} + 0.229x^{2} 14.137x - .759 | .994 | 2059 | .990 |
Quartic | .995 | 2038 | .990 |
If one looks at only the R^{2} column, they might assume that the 4th degree quartic was best regression model. However, if one checks the adjusted R^{2} column, one sees that the quartic does not improve the fit suggesting that the cubic would be a better model. (The σ^{2} values are large because the y values are large.) Again the coefficients of the cubic differ significantly from the original cubic despite y values being randomized by less than 10%. Repeating the experiment would yield different results because of the randomization.
See Options for special features. below.
Many of function boxes have dedicated purposes.
Although the first 3 function text boxes are dedicated, the last three function boxes are available for functions or special options like labels.
When writing functions: The operational signs are +, -, *, /, % and **. You can use ( and ) to control the order of operations and with functions. They can also be used in almost all input expressions.
The % operator means modulus. That is, if we have a % b we would divide a by b as integers and the result is the remainder. For example, 13 % 3 = 1 as 13 divided by 3 is 4 with a remainder of 1. Some more examples: 11 % 5 = 1, 215 % 10 = 5
The ** operator means "to the power". Thus, 3**2 = 9, 2**3 = 8, and 10**3 = 1000.
You can use some constants: PI (π = 3.141592653589793), QUARTER_PI, HALF_PI, TWO_PI, Math.E (e = 2.718281828459045).
You can use some values: a, b, c, d, e, and s. Variables a and b have a special use. Additional variables have special purposes depending on the regression curve and should not be used for input. They will be explained in the values section.
Some of the functions that can be used follow. (Note: some of these functions normally are not appropriate when defining data.)
Order of evaluation:
Illegal values like sqrt(-2) and 1/0 are ignored.
After you finish typing a function, press "Enter".
Formulas can contain comments using // and /*...*/.
There are 7 special options of which Words is normally the only one that is normally useful in PolyRegression. These options can be used to add labels to the graph. If you want a label you can put it in one of the undedicated function boxes as illustrated in several of the examples.
Values are separated by ";".
Words are centered horizontally at the specified point and are located just above the specified point.
(This option would be better named "label" or "text". But using the first letter of these names would not work because "l" and "t" have other meanings in the grapher apt.)
It is possible to include variable values in words. The variable values are denoted by ^a, ^b, ^c, ^d, and ^e. E.g. if d currently = 3, then the "word" "value=^d stars" results in "value=3 stars".
When supplying a label for a curve, it may be desirable to have
the label's color match the curve's color. This is possible. One
needs to follow the last "word" in the input box with
; color; color number;. E.g.
w 2; 3; aWord; color; 2;
This says to put "aWord" at (2, 3) with the same coloring as function
box 2 (Fit to) (because the function boxes are number
starting with 0). Notice the ";"s in the statement. The
final ";" is absolutely
needed. It is possible to use formulas for the curve number.
Minimum x and Maximum x
are the left and right end points of the x axis. If the input boxes
are blank, the max and min is set by domain of the x values.
Minimum y and Maximum y,
if specified, are the end points of the y axis. If these fields are blank,
appropriate values will be calculated automatically based on the y values
of the functions.
If "Equal spacing" is true (checked) then the mins and maxs for x and y will be adjusted as needed in order to make the equal spacing possible.
Note: After you finish typing a minimum or maximum, press "Enter". The minimum and maximum values can be formulas which are evaluated the same way functions and values are evaluated. Just don't use x, t, or o in the formulas.
Sometimes it is useful to zoom in to see a portion of the plot more closely. The Zoom In button does this. The domain is reduced by one half and centered at the last x value. If both the y minimum and x maximum are specified the range is reduced in a similar manner.
Zoom out increases the domain by a factor of 2.
When Equal Spacing is checked, the units on the x and y axis are the same so a 45^{o} line would look like a 45^{o} line and circles will look like circles.
For the sake of efficiency, PolyRegression stops redrawing plots when nothing is changed. Checking Allow motion says to keep drawing. However, this is rarely useful in PolyRegression.
The standard canvas is medium sized to allow the user see the canvas and many of the controls at the same time without having to scroll. This is convenient while entering data and picking a mode. But a larger plot is drawn above the original canvas and controls when the "Show enlarged canvas" item is checked. When using this option, one will normally have to scroll up to see the larger plot and scroll down to see the original canvas and the control items. The enlarged canvas shows a magnified image of the original canvas except text elements are the same size as in the original.
Notes:
The Radians and Degrees radio buttons allow the user to specify if the trig functions and trig inverse functions assume radians or degrees. The default is radians.
You can supply values for undedicated values. The dedicated registers hold coefficients for the regression polynomial starting with the coefficient of the highest power of x.
Regression | a | b | c | d | e |
---|---|---|---|---|---|
Linear | dedicated | dedicated | |||
Quadratic | dedicated | dedicated | dedicated | ||
Cubic | dedicated | dedicated | dedicated | dedicated | |
Quartic | dedicated | dedicated | dedicated | dedicated | dedicated |
An undedicated variable is 0 when its input box is blank. They are evaluated the same way functions are evaluated. Just don't use x, t, or o in the formulas. See below for the special s value. You can use the value of one of the variables in another. For example, the expressions for c and d might be s/10 and 2 * e. Because one can use formulas for undedicated variables, the numerical values for all variables (rounded to 3 decimal places) are shown below the values.
Warning: If you ask to use a high degree regression polynomial when the data closely fits a lower degree polynomial, the leading coefficients may be very close to zero. However, values very near zero are shown in power of 10 notation. For example, the apt would show .00000000000008283328405903738 as 8.3328405903738e-14. However, only 8.33284059037385 is shown in the text box which can be very misleading. However, the value is shown as 0.000 in the value listing below the variable boxes and as 0 in the regression curve.
The slider moves from 0 to 100 and is always an integer. If you move the slider, its value is sent to the s value input box. One can type a value from 0 to 100 in the s value input box to adjust the slider to that value. Anything else typed into that s value box is ignored and will be replaced by the value of the slider.
If the value range of the slider is not appropriate you can do something like the following: f3(x) = e * sin(x) where e value is set to s/10.
Important note: When "Allow motion" is false (not checked) the value of s is set when the mouse releases the slider but the value of s does not track the slider while the slider is moving. However, if "Allow motion" is true (checked) (and at least one function has been defined) the value of s tracks the slider.
The initial value of 200 is almost always adequate as that means the functions are evaluated at about every other pixel in the x direction.
This button is useful if you have your own copy of PolyRegression on your computer. If you set up a problem that you want to use again in another session, you can click this button and dialog box will show the information needed for the current plot to add it to the setupExamples function in the PolyRegression.pjs. You can copy the information and paste it into the function. You will have to provide an unique example number and a name for the graph. Formulas are shown in the order of the example numbers.
If a "," is included in a "word" special option w in function boxes or as a data separator in the x and y value boxes, it will be replaced by "^$". The substitution is required because ","s are used to separate items in the data. However, this is not a problem because the "^$" is automatically replaced by "," when used in an example.
If you click this button, the current setup saved as a new example which added to the list of examples. You will be required to supply a name for the new example. The temporary example will only be available for the rest of current session.
After finishing a plot, it can be saved as an image file. Click the button. Depending on your browser, you may be asked if you want to save the file. In any case, you will probably find it your normal download folder with name polyRegression.jpg. If you save more than one time, the multiple copies will normally be numbered. Your browser may have a button to give direct access to this folder.
When the user types invalid info into one of the input boxes, a message is displayed on the top of the plot area. After fixing error, normally the error message will be hidden. Occasionally it may be necessary to click anywhere in the plot area to hide the message.
James Brink, 7/6/2021