Variable Interaction And Nonlinearity

Variable interaction and nonlinearity are fundamental concepts in statistical analysis and machine learning that help model complex relationships between variables.

Variable interaction occurs when the effect of one variable on the outcome is influenced by another variable. Understanding and accounting for interactions are essential for building accurate predictive models, as neglecting them can lead to biased estimates and erroneous conclusions about the relationships between variables.

Nonlinearity refers to a situation where the relationship between predictor variables and the response is not linear. In linear relationships, changes in predictors lead to proportional changes in the response. However, in nonlinear relationships, this proportionality does not hold. Detecting and modeling nonlinear relationships are critical for creating accurate models.

Dealing with Nonlinearity

  1. Utilize nonlinear models such as decision trees, random forests, neural networks, etc., that can capture complex relationships.
  2. Apply feature engineering techniques, like adding polynomial features or transformations, to better represent nonlinearities.
  3. Leverage kernel methods to implicitly map data into higher-dimensional spaces where linear models can capture nonlinear patterns.
  4. Use ensemble methods that combine various models to capture different aspects of the nonlinear relationship, resulting in more accurate predictions.

Cross-Validation and The Bootstrap

Cross-validation and bootstrap are two commonly used techniques in statistics and machine learning for assessing the performance of models and estimating parameters. 

Cross-validation is a resampling technique used to evaluate machine learning models on a limited data sample. It involves splitting the dataset into training and validation sets, where the model is trained on the training set and then evaluated on the validation set. This process is repeated multiple times, with different subsets of the data used for training and validation. Common types of cross-validation include k-fold cross-validation, leave-one-out cross-validation, and stratified k-fold cross-validation.

  • K-fold cross-validation: The data is divided into k subsets, and the model is trained and validated k times, with each subset used as the validation data once.
  • Leave-one-out cross-validation (LOOCV): Each observation is used as the validation set once while the rest of the data forms the training set.
  • Stratified k-fold cross-validation: Data is divided into k folds, ensuring that each fold has a similar distribution of the target variable.

Bootstrap is a resampling technique that involves random sampling with replacement from the original dataset to create new samples of the same size. This technique is often used for estimating the sampling distribution of a statistic like mean or variance, or for constructing confidence intervals.

The key steps in bootstrap resampling are:

  • Sample with replacement: Randomly select observations from the original dataset with replacement to create a bootstrap sample of the same size as the original dataset.
  • Calculate statistic: Calculate the statistic of interest (e.g., mean, median, standard deviation) on the bootstrap sample.
  • Repeat: Repeat the above steps a large number of times to create a bootstrap distribution of the statistic.

Bootstrap is particularly useful when the underlying distribution of the data is unknown or when you want to estimate the sampling distribution of a statistic without making strong assumptions about the data.

K-Fold Cross-Validation and Kruskal-Wallis K Test

Cross-validation is a technique used in machine learning and statistics to evaluate the performance of a predictive model. It involves partitioning a dataset into subsets, training the model on some of these subsets, and evaluating its performance on the remaining subset.

K-fold cross-validation is a specific approach to cross-validation where the original dataset is divided into K equal-sized folds. The model is trained on K-1 of these folds and validated on the remaining fold. This process is repeated K times, with each fold used as the validation set exactly once. 

  • Split into K equal parts.
  • For training and validation: Train on K-1 parts, and validate on the first part.
  • Repeat this K times with a different validation part each time.
  • To evaluate performance: Measure accuracy, mean squared error, etc.
  • To get the overall performance: Average the performance metrics from all K iterations.

In our project, we used the Kruskal-Wallis H test which is a non-parametric statistical test used to determine whether there are statistically significant differences between the medians of three or more independent variables.

  • The statistic value is a measure of the overall difference in ranks among the groups being compared. The calculated H statistic is approximately 896.813.
  • The p-value is a measure of the probability of observing the data, assuming the null hypothesis is true. It tells us how likely it is to observe such an extreme H statistic by chance alone if there were no actual differences between the groups.
  • The p-value in this output is approximately 1.817×(10)^−195, an extremely small value close to zero.
  • The small p-value suggests strong evidence against the null hypothesis. With such a small p-value, it’s safe to reject the null hypothesis and conclude that there are significant differences in medians among the groups involved in the Kruskal-Wallis H test.

 

Understanding Distributions in Merged Datasets

We combined information from three different sets: one for inactive individuals, another for people with diabetes, and a third for those dealing with obesity.

Histogram Plot for Inactive and Diabetics Data: When we looked at the data for inactive individuals and those with diabetes, the histogram plot showed a “normal distribution.” This means that the data forms a bell-shaped curve, and the mean and standard deviation can fully describe the characteristics of this distribution. It’s like how most people’s heights cluster around an average height, with some variability.

Histogram for Obesity Data: However, when we examined the data for obesity, the histogram looked different. It was a “left-skewed histogram,” meaning it tilted more towards the left side. The mean was typically less than the median. The longer tail on the left side of the graph indicated that there were some unusually low values in the data, which is common in obesity data.

By understanding these different distributions, we can analyze data more accurately, especially when the data follows a normal distribution.

T-Test and Carb Molt Model

A t-test is a statistical hypothesis test used to determine if there is a significant difference between the means of two groups. It is commonly used when you have a small sample size and want to infer if the difference between the groups is likely due to chance.
There are two types:

  1. Independent Samples T-Test: Compares means of two separate groups.
  2. Paired Samples T-Test: Compares means of paired data.
  • Null Hypothesis (): No significant difference.
  • Alternative Hypothesis (): Significant difference.

The test calculates a statistic (t) based on sample means and standard deviations. If the p-value is less than a chosen significance level, we reject the null hypothesis, implying a significant difference. Check assumptions like normality and equal variances.

The Crab Molt model looks at measurements of crabs before and after molting. Our main goal was to predict the crab size before molting based on their post-molt measurements. Using a simple model, we got an impressive R-squared value of 0.98, indicating the model predicts well based on the data. And also analyzed the pre-molt and post-molt data. They were similar in distribution, with a small mean difference of about 14.7 units.

Multiple Linear Regression

Multiple linear regression is a statistical method that is an extension of simple linear regression in which more than one independent variable (X) is used to predict a single dependent variable (Y). The predicted value of Y is a linear transformation of the X variables such that the sum of squared deviations of the observed and predicted Y is a minimum. The computations are more complex, however, because the interrelationships among all the variables must be considered in the weights assigned to the variables. The interpretation of the results of a multiple regression analysis is also more complex for the same reason. With two independent variables the prediction of Y is expressed by the following equation:

Y’i = b0 + b1X1i + b2X2i

This transformation is similar to the linear transformation of two variables discussed in the previous chapter except that the w’s have been replaced with b’s and the X’i has been replaced with a Y’i.

The “b” values are called regression weights and are computed in a way that minimizes the sum of squared deviations

Multiple linear regression for Obesity, Inactivity, and diabetics. The relationship between two independent variables, “inactive” and “obesity”, and a dependent variable “diabetics”. R-squared, is a statistical measure used in regression analysis to evaluate the goodness of fit of a regression model. It provides an indication of the proportion of variance in the dependent variable that can be explained by the independent variables in the model. An R-squared value of 0.34 in multiple linear regression indicates that the independent variables included in the model collectively explain about 34% of the variability in the dependent variable. 

Test for Heteroskedasticity

 

Residuals
In statistical analysis, residuals are the differences between observed values and the corresponding values predicted by a statistical model. These differences are fundamental for understanding how well a statistical model fits the observed data and for diagnosing the appropriateness of the model assumptions.

Breusch-Pagan test for heteroskedasticity
The Breusch-Pagan test is a statistical test used to determine whether the variance of the errors in a regression model is constant or varies with respect to the predictor variables. In regression analysis, heteroscedasticity refers to the unequal scatter of residuals. Specifically, it refers to the case where there is a systematic change in the spread of the residuals over the range of measured values.

Heteroscedasticity is a problem because ordinary least squares (OLS) regression assumes that the residuals come from a population that has homoscedasticity, which means constant variance. When heteroscedasticity is present in a regression analysis, the results of the analysis become hard to trust. One way to determine if heteroscedasticity is present in a regression analysis is to use a Breusch -Pagan Test.

Multi Linear Regression for Obesity, Inactivity, and Diabetics is a generalization of simple linear regression, in the sense that this approach makes it possible to evaluate the linear relationships between a response variable and several explanatory variables.

P-value and Linear Regression between Obesity and Diabetes

P-Value:

A p-value, a “probability value,” is a statistical measure used in hypothesis testing to determine the strength of evidence against a null hypothesis. The null hypothesis is a statement that there is no significant effect or relationship in a given set of data.

  • For example, the Null hypothesis (H0) would be that the coin is fair, meaning it has an equal chance of landing on heads or tails (probability of each = 0.5).  An alternative hypothesis (Ha) would be that the coin is biased towards tails, meaning it’s more likely to land on tails than heads.

A linear regression plot on two variables, Obesity, and Diabetic, helps you visualize the relationship between these two variables and understand how they are related linearly. The formula y = c + b*x, where y = estimated dependent variable score, c = constant, b = coefficient, and x = score on the independent variable. Regression analysis draws a line through these points that minimizes their overall distance from the line. More specifically, least squares regression minimizes the sum of the squared differences between the data points and the line. Following the practice in statistics, the Y-axis displays the dependent variable, % DIABETIC. The X-axis shows the independent variable, which is the % OBESE. The Pearson correlation coefficient is used to measure the strength of a linear association between Obesity and Diabetes. The values have a moderate positive correlation.

array([[1.        , 0.38532577],
       [0.38532577, 1.        ]])

The Breusch-Pagan test is a statistical test used to detect heteroskedasticity in a regression model. Heteroskedasticity occurs when the variance of the residuals in the regression model is not constant across all levels of the independent variables, violating one of the assumptions of linear regression.

 

Overview of Diabetes and Obesity data

 

 

 

 

 

 

The histogram for obesity is a left-skewed histogram is known negatively skewed histogram. It’s a graphical representation of data in which the tail of the distribution extends to the left. The mean is typically less than the median. The longer tail on the left side indicates some lower extreme values in the data.

The histogram for diabetes is a normal distribution. The mean and standard deviation of a normal distribution fully describe characteristics. When data is approximately normal, these tests tend to be more accurate.

QQ plot for Diabetes and obesity compares the quantiles from the dataset to the quantiles expected from a theoretical distribution. These pairs of quantiles are then plotted against each other on a scatterplot, with the theoretical quantiles on the x-axis and the observed quantiles on the y-axis. QQ plot for diabetes is S-shaped which indicates skewness.QQ plots for obesity points are consistently below the line, which suggests that obesity data has lighter tails.