Exploring Age Distribution and Mental Health Impact on Manner of Death

Simple histogram plot to show the distribution of ages in the dataset. The plot indicates that most data points are clustered around a central age value, forming a distribution somewhat similar to a normal distribution but with a slight right skew. This suggests that most individuals fall within a specific age range, with fewer individuals being older.


Simple bar chart that shows the relationship between the presence or absence of “Signs of Mental Illness” (1 for true, 0 for false) and the “Manner of Death.” It visualizes the count of incidents in each category.

  • When “Signs of Mental Illness” is 0 (false), the most common “Manner of Death” is “shot” or “shot and Tasered,” with a higher count.
  • When “Signs of Mental Illness” is 1 (true), the count of incidents for “shot” or “shot and Tasered” is lower compared to when there are no signs of mental illness.

When individuals do not show signs of mental illness, the most common manners of death are “shot” and “shot and Tasered.” This means that incidents where individuals are shot or shot and Tasered are more frequently observed when there are no signs of mental illness.

Hyperparameter Tuning for K-means Clustering

Hyperparameter tuning is a crucial aspect of the K-means clustering algorithm to ensure optimal performance on specific datasets. K-means has several hyperparameters that can be adjusted to optimize its performance. Here is a guide on how to tune hyperparameters for K-means:

Number of Clusters (K): The number of clusters we want to divide the data into is the most critical hyperparameter in K-means. Selecting the appropriate value for K is often the most challenging part. Techniques such as the Elbow Method or Silhouette Score can be used to determine an appropriate value for K. The goal is to find a point where increasing the number of clusters does not significantly reduce the cost function or improve the clustering quality.

Initialization Method: K-means is sensitive to the initial placement of cluster centers. Different initialization methods can lead to different results. Common initialization techniques are “random” and “k-means++.” The k-means++ method usually produces better results as it initializes cluster centers in a way that is more likely to lead to a good final solution.

Max Iterations (n_init): K-means uses an iterative process to converge to a solution. The algorithm stops when the cluster assignments no longer change significantly or after a certain number of iterations. This can adjust the n_init parameter, which determines how many times the algorithm runs with different initializations and chooses the best result. Increasing this number can help to  find a more stable solution.

Tolerance (tol): This parameter specifies when the algorithm should stop iterating. It is often set to a small value like 1e-4, meaning that if the change in cluster centers between iterations is smaller than this value, the algorithm stops.

Distance Metric: K-means uses Euclidean distance by default to measure the dissimilarity between data points and cluster centers. Depending on the data, we may consider using a different distance metric, such as Manhattan distance or cosine similarity.

Preprocessing: Scaling or normalizing data can impact the performance of K-means. It is often a good idea to preprocess data to have features with similar scales, especially when working with distance-based algorithms like K-means.

Parallelization: Some implementations of K-means offer the ability to parallelize the computation, which can significantly speed up the algorithm, especially when dealing with large datasets. That can adjust the number of CPU cores or threads used for parallelization.

Mini-batch K-means: If are dealing with large datasets, consider using mini-batch K-means. This variant of K-means can be faster but might require tuning additional parameters like batch size and learning rate.

K-Medoids and DBSCAN: Clustering Algorithms

K-medoids are the most typical representatives in a group of data. They are selected so that the distance from each data point to the chosen medoid is as small as possible. To understand medoids, let’s compare them to central points in K-Means. The relationship between central points and medoids is similar to that of averages and middle values in a list. The key difference is that medoids and middle values are always actual data points, whereas central points and averages may not be. The main difference between K-Means and K-Medoids is how they group data. K-Means cluster the data based on the distances between data points and central points, while K-Medoids cluster it based on the distances to medoids. Unusual data points can influence k-means, but K-medoids are more resilient and don’t rely on central points, making them better at handling outliers.

DBSCAN is a clustering algorithm that identifies groups of data points that are close to each other, even if they do not have a circular or square shape. It can also detect data points that do not belong to any group. The algorithm works by measuring the distance between data points. If the distance is less than or equal to a predetermined value ε, the data points may be considered part of the same group. Additionally, a minimum number of data points, MinPts, must be within the distance ε for a group to be formed. Based on these criteria, DBSCAN classifies data points as Core Points, Border Points, or Outliers. Core Points have enough data points within the distance ε to form a cluster. Border Points can be reached from a Core Point but have fewer data points within the distance ε. Outliers are data points that do not belong to any group and cannot be reached from any Core Point.

Clustering: Sorting and Grouping Data

Clustering is like sorting or grouping similar things together. Clustering in the context of computers and data means using special techniques to do this automatically, without having to decide in advance how many piles or groups there should be.

It’s a way to find patterns or groups in data, which can be helpful for many things like organizing information, understanding customer behavior, or even organizing photos on your computer.

K-means Clustering:

  • K-means is like sorting things into groups based on their similarity.
  • We can decide how many groups we want, and the computer puts things in those groups.
  • It keeps moving things around until it finds the best groups.

Hierarchical Clustering:

  • Hierarchical clustering is like building a family tree for data.
  • It starts with every item in its own family, and then it joins them together step by step, like merging families in a family tree.
  • This is how things are related at different levels, from big groups to smaller ones.

Logistic Regression: Understanding Coefficients and Predictions

Logistic regression is a math method used for saying “yes” or “no” things. Like, it can help tell if an email is spam or not, or if a person will buy something or not. In this, we use special numbers called “coefficients” to make predictions.

Here’s what these coefficients do:

  1. Coefficient Values: For every piece of information we have, there’s a special number (let’s call it “β”) that goes with it. These numbers help us guess how likely the event is. They show how things change when we change the info slightly, keeping everything else the same.
  2. Intercept (Bias): Besides the info numbers, we also have one more number called the “intercept” or “bias.” It’s like the starting point for our guessing. When all the info is zero, this number gives us the basic chance of the event happening.
  3. Log-Odds: To make a guess, we find the log-odds. It’s a way to see how likely the event is. We add up the info numbers times the actual info we have, and then we add the intercept.Log-Odds = β₀ + β₁ * x₁ + β₂ * x₂ + … + βn * xn
  4. Odds Ratio: We can use the odds ratio to compare how different info affects our guess. If the odds ratio is more than 1, it means the info makes the event more likely. If it’s less than 1, the event is less likely.Odds Ratio = exp(β)
  5. Probability: Finally, to get a proper chance of the event happening, we use a special function (sigmoid function) to change the log-odds into a number between 0 and 1.Probability (p) = 1 / (1 + e^(-Log-Odds))In all of this:
    • β₀ is the starting point.
    • β₁, β₂, and so on are the info numbers.
    • x₁, x₂, and so on are the actual info we have.
    • “e” is a special number (around 2.71828) used in math.

Average Age by Race Comparative Analysis

Making a bar plot showing age and race involves using bars to display the average age for different racial groups. Each bar represents a group, and its height shows the average age for that group. This helps compare the average ages between different races, making it easier to see their age differences. white and unknown races are mostly shot at the same age.

Clustering algorithms are used in machine learning to group similar data points based on their features. There are different types of clustering algorithms, each with its own way of identifying clusters. K-means divides data into a set number of clusters by optimizing mean distances. Hierarchical clustering creates a hierarchy of clusters by merging or splitting them. DBSCAN identifies clusters by density, while Gaussian Mixture Models use Gaussian distributions to model data. Mean Shift finds dense regions without predefining the number of clusters. Spectral clustering uses eigenvectors to cluster data, and OPTICS ranks points according to density to discover clusters of various shapes. The algorithm choice depends on the data’s characteristics and the desired clustering outcomes.

About Logistic Regression

Logistic regression is a versatile statistical technique utilized for analyzing datasets where the dependent variable involves two distinct outcomes, typically denoted as 0 and 1. The application of logistic regression is widespread, finding utility in fields ranging from healthcare to finance and beyond. The essence lies in its ability to model the probability of a binary event occurring based on one or more independent variables. This makes it invaluable for scenarios like predicting patient survival, customer purchase behavior, or whether a credit card transaction is fraudulent.

At its core, logistic regression transforms a linear combination of independent variables through the logistic function. The logistic function maps the sum of the products of the independent variables and their corresponding coefficients to a probability value between 0 and 1. This probability represents the likelihood of the event belonging to the positive class. The logistic function’s mathematical expression is essential for converting the linear prediction into a probability, enabling a clear interpretation of the prediction. During the training phase, logistic regression estimates the parameters that maximize the likelihood of the observed data given the logistic function. This process involves iteratively adjusting the coefficients to minimize the difference between the predicted probabilities and the actual binary outcomes in the training dataset. The resultant model encapsulates the relationships between the independent variables and the log odds of the event, providing a sound basis for making predictions.

Police Shootings in US Dataset

Today, I spent time reviewing the new dataset, This data is about the Washington Post started tracking police-involved killings in the US by gathering information from various sources such as news reports, law enforcement websites, social media, and databases like Fatal Encounters and Killed by Police. They made sure to keep track of key details such as the race of the person killed, the circumstances surrounding the shooting, whether the person was armed, and whether they were experiencing a mental health crisis. In 2022, they have improved its database by adding the names of the police departments involved in each shooting, aiming to assess accountability at the department level. The death of Michael Brown in 2014 in Ferguson, Missouri, sparked the Black Lives Matter movement and increased attention to police accountability. The Post’s database specifically focuses on cases in which a police officer, while on duty, shoots and kills a civilian. The FBI and the Centers for Disease Control and Prevention also track fatal police shootings, but their data is incomplete. Since 2015, The Post has documented more than double the amount of fatal police shootings annually, compared to federal records. This disparity has grown, with the FBI tracking only a third of departments’ fatal shootings in 2021.

The post aims to create comprehensive records and update its database regularly with information on fatal shootings and individual cases.

Understanding Principal Component Analysis (PCA)

Principal Component Analysis (PCA) is a method used in data analysis and machine learning to simplify complex datasets while retaining essential information. It starts by standardizing the data and then calculating a covariance matrix, revealing how features in the dataset are related. The eigenvectors and eigenvalues of this matrix point out the directions of maximum variance, called principal components. These components are arranged by the amount of variance they represent, enabling the selection of a smaller set of key components that effectively summarize the data. PCA is beneficial for tasks like visualization, noise reduction, and feature extraction, achieved by projecting the original data into a lower-dimensional space defined by these principal components.

On the other hand, the t-test is a statistical tool used to determine if there’s a significant difference between the means of two groups. There are two main types: the one-sample t-test compares a single sample mean to a known or hypothesized population mean, while the two-sample t-test examines the difference between the means of two independent samples. The t-test computes a t-statistic, which is then compared to a critical t-value from a t-distribution based on the desired level of significance and degrees of freedom. If the t-statistic surpasses the critical t-value, it suggests a significant difference between the means of the groups. This makes the t-test a valuable tool for hypothesis testing in various scientific and analytical domains.

Project: Transformations

Box-Cox transformation is a widely used mathematical transformation used in statistics and data analysis. It’s primarily used to stabilize variance and make the data more normally distributed, both of which are essential assumptions for many statistical techniques.

The purpose of the Box-Cox transformation is to stabilize variance and make the data closer to a normal distribution. This can be particularly useful for data analysis techniques that assume normally distributed data, such as linear regression. To determine the optimal value of  ,  typically search for the value that maximizes the log-likelihood function, which measures how well the transformed data fit a normal distribution. This is often done using numerical optimization techniques.

Yeo-Johnson transformation is a mathematical formula used to transform data in statistics and data analysis. It’s an extension of the Box-Cox transformation and is designed to handle a broader range of data, including both positive and negative values, as well as zero.

The Yeo-Johnson transformation can handle a wider range of data than the original Box-Cox transformation, which is limited to positive data or data with positive shifts. The Yeo-Johnson transformation can be applied to both positive and negative data, making it more versatile in practical applications . Choosing the appropriate value of is critical to obtaining a meaningful transformation. Typically, this is done through a search for the optimal that maximizes the log-likelihood function, or by using other criteria such as minimizing the mean squared error. 

Project :About Outliers Detection and Removal

Outlier detection involves identifying unusual or abnormal data points that significantly deviate from the majority of the dataset. Various approaches like statistical methods, machine learning algorithms, or domain-specific rules can be used to pinpoint these outliers. Statistical techniques such as Z-score or interquartile range are commonly applied, whereas machine learning models like Isolation Forest or One-Class SVM utilize the data’s underlying patterns to detect outliers. Once identified, outliers can be either removed from the dataset or handled through imputation or transformation, depending on the context and purpose of the analysis.

In the process of outlier removal, careful consideration is essential to maintain data integrity and ensure that the removal or treatment aligns with the overall goals of the analysis. Removing outliers can help enhance the accuracy and reliability of statistical analysis or machine learning models by reducing the influence of erroneous or extreme data points. However, it’s crucial to strike a balance and use domain knowledge to determine the appropriate course of action, as excessive outlier removal may lead to loss of valuable information or distortion of the true data distribution.

Isolation Forest is an efficient and effective algorithm for outlier detection. It operates by isolating anomalies within a dataset by constructing a random forest of decision trees. The key idea is that anomalies are likely to require fewer random splits to be isolated, making them stand out in the tree structure. By measuring the average path length to isolate each data point, Isolation Forest identifies outliers as those with shorter average path lengths, representing their higher degree of separability from the majority of the data. This method is particularly adept at handling high-dimensional data and offers quick and accurate outlier detection, making it widely used in various domains including fraud detection, network security, and anomaly detection.