DBSCAN is a clever way to group similar data points together, especially when we don’t know how many groups there are. It works by looking at how close points are to each other. If a point has many nearby points, it’s a core point. DBSCAN starts with one point and adds all its friends to a group. Then it moves to another point, and the process repeats until all points are assigned to groups or marked as loners. This helps find groups of different shapes and sizes in our data, even when there are some lonely, ungrouped points.
DBSCAN is great at handling messy data and doesn’t require us to guess the number of groups in advance. It’s like finding clusters of stars in the night sky, where some stars are closer to others, forming groups, while some are all by themselves.
- Core Points: In DBSCAN, a “core point” is a data point that has at least a specified number of other data points (minPts) within a certain distance (epsilon, ε) from it. Core points are typically located within dense regions of a cluster.
- Border Points: A “border point” is a data point within ε distance of a core point but does not have enough neighboring data points to be considered a core point. Border points are part of a cluster but are located on its periphery.
- Noise Points: Data points that are neither core nor border points are classified as “noise points” or outliers. They do not belong to any cluster.
Parameters: DBSCAN has two primary parameters:
- ε (epsilon): The radius or maximum distance that defines the neighborhood around each data point. It determines which points are considered neighbors.
- minPts: The minimum number of data points required to form a cluster. A core point must have at least minPts neighbors to define a cluster.