hari_plotter.cluster module

class hari_plotter.cluster.Clustering(G: Graph, node_ids: ndarray, cluster_indexes: ndarray)[source]

Bases: ABC

Abstract base class representing a cluster. It provides a template for clustering algorithms.

clusters

A list of clusters, where each cluster is represented by a numpy array.

Type:

list[np.ndarray]

centroids

An array of centroids for the clusters.

Type:

np.ndarray

labels

An array indicating the label of each data point.

Type:

np.ndarray

parameters

A list of parameter names used for clustering.

Type:

list[str]

classmethod available_clustering_methods() list[str][source]
property cluster_labels: list[str]
classmethod clustering_method(clustering_name)[source]
classmethod create_clustering(G: Graph, clustering_method: str = 'K-Means Clustering', **kwargs) Clustering[source]

Factory method that creates an instance of a subclass of Clustering based on the provided method name and applies specified scaling functions to the data before clustering.

Parameters:
  • clustering_method – The name of the clustering method corresponding to a subclass of Clustering.

  • data – The data to be clustered, structured as a dictionary with the key ‘data’ and value as another dictionary mapping integers to lists of float values.

  • scale – An optional dictionary where keys are parameter names and values are functions (‘Linear’ or ‘Tanh’) to be applied to the parameter values before clustering. If not provided, no scaling is applied.

Returns:

An instance of the subclass of Clustering that corresponds to the given method name.

Raises:

ValueError – If the method name is not recognized (i.e., not found in the clustering_methods).

abstract classmethod from_graph(G: Graph, **kwargs) Clustering[source]
get_cluster_labels(**kwargs) list[str][source]
abstract get_number_of_clusters() int[source]

Abstract method to get the number of clusters.

get_values(key: str | list[str]) list[ndarray][source]

Returns the values corresponding to the given parameter(s) for all points in the clusters.

Parameters:
  • key (Union[str, list[str]]) – The parameter name or list of parameter names.

  • *bool) (keep_scale) –

    For the convenience, some values are kept as the values of the scale function of themselves. You might need it as it is kept or the actual values, bu default, you need the actual values.

Returns:

A list of numpy arrays, where each array corresponds to a cluster

and contains the values of the specified parameter(s) for each point in that cluster.

Return type:

list[np.ndarray]

label_to_index(label: str) int[source]
labels_nodes_dict() dict[str, tuple[tuple[int]]][source]
nodes_by_index(index: int) tuple[tuple[int]][source]

Returns the nodes that are in the cluster with the given label

nodes_by_label(label: str) tuple[tuple[int]][source]

Returns the nodes that are in the cluster with the given label

nodes_labels_default_dict()[source]
nodes_labels_dict()[source]
reorder_labels(new_order: list[int])[source]
class hari_plotter.cluster.DBSCANClustering(G: Graph, data: ndarray, node_ids: ndarray, parameters: list[str], scales: list[str], cluster_indexes: ndarray)[source]

Bases: ParameterBasedClustering

A DBSCAN clustering representation, extending the generic Clustering class.

classmethod from_graph(G: Graph, clustering_parameters: tuple[str] | list[str], scale: list[str] | dict[str, str] | None = None, eps: float = 0.5, min_samples: int = 5) Clustering[source]

Creates an instance of DBSCANClustering from a structured data dictionary, applying specified scaling to each parameter if needed.

Parameters:
  • G – HariGraph.

  • clustering_parameters – list of clustering parameters

  • scale – An optional dictionary where keys are parameter names and values are functions (‘Linear’ or ‘Tanh’) to be applied to the parameter values before clustering.

  • eps – The maximum distance between two samples for them to be considered as in the same neighborhood.

  • min_samples – The number of samples in a neighborhood for a point to be considered as a core point.

Returns:

An instance of DBSCANClustering with clusters and labels determined from the data.

Return type:

DBSCANClustering

Raises:

ValueError – If no data points remain after removing NaN values or if an unknown scaling function is specified.

get_number_of_clusters() int[source]

Get the number of clusters.

Returns: - int : The number of clusters (excluding noise points).

predict_cluster(data_points: ndarray, points_scaled: bool = False, parameters: None | tuple[str] = None) ndarray[source]

Predicts the cluster indices to which new data points belong based on the clusters formed.

Parameters:
  • data_points – The new data points’ parameter values as a numpy array.

  • points_scaled – A boolean indicating whether the data points are already scaled.

Returns:

An array of indices of the closest cluster point to each data point.

Return type:

np.ndarray

Raises:

ValueError – If the dimensionality of the data points does not match that of the original data.

recluster(eps: float, min_samples: int)[source]
reorder_clusters(new_order: list[int])[source]

Reorders clusters and associated information based on a new order.

Parameters:

new_order – A list containing the indices of the clusters in their new order.

Raises:

ValueError – If new_order does not contain all existing cluster indices.

unscaled_centroids() ndarray[source]

Calculate centroids for the clusters, excluding noise points.

Returns: - np.ndarray: The centroids of the clusters.

class hari_plotter.cluster.KMeansClustering(G: Graph, data: ndarray, node_ids: ndarray, parameters: list[str], scales: list[str], cluster_indexes: ndarray)[source]

Bases: ParameterBasedClustering

A KMeans clustering representation, extending the generic Clustering class.

calculate_silhouette_scores(max_clusters: int = 10) list[float][source]

Calculate the silhouette scores for different numbers of clusters.

Parameters:

max_clusters – The maximum number of clusters to consider.

Returns:

A list of silhouette scores for each number of clusters.

Return type:

list[float]

calculate_wcss(max_clusters: int = 10) list[float][source]

Calculate the within-cluster sum of squares (WCSS) for different numbers of clusters.

Parameters:

max_clusters – The maximum number of clusters to consider.

Returns:

A list of WCSS values for each number of clusters.

Return type:

list[float]

classmethod from_graph(G: Graph, clustering_parameters: tuple[str] | list[str], scale: list[str] | dict[str, str] | None = None, n_clusters: int = -1, method: str = 'silhouette', max_clusters: int = 10) Clustering[source]

Creates an instance of KMeansClustering from a structured data dictionary, applying specified scaling to each parameter if needed.

Parameters:
  • G – HariGraph.

  • clustering_parameters – list of clustering parameters

  • scale – An optional dictionary where keys are parameter names and values are functions (‘Linear’ or ‘Tanh’) to be applied to the parameter values before clustering.

  • n_clusters – The number of clusters to form. If -1, the optimal number of clusters will be determined.

  • method – The method to use for determining the optimal number of clusters (‘elbow’ or ‘silhouette’).

  • max_clusters – The maximum number of clusters to consider.

Returns:

An instance of KMeansClustering with clusters, centroids,

and labels determined from the data.

Return type:

KMeansClustering

Raises:

ValueError – If no data points remain after removing NaN values or if an unknown scaling function is specified.

get_number_of_clusters() int[source]

Get the number of clusters.

Returns: - int : The number of clusters.

optimal_number_of_clusters(method: str = 'silhouette', max_clusters: int = 10) int[source]

Determine the optimal number of clusters using the specified method.

Parameters:
  • method – The method to use (‘elbow’ or ‘silhouette’).

  • max_clusters – The maximum number of clusters to consider.

Returns:

The optimal number of clusters.

Return type:

int

plot_elbow_method(max_clusters: int = 10)[source]

Plot the WCSS values to use the Elbow method for determining the optimal number of clusters.

Parameters:

max_clusters – The maximum number of clusters to consider.

plot_silhouette_scores(max_clusters: int = 10)[source]

Plot the silhouette scores to help determine the optimal number of clusters.

Parameters:

max_clusters – The maximum number of clusters to consider.

predict_cluster(data_points: ndarray, points_scaled: bool = False, parameters: None | tuple[str] = None) ndarray[source]

Predicts the cluster indices to which new data points belong based on the centroids.

Parameters:

data_points – The new data points’ parameter values as a numpy array.

Returns:

An array of indices of the closest cluster centroid to each data point.

Return type:

np.ndarray

Raises:

ValueError – If the dimensionality of the data points does not match that of the centroids.

recluster(n_clusters)[source]
reorder_clusters(new_order: list[int])[source]

Reorders clusters and associated information based on a new order.

Parameters:

new_order – A list containing the indices of the clusters in their new order.

Raises:

ValueError – If new_order does not contain all existing cluster indices.

unscaled_centroids() ndarray[source]

A numpy array representing the centroids of the clusters. Each row in this array corresponds to a centroid.

class hari_plotter.cluster.ParameterBasedClustering(G: Graph, node_ids: ndarray, cluster_indexes: ndarray, parameters: list[str], scales: list[str])[source]

Bases: Clustering

centroids(keep_scale: bool = False)[source]
degree_of_membership(data_point: list[float]) list[float][source]

Predicts the ‘probability’ of belonging to each cluster for a new data point.

If the clustering method does not provide probabilities, this method will return a list with a 1 at the index of the assigned cluster and 0s elsewhere.

Parameters:

data_point – The new data point’s parameter values as a list of floats.

Returns:

A list of zeros and one one, indicating the cluster assignment.

Return type:

list[float]

get_indices_from_parameters(params: str | list[str]) int | list[int][source]

Returns the indices corresponding to the given parameter(s).

Parameters:

params (Union[str, list[str]]) – The parameter name or list of parameter names.

Returns:

The index or list of indices corresponding to the given parameter(s). Returns None if parameter is not present

Return type:

Union[int, list[int]]

get_parameters_from_indices(indices: int | list[int]) str | list[str][source]

Returns the parameter names corresponding to the given index/indices.

Parameters:

indices (Union[int, list[int]]) – The index or list of indices.

Returns:

The parameter name or list of parameter names corresponding to the given index/indices.

Return type:

Union[str, list[str]]

abstract predict_cluster(data_point: list[float], parameters: None | tuple[str] = None) int[source]

Abstract method to predict the cluster for a new data point.

prepare_data_point_for_prediction(data_points, parameters)[source]
abstract reorder_clusters(new_order: list[int])[source]

Abstract method to reorder clusters based on a new order. Assumes that the new_order list contains the indices of the clusters in their new order.

scale_funcs = {'Linear': {'direct': <function ParameterBasedClustering.<lambda>>, 'inverse': <function ParameterBasedClustering.<lambda>>}, 'Tanh': {'direct': <ufunc 'tanh'>, 'inverse': <ufunc 'arctanh'>}}
abstract unscaled_centroids() list[ndarray][source]

A numpy array representing the centroids of the clusters. Each row in this array corresponds to a centroid.

class hari_plotter.cluster.ValueIntervalsClustering(G: Graph, data: ndarray, parameter_boundaries: list[list[float]], node_ids: ndarray, parameters: list[str], scales: list[str], cluster_indexes: ndarray)[source]

Bases: ParameterBasedClustering

Value Intervals clustering representation, extending the generic Clustering class.

find_cluster_index(point: ndarray) int | None[source]

Identifies the cluster index for a given point based on the parameter boundaries.

Args: - point: The data point’s parameter values as a numpy array.

Returns: - int or None: The cluster index if the cell is a cluster, otherwise None.

find_cluster_indices_on_grid(point: ndarray) ndarray[source]

Determines the indices of the clusters a point belongs to based on parameter boundaries.

Args: - point: The data point’s parameter values as a numpy array.

Returns: - np.ndarray: An array of the indices of the clusters the point belongs to.

classmethod from_graph(G: Graph, parameter_boundaries: list[list[float]], clustering_parameters: list[str], scale: list[str] | dict[str, str] | None = None) ValueIntervalsClustering[source]

Creates an instance of valueIntervalsClustering from a HariGraph.

Parameters:
  • G – HariGraph.

  • parameter_boundaries – list of lists, each containing the boundaries for a parameter.

  • clustering_parameters – list of parameter names.

  • scale – Optional scaling functions for the clustering_parameters.

Returns:

An instance with nodes assigned to clusters based on the parameter boundaries.

Return type:

valueIntervalsClustering

Raises:

ValueError – If the number of clustering_parameters does not match the number of parameter boundaries.

get_number_of_clusters() int[source]

Abstract method to get the number of clusters.

predict_cluster(data_points: ndarray, points_scaled: bool = False, parameters: None | tuple[str] = None) ndarray[source]

Predicts the cluster indices to which new data points belong based on the centroids.

Parameters:

data_points – The new data points’ parameter values as a numpy array.

Returns:

An array of indices of the closest cluster centroid to each data point.

Return type:

np.ndarray

Raises:

ValueError – If the dimensionality of the data points does not match that of the centroids.

recluster()[source]

Recalculates the cluster indices for each data point based on the current parameter boundaries.

reorder_clusters(new_order: list[int])[source]

Abstract method to reorder clusters based on a new order. Assumes that the new_order list contains the indices of the clusters in their new order.

unscaled_centroids() ndarray[source]

A numpy array representing the centroids of the clusters. Each row in this array corresponds to a centroid.