rep.tex

%! Author = anton
%! Date = 08/05/2024

% Preamble
\documentclass[11pt]{article}

% Packages
\usepackage{amsmath}
\usepackage{graphicx}
\usepackage{algorithm2e}
\usepackage{float}
\usepackage{hyperref}
\usepackage[a4paper, margin=2.2cm]{geometry}
\usepackage{listings}

\title{\textbf{Comparison of Disjoint-Sets Implementations}}
\author{Antonio Bruno}
\date{May 2024}


% Document
\begin{document}
    \maketitle
    \tableofcontents
    \newpage
    
    \section{Introduction}\label{sec:introduction}

    \subsection{Problem statement}\label{subsec:problem-statement}
    In this report I will discuss and compare two different methods for implementing data structures for \textbf{disjoint sets}, i.e.:
    \begin{itemize}
        \item The implementation based on \textbf{linked lists}
        \item The implementation based on \textbf{rooted trees}
    \end{itemize}
    In order to compare the efficiency of these implementations, I will test disjoint sets based algorithms for finding \textbf{connected components} in non-oriented graphs with different sizes and densities.

    \subsection{Hardware and software specifications}\label{subsec:hardware-and-software-specifications}
    The code will be run on a computer with the following specifications:
    \begin{itemize}
        \item \textbf{Processor}: 11th Gen Intel Core i7-1165G7 @ 2.80GHz
        \item \textbf{RAM}: 16 GB
        \item \textbf{OS}: Windows 11
        \item \textbf{Python version}: 3.12.2
    \end{itemize}

    \section{Theoretical analysis of the problem}\label{sec:theoretical-analysis}
    \subsection{Fundamental aspects}
    We will now cover the fundamental information required to understand disjoint sets and the operations I am going to analyze.
    \subsubsection{Disjoint sets}
    A disjoint-set data structure maintains a collection $\mathcal{S} = \{S_1, S_1, \dots, S_k\}$ of disjoint dynamic sets, i.e., sets which have no element in common ($S_i \ \cap \ S_j = \emptyset, \forall \ i,j$).
    To identify each set, a \textbf{representative} is chosen between the members of the set; it doesn't matter which member is chosen as representative, as long as it doesn't change until the set is modified. 

    A disjoint-set data structure must support the following three operations:
    \begin{itemize}
        \item \textbf{make-set(x)}: where object x does not already belong to some other set, creates a new set whose only member (and thus representative) is x.
        \item \textbf{union(x,y)}: unites two disjoint sets containing x and y, say $S_x$ and $S_y$, into a new set that is the union of these two sets. A new representative is chosen (usually it's one of the representatives of the former sets).
        \item \textbf{find(x)}: returns a pointer of the representative of the unique set containing x.
    \end{itemize}
    \subsubsection{Connected components of a graph}\label{subsubsec:connected-components}
    One of the many applications of disjoint-set data structures is determining the connected components of a non-oriented graph, i.e., subsets of vertices where each vertex is reachable from every other vertex within the same subset by traversing edges; for example the graph in  figure \ref{fig:connected-components-example}a has 3 connected components.
    \begin{figure}[H]
        \centering
        \includegraphics[width = \textwidth]{images/connected-components.png}
        \caption{ \textbf{(a)} A graph with 4 connected components. \textbf{(b)} The collection of disjoint sets after processing each edge.}
        \label{fig:connected-components-example}
    \end{figure}
    
    We can use disjoint-sets data structures to compute connected components in the following way: initially, each vertex \textbf{v} is placed in its own set; then, for each edge \textbf{(u, v)} the sets containing \textbf{u} and \textbf{v} are united; after all the edges are processed, two vertices belong to the same connected components if and only if the objects corresponding to the vertices belong to the same set, as shown in figure \ref{fig:connected-components-example}b.
    
    This procedure can be described by the following pseudo-code:
    \begin{algorithm}[H]
        \SetAlgoLined
        \KwIn{Graph $G = (V, E)$}
        \For{each vertex $v \in V$}{
            make-set($v$)\;
        }
        \For{each edge $(u, v) \in E$}{
            \If{find($u$) $\neq$ find($v$)}{
                 union($u, v$)\;
            }
        }
        \title{connected-components(G)}
        \label{algo:connected-components}
    \end{algorithm}
    Computing time for this algorithm depends on the specific implementation of disjoint sets.


    \subsection{Implementation of disjoint-sets data structures}
    Let's now focus on the two main implementations for disjoint-sets data structures.
    \subsubsection{Linked-list representation of disjoint sets}
    A simple implementation of disjoint-sets data structures is based on \textbf{linked lists}. In particular, every set is a linked list which has a pointer to its head object, i.e., the first element of the list as well as the \textbf{representative} of the set, and a pointer to its tail object, i.e, the last element of the list. Each object in the set contains a set member, a pointer to the next object in the list and a pointer back to the set object (figure \ref{fig:linked-list-representation}a).

    \begin{figure}[H]
        \centering
        \includegraphics[width=\textwidth]{images/linked-list-representation.png}
        \caption{\textbf{(a)} Linked-list representation of two sets. \textbf{(b)} The result of union(g,e).}
        \label{fig:linked-list-representation}
    \end{figure}

    With this representation, a simple \textbf{union(x,y)} operation appends \textbf{y}'s list onto the end of \textbf{x}'s list and updates the pointer to the set of every object of \textbf{y}'s list, which now has to point to \textbf{x}'s set (figure \ref{fig:linked-list-representation}b). As you can imagine, the \textbf{union} operation takes significantly more time than the \textbf{find} and \textbf{make-set} operations, because it must iterate through a whole set.

    Sometimes the simple \textbf{union} operation may append a longer list onto a shorter list, resulting in a waste of time. The \textbf{weighted-union heuristic} solves this issue by appending the shorter list to the longer list (this requires keeping a \textbf{size} attribute for every set object), resulting in a faster running time for the \textbf{union} operation.

    \subsubsection{Disjoint-sets forests}\label{subsubsec:disjoint-set-forests}
    A faster implementation of disjoint sets represents sets by \textbf{rooted trees}, with each node containing one member and the pointer to its parent and each tree representing one set, with the \textbf{representative} being its root (figure \ref{fig:disjoint-sets-forest}a).

    \begin{figure}[H]
        \centering
        \includegraphics{images/disjoint-sets-forest.png}
        \caption{\textbf{(a)} Trees representing the sets of figure \ref{fig:linked-list-representation}. \textbf{(b)} the result of union(g,e).}
        \label{fig:disjoint-sets-forest}
    \end{figure}
    
    The \textbf{union(x,y)} operation simply sets the root of \textbf{x} as the new root of the root of \textbf{y} (figure \ref{fig:disjoint-sets-forest}b). However the pointer to root of each tree is returned by the \textbf{find(x)} operation by following parent pointers from \textbf{x}; thus the \textbf{union} operation takes linear time.

    There are two heuristics which significantly reduce the time complexity of the \textbf{union} operation, i.e., \textbf{union by rank} and \textbf{path compression}; in this report I will only focus on the latter.

    \textbf{Path compression} is quite simple yet highly effective: in the \textbf{find(x)} operation, once the root of tree has been found, it is set as the parent of all the nodes of the tree, indeed \textit{compressing} the tree, as shown in figure \ref{fig:path-compression}; this ensures that succeeding \textbf{find} operations finding the root of the tree will take less time.  

    \begin{figure}[H]
        \centering
        \includegraphics[width=0.8\textwidth]{images/path-compression.png}
        \caption{\textbf{(a)} A tree representing a set prior to executing find(a) and \textbf{(b)} after executing find(a)}
        \label{fig:path-compression}
    \end{figure}

    \subsection{Expected performance}
    Determining connected components of a graph can be more or less efficient depending on the implementation of disjoint sets and eventual heuristics.
    We can of course expect the implementation with linked lists and simple union to be in general less efficient than the other two (i.e., linked lists with weighted union and forest trees with path compression). However, the comparison between the latter can be interesting because there's no actual a priori indicator of which one will be more efficient.
    In particular, the time complexities for determining connected components in each of the implementations are the following:
    \begin{itemize}
        \item \textbf{linked lists with simple union:} $\mathcal{O}(m\cdot n) $
        \item \textbf{linked lists with weighted union:} $\mathcal{O}(m + n \cdot \log_{2}n)$
        \item \textbf{rooted trees with path compression:} $\mathcal{O}(n + f \cdot (1 + \log_{2+f/n}n)$
    \end{itemize}
    where \textbf{n} is the number of \textbf{make-set} operations, \textbf{f} is the number of \textbf{find} operations and \textbf{m} is the number of total operations. I won't spend time proving these results but we can take them as true. As you can see, the efficiency of the latter implementations really depends on the type of graph we're operating on and it's difficult to make general predictions;  in section \ref{sec:tests-and-results} we will see the actual running time of these algorithms. 

    \section{Code documentation}

    \subsection{Content outline and interaction between modules}
    I decided to divide the code into five modules, i.e.:
    \begin{itemize}
        \item \textbf{linked\_list\_disjoint\_set.py:} this module handles the linked list representation of disjoint sets with its classes and methods.
        \item \textbf{disjoint\_set\_forest.py:} this module handles the rooted trees representation of disjoint sets with its classes and methods
        \item \textbf{graph.py:} this module handles the graph data structure and provides methods for generating graphs. 
        \item \textbf{test.py:} this module provides the methods for testing the efficiency of different implementations of the \textbf{connected-components} algorithm.
        \item \textbf{main.py:} the main executable program, where I run the tests.
    \end{itemize}
    The classes are organized as follows:
    \begin{figure}[H]
        \centering
        \includegraphics[width=\textwidth]{images/class-diagram.png}
        \caption{Class Diagram}
        \label{fig:class-diagram}
    \end{figure}

    \subsection{Implementation choices}
    I will now briefly explain the reasons behind some of the implementation choices I made.
    \begin{itemize}
        \item \textbf{Nodes and edges:} Nodes are simply represented by integers, whereas edges are represented by pairs of integers. This implementation is simple and makes sure that the graph object is not modified after running the connected-components algorithms, because new data structures are created for storing nodes.
        \item \textbf{Graph generation:} The graphs are generated as follows: given a number of nodes and a number of edges, nodes are generated with sequential values whereas edges are randomly generated; This generation is fast and unbiased.
        \item \textbf{UnionFind classes:} each UnionFind class has a data structure for storing the sets, i.e., a dictionary for linked lists and a list of nodes for rooted trees; this is because the former requires a direct access to the sets, whereas the latter requires following parent pointers.
        \item \textbf{Shuffling edges:} in the connected-components algorithms, the edges of the graph are shuffled before processing them; this is because the order of the edges can affect the running time of the algorithms. Since each test runs the algorithm multiple times, shuffling the edges ensures a more accurate average running time.
    \end{itemize}
    

    \subsection{Description of implemented methods}
    I will now briefly explain the functionality of each implemented method for each class/module.
    \begin{itemize}
        \item \textbf class \textbf{Graph}:
        \begin{itemize}
            \item \textbf{\_\_init\_\_()}: the class constructor; receives as input the number of nodes and the density and generates a random non oriented graph.
            \item \textbf{print\_edges()}: simply prints every edge of the graph; this method was only used for debugging purposes.
        \end{itemize}
        \item class \textbf{UnionFind} from module \textbf{linked\_list\_disjoint\_set}: 
        \begin{itemize}
            \item \textbf{make\_set()}: receives an integer value as input and creates a new set (a \textbf{LinkedList} object) containing the \textbf{Item} with the given value.
            \item \textbf{find()}: receives a value as input and returns the head (an \textbf{Item} object) of the set containing the \textbf{Item} with the given value.
            \item \textbf{union()}: receives two values as input and, if the items with the given values belong to different sets, performs the union between them, as well as updates the size of the new set.
            \item \textbf{weighted\_union()}: same functionality as \textbf{union()}, but before performing the union, checks which set is smaller and eventually appends it to the larger set.
            \item \textbf{connected\_components()}: receives a non-oriented graph as input and determines its connected components, which are stored in the \textbf{sets} dictionary; for details see paragraph \ref{subsubsec:connected-components}.
            \item \textbf{print\_connected\_components()}: prints the list of connected components of the given graph; used only for debugging purposes.
        \end{itemize}
        \item class \textbf{UnionFind} from module \textbf{disjoint\_set\_forest}:
        \begin{itemize}
            \item \textbf{make\_set()}:  receives an integer value as input and creates a new tree, i.e., a new \textbf{Node} with the given value and whose root is itself.
            \item \textbf{find()}: receives a value as input and returns the root (a \textbf{Node} object) of the tree containing the \textbf{Node} with the given value, as well as executes \textbf{path compression} (see paragraph \ref{subsubsec:disjoint-set-forests})
            \item \textbf{union()}: receives two values as input and, if the nodes with the given values belong to different trees, performs the union between them.
            \item \textbf{connected\_components()}: receives a non-oriented graph as input and determines its connected components, which are stored implicitly the \textbf{nodes} list; for details see paragraph \ref{subsubsec:connected-components}.
            \item \textbf{print\_connected\_components()}: prints the list of connected components of the given graph; used only for debugging purposes.
        \end{itemize}
        \item module \textbf{test}:
        \begin{itemize}
            \item \textbf{test\_linked\_list\_time()}: receives three parameters as input: a Graph \textit{graph}, a boolean \textit{weighted} and an integer \textit{measurements}. Returns the average running time of \textbf{connected\_components()} on \textit{graph} using the linked-list data structure, executed \textit{measurements} times with the weighted-union heuristic if \textit{weighted} is True. This method is indeed used for testing both  simple and weighted union on linked lists.
            \item \textbf{test\_forest\_time()}: receives \textit{graph} and \textit{measurements}. Executes \textbf{connected\_components()} on \textit{graph} using the rooted-trees data structure (with path compression) \textit{measurements} times, and returns its average running time.
            \item \textbf{compare\_all\_running\_times()}: receives three integer values as input: \textit{edges\_nodes\_ratio}, \textit{step} and \textit{measurements}. Generates 10 graphs of different dimensions based on \textit{step} and \textit{edges\_nodes\_ratio}, then calls the aforementioned methods (passing the \textit{measurements} parameter) and finally plots the running times of the three different algorithms on a time-dimension graph.
            \item \textbf{compare\_wlist\_vs\_forest()}: identical to the latter, except it only compares linked lists with weighted-union and rooted trees.
        \end{itemize}
        
    \end{itemize}

    \section{Tests and results} \label{sec:tests-and-results}
    In this section I'm going to present the details of the tests conducted and discuss and analyse the results.
    \subsection{Evaluation metrics}
    The primary metric used to evaluate the performance of the algorithms will be the running time: we measure the time taken by each implementation to compute the connected components of the generated graphs (figure \ref{fig:time-measurement}) and then compare them for each test case.

    \begin{figure}[H]
        \centering
        \includegraphics{images/time-measurement.png}
        \caption{Time measurements using \textit{time} module.}
        \label{fig:time-measurement}
    \end{figure}

    \subsection{Test cases and results}
    I've conducted a total of six tests, three of which comparing all implementations and the other three comparing only linked lists with weighted union to rooted trees, as shown in figure \ref{fig:tests-conducted}. In each test the connected components algorithm is run on 10 different permutations of edges of the same graphs, and the step is adjusted according to the test case. The main difference between the tests is the \textit{density} of the graphs, i.e., the relative number of edges, which is modified through the \textit{edges\_nodes\_ratio} parameter. In particular, I generated graphs where the number of edges is respectively 1, 10 and 100 times the number of nodes.

    \begin{figure}[H]
        \centering
        \includegraphics[width=\textwidth]{images/tests-conducted.png}
        \caption{Tests conducted.}
        \label{fig:tests-conducted}
    \end{figure}
    Let's now see the results for each test case:
    \begin{table}[H]
        \centering
        \resizebox{0.7\textwidth}{!}{
        \begin{tabular}{l|l|l|l|l}
            \textbf{test case} & \textbf{implementations compared} & \textbf{\textbar{}E\textbar{}} & \textbf{step} & \textbf{measurements}  \\
            \hline
            \multicolumn{1}{c}{1} & \multicolumn{1}{c}{all} & \multicolumn{1}{c}{\textbar{}V\textbar{}} & \multicolumn{1}{c}{1000} & \multicolumn{1}{c}{10}  \\     
        \end{tabular}
        }
        \begin{figure}[H]
            \centering
            \includegraphics[width = 0.7\textwidth]{images/test-r1-10mms.png}
            \caption{Results of test case 1.}
            \label{fig:test-case-1}
        \end{figure}
    \end{table}
    \begin{table}[H]
        \centering
        \resizebox{0.7\textwidth}{!}{
        \begin{tabular}{l|l|l|l|l}
            \textbf{test case} & \textbf{implementations compared} & \multicolumn{1}{c|}{\textbf{\textbar{}E\textbar{}}} & \textbf{step} & \textbf{measurements}  \\
            \hline
            \multicolumn{1}{c}{2} & \multicolumn{1}{c}{all} & \multicolumn{1}{c}{$10 \cdot |V|$} & \multicolumn{1}{c}{1000} & \multicolumn{1}{c}{10}  \\     
        \end{tabular}
        }
        \begin{figure}[H]
            \centering
            \includegraphics[width = 0.7\textwidth]{images/test-r10-10mms.png}
            \caption{Results of test case 2.}
            \label{fig:test-case-2}
        \end{figure}
    \end{table}
    \begin{table}[H]
        \centering
        \resizebox{0.7\textwidth}{!}{
        \begin{tabular}{l|l|l|l|l}
            \textbf{test case} & \textbf{implementations compared} & \multicolumn{1}{c|}{\textbf{\textbar{}E\textbar{}}} & \textbf{step} & \textbf{measurements}  \\
            \hline
            \multicolumn{1}{c}{3} & \multicolumn{1}{c}{all} & \multicolumn{1}{c}{$100 \cdot |V|$} & \multicolumn{1}{c}{1000} & \multicolumn{1}{c}{10}  \\     
        \end{tabular}
        }
        \begin{figure}[H]
            \centering
            \includegraphics[width = 0.7\textwidth]{images/test-r100-10mms.png}
            \caption{Results of test case 3.}
            \label{fig:test-case-3}
        \end{figure}
    \end{table}
    \begin{table}[H]
        \centering
        \resizebox{0.7\textwidth}{!}{
        \begin{tabular}{l|l|l|l|l}
            \textbf{test case} & \textbf{implementations compared} & \multicolumn{1}{c|}{\textbf{\textbar{}E\textbar{}}} & \textbf{step} & \textbf{measurements}  \\
            \hline
            \multicolumn{1}{c}{4} & \multicolumn{1}{c}{weighted-union and forest} & \multicolumn{1}{c}{$|V|$} & \multicolumn{1}{c}{2000} & \multicolumn{1}{c}{10}  \\     
        \end{tabular}
        }
        \begin{figure}[H]
            \centering
            \includegraphics[width = 0.7\textwidth]{images/test-1v1-r1-10mms.png}
            \caption{Results of test case 4.}
            \label{fig:test-case-4}
        \end{figure}
    \end{table}
    \begin{table}[H]
        \centering
        \resizebox{0.7\textwidth}{!}{
        \begin{tabular}{l|l|l|l|l}
            \textbf{test case} & \textbf{implementations compared} & \multicolumn{1}{c|}{\textbf{\textbar{}E\textbar{}}} & \textbf{step} & \textbf{measurements}  \\
            \hline
            \multicolumn{1}{c}{5} & \multicolumn{1}{c}{weighted-union and forest} & \multicolumn{1}{c}{$10 \cdot |V|$} & \multicolumn{1}{c}{1000} & \multicolumn{1}{c}{10}  \\     
        \end{tabular}
        }
        \begin{figure}[H]
            \centering
            \includegraphics[width = 0.7\textwidth]{images/test-1v1-r10-10mms.png}
            \caption{Results of test case 5.}
            \label{fig:test-case-5}
        \end{figure}
    \end{table}
    \begin{table}[H]
        \centering
        \resizebox{0.7\textwidth}{!}{
        \begin{tabular}{l|l|l|l|l}
            \textbf{test case} & \textbf{implementations compared} & \multicolumn{1}{c|}{\textbf{\textbar{}E\textbar{}}} & \textbf{step} & \textbf{measurements}  \\
            \hline
            \multicolumn{1}{c}{6} & \multicolumn{1}{c}{weighted-union and forest} & \multicolumn{1}{c}{$100 \cdot |V|$} & \multicolumn{1}{c}{500} & \multicolumn{1}{c}{10}  \\     
        \end{tabular}
        }
        \begin{figure}[H]
            \centering
            \includegraphics[width = 0.7\textwidth]{images/test-1v1-r100-10mms.png}
            \caption{Results of test case 6.}
            \label{fig:test-case-6}
        \end{figure}
    \end{table}
    \subsection{Analysis of the results}
    In the first three test cases, we can say to have proved the efficiency of heuristics on running time, since the implementation with linked lists with simple union is significantly slower than the other two (figures \ref{fig:test-case-1} and \ref{fig:test-case-2}). Even though on more dense graphs the time difference is smaller (figure \ref{fig:test-case-3}), we can still observe a significant difference and it seems to grow as the dimension of the graph grows. In figures \ref{fig:test-case-2} and \ref{fig:test-case-3} the weighted-union heuristic on linked lists seems to be slightly more efficient than the path-compression heuristic on rooted trees. This comparison is highlighted in the subsequent test cases: we can see that on sparse graphs (i.e., graphs with small density) the path-compression rooted trees implementation is slightly faster (figure \ref{fig:test-case-4}), however, when density grows, the linked lists implementation is more efficient and its running time seems to grow slower than the other when the graphs dimension grows (figures \ref{fig:test-case-5} and \ref{fig:test-case-6}).
    \section{Conclusion}
    In conclusion, this study has shed light on the efficiency of disjoint-set implementations in solving the problem of finding connected components in non-oriented graphs. Through testing and analysis, we have gained valuable insights into the impact of different heuristics and graph characteristics on algorithm performance.

    Our results clearly demonstrate the significance of heuristics, such as weighted union in linked lists and path compression in rooted trees, in optimizing the running time of disjoint-set operations. We observed notable variations in performance across different graph densities and sizes, highlighting the importance of considering these factors when selecting the most suitable implementation for a given application.

    While path compression in rooted trees shows promise for sparse graphs, linked lists with weighted union emerges as a better choice for denser graphs, particularly as graph size increased. 
    Ultimately, the most efficient implementation of disjoint sets would be the one based on rooted trees combining path-compression and union-by-rank heuristics, which ensures in most practical scenarios a linear time complexity for the connected components algorithm, even though it was not analysed in this report.
    

\end{document}