Finish writing the tutorial on duplicates removal.

technic960183 · technic960183 · commit 7fb373e9911c · 2024-07-25T03:29:11.000+08:00
diff --git a/docs/source/index.rst b/docs/source/index.rst
@@ -22,6 +22,8 @@ Quickstart
      See the :doc:`cross-matching example <tutorial/xmatch>` for how to use it.
    - If you want to cluster the objects in a catalog with the Friends-of-Friends (FoF) algorithm, use
      the :ref:`group_by_quadtree <fof-ref>` function. See the :doc:`clustering example <tutorial/fof>` for how to use it.
+   - If you want to remove duplicates from a catalog, also using the :ref:`group_by_quadtree <fof-ref>` function.
+     See the :doc:`duplicates removal example <tutorial/duplicates_removal>` for how to do it.
 
 
 Contents
diff --git a/docs/source/tutorial/duplicates_removal.rst b/docs/source/tutorial/duplicates_removal.rst
@@ -0,0 +1,83 @@
+Example: Duplicates removal
+===========================
+
+One common usage of the FoF algorithm is to remove duplicates from a catalog.
+In this example, we will show how to remove duplicates from a catalog using the
+:func:`pycorrelator.group_by_quadtree` function.
+
+First, let's create a mock catalog with duplicates:
+
+.. code-block:: python
+
+    import pandas as pd
+
+    # Create a mock catalog as a pandas DataFrame
+    catalog = pd.DataFrame([[80.894, 41.269, 1200], [120.689, -41.269, 1500], 
+                            [10.689, -41.269, 3600], [10.688, -41.270, 300], 
+                            [10.689, -41.270, 1800], [10.690, -41.269, 2400], 
+                            [120.690, -41.270, 900], [10.689, -41.269, 2700]], 
+                            columns=['ra', 'dec', 'exp_time'])
+
+Here, we actually only have 3 unique objects, but the catalog contains 8 entries and 5 of them are duplicates.
+
+Now we wish to remove the duplicates from the catalog and retain only the unique objects with the highest exposure time.
+Here is how we can do it:
+
+.. code-block:: python
+
+    ranking_col = 'exp_time'
+    tolerance = 0.01
+
+    from pycorrelator import group_by_quadtree
+    result_object = group_by_quadtree(catalog, tolerance=tolerance)
+    catalog = result_object.get_group_dataframe()
+
+    catalog['dup_num'] = catalog.groupby('Group')['Ra'].transform('size')
+    catalog['rank'] = catalog.groupby('Group')[ranking_col].rank(ascending=False, method='first')
+    catalog['rank'] = catalog['rank'].astype(int)
+    print(catalog)
+
+Expected output::
+
+                       Ra     Dec  exp_time  dup_num  rank
+    Group Object                                          
+    0     0        80.894  41.269      1200        1     1
+    1     1       120.689 -41.269      1500        2     1
+          6       120.690 -41.270       900        2     2
+    2     2        10.689 -41.269      3600        5     1
+          3        10.688 -41.270       300        5     5
+          4        10.689 -41.270      1800        5     4
+          5        10.690 -41.269      2400        5     3
+          7        10.689 -41.269      2700        5     2
+
+Here I set the tolerance to 0.01, which means that objects with a separation less than 0.01 degrees to any other
+object in the same 'cluster' will be considered as duplicates. You need to adjust this value according to the
+properties of your catalog. The ``'dup_num'`` column shows the number of duplicates in each group, and the
+``'rank'`` column shows the order of the object in the group sorted by the ranking column.
+
+.. note::
+    When there are two 'unique' objects that are very close to each other, it is possible that they will be grouped together.
+    In an exetrema case, it is possible that a chain of unique objects will be grouped together, linking by their duplicates.
+    But this is rare for most catalogs. To solve this problem, you can try to decrease the tolerance value. However, if
+    decreasing the tolerance value separates objects that should be considered as duplicates, this package does not provide
+    a solution for now. You may need to remove the duplicates manually for those close objects.
+    We are now working on some new features related to this issue.
+
+Finally, we can remove the duplicates from the catalog by retaining only the objects with ``'rank'`` equal to 1:
+
+.. code-block:: python
+
+    catalog_no_duplicates = catalog[catalog['rank'] == 1].copy()
+    catalog_no_duplicates.drop(columns=['rank'], inplace=True)
+    catalog_no_duplicates.reset_index(level='Object', inplace=True)
+    print(catalog_no_duplicates)
+
+Expected output::
+
+           Object       Ra     Dec  exp_time  dup_num
+    Group                                            
+    0           0   80.894  41.269      1200        1
+    1           1  120.689 -41.269      1500        2
+    2           2   10.689 -41.269      3600        5
+
+Now the catalog contains only the unique objects with the highest exposure time.
diff --git a/docs/source/tutorial/fof.rst b/docs/source/tutorial/fof.rst
@@ -36,7 +36,7 @@ The result object contains the clustering results. Four methods are available to
 get_group_dataframe()
 ---------------------
 
-To get the clustering results with the appendind data (``"mag"`` in this case), use the
+To get the clustering results with the appendind data (``'mag'`` in this case), use the
 :func:`pycorrelator.FoFResult.get_group_dataframe` method:
 
 .. code-block:: python
@@ -101,7 +101,7 @@ If you want DataFrame with a single layer of index and the size of each group as
 
 .. code-block:: python
 
-    groups_df['group_size'] = groups_df.groupby(level='Group')['Ra'].transform('size')
+    groups_df['group_size'] = groups_df.groupby('Group')['Ra'].transform('size')
     groups_df.reset_index(level='Group', inplace=True)
     print(groups_df)
 
diff --git a/docs/source/tutorial/index.rst b/docs/source/tutorial/index.rst
@@ -8,3 +8,4 @@ This section contains tutorials on how to use the **pycorrelator** package.
    input_validation
    xmatch
    fof
+   duplicates_removal