Skip to content

Commit 7fb373e

Browse files
committed
Finish writing the tutorial on duplicates removal.
1 parent dc2df48 commit 7fb373e

File tree

4 files changed

+88
-2
lines changed

4 files changed

+88
-2
lines changed

docs/source/index.rst

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -22,6 +22,8 @@ Quickstart
2222
See the :doc:`cross-matching example <tutorial/xmatch>` for how to use it.
2323
- If you want to cluster the objects in a catalog with the Friends-of-Friends (FoF) algorithm, use
2424
the :ref:`group_by_quadtree <fof-ref>` function. See the :doc:`clustering example <tutorial/fof>` for how to use it.
25+
- If you want to remove duplicates from a catalog, also using the :ref:`group_by_quadtree <fof-ref>` function.
26+
See the :doc:`duplicates removal example <tutorial/duplicates_removal>` for how to do it.
2527

2628

2729
Contents
Lines changed: 83 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,83 @@
1+
Example: Duplicates removal
2+
===========================
3+
4+
One common usage of the FoF algorithm is to remove duplicates from a catalog.
5+
In this example, we will show how to remove duplicates from a catalog using the
6+
:func:`pycorrelator.group_by_quadtree` function.
7+
8+
First, let's create a mock catalog with duplicates:
9+
10+
.. code-block:: python
11+
12+
import pandas as pd
13+
14+
# Create a mock catalog as a pandas DataFrame
15+
catalog = pd.DataFrame([[80.894, 41.269, 1200], [120.689, -41.269, 1500],
16+
[10.689, -41.269, 3600], [10.688, -41.270, 300],
17+
[10.689, -41.270, 1800], [10.690, -41.269, 2400],
18+
[120.690, -41.270, 900], [10.689, -41.269, 2700]],
19+
columns=['ra', 'dec', 'exp_time'])
20+
21+
Here, we actually only have 3 unique objects, but the catalog contains 8 entries and 5 of them are duplicates.
22+
23+
Now we wish to remove the duplicates from the catalog and retain only the unique objects with the highest exposure time.
24+
Here is how we can do it:
25+
26+
.. code-block:: python
27+
28+
ranking_col = 'exp_time'
29+
tolerance = 0.01
30+
31+
from pycorrelator import group_by_quadtree
32+
result_object = group_by_quadtree(catalog, tolerance=tolerance)
33+
catalog = result_object.get_group_dataframe()
34+
35+
catalog['dup_num'] = catalog.groupby('Group')['Ra'].transform('size')
36+
catalog['rank'] = catalog.groupby('Group')[ranking_col].rank(ascending=False, method='first')
37+
catalog['rank'] = catalog['rank'].astype(int)
38+
print(catalog)
39+
40+
Expected output::
41+
42+
Ra Dec exp_time dup_num rank
43+
Group Object
44+
0 0 80.894 41.269 1200 1 1
45+
1 1 120.689 -41.269 1500 2 1
46+
6 120.690 -41.270 900 2 2
47+
2 2 10.689 -41.269 3600 5 1
48+
3 10.688 -41.270 300 5 5
49+
4 10.689 -41.270 1800 5 4
50+
5 10.690 -41.269 2400 5 3
51+
7 10.689 -41.269 2700 5 2
52+
53+
Here I set the tolerance to 0.01, which means that objects with a separation less than 0.01 degrees to any other
54+
object in the same 'cluster' will be considered as duplicates. You need to adjust this value according to the
55+
properties of your catalog. The ``'dup_num'`` column shows the number of duplicates in each group, and the
56+
``'rank'`` column shows the order of the object in the group sorted by the ranking column.
57+
58+
.. note::
59+
When there are two 'unique' objects that are very close to each other, it is possible that they will be grouped together.
60+
In an exetrema case, it is possible that a chain of unique objects will be grouped together, linking by their duplicates.
61+
But this is rare for most catalogs. To solve this problem, you can try to decrease the tolerance value. However, if
62+
decreasing the tolerance value separates objects that should be considered as duplicates, this package does not provide
63+
a solution for now. You may need to remove the duplicates manually for those close objects.
64+
We are now working on some new features related to this issue.
65+
66+
Finally, we can remove the duplicates from the catalog by retaining only the objects with ``'rank'`` equal to 1:
67+
68+
.. code-block:: python
69+
70+
catalog_no_duplicates = catalog[catalog['rank'] == 1].copy()
71+
catalog_no_duplicates.drop(columns=['rank'], inplace=True)
72+
catalog_no_duplicates.reset_index(level='Object', inplace=True)
73+
print(catalog_no_duplicates)
74+
75+
Expected output::
76+
77+
Object Ra Dec exp_time dup_num
78+
Group
79+
0 0 80.894 41.269 1200 1
80+
1 1 120.689 -41.269 1500 2
81+
2 2 10.689 -41.269 3600 5
82+
83+
Now the catalog contains only the unique objects with the highest exposure time.

docs/source/tutorial/fof.rst

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -36,7 +36,7 @@ The result object contains the clustering results. Four methods are available to
3636
get_group_dataframe()
3737
---------------------
3838

39-
To get the clustering results with the appendind data (``"mag"`` in this case), use the
39+
To get the clustering results with the appendind data (``'mag'`` in this case), use the
4040
:func:`pycorrelator.FoFResult.get_group_dataframe` method:
4141

4242
.. code-block:: python
@@ -101,7 +101,7 @@ If you want DataFrame with a single layer of index and the size of each group as
101101

102102
.. code-block:: python
103103
104-
groups_df['group_size'] = groups_df.groupby(level='Group')['Ra'].transform('size')
104+
groups_df['group_size'] = groups_df.groupby('Group')['Ra'].transform('size')
105105
groups_df.reset_index(level='Group', inplace=True)
106106
print(groups_df)
107107

docs/source/tutorial/index.rst

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -8,3 +8,4 @@ This section contains tutorials on how to use the **pycorrelator** package.
88
input_validation
99
xmatch
1010
fof
11+
duplicates_removal

0 commit comments

Comments
 (0)