Spiral galaxy LEDA 2046648 as captured by NASA’s JWST1
Humans have been looking to the sky with wonder and curiosity since our species originated. In the modern era, sophisticated telescopes allow us to analyze celestial objects that exist billions of light-years away from Earth. By examining astronomical sky survey data, we can understand more about the universe, including what characteristics different classes of celestial objects may share. To that end, machine learning and data science techniques can help us to parse sophisticated datasets where many measurements are taken for a single observation. In this project, a k-Nearest Neighbors (k-NN) model and a Decision Tree model are applied to labeled Sloan Digital Sky Survey (SDSS) data in RapidMiner Studio to determine which algorithm is most appropriate and accurate for identifying three types of celestial objects within an astronomical dataset. Contrary to what might be assumed, findings indicate that the simpler algorithm, the Decision Tree, was able to categorize this dataset better and more accurately than the k-NN model.
The Sloan Digital Sky Survey (SDSS) is one of the most comprehensive and widely cited astronomical surveys in history. Sponsored by the Alfred P. Sloan Foundation with support from the National Science Foundation, the Simons Foundation, and the Heising-Simons Foundation, SDSS endeavors to help humanity achieve a deeper understanding of the structure and origins of the universe2.
A partnership with the Carnegie Observatories, the current phase of SDSS, SDSS-V, employs the Sloan Foundation Telescope at Apache Point Observatory (New Mexico, USA) and the Irénée du Pont Telescope at Las Campanas Observatory (Atacama, Chile) to capture over a third of the night sky in both hemispheres2 to provide open-use, comprehensive astronomical data to the public for the purposes of studying stars, galaxies, black holes, and more.
The purpose of this project is to determine whether a machine learning classification algorithm can correctly classify objects within astronomical data, and further, to understand which classification algorithm is best for doing so. Considering the nature and size of the dataset, two classification models have been selected as best options for determining the classes of celestial objects:
-
k-Nearest Neighbors (k-NN) is a supervised classification algorithm that groups each individual observations within novel data based on its proximity to labeled data points, as though the labeled data were a reference table to check against3. In a dataset with a high number of attributes per observation, this is a complex and time-consuming task.
-
A Decision Tree is a supervised classification algorithm which works by memorizing training data and then making classification decisions within novel data in the form of a flowchart, where each decision node occurs at its the highest possible level of homogeneity in the dataset3. This results in a “tree” which groups the data by the largest groups at the top and branches off smaller groups progressively until all the data is categorized. Where a k-Nearest Neighbors model creates a mathematical relationship between observations, a Decision Tree merely generalizes relationships within the dataset3.
The dataset used in this project to classify observed celestial objects as a star, a galaxy, or a quasi-stellar object (QSO) is labeled data derived from the most recent survey by the SDSS-V, Data Release 18, and contains 100,000 observations characterized by 42 attributes and 1 classification, or type of celestial object4. These data are compiled using optical, ultraviolet, and infrared observations, and measurements to understand redshift and photometric banding of each celestial object are calculated using spectroscopy5.
Several attributes in the dataset are identifiers or reference columns to track when and where observations occurred. Measurement attributes include Petrosian radii, Petrosian fluxes, Petrosian half-light radii, Point Spread Function (PSF) magnitude, and the axis ratio of exponential fits of objects observed in the ultraviolet, green, red, infrared, and near-infrared photometric bands4. These attributes help to characterize the observed size, brightness, color, and shape of celestial objects4. Finally, the redshift attribute describes the distance and motion of celestial objects in relation to Earth6.
To begin the process of training and implementing the k-NN model, the dataset was first imported into RapidMiner Studio and prepared by applying the Set Role operator to designate “class” as the dataset label, indicating to RapidMiner that this was the desired attribute for which to create a prediction. In general, the dataset was very clean and complete, so did not require extensive preparation prior to analysis.
The data was then split into a training dataset (70% of total rows) and an unlabeled dataset (30% of total rows), and each batch fed to a weighted 5-NN model (the default setting of k = 5) separately. Next, RapidMiner ignored the existing class column in the unlabeled dataset to create its own k-NN prediction as to whether each celestial object fell into the star, galaxy, or QSO class based on each data point’s proximity to labeled data.
Optimizing the dataset for a second run was necessary to improve the model. As the dataset is quite detailed, this included optimizing the model run time and prediction accuracy by using the Select Attributes operator to remove extraneous reference attributes from the dataset. These included the object ID, spec object ID, run number, rerun number, camera column, field number, fiber ID, modified Julian date, and plate number.
To prepare the dataset for processing with a Decision Tree model it was imported into RapidMiner Studio, where extraneous reference attributes were removed using the Select Attributes operator and the class attribute was designated as the label with the Set Role operator. Again, no extensive preparation of the data was necessary as it was already clean and complete.
The Decision Tree parameters were adjusted to a maximal depth of 20 with a minimal leaf size of 10. Pruning and pre-pruning were both applied to the model.
The k-NN model using a weighted k of 5 produced a correct prediction of celestial object class with 93.56% accuracy.
Lowering the k value to 3 did not make much difference to the accuracy of the model (93.90% accuracy). However, even optimized, the model was cumbersome to run with quite a long duration of execution. At 100,000 observations and 33 attributes, the SDSS-V dataset is of moderate size and complexity. Running larger, more detailed astronomical datasets through a k-NN model for the purposes of classification would require significant time and resources.
The Decision Tree model produced a correct prediction of celestial object class with 98.64% accuracy.
The model was not only able to categorize the data much more naturally with higher accuracy than the k-NN model, but additionally provided much deeper insight as to which astronomic measurements could broadly be used to define groups of celestial bodies. After two runs of this model with minor adjustments to tree depth and leaf size, the Decision Tree was able to produce accurate predictions using only redshift measurements. This makes a great deal of logical sense: in astronomy, stars, galaxies, and quasi-stellar objects are easily stratified by their distances from Earth, and this would be reflected in the measurement of each object’s redshift.
In conclusion, the Decision Tree model was found to be more appropriate for this astronomical survey dataset than the k-NN model due to its shorter run time and the simplicity of its categorization logic. Additionally, the Decision Tree produced a 5% higher percentage of accurate classification predictions than the k-NN algorithm.
Partition data for training and application datasets:
Assign k value, weights, and measure types (see Parameters box):
Footnotes
-
ESA/Webb, et al. “A Spiral Amongst Thousands.” NASA’s James Webb Space Telescope Photostream, 31 Jan. 2023, [https://www.flickr.com/photos/nasawebbtelescope/52660766241/in/album-72177720305127361/]. Accessed 3 Aug. 2023. ↩
-
Sloan Digital Sky Survey: Alfred P. Sloan Foundation, sloan.org/programs/research/sloan-digital-sky-survey. Accessed 3 Aug. 2023. ↩ ↩2
-
Kotu, Vijay, and Bala Deshpande. “Ch 4: Classification.” Data Science: Concepts and Practice, Morgan Kaufmann, Cambridge, 2019. ↩ ↩2 ↩3
-
R, Farid. “Sloan Digital Sky Survey - DR18.” Kaggle, 29 July 2023, [www.kaggle.com/datasets/diraf0/sloan-digital-sky-survey-dr18]. ↩ ↩2 ↩3
-
Sloan Digital Sky Survey. “Data Release 18.” Sloan Digital Sky Survey, 26 July 2023, www.sdss.org/dr18/. ↩
-
“Redshift.” Las Cumbres Observatory, [lco.global/spacebook/light/redshift/]. Accessed 6 Aug. 2023. ↩