\volnopage

Vol.0 (20xx) No.0, 000–000

11institutetext: National Astronomical Observatories, Chinese Academy of Sciences, Beijing 100101, China; [email protected], [email protected]
22institutetext: School of Astronomy and Space Science, University of Chinese Academy of Sciences, Beijing 100049, China
33institutetext: State Key Laboratory of Dark Matter Physics, Tsung-Dao Lee Institute & School of Physics and Astronomy, Shanghai Jiao Tong University, Shanghai 201210, China
44institutetext: Shanghai Key Laboratory for Particle Physics and Cosmology, Shanghai 200240, China
\vs\noReceived 20xx month day; accepted 20xx month day

A widely applicable Galaxy Group finder Using Machine Learning

Juntao Ma 1122    Jie Wang 11 2 2    Tianxiang Mao 11 2 2    Hongxiang Chen 11 2 2    Yuxi Meng 11 2 2    Xiaohu Yang 33 4 4    Qingyang Li 33 4 4
Abstract

Galaxy groups are essential for studying the distribution of matter on a large scale in redshift surveys and for deciphering the link between galaxy traits and their associated halos. In this work, we propose a widely applicable method for identifying groups through machine learning techniques in real space taking into account the impact of redshift distortion. Our methodology involves two neural networks: one is a classification model for identifying central galaxy groups, and the other is a regression model for predicting the mass of these groups. Both models input observable galaxy traits, allowing future applicability to real survey data. Testing on simulated datasets indicates our method accurately identifies over 92%percent9292\%92 % of groups with Mvir1011h1MsubscriptM𝑣𝑖𝑟superscript1011superscript1subscriptMdirect-product\mathrm{M}_{vir}\geq 10^{11}h^{-1}\mathrm{M}_{\odot}roman_M start_POSTSUBSCRIPT italic_v italic_i italic_r end_POSTSUBSCRIPT ≥ 10 start_POSTSUPERSCRIPT 11 end_POSTSUPERSCRIPT italic_h start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT roman_M start_POSTSUBSCRIPT ⊙ end_POSTSUBSCRIPT, with 80%percent8080\%80 % achieving a membership completeness of at least 80%percent8080\%80 %. The predicted group masses vary by less than 0.3 dex across different mass scales, even in the absence of a priori data. Our network adapts seamlessly to expand to sparse samples with a flux limit of mr<14subscript𝑚𝑟14m_{r}<14italic_m start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT < 14, to high redshift samples at z=1.08𝑧1.08z=1.08italic_z = 1.08, and to galaxy samples from the TNG300 hydrodynamical simulation without further training. Furthermore, the framework can easily adjust to real surveys by training on redshift distorted samples without needing parameter changes. Careful consideration of different observational effects in redshift space makes it promising that this method will be applicable to real galaxy surveys.

keywords:
large-scale structure of Universe – Galaxy: halo – methods: data analysis

1 Introduction

Current structure formation theory suggests that the mass content of the universe is dominated by dark matter, and cosmic structures form hierarchically through gravitational instability (White & Rees 1978; Davis et al. 1985; Springel et al. 2006). Galaxies form and evolve within these structures, called dark matter halos. Therefore, understanding the relation between galaxies and their host halos is crucial for understanding the role played by the environment in galaxy formation and evolution, as well as for tracing the underlying density field. Apart from theoretical or simulation-based approaches, galaxy groups provide a direct way to study the galaxy-halo relation, as they consist of various galaxies residing within the same mass dark matter halos.

Due to the visual concentration of these systems, identifying galaxy groups from galaxy surveys begins at the very beginning of cosmology research. Abell (1958) identified approximately 2700 clusters from the Palomar Observatory Sky Survey (POSS) using local galaxy surface number densities. Similarly, Zwicky et al. (1968) constructed a catalogue of 9133 clusters in the Northern celestial hemisphere, and Abell et al. (1989) identified around 1600 clusters from the UK Schmidt Telescope (UKST) plates. The lack of precise distance estimation in early no-redshift surveys caused these catalogs to suffer significantly from issues like projection effects. With the advent of large redshift surveys since the 1980s, many efforts have been made to identify galaxy groups using different group finder algorithms. For instance, galaxy groups have been identified from the CfA redshift survey (e.g. Huchra & Geller 1982), the Two Degree Field Galaxy Redshift Survey (e.g. Eke et al. 2004; Yang et al. 2005a; Tago et al. 2006), the Two Micron All Sky Redshift Survey (e.g. Lavaux & Hudson 2011; Tully 2015; Crook et al. 2007), the Sloan Digital Sky Survey (e.g. Goto 2005; Berlind et al. 2006; Yang et al. 2007; Lim et al. 2017), the DESI Legacy Imaging Surveys (e.g. Yang et al. 2021).

Based on the galaxies groups identified from large redshift surveys, we can have a better understanding on how different galaxies form and evolve in different dark matter haloes. Weinmann et al. (2006) found a strong correlation in the properties of galaxies residing in common dark matter halos, i.e. galactic conformity. Wang et al. (2018) found that the apparent dependence of the quenched fraction of galaxies on large-scale environment is largely induced by the dependence of quenching on the host halo mass combined with the biased distribution of dark matter halos in the cosmic density field. The group-galaxy cross-correlation function is measured to evaluate how galaxies are distributed within and beyond their host halos (e.g. Yang et al. 2005b; Coil et al. 2006; Knobel et al. 2009). Stacking groups with similar masses can help probe the weak signal of Sunyaev-Zel,dovich (SZ) effects (e.g. Li et al. 2011; Vikram et al. 2017; Lim et al. 2018, 2020) and weak gravitational lensing signals (e.g. Mandelbaum et al. 2006; Yang et al. 2006; Han et al. 2015; Viola et al. 2015; Luo et al. 2018) over a large halo mass range. A similar approach can be used to measure the halo occupation distribution or the conditional luminosity functions of galaxies in halos of different masses (e.g. Yang et al. 2005a, 2008, 2009; Rodriguez et al. 2015; Lan et al. 2016). As biased tracer of dark matter, galaxy groups and their halos can be used to reconstruct cosmic density field (Wang et al. 2009; Muñoz-Cuartas et al. 2011) and constrain the initial conditions that produced the observed cosmic web (e.g. Wang et al. 2016).

Several group finders have been proposed and applied to redshift surveys. The Friends-of-Friends (FoF) algorithm identifies galaxies as belonging to the same group when their distance is less than a linking length. For example, Huchra & Geller (1982) used a FoF method with two linking lengths, one in the projected direction and the other in the redshift direction, to construct galaxy groups in the CfA survey. Miller et al. (2005) used the C4 algorithm, which places galaxies in a six-dimensional parameter space, to find groups in the SDSS DR2. Yang et al. (2005a) proposed a halo-based group finder that takes advantage of known halo models (e.g., NFW profile) and iteration. Wang et al. (2020) combined FoF and machine learning methods to identify groups in incomplete samples at high redshift.

Traditional group finders rely on specific physical models that statistically describe the structures of galaxies and halos. These models may lose higher-order information present in the cosmic density field. In this paper, we propose a machine-based group finder algorithm, specifically artificial neural networks (ANNs). Machine learning models learn directly from obervational data, allowing us to extract more nuanced information hidden in the data provided to the networks. Recent research has shown that machine learning models outperform traditional methods in various tasks, such as feature extraction and classification. Our recent research has confirmed the effectiveness of Artificial Neural Networks (ANNs) in solving a variety of astrophysical problems.As an example, Mao et al. (2021) introduced an innovative convolutional neural network framework for reconstructing baryon acoustic oscillation (BAO) signals, significantly enhancing the BAO signal-to-noise ratio to around k0.4hMpc1similar-to-or-equals𝑘0.4superscriptMpc1k\simeq 0.4h\mathrm{Mpc}^{-1}italic_k ≃ 0.4 italic_h roman_Mpc start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT. Similarly, Chen et al. (2024) utilized ANNs to assess environmental attributes of galaxies,achieving accurate line-of-sight velocity estimations and enabling the recovery of the real-space power spectrum with less than a 5% margin of error.

For our group finder, we aim to develop a machine learning model that accurately identifies member galaxies and estimates halo mass. Moreover, with appropriate preprocessing and network architecture, machine learning models can demonstrate good generalisability, meaning they can be applied to different galaxy catalogues without the need for retraining or hyperparameter adjustment. By leveraging ANNs, our group finder extracts information about galaxies and their host halos from a high-resolution N-body simulation. The nonlinear nature of the network enables it to uncover more intricate relations than those described by current galaxy-halo models. Our group finder demonstrates remarkable accuracy across various test datasets, including those at different cosmic epochs, and galaxy samples with different flux limits.

This paper is organised as follows. In Section 2 we provide a description of the simulation data used for training and testing our group finder. In Section 3 we describe in detail our group finding method, which includes two machine learning models. In Section 4 we test the performance of our group finder, including completeness and purity test, halo mass assignment test, on each of the test datasets. Finally, we conclude our main results in Section 5.

2 Data

This section outlines the data sets utilised in our study. The Millennium Simulation is partitioned into a subbox with a side length of 300 Mpc/h for training purposes, as well as several smaller boxes for testing to maintain data integrity. In addition, to assess the robustness and flexibility of our model, we developed three additional test datasets.

  1. 1.

    Magnitude Sample: Similar to the basic test datasets, but with a different apparent magnitude limit applied.

  2. 2.

    High-z Sample: Consisting of galaxies from higher redshift snapshots of the Millennium Simulation.

  3. 3.

    TNG Sample: Generated from the TNG300 simulation, providing a distinct environment for model evaluation.

These samples provide a diverse range of data for comprehensive testing. Additionally, we evaluated our model on a redshift-space dataset to demonstrate its applicability to real redshift surveys. We detail these data set as follows.

2.1 Traning Data

To generate suitable training datasets for our machine learning models and evaluate the performance of our group finder, we utilized cosmological simulation, Millennium simulation, and its galaxy catalogue.

The Millennium Simulation(MS; Springel et al. (2005)), a large-scale simulation of cosmic structure formation based on the ΛΛ\Lambdaroman_ΛCDM cosmology. It simulates N=21603𝑁superscript21603N=2160^{3}italic_N = 2160 start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT dark matter particles across a redshift range from z=127𝑧127z=127italic_z = 127 to z=0𝑧0z=0italic_z = 0, within a co-moving volume of (500h1Mpc)3superscript500superscript1Mpc3(500h^{-1}\mathrm{Mpc})^{3}( 500 italic_h start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT roman_Mpc ) start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT. Each dark matter particle has a mass of 8.6×108M8.6superscript108subscript𝑀direct-product8.6\times 10^{8}M_{\odot}8.6 × 10 start_POSTSUPERSCRIPT 8 end_POSTSUPERSCRIPT italic_M start_POSTSUBSCRIPT ⊙ end_POSTSUBSCRIPT. The cosmological parameters of the simulation are Ωm=0.25subscriptΩ𝑚0.25\Omega_{m}=0.25roman_Ω start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT = 0.25, Ωb=0.045subscriptΩ𝑏0.045\Omega_{b}=0.045roman_Ω start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT = 0.045, h=0.730.73h=0.73italic_h = 0.73, ΩΛ=0.75subscriptΩΛ0.75\Omega_{\Lambda}=0.75roman_Ω start_POSTSUBSCRIPT roman_Λ end_POSTSUBSCRIPT = 0.75, n=1𝑛1n=1italic_n = 1, and σ8=0.9subscript𝜎80.9\sigma_{8}=0.9italic_σ start_POSTSUBSCRIPT 8 end_POSTSUBSCRIPT = 0.9, with the Hubble constant defined as H0=100hKms1Mpc1subscript𝐻0100Kmsuperscripts1superscriptMpc1H_{0}=100h\ \mathrm{Km\cdot s^{-1}\cdot\mathrm{Mpc}^{-1}}italic_H start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT = 100 italic_h roman_Km ⋅ roman_s start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ⋅ roman_Mpc start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT.

We use the semi-analytic galaxy catalogue of MS developed by Guo et al. (2011), which implement the galaxy formation model L-Galaxies (Henriques et al. 2015) onto merger trees extracted from the Millennium Simulation.

A sub-box with dimensions of 300×300×500(h1Mpc)3300300500superscriptsuperscript1Mpc3300\times 300\times 500(h^{-1}\mathrm{Mpc})^{3}300 × 300 × 500 ( italic_h start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT roman_Mpc ) start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT from the MS was selected as the training set. Edge effects of the box may lead to some groups being incomplete in terms of member galaxies, so these incomplete groups were removed in data pre-processing. Furthermore, we selected only groups with host halos containing more than 100 dark matter particles. The observer is at edge plane of the simulation box, with the perpendicular axis to this plane as the line-of-sight direction. The apparent magnitudes of galaxies were computed based on their absolute magnitudes and their line-of-sight distances, adhering to an r-band magnitude limit of magr<17.7𝑚𝑎subscript𝑔𝑟17.7mag_{r}<17.7italic_m italic_a italic_g start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT < 17.7, which aligns with observational sample criteria of the SDSS sample (Abazajian et al. 2009). The resulting galaxy catalogs formed the foundation for training our machine-learning models. The training dataset includes 1,298,413 galaxies distributed in 945,078 dark matter halos.

2.2 Test Data

The basic test data sets were generated from six small sub-boxes in the MS, each with a size of (200h1Mpc)3superscript200superscript1Mpc3(200h^{-1}\mathrm{Mpc})^{3}( 200 italic_h start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT roman_Mpc ) start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT. All the selection criteria were same as those used for the training datasets. The resulting six basic test datasets contain a total of 1,650,251 galaxies distributed across 1,189,865 dark matter halos.

We also evaluated our model using three extended datasets. The magnitude limited datasets, derived from the same simulation box as the training and basic test datasets, include different apparent magnitude limits for the r-band: magr<16𝑚𝑎subscript𝑔𝑟16mag_{r}<16italic_m italic_a italic_g start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT < 16, magr<15𝑚𝑎subscript𝑔𝑟15mag_{r}<15italic_m italic_a italic_g start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT < 15, and magr<14𝑚𝑎subscript𝑔𝑟14mag_{r}<14italic_m italic_a italic_g start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT < 14, resulting in galaxy catalogues containing 169,375, 118,931, and 77,950 galaxies, respectively. The high-z datasets comprise galaxies from higher redshift snapshots at z=0.32𝑧0.32z=0.32italic_z = 0.32 (Snapshot 52), z=0.62𝑧0.62z=0.62italic_z = 0.62 (Snapshot 46), and z=1.08𝑧1.08z=1.08italic_z = 1.08 (Snapshot 40) of the Millennium Simulation, using the same limit of apparent magnitude as magr<17.7𝑚𝑎subscript𝑔𝑟17.7mag_{r}<17.7italic_m italic_a italic_g start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT < 17.7. These high-z datasets contain 593,102, 651,365, and 670,209 galaxies, respectively.

The TNG samples are derived from the IllustrisTNG Project, which builds upon the earlier Illustris simulation. TNG features cosmological magnetohydrodynamical simulations aimed at understanding key mechanisms in galaxy formation and evolution. It includes three main simulation runs with different scales and resolutions: TNG50, TNG100, and TNG300. In our research, we use TNG300, which is the largest simulation, which includes a volume of (302.6Mpc)3superscript302.6Mpc3(302.6\mathrm{Mpc})^{3}( 302.6 roman_Mpc ) start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT. This simulation begins at redshift z=127 and is based on the Planck 2015 ΛΛ\Lambdaroman_ΛCDM cosmological parameters (ΩΛ,0=0.6911subscriptΩΛ00.6911\Omega_{\Lambda,0}=0.6911roman_Ω start_POSTSUBSCRIPT roman_Λ , 0 end_POSTSUBSCRIPT = 0.6911, Ωm,0=0.3089subscriptΩ𝑚00.3089\Omega_{m,0}=0.3089roman_Ω start_POSTSUBSCRIPT italic_m , 0 end_POSTSUBSCRIPT = 0.3089, Ωb,0=0.0486subscriptΩ𝑏00.0486\Omega_{b,0}=0.0486roman_Ω start_POSTSUBSCRIPT italic_b , 0 end_POSTSUBSCRIPT = 0.0486, σ8=0.8159subscript𝜎80.8159\sigma_{8}=0.8159italic_σ start_POSTSUBSCRIPT 8 end_POSTSUBSCRIPT = 0.8159, ns=0.9667subscript𝑛𝑠0.9667n_{s}=0.9667italic_n start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT = 0.9667, h=0.67740.6774h=0.6774italic_h = 0.6774), with dark matter particles having a resolution of mDM=5.9×107Msubscript𝑚𝐷𝑀5.9superscript107subscript𝑀direct-productm_{DM}=5.9\times 10^{7}M_{\odot}italic_m start_POSTSUBSCRIPT italic_D italic_M end_POSTSUBSCRIPT = 5.9 × 10 start_POSTSUPERSCRIPT 7 end_POSTSUPERSCRIPT italic_M start_POSTSUBSCRIPT ⊙ end_POSTSUBSCRIPT and gas cells averaging mgas=1.1×107Msubscript𝑚𝑔𝑎𝑠1.1superscript107subscript𝑀direct-productm_{gas}=1.1\times 10^{7}M_{\odot}italic_m start_POSTSUBSCRIPT italic_g italic_a italic_s end_POSTSUBSCRIPT = 1.1 × 10 start_POSTSUPERSCRIPT 7 end_POSTSUPERSCRIPT italic_M start_POSTSUBSCRIPT ⊙ end_POSTSUBSCRIPT in mass. Our study selects galaxies within subhalos containing over 20 dark-matter particles and positive stellar mass (Ms>0subscript𝑀𝑠0M_{s}>0italic_M start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT > 0). We apply apparent magnitude cut-off of magr17.7𝑚𝑎subscript𝑔𝑟17.7mag_{r}\leq 17.7italic_m italic_a italic_g start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT ≤ 17.7, using z = 0 as the observer’s reference frame. The TNG300 galaxy mock catalogue includes 338,161 galaxies and 221,971 groups.

To assess how well the model applies to actual redshift surveys, we evaluated it using a redshift-space mock catalogue. Despite the fact that the machine learning model was initially trained on non-redshift-space data, our group finder does not depend on exact redshift measurements, making it adaptable to redshift space. The redshift distorted samples is sourced from the z=0𝑧0z=0italic_z = 0 snapshot of the MS. As with the training data, groups intersecting the box boundaries and those with host halos having fewer than 100 dark matter particles were omitted. In this dataset, galaxy line-of-sight distances were adjusted factoring in redshift distortion effects, determined by their line-of-sight velocities. An apparent magnitude limit of magr<17.7𝑚𝑎subscript𝑔𝑟17.7mag_{r}<17.7italic_m italic_a italic_g start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT < 17.7 was maintained. The redshift distorted samples comprises 397,283 galaxies in 293,030 groups.

Same as the training data, the observer of all test simulation boxes will also be placed at edge plane of the simulation box, and its perpendicular axis as the line-of-sight direction.

3 Method

Our group finding algorithm is based on artificial neural networks (ANN). ANNs are a rapidly growing area of machine learning, and many network architectures, such as multilayer perceptrons, convolutional neural networks, graph neural networks, and recurrent neural networks, have been developed to solve different types of problems. ANNs use non-linear models to solve complex problems by optimising trainable parameters through gradient descent. This training process involves feeding data to the network to adjust these parameters.

Our group identification system utilizes observable attributes of galaxies to detect groups and predict their virial masses. The framework is comprised of two machine learning models: 1) Central Galaxy Identifier, a classification model that identifies the most likely central galaxy from a galaxy’s nearby neighbors; 2) Group Mass Estimator, a regression model that calculates a group’s mass based on its member galaxies’ properties. The initial output from the central galaxy identifier generally results in smaller group segments that align with actual groups, which are regarded as group candidates. These candidates undergo merging according to certain criteria, considering their virial mass projections from the group mass estimator and their spatial arrangement. This merging process continues until the group catalog stabilizes with no further modifications. A thorough description of these models and ensuing steps will be discussed in the next sections.

3.1 Identification of Central Galaxies

The catalog of galaxy groups can be described by the center-satellite systems, which highlights the importance of pinpointing the central galaxy for every group. To facilitate this, we have created an artificial neural network (ANN) to recognize central galaxies by analyzing their environmental features. The network uses the following properties as input:

Mr,0:r-band magnitude of target galaxy:subscript𝑀𝑟0r-band magnitude of target galaxy\displaystyle M_{r,0}:\text{r-band magnitude of target galaxy}italic_M start_POSTSUBSCRIPT italic_r , 0 end_POSTSUBSCRIPT : r-band magnitude of target galaxy
(gr)0:color of target galaxy:subscript𝑔𝑟0color of target galaxy\displaystyle(g-r)_{0}:\text{color of target galaxy}( italic_g - italic_r ) start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT : color of target galaxy
z0:redshift of target galaxy:subscript𝑧0redshift of target galaxy\displaystyle z_{0}:\text{redshift of target galaxy}italic_z start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT : redshift of target galaxy
di:projection distance to its neighbors:subscript𝑑𝑖projection distance to its neighbors\displaystyle d_{i}:\text{projection distance to its neighbors}italic_d start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT : projection distance to its neighbors
Mr,i:r-band magnitude of neighbor galaxies:𝑀𝑟𝑖r-band magnitude of neighbor galaxies\displaystyle M{r,i}:\text{r-band magnitude of neighbor galaxies}italic_M italic_r , italic_i : r-band magnitude of neighbor galaxies
dzi:redshift distance to its neighbors:𝑑subscript𝑧𝑖redshift distance to its neighbors\displaystyle dz_{i}:\text{redshift distance to its neighbors}italic_d italic_z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT : redshift distance to its neighbors
(gr)i:color of neighbor galaxies:subscript𝑔𝑟𝑖color of neighbor galaxies\displaystyle(g-r)_{i}:\text{color of neighbor galaxies}( italic_g - italic_r ) start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT : color of neighbor galaxies

Here, the target galaxy refers to the galaxy for which we want to identify the central galaxy. Central galaxies are selected from the 10 nearest neighboring galaxies of the target galaxy, with i𝑖iitalic_i representing the rank of the neighbors, ranging from 1 to 10. When analyzing actual survey data, the redshift distortion will cause uncertainties in the estimation of line-of-sight distances. To account for this, we allow a ±5h1Mpcplus-or-minus5superscript1Mpc\pm 5h^{-1}\mathrm{Mpc}± 5 italic_h start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT roman_Mpc margin when pinpointing nearby galaxies. Galaxies positioned within ±5h1Mpcplus-or-minus5superscript1Mpc\pm 5h^{-1}\mathrm{Mpc}± 5 italic_h start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT roman_Mpc of a target galaxy’s line-of-sight are deemed equivalent, and only projection distances are used to identify such neighbors. These neighboring galaxies are ranked according to their projection distance to the target galaxy. In the simulation, we verified that the central galaxies of approximately 94%percent9494\%94 % of the galaxies lie within their 10 nearest neighbors plus themselves.

Target galaxies are categorized into 12 distinct classes based on their identification outcomes. Specifically, the ten closest neighbors are numbered from 1 to 10, ordered by their projected distances (disubscript𝑑𝑖d_{i}italic_d start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT) in ascending sequence, and the target galaxy is indexed as 0. If the central galaxy of the target galaxy is included in these 11 galaxies, it is given the label corresponding to the central’s own index. If none of these galaxies serve as its central galaxy, it receives a label of 11.

Our neural network consists of four hidden layers, each with a Rectified Linear Unit (ReLU) activation function. The network outputs a 12-element vector representing the probabilities of a galaxy belonging to each of the possible classes. We use cross-entropy as the loss function and train the neural network using the training dataset from the Millennium Simulation (MS) for 500 epochs.

Figure 1 illustrates the confusion matrix derived from our findings. A confusion matrix (or error matrix) offers a comprehensive overview of accurate and inaccurate classifications. Notably, the class designated as 0, which identifies the galaxy itself as the central galaxy, is the most prevalent and exhibits the greatest accuracy. The other classes maintain acceptable accuracy, although there is a declining trend as the class label numbers increase. Additionally, it is significant to note that a considerable fraction of other classes is classified as class 11.

Refer to caption
Figure 1: Confusion matrix for central galaxy identification on the test dataset. The x-axis represents the predicted labels, while the y-axis represents the true value. Each number within a cell represents the fraction of galaxies with a true label of ’y’ that were predicted as label ’x.’ Consequently, the sum of the numbers in each row equals 1. The color shading within the matrix also reflects this fraction.

Although the identifier for central galaxies demonstrates strong accuracy in various galaxy samples, there remain cases where galaxies are inaccurately categorized as central, or the central counterpart is not found (assigned to class 11). Importantly, the core objective of the central galaxy identifier, as an initial component of our group finder, is not to accurately assign a central galaxy to each individual galaxy. Instead, it aims to ensure that the predicted central-satellite pairs are situated within the same dark matter halo, thus making them viable candidates for groups in subsequent merging processes.

Furthermore, with the test data, it is observed that 99.02% of the center-satellite pairs coexist within the same halo. This suggests that the groups formed by the central galaxy identifier serve as an effective initial step in the identification of galaxy groups.

3.2 Estimation of group mass

The host halo mass of galaxies is vital for comprehending the development and dynamics of galaxies. In order to approximate the halo mass of the galaxy group identified in the previous section, we developed a new artificial neural network specifically designed to forecast the halo mass of groups. This is accomplished by using data from the central galaxies and the it’s top large satellite galaxies within these groups.

The network takes the following properties of a target group as inputs to predict its halo mass:

M,c:stellar mass of central galaxy:subscript𝑀𝑐stellar mass of central galaxy\displaystyle M_{*,c}:\text{stellar mass of central galaxy}italic_M start_POSTSUBSCRIPT ∗ , italic_c end_POSTSUBSCRIPT : stellar mass of central galaxy
M,t:total stellar mass of all galaxies in the group:subscript𝑀𝑡total stellar mass of all galaxies in the group\displaystyle M_{*,t}:\text{total stellar mass of all galaxies in the group}italic_M start_POSTSUBSCRIPT ∗ , italic_t end_POSTSUBSCRIPT : total stellar mass of all galaxies in the group
N:total number of galaxies in the group:𝑁total number of galaxies in the group\displaystyle N:\text{total number of galaxies in the group}italic_N : total number of galaxies in the group
Mr,max:maximum of r-band magnitude among all galaxies:subscript𝑀𝑟𝑚𝑎𝑥maximum of r-band magnitude among all galaxies\displaystyle M_{r,max}:\text{maximum of r-band magnitude among all galaxies}italic_M start_POSTSUBSCRIPT italic_r , italic_m italic_a italic_x end_POSTSUBSCRIPT : maximum of r-band magnitude among all galaxies
M,i:stellar mass of most massive 5 satellite galaxies:subscript𝑀𝑖stellar mass of most massive 5 satellite galaxies\displaystyle M_{*,i}:\text{stellar mass of most massive 5 satellite galaxies}italic_M start_POSTSUBSCRIPT ∗ , italic_i end_POSTSUBSCRIPT : stellar mass of most massive 5 satellite galaxies
di:projection distance of the 5 satellites to group center:subscript𝑑𝑖projection distance of the 5 satellites to group center\displaystyle d_{i}:\text{projection distance of the 5 satellites to group center}italic_d start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT : projection distance of the 5 satellites to group center

The variable i𝑖iitalic_i represents the index of five most massive satellite galaxies within the group, i=1𝑖1i=1italic_i = 1 is the most massive satellite galaxy. In cases where a group has fewer than five members, M,isubscript𝑀𝑖M_{*,i}italic_M start_POSTSUBSCRIPT ∗ , italic_i end_POSTSUBSCRIPT and disubscript𝑑𝑖d_{i}italic_d start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT will be filled with zeros.

The network consists of four hidden layers, each using a Rectified Linear Unit (ReLU) activation function. We use Mean Squared Error as the loss function and train the neural network on training dataset from the Millennium Simulation (MS) for 500 epochs.

3.3 Identification of groups

Although the central galaxy identifier shows commendable precision with simulated data, deeper scrutiny discloses that it may unintentionally divide larger groups into smaller subsections. This division can adversely affect the halo mass function and hinder the accurate identification of massive halos. To remedy this, we suggest an iterative method to reconcile and unify sections of authentic groups, commencing with those identified by the central galaxy identifier.

The correction methodology is outlined in the steps below:

  1. 1.

    Group and Mass Estimation: The groups predicted by the machine learning algorithm are considered as candidates. The Group Mass Estimator is employed to forecast the halo mass, labeled as M200subscript𝑀200M_{200}italic_M start_POSTSUBSCRIPT 200 end_POSTSUBSCRIPT, for each group. We then compute R200subscript𝑅200R_{200}italic_R start_POSTSUBSCRIPT 200 end_POSTSUBSCRIPT using the equation M200/43πr2003=ρcsubscript𝑀20043𝜋superscriptsubscript𝑟2003subscript𝜌𝑐{M_{200}}/{\frac{4}{3}\pi r_{200}^{3}}=\rho_{c}italic_M start_POSTSUBSCRIPT 200 end_POSTSUBSCRIPT / divide start_ARG 4 end_ARG start_ARG 3 end_ARG italic_π italic_r start_POSTSUBSCRIPT 200 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT = italic_ρ start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT.

  2. 2.

    Group Consolidation: For every candidate group, all galaxies located within its R200subscript𝑅200R_{200}italic_R start_POSTSUBSCRIPT 200 end_POSTSUBSCRIPT radius are examined. As mentioned before, we also set a line-of-sight ±5h1Mpcplus-or-minus5superscript1Mpc\pm 5h^{-1}\mathrm{Mpc}± 5 italic_h start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT roman_Mpc tolerance when searching for these near galaxies, which will guarantee the ability of our model to be further used in redshift surveys. If neighboring galaxies belong to another group, the two groups are combined into one. The new group’s center is aligned with the more massive of the initial groups, determined by their predicted halo masses, integrating all members from the original groups.

  3. 3.

    Reiteration: After merging, an updated group catalog is created. These revised groups are used as candidates to predict new halo masses, repeating the second step until the group catalog is stable and no more changes occur.

This correction strategy greatly enhances our group finder’s effectiveness, especially for large groups. The precision of central galaxy allocation increases from 82% to approximately 90%.

4 Results

In this section, we perform a quantitative assessment of our group finder’s performance on all the test datasets described in Section 2.2, which include:

  1. 1.

    Basic test datasets

  2. 2.

    Extended test datasets: different magnitude limit dataset, high-z dataset, and TNG dataset

  3. 3.

    Readshift space dataset

To assess performance, it is crucial to align the identified groups (IGs) discovered by the group finder with the actual groups (TGs) in the simulation datasets. However, due to unavoidable inaccuracies in assigning member galaxies, the IGs and TGs may not have identical membership. According to Campbell et al. (2015), these inaccuracies usually lead to two types of failure: ’fracturing’ and ’fusing.’ Fracturing happens when galaxies belonging to one true group are mistakenly split into multiple identified groups, whereas fusing occurs when galaxies from separate true groups are erroneously combined into one. These failure modes can appear separately or together, complicating the alignment of IGs and TGs. For clarity in presentation, the following notation will be employed:

  • IG: A group identified by the group finder

  • TG: A true group within the simulation data, located in a host halo

  • IG-T: An identified group matched to a true group

  • TG-I: A true group matched to an identified group

Following the method proposed by Wang et al. (2020), we execute a fusion of Member Matching and Central Matching to align IGs with TGs. Member Matching occurs when over 50% of an IG’s members are also within a TG, and vice versa. Central Matching is achieved if the primary galaxy of an IG coincides with that of a TG. Typically, Member Matching is regarded as more reliable, though it is more stringent for smaller assemblages. The combined approach seeks IGs that fit both member and central matching criteria for a TG, and if the outcomes of these methods diverge, we give preference to the Member Matching result. This matching procedure produces a collection of one-to-one pairings, as previously defined as IG-Ts and TG-Is.

We evaluate the model’s effectiveness using the corresponding pairs of TGs and IGs by analyzing group completeness and purity, the completeness and purity of member galaxies in groups, as well as the precision of halo mass forecasts.

4.1 Basic test datasets

Initially, we assessed the effectiveness of our group finder using the basic test datasets outlined in Section 2.2. Prior to performing quantitative analyses, we chose a particular section within the simulated test box, which corresponds to a slice measuring 60×60×20(h1Mpc)3606020superscriptsuperscript1Mpc360\times 60\times 20(h^{-1}\mathrm{Mpc})^{3}60 × 60 × 20 ( italic_h start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT roman_Mpc ) start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT. This section was selected to visually compare the actual groups in the simulation with those predicted by our method. In Figure 2, blue circles denote the true groups, while red circles represent the predicted groups in this region, with RvirsubscriptR𝑣𝑖𝑟\mathrm{R}_{vir}roman_R start_POSTSUBSCRIPT italic_v italic_i italic_r end_POSTSUBSCRIPT as their radii. The proximity in their locations and sizes indicates the proficiency of our group finder.

Refer to caption
Figure 2: Comparison of true and predicted groups within a 40×20×20(h1Mpc)3402020superscriptsuperscript1Mpc340\times 20\times 20(h^{-1}\mathrm{Mpc})^{3}40 × 20 × 20 ( italic_h start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT roman_Mpc ) start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT slice in the test box. There are 2087 galaxies in this region, shown as black point. These galaxies are resided in 1,020 true groups (halos) in the simulation, represented as background blue circles, with RvirsubscriptR𝑣𝑖𝑟\mathrm{R}_{vir}roman_R start_POSTSUBSCRIPT italic_v italic_i italic_r end_POSTSUBSCRIPT of true groups as their radii. The groups identified by our group finder are shown as red circles, with RvirsubscriptR𝑣𝑖𝑟\mathrm{R}_{vir}roman_R start_POSTSUBSCRIPT italic_v italic_i italic_r end_POSTSUBSCRIPT (calculated from predicted MvirsubscriptM𝑣𝑖𝑟\mathrm{M}_{vir}roman_M start_POSTSUBSCRIPT italic_v italic_i italic_r end_POSTSUBSCRIPT) as radii. The identified groups and true groups show strong agreement in both position and size. For example, for the largest halo, we successfully assigned 316 out of 355 true member galaxies and predicted a virial mass of 1014.57h1Msuperscript1014.57superscript1subscriptMdirect-product10^{14.57}h^{-1}\mathrm{M}_{\odot}10 start_POSTSUPERSCRIPT 14.57 end_POSTSUPERSCRIPT italic_h start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT roman_M start_POSTSUBSCRIPT ⊙ end_POSTSUBSCRIPT, with an error of less than 0.2 dex compared to the true value of 1014.72h1Msuperscript1014.72superscript1subscriptMdirect-product10^{14.72}h^{-1}\mathrm{M}_{\odot}10 start_POSTSUPERSCRIPT 14.72 end_POSTSUPERSCRIPT italic_h start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT roman_M start_POSTSUBSCRIPT ⊙ end_POSTSUBSCRIPT.

.

4.1.1 Group completeness and purity

We employ global completeness and purity to assess the model’s performance at the group level. Group completeness, represented as C=N(TGIs)N(TGs)𝐶NTGIsNTGsC=\frac{\mathrm{N(TG-Is)}}{\mathrm{N(TGs)}}italic_C = divide start_ARG roman_N ( roman_TG - roman_Is ) end_ARG start_ARG roman_N ( roman_TGs ) end_ARG, is the ratio of true groups (TGs) that are correctly identified by the group finder. Group purity, indicated by P=N(IGTs)N(IGs)𝑃NIGTsNIGsP=\frac{\mathrm{N(IG-Ts)}}{\mathrm{N(IGs)}}italic_P = divide start_ARG roman_N ( roman_IG - roman_Ts ) end_ARG start_ARG roman_N ( roman_IGs ) end_ARG, shows the percentage of identified groups (IGs) accurately matched with TGs. It’s crucial to note that N(TG-Is) is equal to N(IG-Ts).

Figure 3 illustrates the group completeness and purity for the basic test datasets. We plot how group completeness varies with the virial mass of the true groups Mvir,truesubscriptM𝑣𝑖𝑟true\mathrm{M}_{vir,\mathrm{true}}roman_M start_POSTSUBSCRIPT italic_v italic_i italic_r , roman_true end_POSTSUBSCRIPT and how group purity changes with the group mass predicted by our group finder Mvir,predictsubscriptM𝑣𝑖𝑟predict\mathrm{M}_{vir,\mathrm{predict}}roman_M start_POSTSUBSCRIPT italic_v italic_i italic_r , roman_predict end_POSTSUBSCRIPT. The blue line denotes the average completeness over the six basic test datasets, with error bars showing the 1σ1𝜎1\sigma1 italic_σ deviation. The green line and error bars similarly depict the average purity value and standard deviation for the six datasets. Both group completeness and purity tend towards 100% as virial mass increases and maintain a level above 90% for all groups with Mvir1011h1MsubscriptM𝑣𝑖𝑟superscript1011superscript1subscriptMdirect-product\mathrm{M}_{vir}\geq 10^{11}h^{-1}\mathrm{M}_{\odot}roman_M start_POSTSUBSCRIPT italic_v italic_i italic_r end_POSTSUBSCRIPT ≥ 10 start_POSTSUPERSCRIPT 11 end_POSTSUPERSCRIPT italic_h start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT roman_M start_POSTSUBSCRIPT ⊙ end_POSTSUBSCRIPT. Notably, purity consistently exceeds 95% within this mass range.

The findings reveal that our group finder successfully detects most actual halos, and a significant percentage of the discovered groups align with true groups. It is worth noting that completeness is somewhat diminished for low-mass halos. This reduction is mainly due to the increased likelihood of smaller halos being mistakenly classified as extensions of adjacent, larger halos, which aligns with the ’fusing’ error category. Nonetheless, low-mass groups are uncommon and incomplete in both our test datasets and real-world surveys, so they do not substantially affect the overall efficacy of the group finder.

Refer to caption
Figure 3: Group completeness and purity on the basic test datasets. See the section 4.1.1 for the definitions of group completeness and purity. The line and error bar show mean and 1σ1𝜎1\sigma1 italic_σ value across the six test datasets. Completeness is shown in blue as a function of the virial mass of true groups Mvir,truesubscriptM𝑣𝑖𝑟true\mathrm{M}_{vir,\mathrm{true}}roman_M start_POSTSUBSCRIPT italic_v italic_i italic_r , roman_true end_POSTSUBSCRIPT, while purity is shown in green as a function of the predicted virial mass of identified groups Mvir,predictsubscriptM𝑣𝑖𝑟predict\mathrm{M}_{vir,\mathrm{predict}}roman_M start_POSTSUBSCRIPT italic_v italic_i italic_r , roman_predict end_POSTSUBSCRIPT. All Groups with Mvir1011h1MsubscriptM𝑣𝑖𝑟superscript1011superscript1subscriptMdirect-product\mathrm{M}_{vir}\geq 10^{11}h^{-1}\mathrm{M}_{\odot}roman_M start_POSTSUBSCRIPT italic_v italic_i italic_r end_POSTSUBSCRIPT ≥ 10 start_POSTSUPERSCRIPT 11 end_POSTSUPERSCRIPT italic_h start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT roman_M start_POSTSUBSCRIPT ⊙ end_POSTSUBSCRIPT achieve greater than 90%percent9090\%90 % completeness and purity.

4.1.2 Group member completeness and purity

We further assess the group finder’s effectiveness in assigning galaxies as group members. Although member and center matching assists in recognizing groups, the precision in determining whether the assigned galaxies truly belong to their host groups is uncertain. Consequently, we establish metrics to evaluate performance within the groups.

Consider a galaxy group identified by our group finder, which comprises Ni predicted member galaxies. Assume that its corresponding true group (halo) contains Nt member galaxies. If Ns galaxies are shared between the true and predicted members, we define the following metrics:

  1. 1.

    Member Completeness: fc=NsNtsubscript𝑓𝑐subscriptN𝑠subscriptN𝑡f_{c}=\frac{\mathrm{N}_{s}}{\mathrm{N}_{t}}italic_f start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT = divide start_ARG roman_N start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT end_ARG start_ARG roman_N start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG

  2. 2.

    Member Purity: fp=NsNisubscript𝑓𝑝subscriptN𝑠subscriptN𝑖f_{p}=\frac{\mathrm{N}_{s}}{\mathrm{N}_{i}}italic_f start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT = divide start_ARG roman_N start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT end_ARG start_ARG roman_N start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_ARG

These metrics can solely be determined for matched pairs of actual and predicted groups, since defining Nt and Ns is infeasible for unmatched groups. The percentage of matched identified groups corresponds to group purity, as mentioned in Section 4.1.1.

Figure 4 presents the member completeness and purity for the test datasets. The left panel shows the cumulative distribution of member completeness (fcsubscript𝑓𝑐f_{c}italic_f start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT), indicating the proportion of groups with a completeness of at least x=fc𝑥subscript𝑓𝑐x=f_{c}italic_x = italic_f start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT. Different line styles correspond to four specific mass bins within the test sample.

Notably, nearly all low-mass groups (Mvir<1012h1MsubscriptM𝑣𝑖𝑟superscript1012superscript1subscriptMdirect-product\mathrm{M}_{vir}<10^{12}h^{-1}\mathrm{M}_{\odot}roman_M start_POSTSUBSCRIPT italic_v italic_i italic_r end_POSTSUBSCRIPT < 10 start_POSTSUPERSCRIPT 12 end_POSTSUPERSCRIPT italic_h start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT roman_M start_POSTSUBSCRIPT ⊙ end_POSTSUBSCRIPT) reach a completeness of fc=1subscript𝑓𝑐1f_{c}=1italic_f start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT = 1, highlighting the model’s proficiency in galaxy membership allocation. The smaller number of galaxies in low-mass groups makes full member identification easier. For groups with masses from 1012h1Msuperscript1012superscript1subscriptMdirect-product10^{12}h^{-1}\mathrm{M}_{\odot}10 start_POSTSUPERSCRIPT 12 end_POSTSUPERSCRIPT italic_h start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT roman_M start_POSTSUBSCRIPT ⊙ end_POSTSUBSCRIPT to slightly under 1013h1Msuperscript1013superscript1subscriptMdirect-product10^{13}h^{-1}\mathrm{M}_{\odot}10 start_POSTSUPERSCRIPT 13 end_POSTSUPERSCRIPT italic_h start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT roman_M start_POSTSUBSCRIPT ⊙ end_POSTSUBSCRIPT, there is a minor decline in member completeness (fcsubscript𝑓𝑐f_{c}italic_f start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT), yet approximately 95% still achieve fc=1subscript𝑓𝑐1f_{c}=1italic_f start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT = 1. Conversely, high-mass groups, due to their larger number of members, often miss some members, especially those at the edges. Approximately 80% of high-mass groups (Mvir>1013h1MsubscriptM𝑣𝑖𝑟superscript1013superscript1subscriptMdirect-product\mathrm{M}_{vir}>10^{13}h^{-1}\mathrm{M}_{\odot}roman_M start_POSTSUBSCRIPT italic_v italic_i italic_r end_POSTSUBSCRIPT > 10 start_POSTSUPERSCRIPT 13 end_POSTSUPERSCRIPT italic_h start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT roman_M start_POSTSUBSCRIPT ⊙ end_POSTSUBSCRIPT) reach a completeness of 0.8, with around 50% to 70% achieving fc0.9subscript𝑓𝑐0.9f_{c}\geq 0.9italic_f start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT ≥ 0.9.

The right panel depicts the cumulative distribution of member purity (fpsubscript𝑓𝑝f_{p}italic_f start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT), showing the fraction of groups with fpsubscript𝑓𝑝f_{p}italic_f start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT at least x=fp𝑥subscript𝑓𝑝x=f_{p}italic_x = italic_f start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT. Like completeness, member purity for low-mass groups is almost 1. For mid-mass groups, fpsubscript𝑓𝑝f_{p}italic_f start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT is slightly lower than fcsubscript𝑓𝑐f_{c}italic_f start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT. In high-mass groups, roughly 80% achieve fp0.8subscript𝑓𝑝0.8f_{p}\geq 0.8italic_f start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT ≥ 0.8, with 50% reaching fp0.9subscript𝑓𝑝0.9f_{p}\geq 0.9italic_f start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT ≥ 0.9. Overall, the membership evaluation suggests that our group finder efficiently manages most member assignments.

Refer to caption
Figure 4: Member completeness and purity on the test datasets. Left panel: The x-axis represents the member completeness of groups (fcsubscript𝑓𝑐f_{c}italic_f start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT), while the y-axis indicates the fraction of galaxies with a completeness exceeding x=fc𝑥subscript𝑓𝑐x=f_{c}italic_x = italic_f start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT. The results are presented for 4 mass bins, each distinguished by different colors. The lines and error bars show mean and 1σ𝜎\sigmaitalic_σ value of the results across four mass bins. Right panel: Similar to the left panel, but shows the result of member purity of groups.

4.1.3 Halo mass

Determining the masses of galaxy groups is essential for compiling a catalog of such groups. In training the group mass estimator, we input the characteristics of genuine groups into the neural network, yet the model is designed to predict masses for more than just genuine groups. At each iteration step of group-finding process, a predicted mass is assigned to every potential group, significantly influencing the merging process of these potential groups. The previously demonstrated completeness and purity of groups, along with their member galaxies, indicate that the mass estimator is adept at predicting the masses of candidate groups in the iterative correction steps as well. In this context, we concentrate on the mass distribution of the final results produced by the group finder.

Refer to caption
Figure 5: Comparison of true and predicted group mass for the basic test dataset. The upper panel shows a 2D histogram comparing the true and predicted masses. The x-axis represents the virial mass of true groups in simulation, while the y-axis represents the predicted mass of groups identified by group finder. Both true and predicted mass are divided into 100 bins across the mass range of 1010h1Msuperscript1010superscript1subscriptMdirect-product10^{10}h^{-1}\mathrm{M}_{\odot}10 start_POSTSUPERSCRIPT 10 end_POSTSUPERSCRIPT italic_h start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT roman_M start_POSTSUBSCRIPT ⊙ end_POSTSUBSCRIPT to 1015.5h1Msuperscript1015.5superscript1subscriptMdirect-product10^{15.5}h^{-1}\mathrm{M}_{\odot}10 start_POSTSUPERSCRIPT 15.5 end_POSTSUPERSCRIPT italic_h start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT roman_M start_POSTSUBSCRIPT ⊙ end_POSTSUBSCRIPT. The red solid line represents the ideal 1:1 relation. In the lower panel, we show the standard deviation of (logMvir,truelogMvir,predict)subscriptM𝑣𝑖𝑟truesubscriptM𝑣𝑖𝑟predict(\log\mathrm{M}_{vir,\mathrm{true}}-\log\mathrm{M}_{vir,\mathrm{predict}})( roman_log roman_M start_POSTSUBSCRIPT italic_v italic_i italic_r , roman_true end_POSTSUBSCRIPT - roman_log roman_M start_POSTSUBSCRIPT italic_v italic_i italic_r , roman_predict end_POSTSUBSCRIPT ).

Our group finder is capable of assigning a mass to each predicted group based solely on observable properties of its member galaxies. Figure 5 shows a comparison between the actual mass of the groups and the estimated mass of the detected ones. This analysis, akin to determining the completeness and purity of the group members, is feasible only for pairs of true and identified groups that have been accurately paired, as both Mvir,truesubscriptM𝑣𝑖𝑟true\mathrm{M}_{vir,\mathrm{true}}roman_M start_POSTSUBSCRIPT italic_v italic_i italic_r , roman_true end_POSTSUBSCRIPT and Mvir,predictsubscriptM𝑣𝑖𝑟predict\mathrm{M}_{vir,\mathrm{predict}}roman_M start_POSTSUBSCRIPT italic_v italic_i italic_r , roman_predict end_POSTSUBSCRIPT are necessary. The proportion of these accurately matched groups indicates the groups’ completeness and purity, illustrated in Figure 3. The variance in the forecasted group mass is below 0.3 dex across all mass bins. The standard deviation tends to be slightly higher for mass ranges of 1012h1M1013h1Msimilar-tosuperscript1012superscript1subscriptMdirect-productsuperscript1013superscript1subscriptMdirect-product10^{12}h^{-1}\mathrm{M}_{\odot}\sim 10^{13}h^{-1}\mathrm{M}_{\odot}10 start_POSTSUPERSCRIPT 12 end_POSTSUPERSCRIPT italic_h start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT roman_M start_POSTSUBSCRIPT ⊙ end_POSTSUBSCRIPT ∼ 10 start_POSTSUPERSCRIPT 13 end_POSTSUPERSCRIPT italic_h start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT roman_M start_POSTSUBSCRIPT ⊙ end_POSTSUBSCRIPT and >1014.5h1Mabsentsuperscript1014.5superscript1subscriptMdirect-product>10^{14.5}h^{-1}\mathrm{M}_{\odot}> 10 start_POSTSUPERSCRIPT 14.5 end_POSTSUPERSCRIPT italic_h start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT roman_M start_POSTSUBSCRIPT ⊙ end_POSTSUBSCRIPT. The first rise is attributed to the stellar mass-to-halo mass association, introducing more variability within the 1012h1Msuperscript1012superscript1subscriptMdirect-product10^{12}h^{-1}\mathrm{M}_{\odot}10 start_POSTSUPERSCRIPT 12 end_POSTSUPERSCRIPT italic_h start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT roman_M start_POSTSUBSCRIPT ⊙ end_POSTSUBSCRIPT to 1013h1Msuperscript1013superscript1subscriptMdirect-product10^{13}h^{-1}\mathrm{M}_{\odot}10 start_POSTSUPERSCRIPT 13 end_POSTSUPERSCRIPT italic_h start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT roman_M start_POSTSUBSCRIPT ⊙ end_POSTSUBSCRIPT span. The second peak appears at the largest masses, likely due to limited training data for such substantial halo masses.

Refer to caption
Figure 6: Comparison of predicted and true halo mass functions in the basic test datasets. The halo mass range [1011h1M,1015.5h1M]superscript1011superscript1subscriptMdirect-productsuperscript1015.5superscript1subscriptMdirect-product[10^{11}h^{-1}\mathrm{M}_{\odot},10^{15.5}h^{-1}\mathrm{M}_{\odot}][ 10 start_POSTSUPERSCRIPT 11 end_POSTSUPERSCRIPT italic_h start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT roman_M start_POSTSUBSCRIPT ⊙ end_POSTSUBSCRIPT , 10 start_POSTSUPERSCRIPT 15.5 end_POSTSUPERSCRIPT italic_h start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT roman_M start_POSTSUBSCRIPT ⊙ end_POSTSUBSCRIPT ] is divided into 50 bins. The grey area shows the 1σ1𝜎1\sigma1 italic_σ range of the true mass distribution across all six basic test datasets. Blue points and error bars represent the predicted mean and 1σ1𝜎1\sigma1 italic_σ results for the test datasets. The two functions show only slight differences for halo masses 1014.5Mabsentsuperscript1014.5subscript𝑀direct-product\geq 10^{14.5}M_{\odot}≥ 10 start_POSTSUPERSCRIPT 14.5 end_POSTSUPERSCRIPT italic_M start_POSTSUBSCRIPT ⊙ end_POSTSUBSCRIPT, which may be due to the limited number of large groups.

Figure 6 not only provides a direct comparison between the masses of true and identified groups but also displays the halo mass functions for all groups, including unmatched ones. The simulation’s true halo mass function is depicted as a shaded grey region, marking the 1σ1𝜎1\sigma1 italic_σ interval across the six basic test datasets. In contrast, the group finder’s predicted halo mass function is illustrated with blue points and error bars, showing the mean and 1σ1𝜎1\sigma1 italic_σ values derived from the six test sets. Across the entire four-order mass range, both functions are largely consistent with error less than 20%, except for the largest group. This discrepancy maybe arises due to the scarcity of these large groups.

4.2 Results for extended test datasets

While our model is developed and validated using simulated galaxy catalogs, we are confident that this methodology can also be applied to real observational data. For flexibility, our technique uses fundamental input features and employs a straightforward neural network architecture, which aids in reducing overfitting to the training data. To evaluate the model’s performance with different real galaxy surveys, we applied the comprehensive datasets described in Section 4.2 to test the group’s identification efficiency. Details of these datasets are provided as follows:

  1. 1.

    Magnitude limited datasets: These samples is similar to those found in the initial test datasets, all sourced from the Millennium Simulation at a redshift of 0. They differ by employing distinct r-band apparent magnitude cutoffs: magr<16𝑚𝑎subscript𝑔𝑟16mag_{r}<16italic_m italic_a italic_g start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT < 16, magr<15𝑚𝑎subscript𝑔𝑟15mag_{r}<15italic_m italic_a italic_g start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT < 15, and magr<14𝑚𝑎subscript𝑔𝑟14mag_{r}<14italic_m italic_a italic_g start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT < 14. This variation enables an evaluation of the model’s effectiveness for shallower surveys.

  2. 2.

    High-z datasets. These datasets comprise galaxy samples extracted from higher-redshift snapshots within the Millennium Simulation, notably at redshifts z=0.32𝑧0.32z=0.32italic_z = 0.32 (Snapshot 52), z=0.62𝑧0.62z=0.62italic_z = 0.62 (Snapshot 46), and z=1.08𝑧1.08z=1.08italic_z = 1.08 (Snapshot 40). The apparent magnitude constraints are consistent with those found in the fundamental test datasets. These samples assist in assessing whether our model is affected by the evolution of galaxies and halos.

  3. 3.

    TNG300 datasets. The datasets are derived from the TNG300 simulation at redshift 0, using identical apparent magnitude limits as those in the training dataset. This allows for an evaluation of how our scheme depends on the galaxies’ physical models and cosmological parameters.

Table 1 shows the fundamental characteristics of galaxies, dark matter halos, and galaxy groups as estimated by the model for these datasets. The model exhibits robust predictive accuracy across these varied samples, which are notably different from the training and basic test datasets. Group completeness stays around or surpasses 90%, and group purity remains consistently over 97%. Within the extended datasets, the TNG300 sample achieves the most outstanding overall results.

Catalog Total galaxies Total halos Total groups Group completeness Group purity
MSI_Mag16 169,375 125,850 120,295 0.93 0.98
MSI_Mag15 118,931 89,325 86,077 0.94 0.97
MSI_Mag14 77,950 58,675 56,774 0.94 0.97
MSI_z0.32 593,102 421,840 396,798 0.92 0.98
MSI_z0.62 651,365 458,889 406,857 0.87 0.98
MSI_z1.08 670,209 470,646 423,472 0.88 0.98
TNG300 338,161 221,971 217,755 0.95 0.97
Table 1: Performance of the group finder across various catalogs. Each entry includes the total counts of galaxies, halos (true groups), and predicted groups, as well as group completeness and purity for evaluating the model’s effectiveness. The results illustrate the model’s robust performance, even when applied to diverse datasets beyond its original training set

Figure 7 offers a detailed portrayal of the concepts of group completeness and purity. It shows the variations in completeness and purity across different mass ranges for each extended test dataset. For reference, the basic test datasets are also shown (black solid line). At redshifts z=0.62𝑧0.62z=0.62italic_z = 0.62 and z=1.08𝑧1.08z=1.08italic_z = 1.08, usually characterized by lower completeness, values drop below 90% only in the lowest mass range (below 1012h1Msuperscript1012superscript1subscriptMdirect-product10^{12}h^{-1}\mathrm{M}_{\odot}10 start_POSTSUPERSCRIPT 12 end_POSTSUPERSCRIPT italic_h start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT roman_M start_POSTSUBSCRIPT ⊙ end_POSTSUBSCRIPT). Notably, some datasets demonstrate greater group completeness than the basic test data. The TNG300 dataset, in particular, maintains a completeness rate over 95% across all mass categories.

In terms of group purity, each sample consistently surpasses 95% in every mass range, closely aligning with the basic test set results. This serves as evidence for the rarity of the model incorrectly dividing a single true group into multiple predicted groups, known as the fracturing failure mode. Furthermore, there is no direct linear relation between group completeness, group purity, apparent magnitude thresholds, and redshift; these measures display varied patterns across different mass ranges.

Refer to caption
Figure 7: Completeness and purity of galaxy groups for the extended test datasets. Red lines represent the results for different magnitude limited datasets: the dotted line corresponds to the sample with magr16𝑚𝑎subscript𝑔𝑟16mag_{r}\leq 16italic_m italic_a italic_g start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT ≤ 16, the dashed line to magr15𝑚𝑎subscript𝑔𝑟15mag_{r}\leq 15italic_m italic_a italic_g start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT ≤ 15, and the dashdot line to magr14𝑚𝑎subscript𝑔𝑟14mag_{r}\leq 14italic_m italic_a italic_g start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT ≤ 14. Green lines represent high-z dataset, with the dotted line for z=0.32𝑧0.32z=0.32italic_z = 0.32, the dashed line for z=0.62𝑧0.62z=0.62italic_z = 0.62, and the dashdot line for z=1.08𝑧1.08z=1.08italic_z = 1.08. The blue dotted line represents the TNG300 dataset, while the solid black line serves as a benchmark for the basic test datasets’ results. Except for the low-mass intervals in z=0.62𝑧0.62z=0.62italic_z = 0.62 and z=1.08𝑧1.08z=1.08italic_z = 1.08, completeness for all samples exceeds 90%, and purity exceeds 95% for all samples.

Concerning the completeness and purity of the member galaxies (fcsubscript𝑓𝑐f_{c}italic_f start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT and fpsubscript𝑓𝑝f_{p}italic_f start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT), the extended test datasets yielded quite positive outcomes. Figure 8 illustrates findings for the TNG300 dataset, which emerges as the most exemplary among the expanded test datasets. The values of fcsubscript𝑓𝑐f_{c}italic_f start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT and fpsubscript𝑓𝑝f_{p}italic_f start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT in the TNG300 sample align closely with the original test data, with over 80% of galaxy groups achieving fc,fp0.8subscript𝑓𝑐subscript𝑓𝑝0.8f_{c},f_{p}\geq 0.8italic_f start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT , italic_f start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT ≥ 0.8. Considering that our model was developed using the Millennium Simulation (a semi-analytic simulation), while TNG300 is a hydrodynamical model, the disparities in physical processes, simulation parameters, and cosmological parameters between the two are substantial. These outcomes highlight the strong generalizability of our model across different simulation datasets. This offers a promising basis for adapting our model to real observational data.

Refer to caption
Figure 8: Completeness and purity of member galaxies in the TNG300 sample. Same as Fig 4, change dataset to TNG300. The results exhibit similar or even better performance compared to the MSI test set. More than 80% of galaxy groups have completeness and purity exceeding 0.8, and over 70% of galaxy groups have completeness and purity exceeding 0.9.

Evaluating predicted group masses is another crucial facet of the galaxy group catalogue. Figure 9 illustrates the halo mass function (HMF) for the base dataset and seven extended datasets. The upper panel contrasts the actual HMF (black solid line) with the estimated HMF (blue dotted line), while the lower panel displays the ratio of these distributions over varying mass intervals. While the outcomes for the extended datasets aren’t as precise as those for the base dataset, the predicted halo mass functions still largely align with the actual data. Minor discrepancies arise from intrinsic differences between the base and extended datasets. We will discuss these differences and their effects in detail.

Refer to caption
Refer to caption
Figure 9: Compression of predicted halo mass function and true halo mass function on basic test set and extended test set. each column of panels show one of the dataset. The black solid line in each upper panel indicates the halo mass function directly from simulation, and blue dashed line shows the predicted one. Each lower panel represents how the corresponding ratio of halo mass distribution changing with mass.
Refer to caption
Figure 10: Dependence of group richness on Group Mass Estimator model. The figure displays the median and 1-sigma errors of predicted group mass for various true mass bins. Different colors represent galaxy groups with different richness, where blue represents all galaxy groups.
Refer to caption
Figure 11: Comparison of the relation between halo mass and Magr𝑀𝑎subscript𝑔𝑟Mag_{r}italic_M italic_a italic_g start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT of central galaxy. It presents the curve of mean Magr𝑀𝑎subscript𝑔𝑟Mag_{r}italic_M italic_a italic_g start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT as a function of halo mass for four different snapshots. An evident linear bias is observable.
Refer to caption
Figure 12: Comparison of Stellar Mass - Halo Mass relation on MSI and TNG300. The solid black line represents the median curve of total stellar mass corresponding to the dark halo mass in MSI, while the solid blue line represents the relation within the TNG300 sample. There is a noticeable discrepancy between the two. The blue dashed line also from TNG300 but using model-predicted mass, and it becomes closer to MSI’s results.

In the magnitude limited datasets, the mass distribution function aligns well with actual values for lower masses (below 1012.5superscript1012.510^{12.5}10 start_POSTSUPERSCRIPT 12.5 end_POSTSUPERSCRIPT). However, deviations become noticeable in the higher mass range, escalating as the apparent magnitude threshold is decreased. Adjusting this limit affects several input parameters for the mass estimation models, including M,truesubscriptMtrue\mathrm{M}_{*,\mathrm{true}}roman_M start_POSTSUBSCRIPT ∗ , roman_true end_POSTSUBSCRIPT, Magmax𝑀𝑎subscript𝑔maxMag_{\mathrm{max}}italic_M italic_a italic_g start_POSTSUBSCRIPT roman_max end_POSTSUBSCRIPT, and properties of satellite galaxies. Analysis of these distributions and their significance revealed that group richness N𝑁Nitalic_N is the most influential factor. In groups with more than five satellite galaxies, raising the apparent magnitude limit primarily affects N𝑁Nitalic_N, while most other parameters remain stable. Additional data analysis indicates performance variations based on the galaxy clusters’ member richness. Figure 10 presents the mass prediction results for basic test datasets over various halo mass ranges, highlighting the median predicted values with corresponding 1-sigma errors. We classified the samples into five groups based on the number of member galaxies: N=1𝑁1N=1italic_N = 1, N=2𝑁2N=2italic_N = 2, N=3𝑁3N=3italic_N = 3, N=4𝑁4N=4italic_N = 4, and N5𝑁5N\geq 5italic_N ≥ 5. The figure demonstrates that, for groups sharing the same halo mass, those with more member galaxies tend to have higher predicted masses. This causes slight underestimation of masses for groups with fewer members, especially for larger groups. The increasing presence of such groups in samples with higher apparent magnitudes contributes to a decline in the mass distribution function’s high-mass end. Notably, the strictness of the magr<14𝑚𝑎subscript𝑔𝑟14mag_{r}<14italic_m italic_a italic_g start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT < 14 limit renders it unlikely to be used in modern redshift surveys.

In high-redshift datasets, the calculated halo mass function aligns reasonably well with the actual data, although discrepancies grow as redshift increases. We categorized the samples into two groups: those with N>1𝑁1N>1italic_N > 1 and those with N=1𝑁1N=1italic_N = 1. The primary discrepancies in mass predictions are found in the galaxy groups with N=1𝑁1N=1italic_N = 1. Out of all the input parameters for mass prediction, only Mcsubscript𝑀𝑐M_{c}italic_M start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT and Magr,c𝑀𝑎subscript𝑔𝑟𝑐Mag_{r,c}italic_M italic_a italic_g start_POSTSUBSCRIPT italic_r , italic_c end_POSTSUBSCRIPT (equivalent to =Magmaxabsent𝑀𝑎subscript𝑔𝑚𝑎𝑥=Mag_{max}= italic_M italic_a italic_g start_POSTSUBSCRIPT italic_m italic_a italic_x end_POSTSUBSCRIPT) show effectiveness for these particular groups. Upon examining their correlation with halo mass, we discovered that this relation fluctuates significantly across different redshifts. Figure 11 displays the link between the median Magr,c𝑀𝑎subscript𝑔𝑟𝑐Mag_{r,c}italic_M italic_a italic_g start_POSTSUBSCRIPT italic_r , italic_c end_POSTSUBSCRIPT and halo mass, where curves for various redshifts show a unique, nearly linear bias. For the z=1.08 samples compared to the test datasets, there’s an approximate deviation of 1 magnitude. A straightforward correction was tested by adding 1 to Magr,c𝑀𝑎subscript𝑔𝑟𝑐Mag_{r,c}italic_M italic_a italic_g start_POSTSUBSCRIPT italic_r , italic_c end_POSTSUBSCRIPT for each sample within the z=1.08 dataset, leading to a median mass prediction closely matching the actual mass. This indicates that applying a linear adjustment to the magnitudes of galaxies at higher redshifts can maintain our model’s relevance for galaxies across varying redshifts.

The TNG300 dataset is notably distinct from the training data and other samples as it is derived from hydro-dynamic simulations and employs a different cosmological model. Despite this difference, the predicted halo mass function maintains an accuracy comparable to high-redshift datasets. However, there is a minor bias: smaller groups tend to have their masses underestimated, whereas larger groups see an overestimation. This variance is linked to inherent disparities in the Stellar Mass - Halo Mass (SMHM) relation between the TNG300 and Millennium simulations. Figure 12 demonstrates the correlation between halo mass and average stellar mass within both simulations. A notable difference is evident: for lower mass ranges, TNG300 forecasts a lower total stellar mass for a given halo mass, whereas for higher mass ranges, it anticipates a greater stellar mass. The figure’s blue dashed line portrays the connection between predicted halo mass and average stellar mass, aligning closely with the Millennium Simulation’s results. It matches the Millennium Simulation’s outcomes at the low mass end because these groups are predominantly isolated centers or have only few satellites, implying that the prediction aligns with the Millennium SMHM relation. However, as the number of satellites increases, the predicted halo mass aligns more accurately with the true value in TNG300, rather than adhering strictly to the relation in the training data. This highlights the network’s capability in understanding the link between halo mass and the properties of central and satellite galaxies fed into it. Expanding our network’s application to various galaxy samples, including those from actual galaxy surveys, is anticipated to produce accurate predictions for halo mass.

These extended test set results demonstrate that our model is capable of generating dependable outcomes without needing to retrain on various mock catalogs. Group completeness is generally around 90%, and purity consistently surpasses 95% for these datasets. The completeness and purity of member identification are also high, confirming the model’s efficacy in detecting group members. Additionally, the model can reliably reconstruct the halo mass function. However, small biases in estimating the halo mass of galaxy clusters may occur due to inherent differences between the training/basic data and the extended datasets.

4.3 Redshift distorted samples

While our model is constructed for an optimal setting, we propose that it can be successfully adapted to more practical situations. This adaptability is due to its limited dependence on exact redshift measurements used for network training, as outlined in the Method section. In identifying nearby targets, we consistently incorporate a ±5h1Mpcplus-or-minus5superscript1Mpc\pm 5h^{-1}\mathrm{Mpc}± 5 italic_h start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT roman_Mpc margin of uncertainty in line-of-sight distance measurements. This approach is utilized when locating the nearest 10 neighbors of target galaxies and when looking for all galaxies within R200subscript𝑅200R_{200}italic_R start_POSTSUBSCRIPT 200 end_POSTSUBSCRIPT of a potential group. The first try we should do is to extend the test into the redshift distorted sample.

In this test, we preserve the essential structure of our model but re-train it using a simulated redshift distorted sample. This catalog is modeled on galaxy data from the z=0𝑧0z=0italic_z = 0 snapshot of the Millennium Simulation, incorporating redshift distortions based on the galaxies’ velocities along a specified axis. We apply the same apparent magnitude threshold to these redshift distorted samples as utilized for the test data.

Figure 13 displays the result for redshift distorted sample. The mass estimations in Figure 13(a) align well with true mass values for both smaller and larger mass groups, although there is a tendency to slightly underestimate in the 1013h1M1014h1Msimilar-tosuperscript1013superscript1subscriptMdirect-productsuperscript1014superscript1subscriptMdirect-product10^{13}h^{-1}\mathrm{M}_{\odot}\sim 10^{14}h^{-1}\mathrm{M}_{\odot}10 start_POSTSUPERSCRIPT 13 end_POSTSUPERSCRIPT italic_h start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT roman_M start_POSTSUBSCRIPT ⊙ end_POSTSUBSCRIPT ∼ 10 start_POSTSUPERSCRIPT 14 end_POSTSUPERSCRIPT italic_h start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT roman_M start_POSTSUBSCRIPT ⊙ end_POSTSUBSCRIPT range. This underestimation stems mainly from the incomplete membership predictions in this mass range. Both group completeness and purity exceed 90% across all mass classes (Figure 3) and surpass 95% for groups with Mvir>1012h1MsubscriptM𝑣𝑖𝑟superscript1012superscript1subscriptMdirect-product\mathrm{M}_{vir}>10^{12}h^{-1}\mathrm{M}_{\odot}roman_M start_POSTSUBSCRIPT italic_v italic_i italic_r end_POSTSUBSCRIPT > 10 start_POSTSUPERSCRIPT 12 end_POSTSUPERSCRIPT italic_h start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT roman_M start_POSTSUBSCRIPT ⊙ end_POSTSUBSCRIPT, showcasing our model’s strong ability to detect groups in redshift space. In Figure 13(c), the completeness curve declines more sharply compared to the original test data, with around 60% of groups having Mvir1013Msubscript𝑀𝑣𝑖𝑟superscript1013subscript𝑀direct-productM_{vir}\geq 10^{13}M_{\odot}italic_M start_POSTSUBSCRIPT italic_v italic_i italic_r end_POSTSUBSCRIPT ≥ 10 start_POSTSUPERSCRIPT 13 end_POSTSUPERSCRIPT italic_M start_POSTSUBSCRIPT ⊙ end_POSTSUBSCRIPT achieving fc0.8subscript𝑓𝑐0.8f_{c}\geq 0.8italic_f start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT ≥ 0.8. The overall performance on redshift distorted samples illustrates the significant capability of our model for application in real redshift surveys. Minor predictive errors suggest the need for additional adjustments and optimizations for use in redshift space contexts. More precise results for real surveys will be presented in our subsequent papers, including group catalogs from actual redshift surveys.

Refer to caption
(a)
Refer to caption
(b)
Refer to caption
(c)
Figure 13: Results on redshift distorted samples. (a) Comparison of halo mass function. Black dashed line is the true value of simulation, and blue dashed line shows the prediction of our group finder. The lower panel present (logMp,virlogMt,vir)/logMt,virsubscriptM𝑝𝑣𝑖𝑟subscriptM𝑡𝑣𝑖𝑟subscriptM𝑡𝑣𝑖𝑟(\log\mathrm{M}_{p,vir}-\log\mathrm{M}_{t,vir})/\log\mathrm{M}_{t,vir}( roman_log roman_M start_POSTSUBSCRIPT italic_p , italic_v italic_i italic_r end_POSTSUBSCRIPT - roman_log roman_M start_POSTSUBSCRIPT italic_t , italic_v italic_i italic_r end_POSTSUBSCRIPT ) / roman_log roman_M start_POSTSUBSCRIPT italic_t , italic_v italic_i italic_r end_POSTSUBSCRIPT value on each mass bins. (b) The group completeness and purity in redshift distorted samples. The definition of them can be found in section 4.1.1. We only show results for groups with Mvir1011h1MsubscriptM𝑣𝑖𝑟superscript1011superscript1subscriptMdirect-product\mathrm{M}_{vir}\geq 10^{11}h^{-1}\mathrm{M}_{\odot}roman_M start_POSTSUBSCRIPT italic_v italic_i italic_r end_POSTSUBSCRIPT ≥ 10 start_POSTSUPERSCRIPT 11 end_POSTSUPERSCRIPT italic_h start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT roman_M start_POSTSUBSCRIPT ⊙ end_POSTSUBSCRIPT because smaller groups are highly incomplete due to selection effect. (c) The completeness and purity function of member galaxies in redshift distorted samples.

5 Conclusion

In this research, we devised a machine learning-based strategy to identify galaxy groups by leveraging various observable characteristics of galaxies. This methodology exhibited strong performance when tested on simulated mock catalogs. Our galaxy group identification system is composed of three primary elements:

  1. 1.

    Central Galaxy Identifier. This component is essential for distinguishing central and satellite galaxy pairs within a group by employing a machine-learning classification algorithm. It assesses a target galaxy and its ten closest neighbors to accurately identify the central galaxy with about 90% accuracy.

  2. 2.

    Group Mass Estimator. We developed a regression model to estimate the halo mass of galaxy groups found by our system. This model uses the attributes of a group’s central galaxy and its five most massive satellites as inputs. Evaluation against a true group catalog showed a prediction error rate of approximately 0.2 dex.

  3. 3.

    Group finder. Although the Central Galaxy Identifier achieves significant accuracy in generating a galaxy group catalog, sporadic segmentation errors, particularly in more massive groups, were observed. To mitigate these, we implemented an iterative procedure to consolidate fragmented sections of actual groups.

The performance of the group identification tool was thoroughly assessed using six basic test datasets derived from the z=0 snapshot of the Millennium Simulation (MSI). The findings reveal that the completeness and purity of the groups surpass 90% for all group mass ranges, including the lower mass limit near 1011h1Msuperscript1011superscript1subscriptMdirect-product10^{11}h^{-1}\mathrm{M}_{\odot}10 start_POSTSUPERSCRIPT 11 end_POSTSUPERSCRIPT italic_h start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT roman_M start_POSTSUBSCRIPT ⊙ end_POSTSUBSCRIPT. Concerning the precision of member allocation, over 80% of the groups had member completeness above 80%, while more than 90% showed member purity exceeding 60%. Additionally, the estimated halo mass distribution was in extraordinary agreement with the true values, with a ratio that approximates 1 across most mass ranges.

We conducted an additional evaluation of our group finder using three separate datasets to determine the model’s extensibility and flexibility. These comprehensive test datasets comprised samples with diverse apparent magnitude thresholds, high-redshift samples, and hydrodynamic simulation TNG300 samples.

The model reliably assigned group memberships across all datasets, achieving group completeness over 90% and purity exceeding 95%. Interestingly, certain datasets performed better than the standard test datasets. Nonetheless, predictions of halo mass showed some errors, especially in samples with varying selection criteria. We believe these inconsistencies may arise due to the following reasons: the prediction of our current scheme has a weak dependence on the richness of groups, this will lower the predicted virial mass for smaller groups. Also our model has a weak dependence on the Stellar Mass - Halo Mass (SMHM) relation from the training data for small groups, and then the prediction of halo mass will keep inline with the SMHM relation from the training data for smaller groups with few satellites. Furthermore, the difference between the r-band luminosity-halo mass from different cosmic epochs in real data bring extend errors on the prediction of the halo mass for distant groups.

However, for massive groups with more satellites in TNG300 sample, the predicted halo mass aligns more accurately with the true value in TNG300, rather than adhering strictly to the relation in the training data. This highlights the network’s capability in understanding the link between halo mass and the properties of central and satellite galaxies fed into it. Expanding our network’s application to various galaxy samples, including those from actual galaxy surveys, is anticipated to produce accurate predictions for halo mass.

Although our group finder is developed in real space, we address the impact of redshift-space distortions by setting a ±5h1Mpcplus-or-minus5superscript1Mpc\pm 5h^{-1}\mathrm{Mpc}± 5 italic_h start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT roman_Mpc uncertainty when dealing with line-of-sight distance. This approach guarantees the ability of our model in detecting galaxy groups within redshift space. Following retraining using a redshift distorted mock catalog based on the Millennium Simulation, the model consistently achieves high levels of group completeness and purity, with rates surpassing 90% for all mass categories and exceeding 95% for more massive groups (Mvir>1012h1Msubscript𝑀𝑣𝑖𝑟superscript1012superscript1subscriptMdirect-productM_{vir}>10^{12}h^{-1}\mathrm{M}_{\odot}italic_M start_POSTSUBSCRIPT italic_v italic_i italic_r end_POSTSUBSCRIPT > 10 start_POSTSUPERSCRIPT 12 end_POSTSUPERSCRIPT italic_h start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT roman_M start_POSTSUBSCRIPT ⊙ end_POSTSUBSCRIPT). Minor discrepancies have been observed, specifically the underestimation of masses ranging from 1013h1MM1014h1Msimilar-tosuperscript1013superscript1subscriptMdirect-productsubscript𝑀direct-productsuperscript1014superscript1subscriptMdirect-product10^{13}h^{-1}\mathrm{M}_{\odot}M_{\odot}\sim 10^{14}h^{-1}\mathrm{M}_{\odot}10 start_POSTSUPERSCRIPT 13 end_POSTSUPERSCRIPT italic_h start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT roman_M start_POSTSUBSCRIPT ⊙ end_POSTSUBSCRIPT italic_M start_POSTSUBSCRIPT ⊙ end_POSTSUBSCRIPT ∼ 10 start_POSTSUPERSCRIPT 14 end_POSTSUPERSCRIPT italic_h start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT roman_M start_POSTSUBSCRIPT ⊙ end_POSTSUBSCRIPT, mainly due to incomplete member predictions in this mass range. These findings highlight the model’s potential for use in real redshift surveys, although additional adjustments might be required to fully adapt it to observational data. The effectiveness demonstrated with mock catalogs underscores the prospects of our group finder for actual redshift surveys. In our subsequent studies, we intend to thoroughly examine the diverse observational effects in the actual survey, and apply our techniques to the real galaxy survey.

Acknowledgements.
This work was supported by the National Key R&D Program of China (2022YFA1602901), the NSFC grant (Nos 11988101, 11873051, 12125302, and 11903043), CAS Project for Young Scientists in Basic Research Grant (No. YSBR-062), and the K.C. Wong Education Foundation. YC acknowledges the support of the UK Royal Society through a University Research Fellowship. For the purpose of open access, the author has applied a Creative Commons Attribution (CC BY) licence to any Author Accepted Manuscript version arising from this submission.

References

  • Abazajian et al. (2009) Abazajian, K. N., Adelman-McCarthy, J. K., Agüeros, M. A., et al. 2009, ApJS, 182, 543
  • Abell (1958) Abell, G. O. 1958, ApJS, 3, 211
  • Abell et al. (1989) Abell, G. O., Corwin, Jr., H. G., & Olowin, R. P. 1989, ApJS, 70, 1
  • Berlind et al. (2006) Berlind, A. A., Frieman, J., Weinberg, D. H., et al. 2006, ApJS, 167, 1
  • Campbell et al. (2015) Campbell, D., van den Bosch, F. C., Hearin, A., et al. 2015, MNRAS, 452, 444
  • Chen et al. (2024) Chen, H., Wang, J., Mao, T., et al. 2024, MNRAS, 532, 3947
  • Coil et al. (2006) Coil, A. L., Gerke, B. F., Newman, J. A., et al. 2006, ApJ, 638, 668
  • Crook et al. (2007) Crook, A. C., Huchra, J. P., Martimbeau, N., et al. 2007, ApJ, 655, 790
  • Davis et al. (1985) Davis, M., Efstathiou, G., Frenk, C. S., & White, S. D. M. 1985, ApJ, 292, 371
  • Eke et al. (2004) Eke, V. R., Baugh, C. M., Cole, S., et al. 2004, MNRAS, 348, 866
  • Goto (2005) Goto, T. 2005, MNRAS, 359, 1415
  • Guo et al. (2011) Guo, Q., White, S., Boylan-Kolchin, M., et al. 2011, MNRAS, 413, 101
  • Han et al. (2015) Han, J., Eke, V. R., Frenk, C. S., et al. 2015, MNRAS, 446, 1356
  • Henriques et al. (2015) Henriques, B. M. B., White, S. D. M., Thomas, P. A., et al. 2015, MNRAS, 451, 2663
  • Huchra & Geller (1982) Huchra, J. P., & Geller, M. J. 1982, ApJ, 257, 423
  • Knobel et al. (2009) Knobel, C., Lilly, S. J., Iovino, A., et al. 2009, ApJ, 697, 1842
  • Lan et al. (2016) Lan, T.-W., Ménard, B., & Mo, H. 2016, MNRAS, 459, 3998
  • Lavaux & Hudson (2011) Lavaux, G., & Hudson, M. J. 2011, MNRAS, 416, 2840
  • Li et al. (2011) Li, R., Mo, H. J., Fan, Z., van den Bosch, F. C., & Yang, X. 2011, MNRAS, 413, 3039
  • Lim et al. (2018) Lim, S. H., Mo, H. J., Li, R., et al. 2018, ApJ, 854, 181
  • Lim et al. (2017) Lim, S. H., Mo, H. J., Lu, Y., Wang, H., & Yang, X. 2017, MNRAS, 470, 2982
  • Lim et al. (2020) Lim, S. H., Mo, H. J., Wang, H., & Yang, X. 2020, ApJ, 889, 48
  • Luo et al. (2018) Luo, W., Yang, X., Lu, T., et al. 2018, ApJ, 862, 4
  • Mandelbaum et al. (2006) Mandelbaum, R., Seljak, U., Cool, R. J., et al. 2006, MNRAS, 372, 758
  • Mao et al. (2021) Mao, T.-X., Wang, J., Li, B., et al. 2021, MNRAS, 501, 1499
  • Miller et al. (2005) Miller, C. J., Nichol, R. C., Reichart, D., et al. 2005, AJ, 130, 968
  • Muñoz-Cuartas et al. (2011) Muñoz-Cuartas, J. C., Müller, V., & Forero-Romero, J. E. 2011, MNRAS, 417, 1303
  • Rodriguez et al. (2015) Rodriguez, F., Merchán, M., & Sgró, M. A. 2015, A&A, 580, A86
  • Springel et al. (2006) Springel, V., Frenk, C. S., & White, S. D. M. 2006, Nature, 440, 1137
  • Springel et al. (2005) Springel, V., White, S. D. M., Jenkins, A., et al. 2005, Nature, 435, 629
  • Tago et al. (2006) Tago, E., Einasto, J., Saar, E., et al. 2006, Astronomische Nachrichten, 327, 365
  • Tully (2015) Tully, R. B. 2015, AJ, 149, 171
  • Vikram et al. (2017) Vikram, V., Lidz, A., & Jain, B. 2017, MNRAS, 467, 2315
  • Viola et al. (2015) Viola, M., Cacciato, M., Brouwer, M., et al. 2015, MNRAS, 452, 3529
  • Wang et al. (2009) Wang, H., Mo, H. J., Jing, Y. P., et al. 2009, MNRAS, 394, 398
  • Wang et al. (2016) Wang, H., Mo, H. J., Yang, X., et al. 2016, ApJ, 831, 164
  • Wang et al. (2018) Wang, H., Mo, H. J., Chen, S., et al. 2018, ApJ, 852, 31
  • Wang et al. (2020) Wang, K., Mo, H. J., Li, C., Meng, J., & Chen, Y. 2020, MNRAS, 499, 89
  • Weinmann et al. (2006) Weinmann, S. M., van den Bosch, F. C., Yang, X., & Mo, H. J. 2006, MNRAS, 366, 2
  • White & Rees (1978) White, S. D. M., & Rees, M. J. 1978, MNRAS, 183, 341
  • Yang et al. (2008) Yang, X., Mo, H. J., & van den Bosch, F. C. 2008, ApJ, 676, 248
  • Yang et al. (2009) Yang, X., Mo, H. J., & van den Bosch, F. C. 2009, ApJ, 695, 900
  • Yang et al. (2005a) Yang, X., Mo, H. J., van den Bosch, F. C., & Jing, Y. P. 2005a, MNRAS, 356, 1293
  • Yang et al. (2006) Yang, X., Mo, H. J., van den Bosch, F. C., et al. 2006, MNRAS, 373, 1159
  • Yang et al. (2007) Yang, X., Mo, H. J., van den Bosch, F. C., et al. 2007, ApJ, 671, 153
  • Yang et al. (2005b) Yang, X., Mo, H. J., van den Bosch, F. C., et al. 2005b, MNRAS, 362, 711
  • Yang et al. (2021) Yang, X., Xu, H., He, M., et al. 2021, ApJ, 909, 143
  • Zwicky et al. (1968) Zwicky, F., Herzog, E., & Wild, P. 1968, Pasadena: California Institute of Technology (CIT)