Particle Hit Clustering and Identification Using Point Set Transformers in Liquid Argon Time Projection Chambers
Abstract
Liquid argon time projection chambers are often used in neutrino physics and dark-matter searches because of their high spatial resolution. The images generated by these detectors are extremely sparse, as the energy values detected by most of the detector are equal to 0, meaning that despite their high resolution, most of the detector is unused in a particular interaction. Instead of representing all of the empty detections, the interaction is usually stored as a sparse matrix, a list of detection locations paired with their energy values. Traditional machine learning methods that have been applied to particle reconstruction such as convolutional neural networks (CNNs), however, cannot operate over data stored in this way and therefore must have the matrix fully instantiated as a dense matrix. Operating on dense matrices requires a lot of memory and computation time, in contrast to directly operating on the sparse matrix. We propose a machine learning model using a point set neural network that operates over a sparse matrix, greatly improving both processing speed and accuracy over methods that instantiate the dense matrix, as well as over other methods that operate over sparse matrices. Compared to competing state-of-the-art methods, our method improves classification performance by 14%, segmentation performance by more than 22%, while taking 80% less time and using 66% less memory. Compared to state-of-the-art CNN methods, our method improves classification performance by more than 86%, segmentation performance by more than 71%, while reducing runtime by 91% and reducing memory usage by 61%.
1 Introduction
Experiments in the field of particle physics often create large amounts of data, which is difficult to process at scale by human experts. This data often needs to be manually sorted by these experts, using valuable time that could be used interpreting the data. The advent of high-quality machine learning models has helped automate much of the manual labor required to label these images [4], but with increased quality, there has also been an increase in computational costs and resources required to run these models. Even large experimental collaborations in the field of particle physics often face strict limits in resource utilization during large-scale simulation and data processing.
The liquid argon time projection chamber (LArTPC) is a common choice of detector technology in neutrino physics and direct dark matter searches due to its very high spatial resolution. The operating principle consists of applying an electric field across a large volume of liquid argon. When charged particles pass through the detector, ionized electrons are accelerated toward the anode end of the drift volume. These drift electrons are usually detected via either a series of wire planes or a grid of charge-detecting pixels. Together with the detection time of the drift electrons, this technology allows for 3D reconstruction of particle trajectories through the detector. These trajectories appear as tracks or showers referred to as "prongs". Particles may also decay in their trajectory, splitting into more particles and creating new prongs. The task at hand is then to perform instance segmentation over these prongs to cluster them as well as to classify each hit into its corresponding particle type for prong identification.
Due to the high spatial resolution, LArTPC images are exceptionally sparse, consisting of an empty background in most of the image except for a few prongs. As such, these are usually represented as sparse matrices, stored as a list of coordinates and values. When performing computations such as the ones used in segmentation machine learning models, these sparse matrices have to be converted into dense matrices, which can take up a lot of resources and slow down training and inference. There have been implementations of differentiable convolution operations on sparse matrices, such as Nvidia’s MinkowskiEngine [6]. However, the operations need to approximate a convolution in order to save memory. An alternative to using sparse matrices is to represent the sparse image as a point cloud, which only requires coordinates and values to be operated on directly.
Similarly to traditional scintillator cell detectors, LArTPCs with wire-based readout provide multiple 2D views that are subsequently combined to create the 3D reconstruction of particle trajectories. The more novel pixel-based readout for LArTPCs intrinsically provides 3D point cloud representations [9]. However, segmentation over large 3D images can be prohibitively computationally expensive, so images are often reduced to multiple 2D views to save memory. Finally, downsampling is often used to further save on memory when it is necessary to process large volumes, as is the case with events containing long muon tracks.
1.1 Related work
The segmentation tasks considered in this work are commonly handled through the Pandora multi-algorithm approach for LArTPC event reconstruction, and a variety of clustering algorithms are available in the Pandora software development kit [1, 13]. The Wire-Cell software package has also introduced machine-learning based approaches for these tasks [23, 18], and the PoLAr-MAE model has recently addressed this task with a transformer architecture [22]. CNNs are often used for event and particle classification at LArTPCs, building on the work of the NOvA CNN [17, 3, 15]. Through panoptic segmentation, this work addresses both clustering and particle classification.
We will be interpreting the data as point sets rather than pixels, thus we rely on the Deep Sets [24] framework. This framework has been extended to implement self-attention and graphs in later works. One such work is Point Transformers [21, 20], A model that implements an attention mechanism between neighboring points in a point cloud. Point Transformer v2 uses k-nearest neighbors to create a graph between points to calculate the attention between closer points, while Point Transformers v3 uses a different serialization technique to save memory usage.
We choose to extend the concepts from point set transformers using Heterogeneous Graph Transformers [12], a method that implements attention in heterogeneous graphs. Heterogeneous graphs are graphs where each node is part of a different semantic class, meaning that using different attention weights is able to model the data in a more semantically correct way. This allows us to interpret information from two different views that are related to eachother.
Our model differs from graph attention-based models, such as Graph Attention Transformers [19], by leveraging the coordinates of each of the points. Rather than just using coordinates of the points as a feature or as inputs to the kNN algorithm that builds the graph, our model uses them to implement a faster pooling algorithm [16], which reduces the computation time. We also include a relative positional encoding scheme [21] in order to decay information from neighbors that are too far away.
2 Methods
2.1 Notation
Consider a dataset of size , where each sample represents an event from the particle detector. Each event is split into views, each view denoted by . Each view has a variable number of detections . Each detection is described by coordinates and values . Pixel-based TPCs present a homogeneous view as a single 3D point cloud. Wire-based TPCs present heterogeneous views as multiple 2D point clouds. We will treat homogeneous views as a special case of heterogeneous views when explaining the methods for the purpose of easing the burden on notation.
For each pair of points we define an intra-view distance for points within the same view and therefore vector space and an inter-view distance for points between different views. Additionally, based on these distances we will define an edge which connects two nodes that may be in the same or different views.
2.2 Homogeneous attention
Point attention is calculated by creating a graph between points, using nearest neighbors or other serialization techniques in order to emulate a rolling window, such as the one present in a traditional attention model [5]. The attention is then calculated and aggregated over the neighborhood of this graph, for example, if there is a source node and a destination node , a query is calculated with respect to the source and a key and a value a value are calculated with respect to the destination for each edge. Each edge within the same view is then given a score of
(2.1) |
where RPE is a relative positional encoding module, where the difference of the two points are encoded by a linear layer , i.e.,
(2.2) |
The weights are then normalized with a softmax operation to represent the intensities of how much information is required to flow from each edge. This is then used to weigh the value vectors:
(2.3) |
leading to a point set attention mechanism.
2.3 Homogeneous pooling and unpooling
A pooling operation is used in U-net-like architectures to create feature representations between points that are further away from each other. The way to extend this concept from CNNs to PSNNs and GNNs is to pool neighbors with each other. This is slow as the nearest neighbors operation is quite expensive, so we can approximate it with a faster method: we first create a grid of a specific size , then for each square or cube in the grid , we average out the coordinates and features of the points within that part of the cube, that is,
(2.4) |
meaning that the summary of the points in that cell are located in the middle of all the points in that cell, and,
(2.5) |
meaning that the features are the average features of all the points in the cell.
Unpooling is done by using the coordinates from a previous step and then broadcasting the pooled point’s features into the coordinates that created it. This does create a set where the features will be the same within the grid after it is unpooled, making the residual connections of a U-net vital for the operation to be semantically meaningful.
2.4 Heterogeneous attention
Wire-based LArTPCs usually output multiple views of the point clouds, where each view presents a different subset of the spatial dimensions. This means that the data between different point clouds is related but cannot easily be built into a graph. Using the aforementioned inter-view distance, we are able to build the neighborhood graph. Therefore, for each point we calculate the query , i.e., the query on point from view to view on sample , and then for each of its neighbors we calculate both and , that is, the key and values on point from view to view on sample . Using these, we can calculate
(2.6) |
a RPE cannot be used here due to both samples being defined on different spaces, making them hard to compare. This weight is then normalized using a softmax operation over its neighbors and then used in a weighted sum to calculate the output of the attention module,
(2.7) |
2.5 Heterogeneous pooling
When dealing with multiple views from the same detector, the views may be defined in completely different vector spaces, so while we may be able to compare distances to determine nearest neighbors or grids to pool points together, heterogeneous points cannot be pooled together. Therefore, we treat each view separately and pool using a grid pool. Pooling is then done per view, using a voxel pooling method [16], in the same manner as with homogeneous pooling described in section 2.3, creating a grid and then averaging out the values of all the points within each point of the grid, and positioning the point in the mean of all the points within the created voxel.
Unpooling is performed using skip connections, the points are upsampled to the same coordinates that they were previously pooled from, only using information from the same view.
2.6 Architecture


The network is structured like a U-net [14], where attention layers act as the convolutions and the grid pooling and unpooling function as the pooling method. The architecture is described in Figure 2. The UNet is divided into stages. Each stage contains blocks, where each block has an attention module. Following each stage in the first stages is a pooling step, which reduces the number of points. The next stages are followed by an unpooling stage, which uses the coordinates of the points of a previous block, as well as concatenates the features of the previous block. The dimensionality of the embeddings is doubled at each stage during the first half and halved at each stage during the second half. Intra-view attention is calculated on each stage in order to ensure that the information mixing between views is done locally (in the earlier stages) and globally (in the later stages).
2.7 Loss function
The network performs two tasks simultaneously: instance segmentation, selecting separate prongs from each other; and semantic segmentation, classifying each detection into a particle type. As such, the loss function used is separated into two parts,
(2.8) |
Semantic segmentation is a simple classification problem, so we use multi-class cross-entropy to calculate this loss:
(2.9) |
where is the correct semantic label of the detection.
Instance segmentation is done by minimizing the loss calculated by the best assignment between the predicted labels and the real labels. If point belongs to the segment , then the loss calculated is
(2.10) |
where is the set of all permutations of labels, allowing a unique assignment of one label to another. The optimal assignment of the labels is solved using a linear sum assignment solver [7]. The linear sum assignment solver, also known as the Hungarian algorithm is a method to assign a bipartite graph maximizing a quantity in polynomial time. This allows us to calculate the loss function without needing to check every possible assignment combination. This is a standard method used to train object segmentation models and does not affect the inference time, only the training time, and only scales polynomially with respect to the number of possible object segments in the model, picked as a hyperparameter.
3 Experiments and Results
3.1 Dataset
We consider here a LArTPC with square pixel-based readout. The TPC is x x in with a drift length along x. and at energies are simulated with GENIE [2] in the direction with uniform neutrino energy up to . The energy deposition in liquid argon is then simulated with GEANT4 [10]. The dataset consists of 100,000 and events each with 74% of events interacting through the charged current and the rest through neutral current. Additionally, we created another dataset where the odd pixels of the Z dimension were assigned to the XZ view by removing the Y coordinate and the even pixels were assigned to the YZ view by removing the Y coordinate in order to simulate similar multi-view images to those produced by wire LArTPCs.
3.2 Performance evaluation
Six models were trained and evaluated, two graph attention network (GAT) [19] based models, one for the multi-view case and one for the single-view case, a 2D CNN-based model (R-CNN) [11], only used for the single view case, a heterogeneous point set transformer for the multi view case, a and point set transformer for the multi-view case and single-view case. We performed a hyperparameter sweep over the number of layers, the layer size, the number of neighbors to use in the nearest neighbors calculation, and the learning rate, sampling 60 random hyperparameters in the grid. The hyperparameters picked for the three networks were the number of neighbor connections (4, 8), the number of stages of the neural network (2, 3, 4), the size of the embeddings inside the neural network (128, 256, 512) and the learning rate (between 1e-4 and 1e-1). The range of the parameters was chosen according to the memory restrictions of our targeted production environment. The same ranges were used for all 3 of them as they all had comparable parameters. The training and testing was done on a server using an Intel(R) Xeon(R) CPU E5-2640 v4 @ 2.40GHz, 503G of RAM and 4xNVIDIA Titan V.
The hyperparameter sweep was performed over one learning rate cycle with a cosine annealing scheduler, over 64 epochs, using 10% of the dataset. The model with the best accuracy on the segmentation’s class labels in the validation set was selected as the one with the best hyperparameters. The resulting models with the best hyperparameters were trained for 4 learning rate cycles, each cycle being 64 epochs long for a total of 256 epochs using an AdamW optimizer. The results of this optimization for the 3D PST can be seen in table 1
Hyperparameter | Value |
---|---|
Learning Rate | 0.0006323 |
Number of stages | 3 |
Embed size | 256 |
Neighbors | 8 |
Initial grid size | 8 |
3.2.1 Classification and Segmentation accuracy
We evaluate the accuracy of the classification and instance segmentation of each point for each model. The results can be seen in table 2. As we can see, we gain an advantage over using a traditional GAT with a more efficient implementation of attention, as well as using pooling to our advantage, as evidenced by the jump in performance between the 3D GAT and our 3D PST.
Using all the information available in 3D images also helps increase accuracy. Matching prongs between views is an especially hard task, so 3D images will inherently have better performance for segmentation accuracy, as they are single view models. HPST is able to bridge the gap by sharing data between the views, and it is able to improve the performance over the 2D GAT and R-CNN by using less parameters than the 2D PST, this creates a tradeoff where the view sharing can give you good performance with a smaller model, while the PST can have higher accuracy due to being able to reach more neighbors within that same view.
3.2.2 Efficiency and purity of segmentation


For each prong, we calculated the efficiency and purity of the classification, allowing for multiple predicted prongs to be assigned to a single prong. Efficiency is defined as the percentage of a predicted prong that is assigned to the correct prong. Purity is the percentage of the true prong that is predicted correctly. These are metrics used in particle physics that should be balanced, as raising the purity can often lower the efficiency and vice versa. In figure 3 we can see the distribution of the purity and efficiency in each prong. As we can see, the segmentation results are generally good, especially in the majority classes (muons and electrons).
3.2.3 Speed and memory usage
Model | Memory | Time per | Classification | Instance segmentation |
usage (MiB) | sample (s) | OVR AUC | accuracy | |
2D R-CNN | 0.526 | 0.518 | ||
2D GAT | 0.833 | 0.659 | ||
2D HPST (ours) | 0.936 | 0.779 | ||
2D PST (ours) | 0.949 | 0.827 | ||
3D GAT | 0.859 | 0.727 | ||
3D PST (ours) | 0.982 | 0.889 |
We benchmarked the three models by running inference on 100 samples, with a batch size of 1, measuring the peak memory increase between the start of inference and the end of inference, in order to remove as much overhead as possible. We evaluated the time it takes for these 100 inferences and the memory used in each of them. We additionally used a Fast R-CNN [11] as a comparison in order to evaluate how much memory is saved by evaluating the data as a point cloud. The results can be seen in table 2. As we can see, memory usage is greatly decreased when comparing a regular CNN model to the sparse methods like graph neural networks and point set neural networks, even when projecting a 3D voxel into two 2D views.
Although 2D models are able to maintain a lower memory usage profile due to merging obscured points and removing at least 1/3 of the data, our 3D model presents a significant increase in performance, especially when comparing the segmentation accuracy to the 2D models. Our 3D model has a significant enough increase in accuracy to justify the increase in memory usage when compared to the models that do not use all three dimensions. Although the increase in memory usage is significant, the memory usage is still within the memory usage required by the environment in which it will be deployed.
3.3 Qualitative evaluation


Figures 4 and 5 show two samples from the test set. They are colored according to the most common particle predicted by the model in the segmentation. The muon produced by a numu charged current interaction event leaves a long track. Protons and charged pions also visible in this event leave tracks as well, but such tracks are generally shorter. The nue event produces a prominent electron shower. Separation between tracks and showers is an easier task compared to identifying particles with similar topologies, especially in the case of protons vs pions. These particles make up the hadronic portions of neutrino interactions, which are less understood compared to the leptonic portions from electrons and muons.
4 Limitations and Conclusions
While the claims of memory efficiency will generally hold, although for different datasets this might not be the case. The representation of a sparse matrix is more efficient than a dense matrix until a certain point, where the storage of the coordinates becomes bigger than just storing a dense matrix. Point set operations can also greatly increase in complexity as the number of points grows, resulting in a much slower algorithm. However, these are not the regimes found in data produced by neutrino detectors.
Improvements can also be made to the attention mechanism. The current implementation calculates attention manually, instead of using the more optimized FlashAttention [8], meaning that memory usage and speed can be further reduced. FlashAttention cannot directly be used since it is limited to fixed length sequences, however, a similar strategy could be implemented to speed up point transformer operations. Using nearest neighbors to encode the connections between points is also not necessarily the most efficient method for this dataset. Point Transformers v3 (PTv3) [20] demonstrates that point transformers can achieve the same performance using fewer connections than a graph based on nearest neighbors. In the PTv3 paper, this is achieved by serializing the points with a space-filling curve, drastically reducing the memory usage for one of the most expensive operations in the model’s calculations.
In general, point set transformers perform very well compared to GNNs and CNNs in this task. PSTs strike a balance between memory usage, time, and performance that makes them a great fit for this application.
References
- [1] R. Acciarri et al. The Pandora multi-algorithm approach to automated pattern recognition of cosmic-ray muon and neutrino events in the MicroBooNe detector. The European Physical Journal C, 78(1):82, Jan 2018.
- [2] Costas Andreopoulos, Christopher Barry, Steve Dytman, Hugh Gallagher, Tomasz Golan, Robert Hatcher, Gabriel Perdue, and Julia Yarba. The GENIE Neutrino Monte Carlo Generator: Physics and User Manual, 10 2015.
- [3] A. Aurisano, A. Radovic, D. Rocco, A. Himmel, M. D. Messier, E. Niner, G. Pawloski, F. Psihas, A. Sousa, and P. Vahle. A Convolutional Neural Network Neutrino Event Classifier. JINST, 11(09):P09001, 2016.
- [4] Pierre Baldi. Deep Learning in Science. Cambridge University Press, 2021.
- [5] Pierre Baldi and Roman Vershynin. The quarks of attention: Structure and capacity of neural attention building blocks. Artificial Intelligence, 319:103901, 2023. Also: arXiv:2202.08371.
- [6] Christopher Choy, JunYoung Gwak, and Silvio Savarese. 4d spatio-temporal convnets: Minkowski convolutional neural networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 3075–3084, 2019.
- [7] David F. Crouse. On implementing 2d rectangular assignment algorithms. IEEE Transactions on Aerospace and Electronic Systems, 52(4):1679–1696, 2016.
- [8] Tri Dao, Dan Fu, Stefano Ermon, Atri Rudra, and Christopher Ré. Flashattention: Fast and memory-efficient exact attention with io-awareness. In S. Koyejo, S. Mohamed, A. Agarwal, D. Belgrave, K. Cho, and A. Oh, editors, Advances in Neural Information Processing Systems, volume 35, pages 16344–16359. Curran Associates, Inc., 2022.
- [9] D.A. Dwyer, M. Garcia-Sciveres, D. Gnani, C. Grace, S. Kohn, M. Kramer, A. Krieger, C.J. Lin, K.B. Luk, P. Madigan, C. Marshall, H. Steiner, and T. Stezelberger. Larpix: demonstration of low-power 3d pixelated charge readout for liquid argon time projection chambers. Journal of Instrumentation, 13(10):P10007, oct 2018.
- [10] Geant4 Collaboration. Geant4 10.4 release notes. geant4-data.web.cern.ch, https://geant4-data.web.cern.ch/ ReleaseNotes/ReleaseNotes4.10.4.html, 2017.
- [11] Ross Girshick. Fast r-cnn. In 2015 IEEE International Conference on Computer Vision (ICCV), pages 1440–1448, 2015.
- [12] Ziniu Hu, Yuxiao Dong, Kuansan Wang, and Yizhou Sun. Heterogeneous graph transformer. In Proceedings of The Web Conference 2020, WWW ’20, page 2704–2710, New York, NY, USA, 2020. Association for Computing Machinery.
- [13] J. S. Marshall and M. A. Thomson. The Pandora software development kit for pattern recognition. The European Physical Journal C, 75(9):439, Sep 2015.
- [14] Olaf Ronneberger, Philipp Fischer, and Thomas Brox. U-net: Convolutional networks for biomedical image segmentation. In Nassir Navab, Joachim Hornegger, William M. Wells, and Alejandro F. Frangi, editors, Medical Image Computing and Computer-Assisted Intervention – MICCAI 2015, pages 234–241, Cham, 2015. Springer International Publishing.
- [15] Alexander Shmakov, Alejandro J Yankelevich, Jianming Bian, and Pierre Baldi. Interpretable joint event-particle reconstruction at NOvA with sparse cnns and transformers. In Machine Learning and the Physical Sciences, NeurIPS, 2023.
- [16] Martin Simonovsky and Nikos Komodakis. Dynamic edge-conditioned filters in convolutional neural networks on graphs. In 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 29–38, 2017.
- [17] The DUNE Collaboration. Deep underground neutrino experiment (DUNE), far detector technical design report, volume ii: DUNE physics, 2020.
- [18] The MicroBooNE collaboration. Neutrino event selection in the MicroBooNE liquid argon time projection chamber using Wire-Cell 3D imaging, clustering, and charge-light matching. Journal of Instrumentation, 16(06):P06043, jun 2021.
- [19] Petar Veličković, Guillem Cucurull, Arantxa Casanova, Adriana Romero, Pietro Liò, and Yoshua Bengio. Graph attention networks, 2018.
- [20] Xiaoyang Wu, Li Jiang, Peng-Shuai Wang, Zhijian Liu, Xihui Liu, Yu Qiao, Wanli Ouyang, Tong He, and Hengshuang Zhao. Point transformer v3: Simpler, faster, stronger. In CVPR, 2024.
- [21] Xiaoyang Wu, Yixing Lao, Li Jiang, Xihui Liu, and Hengshuang Zhao. Point transformer v2: Grouped vector attention and partition-based pooling. In NeurIPS, 2022.
- [22] Sam Young, Yeon jae Jwa, and Kazuhiro Terao. Particle trajectory representation learning with masked point modeling, 2025.
- [23] H.W. Yu, M. Bishai, W.Q. Gu, M.F. Lin, X. Qian, Y.H. Ren, A. Scarpelli, B. Viren, H.Y. Wei, H.Z. Yu, K. Yu, and C. Zhang. Augmented signal processing in liquid argon time projection chambers with a deep neural network. Journal of Instrumentation, 16(01):P01036, jan 2021.
- [24] Manzil Zaheer, Satwik Kottur, Siamak Ravanbakhsh, Barnabas Poczos, Russ R Salakhutdinov, and Alexander J Smola. Deep sets. In I. Guyon, U. Von Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett, editors, Advances in Neural Information Processing Systems, volume 30. Curran Associates, Inc., 2017.