Particle Hit Clustering and Identification Using Point Set Transformers in Liquid Argon Time Projection Chambers

Edgar E. Robles,11footnotetext: Corresponding author.    Alejandro Yankelevich    Wenjie Wu    Jianming Bian    and Pierre Baldi
Abstract

Liquid argon time projection chambers are often used in neutrino physics and dark-matter searches because of their high spatial resolution. The images generated by these detectors are extremely sparse, as the energy values detected by most of the detector are equal to 0, meaning that despite their high resolution, most of the detector is unused in a particular interaction. Instead of representing all of the empty detections, the interaction is usually stored as a sparse matrix, a list of detection locations paired with their energy values. Traditional machine learning methods that have been applied to particle reconstruction such as convolutional neural networks (CNNs), however, cannot operate over data stored in this way and therefore must have the matrix fully instantiated as a dense matrix. Operating on dense matrices requires a lot of memory and computation time, in contrast to directly operating on the sparse matrix. We propose a machine learning model using a point set neural network that operates over a sparse matrix, greatly improving both processing speed and accuracy over methods that instantiate the dense matrix, as well as over other methods that operate over sparse matrices. Compared to competing state-of-the-art methods, our method improves classification performance by 14%, segmentation performance by more than 22%, while taking 80% less time and using 66% less memory. Compared to state-of-the-art CNN methods, our method improves classification performance by more than 86%, segmentation performance by more than 71%, while reducing runtime by 91% and reducing memory usage by 61%.

1 Introduction

Experiments in the field of particle physics often create large amounts of data, which is difficult to process at scale by human experts. This data often needs to be manually sorted by these experts, using valuable time that could be used interpreting the data. The advent of high-quality machine learning models has helped automate much of the manual labor required to label these images [4], but with increased quality, there has also been an increase in computational costs and resources required to run these models. Even large experimental collaborations in the field of particle physics often face strict limits in resource utilization during large-scale simulation and data processing.

The liquid argon time projection chamber (LArTPC) is a common choice of detector technology in neutrino physics and direct dark matter searches due to its very high spatial resolution. The operating principle consists of applying an electric field across a large volume of liquid argon. When charged particles pass through the detector, ionized electrons are accelerated toward the anode end of the drift volume. These drift electrons are usually detected via either a series of wire planes or a grid of charge-detecting pixels. Together with the detection time of the drift electrons, this technology allows for 3D reconstruction of particle trajectories through the detector. These trajectories appear as tracks or showers referred to as "prongs". Particles may also decay in their trajectory, splitting into more particles and creating new prongs. The task at hand is then to perform instance segmentation over these prongs to cluster them as well as to classify each hit into its corresponding particle type for prong identification.

Due to the high spatial resolution, LArTPC images are exceptionally sparse, consisting of an empty background in most of the image except for a few prongs. As such, these are usually represented as sparse matrices, stored as a list of coordinates and values. When performing computations such as the ones used in segmentation machine learning models, these sparse matrices have to be converted into dense matrices, which can take up a lot of resources and slow down training and inference. There have been implementations of differentiable convolution operations on sparse matrices, such as Nvidia’s MinkowskiEngine [6]. However, the operations need to approximate a convolution in order to save memory. An alternative to using sparse matrices is to represent the sparse image as a point cloud, which only requires coordinates and values to be operated on directly.

Similarly to traditional scintillator cell detectors, LArTPCs with wire-based readout provide multiple 2D views that are subsequently combined to create the 3D reconstruction of particle trajectories. The more novel pixel-based readout for LArTPCs intrinsically provides 3D point cloud representations [9]. However, segmentation over large 3D images can be prohibitively computationally expensive, so images are often reduced to multiple 2D views to save memory. Finally, downsampling is often used to further save on memory when it is necessary to process large volumes, as is the case with events containing long muon tracks.

1.1 Related work

The segmentation tasks considered in this work are commonly handled through the Pandora multi-algorithm approach for LArTPC event reconstruction, and a variety of clustering algorithms are available in the Pandora software development kit [1, 13]. The Wire-Cell software package has also introduced machine-learning based approaches for these tasks [23, 18], and the PoLAr-MAE model has recently addressed this task with a transformer architecture [22]. CNNs are often used for event and particle classification at LArTPCs, building on the work of the NOvA CNN [17, 3, 15]. Through panoptic segmentation, this work addresses both clustering and particle classification.

We will be interpreting the data as point sets rather than pixels, thus we rely on the Deep Sets [24] framework. This framework has been extended to implement self-attention and graphs in later works. One such work is Point Transformers [21, 20], A model that implements an attention mechanism between neighboring points in a point cloud. Point Transformer v2 uses k-nearest neighbors to create a graph between points to calculate the attention between closer points, while Point Transformers v3 uses a different serialization technique to save memory usage.

We choose to extend the concepts from point set transformers using Heterogeneous Graph Transformers [12], a method that implements attention in heterogeneous graphs. Heterogeneous graphs are graphs where each node is part of a different semantic class, meaning that using different attention weights is able to model the data in a more semantically correct way. This allows us to interpret information from two different views that are related to eachother.

Our model differs from graph attention-based models, such as Graph Attention Transformers [19], by leveraging the coordinates of each of the points. Rather than just using coordinates of the points as a feature or as inputs to the kNN algorithm that builds the graph, our model uses them to implement a faster pooling algorithm [16], which reduces the computation time. We also include a relative positional encoding scheme [21] in order to decay information from neighbors that are too far away.

2 Methods

2.1 Notation

Consider a dataset 𝒳𝒳\mathcal{X}caligraphic_X of size N𝑁Nitalic_N, where each sample X(i)superscript𝑋𝑖X^{(i)}italic_X start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT represents an event from the particle detector. Each event X(i)superscript𝑋𝑖X^{(i)}italic_X start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT is split into M𝑀Mitalic_M views, each view denoted by X(i,j)superscript𝑋𝑖𝑗X^{(i,j)}italic_X start_POSTSUPERSCRIPT ( italic_i , italic_j ) end_POSTSUPERSCRIPT. Each view has a variable number of detections K(i,j)superscript𝐾𝑖𝑗K^{(i,j)}italic_K start_POSTSUPERSCRIPT ( italic_i , italic_j ) end_POSTSUPERSCRIPT. Each detection is described by coordinates xk(i,j)csubscriptsuperscript𝑥𝑖𝑗𝑘superscript𝑐x^{(i,j)}_{k}\in\mathbb{R}^{c}italic_x start_POSTSUPERSCRIPT ( italic_i , italic_j ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT and values vk(i,j)dsubscriptsuperscript𝑣𝑖𝑗𝑘superscript𝑑v^{(i,j)}_{k}\in\mathbb{R}^{d}italic_v start_POSTSUPERSCRIPT ( italic_i , italic_j ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT. Pixel-based TPCs present a homogeneous view as a single 3D point cloud. Wire-based TPCs present heterogeneous views as multiple 2D point clouds. We will treat homogeneous views as a special case of heterogeneous views when explaining the methods for the purpose of easing the burden on notation.

For each pair of points we define an intra-view distance djj(xk(i,j),xk(i,j))subscript𝑑𝑗𝑗subscriptsuperscript𝑥𝑖𝑗𝑘subscriptsuperscript𝑥𝑖𝑗superscript𝑘d_{jj}(x^{(i,j)}_{k},x^{(i,j)}_{k^{\prime}})italic_d start_POSTSUBSCRIPT italic_j italic_j end_POSTSUBSCRIPT ( italic_x start_POSTSUPERSCRIPT ( italic_i , italic_j ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT , italic_x start_POSTSUPERSCRIPT ( italic_i , italic_j ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_k start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ) for points within the same view and therefore vector space and an inter-view distance djj(xk(i,j),xk(i,j))subscript𝑑𝑗superscript𝑗subscriptsuperscript𝑥𝑖𝑗𝑘subscriptsuperscript𝑥𝑖superscript𝑗superscript𝑘d_{jj^{\prime}}(x^{(i,j)}_{k},x^{(i,j^{\prime})}_{k^{\prime}})italic_d start_POSTSUBSCRIPT italic_j italic_j start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( italic_x start_POSTSUPERSCRIPT ( italic_i , italic_j ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT , italic_x start_POSTSUPERSCRIPT ( italic_i , italic_j start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_k start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ) for points between different views. Additionally, based on these distances we will define an edge ek,k(i){0,1}subscriptsuperscript𝑒𝑖𝑘superscript𝑘01e^{(i)}_{k,k^{\prime}}\in\{0,1\}italic_e start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_k , italic_k start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ∈ { 0 , 1 } which connects two nodes that may be in the same or different views.

2.2 Homogeneous attention

Point attention is calculated by creating a graph between points, using nearest neighbors or other serialization techniques in order to emulate a rolling window, such as the one present in a traditional attention model [5]. The attention is then calculated and aggregated over the neighborhood of this graph, for example, if there is a source node xk(i,j)subscriptsuperscript𝑥𝑖𝑗𝑘x^{(i,j)}_{k}italic_x start_POSTSUPERSCRIPT ( italic_i , italic_j ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT and a destination node xk(i,j)subscriptsuperscript𝑥𝑖𝑗superscript𝑘x^{(i,j)}_{k^{\prime}}italic_x start_POSTSUPERSCRIPT ( italic_i , italic_j ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_k start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT, a query Qk(i,j)subscriptsuperscript𝑄𝑖𝑗𝑘Q^{(i,j)}_{k}italic_Q start_POSTSUPERSCRIPT ( italic_i , italic_j ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT is calculated with respect to the source and a key Kk(i,j)subscriptsuperscript𝐾𝑖𝑗superscript𝑘K^{(i,j)}_{k^{\prime}}italic_K start_POSTSUPERSCRIPT ( italic_i , italic_j ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_k start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT and a value a value Vk(i,j)subscriptsuperscript𝑉𝑖𝑗superscript𝑘V^{(i,j)}_{k^{\prime}}italic_V start_POSTSUPERSCRIPT ( italic_i , italic_j ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_k start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT are calculated with respect to the destination for each edge. Each edge within the same view ekk(i)subscriptsuperscript𝑒𝑖𝑘superscript𝑘e^{(i)}_{kk^{\prime}}italic_e start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_k italic_k start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT is then given a score of

wkk(i,j)=Qk(i,j)TKk(i,j)+RPE(xk(i,j),xk(i,j)),subscriptsuperscript𝑤𝑖𝑗𝑘superscript𝑘superscriptsubscript𝑄𝑘𝑖𝑗𝑇subscriptsuperscript𝐾𝑖𝑗superscript𝑘RPEsubscriptsuperscript𝑥𝑖𝑗𝑘subscriptsuperscript𝑥𝑖𝑗superscript𝑘w^{(i,j)}_{kk^{\prime}}=Q_{k}^{(i,j)T}K^{(i,j)}_{k^{\prime}}+\mathrm{RPE}(x^{(% i,j)}_{k},x^{(i,j)}_{k^{\prime}}),italic_w start_POSTSUPERSCRIPT ( italic_i , italic_j ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_k italic_k start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT = italic_Q start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_i , italic_j ) italic_T end_POSTSUPERSCRIPT italic_K start_POSTSUPERSCRIPT ( italic_i , italic_j ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_k start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT + roman_RPE ( italic_x start_POSTSUPERSCRIPT ( italic_i , italic_j ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT , italic_x start_POSTSUPERSCRIPT ( italic_i , italic_j ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_k start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ) , (2.1)

where RPE is a relative positional encoding module, where the difference of the two points are encoded by a linear layer W𝑊Witalic_W, i.e.,

RPE(xk(i,j),xk(i,j))=W(xk(i,j)xk(i,j)).RPEsubscriptsuperscript𝑥𝑖𝑗𝑘subscriptsuperscript𝑥𝑖𝑗superscript𝑘𝑊subscriptsuperscript𝑥𝑖𝑗𝑘subscriptsuperscript𝑥𝑖𝑗superscript𝑘\mathrm{RPE}(x^{(i,j)}_{k},x^{(i,j)}_{k^{\prime}})=W(x^{(i,j)}_{k}-x^{(i,j)}_{% k^{\prime}}).roman_RPE ( italic_x start_POSTSUPERSCRIPT ( italic_i , italic_j ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT , italic_x start_POSTSUPERSCRIPT ( italic_i , italic_j ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_k start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ) = italic_W ( italic_x start_POSTSUPERSCRIPT ( italic_i , italic_j ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT - italic_x start_POSTSUPERSCRIPT ( italic_i , italic_j ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_k start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ) . (2.2)

The weights are then normalized with a softmax operation to represent the intensities of how much information is required to flow from each edge. This is then used to weigh the value vectors:

hk(i,j)=ksoftmax(wk(i,j))kVk(i,j),subscriptsuperscript𝑖𝑗𝑘subscriptsuperscript𝑘subscriptsoftmaxsubscriptsubscriptsuperscript𝑤𝑖𝑗𝑘superscript𝑘subscriptsuperscript𝑉𝑖𝑗superscript𝑘h^{(i,j)}_{k}=\sum_{k^{\prime}}\mathrm{softmax}_{\ell}(w^{(i,j)}_{k\ell})_{k^{% \prime}}V^{(i,j)}_{k^{\prime}},italic_h start_POSTSUPERSCRIPT ( italic_i , italic_j ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT = ∑ start_POSTSUBSCRIPT italic_k start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT roman_softmax start_POSTSUBSCRIPT roman_ℓ end_POSTSUBSCRIPT ( italic_w start_POSTSUPERSCRIPT ( italic_i , italic_j ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_k roman_ℓ end_POSTSUBSCRIPT ) start_POSTSUBSCRIPT italic_k start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT italic_V start_POSTSUPERSCRIPT ( italic_i , italic_j ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_k start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT , (2.3)

leading to a point set attention mechanism.

2.3 Homogeneous pooling and unpooling

A pooling operation is used in U-net-like architectures to create feature representations between points that are further away from each other. The way to extend this concept from CNNs to PSNNs and GNNs is to pool neighbors with each other. This is slow as the nearest neighbors operation is quite expensive, so we can approximate it with a faster method: we first create a grid G𝐺Gitalic_G of a specific size g𝑔gitalic_g, then for each square or cube in the grid Gsubscript𝐺G_{\ell}italic_G start_POSTSUBSCRIPT roman_ℓ end_POSTSUBSCRIPT, we average out the coordinates and features of the points within that part of the cube, that is,

xi,=1|G|xi,jGxi,j,subscriptsuperscript𝑥𝑖1subscript𝐺subscriptsubscript𝑥𝑖𝑗subscript𝐺subscript𝑥𝑖𝑗x^{\prime}_{i,\ell}=\frac{1}{|G_{\ell}|}\sum_{x_{i,j}\in G_{\ell}}x_{i,j},italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i , roman_ℓ end_POSTSUBSCRIPT = divide start_ARG 1 end_ARG start_ARG | italic_G start_POSTSUBSCRIPT roman_ℓ end_POSTSUBSCRIPT | end_ARG ∑ start_POSTSUBSCRIPT italic_x start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT ∈ italic_G start_POSTSUBSCRIPT roman_ℓ end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_x start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT , (2.4)

meaning that the summary of the points in that cell are located in the middle of all the points in that cell, and,

vi,=1|G|vi,jGvi,j,subscriptsuperscript𝑣𝑖1subscript𝐺subscriptsubscript𝑣𝑖𝑗subscript𝐺subscript𝑣𝑖𝑗v^{\prime}_{i,\ell}=\frac{1}{|G_{\ell}|}\sum_{v_{i,j}\in G_{\ell}}v_{i,j},italic_v start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i , roman_ℓ end_POSTSUBSCRIPT = divide start_ARG 1 end_ARG start_ARG | italic_G start_POSTSUBSCRIPT roman_ℓ end_POSTSUBSCRIPT | end_ARG ∑ start_POSTSUBSCRIPT italic_v start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT ∈ italic_G start_POSTSUBSCRIPT roman_ℓ end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_v start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT , (2.5)

meaning that the features are the average features of all the points in the cell.

Unpooling is done by using the coordinates from a previous step and then broadcasting the pooled point’s features into the coordinates that created it. This does create a set where the features will be the same within the grid after it is unpooled, making the residual connections of a U-net vital for the operation to be semantically meaningful.

2.4 Heterogeneous attention

Wire-based LArTPCs usually output multiple views of the point clouds, where each view presents a different subset of the spatial dimensions. This means that the data between different point clouds is related but cannot easily be built into a graph. Using the aforementioned inter-view distance, we are able to build the neighborhood graph. Therefore, for each point we calculate the query Qk(i,jj)subscriptsuperscript𝑄𝑖superscript𝑗𝑗𝑘Q^{(i,j^{\prime}\to j)}_{k}italic_Q start_POSTSUPERSCRIPT ( italic_i , italic_j start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT → italic_j ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT, i.e., the query on point k𝑘kitalic_k from view jsuperscript𝑗j^{\prime}italic_j start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT to view j𝑗jitalic_j on sample i𝑖iitalic_i, and then for each of its neighbors ksuperscript𝑘k^{\prime}italic_k start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT we calculate both Kk(i,jj)subscriptsuperscript𝐾𝑖superscript𝑗𝑗superscript𝑘K^{(i,j^{\prime}\to j)}_{k^{\prime}}italic_K start_POSTSUPERSCRIPT ( italic_i , italic_j start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT → italic_j ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_k start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT and Vk(i,jj)subscriptsuperscript𝑉𝑖superscript𝑗𝑗superscript𝑘V^{(i,j^{\prime}\to j)}_{k^{\prime}}italic_V start_POSTSUPERSCRIPT ( italic_i , italic_j start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT → italic_j ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_k start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT, that is, the key and values on point ksuperscript𝑘k^{\prime}italic_k start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT from view jsuperscript𝑗j^{\prime}italic_j start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT to view j𝑗jitalic_j on sample i𝑖iitalic_i. Using these, we can calculate

wkk(i,jj)=Qk(i,jj)TKk(i,jj),subscriptsuperscript𝑤𝑖superscript𝑗𝑗𝑘superscript𝑘superscriptsubscript𝑄𝑘𝑖superscript𝑗𝑗𝑇subscriptsuperscript𝐾𝑖superscript𝑗𝑗superscript𝑘w^{(i,j^{\prime}\to j)}_{kk^{\prime}}=Q_{k}^{(i,j^{\prime}\to j)T}K^{(i,j^{% \prime}\to j)}_{k^{\prime}},italic_w start_POSTSUPERSCRIPT ( italic_i , italic_j start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT → italic_j ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_k italic_k start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT = italic_Q start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_i , italic_j start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT → italic_j ) italic_T end_POSTSUPERSCRIPT italic_K start_POSTSUPERSCRIPT ( italic_i , italic_j start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT → italic_j ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_k start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT , (2.6)

a RPE cannot be used here due to both samples being defined on different spaces, making them hard to compare. This weight is then normalized using a softmax operation over its neighbors and then used in a weighted sum to calculate the output of the attention module,

hk(i,j)=ksoftmax(wk(i,jj))kVk(i,jj).subscriptsuperscriptsuperscript𝑖𝑗𝑘subscriptsuperscript𝑘subscriptsoftmaxsubscriptsubscriptsuperscript𝑤𝑖superscript𝑗𝑗𝑘superscript𝑘subscriptsuperscript𝑉𝑖superscript𝑗𝑗superscript𝑘{h^{\prime}}^{(i,j)}_{k}=\sum_{k^{\prime}}\mathrm{softmax}_{\ell}(w^{(i,j^{% \prime}\to j)}_{k\ell})_{k^{\prime}}V^{(i,j^{\prime}\to j)}_{k^{\prime}}.italic_h start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUPERSCRIPT ( italic_i , italic_j ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT = ∑ start_POSTSUBSCRIPT italic_k start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT roman_softmax start_POSTSUBSCRIPT roman_ℓ end_POSTSUBSCRIPT ( italic_w start_POSTSUPERSCRIPT ( italic_i , italic_j start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT → italic_j ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_k roman_ℓ end_POSTSUBSCRIPT ) start_POSTSUBSCRIPT italic_k start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT italic_V start_POSTSUPERSCRIPT ( italic_i , italic_j start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT → italic_j ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_k start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT . (2.7)

2.5 Heterogeneous pooling

When dealing with multiple views from the same detector, the views may be defined in completely different vector spaces, so while we may be able to compare distances to determine nearest neighbors or grids to pool points together, heterogeneous points cannot be pooled together. Therefore, we treat each view separately and pool using a grid pool. Pooling is then done per view, using a voxel pooling method [16], in the same manner as with homogeneous pooling described in section 2.3, creating a grid and then averaging out the values of all the points within each point of the grid, and positioning the point in the mean of all the points within the created voxel.

Unpooling is performed using skip connections, the points are upsampled to the same coordinates that they were previously pooled from, only using information from the same view.

2.6 Architecture

Refer to caption
Figure 1: Block diagram of the attention mechanism. The top path describes the intra-view attention mechanism, and the bottom path describes the inter-view mechanism. The top section labeled PST is the attention mechanism used in the point set transformer, while HPST uses both the top and bottom sections.
Refer to caption
Figure 2: Architecture of the neural network. The attention block is described in Figure 1. The number of stages can be arbitrarily increased by adding stages to both the pooling and unpooling sides.

The network is structured like a U-net [14], where attention layers act as the convolutions and the grid pooling and unpooling function as the pooling method. The architecture is described in Figure  2. The UNet is divided into 2n2𝑛2n2 italic_n stages. Each stage contains m𝑚mitalic_m blocks, where each block has an attention module. Following each stage in the first n𝑛nitalic_n stages is a pooling step, which reduces the number of points. The next n𝑛nitalic_n stages are followed by an unpooling stage, which uses the coordinates of the points of a previous block, as well as concatenates the features of the previous block. The dimensionality of the embeddings is doubled at each stage during the first half and halved at each stage during the second half. Intra-view attention is calculated on each stage in order to ensure that the information mixing between views is done locally (in the earlier stages) and globally (in the later stages).

2.7 Loss function

The network performs two tasks simultaneously: instance segmentation, selecting separate prongs from each other; and semantic segmentation, classifying each detection into a particle type. As such, the loss function used is separated into two parts,

=λsem+(1λ)ins.𝜆subscriptsem1𝜆subscriptins\mathcal{L}=\lambda\mathcal{L}_{\mathrm{sem}}+(1-\lambda)\mathcal{L}_{\mathrm{% ins}}.caligraphic_L = italic_λ caligraphic_L start_POSTSUBSCRIPT roman_sem end_POSTSUBSCRIPT + ( 1 - italic_λ ) caligraphic_L start_POSTSUBSCRIPT roman_ins end_POSTSUBSCRIPT . (2.8)

Semantic segmentation is a simple classification problem, so we use multi-class cross-entropy to calculate this loss:

sem=X(i)𝒳X(i)xk(i,j)CE(softmaxk(f(Xi)k(i,j)),yk(i,j)),subscriptsemsubscriptsuperscript𝑋𝑖𝒳subscriptsuperscript𝑋𝑖subscriptsuperscript𝑥𝑖𝑗𝑘CEsubscriptsoftmaxsuperscript𝑘𝑓subscriptsuperscriptsubscript𝑋𝑖𝑖𝑗superscript𝑘subscriptsuperscript𝑦𝑖𝑗𝑘\displaystyle\begin{split}\mathcal{L}_{\mathrm{sem}}&=\sum_{X^{(i)}\in\mathcal% {X}}\sum_{X^{(i)}\in x^{(i,j)}_{k}}\\ &\mathrm{CE}\left(\mathrm{softmax}_{k^{\prime}}\left(f(X_{i})^{(i,j)}_{k^{% \prime}}\right),y^{(i,j)}_{k}\right),\end{split}start_ROW start_CELL caligraphic_L start_POSTSUBSCRIPT roman_sem end_POSTSUBSCRIPT end_CELL start_CELL = ∑ start_POSTSUBSCRIPT italic_X start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT ∈ caligraphic_X end_POSTSUBSCRIPT ∑ start_POSTSUBSCRIPT italic_X start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT ∈ italic_x start_POSTSUPERSCRIPT ( italic_i , italic_j ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_POSTSUBSCRIPT end_CELL end_ROW start_ROW start_CELL end_CELL start_CELL roman_CE ( roman_softmax start_POSTSUBSCRIPT italic_k start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( italic_f ( italic_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT ( italic_i , italic_j ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_k start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ) , italic_y start_POSTSUPERSCRIPT ( italic_i , italic_j ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) , end_CELL end_ROW (2.9)

where yk(i,j)subscriptsuperscript𝑦𝑖𝑗𝑘y^{(i,j)}_{k}italic_y start_POSTSUPERSCRIPT ( italic_i , italic_j ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT is the correct semantic label of the detection.

Instance segmentation is done by minimizing the loss calculated by the best assignment between the predicted labels and the real labels. If point xk(i,j)subscriptsuperscript𝑥𝑖𝑗𝑘x^{(i,j)}_{k}italic_x start_POSTSUPERSCRIPT ( italic_i , italic_j ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT belongs to the segment Lk(i,j)subscriptsuperscript𝐿𝑖𝑗𝑘L^{(i,j)}_{k}italic_L start_POSTSUPERSCRIPT ( italic_i , italic_j ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT, then the loss calculated is

ins=X(i)𝒳minϕΣxk(i,j)X(i)CE(softmaxk(f(Xi)k(i,j)),ϕ(Lk(i,j))),subscriptinssubscriptsuperscript𝑋𝑖𝒳subscriptitalic-ϕΣsubscriptsubscriptsuperscript𝑥𝑖𝑗𝑘superscript𝑋𝑖CEsubscriptsoftmaxsuperscript𝑘𝑓subscriptsuperscriptsubscript𝑋𝑖𝑖𝑗superscript𝑘italic-ϕsubscriptsuperscript𝐿𝑖𝑗𝑘\displaystyle\begin{split}\mathcal{L}_{\mathrm{ins}}&=\sum_{X^{(i)}\in\mathcal% {X}}\min_{\phi\in\Sigma}\sum_{x^{(i,j)}_{k}\in X^{(i)}}\\ &\mathrm{CE}\left(\mathrm{softmax}_{k^{\prime}}\left(f(X_{i})^{(i,j)}_{k^{% \prime}}\right),\phi\left(L^{(i,j)}_{k}\right)\right),\end{split}start_ROW start_CELL caligraphic_L start_POSTSUBSCRIPT roman_ins end_POSTSUBSCRIPT end_CELL start_CELL = ∑ start_POSTSUBSCRIPT italic_X start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT ∈ caligraphic_X end_POSTSUBSCRIPT roman_min start_POSTSUBSCRIPT italic_ϕ ∈ roman_Σ end_POSTSUBSCRIPT ∑ start_POSTSUBSCRIPT italic_x start_POSTSUPERSCRIPT ( italic_i , italic_j ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ∈ italic_X start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT end_POSTSUBSCRIPT end_CELL end_ROW start_ROW start_CELL end_CELL start_CELL roman_CE ( roman_softmax start_POSTSUBSCRIPT italic_k start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( italic_f ( italic_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT ( italic_i , italic_j ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_k start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ) , italic_ϕ ( italic_L start_POSTSUPERSCRIPT ( italic_i , italic_j ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) ) , end_CELL end_ROW (2.10)

where ΣΣ\Sigmaroman_Σ is the set of all permutations of labels, allowing a unique assignment of one label to another. The optimal assignment of the labels is solved using a linear sum assignment solver [7]. The linear sum assignment solver, also known as the Hungarian algorithm is a method to assign a bipartite graph maximizing a quantity in polynomial time. This allows us to calculate the loss function without needing to check every possible assignment combination. This is a standard method used to train object segmentation models and does not affect the inference time, only the training time, and only scales polynomially with respect to the number of possible object segments in the model, picked as a hyperparameter.

3 Experiments and Results

3.1 Dataset

We consider here a LArTPC with square 5 mmtimes5mm5\text{\,}\mathrm{m}\mathrm{m}start_ARG 5 end_ARG start_ARG times end_ARG start_ARG roman_mm end_ARG pixel-based readout. The TPC is 2 mtimes2m2\text{\,}\mathrm{m}start_ARG 2 end_ARG start_ARG times end_ARG start_ARG roman_m end_ARG x 2 mtimes2m2\text{\,}\mathrm{m}start_ARG 2 end_ARG start_ARG times end_ARG start_ARG roman_m end_ARG x 7 mtimes7m7\text{\,}\mathrm{m}start_ARG 7 end_ARG start_ARG times end_ARG start_ARG roman_m end_ARG in x,y,z𝑥𝑦𝑧x,y,zitalic_x , italic_y , italic_z with a 2 mtimes2m2\text{\,}\mathrm{m}start_ARG 2 end_ARG start_ARG times end_ARG start_ARG roman_m end_ARG drift length along x. νesubscript𝜈𝑒\nu_{e}italic_ν start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT and νμsubscript𝜈𝜇\nu_{\mu}italic_ν start_POSTSUBSCRIPT italic_μ end_POSTSUBSCRIPT at energies are simulated with GENIE [2] in the +z𝑧+z+ italic_z direction with uniform neutrino energy up to 10 GeVtimes10GeV10\text{\,}\mathrm{G}\mathrm{e}\mathrm{V}start_ARG 10 end_ARG start_ARG times end_ARG start_ARG roman_GeV end_ARG. The energy deposition in liquid argon is then simulated with GEANT4 [10]. The dataset consists of 100,000 νesubscript𝜈𝑒\nu_{e}italic_ν start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT and νμsubscript𝜈𝜇\nu_{\mu}italic_ν start_POSTSUBSCRIPT italic_μ end_POSTSUBSCRIPT events each with 74% of events interacting through the charged current and the rest through neutral current. Additionally, we created another dataset where the odd pixels of the Z dimension were assigned to the XZ view by removing the Y coordinate and the even pixels were assigned to the YZ view by removing the Y coordinate in order to simulate similar multi-view images to those produced by wire LArTPCs.

3.2 Performance evaluation

Six models were trained and evaluated, two graph attention network (GAT) [19] based models, one for the multi-view case and one for the single-view case, a 2D CNN-based model (R-CNN) [11], only used for the single view case, a heterogeneous point set transformer for the multi view case, a and point set transformer for the multi-view case and single-view case. We performed a hyperparameter sweep over the number of layers, the layer size, the number of neighbors to use in the nearest neighbors calculation, and the learning rate, sampling 60 random hyperparameters in the grid. The hyperparameters picked for the three networks were the number of neighbor connections (4, 8), the number of stages of the neural network (2, 3, 4), the size of the embeddings inside the neural network (128, 256, 512) and the learning rate (between 1e-4 and 1e-1). The range of the parameters was chosen according to the memory restrictions of our targeted production environment. The same ranges were used for all 3 of them as they all had comparable parameters. The training and testing was done on a server using an Intel(R) Xeon(R) CPU E5-2640 v4 @ 2.40GHz, 503G of RAM and 4xNVIDIA Titan V.

The hyperparameter sweep was performed over one learning rate cycle with a cosine annealing scheduler, over 64 epochs, using 10% of the dataset. The model with the best accuracy on the segmentation’s class labels in the validation set was selected as the one with the best hyperparameters. The resulting models with the best hyperparameters were trained for 4 learning rate cycles, each cycle being 64 epochs long for a total of 256 epochs using an AdamW optimizer. The results of this optimization for the 3D PST can be seen in table 1

Table 1: Best hyperparameters for the 3D PST
Hyperparameter Value
Learning Rate 0.0006323
Number of stages 3
Embed size 256
Neighbors 8
Initial grid size 8

3.2.1 Classification and Segmentation accuracy

We evaluate the accuracy of the classification and instance segmentation of each point for each model. The results can be seen in table 2. As we can see, we gain an advantage over using a traditional GAT with a more efficient implementation of attention, as well as using pooling to our advantage, as evidenced by the jump in performance between the 3D GAT and our 3D PST.

Using all the information available in 3D images also helps increase accuracy. Matching prongs between views is an especially hard task, so 3D images will inherently have better performance for segmentation accuracy, as they are single view models. HPST is able to bridge the gap by sharing data between the views, and it is able to improve the performance over the 2D GAT and R-CNN by using less parameters than the 2D PST, this creates a tradeoff where the view sharing can give you good performance with a smaller model, while the PST can have higher accuracy due to being able to reach more neighbors within that same view.

3.2.2 Efficiency and purity of segmentation

Refer to caption
Refer to caption
Figure 3: Distribution of prong efficiency and purity

For each prong, we calculated the efficiency and purity of the classification, allowing for multiple predicted prongs to be assigned to a single prong. Efficiency is defined as the percentage of a predicted prong that is assigned to the correct prong. Purity is the percentage of the true prong that is predicted correctly. These are metrics used in particle physics that should be balanced, as raising the purity can often lower the efficiency and vice versa. In figure 3 we can see the distribution of the purity and efficiency in each prong. As we can see, the segmentation results are generally good, especially in the majority classes (muons and electrons).

3.2.3 Speed and memory usage

Table 2: Speed and memory usage for each model compared to their performance.
Model Memory Time per Classification Instance segmentation
usage (MiB) sample (s) OVR AUC accuracy
2D R-CNN 440.5±51.04plus-or-minus440.551.04440.5\pm 51.04440.5 ± 51.04 1.5752±0.091plus-or-minus1.57520.0911.5752\pm 0.0911.5752 ± 0.091 0.526 0.518
2D GAT 88.6±7.56plus-or-minus88.67.56\mathbf{88.6\pm 7.56}bold_88.6 ± bold_7.56 0.2300±0.025plus-or-minus0.23000.025\mathbf{0.2300\pm 0.025}bold_0.2300 ± bold_0.025 0.833 0.659
2D HPST (ours) 99.1±7.39plus-or-minus99.17.3999.1\pm 7.3999.1 ± 7.39 0.3542±0.019plus-or-minus0.35420.0190.3542\pm 0.0190.3542 ± 0.019 0.936 0.779
2D PST (ours) 138.1±11.29plus-or-minus138.111.29138.1\pm 11.29138.1 ± 11.29 0.2539±0.021plus-or-minus0.25390.0210.2539\pm 0.0210.2539 ± 0.021 0.949 0.827
3D GAT 506.1±30.13plus-or-minus506.130.13506.1\pm 30.13506.1 ± 30.13 0.7216±0.060plus-or-minus0.72160.0600.7216\pm 0.0600.7216 ± 0.060 0.859 0.727
3D PST (ours) 170.2±9.65plus-or-minus170.29.65\mathbf{170.2\pm 9.65}bold_170.2 ± bold_9.65 0.1401±0.012plus-or-minus0.14010.012\mathbf{0.1401\pm 0.012}bold_0.1401 ± bold_0.012 0.982 0.889

We benchmarked the three models by running inference on 100 samples, with a batch size of 1, measuring the peak memory increase between the start of inference and the end of inference, in order to remove as much overhead as possible. We evaluated the time it takes for these 100 inferences and the memory used in each of them. We additionally used a Fast R-CNN [11] as a comparison in order to evaluate how much memory is saved by evaluating the data as a point cloud. The results can be seen in table 2. As we can see, memory usage is greatly decreased when comparing a regular CNN model to the sparse methods like graph neural networks and point set neural networks, even when projecting a 3D voxel into two 2D views.

Although 2D models are able to maintain a lower memory usage profile due to merging obscured points and removing at least 1/3 of the data, our 3D model presents a significant increase in performance, especially when comparing the segmentation accuracy to the 2D models. Our 3D model has a significant enough increase in accuracy to justify the increase in memory usage when compared to the models that do not use all three dimensions. Although the increase in memory usage is significant, the memory usage is still within the memory usage required by the environment in which it will be deployed.

3.3 Qualitative evaluation

Refer to caption
Figure 4: Two example events from the test set. The left two columns show the X and Y views of a muon neutrino event with a long muon track (blue), and the right two show an electron neutrino event with a prominent electron shower (black). The top row shows each hit’s true particle label and the bottom row shows the network’s predicted segmentation each colored according to the particle class that had the majority of hits classified as such in the segment.
Refer to caption
Figure 5: The two example events from figure 4 from the test set visualized as 3D plots.

Figures 4 and 5 show two samples from the test set. They are colored according to the most common particle predicted by the model in the segmentation. The muon produced by a numu charged current interaction event leaves a long track. Protons and charged pions also visible in this event leave tracks as well, but such tracks are generally shorter. The nue event produces a prominent electron shower. Separation between tracks and showers is an easier task compared to identifying particles with similar topologies, especially in the case of protons vs pions. These particles make up the hadronic portions of neutrino interactions, which are less understood compared to the leptonic portions from electrons and muons.

4 Limitations and Conclusions

While the claims of memory efficiency will generally hold, although for different datasets this might not be the case. The representation of a sparse matrix is more efficient than a dense matrix until a certain point, where the storage of the coordinates becomes bigger than just storing a dense matrix. Point set operations can also greatly increase in complexity as the number of points grows, resulting in a much slower algorithm. However, these are not the regimes found in data produced by neutrino detectors.

Improvements can also be made to the attention mechanism. The current implementation calculates attention manually, instead of using the more optimized FlashAttention [8], meaning that memory usage and speed can be further reduced. FlashAttention cannot directly be used since it is limited to fixed length sequences, however, a similar strategy could be implemented to speed up point transformer operations. Using nearest neighbors to encode the connections between points is also not necessarily the most efficient method for this dataset. Point Transformers v3 (PTv3) [20] demonstrates that point transformers can achieve the same performance using fewer connections than a graph based on nearest neighbors. In the PTv3 paper, this is achieved by serializing the points with a space-filling curve, drastically reducing the memory usage for one of the most expensive operations in the model’s calculations.

In general, point set transformers perform very well compared to GNNs and CNNs in this task. PSTs strike a balance between memory usage, time, and performance that makes them a great fit for this application.

References

  • [1] R. Acciarri et al. The Pandora multi-algorithm approach to automated pattern recognition of cosmic-ray muon and neutrino events in the MicroBooNe detector. The European Physical Journal C, 78(1):82, Jan 2018.
  • [2] Costas Andreopoulos, Christopher Barry, Steve Dytman, Hugh Gallagher, Tomasz Golan, Robert Hatcher, Gabriel Perdue, and Julia Yarba. The GENIE Neutrino Monte Carlo Generator: Physics and User Manual, 10 2015.
  • [3] A. Aurisano, A. Radovic, D. Rocco, A. Himmel, M. D. Messier, E. Niner, G. Pawloski, F. Psihas, A. Sousa, and P. Vahle. A Convolutional Neural Network Neutrino Event Classifier. JINST, 11(09):P09001, 2016.
  • [4] Pierre Baldi. Deep Learning in Science. Cambridge University Press, 2021.
  • [5] Pierre Baldi and Roman Vershynin. The quarks of attention: Structure and capacity of neural attention building blocks. Artificial Intelligence, 319:103901, 2023. Also: arXiv:2202.08371.
  • [6] Christopher Choy, JunYoung Gwak, and Silvio Savarese. 4d spatio-temporal convnets: Minkowski convolutional neural networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 3075–3084, 2019.
  • [7] David F. Crouse. On implementing 2d rectangular assignment algorithms. IEEE Transactions on Aerospace and Electronic Systems, 52(4):1679–1696, 2016.
  • [8] Tri Dao, Dan Fu, Stefano Ermon, Atri Rudra, and Christopher Ré. Flashattention: Fast and memory-efficient exact attention with io-awareness. In S. Koyejo, S. Mohamed, A. Agarwal, D. Belgrave, K. Cho, and A. Oh, editors, Advances in Neural Information Processing Systems, volume 35, pages 16344–16359. Curran Associates, Inc., 2022.
  • [9] D.A. Dwyer, M. Garcia-Sciveres, D. Gnani, C. Grace, S. Kohn, M. Kramer, A. Krieger, C.J. Lin, K.B. Luk, P. Madigan, C. Marshall, H. Steiner, and T. Stezelberger. Larpix: demonstration of low-power 3d pixelated charge readout for liquid argon time projection chambers. Journal of Instrumentation, 13(10):P10007, oct 2018.
  • [10] Geant4 Collaboration. Geant4 10.4 release notes. geant4-data.web.cern.ch, https://geant4-data.web.cern.ch/ ReleaseNotes/ReleaseNotes4.10.4.html, 2017.
  • [11] Ross Girshick. Fast r-cnn. In 2015 IEEE International Conference on Computer Vision (ICCV), pages 1440–1448, 2015.
  • [12] Ziniu Hu, Yuxiao Dong, Kuansan Wang, and Yizhou Sun. Heterogeneous graph transformer. In Proceedings of The Web Conference 2020, WWW ’20, page 2704–2710, New York, NY, USA, 2020. Association for Computing Machinery.
  • [13] J. S. Marshall and M. A. Thomson. The Pandora software development kit for pattern recognition. The European Physical Journal C, 75(9):439, Sep 2015.
  • [14] Olaf Ronneberger, Philipp Fischer, and Thomas Brox. U-net: Convolutional networks for biomedical image segmentation. In Nassir Navab, Joachim Hornegger, William M. Wells, and Alejandro F. Frangi, editors, Medical Image Computing and Computer-Assisted Intervention – MICCAI 2015, pages 234–241, Cham, 2015. Springer International Publishing.
  • [15] Alexander Shmakov, Alejandro J Yankelevich, Jianming Bian, and Pierre Baldi. Interpretable joint event-particle reconstruction at NOvA with sparse cnns and transformers. In Machine Learning and the Physical Sciences, NeurIPS, 2023.
  • [16] Martin Simonovsky and Nikos Komodakis. Dynamic edge-conditioned filters in convolutional neural networks on graphs. In 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 29–38, 2017.
  • [17] The DUNE Collaboration. Deep underground neutrino experiment (DUNE), far detector technical design report, volume ii: DUNE physics, 2020.
  • [18] The MicroBooNE collaboration. Neutrino event selection in the MicroBooNE liquid argon time projection chamber using Wire-Cell 3D imaging, clustering, and charge-light matching. Journal of Instrumentation, 16(06):P06043, jun 2021.
  • [19] Petar Veličković, Guillem Cucurull, Arantxa Casanova, Adriana Romero, Pietro Liò, and Yoshua Bengio. Graph attention networks, 2018.
  • [20] Xiaoyang Wu, Li Jiang, Peng-Shuai Wang, Zhijian Liu, Xihui Liu, Yu Qiao, Wanli Ouyang, Tong He, and Hengshuang Zhao. Point transformer v3: Simpler, faster, stronger. In CVPR, 2024.
  • [21] Xiaoyang Wu, Yixing Lao, Li Jiang, Xihui Liu, and Hengshuang Zhao. Point transformer v2: Grouped vector attention and partition-based pooling. In NeurIPS, 2022.
  • [22] Sam Young, Yeon jae Jwa, and Kazuhiro Terao. Particle trajectory representation learning with masked point modeling, 2025.
  • [23] H.W. Yu, M. Bishai, W.Q. Gu, M.F. Lin, X. Qian, Y.H. Ren, A. Scarpelli, B. Viren, H.Y. Wei, H.Z. Yu, K. Yu, and C. Zhang. Augmented signal processing in liquid argon time projection chambers with a deep neural network. Journal of Instrumentation, 16(01):P01036, jan 2021.
  • [24] Manzil Zaheer, Satwik Kottur, Siamak Ravanbakhsh, Barnabas Poczos, Russ R Salakhutdinov, and Alexander J Smola. Deep sets. In I. Guyon, U. Von Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett, editors, Advances in Neural Information Processing Systems, volume 30. Curran Associates, Inc., 2017.