Software Engineering
See recent articles
Showing new listings for Friday, 18 April 2025
- [1] arXiv:2504.12300 [pdf, other]
-
Title: Implementing Effective Changes in Software Projects to Optimize Runtimes and Minimize DefectsComments: 14 pages. arXiv admin note: text overlap with arXiv:1708.05442 by other authorsSubjects: Software Engineering (cs.SE)
The continuous evolution of software projects necessitates the implementation of changes to enhance performance and reduce defects. This research explores effective strategies for learning and implementing useful changes in software projects, focusing on optimizing runtimes and minimizing software defects. A comprehensive review of existing literature sets the foundation for understanding the current landscape of software optimization and defect reduction. The study employs a mixed-methods approach, incorporating both qualitative and quantitative data from software projects before and after changes were made. Key methodologies include detailed data collection on runtimes and defect rates, root cause analysis of common issues, and the application of best practices from successful case studies. The research highlights critical techniques for learning from past projects, identifying actionable changes, and ensuring their effective implementation. In-depth case study analysis provides insights into the practical challenges and success factors associated with these changes. Statistical analysis of the results demonstrates significant improvements in runtimes and defect rates, underscoring the value of a structured approach to software project optimization. The findings offer actionable recommendations for software development teams aiming to enhance project performance and reliability. This study contributes to the broader understanding of software engineering practices, providing a framework for continuous improvement in software projects. Future research directions are suggested to refine these strategies further and explore their application in diverse software development environments.
- [2] arXiv:2504.12365 [pdf, html, other]
-
Title: Themisto: Jupyter-Based Runtime BenchmarkComments: Accepted to the third Deep Learning for Code (DL4C) workshop @ ICLR 2025Subjects: Software Engineering (cs.SE); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
In this work, we present a benchmark that consists of Jupyter notebooks development trajectories and allows measuring how large language models (LLMs) can leverage runtime information for predicting code output and code generation. We demonstrate that the current generation of LLMs performs poorly on these tasks and argue that there exists a significantly understudied domain in the development of code-based models, which involves incorporating the runtime context.
- [3] arXiv:2504.12443 [pdf, html, other]
-
Title: Bridging the Gap: A Comparative Study of Academic and Developer Approaches to Smart Contract VulnerabilitiesFrancesco Salzano, Lodovica Marchesi, Cosmo Kevin Antenucci, Simone Scalabrino, Roberto Tonelli, Rocco Oliveto, Remo PareschiSubjects: Software Engineering (cs.SE)
In this paper, we investigate the strategies adopted by Solidity developers to fix security vulnerabilities in smart contracts. Vulnerabilities are categorized using the DASP TOP 10 taxonomy, and fixing strategies are extracted from GitHub commits in open-source Solidity projects. Each commit was selected through a two-phase process: an initial filter using natural language processing techniques, followed by manual validation by the authors. We analyzed these commits to evaluate adherence to academic best practices. Our results show that developers often follow established guidelines for well-known vulnerability types such as Reentrancy and Arithmetic. However, in less-documented categories like Denial of Service, Bad Randomness, and Time Manipulation, adherence is significantly lower, suggesting gaps between academic literature and practical development. From non-aligned commits, we identified 27 novel fixing strategies not previously discussed in the literature. These emerging patterns offer actionable solutions for securing smart contracts in underexplored areas. To evaluate the quality of these new fixes, we conducted a questionnaire with academic and industry experts, who assessed each strategy based on Generalizability, Long-term Sustainability, and Effectiveness. Additionally, we performed a post-fix analysis by tracking subsequent commits to the fixed files, assessing the persistence and evolution of the fixes over time. Our findings offer an empirically grounded view of how vulnerabilities are addressed in practice, bridging theoretical knowledge and real-world solutions in the domain of smart contract security.
- [4] arXiv:2504.12445 [pdf, html, other]
-
Title: RePurr: Automated Repair of Block-Based Learners' ProgramsComments: 24 pages, ACM International Conference on the Foundations of Software Engineering (FSE 2025)Subjects: Software Engineering (cs.SE)
Programming is increasingly taught using block-based languages like Scratch. While the use of blocks prevents syntax errors, learners can still make semantic mistakes, requiring feedback and help. As teachers may be overwhelmed by help requests in a classroom, may lack programming expertise themselves, or may be unavailable in independent learning scenarios, automated hint generation is desirable. Automated program repair (APR) can provide the foundation for this, but relies on multiple assumptions: (1) APR usually targets isolated bugs, but learners may fundamentally misunderstand tasks or request help for substantially incomplete code. (2) Software tests are required to guide the search and localize broken blocks, but tests for block-based programs are different to those in past APR research: They consist of system tests, and very few of them already fully cover the code. At the same time, they have vastly longer runtimes due to animations and interactions on Scratch programs, which inhibits the applicability of search. (3) The plastic surgery hypothesis assumes the code necessary for repairs already exists in the codebase. Block-based programs tend to be small and may lack this redundancy. To study if APR of such programs is still feasible, we introduce, to the best of our knowledge, the first APR approach for Scratch based on evolutionary search. Our RePurr prototype includes novel refinements of fault localization to improve the guidance of test suites, recovers the plastic surgery hypothesis by exploiting that learning scenarios provide model and student solutions, and reduces the costs of fitness evaluations via test parallelization and acceleration. Empirical evaluation on a set of real learners' programs confirms the anticipated challenges, but also demonstrates APR can still effectively improve and fix learners' programs, enabling automated generation of hints and feedback.
- [5] arXiv:2504.12461 [pdf, html, other]
-
Title: Rethinking Trust in AI Assistants for Software Development: A Critical ReviewSebastian Baltes, Timo Speith, Brenda Chiteri, Seyedmoein Mohsenimofidi, Shalini Chakraborty, Daniel BuschekComments: 16 pages, 2 figures, 3 tables, currently under reviewSubjects: Software Engineering (cs.SE)
Trust is a fundamental concept in human decision-making and collaboration that has long been studied in philosophy and psychology. However, software engineering (SE) articles often use the term 'trust' informally - providing an explicit definition or embedding results in established trust models is rare. In SE research on AI assistants, this practice culminates in equating trust with the likelihood of accepting generated content, which does not capture the full complexity of the trust concept. Without a common definition, true secondary research on trust is impossible. The objectives of our research were: (1) to present the psychological and philosophical foundations of human trust, (2) to systematically study how trust is conceptualized in SE and the related disciplines human-computer interaction and information systems, and (3) to discuss limitations of equating trust with content acceptance, outlining how SE research can adopt existing trust models to overcome the widespread informal use of the term 'trust'. We conducted a literature review across disciplines and a critical review of recent SE articles focusing on conceptualizations of trust. We found that trust is rarely defined or conceptualized in SE articles. Related disciplines commonly embed their methodology and results in established trust models, clearly distinguishing, for example, between initial trust and trust formation and discussing whether and when trust can be applied to AI assistants. Our study reveals a significant maturity gap of trust research in SE compared to related disciplines. We provide concrete recommendations on how SE researchers can adopt established trust models and instruments to study trust in AI assistants beyond the acceptance of generated software artifacts.
- [6] arXiv:2504.12517 [pdf, html, other]
-
Title: Code Improvement Practices at MetaAudris Mockus, Peter C Rigby, Rui Abreu, Anatoly Akkerman, Yogesh Bhootada, Payal Bhuptani, Gurnit Ghardhora, Lan Hoang Dao, Chris Hawley, Renzhi He, Sagar Krishnamoorthy, Sergei Krauze, Jianmin Li, Anton Lunov, Dragos Martac, Francois Morin, Neil Mitchell, Venus Montes, Maher Saba, Matt Steiner, Andrea Valori, Shanchao Wang, Nachiappan NagappanSubjects: Software Engineering (cs.SE)
The focus on rapid software delivery inevitably results in the accumulation of technical debt, which, in turn, affects quality and slows future development. Yet, companies with a long history of rapid delivery exist. Our primary aim is to discover how such companies manage to keep their codebases maintainable. Method: we investigate Meta's practices by collaborating with engineers on code quality and by analyzing rich source code change history
to reveal a range of practices used for continual improvement of the codebase. In addition, we replicate several aspects of previous industry cases studies investigating the impact of code reengineering. Results: Code improvements at Meta range from completely organic grass-roots done at the initiative of individual engineers, to regularly blocked time and engagement via gamification of Better Engineering (BE) work, to major explicit initiatives aimed at reengineering the complex parts of the codebase or deleting accumulations of dead code. Over 14% of changes are explicitly devoted to code improvement and the developers are given ``badges'' to acknowledge the type of work and the amount of effort. Our investigation to prioritize which parts of the codebase to improve lead to the development of metrics to guide this decision making. Our analysis of the impact of reengineering activities revealed substantial improvements in quality and speed as well as a reduction in code complexity. Overall, such continual improvement is an effective way to develop software with rapid releases, while maintaining high quality. - [7] arXiv:2504.12608 [pdf, html, other]
-
Title: Code Copycat Conundrum: Demystifying Repetition in LLM-based Code GenerationMingwei Liu, Juntao Li, Ying Wang, Xueying Du, Zuoyu Ou, Qiuyuan Chen, Bingxu An, Zhao Wei, Yong Xu, Fangming Zou, Xin Peng, Yiling LouSubjects: Software Engineering (cs.SE); Artificial Intelligence (cs.AI)
Despite recent advances in Large Language Models (LLMs) for code generation, the quality of LLM-generated code still faces significant challenges. One significant issue is code repetition, which refers to the model's tendency to generate structurally redundant code, resulting in inefficiencies and reduced readability. To address this, we conduct the first empirical study to investigate the prevalence and nature of repetition across 19 state-of-the-art code LLMs using three widely-used benchmarks. Our study includes both quantitative and qualitative analyses, revealing that repetition is pervasive and manifests at various granularities and extents, including character, statement, and block levels. We further summarize a taxonomy of 20 repetition patterns. Building on our findings, we propose DeRep, a rule-based technique designed to detect and mitigate repetition in generated code. We evaluate DeRep using both open-source benchmarks and in an industrial setting. Our results demonstrate that DeRep significantly outperforms baselines in reducing repetition (with an average improvements of 91.3%, 93.5%, and 79.9% in rep-3, rep-line, and sim-line metrics) and enhancing code quality (with a Pass@1 increase of 208.3% over greedy search). Furthermore, integrating DeRep improves the performance of existing repetition mitigation methods, with Pass@1 improvements ranging from 53.7% to 215.7%.
- [8] arXiv:2504.12646 [pdf, html, other]
-
Title: Replication Packages in Software Engineering Secondary Studies: A Systematic MappingSubjects: Software Engineering (cs.SE)
Context: Systematic reviews (SRs) summarize state-of-the-art evidence in science, including software engineering (SE). Objective: Our objective is to evaluate how SRs report replication packages and to provide a comprehensive list of these packages. Method: We examined 528 secondary studies published between 2013 and 2023 to analyze the availability and reporting of replication packages. Results: Our findings indicate that only 25.4% of the reviewed studies include replication packages. Encouragingly, the situation is gradually improving, as our regression analysis shows significant increase in the availability of replication packages over time. However, in 2023, just 50.6% of secondary studies provided a replication package while an even lower percentage, 29.1% had used a permanent repository with a digital object identifier (DOI) for storage. Conclusion: To enhance transparency and reproducibility in SE research, we advocate for the mandatory publication of replication packages in secondary studies.
- [9] arXiv:2504.12690 [pdf, html, other]
-
Title: Accessibility Recommendations for Designing Better Mobile Application User Interfaces for SeniorsComments: Submitted to ACM Transactions on Software Engineering and Methodology (ToSEM)Subjects: Software Engineering (cs.SE); Human-Computer Interaction (cs.HC)
Seniors represent a growing user base for mobile applications; however, many apps fail to adequately address their accessibility challenges and usability preferences. To investigate this issue, we conducted an exploratory focus group study with 16 senior participants, from which we derived an initial set of user personas highlighting key accessibility and personalisation barriers. These personas informed the development of a model-driven engineering toolset, which was used to generate adaptive mobile app prototypes tailored to seniors' needs. We then conducted a second focus group study with 22 seniors to evaluate these prototypes and validate our findings. Based on insights from both studies, we developed a refined set of personas and a series of accessibility and personalisation recommendations grounded in empirical data, prior research, accessibility standards, and developer resources, aimed at supporting software practitioners in designing more inclusive mobile applications.
- [10] arXiv:2504.12790 [pdf, html, other]
-
Title: Empirically Evaluating the Use of Bytecode for Diversity-Based Test Case PrioritisationComments: 10 pages, 4 figures, 6 tables, EASE 2025 conferenceSubjects: Software Engineering (cs.SE)
Regression testing assures software correctness after changes but is resource-intensive. Test Case Prioritisation (TCP) mitigates this by ordering tests to maximise early fault detection. Diversity-based TCP prioritises dissimilar tests, assuming they exercise different system parts and uncover more faults. Traditional static diversity-based TCP approaches (i.e., methods that utilise the dissimilarity of tests), like the state-of-the-art FAST approach, rely on textual diversity from test source code, which is effective but inefficient due to its relative verbosity and redundancies affecting similarity calculations. This paper is the first to study bytecode as the basis of diversity in TCP, leveraging its compactness for improved efficiency and accuracy. An empirical study on seven Defects4J projects shows that bytecode diversity improves fault detection by 2.3-7.8% over text-based TCP. It is also 2-3 orders of magnitude faster in one TCP approach and 2.5-6 times faster in FAST-based TCP. Filtering specific bytecode instructions improves efficiency up to fourfold while maintaining effectiveness, making bytecode diversity a superior static approach.
- [11] arXiv:2504.12977 [pdf, html, other]
-
Title: A Phenomenological Approach to Analyzing User Queries in IT Systems Using Heidegger's Fundamental OntologyComments: 12 pages, no figuresSubjects: Software Engineering (cs.SE); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Human-Computer Interaction (cs.HC)
This paper presents a novel research analytical IT system grounded in Martin Heidegger's Fundamental Ontology, distinguishing between beings (das Seiende) and Being (das Sein). The system employs two modally distinct, descriptively complete languages: a categorical language of beings for processing user inputs and an existential language of Being for internal analysis. These languages are bridged via a phenomenological reduction module, enabling the system to analyze user queries (including questions, answers, and dialogues among IT specialists), identify recursive and self-referential structures, and provide actionable insights in categorical terms. Unlike contemporary systems limited to categorical analysis, this approach leverages Heidegger's phenomenological existential analysis to uncover deeper ontological patterns in query processing, aiding in resolving logical traps in complex interactions, such as metaphor usage in IT contexts. The path to full realization involves formalizing the language of Being by a research team based on Heidegger's Fundamental Ontology; given the existing completeness of the language of beings, this reduces the system's computability to completeness, paving the way for a universal query analysis tool. The paper presents the system's architecture, operational principles, technical implementation, use cases--including a case based on real IT specialist dialogues--comparative evaluation with existing tools, and its advantages and limitations.
- [12] arXiv:2504.12998 [pdf, html, other]
-
Title: Automated Generation of Commit Messages in Software RepositoriesSubjects: Software Engineering (cs.SE)
Commit messages are crucial for documenting software changes, aiding in program comprehension and maintenance. However, creating effective commit messages is often overlooked by developers due to time constraints and varying levels of documentation skills. Our research presents an automated approach to generate commit messages using Machine Learning (ML) and Natural Language Processing (NLP) by developing models that use techniques such as Logistic Regression with TF-IDF and Word2Vec, as well as more sophisticated methods like LSTM. We used the dataset of code changes and corresponding commit messages that was used by Liu et al., which we used to train and evaluate ML/NLP models and was chosen because it is extensively used in previous research, also for comparability in our study. The objective was to explore which ML/NLP techniques generate the most effective, clear, and concise commit messages that accurately reflect the code changes. We split the dataset into training, validation, and testing sets and used these sets to evaluate the performance of each model using qualitative and quantitative evaluation methods. Our results reveal a spectrum of effectiveness among these models, with the highest BLEU score achieved being 16.82, showcasing the models' capability in automating a clear and concise commit message generation. Our paper offers insights into the comparative effectiveness of different machine learning models for automating commit message generation in software development, aiming to enhance the overall practice of code documentation. The source code is available at this https URL.
- [13] arXiv:2504.13002 [pdf, html, other]
-
Title: The Role of Empathy in Software Engineering -- A Socio-Technical Grounded TheoryComments: ACM Transactions on Software Engineering and Methodology (TOSEM)(2025)Subjects: Software Engineering (cs.SE)
Empathy, defined as the ability to understand and share others' perspectives and emotions, is essential in software engineering (SE), where developers often collaborate with diverse stakeholders. It is also considered as a vital competency in many professional fields such as medicine, healthcare, nursing, animal science, education, marketing, and project management. Despite its importance, empathy remains under-researched in SE. To further explore this, we conducted a socio-technical grounded theory (STGT) study through in-depth semi-structured interviews with 22 software developers and stakeholders. Our study explored the role of empathy in SE and how SE activities and processes can be improved by considering empathy. Through applying the systematic steps of STGT data analysis and theory development, we developed a theory that explains the role of empathy in SE. Our theory details the contexts in which empathy arises, the conditions that shape it, the causes and consequences of its presence and absence. We also identified contingencies for enhancing empathy or overcoming barriers to its expression. Our findings provide practical implications for SE practitioners and researchers, offering a deeper understanding of how to effectively integrate empathy into SE processes.
- [14] arXiv:2504.13069 [pdf, html, other]
-
Title: Early Accessibility: Automating Alt-Text Generation for UI Icons During App DevelopmentSubjects: Software Engineering (cs.SE); Human-Computer Interaction (cs.HC)
Alt-text is essential for mobile app accessibility, yet UI icons often lack meaningful descriptions, limiting accessibility for screen reader users. Existing approaches either require extensive labeled datasets, struggle with partial UI contexts, or operate post-development, increasing technical debt. We first conduct a formative study to determine when and how developers prefer to generate icon alt-text. We then explore the ALTICON approach for generating alt-text for UI icons during development using two fine-tuned models: a text-only large language model that processes extracted UI metadata and a multi-modal model that jointly analyzes icon images and textual context. To improve accuracy, the method extracts relevant UI information from the DOM tree, retrieves in-icon text via OCR, and applies structured prompts for alt-text generation. Our empirical evaluation with the most closely related deep-learning and vision-language models shows that ALTICON generates alt-text that is of higher quality while not requiring a full-screen input.
New submissions (showing 14 of 14 entries)
- [15] arXiv:2504.12587 (cross-list from cs.LG) [pdf, html, other]
-
Title: Software Engineering Principles for Fairer Systems: Experiments with GroupCARTSubjects: Machine Learning (cs.LG); Software Engineering (cs.SE)
Discrimination-aware classification aims to make accurate predictions while satisfying fairness constraints. Traditional decision tree learners typically optimize for information gain in the target attribute alone, which can result in models that unfairly discriminate against protected social groups (e.g., gender, ethnicity). Motivated by these shortcomings, we propose GroupCART, a tree-based ensemble optimizer that avoids bias during model construction by optimizing not only for decreased entropy in the target attribute but also for increased entropy in protected attributes. Our experiments show that GroupCART achieves fairer models without data transformation and with minimal performance degradation. Furthermore, the method supports customizable weighting, offering a smooth and flexible trade-off between predictive performance and fairness based on user requirements. These results demonstrate that algorithmic bias in decision tree models can be mitigated through multi-task, fairness-aware learning. All code and datasets used in this study are available at: this https URL.
- [16] arXiv:2504.12729 (cross-list from quant-ph) [pdf, other]
-
Title: Dead Gate EliminationComments: Accepted by 25th International Conference on Computational Science, 2025Subjects: Quantum Physics (quant-ph); Programming Languages (cs.PL); Software Engineering (cs.SE)
Hybrid quantum algorithms combine the strengths of quantum and classical computing. Many quantum algorithms, such as the variational quantum eigensolver (VQE), leverage this synergy. However, quantum circuits are executed in full, even when only subsets of measurement outcomes contribute to subsequent classical computations. In this manuscript, we propose a novel circuit optimization technique that identifies and removes dead gates. We prove that the removal of dead gates has no influence on the probability distribution of the measurement outcomes that contribute to the subsequent calculation result. We implemented and evaluated our optimization on a VQE instance, a quantum phase estimation (QPE) instance, and hybrid programs embedded with random circuits of varying circuit width, confirming its capability to remove a non-trivial number of dead gates in real-world algorithms. The effect of our optimization scales up as more measurement outcomes are identified as non-contributory, resulting in a proportionally greater reduction of dead gates.
- [17] arXiv:2504.12813 (cross-list from cs.RO) [pdf, other]
-
Title: Approaching Current Challenges in Developing a Software Stack for Fully Autonomous DrivingComments: Accepted at IEEE IV 2025Subjects: Robotics (cs.RO); Software Engineering (cs.SE)
Autonomous driving is a complex undertaking. A common approach is to break down the driving task into individual subtasks through modularization. These sub-modules are usually developed and published separately. However, if these individually developed algorithms have to be combined again to form a full-stack autonomous driving software, this poses particular challenges. Drawing upon our practical experience in developing the software of TUM Autonomous Motorsport, we have identified and derived these challenges in developing an autonomous driving software stack within a scientific environment. We do not focus on the specific challenges of individual algorithms but on the general difficulties that arise when deploying research algorithms on real-world test vehicles. To overcome these challenges, we introduce strategies that have been effective in our development approach. We additionally provide open-source implementations that enable these concepts on GitHub. As a result, this paper's contributions will simplify future full-stack autonomous driving projects, which are essential for a thorough evaluation of the individual algorithms.
- [18] arXiv:2504.13141 (cross-list from cs.DC) [pdf, html, other]
-
Title: Complexity at Scale: A Quantitative Analysis of an Alibaba Microservice DeploymentComments: 19 pages, 24 figures, 3 tablesSubjects: Distributed, Parallel, and Cluster Computing (cs.DC); Software Engineering (cs.SE)
Microservice architectures are increasingly prevalent in organisations providing online applications. Recent studies have begun to explore the characteristics of real-world large-scale microservice deployments; however, their operational complexities, and the degree to which this complexities are consistent across different deployments, remains under-explored. In this paper, we analyse a microservice dataset released by Alibaba along three dimensions of complexity: scale, heterogeneity, and dynamicity. We find that large-scale deployments can consist of tens of thousands of microservices, that support an even broader array of front-end functionality. Moreover, our analysis shows wide-spread long-tailed distributions of characteristics between microservices, such as share of workload and dependencies, highlighting inequality across the deployment. This diversity is also reflected in call graphs, where we find that whilst front-end services produce dominant call graphs, rarer non-dominant call graphs are prevalent and could involve dissimilar microservice calls. We also find that runtime dependencies between microservices deviate from the static view of system dependencies, and that the deployment undergoes daily changes to microservices. We discuss the implications of our findings for state-of-the-art research in microservice management and research testbed realism, and compare our results to previous descriptions of large-scale microservice deployments to begin to build an understanding of their commonalities.
Cross submissions (showing 4 of 4 entries)
- [19] arXiv:2307.16714 (replaced) [pdf, html, other]
-
Title: A Comprehensive Study of Machine Learning Techniques for Log-Based Anomaly DetectionSubjects: Software Engineering (cs.SE); Machine Learning (cs.LG)
Growth in system complexity increases the need for automated log analysis techniques, such as Log-based Anomaly Detection (LAD). While deep learning (DL) methods have been widely used for LAD, traditional machine learning (ML) techniques can also perform well depending on the context and dataset. Semi-supervised techniques deserve the same attention as they offer practical advantages over fully supervised methods. Current evaluations mainly focus on detection accuracy, but this alone is insufficient to determine the suitability of a technique for a given LAD task. Other aspects to consider include training and prediction times as well as the sensitivity to hyperparameter tuning, which in practice matters to engineers.
This paper presents a comprehensive empirical study evaluating a wide range of supervised and semi-supervised, traditional and deep ML techniques across four criteria: detection accuracy, time performance, and sensitivity to hyperparameter tuning in both detection accuracy and time performance. The experimental results show that supervised traditional and deep ML techniques fare similarly in terms of their detection accuracy and prediction time on most of the benchmark datasets considered in our study. Moreover, overall, sensitivity analysis to hyperparameter tuning with respect to detection accuracy shows that supervised traditional ML techniques are less sensitive than deep learning techniques. Further, semi-supervised techniques yield significantly worse detection accuracy than supervised techniques. - [20] arXiv:2405.11514 (replaced) [pdf, html, other]
-
Title: Towards Translating Real-World Code with LLMs: A Study of Translating to RustHasan Ferit Eniser, Hanliang Zhang, Cristina David, Meng Wang, Maria Christakis, Brandon Paulsen, Joey Dodds, Daniel KroeningComments: 12 pages, 12 figuresSubjects: Software Engineering (cs.SE)
Large language models (LLMs) show promise in code translation - the task of translating code written in one programming language to another language - due to their ability to write code in most programming languages. However, LLM's effectiveness on translating real-world code remains largely unstudied. In this work, we perform the first substantial study on LLM-based translation to Rust by assessing the ability of five state-of-the-art LLMs, GPT4, Claude 3, Claude 2.1, Gemini Pro, and Mixtral. We conduct our study on code extracted from real-world open source projects. To enable our study, we develop FLOURINE, an end-to-end code translation tool that uses differential fuzzing to check if a Rust translation is I/O equivalent to the original source program, eliminating the need for pre-existing test cases. As part of our investigation, we assess both the LLM's ability to produce an initially successful translation, as well as their capacity to fix a previously generated buggy one. If the original and the translated programs are not I/O equivalent, we apply a set of automated feedback strategies, including feedback to the LLM with counterexamples. Our results show that the most successful LLM can translate 47% of our benchmarks, and also provides insights into next steps for improvements.
- [21] arXiv:2408.02825 (replaced) [pdf, html, other]
-
Title: The Impact of Environment Configurations on the Stability of AI-Enabled SystemsComments: Accepted for publication at the International Conference on Evaluation and Assessment in Software Engineering (EASE 2025)Subjects: Software Engineering (cs.SE); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Nowadays, software systems tend to include Artificial Intelligence (AI) components. Changes in the operational environment have been known to negatively impact the stability of AI-enabled software systems by causing unintended changes in behavior. However, how an environment configuration impacts the behavior of such systems has yet to be explored. Understanding and quantifying the degree of instability caused by different environment settings can help practitioners decide the best environment configuration for the most stable AI systems. To achieve this goal, we performed experiments with eight different combinations of three key environment variables (operating system, Python version, and CPU architecture) on $30$ open-source AI-enabled systems using the Travis CI platform. We determine the existence and the degree of instability introduced by each configuration using three metrics: the output of an AI component of the system (model performance), the time required to build and run the system (processing time), and the cost associated with building and running the system (expense). Our results indicate that changes in environment configurations lead to instability across all three metrics; however, it is observed more frequently with respect to processing time and expense rather than model performance. For example, between Linux and MacOS, instability is observed in 23\%, 96.67\%, and 100\% of the studied projects in model performance, processing time, and expense, respectively. Our findings underscore the importance of identifying the optimal combination of configuration settings to mitigate drops in model performance and reduce the processing time and expense before deploying an AI-enabled system.
- [22] arXiv:2412.10099 (replaced) [pdf, html, other]
-
Title: Unexpected but informative: What fixation-related potentials tell us about the processing of confusing program codeAnnabelle Bergum, Anna-Maria Maurer, Norman Peitek, Regine Bader, Axel Mecklinger, Vera Demberg, Janet Siegmund, Sven ApelSubjects: Software Engineering (cs.SE)
As software pervades more and more areas of our professional and personal lives, there is an ever-increasing need to maintain software, and for programmers to be able to efficiently write and understand program code. In the first study of its kind, we analyze fixation-related potentials (FRPs) to explore the online processing of program code patterns that are ambiguous to programmers, but not the computer (so-called atoms of confusion), and their underlying neurocognitive mechanisms in an ecologically valid setting. Relative to unambiguous counterparts in program code, atoms of confusion elicit a late frontal positivity with a duration of about 400 to 700 ms after first looking at the atom of confusion. As the frontal positivity shows high resemblance with an event-related potential (ERP) component found during natural language processing that is elicited by unexpected but plausible words in sentence context, we take these data to suggest that the brain engages similar neurocognitive mechanisms in response to unexpected and informative inputs in program code and in natural language. In both domains, these inputs lead to an update of a comprehender's situation model that is essential for information extraction from a quickly unfolding input.
- [23] arXiv:2501.09892 (replaced) [pdf, html, other]
-
Title: Learning from Mistakes: Understanding Ad-hoc Logs through Analyzing Accidental CommitsComments: Accepted at MSR 2025Subjects: Software Engineering (cs.SE)
Developers often insert temporary "print" or "log" instructions into their code to help them better understand runtime behavior, usually when the code is not behaving as they expected. Despite the fact that such monitoring instructions, or "ad-hoc logs," are so commonly used by developers, there is almost no existing literature that studies developers' practices in how they use them. This paucity of knowledge of the use of these ephemeral logs may be largely due to the fact that they typically only exist in the developers' local environments and are removed before they commit their code to their revision control system. In this work, we overcome this challenge by observing that developers occasionally mistakenly forget to remove such instructions before committing, and then they remove them shortly later. Additionally, we further study such developer logging practices by watching and analyzing live-streamed coding videos. Through these empirical approaches, we study where, how, and why developers use ad-hoc logs to better understand their code and its execution. We collect 27 GB of accidental commits that removed 548,880 ad-hoc logs in JavaScript from GitHub Archive repositories to provide the first large-scale dataset and empirical studies on ad-hoc logging practices. Our results reveal several illuminating findings, including a particular propensity for developers to use ad-hoc logs in asynchronous and callback functions. Our findings provide both empirical evidence and a valuable dataset for researchers and tool developers seeking to enhance ad-hoc logging practices, and potentially deepen our understanding of developers' practices towards understanding of software's runtime behaviors.
- [24] arXiv:2503.17192 (replaced) [pdf, other]
-
Title: Employing Continuous Integration inspired workflows for benchmarking of scientific software -- a use case on numerical cut cell quadratureTeoman Toprak, Michael Loibl, Guilherme Teixeira, Irina Shiskina, Chen Miao, Josef Kiendl, Benjamin Marussig, Florian KummerComments: 22 pages, 8 figures, pre-print (not submitted)Subjects: Software Engineering (cs.SE)
Scientific software often offers numerous (open or closed-source) alternatives for a given problem. A user needs to make an informed choice by selecting the best option based on specific metrics. However, setting up benchmarks ad-hoc can become overwhelming as the parameter space expands rapidly. Very often, the design of the benchmark is also not fully set at the start of some project. For instance, adding new libraries, adapting metrics, or introducing new benchmark cases during the project can significantly increase complexity and necessitate laborious re-evaluation of previous results. This paper presents a proven approach that utilizes established Continuous Integration tools and practices to achieve high automation of benchmark execution and reporting. Our use case is the numerical integration (quadrature) on arbitrary domains, which are bounded by implicitly or parametrically defined curves or surfaces in 2D or 3D.
- [25] arXiv:2503.22512 (replaced) [pdf, html, other]
-
Title: Unlocking LLM Repair Capabilities in Low-Resource Programming Languages Through Cross-Language Translation and Multi-Agent RefinementWenqiang Luo, Jacky Wai Keung, Boyang Yang, Jacques Klein, Tegawende F. Bissyande, Haoye Tian, Bach LeSubjects: Software Engineering (cs.SE)
Recent advances in leveraging LLMs for APR have demonstrated impressive capabilities in fixing software defects. However, current LLM-based approaches predominantly focus on mainstream programming languages like Java and Python, neglecting less prevalent but emerging languages such as Rust due to expensive training resources, limited datasets, and insufficient community support. This narrow focus creates a significant gap in repair capabilities across the programming language spectrum, where the full potential of LLMs for comprehensive multilingual program repair remains largely unexplored. To address this limitation, we introduce a novel cross-language program repair approach LANTERN that leverages LLMs' differential proficiency across languages through a multi-agent iterative repair paradigm. Our technique strategically translates defective code from languages where LLMs exhibit weaker repair capabilities to languages where they demonstrate stronger performance, without requiring additional training. A key innovation of our approach is an LLM-based decision-making system that dynamically selects optimal target languages based on bug characteristics and continuously incorporates feedback from previous repair attempts. We evaluate our method on xCodeEval, a comprehensive multilingual benchmark comprising 5,068 bugs across 11 programming languages. Results demonstrate significant enhancement in repair effectiveness, particularly for underrepresented languages, with Rust showing a 22.09% improvement in Pass@10 metrics. Our research provides the first empirical evidence that cross-language translation significantly expands the repair capabilities of LLMs and effectively bridges the performance gap between programming languages with different levels of popularity, opening new avenues for truly language-agnostic automated program repair.
- [26] arXiv:2504.11711 (replaced) [pdf, html, other]
-
Title: The Hitchhiker's Guide to Program Analysis, Part II: Deep Thoughts by LLMsSubjects: Software Engineering (cs.SE); Artificial Intelligence (cs.AI)
Static analysis is a cornerstone for software vulnerability detection, yet it often struggles with the classic precision-scalability trade-off. In practice, such tools often produce high false positive rates, particularly in large codebases like the Linux kernel. This imprecision can arise from simplified vulnerability modeling and over-approximation of path and data constraints. While large language models (LLMs) show promise in code understanding, their naive application to program analysis yields unreliable results due to inherent reasoning limitations. We introduce BugLens, a post-refinement framework that significantly improves static analysis precision. BugLens guides an LLM to follow traditional analysis steps by assessing buggy code patterns for security impact and validating the constraints associated with static warnings. Evaluated on real-world Linux kernel bugs, BugLens raises precision from 0.10 (raw) and 0.50 (semi-automated refinement) to 0.72, substantially reducing false positives and revealing four previously unreported vulnerabilities. Our results suggest that a structured LLM-based workflow can meaningfully enhance the effectiveness of static analysis tools.
- [27] arXiv:2404.04991 (replaced) [pdf, html, other]
-
Title: An Analysis of Malicious Packages in Open-Source Software in the WildJournal-ref: the 55th Annual IEEE/IFIP International Conference on Dependable Systems and Networks(DSN), 2025Subjects: Cryptography and Security (cs.CR); Software Engineering (cs.SE)
The open-source software (OSS) ecosystem suffers from security threats caused by this http URL, OSS malware research has three limitations: a lack of high-quality datasets, a lack of malware diversity, and a lack of attack campaign contexts. In this paper, we first build the largest dataset of 24,356 malicious packages from online sources, then propose a knowledge graph to represent the OSS malware corpus and conduct malware analysis in the this http URL main findings include (1) it is essential to collect malicious packages from various online sources because their data overlapping degrees are small;(2) despite the sheer volume of malicious packages, many reuse similar code, leading to a low diversity of malware;(3) only 28 malicious packages were repeatedly hidden via dependency libraries of 1,354 malicious packages, and dependency-hidden malware has a shorter active time;(4) security reports are the only reliable source for disclosing the malware-based context. Index Terms: Malicious Packages, Software Analysis
- [28] arXiv:2504.08876 (replaced) [pdf, html, other]
-
Title: Is Productivity in Quantum Programming Equivalent to Expressiveness?Comments: 11 pages, 6 figuresSubjects: Quantum Physics (quant-ph); Programming Languages (cs.PL); Software Engineering (cs.SE)
The expressiveness of quantum programming languages plays a crucial role in the efficient and comprehensible representation of quantum algorithms. Unlike classical programming languages, which offer mature and well-defined abstraction mechanisms, quantum languages must integrate cognitively challenging concepts such as superposition, interference and entanglement while maintaining clarity and usability. However, identifying and characterizing differences in expressiveness between quantum programming paradigms remains an open area of study. Our work investigates the landscape of expressiveness through a comparative analysis of hosted quantum programming languages such as Qiskit, Cirq, Qrisp, and quAPL, and standalone languages including Q# and Qmod. We focused on evaluating how different quantum programming languages support the implementation of core quantum algorithms -- Deutsch-Jozsa, Simon, Bernstein-Vazirani, and Grover -- using expressiveness metrics: Lines of Code (LOC), Cyclomatic Complexity (CC), and Halstead Complexity (HC) metrics as proxies for developer productivity. Our findings suggest that different quantum programming paradigms offer distinct trade-offs between expressiveness and productivity, highlighting the importance of language design in quantum software development.
- [29] arXiv:2504.09115 (replaced) [pdf, other]
-
Title: CAShift: Benchmarking Log-Based Cloud Attack Detection under Normality ShiftJiongchi Yu, Xiaofei Xie, Qiang Hu, Bowen Zhang, Ziming Zhao, Yun Lin, Lei Ma, Ruitao Feng, Frank LiauwComments: Accepted by FSE 2025Subjects: Cryptography and Security (cs.CR); Software Engineering (cs.SE)
With the rapid advancement of cloud-native computing, securing cloud environments has become an important task. Log-based Anomaly Detection (LAD) is the most representative technique used in different systems for attack detection and safety guarantee, where multiple LAD methods and relevant datasets have been proposed. However, even though some of these datasets are specifically prepared for cloud systems, they only cover limited cloud behaviors and lack information from a whole-system perspective. Besides, another critical issue to consider is normality shift, which implies the test distribution could differ from the training distribution and highly affects the performance of LAD. Unfortunately, existing works only focus on simple shift types such as chronological changes, while other important and cloud-specific shift types are ignored, e.g., the distribution shift introduced by different deployed cloud architectures. Therefore, creating a new dataset that covers diverse behaviors of cloud systems and normality shift types is necessary.
To fill the gap in evaluating LAD under real-world conditions, we present CAShift, the first normality shift-aware dataset for cloud systems. CAShift captures three shift types, including application, version, and cloud architecture shifts, and includes 20 diverse attack scenarios across various cloud components. Using CAShift, we conduct an empirical study showing that (1) all LAD methods are significantly affected by normality shifts, with performance drops of up to 34%, and (2) continuous learning techniques can improve F1-scores by up to 27%, depending on data usage and algorithm choice. Based on our findings, we offer valuable implications for future research in designing more robust LAD models and methods for LAD shift adaptation.