Computer Science
See recent articles
- [1] arXiv:2409.12196 [pdf, other]
-
Title: Using Agile Story Points and Game Theory Together: Better Software Planning and Development in Agile Software DevelopmentComments: 10th International Management Information Systems Conference Sustainability in the Digital EraSubjects: Software Engineering (cs.SE)
In the realm of Agile software development, precise user story point estimation is crucial for effectual project timeline and resource management. Despite its significance, the method is often marred by issues stemming from cognitive biases, disparities in individual judgment, and hurdles related to both collaboration and competition. In addressing these challenges, this study employs a comprehensive literature review, integrating key concepts from Agile software development, Story Point estimation, and Game Theory. Through rigorous examination of existing literature and relevant case studies, we identified pervasive issues in Agile and Story Point estimation. In response, we proposed the application of game theoretic strategies, notably the Vickrey Auction and Stag Hunt Game, aiming to refine these estimations. The resultant methodology not only promotes the use of game-theory inspired mechanisms but also accentuates their potential to enhance software development planning, team cohesion, and conflict resolution. Preliminary results from our research underscore the transformative potential of these games when incorporated into Agile methodologies, especially during planning and retrospective phases. The overarching goal is to achieve improved accuracy in planning, foster team collaboration, and a discernible uplift in software product quality.
- [2] arXiv:2409.12197 [pdf, html, other]
-
Title: Nteasee: A mixed methods study of expert and general population perspectives on deploying AI for health in African countriesMercy Nyamewaa Asiedu, Iskandar Haykel, Awa Dieng, Kerrie Kauer, Tousif Ahmed, Florence Ofori, Charisma Chan, Stephen Pfohl, Negar Rostamzadeh, Katherine HellerComments: Equal contributionsSubjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computers and Society (cs.CY)
Artificial Intelligence (AI) for health has the potential to significantly change and improve healthcare. However in most African countries, identifying culturally and contextually attuned approaches for deploying these solutions is not well understood. To bridge this gap, we conduct a qualitative study to investigate the best practices, fairness indicators, and potential biases to mitigate when deploying AI for health in African countries, as well as explore opportunities where artificial intelligence could make a positive impact in health. We used a mixed methods approach combining in-depth interviews (IDIs) and surveys. We conduct 1.5-2 hour long IDIs with 50 experts in health, policy, and AI across 17 countries, and through an inductive approach we conduct a qualitative thematic analysis on expert IDI responses. We administer a blinded 30-minute survey with case studies to 672 general population participants across 5 countries in Africa and analyze responses on quantitative scales, statistically comparing responses by country, age, gender, and level of familiarity with AI. We thematically summarize open-ended responses from surveys. Our results find generally positive attitudes, high levels of trust, accompanied by moderate levels of concern among general population participants for AI usage for health in Africa. This contrasts with expert responses, where major themes revolved around trust/mistrust, ethical concerns, and systemic barriers to integration, among others. This work presents the first-of-its-kind qualitative research study of the potential of AI for health in Africa from an algorithmic fairness angle, with perspectives from both experts and the general population. We hope that this work guides policymakers and drives home the need for further research and the inclusion of general population perspectives in decision-making around AI usage.
- [3] arXiv:2409.12199 [pdf, html, other]
-
Title: Fault Tolerant Metric Dimensions of Leafless Cacti Graphs with Application in Supply Chain ManagementSubjects: Discrete Mathematics (cs.DM); Combinatorics (math.CO)
A resolving set for a simple graph $G$ is a subset of vertex set of $G$ such that it distinguishes all vertices of $G$ using the shortest distance from this subset. This subset is a metric basis if it is the smallest set with this property. A resolving set is a fault tolerant resolving set if the removal of any vertex from the subset still leaves it a resolving set. The smallest set satisfying this property is the fault tolerant metric basis, and the cardinality of this set is termed as fault tolerant metric dimension of $G$, denoted by $\beta'(G)$. In this article, we determine the fault tolerant metric dimension of bicyclic graphs of type-I and II and show that it is always $4$ for both types of graphs. We then use these results to form our basis to consider leafless cacti graphs, and calculate their fault tolerant metric dimensions in terms of \textit{inner cycles} and \textit{outer cycles}. We then consider a detailed real world example of supply and distribution center management, and discuss the application of fault tolerant metric dimension in such a scenario. We also briefly discuss some other scenarios where leafless cacti graphs can be used to model real world problems.
- [4] arXiv:2409.12202 [pdf, html, other]
-
Title: ScaleFlow++: Robust and Accurate Estimation of 3D Motion from VideoComments: arXiv admin note: substantial text overlap with arXiv:2407.09797Subjects: Computer Vision and Pattern Recognition (cs.CV)
Perceiving and understanding 3D motion is a core technology in fields such as autonomous driving, robots, and motion prediction. This paper proposes a 3D motion perception method called ScaleFlow++ that is easy to generalize. With just a pair of RGB images, ScaleFlow++ can robustly estimate optical flow and motion-in-depth (MID). Most existing methods directly regress MID from two RGB frames or optical flow, resulting in inaccurate and unstable results. Our key insight is cross-scale matching, which extracts deep motion clues by matching objects in pairs of images at different scales. Unlike previous methods, ScaleFlow++ integrates optical flow and MID estimation into a unified architecture, estimating optical flow and MID end-to-end based on feature matching. Moreover, we also proposed modules such as global initialization network, global iterative optimizer, and hybrid training pipeline to integrate global motion information, reduce the number of iterations, and prevent overfitting during training. On KITTI, ScaleFlow++ achieved the best monocular scene flow estimation performance, reducing SF-all from 6.21 to 5.79. The evaluation of MID even surpasses RGBD-based methods. In addition, ScaleFlow++ has achieved stunning zero-shot generalization performance in both rigid and nonrigid scenes. Code is available at \url{this https URL}.
- [5] arXiv:2409.12203 [pdf, html, other]
-
Title: A Simple Model to Estimate Sharing Effects in Social NetworksComments: Accepted at the CONSEQUENCES '24 workshop, co-located with ACM RecSys '24Subjects: Social and Information Networks (cs.SI); Information Retrieval (cs.IR); Machine Learning (cs.LG); Applications (stat.AP)
Randomised Controlled Trials (RCTs) are the gold standard for estimating treatment effects across many fields of science. Technology companies have adopted A/B-testing methods as a modern RCT counterpart, where end-users are randomly assigned various system variants and user behaviour is tracked continuously. The objective is then to estimate the causal effect that the treatment variant would have on certain metrics of interest to the business.
When the outcomes for randomisation units -- end-users in this case -- are not statistically independent, this obfuscates identifiability of treatment effects, and harms decision-makers' observability of the system. Social networks exemplify this, as they are designed to promote inter-user interactions. This interference by design notoriously complicates measurement of, e.g., the effects of sharing. In this work, we propose a simple Markov Decision Process (MDP)-based model describing user sharing behaviour in social networks. We derive an unbiased estimator for treatment effects under this model, and demonstrate through reproducible synthetic experiments that it outperforms existing methods by a significant margin. - [6] arXiv:2409.12207 [pdf, html, other]
-
Title: Architectural Co-LOD GenerationComments: ACM Transactions on Graphics (SIGGRAPH Aisa 2024); Project page: https://vcc.tech/research/2024/CoLODSubjects: Graphics (cs.GR)
Managing the level-of-detail (LOD) in architectural models is crucial yet challenging, particularly for effective representation and visualization of buildings. Traditional approaches often fail to deliver controllable detail alongside semantic consistency, especially when dealing with noisy and inconsistent inputs. We address these limitations with \emph{Co-LOD}, a new approach specifically designed for effective LOD management in architectural modeling. Co-LOD employs shape co-analysis to standardize geometric structures across multiple buildings, facilitating the progressive and consistent generation of LODs. This method allows for precise detailing in both individual models and model collections, ensuring semantic integrity. Extensive experiments demonstrate that Co-LOD effectively applies accurate LOD across a variety of architectural inputs, consistently delivering superior detail and quality in LOD representations.
- [7] arXiv:2409.12210 [pdf, html, other]
-
Title: Mixture of Diverse Size ExpertsSubjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
The Sparsely-Activated Mixture-of-Experts (MoE) has gained increasing popularity for scaling up large language models (LLMs) without exploding computational costs. Despite its success, the current design faces a challenge where all experts have the same size, limiting the ability of tokens to choose the experts with the most appropriate size for generating the next token. In this paper, we propose the Mixture of Diverse Size Experts (MoDSE), a new MoE architecture with layers designed to have experts of different sizes. Our analysis of difficult token generation tasks shows that experts of various sizes achieve better predictions, and the routing path of the experts tends to be stable after a training period. However, having experts of diverse sizes can lead to uneven workload distribution. To tackle this limitation, we introduce an expert-pair allocation strategy to evenly distribute the workload across multiple GPUs. Comprehensive evaluations across multiple benchmarks demonstrate the effectiveness of MoDSE, as it outperforms existing MoEs by allocating the parameter budget to experts adaptively while maintaining the same total parameter size and the number of experts.
- [8] arXiv:2409.12213 [pdf, html, other]
-
Title: SemAI: Semantic Artificial Intelligence-enhanced DNA storage for Internet-of-ThingsSubjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
In the wake of the swift evolution of technologies such as the Internet of Things (IoT), the global data landscape undergoes an exponential surge, propelling DNA storage into the spotlight as a prospective medium for contemporary cloud storage applications. This paper introduces a Semantic Artificial Intelligence-enhanced DNA storage (SemAI-DNA) paradigm, distinguishing itself from prevalent deep learning-based methodologies through two key modifications: 1) embedding a semantic extraction module at the encoding terminus, facilitating the meticulous encoding and storage of nuanced semantic information; 2) conceiving a forethoughtful multi-reads filtering model at the decoding terminus, leveraging the inherent multi-copy propensity of DNA molecules to bolster system fault tolerance, coupled with a strategically optimized decoder's architectural framework. Numerical results demonstrate the SemAI-DNA's efficacy, attaining 2.61 dB Peak Signal-to-Noise Ratio (PSNR) gain and 0.13 improvement in Structural Similarity Index (SSIM) over conventional deep learning-based approaches.
- [9] arXiv:2409.12217 [pdf, html, other]
-
Title: Effects of Common Regularization Techniques on Open-Set RecognitionSubjects: Machine Learning (cs.LG)
In recent years there has been increasing interest in the field of Open-Set Recognition, which allows a classification model to identify inputs as "unknown" when it encounters an object or class not in the training set. This ability to flag unknown inputs is of vital importance to many real world classification applications. As almost all modern training methods for neural networks use extensive amounts of regularization for generalization, it is therefore important to examine how regularization techniques impact the ability of a model to perform Open-Set Recognition. In this work, we examine the relationship between common regularization techniques and Open-Set Recognition performance. Our experiments are agnostic to the specific open-set detection algorithm and examine the effects across a wide range of datasets. We show empirically that regularization methods can provide significant improvements to Open-Set Recognition performance, and we provide new insights into the relationship between accuracy and Open-Set performance.
- [10] arXiv:2409.12218 [pdf, html, other]
-
Title: ARTICLE: Annotator Reliability Through In-Context LearningSujan Dutta, Deepak Pandita, Tharindu Cyril Weerasooriya, Marcos Zampieri, Christopher M. Homan, Ashiqur R. KhudaBukhshSubjects: Computation and Language (cs.CL); Machine Learning (cs.LG)
Ensuring annotator quality in training and evaluation data is a key piece of machine learning in NLP. Tasks such as sentiment analysis and offensive speech detection are intrinsically subjective, creating a challenging scenario for traditional quality assessment approaches because it is hard to distinguish disagreement due to poor work from that due to differences of opinions between sincere annotators. With the goal of increasing diverse perspectives in annotation while ensuring consistency, we propose \texttt{ARTICLE}, an in-context learning (ICL) framework to estimate annotation quality through self-consistency. We evaluate this framework on two offensive speech datasets using multiple LLMs and compare its performance with traditional methods. Our findings indicate that \texttt{ARTICLE} can be used as a robust method for identifying reliable annotators, hence improving data quality.
- [11] arXiv:2409.12244 [pdf, other]
-
Title: Sparks of Artificial General Intelligence(AGI) in Semiconductor Material Science: Early Explorations into the Next Frontier of Generative AI-Assisted Electron Micrograph AnalysisSakhinana Sagar Srinivas, Geethan Sannidhi, Sreeja Gangasani, Chidaksh Ravuru, Venkataramana RunkanaComments: Published at Deployable AI (DAI) Workshop at AAAI-2024Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Characterizing materials with electron micrographs poses significant challenges for automated labeling due to the complex nature of nanomaterial structures. To address this, we introduce a fully automated, end-to-end pipeline that leverages recent advances in Generative AI. It is designed for analyzing and understanding the microstructures of semiconductor materials with effectiveness comparable to that of human experts, contributing to the pursuit of Artificial General Intelligence (AGI) in nanomaterial identification. Our approach utilizes Large MultiModal Models (LMMs) such as GPT-4V, alongside text-to-image models like DALLE-3. We integrate a GPT-4 guided Visual Question Answering (VQA) method to analyze nanomaterial images, generate synthetic nanomaterial images via DALLE-3, and employ in-context learning with few-shot prompting in GPT-4V for accurate nanomaterial identification. Our method surpasses traditional techniques by enhancing the precision of nanomaterial identification and optimizing the process for high-throughput screening.
- [12] arXiv:2409.12245 [pdf, html, other]
-
Title: Computing-specific pedagogies and theoretical models: common uses and relationshipsSubjects: Computers and Society (cs.CY)
Computing education widely applies general learning theories and pedagogical practices. However, computing also includes specific disciplinary knowledge and skills, e.g., programming and software development methods, for which there has been a long history of development and application of specific pedagogical practices. In recent years, there has also been substantial interest in developing computing-specific theoretical models, which seek to describe and explain the complex interactions within teaching and learning computing in various contexts. In this paper, we explore connections between computing-specific pedagogies and theoretical models as reported in the literature. Our goal is to enrich computing education research and practice by illustrating how explicit use of field-specific theories and pedagogies can further the whole field. We have collected a list of computing-specific pedagogical practices and theoretical models from a literature search, identifying source papers where they have been first introduced or well described. We then searched for papers in the ACM digital library that cite source papers from each list, and analyzed the type of interaction between the model and pedagogy in each paper. We developed a categorization of how theoretical models and pedagogies have supported or discounted each other, have been used together in empirical studies or used to build new artefacts. Our results showed that pair programming and parsons problems have had the most interactions with theoretical models in the explored papers, and we present findings of the analysis of these interactions.
- [13] arXiv:2409.12246 [pdf, html, other]
-
Title: An Outline for a Jupyter-Materials-Based Repository Website Focused on the Computational SciencesComments: 14 pages, 8 figuresSubjects: Computers and Society (cs.CY); Digital Libraries (cs.DL)
As access to the internet has become increasingly ubiquitous, along with the reliability and speed of internet providers, so too has the implementation of internet-based learning tools. These tools provide students opportunities to do meaningful work away from university, however, often at a financial cost to universities and students. Moreover, limited and high-cost internet access in less-developed countries and remote areas acts as a barrier to implementing these tools in a meaningful way, leading to inequalities in both the quality of education and the opportunities provided. This paper outlines the development process, and benefits, of a low-cost and light-weight repository website centered around disseminating open-source textbooks and other supplemental learning materials for computational sciences using Jupyter Notebooks. The website focuses on allowing students to download their textbooks and other materials from a centralized location, to be used offline or with limited internet access. Internet access is not the only constraining factor; access to reasonably priced personal computers also limits the effectiveness of internet-based learning tools. As such, this paper will also explore the feasibility of integrating low-cost Raspberry Pi kits into this development process as a way of increasing the reach of an online repository of open-source Jupyter Notebook textbooks. While this paper focuses on Canadian universities and remote communities, many of the website's proposed applications are relevant worldwide.
- [14] arXiv:2409.12249 [pdf, html, other]
-
Title: GCA-SUN: A Gated Context-Aware Swin-UNet for Exemplar-Free CountingSubjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
Exemplar-Free Counting aims to count objects of interest without intensive annotations of objects or exemplars. To achieve this, we propose Gated Context-Aware Swin-UNet (GCA-SUN) to directly map an input image to the density map of countable objects. Specifically, a Gated Context-Aware Modulation module is designed in the encoder to suppress irrelevant objects or background through a gate mechanism and exploit the attentive support of objects of interest through a self-similarity matrix. The gate strategy is also incorporated into the bottleneck network and the decoder to highlight the features most relevant to objects of interest. By explicitly exploiting the attentive support among countable objects and eliminating irrelevant features through the gate mechanisms, the proposed GCA-SUN focuses on and counts objects of interest without relying on predefined categories or exemplars. Experimental results on the FSC-147 and CARPK datasets demonstrate that GCA-SUN outperforms state-of-the-art methods.
- [15] arXiv:2409.12252 [pdf, html, other]
-
Title: Optimal Control for Discrete-Time Systems under Bounded DisturbancesSubjects: Systems and Control (eess.SY)
This paper introduces a novel approach to the optimal control of linear discrete-time systems subject to bounded disturbances. Our approach is based on the newly established duality between ellipsoidal approximations of reachable and hardly observable sets. We provide exact solutions for state-feedback control and filtering problems, aligning with existing methods while offering improved computational efficiency. Moreover, our main contribution is the optimal solution to the output-feedback control problem for discrete-time systems which was not known before. Numerical simulations demonstrate the superiority of this result over previous sub-optimal ones.
- [16] arXiv:2409.12253 [pdf, html, other]
-
Title: Development of High-Performance DSP Algorithms on the European Rad-Hard NG-ULTRA SoC FPGAVasileios Leon, Anastasios Xynos, Dimitrios Soudris, George Lentaris, Ruben Domingo, Arturo Perez, David Gonzalez-Arjona, Isabelle Conway, David Merodio CodinachsComments: Accepted for publication at the 31st IEEE ICECS Conference, 18-20 Nov, 2024, Nancy, FranceSubjects: Hardware Architecture (cs.AR)
The emergence of demanding space applications has modified the traditional landscape of computing systems in space. When reliability is a first-class concern, in addition to enhanced performance-per-Watt, radiation-hardened FPGAs are favored. In this context, the current paper evaluates the first European radiation-hardened SoC FPGA, i.e., NanoXplore's NG-ULTRA, for accelerating high-performance DSP algorithms from space applications. The proposed development & testing methodologies provide efficient implementations, while they also aim to test the new NG-ULTRA hardware and its associated software tools. The results show that NG-ULTRA achieves competitive resource utilization and performance, constituting it as a very promising device for space missions, especially for Europe.
- [17] arXiv:2409.12255 [pdf, html, other]
-
Title: Efficient Data Subset Selection to Generalize Training Across Models: Transductive and Inductive NetworksComments: Published at NeurIPS 2023Journal-ref: Advances in Neural Information Processing Systems, 36 (2024)Subjects: Machine Learning (cs.LG); Machine Learning (stat.ML)
Existing subset selection methods for efficient learning predominantly employ discrete combinatorial and model-specific approaches which lack generalizability. For an unseen architecture, one cannot use the subset chosen for a different model. To tackle this problem, we propose $\texttt{SubSelNet}$, a trainable subset selection framework, that generalizes across architectures. Here, we first introduce an attention-based neural gadget that leverages the graph structure of architectures and acts as a surrogate to trained deep neural networks for quick model prediction. Then, we use these predictions to build subset samplers. This naturally provides us two variants of $\texttt{SubSelNet}$. The first variant is transductive (called as Transductive-$\texttt{SubSelNet}$) which computes the subset separately for each model by solving a small optimization problem. Such an optimization is still super fast, thanks to the replacement of explicit model training by the model approximator. The second variant is inductive (called as Inductive-$\texttt{SubSelNet}$) which computes the subset using a trained subset selector, without any optimization. Our experiments show that our model outperforms several methods across several real datasets
- [18] arXiv:2409.12256 [pdf, html, other]
-
Title: The Finer Points: A Systematic Comparison of Point-Cloud Extractors for Radar OdometryComments: 8 pages, 2 figures, 1 table. Submitted to IEEE International Conference on Robotics and Automation (ICRA)Subjects: Robotics (cs.RO)
A key element of many odometry pipelines using spinning frequency-modulated continuous-wave (FMCW) radar is the extraction of a point-cloud from the raw signal. This extraction greatly impacts the overall performance of point-cloud-based odometry. This paper provides a first-of-its-kind, comprehensive comparison of 13 common radar point-cloud extractors for the task of iterative closest point based odometry in autonomous driving environments. Each extractor's parameters are tuned and tested on two FMCW radar datasets using approximately 176km of data from public roads. We find that the simplest, and fastest extractor, K-strongest, is the best overall extractor, consistently outperforming the average by 13.59% and 24.94% on each dataset, respectively. Additionally, we highlight the significance of tuning an extractor and the substantial improvement in odometry accuracy that it yields.
- [19] arXiv:2409.12257 [pdf, html, other]
-
Title: MQA-KEAL: Multi-hop Question Answering under Knowledge Editing for Arabic LanguageSubjects: Computation and Language (cs.CL)
Large Language Models (LLMs) have demonstrated significant capabilities across numerous application domains. A key challenge is to keep these models updated with latest available information, which limits the true potential of these models for the end-applications. Although, there have been numerous attempts for LLMs Knowledge Editing (KE), i.e., to edit the LLMs prior knowledge and in turn test it via Multi-hop Question Answering (MQA), yet so far these studies are primarily focused on English language. To bridge this gap, in this paper we propose: Multi-hop Questioning Answering under Knowledge Editing for Arabic Language (MQA-KEAL). MQA-KEAL stores knowledge edits as structured knowledge units in the external memory. In order to solve multi-hop question, it first uses task-decomposition to decompose the question into smaller sub-problems. Later for each sub-problem, it iteratively queries the external memory and/or target LLM in order to generate the final response. In addition, we also contribute MQUAKE-AR (Arabic translation of English benchmark MQUAKE), as well as a new benchmark MQA-AEVAL for rigorous performance evaluation of MQA under KE for Arabic language. Experimentation evaluation reveals MQA-KEAL outperforms the baseline models by a significant margin.
- [20] arXiv:2409.12258 [pdf, html, other]
-
Title: MPAI: A Co-Processing Architecture with MPSoC & AI Accelerators for Vision Applications in SpaceComments: Accepted for publication at the 31st IEEE ICECS Conference, 18-20 Nov, 2024, Nancy, FranceSubjects: Hardware Architecture (cs.AR); Emerging Technologies (cs.ET)
The emerging need for fast and power-efficient AI/ML deployment on-board spacecraft has forced the space industry to examine specialized accelerators, which have been successfully used in terrestrial applications. Towards this direction, the current work introduces a very heterogeneous co-processing architecture that is built around UltraScale+ MPSoC and its programmable DPU, as well as commercial AI/ML accelerators such as MyriadX VPU and Edge TPU. The proposed architecture, called MPAI, handles networks of different size/complexity and accommodates speed-accuracy-energy trade-offs by exploiting the diversity of accelerators in precision and computational power. This brief provides technical background and reports preliminary experimental results and outcomes.
- [21] arXiv:2409.12259 [pdf, html, other]
-
Title: WiLoR: End-to-end 3D Hand Localization and Reconstruction in-the-wildComments: Project Page this https URLSubjects: Computer Vision and Pattern Recognition (cs.CV)
In recent years, 3D hand pose estimation methods have garnered significant attention due to their extensive applications in human-computer interaction, virtual reality, and robotics. In contrast, there has been a notable gap in hand detection pipelines, posing significant challenges in constructing effective real-world multi-hand reconstruction systems. In this work, we present a data-driven pipeline for efficient multi-hand reconstruction in the wild. The proposed pipeline is composed of two components: a real-time fully convolutional hand localization and a high-fidelity transformer-based 3D hand reconstruction model. To tackle the limitations of previous methods and build a robust and stable detection network, we introduce a large-scale dataset with over than 2M in-the-wild hand images with diverse lighting, illumination, and occlusion conditions. Our approach outperforms previous methods in both efficiency and accuracy on popular 2D and 3D benchmarks. Finally, we showcase the effectiveness of our pipeline to achieve smooth 3D hand tracking from monocular videos, without utilizing any temporal components. Code, models, and dataset are available this https URL.
- [22] arXiv:2409.12262 [pdf, html, other]
-
Title: Bootstrapping Object-level Planning with Large Language ModelsComments: 6 pages + 1 page referenceSubjects: Robotics (cs.RO)
We introduce a new method that extracts knowledge from a large language model (LLM) to produce object-level plans, which describe high-level changes to object state, and uses them to bootstrap task and motion planning (TAMP) in a hierarchical manner. Existing works use LLMs to either directly output task plans or to generate goals in representations like PDDL. However, these methods fall short because they either rely on the LLM to do the actual planning or output a hard-to-satisfy goal. Our approach instead extracts knowledge from a LLM in the form of plan schemas as an object level representation called functional object-oriented networks (FOON), from which we automatically generate PDDL subgoals. Our experiments demonstrate how our method's performance markedly exceeds alternative planning strategies across several tasks in simulation.
- [23] arXiv:2409.12263 [pdf, html, other]
-
Title: Detecting LGBTQ+ Instances of CyberbullyingComments: 10 pages, 4 tables, 1 figure, 17th International Conference on Social Computing, Behavioral-Cultural Modeling, & Prediction and Behavior Representation in Modeling and SimulationSubjects: Machine Learning (cs.LG); Social and Information Networks (cs.SI)
Social media continues to have an impact on the trajectory of humanity. However, its introduction has also weaponized keyboards, allowing the abusive language normally reserved for in-person bullying to jump onto the screen, i.e., cyberbullying. Cyberbullying poses a significant threat to adolescents globally, affecting the mental health and well-being of many. A group that is particularly at risk is the LGBTQ+ community, as researchers have uncovered a strong correlation between identifying as LGBTQ+ and suffering from greater online harassment. Therefore, it is critical to develop machine learning models that can accurately discern cyberbullying incidents as they happen to LGBTQ+ members. The aim of this study is to compare the efficacy of several transformer models in identifying cyberbullying targeting LGBTQ+ individuals. We seek to determine the relative merits and demerits of these existing methods in addressing complex and subtle kinds of cyberbullying by assessing their effectiveness with real social media data.
- [24] arXiv:2409.12264 [pdf, html, other]
-
Title: User-friendly Foundation Model Adapters for Multivariate Time Series ClassificationComments: The first two authors contributed equallySubjects: Machine Learning (cs.LG); Machine Learning (stat.ML)
Foundation models, while highly effective, are often resource-intensive, requiring substantial inference time and memory. This paper addresses the challenge of making these models more accessible with limited computational resources by exploring dimensionality reduction techniques. Our goal is to enable users to run large pre-trained foundation models on standard GPUs without sacrificing performance. We investigate classical methods such as Principal Component Analysis alongside neural network-based adapters, aiming to reduce the dimensionality of multivariate time series data while preserving key features. Our experiments show up to a 10x speedup compared to the baseline model, without performance degradation, and enable up to 4.5x more datasets to fit on a single GPU, paving the way for more user-friendly and scalable foundation models.
- [25] arXiv:2409.12266 [pdf, html, other]
-
Title: C-Uniform Trajectory Sampling For Fast Motion PlanningSubjects: Robotics (cs.RO)
We study the problem of sampling robot trajectories and introduce the notion of C-Uniformity. As opposed to the standard method of uniformly sampling control inputs (which lead to biased samples of the configuration space), C-Uniform trajectories are generated by control actions which lead to uniform sampling of the configuration space. After presenting an intuitive closed-form solution to generate C-Uniform trajectories for the 1D random-walker, we present a network-flow based optimization method to precompute C-Uniform trajectories for general robot systems. We apply the notion of C-Uniformity to the design of Model Predictive Path Integral controllers. Through simulation experiments, we show that using C-Uniform trajectories significantly improves the performance of MPPI-style controllers, achieving up to 40% coverage performance gain compared to the best baseline. We demonstrate the practical applicability of our method with an implementation on a 1/10th scale racer.
- [26] arXiv:2409.12269 [pdf, other]
-
Title: Unlocking the Power of Environment Assumptions for Unit ProofsComments: SEFM 2024Subjects: Software Engineering (cs.SE); Programming Languages (cs.PL)
Clearly articulating the assumptions of the execution environment is crucial for the successful application of code-level formal verification. The process of specifying a model for the environment can be both laborious and error-prone, often requiring domain experts. In contrast, when engineers write unit tests, they frequently employ mocks (tMocks) to define the expected behavior of the environment in which the function under test operates. These tMocks describe how the environment behaves, e.g., the return types of an external API call (stateless behaviour) or the correct sequence of function calls (stateful behaviour). Mocking frameworks have proven to be highly effective tools for crafting unit tests. In our work, we draw inspiration from tMocks and introduce their counterpart in the realm of formal verification, which we term vMocks. vMocks offer an intuitive framework for specifying a plausible environment when conducting code-level formal verification. We implement a vMock library for the verification of C programs called SEAMOCK. We investigate the practicality of vMocks by, first, comparing specifications styles in the communication layer of the Android Trusty Trusted Execution Environment (TEE) open source project, and second, in the verification of mbedTLS, a widely used open source C library that provides secure communication protocols and cryptography primitives for embedded systems. Based on our experience, we conclude that vMocks complement other forms of environment models. We believe that vMocks ease adoption of code-level formal verification among developers already familiar with tMocks.
- [27] arXiv:2409.12272 [pdf, html, other]
-
Title: Mastering Chess with a Transformer ModelSubjects: Machine Learning (cs.LG)
Transformer models have demonstrated impressive capabilities when trained at scale, excelling at difficult cognitive tasks requiring complex reasoning and rational decision-making. In this paper, we explore the application of transformer models to chess, focusing on the critical role of the position encoding within the attention mechanism. We show that in chess, transformers endowed with a sufficiently versatile position encoding can match existing chess-playing models at a fraction of the computational cost. Our architecture significantly outperforms AlphaZero at 8x fewer FLOPS and matches prior grandmaster-level transformer-based agents at 30x fewer FLOPS.
- [28] arXiv:2409.12273 [pdf, other]
-
Title: Improving Soft-Capture Phase Success in Space Debris Removal Missions: Leveraging Deep Reinforcement Learning and Tactile FeedbackComments: This paper is accepted and presented in CASE2024Subjects: Robotics (cs.RO)
Traditional control methods effectively manage robot operations using models like motion equations but face challenges with issues of contact and friction, leading to unstable and imprecise controllers that often require manual tweaking. Reinforcement learning, however, has developed as a capable solution for developing robust robot controllers that excel in handling contact-related challenges. In this work, we introduce a deep reinforcement learning approach to tackle the soft-capture phase for free-floating moving targets, mainly space debris, amidst noisy data. Our findings underscore the crucial role of tactile sensors, even during the soft-capturing phase. By employing deep reinforcement learning, we eliminate the need for manual feature design, simplifying the problem and allowing the robot to learn soft-capture strategies through trial and error. To facilitate effective learning of the approach phase, we have crafted a specialized reward function that offers clear and insightful feedback to the agent. Our method is trained entirely within the simulation environment, eliminating the need for direct demonstrations or prior knowledge of the task. The developed control policy shows promising results, highlighting the necessity of using tactile sensor information. The code and simulation results are available at Soft_Capture_Tactile repo.
- [29] arXiv:2409.12274 [pdf, html, other]
-
Title: Hierarchical LLMs In-the-loop Optimization for Real-time Multi-Robot Target Tracking under Unknown HazardsSubjects: Robotics (cs.RO)
In this paper, we propose a hierarchical Large Language Models (LLMs) in-the-loop optimization framework for real-time multi-robot task allocation and target tracking in an unknown hazardous environment subject to sensing and communication attacks. We formulate multi-robot coordination for tracking tasks as a bi-level optimization problem, with LLMs to reason about potential hazards in the environment and the status of the robot team and modify both the inner and outer levels of the optimization. The inner LLM adjusts parameters to prioritize various objectives, including performance, safety, and energy efficiency, while the outer LLM handles online variable completion for team reconfiguration. This hierarchical approach enables real-time adjustments to the robots' behavior. Additionally, a human supervisor can offer broad guidance and assessments to address unexpected dangers, model mismatches, and performance issues arising from local minima. We validate our proposed framework in both simulation and real-world experiments with comprehensive evaluations, which provide the potential for safe LLM integration for multi-robot problems.
- [30] arXiv:2409.12278 [pdf, html, other]
-
Title: Making Large Language Models into World Models with Precondition and Effect KnowledgeSubjects: Computation and Language (cs.CL)
World models, which encapsulate the dynamics of how actions affect environments, are foundational to the functioning of intelligent agents. In this work, we explore the potential of Large Language Models (LLMs) to operate as world models. Although LLMs are not inherently designed to model real-world dynamics, we show that they can be induced to perform two critical world model functions: determining the applicability of an action based on a given world state, and predicting the resulting world state upon action execution. This is achieved by fine-tuning two separate LLMs-one for precondition prediction and another for effect prediction-while leveraging synthetic data generation techniques. Through human-participant studies, we validate that the precondition and effect knowledge generated by our models aligns with human understanding of world dynamics. We also analyze the extent to which the world model trained on our synthetic data results in an inferred state space that supports the creation of action chains, a necessary property for planning.
- [31] arXiv:2409.12285 [pdf, html, other]
-
Title: On Convergent Dynamic Mode Decomposition and its Equivalence with Occupation Kernel RegressionSubjects: Systems and Control (eess.SY); Functional Analysis (math.FA); Optimization and Control (math.OC)
This paper presents a new technique for norm-convergent dynamic mode decomposition of deterministic systems. The developed method utilizes recent results on singular dynamic mode decomposition where it is shown that by appropriate selection of domain and range Hilbert spaces, the Liouville operator (also known as the Koopman generator) can be made to be compact. In this paper, it is shown that by selecting appropriate collections of finite basis functions in the domain and the range, a novel finite-rank representation of the Liouville operator may be obtained. It is also shown that the model resulting from dynamic mode decomposition of the finite-rank representation is closely related to regularized regression using the so-called occupation kernels as basis functions.
- [32] arXiv:2409.12289 [pdf, html, other]
-
Title: MetaPix: A Data-Centric AI Development Platform for Efficient Management and Utilization of Unstructured Computer Vision DataComments: Accepted @ The 22nd International Conference on Software Engineering Research & PracticeSubjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
In today's world of advanced AI technologies, data management is a critical component of any AI/ML solution. Effective data management is vital for the creation and maintenance of high-quality, diverse datasets, which significantly enhance predictive capabilities and lead to smarter business solutions. In this work, we introduce MetaPix, a Data-centric AI platform offering comprehensive data management solutions specifically designed for unstructured data. MetaPix offers robust tools for data ingestion, processing, storage, versioning, governance, and discovery. The platform operates on four key concepts: DataSources, Datasets, Extensions and Extractors. A DataSource serves as MetaPix top level asset, representing a narrow-scoped source of data for a specific use. Datasets are MetaPix second level object, structured collections of data. Extractors are internal tools integrated into MetaPix's backend processing, facilitate data processing and enhancement. Additionally, MetaPix supports extensions, enabling integration with external third-party tools to enhance platform functionality. This paper delves into each MetaPix concept in detail, illustrating how they collectively contribute to the platform's objectives. By providing a comprehensive solution for managing and utilizing unstructured computer vision data, MetaPix equips organizations with a powerful toolset to develop AI applications effectively.
- [33] arXiv:2409.12293 [pdf, html, other]
-
Title: Provable In-Context Learning of Linear Systems and Linear Elliptic PDEs with TransformersSubjects: Machine Learning (cs.LG); Numerical Analysis (math.NA); Machine Learning (stat.ML)
Foundation models for natural language processing, powered by the transformer architecture, exhibit remarkable in-context learning (ICL) capabilities, allowing pre-trained models to adapt to downstream tasks using few-shot prompts without updating their weights. Recently, transformer-based foundation models have also emerged as versatile tools for solving scientific problems, particularly in the realm of partial differential equations (PDEs). However, the theoretical foundations of the ICL capabilities in these scientific models remain largely unexplored. This work develops a rigorous error analysis for transformer-based ICL applied to solution operators associated with a family of linear elliptic PDEs. We first demonstrate that a linear transformer, defined by a linear self-attention layer, can provably learn in-context to invert linear systems arising from the spatial discretization of PDEs. This is achieved by deriving theoretical scaling laws for the prediction risk of the proposed linear transformers in terms of spatial discretization size, the number of training tasks, and the lengths of prompts used during training and inference. These scaling laws also enable us to establish quantitative error bounds for learning PDE solutions. Furthermore, we quantify the adaptability of the pre-trained transformer on downstream PDE tasks that experience distribution shifts in both tasks (represented by PDE coefficients) and input covariates (represented by the source term). To analyze task distribution shifts, we introduce a novel concept of task diversity and characterize the transformer's prediction error in terms of the magnitude of task shift, assuming sufficient diversity in the pre-training tasks. We also establish sufficient conditions to ensure task diversity. Finally, we validate the ICL-capabilities of transformers through extensive numerical experiments.
- [34] arXiv:2409.12294 [pdf, html, other]
-
Title: RAG-Modulo: Solving Sequential Tasks using Experience, Critics, and Language ModelsComments: 8 pages, 5 figuresSubjects: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG); Robotics (cs.RO)
Large language models (LLMs) have recently emerged as promising tools for solving challenging robotic tasks, even in the presence of action and observation uncertainties. Recent LLM-based decision-making methods (also referred to as LLM-based agents), when paired with appropriate critics, have demonstrated potential in solving complex, long-horizon tasks with relatively few interactions. However, most existing LLM-based agents lack the ability to retain and learn from past interactions - an essential trait of learning-based robotic systems. We propose RAG-Modulo, a framework that enhances LLM-based agents with a memory of past interactions and incorporates critics to evaluate the agents' decisions. The memory component allows the agent to automatically retrieve and incorporate relevant past experiences as in-context examples, providing context-aware feedback for more informed decision-making. Further by updating its memory, the agent improves its performance over time, thereby exhibiting learning. Through experiments in the challenging BabyAI and AlfWorld domains, we demonstrate significant improvements in task success rates and efficiency, showing that the proposed RAG-Modulo framework outperforms state-of-the-art baselines.
- [35] arXiv:2409.12295 [pdf, other]
-
Title: SANE: Strategic Autonomous Non-Smooth Exploration for Multiple Optima Discovery in Multi-modal and Non-differentiable Black-box FunctionsComments: 25 pages, 7 figures in main text, 2 figures in Supp MatSubjects: Machine Learning (cs.LG); Machine Learning (stat.ML)
Both computational and experimental material discovery bring forth the challenge of exploring multidimensional and multimodal parameter spaces, such as phase diagrams of Hamiltonians with multiple interactions, composition spaces of combinatorial libraries, material structure image spaces, and molecular embedding spaces. Often these systems are black-box and time-consuming to evaluate, which resulted in strong interest towards active learning methods such as Bayesian optimization (BO). However, these systems are often noisy which make the black box function severely multi-modal and non-differentiable, where a vanilla BO can get overly focused near a single or faux optimum, deviating from the broader goal of scientific discovery. To address these limitations, here we developed Strategic Autonomous Non-Smooth Exploration (SANE) to facilitate an intelligent Bayesian optimized navigation with a proposed cost-driven probabilistic acquisition function to find multiple global and local optimal regions, avoiding the tendency to becoming trapped in a single optimum. To distinguish between a true and false optimal region due to noisy experimental measurements, a human (domain) knowledge driven dynamic surrogate gate is integrated with SANE. We implemented the gate-SANE into a pre-acquired Piezoresponse spectroscopy data of a ferroelectric combinatorial library with high noise levels in specific regions, and a piezoresponse force microscopy (PFM) hyperspectral data. SANE demonstrated better performance than classical BO to facilitate the exploration of multiple optimal regions and thereby prioritized learning with higher coverage of scientific values in autonomous experiments. Our work showcases the potential application of this method to real-world experiment, where such combined strategic and human intervening approaches can be critical to unlocking new discoveries in autonomous research.
- [36] arXiv:2409.12296 [pdf, html, other]
-
Title: JKO for Landau: a variational particle method for homogeneous Landau equationSubjects: Numerical Analysis (math.NA); Machine Learning (cs.LG)
Inspired by the gradient flow viewpoint of the Landau equation and corresponding dynamic formulation of the Landau metric in [arXiv:2007.08591], we develop a novel implicit particle method for the Landau equation in the framework of the JKO scheme. We first reformulate the Landau metric in a computationally friendly form, and then translate it into the Lagrangian viewpoint using the flow map. A key observation is that, while the flow map evolves according to a rather complicated integral equation, the unknown component is merely a score function of the corresponding density plus an additional term in the null space of the collision kernel. This insight guides us in approximating the flow map with a neural network and simplifies the training. Additionally, the objective function is in a double summation form, making it highly suitable for stochastic methods. Consequently, we design a tailored version of stochastic gradient descent that maintains particle interactions and reduces the computational complexity. Compared to other deterministic particle methods, the proposed method enjoys exact entropy dissipation and unconditional stability, therefore making it suitable for large-scale plasma simulations over extended time periods.
- [37] arXiv:2409.12297 [pdf, html, other]
-
Title: Hypersparse Traffic Matrices from Suricata Network Flows using GraphBLASMichael Houle, Michael Jones, Dan Wallmeyer, Risa Brodeur, Justin Burr, Hayden Jananthan, Sam Merrell, Peter Michaleas, Anthony Perez, Andrew Prout, Jeremy KepnerSubjects: Distributed, Parallel, and Cluster Computing (cs.DC)
Hypersparse traffic matrices constructed from network packet source and destination addresses is a powerful tool for gaining insights into network traffic. SuiteSparse: GraphBLAS, an open source package or building, manipulating, and analyzing large hypersparse matrices, is one approach to constructing these traffic matrices. Suricata is a widely used open source network intrusion detection software package. This work demonstrates how Suricata network flow records can be used to efficiently construct hypersparse matrices using GraphBLAS.
- [38] arXiv:2409.12299 [pdf, html, other]
-
Title: Understanding Web Application Workloads and Their Applications: Systematic Literature Review and CharacterizationSubjects: Software Engineering (cs.SE)
Web applications, accessible via web browsers over the Internet, facilitate complex functionalities without local software installation. In the context of web applications, a workload refers to the number of user requests sent by users or applications to the underlying system. Existing studies have leveraged web application workloads to achieve various objectives, such as workload prediction and auto-scaling. However, these studies are conducted in an ad hoc manner, lacking a systematic understanding of the characteristics of web application workloads. In this study, we first conduct a systematic literature review to identify and analyze existing studies leveraging web application workloads. Our analysis sheds light on their workload utilization, analysis techniques, and high-level objectives. We further systematically analyze the characteristics of the web application workloads identified in the literature review. Our analysis centers on characterizing these workloads at two distinct temporal granularities: daily and weekly. We successfully identify and categorize three daily and three weekly patterns within the workloads. By providing a statistical characterization of these workload patterns, our study highlights the uniqueness of each pattern, paving the way for the development of realistic workload generation and resource provisioning techniques that can benefit a range of applications and research areas.
- [39] arXiv:2409.12300 [pdf, html, other]
-
Title: Autoformalization of Game Descriptions using Large Language ModelsComments: code: this https URLSubjects: Artificial Intelligence (cs.AI); Computer Science and Game Theory (cs.GT)
Game theory is a powerful framework for reasoning about strategic interactions, with applications in domains ranging from day-to-day life to international politics. However, applying formal reasoning tools in such contexts is challenging, as these scenarios are often expressed in natural language. To address this, we introduce a framework for the autoformalization of game-theoretic scenarios, which translates natural language descriptions into formal logic representations suitable for formal solvers. Our approach utilizes one-shot prompting and a solver that provides feedback on syntactic correctness to allow LLMs to refine the code. We evaluate the framework using GPT-4o and a dataset of natural language problem descriptions, achieving 98% syntactic correctness and 88% semantic correctness. These results show the potential of LLMs to bridge the gap between real-life strategic interactions and formal reasoning.
- [40] arXiv:2409.12302 [pdf, html, other]
-
Title: Space-Time Continuum: Continuous Shape and Time State Estimation for Flexible RobotsComments: Accepted at ICRA40Subjects: Robotics (cs.RO)
This extended abstract introduces a novel method for continuous state estimation of continuum robots. We formulate the estimation problem as a factor-graph optimization problem using a novel Gaussian-process prior that is parameterized over both arclength and time. We use this to introduce the first continuous-time-and-space state estimation method for continuum robots.
- [41] arXiv:2409.12304 [pdf, html, other]
-
Title: Self-Supervised Pre-training Tasks for an fMRI Time-series Transformer in Autism DetectionSubjects: Computer Vision and Pattern Recognition (cs.CV)
Autism Spectrum Disorder (ASD) is a neurodevelopmental condition that encompasses a wide variety of symptoms and degrees of impairment, which makes the diagnosis and treatment challenging. Functional magnetic resonance imaging (fMRI) has been extensively used to study brain activity in ASD, and machine learning methods have been applied to analyze resting state fMRI (rs-fMRI) data. However, fewer studies have explored the recent transformer-based models on rs-fMRI data. Given the superiority of transformer models in capturing long-range dependencies in sequence data, we have developed a transformer-based self-supervised framework that directly analyzes time-series fMRI data without computing functional connectivity. To address over-fitting in small datasets and enhance the model performance, we propose self-supervised pre-training tasks to reconstruct the randomly masked fMRI time-series data, investigating the effects of various masking strategies. We then finetune the model for the ASD classification task and evaluate it using two public datasets and five-fold cross-validation with different amounts of training data. The experiments show that randomly masking entire ROIs gives better model performance than randomly masking time points in the pre-training step, resulting in an average improvement of 10.8% for AUC and 9.3% for subject accuracy compared with the transformer model trained from scratch across different levels of training data availability. Our code is available on GitHub.
- [42] arXiv:2409.12305 [pdf, html, other]
-
Title: QAMNet: Fast and Efficient Optical QAM Neural NetworksComments: 9 pages, 7 figuresSubjects: Emerging Technologies (cs.ET)
The energy consumption of neural network inference has become a topic of paramount importance with the growing success and adoption of deep neural networks (DNNs). Analog optical neural networks (ONNs) can reduce the energy of matrix-vector multiplication in neural network inference below that of digital electronics. However, realizing this promise remains challenging due to digital-to-analog (DAC) conversion: even at low bit precisions $b$, encoding the $2^b$ levels of digital weights and inputs into the analog domain requires specialized and power-hungry electronics. Faced with similar challenges, the field of telecommunications has developed the complex-valued Quadrature-Amplitude Modulation (QAM), the workhorse modulation format for decades. QAM maximally exploits the complex amplitude to provide a quadratic $O(N^2) \to O(N)$ energy saving over intensity-only modulation. Inspired by this advantage, this work introduces QAMNet, an optical neural network hardware and architecture with superior energy consumption to existing ONNs, that fully utilizes the complex nature of the amplitude of light with QAM. When implemented with conventional telecommunications equipment, we show that QAMNet accelerates complex-valued deep neural networks with accuracies indistinguishable from digital hardware, based on physics-based simulations. Compared to standard ONNs, we find that QAMNet ONNs: (1) attain higher accuracy above moderate levels of total bit precision, (2) are more accurate above low energy budgets, and (3) are an optimal choice when hardware bit precision is limited.
- [43] arXiv:2409.12306 [pdf, html, other]
-
Title: Measuring Sound Symbolism in Audio-visual ModelsComments: SLT 2024Subjects: Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV); Sound (cs.SD); Audio and Speech Processing (eess.AS)
Audio-visual pre-trained models have gained substantial attention recently and demonstrated superior performance on various audio-visual tasks. This study investigates whether pre-trained audio-visual models demonstrate non-arbitrary associations between sounds and visual representations$\unicode{x2013}$known as sound symbolism$\unicode{x2013}$which is also observed in humans. We developed a specialized dataset with synthesized images and audio samples and assessed these models using a non-parametric approach in a zero-shot setting. Our findings reveal a significant correlation between the models' outputs and established patterns of sound symbolism, particularly in models trained on speech data. These results suggest that such models can capture sound-meaning connections akin to human language processing, providing insights into both cognitive architectures and machine learning strategies.
- [44] arXiv:2409.12308 [pdf, html, other]
-
Title: Robust DOA Estimation Based on Dual Lawson Norm for RIS-Aided Wireless Communication SystemsComments: 10 figures, 28 pagesSubjects: Information Theory (cs.IT)
Reconfigurable intelligent surfaces (RIS) can actively perform beamforming and have become a crucial enabler for wireless systems in the future. The direction-of-arrival (DOA) estimates of RIS received signals can help design the reflection control matrix and improve communication quality. In this paper, we design a RIS-assisted system and propose a robust Lawson norm-based multiple-signal-classification (LN-MUSIC) DOA estimation algorithm for impulsive noise, which is divided into two parts. The first one, the non-convex Lawson norm is used as the error criterion along with a regularization constraint to formulate the optimization problem. Then, a Bregman distance based alternating direction method of multipliers is used to solve the problem and recover the desired signal. The second part is to use the multiple signal classification (MUSIC) to find out the DOAs of targets based on their sparsity in the spatial domain. In addition, we also propose a RIS control matrix optimization strategy that requires no channel state information, which effectively enhances the desired signals and improves the performance of the LN-MUSIC algorithm. A Cramer-Rao-lower-bound (CRLB) of the proposed DOA estimation algorithm is presented and verifies its feasibility. Simulated results show that the proposed robust DOA estimation algorithm based on the Lawson norm can effectively suppress the impact of large outliers caused by impulsive noise on the estimation results, outperforming existing methods.
- [45] arXiv:2409.12311 [pdf, html, other]
-
Title: Towards Closing the Loop in Robotic Pollination for Indoor Farming via Autonomous Microscopic InspectionChuizheng Kong, Alex Qiu, Idris Wibowo, Marvin Ren, Aishik Dhori, Kai-Shu Ling, Ai-Ping Hu, Shreyas KousikSubjects: Robotics (cs.RO); Systems and Control (eess.SY)
Effective pollination is a key challenge for indoor farming, since bees struggle to navigate without the sun. While a variety of robotic system solutions have been proposed, it remains difficult to autonomously check that a flower has been sufficiently pollinated to produce high-quality fruit, which is especially critical for self-pollinating crops such as strawberries. To this end, this work proposes a novel robotic system for indoor farming. The proposed hardware combines a 7-degree-of-freedom (DOF) manipulator arm with a custom end-effector, comprised of an endoscope camera, a 2-DOF microscope subsystem, and a custom vibrating pollination tool; this is paired with algorithms to detect and estimate the pose of strawberry flowers, navigate to each flower, pollinate using the tool, and inspect with the microscope. The key novelty is vibrating the flower from below while simultaneously inspecting with a microscope from above. Each subsystem is validated via extensive experiments.
- [46] arXiv:2409.12314 [pdf, html, other]
-
Title: Understanding Implosion in Text-to-Image Generative ModelsComments: ACM CCS 2024Subjects: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
Recent works show that text-to-image generative models are surprisingly vulnerable to a variety of poisoning attacks. Empirical results find that these models can be corrupted by altering associations between individual text prompts and associated visual features. Furthermore, a number of concurrent poisoning attacks can induce "model implosion," where the model becomes unable to produce meaningful images for unpoisoned prompts. These intriguing findings highlight the absence of an intuitive framework to understand poisoning attacks on these models. In this work, we establish the first analytical framework on robustness of image generative models to poisoning attacks, by modeling and analyzing the behavior of the cross-attention mechanism in latent diffusion models. We model cross-attention training as an abstract problem of "supervised graph alignment" and formally quantify the impact of training data by the hardness of alignment, measured by an Alignment Difficulty (AD) metric. The higher the AD, the harder the alignment. We prove that AD increases with the number of individual prompts (or concepts) poisoned. As AD grows, the alignment task becomes increasingly difficult, yielding highly distorted outcomes that frequently map meaningful text prompts to undefined or meaningless visual representations. As a result, the generative model implodes and outputs random, incoherent images at large. We validate our analytical framework through extensive experiments, and we confirm and explain the unexpected (and unexplained) effect of model implosion while producing new, unforeseen insights. Our work provides a useful tool for studying poisoning attacks against diffusion models and their defenses.
- [47] arXiv:2409.12317 [pdf, html, other]
-
Title: Excitable Nonlinear Opinion Dynamics (E-NOD) for Agile Decision-MakingComments: 6 pages, 5 figuresSubjects: Systems and Control (eess.SY)
We present Excitable Nonlinear Opinion Dynamics (E-NOD), which describe opinion-forming and decision-making behavior with superior "agility" in responding and adapting to fast and unpredictable changes in context, environment, or information about available options. E-NOD is derived by introducing a single extra term to the previously presented Nonlinear Opinion Dynamics (NOD), which have been shown to enable fast and flexible multi-agent behavior. This extra term is inspired by the fast-positive, slow-negative mixed-feedback structure of excitable systems. The agile behaviors resulting from the excitable nature of decision-making driven by E-NOD are analyzed in a general setting and illustrated through an application to robot navigation around human movers.
- [48] arXiv:2409.12318 [pdf, html, other]
-
Title: A large-scale study of performance and equity of commercial remote identity verification technologies across demographicsKaniz Fatima, Michael Schuckers, Gerardo Cruz-Ortiz, Daqing Hou, Sandip Purnapatra, Tiffany Andrews, Ambuj Neupane, Brandeis Marshall, Stephanie SchuckersSubjects: Computer Vision and Pattern Recognition (cs.CV)
As more types of transactions move online, there is an increasing need to verify someone's identity remotely. Remote identity verification (RIdV) technologies have emerged to fill this need. RIdV solutions typically use a smart device to validate an identity document like a driver's license by comparing a face selfie to the face photo on the document. Recent research has been focused on ensuring that biometric systems work fairly across demographic groups. This study assesses five commercial RIdV solutions for equity across age, gender, race/ethnicity, and skin tone across 3,991 test subjects. This paper employs statistical methods to discern whether the RIdV result across demographic groups is statistically distinguishable. Two of the RIdV solutions were equitable across all demographics, while two RIdV solutions had at least one demographic that was inequitable. For example, the results for one technology had a false negative rate of 10.5% +/- 4.5% and its performance for each demographic category was within the error bounds, and, hence, were equitable. The other technologies saw either poor overall performance or inequitable performance. For one of these, participants of the race Black/African American (B/AA) as well as those with darker skin tones (Monk scale 7/8/9/10) experienced higher false rejections. Finally, one technology demonstrated more favorable but inequitable performance for the Asian American and Pacific Islander (AAPI) demographic. This study confirms that it is necessary to evaluate products across demographic groups to fully understand the performance of remote identity verification technologies.
- [49] arXiv:2409.12319 [pdf, html, other]
-
Title: Large Language Models Are Strong Audio-Visual Speech Recognition LearnersUmberto Cappellazzo, Minsu Kim, Honglie Chen, Pingchuan Ma, Stavros Petridis, Daniele Falavigna, Alessio Brutti, Maja PanticComments: The code will be made available at this link: this https URLSubjects: Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM); Sound (cs.SD); Audio and Speech Processing (eess.AS)
Multimodal large language models (MLLMs) have recently become a focal point of research due to their formidable multimodal understanding capabilities. For example, in the audio and speech domains, an LLM can be equipped with (automatic) speech recognition (ASR) abilities by just concatenating the audio tokens, computed with an audio encoder, and the text tokens to achieve state-of-the-art results. On the contrary, tasks like visual and audio-visual speech recognition (VSR/AVSR), which also exploit noise-invariant lip movement information, have received little or no attention. To bridge this gap, we propose Llama-AVSR, a new MLLM with strong audio-visual speech recognition capabilities. It leverages pre-trained audio and video encoders to produce modality-specific tokens which, together with the text tokens, are processed by a pre-trained LLM (e.g., Llama3.1-8B) to yield the resulting response in an auto-regressive fashion. Llama-AVSR requires a small number of trainable parameters as only modality-specific projectors and LoRA modules are trained whereas the multi-modal encoders and LLM are kept frozen. We evaluate our proposed approach on LRS3, the largest public AVSR benchmark, and we achieve new state-of-the-art results for the tasks of ASR and AVSR with a WER of 0.81% and 0.77%, respectively. To bolster our results, we investigate the key factors that underpin the effectiveness of Llama-AVSR: the choice of the pre-trained encoders and LLM, the efficient integration of LoRA modules, and the optimal performance-efficiency trade-off obtained via modality-aware compression rates.
- [50] arXiv:2409.12320 [pdf, html, other]
-
Title: The Effect of Education in Prompt Engineering: Evidence from JournalistsSubjects: Human-Computer Interaction (cs.HC)
Large language models (LLMs) are increasingly used in daily work. In this paper, we analyze whether training in prompt engineering can improve the interactions of users with LLMs. For this, we conducted a field experiment where we asked journalists to write short texts before and after training in prompt engineering. We then analyzed the effect of training on three dimensions: (1) the user experience of journalists when interacting with LLMs, (2) the accuracy of the texts (assessed by a domain expert), and (3) the reader perception, such as clarity, engagement, and other text quality dimensions (assessed by non-expert readers). Our results show: (1) Our training improved the perceived expertise of journalists but also decreased the perceived helpfulness of LLM use. (2) The effect on accuracy varied by the difficulty of the task. (3) There is a mixed impact of training on reader perception across different text quality dimensions.
- [51] arXiv:2409.12323 [pdf, html, other]
-
Title: Depth Estimation Based on 3D Gaussian Splatting Siamese DefocusSubjects: Computer Vision and Pattern Recognition (cs.CV)
Depth estimation is a fundamental task in 3D geometry. While stereo depth estimation can be achieved through triangulation methods, it is not as straightforward for monocular methods, which require the integration of global and local information. The Depth from Defocus (DFD) method utilizes camera lens models and parameters to recover depth information from blurred images and has been proven to perform well. However, these methods rely on All-In-Focus (AIF) images for depth estimation, which is nearly impossible to obtain in real-world applications. To address this issue, we propose a self-supervised framework based on 3D Gaussian splatting and Siamese networks. By learning the blur levels at different focal distances of the same scene in the focal stack, the framework predicts the defocus map and Circle of Confusion (CoC) from a single defocused image, using the defocus map as input to DepthNet for monocular depth estimation. The 3D Gaussian splatting model renders defocused images using the predicted CoC, and the differences between these and the real defocused images provide additional supervision signals for the Siamese Defocus self-supervised network. This framework has been validated on both artificially synthesized and real blurred datasets. Subsequent quantitative and visualization experiments demonstrate that our proposed framework is highly effective as a DFD method.
- [52] arXiv:2409.12326 [pdf, html, other]
-
Title: ReFu: Recursive Fusion for Exemplar-Free 3D Class-Incremental LearningSubjects: Computer Vision and Pattern Recognition (cs.CV)
We introduce a novel Recursive Fusion model, dubbed ReFu, designed to integrate point clouds and meshes for exemplar-free 3D Class-Incremental Learning, where the model learns new 3D classes while retaining knowledge of previously learned ones. Unlike existing methods that either rely on storing historical data to mitigate forgetting or focus on single data modalities, ReFu eliminates the need for exemplar storage while utilizing the complementary strengths of both point clouds and meshes. To achieve this, we introduce a recursive method which continuously accumulates knowledge by updating the regularized auto-correlation matrix. Furthermore, we propose a fusion module, featuring a Pointcloud-guided Mesh Attention Layer that learns correlations between the two modalities. This mechanism effectively integrates point cloud and mesh features, leading to more robust and stable continual learning. Experiments across various datasets demonstrate that our proposed framework outperforms existing methods in 3D class-incremental learning. Project Page: this https URL
- [53] arXiv:2409.12328 [pdf, html, other]
-
Title: SplitVAEs: Decentralized scenario generation from siloed data for stochastic optimization problemsComments: This work has been submitted to the IEEE for possible publication. Copyright may be transferred without notice, after which this version may no longer be accessibleSubjects: Machine Learning (cs.LG); Optimization and Control (math.OC); Methodology (stat.ME)
Stochastic optimization problems in large-scale multi-stakeholder networked systems (e.g., power grids and supply chains) rely on data-driven scenarios to encapsulate complex spatiotemporal interdependencies. However, centralized aggregation of stakeholder data is challenging due to the existence of data silos resulting from computational and logistical bottlenecks. In this paper, we present SplitVAEs, a decentralized scenario generation framework that leverages variational autoencoders to generate high-quality scenarios without moving stakeholder data. With the help of experiments on distributed memory systems, we demonstrate the broad applicability of SplitVAEs in a variety of domain areas that are dominated by a large number of stakeholders. Our experiments indicate that SplitVAEs can learn spatial and temporal interdependencies in large-scale networks to generate scenarios that match the joint historical distribution of stakeholder data in a decentralized manner. Our experiments show that SplitVAEs deliver robust performance compared to centralized, state-of-the-art benchmark methods while significantly reducing data transmission costs, leading to a scalable, privacy-enhancing alternative to scenario generation.
- [54] arXiv:2409.12330 [pdf, html, other]
-
Title: Heterogeneous Mixed Traffic Control and CoordinationSubjects: Multiagent Systems (cs.MA)
Urban intersections, filled with a diverse mix of vehicles from small cars to large semi-trailers, present a persistent challenge for traffic control and management. This reality drives our investigation into how robot vehicles (RVs) can transform such heterogeneous traffic flow, particularly at unsignalized intersections where traditional control methods often falter during power failures and emergencies. Using reinforcement learning (RL) and real-world traffic data, we study heterogeneous mixed traffic across complex intersections under gradual automation by varying RV penetration from 10% to 90%. The results are compelling: average waiting times decrease by up to 86% and 91% compared to signalized and unsignalized intersections, respectively. Additionally, we uncover a "rarity advantage," where less frequent vehicles, such as trucks, benefit the most from RV coordination (by up to 87%). RVs' presence also leads to lower CO2 emissions and fuel consumption compared to managing traffic via traffic lights. Moreover, space headways decrease across all vehicle types as RV rate increases, indicating better road space utilization.
- [55] arXiv:2409.12331 [pdf, html, other]
-
Title: FuzzEval: Assessing Fuzzers on Generating Context-Sensitive InputsSubjects: Cryptography and Security (cs.CR)
Cryptographic protocols form the backbone of modern security systems, yet vulnerabilities persist within their implementations. Traditional testing techniques, including fuzzing, have struggled to effectively identify vulnerabilities in cryptographic libraries due to their reliance on context-sensitive inputs. This paper presents a comprehensive evaluation of eleven state-of-the-art fuzzers' ability to generate context-sensitive inputs for testing a cryptographic standard, PKCS#1-v1.5, across thirteen implementations. Our study reveals nuanced performance differences among the fuzzers in terms of the validity and diversity of the produced inputs. This investigation underscores the limitations of existing fuzzers in handling context-sensitive inputs. These findings are expected to drive further research and development in this area.
- [56] arXiv:2409.12332 [pdf, html, other]
-
Title: Perceptions of the Fairness Impacts of Multiplicity in Machine LearningComments: 30 pages, 3 figuresSubjects: Computers and Society (cs.CY); Human-Computer Interaction (cs.HC)
Machine learning (ML) is increasingly used in high-stakes settings, yet multiplicity -- the existence of multiple good models -- means that some predictions are essentially arbitrary. ML researchers and philosophers posit that multiplicity poses a fairness risk, but no studies have investigated whether stakeholders agree. In this work, we conduct a survey to see how the presence of multiplicity impacts lay stakeholders' -- i.e., decision subjects' -- perceptions of ML fairness, and which approaches to address multiplicity they prefer. We investigate how these perceptions are modulated by task characteristics (e.g., stakes and uncertainty). Survey respondents think that multiplicity lowers distributional, but not procedural, fairness, even though existing work suggests the opposite. Participants are strongly against resolving multiplicity by using a single good model (effectively ignoring multiplicity) or by randomizing over possible outcomes. Our results indicate that model developers should be intentional about dealing with multiplicity in order to maintain fairness.
- [57] arXiv:2409.12335 [pdf, other]
-
Title: Bridging the Gap Between Approximation and Learning via Optimal Approximation by ReLU MLPs of Maximal RegularityComments: 16 pages main body, 40 pages proofs, 7 figures, 1 tableSubjects: Machine Learning (cs.LG); Neural and Evolutionary Computing (cs.NE); Functional Analysis (math.FA); Numerical Analysis (math.NA); Machine Learning (stat.ML)
The foundations of deep learning are supported by the seemingly opposing perspectives of approximation or learning theory. The former advocates for large/expressive models that need not generalize, while the latter considers classes that generalize but may be too small/constrained to be universal approximators. Motivated by real-world deep learning implementations that are both expressive and statistically reliable, we ask: "Is there a class of neural networks that is both large enough to be universal but structured enough to generalize?"
This paper constructively provides a positive answer to this question by identifying a highly structured class of ReLU multilayer perceptions (MLPs), which are optimal function approximators and are statistically well-behaved. We show that any $L$-Lipschitz function from $[0,1]^d$ to $[-n,n]$ can be approximated to a uniform $Ld/(2n)$ error on $[0,1]^d$ with a sparsely connected $L$-Lipschitz ReLU MLP of width $\mathcal{O}(dn^d)$, depth $\mathcal{O}(\log(d))$, with $\mathcal{O}(dn^d)$ nonzero parameters, and whose weights and biases take values in $\{0,\pm 1/2\}$ except in the first and last layers which instead have magnitude at-most $n$. Unlike previously known "large" classes of universal ReLU MLPs, the empirical Rademacher complexity of our class remains bounded even when its depth and width become arbitrarily large. Further, our class of MLPs achieves a near-optimal sample complexity of $\mathcal{O}(\log(N)/\sqrt{N})$ when given $N$ i.i.d. normalized sub-Gaussian training samples.
We achieve this by avoiding the standard approach to constructing optimal ReLU approximators, which sacrifices regularity by relying on small spikes. Instead, we introduce a new construction that perfectly fits together linear pieces using Kuhn triangulations and avoids these small spikes. - [58] arXiv:2409.12338 [pdf, html, other]
-
Title: Can I Pet Your Robot? Incorporating Capacitive Touch Sensing into a Soft Socially Assistive Robot PlatformComments: Accepted as a Work-In-Progress submission at the 2024 IEEE Haptics SymposiumSubjects: Robotics (cs.RO); Human-Computer Interaction (cs.HC)
This work presents a method of incorporating low-cost capacitive tactile sensors on a soft socially assistive robot platform. By embedding conductive thread into the robot's crocheted exterior, we formed a set of low-cost, flexible capacitive tactile sensors that do not disrupt the robot's soft, zoomorphic embodiment. We evaluated the sensors' performance through a user study (N=20) and found that the sensors reliably detected user touch events and localized touch inputs to one of three regions on the robot's exterior.
- [59] arXiv:2409.12339 [pdf, html, other]
-
Title: A Learning-based Controller for Multi-Contact Grasps on Unknown Objects with a Dexterous HandComments: 7 pages, 10 figures, Accepted at International Conference on Intelligent Robots and Systems (IROS) 2024Subjects: Robotics (cs.RO)
Existing grasp controllers usually either only support finger-tip grasps or need explicit configuration of the inner forces. We propose a novel grasp controller that supports arbitrary grasp types, including power grasps with multi-contacts, while operating self-contained on before unseen objects. No detailed contact information is needed, but only a rough 3D model, e.g., reconstructed from a single depth image. First, the external wrench being applied to the object is estimated by using the measured torques at the joints. Then, the torques necessary to counteract the estimated wrench while keeping the object at its initial pose are predicted. The torques are commanded via desired joint angles to an underlying joint-level impedance controller. To reach real-time performance, we propose a learning-based approach that is based on a wrench estimator- and a torque predictor neural network. Both networks are trained in a supervised fashion using data generated via the analytical formulation of the controller. In an extensive simulation-based evaluation, we show that our controller is able to keep 83.1% of the tested grasps stable when applying external wrenches with up to 10N. At the same time, we outperform the two tested baselines by being more efficient and inducing less involuntary object movement. Finally, we show that the controller also works on the real DLR-Hand II, reaching a cycle time of 6ms.
- [60] arXiv:2409.12341 [pdf, html, other]
-
Title: Provable Privacy Guarantee for Individual Identities and Locations in Large-Scale Contact TracingSubjects: Cryptography and Security (cs.CR)
The task of infectious disease contact tracing is crucial yet challenging, especially when meeting strict privacy requirements. Previous attempts in this area have had limitations in terms of applicable scenarios and efficiency. Our paper proposes a highly scalable, practical contact tracing system called PREVENT that can work with a variety of location collection methods to gain a comprehensive overview of a person's trajectory while ensuring the privacy of individuals being tracked, without revealing their plain text locations to any party, including servers. Our system is very efficient and can provide real-time query services for large-scale datasets with millions of locations. This is made possible by a newly designed secret-sharing based architecture that is tightly integrated into unique private space partitioning trees. Notably, our experimental results on both real and synthetic datasets demonstrate that our system introduces negligible performance overhead compared to traditional contact tracing methods. PREVENT could be a game-changer in the fight against infectious diseases and set a new standard for privacy-preserving location tracking.
- [61] arXiv:2409.12345 [pdf, html, other]
-
Title: Wing Optimisation for a tractor propeller driven Micro Aerial VehicleComments: 12 pages, 14 figuresSubjects: Systems and Control (eess.SY); Optimization and Control (math.OC)
This paper describes an investigation of the possible benefits from wing optimisation in improving the performance of Micro Air Vehicles (MAVs). As an example we study the Avion (3.64 kg mass, 1.60 m span), being designed at the CSIR National Aerospace Laboratories (NAL), Bengaluru. The optimisation is first carried out using the methodology described by Rakshith \emph{et al.} (using an in\textendash house software PROWING), developed for large transport aircraft, with certain modifications to adapt the code to the special features of the MAV. The chief among such features is the use of low Reynolds number aerofoils with significantly different aerodynamic characteristics on a small MAV. These characteristics are taken from test data when available, and/or estimated by the XFOIL code of Drela. A total of 8 optimisation cases are studied for the purpose, leading to 6 different options for new wing planforms (and associated twist distributions along the wing span) with an improved performance. It is found that the improvements in drag coefficient using the PROWING code are about 5%. However, by allowing the operating lift coefficient $C_L$ to float within a specified range, drag bucket characteristics of the Eppler E423 aerofoil used on Avion can be exploited to improve the endurance, which is a major performance parameter for Avion. Thus, compared to the control wing $W_0$ (with operating point at $C_L$ =0.7) used in the preliminary design, permitting a variation of $C_L$ over a range of $\pm$ 10% is shown to enhance the endurance of wing $W_4$ by 18.6%, and of wing $W_{6}$ with a permitted $C_L$ range of $\pm$ 50% by 39.2%. Apart from the philosophy of seeking optimal operating conditions for a given configuration, the advantages of optimising design parameters such as washout of a simple wing proposed in the preliminary design stage, is also demonstrated.
- [62] arXiv:2409.12346 [pdf, html, other]
-
Title: Simultaneous Music Separation and Generation Using Multi-Track Latent Diffusion ModelsSubjects: Sound (cs.SD); Audio and Speech Processing (eess.AS)
Diffusion models have recently shown strong potential in both music generation and music source separation tasks. Although in early stages, a trend is emerging towards integrating these tasks into a single framework, as both involve generating musically aligned parts and can be seen as facets of the same generative process. In this work, we introduce a latent diffusion-based multi-track generation model capable of both source separation and multi-track music synthesis by learning the joint probability distribution of tracks sharing a musical context. Our model also enables arrangement generation by creating any subset of tracks given the others. We trained our model on the Slakh2100 dataset, compared it with an existing simultaneous generation and separation model, and observed significant improvements across objective metrics for source separation, music, and arrangement generation tasks. Sound examples are available at this https URL.
- [63] arXiv:2409.12350 [pdf, html, other]
-
Title: Advancing Cucumber Disease Detection in Agriculture through Machine Vision and Drone TechnologyComments: 10 page and 6 figureSubjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
This study uses machine vision and drone technologies to propose a unique method for the diagnosis of cucumber disease in agriculture. The backbone of this research is a painstakingly curated dataset of hyperspectral photographs acquired under genuine field conditions. Unlike earlier datasets, this study included a wide variety of illness types, allowing for precise early-stage detection. The model achieves an excellent 87.5\% accuracy in distinguishing eight unique cucumber illnesses after considerable data augmentation. The incorporation of drone technology for high-resolution images improves disease evaluation. This development has enormous potential for improving crop management, lowering labor costs, and increasing agricultural productivity. This research, which automates disease detection, represents a significant step toward a more efficient and sustainable agricultural future.
- [64] arXiv:2409.12357 [pdf, other]
-
Title: Dynamics of Post-disaster Recovery in Behavior-dependent Business NetworksComments: 22 pages, 11 figuresSubjects: Social and Information Networks (cs.SI); Physics and Society (physics.soc-ph)
The recovery of businesses after a disaster is vital to community economic resilience, yet the network dynamics influencing the speed and spillover effects of recovery remain poorly understood. Understanding these dynamics is essential for characterizing economic resilience and informing more effective recovery policies. This study explores the extent to which post-disaster business recovery is shaped by network diffusion processes within pre-disaster business dependency networks, driven by visitation behaviors among business points of interest (POIs). We developed a network diffusion model to simulate recovery across businesses in the Louisiana Gulf Coast following Hurricane Ida (2021) and assessed its performance using empirical data. Our analysis focuses on four key areas: (1) the presence of a diffusion process influencing recovery across the business network; (2) variations in how different business types depend on others for recovery; (3) identification of recovery multiplier businesses that accelerate regional recovery; and (4) differences in recovery multipliers across high- and low-income areas. The findings reveal that business recovery is governed by diffusion dynamics in these behavior-based networks, with recovery speed closely linked to pre-disaster visitation patterns. Retail and service businesses are identified as key recovery multipliers whose rapid recovery accelerates the broader business network's recovery, enhancing economic resilience. Additionally, recovery multipliers vary between high- and low-income areas. This study enhances our understanding of network mechanisms in post-disaster recovery and offers valuable insights for improving recovery policies.
- [65] arXiv:2409.12362 [pdf, html, other]
-
Title: Rest Shape Optimization for Sag-Free Discrete Elastic RodsSubjects: Graphics (cs.GR)
We propose a new rest shape optimization framework to achieve sag-free simulations of discrete elastic rods. To optimize rest shape parameters, we formulate a minimization problem based on the kinetic energy with a regularizer while imposing box constraints on these parameters to ensure the system's stability. Our method solves the resulting constrained minimization problem via the Gauss-Newton algorithm augmented with penalty methods. We demonstrate that the optimized rest shape parameters enable discrete elastic rods to achieve static equilibrium for a wide range of strand geometries and material parameters.
- [66] arXiv:2409.12366 [pdf, html, other]
-
Title: Bilevel Optimization for Real-Time Control with Application to Locomotion Gait GenerationComments: Accepted to CDC 2024Subjects: Systems and Control (eess.SY); Robotics (cs.RO)
Model Predictive Control (MPC) is a common tool for the control of nonlinear, real-world systems, such as legged robots. However, solving MPC quickly enough to enable its use in real-time is often challenging. One common solution is given by real-time iterations, which does not solve the MPC problem to convergence, but rather close enough to give an approximate solution. In this paper, we extend this idea to a bilevel control framework where a "high-level" optimization program modifies a controller parameter of a "low-level" MPC problem which generates the control inputs and desired state trajectory. We propose an algorithm to iterate on this bilevel program in real-time and provide conditions for its convergence and improvements in stability. We then demonstrate the efficacy of this algorithm by applying it to a quadrupedal robot where the high-level problem optimizes a contact schedule in real-time. We show through simulation that the algorithm can yield improvements in disturbance rejection and optimality, while creating qualitatively new gaits.
- [67] arXiv:2409.12367 [pdf, html, other]
-
Title: Extracting Memorized Training Data via DecompositionSubjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR)
The widespread use of Large Language Models (LLMs) in society creates new information security challenges for developers, organizations, and end-users alike. LLMs are trained on large volumes of data, and their susceptibility to reveal the exact contents of the source training datasets poses security and safety risks. Although current alignment procedures restrict common risky behaviors, they do not completely prevent LLMs from leaking data. Prior work demonstrated that LLMs may be tricked into divulging training data by using out-of-distribution queries or adversarial techniques. In this paper, we demonstrate a simple, query-based decompositional method to extract news articles from two frontier LLMs. We use instruction decomposition techniques to incrementally extract fragments of training data. Out of 3723 New York Times articles, we extract at least one verbatim sentence from 73 articles, and over 20% of verbatim sentences from 6 articles. Our analysis demonstrates that this method successfully induces the LLM to generate texts that are reliable reproductions of news articles, meaning that they likely originate from the source training dataset. This method is simple, generalizable, and does not fine-tune or change the production model. If replicable at scale, this training data extraction methodology could expose new LLM security and safety vulnerabilities, including privacy risks and unauthorized data leaks. These implications require careful consideration from model development to its end-use.
- [68] arXiv:2409.12368 [pdf, html, other]
-
Title: Optimal Linear Filtering for Discrete-Time Systems with Infinite-Dimensional MeasurementsComments: 15 pages, 3 Figures. This paper will appear in the IEEE Transactions on Automatic ControlSubjects: Systems and Control (eess.SY)
Systems equipped with modern sensing modalities such as vision and lidar gain access to increasingly high-dimensional measurements with which to enact estimation and control schemes. In this article, we examine the continuum limit of high-dimensional measurements and analyze state estimation in linear time-invariant systems with infinite-dimensional measurements but finite-dimensional states, both corrupted by additive noise. We propose a linear filter and derive the corresponding optimal gain functional in the sense of the minimum mean square error, analogous to the classic Kalman filter. By modeling the measurement noise as a wide-sense stationary random field, we are able to derive the optimal linear filter explicitly, in contrast to previous derivations of Kalman filters in distributed-parameter settings. Interestingly, we find that we need only impose conditions that are finite-dimensional in nature to ensure that the filter is asymptotically stable. The proposed filter is verified via simulation of a linearized system with a pinhole camera sensor.
- [69] arXiv:2409.12369 [pdf, html, other]
-
Title: Program Slicing in the Era of Large Language ModelsSubjects: Software Engineering (cs.SE)
Program slicing is a critical technique in software engineering, enabling developers to isolate relevant portions of code for tasks such as bug detection, code comprehension, and debugging. In this study, we investigate the application of large language models (LLMs) to both static and dynamic program slicing, with a focus on Java programs. We evaluate the performance of four state-of-the-art LLMs- GPT-4o, GPT-3.5 Turbo, Llama-2, and Gemma-7B leveraging advanced prompting techniques, including few-shot learning and chain-of-thought reasoning. Using a dataset of 100 Java programs derived from LeetCode problems, our experiments reveal that GPT-4o performs the best in both static and dynamic slicing across other LLMs, achieving an accuracy of 60.84% and 59.69%, respectively. Our results also show that the LLMs we experimented with are yet to achieve reasonable performance for either static slicing or dynamic slicing. Through a rigorous manual analysis, we developed a taxonomy of root causes and failure locations to explore the unsuccessful cases in more depth. We identified Complex Control Flow as the most frequent root cause of failures, with the majority of issues occurring in Variable Declarations and Assignments locations. To improve the performance of LLMs, we further examined two independent strategies for prompting guided by our taxonomy, including prompt crafting, which involved refining the prompts to better guide the LLM through the slicing process, and iterative prompting, where the model receives feedback on the root cause and location of the failure and re-generates its responses. Our evaluation shows these two prompting enhancement approaches can improve accuracy by 4% and 3.9%, respectively.
- [70] arXiv:2409.12371 [pdf, html, other]
-
Title: Communication-Efficient Federated Low-Rank Update Algorithm and its Connection to Implicit RegularizationSubjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
Federated Learning (FL) faces significant challenges related to communication efficiency and heterogeneity. To address these issues, we explore the potential of using low-rank updates. Our theoretical analysis reveals that client's loss exhibits a higher rank structure (gradients span higher rank subspace of Hessian) compared to the server's loss. Based on this insight, we hypothesize that constraining client-side optimization to a low-rank subspace could provide an implicit regularization effect. Consequently, we propose FedLoRU, a general low-rank update framework for federated learning. Our framework enforces low-rank client-side updates and accumulates these updates to form a higher-rank model. Additionally, variants of FedLoRU can adapt to environments with statistical and model heterogeneity by employing multiple or hierarchical low-rank updates. Experimental results demonstrate that FedLoRU performs comparably to full-rank algorithms and exhibits robustness to heterogeneous and large numbers of clients.
- [71] arXiv:2409.12374 [pdf, html, other]
-
Title: Linear Model Predictive Control for Quadrotors with An Analytically Derived Koopman ModelComments: 6 pages, 5 figuresSubjects: Systems and Control (eess.SY)
This letter presents a Koopman-theoretic lifted linear parameter-varying (LPV) system with countably infinite dimensions to model the nonlinear dynamics of a quadrotor on SE(3) for facilitating control design. The LPV system evolves in time in the space of the observables, called the lifted space. A primary challenge in utilizing the Koopman-based linearization is identifying a set of observables that can adequately span the lifted space, with the majority of the current methods using data to learn these observables. In this study, we analytically derive the observables for the quadrotor dynamics on SE(3) to formulate the lifted LPV system. The lifted LPV system has a countably infinite dimension which is then truncated for practical control design. The truncation is analytically justified by showing vanishing residual property in a bounded trajectory regime. The LPV system is then approximated as a linear time-invariant (LTI) system with a set of virtual control inputs. The controllability of the lifted LTI system is translatable to the true quadrotor system on SE(3). A linear model-predictive control (LMPC) scheme is formulated and implemented in numerical simulations employing this LTI framework for various tracking problems, with attention given to the potential for real-time implementation.
- [72] arXiv:2409.12376 [pdf, other]
-
Title: Prediction of Brent crude oil price based on LSTM model under the background of low-carbon transitionSubjects: Machine Learning (cs.LG); Distributed, Parallel, and Cluster Computing (cs.DC)
In the field of global energy and environment, crude oil is an important strategic resource, and its price fluctuation has a far-reaching impact on the global economy, financial market and the process of low-carbon development. In recent years, with the gradual promotion of green energy transformation and low-carbon development in various countries, the dynamics of crude oil market have become more complicated and changeable. The price of crude oil is not only influenced by traditional factors such as supply and demand, geopolitical conflict and production technology, but also faces the challenges of energy policy transformation, carbon emission control and new energy technology development. This diversified driving factor makes the prediction of crude oil price not only very important in economic decision-making and energy planning, but also a key issue in financial this http URL this paper, the spot price data of European Brent crude oil provided by us energy information administration are selected, and a deep learning model with three layers of LSTM units is constructed to predict the crude oil price in the next few days. The results show that the LSTM model performs well in capturing the overall price trend, although there is some deviation during the period of sharp price fluctuation. The research in this paper not only verifies the applicability of LSTM model in energy market forecasting, but also provides data support for policy makers and investors when facing the uncertainty of crude oil price.
- [73] arXiv:2409.12379 [pdf, html, other]
-
Title: Enhancing 3D Robotic Vision Robustness by Minimizing Adversarial Mutual Information through a Curriculum Training ApproachSubjects: Computer Vision and Pattern Recognition (cs.CV); Information Theory (cs.IT); Robotics (cs.RO)
Adversarial attacks exploit vulnerabilities in a model's decision boundaries through small, carefully crafted perturbations that lead to significant mispredictions. In 3D vision, the high dimensionality and sparsity of data greatly expand the attack surface, making 3D vision particularly vulnerable for safety-critical robotics. To enhance 3D vision's adversarial robustness, we propose a training objective that simultaneously minimizes prediction loss and mutual information (MI) under adversarial perturbations to contain the upper bound of misprediction errors. This approach simplifies handling adversarial examples compared to conventional methods, which require explicit searching and training on adversarial samples. However, minimizing prediction loss conflicts with minimizing MI, leading to reduced robustness and catastrophic forgetting. To address this, we integrate curriculum advisors in the training setup that gradually introduce adversarial objectives to balance training and prevent models from being overwhelmed by difficult cases early in the process. The advisors also enhance robustness by encouraging training on diverse MI examples through entropy regularizers. We evaluated our method on ModelNet40 and KITTI using PointNet, DGCNN, SECOND, and PointTransformers, achieving 2-5% accuracy gains on ModelNet40 and a 5-10% mAP improvement in object detection. Our code is publicly available at this https URL.
- [74] arXiv:2409.12380 [pdf, html, other]
-
Title: Bundle Fragments into a Whole: Mining More Complete Clusters via Submodular Selection of Interesting webpages for Web Topic DetectionComments: 10Subjects: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)
Organizing interesting webpages into hot topics is one of key steps to understand the trends of multimodal web data. A state-of-the-art solution is firstly to organize webpages into a large volume of multi-granularity topic candidates; hot topics are further identified by estimating their interestingness. However, these topic candidates contain a large number of fragments of hot topics due to both the inefficient feature representations and the unsupervised topic generation. This paper proposes a bundling-refining approach to mine more complete hot topics from fragments. Concretely, the bundling step organizes the fragment topics into coarse topics; next, the refining step proposes a submodular-based method to refine coarse topics in a scalable approach. The propose unconventional method is simple, yet powerful by leveraging submodular optimization, our approach outperforms the traditional ranking methods which involve the careful design and complex steps. Extensive experiments demonstrate that the proposed approach surpasses the state-of-the-art method (i.e., latent Poisson deconvolution Pang et al. (2016)) 20% accuracy and 10% one on two public data sets, respectively.
- [75] arXiv:2409.12381 [pdf, html, other]
-
Title: A Stochastic Iteratively Regularized Gauss-Newton MethodComments: 23 pagesSubjects: Numerical Analysis (math.NA); Optimization and Control (math.OC)
This work focuses on developing and motivating a stochastic version of a wellknown inverse problem methodology. Specifically, we consider the iteratively regularized Gauss-Newton method, originally proposed by Bakushinskii for infinite-dimensional problems. Recent work have extended this method to handle sequential observations, rather than a single instance of the data, demonstrating notable improvements in reconstruction accuracy. In this paper, we further extend these methods to a stochastic framework through mini-batching, introducing a new algorithm, the stochastic iteratively regularized Gauss-Newton method (SIRGNM). Our algorithm is designed through the use randomized sketching. We provide an analysis for the SIRGNM, which includes a preliminary error decomposition and a convergence analysis, related to the residuals. We provide numerical experiments on a 2D elliptic PDE example. This illustrates the effectiveness of the SIRGNM, through maintaining a similar level of accuracy while reducing on the computational time.
- [76] arXiv:2409.12382 [pdf, html, other]
-
Title: Incremental Composition of Learned Control Barrier Functions in Unknown EnvironmentsSubjects: Systems and Control (eess.SY)
We consider the problem of safely exploring a static and unknown environment while learning valid control barrier functions (CBFs) from sensor data. Existing works either assume known environments, target specific dynamics models, or use a-priori valid CBFs, and are thus limited in their safety guarantees for general systems during exploration. We present a method for safely exploring the unknown environment by incrementally composing a global CBF from locally-learned CBFs. The challenge here is that local CBFs may not have well-defined end behavior outside their training domain, i.e. local CBFs may be positive (indicating safety) in regions where no training data is available. We show that well-defined end behavior can be obtained when local CBFs are parameterized by compactly-supported radial basis functions. For learning local CBFs, we collect sensor data, e.g. LiDAR capturing obstacles in the environment, and augment it with simulated data from a safe oracle controller. Our work complements recent efforts to learn CBFs from safe demonstrations -- where learned safe sets are limited to their training domains -- by demonstrating how to grow the safe set over time as more data becomes available. We evaluate our approach on two simulated systems, where our method successfully explores an unknown environment while maintaining safety throughout the entire execution.
- [77] arXiv:2409.12384 [pdf, html, other]
-
Title: Privacy-Preserving Student Learning with Differentially Private Data-Free DistillationComments: Published by IEEE MMSP 2022Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR); Computer Vision and Pattern Recognition (cs.CV)
Deep learning models can achieve high inference accuracy by extracting rich knowledge from massive well-annotated data, but may pose the risk of data privacy leakage in practical deployment. In this paper, we present an effective teacher-student learning approach to train privacy-preserving deep learning models via differentially private data-free distillation. The main idea is generating synthetic data to learn a student that can mimic the ability of a teacher well-trained on private data. In the approach, a generator is first pretrained in a data-free manner by incorporating the teacher as a fixed discriminator. With the generator, massive synthetic data can be generated for model training without exposing data privacy. Then, the synthetic data is fed into the teacher to generate private labels. Towards this end, we propose a label differential privacy algorithm termed selective randomized response to protect the label information. Finally, a student is trained on the synthetic data with the supervision of private labels. In this way, both data privacy and label privacy are well protected in a unified framework, leading to privacy-preserving models. Extensive experiments and analysis clearly demonstrate the effectiveness of our approach.
- [78] arXiv:2409.12385 [pdf, html, other]
-
Title: Look Through Masks: Towards Masked Face Recognition with De-Occlusion DistillationComments: Accepted by ACM MM 2020Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Many real-world applications today like video surveillance and urban governance need to address the recognition of masked faces, where content replacement by diverse masks often brings in incomplete appearance and ambiguous representation, leading to a sharp drop in accuracy. Inspired by recent progress on amodal perception, we propose to migrate the mechanism of amodal completion for the task of masked face recognition with an end-to-end de-occlusion distillation framework, which consists of two modules. The \textit{de-occlusion} module applies a generative adversarial network to perform face completion, which recovers the content under the mask and eliminates appearance ambiguity. The \textit{distillation} module takes a pre-trained general face recognition model as the teacher and transfers its knowledge to train a student for completed faces using massive online synthesized face pairs. Especially, the teacher knowledge is represented with structural relations among instances in multiple orders, which serves as a posterior regularization to enable the adaptation. In this way, the knowledge can be fully distilled and transferred to identify masked faces. Experiments on synthetic and realistic datasets show the efficacy of the proposed approach.
- [79] arXiv:2409.12386 [pdf, html, other]
-
Title: Channel-Aware Domain-Adaptive Generative Adversarial Network for Robust Speech RecognitionComments: Submitted to ICASSP 2025Subjects: Sound (cs.SD); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Audio and Speech Processing (eess.AS)
While pre-trained automatic speech recognition (ASR) systems demonstrate impressive performance on matched domains, their performance often degrades when confronted with channel mismatch stemming from unseen recording environments and conditions. To mitigate this issue, we propose a novel channel-aware data simulation method for robust ASR training. Our method harnesses the synergistic power of channel-extractive techniques and generative adversarial networks (GANs). We first train a channel encoder capable of extracting embeddings from arbitrary audio. On top of this, channel embeddings are extracted using a minimal amount of target-domain data and used to guide a GAN-based speech synthesizer. This synthesizer generates speech that faithfully preserves the phonetic content of the input while mimicking the channel characteristics of the target domain. We evaluate our method on the challenging Hakka Across Taiwan (HAT) and Taiwanese Across Taiwan (TAT) corpora, achieving relative character error rate (CER) reductions of 20.02% and 9.64%, respectively, compared to the baselines. These results highlight the efficacy of our channel-aware data simulation method for bridging the gap between source- and target-domain acoustics.
- [80] arXiv:2409.12387 [pdf, html, other]
-
Title: On the Regret of Coded Caching with Adversarial RequestsSubjects: Information Theory (cs.IT); Machine Learning (cs.LG)
We study the well-known coded caching problem in an online learning framework, wherein requests arrive sequentially, and an online policy can update the cache contents based on the history of requests seen thus far. We introduce a caching policy based on the Follow-The-Perturbed-Leader principle and show that for any time horizon T and any request sequence, it achieves a sub-linear regret of \mathcal{O}(\sqrt(T) ) with respect to an oracle that knows the request sequence beforehand. Our study marks the first examination of adversarial regret in the coded caching setup. Furthermore, we also address the issue of switching cost by establishing an upper bound on the expected number of cache updates made by our algorithm under unrestricted switching and also provide an upper bound on the regret under restricted switching when cache updates can only happen in a pre-specified subset of timeslots. Finally, we validate our theoretical insights with numerical results using a real-world dataset
- [81] arXiv:2409.12390 [pdf, html, other]
-
Title: A Novel Perspective for Multi-modal Multi-label Skin Lesion ClassificationComments: Accepted by WACV2025Subjects: Computer Vision and Pattern Recognition (cs.CV)
The efficacy of deep learning-based Computer-Aided Diagnosis (CAD) methods for skin diseases relies on analyzing multiple data modalities (i.e., clinical+dermoscopic images, and patient metadata) and addressing the challenges of multi-label classification. Current approaches tend to rely on limited multi-modal techniques and treat the multi-label problem as a multiple multi-class problem, overlooking issues related to imbalanced learning and multi-label correlation. This paper introduces the innovative Skin Lesion Classifier, utilizing a Multi-modal Multi-label TransFormer-based model (SkinM2Former). For multi-modal analysis, we introduce the Tri-Modal Cross-attention Transformer (TMCT) that fuses the three image and metadata modalities at various feature levels of a transformer encoder. For multi-label classification, we introduce a multi-head attention (MHA) module to learn multi-label correlations, complemented by an optimisation that handles multi-label and imbalanced learning problems. SkinM2Former achieves a mean average accuracy of 77.27% and a mean diagnostic accuracy of 77.85% on the public Derm7pt dataset, outperforming state-of-the-art (SOTA) methods.
- [82] arXiv:2409.12391 [pdf, html, other]
-
Title: Selecting a classification performance measure: matching the measure to the problemSubjects: Machine Learning (cs.LG)
The problem of identifying to which of a given set of classes objects belong is ubiquitous, occurring in many research domains and application areas, including medical diagnosis, financial decision making, online commerce, and national security. But such assignments are rarely completely perfect, and classification errors occur. This means it is necessary to compare classification methods and algorithms to decide which is ``best'' for any particular problem. However, just as there are many different classification methods, so there are many different ways of measuring their performance. It is thus vital to choose a measure of performance which matches the aims of the research or application. This paper is a contribution to the growing literature on the relative merits of different performance measures. Its particular focus is the critical importance of matching the properties of the measure to the aims for which the classification is being made.
- [83] arXiv:2409.12393 [pdf, html, other]
-
Title: Small Language Models are Equation ReasonersComments: 6 pages, 2 figuresSubjects: Computation and Language (cs.CL)
Chain-of-Thought (CoT) reasoning has enabled Large Language Model (LLM) to achieve remarkable performance in various NLP tasks, including arithmetic problem-solving. However, this success does not generalize to small language model (sLM) like T5, due to their limited capacity and absence of emergent abilities associated with larger models. Recent works to enhance sLM through knowledge distillation have yielded some improvements but still face significant limitations, particularly high ambiguity from the variability in natural language expressions and substantial computational costs. In this paper, we investigate why sLM perform poorly on arithmetic reasoning tasks and hypothesize that natural language format variability introduces high ambiguity for these smaller models. Based on this hypothesis, we conduct experiments with equation-only format, which is a reasoning format that unifies arithmetic reasoning previously expressed in natural language formats into mathematical equations. Experiment results demonstrate that equation-only format effectively boosts the arithmetic reasoning abilities of sLM, especially in very small models like T5-Tiny.
- [84] arXiv:2409.12394 [pdf, html, other]
-
Title: ITPatch: An Invisible and Triggered Physical Adversarial Patch against Traffic Sign RecognitionSubjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
Physical adversarial patches have emerged as a key adversarial attack to cause misclassification of traffic sign recognition (TSR) systems in the real world. However, existing adversarial patches have poor stealthiness and attack all vehicles indiscriminately once deployed. In this paper, we introduce an invisible and triggered physical adversarial patch (ITPatch) with a novel attack vector, i.e., fluorescent ink, to advance the state-of-the-art. It applies carefully designed fluorescent perturbations to a target sign, an attacker can later trigger a fluorescent effect using invisible ultraviolet light, causing the TSR system to misclassify the sign and potentially resulting in traffic accidents. We conducted a comprehensive evaluation to investigate the effectiveness of ITPatch, which shows a success rate of 98.31% in low-light conditions. Furthermore, our attack successfully bypasses five popular defenses and achieves a success rate of 96.72%.
- [85] arXiv:2409.12396 [pdf, html, other]
-
Title: ARTAI: An Evaluation Platform to Assess Societal Risk of Recommender AlgorithmsComments: 3 pages, 1 figure, accepted at FAccTRec 2024 Workshop, RecSys 2024Subjects: Computers and Society (cs.CY); Artificial Intelligence (cs.AI)
Societal risk emanating from how recommender algorithms disseminate content online is now well documented. Emergent regulation aims to mitigate this risk through ethical audits and enabling new research on the social impact of algorithms. However, there is currently a need for tools and methods that enable such evaluation. This paper presents ARTAI, an evaluation environment that enables large-scale assessments of recommender algorithms to identify harmful patterns in how content is distributed online and enables the implementation of new regulatory requirements for increased transparency in recommender systems.
- [86] arXiv:2409.12397 [pdf, html, other]
-
Title: Learning to Coordinate without Communication under Incomplete InformationComments: This paper is currently under review at AAAI 2025Subjects: Artificial Intelligence (cs.AI)
Achieving seamless coordination in cooperative games is a crucial challenge in artificial intelligence, particularly when players operate under incomplete information. A common strategy to mitigate this information asymmetry involves leveraging explicit communication. However, direct communication is not always feasible due to factors such as transmission loss. We explore how effective coordination can be achieved without verbal communication, relying solely on observing each other's actions. We demonstrate how an autonomous agent can learn to cooperate by interpreting its partner's actions, which are used to hint at its intents. Our approach involves developing an agent strategy by constructing deterministic finite automata for each possible action and integrating them into a non-Markovian finite-state transducer. This transducer represents a non-deterministic strategy for the agent that suggests actions to assist its partner during gameplay. Experimental results in a testbed called Gnomes at Night show that the learned no-communication coordination strategy achieves significantly higher success rates and requires fewer steps to complete the game compared to uncoordinated scenarios, performing almost as well as an oracle baseline with direct communication.
- [87] arXiv:2409.12400 [pdf, html, other]
-
Title: Shape-informed surrogate models based on signed distance function domain encodingSubjects: Numerical Analysis (math.NA); Machine Learning (cs.LG)
We propose a non-intrusive method to build surrogate models that approximate the solution of parameterized partial differential equations (PDEs), capable of taking into account the dependence of the solution on the shape of the computational domain. Our approach is based on the combination of two neural networks (NNs). The first NN, conditioned on a latent code, provides an implicit representation of geometry variability through signed distance functions. This automated shape encoding technique generates compact, low-dimensional representations of geometries within a latent space, without requiring the explicit construction of an encoder. The second NN reconstructs the output physical fields independently for each spatial point, thus avoiding the computational burden typically associated with high-dimensional discretizations like computational meshes. Furthermore, we show that accuracy in geometrical characterization can be further enhanced by employing Fourier feature mapping as input feature of the NN. The meshless nature of the proposed method, combined with the dimensionality reduction achieved through automatic feature extraction in latent space, makes it highly flexible and computationally efficient. This strategy eliminates the need for manual intervention in extracting geometric parameters, and can even be applied in cases where geometries undergo changes in their topology. Numerical tests in the field of fluid dynamics and solid mechanics demonstrate the effectiveness of the proposed method in accurately predict the solution of PDEs in domains of arbitrary shape. Remarkably, the results show that it achieves accuracy comparable to the best-case scenarios where an explicit parametrization of the computational domain is available.
- [88] arXiv:2409.12403 [pdf, html, other]
-
Title: Preference Alignment Improves Language Model-Based TTSSubjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
Recent advancements in text-to-speech (TTS) have shown that language model (LM)-based systems offer competitive performance to their counterparts. Further optimization can be achieved through preference alignment algorithms, which adjust LMs to align with the preferences of reward models, enhancing the desirability of the generated content. This study presents a thorough empirical evaluation of how preference alignment algorithms, particularly Direct Preference Optimization (DPO), enhance LM-based TTS. With a 1.15B parameter LM-based TTS model, we demonstrate that preference alignment consistently improves intelligibility, speaker similarity, and proxy subjective evaluation scores, with the latter two metrics surpassing even human speech in certain evaluations. We also show preference alignment is applicable to low-resource scenarios and effectively generalized to out-of-domain applications.
- [89] arXiv:2409.12405 [pdf, html, other]
-
Title: On the Effectiveness of LLMs for Manual Test VerificationsMyron David Lucena Campos Peixoto, Davy de Medeiros Baia, Nathalia Nascimento, Paulo Alencar, Baldoino Fonseca, Márcio RibeiroComments: 9 pagesSubjects: Software Engineering (cs.SE); Artificial Intelligence (cs.AI)
Background: Manual testing is vital for detecting issues missed by automated tests, but specifying accurate verifications is challenging. Aims: This study aims to explore the use of Large Language Models (LLMs) to produce verifications for manual tests. Method: We conducted two independent and complementary exploratory studies. The first study involved using 2 closed-source and 6 open-source LLMs to generate verifications for manual test steps and evaluate their similarity to original verifications. The second study involved recruiting software testing professionals to assess their perception and agreement with the generated verifications compared to the original ones. Results: The open-source models Mistral-7B and Phi-3-mini-4k demonstrated effectiveness and consistency comparable to closed-source models like Gemini-1.5-flash and GPT-3.5-turbo in generating manual test verifications. However, the agreement level among professional testers was slightly above 40%, indicating both promise and room for improvement. While some LLM-generated verifications were considered better than the originals, there were also concerns about AI hallucinations, where verifications significantly deviated from expectations. Conclusion: We contributed by generating a dataset of 37,040 test verifications using 8 different LLMs. Although the models show potential, the relatively modest 40% agreement level highlights the need for further refinement. Enhancing the accuracy, relevance, and clarity of the generated verifications is crucial to ensure greater reliability in real-world testing scenarios.
- [90] arXiv:2409.12406 [pdf, html, other]
-
Title: Robust Model-Free Control Framework with Safety Constraints for a Fully Electric Linear Actuator SystemComments: This work has been accepted in the IEEE Power Electronics and Motion Control (IEEE-PEMC 2024) ConferenceSubjects: Systems and Control (eess.SY)
This paper introduces a novel model-free control strategy for a complex multi-stage gearbox electromechanical linear actuator (EMLA) system, driven by a permanent magnet synchronous motor (PMSM) with non-ideal ball screw characteristics. The proposed control approach aims to (1) manage user-specified safety constraints, (2) identify optimal control parameters for minimizing tracking errors, (3) ensure robustness, and (4) guarantee uniformly exponential stability. First, this paper employs a trajectory-setting interpolation-based algorithm to specify the piecewise definition of a smooth and jerk-bounded reference trajectory. Then, a dual robust subsystem-based barrier Lyapunov function (DRS-BLF) control is proposed for the PMSM-powered EMLA system to track the reference motions, guaranteeing user-specified safety related to constraints on system characteristics and alleviating control signal efforts. This methodology guarantees robustness and uniform exponential convergence. Lastly, optimal control parameter values are determined by customizing a swarm intelligence technique known as the Jaya (a term derived from the Sanskrit word for `victory') algorithm to minimize tracking errors. Experimental results validate the performance of the DRS-BLF control.
- [91] arXiv:2409.12408 [pdf, html, other]
-
Title: Mutual Information-based Representations Disentanglement for Unaligned Multimodal Language SequencesComments: 31 pages, 8 figuresSubjects: Computation and Language (cs.CL); Multimedia (cs.MM)
The key challenge in unaligned multimodal language sequences lies in effectively integrating information from various modalities to obtain a refined multimodal joint representation. Recently, the disentangle and fuse methods have achieved the promising performance by explicitly learning modality-agnostic and modality-specific representations and then fusing them into a multimodal joint representation. However, these methods often independently learn modality-agnostic representations for each modality and utilize orthogonal constraints to reduce linear correlations between modality-agnostic and modality-specific representations, neglecting to eliminate their nonlinear correlations. As a result, the obtained multimodal joint representation usually suffers from information redundancy, leading to overfitting and poor generalization of the models. In this paper, we propose a Mutual Information-based Representations Disentanglement (MIRD) method for unaligned multimodal language sequences, in which a novel disentanglement framework is designed to jointly learn a single modality-agnostic representation. In addition, the mutual information minimization constraint is employed to ensure superior disentanglement of representations, thereby eliminating information redundancy within the multimodal joint representation. Furthermore, the challenge of estimating mutual information caused by the limited labeled data is mitigated by introducing unlabeled data. Meanwhile, the unlabeled data also help to characterize the underlying structure of multimodal data, consequently further preventing overfitting and enhancing the performance of the models. Experimental results on several widely used benchmark datasets validate the effectiveness of our proposed approach.
- [92] arXiv:2409.12409 [pdf, html, other]
-
Title: LMT-Net: Lane Model Transformer Network for Automated HD Mapping from Sparse Vehicle ObservationsComments: Accepted for 2024 IEEE International Conference on Intelligent Transportation Systems (ITSC 2024)Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Robotics (cs.RO)
In autonomous driving, High Definition (HD) maps provide a complete lane model that is not limited by sensor range and occlusions. However, the generation and upkeep of HD maps involves periodic data collection and human annotations, limiting scalability. To address this, we investigate automating the lane model generation and the use of sparse vehicle observations instead of dense sensor measurements. For our approach, a pre-processing step generates polylines by aligning and aggregating observed lane boundaries. Aligned driven traces are used as starting points for predicting lane pairs defined by the left and right boundary points. We propose Lane Model Transformer Network (LMT-Net), an encoder-decoder neural network architecture that performs polyline encoding and predicts lane pairs and their connectivity. A lane graph is formed by using predicted lane pairs as nodes and predicted lane connectivity as edges. We evaluate the performance of LMT-Net on an internal dataset that consists of multiple vehicle observations as well as human annotations as Ground Truth (GT). The evaluation shows promising results and demonstrates superior performance compared to the implemented baseline on both highway and non-highway Operational Design Domain (ODD).
- [93] arXiv:2409.12411 [pdf, html, other]
-
Title: Textualized Agent-Style Reasoning for Complex Tasks by Multiple Round LLM GenerationSubjects: Computation and Language (cs.CL)
Chain-of-thought prompting significantly boosts the reasoning ability of large language models but still faces three issues: hallucination problem, restricted interpretability, and uncontrollable generation. To address these challenges, we present AgentCOT, a llm-based autonomous agent framework, which can solve complex problems in an agent-style manner by multiple round LLM generation. At each step, AgentCOT selects an action and executes it to yield an intermediate result with supporting evidence. In addition, we integrate the step's index into the reasoning process to form a graph structure for complex inference logic. We introduce two new strategies to enhance the performance of AgentCOT.We conduct extensive experiments to verify the effectiveness of our method on six common benchmarks. Results exhibit that our method brings in substantial improvements over current competitive approaches.
- [94] arXiv:2409.12412 [pdf, other]
-
Title: How to predict on-road air pollution based on street view images and machine learning: a quantitative analysis of the optimal strategySubjects: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
On-road air pollution exhibits substantial variability over short distances due to emission sources, dilution, and physicochemical processes. Integrating mobile monitoring data with street view images (SVIs) holds promise for predicting local air pollution. However, algorithms, sampling strategies, and image quality introduce extra errors due to a lack of reliable references that quantify their effects. To bridge this gap, we employed 314 taxis to monitor NO, NO2, PM2.5 and PM10 dynamically and sampled corresponding SVIs, aiming to develop a reliable strategy. We extracted SVI features from ~ 382,000 streetscape images, which were collected at various angles (0°, 90°, 180°, 270°) and ranges (buffers with radii of 100m, 200m, 300m, 400m, 500m). Also, three machine learning algorithms alongside the linear land-used regression (LUR) model were experimented with to explore the influences of different algorithms. Four typical image quality issues were identified and discussed. Generally, machine learning methods outperform linear LUR for estimating the four pollutants, with the ranking: random forest > XGBoost > neural network > LUR. Compared to single-angle sampling, the averaging strategy is an effective method to avoid bias of insufficient feature capture. Therefore, the optimal sampling strategy is to obtain SVIs at a 100m radius buffer and extract features using the averaging strategy. This approach achieved estimation results for each aggregation location with absolute errors almost less than 2.5 {\mu}g/m^2 or ppb. Overexposure, blur, and underexposure led to image misjudgments and incorrect identifications, causing an overestimation of road features and underestimation of human-activity features, contributing to inaccurate NO, NO2, PM2.5 and PM10 estimation.
- [95] arXiv:2409.12418 [pdf, other]
-
Title: Domain-stratified Training for Cross-organ and Cross-scanner Adenocarcinoma Segmentation in the COSAS 2024 ChallengeSubjects: Computer Vision and Pattern Recognition (cs.CV)
This manuscript presents an image segmentation algorithm developed for the Cross-Organ and Cross-Scanner Adenocarcinoma Segmentation (COSAS 2024) challenge. We adopted an organ-stratified and scanner-stratified approach to train multiple Upernet-based segmentation models and subsequently ensembled the results. Despite the challenges posed by the varying tumor characteristics across different organs and the differing imaging conditions of various scanners, our method achieved a final test score of 0.7643 for Task 1 and 0.8354 for Task 2. These results demonstrate the adaptability and efficacy of our approach across diverse conditions. Our model's ability to generalize across various datasets underscores its potential for real-world applications.
- [96] arXiv:2409.12419 [pdf, html, other]
-
Title: Shape-Space Deformer: Unified Visuo-Tactile Representations for Robotic Manipulation of Deformable ObjectsComments: Submitted to ICRA2025Subjects: Robotics (cs.RO)
Accurate modelling of object deformations is crucial for a wide range of robotic manipulation tasks, where interacting with soft or deformable objects is essential. Current methods struggle to generalise to unseen forces or adapt to new objects, limiting their utility in real-world applications. We propose Shape-Space Deformer, a unified representation for encoding a diverse range of object deformations using template augmentation to achieve robust, fine-grained reconstructions that are resilient to outliers and unwanted artefacts. Our method improves generalization to unseen forces and can rapidly adapt to novel objects, significantly outperforming existing approaches. We perform extensive experiments to test a range of force generalisation settings and evaluate our method's ability to reconstruct unseen deformations, demonstrating significant improvements in reconstruction accuracy and robustness. Our approach is suitable for real-time performance, making it ready for downstream manipulation applications.
- [97] arXiv:2409.12421 [pdf, html, other]
-
Title: Frequency-Guided Spatial Adaptation for Camouflaged Object DetectionShizhou Zhang, Dexuan Kong, Yinghui Xing, Yue Lu, Lingyan Ran, Guoqiang Liang, Hexu Wang, Yanning ZhangComments: The paper has been accepted for publication as a regular paper in the IEEE Transactions on MultimediaSubjects: Computer Vision and Pattern Recognition (cs.CV)
Camouflaged object detection (COD) aims to segment camouflaged objects which exhibit very similar patterns with the surrounding environment. Recent research works have shown that enhancing the feature representation via the frequency information can greatly alleviate the ambiguity problem between the foreground objects and the background.With the emergence of vision foundation models, like InternImage, Segment Anything Model etc, adapting the pretrained model on COD tasks with a lightweight adapter module shows a novel and promising research direction. Existing adapter modules mainly care about the feature adaptation in the spatial domain. In this paper, we propose a novel frequency-guided spatial adaptation method for COD task. Specifically, we transform the input features of the adapter into frequency domain. By grouping and interacting with frequency components located within non overlapping circles in the spectrogram, different frequency components are dynamically enhanced or weakened, making the intensity of image details and contour features adaptively adjusted. At the same time, the features that are conducive to distinguishing object and background are highlighted, indirectly implying the position and shape of camouflaged object. We conduct extensive experiments on four widely adopted benchmark datasets and the proposed method outperforms 26 state-of-the-art methods with large margins. Code will be released.
- [98] arXiv:2409.12422 [pdf, html, other]
-
Title: An Adaptive Difference Method for Variable-Order Diffusion EquationsComments: 19 pages, 8 figuresJournal-ref: Mediterr. J. Math. 21, 145 (2024)Subjects: Numerical Analysis (math.NA); Statistical Mechanics (cond-mat.stat-mech); Computational Physics (physics.comp-ph)
An adaptive finite difference scheme for variable-order fractional-time subdiffusion equations in the Caputo form is studied. The fractional time derivative is discretized by the L1 procedure but using nonhomogeneous timesteps. The size of these timesteps is chosen by an adaptive algorithm in order to keep the local error bounded around a preset value, a value that can be chosen at will. For some types of problems this adaptive method is much faster than the corresponding usual method with fixed timesteps while keeping the local error of the numerical solution around the preset values. These findings turns out to be similar to those found for constant-order fractional diffusion equations.
- [99] arXiv:2409.12425 [pdf, html, other]
-
Title: Zero-to-Strong Generalization: Eliciting Strong Capabilities of Large Language Models Iteratively without Gold LabelsComments: 15 pagesSubjects: Computation and Language (cs.CL); Machine Learning (cs.LG)
Large Language Models (LLMs) have demonstrated remarkable performance through supervised fine-tuning or in-context learning using gold labels. However, this paradigm is limited by the availability of gold labels, while in certain scenarios, LLMs may need to perform tasks that are too complex for humans to provide such labels. To tackle this challenge, this study explores whether solely utilizing unlabeled data can elicit strong model capabilities. We propose a new paradigm termed zero-to-strong generalization. We iteratively prompt LLMs to annotate unlabeled data and retain high-quality labels by filtering. Surprisingly, we obverse that this iterative process gradually unlocks LLMs' potential on downstream tasks. Our experiments on extensive classification and reasoning tasks confirm the effectiveness of our proposed framework. Our analysis indicates that this paradigm is effective for both in-context learning and fine-tuning, and for various model sizes.
- [100] arXiv:2409.12426 [pdf, html, other]
-
Title: UniMSF: A Unified Multi-Sensor Fusion Framework for Intelligent Transportation System Global LocalizationWei Liu, Jiaqi Zhu, Guirong Zhuo, Wufei Fu, Zonglin Meng, Yishi Lu, Min Hua, Feng Qiao, You Li, Yi He, Lu XiongSubjects: Robotics (cs.RO)
Intelligent transportation systems (ITS) localization is of significant importance as it provides fundamental position and orientation for autonomous operations like intelligent vehicles. Integrating diverse and complementary sensors such as global navigation satellite system (GNSS) and 4D-radar can provide scalable and reliable global localization. Nevertheless, multi-sensor fusion encounters challenges including heterogeneity and time-varying uncertainty in measurements. Consequently, developing a reliable and unified multi-sensor framework remains challenging. In this paper, we introduce UniMSF, a comprehensive multi-sensor fusion localization framework for ITS, utilizing factor graphs. By integrating a multi-sensor fusion front-end, alongside outlier detection\&noise model estimation, and a factor graph optimization back-end, this framework accomplishes efficient fusion and ensures accurate localization for ITS. Specifically, in the multi-sensor fusion front-end module, we tackle the measurement heterogeneity among different modality sensors and establish effective measurement models. Reliable outlier detection and data-driven online noise estimation methods ensure that back-end optimization is immune to interference from outlier measurements. In addition, integrating multi-sensor observations via factor graph optimization offers the advantage of \enquote{plug and play}. Notably, our framework features high modularity and is seamlessly adapted to various sensor configurations. We demonstrate the effectiveness of the proposed framework through real vehicle tests by tightly integrating GNSS pseudorange and carrier phase information with IMU, and 4D-radar.
- [101] arXiv:2409.12427 [pdf, html, other]
-
Title: Sustainable Visions: Unsupervised Machine Learning Insights on Global Development GoalsAlberto García-Rodríguez, Matias Núñez, Miguel Robles Pérez, Tzipe Govezensky, Rafael A. Barrio, Carlos Gershenson, Kimmo K. Kaski, Julia TagüeñaSubjects: Machine Learning (cs.LG); Computers and Society (cs.CY)
The United Nations 2030 Agenda for Sustainable Development outlines 17 goals to address global challenges. However, progress has been slower than expected and, consequently, there is a need to investigate the reasons behind this fact. In this study, we used a novel data-driven methodology to analyze data from 107 countries (2000$-$2022) using unsupervised machine learning techniques. Our analysis reveals strong positive and negative correlations between certain SDGs. The findings show that progress toward the SDGs is heavily influenced by geographical, cultural and socioeconomic factors, with no country on track to achieve all goals by 2030. This highlights the need for a region specific, systemic approach to sustainable development that acknowledges the complex interdependencies of the goals and the diverse capacities of nations. Our approach provides a robust framework for developing efficient and data-informed strategies, to promote cooperative and targeted initiatives for sustainable progress.
- [102] arXiv:2409.12428 [pdf, html, other]
-
Title: Is it Still Fair? A Comparative Evaluation of Fairness Algorithms through the Lens of Covariate DriftOscar Blessed Deho, Michael Bewong, Selasi Kwashie, Jiuyong Li, Jixue Liu, Lin Liu, Srecko JoksimovicSubjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computers and Society (cs.CY); Human-Computer Interaction (cs.HC)
Over the last few decades, machine learning (ML) applications have grown exponentially, yielding several benefits to society. However, these benefits are tempered with concerns of discriminatory behaviours exhibited by ML models. In this regard, fairness in machine learning has emerged as a priority research area. Consequently, several fairness metrics and algorithms have been developed to mitigate against discriminatory behaviours that ML models may possess. Yet still, very little attention has been paid to the problem of naturally occurring changes in data patterns (\textit{aka} data distributional drift), and its impact on fairness algorithms and metrics. In this work, we study this problem comprehensively by analyzing 4 fairness-unaware baseline algorithms and 7 fairness-aware algorithms, carefully curated to cover the breadth of its typology, across 5 datasets including public and proprietary data, and evaluated them using 3 predictive performance and 10 fairness metrics. In doing so, we show that (1) data distributional drift is not a trivial occurrence, and in several cases can lead to serious deterioration of fairness in so-called fair models; (2) contrary to some existing literature, the size and direction of data distributional drift is not correlated to the resulting size and direction of unfairness; and (3) choice of, and training of fairness algorithms is impacted by the effect of data distributional drift which is largely ignored in the literature. Emanating from our findings, we synthesize several policy implications of data distributional drift on fairness algorithms that can be very relevant to stakeholders and practitioners.
- [103] arXiv:2409.12431 [pdf, html, other]
-
Title: FlexiTex: Enhancing Texture Generation with Visual GuidanceDaDong Jiang, Xianghui Yang, Zibo Zhao, Sheng Zhang, Jiaao Yu, Zeqiang Lai, Shaoxiong Yang, Chunchao Guo, Xiaobo Zhou, Zhihui KeComments: Project Page: this https URLSubjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
Recent texture generation methods achieve impressive results due to the powerful generative prior they leverage from large-scale text-to-image diffusion models. However, abstract textual prompts are limited in providing global textural or shape information, which results in the texture generation methods producing blurry or inconsistent patterns. To tackle this, we present FlexiTex, embedding rich information via visual guidance to generate a high-quality texture. The core of FlexiTex is the Visual Guidance Enhancement module, which incorporates more specific information from visual guidance to reduce ambiguity in the text prompt and preserve high-frequency details. To further enhance the visual guidance, we introduce a Direction-Aware Adaptation module that automatically designs direction prompts based on different camera poses, avoiding the Janus problem and maintaining semantically global consistency. Benefiting from the visual guidance, FlexiTex produces quantitatively and qualitatively sound results, demonstrating its potential to advance texture generation for real-world applications.
- [104] arXiv:2409.12433 [pdf, html, other]
-
Title: A High-Throughput Hardware Accelerator for Lempel-Ziv 4 Compression AlgorithmSubjects: Hardware Architecture (cs.AR); Signal Processing (eess.SP)
This paper delves into recent hardware implementations of the Lempel-Ziv 4 (LZ4) algorithm, highlighting two key factors that limit the throughput of single-kernel compressors. Firstly, the actual parallelism exhibited in single-kernel designs falls short of the theoretical potential. Secondly, the clock frequency is constrained due to the presence of the feedback loops. To tackle these challenges, we propose a novel scheme that restricts each parallelization window to a single match, thus elevating the level of actual parallelism. Furthermore, by restricting the maximum match length, we eliminate the feedback loops within the architecture, enabling a significant boost in throughput. Finally, we present a high-speed hardware architecture. The implementation results demonstrate that the proposed architecture achieves a throughput of up to 16.10 Gb/s, exhibiting a 2.648x improvement over the start-of-the-art. The new design only results in an acceptable compression ratio reduction ranging from 4.93% to 11.68% with various numbers of hash table entries, compared to the LZ4 compression ratio achieved by official software implementations disclosed on GitHub.
- [105] arXiv:2409.12435 [pdf, html, other]
-
Title: Linguistic Minimal Pairs Elicit Linguistic Similarity in Large Language ModelsComments: Codes and data are available at this https URLSubjects: Computation and Language (cs.CL)
We introduce a novel analysis that leverages linguistic minimal pairs to probe the internal linguistic representations of Large Language Models (LLMs). By measuring the similarity between LLM activation differences across minimal pairs, we quantify the and gain insight into the linguistic knowledge captured by LLMs. Our large-scale experiments, spanning 100+ LLMs and 150k minimal pairs in three languages, reveal properties of linguistic similarity from four key aspects: consistency across LLMs, relation to theoretical categorizations, dependency to semantic context, and cross-lingual alignment of relevant phenomena. Our findings suggest that 1) linguistic similarity is significantly influenced by training data exposure, leading to higher cross-LLM agreement in higher-resource languages. 2) Linguistic similarity strongly aligns with fine-grained theoretical linguistic categories but weakly with broader ones. 3) Linguistic similarity shows a weak correlation with semantic similarity, showing its context-dependent nature. 4) LLMs exhibit limited cross-lingual alignment in their understanding of relevant linguistic phenomena. This work demonstrates the potential of minimal pairs as a window into the neural representations of language in LLMs, shedding light on the relationship between LLMs and linguistic theory.
- [106] arXiv:2409.12437 [pdf, html, other]
-
Title: Enhancing Logical Reasoning in Large Language Models through Graph-based Synthetic DataJiaming Zhou, Abbas Ghaddar, Ge Zhang, Liheng Ma, Yaochen Hu, Soumyasundar Pal, Mark Coates, Bin Wang, Yingxue Zhang, Jianye HaoSubjects: Computation and Language (cs.CL); Machine Learning (cs.LG)
Despite recent advances in training and prompting strategies for Large Language Models (LLMs), these models continue to face challenges with complex logical reasoning tasks that involve long reasoning chains. In this work, we explore the potential and limitations of using graph-based synthetic reasoning data as training signals to enhance LLMs' reasoning capabilities. Our extensive experiments, conducted on two established natural language reasoning tasks -- inductive reasoning and spatial reasoning -- demonstrate that supervised fine-tuning (SFT) with synthetic graph-based reasoning data effectively enhances LLMs' reasoning performance without compromising their effectiveness on other standard evaluation benchmarks.
- [107] arXiv:2409.12439 [pdf, html, other]
-
Title: Event-Driven Real-Time Multi-Objective Charging Schedule Optimization For Electric Vehicle FleetsComments: 2022 IEEE 25th International Conference on Intelligent Transportation Systems (ITSC)Subjects: Systems and Control (eess.SY)
The utilization of Electric Vehicles (EVs) in car rental services is gaining momentum around the world and most commercial fleets are expected to fully adopt EVs by 2030. At the moment, the baseline solution that most fleet operators use is a Business as Usual (BAU) policy of charging at the maximum power at all times when charging EVs. Unlike petrol prices that are fairly constant, electricity prices are more volatile and can vary vastly within several minutes depending on electricity supply which is influenced by intermittent energy supplies like renewable energy and increased demand due to electrification in many industrial sectors including transportation. The battery in EVs is the most critical component as it is the most expensive component to replace and the most dangerous component with fire risks. For safe operation and battery longevity it is imperative to prevent battery capacity fade whenever the EVs are under the control of the fleet operator such as during charging.Fundamentally, the fleet operator would like to service as much demand as possible to maximize the revenue generated at a particular time instance.This is achieved by minimizing the EVs time spent on charging and thereby increasing their availability for rides.The three goals of reducing charging cost, battery capacity fade and maximizing ride availability are formulated as a multi-objective optimization problem. The formulation is tested using the Gurobi solver on two cases from the real-world ACN dataset involving low and high EV charging densities over a week long period. The results of the proposed solution show 33.3% reduction in peak electricity loading period, 53.2% savings in charging cost and 16% lower battery capacity fade for the fleet operator.
- [108] arXiv:2409.12440 [pdf, html, other]
-
Title: Incremental and Data-Efficient Concept Formation to Support Masked Word PredictionComments: Accepted by the Eleventh Annual Conference on Advances in Cognitive SystemsSubjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
This paper introduces Cobweb4L, a novel approach for efficient language model learning that supports masked word prediction. The approach builds on Cobweb, an incremental system that learns a hierarchy of probabilistic concepts. Each concept stores the frequencies of words that appear in instances tagged with that concept label. The system utilizes an attribute value representation to encode words and their surrounding context into instances. Cobweb4L uses the information theoretic variant of category utility and a new performance mechanism that leverages multiple concepts to generate predictions. We demonstrate that with these extensions it significantly outperforms prior Cobweb performance mechanisms that use only a single node to generate predictions. Further, we demonstrate that Cobweb4L learns rapidly and achieves performance comparable to and even superior to Word2Vec. Next, we show that Cobweb4L and Word2Vec outperform BERT in the same task with less training data. Finally, we discuss future work to make our conclusions more robust and inclusive.
- [109] arXiv:2409.12443 [pdf, html, other]
-
Title: A Neural Network-based Framework for Fast and Smooth Posture Reconstruction of a Soft Continuum ArmTixian Wang, Heng-Sheng Chang, Seung Hyun Kim, Jiamiao Guo, Ugur Akcal, Benjamin Walt, Darren Biskup, Udit Halder, Girish Krishnan, Girish Chowdhary, Mattia Gazzola, Prashant G. MehtaComments: 6 pages + reference, 5 figures, submitted to ICRA 2025Subjects: Robotics (cs.RO)
A neural network-based framework is developed and experimentally demonstrated for the problem of estimating the shape of a soft continuum arm (SCA) from noisy measurements of the pose at a finite number of locations along the length of the arm. The neural network takes as input these measurements and produces as output a finite-dimensional approximation of the strain, which is further used to reconstruct the infinite-dimensional smooth posture. This problem is important for various soft robotic applications. It is challenging due to the flexible aspects that lead to the infinite-dimensional reconstruction problem for the continuous posture and strains. Because of this, past solutions to this problem are computationally intensive. The proposed fast smooth reconstruction method is shown to be five orders of magnitude faster while having comparable accuracy. The framework is evaluated on two testbeds: a simulated octopus muscular arm and a physical BR2 pneumatic soft manipulator.
- [110] arXiv:2409.12444 [pdf, html, other]
-
Title: A Lightweight and Real-Time Binaural Speech Enhancement Model with Spatial Cues PreservationSubjects: Sound (cs.SD); Artificial Intelligence (cs.AI); Audio and Speech Processing (eess.AS)
Binaural speech enhancement (BSE) aims to jointly improve the speech quality and intelligibility of noisy signals received by hearing devices and preserve the spatial cues of the target for natural listening. Existing methods often suffer from the compromise between noise reduction (NR) capacity and spatial cues preservation (SCP) accuracy and a high computational demand in complex acoustic scenes. In this work, we present a learning-based lightweight binaural complex convolutional network (LBCCN), which excels in NR by filtering low-frequency bands and keeping the rest. Additionally, our approach explicitly incorporates the estimation of interchannel relative acoustic transfer function to ensure the spatial cues fidelity and speech clarity. Results show that the proposed LBCCN can achieve a comparable NR performance to state-of-the-art methods under various noise conditions, but with a much lower computational cost and a better SCP. The reproducible code and audio examples are available at this https URL.
- [111] arXiv:2409.12446 [pdf, html, other]
-
Title: Neural Networks Generalize on Low Complexity DataComments: Comments welcome. 27 pagesSubjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Statistics Theory (math.ST); Machine Learning (stat.ML)
We show that feedforward neural networks with ReLU activation generalize on low complexity data, suitably defined. Given i.i.d. data generated from a simple programming language, the minimum description length (MDL) feedforward neural network which interpolates the data generalizes with high probability. We define this simple programming language, along with a notion of description length of such networks. We provide several examples on basic computational tasks, such as checking primality of a natural number, and more. For primality testing, our theorem shows the following. Suppose that we draw an i.i.d. sample of $\Theta(N^{\delta}\ln N)$ numbers uniformly at random from $1$ to $N$, where $\delta\in (0,1)$. For each number $x_i$, let $y_i = 1$ if $x_i$ is a prime and $0$ if it is not. Then with high probability, the MDL network fitted to this data accurately answers whether a newly drawn number between $1$ and $N$ is a prime or not, with test error $\leq O(N^{-\delta})$. Note that the network is not designed to detect primes; minimum description learning discovers a network which does so.
- [112] arXiv:2409.12447 [pdf, html, other]
-
Title: Prompts Are Programs Too! Understanding How Developers Build Software Containing PromptsSubjects: Software Engineering (cs.SE); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)
The introduction of generative pre-trained models, like GPT-4, has introduced a phenomenon known as prompt engineering, whereby model users repeatedly write and revise prompts while trying to achieve a task. Using these AI models for intelligent features in software applications require using APIs that are controlled through developer-written prompts. These prompts have powered AI experiences in popular software products, potentially reaching millions of users. Despite the growing impact of prompt-powered software, little is known about its development process and its relationship to programming. In this work, we argue that some forms of prompts are programs, and that the development of prompts is a distinct phenomenon in programming. We refer to this phenomenon as prompt programming. To this end, we develop an understanding of prompt programming using Straussian grounded theory through interviews with 20 developers engaged in prompt development across a variety of contexts, models, domains, and prompt complexities.
Through this study, we contribute 14 observations about prompt programming. For example, rather than building mental models of code, prompt programmers develop mental models of the FM's behavior on the prompt and its unique qualities by interacting with the model. While prior research has shown that experts have well-formed mental models, we find that prompt programmers who have developed dozens of prompts, each with many iterations, still struggle to develop reliable mental models. This contributes to a rapid and unsystematic development process. Taken together, our observations indicate that prompt programming is significantly different from traditional software development, motivating the creation of tools to support prompt programming. Our findings have implications for software engineering practitioners, educators, and researchers. - [113] arXiv:2409.12448 [pdf, html, other]
-
Title: Infrared Small Target Detection in Satellite Videos: A New Dataset and A Novel Recurrent Feature Refinement FrameworkXinyi Ying, Li Liu, Zaipin Lin, Yangsi Shi, Yingqian Wang, Ruojing Li, Xu Cao, Boyang Li, Shilin ZhouSubjects: Computer Vision and Pattern Recognition (cs.CV)
Multi-frame infrared small target (MIRST) detection in satellite videos is a long-standing, fundamental yet challenging task for decades, and the challenges can be summarized as: First, extremely small target size, highly complex clutters & noises, various satellite motions result in limited feature representation, high false alarms, and difficult motion analyses. Second, the lack of large-scale public available MIRST dataset in satellite videos greatly hinders the algorithm development. To address the aforementioned challenges, in this paper, we first build a large-scale dataset for MIRST detection in satellite videos (namely IRSatVideo-LEO), and then develop a recurrent feature refinement (RFR) framework as the baseline method. Specifically, IRSatVideo-LEO is a semi-simulated dataset with synthesized satellite motion, target appearance, trajectory and intensity, which can provide a standard toolbox for satellite video generation and a reliable evaluation platform to facilitate the algorithm development. For baseline method, RFR is proposed to be equipped with existing powerful CNN-based methods for long-term temporal dependency exploitation and integrated motion compensation & MIRST detection. Specifically, a pyramid deformable alignment (PDA) module and a temporal-spatial-frequency modulation (TSFM) module are proposed to achieve effective and efficient feature alignment, propagation, aggregation and refinement. Extensive experiments have been conducted to demonstrate the effectiveness and superiority of our scheme. The comparative results show that ResUNet equipped with RFR outperforms the state-of-the-art MIRST detection methods. Dataset and code are released at this https URL.
- [114] arXiv:2409.12450 [pdf, html, other]
-
Title: Domain Generalization for Endoscopic Image Segmentation by Disentangling Style-Content Information and SuperPixel ConsistencyJournal-ref: 2024 IEEE 37th International Symposium on Computer-Based Medical Systems (CBMS)Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
Frequent monitoring is necessary to stratify individuals based on their likelihood of developing gastrointestinal (GI) cancer precursors. In clinical practice, white-light imaging (WLI) and complementary modalities such as narrow-band imaging (NBI) and fluorescence imaging are used to assess risk areas. However, conventional deep learning (DL) models show degraded performance due to the domain gap when a model is trained on one modality and tested on a different one. In our earlier approach, we used a superpixel-based method referred to as "SUPRA" to effectively learn domain-invariant information using color and space distances to generate groups of pixels. One of the main limitations of this earlier work is that the aggregation does not exploit structural information, making it suboptimal for segmentation tasks, especially for polyps and heterogeneous color distributions. Therefore, in this work, we propose an approach for style-content disentanglement using instance normalization and instance selective whitening (ISW) for improved domain generalization when combined with SUPRA. We evaluate our approach on two datasets: EndoUDA Barrett's Esophagus and EndoUDA polyps, and compare its performance with three state-of-the-art (SOTA) methods. Our findings demonstrate a notable enhancement in performance compared to both baseline and SOTA methods across the target domain data. Specifically, our approach exhibited improvements of 14%, 10%, 8%, and 18% over the baseline and three SOTA methods on the polyp dataset. Additionally, it surpassed the second-best method (EndoUDA) on the Barrett's Esophagus dataset by nearly 2%.
- [115] arXiv:2409.12452 [pdf, html, other]
-
Title: CodePlan: Unlocking Reasoning Potential in Large Langauge Models by Scaling Code-form PlanningSubjects: Computation and Language (cs.CL)
Despite the remarkable success of large language models (LLMs) on traditional natural language processing tasks, their planning ability remains a critical bottleneck in tackling complex multi-step reasoning tasks. Existing approaches mainly rely on prompting or task-specific fine-tuning, often suffering from weak robustness and cross-task generalization. To address the limitation, we introduce CODEPLAN, a scalable paradigm that empowers LLMs to generate and follow code-form plans pseudocode that outlines high-level, structured reasoning processes. By leveraging the structured and versatile nature of code, CODEPLAN effectively captures the rich semantics and control flows inherent to sophisticated reasoning. Importantly, CODEPLAN allows the automatic extraction of code-form plans from massive, wide-ranging text corpora without the need for curated, task-specific datasets. This enables it to scale up efficiently and improve reasoning capabilities across diverse scenarios. To train CODEPLAN, we construct a large-scale dataset of 2M examples that integrate code-form plans with standard prompt-response pairs from existing corpora. With minimal computation overhead during both training and inference, CODEPLAN achieves a 25.1% relative improvement compared with directly generating responses, averaged across 13 challenging multi-step reasoning benchmarks, spanning mathematical reasoning, symbolic reasoning, instruction-following, multi-hop QA, and decision-making tasks. Further analysis reveals CODEPLAN's increasing performance gains on more complex reasoning tasks, as well as significant data efficiency thanks to its generalization ability.
- [116] arXiv:2409.12454 [pdf, html, other]
-
Title: FoME: A Foundation Model for EEG using Adaptive Temporal-Lateral Attention ScalingSubjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Signal Processing (eess.SP)
Electroencephalography (EEG) is a vital tool to measure and record brain activity in neuroscience and clinical applications, yet its potential is constrained by signal heterogeneity, low signal-to-noise ratios, and limited labeled datasets. In this paper, we propose FoME (Foundation Model for EEG), a novel approach using adaptive temporal-lateral attention scaling to address above-mentioned challenges. FoME is pre-trained on a diverse 1.7TB dataset of scalp and intracranial EEG recordings, comprising 745M parameters trained for 1,096k steps. Our model introduces two key innovations: a time-frequency fusion embedding technique and an adaptive time-lateral attention scaling (ATLAS) mechanism. These components synergistically capture complex temporal and spectral EEG dynamics, enabling FoME to adapt to varying patterns across diverse data streams and facilitate robust multi-channel modeling. Evaluations across four downstream tasks demonstrate FoME's superior performance in classification and forecasting applications, consistently achieving state-of-the-art results. To conclude, FoME establishes a new paradigm for EEG analysis, offering a versatile foundation that advances brain-computer interfaces, clinical diagnostics, and cognitive research across neuroscience and related fields. Our code will be available at this https URL.
- [117] arXiv:2409.12455 [pdf, html, other]
-
Title: MuxHand: A Cable-driven Dexterous Robotic Hand Using Time-division Multiplexing MotorsComments: 7 pagesSubjects: Robotics (cs.RO)
The robotic dexterous hand is responsible for both grasping and dexterous manipulation. The number of motors directly influences both the dexterity and the cost of such systems. In this paper, we present MuxHand, a robotic hand that employs a time-division multiplexing motor (TDMM) mechanism. This system allows 9 cables to be independently controlled by just 4 motors, significantly reducing cost while maintaining high dexterity. To enhance stability and smoothness during grasping and manipulation tasks, we have integrated magnetic joints into the three 3D-printed fingers. These joints offer superior impact resistance and self-resetting capabilities. We conduct a series of experiments to evaluate the grasping and manipulation performance of MuxHand. The results demonstrate that the TDMM mechanism can precisely control each cable connected to the finger joints, enabling robust grasping and dexterous manipulation. Furthermore, the fingertip load capacity reached 1.0 kg, and the magnetic joints effectively absorbed impact and corrected misalignments without damage.
- [118] arXiv:2409.12456 [pdf, html, other]
-
Title: Bayesian-Optimized One-Step Diffusion Model with Knowledge Distillation for Real-Time 3D Human Motion PredictionSubjects: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
Human motion prediction is a cornerstone of human-robot collaboration (HRC), as robots need to infer the future movements of human workers based on past motion cues to proactively plan their motion, ensuring safety in close collaboration scenarios. The diffusion model has demonstrated remarkable performance in predicting high-quality motion samples with reasonable diversity, but suffers from a slow generative process which necessitates multiple model evaluations, hindering real-world applications. To enable real-time prediction, in this work, we propose training a one-step multi-layer perceptron-based (MLP-based) diffusion model for motion prediction using knowledge distillation and Bayesian optimization. Our method contains two steps. First, we distill a pretrained diffusion-based motion predictor, TransFusion, directly into a one-step diffusion model with the same denoiser architecture. Then, to further reduce the inference time, we remove the computationally expensive components from the original denoiser and use knowledge distillation once again to distill the obtained one-step diffusion model into an even smaller model based solely on MLPs. Bayesian optimization is used to tune the hyperparameters for training the smaller diffusion model. Extensive experimental studies are conducted on benchmark datasets, and our model can significantly improve the inference speed, achieving real-time prediction without noticeable degradation in performance.
- [119] arXiv:2409.12457 [pdf, html, other]
-
Title: Canonical forms for matrix tuples in polynomial timeComments: Accepted to FOCS 2024Subjects: Data Structures and Algorithms (cs.DS)
Left-right and conjugation actions on matrix tuples have received considerable attention in theoretical computer science due to their connections with polynomial identity testing, group isomorphism, and tensor isomorphism. In this paper, we present polynomial-time algorithms for computing canonical forms of matrix tuples over a finite field under these actions. Our algorithm builds upon new structural insights for matrix tuples, which can be viewed as a generalization of Schur's lemma for irreducible representations to general representations.
- [120] arXiv:2409.12461 [pdf, html, other]
-
Title: Verification with Common Knowledge of Rationality for Graph GamesSubjects: Computer Science and Game Theory (cs.GT)
Realizability asks whether there exists a program satisfying its specification. In this problem, we assume that each agent has her own objective and behaves rationally to satisfy her objective. Traditionally, the rationality of agents is modeled by a Nash equilibrium (NE), where each agent has no incentive to change her strategy because she cannot satisfy her objective by changing her strategy alone. However, an NE is not always an appropriate notion for the rationality of agents because the condition of an NE is too strong; each agent is assumed to know strategies of the other agents completely. In this paper, we use an epistemic model to define common knowledge of rationality of all agents (CKR). We define the verification problem as a variant of the realizability problem, based on CKR, instead of NE. We then analyze the complexity of the verification problems for the class of positional strategies.
- [121] arXiv:2409.12465 [pdf, html, other]
-
Title: Galileo: A Pseudospectral Collocation Framework for Legged RobotsComments: This extended abstract was accepted for presentation at ICRA@40Subjects: Robotics (cs.RO)
Dynamic maneuvers for legged robots present a difficult challenge due to the complex dynamics and contact constraints. This paper introduces a versatile trajectory optimization framework for continuous-time multi-phase problems. We introduce a new transcription scheme that enables pseudospectral collocation to optimize directly on Lie Groups, such as SE(3) and quaternions without special normalization constraints. The key insight is the change of variables - we choose to optimize over the history of the tangent vectors rather than the states themselves. Our approach uses a modified Legendre-Gauss-Radau (LGR) method to produce dynamic motions for various legged robots. We implement our approach as a Model Predictive Controller (MPC) and track the MPC output using a Quadratic Program (QP) based whole-body controller. Results on the Go1 Unitree and WPI HURON humanoid confirm the feasibility of the planned trajectories.
- [122] arXiv:2409.12466 [pdf, other]
-
Title: AudioEditor: A Training-Free Diffusion-Based Audio Editing FrameworkSubjects: Sound (cs.SD); Audio and Speech Processing (eess.AS)
Diffusion-based text-to-audio (TTA) generation has made substantial progress, leveraging latent diffusion model (LDM) to produce high-quality, diverse and instruction-relevant audios. However, beyond generation, the task of audio editing remains equally important but has received comparatively little attention. Audio editing tasks face two primary challenges: executing precise edits and preserving the unedited sections. While workflows based on LDMs have effectively addressed these challenges in the field of image processing, similar approaches have been scarcely applied to audio editing. In this paper, we introduce AudioEditor, a training-free audio editing framework built on the pretrained diffusion-based TTA model. AudioEditor incorporates Null-text Inversion and EOT-suppression methods, enabling the model to preserve original audio features while executing accurate edits. Comprehensive objective and subjective experiments validate the effectiveness of AudioEditor in delivering high-quality audio edits. Code and demo can be found at this https URL.
- [123] arXiv:2409.12467 [pdf, html, other]
-
Title: SurgPLAN++: Universal Surgical Phase Localization Network for Online and Offline InferenceZhen Chen, Xingjian Luo, Jinlin Wu, Long Bai, Zhen Lei, Hongliang Ren, Sebastien Ourselin, Hongbin LiuSubjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Surgical phase recognition is critical for assisting surgeons in understanding surgical videos. Existing studies focused more on online surgical phase recognition, by leveraging preceding frames to predict the current frame. Despite great progress, they formulated the task as a series of frame-wise classification, which resulted in a lack of global context of the entire procedure and incoherent predictions. Moreover, besides online analysis, accurate offline surgical phase recognition is also in significant clinical need for retrospective analysis, and existing online algorithms do not fully analyze the entire video, thereby limiting accuracy in offline analysis. To overcome these challenges and enhance both online and offline inference capabilities, we propose a universal Surgical Phase Localization Network, named SurgPLAN++, with the principle of temporal detection. To ensure a global understanding of the surgical procedure, we devise a phase localization strategy for SurgPLAN++ to predict phase segments across the entire video through phase proposals. For online analysis, to generate high-quality phase proposals, SurgPLAN++ incorporates a data augmentation strategy to extend the streaming video into a pseudo-complete video through mirroring, center-duplication, and down-sampling. For offline analysis, SurgPLAN++ capitalizes on its global phase prediction framework to continuously refine preceding predictions during each online inference step, thereby significantly improving the accuracy of phase recognition. We perform extensive experiments to validate the effectiveness, and our SurgPLAN++ achieves remarkable performance in both online and offline modes, which outperforms state-of-the-art methods. The source code is available at this https URL.
- [124] arXiv:2409.12468 [pdf, html, other]
-
Title: Familiarity-aware Evidence Compression for Retrieval Augmented GenerationSubjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR); Machine Learning (cs.LG)
Retrieval Augmented Generation (RAG) improves large language models (LMs) by incorporating non-parametric knowledge through evidence retrieval from external sources. However, it often struggles to filter out inconsistent and irrelevant information that can distract the LM from its tasks. While compressing the retrieved evidence with a compression model aims to address this issue, the compressed evidence may still be unfamiliar to the target model used for downstream task, potentially failing to utilize the evidence effectively. We propose FaviComp (Familiarity-aware Evidence Compression), a novel training-free evidence compression technique that makes retrieved evidence more familiar to the target model, while seamlessly integrating parametric knowledge from the model. Specifically, FaviComp proactively lowers the perplexity of the compressed evidence with regard to the target model by combining token probabilities from both the compression model and the target model to generate context that is more familiar to the target model. This approach balances the integration of parametric and non-parametric knowledge, which is especially helpful in complex tasks where the retrieved evidence set may not contain all the necessary information. Experimental results demonstrate that FaviComp consistently outperforms existing baselines in multiple open-domain QA datasets, achieving high compression rates and showcasing the effective integration of both parametric and non-parametric knowledge.
- [125] arXiv:2409.12469 [pdf, other]
-
Title: From Data to Control: A Formal Compositional Framework for Large-Scale Interconnected NetworksSubjects: Systems and Control (eess.SY)
We introduce a compositional data-driven methodology for designing fully-decentralized safety controllers applicable to large-scale interconnected networks, encompassing subsystems with unknown mathematical models. Our compositional scheme leverages the interconnection topology and breaks down the network analysis into the examination of distinct subsystems. This is accompanied by utilizing a concept of control storage certificates (CSCs) to capture joint dissipativity-type properties among subsystems. These CSCs are instrumental in a compositional derivation of a control barrier certificate (CBC) specialized for the interconnected network, thereby ensuring its safety. In our data-driven scheme, we gather solely one input-output trajectory from each unknown subsystem within a specified time frame. By fulfilling a specific rank condition, this process facilitates the construction of a CSC for each subsystem. Following this, by adhering to compositional dissipativity reasoning, we compose CSCs derived from data and build a CBC for the unknown network, ensuring its safety over an infinite time horizon, while providing correctness guarantees. We demonstrate that our compositional data-driven approach significantly enhances the design of a CBC and its safety controller across the interconnected network. This advancement is achieved by reducing the computational complexity from a polynomial growth in relation to network dimension, when using sum-of-squares (SOS) optimization, to a linear scale based on the number of subsystems. We additionally demonstrate that the dissipativity-type compositionality condition can benefit from the structure of interconnection topology and potentially be fulfilled regardless of the number of subsystems. We apply our data-driven findings to a variety of benchmarks, involving physical networks with unknown models and diverse interconnection topologies.
- [126] arXiv:2409.12470 [pdf, html, other]
-
Title: HSIGene: A Foundation Model For Hyperspectral Image GenerationSubjects: Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)
Hyperspectral image (HSI) plays a vital role in various fields such as agriculture and environmental monitoring. However, due to the expensive acquisition cost, the number of hyperspectral images is limited, degenerating the performance of downstream tasks. Although some recent studies have attempted to employ diffusion models to synthesize HSIs, they still struggle with the scarcity of HSIs, affecting the reliability and diversity of the generated images. Some studies propose to incorporate multi-modal data to enhance spatial diversity, but the spectral fidelity cannot be ensured. In addition, existing HSI synthesis models are typically uncontrollable or only support single-condition control, limiting their ability to generate accurate and reliable HSIs. To alleviate these issues, we propose HSIGene, a novel HSI generation foundation model which is based on latent diffusion and supports multi-condition control, allowing for more precise and reliable HSI generation. To enhance the spatial diversity of the training data while preserving spectral fidelity, we propose a new data augmentation method based on spatial super-resolution, in which HSIs are upscaled first, and thus abundant training patches could be obtained by cropping the high-resolution HSIs. In addition, to improve the perceptual quality of the augmented data, we introduce a novel two-stage HSI super-resolution framework, which first applies RGB bands super-resolution and then utilizes our proposed Rectangular Guided Attention Network (RGAN) for guided HSI super-resolution. Experiments demonstrate that the proposed model is capable of generating a vast quantity of realistic HSIs for downstream tasks such as denoising and super-resolution. The code and models are available at this https URL.
- [127] arXiv:2409.12471 [pdf, html, other]
-
Title: Arena 4.0: A Comprehensive ROS2 Development and Benchmarking Platform for Human-centric Navigation Using Generative-Model-based Environment GenerationVolodymyr Shcherbyna1, Linh Kästner, Diego Diaz, Huu Giang Nguyen, Maximilian Ho-Kyoung Schreff, Tim Lenz, Jonas Kreutz, Ahmed Martban, Huajian Zeng, Harold SohComments: 7 pages, 7 figuresSubjects: Robotics (cs.RO); Artificial Intelligence (cs.AI)
Building on the foundations of our previous work, this paper introduces Arena 4.0, a significant advancement over Arena 3.0, Arena-Bench, Arena 1.0, and Arena 2.0. Arena 4.0 offers three key novel contributions: (1) a generative-model-based world and scenario generation approach that utilizes large language models (LLMs) and diffusion models to dynamically generate complex, human-centric environments from text prompts or 2D floorplans, useful for the development and benchmarking of social navigation strategies; (2) a comprehensive 3D model database, extendable with additional 3D assets that are semantically linked and annotated for dynamic spawning and arrangement within 3D worlds; and (3) a complete migration to ROS 2, enabling compatibility with modern hardware and enhanced functionalities for improved navigation, usability, and easier deployment on real robots. We evaluated the platform's performance through a comprehensive user study, demonstrating significant improvements in usability and efficiency compared to previous versions. Arena 4.0 is openly available at this https URL.
- [128] arXiv:2409.12472 [pdf, html, other]
-
Title: TEAM: Temporal Adversarial Examples Attack Model against Network Intrusion Detection System Applied to RNNSubjects: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
With the development of artificial intelligence, neural networks play a key role in network intrusion detection systems (NIDS). Despite the tremendous advantages, neural networks are susceptible to adversarial attacks. To improve the reliability of NIDS, many research has been conducted and plenty of solutions have been proposed. However, the existing solutions rarely consider the adversarial attacks against recurrent neural networks (RNN) with time steps, which would greatly affect the application of NIDS in real world. Therefore, we first propose a novel RNN adversarial attack model based on feature reconstruction called \textbf{T}emporal adversarial \textbf{E}xamples \textbf{A}ttack \textbf{M}odel \textbf{(TEAM)}, which applied to time series data and reveals the potential connection between adversarial and time steps in RNN. That is, the past adversarial examples within the same time steps can trigger further attacks on current or future original examples. Moreover, TEAM leverages Time Dilation (TD) to effectively mitigates the effect of temporal among adversarial examples within the same time steps. Experimental results show that in most attack categories, TEAM improves the misjudgment rate of NIDS on both black and white boxes, making the misjudgment rate reach more than 96.68%. Meanwhile, the maximum increase in the misjudgment rate of the NIDS for subsequent original samples exceeds 95.57%.
- [129] arXiv:2409.12475 [pdf, html, other]
-
Title: Reference Dataset and Benchmark for Reconstructing Laser Parameters from On-axis Video in Powder Bed Fusion of Bulk Stainless SteelComments: Dataset download: this https URLJournal-ref: Additive Manufacturing Letters 7 (2023) 1-7Subjects: Computer Vision and Pattern Recognition (cs.CV)
We present RAISE-LPBF, a large dataset on the effect of laser power and laser dot speed in powder bed fusion (LPBF) of 316L stainless steel bulk material, monitored by on-axis 20k FPS video. Both process parameters are independently sampled for each scan line from a continuous distribution, so interactions of different parameter choices can be investigated. The data can be used to derive statistical properties of LPBF, as well as to build anomaly detectors. We provide example source code for loading the data, baseline machine learning models and results, and a public benchmark to evaluate predictive models.
- [130] arXiv:2409.12476 [pdf, html, other]
-
Title: AutoMode-ASR: Learning to Select ASR Systems for Better Quality and CostAhmet Gündüz, Yunsu Kim, Kamer Ali Yuksel, Mohamed Al-Badrashiny, Thiago Castro Ferreira, Hassan SawafComments: SPECOM 2024 ConferenceSubjects: Computation and Language (cs.CL); Sound (cs.SD); Audio and Speech Processing (eess.AS)
We present AutoMode-ASR, a novel framework that effectively integrates multiple ASR systems to enhance the overall transcription quality while optimizing cost. The idea is to train a decision model to select the optimal ASR system for each segment based solely on the audio input before running the systems. We achieve this by ensembling binary classifiers determining the preference between two systems. These classifiers are equipped with various features, such as audio embeddings, quality estimation, and signal properties. Additionally, we demonstrate how using a quality estimator can further improve performance with minimal cost increase. Experimental results show a relative reduction in WER of 16.2%, a cost saving of 65%, and a speed improvement of 75%, compared to using a single-best model for all segments. Our framework is compatible with commercial and open-source black-box ASR systems as it does not require changes in model codes.
- [131] arXiv:2409.12477 [pdf, html, other]
-
Title: ViolinDiff: Enhancing Expressive Violin Synthesis with Pitch Bend ConditioningSubjects: Sound (cs.SD); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Audio and Speech Processing (eess.AS); Signal Processing (eess.SP)
Modeling the natural contour of fundamental frequency (F0) plays a critical role in music audio synthesis. However, transcribing and managing multiple F0 contours in polyphonic music is challenging, and explicit F0 contour modeling has not yet been explored for polyphonic instrumental synthesis. In this paper, we present ViolinDiff, a two-stage diffusion-based synthesis framework. For a given violin MIDI file, the first stage estimates the F0 contour as pitch bend information, and the second stage generates mel spectrogram incorporating these expressive details. The quantitative metrics and listening test results show that the proposed model generates more realistic violin sounds than the model without explicit pitch bend modeling. Audio samples are available online: this http URL.
- [132] arXiv:2409.12479 [pdf, html, other]
-
Title: Learning Multi-Manifold Embedding for Out-Of-Distribution DetectionComments: European Conference on Computer Vision ECCV 2024 BEW Workshop Best PaperSubjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
Detecting out-of-distribution (OOD) samples is crucial for trustworthy AI in real-world applications. Leveraging recent advances in representation learning and latent embeddings, Various scoring algorithms estimate distributions beyond the training data. However, a single embedding space falls short in characterizing in-distribution data and defending against diverse OOD conditions. This paper introduces a novel Multi-Manifold Embedding Learning (MMEL) framework, optimizing hypersphere and hyperbolic spaces jointly for enhanced OOD detection. MMEL generates representative embeddings and employs a prototype-aware scoring function to differentiate OOD samples. It operates with very few OOD samples and requires no model retraining. Experiments on six open datasets demonstrate MMEL's significant reduction in FPR while maintaining a high AUC compared to state-of-the-art distance-based OOD detection methods. We analyze the effects of learning multiple manifolds and visualize OOD score distributions across datasets. Notably, enrolling ten OOD samples without retraining achieves comparable FPR and AUC to modern outlier exposure methods using 80 million outlier samples for model training.
- [133] arXiv:2409.12481 [pdf, html, other]
-
Title: A physics-enhanced multi-modal fused neural network for predicting contamination length interval in pipelineComments: 16 pages, 9 figures. This paper is one of the research outputs of the intelligent oil and gas pipeline in our team, which can be abbreviated as "DeepPipe". This paper have been submitted to the journal and is under reviewSubjects: Computational Engineering, Finance, and Science (cs.CE)
During the operation of a multi-product pipeline, an accurate and effective prediction of contamination length interval is the central key to guiding the cutting plan formulation and improving the economic effect. However, the existing methods focus on extracting implicit principles and insufficient feature correlations in a data-driven pattern but overlook the potential knowledge in the scientific theory of contamination development, may cause practically useless results. Consequently, in this study, the holistic feature correlations and physical knowledge are extracted and integrated into the neural network to propose a physics-enhanced adaptive multi-modal fused neural network (PE-AMFNN) for contamination length interval prediction. In PE-AMFNN, a multi-modal adaptive feature fusion module is created to establish a comprehensive feature space with quantified feature importance, thus capturing sufficient feature correlations. Subsequently, a mechanism-coupled customized neural network is designed to incorporate the explicit scientific principle into the forward and backward propagation. Besides, a physics-embedded loss function, which introduces interval differences and interval correlation constraints, is established to unearth the latent physical knowledge in contamination development and force the model to draw physically unreasonable results. Validation on the real-world cases implies that the proposed model outperforms the start-of-art techniques and latest achievements, with Root Mean Squared Relative Errors reduced by 31% and 36% in lower and upper limit prediction. Furthermore, the sensitivity analysis of model modules suggests that both the multi-modal feature fusion and the physical principle are crucial for model improvements
- [134] arXiv:2409.12490 [pdf, html, other]
-
Title: CritiPrefill: A Segment-wise Criticality-based Approach for Prefilling Acceleration in LLMsSubjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Large language models have achieved notable success across various domains, yet efficient inference is still limited by the quadratic computation complexity of the attention mechanism. The inference consists of prefilling and decoding phases. Although several attempts have been made to accelerate decoding, the inefficiency of the prefilling phase, especially for long-context tasks, remains a challenge. In this paper, we observe a locality in query criticality during the prefilling phase of long-context processing: adjacent query tokens tend to focus on similar subsets of the past Key-Value (KV) cache. Based on this observation, we propose CritiPrefill, a criticality-based segment-wise prefilling method. This method partitions the input sequence's queries and KV cache into segments and blocks, utilizing a segment-wise algorithm to estimate the query criticality. By pruning non-critical computations between query segments and cache blocks in the self-attention mechanism, the prefilling process can be significantly accelerated. Extensive evaluations on multiple long-context datasets show up to 2.7x speedup on Llama3-8B and 3.0x speedup on Yi-9B for 128K context length on a single A100 GPU, with minimal quality degradation.
- [135] arXiv:2409.12493 [pdf, html, other]
-
Title: ConvexECG: Lightweight and Explainable Neural Networks for Personalized, Continuous Cardiac MonitoringSubjects: Machine Learning (cs.LG); Signal Processing (eess.SP); Optimization and Control (math.OC)
We present ConvexECG, an explainable and resource-efficient method for reconstructing six-lead electrocardiograms (ECG) from single-lead data, aimed at advancing personalized and continuous cardiac monitoring. ConvexECG leverages a convex reformulation of a two-layer ReLU neural network, enabling the potential for efficient training and deployment in resource constrained environments, while also having deterministic and explainable behavior. Using data from 25 patients, we demonstrate that ConvexECG achieves accuracy comparable to larger neural networks while significantly reducing computational overhead, highlighting its potential for real-time, low-resource monitoring applications.
- [136] arXiv:2409.12499 [pdf, html, other]
-
Title: End-to-end Open-vocabulary Video Visual Relationship Detection using Multi-modal PromptingSubjects: Computer Vision and Pattern Recognition (cs.CV)
Open-vocabulary video visual relationship detection aims to expand video visual relationship detection beyond annotated categories by detecting unseen relationships between both seen and unseen objects in videos. Existing methods usually use trajectory detectors trained on closed datasets to detect object trajectories, and then feed these trajectories into large-scale pre-trained vision-language models to achieve open-vocabulary classification. Such heavy dependence on the pre-trained trajectory detectors limits their ability to generalize to novel object categories, leading to performance degradation. To address this challenge, we propose to unify object trajectory detection and relationship classification into an end-to-end open-vocabulary framework. Under this framework, we propose a relationship-aware open-vocabulary trajectory detector. It primarily consists of a query-based Transformer decoder, where the visual encoder of CLIP is distilled for frame-wise open-vocabulary object detection, and a trajectory associator. To exploit relationship context during trajectory detection, a relationship query is embedded into the Transformer decoder, and accordingly, an auxiliary relationship loss is designed to enable the decoder to perceive the relationships between objects explicitly. Moreover, we propose an open-vocabulary relationship classifier that leverages the rich semantic knowledge of CLIP to discover novel relationships. To adapt CLIP well to relationship classification, we design a multi-modal prompting method that employs spatio-temporal visual prompting for visual representation and vision-guided language prompting for language input. Extensive experiments on two public datasets, VidVRD and VidOR, demonstrate the effectiveness of our framework. Our framework is also applied to a more difficult cross-dataset scenario to further demonstrate its generalization ability.
- [137] arXiv:2409.12500 [pdf, html, other]
-
Title: LLMR: Knowledge Distillation with a Large Language Model-Induced RewardComments: Accepted by LERC COLING 2024Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
Large language models have become increasingly popular and demonstrated remarkable performance in various natural language processing (NLP) tasks. However, these models are typically computationally expensive and difficult to be deployed in resource-constrained environments. In this paper, we propose LLMR, a novel knowledge distillation (KD) method based on a reward function induced from large language models. We conducted experiments on multiple datasets in the dialogue generation and summarization tasks. Empirical results demonstrate that our LLMR approach consistently outperforms traditional KD methods in different tasks and datasets.
- [138] arXiv:2409.12504 [pdf, html, other]
-
Title: Sustainable Placement with Cost Minimization in Wireless Digital Twin NetworksSubjects: Networking and Internet Architecture (cs.NI)
Digital twin (DT) technology has a high potential to satisfy different requirements of the ever-expanding new applications. Nonetheless, the DT placement in wireless digital twin networks (WDTNs) poses a significant challenge due to the conflict between unpredictable workloads and the limited capacity of edge servers. In other words, each edge server has a risk of overload when handling an excessive number of tasks or services. Overload risks can have detrimental effects on a network's sustainability, yet this aspect is often overlooked in the literature. In this paper, we aim to study the sustainability-aware DT placement problem for WDTNs from a cost minimization perspective. To this end, we formulate the DT placement-driven cost optimization problem as a chance-constrained integer programming problem. For tractability, we transform the original non-deterministic problem into a deterministic integer linear programming (ILP) problem using the sample average approximation (SAA) approach. We prove that the transformed problem remains NP-hard and thus finding a global optimal solution is very difficult. To strike a balance between time efficiency and performance guarantee, we propose an improved local search algorithm for this ILP by identifying high-quality starting states from historical search data and enhancing the search process. Numerical results show a lower cost and higher efficiency of our proposed method compared with the previous schemes.
- [139] arXiv:2409.12505 [pdf, html, other]
-
Title: Accurately Tracking Relative Positions of Moving Trackers based on UWB Ranging and Inertial Sensing without AnchorsComments: Accepted at IROS2024Subjects: Robotics (cs.RO)
We present a tracking system for relative positioning that can operate on entirely moving tracking nodes without the need for stationary anchors. Each node embeds a 9-DOF magnetic and inertial measurement unit and a single-antenna ultra-wideband radio. We introduce a multi-stage filtering pipeline through which our system estimates the relative layout of all tracking nodes within the group. The key novelty of our method is the integration of a custom Extended Kalman filter (EKF) with a refinement step via multidimensional scaling (MDS). Our method integrates the MDS output back into the EKF, thereby creating a dynamic feedback loop for more robust estimates. We complement our method with UWB ranging protocol that we designed to allow tracking nodes to opportunistically join and leave the group. In our evaluation with constantly moving nodes, our system estimated relative positions with an error of 10.2cm (in 2D) and 21.7cm (in 3D), including obstacles that occluded the line of sight between tracking nodes. Our approach requires no external infrastructure, making it particularly suitable for operation in environments where stationary setups are impractical.
- [140] arXiv:2409.12507 [pdf, html, other]
-
Title: Towards Low-latency Event-based Visual Recognition with Hybrid Step-wise Distillation Spiking Neural NetworksSubjects: Computer Vision and Pattern Recognition (cs.CV)
Spiking neural networks (SNNs) have garnered significant attention for their low power consumption and high biological interpretability. Their rich spatio-temporal information processing capability and event-driven nature make them ideally well-suited for neuromorphic datasets. However, current SNNs struggle to balance accuracy and latency in classifying these datasets. In this paper, we propose Hybrid Step-wise Distillation (HSD) method, tailored for neuromorphic datasets, to mitigate the notable decline in performance at lower time steps. Our work disentangles the dependency between the number of event frames and the time steps of SNNs, utilizing more event frames during the training stage to improve performance, while using fewer event frames during the inference stage to reduce latency. Nevertheless, the average output of SNNs across all time steps is susceptible to individual time step with abnormal outputs, particularly at extremely low time steps. To tackle this issue, we implement Step-wise Knowledge Distillation (SKD) module that considers variations in the output distribution of SNNs at each time step. Empirical evidence demonstrates that our method yields competitive performance in classification tasks on neuromorphic datasets, especially at lower time steps. Our code will be available at: {this https URL}.
- [141] arXiv:2409.12512 [pdf, html, other]
-
Title: Exploring and Enhancing the Transfer of Distribution in Knowledge Distillation for Autoregressive Language ModelsSubjects: Computation and Language (cs.CL)
Knowledge distillation (KD) is a technique that compresses large teacher models by training smaller student models to mimic them. The success of KD in auto-regressive language models mainly relies on Reverse KL for mode-seeking and student-generated output (SGO) to combat exposure bias. Our theoretical analyses and experimental validation reveal that while Reverse KL effectively mimics certain features of the teacher distribution, it fails to capture most of its behaviors. Conversely, SGO incurs higher computational costs and presents challenges in optimization, particularly when the student model is significantly smaller than the teacher model. These constraints are primarily due to the immutable distribution of the teacher model, which fails to adjust adaptively to models of varying sizes. We introduce Online Knowledge Distillation (OKD), where the teacher network integrates small online modules to concurrently train with the student model. This strategy abolishes the necessity for on-policy sampling and merely requires minimal updates to the parameters of the teacher's online module during training, thereby allowing dynamic adaptation to the student's distribution to make distillation better. Extensive results across multiple generation datasets show that OKD achieves or exceeds the performance of leading methods in various model architectures and sizes, reducing training time by up to fourfold.
- [142] arXiv:2409.12514 [pdf, html, other]
-
Title: TinyVLA: Towards Fast, Data-Efficient Vision-Language-Action Models for Robotic ManipulationJunjie Wen, Yichen Zhu, Jinming Li, Minjie Zhu, Kun Wu, Zhiyuan Xu, Ran Cheng, Chaomin Shen, Yaxin Peng, Feifei Feng, Jian TangSubjects: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
Vision-Language-Action (VLA) models have shown remarkable potential in visuomotor control and instruction comprehension through end-to-end learning processes. However, current VLA models face significant challenges: they are slow during inference and require extensive pre-training on large amounts of robotic data, making real-world deployment difficult. In this paper, we introduce a new family of compact vision-language-action models, called TinyVLA, which offers two key advantages over existing VLA models: (1) faster inference speeds, and (2) improved data efficiency, eliminating the need for pre-training stage. Our framework incorporates two essential components to build TinyVLA: (1) initializing the policy backbone with robust, high-speed multimodal models, and (2) integrating a diffusion policy decoder during fine-tuning to enable precise robot actions. We conducted extensive evaluations of TinyVLA in both simulation and on real robots, demonstrating that our approach significantly outperforms the state-of-the-art VLA model, OpenVLA, in terms of speed and data efficiency, while delivering comparable or superior performance. Additionally, TinyVLA exhibits strong generalization capabilities across various dimensions, including language instructions, novel objects, unseen positions, changes in object appearance, background variations, and environmental shifts, often matching or exceeding the performance of OpenVLA. We believe that \methodname offers an interesting perspective on utilizing pre-trained multimodal models for policy learning. Our project is at this https URL.
- [143] arXiv:2409.12517 [pdf, html, other]
-
Title: Scaling FP8 training to trillion-token LLMsSubjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
We train, for the first time, large language models using FP8 precision on datasets up to 2 trillion tokens -- a 20-fold increase over previous limits. Through these extended training runs, we uncover critical instabilities in FP8 training that were not observable in earlier works with shorter durations. We trace these instabilities to outlier amplification by the SwiGLU activation function. Interestingly, we show, both analytically and empirically, that this amplification happens only over prolonged training periods, and link it to a SwiGLU weight alignment process. To address this newly identified issue, we introduce Smooth-SwiGLU, a novel modification that ensures stable FP8 training without altering function behavior. We also demonstrate, for the first time, FP8 quantization of both Adam optimizer moments. Combining these innovations, we successfully train a 7B parameter model using FP8 precision on 256 Intel Gaudi2 accelerators, achieving on-par results with the BF16 baseline while delivering up to a $\sim 34 \%$ throughput improvement.
- [144] arXiv:2409.12518 [pdf, html, other]
-
Title: Hi-SLAM: Scaling-up Semantics in SLAM with a Hierarchically Categorical Gaussian SplattingComments: 6 pages, 4 figuresSubjects: Robotics (cs.RO); Artificial Intelligence (cs.AI)
We propose Hi-SLAM, a semantic 3D Gaussian Splatting SLAM method featuring a novel hierarchical categorical representation, which enables accurate global 3D semantic mapping, scaling-up capability, and explicit semantic label prediction in the 3D world. The parameter usage in semantic SLAM systems increases significantly with the growing complexity of the environment, making it particularly challenging and costly for scene understanding. To address this problem, we introduce a novel hierarchical representation that encodes semantic information in a compact form into 3D Gaussian Splatting, leveraging the capabilities of large language models (LLMs). We further introduce a novel semantic loss designed to optimize hierarchical semantic information through both inter-level and cross-level optimization. Furthermore, we enhance the whole SLAM system, resulting in improved tracking and mapping performance. Our Hi-SLAM outperforms existing dense SLAM methods in both mapping and tracking accuracy, while achieving a 2x operation speed-up. Additionally, it exhibits competitive performance in rendering semantic segmentation in small synthetic scenes, with significantly reduced storage and training time requirements. Rendering FPS impressively reaches 2,000 with semantic information and 3,000 without it. Most notably, it showcases the capability of handling the complex real-world scene with more than 500 semantic classes, highlighting its valuable scaling-up capability.
- [145] arXiv:2409.12519 [pdf, html, other]
-
Title: Multi-View Adaptive Contrastive Learning for Information Retrieval Based Fault LocalizationSubjects: Software Engineering (cs.SE); Information Retrieval (cs.IR)
Most studies focused on information retrieval-based techniques for fault localization, which built representations for bug reports and source code files and matched their semantic vectors through similarity measurement. However, such approaches often ignore some useful information that might help improve localization performance, such as 1) the interaction relationship between bug reports and source code files; 2) the similarity relationship between bug reports; and 3) the co-citation relationship between source code files. In this paper, we propose a novel approach named Multi-View Adaptive Contrastive Learning for Information Retrieval Fault Localization (MACL-IRFL) to learn the above-mentioned relationships for software fault localization. Specifically, we first generate data augmentations from report-code interaction view, report-report similarity view and code-code co-citation view separately, and adopt graph neural network to aggregate the information of bug reports or source code files from the three views in the embedding process. Moreover, we perform contrastive learning across these views. Our design of contrastive learning task will force the bug report representations to encode information shared by report-report and report-code views,and the source code file representations shared by code-code and report-code views, thereby alleviating the noise from auxiliary information. Finally, to evaluate the performance of our approach, we conduct extensive experiments on five open-source Java projects. The results show that our model can improve over the best baseline up to 28.93%, 25.57% and 20.35% on Accuracy@1, MAP and MRR, respectively.
- [146] arXiv:2409.12521 [pdf, html, other]
-
Title: GraspSAM: When Segment Anything Model Meets Grasp DetectionComments: 6 pages (main), 1 page (references)Subjects: Robotics (cs.RO); Systems and Control (eess.SY)
Grasp detection requires flexibility to handle objects of various shapes without relying on prior knowledge of the object, while also offering intuitive, user-guided control. This paper introduces GraspSAM, an innovative extension of the Segment Anything Model (SAM), designed for prompt-driven and category-agnostic grasp detection. Unlike previous methods, which are often limited by small-scale training data, GraspSAM leverages the large-scale training and prompt-based segmentation capabilities of SAM to efficiently support both target-object and category-agnostic grasping. By utilizing adapters, learnable token embeddings, and a lightweight modified decoder, GraspSAM requires minimal fine-tuning to integrate object segmentation and grasp prediction into a unified framework. The model achieves state-of-the-art (SOTA) performance across multiple datasets, including Jacquard, Grasp-Anything, and Grasp-Anything++. Extensive experiments demonstrate the flexibility of GraspSAM in handling different types of prompts (such as points, boxes, and language), highlighting its robustness and effectiveness in real-world robotic applications.
- [147] arXiv:2409.12522 [pdf, html, other]
-
Title: Prompting Segment Anything Model with Domain-Adaptive Prototype for Generalizable Medical Image SegmentationComments: Accepted by the 27th International Conference on Medical Image Computing and Computer Assisted Intervention (MICCAI 2024)Subjects: Computer Vision and Pattern Recognition (cs.CV)
Deep learning based methods often suffer from performance degradation caused by domain shift. In recent years, many sophisticated network structures have been designed to tackle this problem. However, the advent of large model trained on massive data, with its exceptional segmentation capability, introduces a new perspective for solving medical segmentation problems. In this paper, we propose a novel Domain-Adaptive Prompt framework for fine-tuning the Segment Anything Model (termed as DAPSAM) to address single-source domain generalization (SDG) in segmenting medical images. DAPSAM not only utilizes a more generalization-friendly adapter to fine-tune the large model, but also introduces a self-learning prototype-based prompt generator to enhance model's generalization ability. Specifically, we first merge the important low-level features into intermediate features before feeding to each adapter, followed by an attention filter to remove redundant information. This yields more robust image embeddings. Then, we propose using a learnable memory bank to construct domain-adaptive prototypes for prompt generation, helping to achieve generalizable medical image segmentation. Extensive experimental results demonstrate that our DAPSAM achieves state-of-the-art performance on two SDG medical image segmentation tasks with different modalities. The code is available at this https URL.
- [148] arXiv:2409.12524 [pdf, html, other]
-
Title: Should RAG Chatbots Forget Unimportant Conversations? Exploring Importance and Forgetting with Psychological InsightsSubjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
While Retrieval-Augmented Generation (RAG) has shown promise in enhancing long-term conversations, the increasing memory load as conversations progress degrades retrieval accuracy. Drawing on psychological insights, we propose LUFY, a simple yet effective method that focuses on emotionally arousing memories and retains less than 10% of the conversation. In the user experiment, participants interacted with three types of RAG chatbots, each for 2 hours over 4 sessions, marking the most extensive assessment of a chatbot's long-term capabilities to date -- more than four times longer than any existing benchmark. The results demonstrate that prioritizing arousing memories while forgetting the majority of the conversation significantly enhances user experience. This study pushes the frontier of long-term conversations and highlights the importance of forgetting unimportant parts of conversations. Code and Dataset: this https URL
- [149] arXiv:2409.12528 [pdf, html, other]
-
Title: SoundBeam meets M2D: Target Sound Extraction with Audio Foundation ModelCarlos Hernandez-Olivan, Marc Delcroix, Tsubasa Ochiai, Daisuke Niizumi, Naohiro Tawara, Tomohiro Nakatani, Shoko ArakiSubjects: Sound (cs.SD); Audio and Speech Processing (eess.AS)
Target sound extraction (TSE) consists of isolating a desired sound from a mixture of arbitrary sounds using clues to identify it. A TSE system requires solving two problems at once, identifying the target source and extracting the target signal from the mixture. For increased practicability, the same system should work with various types of sound. The duality of the problem and the wide variety of sounds make it challenging to train a powerful TSE system from scratch. In this paper, to tackle this problem, we explore using a pre-trained audio foundation model that can provide rich feature representations of sounds within a TSE system. We chose the masked-modeling duo (M2D) foundation model, which appears especially suited for the TSE task, as it is trained using a dual objective consisting of sound-label predictions and improved masked prediction. These objectives are related to sound identification and the signal extraction problems of TSE. We propose a new TSE system that integrates the feature representation from M2D into SoundBeam, which is a strong TSE system that can exploit both target sound class labels and pre-recorded enrollments (or audio queries) as clues. We show experimentally that using M2D can increase extraction performance, especially when employing enrollment clues.
- [150] arXiv:2409.12532 [pdf, html, other]
-
Title: Denoising Reuse: Exploiting Inter-frame Motion Consistency for Efficient Video Latent GenerationChenyu Wang, Shuo Yan, Yixuan Chen, Yujiang Wang, Mingzhi Dong, Xiaochen Yang, Dongsheng Li, Robert P. Dick, Qin Lv, Fan Yang, Tun Lu, Ning Gu, Li ShangSubjects: Computer Vision and Pattern Recognition (cs.CV)
Video generation using diffusion-based models is constrained by high computational costs due to the frame-wise iterative diffusion process. This work presents a Diffusion Reuse MOtion (Dr. Mo) network to accelerate latent video generation. Our key discovery is that coarse-grained noises in earlier denoising steps have demonstrated high motion consistency across consecutive video frames. Following this observation, Dr. Mo propagates those coarse-grained noises onto the next frame by incorporating carefully designed, lightweight inter-frame motions, eliminating massive computational redundancy in frame-wise diffusion models. The more sensitive and fine-grained noises are still acquired via later denoising steps, which can be essential to retain visual qualities. As such, deciding which intermediate steps should switch from motion-based propagations to denoising can be a crucial problem and a key tradeoff between efficiency and quality. Dr. Mo employs a meta-network named Denoising Step Selector (DSS) to dynamically determine desirable intermediate steps across video frames. Extensive evaluations on video generation and editing tasks have shown that Dr. Mo can substantially accelerate diffusion models in video tasks with improved visual qualities.
- [151] arXiv:2409.12535 [pdf, html, other]
-
Title: Deep Probability Segmentation: Are segmentation models probability estimators?Journal-ref: IEEE AICT2024 Conference ProceedingsSubjects: Computer Vision and Pattern Recognition (cs.CV)
Deep learning has revolutionized various fields by enabling highly accurate predictions and estimates. One important application is probabilistic prediction, where models estimate the probability of events rather than deterministic outcomes. This approach is particularly relevant and, therefore, still unexplored for segmentation tasks where each pixel in an image needs to be classified. Conventional models often overlook the probabilistic nature of labels, but accurate uncertainty estimation is crucial for improving the reliability and applicability of models.
In this study, we applied Calibrated Probability Estimation (CaPE) to segmentation tasks to evaluate its impact on model calibration. Our results indicate that while CaPE improves calibration, its effect is less pronounced compared to classification tasks, suggesting that segmentation models can inherently provide better probability estimates. We also investigated the influence of dataset size and bin optimization on the effectiveness of calibration. Our results emphasize the expressive power of segmentation models as probability estimators and incorporate probabilistic reasoning, which is crucial for applications requiring precise uncertainty quantification. - [152] arXiv:2409.12538 [pdf, html, other]
-
Title: PersonaFlow: Boosting Research Ideation with LLM-Simulated Expert PersonasSubjects: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI)
Developing novel interdisciplinary research ideas often requires discussions and feedback from experts across different domains. However, obtaining timely inputs is challenging due to the scarce availability of domain experts. Recent advances in Large Language Model (LLM) research have suggested the feasibility of utilizing LLM-simulated expert personas to support research ideation. In this study, we introduce PersonaFlow, an LLM-based system using persona simulation to support the ideation stage of interdisciplinary scientific discovery. Our findings indicate that using multiple personas during ideation significantly enhances user-perceived quality of outcomes (e.g., relevance of critiques, creativity of research questions) without increasing cognitive load. We also found that users' persona customization interactions significantly improved their sense of control and recall of generated ideas. Based on the findings, we discuss highlighting ethical concerns, including potential over-reliance and cognitive biases, and suggest design implications for leveraging LLM-simulated expert personas to support research ideation when human expertise is inaccessible.
- [153] arXiv:2409.12539 [pdf, other]
-
Title: Improving Cone-Beam CT Image Quality with Knowledge Distillation-Enhanced Diffusion Model in Imbalanced Data SettingsComments: MICCAI 2024Subjects: Computer Vision and Pattern Recognition (cs.CV)
In radiation therapy (RT), the reliance on pre-treatment computed tomography (CT) images encounter challenges due to anatomical changes, necessitating adaptive planning. Daily cone-beam CT (CBCT) imaging, pivotal for therapy adjustment, falls short in tissue density accuracy. To address this, our innovative approach integrates diffusion models for CT image generation, offering precise control over data synthesis. Leveraging a self-training method with knowledge distillation, we maximize CBCT data during therapy, complemented by sparse paired fan-beam CTs. This strategy, incorporated into state-of-the-art diffusion-based models, surpasses conventional methods like Pix2pix and CycleGAN. A meticulously curated dataset of 2800 paired CBCT and CT scans, supplemented by 4200 CBCT scans, undergoes preprocessing and teacher model training, including the Brownian Bridge Diffusion Model (BBDM). Pseudo-label CT images are generated, resulting in a dataset combining 5600 CT images with corresponding CBCT images. Thorough evaluation using MSE, SSIM, PSNR and LPIPS demonstrates superior performance against Pix2pix and CycleGAN. Our approach shows promise in generating high-quality CT images from CBCT scans in RT.
- [154] arXiv:2409.12540 [pdf, html, other]
-
Title: Impacts of aspect ratio on task accuracy in parallel coordinatesSubjects: Human-Computer Interaction (cs.HC)
Parallel coordinates plots (PCPs) are a widely used visualization method, particularly for exploratory analysis. Previous studies show that PCPs perform much more poorly for estimating positive correlation than for estimating negative correlation, but it is not clear if this is affected by the aspect ratio (AR) of the axes pairs. In this paper, we present the results from an evaluation of the effect of the aspect ratio of axes in static (non-interactive) PCPs for two tasks: a) linear correlation estimation and b) value tracing. For both tasks we find strong evidence that AR influences accuracy, including ARs greater than 1:1 being much more performant for estimation of positive correlations. We provide a set of recommendations for visualization designers using PCPs for correlation or value-tracing tasks, based on the data characteristics and expected use cases.
- [155] arXiv:2409.12541 [pdf, html, other]
-
Title: Profiling Patient Transcript Using Large Language Model Reasoning Augmentation for Alzheimer's Disease DetectionComments: accepted to EMBC 2024Subjects: Computation and Language (cs.CL)
Alzheimer's disease (AD) stands as the predominant cause of dementia, characterized by a gradual decline in speech and language capabilities. Recent deep-learning advancements have facilitated automated AD detection through spontaneous speech. However, common transcript-based detection methods directly model text patterns in each utterance without a global view of the patient's linguistic characteristics, resulting in limited discriminability and interpretability. Despite the enhanced reasoning abilities of large language models (LLMs), there remains a gap in fully harnessing the reasoning ability to facilitate AD detection and model interpretation. Therefore, we propose a patient-level transcript profiling framework leveraging LLM-based reasoning augmentation to systematically elicit linguistic deficit attributes. The summarized embeddings of the attributes are integrated into an Albert model for AD detection. The framework achieves 8.51\% ACC and 8.34\% F1 improvements on the ADReSS dataset compared to the baseline without reasoning augmentation. Our further analysis shows the effectiveness of our identified linguistic deficit attributes and the potential to use LLM for AD detection interpretation.
- [156] arXiv:2409.12544 [pdf, html, other]
-
Title: Nigerian Software Engineer or American Data Scientist? GitHub Profile Recruitment Bias in Large Language ModelsTakashi Nakano, Kazumasa Shimari, Raula Gaikovina Kula, Christoph Treude, Marc Cheong, Kenichi MatsumotoSubjects: Software Engineering (cs.SE)
Large Language Models (LLMs) have taken the world by storm, demonstrating their ability not only to automate tedious tasks, but also to show some degree of proficiency in completing software engineering tasks. A key concern with LLMs is their "black-box" nature, which obscures their internal workings and could lead to societal biases in their outputs. In the software engineering context, in this early results paper, we empirically explore how well LLMs can automate recruitment tasks for a geographically diverse software team. We use OpenAI's ChatGPT to conduct an initial set of experiments using GitHub User Profiles from four regions to recruit a six-person software development team, analyzing a total of 3,657 profiles over a five-year period (2019-2023). Results indicate that ChatGPT shows preference for some regions over others, even when swapping the location strings of two profiles (counterfactuals). Furthermore, ChatGPT was more likely to assign certain developer roles to users from a specific country, revealing an implicit bias. Overall, this study reveals insights into the inner workings of LLMs and has implications for mitigating such societal biases in these models.
- [157] arXiv:2409.12545 [pdf, html, other]
-
Title: Enhancing Knowledge Distillation of Large Language Models through Efficient Multi-Modal Distribution AlignmentComments: 18 pagesSubjects: Computation and Language (cs.CL)
Knowledge distillation (KD) is an effective model compression method that can transfer the internal capabilities of large language models (LLMs) to smaller ones. However, the multi-modal probability distribution predicted by teacher LLMs causes difficulties for student models to learn. In this paper, we first demonstrate the importance of multi-modal distribution alignment with experiments and then highlight the inefficiency of existing KD approaches in learning multi-modal distributions. To address this problem, we propose Ranking Loss based Knowledge Distillation (RLKD), which encourages the consistency of the ranking of peak predictions between the teacher and student models. By incorporating word-level ranking loss, we ensure excellent compatibility with existing distillation objectives while fully leveraging the fine-grained information between different categories in peaks of two predicted distribution. Experimental results demonstrate that our method enables the student model to better learn the multi-modal distributions of the teacher model, leading to a significant performance improvement in various downstream tasks.
- [158] arXiv:2409.12548 [pdf, html, other]
-
Title: Mimicking Networks for Constrained Multicuts in HypergraphsComments: Publised in ISAAC 2024Subjects: Data Structures and Algorithms (cs.DS)
In this paper, we study a \emph{multicut-mimicking network} for a hypergraph over terminals $T$ with a parameter $c$. It is a hypergraph preserving the minimum multicut values of any set of pairs over $T$ where the value is at most $c$. This is a new variant of the multicut-mimicking network of a graph in [Wahlstr{ö}m ICALP'20], which introduces a parameter $c$ and extends it to handle hypergraphs. Additionally, it is a natural extension of the \emph{connectivity-$c$ mimicking network} introduced by [Chalermsook et al. SODA'21] and [Jiang et al. ESA'22] that is a (hyper)graph preserving the minimum cut values between two subsets of terminals where the value is at most $c$. We propose an algorithm for a hypergraph that returns a multicut-mimicking network over terminals $T$ with a parameter $c$ having $|T|c^{O(r\log c)}$ hyperedges in $p^{1+o(1)}+|T|(c^r\log n)^{\tilde{O}(rc)}m$ time, where $p$ and $r$ are the total size and the rank, respectively, of the hypergraph.
- [159] arXiv:2409.12549 [pdf, html, other]
-
Title: FruitsMusic: A Real-World Corpus of Japanese Idol-Group SongsComments: Accepted at the 25th International Society for Music Information Retrieval (ISMIR) Conference 2024, San Francisco, United StatesSubjects: Sound (cs.SD); Audio and Speech Processing (eess.AS)
This study presents FruitsMusic, a metadata corpus of Japanese idol-group songs in the real world, precisely annotated with who sings what and when. Japanese idol-group songs, vital to Japanese pop culture, feature a unique vocal arrangement style, where songs are divided into several segments, and a specific individual or multiple singers are assigned to each segment. To enhance singer diarization methods for recognizing such structures, we constructed FruitsMusic as a resource using 40 music videos of Japanese idol groups from YouTube. The corpus includes detailed annotations, covering songs across various genres, division and assignment styles, and groups ranging from 4 to 9 members. FruitsMusic also facilitates the development of various music information retrieval techniques, such as lyrics transcription and singer identification, benefiting not only Japanese idol-group songs but also a wide range of songs featuring single or multiple singers from various cultures. This paper offers a comprehensive overview of FruitsMusic, including its creation methodology and unique characteristics compared to conversational speech. Additionally, this paper evaluates the efficacy of current methods for singer embedding extraction and diarization in challenging real-world conditions using FruitsMusic. Furthermore, this paper examines potential improvements in automatic diarization performance through evaluating human performance.
- [160] arXiv:2409.12553 [pdf, html, other]
-
Title: Hidden in Plain Sound: Environmental Backdoor Poisoning Attacks on Whisper, and MitigationsComments: 13 pages, 12 figures, 6 tablesSubjects: Cryptography and Security (cs.CR); Sound (cs.SD); Audio and Speech Processing (eess.AS)
Thanks to the popularisation of transformer-based models, speech recognition (SR) is gaining traction in various application fields, such as industrial and robotics environments populated with mission-critical devices. While transformer-based SR can provide various benefits for simplifying human-machine interfacing, the research on the cybersecurity aspects of these models is lacklustre. In particular, concerning backdoor poisoning attacks. In this paper, we propose a new poisoning approach that maps different environmental trigger sounds to target phrases of different lengths, during the fine-tuning phase. We test our approach on Whisper, one of the most popular transformer-based SR model, showing that it is highly vulnerable to our attack, under several testing conditions. To mitigate the attack proposed in this paper, we investigate the use of Silero VAD, a state-of-the-art voice activity detection (VAD) model, as a defence mechanism. Our experiments show that it is possible to use VAD models to filter out malicious triggers and mitigate our attacks, with a varying degree of success, depending on the type of trigger sound and testing conditions.
- [161] arXiv:2409.12558 [pdf, html, other]
-
Title: RAD-Bench: Evaluating Large Language Models Capabilities in Retrieval Augmented DialoguesSubjects: Computation and Language (cs.CL)
In real-world applications with Large Language Models (LLMs), external retrieval mechanisms - such as Search-Augmented Generation (SAG), tool utilization, and Retrieval-Augmented Generation (RAG) - are often employed to enhance the quality of augmented generations in dialogues. These approaches often come with multi-turn dialogue, where each interaction is enriched by relevant information retrieved from external sources. Existing benchmarks either assess LLMs' chat abilities in multi-turn dialogues or their use of retrieval for augmented responses in single-turn settings. However, there is a gap in evaluating LLMs' ability to leverage retrieval for more precise responses across multiple turns. To address this limitation, we introduce RAD-Bench (Retrieval Augmented Dialogue), a benchmark designed to evaluate LLMs' capabilities in multi-turn dialogues following retrievals, essential for their deployment in context-rich applications. RAD-Bench evaluates two key abilities of LLMs: Retrieval Synthesis and Retrieval Reasoning. These are measured using discriminative questions and retrieved contexts, and corresponding reference answers, assessing how effectively LLMs integrate and reason with context to maintain and enhance conversation quality over multiple turns. Our evaluation results on commonly used LLMs reveal that model performance deteriorates as additional layers of conditions or constraints are applied across conversation turns, even when accurate retrieved contexts are provided.
- [162] arXiv:2409.12561 [pdf, html, other]
-
Title: Human Interest or Conflict? Leveraging LLMs for Automated Framing Analysis in TV ShowsSubjects: Human-Computer Interaction (cs.HC)
In the current media landscape, understanding the framing of information is crucial for critical consumption and informed decision making. Framing analysis is a valuable tool for identifying the underlying perspectives used to present information, and has been applied to a variety of media formats, including television programs. However, manual analysis of framing can be time-consuming and labor-intensive. This is where large language models (LLMs) can play a key role. In this paper, we propose a novel approach to use prompt-engineering to identify the framing of spoken content in television programs. Our findings indicate that prompt-engineering LLMs can be used as a support tool to identify frames, with agreement rates between human and machine reaching up to 43\%. As LLMs are still under development, we believe that our approach has the potential to be refined and further improved. The potential of this technology for interactive media applications is vast, including the development of support tools for journalists, educational resources for students of journalism learning about framing and related concepts, and interactive media experiences for audiences.
- [163] arXiv:2409.12564 [pdf, html, other]
-
Title: State Estimation and Environment Recognition for Articulated Structures via Proximity Sensors Distributed over the Whole BodyComments: This work has been submitted to the IEEE for possible publication. Copyright may be transferred without notice, after which this version may no longer be accessibleSubjects: Robotics (cs.RO)
For robots with low rigidity, determining the robot's state based solely on kinematics is challenging. This is particularly crucial for a robot whose entire body is in contact with the environment, as accurate state estimation is essential for environmental interaction. We propose a method for simultaneous articulated robot posture estimation and environmental mapping by integrating data from proximity sensors distributed over the whole body. Our method extends the discrete-time model, typically used for state estimation, to the spatial direction of the articulated structure. The simulations demonstrate that this approach significantly reduces estimation errors.
- [164] arXiv:2409.12567 [pdf, html, other]
-
Title: Model calibration using a parallel differential evolution algorithm in computational neuroscience: simulation of stretch induced nerve deficitAntonio LaTorre, Man Ting Kwong, Julián A. García-Grajales, Riyi Shi, Antoine Jérusalem, José-María PeñaJournal-ref: Journal of Computational Science, vol. 39, p. 101053, Jan. 2020Subjects: Neural and Evolutionary Computing (cs.NE); Artificial Intelligence (cs.AI)
Neuronal damage, in the form of both brain and spinal cord injuries, is one of the major causes of disability and death in young adults worldwide. One way to assess the direct damage occurring after a mechanical insult is the simulation of the neuronal cells functional deficits following the mechanical event. In this study, we use a coupled mechanical electrophysiological model with several free parameters that are required to be calibrated against experimental results. The calibration is carried out by means of an evolutionary algorithm (differential evolution, DE) that needs to evaluate each configuration of parameters on six different damage cases, each of them taking several minutes to compute. To minimise the simulation time of the parameter tuning for the DE, the stretch of one unique fixed-diameter axon with a simplified triggering process is used to speed up the calculations. The model is then leveraged for the parameter optimization of the more realistic bundle of independent axons, an impractical configuration to run on a single processor computer. To this end, we have developed a parallel implementation based on OpenMP that runs on a multi-processor taking advantage of all the available computational power. The parallel DE algorithm obtains good results, outperforming the best effort achieved by published manual calibration, in a fraction of the time. While not being able to fully capture the experimental results, the resulting nerve model provides a complex averaging framework for nerve damage simulation able to simulate gradual axonal functional alteration in a bundle.
- [165] arXiv:2409.12568 [pdf, html, other]
-
Title: InfiMM-WebMath-40B: Advancing Multimodal Pre-Training for Enhanced Mathematical ReasoningXiaotian Han, Yiren Jian, Xuefeng Hu, Haogeng Liu, Yiqi Wang, Qihang Fan, Yuang Ai, Huaibo Huang, Ran He, Zhenheng Yang, Quanzeng YouSubjects: Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM)
Pre-training on large-scale, high-quality datasets is crucial for enhancing the reasoning capabilities of Large Language Models (LLMs), especially in specialized domains such as mathematics. Despite the recognized importance, the Multimodal LLMs (MLLMs) field currently lacks a comprehensive open-source pre-training dataset specifically designed for mathematical reasoning. To address this gap, we introduce InfiMM-WebMath-40B, a high-quality dataset of interleaved image-text documents. It comprises 24 million web pages, 85 million associated image URLs, and 40 billion text tokens, all meticulously extracted and filtered from CommonCrawl. We provide a detailed overview of our data collection and processing pipeline. To demonstrate the robustness of InfiMM-WebMath-40B, we conducted evaluations in both text-only and multimodal settings. Our evaluations on text-only benchmarks show that, despite utilizing only 40 billion tokens, our dataset significantly enhances the performance of our 1.3B model, delivering results comparable to DeepSeekMath-1.3B, which uses 120 billion tokens for the same model size. Nevertheless, with the introduction of our multi-modal math pre-training dataset, our models set a new state-of-the-art among open-source models on multi-modal math benchmarks such as MathVerse and We-Math. We release our data at this https URL.
- [166] arXiv:2409.12572 [pdf, html, other]
-
Title: Scalable and Robust Mobile Activity Fingerprinting via Over-the-Air Control Channel in 5G NetworksComments: This work has been submitted to the IEEE for possible publication. Copyright may be transferred without notice, after which this version may no longer be accessibleSubjects: Networking and Internet Architecture (cs.NI); Cryptography and Security (cs.CR)
5G has undergone significant changes in its over-the-air control channel architecture compared to legacy networks, aimed at enhancing performance. These changes have unintentionally strengthened the security of control channels, reducing vulnerabilities in radio channels for attackers. However, based on our experimental results, less than 10% of Physical Downlink Control Channel (PDCCH) messages could be decoded using sniffers. We demonstrate that even with this limited data, cell scanning and targeted user mobile activity tracking are feasible. This privacy attack exposes the number of active communication channels and reveals the mobile applications and their usage time. We propose an efficient deep learning-based mobile traffic classification method that eliminates the need for manual feature extraction, enabling scalability across various applications while maintaining high performance even in scenarios with data loss. We evaluated the effectiveness of our approach using both an open-source testbed and a commercial 5G testbed, demonstrating the feasibility of mobile activity fingerprinting and targeted attacks. To the best of our knowledge, this is the first study to track mobile activity over-the-air using PDCCH messages.
- [167] arXiv:2409.12575 [pdf, html, other]
-
Title: Deep Transfer Hashing for Adaptive Learning on Federated Streaming DataComments: Presented at ECML2024: 8th Intl. Worksh. and Tutorial on Interactive Adaptive Learning, Sep. 9th, 2024, Vilnius, LithuaniaSubjects: Machine Learning (cs.LG); Cryptography and Security (cs.CR); Distributed, Parallel, and Cluster Computing (cs.DC)
This extended abstract explores the integration of federated learning with deep transfer hashing for distributed prediction tasks, emphasizing resource-efficient client training from evolving data streams. Federated learning allows multiple clients to collaboratively train a shared model while maintaining data privacy - by incorporating deep transfer hashing, high-dimensional data can be converted into compact hash codes, reducing data transmission size and network loads. The proposed framework utilizes transfer learning, pre-training deep neural networks on a central server, and fine-tuning on clients to enhance model accuracy and adaptability. A selective hash code sharing mechanism using a privacy-preserving global memory bank further supports client fine-tuning. This approach addresses challenges in previous research by improving computational efficiency and scalability. Practical applications include Car2X event predictions, where a shared model is collectively trained to recognize traffic patterns, aiding in tasks such as traffic density assessment and accident detection. The research aims to develop a robust framework that combines federated learning, deep transfer hashing and transfer learning for efficient and secure downstream task execution.
- [168] arXiv:2409.12576 [pdf, html, other]
-
Title: StoryMaker: Towards Holistic Consistent Characters in Text-to-image GenerationComments: 12 pages, 5 figuresSubjects: Computer Vision and Pattern Recognition (cs.CV)
Tuning-free personalized image generation methods have achieved significant success in maintaining facial consistency, i.e., identities, even with multiple characters. However, the lack of holistic consistency in scenes with multiple characters hampers these methods' ability to create a cohesive narrative. In this paper, we introduce StoryMaker, a personalization solution that preserves not only facial consistency but also clothing, hairstyles, and body consistency, thus facilitating the creation of a story through a series of images. StoryMaker incorporates conditions based on face identities and cropped character images, which include clothing, hairstyles, and bodies. Specifically, we integrate the facial identity information with the cropped character images using the Positional-aware Perceiver Resampler (PPR) to obtain distinct character features. To prevent intermingling of multiple characters and the background, we separately constrain the cross-attention impact regions of different characters and the background using MSE loss with segmentation masks. Additionally, we train the generation network conditioned on poses to promote decoupling from poses. A LoRA is also employed to enhance fidelity and quality. Experiments underscore the effectiveness of our approach. StoryMaker supports numerous applications and is compatible with other societal plug-ins. Our source codes and model weights are available at this https URL.
- [169] arXiv:2409.12580 [pdf, html, other]
-
Title: LLMs Can Check Their Own Results to Mitigate Hallucinations in Traffic Understanding TasksComments: ICTSS 2024, 36th International Conference on Testing Software and SystemsSubjects: Computer Vision and Pattern Recognition (cs.CV)
Today's Large Language Models (LLMs) have showcased exemplary capabilities, ranging from simple text generation to advanced image processing. Such models are currently being explored for in-vehicle services such as supporting perception tasks in Advanced Driver Assistance Systems (ADAS) or Autonomous Driving (AD) systems, given the LLMs' capabilities to process multi-modal data. However, LLMs often generate nonsensical or unfaithful information, known as ``hallucinations'': a notable issue that needs to be mitigated. In this paper, we systematically explore the adoption of SelfCheckGPT to spot hallucinations by three state-of-the-art LLMs (GPT-4o, LLaVA, and Llama3) when analysing visual automotive data from two sources: Waymo Open Dataset, from the US, and PREPER CITY dataset, from Sweden. Our results show that GPT-4o is better at generating faithful image captions than LLaVA, whereas the former demonstrated leniency in mislabeling non-hallucinated content as hallucinations compared to the latter. Furthermore, the analysis of the performance metrics revealed that the dataset type (Waymo or PREPER CITY) did not significantly affect the quality of the captions or the effectiveness of hallucination detection. However, the models showed better performance rates over images captured during daytime, compared to during dawn, dusk or night. Overall, the results show that SelfCheckGPT and its adaptation can be used to filter hallucinations in generated traffic-related image captions for state-of-the-art LLMs.
- [170] arXiv:2409.12586 [pdf, html, other]
-
Title: Efficient Knowledge Distillation: Empowering Small Language Models with Teacher Model InsightsSubjects: Computation and Language (cs.CL)
Enhancing small language models for real-life application deployment is a significant challenge facing the research community. Due to the difficulties and costs of using large language models, researchers are seeking ways to effectively deploy task-specific small models. In this work, we introduce a simple yet effective knowledge distillation method to improve the performance of small language models. Our approach utilizes a teacher model with approximately 3 billion parameters to identify the most influential tokens in its decision-making process. These tokens are extracted from the input based on their attribution scores relative to the output, using methods like saliency maps. These important tokens are then provided as rationales to a student model, aiming to distill the knowledge of the teacher model. This method has proven to be effective, as demonstrated by testing it on four diverse datasets, where it shows improvement over both standard fine-tuning methods and state-of-the-art knowledge distillation models. Furthermore, we explore explanations of the success of the model by analyzing the important tokens extracted from the teacher model. Our findings reveal that in 68\% of cases, specifically in datasets where labels are part of the answer, such as multiple-choice questions, the extracted tokens are part of the ground truth.
- [171] arXiv:2409.12590 [pdf, html, other]
-
Title: Hybrid Ensemble Deep Graph Temporal Clustering for Spatiotemporal DataComments: 10 pagesSubjects: Machine Learning (cs.LG)
Classifying subsets based on spatial and temporal features is crucial to the analysis of spatiotemporal data given the inherent spatial and temporal variability. Since no single clustering algorithm ensures optimal results, researchers have increasingly explored the effectiveness of ensemble approaches. Ensemble clustering has attracted much attention due to increased diversity, better generalization, and overall improved clustering performance. While ensemble clustering may yield promising results on simple datasets, it has not been fully explored on complex multivariate spatiotemporal data. For our contribution to this field, we propose a novel hybrid ensemble deep graph temporal clustering (HEDGTC) method for multivariate spatiotemporal data. HEDGTC integrates homogeneous and heterogeneous ensemble methods and adopts a dual consensus approach to address noise and misclassification from traditional clustering. It further applies a graph attention autoencoder network to improve clustering performance and stability. When evaluated on three real-world multivariate spatiotemporal data, HEDGTC outperforms state-of-the-art ensemble clustering models by showing improved performance and stability with consistent results. This indicates that HEDGTC can effectively capture implicit temporal patterns in complex spatiotemporal data.
- [172] arXiv:2409.12597 [pdf, html, other]
-
Title: LARE: Latent Augmentation using Regional Embedding with Vision-Language ModelComments: 10 pages, 4 figuresSubjects: Computer Vision and Pattern Recognition (cs.CV)
In recent years, considerable research has been conducted on vision-language models that handle both image and text data; these models are being applied to diverse downstream tasks, such as "image-related chat," "image recognition by instruction," and "answering visual questions." Vision-language models (VLMs), such as Contrastive Language-Image Pre-training (CLIP), are also high-performance image classifiers that are being developed into domain adaptation methods that can utilize language information to extend into unseen domains. However, because these VLMs embed images as a single point in a unified embedding space, there is room for improvement in the classification accuracy. Therefore, in this study, we proposed the Latent Augmentation using Regional Embedding (LARE), which embeds the image as a region in the unified embedding space learned by the VLM. By sampling the augmented image embeddings from within this latent region, LARE enables data augmentation to various unseen domains, not just to specific unseen domains. LARE achieves robust image classification for domains in and out using augmented image embeddings to fine-tune VLMs. We demonstrate that LARE outperforms previous fine-tuning models in terms of image classification accuracy on three benchmarks. We also demonstrate that LARE is a more robust and general model that is valid under multiple conditions, such as unseen domains, small amounts of data, and imbalanced data.
- [173] arXiv:2409.12599 [pdf, html, other]
-
Title: Enhancing SLM via ChatGPT and Dataset AugmentationSubjects: Computation and Language (cs.CL)
This paper explores the enhancement of small language models through strategic dataset augmentation via ChatGPT-3.5-Turbo, in the domain of Natural Language Inference (NLI). By employing knowledge distillation-based techniques and synthetic dataset augmentation, we aim to bridge the performance gap between large language models (LLMs) and small language models (SLMs) without the immense cost of human annotation. Our methods involve two forms of rationale generation--information extraction and informed reasoning--to enrich the ANLI dataset. We then fine-tune T5-Small on these augmented datasets, evaluating its performance against an established benchmark. Our findings reveal that the incorporation of synthetic rationales significantly improves the model's ability to comprehend natural language, leading to 1.3\% and 2.3\% higher classification accuracy, respectively, on the ANLI dataset, demonstrating the potential of leveraging LLMs for dataset augmentation. This approach not only enhances the performance of smaller models on complex tasks but also introduces a cost-effective method for fine-tuning smaller language models. By advancing our understanding of knowledge distillation and fine-tuning strategies, this work contributes to the ongoing effort to create more capable and efficient NLP systems.
- [174] arXiv:2409.12601 [pdf, html, other]
-
Title: Frisking-Johnsen Model with Diminishing CompetitionComments: This work has been submitted to IEEE for possible publication. Copyright may be transferred without notice, after which this version may no longer be accessibleSubjects: Systems and Control (eess.SY); Multiagent Systems (cs.MA)
This letter studies the Friedkin-Johnsen (FJ) model with diminishing competition, or stubbornness. The original FJ model assumes fixed competition that is manifested through a constant weight that each agent gives to its initial opinion in addition to its contribution through a consensus dynamic. This letter investigates the effect of diminishing competition on the convergence point and speed of the FJ dynamics. We show that, if the competition is uniform across agents and vanishes asymptotically, the convergence point coincides with the nominal consensus reached with no competition. However, the diminishing competition slows down convergence according to its own rate of decay. We evaluate this phenomenon analytically and provide upper and lower bounds on the convergence rate. If competition is not uniform across clients, we show that the convergence point may not coincide with the nominal consensus point. Finally, we evaluate and validate our analytical insights numerically.
- [175] arXiv:2409.12602 [pdf, html, other]
-
Title: Enhancing Agricultural Environment Perception via Active Vision and Zero-Shot LearningSubjects: Robotics (cs.RO); Artificial Intelligence (cs.AI)
Agriculture, fundamental for human sustenance, faces unprecedented challenges. The need for efficient, human-cooperative, and sustainable farming methods has never been greater. The core contributions of this work involve leveraging Active Vision (AV) techniques and Zero-Shot Learning (ZSL) to improve the robot's ability to perceive and interact with agricultural environment in the context of fruit harvesting. The AV Pipeline implemented within ROS 2 integrates the Next-Best View (NBV) Planning for 3D environment reconstruction through a dynamic 3D Occupancy Map. Our system allows the robotics arm to dynamically plan and move to the most informative viewpoints and explore the environment, updating the 3D reconstruction using semantic information produced through ZSL models. Simulation and real-world experimental results demonstrate our system's effectiveness in complex visibility conditions, outperforming traditional and static predefined planning methods. ZSL segmentation models employed, such as YOLO World + EfficientViT SAM, exhibit high-speed performance and accurate segmentation, allowing flexibility when dealing with semantic information in unknown agricultural contexts without requiring any fine-tuning process.
- [176] arXiv:2409.12606 [pdf, html, other]
-
Title: Well-balanced Point-Average-Moment PolynomiAl-interpreted (PAMPA) Methods for Shallow Water Equations on Triangular MeshesSubjects: Numerical Analysis (math.NA)
In this paper, we develop novel well-balanced Point-Average-Moment PolynomiAl-interpreted (PAMPA) numerical methods for solving the two-dimensional shallow water equations on triangular meshes. The proposed PAMPA methods use a globally continuous representation of the variables, with degree of freedoms (DoFs) consisting of point values on the edges and average values within each triangular element. The update of cell averages is carried out using a conservative form of the partial differential equations (PDEs), while the point values are updated using a non-conservative formulation. This non-conservative formulation can be expressed in terms of either conservative or primitive variables. This new class of schemes is proved to be well-balanced and positivity-preserving. We validate the performance of the proposed methods through a series of numerical experiments. The numerical results are as expected and confirm that the performance of the PAMPA method using primitive variables in the non-conservative formulation is comparable to that using only the conservative variables. This work represents a step forward in the development and application of PAMPA methods for solving hyperbolic balance laws.
- [177] arXiv:2409.12610 [pdf, html, other]
-
Title: CF-GO-Net: A Universal Distribution Learner via Characteristic Function Networks with Graph OptimizersSubjects: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
Generative models aim to learn the distribution of datasets, such as images, so as to be able to generate samples that statistically resemble real data. However, learning the underlying probability distribution can be very challenging and intractable. To this end, we introduce an approach which employs the characteristic function (CF), a probabilistic descriptor that directly corresponds to the distribution. However, unlike the probability density function (pdf), the characteristic function not only always exists, but also provides an additional degree of freedom, hence enhances flexibility in learning distributions. This removes the critical dependence on pdf-based assumptions, which limit the applicability of traditional methods. While several works have attempted to use CF in generative modeling, they often impose strong constraints on the training process. In contrast, our approach calculates the distance between query points in the CF domain, which is an unconstrained and well defined problem. Next, to deal with the sampling strategy, which is crucial to model performance, we propose a graph neural network (GNN)-based optimizer for the sampling process, which identifies regions where the difference between CFs is most significant. In addition, our method allows the use of a pre-trained model, such as a well-trained autoencoder, and is capable of learning directly in its feature space, without modifying its parameters. This offers a flexible and robust approach to generative modeling, not only provides broader applicability and improved performance, but also equips any latent space world with the ability to become a generative model.
- [178] arXiv:2409.12612 [pdf, html, other]
-
Title: Enhancing Perception of Key Changes in Remote Sensing Image Change CaptioningSubjects: Computer Vision and Pattern Recognition (cs.CV)
Recently, while significant progress has been made in remote sensing image change captioning, existing methods fail to filter out areas unrelated to actual changes, making models susceptible to irrelevant features. In this article, we propose a novel multimodal framework for remote sensing image change captioning, guided by Key Change Features and Instruction-tuned (KCFI). This framework aims to fully leverage the intrinsic knowledge of large language models through visual instructions and enhance the effectiveness and accuracy of change features using pixel-level change detection tasks. Specifically, KCFI includes a ViTs encoder for extracting bi-temporal remote sensing image features, a key feature perceiver for identifying critical change areas, a pixel-level change detection decoder to constrain key change features, and an instruction-tuned decoder based on a large language model. Moreover, to ensure that change description and change detection tasks are jointly optimized, we employ a dynamic weight-averaging strategy to balance the losses between the two tasks. We also explore various feature combinations for visual fine-tuning instructions and demonstrate that using only key change features to guide the large language model is the optimal choice. To validate the effectiveness of our approach, we compare it against several state-of-the-art change captioning methods on the LEVIR-CC dataset, achieving the best performance. Our code will be available at this https URL.
- [179] arXiv:2409.12615 [pdf, html, other]
-
Title: Discrete Incremental Voting on ExpandersComments: arXiv admin note: text overlap with arXiv:2305.15632Subjects: Discrete Mathematics (cs.DM)
Pull voting is a random process in which vertices of a connected graph have initial opinions chosen from a set of $k$ distinct opinions, and at each step a random vertex alters its opinion to that of a randomly chosen neighbour. If the system reaches a state where each vertex holds the same opinion, then this opinion will persist forthwith.
In general the opinions are regarded as incommensurate, whereas in this paper we consider a type of pull voting suitable for integer opinions such as $\{1,2,\ldots,k\}$ which can be compared on a linear scale; for example, 1 ('disagree strongly'), 2 ('disagree'), $\ldots,$ 5 ('agree strongly'). On observing the opinion of a random neighbour, a vertex updates its opinion by a discrete change towards the value of the neighbour's opinion, if different.
Discrete incremental voting is a pull voting process which mimics this behaviour. At each step a random vertex alters its opinion towards that of a randomly chosen neighbour; increasing its opinion by $+1$ if the opinion of the chosen neighbour is larger, or decreasing its opinion by $-1$, if the opinion of the neighbour is smaller. If initially there are only two adjacent integer opinions, for example $\{0,1\}$, incremental voting coincides with pull voting, but if initially there are more than two opinions this is not the case.
For an $n$-vertex graph $G=(V,E)$, let $\lambda$ be the absolute second eigenvalue of the transition matrix $P$ of a simple random walk on $G$. Let the initial opinions of the vertices be chosen from $\{1,2,\ldots,k\}$. Let $c=\sum_{v \in V} \pi_v X_v$, where $X_v$ is the initial opinion of vertex $v$, and $\pi_v$ is the stationary distribution of the vertex. Then provided $\lambda k=o(1)$ and $k=o(n/\log n)$, with high probability the final opinion is the initial weighted average $c$ suitably rounded to $\lfloor c \rfloor$ or $\lceil c\rceil$. - [180] arXiv:2409.12616 [pdf, html, other]
-
Title: Semi-Supervised Safe Visuomotor Policy Synthesis using Barrier CertificatesComments: First two authors have contributed equally. 8 Pages, 3 figuresSubjects: Robotics (cs.RO); Systems and Control (eess.SY)
In modern robotics, addressing the lack of accurate state space information in real-world scenarios has led to a significant focus on utilizing visuomotor observation to provide safety assurances. Although supervised learning methods, such as imitation learning, have demonstrated potential in synthesizing control policies based on visuomotor observations, they require ground truth safety labels for the complete dataset and do not provide formal safety assurances. On the other hand, traditional control-theoretic methods like Control Barrier Functions (CBFs) and Hamilton-Jacobi (HJ) Reachability provide formal safety guarantees but depend on accurate knowledge of system dynamics, which is often unavailable for high-dimensional visuomotor data. To overcome these limitations, we propose a novel approach to synthesize a semi-supervised safe visuomotor policy using barrier certificates that integrate the strengths of model-free supervised learning and model-based control methods. This framework synthesizes a provably safe controller without requiring safety labels for the complete dataset and ensures completeness guarantees for both the barrier certificate and the policy. We validate our approach through distinct case studies: an inverted pendulum system and the obstacle avoidance of an autonomous mobile robot.
- [181] arXiv:2409.12617 [pdf, html, other]
-
Title: CrossRT: A cross platform programming technology for hardware-accelerated ray tracing in CG and CV applicationsSubjects: Graphics (cs.GR)
We propose a programming technology that bridges cross-platform compatibility and hardware acceleration in ray tracing applications. Our methodology enables developers to define algorithms while our translator manages implementation specifics for different hardware or APIs. Features include: generating hardware-accelerated code from hardware-agnostic, object-oriented C++ algorithm descriptions; enabling users to define software fallbacks for non-hardware-accelerated CPUs and GPUs; producing GPU programming API-based algorithm implementations resembling manually ported C++ versions. The generated code is editable and readable, allowing for additional hardware acceleration. Our translator supports single megakernel and multiple kernel path tracing implementations without altering the programming model or input source code. Wavefront mode is crucial for NeRF and SDF, ensuring efficient evaluation with multiple kernels. Validation on tasks such as BVH tree build/traversal, ray-surface intersection for SDF, ray-volume intersection for 3D Gaussian Splatting, and complex Path Tracing models showed comparable performance levels to expert-written implementations for GPUs. Our technology outperformed existing Path Tracing implementations.
- [182] arXiv:2409.12618 [pdf, other]
-
Title: Iteration of Thought: Leveraging Inner Dialogue for Autonomous Large Language Model ReasoningSubjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Multiagent Systems (cs.MA)
Iterative human engagement is a common and effective means of leveraging the advanced language processing power of large language models (LLMs). Using well-structured prompts in a conversational manner, human users can effectively influence an LLM to develop more thoughtful and accurate responses. Motivated by this insight, we propose the Iteration of Thought (IoT) framework for enhancing LLM responses by generating "thought"-provoking prompts vis a vis an input query and the current iteration of an LLM's response. Unlike static or semi-static approaches, e.g. Chain of Thought (CoT) or Tree of Thoughts (ToT), IoT adapts its reasoning path dynamically, based on evolving context, and without generating alternate explorative thoughts which are ultimately discarded. The three components of the IoT framework are (1) an Inner Dialogue Agent (IDA) responsible for generating instructive, context-specific prompts; (2) an LLM Agent (LLMA) that processes these prompts to refine its responses; and (3) an iterative prompting loop that implements a conversation between the former two components. We introduce two variants of our framework: Autonomous Iteration of Thought (AIoT), where an LLM decides when to stop iterating, and Guided Iteration of Thought (GIoT), which always forces a fixed number iterations. We investigate the performance of IoT across various datasets, spanning complex reasoning tasks from the GPQA dataset, explorative problem-solving in Game of 24, puzzle solving in Mini Crosswords, and multi-hop question answering from the HotpotQA dataset. Our results show that IoT represents a viable paradigm for autonomous response refinement in LLMs, showcasing significant improvements over CoT and thereby enabling more adaptive and efficient reasoning systems that minimize human intervention.
- [183] arXiv:2409.12620 [pdf, html, other]
-
Title: Accurate Automatic 3D Annotation of Traffic Lights and Signs for Autonomous DrivingComments: Accepted at the 2nd Workshop on Vision-Centric Autonomous Driving (VCAD) as part of ECCV 2024Subjects: Computer Vision and Pattern Recognition (cs.CV)
3D detection of traffic management objects, such as traffic lights and road signs, is vital for self-driving cars, particularly for address-to-address navigation where vehicles encounter numerous intersections with these static objects. This paper introduces a novel method for automatically generating accurate and temporally consistent 3D bounding box annotations for traffic lights and signs, effective up to a range of 200 meters. These annotations are suitable for training real-time models used in self-driving cars, which need a large amount of training data. The proposed method relies only on RGB images with 2D bounding boxes of traffic management objects, which can be automatically obtained using an off-the-shelf image-space detector neural network, along with GNSS/INS data, eliminating the need for LiDAR point cloud data.
- [184] arXiv:2409.12623 [pdf, html, other]
-
Title: CamelEval: Advancing Culturally Aligned Arabic Language Models and BenchmarksSubjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
Large Language Models (LLMs) are the cornerstones of modern artificial intelligence systems. This paper introduces Juhaina, a Arabic-English bilingual LLM specifically designed to align with the values and preferences of Arabic speakers. Juhaina inherently supports advanced functionalities such as instruction following, open-ended question answering, information provisioning, and text processing. Our model contains 9.24 billion parameters and is trained on a context window of up to 8,192 tokens. This paper details the creation process of Juhaina and provides an extensive empirical evaluation. Furthermore, we identify the limitations of widely-adopted Open Arabic LLM Leaderboard (OALL) and propose a new evaluation benchmark, CamelEval. Our findings demonstrate that Juhaina surpasses existing LLMs of comparable sizes, such as the Llama and Gemma families, in generating helpful responses in Arabic, providing factually accurate information about the region, and understanding nuanced cultural aspects. We aspire for Juhaina to democratize cutting-edge AI technologies, serving over 400 million Arabic speakers by offering LLMs that not only communicate in their language but also comprehend their culture. We publicly release all models on Huggingface \url{this https URL}.
- [185] arXiv:2409.12626 [pdf, html, other]
-
Title: Green Federated Learning: A new era of Green Aware AISubjects: Machine Learning (cs.LG)
The development of AI applications, especially in large-scale wireless networks, is growing exponentially, alongside the size and complexity of the architectures used. Particularly, machine learning is acknowledged as one of today's most energy-intensive computational applications, posing a significant challenge to the environmental sustainability of next-generation intelligent systems. Achieving environmental sustainability entails ensuring that every AI algorithm is designed with sustainability in mind, integrating green considerations from the architectural phase onwards. Recently, Federated Learning (FL), with its distributed nature, presents new opportunities to address this need. Hence, it's imperative to elucidate the potential and challenges stemming from recent FL advancements and their implications for sustainability. Moreover, it's crucial to furnish researchers, stakeholders, and interested parties with a roadmap to navigate and understand existing efforts and gaps in green-aware AI algorithms. This survey primarily aims to achieve this objective by identifying and analyzing over a hundred FL works, assessing their contributions to green-aware artificial intelligence for sustainable environments, with a specific focus on IoT research. It delves into current issues in green federated learning from an energy-efficient standpoint, discussing potential challenges and future prospects for green IoT application research.
- [186] arXiv:2409.12627 [pdf, html, other]
-
Title: A topological proof of the Hell-Ne\v{s}et\v{r}il dichotomyComments: 13 pagesSubjects: Computational Complexity (cs.CC); Algebraic Topology (math.AT); Combinatorics (math.CO)
We provide a new proof of a theorem of Hell and Nešetřil [J. Comb. Theory B, 48(1):92-110, 1990] using tools from topological combinatorics based on ideas of Lovász [J. Comb. Theory, Ser. A, 25(3):319-324, 1978]. The Hell-Nešetřil Theorem provides a dichotomy of the graph homomorphism problem. It states that deciding whether there is a graph homomorphism from a given graph to a fixed graph $H$ is in P if $H$ is bipartite (or contains a self-loop), and is NP-complete otherwise. In our proof we combine topological combinatorics with the algebraic approach to constraint satisfaction problem.
- [187] arXiv:2409.12632 [pdf, html, other]
-
Title: Counterfactual Explanations for Clustering ModelsSubjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Neural and Evolutionary Computing (cs.NE)
Clustering algorithms rely on complex optimisation processes that may be difficult to comprehend, especially for individuals who lack technical expertise. While many explainable artificial intelligence techniques exist for supervised machine learning, unsupervised learning -- and clustering in particular -- has been largely neglected. To complicate matters further, the notion of a ``true'' cluster is inherently challenging to define. These facets of unsupervised learning and its explainability make it difficult to foster trust in such methods and curtail their adoption. To address these challenges, we propose a new, model-agnostic technique for explaining clustering algorithms with counterfactual statements. Our approach relies on a novel soft-scoring method that captures the spatial information utilised by clustering models. It builds upon a state-of-the-art Bayesian counterfactual generator for supervised learning to deliver high-quality explanations. We evaluate its performance on five datasets and two clustering algorithms, and demonstrate that introducing soft scores to guide counterfactual search significantly improves the results.
- [188] arXiv:2409.12634 [pdf, html, other]
-
Title: Exploring bat song syllable representations in self-supervised audio encodersComments: Presented at VIHAR-2024; see this https URLJournal-ref: Proceedings of the 4th International Workshop on Vocal Interactivity in-and-between Humans, Animals and Robots (VIHAR), Kos, GR, 6-9 Sept 2024Subjects: Sound (cs.SD); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Audio and Speech Processing (eess.AS)
How well can deep learning models trained on human-generated sounds distinguish between another species' vocalization types? We analyze the encoding of bat song syllables in several self-supervised audio encoders, and find that models pre-trained on human speech generate the most distinctive representations of different syllable types. These findings form first steps towards the application of cross-species transfer learning in bat bioacoustics, as well as an improved understanding of out-of-distribution signal processing in audio encoder models.
- [189] arXiv:2409.12635 [pdf, html, other]
-
Title: EFA-YOLO: An Efficient Feature Attention Model for Fire and Flame DetectionSubjects: Computer Vision and Pattern Recognition (cs.CV)
As a natural disaster with high suddenness and great destructiveness, fire has long posed a major threat to human society and ecological environment. In recent years, with the rapid development of smart city and Internet of Things (IoT) technologies, fire detection systems based on deep learning have gradually become a key means to cope with fire hazards. However, existing fire detection models still have many challenges in terms of detection accuracy and real-time performance in complex contexts. To address these issues, we propose two key modules: EAConv (Efficient Attention Convolution) and EADown (Efficient Attention Downsampling). The EAConv module significantly improves the feature extraction efficiency by combining an efficient attention mechanism with depth-separable convolution, while the EADown module enhances the accuracy and efficiency of feature downsampling by utilizing spatial and channel attention mechanisms in combination with pooling operations. Based on these two modules, we design an efficient and lightweight flame detection model, EFA-YOLO (Efficient Feature Attention YOLO). Experimental results show that EFA-YOLO has a model parameter quantity of only 1.4M, GFLOPs of 4.6, and the inference time per image on the CPU is only 22.19 ms. Compared with existing mainstream models (e.g., YOLOv5, YOLOv8, YOLOv9, and YOLOv10), EFA-YOLO exhibits a significant enhancement in detection accuracy (mAP) and inference speed, with model parameter amount is reduced by 94.6 and the inference speed is improved by 88 times.
- [190] arXiv:2409.12636 [pdf, html, other]
-
Title: Image inpainting for corrupted images by using the semi-super resolution GANSubjects: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Image and Video Processing (eess.IV)
Image inpainting is a valuable technique for enhancing images that have been corrupted. The primary challenge in this research revolves around the extent of corruption in the input image that the deep learning model must restore. To address this challenge, we introduce a Generative Adversarial Network (GAN) for learning and replicating the missing pixels. Additionally, we have developed a distinct variant of the Super-Resolution GAN (SRGAN), which we refer to as the Semi-SRGAN (SSRGAN). Furthermore, we leveraged three diverse datasets to assess the robustness and accuracy of our proposed model. Our training process involves varying levels of pixel corruption to attain optimal accuracy and generate high-quality images.
- [191] arXiv:2409.12638 [pdf, html, other]
-
Title: $\text{M}^\text{6}(\text{GPT})^\text{3}$: Generating Multitrack Modifiable Multi-Minute MIDI Music from Text using Genetic algorithms, Probabilistic methods and GPT Models in any Progression and Time signatureComments: 12 pages, 1 figureSubjects: Sound (cs.SD); Human-Computer Interaction (cs.HC); Audio and Speech Processing (eess.AS)
This work introduces the $\text{M}^\text{6}(\text{GPT})^\text{3}$ Composer system, capable of generating complete, multi-minute musical compositions with complex structures in any time signature, in the MIDI domain from input descriptions in natural language. The system utilizes an autoregressive transformer language model to map natural language prompts to composition parameters in JSON format. The defined structure includes time signature, scales, chord progressions, and valence-arousal values, from which accompaniment, melody, bass, motif, and percussion tracks are created. We propose a genetic algorithm for the generation of melodic elements. The algorithm incorporates mutations with musical significance and a fitness function based on normal distribution and predefined musical feature values. The values adaptively evolve, influenced by emotional parameters and distinct playing styles. The system for generating percussion in any time signature utilises probabilistic methods, including Markov chains. Through both human and objective evaluations, we demonstrate that our music generation approach outperforms baselines on specific, musically meaningful metrics, offering a valuable alternative to purely neural network-based systems.
- [192] arXiv:2409.12640 [pdf, html, other]
-
Title: Michelangelo: Long Context Evaluations Beyond Haystacks via Latent Structure QueriesKiran Vodrahalli, Santiago Ontanon, Nilesh Tripuraneni, Kelvin Xu, Sanil Jain, Rakesh Shivanna, Jeffrey Hui, Nishanth Dikkala, Mehran Kazemi, Bahare Fatemi, Rohan Anil, Ethan Dyer, Siamak Shakeri, Roopali Vij, Harsh Mehta, Vinay Ramasesh, Quoc Le, Ed Chi, Yifeng Lu, Orhan Firat, Angeliki Lazaridou, Jean-Baptiste Lespiau, Nithya Attaluri, Kate OlszewskaSubjects: Computation and Language (cs.CL); Machine Learning (cs.LG)
We introduce Michelangelo: a minimal, synthetic, and unleaked long-context reasoning evaluation for large language models which is also easy to automatically score. This evaluation is derived via a novel, unifying framework for evaluations over arbitrarily long contexts which measure the model's ability to do more than retrieve a single piece of information from its context. The central idea of the \frameworkname framework (\frameworkshort) is to construct tasks which require a model to ``chisel away'' the irrelevant information in the context, revealing a latent structure in the context. To verify a model's understanding of this latent structure, we query the model for details of the structure. Using \frameworkshort, we produce three diagnostic long-context evaluations across code and natural-language domains intended to provide a stronger signal of long-context language model capabilities. We perform evaluations on several state-of-the-art models and demonstrate both that a) the proposed evaluations are high-signal and b) that there is significant room for improvement in synthesizing long-context information.
- [193] arXiv:2409.12642 [pdf, html, other]
-
Title: Deep generative models as an adversarial attack strategy for tabular machine learningComments: Accepted at ICMLC 2024 (International Conference on Machine Learning and Cybernetics)Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Deep Generative Models (DGMs) have found application in computer vision for generating adversarial examples to test the robustness of machine learning (ML) systems. Extending these adversarial techniques to tabular ML presents unique challenges due to the distinct nature of tabular data and the necessity to preserve domain constraints in adversarial examples. In this paper, we adapt four popular tabular DGMs into adversarial DGMs (AdvDGMs) and evaluate their effectiveness in generating realistic adversarial examples that conform to domain constraints.
- [194] arXiv:2409.12646 [pdf, html, other]
-
Title: Native Execution of GraphQL Queries over RDF Graphs Using Multi-way JoinsSubjects: Databases (cs.DB)
Purpose: The query language GraphQL has gained significant traction in recent years. In particular, it has recently gained the attention of the semantic web and graph database communities and is now often used as a means to query knowledge graphs. Most of the storage solutions that support GraphQL rely on a translation layer to map the said language to another query language that they support natively, for example SPARQL. Methodology: Our main innovation is a multi-way left-join algorithm inspired by worst-case optimal multi-way join algorithms. This novel algorithm enables the native execution of GraphQL queries over RDF knowledge graphs. We evaluate our approach in two settings using the LinGBM benchmark generator. Findings: The experimental results suggest that our solution outperforms the state-of-the-art graph storage solution for GraphQL with respect to both query runtimes and scalability. Value: Our solution is implemented in an open-sourced triple store, and is intended to advance the development of representation-agnostic storage solutions for knowledge graphs.
- [195] arXiv:2409.12651 [pdf, other]
-
Title: A Deep Dive into Fairness, Bias, Threats, and Privacy in Recommender Systems: Insights and Future ResearchComments: 38 pages, 6 figuresSubjects: Information Retrieval (cs.IR); Cryptography and Security (cs.CR); Human-Computer Interaction (cs.HC)
Recommender systems are essential for personalizing digital experiences on e-commerce sites, streaming services, and social media platforms. While these systems are necessary for modern digital interactions, they face fairness, bias, threats, and privacy challenges. Bias in recommender systems can result in unfair treatment of specific users and item groups, and fairness concerns demand that recommendations be equitable for all users and items. These systems are also vulnerable to various threats that compromise reliability and security. Furthermore, privacy issues arise from the extensive use of personal data, making it crucial to have robust protection mechanisms to safeguard user information. This study explores fairness, bias, threats, and privacy in recommender systems. It examines how algorithmic decisions can unintentionally reinforce biases or marginalize specific user and item groups, emphasizing the need for fair recommendation strategies. The study also looks at the range of threats in the form of attacks that can undermine system integrity and discusses advanced privacy-preserving techniques. By addressing these critical areas, the study highlights current limitations and suggests future research directions to improve recommender systems' robustness, fairness, and privacy. Ultimately, this research aims to help develop more trustworthy and ethical recommender systems that better serve diverse user populations.
- [196] arXiv:2409.12656 [pdf, html, other]
-
Title: Efficient Performance Tracking: Leveraging Large Language Models for Automated Construction of Scientific LeaderboardsSubjects: Computation and Language (cs.CL)
Scientific leaderboards are standardized ranking systems that facilitate evaluating and comparing competitive methods. Typically, a leaderboard is defined by a task, dataset, and evaluation metric (TDM) triple, allowing objective performance assessment and fostering innovation through benchmarking. However, the exponential increase in publications has made it infeasible to construct and maintain these leaderboards manually. Automatic leaderboard construction has emerged as a solution to reduce manual labor. Existing datasets for this task are based on the community-contributed leaderboards without additional curation. Our analysis shows that a large portion of these leaderboards are incomplete, and some of them contain incorrect information. In this work, we present SciLead, a manually-curated Scientific Leaderboard dataset that overcomes the aforementioned problems. Building on this dataset, we propose three experimental settings that simulate real-world scenarios where TDM triples are fully defined, partially defined, or undefined during leaderboard construction. While previous research has only explored the first setting, the latter two are more representative of real-world applications. To address these diverse settings, we develop a comprehensive LLM-based framework for constructing leaderboards. Our experiments and analysis reveal that various LLMs often correctly identify TDM triples while struggling to extract result values from publications. We make our code and data publicly available.
- [197] arXiv:2409.12658 [pdf, html, other]
-
Title: Exploring the topics, sentiments and hate speech in the Spanish information environmentComments: 24 pagesSubjects: Computation and Language (cs.CL)
In the digital era, the internet and social media have transformed communication but have also facilitated the spread of hate speech and disinformation, leading to radicalization, polarization, and toxicity. This is especially concerning for media outlets due to their significant role in shaping public discourse. This study examines the topics, sentiments, and hate prevalence in 337,807 response messages (website comments and tweets) to news from five Spanish media outlets (La Vanguardia, ABC, El País, El Mundo, and 20 Minutos) in January 2021. These public reactions were originally labeled as distinct types of hate by experts following an original procedure, and they are now classified into three sentiment values (negative, neutral, or positive) and main topics. The BERTopic unsupervised framework was used to extract 81 topics, manually named with the help of Large Language Models (LLMs) and grouped into nine primary categories.
Results show social issues (22.22%), expressions and slang (20.35%), and political issues (11.80%) as the most discussed. Content is mainly negative (62.7%) and neutral (28.57%), with low positivity (8.73%). Toxic narratives relate to conversation expressions, gender, feminism, and COVID-19. Despite low levels of hate speech (3.98%), the study confirms high toxicity in online responses to social and political topics. - [198] arXiv:2409.12659 [pdf, other]
-
Title: PoTATO: A Dataset for Analyzing Polarimetric Traces of Afloat Trash ObjectsComments: ECCV24 TRICKY workshop, Sep 2024, Milano (Italy), ItalySubjects: Computer Vision and Pattern Recognition (cs.CV)
Plastic waste in aquatic environments poses severe risks to marine life and human health. Autonomous robots can be utilized to collect floating waste, but they require accurate object identification capability. While deep learning has been widely used as a powerful tool for this task, its performance is significantly limited by outdoor light conditions and water surface reflection. Light polarization, abundant in such environments yet invisible to the human eye, can be captured by modern sensors to significantly improve litter detection accuracy on water surfaces. With this goal in mind, we introduce PoTATO, a dataset containing 12,380 labeled plastic bottles and rich polarimetric information. We demonstrate under which conditions polarization can enhance object detection and, by providing raw image data, we offer an opportunity for the research community to explore novel approaches and push the boundaries of state-of-the-art object detection algorithms even further. Code and data are publicly available at this https URL PoTATO/tree/eccv2024.
- [199] arXiv:2409.12661 [pdf, html, other]
-
Title: Manifold Sampling for Differentiable Uncertainty in Radiance FieldsLinjie Lyu, Ayush Tewari, Marc Habermann, Shunsuke Saito, Michael Zollhöfer, Thomas Leimkühler, Christian TheobaltComments: Siggraph Asia 2024 conferenceSubjects: Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR)
Radiance fields are powerful and, hence, popular models for representing the appearance of complex scenes. Yet, constructing them based on image observations gives rise to ambiguities and uncertainties. We propose a versatile approach for learning Gaussian radiance fields with explicit and fine-grained uncertainty estimates that impose only little additional cost compared to uncertainty-agnostic training. Our key observation is that uncertainties can be modeled as a low-dimensional manifold in the space of radiance field parameters that is highly amenable to Monte Carlo sampling. Importantly, our uncertainties are differentiable and, thus, allow for gradient-based optimization of subsequent captures that optimally reduce ambiguities. We demonstrate state-of-the-art performance on next-best-view planning tasks, including high-dimensional illumination planning for optimal radiance field relighting quality.
- [200] arXiv:2409.12667 [pdf, html, other]
-
Title: METDrive: Multi-modal End-to-end Autonomous Driving with Temporal GuidanceSubjects: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
Multi-modal end-to-end autonomous driving has shown promising advancements in recent work. By embedding more modalities into end-to-end networks, the system's understanding of both static and dynamic aspects of the driving environment is enhanced, thereby improving the safety of autonomous driving. In this paper, we introduce METDrive, an end-to-end system that leverages temporal guidance from the embedded time series features of ego states, including rotation angles, steering, throttle signals, and waypoint vectors. The geometric features derived from perception sensor data and the time series features of ego state data jointly guide the waypoint prediction with the proposed temporal guidance loss function. We evaluated METDrive on the CARLA leaderboard's Longest6 benchmark, achieving a driving score of 70%, a route completion score of 94%, and an infraction score of 0.78.
- [201] arXiv:2409.12669 [pdf, html, other]
-
Title: Enhancing Construction Site Safety: A Lightweight Convolutional Network for Effective Helmet DetectionSubjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
In the realm of construction safety, the detection of personal protective equipment, such as helmets, plays a critical role in preventing workplace injuries. This paper details the development and evaluation of convolutional neural networks (CNNs) designed for the accurate classification of helmet presence on construction sites. Initially, a simple CNN model comprising one convolutional block and one fully connected layer was developed, yielding modest results. To enhance its performance, the model was progressively refined, first by extending the architecture to include an additional convolutional block and a fully connected layer. Subsequently, batch normalization and dropout techniques were integrated, aiming to mitigate overfitting and improve the model's generalization capabilities. The performance of these models is methodically analyzed, revealing a peak F1-score of 84\%, precision of 82\%, and recall of 86\% with the most advanced configuration of the first study phase. Despite these improvements, the accuracy remained suboptimal, thus setting the stage for further architectural and operational enhancements. This work lays a foundational framework for ongoing adjustments and optimization in automated helmet detection technology, with future enhancements expected to address the limitations identified during these initial experiments.
- [202] arXiv:2409.12670 [pdf, html, other]
-
Title: Text2Traj2Text: Learning-by-Synthesis Framework for Contextual Captioning of Human Movement TrajectoriesComments: To appear in the International Natural Language Generation Conference (INLG 2024)Subjects: Computation and Language (cs.CL)
This paper presents Text2Traj2Text, a novel learning-by-synthesis framework for captioning possible contexts behind shopper's trajectory data in retail stores. Our work will impact various retail applications that need better customer understanding, such as targeted advertising and inventory management. The key idea is leveraging large language models to synthesize a diverse and realistic collection of contextual captions as well as the corresponding movement trajectories on a store map. Despite learned from fully synthesized data, the captioning model can generalize well to trajectories/captions created by real human subjects. Our systematic evaluation confirmed the effectiveness of the proposed framework over competitive approaches in terms of ROUGE and BERT Score metrics.
- [203] arXiv:2409.12676 [pdf, html, other]
-
Title: An Exploration of Agile Methods in the Automotive Industry: Benefits, Challenges and OpportunitiesSubjects: Software Engineering (cs.SE)
Agile methodologies have gained significant traction in the software development industry, promising increased flexibility and responsiveness to changing requirements. However, their applicability to safety-critical systems, particularly in the automotive sector, remains a topic of debate. This paper examines the benefits and challenges of implementing agile methods in the automotive industry through a comprehensive review of relevant literature and case studies. Our findings highlight the potential advantages of agile approaches, such as improved collaboration and faster time-to-market, as well as the inherent challenges, including safety compliance and cultural resistance. By synthesizing existing research and practical insights, this paper aims to provide an understanding of the role of agile methods in shaping the future of automotive software development.
- [204] arXiv:2409.12677 [pdf, html, other]
-
Title: (Un)certainty of (Un)fairness: Preference-Based Selection of Certainly Fair Decision-MakersComments: Accepted in 27TH EUROPEAN CONFERENCE ON ARTIFICIAL INTELLIGENCE (ECAI 2024)Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Fairness metrics are used to assess discrimination and bias in decision-making processes across various domains, including machine learning models and human decision-makers in real-world applications. This involves calculating the disparities between probabilistic outcomes among social groups, such as acceptance rates between male and female applicants. However, traditional fairness metrics do not account for the uncertainty in these processes and lack of comparability when two decision-makers exhibit the same disparity. Using Bayesian statistics, we quantify the uncertainty of the disparity to enhance discrimination assessments. We represent each decision-maker, whether a machine learning model or a human, by its disparity and the corresponding uncertainty in that disparity. We define preferences over decision-makers and utilize brute-force to choose the optimal decision-maker according to a utility function that ranks decision-makers based on these preferences. The decision-maker with the highest utility score can be interpreted as the one for whom we are most certain that it is fair.
- [205] arXiv:2409.12680 [pdf, html, other]
-
Title: Semi-Supervised Semantic Segmentation with Professional and General TrainingComments: 18 pages, 10 figuresSubjects: Computer Vision and Pattern Recognition (cs.CV)
With the advancement of convolutional neural networks, semantic segmentation has achieved remarkable progress. The training of such networks heavily relies on image annotations, which are very expensive to obtain. Semi-supervised learning can utilize both labeled data and unlabeled data with the help of pseudo-labels. However, in many real-world scenarios where classes are imbalanced, majority classes often play a dominant role during training and the learning quality of minority classes can be undermined. To overcome this limitation, we propose a synergistic training framework, including a professional training module to enhance minority class learning and a general training module to learn more comprehensive semantic information. Based on a pixel selection strategy, they can iteratively learn from each other to reduce error accumulation and coupling. In addition, a dual contrastive learning with anchors is proposed to guarantee more distinct decision boundaries. In experiments, our framework demonstrates superior performance compared to state-of-the-art methods on benchmark datasets.
- [206] arXiv:2409.12682 [pdf, html, other]
-
Title: Retrieval-Augmented Test Generation: How Far Are We?Comments: 18 pages + referenceSubjects: Software Engineering (cs.SE); Artificial Intelligence (cs.AI)
Retrieval Augmented Generation (RAG) has shown notable advancements in software engineering tasks. Despite its potential, RAG's application in unit test generation remains under-explored. To bridge this gap, we take the initiative to investigate the efficacy of RAG-based LLMs in test generation. As RAGs can leverage various knowledge sources to enhance their performance, we also explore the impact of different sources of RAGs' knowledge bases on unit test generation to provide insights into their practical benefits and limitations. Specifically, we examine RAG built upon three types of domain knowledge: 1) API documentation, 2) GitHub issues, and 3) StackOverflow Q&As. Each source offers essential knowledge for creating tests from different perspectives, i.e., API documentations provide official API usage guidelines, GitHub issues offer resolutions of issues related to the APIs from the library developers, and StackOverflow Q&As present community-driven solutions and best practices. For our experiment, we focus on five widely used and typical Python-based machine learning (ML) projects, i.e., TensorFlow, PyTorch, Scikit-learn, Google JAX, and XGBoost to build, train, and deploy complex neural networks efficiently. We conducted experiments using the top 10% most widely used APIs across these projects, involving a total of 188 APIs. We investigate the effectiveness of four state-of-the-art LLMs (open and closed-sourced), i.e., GPT-3.5-Turbo, GPT-4o, Mistral MoE 8x22B, and Llamma 3.1 405B. Additionally, we compare three prompting strategies in generating unit test cases for the experimental APIs, i.e., zero-shot, a Basic RAG, and an API-level RAG on the three external sources. Finally, we compare the cost of different sources of knowledge used for the RAG.
- [207] arXiv:2409.12683 [pdf, html, other]
-
Title: Connecting Ideas in 'Lower-Resource' Scenarios: NLP for National Varieties, Creoles and Other Low-resource ScenariosComments: Selected as a full-day tutorial at COLING 2025Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
Despite excellent results on benchmarks over a small subset of languages, large language models struggle to process text from languages situated in `lower-resource' scenarios such as dialects/sociolects (national or social varieties of a language), Creoles (languages arising from linguistic contact between multiple languages) and other low-resource languages. This introductory tutorial will identify common challenges, approaches, and themes in natural language processing (NLP) research for confronting and overcoming the obstacles inherent to data-poor contexts. By connecting past ideas to the present field, this tutorial aims to ignite collaboration and cross-pollination between researchers working in these scenarios. Our notion of `lower-resource' broadly denotes the outstanding lack of data required for model training - and may be applied to scenarios apart from the three covered in the tutorial.
- [208] arXiv:2409.12690 [pdf, other]
-
Title: Exploring Scientometrics with the OpenAIRE Graph: Introducing the OpenAIRE Beginner's KitComments: Accepted for oral presentation at The 28th International Conference on Science, Technology and Innovation Indicators (STI), 2024. September 18-20, 2024, BerlinSubjects: Digital Libraries (cs.DL)
The OpenAIRE Graph is an extensive resource housing diverse information on research products, including literature, datasets, and software, alongside research projects and other scholarly outputs and context. It stands as a cornerstone among contemporary research information databases, offering invaluable insights for scientometric investigations. Despite its wealth of data, its sheer size may initially appear daunting, potentially hindering its widespread adoption. To address this challenge, this paper introduces the OpenAIRE Beginner's Kit, a user-friendly solution providing access to a subset of the OpenAIRE Graph within a sandboxed environment coupled with a Jupyter notebook for analysis. The OpenAIRE Beginner's Kit is meticulously designed to democratise research and data exploration, offering accessibility from standard desktop and laptop setups. Within this paper, we provide a brief overview of the included dataset and offer guidance on leveraging the kit through a selection of illustrative queries tailored to address common scientometric inquiries.
- [209] arXiv:2409.12691 [pdf, html, other]
-
Title: A dynamic vision sensor object recognition model based on trainable event-driven convolution and spiking attention mechanismComments: 14 pages, 2 figuresSubjects: Computer Vision and Pattern Recognition (cs.CV)
Spiking Neural Networks (SNNs) are well-suited for processing event streams from Dynamic Visual Sensors (DVSs) due to their use of sparse spike-based coding and asynchronous event-driven computation. To extract features from DVS objects, SNNs commonly use event-driven convolution with fixed kernel parameters. These filters respond strongly to features in specific orientations while disregarding others, leading to incomplete feature extraction. To improve the current event-driven convolution feature extraction capability of SNNs, we propose a DVS object recognition model that utilizes a trainable event-driven convolution and a spiking attention mechanism. The trainable event-driven convolution is proposed in this paper to update its convolution kernel through gradient descent. This method can extract local features of the event stream more efficiently than traditional event-driven convolution. Furthermore, the spiking attention mechanism is used to extract global dependence features. The classification performances of our model are better than the baseline methods on two neuromorphic datasets including MNIST-DVS and the more complex CIFAR10-DVS. Moreover, our model showed good classification ability for short event streams. It was shown that our model can improve the performance of event-driven convolutional SNNs for DVS objects.
- [210] arXiv:2409.12694 [pdf, html, other]
-
Title: Comparing the Hardness of Online Minimization and Maximization Problems with PredictionsSubjects: Data Structures and Algorithms (cs.DS)
Building on the work of Berg, Boyar, Favrholdt, and Larsen, who developed a complexity theory for online minimization problems with and without predictions (arXiv:2406.18265), we consider online maximization problems with and without predictions. We define complexity classes that capture the hardness of Online Bounded Degree Independent Set, prove several structural properties of the complexity classes, establish a strict hierarchy, and show multiple membership, hardness, and completeness results.
Further, we establish reductions that relate the hardness of Online Bounded Degree Vertex Cover and Online Bounded Degree Independent Set, while respecting that the hardness of minimization problems is measured differently than the hardness of maximization problems. In particular, we show that there exist good algorithms for Online Bounded Degree Independent Set if and only if there exist good algorithms for Online Bounded Degree Vertex Cover. Since these reductions relate the hardness of complete problems for the complexity classes for maximization problems and complete problems for the complexity classes for minimization problems, their existence provides a connection between the complexity classes for minimization and maximization problems. This connection gives similar reductions relating the hardness of various other pairs of minimization and maximization problems. - [211] arXiv:2409.12695 [pdf, html, other]
-
Title: Exploring Large Language Models for Product Attribute Value IdentificationSubjects: Computation and Language (cs.CL); Information Retrieval (cs.IR)
Product attribute value identification (PAVI) involves automatically identifying attributes and their values from product information, enabling features like product search, recommendation, and comparison. Existing methods primarily rely on fine-tuning pre-trained language models, such as BART and T5, which require extensive task-specific training data and struggle to generalize to new attributes. This paper explores large language models (LLMs), such as LLaMA and Mistral, as data-efficient and robust alternatives for PAVI. We propose various strategies: comparing one-step and two-step prompt-based approaches in zero-shot settings and utilizing parametric and non-parametric knowledge through in-context learning examples. We also introduce a dense demonstration retriever based on a pre-trained T5 model and perform instruction fine-tuning to explicitly train LLMs on task-specific instructions. Extensive experiments on two product benchmarks show that our two-step approach significantly improves performance in zero-shot settings, and instruction fine-tuning further boosts performance when using training data, demonstrating the practical benefits of using LLMs for PAVI.
- [212] arXiv:2409.12699 [pdf, html, other]
-
Title: PromSec: Prompt Optimization for Secure Generation of Functional Source Code with Large Language Models (LLMs)Comments: 15 pages, 19 figures, CCS 2024Subjects: Software Engineering (cs.SE); Machine Learning (cs.LG)
The capability of generating high-quality source code using large language models (LLMs) reduces software development time and costs. However, they often introduce security vulnerabilities due to training on insecure open-source data. This highlights the need for ensuring secure and functional code generation. This paper introduces PromSec, an algorithm for prom optimization for secure and functioning code generation using LLMs. In PromSec, we combine 1) code vulnerability clearing using a generative adversarial graph neural network, dubbed as gGAN, to fix and reduce security vulnerabilities in generated codes and 2) code generation using an LLM into an interactive loop, such that the outcome of the gGAN drives the LLM with enhanced prompts to generate secure codes while preserving their functionality. Introducing a new contrastive learning approach in gGAN, we formulate code-clearing and generation as a dual-objective optimization problem, enabling PromSec to notably reduce the number of LLM inferences. PromSec offers a cost-effective and practical solution for generating secure, functional code. Extensive experiments conducted on Python and Java code datasets confirm that PromSec effectively enhances code security while upholding its intended functionality. Our experiments show that while a state-of-the-art approach fails to address all code vulnerabilities, PromSec effectively resolves them. Moreover, PromSec achieves more than an order-of-magnitude reduction in operation time, number of LLM queries, and security analysis costs. Furthermore, prompts optimized with PromSec for a certain LLM are transferable to other LLMs across programming languages and generalizable to unseen vulnerabilities in training. This study is a step in enhancing the trustworthiness of LLMs for secure and functional code generation, supporting their integration into real-world software development.
- [213] arXiv:2409.12701 [pdf, html, other]
-
Title: An Empirical Study on the Distance Metric in Guiding Directed Grey-box FuzzingSubjects: Cryptography and Security (cs.CR); Software Engineering (cs.SE)
Directed grey-box fuzzing (DGF) aims to discover vulnerabilities in specific code areas efficiently. Distance metric, which is used to measure the quality of seed in DGF, is a crucial factor in affecting the fuzzing performance. Despite distance metrics being widely applied in existing DGF frameworks, it remains opaque about how different distance metrics guide the fuzzing process and affect the fuzzing result in practice. In this paper, we conduct the first empirical study to explore how different distance metrics perform in guiding DGFs. Specifically, we systematically discuss different distance metrics in the aspect of calculation method and granularity. Then, we implement different distance metrics based on AFLGo. On this basis, we conduct comprehensive experiments to evaluate the performance of these distance metrics on the benchmarks widely used in existing DGF-related work. The experimental results demonstrate the following insights. First, the difference among different distance metrics with varying methods of calculation and granularities is not significant. Second, the distance metrics may not be effective in describing the difficulty of triggering the target vulnerability. In addition, by scrutinizing the quality of testcases, our research highlights the inherent limitation of existing mutation strategies in generating high-quality testcases, calling for designing effective mutation strategies for directed fuzzing. We open-source the implementation code and experiment dataset to facilitate future research in DGF.
- [214] arXiv:2409.12705 [pdf, html, other]
-
Title: Generation and Editing of Mandrill Faces: Application to Sex Editing and AssessmentSubjects: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
Generative AI has seen major developments in recent years, enhancing the realism of synthetic images, also known as computer-generated images. In addition, generative AI has also made it possible to modify specific image characteristics through image editing. Previous work has developed methods based on generative adversarial networks (GAN) for generating realistic images, in particular faces, but also to modify specific features. However, this work has never been applied to specific animal species. Moreover, the assessment of the results has been generally done subjectively, rather than quantitatively. In this paper, we propose an approach based on methods for generating images of faces of male or female mandrills, a non-human primate. The main novelty of proposed method is the ability to edit their sex by identifying a sex axis in the latent space of a specific GAN. In addition, we have developed an assessment of the sex levels based on statistical features extracted from real image distributions. The experimental results we obtained from a specific database are not only realistic, but also accurate, meeting a need for future work in behavioral experiments with wild mandrills.
- [215] arXiv:2409.12709 [pdf, html, other]
-
Title: SeqRisk: Transformer-augmented latent variable model for improved survival prediction with longitudinal dataSubjects: Machine Learning (cs.LG)
In healthcare, risk assessment of different patient outcomes has for long time been based on survival analysis, i.e.\ modeling time-to-event associations. However, conventional approaches rely on data from a single time-point, making them suboptimal for fully leveraging longitudinal patient history and capturing temporal regularities. Focusing on clinical real-world data and acknowledging its challenges, we utilize latent variable models to effectively handle irregular, noisy, and sparsely observed longitudinal data. We propose SeqRisk, a method that combines variational autoencoder (VAE) or longitudinal VAE (LVAE) with a transformer encoder and Cox proportional hazards module for risk prediction. SeqRisk captures long-range interactions, improves patient trajectory representations, enhances predictive accuracy and generalizability, as well as provides partial explainability for sample population characteristics in attempts to identify high-risk patients. We demonstrate that SeqRisk performs competitively compared to existing approaches on both simulated and real-world datasets.
- [216] arXiv:2409.12710 [pdf, html, other]
-
Title: Age of gossip from connective properties via first passage percolationComments: 12 pagesSubjects: Information Theory (cs.IT); Probability (math.PR)
In gossip networks, a source node forwards time-stamped updates to a network of observers according to a Poisson process. The observers then update each other on this information according to Poisson processes as well. The Age of Information (AoI) of a given node is the difference between the current time and the most recent time-stamp of source information that the node has received.
We provide a method for evaluating the AoI of a node in terms of first passage percolation. We then use this distributional identity to prove matching upper and lower bounds on the AoI in terms of connectivity properties of the underlying network. In particular, if one sets $X_v$ to be the AoI of node $v$ on a finite graph $G$ with $n$ nodes, then we define $m_\ast = \min\{m : m \cdot |B_m(v)| \geq n\}$ where $B_m(v)$ is the ball of radius $m$ in $G$. In the case when the maximum degree of $G$ is bounded by $\Delta$ we prove $\mathbb{E} X_v = \Theta_\Delta(m_\ast)$.
As corollaries, we solve multiple open problems in the literature such as showing the age of information on a subset of $\mathbb{Z}^d$ is $\Theta(n^{1/(d+1)})$.
We also demonstrate examples of graphs with AoI scaling like $n^{\alpha}$ for each $\alpha \in (0,1/2)$. These graphs are not vertex-transitive and in fact we show that if one considers the AoI on a graph coming from a vertex-transitive infinite graph then either $\mathbb{E} X_v = \Theta(n^{1/k})$ for some integer $k \geq 2$ or $\mathbb{E} X_v = n^{o(1)}$. - [217] arXiv:2409.12713 [pdf, html, other]
-
Title: A Signal Temporal Logic Approach for Task-Based Coordination of Multi-Aerial Systems: a Wind Turbine Inspection Case StudyComments: This work has been submitted to "Robotics and Autonomous Systems" for possible publication. Copyright may be transferred without notice, after which this version may no longer be accessibleSubjects: Robotics (cs.RO)
The paper addresses task assignment and trajectory generation for collaborative inspection missions using a fleet of multi-rotors, focusing on the wind turbine inspection scenario. The proposed solution enables safe and feasible trajectories while accommodating heterogeneous time-bound constraints and vehicle physical limits. An optimization problem is formulated to meet mission objectives and temporal requirements encoded as Signal Temporal Logic (STL) specifications. Additionally, an event-triggered replanner is introduced to address unforeseen events and compensate for lost time. Furthermore, a generalized robustness scoring method is employed to reflect user preferences and mitigate task conflicts. The effectiveness of the proposed approach is demonstrated through MATLAB and Gazebo simulations, as well as field multi-robot experiments in a mock-up scenario.
- [218] arXiv:2409.12714 [pdf, html, other]
-
Title: Towards adaptive trajectories for mixed autonomous and human-operated shipsComments: Submitted to SISSY 2024Subjects: Computers and Society (cs.CY); Software Engineering (cs.SE)
We are witnessing the rise of autonomous cars, which will likely revolutionize the way we travel. Arguably, the maritime domain lags behind, as ships operate on many more degrees of freedom (thus, a much larger search space): there is less physical infrastructure, and rules are less consistent and constraining than what is found on roads. The problem is further complicated by the inevitable co-existence of autonomous and human-operated ships: the latter may take unpredictable decisions, which require adjustments on the autonomous ones. Finally, the problem is inherently decentralised, there is no central authority, and communication means can be very diverse in terms of communication distance and performance, mandating special care on which information is shared and how. In this work, we elaborate on the challenges of trajectory prediction and adaptation for mixed autonomous and human-operated ships, and we propose initial ideas on potential approaches to address them.
- [219] arXiv:2409.12716 [pdf, html, other]
-
Title: Optical Flow Matters: an Empirical Comparative Study on Fusing Monocular Extracted Modalities for Better SteeringFouad Makiyeh, Mark Bastourous, Anass Bairouk, Wei Xiao, Mirjana Maras, Tsun-Hsuan Wangb, Marc Blanchon, Ramin Hasani, Patrick Chareyre, Daniela RusSubjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
Autonomous vehicle navigation is a key challenge in artificial intelligence, requiring robust and accurate decision-making processes. This research introduces a new end-to-end method that exploits multimodal information from a single monocular camera to improve the steering predictions for self-driving cars. Unlike conventional models that require several sensors which can be costly and complex or rely exclusively on RGB images that may not be robust enough under different conditions, our model significantly improves vehicle steering prediction performance from a single visual sensor. By focusing on the fusion of RGB imagery with depth completion information or optical flow data, we propose a comprehensive framework that integrates these modalities through both early and hybrid fusion techniques.
We use three distinct neural network models to implement our approach: Convolution Neural Network - Neutral Circuit Policy (CNN-NCP) , Variational Auto Encoder - Long Short-Term Memory (VAE-LSTM) , and Neural Circuit Policy architecture VAE-NCP. By incorporating optical flow into the decision-making process, our method significantly advances autonomous navigation. Empirical results from our comparative study using Boston driving data show that our model, which integrates image and motion information, is robust and reliable. It outperforms state-of-the-art approaches that do not use optical flow, reducing the steering estimation error by 31%. This demonstrates the potential of optical flow data, combined with advanced neural network architectures (a CNN-based structure for fusing data and a Recurrence-based network for inferring a command from latent space), to enhance the performance of autonomous vehicles steering estimation. - [220] arXiv:2409.12718 [pdf, html, other]
-
Title: Probabilistically Robust Trajectory Planning of Multiple Aerial AgentsComments: 18th International Conference on Control, Automation, Robotics and Vision (ICARCV 2024)Subjects: Systems and Control (eess.SY)
Current research on robust trajectory planning for autonomous agents aims to mitigate uncertainties arising from disturbances and modeling errors while ensuring guaranteed safety. Existing methods primarily utilize stochastic optimal control techniques with chance constraints to maintain a minimum distance among agents with a guaranteed probability. However, these approaches face challenges, such as the use of simplifying assumptions that result in linear system models or Gaussian disturbances, which limit their practicality in complex realistic scenarios. To address these limitations, this work introduces a novel probabilistically robust distributed controller enabling autonomous agents to plan safe trajectories, even under non-Gaussian uncertainty and nonlinear systems. Leveraging exact uncertainty propagation techniques based on mixed-trigonometric-polynomial moment propagation, this method transforms non-Gaussian chance constraints into deterministic ones, seamlessly integrating them into a distributed model predictive control framework solvable with standard optimization tools. Simulation results demonstrate the effectiveness of this technique, highlighting its ability to consistently handle various types of uncertainty, ensuring robust and accurate path planning in complex scenarios.
- [221] arXiv:2409.12720 [pdf, html, other]
-
Title: FAST GDRNPP: Improving the Speed of State-of-the-Art 6D Object Pose EstimationSubjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
6D object pose estimation involves determining the three-dimensional translation and rotation of an object within a scene and relative to a chosen coordinate system. This problem is of particular interest for many practical applications in industrial tasks such as quality control, bin picking, and robotic manipulation, where both speed and accuracy are critical for real-world deployment. Current models, both classical and deep-learning-based, often struggle with the trade-off between accuracy and latency. Our research focuses on enhancing the speed of a prominent state-of-the-art deep learning model, GDRNPP, while keeping its high accuracy. We employ several techniques to reduce the model size and improve inference time. These techniques include using smaller and quicker backbones, pruning unnecessary parameters, and distillation to transfer knowledge from a large, high-performing model to a smaller, more efficient student model. Our findings demonstrate that the proposed configuration maintains accuracy comparable to the state-of-the-art while significantly improving inference time. This advancement could lead to more efficient and practical applications in various industrial scenarios, thereby enhancing the overall applicability of 6D Object Pose Estimation models in real-world settings.
- [222] arXiv:2409.12722 [pdf, other]
-
Title: LLM-Measure: Generating Valid, Consistent, and Reproducible Text-Based Measures for Social Science ResearchSubjects: Computation and Language (cs.CL)
The increasing use of text as data in social science research necessitates the development of valid, consistent, reproducible, and efficient methods for generating text-based concept measures. This paper presents a novel method that leverages the internal hidden states of large language models (LLMs) to generate these concept measures. Specifically, the proposed method learns a concept vector that captures how the LLM internally represents the target concept, then estimates the concept value for text data by projecting the text's LLM hidden states onto the concept vector. Three replication studies demonstrate the method's effectiveness in producing highly valid, consistent, and reproducible text-based measures across various social science research contexts, highlighting its potential as a valuable tool for the research community.
- [223] arXiv:2409.12723 [pdf, html, other]
-
Title: Optimal Cosserat-based deformation control for robotic manipulation of linear objectsSubjects: Robotics (cs.RO)
The robotic shape control of deformable linear objects has garnered increasing interest within the robotics community. Despite recent progress, the majority of shape control approaches can be classified into two main groups: open-loop control, which relies on physically realistic models to represent the object, and closed-loop control, which employs less precise models alongside visual data to compute commands. In this work, we present a novel 3D shape control approach that includes the physically realistic Cosserat model into a closed-loop control framework, using vision feedback to rectify errors in real-time. This approach capitalizes on the advantages of both groups: the realism and precision provided by physics-based models, and the rapid computation, therefore enabling real-time correction of model errors, and robustness to elastic parameter estimation inherent in vision-based approaches. This is achieved by computing a deformation Jacobian derived from both the Cosserat model and visual data. To demonstrate the effectiveness of the method, we conduct a series of shape control experiments where robots are tasked with deforming linear objects towards a desired shape.
- [224] arXiv:2409.12724 [pdf, html, other]
-
Title: PVContext: Hybrid Context Model for Point Cloud CompressionSubjects: Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)
Efficient storage of large-scale point cloud data has become increasingly challenging due to advancements in scanning technology. Recent deep learning techniques have revolutionized this field; However, most existing approaches rely on single-modality contexts, such as octree nodes or voxel occupancy, limiting their ability to capture information across large regions. In this paper, we propose PVContext, a hybrid context model for effective octree-based point cloud compression. PVContext comprises two components with distinct modalities: the Voxel Context, which accurately represents local geometric information using voxels, and the Point Context, which efficiently preserves global shape information from point clouds. By integrating these two contexts, we retain detailed information across large areas while controlling the context size. The combined context is then fed into a deep entropy model to accurately predict occupancy. Experimental results demonstrate that, compared to G-PCC, our method reduces the bitrate by 37.95\% on SemanticKITTI LiDAR point clouds and by 48.98\% and 36.36\% on dense object point clouds from MPEG 8i and MVUB, respectively.
- [225] arXiv:2409.12726 [pdf, html, other]
-
Title: Cloudy with a Chance of Anomalies: Dynamic Graph Neural Network for Early Detection of Cloud Services' User AnomaliesSubjects: Networking and Internet Architecture (cs.NI); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR)
Ensuring the security of cloud environments is imperative for sustaining organizational growth and operational efficiency. As the ubiquity of cloud services continues to rise, the inevitability of cyber threats underscores the importance of preemptive detection. This paper introduces a pioneering time-based embedding approach for Cloud Services Graph-based Anomaly Detection (CS-GAD), utilizing a Graph Neural Network (GNN) to discern anomalous user behavior during interactions with cloud services. Our method employs a dynamic tripartite graph representation to encapsulate the evolving interactions among cloud services, users, and their activities over time. Leveraging GNN models in each time frame, our approach generates a graph embedding wherein each user is assigned a score based on their historical activity, facilitating the identification of unusual behavior. Results demonstrate a notable reduction in false positive rates (2-9%) compared to prevailing methods, coupled with a commendable true positive rate (100%). The contributions of this work encompass early detection capabilities, a low false positive rate, an innovative tripartite graph representation incorporating action types, the introduction of a new cloud services dataset featuring various user attacks, and an open-source implementation for community collaboration in advancing cloud service security.
- [226] arXiv:2409.12727 [pdf, html, other]
-
Title: A Generalization of Habicht's Theorem for Subresultants of Several Univariate PolynomialsSubjects: Symbolic Computation (cs.SC)
Subresultants of two univariate polynomials are one of the most classic and ubiquitous objects in computational algebra and algebraic geometry. In 1948, Habicht discovered and proved interesting relationships among subresultants. Those relationships were found to be useful for both structural understanding and efficient computation. Often one needs to consider several (possibly more than two) polynomials. It is rather straightforward to generalize the notion of subresultants to several polynomials. However, it is not obvious (in fact, quite challenging) to generalize the Habicht's result to several polynomials. The main contribution of this paper is to provide such a generalization.
- [227] arXiv:2409.12730 [pdf, html, other]
-
Title: When SparseMoE Meets Noisy Interactions: An Ensemble View on Denoising RecommendationSubjects: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)
Learning user preferences from implicit feedback is one of the core challenges in recommendation. The difficulty lies in the potential noise within implicit feedback. Therefore, various denoising recommendation methods have been proposed recently. However, most of them overly rely on the hyperparameter configurations, inevitably leading to inadequacies in model adaptability and generalization performance. In this study, we propose a novel Adaptive Ensemble Learning (AEL) for denoising recommendation, which employs a sparse gating network as a brain, selecting suitable experts to synthesize appropriate denoising capacities for different data samples. To address the ensemble learning shortcoming of model complexity and ensure sub-recommender diversity, we also proposed a novel method that stacks components to create sub-recommenders instead of directly constructing them. Extensive experiments across various datasets demonstrate that AEL outperforms others in kinds of popular metrics, even in the presence of substantial and dynamic noise. Our code is available at this https URL.
- [228] arXiv:2409.12735 [pdf, html, other]
-
Title: Fine Manipulation Using a Tactile Skin: Learning in Simulation and Sim-to-Real TransferComments: Accepted for the 2024 IEEE/RSJ International Conference on Intelligent Robots and SystemsSubjects: Robotics (cs.RO)
We want to enable fine manipulation with a multi-fingered robotic hand by using modern deep reinforcement learning methods. Key for fine manipulation is a spatially resolved tactile sensor. Here, we present a novel model of a tactile skin that can be used together with rigid-body (hence fast) physics simulators. The model considers the softness of the real fingertips such that a contact can spread across multiple taxels of the sensor depending on the contact geometry. We calibrate the model parameters to allow for an accurate simulation of the real-world sensor. For this, we present a self-contained calibration method without external tools or sensors. To demonstrate the validity of our approach, we learn two challenging fine manipulation tasks: Rolling a marble and a bolt between two fingers. We show in simulation experiments that tactile feedback is crucial for precise manipulation and reaching sub-taxel resolution of < 1 mm (despite a taxel spacing of 4 mm). Moreover, we demonstrate that all policies successfully transfer from the simulation to the real robotic hand.
- [229] arXiv:2409.12737 [pdf, html, other]
-
Title: MEXMA: Token-level objectives improve sentence representationsComments: 11 pages, 12 figuresSubjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
Current pre-trained cross-lingual sentence encoders approaches use sentence-level objectives only. This can lead to loss of information, especially for tokens, which then degrades the sentence representation. We propose MEXMA, a novel approach that integrates both sentence-level and token-level objectives. The sentence representation in one language is used to predict masked tokens in another language, with both the sentence representation and all tokens directly updating the encoder. We show that adding token-level objectives greatly improves the sentence representation quality across several tasks. Our approach outperforms current pre-trained cross-lingual sentence encoders on bi-text mining as well as several downstream tasks. We also analyse the information encoded in our tokens, and how the sentence representation is built from them.
- [230] arXiv:2409.12739 [pdf, html, other]
-
Title: Edu-Values: Towards Evaluating the Chinese Education Values of Large Language ModelsComments: 9 pages, 5 figuresSubjects: Computation and Language (cs.CL)
With the recent evolution of large language models (LLMs), concerns about aligning such models with human values have grown. Previous research has primarily focused on assessing LLMs' performance in terms of the Helpful, Honest, Harmless (3H) basic principles, while often overlooking their alignment with educational values in the Chinese context. To fill this gap, we present Edu-Values, the first Chinese education values evaluation benchmark designed to measure LLMs' alignment ability across seven dimensions: professional ideology, cultural literacy, educational knowledge and skills, education laws and regulations, teachers' professional ethics, basic competencies, and subject knowledge. We meticulously design and compile 1,418 questions, including multiple-choice, multi-modal question answering, subjective analysis, adversarial prompts, and questions on traditional Chinese culture. We conduct both human evaluation and automatic evaluation over 11 state-of-the-art (SoTA) LLMs, and highlight three main findings: (1) due to differences in educational culture, Chinese LLMs significantly outperform English LLMs, with Qwen 2 ranking the first with a score of 81.37; (2) LLMs perform well in subject knowledge and teaching skills but struggle with teachers' professional ethics and basic competencies; (3) LLMs excel at multiple-choice questions but perform poorly on subjective analysis and multi-modal tasks. This demonstrates the effectiveness and potential of the proposed benchmark.
Our dataset is available at this https URL. - [231] arXiv:2409.12740 [pdf, html, other]
-
Title: HLLM: Enhancing Sequential Recommendations via Hierarchical Large Language Models for Item and User ModelingSubjects: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)
Large Language Models (LLMs) have achieved remarkable success in various fields, prompting several studies to explore their potential in recommendation systems. However, these attempts have so far resulted in only modest improvements over traditional recommendation models. Moreover, three critical questions remain under-explored: firstly, the real value of LLMs' pre-trained weights, often considered to encapsulate world knowledge; secondly, the necessity of fine-tuning for recommendation tasks; lastly, whether LLMs can exhibit the same scalability benefits in recommendation systems as they do in other domains. In this paper, we propose a novel Hierarchical Large Language Model (HLLM) architecture designed to enhance sequential recommendation systems. Our approach employs a two-tier model: the first Item LLM extracts rich content features from the detailed text description of the item, while the second User LLM utilizes these features to predict users' future interests based on their interaction history. Extensive experiments demonstrate that our method effectively leverages the pre-trained capabilities of open-source LLMs, and further fine-tuning leads to significant performance boosts. Additionally, HLLM achieves excellent scalability, with the largest configuration utilizing 7B parameters for both item feature extraction and user interest modeling. Moreover, HLLM offers excellent training and serving efficiency, making it practical in real-world applications. Evaluations on two large-scale datasets, PixelRec and Amazon Reviews, show that HLLM achieves state-of-the-art results, outperforming traditional ID-based models by a wide margin. In online A/B testing, HLLM showcases notable gains, validating its practical impact in real-world recommendation scenarios. Codes are available at this https URL.
- [232] arXiv:2409.12741 [pdf, other]
-
Title: Fine Tuning Large Language Models for Medicine: The Role and Importance of Direct Parameter OptimizationThomas Savage, Stephen Ma, Abdessalem Boukil, Vishwesh Patel, Ekanath Rangan, Ivan Rodriguez, Jonathan H ChenSubjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
Large Language Model (LLM) fine tuning is underutilized in the field of medicine. Two of the most common methods of fine tuning are Supervised Fine Tuning (SFT) and Direct Parameter Optimization (DPO), but there is little guidance informing users when to use either technique. In this investigation, we compare the performance of SFT and DPO for five common natural language tasks in medicine: Classification with text data, Classification with numeric data, Clinical Reasoning, Summarization, and Clinical Triage. We find that SFT alone is sufficient for Classification with text data, whereas DPO improves performance for the more complex tasks of Clinical Reasoning, Summarization and Clinical Triage. Our results establish the role and importance of DPO fine tuning within medicine, and consequently call attention to current software gaps that prevent widespread deployment of this technique.
- [233] arXiv:2409.12744 [pdf, other]
-
Title: Optimal Coding for Randomized Kolmogorov Complexity and Its ApplicationsSubjects: Computational Complexity (cs.CC)
The coding theorem for Kolmogorov complexity states that any string sampled from a computable distribution has a description length close to its information content. A coding theorem for resource-bounded Kolmogorov complexity is the key to obtaining fundamental results in average-case complexity, yet whether any samplable distribution admits a coding theorem for randomized time-bounded Kolmogorov complexity ($\mathsf{rK}^\mathsf{poly}$) is open and a common bottleneck in the recent literature of meta-complexity. Previous works bypassed this issue by considering probabilistic Kolmogorov complexity ($\mathsf{pK}^\mathsf{poly}$), in which public random bits are assumed to be available. In this paper, we present an efficient coding theorem for randomized Kolmogorov complexity under the non-existence of one-way functions, thereby removing the common bottleneck. This enables us to prove $\mathsf{rK}^\mathsf{poly}$ counterparts of virtually all the average-case results that were proved only for $\mathsf{pK}^\mathsf{poly}$, and enables the resolution of the open problems of Hirahara, Ilango, Lu, Nanashima, and Oliveira (STOC'23) and Hirahara, Kabanets, Lu, and Oliveira (CCC'24). The key technical lemma is that any distribution whose next bits are efficiently predictable admits an efficient encoding and decoding scheme, which could be of independent interest to data compression.
- [234] arXiv:2409.12745 [pdf, html, other]
-
Title: Enhancing Synthetic Training Data for Speech Commands: From ASR-Based Filtering to Domain Adaptation in SSL Latent SpaceSubjects: Sound (cs.SD); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Audio and Speech Processing (eess.AS)
The use of synthetic speech as data augmentation is gaining increasing popularity in fields such as automatic speech recognition and speech classification tasks. Despite novel text-to-speech systems with voice cloning capabilities, that allow the usage of a larger amount of voices based on short audio segments, it is known that these systems tend to hallucinate and oftentimes produce bad data that will most likely have a negative impact on the downstream task. In the present work, we conduct a set of experiments around zero-shot learning with synthetic speech data for the specific task of speech commands classification. Our results on the Google Speech Commands dataset show that a simple ASR-based filtering method can have a big impact in the quality of the generated data, translating to a better performance. Furthermore, despite the good quality of the generated speech data, we also show that synthetic and real speech can still be easily distinguishable when using self-supervised (WavLM) features, an aspect further explored with a CycleGAN to bridge the gap between the two types of speech material.
- [235] arXiv:2409.12746 [pdf, html, other]
-
Title: Bilingual Evaluation of Language Models on General Knowledge in University Entrance Exams with Minimal ContaminationEva Sánchez Salido, Roser Morante, Julio Gonzalo, Guillermo Marco, Jorge Carrillo-de-Albornoz, Laura Plaza, Enrique Amigó, Andrés Fernández, Alejandro Benito-Santos, Adrián Ghajari Espinosa, Victor FresnoSubjects: Computation and Language (cs.CL)
In this article we present UNED-ACCESS 2024, a bilingual dataset that consists of 1003 multiple-choice questions of university entrance level exams in Spanish and English. Questions are originally formulated in Spanish and translated manually into English, and have not ever been publicly released. A selection of current open-source and proprietary models are evaluated in a uniform zero-shot experimental setting both on the UNED-ACCESS 2024 dataset and on an equivalent subset of MMLU questions. Results show that (i) reasoning questions are challenging for models, (ii) smaller models perform worse than larger models and degrade faster in Spanish than in English and (iii) the performance gap between languages is negligible for the best models and grows up to 37% for smaller models. Model ranking on UNED-ACCESS 2024 is almost identical in English and Spanish, and has also a high correlation (0.98 Pearson) with ranking on MMLU, suggesting that a small dataset is sufficiently diverse and representative to measure performance by discipline.
- [236] arXiv:2409.12753 [pdf, html, other]
-
Title: DrivingForward: Feed-forward 3D Gaussian Splatting for Driving Scene Reconstruction from Flexible Surround-view InputComments: Project page: this https URLSubjects: Computer Vision and Pattern Recognition (cs.CV)
We propose DrivingForward, a feed-forward Gaussian Splatting model that reconstructs driving scenes from flexible surround-view input. Driving scene images from vehicle-mounted cameras are typically sparse, with limited overlap, and the movement of the vehicle further complicates the acquisition of camera extrinsics. To tackle these challenges and achieve real-time reconstruction, we jointly train a pose network, a depth network, and a Gaussian network to predict the Gaussian primitives that represent the driving scenes. The pose network and depth network determine the position of the Gaussian primitives in a self-supervised manner, without using depth ground truth and camera extrinsics during training. The Gaussian network independently predicts primitive parameters from each input image, including covariance, opacity, and spherical harmonics coefficients. At the inference stage, our model can achieve feed-forward reconstruction from flexible multi-frame surround-view input. Experiments on the nuScenes dataset show that our model outperforms existing state-of-the-art feed-forward and scene-optimized reconstruction methods in terms of reconstruction.
- [237] arXiv:2409.12760 [pdf, html, other]
-
Title: COCO-Occ: A Benchmark for Occluded Panoptic Segmentation and Image UnderstandingSubjects: Computer Vision and Pattern Recognition (cs.CV)
To help address the occlusion problem in panoptic segmentation and image understanding, this paper proposes a new large-scale dataset, COCO-Occ, which is derived from the COCO dataset by manually labelling the COCO images into three perceived occlusion levels. Using COCO-Occ, we systematically assess and quantify the impact of occlusion on panoptic segmentation on samples having different levels of occlusion. Comparative experiments with SOTA panoptic models demonstrate that the presence of occlusion significantly affects performance with higher occlusion levels resulting in notably poorer performance. Additionally, we propose a straightforward yet effective method as an initial attempt to leverage the occlusion annotation using contrastive learning to render a model that learns a more robust representation capturing different severities of occlusion. Experimental results demonstrate that the proposed approach boosts the performance of the baseline model and achieves SOTA performance on the proposed COCO-Occ dataset.
- [238] arXiv:2409.12762 [pdf, html, other]
-
Title: The inverse obstacle scattering with a tapered wave incidenceComments: 17 pages, 11 figuresSubjects: Numerical Analysis (math.NA)
This paper is concerned with the reconstruction of the shape of an acoustic obstacle. Based on the use of the tapered waves with very narrow widths illuminating the obstacle, the boundary of the obstacle is reconstructed by a direct imaging algorithm. The stability of the imaging scheme is mathematically analyzed. We emphasize that different from the incident plane waves or point sources, the tapered waves with narrow widths bring several benefits in the inverse scattering: 1. local property. A tapered wave can illuminate only on a local part of the boundary of the obstacle, which generates the scattered field; 2. high resolution. We need only reconstruct the boundary near the beam, which improves the quality of some well-known algorithms; 3. fast and easy to implement. Numerical examples are included to demonstrate the effectiveness of the tapered waves.
- [239] arXiv:2409.12765 [pdf, html, other]
-
Title: Towards AI-enabled Cyber Threat Assessment in the Health SectorComments: 10 pages (including references), 5 figures, 2 tables, cs.CRSubjects: Cryptography and Security (cs.CR)
Cyber attacks on the healthcare industry can have tremendous consequences and the attack surface expands continuously. In order to handle the steadily rising workload, an expanding amount of analog processes in healthcare institutions is digitized. Despite regulations becoming stricter, not all existing infrastructure is sufficiently protected against cyber attacks. With an increasing number of devices and digital processes, the system and network landscape becomes more complex and harder to manage and therefore also more difficult to protect. The aim of this project is to introduce an AI-enabled platform that collects security relevant information from the outside of a health organization, analyzes it, delivers a risk score and supports decision makers in healthcare institutions to optimize investment choices for security measures. Therefore, an architecture of such a platform is designed, relevant information sources are identified, and AI methods for relevant data collection, selection, and risk scoring are explored.
- [240] arXiv:2409.12769 [pdf, html, other]
-
Title: The Robustness of Spiking Neural Networks in Communication and its Application towards Network Efficiency in Federated LearningComments: This paper has been accepted for publication at the 43rd IEEE International Performance Computing and Communications Conference (IPCCC 2024)Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Neural and Evolutionary Computing (cs.NE)
Spiking Neural Networks (SNNs) have recently gained significant interest in on-chip learning in embedded devices and emerged as an energy-efficient alternative to conventional Artificial Neural Networks (ANNs). However, to extend SNNs to a Federated Learning (FL) setting involving collaborative model training, the communication between the local devices and the remote server remains the bottleneck, which is often restricted and costly. In this paper, we first explore the inherent robustness of SNNs under noisy communication in FL. Building upon this foundation, we propose a novel Federated Learning with Top-K Sparsification (FLTS) algorithm to reduce the bandwidth usage for FL training. We discover that the proposed scheme with SNNs allows more bandwidth savings compared to ANNs without impacting the model's accuracy. Additionally, the number of parameters to be communicated can be reduced to as low as 6 percent of the size of the original model. We further improve the communication efficiency by enabling dynamic parameter compression during model training. Extensive experiment results demonstrate that our proposed algorithms significantly outperform the baselines in terms of communication cost and model accuracy and are promising for practical network-efficient FL with SNNs.
- [241] arXiv:2409.12771 [pdf, html, other]
-
Title: Spectral-GS: Taming 3D Gaussian Splatting with Spectral EntropySubjects: Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR)
Recently, 3D Gaussian Splatting (3D-GS) has achieved impressive results in novel view synthesis, demonstrating high fidelity and efficiency. However, it easily exhibits needle-like artifacts, especially when increasing the sampling rate. Mip-Splatting tries to remove these artifacts with a 3D smoothing filter for frequency constraints and a 2D Mip filter for approximated supersampling. Unfortunately, it tends to produce over-blurred results, and sometimes needle-like Gaussians still persist. Our spectral analysis of the covariance matrix during optimization and densification reveals that current 3D-GS lacks shape awareness, relying instead on spectral radius and view positional gradients to determine splitting. As a result, needle-like Gaussians with small positional gradients and low spectral entropy fail to split and overfit high-frequency details. Furthermore, both the filters used in 3D-GS and Mip-Splatting reduce the spectral entropy and increase the condition number during zooming in to synthesize novel view, causing view inconsistencies and more pronounced artifacts. Our Spectral-GS, based on spectral analysis, introduces 3D shape-aware splitting and 2D view-consistent filtering strategies, effectively addressing these issues, enhancing 3D-GS's capability to represent high-frequency details without noticeable artifacts, and achieving high-quality photorealistic rendering.
- [242] arXiv:2409.12774 [pdf, html, other]
-
Title: GaRField++: Reinforced Gaussian Radiance Fields for Large-Scale 3D Scene ReconstructionSubjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Robotics (cs.RO)
This paper proposes a novel framework for large-scale scene reconstruction based on 3D Gaussian splatting (3DGS) and aims to address the scalability and accuracy challenges faced by existing methods. For tackling the scalability issue, we split the large scene into multiple cells, and the candidate point-cloud and camera views of each cell are correlated through a visibility-based camera selection and a progressive point-cloud extension. To reinforce the rendering quality, three highlighted improvements are made in comparison with vanilla 3DGS, which are a strategy of the ray-Gaussian intersection and the novel Gaussians density control for learning efficiency, an appearance decoupling module based on ConvKAN network to solve uneven lighting conditions in large-scale scenes, and a refined final loss with the color loss, the depth distortion loss, and the normal consistency loss. Finally, the seamless stitching procedure is executed to merge the individual Gaussian radiance field for novel view synthesis across different cells. Evaluation of Mill19, Urban3D, and MatrixCity datasets shows that our method consistently generates more high-fidelity rendering results than state-of-the-art methods of large-scale scene reconstruction. We further validate the generalizability of the proposed approach by rendering on self-collected video clips recorded by a commercial drone.
- [243] arXiv:2409.12778 [pdf, html, other]
-
Title: EventDance++: Language-guided Unsupervised Source-free Cross-modal Adaptation for Event-based Object RecognitionComments: arXiv admin note: text overlap with arXiv:2403.14082Subjects: Computer Vision and Pattern Recognition (cs.CV)
In this paper, we address the challenging problem of cross-modal (image-to-events) adaptation for event-based recognition without accessing any labeled source image data. This task is arduous due to the substantial modality gap between images and events. With only a pre-trained source model available, the key challenge lies in extracting knowledge from this model and effectively transferring knowledge to the event-based domain. Inspired by the natural ability of language to convey semantics across different modalities, we propose EventDance++, a novel framework that tackles this unsupervised source-free cross-modal adaptation problem from a language-guided perspective. We introduce a language-guided reconstruction-based modality bridging (L-RMB) module, which reconstructs intensity frames from events in a self-supervised manner. Importantly, it leverages a vision-language model to provide further supervision, enriching the surrogate images and enhancing modality bridging. This enables the creation of surrogate images to extract knowledge (i.e., labels) from the source model. On top, we propose a multi-representation knowledge adaptation (MKA) module to transfer knowledge to target models, utilizing multiple event representations to capture the spatiotemporal characteristics of events fully. The L-RMB and MKA modules are jointly optimized to achieve optimal performance in bridging the modality gap. Experiments on three benchmark datasets demonstrate that EventDance++ performs on par with methods that utilize source data, validating the effectiveness of our language-guided approach in event-based recognition.
- [244] arXiv:2409.12780 [pdf, html, other]
-
Title: Infrastructure-less UWB-based Active Relative LocalizationSubjects: Robotics (cs.RO)
In multi-robot systems, relative localization between platforms plays a crucial role in many tasks, such as leader following, target tracking, or cooperative maneuvering. State of the Art (SotA) approaches either rely on infrastructure-based or on infrastructure-less setups. The former typically achieve high localization accuracy but require fixed external structures. The latter provide more flexibility, however, most of the works use cameras or lidars that require Line-of-Sight (LoS) to operate. Ultra Wide Band (UWB) devices are emerging as a viable alternative to build infrastructure-less solutions that do not require LoS. These approaches directly deploy the UWB sensors on the robots. However, they require that at least one of the platforms is static, limiting the advantages of an infrastructure-less setup. In this work, we remove this constraint and introduce an active method for infrastructure-less relative localization. Our approach allows the robot to adapt its position to minimize the relative localization error of the other platform. To this aim, we first design a specialized anchor placement for the active localization task. Then, we propose a novel UWB Relative Localization Loss that adapts the Geometric Dilution Of Precision metric to the infrastructure-less scenario. Lastly, we leverage this loss function to train an active Deep Reinforcement Learning-based controller for UWB relative localization. An extensive simulation campaign and real-world experiments validate our method, showing up to a 60% reduction of the localization error compared to current SotA approaches.
- [245] arXiv:2409.12784 [pdf, html, other]
-
Title: Evaluating Image Hallucination in Text-to-Image Generation with Question-AnsweringComments: 20 pagesSubjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
Despite the impressive success of text-to-image (TTI) generation models, existing studies overlook the issue of whether these models accurately convey factual information. In this paper, we focus on the problem of image hallucination, where images created by generation models fail to faithfully depict factual content. To address this, we introduce I-HallA (Image Hallucination evaluation with Question Answering), a novel automated evaluation metric that measures the factuality of generated images through visual question answering (VQA). We also introduce I-HallA v1.0, a curated benchmark dataset for this purpose. As part of this process, we develop a pipeline that generates high-quality question-answer pairs using multiple GPT-4 Omni-based agents, with human judgments to ensure accuracy. Our evaluation protocols measure image hallucination by testing if images from existing text-to-image models can correctly respond to these questions. The I-HallA v1.0 dataset comprises 1.2K diverse image-text pairs across nine categories with 1,000 rigorously curated questions covering various compositional challenges. We evaluate five text-to-image models using I-HallA and reveal that these state-of-the-art models often fail to accurately convey factual information. Moreover, we validate the reliability of our metric by demonstrating a strong Spearman correlation (rho=0.95) with human judgments. We believe our benchmark dataset and metric can serve as a foundation for developing factually accurate text-to-image generation models.
- [246] arXiv:2409.12785 [pdf, other]
-
Title: Investigation on domain adaptation of additive manufacturing monitoring systems to enhance digital twin reusabilityComments: 8 pages, 7 figures, 3 tables. IEEE CASE 2024Subjects: Computational Engineering, Finance, and Science (cs.CE); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Powder bed fusion (PBF) is an emerging metal additive manufacturing (AM) technology that enables rapid fabrication of complex geometries. However, defects such as pores and balling may occur and lead to structural unconformities, thus compromising the mechanical performance of the part. This has become a critical challenge for quality assurance as the nature of some defects is stochastic during the process and invisible from the exterior. To address this issue, digital twin (DT) using machine learning (ML)-based modeling can be deployed for AM process monitoring and control. Melt pool is one of the most commonly observed physical phenomena for process monitoring, usually by high-speed cameras. Once labeled and preprocessed, the melt pool images are used to train ML-based models for DT applications such as process anomaly detection and print quality evaluation. Nonetheless, the reusability of DTs is restricted due to the wide variability of AM settings, including AM machines and monitoring instruments. The performance of the ML models trained using the dataset collected from one setting is usually compromised when applied to other settings. This paper proposes a knowledge transfer pipeline between different AM settings to enhance the reusability of AM DTs. The source and target datasets are collected from the National Institute of Standards and Technology and National Cheng Kung University with different cameras, materials, AM machines, and process parameters. The proposed pipeline consists of four steps: data preprocessing, data augmentation, domain alignment, and decision alignment. Compared with the model trained only using the source dataset, this pipeline increased the melt pool anomaly detection accuracy by 31% without any labeled training data from the target dataset.
- [247] arXiv:2409.12788 [pdf, html, other]
-
Title: Optimal or Greedy Decision Trees? Revisiting their Objectives, Tuning, and PerformanceSubjects: Machine Learning (cs.LG)
Decision trees are traditionally trained using greedy heuristics that locally optimize an impurity or information metric. Recently there has been a surge of interest in optimal decision tree (ODT) methods that globally optimize accuracy directly. We identify two relatively unexplored aspects of ODTs: the objective function used in training trees and tuning techniques. Additionally, the value of optimal methods is not well understood yet, as the literature provides conflicting results, with some demonstrating superior out-of-sample performance of ODTs over greedy approaches, while others show the exact opposite. In this paper, we address these three questions: what objective to optimize in ODTs; how to tune ODTs; and how do optimal and greedy methods compare? Our experimental evaluation examines 13 objective functions, including four novel objectives resulting from our analysis, seven tuning methods, and six claims from the literature on optimal and greedy methods on 165 real and synthetic data sets. Through our analysis, both conceptually and experimentally, we discover new non-concave objectives, highlight the importance of proper tuning, support and refute several claims from the literature, and provide clear recommendations for researchers and practitioners on the usage of greedy and optimal methods, and code for future comparisons.
- [248] arXiv:2409.12789 [pdf, other]
-
Title: Reinforcement Learning-based Model Predictive Control for Greenhouse Climate ControlComments: 12 pages, 6 figures, code available at this https URL, submitted to Computers and Electronics in AgricultureSubjects: Systems and Control (eess.SY)
Greenhouse climate control is concerned with maximizing performance in terms of crop yield and resource efficiency. One promising approach is model predictive control (MPC), which leverages a model of the system to optimize the control inputs, while enforcing physical constraints. However, prediction models for greenhouse systems are inherently inaccurate due to the complexity of the real system and the uncertainty in predicted weather profiles. For model-based control approaches such as MPC, this can degrade performance and lead to constraint violations. Existing approaches address uncertainty in the prediction model with robust or stochastic MPC methodology; however, these necessarily reduce crop yield due to conservatism and often bear higher computational loads. In contrast, learning-based control approaches, such as reinforcement learning (RL), can handle uncertainty naturally by leveraging data to improve performance. This work proposes an MPC-based RL control framework to optimize the climate control performance in the presence of prediction uncertainty. The approach employs a parametrized MPC scheme that learns directly from data, in an online fashion, the parametrization of the constraints, prediction model, and optimization cost that minimizes constraint violations and maximizes climate control performance. Simulations show that the approach can learn an MPC controller that significantly outperforms the current state-of-the-art in terms of constraint violations and efficient crop growth.
- [249] arXiv:2409.12796 [pdf, html, other]
-
Title: Angular Divergent Component of Motion: A step towards planning Spatial DCM Objectives for Legged RobotsConnor W. Herron, Robert Schuller, Benjamin C. Beiter, Robert J. Griffin, Alexander Leonessa, Johannes EnglsbergerSubjects: Robotics (cs.RO); Systems and Control (eess.SY)
In this work, the Divergent Component of Motion (DCM) method is expanded to include angular coordinates for the first time. This work introduces the idea of spatial DCM, which adds an angular objective to the existing linear DCM theory. To incorporate the angular component into the framework, a discussion is provided on extending beyond the linear motion of the Linear Inverted Pendulum model (LIPM) towards the Single Rigid Body model (SRBM) for DCM. This work presents the angular DCM theory for a 1D rotation, simplifying the SRBM rotational dynamics to a flywheel to satisfy necessary linearity constraints. The 1D angular DCM is mathematically identical to the linear DCM and defined as an angle which is ahead of the current body rotation based on the angular velocity. This theory is combined into a 3D linear and 1D angular DCM framework, with discussion on the feasibility of simultaneously achieving both sets of objectives. A simulation in MATLAB and hardware results on the TORO humanoid are presented to validate the framework's performance.
- [250] arXiv:2409.12797 [pdf, html, other]
-
Title: Efficient Identification of Direct Causal Parents via Invariance and Minimum Error TestingComments: Accepted at TMLRSubjects: Machine Learning (cs.LG)
Invariant causal prediction (ICP) is a popular technique for finding causal parents (direct causes) of a target via exploiting distribution shifts and invariance testing (Peters et al., 2016). However, since ICP needs to run an exponential number of tests and fails to identify parents when distribution shifts only affect a few variables, applying ICP to practical large scale problems is challenging. We propose MMSE-ICP and fastICP, two approaches which employ an error inequality to address the identifiability problem of ICP. The inequality states that the minimum prediction error of the predictor using causal parents is the smallest among all predictors which do not use descendants. fastICP is an efficient approximation tailored for large problems as it exploits the inequality and a heuristic to run fewer tests. MMSE-ICP and fastICP not only outperform competitive baselines in many simulations but also achieve state-of-the-art result on a large scale real data benchmark.
- [251] arXiv:2409.12798 [pdf, html, other]
-
Title: Assessing the Zero-Shot Capabilities of LLMs for Action Evaluation in RLEduardo Pignatelli, Johan Ferret, Tim Rockäschel, Edward Grefenstette, Davide Paglieri, Samuel Coward, Laura ToniComments: 9 pagesSubjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
The temporal credit assignment problem is a central challenge in Reinforcement Learning (RL), concerned with attributing the appropriate influence to each actions in a trajectory for their ability to achieve a goal. However, when feedback is delayed and sparse, the learning signal is poor, and action evaluation becomes harder. Canonical solutions, such as reward shaping and options, require extensive domain knowledge and manual intervention, limiting their scalability and applicability. In this work, we lay the foundations for Credit Assignment with Language Models (CALM), a novel approach that leverages Large Language Models (LLMs) to automate credit assignment via reward shaping and options discovery. CALM uses LLMs to decompose a task into elementary subgoals and assess the achievement of these subgoals in state-action transitions. Every time an option terminates, a subgoal is achieved, and CALM provides an auxiliary reward. This additional reward signal can enhance the learning process when the task reward is sparse and delayed without the need for human-designed rewards. We provide a preliminary evaluation of CALM using a dataset of human-annotated demonstrations from MiniHack, suggesting that LLMs can be effective in assigning credit in zero-shot settings, without examples or LLM fine-tuning. Our preliminary results indicate that the knowledge of LLMs is a promising prior for credit assignment in RL, facilitating the transfer of human knowledge into value functions.
- [252] arXiv:2409.12801 [pdf, html, other]
-
Title: Exploring the Lands Between: A Method for Finding Differences between AI-Decisions and Human Ratings through Generated SamplesSubjects: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI)
Many important decisions in our everyday lives, such as authentication via biometric models, are made by Artificial Intelligence (AI) systems. These can be in poor alignment with human expectations, and testing them on clear-cut existing data may not be enough to uncover those cases. We propose a method to find samples in the latent space of a generative model, designed to be challenging for a decision-making model with regard to matching human expectations. By presenting those samples to both the decision-making model and human raters, we can identify areas where its decisions align with human intuition and where they contradict it. We apply this method to a face recognition model and collect a dataset of 11,200 human ratings from 100 participants. We discuss findings from our dataset and how our approach can be used to explore the performance of AI models in different contexts and for different user groups.
- [253] arXiv:2409.12809 [pdf, html, other]
-
Title: Don't be Fooled: The Misinformation Effect of Explanations in Human-AI CollaborationSubjects: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI)
Across various applications, humans increasingly use black-box artificial intelligence (AI) systems without insight into these systems' reasoning. To counter this opacity, explainable AI (XAI) methods promise enhanced transparency and interpretability. While recent studies have explored how XAI affects human-AI collaboration, few have examined the potential pitfalls caused by incorrect explanations. The implications for humans can be far-reaching but have not been explored extensively. To investigate this, we ran a study (n=160) on AI-assisted decision-making in which humans were supported by XAI. Our findings reveal a misinformation effect when incorrect explanations accompany correct AI advice with implications post-collaboration. This effect causes humans to infer flawed reasoning strategies, hindering task execution and demonstrating impaired procedural knowledge. Additionally, incorrect explanations compromise human-AI team-performance during collaboration. With our work, we contribute to HCI by providing empirical evidence for the negative consequences of incorrect explanations on humans post-collaboration and outlining guidelines for designers of AI.
- [254] arXiv:2409.12812 [pdf, html, other]
-
Title: Towards Interactive and Learnable Cooperative Driving Automation: a Large Language Model-Driven Decision-Making FrameworkSubjects: Robotics (cs.RO); Artificial Intelligence (cs.AI)
At present, Connected Autonomous Vehicles (CAVs) have begun to open road testing around the world, but their safety and efficiency performance in complex scenarios is still not satisfactory. Cooperative driving leverages the connectivity ability of CAVs to achieve synergies greater than the sum of their parts, making it a promising approach to improving CAV performance in complex scenarios. However, the lack of interaction and continuous learning ability limits current cooperative driving to single-scenario applications and specific Cooperative Driving Automation (CDA). To address these challenges, this paper proposes CoDrivingLLM, an interactive and learnable LLM-driven cooperative driving framework, to achieve all-scenario and all-CDA. First, since Large Language Models(LLMs) are not adept at handling mathematical calculations, an environment module is introduced to update vehicle positions based on semantic decisions, thus avoiding potential errors from direct LLM control of vehicle positions. Second, based on the four levels of CDA defined by the SAE J3216 standard, we propose a Chain-of-Thought (COT) based reasoning module that includes state perception, intent sharing, negotiation, and decision-making, enhancing the stability of LLMs in multi-step reasoning tasks. Centralized conflict resolution is then managed through a conflict coordinator in the reasoning process. Finally, by introducing a memory module and employing retrieval-augmented generation, CAVs are endowed with the ability to learn from their past experiences. We validate the proposed CoDrivingLLM through ablation experiments on the negotiation module, reasoning with different shots experience, and comparison with other cooperative driving methods.
- [255] arXiv:2409.12813 [pdf, html, other]
-
Title: Autonomous Visual Fish Pen Inspections for Estimating the State of Biofouling Buildup Using ROV -- Extended AbstractComments: IEEE ICRA Workshop on Field Robotics 2024Subjects: Computer Vision and Pattern Recognition (cs.CV)
The process of fish cage inspections, which is a necessary maintenance task at any fish farm, be it small scale or industrial, is a task that has the potential to be fully automated. Replacing trained divers who perform regular inspections with autonomous marine vehicles would lower the costs of manpower and remove the risks associated with humans performing underwater inspections. Achieving such a level of autonomy implies developing an image processing algorithm that is capable of estimating the state of biofouling buildup. The aim of this work is to propose a complete solution for automating the said inspection process; from developing an autonomous control algorithm for an ROV, to automatically segmenting images of fish cages, and accurately estimating the state of biofouling. The first part is achieved by modifying a commercially available ROV with an acoustic SBL positioning system and developing a closed-loop control system. The second part is realized by implementing a proposed biofouling estimation framework, which relies on AI to perform image segmentation, and by processing images using established computer vision methods to obtain a rough estimate of the distance of the ROV from the fish cage. This also involved developing a labeling tool in order to create a dataset of images for the neural network performing the semantic segmentation to be trained on. The experimental results show the viability of using an ROV fitted with an acoustic transponder for autonomous missions, and demonstrate the biofouling estimation framework's ability to provide accurate assessments, alongside satisfactory distance estimation capabilities. In conclusion, the achieved biofouling estimation accuracy showcases clear potential for use in the aquaculture industry.
- [256] arXiv:2409.12816 [pdf, html, other]
-
Title: Hierarchical Gradient-Based Genetic Sampling for Accurate Prediction of Biological OscillationsSubjects: Machine Learning (cs.LG)
Biological oscillations are periodic changes in various signaling processes crucial for the proper functioning of living organisms. These oscillations are modeled by ordinary differential equations, with coefficient variations leading to diverse periodic behaviors, typically measured by oscillatory frequencies. This paper explores sampling techniques for neural networks to model the relationship between system coefficients and oscillatory frequency. However, the scarcity of oscillations in the vast coefficient space results in many samples exhibiting non-periodic behaviors, and small coefficient changes near oscillation boundaries can significantly alter oscillatory properties. This leads to non-oscillatory bias and boundary sensitivity, making accurate predictions difficult. While existing importance and uncertainty sampling approaches partially mitigate these challenges, they either fail to resolve the sensitivity problem or result in redundant sampling. To address these limitations, we propose the Hierarchical Gradient-based Genetic Sampling (HGGS) framework, which improves the accuracy of neural network predictions for biological oscillations. The first layer, Gradient-based Filtering, extracts sensitive oscillation boundaries and removes redundant non-oscillatory samples, creating a balanced coarse dataset. The second layer, Multigrid Genetic Sampling, utilizes residual information to refine these boundaries and explore new high-residual regions, increasing data diversity for model training. Experimental results demonstrate that HGGS outperforms seven comparative sampling methods across four biological systems, highlighting its effectiveness in enhancing sampling and prediction accuracy.
- [257] arXiv:2409.12817 [pdf, html, other]
-
Title: Automated Linear Disturbance Mapping via Semantic Segmentation of Sentinel-2 ImageryAndrew M. Nagel, Anne Webster, Christopher Henry, Christopher Storie, Ignacio San-Miguel Sanchez, Olivier Tsui, Jason Duffe, Andy DeanSubjects: Computer Vision and Pattern Recognition (cs.CV)
In Canada's northern regions, linear disturbances such as roads, seismic exploration lines, and pipelines pose a significant threat to the boreal woodland caribou population (Rangifer tarandus). To address the critical need for management of these disturbances, there is a strong emphasis on developing mapping approaches that accurately identify forest habitat fragmentation. The traditional approach is manually generating maps, which is time-consuming and lacks the capability for frequent updates. Instead, applying deep learning methods to multispectral satellite imagery offers a cost-effective solution for automated and regularly updated map production. Deep learning models have shown promise in extracting paved roads in urban environments when paired with high-resolution (<0.5m) imagery, but their effectiveness for general linear feature extraction in forested areas from lower resolution imagery remains underexplored. This research employs a deep convolutional neural network model based on the VGGNet16 architecture for semantic segmentation of lower resolution (10m) Sentinel-2 satellite imagery, creating precise multi-class linear disturbance maps. The model is trained using ground-truth label maps sourced from the freely available Alberta Institute of Biodiversity Monitoring Human Footprint dataset, specifically targeting the Boreal and Taiga Plains ecozones in Alberta, Canada. Despite challenges in segmenting lower resolution imagery, particularly for thin linear disturbances like seismic exploration lines that can exhibit a width of 1-3 pixels in Sentinel-2 imagery, our results demonstrate the effectiveness of the VGGNet model for accurate linear disturbance retrieval. By leveraging the freely available Sentinel-2 imagery, this work advances cost-effective automated mapping techniques for identifying and monitoring linear disturbance fragmentation.
- [258] arXiv:2409.12822 [pdf, html, other]
-
Title: Language Models Learn to Mislead Humans via RLHFJiaxin Wen, Ruiqi Zhong, Akbir Khan, Ethan Perez, Jacob Steinhardt, Minlie Huang, Samuel R. Boman, He He, Shi FengSubjects: Computation and Language (cs.CL)
Language models (LMs) can produce errors that are hard to detect for humans, especially when the task is complex. RLHF, the most popular post-training method, may exacerbate this problem: to achieve higher rewards, LMs might get better at convincing humans that they are right even when they are wrong. We study this phenomenon under a standard RLHF pipeline, calling it "U-SOPHISTRY" since it is Unintended by model developers. Specifically, we ask time-constrained (e.g., 3-10 minutes) human subjects to evaluate the correctness of model outputs and calculate humans' accuracy against gold labels. On a question-answering task (QuALITY) and programming task (APPS), RLHF makes LMs better at convincing our subjects but not at completing the task correctly. RLHF also makes the model harder to evaluate: our subjects' false positive rate increases by 24.1% on QuALITY and 18.3% on APPS. Finally, we show that probing, a state-of-the-art approach for detecting Intended Sophistry (e.g. backdoored LMs), does not generalize to U-SOPHISTRY. Our results highlight an important failure mode of RLHF and call for more research in assisting humans to align them.
- [259] arXiv:2409.12824 [pdf, html, other]
-
Title: Data-Driven Cooperative Output Regulation of Continuous-Time Multi-Agent Systems with Unknown Network TopologySubjects: Multiagent Systems (cs.MA); Systems and Control (eess.SY)
This paper investigates data-driven cooperative output regulation for continuous-time multi-agent systems with unknown network topology. Unlike existing studies that typically assume a known network topology to directly compute controller parameters, a novel approach is proposed that allows for the computation of the parameter without prior knowledge of the topology. A lower bound on the minimum non-zero eigenvalue of the Laplacian matrix is estimated using only edge weight bounds, enabling the output regulation controller design to be independent of global network information. Additionally, the common need for state derivative measurements is eliminated, reducing the amount of data requirements. Furthermore, necessary and sufficient conditions are established to ensure that the data are informative for cooperative output regulation, leading to the design of a distributed output regulation controller. For the case with noisy data, the bound of the output error is provided, which is positively correlated with the noise bound, and a distributed controller is constructed for the approximate cooperative output regulation. Finally, the effectiveness of the proposed methods is verified through numerical simulations.
- [260] arXiv:2409.12832 [pdf, html, other]
-
Title: FoodPuzzle: Developing Large Language Model Agents as Flavor ScientistsTenghao Huang, Donghee Lee, John Sweeney, Jiatong Shi, Emily Steliotes, Matthew Lange, Jonathan May, Muhao ChenSubjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
Flavor development in the food industry is increasingly challenged by the need for rapid innovation and precise flavor profile creation. Traditional flavor research methods typically rely on iterative, subjective testing, which lacks the efficiency and scalability required for modern demands. This paper presents three contributions to address the challenges. Firstly, we define a new problem domain for scientific agents in flavor science, conceptualized as the generation of hypotheses for flavor profile sourcing and understanding. To facilitate research in this area, we introduce the FoodPuzzle, a challenging benchmark consisting of 978 food items and 1,766 flavor molecules profiles. We propose a novel Scientific Agent approach, integrating in-context learning and retrieval augmented techniques to generate grounded hypotheses in the domain of food science. Experimental results indicate that our model significantly surpasses traditional methods in flavor profile prediction tasks, demonstrating its potential to transform flavor development practices.
- [261] arXiv:2409.12836 [pdf, html, other]
-
Title: SituationAdapt: Contextual UI Optimization in Mixed Reality with Situation Awareness via LLM ReasoningComments: 13 pagesSubjects: Human-Computer Interaction (cs.HC)
Mixed Reality is increasingly used in mobile settings beyond controlled home and office spaces. This mobility introduces the need for user interface layouts that adapt to varying contexts. However, existing adaptive systems are designed only for static environments. In this paper, we introduce SituationAdapt, a system that adjusts Mixed Reality UIs to real-world surroundings by considering environmental and social cues in shared settings. Our system consists of perception, reasoning, and optimization modules for UI adaptation. Our perception module identifies objects and individuals around the user, while our reasoning module leverages a Vision-and-Language Model to assess the placement of interactive UI elements. This ensures that adapted layouts do not obstruct relevant environmental cues or interfere with social norms. Our optimization module then generates Mixed Reality interfaces that account for these considerations as well as temporal constraints. For evaluation, we first validate our reasoning module's capability of assessing UI contexts in comparison to human expert users. In an online user study, we then establish SituationAdapt's capability of producing context-aware layouts for Mixed Reality, where it outperformed previous adaptive layout methods. We conclude with a series of applications and scenarios to demonstrate SituationAdapt's versatility.
- [262] arXiv:2409.12839 [pdf, html, other]
-
Title: Social impact of CAVs -- coexistence of machines and humans in the context of route choiceSubjects: Multiagent Systems (cs.MA)
Suppose in a stable urban traffic system populated only by human driven vehicles (HDVs), a given proportion (e.g. 10%) is replaced by a fleet of Connected and Autonomous Vehicles (CAVs), which share information and pursue a collective goal. Suppose these vehicles are centrally coordinated and differ from HDVs only by their collective capacities allowing them to make more efficient routing decisions before the travel on a given day begins. Suppose there is a choice between two routes and every day each driver makes a decision which route to take. Human drivers maximize their utility. CAVs might optimize different goals, such as the total travel time of the fleet. We show that in this plausible futuristic setting, the strategy CAVs are allowed to adopt may result in human drivers either benefitting or being systematically disadvantaged and urban networks becoming more or less optimal. Consequently, some regulatory measures might become indispensable.
- [263] arXiv:2409.12840 [pdf, html, other]
-
Title: Lexicon-Based Sentiment Analysis on Text Polarities with Evaluation of Classification ModelsSubjects: Computation and Language (cs.CL)
Sentiment analysis possesses the potential of diverse applicability on digital platforms. Sentiment analysis extracts the polarity to understand the intensity and subjectivity in the text. This work uses a lexicon-based method to perform sentiment analysis and shows an evaluation of classification models trained over textual data. The lexicon-based methods identify the intensity of emotion and subjectivity at word levels. The categorization identifies the informative words inside a text and specifies the quantitative ranking of the polarity of words. This work is based on a multi-class problem of text being labeled as positive, negative, or neutral. Twitter sentiment dataset containing 1.6 million unprocessed tweets is used with lexicon-based methods like Text Blob and Vader Sentiment to introduce the neutrality measure on text. The analysis of lexicons shows how the word count and the intensity classify the text. A comparative analysis of machine learning models, Naiive Bayes, Support Vector Machines, Multinomial Logistic Regression, Random Forest, and Extreme Gradient (XG) Boost performed across multiple performance metrics. The best estimations are achieved through Random Forest with an accuracy score of 81%. Additionally, sentiment analysis is applied for a personality judgment case against a Twitter profile based on online activity.
- [264] arXiv:2409.12842 [pdf, html, other]
-
Title: Vision Language Models Can Parse Floor Plan MapsSubjects: Robotics (cs.RO); Artificial Intelligence (cs.AI)
Vision language models (VLMs) can simultaneously reason about images and texts to tackle many tasks, from visual question answering to image captioning. This paper focuses on map parsing, a novel task that is unexplored within the VLM context and particularly useful to mobile robots. Map parsing requires understanding not only the labels but also the geometric configurations of a map, i.e., what areas are like and how they are connected. To evaluate the performance of VLMs on map parsing, we prompt VLMs with floorplan maps to generate task plans for complex indoor navigation. Our results demonstrate the remarkable capability of VLMs in map parsing, with a success rate of 0.96 in tasks requiring a sequence of nine navigation actions, e.g., approaching and going through doors. Other than intuitive observations, e.g., VLMs do better in smaller maps and simpler navigation tasks, there was a very interesting observation that its performance drops in large open areas. We provide practical suggestions to address such challenges as validated by our experimental results. Webpage: this https URL
- [265] arXiv:2409.12846 [pdf, html, other]
-
Title: How the (Tensor-) Brain uses Embeddings and Embodiment to Encode Senses and Decode SymbolsSubjects: Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Neural and Evolutionary Computing (cs.NE); Neurons and Cognition (q-bio.NC)
The tensor brain has been introduced as a computational model for perception and memory. We provide an overview of the tensor brain model, including recent developments. The tensor brain has two major layers: the representation layer and the index layer. The representation layer is a model for the subsymbolic global workspace from consciousness research. The state of the representation layer is the cognitive brain state. The index layer contains symbols for concepts, time instances, and predicates. In a bottom-up operation, the cognitive brain state is encoded by the index layer as symbolic labels. In a top-down operation, symbols are decoded and written to the representation layer. This feeds to earlier processing layers as embodiment. The top-down operation became the basis for semantic memory. The embedding vector of a concept forms the connection weights between its index and the representation layer. The embedding is the signature or ``DNA'' of a concept, which is decoded by the brain when its index is activated. It integrates all that is known about a concept from different experiences, modalities, and symbolic decodings. Although being computational, it has been suggested that the tensor brain might be related to the actual operation of the brain. The sequential nature of symbol generation might have been a prerequisite to the generation of natural language. We describe an attention mechanism and discuss multitasking by multiplexing. We emphasize the inherent multimodality of the tensor brain. Finally, we discuss embedded and symbolic reasoning.
- [266] arXiv:2409.12849 [pdf, html, other]
-
Title: A Margin-Maximizing Fine-Grained Ensemble MethodSubjects: Machine Learning (cs.LG)
Ensemble learning has achieved remarkable success in machine learning, but its reliance on numerous base learners limits its application in resource-constrained environments. This paper introduces an innovative "Margin-Maximizing Fine-Grained Ensemble Method" that achieves performance surpassing large-scale ensembles by meticulously optimizing a small number of learners and enhancing generalization capability. We propose a novel learnable confidence matrix, quantifying each classifier's confidence for each category, precisely capturing category-specific advantages of individual learners. Furthermore, we design a margin-based loss function, constructing a smooth and partially convex objective using the logsumexp technique. This approach improves optimization, eases convergence, and enables adaptive confidence allocation. Finally, we prove that the loss function is Lipschitz continuous, based on which we develop an efficient gradient optimization algorithm that simultaneously maximizes margins and dynamically adjusts learner weights. Extensive experiments demonstrate that our method outperforms traditional random forests using only one-tenth of the base learners and other state-of-the-art ensemble methods.
- [267] arXiv:2409.12850 [pdf, other]
-
Title: USBIPS Framework: Protecting Hosts from Malicious USB PeripheralsComments: Under review by Computer Standards & InterfacesSubjects: Cryptography and Security (cs.CR)
USB-based attacks have increased in complexity in recent years. Modern attacks incorporate a wide range of attack vectors, from social engineering to signal injection. The security community is addressing these challenges using a growing set of fragmented defenses. Regardless of the vector of a USB-based attack, the most important risks concerning most people and enterprises are service crashes and data loss. The host OS manages USB peripherals, and malicious USB peripherals, such as those infected with BadUSB, can crash a service or steal data from the OS. Although USB firewalls have been proposed to thwart malicious USB peripherals, such as USBFilter and USBGuard, they cannot prevent real-world intrusions. This paper focuses on building a security framework called USBIPS within OSs to defend against malicious USB peripherals. This includes major efforts to explore the nature of malicious behavior and build persistent protection from USB-based intrusions. We first present a behavior-based detection mechanism focusing on attacks integrated into USB peripherals. We then introduce the novel idea of an allowlisting-based method for USB access control. We finally develop endpoint detection and response system to build the first generic security framework that thwarts USB-based intrusion. Within a centralized threat analysis framework, it provides persistent protection and may detect unknown malicious behavior. By addressing key security and performance challenges, these efforts help modern OSs against attacks from untrusted USB peripherals.
- [268] arXiv:2409.12851 [pdf, html, other]
-
Title: Harnessing Stacked Intelligent Metasurface for Enhanced Cell-Free Massive MIMO Systems: A Low-Power and Cost ApproachSubjects: Information Theory (cs.IT); Signal Processing (eess.SP)
In this paper, we explore the integration of low-power, low-cost stacked intelligent metasurfaces (SIM) into cell-free (CF) massive multiple-input multiple-output (mMIMO) systems to enhance access point (AP) capabilities and address high power consumption and cost challenges. Specifically, we investigate the uplink performance of a SIM-enhanced CF mMIMO system and propose a novel system framework. First, the closed-form expressions of the spectral efficiency (SE) are obtained using the unique two-layer signal processing framework of CF mMIMO systems. Second, to mitigate inter-user interference, an interference-based greedy algorithm for pilot allocation is introduced. Third, a wave-based beamforming algorithm for SIM is proposed, based only on statistical channel state information, which effectively reduces the fronthaul costs. Finally, a max-min SE power control algorithm is proposed to improve the performance of UE with inferior channel conditions. The results indicate that increasing the number of SIM layers and meta-atoms leads to significant performance improvements and allows for a reduction in the number of APs and AP antennas, thus lowering the costs. In particular, the best SE performance is achieved with the deployment of 20 APs plus 1200 SIM meta-atoms. Finally, the proposed wave-based beamforming algorithm can enhance the SE performance of SIM-enhanced CF-mMIMO systems by 57\%, significantly outperforming traditional CF mMIMO systems.
- [269] arXiv:2409.12853 [pdf, html, other]
-
Title: A New Perspective on ADHD Research: Knowledge Graph Construction with LLMs and Network Based InsightsComments: 14 pages, 2 figuresSubjects: Social and Information Networks (cs.SI); Computation and Language (cs.CL)
Attention-Deficit/Hyperactivity Disorder (ADHD) is a challenging disorder to study due to its complex symptomatology and diverse contributing factors. To explore how we can gain deeper insights on this topic, we performed a network analysis on a comprehensive knowledge graph (KG) of ADHD, constructed by integrating scientific literature and clinical data with the help of cutting-edge large language models. The analysis, including k-core techniques, identified critical nodes and relationships that are central to understanding the disorder. Building on these findings, we developed a context-aware chatbot using Large Language Models (LLMs) and Retrieval-Augmented Generation (RAG), enabling accurate and informed interactions. Our knowledge graph not only advances the understanding of ADHD but also provides a powerful tool for research and clinical applications.
- [270] arXiv:2409.12862 [pdf, html, other]
-
Title: Extended Reality System for Robotic Learning from Human DemonstrationIsaac Ngui, Courtney McBeth, Grace He, André Corrêa Santos, Luciano Soares, Marco Morales, Nancy M. AmatoComments: In submissionSubjects: Robotics (cs.RO); Human-Computer Interaction (cs.HC)
Many real-world tasks are intuitive for a human to perform, but difficult to encode algorithmically when utilizing a robot to perform the tasks. In these scenarios, robotic systems can benefit from expert demonstrations to learn how to perform each task. In many settings, it may be difficult or unsafe to use a physical robot to provide these demonstrations, for example, considering cooking tasks such as slicing with a knife. Extended reality provides a natural setting for demonstrating robotic trajectories while bypassing safety concerns and providing a broader range of interaction modalities. We propose the Robot Action Demonstration in Extended Reality (RADER) system, a generic extended reality interface for learning from demonstration. We additionally present its application to an existing state-of-the-art learning from demonstration approach and show comparable results between demonstrations given on a physical robot and those given using our extended reality system.
- [271] arXiv:2409.12865 [pdf, html, other]
-
Title: KnowFormer: Revisiting Transformers for Knowledge Graph ReasoningComments: Accepted by ICML2024Subjects: Artificial Intelligence (cs.AI)
Knowledge graph reasoning plays a vital role in various applications and has garnered considerable attention. Recently, path-based methods have achieved impressive performance. However, they may face limitations stemming from constraints in message-passing neural networks, such as missing paths and information over-squashing. In this paper, we revisit the application of transformers for knowledge graph reasoning to address the constraints faced by path-based methods and propose a novel method KnowFormer.KnowFormer utilizes a transformer architecture to perform reasoning on knowledge graphs from the message-passing perspective, rather than reasoning by textual information like previous pretrained language model based methods. Specifically, we define the attention computation based on the query prototype of knowledge graph reasoning, facilitating convenient construction and efficient optimization. To incorporate structural information into the self-attention mechanism, we introduce structure-aware modules to calculate query, key, and value respectively. Additionally, we present an efficient attention computation method for better scalability. Experimental results demonstrate the superior performance of KnowFormer compared to prominent baseline methods on both transductive and inductive benchmarks.
- [272] arXiv:2409.12866 [pdf, html, other]
-
Title: SpecEval: Evaluating Code Comprehension in Large Language Models via Program SpecificationsSubjects: Software Engineering (cs.SE)
Large Language models have achieved impressive performance in automated software engineering. Extensive efforts have been made to evaluate the abilities of code LLMs in various aspects, with an increasing number of benchmarks and evaluation frameworks proposed. Apart from the most sought-after capability of code generation, the capability of code comprehension is being granted growing attention. Nevertheless, existing works assessing the code comprehension capability of LLMs exhibit varied limitations. Evaluation frameworks like CRUXEval and REval usually focus on code reasoning tasks over a certain input case, leading to a limited range of execution traces covered, resulting in a loss in code semantics examined and the inability to assess the comprehensive understanding of LLMs concerning the target program. To tackle the challenges above, we propose SpecEval, a novel black-box evaluation framework to evaluate code comprehension in LLMs via program specifications. Inspired by the idea that specifications can comprehensively articulate program behaviors concerning all possible execution traces, we employ formal specifications to represent program semantics and perform thorough evaluations. In particular, four specification-related tasks are designed to assess the capability of LLMs from basic to advanced levels. Moreover, counterfactual analysis is conducted to study the performance variance of LLMs under semantics-preserving perturbations, and progressive consistency analysis is performed to study the performance consistency of LLMs over a series of tasks with sequential dependence. Systematic experiments are conducted on six state-of-the-art LLMs. Experimental results present a below-satisfactory performance of LLMs on specification-related tasks, revealing the limitations of existing LLMs in articulating program semantics, underscoring future directions for enhancement.
- [273] arXiv:2409.12870 [pdf, html, other]
-
Title: Joint AP-UE Association and Precoding for SIM-Aided Cell-Free Massive MIMO SystemsSubjects: Information Theory (cs.IT); Signal Processing (eess.SP)
Cell-free (CF) massive multiple-input multiple-output (mMIMO) systems are emerging as promising alternatives to cellular networks, especially in ultra-dense environments. However, further capacity enhancement requires the deployment of more access points (APs), which will lead to high costs and high energy consumption. To address this issue, in this paper, we explore the integration of low-power, low-cost stacked intelligent metasurfaces (SIM) into CF mMIMO systems to enhance AP capabilities. The key point is that SIM performs precoding-related matrix operations in the wave domain. As a consequence, each AP antenna only needs to transmit data streams for a single user equipment (UE), eliminating the need for complex baseband digital precoding. Then, we formulate the problem of joint AP-UE association and precoding at APs and SIMs to maximize the system sum rate. Due to the non-convexity and high complexity of the formulated problem, we propose a two-stage signal processing framework to solve it. In particular, in the first stage, we propose an AP antenna greedy association (AGA) algorithm to minimize UE interference. In the second stage, we introduce an alternating optimization (AO)-based algorithm that separates the joint power and wave-based precoding optimization problem into two distinct sub-problems: the complex quadratic transform method is used for AP antenna power control, and the projection gradient ascent (PGA) algorithm is employed to find suboptimal solutions for the SIM wave-based precoding. Finally, the numerical results validate the effectiveness of the proposed framework and assess the performance enhancement achieved by the algorithm in comparison to various benchmark schemes. The results show that, with the same number of SIM meta-atoms, the proposed algorithm improves the sum rate by approximately 275% compared to the benchmark scheme.
- [274] arXiv:2409.12873 [pdf, html, other]
-
Title: Reliability-Based Planning of Cable Layout for Offshore Wind Farm Electrical Collector System Considering Post-Fault Network ReconfigurationComments: 13 pagesSubjects: Systems and Control (eess.SY)
The electrical collector system (ECS) plays a crucial role in determining the performance of offshore wind farms (OWFs). Existing research has predominantly restricted ECS cable layouts to conventional radial or ring structures and employed graph theory heuristics for solutions. However, both economic efficiency and reliability of the OWFs heavily depend on their ECS structure, and the optimal ECS cable layout often deviates from typical configurations. In this context, this paper introduces a novel reliability-based ECS cable layout planning method for large-scale OWFs, employing a two-stage stochastic programming approach to address uncertainties of wind power and contingencies. To enhance reliability, the model incorporates optimal post-fault network reconfiguration strategies by adjusting wind turbine power supply paths through link cables. To tackle computation challenges arising from numerous contingency scenarios, a customized progressive contingency incorporation (CPCI) framework is developed to solve the model with higher efficiency by iteratively identifying non-trivial scenarios and solving the simplified problems. The convergence and optimality are theoretically proven. Numerical tests on several real-world OWFs validate the necessity of fully optimizing ECS structures and demonstrate the efficiency of the CPCI algorithm.
- [275] arXiv:2409.12878 [pdf, other]
-
Title: Impact of ML Optimization Tactics on Greener Pre-Trained ML ModelsSubjects: Machine Learning (cs.LG); Software Engineering (cs.SE)
Background: Given the fast-paced nature of today's technology, which has surpassed human performance in tasks like image classification, visual reasoning, and English understanding, assessing the impact of Machine Learning (ML) on energy consumption is crucial. Traditionally, ML projects have prioritized accuracy over energy, creating a gap in energy consumption during model inference.
Aims: This study aims to (i) analyze image classification datasets and pre-trained models, (ii) improve inference efficiency by comparing optimized and non-optimized models, and (iii) assess the economic impact of the optimizations.
Method: We conduct a controlled experiment to evaluate the impact of various PyTorch optimization techniques (dynamic quantization, torch.compile, local pruning, and global pruning) to 42 Hugging Face models for image classification. The metrics examined include GPU utilization, power and energy consumption, accuracy, time, computational complexity, and economic costs. The models are repeatedly evaluated to quantify the effects of these software engineering tactics.
Results: Dynamic quantization demonstrates significant reductions in inference time and energy consumption, making it highly suitable for large-scale systems. Additionally, torch.compile balances accuracy and energy. In contrast, local pruning shows no positive impact on performance, and global pruning's longer optimization times significantly impact costs.
Conclusions: This study highlights the role of software engineering tactics in achieving greener ML models, offering guidelines for practitioners to make informed decisions on optimization methods that align with sustainability goals. - [276] arXiv:2409.12879 [pdf, html, other]
-
Title: QMC integration based on arbitrary (t,m,s)-nets yields optimal convergence rates on several scales of function spacesComments: 56 pagesSubjects: Numerical Analysis (math.NA)
We study the integration problem over the $s$-dimensional unit cube on four types of Banach spaces of integrands. First we consider Haar wavelet spaces, consisting of functions whose Haar wavelet coefficients exhibit a certain decay behavior measured by a parameter $\alpha >0$. We study the worst case error of integration over the norm unit ball and provide upper error bounds for quasi-Monte Carlo (QMC) cubature rules based on arbitrary $(t,m,s)$-nets as well as matching lower error bounds for arbitrary cubature rules. These results show that using arbitrary $(t,m,s)$-nets as sample points yields the best possible rate of convergence. Afterwards we study spaces of integrands of fractional smoothness $\alpha \in (0,1)$ and state a sharp Koksma-Hlawka-type inequality. More precisely, we show that on those spaces the worst case error of integration is equal to the corresponding fractional discrepancy. Those spaces can be continuously embedded into tensor product Bessel potential spaces, also known as Sobolev spaces of dominated mixed smoothness, with the same set of parameters. The latter spaces can be embedded into suitable Besov spaces of dominating mixed smoothness $\alpha$, which in turn can be embedded into the Haar wavelet spaces with the same set of parameters. Therefore our upper error bounds on Haar wavelet spaces for QMC cubatures based on $(t,m,s)$-nets transfer (with possibly different constants) to the corresponding spaces of integrands of fractional smoothness and to Sobolev and Besov spaces of dominating mixed smoothness. Moreover, known lower error bounds for periodic Sobolev and Besov spaces of dominating mixed smoothness show that QMC integration based on arbitrary $(t,m,s)$-nets yields the best possible convergence rate on periodic as well as on non-periodic Sobolev and Besov spaces of dominating smoothness.
- [277] arXiv:2409.12880 [pdf, html, other]
-
Title: Enhancing E-commerce Product Title Translation with Retrieval-Augmented Generation and Large Language ModelsComments: 6 Pages,In Proceedings of ACM CIKM Workshop on Data-Centric AI (CIKM DCAI 2024)Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
E-commerce stores enable multilingual product discovery which require accurate product title translation. Multilingual large language models (LLMs) have shown promising capacity to perform machine translation tasks, and it can also enhance and translate product titles cross-lingually in one step. However, product title translation often requires more than just language conversion because titles are short, lack context, and contain specialized terminology. This study proposes a retrieval-augmented generation (RAG) approach that leverages existing bilingual product information in e-commerce by retrieving similar bilingual examples and incorporating them as few-shot prompts to enhance LLM-based product title translation. Experiment results show that our proposed RAG approach improve product title translation quality with chrF score gains of up to 15.3% for language pairs where the LLM has limited proficiency.
- [278] arXiv:2409.12882 [pdf, html, other]
-
Title: On the Hardness of Decentralized Multi-Agent Policy Evaluation under Byzantine AttacksComments: To appear in Proceedings of the 22nd International Symposium on Modeling and Optimization in Mobile, Ad hoc, and Wireless Networks (WiOpt 2024)Subjects: Cryptography and Security (cs.CR); Distributed, Parallel, and Cluster Computing (cs.DC); Machine Learning (cs.LG)
In this paper, we study a fully-decentralized multi-agent policy evaluation problem, which is an important sub-problem in cooperative multi-agent reinforcement learning, in the presence of up to $f$ faulty agents. In particular, we focus on the so-called Byzantine faulty model with model poisoning setting. In general, policy evaluation is to evaluate the value function of any given policy. In cooperative multi-agent system, the system-wide rewards are usually modeled as the uniform average of rewards from all agents. We investigate the multi-agent policy evaluation problem in the presence of Byzantine agents, particularly in the setting of heterogeneous local rewards. Ideally, the goal of the agents is to evaluate the accumulated system-wide rewards, which are uniform average of rewards of the normal agents for a given policy. It means that all agents agree upon common values (the consensus part) and furthermore, the consensus values are the value functions (the convergence part). However, we prove that this goal is not achievable. Instead, we consider a relaxed version of the problem, where the goal of the agents is to evaluate accumulated system-wide reward, which is an appropriately weighted average reward of the normal agents. We further prove that there is no correct algorithm that can guarantee that the total number of positive weights exceeds $|\mathcal{N}|-f $, where $|\mathcal{N}|$ is the number of normal agents. Towards the end, we propose a Byzantine-tolerant decentralized temporal difference algorithm that can guarantee asymptotic consensus under scalar function approximation. We then empirically test the effective of the proposed algorithm.
- [279] arXiv:2409.12883 [pdf, html, other]
-
Title: Improving Prototypical Parts Abstraction for Case-Based Reasoning Explanations Designed for the Kidney Stone Type RecognitionDaniel Flores-Araiza, Francisco Lopez-Tiro, Clément Larose, Salvador Hinojosa, Andres Mendez-Vazquez, Miguel Gonzalez-Mendoza, Gilberto Ochoa-Ruiz, Christian DaulComments: Paper submitted to Artificial Intelligence in Medicine. (AIIM), ElsevierSubjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
The in-vivo identification of the kidney stone types during an ureteroscopy would be a major medical advance in urology, as it could reduce the time of the tedious renal calculi extraction process, while diminishing infection risks. Furthermore, such an automated procedure would make possible to prescribe anti-recurrence treatments immediately. Nowadays, only few experienced urologists are able to recognize the kidney stone types in the images of the videos displayed on a screen during the endoscopy. Thus, several deep learning (DL) models have recently been proposed to automatically recognize the kidney stone types using ureteroscopic images. However, these DL models are of black box nature whicl limits their applicability in clinical settings. This contribution proposes a case-based reasoning DL model which uses prototypical parts (PPs) and generates local and global descriptors. The PPs encode for each class (i.e., kidney stone type) visual feature information (hue, saturation, intensity and textures) similar to that used by biologists. The PPs are optimally generated due a new loss function used during the model training. Moreover, the local and global descriptors of PPs allow to explain the decisions ("what" information, "where in the images") in an understandable way for biologists and urologists. The proposed DL model has been tested on a database including images of the six most widespread kidney stone types. The overall average classification accuracy was 90.37. When comparing this results with that of the eight other DL models of the kidney stone state-of-the-art, it can be seen that the valuable gain in explanability was not reached at the expense of accuracy which was even slightly increased with respect to that (88.2) of the best method of the literature. These promising and interpretable results also encourage urologists to put their trust in AI-based solutions.
- [280] arXiv:2409.12884 [pdf, html, other]
-
Title: Hypersphere Secure Sketch Revisited: Probabilistic Linear Regression Attack on IronMask in Multiple UsageSubjects: Cryptography and Security (cs.CR); Computer Vision and Pattern Recognition (cs.CV)
Protection of biometric templates is a critical and urgent area of focus. IronMask demonstrates outstanding recognition performance while protecting facial templates against existing known attacks. In high-level, IronMask can be conceptualized as a fuzzy commitment scheme building on the hypersphere directly. We devise an attack on IronMask targeting on the security notion of renewability. Our attack, termed as Probabilistic Linear Regression Attack, utilizes the linearity of underlying used error correcting code. This attack is the first algorithm to successfully recover the original template when getting multiple protected templates in acceptable time and requirement of storage. We implement experiments on IronMask applied to protect ArcFace that well verify the validity of our attacks. Furthermore, we carry out experiments in noisy environments and confirm that our attacks are still applicable. Finally, we put forward two strategies to mitigate this type of attacks.
- [281] arXiv:2409.12886 [pdf, html, other]
-
Title: EdgeGaussians -- 3D Edge Mapping via Gaussian SplattingSubjects: Computer Vision and Pattern Recognition (cs.CV)
With their meaningful geometry and their omnipresence in the 3D world, edges are extremely useful primitives in computer vision. 3D edges comprise of lines and curves, and methods to reconstruct them use either multi-view images or point clouds as input. State-of-the-art image-based methods first learn a 3D edge point cloud then fit 3D edges to it. The edge point cloud is obtained by learning a 3D neural implicit edge field from which the 3D edge points are sampled on a specific level set (0 or 1). However, such methods present two important drawbacks: i) it is not realistic to sample points on exact level sets due to float imprecision and training inaccuracies. Instead, they are sampled within a range of levels so the points do not lie accurately on the 3D edges and require further processing. ii) Such implicit representations are computationally expensive and require long training times. In this paper, we address these two limitations and propose a 3D edge mapping that is simpler, more efficient, and preserves accuracy. Our method learns explicitly the 3D edge points and their edge direction hence bypassing the need for point sampling. It casts a 3D edge point as the center of a 3D Gaussian and the edge direction as the principal axis of the Gaussian. Such a representation has the advantage of being not only geometrically meaningful but also compatible with the efficient training optimization defined in Gaussian Splatting. Results show that the proposed method produces edges as accurate and complete as the state-of-the-art while being an order of magnitude faster. Code is released at this https URL.
- [282] arXiv:2409.12887 [pdf, html, other]
-
Title: Knowledge-Based Domain-Oriented Data Augmentation for Enhancing Unsupervised Sentence EmbeddingSubjects: Computation and Language (cs.CL)
Recently, unsupervised sentence embedding models have received significant attention in downstream natural language processing tasks. Using large language models (LLMs) for data augmentation has led to considerable improvements in previous studies. Nevertheless, these strategies emphasize data augmentation with extensive generic corpora, neglecting the consideration of few-shot domain data. The synthesized data lacks fine-grained information and may introduce negative sample noise. This study introduces a novel pipeline-based data augmentation method that leverages LLM to synthesize the domain-specific dataset. It produces both positive and negative samples through entity- and quantity-aware augmentation, utilizing an entity knowledge graph to synthesize samples with fine-grained semantic distinctions, increasing training sample diversity and relevance. We then present a Gaussian-decayed gradient-assisted Contrastive Sentence Embedding (GCSE) model to reduce synthetic data noise and improve model discrimination to reduce negative sample noise. Experimental results demonstrate that our approach achieves state-of-the-art semantic textual similarity performance with fewer synthetic data samples and lesser LLM parameters, demonstrating its efficiency and robustness in varied backbones.
- [283] arXiv:2409.12889 [pdf, html, other]
-
Title: Can VLMs Play Action Role-Playing Games? Take Black Myth Wukong as a Study CaseSubjects: Artificial Intelligence (cs.AI)
Recently, large language model (LLM)-based agents have made significant advances across various fields. One of the most popular research areas involves applying these agents to video games. Traditionally, these methods have relied on game APIs to access in-game environmental and action data. However, this approach is limited by the availability of APIs and does not reflect how humans play games. With the advent of vision language models (VLMs), agents now have enhanced visual understanding capabilities, enabling them to interact with games using only visual inputs. Despite these advances, current approaches still face challenges in action-oriented tasks, particularly in action role-playing games (ARPGs), where reinforcement learning methods are prevalent but suffer from poor generalization and require extensive training. To address these limitations, we select an ARPG, ``Black Myth: Wukong'', as a research platform to explore the capability boundaries of existing VLMs in scenarios requiring visual-only input and complex action output. We define 12 tasks within the game, with 75% focusing on combat, and incorporate several state-of-the-art VLMs into this benchmark. Additionally, we will release a human operation dataset containing recorded gameplay videos and operation logs, including mouse and keyboard actions. Moreover, we propose a novel VARP (Vision Action Role-Playing) agent framework, consisting of an action planning system and a visual trajectory system. Our framework demonstrates the ability to perform basic tasks and succeed in 90% of easy and medium-level combat scenarios. This research aims to provide new insights and directions for applying multimodal agents in complex action game environments. The code and datasets will be made available at this https URL.
- [284] arXiv:2409.12892 [pdf, html, other]
-
Title: 3DGS-LM: Faster Gaussian-Splatting Optimization with Levenberg-MarquardtSubjects: Computer Vision and Pattern Recognition (cs.CV)
We present 3DGS-LM, a new method that accelerates the reconstruction of 3D Gaussian Splatting (3DGS) by replacing its ADAM optimizer with a tailored Levenberg-Marquardt (LM). Existing methods reduce the optimization time by decreasing the number of Gaussians or by improving the implementation of the differentiable rasterizer. However, they still rely on the ADAM optimizer to fit Gaussian parameters of a scene in thousands of iterations, which can take up to an hour. To this end, we change the optimizer to LM that runs in conjunction with the 3DGS differentiable rasterizer. For efficient GPU parallization, we propose a caching data structure for intermediate gradients that allows us to efficiently calculate Jacobian-vector products in custom CUDA kernels. In every LM iteration, we calculate update directions from multiple image subsets using these kernels and combine them in a weighted mean. Overall, our method is 30% faster than the original 3DGS while obtaining the same reconstruction quality. Our optimization is also agnostic to other methods that acclerate 3DGS, thus enabling even faster speedups compared to vanilla 3DGS.
- [285] arXiv:2409.12894 [pdf, html, other]
-
Title: Towards Testing and Evaluating Vision-Language-Action Models for Robotic Manipulation: An Empirical StudyComments: 14 pages, 7 figuresSubjects: Software Engineering (cs.SE); Robotics (cs.RO)
Multi-modal foundation models and generative AI have demonstrated promising capabilities in applications across various domains. Recently, Vision-language-action (VLA) models have attracted much attention regarding their potential to advance robotic manipulation. Despite the end-to-end perception-control loop offered by the VLA models, there is a lack of comprehensive understanding of the capabilities of such models and an automated testing platform to reveal their robustness and reliability across different robotic manipulation scenarios. To address these challenges, in this work, we present VLATest, a testing framework that automatically generates diverse robotic manipulation scenes to assess the performance of VLA models from various perspectives. Large-scale experiments are considered, including eight VLA models, four types of manipulation tasks, and over 18,604 testing scenes. The experimental results show that existing VAL models still lack imperative robustness for practical applications. Specifically, the performance of VLA models can be significantly affected by several factors from the operation environments, such as camera poses, lighting conditions, and unseen objects. Our framework and the insights derived from the study are expected to pave the way for more advanced and reliable VLA-enabled robotic manipulation systems in practice.
- [286] arXiv:2409.12899 [pdf, html, other]
-
Title: LI-GS: Gaussian Splatting with LiDAR Incorporated for Accurate Large-Scale ReconstructionSubjects: Robotics (cs.RO)
Large-scale 3D reconstruction is critical in the field of robotics, and the potential of 3D Gaussian Splatting (3DGS) for achieving accurate object-level reconstruction has been demonstrated. However, ensuring geometric accuracy in outdoor and unbounded scenes remains a significant challenge. This study introduces LI-GS, a reconstruction system that incorporates LiDAR and Gaussian Splatting to enhance geometric accuracy in large-scale scenes. 2D Gaussain surfels are employed as the map representation to enhance surface alignment. Additionally, a novel modeling method is proposed to convert LiDAR point clouds to plane-constrained multimodal Gaussian Mixture Models (GMMs). The GMMs are utilized during both initialization and optimization stages to ensure sufficient and continuous supervision over the entire scene while mitigating the risk of over-fitting. Furthermore, GMMs are employed in mesh extraction to eliminate artifacts and improve the overall geometric quality. Experiments demonstrate that our method outperforms state-of-the-art methods in large-scale 3D reconstruction, achieving higher accuracy compared to both LiDAR-based methods and Gaussian-based methods with improvements of 52.6% and 68.7%, respectively.
- [287] arXiv:2409.12900 [pdf, html, other]
-
Title: Recognition of Harmful Phytoplankton from Microscopic Images using Deep LearningComments: 8 pages, 5 figuresSubjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
Monitoring plankton distribution, particularly harmful phytoplankton, is vital for preserving aquatic ecosystems, regulating the global climate, and ensuring environmental protection. Traditional methods for monitoring are often time-consuming, expensive, error-prone, and unsuitable for large-scale applications, highlighting the need for accurate and efficient automated systems. In this study, we evaluate several state-of-the-art CNN models, including ResNet, ResNeXt, DenseNet, and EfficientNet, using three transfer learning approaches: linear probing, fine-tuning, and a combined approach, to classify eleven harmful phytoplankton genera from microscopic images. The best performance was achieved by ResNet-50 using the fine-tuning approach, with an accuracy of 96.97%. The results also revealed that the models struggled to differentiate between four harmful phytoplankton types with similar morphological features.
- [288] arXiv:2409.12902 [pdf, html, other]
-
Title: Fast End-to-End Generation of Belief Space Paths for Minimum Sensing NavigationSubjects: Robotics (cs.RO); Machine Learning (cs.LG)
We revisit the problem of motion planning in the Gaussian belief space. Motivated by the fact that most existing sampling-based planners suffer from high computational costs due to the high-dimensional nature of the problem, we propose an approach that leverages a deep learning model to predict optimal path candidates directly from the problem description. Our proposed approach consists of three steps. First, we prepare a training dataset comprising a large number of input-output pairs: the input image encodes the problem to be solved (e.g., start states, goal states, and obstacle locations), whereas the output image encodes the solution (i.e., the ground truth of the shortest path). Any existing planner can be used to generate this training dataset. Next, we leverage the U-Net architecture to learn the dependencies between the input and output data. Finally, a trained U-Net model is applied to a new problem encoded as an input image. From the U-Net's output image, which is interpreted as a distribution of paths,an optimal path candidate is reconstructed. The proposed method significantly reduces computation time compared to the sampling-based baseline algorithm.
- [289] arXiv:2409.12903 [pdf, html, other]
-
Title: Scaling Smart: Accelerating Large Language Model Pre-training with Small Model InitializationMohammad Samragh, Iman Mirzadeh, Keivan Alizadeh Vahid, Fartash Faghri, Minsik Cho, Moin Nabi, Devang Naik, Mehrdad FarajtabarSubjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
The pre-training phase of language models often begins with randomly initialized parameters. With the current trends in scaling models, training their large number of parameters can be extremely slow and costly. In contrast, small language models are less expensive to train, but they often cannot achieve the accuracy of large models. In this paper, we explore an intriguing idea to connect these two different regimes: Can we develop a method to initialize large language models using smaller pre-trained models? Will such initialization bring any benefits in terms of training time and final accuracy? In this paper, we introduce HyperCloning, a method that can expand the parameters of a pre-trained language model to those of a larger model with increased hidden dimensions. Our method ensures that the larger model retains the functionality of the smaller model. As a result, the larger model already inherits the predictive power and accuracy of the smaller model before the training starts. We demonstrate that training such an initialized model results in significant savings in terms of GPU hours required for pre-training large language models.
- [290] arXiv:2409.12912 [pdf, html, other]
-
Title: The Relevance of Item-Co-Exposure For Exposure Bias MitigationSubjects: Information Retrieval (cs.IR)
Through exposing items to users, implicit feedback recommender systems influence the logged interactions, and, ultimately, their own recommendations. This effect is called exposure bias and it can lead to issues such as filter bubbles and echo chambers. Previous research employed the multinomial logit model (MNL) with exposure information to reduce exposure bias on synthetic data.
This extended abstract summarizes our previous study in which we investigated whether (i) these findings hold for human-generated choices, (ii) other discrete choice models mitigate bias better, and (iii) an item's estimated relevance can depend on the relevances of the other items that were presented with it. We collected a data set of biased and unbiased choices in a controlled online user study and measured the effects of overexposure and competition.
We found that (i) the discrete choice models effectively mitigated exposure bias on human-generated choice data, (ii) there were no significant differences in robustness among the different discrete choice models, and (iii) only multivariate discrete choice models were robust to competition between items. We conclude that discrete choice models mitigate exposure bias effectively because they consider item-co-exposure. Moreover, exposing items alongside more or less popular items can bias future recommendations significantly and item exposure must be tracked for overcoming exposure bias. We consider our work vital for understanding what exposure bias it, how it forms, and how it can be mitigated. - [291] arXiv:2409.12913 [pdf, html, other]
-
Title: Universal approximation theorem for neural networks with inputs from a topological vector spaceComments: 10 pagesSubjects: Machine Learning (cs.LG); Neural and Evolutionary Computing (cs.NE); Machine Learning (stat.ML)
We study feedforward neural networks with inputs from a topological vector space (TVS-FNNs). Unlike traditional feedforward neural networks, TVS-FNNs can process a broader range of inputs, including sequences, matrices, functions and more. We prove a universal approximation theorem for TVS-FNNs, which demonstrates their capacity to approximate any continuous function defined on this expanded input space.
- [292] arXiv:2409.12914 [pdf, html, other]
-
Title: Defending against Reverse Preference Attacks is DifficultDomenic Rosati, Giles Edkins, Harsh Raj, David Atanasov, Subhabrata Majumdar, Janarthanan Rajendran, Frank Rudzicz, Hassan SajjadSubjects: Machine Learning (cs.LG); Computation and Language (cs.CL)
While there has been progress towards aligning Large Language Models (LLMs) with human values and ensuring safe behaviour at inference time, safety-aligned LLMs are known to be vulnerable to training-time attacks such as supervised fine-tuning (SFT) on harmful datasets. In this paper, we ask if LLMs are vulnerable to adversarial reinforcement learning. Motivated by this goal, we propose Reverse Preference Attacks (RPA), a class of attacks to make LLMs learn harmful behavior using adversarial reward during reinforcement learning from human feedback (RLHF). RPAs expose a critical safety gap of safety-aligned LLMs in RL settings: they easily explore the harmful text generation policies to optimize adversarial reward. To protect against RPAs, we explore a host of mitigation strategies. Leveraging Constrained Markov-Decision Processes, we adapt a number of mechanisms to defend against harmful fine-tuning attacks into the RL setting. Our experiments show that ``online" defenses that are based on the idea of minimizing the negative log likelihood of refusals -- with the defender having control of the loss function -- can effectively protect LLMs against RPAs. However, trying to defend model weights using ``offline" defenses that operate under the assumption that the defender has no control over the loss function are less effective in the face of RPAs. These findings show that attacks done using RL can be used to successfully undo safety alignment in open-weight LLMs and use them for malicious purposes.
- [293] arXiv:2409.12915 [pdf, html, other]
-
Title: Unveiling and Manipulating Concepts in Time Series Foundation ModelsSubjects: Machine Learning (cs.LG)
Time series foundation models promise to be powerful tools for a wide range of applications. However, little is known about the concepts that these models learn and how can we manipulate them in the latent space. Our study bridges these gaps by identifying concepts learned by these models, localizing them to specific parts of the model, and steering model predictions along these conceptual directions, using synthetic time series data. Our results show that MOMENT, a state-of-the-art foundation model, can discern distinct time series patterns, and that this ability peaks in the middle layers of the network. Moreover, we show that model outputs can be steered using insights from its activations (e.g., by introducing periodic trends to initially constant signals through intervention during inference). Our findings underscore the importance of synthetic data in studying and steering time series foundation models and intervening throughout the whole model (using steering matrices), instead of a single layer.
- [294] arXiv:2409.12917 [pdf, other]
-
Title: Training Language Models to Self-Correct via Reinforcement LearningAviral Kumar, Vincent Zhuang, Rishabh Agarwal, Yi Su, John D Co-Reyes, Avi Singh, Kate Baumli, Shariq Iqbal, Colton Bishop, Rebecca Roelofs, Lei M Zhang, Kay McKinney, Disha Shrivastava, Cosmin Paduraru, George Tucker, Doina Precup, Feryal Behbahani, Aleksandra FaustSubjects: Machine Learning (cs.LG)
Self-correction is a highly desirable capability of large language models (LLMs), yet it has consistently been found to be largely ineffective in modern LLMs. Existing approaches for training self-correction either require multiple models or rely on a more capable model or other forms of supervision. To this end, we develop a multi-turn online reinforcement learning (RL) approach, SCoRe, that significantly improves an LLM's self-correction ability using entirely self-generated data. To build SCoRe, we first show that variants of supervised fine-tuning (SFT) on offline model-generated correction traces are insufficient for instilling self-correction behavior. In particular, we observe that training via SFT either suffers from a distribution mismatch between the training data and the model's own responses or implicitly prefers only a certain mode of correction behavior that is often not effective at test time. SCoRe addresses these challenges by training under the model's own distribution of self-generated correction traces and using appropriate regularization to steer the learning process into learning a self-correction strategy that is effective at test time as opposed to simply fitting high-reward responses for a given prompt. This regularization prescribes running a first phase of RL on a base model to generate a policy initialization that is less susceptible to collapse and then using a reward bonus to amplify self-correction during training. When applied to Gemini 1.0 Pro and 1.5 Flash models, we find that SCoRe achieves state-of-the-art self-correction performance, improving the base models' self-correction by 15.6% and 9.1% respectively on the MATH and HumanEval benchmarks.
- [295] arXiv:2409.12919 [pdf, html, other]
-
Title: Swine Diet Design using Multi-objective Regionalized Bayesian OptimizationComments: 21 pages, 7 figuresSubjects: Artificial Intelligence (cs.AI)
The design of food diets in the context of animal nutrition is a complex problem that aims to develop cost-effective formulations while balancing minimum nutritional content. Traditional approaches based on theoretical models of metabolic responses and concentrations of digestible energy in raw materials face limitations in incorporating zootechnical or environmental variables affecting the performance of animals and including multiple objectives aligned with sustainable development policies. Recently, multi-objective Bayesian optimization has been proposed as a promising heuristic alternative able to deal with the combination of multiple sources of information, multiple and diverse objectives, and with an intrinsic capacity to deal with uncertainty in the measurements that could be related to variability in the nutritional content of raw materials. However, Bayesian optimization encounters difficulties in high-dimensional search spaces, leading to exploration predominantly at the boundaries. This work analyses a strategy to split the search space into regions that provide local candidates termed multi-objective regionalized Bayesian optimization as an alternative to improve the quality of the Pareto set and Pareto front approximation provided by BO in the context of swine diet design. Results indicate that this regionalized approach produces more diverse non-dominated solutions compared to the standard multi-objective Bayesian optimization. Besides, the regionalized strategy was four times more effective in finding solutions that outperform those identified by a stochastic programming approach referenced in the literature. Experiments using batches of query candidate solutions per iteration show that the optimization process can also be accelerated without compromising the quality of the Pareto set approximation during the initial, most critical phase of optimization.
- [296] arXiv:2409.12921 [pdf, html, other]
-
Title: Motion as Emotion: Detecting Affect and Cognitive Load from Free-Hand Gestures in VRSubjects: Human-Computer Interaction (cs.HC)
Affect and cognitive load influence many user behaviors. In this paper, we propose Motion as Emotion, a novel method that utilizes fine differences in hand motion to recognise affect and cognitive load in virtual reality (VR). We conducted a study with 22 participants who used common free-hand gesture interactions to carry out tasks of varying difficulty in VR environments. We find that the affect and cognitive load induced by tasks are associated with significant differences in gesture features such as speed, distance and hand tension. Standard support vector classification (SVC) models could accurately predict two levels (low, high) of valence, arousal and cognitive load from these features. Our results demonstrate the potential of Motion as Emotion as an accurate and reliable method of inferring user affect and cognitive load from free-hand gestures, without needing any additional wearable sensors or modifications to a standard VR headset.
- [297] arXiv:2409.12922 [pdf, other]
-
Title: AI Thinking: A framework for rethinking artificial intelligence in practiceComments: 30 pages, 2 figuresSubjects: Computers and Society (cs.CY); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)
Artificial intelligence is transforming the way we work with information across disciplines and practical contexts. A growing range of disciplines are now involved in studying, developing, and assessing the use of AI in practice, but these disciplines often employ conflicting understandings of what AI is and what is involved in its use. New, interdisciplinary approaches are needed to bridge competing conceptualisations of AI in practice and help shape the future of AI use. I propose a novel conceptual framework called AI Thinking, which models key decisions and considerations involved in AI use across disciplinary perspectives. The AI Thinking model addresses five practice-based competencies involved in applying AI in context: motivating AI use in information processes, formulating AI methods, assessing available tools and technologies, selecting appropriate data, and situating AI in the sociotechnical contexts it is used in. A hypothetical case study is provided to illustrate the application of AI Thinking in practice. This article situates AI Thinking in broader cross-disciplinary discourses of AI, including its connections to ongoing discussions around AI literacy and AI-driven innovation. AI Thinking can help to bridge divides between academic disciplines and diverse contexts of AI use, and to reshape the future of AI in practice.
- [298] arXiv:2409.12925 [pdf, html, other]
-
Title: Data-driven surrogate model for etch rate profiles using sensor data from a reactive ion etcherAbhijit Pranav Pamarty, Robert Neuweiler, Le Quyen Do, Keaton Johnson, James J. Sanchez, Dinesh KoliComments: 6 pages, 9 figuresSubjects: Computational Engineering, Finance, and Science (cs.CE)
Reliable predictions of the etch rate profile are desirable in semiconductor manufacturing to prevent etch rate target misses and yield rate excursions. Conventional methods for analyzing etch rate require extensive metrology, which adds considerable costs to manufacturing. We demonstrate a data-driven method to predict the etch rate profiles of a capacitively-coupled plasma RIE etcher from the tool's sensor data. The model employs a hybrid autoencoder-multiquadric interpolation-based approach, with the autoencoder being used to encode the features of the wafers' etch rate profiles into a latent space representation. The tool's sensor data is then used to construct interpolation maps for the latent space variables using multiquadric radial basis functions, which are then used to generate synthetic wafer etch rate profiles using the decoder. The accuracy of the model is determined using experimental data, and the errors are analyzed in interpolation and extrapolation.
- [299] arXiv:2409.12926 [pdf, html, other]
-
Title: MaskMol: Knowledge-guided Molecular Image Pre-Training Framework for Activity CliffsZhixiang Cheng, Hongxin Xiang, Pengsen Ma, Li Zeng, Xin Jin, Xixi Yang, Jianxin Lin, Yang Deng, Bosheng Song, Xinxin Feng, Changhui Deng, Xiangxiang ZengComments: 33 pages, 5 figuresSubjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
Activity cliffs, which refer to pairs of molecules that are structurally similar but show significant differences in their potency, can lead to model representation collapse and make the model challenging to distinguish them. Our research indicates that as molecular similarity increases, graph-based methods struggle to capture these nuances, whereas image-based approaches effectively retain the distinctions. Thus, we developed MaskMol, a knowledge-guided molecular image self-supervised learning framework. MaskMol accurately learns the representation of molecular images by considering multiple levels of molecular knowledge, such as atoms, bonds, and substructures. By utilizing pixel masking tasks, MaskMol extracts fine-grained information from molecular images, overcoming the limitations of existing deep learning models in identifying subtle structural changes. Experimental results demonstrate MaskMol's high accuracy and transferability in activity cliff estimation and compound potency prediction across 20 different macromolecular targets, outperforming 25 state-of-the-art deep learning and machine learning approaches. Visualization analyses reveal MaskMol's high biological interpretability in identifying activity cliff-relevant molecular substructures. Notably, through MaskMol, we identified candidate EP4 inhibitors that could be used to treat tumors. This study not only raises awareness about activity cliffs but also introduces a novel method for molecular image representation learning and virtual screening, advancing drug discovery and providing new insights into structure-activity relationships (SAR).
- [300] arXiv:2409.12929 [pdf, html, other]
-
Title: LogicPro: Improving Complex Logical Reasoning via Program-Guided LearningJin Jiang, Yuchen Yan, Yang Liu, Yonggang Jin, Shuai Peng, Mengdi Zhang, Xunliang Cai, Yixin Cao, Liangcai Gao, Zhi TangSubjects: Computation and Language (cs.CL)
In this paper, we present a novel approach, called LogicPro, to enhance Large Language Models (LLMs) complex Logical reasoning through Program Examples. We do this effectively by simply utilizing widely available algorithmic problems and their code solutions. First, we constructed diverse test samples input based on algorithmic questions and code solutions. Then, we designed different complex reasoning questions based on algorithmic problems and test samples. Finally, combining the intermediate variable outputs of the code solutions and the complex reasoning questions, we derived the reasoning process and the final answer. With this approach, we can construct a dataset that is sufficiently difficult (all models are ineffective), diverse (synthesized from 2,360 different algorithmic questions), and scalable (building different test samples and collecting more algorithmic questions). In addition, we obtain a high-quality reasoning process guided by the values of intermediate variables. As a result, our approach achieves significant improvements in multiple models for the BBH$^{27}$, GSM8K, HellSwag, Logicqa, Reclor, and RTE datasets, outperforming a wide range of existing reasoning datasets.
- [301] arXiv:2409.12935 [pdf, html, other]
-
Title: Stochastic gradient descent in continuous time for drift identification in multiscale diffusionsSubjects: Numerical Analysis (math.NA)
We consider the setting of multiscale overdamped Langevin stochastic differential equations, and study the problem of learning the drift function of the homogenized dynamics from continuous-time observations of the multiscale system. We decompose the drift term in a truncated series of basis functions, and employ the stochastic gradient descent in continuous time to infer the coefficients of the expansion. Due to the incompatibility between the multiscale data and the homogenized model, the estimator alone is not able to reconstruct the exact drift. We therefore propose to filter the original trajectory through appropriate kernels and include filtered data in the stochastic differential equation for the estimator, which indeed solves the misspecification issue. Several numerical experiments highlight the accuracy of our approach. Moreover, we show theoretically in a simplified framework the asymptotic unbiasedness of our estimator in the limit of infinite data and when the multiscale parameter describing the fastest scale vanishes.
- [302] arXiv:2409.12939 [pdf, html, other]
-
Title: Accelerating AI and Computer Vision for Satellite Pose Estimation on the Intel Myriad X Embedded SoCComments: Accepted for publication at Elsevier Microprocessors and MicrosystemsJournal-ref: Elsevier Microprocessors and Microsystems, Vol. 103, Nov. 2023Subjects: Hardware Architecture (cs.AR); Computer Vision and Pattern Recognition (cs.CV)
The challenging deployment of Artificial Intelligence (AI) and Computer Vision (CV) algorithms at the edge pushes the community of embedded computing to examine heterogeneous System-on-Chips (SoCs). Such novel computing platforms provide increased diversity in interfaces, processors and storage, however, the efficient partitioning and mapping of AI/CV workloads still remains an open issue. In this context, the current paper develops a hybrid AI/CV system on Intel's Movidius Myriad X, which is an heterogeneous Vision Processing Unit (VPU), for initializing and tracking the satellite's pose in space missions. The space industry is among the communities examining alternative computing platforms to comply with the tight constraints of on-board data processing, while it is also striving to adopt functionalities from the AI domain. At algorithmic level, we rely on the ResNet-50-based UrsoNet network along with a custom classical CV pipeline. For efficient acceleration, we exploit the SoC's neural compute engine and 16 vector processors by combining multiple parallelization and low-level optimization techniques. The proposed single-chip, robust-estimation, and real-time solution delivers a throughput of up to 5 FPS for 1-MegaPixel RGB images within a limited power envelope of 2W.
- [303] arXiv:2409.12941 [pdf, html, other]
-
Title: Fact, Fetch, and Reason: A Unified Evaluation of Retrieval-Augmented GenerationSatyapriya Krishna, Kalpesh Krishna, Anhad Mohananey, Steven Schwarcz, Adam Stambler, Shyam Upadhyay, Manaal FaruquiComments: Arxiv PreprintSubjects: Computation and Language (cs.CL)
Large Language Models (LLMs) have demonstrated significant performance improvements across various cognitive tasks. An emerging application is using LLMs to enhance retrieval-augmented generation (RAG) capabilities. These systems require LLMs to understand user queries, retrieve relevant information, and synthesize coherent and accurate responses. Given the increasing real-world deployment of such systems, comprehensive evaluation becomes crucial. To this end, we propose FRAMES (Factuality, Retrieval, And reasoning MEasurement Set), a high-quality evaluation dataset designed to test LLMs' ability to provide factual responses, assess retrieval capabilities, and evaluate the reasoning required to generate final answers. While previous work has provided datasets and benchmarks to evaluate these abilities in isolation, FRAMES offers a unified framework that provides a clearer picture of LLM performance in end-to-end RAG scenarios. Our dataset comprises challenging multi-hop questions that require the integration of information from multiple sources. We present baseline results demonstrating that even state-of-the-art LLMs struggle with this task, achieving 0.40 accuracy with no retrieval. The accuracy is significantly improved with our proposed multi-step retrieval pipeline, achieving an accuracy of 0.66 (>50% improvement). We hope our work will help bridge evaluation gaps and assist in developing more robust and capable RAG systems.
- [304] arXiv:2409.12946 [pdf, html, other]
-
Title: Revisiting Semi-supervised Adversarial Robustness via Noise-aware Online Robust DistillationComments: 12 pages, 4 figures, 9 tablesSubjects: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
The robust self-training (RST) framework has emerged as a prominent approach for semi-supervised adversarial training. To explore the possibility of tackling more complicated tasks with even lower labeling budgets, unlike prior approaches that rely on robust pretrained models, we present SNORD - a simple yet effective framework that introduces contemporary semi-supervised learning techniques into the realm of adversarial training. By enhancing pseudo labels and managing noisy training data more effectively, SNORD showcases impressive, state-of-the-art performance across diverse datasets and labeling budgets, all without the need for pretrained models. Compared to full adversarial supervision, SNORD achieves a 90% relative robust accuracy under epsilon = 8/255 AutoAttack, requiring less than 0.1%, 2%, and 10% labels for CIFAR-10, CIFAR-100, and TinyImageNet-200, respectively. Additional experiments confirm the efficacy of each component and demonstrate the adaptability of integrating SNORD with existing adversarial pretraining strategies to further bolster robustness.
- [305] arXiv:2409.12947 [pdf, html, other]
-
Title: Unrolled denoising networks provably learn optimal Bayesian inferenceComments: 32 pagesSubjects: Machine Learning (cs.LG); Data Structures and Algorithms (cs.DS); Machine Learning (stat.ML)
Much of Bayesian inference centers around the design of estimators for inverse problems which are optimal assuming the data comes from a known prior. But what do these optimality guarantees mean if the prior is unknown? In recent years, algorithm unrolling has emerged as deep learning's answer to this age-old question: design a neural network whose layers can in principle simulate iterations of inference algorithms and train on data generated by the unknown prior. Despite its empirical success, however, it has remained unclear whether this method can provably recover the performance of its optimal, prior-aware counterparts.
In this work, we prove the first rigorous learning guarantees for neural networks based on unrolling approximate message passing (AMP). For compressed sensing, we prove that when trained on data drawn from a product prior, the layers of the network approximately converge to the same denoisers used in Bayes AMP. We also provide extensive numerical experiments for compressed sensing and rank-one matrix estimation demonstrating the advantages of our unrolled architecture - in addition to being able to obliviously adapt to general priors, it exhibits improvements over Bayes AMP in more general settings of low dimensions, non-Gaussian designs, and non-product priors. - [306] arXiv:2409.12949 [pdf, html, other]
-
Title: A Learning-based Quadcopter Controller with Extreme AdaptationComments: 12 pages, 9 figuresSubjects: Robotics (cs.RO)
This paper introduces a learning-based low-level controller for quadcopters, which adaptively controls quadcopters with significant variations in mass, size, and actuator capabilities. Our approach leverages a combination of imitation learning and reinforcement learning, creating a fast-adapting and general control framework for quadcopters that eliminates the need for precise model estimation or manual tuning. The controller estimates a latent representation of the vehicle's system parameters from sensor-action history, enabling it to adapt swiftly to diverse dynamics. Extensive evaluations in simulation demonstrate the controller's ability to generalize to unseen quadcopter parameters, with an adaptation range up to 16 times broader than the training set. In real-world tests, the controller is successfully deployed on quadcopters with mass differences of 3.7 times and propeller constants varying by more than 100 times, while also showing rapid adaptation to disturbances such as off-center payloads and motor failures. These results highlight the potential of our controller in extreme adaptation to simplify the design process and enhance the reliability of autonomous drone operations in unpredictable environments. The video and code are at: this https URL
- [307] arXiv:2409.12951 [pdf, html, other]
-
Title: Re-Introducing LayerNorm: Geometric Meaning, Irreversibility and a Comparative Study with RMSNormSubjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
Layer normalization is a pivotal step in the transformer architecture. This paper delves into the less explored geometric implications of this process, examining how LayerNorm influences the norm and orientation of hidden vectors in the representation space. We show that the definition of LayerNorm is innately linked to the uniform vector, defined as $\boldsymbol{1} = [1, 1, 1, 1, \cdots, 1]^T \in \mathbb{R}^d$. We then show that the standardization step in LayerNorm can be understood in three simple steps: (i) remove the component of a vector along the uniform vector, (ii) normalize the remaining vector, and (iii) scale the resultant vector by $\sqrt{d}$, where $d$ is the dimensionality of the representation space. We also introduce the property of "irreversibility" for LayerNorm, where we show that the information lost during the normalization process cannot be recovered. In other words, unlike batch normalization, LayerNorm cannot learn an identity transform. While we present possible arguments for removing the component along the uniform vector, the choice of removing this component seems arbitrary and not well motivated by the original authors. To evaluate the usefulness of this step, we compare the hidden representations of LayerNorm-based LLMs with models trained using RMSNorm and show that all LLMs naturally align representations orthogonal to the uniform vector, presenting the first mechanistic evidence that removing the component along the uniform vector in LayerNorm is a redundant step. Our findings support the use of RMSNorm over LayerNorm as it is not only more computationally efficient with comparable downstream performance, but also learns a similar distribution of hidden representations that operate orthogonal to the uniform vector.
- [308] arXiv:2409.12952 [pdf, html, other]
-
Title: The Gaussian Discriminant Variational Autoencoder (GdVAE): A Self-Explainable Model with Counterfactual ExplanationsComments: Accepted paper at the ECCV 2024Subjects: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
Visual counterfactual explanation (CF) methods modify image concepts, e.g, shape, to change a prediction to a predefined outcome while closely resembling the original query image. Unlike self-explainable models (SEMs) and heatmap techniques, they grant users the ability to examine hypothetical "what-if" scenarios. Previous CF methods either entail post-hoc training, limiting the balance between transparency and CF quality, or demand optimization during inference. To bridge the gap between transparent SEMs and CF methods, we introduce the GdVAE, a self-explainable model based on a conditional variational autoencoder (CVAE), featuring a Gaussian discriminant analysis (GDA) classifier and integrated CF explanations. Full transparency is achieved through a generative classifier that leverages class-specific prototypes for the downstream task and a closed-form solution for CFs in the latent space. The consistency of CFs is improved by regularizing the latent space with the explainer function. Extensive comparisons with existing approaches affirm the effectiveness of our method in producing high-quality CF explanations while preserving transparency. Code and models are public.
- [309] arXiv:2409.12953 [pdf, html, other]
-
Title: JourneyBench: A Challenging One-Stop Vision-Language Understanding Benchmark of Generated ImagesZhecan Wang, Junzhang Liu, Chia-Wei Tang, Hani Alomari, Anushka Sivakumar, Rui Sun, Wenhao Li, Md. Atabuzzaman, Hammad Ayyubi, Haoxuan You, Alvi Ishmam, Kai-Wei Chang, Shih-Fu Chang, Chris ThomasSubjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
Existing vision-language understanding benchmarks largely consist of images of objects in their usual contexts. As a consequence, recent multimodal large language models can perform well with only a shallow visual understanding by relying on background language biases. Thus, strong performance on these benchmarks does not necessarily correlate with strong visual understanding. In this paper, we release JourneyBench, a comprehensive human-annotated benchmark of generated images designed to assess the model's fine-grained multimodal reasoning abilities across five tasks: complementary multimodal chain of thought, multi-image VQA, imaginary image captioning, VQA with hallucination triggers, and fine-grained retrieval with sample-specific distractors. Unlike existing benchmarks, JourneyBench explicitly requires fine-grained multimodal reasoning in unusual imaginary scenarios where language bias and holistic image gist are insufficient. We benchmark state-of-the-art models on JourneyBench and analyze performance along a number of fine-grained dimensions. Results across all five tasks show that JourneyBench is exceptionally challenging for even the best models, indicating that models' visual reasoning abilities are not as strong as they first appear. We discuss the implications of our findings and propose avenues for further research.
- [310] arXiv:2409.12954 [pdf, other]
-
Title: GStex: Per-Primitive Texturing of 2D Gaussian Splatting for Decoupled Appearance and Geometry ModelingComments: Project page: this https URLSubjects: Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR)
Gaussian splatting has demonstrated excellent performance for view synthesis and scene reconstruction. The representation achieves photorealistic quality by optimizing the position, scale, color, and opacity of thousands to millions of 2D or 3D Gaussian primitives within a scene. However, since each Gaussian primitive encodes both appearance and geometry, these attributes are strongly coupled--thus, high-fidelity appearance modeling requires a large number of Gaussian primitives, even when the scene geometry is simple (e.g., for a textured planar surface). We propose to texture each 2D Gaussian primitive so that even a single Gaussian can be used to capture appearance details. By employing per-primitive texturing, our appearance representation is agnostic to the topology and complexity of the scene's geometry. We show that our approach, GStex, yields improved visual quality over prior work in texturing Gaussian splats. Furthermore, we demonstrate that our decoupling enables improved novel view synthesis performance compared to 2D Gaussian splatting when reducing the number of Gaussian primitives, and that GStex can be used for scene appearance editing and re-texturing.
- [311] arXiv:2409.12957 [pdf, html, other]
-
Title: 3DTopia-XL: Scaling High-quality 3D Asset Generation via Primitive DiffusionZhaoxi Chen, Jiaxiang Tang, Yuhao Dong, Ziang Cao, Fangzhou Hong, Yushi Lan, Tengfei Wang, Haozhe Xie, Tong Wu, Shunsuke Saito, Liang Pan, Dahua Lin, Ziwei LiuSubjects: Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR)
The increasing demand for high-quality 3D assets across various industries necessitates efficient and automated 3D content creation. Despite recent advancements in 3D generative models, existing methods still face challenges with optimization speed, geometric fidelity, and the lack of assets for physically based rendering (PBR). In this paper, we introduce 3DTopia-XL, a scalable native 3D generative model designed to overcome these limitations. 3DTopia-XL leverages a novel primitive-based 3D representation, PrimX, which encodes detailed shape, albedo, and material field into a compact tensorial format, facilitating the modeling of high-resolution geometry with PBR assets. On top of the novel representation, we propose a generative framework based on Diffusion Transformer (DiT), which comprises 1) Primitive Patch Compression, 2) and Latent Primitive Diffusion. 3DTopia-XL learns to generate high-quality 3D assets from textual or visual inputs. We conduct extensive qualitative and quantitative experiments to demonstrate that 3DTopia-XL significantly outperforms existing methods in generating high-quality 3D assets with fine-grained textures and materials, efficiently bridging the quality gap between generative models and real-world applications.
- [312] arXiv:2409.12958 [pdf, html, other]
-
Title: MURI: High-Quality Instruction Tuning Datasets for Low-Resource Languages via Reverse InstructionsSubjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Instruction tuning enhances large language models (LLMs) by aligning them with human preferences across diverse tasks. Traditional approaches to create instruction tuning datasets face serious challenges for low-resource languages due to their dependence on data annotation. This work introduces a novel method, Multilingual Reverse Instructions (MURI), which generates high-quality instruction tuning datasets for low-resource languages without requiring human annotators or pre-existing multilingual models. Utilizing reverse instructions and a translation pipeline, MURI produces instruction-output pairs from existing human-written texts in low-resource languages. This method ensures cultural relevance and diversity by sourcing texts from different native domains and applying filters to eliminate inappropriate content. Our dataset, MURI-IT, includes more than 2 million instruction-output pairs across 200 languages. Evaluation by native speakers and fine-tuning experiments with mT5 models demonstrate the approach's effectiveness for both NLU and open-ended generation. We publicly release datasets and models at this https URL.
- [313] arXiv:2409.12959 [pdf, html, other]
-
Title: MMSearch: Benchmarking the Potential of Large Models as Multi-modal Search EnginesDongzhi Jiang, Renrui Zhang, Ziyu Guo, Yanmin Wu, Jiayi Lei, Pengshuo Qiu, Pan Lu, Zehui Chen, Guanglu Song, Peng Gao, Yu Liu, Chunyuan Li, Hongsheng LiComments: Project Page: this https URLSubjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Information Retrieval (cs.IR)
The advent of Large Language Models (LLMs) has paved the way for AI search engines, e.g., SearchGPT, showcasing a new paradigm in human-internet interaction. However, most current AI search engines are limited to text-only settings, neglecting the multimodal user queries and the text-image interleaved nature of website information. Recently, Large Multimodal Models (LMMs) have made impressive strides. Yet, whether they can function as AI search engines remains under-explored, leaving the potential of LMMs in multimodal search an open question. To this end, we first design a delicate pipeline, MMSearch-Engine, to empower any LMMs with multimodal search capabilities. On top of this, we introduce MMSearch, a comprehensive evaluation benchmark to assess the multimodal search performance of LMMs. The curated dataset contains 300 manually collected instances spanning 14 subfields, which involves no overlap with the current LMMs' training data, ensuring the correct answer can only be obtained within searching. By using MMSearch-Engine, the LMMs are evaluated by performing three individual tasks (requery, rerank, and summarization), and one challenging end-to-end task with a complete searching process. We conduct extensive experiments on closed-source and open-source LMMs. Among all tested models, GPT-4o with MMSearch-Engine achieves the best results, which surpasses the commercial product, Perplexity Pro, in the end-to-end task, demonstrating the effectiveness of our proposed pipeline. We further present error analysis to unveil current LMMs still struggle to fully grasp the multimodal search tasks, and conduct ablation study to indicate the potential of scaling test-time computation for AI search engine. We hope MMSearch may provide unique insights to guide the future development of multimodal AI search engine. Project Page: this https URL
- [314] arXiv:2409.12960 [pdf, html, other]
-
Title: LVCD: Reference-based Lineart Video Colorization with Diffusion ModelsComments: Accepted by ACM Transactions on Graphics and SIGGRAPH Asia 2024. Project page: this https URLSubjects: Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR)
We propose the first video diffusion framework for reference-based lineart video colorization. Unlike previous works that rely solely on image generative models to colorize lineart frame by frame, our approach leverages a large-scale pretrained video diffusion model to generate colorized animation videos. This approach leads to more temporally consistent results and is better equipped to handle large motions. Firstly, we introduce Sketch-guided ControlNet which provides additional control to finetune an image-to-video diffusion model for controllable video synthesis, enabling the generation of animation videos conditioned on lineart. We then propose Reference Attention to facilitate the transfer of colors from the reference frame to other frames containing fast and expansive motions. Finally, we present a novel scheme for sequential sampling, incorporating the Overlapped Blending Module and Prev-Reference Attention, to extend the video diffusion model beyond its original fixed-length limitation for long video colorization. Both qualitative and quantitative results demonstrate that our method significantly outperforms state-of-the-art techniques in terms of frame and video quality, as well as temporal consistency. Moreover, our method is capable of generating high-quality, long temporal-consistent animation videos with large motions, which is not achievable in previous works. Our code and model are available at this https URL.
- [315] arXiv:2409.12961 [pdf, html, other]
-
Title: Oryx MLLM: On-Demand Spatial-Temporal Understanding at Arbitrary ResolutionSubjects: Computer Vision and Pattern Recognition (cs.CV)
Visual data comes in various forms, ranging from small icons of just a few pixels to long videos spanning hours. Existing multi-modal LLMs usually standardize these diverse visual inputs to a fixed resolution for visual encoders and yield similar numbers of tokens for LLMs. This approach is non-optimal for multimodal understanding and inefficient for processing inputs with long and short visual contents. To solve the problem, we propose Oryx, a unified multimodal architecture for the spatial-temporal understanding of images, videos, and multi-view 3D scenes. Oryx offers an on-demand solution to seamlessly and efficiently process visual inputs with arbitrary spatial sizes and temporal lengths through two core innovations: 1) a pre-trained OryxViT model that can encode images at any resolution into LLM-friendly visual representations; 2) a dynamic compressor module that supports 1x to 16x compression on visual tokens by request. These design features enable Oryx to accommodate extremely long visual contexts, such as videos, with lower resolution and high compression while maintaining high recognition precision for tasks like document understanding with native resolution and no compression. Beyond the architectural improvements, enhanced data curation and specialized training on long-context retrieval and spatial-aware data help Oryx achieve strong capabilities in image, video, and 3D multimodal understanding simultaneously. Our work is open-sourced at this https URL.
- [316] arXiv:2409.12962 [pdf, html, other]
-
Title: CLAIR-A: Leveraging Large Language Models to Judge Audio CaptionsComments: Code is publicly available at this https URLSubjects: Computation and Language (cs.CL); Sound (cs.SD); Audio and Speech Processing (eess.AS)
The Automated Audio Captioning (AAC) task asks models to generate natural language descriptions of an audio input. Evaluating these machine-generated audio captions is a complex task that requires considering diverse factors, among them, auditory scene understanding, sound-object inference, temporal coherence, and the environmental context of the scene. While current methods focus on specific aspects, they often fail to provide an overall score that aligns well with human judgment. In this work, we propose CLAIR-A, a simple and flexible method that leverages the zero-shot capabilities of large language models (LLMs) to evaluate candidate audio captions by directly asking LLMs for a semantic distance score. In our evaluations, CLAIR-A better predicts human judgements of quality compared to traditional metrics, with a 5.8% relative accuracy improvement compared to the domain-specific FENSE metric and up to 11% over the best general-purpose measure on the Clotho-Eval dataset. Moreover, CLAIR-A offers more transparency by allowing the language model to explain the reasoning behind its scores, with these explanations rated up to 30% better by human evaluators than those provided by baseline methods. CLAIR-A is made publicly available at this https URL.
- [317] arXiv:2409.12963 [pdf, html, other]
-
Title: Interpolating Video-LLMs: Toward Longer-sequence LMMs in a Training-free MannerYuzhang Shang, Bingxin Xu, Weitai Kang, Mu Cai, Yuheng Li, Zehao Wen, Zhen Dong, Kurt Keutzer, Yong Jae Lee, Yan YanSubjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Advancements in Large Language Models (LLMs) inspire various strategies for integrating video modalities. A key approach is Video-LLMs, which incorporate an optimizable interface linking sophisticated video encoders to LLMs. However, due to computation and data limitations, these Video-LLMs are typically pre-trained to process only short videos, limiting their broader application for understanding longer video content. Additionally, fine-tuning Video-LLMs to handle longer videos is cost-prohibitive. Consequently, it becomes essential to explore the interpolation of Video-LLMs under a completely training-free setting. In this paper, we first identify the primary challenges in interpolating Video-LLMs: (1) the video encoder and modality alignment projector are fixed, preventing the integration of additional frames into Video-LLMs, and (2) the LLM backbone is limited in its content length capabilities, which complicates the processing of an increased number of video tokens. To address these challenges, we propose a specific INTerPolation method for Video-LLMs (INTP-Video-LLMs). We introduce an alternative video token rearrangement technique that circumvents limitations imposed by the fixed video encoder and alignment projector. Furthermore, we introduce a training-free LLM context window extension method to enable Video-LLMs to understand a correspondingly increased number of visual tokens.
New submissions for Friday, 20 September 2024 (showing 317 of 317 entries )
- [318] arXiv:2409.10729 (cross-list from physics.flu-dyn) [pdf, html, other]
-
Title: OpenACC offloading of the MFC compressible multiphase flow solver on AMD and NVIDIA GPUsBenjamin Wilfong, Anand Radhakrishnan, Henry A. Le Berre, Steve Abbott, Reuben D. Budiardja, Spencer H. BryngelsonComments: 11 pages, 9 figures, 6 listings, WACCPD at SC24Subjects: Fluid Dynamics (physics.flu-dyn); Mathematical Software (cs.MS); Computational Physics (physics.comp-ph)
GPUs are the heart of the latest generations of supercomputers. We efficiently accelerate a compressible multiphase flow solver via OpenACC on NVIDIA and AMD Instinct GPUs. Optimization is accomplished by specifying the directive clauses 'gang vector' and 'collapse'. Further speedups of six and ten times are achieved by packing user-defined types into coalesced multidimensional arrays and manual inlining via metaprogramming. Additional optimizations yield seven-times speedup in array packing and thirty-times speedup of select kernels on Frontier. Weak scaling efficiencies of 97% and 95% are observed when scaling to 50% of Summit and 95% of Frontier. Strong scaling efficiencies of 84% and 81% are observed when increasing the device count by a factor of 8 and 16 on V100 and MI250X hardware. The strong scaling efficiency of AMD's MI250X increases to 92% when increasing the device count by a factor of 16 when GPU-aware MPI is used for communication.
- [319] arXiv:2409.12195 (cross-list from stat.ML) [pdf, html, other]
-
Title: Reproduction of IVFS algorithm for high-dimensional topology preservation feature selectionComments: arXiv admin note: substantial text overlap with arXiv:2004.01299 by other authorsSubjects: Machine Learning (stat.ML); Machine Learning (cs.LG)
Feature selection is a crucial technique for handling high-dimensional data. In unsupervised scenarios, many popular algorithms focus on preserving the original data structure. In this paper, we reproduce the IVFS algorithm introduced in AAAI 2020, which is inspired by the random subset method and preserves data similarity by maintaining topological structure. We systematically organize the mathematical foundations of IVFS and validate its effectiveness through numerical experiments similar to those in the original paper. The results demonstrate that IVFS outperforms SPEC and MCFS on most datasets, although issues with its convergence and stability persist.
- [320] arXiv:2409.12209 (cross-list from q-bio.GN) [pdf, other]
-
Title: Multivariate Analysis of Gut Microbiota Composition and Prevalence of Gastric CancerSubjects: Genomics (q-bio.GN); Artificial Intelligence (cs.AI)
The global surge in the cases of gastric cancer has prompted an investigation into the potential of gut microbiota as a predictive marker for the disease. The alterations in gut diversity are suspected to be associated with an elevated risk of gastric cancer. This paper delves into finding the correlation between gut microbiota and gastric cancer, focusing on patients who have undergone total and subtotal gastrectomy. Utilizing data mining and statistical learning methods, an analysis was conducted on 16S-RNA sequenced genes obtained from 96 participants with the aim of identifying specific genera of gut microbiota associated with gastric cancer. The study reveals several prominent bacterial genera that could potentially serve as biomarkers assessing the risk of gastric cancer. These findings offer a pathway for early risk assessment and precautionary measures in the diagnosis of gastric cancer. The intricate mechanisms through which these gut microbiotas influence gastric cancer progression warrant further investigation. This research significantly aims to contribute to the growing understanding of the gut-cancer axis and its implications in disease prediction and prevention.
- [321] arXiv:2409.12215 (cross-list from q-bio.BM) [pdf, html, other]
-
Title: Assessing Reusability of Deep Learning-Based Monotherapy Drug Response Prediction Models Trained with Omics DataJamie C. Overbeek, Alexander Partin, Thomas S. Brettin, Nicholas Chia, Oleksandr Narykov, Priyanka Vasanthakumari, Andreas Wilke, Yitan Zhu, Austin Clyde, Sara Jones, Rohan Gnanaolivu, Yuanhang Liu, Jun Jiang, Chen Wang, Carter Knutson, Andrew McNaughton, Neeraj Kumar, Gayara Demini Fernando, Souparno Ghosh, Cesar Sanchez-Villalobos, Ruibo Zhang, Ranadip Pal, M. Ryan Weil, Rick L. StevensComments: 12 pages, 2 figuresSubjects: Biomolecules (q-bio.BM); Machine Learning (cs.LG)
Cancer drug response prediction (DRP) models present a promising approach towards precision oncology, tailoring treatments to individual patient profiles. While deep learning (DL) methods have shown great potential in this area, models that can be successfully translated into clinical practice and shed light on the molecular mechanisms underlying treatment response will likely emerge from collaborative research efforts. This highlights the need for reusable and adaptable models that can be improved and tested by the wider scientific community. In this study, we present a scoring system for assessing the reusability of prediction DRP models, and apply it to 17 peer-reviewed DL-based DRP models. As part of the IMPROVE (Innovative Methodologies and New Data for Predictive Oncology Model Evaluation) project, which aims to develop methods for systematic evaluation and comparison DL models across scientific domains, we analyzed these 17 DRP models focusing on three key categories: software environment, code modularity, and data availability and preprocessing. While not the primary focus, we also attempted to reproduce key performance metrics to verify model behavior and adaptability. Our assessment of 17 DRP models reveals both strengths and shortcomings in model reusability. To promote rigorous practices and open-source sharing, we offer recommendations for developing and sharing prediction models. Following these recommendations can address many of the issues identified in this study, improving model reusability without adding significant burdens on researchers. This work offers the first comprehensive assessment of reusability and reproducibility across diverse DRP models, providing insights into current model sharing practices and promoting standards within the DRP and broader AI-enabled scientific research community.
- [322] arXiv:2409.12222 (cross-list from hep-th) [pdf, html, other]
-
Title: Conformal Fields from Neural NetworksComments: 32+16 pagesSubjects: High Energy Physics - Theory (hep-th); Machine Learning (cs.LG)
We use the embedding formalism to construct conformal fields in $D$ dimensions, by restricting Lorentz-invariant ensembles of homogeneous neural networks in $(D+2)$ dimensions to the projective null cone. Conformal correlators may be computed using the parameter space description of the neural network. Exact four-point correlators are computed in a number of examples, and we perform a 4D conformal block decomposition that elucidates the spectrum. In some examples the analysis is facilitated by recent approaches to Feynman integrals. Generalized free CFTs are constructed using the infinite-width Gaussian process limit of the neural network, enabling a realization of the free boson. The extension to deep networks constructs conformal fields at each subsequent layer, with recursion relations relating their conformal dimensions and four-point functions. Numerical approaches are discussed.
- [323] arXiv:2409.12243 (cross-list from math.OC) [pdf, html, other]
-
Title: Convergence of Markov Chains for Constant Step-size Stochastic Gradient Descent with Separable FunctionsSubjects: Optimization and Control (math.OC); Dynamical Systems (math.DS); Numerical Analysis (math.NA)
Stochastic gradient descent (SGD) is a popular algorithm for minimizing objective functions that arise in machine learning. For constant step-sized SGD, the iterates form a Markov chain on a general state space. Focusing on a class of separable (non-convex) objective functions, we establish a "Doeblin-type decomposition," in that the state space decomposes into a uniformly transient set and a disjoint union of absorbing sets. Each of the absorbing sets contains a unique invariant measure, with the set of all invariant measures being the convex hull. Moreover the set of invariant measures are shown to be global attractors to the Markov chain with a geometric convergence rate. The theory is highlighted with examples that show: (1) the failure of the diffusion approximation to characterize the long-time dynamics of SGD; (2) the global minimum of an objective function may lie outside the support of the invariant measures (i.e., even if initialized at the global minimum, SGD iterates will leave); and (3) bifurcations may enable the SGD iterates to transition between two local minima. Key ingredients in the theory involve viewing the SGD dynamics as a monotone iterated function system and establishing a "splitting condition" of Dubins and Freedman 1966 and Bhattacharya and Lee 1988.
- [324] arXiv:2409.12276 (cross-list from eess.IV) [pdf, html, other]
-
Title: Unsupervised Feature Orthogonalization for Learning Distortion-Invariant RepresentationsComments: Accepted at RROW@BMVC 2024 (Workshop on Robust Recognition in the Open World at the British Machine Vision Conference)Subjects: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
This study introduces unORANIC+, a novel method that integrates unsupervised feature orthogonalization with the ability of a Vision Transformer to capture both local and global relationships for improved robustness and generalizability. The streamlined architecture of unORANIC+ effectively separates anatomical and image-specific attributes, resulting in robust and unbiased latent representations that allow the model to demonstrate excellent performance across various medical image analysis tasks and diverse datasets. Extensive experimentation demonstrates unORANIC+'s reconstruction proficiency, corruption resilience, as well as capability to revise existing image distortions. Additionally, the model exhibits notable aptitude in downstream tasks such as disease classification and corruption detection. We confirm its adaptability to diverse datasets of varying image sources and sample sizes which positions the method as a promising algorithm for advanced medical image analysis, particularly in resource-constrained environments lacking large, tailored datasets. The source code is available at this https URL .
- [325] arXiv:2409.12287 (cross-list from quant-ph) [pdf, html, other]
-
Title: Constructing Noise-Robust Quantum Gates via Pontryagin's Maximum PrincipleComments: 10 pages, 8 figures, accepted to IEEE International Conference on Quantum Computing & Engineering (QCE24)Subjects: Quantum Physics (quant-ph); Systems and Control (eess.SY)
Reliable quantum information technologies depend on precise actuation and techniques to mitigate the effects of undesired disturbances such as environmental noise and imperfect calibration. In this work, we present a general framework based in geometric optimal control theory to synthesize smooth control pulses for implementing arbitrary noise-robust quantum gates. The methodology applies to generic unitary quantum dynamics with any number of qubits or energy levels, any number of control fields, and any number of disturbances, extending existing dynamical decoupling approaches that are only applicable for limited gate sets or small systems affected by one or two disturbances. The noise-suppressing controls are computed via indirect trajectory optimization based on Pontryagin's maximum principle, eliminating the need to make heuristic structural assumptions on parameterized pulse envelopes.
- [326] arXiv:2409.12290 (cross-list from math.OC) [pdf, html, other]
-
Title: Adaptive Extremum Seeking Control via the RMSprop OptimizerComments: 6 pages, 1 figure, L-CSS and American Control Conference SubmissionSubjects: Optimization and Control (math.OC); Systems and Control (eess.SY)
Extremum Seeking Control (ESC) is a well-known set of continuous time algorithms for model-free optimization of a cost function. One issue for ESCs is the convergence rates of parameters to extrema of unknown cost functions. The local convergence rate depends on the second, or sometimes higher, order derivatives of the unknown cost function. To mitigate this dependency, we propose the use of the RMSprop optimizer for ESCs as RMSprop is an adaptive gradient-based optimizer which attempts to have a normalized convergence rate in all parameters. Practical stability results are given for this RMSprop ESC (RMSpESC). In particular notability, the proof of practical stability uses Lyapunov function based on observed contracting, attractive sets. Versions of this Lyapunov function could be applied to other areas of applications, in particular for interconnected systems.
- [327] arXiv:2409.12298 (cross-list from math.OC) [pdf, html, other]
-
Title: Computing Bouligand stationary points efficiently in low-rank optimizationSubjects: Optimization and Control (math.OC); Numerical Analysis (math.NA)
This paper considers the problem of minimizing a differentiable function with locally Lipschitz continuous gradient on the algebraic variety of all $m$-by-$n$ real matrices of rank at most $r$. Several definitions of stationarity exist for this nonconvex problem. Among them, Bouligand stationarity is the strongest necessary condition for local optimality. Only a handful of algorithms generate a sequence in the variety whose accumulation points are provably Bouligand stationary. Among them, the most parsimonious with (truncated) singular value decompositions (SVDs) or eigenvalue decompositions can still require a truncated SVD of a matrix whose rank can be as large as $\min\{m, n\}-r+1$ if the gradient does not have low rank, which is computationally prohibitive in the typical case where $r \ll \min\{m, n\}$. This paper proposes a first-order algorithm that generates a sequence in the variety whose accumulation points are Bouligand stationary while requiring SVDs of matrices whose smaller dimension is always at most $r$. A standard measure of Bouligand stationarity converges to zero along the bounded subsequences at a rate at least $O(1/\sqrt{i+1})$, where $i$ is the iteration counter. Furthermore, a rank-increasing scheme based on the proposed algorithm is presented, which can be of interest if the parameter $r$ is potentially overestimated.
- [328] arXiv:2409.12301 (cross-list from stat.ML) [pdf, html, other]
-
Title: Amortized Variational Inference for Deep Gaussian ProcessesSubjects: Machine Learning (stat.ML); Machine Learning (cs.LG)
Gaussian processes (GPs) are Bayesian nonparametric models for function approximation with principled predictive uncertainty estimates. Deep Gaussian processes (DGPs) are multilayer generalizations of GPs that can represent complex marginal densities as well as complex mappings. As exact inference is either computationally prohibitive or analytically intractable in GPs and extensions thereof, some existing methods resort to variational inference (VI) techniques for tractable approximations. However, the expressivity of conventional approximate GP models critically relies on independent inducing variables that might not be informative enough for some problems. In this work we introduce amortized variational inference for DGPs, which learns an inference function that maps each observation to variational parameters. The resulting method enjoys a more expressive prior conditioned on fewer input dependent inducing variables and a flexible amortized marginal posterior that is able to model more complicated functions. We show with theoretical reasoning and experimental results that our method performs similarly or better than previous approaches at less computational cost.
- [329] arXiv:2409.12333 (cross-list from eess.IV) [pdf, html, other]
-
Title: Scale-specific auxiliary multi-task contrastive learning for deep liver vessel segmentationAmine Sadikine, Bogdan Badic, Jean-Pierre Tasu, Vincent Noblet, Pascal Ballet, Dimitris Visvikis, Pierre-Henri ConzeComments: 5 pages, 5 figures, conferenceJournal-ref: 2023 IEEE 20th International Symposium on Biomedical Imaging (ISBI), Cartagena, Colombia, 2023, pp. 1-5Subjects: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
Extracting hepatic vessels from abdominal images is of high interest for clinicians since it allows to divide the liver into functionally-independent Couinaud segments. In this respect, an automated liver blood vessel extraction is widely summoned. Despite the significant growth in performance of semantic segmentation methodologies, preserving the complex multi-scale geometry of main vessels and ramifications remains a major challenge. This paper provides a new deep supervised approach for vessel segmentation, with a strong focus on representations arising from the different scales inherent to the vascular tree geometry. In particular, we propose a new clustering technique to decompose the tree into various scale levels, from tiny to large vessels. Then, we extend standard 3D UNet to multi-task learning by incorporating scale-specific auxiliary tasks and contrastive learning to encourage the discrimination between scales in the shared representation. Promising results, depicted in several evaluation metrics, are revealed on the public 3D-IRCADb dataset.
- [330] arXiv:2409.12334 (cross-list from eess.IV) [pdf, html, other]
-
Title: Deep vessel segmentation with joint multi-prior encodingAmine Sadikine, Bogdan Badic, Enzo Ferrante, Vincent Noblet, Pascal Ballet, Dimitris Visvikis, Pierre-Henri ConzeComments: 5 pages, 3 figures, conferenceJournal-ref: 2024 IEEE International Symposium on Biomedical Imaging (ISBI), Athens, Greece, 2024, pp. 1-5Subjects: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
The precise delineation of blood vessels in medical images is critical for many clinical applications, including pathology detection and surgical planning. However, fully-automated vascular segmentation is challenging because of the variability in shape, size, and topology. Manual segmentation remains the gold standard but is time-consuming, subjective, and impractical for large-scale studies. Hence, there is a need for automatic and reliable segmentation methods that can accurately detect blood vessels from medical images. The integration of shape and topological priors into vessel segmentation models has been shown to improve segmentation accuracy by offering contextual information about the shape of the blood vessels and their spatial relationships within the vascular tree. To further improve anatomical consistency, we propose a new joint prior encoding mechanism which incorporates both shape and topology in a single latent space. The effectiveness of our method is demonstrated on the publicly available 3D-IRCADb dataset. More globally, the proposed approach holds promise in overcoming the challenges associated with automatic vessel delineation and has the potential to advance the field of deep priors encoding.
- [331] arXiv:2409.12343 (cross-list from math.OC) [pdf, html, other]
-
Title: Seperable Bregman Framework for Sparsity Constrained Nonlinear OptimizationSubjects: Optimization and Control (math.OC); Information Theory (cs.IT)
This paper considers the minimization of a continuously differentiable function over a cardinality constraint. We focus on smooth and relatively smooth functions. These smoothness criteria result in new descent lemmas. Based on the new descent lemmas, novel optimality conditions and algorithms are developed, which extend the previously proposed hard-thresholding algorithms. We give a theoretical analysis of these algorithms and extend previous results on properties of iterative hard thresholding-like algorithms. In particular, we focus on the weighted $\ell_2$ norm, which requires efficient solution of convex subproblems. We apply our algorithms to compressed sensing problems to demonstrate the theoretical findings and the enhancements achieved through the proposed framework.
- [332] arXiv:2409.12347 (cross-list from eess.IV) [pdf, other]
-
Title: Axial Attention Transformer Networks: A New Frontier in Breast Cancer DetectionSubjects: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
This paper delves into the challenges and advancements in the field of medical image segmentation, particularly focusing on breast cancer diagnosis. The authors propose a novel Transformer-based segmentation model that addresses the limitations of traditional convolutional neural networks (CNNs), such as U-Net, in accurately localizing and segmenting small lesions within breast cancer images. The model introduces an axial attention mechanism to enhance the computational efficiency and address the issue of global contextual information that is often overlooked by CNNs. Additionally, the paper discusses improvements tailored to the small dataset challenge, including the incorporation of relative position information and a gated axial attention mechanism to refine the model's focus on relevant features. The proposed model aims to significantly improve the segmentation accuracy of breast cancer images, offering a more efficient and effective tool for computer-aided diagnosis.
- [333] arXiv:2409.12352 (cross-list from eess.AS) [pdf, html, other]
-
Title: META-CAT: Speaker-Informed Speech Embeddings via Meta Information Concatenation for Multi-talker ASRJinhan Wang, Weiqing Wang, Kunal Dhawan, Taejin Park, Myungjong Kim, Ivan Medennikov, He Huang, Nithin Koluguri, Jagadeesh Balam, Boris GinsburgSubjects: Audio and Speech Processing (eess.AS); Sound (cs.SD)
We propose a novel end-to-end multi-talker automatic speech recognition (ASR) framework that enables both multi-speaker (MS) ASR and target-speaker (TS) ASR. Our proposed model is trained in a fully end-to-end manner, incorporating speaker supervision from a pre-trained speaker diarization module. We introduce an intuitive yet effective method for masking ASR encoder activations using output from the speaker supervision module, a technique we term Meta-Cat (meta-information concatenation), that can be applied to both MS-ASR and TS-ASR. Our results demonstrate that the proposed architecture achieves competitive performance in both MS-ASR and TS-ASR tasks, without the need for traditional methods, such as neural mask estimation or masking at the audio or feature level. Furthermore, we demonstrate a glimpse of a unified dual-task model which can efficiently handle both MS-ASR and TS-ASR tasks. Thus, this work illustrates that a robust end-to-end multi-talker ASR framework can be implemented with a streamlined architecture, obviating the need for the complex speaker filtering mechanisms employed in previous studies.
- [334] arXiv:2409.12370 (cross-list from eess.AS) [pdf, html, other]
-
Title: Robust Audiovisual Speech Recognition Models with Mixture-of-ExpertsComments: 6 pages, 2 figures, accepted by IEEE Spoken Language Technology Workshop 2024Subjects: Audio and Speech Processing (eess.AS); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV); Sound (cs.SD)
Visual signals can enhance audiovisual speech recognition accuracy by providing additional contextual information. Given the complexity of visual signals, an audiovisual speech recognition model requires robust generalization capabilities across diverse video scenarios, presenting a significant challenge. In this paper, we introduce EVA, leveraging the mixture-of-Experts for audioVisual ASR to perform robust speech recognition for ``in-the-wild'' videos. Specifically, we first encode visual information into visual tokens sequence and map them into speech space by a lightweight projection. Then, we build EVA upon a robust pretrained speech recognition model, ensuring its generalization ability. Moreover, to incorporate visual information effectively, we inject visual information into the ASR model through a mixture-of-experts module. Experiments show our model achieves state-of-the-art results on three benchmarks, which demonstrates the generalization ability of EVA across diverse video domains.
- [335] arXiv:2409.12377 (cross-list from eess.IV) [pdf, html, other]
-
Title: Fundus image enhancement through direct diffusion bridgesComments: Published at IEEE JBHI. 12 pages, 10 figures. Code and Data: this https URLSubjects: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
We propose FD3, a fundus image enhancement method based on direct diffusion bridges, which can cope with a wide range of complex degradations, including haze, blur, noise, and shadow. We first propose a synthetic forward model through a human feedback loop with board-certified ophthalmologists for maximal quality improvement of low-quality in-vivo images. Using the proposed forward model, we train a robust and flexible diffusion-based image enhancement network that is highly effective as a stand-alone method, unlike previous diffusion model-based approaches which act only as a refiner on top of pre-trained models. Through extensive experiments, we show that FD3 establishes \add{superior quality} not only on synthetic degradations but also on in vivo studies with low-quality fundus photos taken from patients with cataracts or small pupils. To promote further research in this area, we open-source all our code and data used for this research at this https URL
- [336] arXiv:2409.12388 (cross-list from eess.AS) [pdf, html, other]
-
Title: Disentangling Speakers in Multi-Talker Speech Recognition with Speaker-Aware CTCSubjects: Audio and Speech Processing (eess.AS); Artificial Intelligence (cs.AI); Sound (cs.SD)
Multi-talker speech recognition (MTASR) faces unique challenges in disentangling and transcribing overlapping speech. To address these challenges, this paper investigates the role of Connectionist Temporal Classification (CTC) in speaker disentanglement when incorporated with Serialized Output Training (SOT) for MTASR. Our visualization reveals that CTC guides the encoder to represent different speakers in distinct temporal regions of acoustic embeddings. Leveraging this insight, we propose a novel Speaker-Aware CTC (SACTC) training objective, based on the Bayes risk CTC framework. SACTC is a tailored CTC variant for multi-talker scenarios, it explicitly models speaker disentanglement by constraining the encoder to represent different speakers' tokens at specific time frames. When integrated with SOT, the SOT-SACTC model consistently outperforms standard SOT-CTC across various degrees of speech overlap. Specifically, we observe relative word error rate reductions of 10% overall and 15% on low-overlap speech. This work represents an initial exploration of CTC-based enhancements for MTASR tasks, offering a new perspective on speaker disentanglement in multi-talker speech recognition.
- [337] arXiv:2409.12399 (cross-list from eess.IV) [pdf, html, other]
-
Title: I2I-Galip: Unsupervised Medical Image Translation Using Generative Adversarial CLIPSubjects: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
Unpaired image-to-image translation is a challenging task due to the absence of paired examples, which complicates learning the complex mappings between the distinct distributions of the source and target domains. One of the most commonly used approach for this task is CycleGAN which requires the training of a new pair of generator-discriminator networks for each domain pair. In this paper, we propose a new image-to-image translation framework named Image-to-Image-Generative-Adversarial-CLIP (I2I-Galip) where we utilize a pre-trained multi-model foundation model (i.e., CLIP) to mitigate the need of separate generator-discriminator pairs for each source-target mapping while achieving better and more efficient multi-domain translation. By utilizing the massive knowledge gathered during pre-training a foundation model, our approach makes use of a single lightweight generator network with ~13M parameters for the multi-domain image translation task. Comprehensive experiments on translation performance in public MRI and CT datasets show the superior performance of the proposed framework over the existing approaches. Code will be available (this https URL).
- [338] arXiv:2409.12401 (cross-list from eess.IV) [pdf, html, other]
-
Title: MambaRecon: MRI Reconstruction with Structured State Space ModelsSubjects: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
Magnetic Resonance Imaging (MRI) is one of the most important medical imaging modalities as it provides superior resolution of soft tissues, albeit with a notable limitation in scanning speed. The advent of deep learning has catalyzed the development of cutting-edge methods for the expedited reconstruction of MRI scans, utilizing convolutional neural networks and, more recently, vision transformers. Recently proposed structured state space models (e.g., Mamba) have gained some traction due to their efficiency and low computational requirements compared to transformer models. We propose an innovative MRI reconstruction framework that employs structured state space models at its core, aimed at amplifying both long-range contextual sensitivity and reconstruction efficacy. Comprehensive experiments on public brain MRI datasets show that our model sets new benchmarks beating state-of-the-art reconstruction baselines. Code will be available (this https URL).
- [339] arXiv:2409.12413 (cross-list from eess.AS) [pdf, html, other]
-
Title: DeFT-Mamba: Universal Multichannel Sound Separation and Polyphonic Audio ClassificationComments: 5 pages, 2 figuresSubjects: Audio and Speech Processing (eess.AS); Sound (cs.SD)
This paper presents a framework for universal sound separation and polyphonic audio classification, addressing the challenges of separating and classifying individual sound sources in a multichannel mixture. The proposed framework, DeFT-Mamba, utilizes the dense frequency-time attentive network (DeFTAN) combined with Mamba to extract sound objects, capturing the local time-frequency relations through gated convolution block and the global time-frequency relations through position-wise Hybrid Mamba. DeFT-Mamba surpasses existing separation and classification networks by a large margin, particularly in complex scenarios involving in-class polyphony. Additionally, a classification-based source counting method is introduced to identify the presence of multiple sources, outperforming conventional threshold-based approaches. Separation refinement tuning is also proposed to improve performance further. The proposed framework is trained and tested on a multichannel universal sound separation dataset developed in this work, designed to mimic realistic environments with moving sources and varying onsets and offsets of polyphonic events.
- [340] arXiv:2409.12415 (cross-list from eess.AS) [pdf, html, other]
-
Title: Multichannel-to-Multichannel Target Sound Extraction Using Direction and Timestamp CluesComments: 5 pages, 4 figuresSubjects: Audio and Speech Processing (eess.AS); Artificial Intelligence (cs.AI); Sound (cs.SD)
We propose a multichannel-to-multichannel target sound extraction (M2M-TSE) framework for separating multichannel target signals from a multichannel mixture of sound sources. Target sound extraction (TSE) isolates a specific target signal using user-provided clues, typically focusing on single-channel extraction with class labels or temporal activation maps. However, to preserve and utilize spatial information in multichannel audio signals, it is essential to extract multichannel signals of a target sound source. Moreover, the clue for extraction can also include spatial or temporal cues like direction-of-arrival (DoA) or timestamps of source activation. To address these challenges, we present an M2M framework that extracts a multichannel sound signal based on spatio-temporal clues. We demonstrate that our transformer-based architecture can successively accomplish the M2M-TSE task for multichannel signals synthesized from audio signals of diverse classes in different room environments. Furthermore, we show that the multichannel extraction task introduces sufficient inductive bias in the DNN, allowing it to directly handle DoA clues without utilizing hand-crafted spatial features.
- [341] arXiv:2409.12416 (cross-list from eess.AS) [pdf, html, other]
-
Title: Speech-Declipping Transformer with Complex Spectrogram and Learnerble Temporal FeaturesComments: 5 pages, 2 figures, submitted to ICASSP 2024Subjects: Audio and Speech Processing (eess.AS); Sound (cs.SD); Signal Processing (eess.SP)
We present a transformer-based speech-declipping model that effectively recovers clipped signals across a wide range of input signal-to-distortion ratios (SDRs). While recent time-domain deep neural network (DNN)-based declippers have outperformed traditional handcrafted and spectrogram-based DNN approaches, they still struggle with low-SDR inputs. To address this, we incorporate a transformer-based architecture that operates in the time-frequency (TF) domain. The TF-transformer architecture has demonstrated remarkable performance in the speech enhancement task for low-SDR signals but cannot be optimal for the time-domain artifact like clipping. To overcome the limitations of spectrogram-based DNNs, we design an extra convolutional block that directly extracts temporal features from time-domain waveforms. The joint analysis of complex spectrogram and learned temporal features allows the model to improve performance on both high- and low-SDR inputs. Our approach also preserves the unclipped portions of the speech signal during processing, preventing degradation typically seen when only spectral information is used. In evaluations on the VoiceBank-DEMAND and DNS challenge datasets, the proposed model consistently outperformed state-of-the-art (SOTA) declipping models across various metrics, demonstrating its robustness and generalizability.
- [342] arXiv:2409.12417 (cross-list from math.CO) [pdf, html, other]
-
Title: Universal partial toriComments: 22 pages, 1 figureSubjects: Combinatorics (math.CO); Discrete Mathematics (cs.DM)
A De Bruijn cycle is a cyclic sequence in which every word of length $n$ over an alphabet $\mathcal{A}$ appears exactly once. De Bruijn tori are a two-dimensional analogue. Motivated by recent progress on universal partial cycles and words, which shorten De Bruijn cycles using a wildcard character, we introduce universal partial tori and matrices. We find them computationally and construct infinitely many of them using one-dimensional variants of universal cycles, including a new variant called a universal partial family.
- [343] arXiv:2409.12420 (cross-list from math.AP) [pdf, html, other]
-
Title: Spatially-invariant opinion dynamics on the circleSubjects: Analysis of PDEs (math.AP); Social and Information Networks (cs.SI)
We propose a nonlinear opinion dynamics model for an agent making decisions about a continuous distribution of options in the presence of distributed input. Inspired by perceptual decision-making, we develop the theory for options distributed on the circle, representing, e.g., the set of possible heading directions in planar robotic navigation problems. Interactions among options are encoded through a spatially invariant kernel. We design the kernel to ensure that decision-making is robust, in the sense that only a finite subset of options can be favored over the continuum. We prove the spatial invariance of the model linearization and use this result to prove an opinion-forming bifurcation in the model with zero input. We then use space and time frequency domain analysis of the model linearization to infer the ultra-sensitive input-output behavior of the nonlinear dynamics with input. We illustrate our model's versatility with a robotic navigation example in crowded spaces.
- [344] arXiv:2409.12460 (cross-list from physics.soc-ph) [pdf, html, other]
-
Title: Robustness of the public transport network against attacks on its routesComments: 10 pages, 8 figuresJournal-ref: Chaos, Solitons and Fractals, 184, 115019 (2024)Subjects: Physics and Society (physics.soc-ph); Social and Information Networks (cs.SI)
We investigate the robustness of Public Transport Networks (PTNs) when subjected to route attacks, focusing specifically on public bus lines. Such attacks, mirroring real-world scenarios, offer insight into the multifaceted dynamics of cities. Our study delves into the consequences of systematically removing entire routes based on strategies that use centrality measures. We evaluate the network's robustness by analyzing the sizes of fragmented networks, focusing on the largest components and derived metrics. To assess the efficacy of various attack strategies, we employ them on both a synthetic PTN model and a real-world example, specifically the Buenos Aires Metropolitan Area in Argentina. We examine these strategies and contrast them with random, and one-step most and least harmful procedures. Our findings indicate that \textit{betweenness}-based attacks and the one-step most (\textit{maximal}) harmful procedure emerge as the most effective attack strategies. Remarkably, the \textit{betweenness} strategy partitions the network into components of similar sizes, whereas alternative approaches yield one dominant and several minor components.
- [345] arXiv:2409.12462 (cross-list from cond-mat.mtrl-sci) [pdf, other]
-
Title: Unsupervised Reward-Driven Image Segmentation in Automated Scanning Transmission Electron Microscopy ExperimentsComments: 17 pages, 6 imagesSubjects: Materials Science (cond-mat.mtrl-sci); Human-Computer Interaction (cs.HC); Machine Learning (cs.LG)
Automated experiments in scanning transmission electron microscopy (STEM) require rapid image segmentation to optimize data representation for human interpretation, decision-making, site-selective spectroscopies, and atomic manipulation. Currently, segmentation tasks are typically performed using supervised machine learning methods, which require human-labeled data and are sensitive to out-of-distribution drift effects caused by changes in resolution, sampling, or beam shape. Here, we operationalize and benchmark a recently proposed reward-driven optimization workflow for on-the fly image analysis in STEM. This unsupervised approach is much more robust, as it does not rely on human labels and is fully explainable. The explanatory feedback can help the human to verify the decision making and potentially tune the model by selecting the position along the Pareto frontier of reward functions. We establish the timing and effectiveness of this method, demonstrating its capability for real-time performance in high-throughput and dynamic automated STEM experiments. The reward driven approach allows to construct explainable robust analysis workflows and can be generalized to a broad range of image analysis tasks in electron and scanning probe microscopy and chemical imaging.
- [346] arXiv:2409.12483 (cross-list from physics.flu-dyn) [pdf, html, other]
-
Title: The effective use of BLAS interface for implementation of finite-element ADER-DG and finite-volume ADER-WENO methodsComments: 33 pages, 5 figures, 13 tables. arXiv admin note: text overlap with arXiv:2409.09911Subjects: Fluid Dynamics (physics.flu-dyn); Analysis of PDEs (math.AP); Numerical Analysis (math.NA); Applied Physics (physics.app-ph); Computational Physics (physics.comp-ph)
Numerical methods of the ADER family, in particular finite-element ADER-DG and finite-volume ADER-WENO methods, are among the most accurate numerical methods for solving quasilinear PDE systems. The internal structure of ADER-DG and ADER-WENO numerical methods contains a large number of basic linear algebra operations related to matrix multiplication. At the moment, the main interface of software libraries for high-performance computing is BLAS. This paper presents an effective method for integration the standard functions of the BLAS interface into the implementation of these numerical methods. The calculated matrices are small matrices; at the same time, the proposed implementation makes it possible to effectively use existing JIT technologies. The proposed approach immediately operates on AoS, which makes it possible to efficiently calculate flux, source and non-conservative terms without need to carry out transposition. The obtained computational costs allowed to conclude that the effective implementation, based on the use of the JIT functions of the BLAS, outperformed both the implementation based on the general BLAS functions and the naive implementations by several times. At the same time, the complexity of developing an implementation based on the approach proposed in this work does not exceed the complexity of developing a naive optimized implementation.
- [347] arXiv:2409.12491 (cross-list from math.ST) [pdf, html, other]
-
Title: Two New Families of Local Asymptotically Minimax Lower Bounds in Parameter EstimationComments: 26 pages, submitted for publicationSubjects: Statistics Theory (math.ST); Information Theory (cs.IT)
We propose two families of asymptotically local minimax lower bounds on parameter estimation performance. The first family of bounds applies to any convex, symmetric loss function that depends solely on the difference between the estimate and the true underlying parameter value (i.e., the estimation error), whereas the second is more specifically oriented to the moments of the estimation error. The proposed bounds are relatively easy to calculate numerically (in the sense that their optimization is over relatively few auxiliary parameters), yet they turn out to be tighter (sometimes significantly so) than previously reported bounds that are associated with similar calculation efforts, across a variety of application examples. In addition to their relative simplicity, they also have the following advantages: (i) Essentially no regularity conditions are required regarding the parametric family of distributions; (ii) The bounds are local (in a sense to be specified); (iii) The bounds provide the correct order of decay as functions of the number of observations, at least in all examples examined; (iv) At least the first family of bounds extends straightforwardly to vector parameters.
- [348] arXiv:2409.12516 (cross-list from q-fin.CP) [pdf, html, other]
-
Title: A Multi-agent Market Model Can Explain the Impact of AI Traders in Financial Markets -- A New Microfoundations of GARCH modelComments: Accepted PRIMA2024Subjects: Computational Finance (q-fin.CP); Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA); Trading and Market Microstructure (q-fin.TR)
The AI traders in financial markets have sparked significant interest in their effects on price formation mechanisms and market volatility, raising important questions for market stability and regulation. Despite this interest, a comprehensive model to quantitatively assess the specific impacts of AI traders remains undeveloped. This study aims to address this gap by modeling the influence of AI traders on market price formation and volatility within a multi-agent framework, leveraging the concept of microfoundations. Microfoundations involve understanding macroeconomic phenomena, such as market price formation, through the decision-making and interactions of individual economic agents. While widely acknowledged in macroeconomics, microfoundational approaches remain unexplored in empirical finance, particularly for models like the GARCH model, which captures key financial statistical properties such as volatility clustering and fat tails. This study proposes a multi-agent market model to derive the microfoundations of the GARCH model, incorporating three types of agents: noise traders, fundamental traders, and AI traders. By mathematically aggregating the micro-structure of these agents, we establish the microfoundations of the GARCH model. We validate this model through multi-agent simulations, confirming its ability to reproduce the stylized facts of financial markets. Finally, we analyze the impact of AI traders using parameters derived from these microfoundations, contributing to a deeper understanding of their role in market dynamics.
- [349] arXiv:2409.12520 (cross-list from eess.AS) [pdf, html, other]
-
Title: Geometry-Constrained EEG Channel Selection for Brain-Assisted Speech EnhancementSubjects: Audio and Speech Processing (eess.AS); Sound (cs.SD)
Brain-assisted speech enhancement (BASE) aims to extract the target speaker in complex multi-talker scenarios using electroencephalogram (EEG) signals as an assistive modality, as the auditory attention of the listener can be decoded from electroneurographic signals of the brain. This facilitates a potential integration of EEG electrodes with listening devices to improve the speech intelligibility of hearing-impaired listeners, which was shown by the recently-proposed BASEN model. As in general the multichannel EEG signals are highly correlated and some are even irrelevant to listening, blindly incorporating all EEG channels would lead to a high economic and computational cost. In this work, we therefore propose a geometry-constrained EEG channel selection approach for BASE. We design a new weighted multi-dilation temporal convolutional network (WDTCN) as the backbone to replace the Conv-TasNet in BASEN. Given a raw channel set that is defined by the electrode geometry for feasible integration, we then propose a geometry-constrained convolutional regularization selection (GC-ConvRS) module for WD-TCN to find an informative EEG subset. Experimental results on a public dataset show the superiority of the proposed WD-TCN over BASEN. The GC-ConvRS can further refine the useful EEG subset subject to the geometry constraint, resulting in a better trade-off between performance and integration cost.
- [350] arXiv:2409.12533 (cross-list from eess.IV) [pdf, other]
-
Title: MambaClinix: Hierarchical Gated Convolution and Mamba-Based U-Net for Enhanced 3D Medical Image SegmentationComments: 18 pages, 5 figuresSubjects: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
Deep learning, particularly convolutional neural networks (CNNs) and Transformers, has significantly advanced 3D medical image segmentation. While CNNs are highly effective at capturing local features, their limited receptive fields may hinder performance in complex clinical scenarios. In contrast, Transformers excel at modeling long-range dependencies but are computationally intensive, making them expensive to train and deploy. Recently, the Mamba architecture, based on the State Space Model (SSM), has been proposed to efficiently model long-range dependencies while maintaining linear computational complexity. However, its application in medical image segmentation reveals shortcomings, particularly in capturing critical local features essential for accurate delineation of clinical regions. In this study, we propose MambaClinix, a novel U-shaped architecture for medical image segmentation that integrates a hierarchical gated convolutional network(HGCN) with Mamba in an adaptive stage-wise framework. This design significantly enhances computational efficiency and high-order spatial interactions, enabling the model to effectively capture both proximal and distal relationships in medical images. Specifically, our HGCN is designed to mimic the attention mechanism of Transformers by a purely convolutional structure, facilitating high-order spatial interactions in feature maps while avoiding the computational complexity typically associated with Transformer-based methods. Additionally, we introduce a region-specific Tversky loss, which emphasizes specific pixel regions to improve auto-segmentation performance, thereby optimizing the model's decision-making process. Experimental results on five benchmark datasets demonstrate that the proposed MambaClinix achieves high segmentation accuracy while maintaining low model complexity.
- [351] arXiv:2409.12550 (cross-list from physics.ins-det) [pdf, html, other]
-
Title: Robust State Estimation from Partial Out-Core Measurements with Shallow Recurrent Decoder for Nuclear ReactorsSubjects: Instrumentation and Detectors (physics.ins-det); Numerical Analysis (math.NA); Computational Physics (physics.comp-ph)
Reliable, real-time state estimation in nuclear reactors is of critical importance for monitoring, control and safety. It further empowers the development of digital twins that are sufficiently accurate for real-world deployment. As nuclear engineering systems are typically characterised by extreme environments, their in-core sensing is a challenging task, even more so in Generation-IV reactor concepts, which feature molten salt or liquid metals as thermal carriers. The emergence of data-driven methods allows for new techniques for accurate and robust estimation of the full state space vector characterising the reactor (mainly composed by neutron fluxes and the thermal-hydraulics fields). These techniques can combine different sources of information, including computational proxy models and local noisy measurements on the system, in order to robustly estimate the state. This work leverages the Shallow Recurrent Decoder (SHRED) architecture to estimate the entire state vector of a reactor from three, out-of-core time-series neutron flux measurements alone. Specifically, the Molten Salt Fast Reactor, in the EVOL geometry (Evaluation and Viability of Liquid Fuel Fast Reactor System project), is demonstrated as a test case, with neutron flux measurements alone allowing for reconstruction of the 20 coupled field variables of the dynamics. This approach can further quantify the uncertainty associated with the state estimation due to its considerably low training cost on compressed data. The accurate reconstruction of every characteristic field in real-time makes this approach suitable for monitoring and control purposes in the framework of a reactor digital twin.
- [352] arXiv:2409.12560 (cross-list from eess.AS) [pdf, html, other]
-
Title: AudioComposer: Towards Fine-grained Audio Generation with Natural Language DescriptionsSubjects: Audio and Speech Processing (eess.AS); Sound (cs.SD)
Current Text-to-audio (TTA) models mainly use coarse text descriptions as inputs to generate audio, which hinders models from generating audio with fine-grained control of content and style. Some studies try to improve the granularity by incorporating additional frame-level conditions or control networks. However, this usually leads to complex system design and difficulties due to the requirement for reference frame-level conditions. To address these challenges, we propose AudioComposer, a novel TTA generation framework that relies solely on natural language descriptions (NLDs) to provide both content specification and style control information. To further enhance audio generative modeling, we employ flow-based diffusion transformers with the cross-attention mechanism to incorporate text descriptions effectively into audio generation processes, which can not only simultaneously consider the content and style information in the text inputs, but also accelerate generation compared to other architectures. Furthermore, we propose a novel and comprehensive automatic data simulation pipeline to construct data with fine-grained text descriptions, which significantly alleviates the problem of data scarcity in the area. Experiments demonstrate the effectiveness of our framework using solely NLDs as inputs for content specification and style control. The generation quality and controllability surpass state-of-the-art TTA models, even with a smaller model size.
- [353] arXiv:2409.12566 (cross-list from quant-ph) [pdf, html, other]
-
Title: Quantum Channel Testing in Average-Case DistanceSubjects: Quantum Physics (quant-ph); Computational Complexity (cs.CC); Data Structures and Algorithms (cs.DS)
We study the complexity of testing properties of quantum channels. First, we show that testing identity to any channel $\mathcal N: \mathbb C^{d_{\mathrm{in}} \times d_{\mathrm{in}}} \to \mathbb C^{d_{\mathrm{out}} \times d_{\mathrm{out}}}$ in diamond norm distance requires $\Omega(\sqrt{d_{\mathrm{in}} / \varepsilon})$ queries, even in the strongest algorithmic model that admits ancillae, coherence, and adaptivity. This is due to the worst-case nature of the distance induced by the diamond norm.
Motivated by this limitation and other theoretical and practical applications, we introduce an average-case analogue of the diamond norm, which we call the average-case imitation diamond (ACID) norm. In the weakest algorithmic model without ancillae, coherence, or adaptivity, we prove that testing identity to certain types of channels in ACID distance can be done with complexity independent of the dimensions of the channel, while for other types of channels the complexity depends on both the input and output dimensions. Building on previous work, we also show that identity to any fixed channel can be tested with $\tilde O(d_{\mathrm{in}} d_{\mathrm{out}}^{3/2} / \varepsilon^2)$ queries in ACID distance and $\tilde O(d_{\mathrm{in}}^2 d_{\mathrm{out}}^{3/2} / \varepsilon^2)$ queries in diamond distance in this model. Finally, we prove tight bounds on the complexity of channel tomography in ACID distance. - [354] arXiv:2409.12579 (cross-list from math.CO) [pdf, html, other]
-
Title: Sharp estimates for Gowers norms on discrete cubesComments: 22 pages, Mathematica notebook attachedSubjects: Combinatorics (math.CO); Information Theory (cs.IT); Classical Analysis and ODEs (math.CA)
We study optimal dimensionless inequalities $$ \|f\|_{U^k} \leq \|f\|_{\ell^{p_{k,n}}} $$ that hold for all functions $f\colon\mathbb{Z}^d\to\mathbb{C}$ supported in $\{0,1,\ldots,n-1\}^d$ and estimates $$ \|1_A\|_{U^k}^{2^k}\leq |A|^{t_{k,n}} $$ that hold for all subsets $A$ of the same discrete cubes. A general theory, analogous to the work of de Dios Pont, Greenfeld, Ivanisvili, and Madrid, is developed to show that the critical exponents are related by $p_{k,n} t_{k,n} = 2^k$. This is used to prove the three main results of the paper: an explicit formula for $t_{k,2}$, which generalizes a theorem by Kane and Tao, two-sided asymptotic estimates for $t_{k,n}$ as $n\to\infty$ for a fixed $k\geq2$, which generalize a theorem by Shao, and a precise asymptotic formula for $t_{k,n}$ as $k\to\infty$ for a fixed $n\geq2$.
- [355] arXiv:2409.12587 (cross-list from stat.ML) [pdf, html, other]
-
Title: Test-Time Augmentation Meets Variational BayesSubjects: Machine Learning (stat.ML); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Data augmentation is known to contribute significantly to the robustness of machine learning models. In most instances, data augmentation is utilized during the training phase. Test-Time Augmentation (TTA) is a technique that instead leverages these data augmentations during the testing phase to achieve robust predictions. More precisely, TTA averages the predictions of multiple data augmentations of an instance to produce a final prediction. Although the effectiveness of TTA has been empirically reported, it can be expected that the predictive performance achieved will depend on the set of data augmentation methods used during testing. In particular, the data augmentation methods applied should make different contributions to performance. That is, it is anticipated that there may be differing degrees of contribution in the set of data augmentation methods used for TTA, and these could have a negative impact on prediction performance. In this study, we consider a weighted version of the TTA based on the contribution of each data augmentation. Some variants of TTA can be regarded as considering the problem of determining the appropriate weighting. We demonstrate that the determination of the coefficients of this weighted TTA can be formalized in a variational Bayesian framework. We also show that optimizing the weights to maximize the marginal log-likelihood suppresses candidates of unwanted data augmentations at the test phase.
- [356] arXiv:2409.12589 (cross-list from hep-ph) [pdf, html, other]
-
Title: Is Tokenization Needed for Masked Particle Modelling?Matthew Leigh, Samuel Klein, François Charton, Tobias Golling, Lukas Heinrich, Michael Kagan, Inês Ochoa, Margarita OsadchySubjects: High Energy Physics - Phenomenology (hep-ph); Machine Learning (cs.LG)
In this work, we significantly enhance masked particle modeling (MPM), a self-supervised learning scheme for constructing highly expressive representations of unordered sets relevant to developing foundation models for high-energy physics. In MPM, a model is trained to recover the missing elements of a set, a learning objective that requires no labels and can be applied directly to experimental data. We achieve significant performance improvements over previous work on MPM by addressing inefficiencies in the implementation and incorporating a more powerful decoder. We compare several pre-training tasks and introduce new reconstruction methods that utilize conditional generative models without data tokenization or discretization. We show that these new methods outperform the tokenized learning objective from the original MPM on a new test bed for foundation models for jets, which includes using a wide variety of downstream tasks relevant to jet physics, such as classification, secondary vertex finding, and track identification.
- [357] arXiv:2409.12622 (cross-list from math.OC) [pdf, html, other]
-
Title: Theoretical Analysis of Heteroscedastic Gaussian Processes with Posterior DistributionsComments: This work has been submitted to the IEEE for possible publication. Copyright may be transferred without notice, after which this version may no longer be accessibleSubjects: Optimization and Control (math.OC); Machine Learning (cs.LG)
This study introduces a novel theoretical framework for analyzing heteroscedastic Gaussian processes (HGPs) that identify unknown systems in a data-driven manner. Although HGPs effectively address the heteroscedasticity of noise in complex training datasets, calculating the exact posterior distributions of the HGPs is challenging, as these distributions are no longer multivariate normal. This study derives the exact means, variances, and cumulative distributions of the posterior distributions. Furthermore, the derived theoretical findings are applied to a chance-constrained tracking controller. After an HGP identifies an unknown disturbance in a plant system, the controller can handle chance constraints regarding the system despite the presence of the disturbance.
- [358] arXiv:2409.12650 (cross-list from math.OC) [pdf, html, other]
-
Title: Stochastic Prediction Equilibrium for Dynamic Traffic AssignmentComments: 34 pages, 2 figuresSubjects: Optimization and Control (math.OC); Computer Science and Game Theory (cs.GT)
Stochastic effects significantly influence the dynamics of traffic flows. Many dynamic traffic assignment (DTA) models attempt to capture these effects by prescribing a specific ratio that determines how flow splits across different routes based on the routes' costs. In this paper, we propose a new framework for DTA that incorporates the interplay between the routing decisions of each single traffic participant, the stochastic nature of predicting the future state of the network, and the physical flow dynamics. Our framework consists of an edge loading operator modeling the physical flow propagation and a routing operator modeling the routing behavior of traffic participants. The routing operator is assumed to be set-valued and capable to model complex (deterministic) equilibrium conditions as well as stochastic equilibrium conditions assuming that measurements for predicting traffic are noisy. As our main results, we derive several quite general equilibrium existence and uniqueness results which not only subsume known results from the literature but also lead to new results. Specifically, for the new stochastic prediction equilibrium, we show existence and uniqueness under natural assumptions on the probability distribution over the predictions.
- [359] arXiv:2409.12672 (cross-list from math.CO) [pdf, html, other]
-
Title: List Conflict-free ColoringComments: 17 pagesSubjects: Combinatorics (math.CO); Discrete Mathematics (cs.DM)
Motivated by its application in the frequency assignment problem for cellular networks, conflict-free coloring was first studied by Even et al. in [Conflict-free colorings of simple geometric regions with applications to frequency assignment in cellular networks, SIAM Journal on Computing, 2004]. A \emph{conflict-free coloring} of a hypergraph $\mathcal{H}$ is an assignment of colors to the vertex set of $\mathcal{H}$ such that every hyperedge in $\mathcal{H}$ has a vertex whose color is distinct from every other vertex in that hyperedge. The minimum number of colors required for such a coloring is known as the \emph{conflict-free chromatic number} of $\mathcal{H}$. Conflict-free coloring has also been studied on open/closed neighborhood hypergraphs of a given graph.
In this paper, we study the list variant of conflict-free coloring where, for every vertex $v$, we are given a list of admissible colors $L_v$ such that $v$ is allowed to be colored only from $L_v$. We prove upper bounds for the list conflict-free chromatic number of general hypergraphs and graphs. - [360] arXiv:2409.12674 (cross-list from physics.ed-ph) [pdf, other]
-
Title: Exploring Engagement and Perceived Learning Outcomes in an Immersive Flipped Learning ContextComments: 14 pagesJournal-ref: The International Journal in Information Technology in Governance, Education, and Business 2024Subjects: Physics Education (physics.ed-ph); Computers and Society (cs.CY)
The flipped classroom model has been widely acknowledged as a practical pedagogical approach to enhancing student engagement and learning. However, it faces challenges such as improving student interaction with learning content and peers, particularly in Japanese universities where digital technologies are not always fully utilized. To address these challenges and identify potential solutions, a case study was conducted in which an online flipped course on academic skills was developed and implemented in an immersive virtual environment. The primary objective during this initial phase was not to establish a causal relationship between the use of immersive flipped learning and students' engagement and perceived learning outcomes. Instead, this initiative aimed to explore the benefits and challenges of the immersive flipped learning approach in relation to students' online engagement and their perceived learning outcomes. Following a mixed-methods research approach, quantitative and qualitative data were collected through a survey (N=50) and students' reflective reports (N=80). The study revealed high levels of student engagement and perceived learning outcomes, although it also identified areas needing improvement, particularly in supporting student interactions in the target language. Despite the exploratory nature of this study, the findings suggest that a well-designed flipped learning approach, set in an engaging immersive environment, can significantly enhance student engagement, thereby supporting the learning process. When creating an immersive flipped learning course, educators should incorporate best practices from the literature on both flipped learning and immersive learning design to ensure optimal learning outcomes. The findings of this study can serve as a valuable resource for educators seeking to design engaging and effective remote learning experiences.
- [361] arXiv:2409.12678 (cross-list from eess.IV) [pdf, html, other]
-
Title: PMR-Net: Parallel Multi-Resolution Encoder-Decoder Network Framework for Medical Image SegmentationSubjects: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
In recent years, encoder-decoder networks have focused on expanding receptive fields and incorporating multi-scale context to capture global features for objects of varying sizes. However, as networks deepen, they often discard fine spatial details, impairing precise object localization. Additionally, conventional decoders' use of interpolation for upsampling leads to a loss of global context, diminishing edge segmentation accuracy. To address the above problems, we propose a novel parallel multi-resolution encoder-decoder network, namely PMR-Net for short. First, we design a parallel multi-resolution encoder and a multi-resolution context encoder. The parallel multi-resolution encoder can extract and fuse multi-scale fine-grained local features in parallel for input images with different resolutions. The multi-resolution context encoder fuses the global context semantic features of different receptive fields from different encoder branches to maintain effectively the integrity of global information. Secondly, we design a parallel multi-resolution decoder symmetrical to the structure of parallel multi-resolution encoder. The decoder can continuously supplement the global context features of low-resolution branches to the feature maps of high-resolution branches, and effectively solve the problem of global context feature loss caused by upsampling operation in the decoding process. Extensive experiment results demonstrate that our proposed PMR-Net can achieve more accurate segmentation results than state-of-the-art methods on five public available datasets. Moreover, PMR-Net is also a flexible network framework, which can meet the requirements of different scenarios by adjusting the number of network layers and the number of parallel encoder-decoder branches.
- [362] arXiv:2409.12707 (cross-list from physics.flu-dyn) [pdf, other]
-
Title: Machine-learning-based multipoint optimization of fluidic injection parameters for improving nozzle performanceSubjects: Fluid Dynamics (physics.flu-dyn); Machine Learning (cs.LG)
Fluidic injection provides a promising solution to improve the performance of overexpanded single expansion ramp nozzle (SERN) during vehicle acceleration. However, determining the injection parameters for the best overall performance under multiple nozzle operating conditions is still a challenge. The gradient-based optimization method requires gradients of injection parameters at each design point, leading to high computational costs if traditional computational fluid dynamic (CFD) simulations are adopted. This paper uses a pretrained neural network model to replace CFD during optimization to quickly calculate the nozzle flow field at multiple design points. Considering the physical characteristics of the nozzle flow field, a prior-based prediction strategy is adopted to enhance the model's transferability. In addition, the back-propagation algorithm of the neural network is adopted to quickly evaluate the gradients by calling the computation process only once, thereby greatly reducing the gradient computation time compared to the finite differential method. As a test case, the average nozzle thrust coefficient of a SERN at seven design points is optimized. An improvement in the thrust coefficient of 1.14% is achieved, and the time cost is greatly reduced compared with the traditional optimization methods, even when the time to establish the database for training is considered.
- [363] arXiv:2409.12711 (cross-list from physics.flu-dyn) [pdf, other]
-
Title: Rapid aerodynamic prediction of swept wings via physics-embedded transfer learningSubjects: Fluid Dynamics (physics.flu-dyn); Machine Learning (cs.LG)
Machine learning-based models provide a promising way to rapidly acquire transonic swept wing flow fields but suffer from large computational costs in establishing training datasets. Here, we propose a physics-embedded transfer learning framework to efficiently train the model by leveraging the idea that a three-dimensional flow field around wings can be analyzed with two-dimensional flow fields around cross-sectional airfoils. An airfoil aerodynamics prediction model is pretrained with airfoil samples. Then, an airfoil-to-wing transfer model is fine-tuned with a few wing samples to predict three-dimensional flow fields based on two-dimensional results on each spanwise cross section. Sweep theory is embedded when determining the corresponding airfoil geometry and operating conditions, and to obtain the sectional airfoil lift coefficient, which is one of the operating conditions, the low-fidelity vortex lattice method and data-driven methods are proposed and evaluated. Compared to a nontransfer model, introducing the pretrained model reduces the error by 30%, while introducing sweep theory further reduces the error by 9%. When reducing the dataset size, less than half of the wing training samples are need to reach the same error level as the nontransfer framework, which makes establishing the model much easier.
- [364] arXiv:2409.12717 (cross-list from eess.AS) [pdf, html, other]
-
Title: NDVQ: Robust Neural Audio Codec with Normal Distribution-Based Vector QuantizationSubjects: Audio and Speech Processing (eess.AS); Sound (cs.SD)
Built upon vector quantization (VQ), discrete audio codec models have achieved great success in audio compression and auto-regressive audio generation. However, existing models face substantial challenges in perceptual quality and signal distortion, especially when operating in extremely low bandwidth, rooted in the sensitivity of the VQ codebook to noise. This degradation poses significant challenges for several downstream tasks, such as codec-based speech synthesis. To address this issue, we propose a novel VQ method, Normal Distribution-based Vector Quantization (NDVQ), by introducing an explicit margin between the VQ codes via learning a variance. Specifically, our approach involves mapping the waveform to a latent space and quantizing it by selecting the most likely normal distribution, with each codebook entry representing a unique normal distribution defined by its mean and variance. Using these distribution-based VQ codec codes, a decoder reconstructs the input waveform. NDVQ is trained with additional distribution-related losses, alongside reconstruction and discrimination losses. Experiments demonstrate that NDVQ outperforms existing audio compression baselines, such as EnCodec, in terms of audio quality and zero-shot TTS, particularly in very low bandwidth scenarios.
- [365] arXiv:2409.12719 (cross-list from eess.IV) [pdf, html, other]
-
Title: Multi-Scale Feature Prediction with Auxiliary-Info for Neural Image CompressionSubjects: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
Recently, significant improvements in rate-distortion performance of image compression have been achieved with deep-learning techniques. A key factor in this success is the use of additional bits to predict an approximation of the latent vector, which is the output of the encoder, through another neural network. Then, only the difference between the prediction and the latent vector is coded into the bitstream, along with its estimated probability distribution. We introduce a new predictive structure consisting of the auxiliary coarse network and the main network, inspired by neural video compression. The auxiliary coarse network encodes the auxiliary information and predicts the approximation of the original image as multi-scale features. The main network encodes the residual between the predicted feature from the auxiliary coarse network and the feature of the original image. To further leverage our new structure, we propose Auxiliary info-guided Feature Prediction (AFP) module that uses global correlation to predict more accurate predicted features. Moreover, we present Context Junction module that refines the auxiliary feature from AFP module and produces the residuals between the refined features and the original image features. Finally, we introduce Auxiliary info-guided Parameter Estimation (APE) module, which predicts the approximation of the latent vector and estimates the probability distribution of these residuals. We demonstrate the effectiveness of the proposed modules by various ablation studies. Under extensive experiments, our model outperforms other neural image compression models and achieves a 19.49\% higher rate-distortion performance than VVC on Tecnick dataset.
- [366] arXiv:2409.12728 (cross-list from q-bio.GN) [pdf, html, other]
-
Title: PRAGA: Prototype-aware Graph Adaptive Aggregation for Spatial Multi-modal Omics AnalysisSubjects: Genomics (q-bio.GN); Machine Learning (cs.LG)
Spatial multi-modal omics technology, highlighted by Nature Methods as an advanced biological technique in 2023, plays a critical role in resolving biological regulatory processes with spatial context. Recently, graph neural networks based on K-nearest neighbor (KNN) graphs have gained prominence in spatial multi-modal omics methods due to their ability to model semantic relations between sequencing spots. However, the fixed KNN graph fails to capture the latent semantic relations hidden by the inevitable data perturbations during the biological sequencing process, resulting in the loss of semantic information. In addition, the common lack of spot annotation and class number priors in practice further hinders the optimization of spatial multi-modal omics models. Here, we propose a novel spatial multi-modal omics resolved framework, termed PRototype-Aware Graph Adaptative Aggregation for Spatial Multi-modal Omics Analysis (PRAGA). PRAGA constructs a dynamic graph to capture latent semantic relations and comprehensively integrate spatial information and feature semantics. The learnable graph structure can also denoise perturbations by learning cross-modal knowledge. Moreover, a dynamic prototype contrastive learning is proposed based on the dynamic adaptability of Bayesian Gaussian Mixture Models to optimize the multi-modal omics representations for unknown biological priors. Quantitative and qualitative experiments on simulated and real datasets with 7 competing methods demonstrate the superior performance of PRAGA.
- [367] arXiv:2409.12777 (cross-list from eess.IV) [pdf, html, other]
-
Title: TEAM PILOT -- Learned Feasible Extendable Set of Dynamic MRI Acquisition TrajectoriesSubjects: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
Dynamic Magnetic Resonance Imaging (MRI) is a crucial non-invasive method used to capture the movement of internal organs and tissues, making it a key tool for medical diagnosis. However, dynamic MRI faces a major challenge: long acquisition times needed to achieve high spatial and temporal resolution. This leads to higher costs, patient discomfort, motion artifacts, and lower image quality. Compressed Sensing (CS) addresses this problem by acquiring a reduced amount of MR data in the Fourier domain, based on a chosen sampling pattern, and reconstructing the full image from this partial data. While various deep learning methods have been developed to optimize these sampling patterns and improve reconstruction, they often struggle with slow optimization and inference times or are limited to specific temporal dimensions used during training. In this work, we introduce a novel deep-compressed sensing approach that uses 3D window attention and flexible, temporally extendable acquisition trajectories. Our method significantly reduces both training and inference times compared to existing approaches, while also adapting to different temporal dimensions during inference without requiring additional training. Tests with real data show that our approach outperforms current state-of-theart techniques. The code for reproducing all experiments will be made available upon acceptance of the paper.
- [368] arXiv:2409.12792 (cross-list from eess.IV) [pdf, html, other]
-
Title: Multi-Source and Multi-Sequence Myocardial Pathology Segmentation Using a Cascading Refinement CNNSubjects: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
Myocardial infarction (MI) is one of the most prevalent cardiovascular diseases and consequently, a major cause for mortality and morbidity worldwide. Accurate assessment of myocardial tissue viability for post-MI patients is critical for diagnosis and treatment planning, e.g. allowing surgical revascularization, or to determine the risk of adverse cardiovascular events in the future. Fine-grained analysis of the myocardium and its surrounding anatomical structures can be performed by combining the information obtained from complementary medical imaging techniques. In this work, we use late gadolinium enhanced (LGE) magnetic resonance (MR), T2-weighted (T2) MR and balanced steady-state free precession (bSSFP) cine MR in order to semantically segment the left and right ventricle, healthy and scarred myocardial tissue, as well as edema. To this end, we propose the Multi-Sequence Cascading Refinement CNN (MS-CaRe-CNN), a 2-stage CNN cascade that receives multi-sequence data and generates predictions of the anatomical structures of interest without considering tissue viability at Stage 1. The prediction of Stage 1 is then further refined in Stage 2, where the model additionally distinguishes myocardial tissue based on viability, i.e. healthy, scarred and edema regions. Our proposed method is set up as a 5-fold ensemble and semantically segments scar tissue achieving 62.31% DSC and 82.65% precision, as well as 63.78% DSC and 87.69% precision for the combined scar and edema region. These promising results for such small and challenging structures confirm that MS-CaRe-CNN is well-suited to generate semantic segmentations to assess the viability of myocardial tissue, enabling downstream tasks like personalized therapy planning.
- [369] arXiv:2409.12799 (cross-list from stat.ML) [pdf, html, other]
-
Title: The Central Role of the Loss Function in Reinforcement LearningSubjects: Machine Learning (stat.ML); Machine Learning (cs.LG); Statistics Theory (math.ST)
This paper illustrates the central role of loss functions in data-driven decision making, providing a comprehensive survey on their influence in cost-sensitive classification (CSC) and reinforcement learning (RL). We demonstrate how different regression loss functions affect the sample efficiency and adaptivity of value-based decision making algorithms. Across multiple settings, we prove that algorithms using the binary cross-entropy loss achieve first-order bounds scaling with the optimal policy's cost and are much more efficient than the commonly used squared loss. Moreover, we prove that distributional algorithms using the maximum likelihood loss achieve second-order bounds scaling with the policy variance and are even sharper than first-order bounds. This in particular proves the benefits of distributional RL. We hope that this paper serves as a guide analyzing decision making algorithms with varying loss functions, and can inspire the reader to seek out better loss functions to improve any decision making algorithm.
- [370] arXiv:2409.12805 (cross-list from stat.ML) [pdf, html, other]
-
Title: Robust estimation of the intrinsic dimension of data sets with quantum cognition machine learningLuca Candelori, Alexander G. Abanov, Jeffrey Berger, Cameron J. Hogan, Vahagn Kirakosyan, Kharen Musaelian, Ryan Samson, James E. T. Smith, Dario Villani, Martin T. Wells, Mengjia XuSubjects: Machine Learning (stat.ML); Machine Learning (cs.LG); Quantum Physics (quant-ph)
We propose a new data representation method based on Quantum Cognition Machine Learning and apply it to manifold learning, specifically to the estimation of intrinsic dimension of data sets. The idea is to learn a representation of each data point as a quantum state, encoding both local properties of the point as well as its relation with the entire data. Inspired by ideas from quantum geometry, we then construct from the quantum states a point cloud equipped with a quantum metric. The metric exhibits a spectral gap whose location corresponds to the intrinsic dimension of the data. The proposed estimator is based on the detection of this spectral gap. When tested on synthetic manifold benchmarks, our estimates are shown to be robust with respect to the introduction of point-wise Gaussian noise. This is in contrast to current state-of-the-art estimators, which tend to attribute artificial ``shadow dimensions'' to noise artifacts, leading to overestimates. This is a significant advantage when dealing with real data sets, which are inevitably affected by unknown levels of noise. We show the applicability and robustness of our method on real data, by testing it on the ISOMAP face database, MNIST, and the Wisconsin Breast Cancer Dataset.
- [371] arXiv:2409.12815 (cross-list from physics.ao-ph) [pdf, html, other]
-
Title: Graph Convolutional Neural Networks as Surrogate Models for Climate SimulationComments: 10 pages, 8 figuresSubjects: Atmospheric and Oceanic Physics (physics.ao-ph); Artificial Intelligence (cs.AI)
Many climate processes are characterized using large systems of nonlinear differential equations; this, along with the immense amount of data required to parameterize complex interactions, means that Earth-System Model (ESM) simulations may take weeks to run on large clusters. Uncertainty quantification may require thousands of runs, making ESM simulations impractical for preliminary assessment. Alternatives may include simplifying the processes in the model, but recent efforts have focused on using machine learning to complement these models or even act as full surrogates. \textit{We leverage machine learning, specifically fully-connected neural networks (FCNNs) and graph convolutional neural networks (GCNNs), to enable rapid simulation and uncertainty quantification in order to inform more extensive ESM simulations.} Our surrogate simulated 80 years in approximately 310 seconds on a single A100 GPU, compared to weeks for the ESM model while having mean temperature errors below $0.1^{\circ}C$ and maximum errors below $2^{\circ}C$.
- [372] arXiv:2409.12820 (cross-list from quant-ph) [pdf, html, other]
-
Title: Machine-learning based high-bandwidth magnetic sensingComments: 12 pages including supplementary, 6 figuresSubjects: Quantum Physics (quant-ph); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Applied Physics (physics.app-ph); Computational Physics (physics.comp-ph)
Recent years have seen significant growth of quantum technologies, and specifically quantum sensing, both in terms of the capabilities of advanced platforms and their applications. One of the leading platforms in this context is nitrogen-vacancy (NV) color centers in diamond, providing versatile, high-sensitivity, and high-resolution magnetic sensing. Nevertheless, current schemes for spin resonance magnetic sensing (as applied by NV quantum sensing) suffer from tradeoffs associated with sensitivity, dynamic range, and bandwidth. Here we address this issue, and implement machine learning tools to enhance NV magnetic sensing in terms of the sensitivity/bandwidth tradeoff in large dynamic range scenarios. We experimentally demonstrate this new approach, reaching an improvement in the relevant figure of merit by a factor of up to 5. Our results promote quantum machine learning protocols for sensing applications towards more feasible and efficient quantum technologies.
- [373] arXiv:2409.12844 (cross-list from math.AP) [pdf, html, other]
-
Title: Iterative algorithms for the reconstruction of early states of prostate cancer growthComments: 37 pagesSubjects: Analysis of PDEs (math.AP); Numerical Analysis (math.NA); Tissues and Organs (q-bio.TO)
The development of mathematical models of cancer informed by time-resolved measurements has enabled personalised predictions of tumour growth and treatment response. However, frequent cancer monitoring is rare, and many tumours are treated soon after diagnosis with limited data. To improve the predictive capabilities of cancer models, we investigate the problem of recovering earlier tumour states from a single spatial measurement at a later time. Focusing on prostate cancer, we describe tumour dynamics using a phase-field model coupled with two reaction-diffusion equations for a nutrient and the local prostate-specific antigen. We generate synthetic data using a discretisation based on Isogeometric Analysis. Then, building on our previous analytical work (Beretta et al., SIAP (2024)), we propose an iterative reconstruction algorithm based on the Landweber scheme, showing local convergence with quantitative rates and exploring an adaptive step size that leads to faster reconstruction algorithms. Finally, we run simulations demonstrating high-quality reconstructions even with long time horizons and noisy data.
- [374] arXiv:2409.12854 (cross-list from eess.IV) [pdf, html, other]
-
Title: Deep Learning-Based Detection of Referable Diabetic Retinopathy and Macular Edema Using Ultra-Widefield Fundus ImagingSubjects: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
Diabetic retinopathy and diabetic macular edema are significant complications of diabetes that can lead to vision loss. Early detection through ultra-widefield fundus imaging enhances patient outcomes but presents challenges in image quality and analysis scale. This paper introduces deep learning solutions for automated UWF image analysis within the framework of the MICCAI 2024 UWF4DR challenge. We detail methods and results across three tasks: image quality assessment, detection of referable DR, and identification of DME. Employing advanced convolutional neural network architectures such as EfficientNet and ResNet, along with preprocessing and augmentation strategies, our models demonstrate robust performance in these tasks. Results indicate that deep learning can significantly aid in the automated analysis of UWF images, potentially improving the efficiency and accuracy of DR and DME detection in clinical settings.
- [375] arXiv:2409.12891 (cross-list from quant-ph) [pdf, html, other]
-
Title: SPARQ: Efficient Entanglement Distribution and Routing in Space-Air-Ground Quantum NetworksSubjects: Quantum Physics (quant-ph); Networking and Internet Architecture (cs.NI)
In this paper, a space-air-ground quantum (SPARQ) network is developed as a means for providing a seamless on-demand entanglement distribution. The node mobility in SPARQ poses significant challenges to entanglement routing. Existing quantum routing algorithms focus on stationary ground nodes and utilize link distance as an optimality metric, which is unrealistic for dynamic systems like SPARQ. Moreover, in contrast to the prior art that assumes homogeneous nodes, SPARQ encompasses heterogeneous nodes with different functionalities further complicates the entanglement distribution. To solve the entanglement routing problem, a deep reinforcement learning (RL) framework is proposed and trained using deep Q-network (DQN) on multiple graphs of SPARQ to account for the network dynamics. Subsequently, an entanglement distribution policy, third-party entanglement distribution (TPED), is proposed to establish entanglement between communication parties. A realistic quantum network simulator is designed for performance evaluation. Simulation results show that the TPED policy improves entanglement fidelity by 3% and reduces memory consumption by 50% compared with benchmark. The results also show that the proposed DQN algorithm improves the number of resolved teleportation requests by 39% compared with shortest path baseline and the entanglement fidelity by 2% compared with an RL algorithm that is based on long short-term memory (LSTM). It also improved entanglement fidelity by 6% and 9% compared with two state-of-the-art benchmarks. Moreover, the entanglement fidelity is improved by 15% compared with DQN trained on a snapshot of SPARQ. Additionally, SPARQ enhances the average entanglement fidelity by 23.5% compared with existing networks spanning only space and ground layers.
- [376] arXiv:2409.12916 (cross-list from eess.SP) [pdf, html, other]
-
Title: Online Proximal ADMM for Graph Learning from Streaming Smooth SignalsComments: 5 pages, 2 figures, submitted to ICASSP 2025Subjects: Signal Processing (eess.SP); Machine Learning (cs.LG)
Graph signal processing deals with algorithms and signal representations that leverage graph structures for multivariate data analysis. Often said graph topology is not readily available and may be time-varying, hence (dynamic) graph structure learning from nodal (e.g., sensor) observations becomes a critical first step. In this paper, we develop a novel algorithm for online graph learning using observation streams, assumed to be smooth on the latent graph. Unlike batch algorithms for topology identification from smooth signals, our modus operandi is to process graph signals sequentially and thus keep memory and computational costs in check. To solve the resulting smoothness-regularized, time-varying inverse problem, we develop online and lightweight iterations built upon the proximal variant of the alternating direction method of multipliers (ADMM), well known for its fast convergence in batch settings. The proximal term in the topology updates seamlessly implements a temporal-variation regularization, and we argue the online procedure exhibits sublinear static regret under some simplifying assumptions. Reproducible experiments with synthetic and real graphs demonstrate the effectiveness of our method in adapting to streaming signals and tracking slowly-varying network connectivity. The proposed approach also exhibits better tracking performance (in terms of suboptimality), when compared to state-of-the-art online graph learning baselines.
- [377] arXiv:2409.12924 (cross-list from eess.SP) [pdf, html, other]
-
Title: WaveletGPT: Wavelets Meet Large Language ModelsComments: 16 pages, 4 figuresSubjects: Signal Processing (eess.SP); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG); Sound (cs.SD); Audio and Speech Processing (eess.AS)
Large Language Models (LLMs) have ushered in a new wave of artificial intelligence advancements impacting every scientific field and discipline. They are trained on a simple objective: to predict the next token given the previous context. We live in a world where most of the data around us, e.g., text, audio, and music, has a multi-scale structure associated with it. This paper infuses LLMs with traditional signal processing ideas, namely wavelets, during pre-training to take advantage of the structure. Without adding \textbf{any extra parameters} to a GPT-style LLM architecture, we achieve the same pre-training performance almost twice as fast in text, raw audio, and symbolic music. This is achieved by imposing a structure on intermediate embeddings. When trained for the same number of training steps, we achieve significant gains in performance, which is comparable to pre-training a larger neural architecture. Our architecture allows every next token prediction access to intermediate embeddings at different temporal resolutions in every Transformer decoder block. This work will hopefully pave the way for incorporating multi-rate signal processing ideas into traditional LLM pre-training. Further, we showcase pushing model performance by improving internal structure instead of just going after scale.
- [378] arXiv:2409.12930 (cross-list from eess.SP) [pdf, html, other]
-
Title: NL-COMM: Demonstrating Gains of Non-Linear Processing in Open-RAN EcosystemComments: Accepted for IEEE CAMADSubjects: Signal Processing (eess.SP); Networking and Internet Architecture (cs.NI)
Multi-user multiple-input, multiple-output (MU-MIMO) designs can substantially increase wireless systems' achievable throughput and connectivity capabilities. However, existing MU-MIMO deployments typically utilize linear processing techniques that, despite their practical benefits, such as low computational complexity and easy integrability, can leave much of the available throughput and connectivity gains unexploited. They typically require many power-intensive antennas and RF chains to support a smaller number of MIMO streams, even when the transmitted information streams are of low rate. Alternatively, non-linear (NL) processing methods can maximize the capabilities of the MIMO channel. Despite their potential, traditional NL methods are challenged by high computational complexity and processing latency, making them impractical for real-time applications, especially in software-based systems envisioned for emerging Open Radio Access Networks (Open-RAN). Additionally, essential functionalities such as rate adaptation (RA) are currently unavailable for NL systems, limiting their practicality in real-world deployments. In this demo, we present the latest capabilities of our advanced NL processing framework (NL-COMM) in real-time and over-the-air, comparing them side-by-side with conventional linear processing. For the first time, NL-COMM not only meets the practical 5G-NR real-time latency requirements in pure software but also does so within a standard-compliant ecosystem. To achieve this, we significantly extended the NL-COMM algorithmic framework to support the first practical RA for NL processing. The demonstrated gains include enhanced connectivity by supporting four MIMO streams with a single base-station antenna, substantially increased throughput, and the ability to halve the number of base-station antennas without any performance loss to linear approaches.
Cross submissions for Friday, 20 September 2024 (showing 61 of 61 entries )
- [379] arXiv:1808.02451 (replaced) [pdf, html, other]
-
Title: Evolution of Preferences in Multiple PopulationsJournal-ref: International Journal of Game Theory 53 (2024) 211-259Subjects: Computer Science and Game Theory (cs.GT)
We study the evolution of preferences in multi-population settings that allow matches across distinct populations. Each individual has subjective preferences over potential outcomes, and chooses a best response based on his preferences and the information about the opponents' preferences. Individuals' realized fitnesses are given by material payoff functions. Following Dekel et al. (2007), we assume that individuals observe their opponents' preferences with probability $p$. We first derive necessary and sufficient conditions for stability for $p=1$ and $p=0$, and then check the robustness of our results against small perturbations on observability for the case of pure-strategy outcomes.
- [380] arXiv:2006.16925 (replaced) [pdf, html, other]
-
Title: Ethical Analysis on the Application of Neurotechnology for Human Augmentation in Physicians and SurgeonsComments: This is a preprint of the following chapter: Ethical Analysis on the Application of Neurotechnology for Human Augmentation in Physicians and Surgeons, published in Proceedings of the Future Technologies Conference (FTC) 2020, Volume 3, edited by Kohei Arai, Supriya Kapoor and Rahul Bhatia. The final authenticated version is available online at: this https URLSubjects: Computers and Society (cs.CY); Human-Computer Interaction (cs.HC); Medical Physics (physics.med-ph)
With the shortage of physicians and surgeons and increase in demand worldwide due to situations such as the COVID-19 pandemic, there is a growing interest in finding solutions to help address the problem. A solution to this problem would be to use neurotechnology to provide them augmented cognition, senses and action for optimal diagnosis and treatment. Consequently, doing so can negatively impact them and others. We argue that applying neurotechnology for human enhancement in physicians and surgeons can cause injustices, and harm to them and patients. In this paper, we will first describe the augmentations and neurotechnologies that can be used to achieve the relevant augmentations for physicians and surgeons. We will then review selected ethical concerns discussed within literature, discuss the neuroengineering behind using neurotechnology for augmentation purposes, then conclude with an analysis on outcomes and ethical issues of implementing human augmentation via neurotechnology in medical and surgical practice.
- [381] arXiv:2106.10866 (replaced) [pdf, html, other]
-
Title: Customizing Graph Neural Networks using Path ReweightingJianpeng Chen, Yujing Wang, Ming Zeng, Zongyi Xiang, Bitan Hou, Yunhai Tong, Ole J. Mengshoel, Yazhou RenComments: 16 pages. Accepted by Information SciencesJournal-ref: Information Sciences 674 (2024): 120681Subjects: Machine Learning (cs.LG)
Graph Neural Networks (GNNs) have been extensively used for mining graph-structured data with impressive performance. However, because these traditional GNNs do not distinguish among various downstream tasks, embeddings embedded by them are not always effective. Intuitively, paths in a graph imply different semantics for different downstream tasks. Inspired by this, we design a novel GNN solution, namely Customized Graph Neural Network with Path Reweighting (CustomGNN for short). Specifically, the proposed CustomGNN can automatically learn the high-level semantics for specific downstream tasks to highlight semantically relevant paths as well to filter out task-irrelevant noises in a graph. Furthermore, we empirically analyze the semantics learned by CustomGNN and demonstrate its ability to avoid the three inherent problems in traditional GNNs, i.e., over-smoothing, poor robustness, and overfitting. In experiments with the node classification task, CustomGNN achieves state-of-the-art accuracies on three standard graph datasets and four large graph datasets. The source code of the proposed CustomGNN is available at \url{this https URL}.
- [382] arXiv:2106.13672 (replaced) [pdf, other]
-
Title: Exploring individual differences through network topologySubjects: Social and Information Networks (cs.SI)
Social animals, including humans, have a broad range of personality traits, which can be used to predict individual behavioral responses and decisions. Current methods to quantify individual personality traits in humans rely on self-report questionnaires, which require time and effort to collect, and rely on active cooperation. However, personality differences naturally manifest in social interactions such as online social networks. Here, we explored this option and found that the topology of an online social network can be used to characterize the personality traits of its members. We analyzed the directed social graph formed by the users of the LiveJournal (LJ) blogging platform. Individual user personality traits, inferred from their self-reported domains of interest (DOIs), were associated with their network measures. Empirical clustering of DOIs by topological similarity exposed two main self-emergent DOI groups that were in alignment with the personality meta-traits plasticity and stability. Closeness, a global topological measure of network centrality, was higher for bloggers associated with plasticity (vs. stability). A local network motif (a triad of 3 connected bloggers) also separated the personality meta-traits. Finally, topology-based classification of DOIs (without analyzing the blog content) attained > 70% accuracy (average AUC of the test-set). These results indicate that personality traits can be detected in social network topology. This has serious implications for user privacy. But, if used responsibly, network identification of personality traits could aid in early identification of health-related risks, at the population level.
- [383] arXiv:2107.08792 (replaced) [pdf, other]
-
Title: Enhancing Stability in Training Conditional Generative Adversarial Networks via Selective Data MatchingComments: 13 pages, IEEE Access (2024)Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Conditional generative adversarial networks (cGANs) have demonstrated remarkable success due to their class-wise controllability and superior quality for complex generation tasks. Typical cGANs solve the joint distribution matching problem by decomposing two easier sub-problems: marginal matching and conditional matching. In this paper, we proposes a simple but effective training methodology, selective focusing learning, which enforces the discriminator and generator to learn easy samples of each class rapidly while maintaining diversity. Our key idea is to selectively apply conditional and joint matching for the data in each mini-batch.Specifically, we first select the samples with the highest scores when sorted using the conditional term of the discriminator outputs (real and generated samples). Then we optimize the model using the selected samples with only conditional matching and the other samples with joint matching. From our toy experiments, we found that it is the best to apply only conditional matching to certain samples due to the content-aware optimization of the discriminator. We conducted experiments on ImageNet (64x64 and 128x128), CIFAR-10, CIFAR-100 datasets, and Mixture of Gaussian, noisy label settings to demonstrate that the proposed method can substantially (up to 35.18% in terms of FID) improve all indicators with 10 independent trials. Code is available at this https URL.
- [384] arXiv:2108.13338 (replaced) [pdf, html, other]
-
Title: Beyond Value Iteration for Parity Games: Strategy Iteration with Universal TreesComments: 35 pages, 9 figuresSubjects: Data Structures and Algorithms (cs.DS); Discrete Mathematics (cs.DM); Logic in Computer Science (cs.LO)
Parity games have witnessed several new quasi-polynomial algorithms since the breakthrough result of Calude et al. (STOC 2017). The combinatorial object underlying these approaches is a universal tree, as identified by Czerwiński et al. (SODA 2019). By proving a quasi-polynomial lower bound on the size of a universal tree, they have highlighted a barrier that must be overcome by all existing approaches to attain polynomial running time. This is due to the existence of worst case instances which force these algorithms to explore a large portion of the tree.
As an attempt to overcome this barrier, we propose a strategy iteration framework which can be applied on any universal tree. It is at least as fast as its value iteration counterparts, while allowing one to take bigger leaps in the universal tree. Our main technical contribution is an efficient method for computing the least fixed point of 1-player games. This is achieved via a careful adaptation of shortest path algorithms to the setting of ordered trees. By plugging in the universal tree of Jurdziński and Lazić (LICS 2017), or the Strahler universal tree of Daviaud et al. (ICALP 2020), we obtain instantiations of the general framework that take time $O(mn^2\log n\log d)$ and $O(mn^2\log^3 n \log d)$ respectively per iteration. - [385] arXiv:2109.06713 (replaced) [pdf, other]
-
Title: Machine-Learned Prediction Equilibrium for Dynamic Traffic AssignmentComments: 42 pages, 5 figuresSubjects: Computer Science and Game Theory (cs.GT); Machine Learning (cs.LG); Optimization and Control (math.OC)
We study a dynamic traffic assignment model, where agents base their instantaneous routing decisions on real-time delay predictions. We formulate a mathematically concise model and define dynamic prediction equilibrium (DPE) in which no agent can at any point during their journey improve their predicted travel time by switching to a different route. We demonstrate the versatility of our framework by showing that it subsumes the well-known full information and instantaneous information models, in addition to admitting further realistic predictors as special cases. We then proceed to derive properties of the predictors that ensure a dynamic prediction equilibrium exists. Additionally, we define $\varepsilon$-approximate DPE wherein no agent can improve their predicted travel time by more than $\varepsilon$ and provide further conditions of the predictors under which such an approximate equilibrium can be computed. Finally, we complement our theoretical analysis by an experimental study, in which we systematically compare the induced average travel times of different predictors, including two machine-learning based models trained on data gained from previously computed approximate equilibrium flows, both on synthetic and real world road networks.
- [386] arXiv:2208.01430 (replaced) [pdf, html, other]
-
Title: A Model for Multi-Agent Heterogeneous Interaction ProblemsComments: 8 pages, 7 figuresJournal-ref: Proc. of the American Control Conference (ACC), 2024Subjects: Multiagent Systems (cs.MA)
We introduce a model for multi-agent interaction problems to understand how a heterogeneous team of agents should organize its resources to tackle a heterogeneous team of attackers. This model is inspired by how the human immune system tackles a diverse set of pathogens. The key property of this model is a ``cross-reactivity'' kernel which enables a particular defender type to respond strongly to some attacker types but weakly to a few different types of attackers. We show how due to such cross-reactivity, the defender team can optimally counteract a heterogeneous attacker team using very few types of defender agents, and thereby minimize its resources. We study this model in different settings to characterize a set of guiding principles for control problems with heterogeneous teams of agents, e.g., sensitivity of the harm to sub-optimal defender distributions, and competition between defenders gives near-optimal behavior using decentralized computation of the control. We also compare this model with existing approaches including reinforcement-learned policies, perimeter defense, and coverage control.
- [387] arXiv:2209.02275 (replaced) [pdf, html, other]
-
Title: Multi-class Classifier based Failure Prediction with Artificial and Anonymous Training for Data PrivacySubjects: Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR); Machine Learning (cs.LG)
This paper proposes a novel non-intrusive system failure prediction technique using available information from developers and minimal information from raw logs (rather than mining entire logs) but keeping the data entirely private with the data owners. A neural network based multi-class classifier is developed for failure prediction, using artificially generated anonymous data set, applying a combination of techniques, viz., genetic algorithm (steps), pattern repetition, etc., to train and test the network. The proposed mechanism completely decouples the data set used for training process from the actual data which is kept private. Moreover, multi-criteria decision making (MCDM) schemes are used to prioritize failures meeting business requirements. Results show high accuracy in failure prediction under different parameter configurations. On a broader context, any classification problem, beyond failure prediction, can be performed using the proposed mechanism with artificially generated data set without looking into the actual data as long as the input features can be translated to binary values (e.g. output from private binary classifiers) and can provide classification-as-a-service.
- [388] arXiv:2210.05918 (replaced) [pdf, html, other]
-
Title: Finite time analysis of temporal difference learning with linear function approximation: Tail averaging and regularisationJournal-ref: Proceedings of The 26th International Conference on Artificial Intelligence and Statistics, Proceedings of Machine Learning Research, 2023Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Systems and Control (eess.SY); Machine Learning (stat.ML)
We study the finite-time behaviour of the popular temporal difference (TD) learning algorithm when combined with tail-averaging. We derive finite time bounds on the parameter error of the tail-averaged TD iterate under a step-size choice that does not require information about the eigenvalues of the matrix underlying the projected TD fixed point. Our analysis shows that tail-averaged TD converges at the optimal $O\left(1/t\right)$ rate, both in expectation and with high probability. In addition, our bounds exhibit a sharper rate of decay for the initial error (bias), which is an improvement over averaging all iterates. We also propose and analyse a variant of TD that incorporates regularisation. From analysis, we conclude that the regularised version of TD is useful for problems with ill-conditioned features.
- [389] arXiv:2211.04165 (replaced) [pdf, html, other]
-
Title: Dynamic loss balancing and sequential enhancement for road-safety assessment and traffic scene classificationComments: Published in IEEE Transactions on Intelligent Transportation SystemsSubjects: Computer Vision and Pattern Recognition (cs.CV)
Road-safety inspection is an indispensable instrument for reducing road-accident fatalities contributed to road infrastructure. Recent work formalizes road-safety assessment in terms of carefully selected risk factors that are also known as road-safety attributes. In current practice, these attributes are manually annotated in geo-referenced monocular video for each road segment. We propose to reduce dependency on tedious human labor by automating recognition with a two-stage neural architecture. The first stage predicts more than forty road-safety attributes by observing a local spatio-temporal context. Our design leverages an efficient convolutional pipeline, which benefits from pre-training on semantic segmentation of street scenes. The second stage enhances predictions through sequential integration across a larger temporal window. Our design leverages per-attribute instances of a lightweight bidirectional LSTM architecture. Both stages alleviate extreme class imbalance by incorporating a multi-task variant of recall-based dynamic loss weighting. We perform experiments on the iRAP-BH dataset, which involves fully labeled geo-referenced video along 2,300 km of public roads in Bosnia and Herzegovina. We also validate our approach by comparing it with the related work on two road-scene classification datasets from the literature: Honda Scenes and FM3m. Experimental evaluation confirms the value of our contributions on all three datasets.
- [390] arXiv:2212.02494 (replaced) [pdf, html, other]
-
Title: Equivalence of eval-readback and eval-apply big-step evaluators by structuring the lambda-calculus's strategy spacePablo Nogueira (1), Álvaro García-Pérez (2) ((1) UDIT University of Design and Technology, Madrid, Spain, (2) Université Paris-Saclay, CEA List, Palaiseau, France)Comments: 76 pages, 15 figuresSubjects: Logic in Computer Science (cs.LO)
We study the equivalence between eval-readback and eval-apply big-step evaluators in the general setting of the pure lambda calculus. We study `one-step' equivalence (same strategy) and also discuss `big-step' equivalence (same final result). One-step equivalence extends for free to evaluators in other settings (calculi, programming languages, proof assistants, etc.) by restricting the terms (closed, convergent) while maintaining the strategy. We present a proof that one-step equivalence holds when (1) the `readback' stage satisfies straightforward well-formedness provisos, (2) the `eval' stage implements a `uniform' strategy, and (3) the eval-apply evaluator implements a `balanced hybrid' strategy that has `eval' as a subsidiary strategy. The proof proceeds by application of the `lightweight fusion by fixed-point promotion' program transformation on evaluator implementations to fuse readback and eval into the balanced hybrid. The proof can be followed with no previous knowledge of the transformation. We use Haskell 2010 as the implementation language, with all evaluators written in monadic style to guarantee semantics (strategy) preservation, but the choice of implementation language is immaterial. To illustrate the large scope of the equivalence, we provide an extensive survey of the strategy space using canonical eval-apply evaluators in code and big-step `natural' operational semantics. We discuss the strategies' properties, some of their uses, and their abstract machines. We improve the formal definition of uniform and hybrid strategy, use it to structure and classify the strategy space, and to obtain generic higher-order evaluators which are used in the equivalence proof. We introduce a systematic notation for both evaluator styles and use it to summarise strategy and evaluator equivalences, including (non-)equivalences.
- [391] arXiv:2212.06431 (replaced) [pdf, html, other]
-
Title: Object-fabrication Targeted Attack for Object DetectionSubjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
Recent studies have demonstrated that object detection networks are usually vulnerable to adversarial examples. Generally, adversarial attacks for object detection can be categorized into targeted and untargeted attacks. Compared with untargeted attacks, targeted attacks present greater challenges and all existing targeted attack methods launch the attack by misleading detectors to mislabel the detected object as a specific wrong label. However, since these methods must depend on the presence of the detected objects within the victim image, they suffer from limitations in attack scenarios and attack success rates. In this paper, we propose a targeted feature space attack method that can mislead detectors to `fabricate' extra designated objects regardless of whether the victim image contains objects or not. Specifically, we introduce a guided image to extract coarse-grained features of the target objects and design an innovative dual attention mechanism to filter out the critical features of the target objects efficiently. The attack performance of the proposed method is evaluated on MS COCO and BDD100K datasets with FasterRCNN and YOLOv5. Evaluation results indicate that the proposed targeted feature space attack method shows significant improvements in terms of image-specific, universality, and generalization attack performance, compared with the previous targeted attack for object detection.
- [392] arXiv:2212.08158 (replaced) [pdf, html, other]
-
Title: MM-SHAP: A Performance-agnostic Metric for Measuring Multimodal Contributions in Vision and Language Models & TasksComments: Published at ACL 2023 Main (Toronto); 10 pages, 14 appendix pages, 11 figures, 3 tablesSubjects: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)
Vision and language models (VL) are known to exploit unrobust indicators in individual modalities (e.g., introduced by distributional biases) instead of focusing on relevant information in each modality. That a unimodal model achieves similar accuracy on a VL task to a multimodal one, indicates that so-called unimodal collapse occurred. However, accuracy-based tests fail to detect e.g., when the model prediction is wrong, while the model used relevant information from a modality. Instead, we propose MM-SHAP, a performance-agnostic multimodality score based on Shapley values that reliably quantifies in which proportions a multimodal model uses individual modalities. We apply MM-SHAP in two ways: (1) to compare models for their average degree of multimodality, and (2) to measure for individual models the contribution of individual modalities for different tasks and datasets. Experiments with six VL models -- LXMERT, CLIP and four ALBEF variants -- on four VL tasks highlight that unimodal collapse can occur to different degrees and in different directions, contradicting the wide-spread assumption that unimodal collapse is one-sided. Based on our results, we recommend MM-SHAP for analysing multimodal tasks, to diagnose and guide progress towards multimodal integration. Code available at \url{this https URL}.
- [393] arXiv:2301.01490 (replaced) [pdf, html, other]
-
Title: Towards a Pipeline for Real-Time Visualization of Faces for VR-based Telepresence and Live Broadcasting Utilizing Neural RenderingSubjects: Computer Vision and Pattern Recognition (cs.CV)
While head-mounted displays (HMDs) for Virtual Reality (VR) have become widely available in the consumer market, they pose a considerable obstacle for a realistic face-to-face conversation in VR since HMDs hide a significant portion of the participants faces. Even with image streams from cameras directly attached to an HMD, stitching together a convincing image of an entire face remains a challenging task because of extreme capture angles and strong lens distortions due to a wide field of view. Compared to the long line of research in VR, reconstruction of faces hidden beneath an HMD is a very recent topic of research. While the current state-of-the-art solutions demonstrate photo-realistic 3D reconstruction results, they require high-cost laboratory equipment and large computational costs. We present an approach that focuses on low-cost hardware and can be used on a commodity gaming computer with a single GPU. We leverage the benefits of an end-to-end pipeline by means of Generative Adversarial Networks (GAN). Our GAN produces a frontal-facing 2.5D point cloud based on a training dataset captured with an RGBD camera. In our approach, the training process is offline, while the reconstruction runs in real-time. Our results show adequate reconstruction quality within the 'learned' expressions. Expressions not learned by the network produce artifacts and can trigger the Uncanny Valley effect.
- [394] arXiv:2302.07666 (replaced) [pdf, html, other]
-
Title: Forbidden Patterns in Temporal Graphs Resulting from Encounters in a CorridorSubjects: Data Structures and Algorithms (cs.DS)
In this paper, we study temporal graphs arising from mobility models, where vertices correspond to agents moving in space and edges appear each time two agents meet. We propose a rather natural one-dimensional model.
If each pair of agents meets exactly once, we get a simple temporal clique where the edges are ordered according to meeting times. In order to characterize which temporal cliques can be obtained as such `mobility graphs', we introduce the notion of forbidden patterns in temporal graphs. Furthermore, using a classical result in combinatorics, we count the number of such mobility cliques for a given number of agents, and show that not every temporal clique resulting from the 1D model can be realized with agents moving with different constant speeds. For the analogous circular problem, where agents are moving along a circle, we provide a characterization via circular forbidden patterns.
Our characterization in terms of forbidden patterns can be extended to the case where each edge appears at most once. We also study the problem where pairs of agents are allowed to cross each other several times, using an approach from automata theory. We observe that in this case, there is no finite set of forbidden patterns that characterize such temporal graphs and nevertheless give a linear-time algorithm to recognize temporal graphs arising from this model. - [395] arXiv:2303.14301 (replaced) [pdf, html, other]
-
Title: High-Level Synthetic Data Generation with Data Set ArchetypesComments: 21 pages, 11 figuresSubjects: Machine Learning (cs.LG); Computation (stat.CO); Machine Learning (stat.ML)
Cluster analysis relies on effective benchmarks for evaluating and comparing different algorithms. Simulation studies on synthetic data are popular because important features of the data sets, such as the overlap between clusters, or the variation in cluster shapes, can be effectively varied. Unfortunately, curating evaluation scenarios is often laborious, as practitioners must find lower-level geometric parameters (like cluster covariance matrices) to match a higher-level scenario description like "clusters with very different shapes." To make benchmarks more convenient and informative, we propose synthetic data generation based on data set archetypes. In this paradigm, the user describes an evaluation scenario in a high-level manner, and the software automatically generates data sets with the desired characteristics. Combining such data set archetypes with large language models (LLMs), it is possible to set up benchmarks purely from verbal descriptions of the evaluation scenarios. We provide an open-source Python package, repliclust, that implements this workflow. A demo of data generation from verbal inputs is available at this https URL.
- [396] arXiv:2303.14994 (replaced) [pdf, other]
-
Title: Similarity analysis of DNA sequences through local distribution of nucleotides in strategic neighborhoodComments: 5 pages, 5 figuresSubjects: Data Structures and Algorithms (cs.DS)
We propose a new alignment-free algorithm by constructing a compact vector representation on $\mathbb{R}^{24}$ of a DNA sequence of arbitrary length. Each component of this vector is obtained from a representative sequence, the elements of which are the values realized by a function $\Gamma$. This function $\Gamma$ acts on neighborhoods of arbitrary radius that are located at strategic positions within the DNA sequence and carries complete information about the local distribution of frequencies of the nucleotides as a consequence of the uniqueness of prime factorization of integer. The algorithm exhibits linear time complexity and turns out to consume significantly small memory. The two natural parameters characterizing the radius and location of the neighbourhoods are fixed by comparing the phylogenetic tree with the benchmark for full genome sequences of fish mtDNA datasets. Using these fitting parameters, the method is applied to analyze a number of genome sequences from benchmark and other standard datasets. Our algorithm proves to be computationally efficient compared to other well known algorithms when applied on simulated dataset.
- [397] arXiv:2303.16199 (replaced) [pdf, html, other]
-
Title: LLaMA-Adapter: Efficient Fine-tuning of Language Models with Zero-init AttentionRenrui Zhang, Jiaming Han, Chris Liu, Peng Gao, Aojun Zhou, Xiangfei Hu, Shilin Yan, Pan Lu, Hongsheng Li, Yu QiaoComments: Accepted by ICLR 2024. Code is available at this https URLSubjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG); Multimedia (cs.MM)
We present LLaMA-Adapter, a lightweight adaption method to efficiently fine-tune LLaMA into an instruction-following model. Using 52K self-instruct demonstrations, LLaMA-Adapter only introduces 1.2M learnable parameters upon the frozen LLaMA 7B model, and costs less than one hour for fine-tuning on 8 A100 GPUs. Specifically, we adopt a set of learnable adaption prompts, and prepend them to the word tokens at higher transformer layers. Then, a zero-initialized attention mechanism with zero gating is proposed, which adaptively injects the new instructional cues into LLaMA, while effectively preserves its pre-trained knowledge. With our efficient training, LLaMA-Adapter can generate high-quality responses, comparable to Alpaca with fully fine-tuned 7B parameters. Besides language commands, our approach can be simply extended to multi-modal instructions for learning image-conditioned LLaMA model, which achieves superior reasoning performance on ScienceQA and COCO Caption benchmarks. Furthermore, we also evaluate the zero-initialized attention mechanism for fine-tuning other pre-trained models (ViT, RoBERTa) on traditional vision and language tasks, demonstrating the superior generalization capacity of our approach. Code is released at this https URL.
- [398] arXiv:2303.17448 (replaced) [pdf, other]
-
Title: NN-Copula-CD: A Copula-Guided Interpretable Neural Network for Change Detection in Heterogeneous Remote Sensing ImagesComments: The full version of this work is submitted to IEEE TGRSSubjects: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Image and Video Processing (eess.IV)
Change detection (CD) in heterogeneous remote sensing images has been widely used for disaster monitoring and land-use management. In the past decade, the heterogeneous CD problem has significantly benefited from the development of deep neural networks (DNNs). However, the purely data-driven DNNs perform like a black box where the lack of interpretability limits the trustworthiness and controllability of DNNs in most practical CD applications. As a powerful knowledge-driven tool, copula theory performs well in modeling relationships among random variables. To enhance the interpretability of existing neural networks for CD, we propose a knowledge-data-driven heterogeneous CD method based on a copula-guided neural network, named NN-Copula-CD. In our NN-Copula-CD, the mathematical characteristics of copula are employed as the loss functions to supervise a neural network to learn the dependence between bi-temporal heterogeneous superpixel pairs, and then the changed regions are identified via binary classification based on the degrees of dependence of all the superpixel pairs in the bi-temporal images. We conduct in-depth experiments on three datasets with heterogeneous images, where both quantitative and visual results demonstrate the effectiveness of our proposed NN-Copula-CD method.
- [399] arXiv:2304.09603 (replaced) [pdf, html, other]
-
Title: Visualising Personal Data Flows: Insights from a Case Study of Booking.comComments: This is the full edition of a paper published in Intelligent Information Systems: CAiSE Forum 2023, Zaragoza, Spain, June 12-16, 2023, Proceedings, Lecture Notes in Business Information Processing (LNBIP), Volume 477, pp. 52-60, 2023, Springer Nature, this https URLJournal-ref: Lecture Notes in Business Information Processing (LNBIP), 2023Subjects: Cryptography and Security (cs.CR); Information Retrieval (cs.IR)
Commercial organisations are holding and processing an ever-increasing amount of personal data. Policies and laws are continually changing to require these companies to be more transparent regarding the collection, storage, processing and sharing of this data. This paper reports our work of taking this http URL as a case study to visualise personal data flows extracted from their privacy policy. By showcasing how the company shares its consumers' personal data, we raise questions and extend discussions on the challenges and limitations of using privacy policies to inform online users about the true scale and the landscape of personal data flows. This case study can inform us about future research on more data flow-oriented privacy policy analysis and on the construction of a more comprehensive ontology on personal data flows in complicated business ecosystems.
- [400] arXiv:2304.09697 (replaced) [pdf, other]
-
Title: A Calculus for Scoped Effects & HandlersSubjects: Programming Languages (cs.PL)
Algebraic effects & handlers have become a standard approach for side-effects in functional programming. Their modular composition with other effects and clean separation of syntax and semantics make them attractive to a wide audience. However, not all effects can be classified as algebraic; some need a more sophisticated handling. In particular, effects that have or create a delimited scope need special care, as their continuation consists of two parts-in and out of the scope-and their modular composition introduces additional complexity. These effects are called scoped and have gained attention by their growing applicability and adoption in popular libraries. While calculi have been designed with algebraic effects & handlers built in to facilitate their use, a calculus that supports scoped effects & handlers in a similar manner does not yet exist. This work fills this gap: we present $\lambda_{\mathit{sc}}$, a calculus with native support for both algebraic and scoped effects & handlers. It addresses the need for polymorphic handlers and explicit clauses for forwarding unknown scoped operations to other handlers. Our calculus is based on Eff, an existing calculus for algebraic effects, extended with Koka-style row polymorphism, and consists of a formal grammar, operational semantics, a (type-safe) type-and-effect system and type inference. We demonstrate $\lambda_{\mathit{sc}}$ on a range of examples.
- [401] arXiv:2304.12751 (replaced) [pdf, other]
-
Title: Centrality-Based Node Feature Augmentation for Robust Network AlignmentComments: 19 pages, 12 figures, 5 tables; its conference version was presented at the ACM International Conference on Information and Knowledge Management (CIKM 2022)Subjects: Social and Information Networks (cs.SI); Information Theory (cs.IT); Machine Learning (cs.LG); Neural and Evolutionary Computing (cs.NE); Networking and Internet Architecture (cs.NI)
Network alignment (NA) is the task of discovering node correspondences across multiple networks. Although NA methods have achieved remarkable success in a myriad of scenarios, their effectiveness is not without additional information such as prior anchor links and/or node features, which may not always be available due to privacy concerns or access restrictions. To tackle this challenge, we propose Grad-Align+, a novel NA method built upon a recent state-of-the-art NA method, the so-called Grad-Align, that gradually discovers a part of node pairs until all node pairs are found. In designing Grad-Align+, we account for how to augment node features in the sense of performing the NA task and how to design our NA method by maximally exploiting the augmented node features. To achieve this goal, Grad-Align+ consists of three key components: 1) centrality-based node feature augmentation (CNFA), 2) graph neural network (GNN)-aided embedding similarity calculation alongside the augmented node features, and 3) gradual NA with similarity calculation using aligned cross-network neighbor-pairs (ACNs). Through comprehensive experiments, we demonstrate that Grad-Align+ exhibits (a) the superiority over benchmark NA methods, (b) empirical validations as well as our theoretical findings to see the effectiveness of CNFA, (c) the influence of each component, (d) the robustness to network noises, and (e) the computational efficiency.
- [402] arXiv:2304.13343 (replaced) [pdf, html, other]
-
Title: Enhancing Large Language Model with Self-Controlled Memory FrameworkBing Wang, Xinnian Liang, Jian Yang, Hui Huang, Shuangzhi Wu, Peihao Wu, Lu Lu, Zejun Ma, Zhoujun LiSubjects: Computation and Language (cs.CL)
Large Language Models (LLMs) are constrained by their inability to process lengthy inputs, resulting in the loss of critical historical information. To address this limitation, in this paper, we propose the Self-Controlled Memory (SCM) framework to enhance the ability of LLMs to maintain long-term memory and recall relevant information. Our SCM framework comprises three key components: an LLM-based agent serving as the backbone of the framework, a memory stream storing agent memories, and a memory controller updating memories and determining when and how to utilize memories from memory stream. Additionally, the proposed SCM is able to process ultra-long texts without any modification or fine-tuning, which can integrate with any instruction following LLMs in a plug-and-play paradigm. Furthermore, we annotate a dataset to evaluate the effectiveness of SCM for handling lengthy inputs. The annotated dataset covers three tasks: long-term dialogues, book summarization, and meeting summarization. Experimental results demonstrate that our method achieves better retrieval recall and generates more informative responses compared to competitive baselines in long-term dialogues. (this https URL)
- [403] arXiv:2305.02673 (replaced) [pdf, html, other]
-
Title: Modelling Spatio-Temporal Interactions For Compositional Action RecognitionComments: This work has been submitted to the IEEE for possible publication. Copyright may be transferred without notice, after which this version may no longer be accessibleSubjects: Computer Vision and Pattern Recognition (cs.CV)
Humans have the natural ability to recognize actions even if the objects involved in the action or the background are changed. Humans can abstract away the action from the appearance of the objects which is referred to as compositionality of actions. We focus on this compositional aspect of action recognition to impart human-like generalization abilities to video action-recognition models. First, we propose an interaction model that captures both fine-grained and long-range interactions between hands and objects. Frame-wise hand-object interactions capture fine-grained movements, while long-range interactions capture broader context and disambiguate actions across time. Second, in order to provide additional contextual cues to differentiate similar actions, we infuse the interaction tokens with global motion information from video tokens. The final global motion refined interaction tokens are used for compositional action recognition. We show the effectiveness of our interaction-centric approach on the compositional Something-Else dataset where we obtain a new state-of-the-art result outperforming recent object-centric methods by a significant margin.
- [404] arXiv:2305.03216 (replaced) [pdf, html, other]
-
Title: Near-realtime Facial Animation by Deep 3D Simulation Super-ResolutionHyojoon Park, Sangeetha Grama Srinivasan, Matthew Cong, Doyub Kim, Byungsoo Kim, Jonathan Swartz, Ken Museth, Eftychios SifakisSubjects: Graphics (cs.GR); Artificial Intelligence (cs.AI)
We present a neural network-based simulation super-resolution framework that can efficiently and realistically enhance a facial performance produced by a low-cost, realtime physics-based simulation to a level of detail that closely approximates that of a reference-quality off-line simulator with much higher resolution (26x element count in our examples) and accurate physical modeling. Our approach is rooted in our ability to construct - via simulation - a training set of paired frames, from the low- and high-resolution simulators respectively, that are in semantic correspondence with each other. We use face animation as an exemplar of such a simulation domain, where creating this semantic congruence is achieved by simply dialing in the same muscle actuation controls and skeletal pose in the two simulators. Our proposed neural network super-resolution framework generalizes from this training set to unseen expressions, compensates for modeling discrepancies between the two simulations due to limited resolution or cost-cutting approximations in the real-time variant, and does not require any semantic descriptors or parameters to be provided as input, other than the result of the real-time simulation. We evaluate the efficacy of our pipeline on a variety of expressive performances and provide comparisons and ablation experiments for plausible variations and alternatives to our proposed scheme.
- [405] arXiv:2305.08456 (replaced) [pdf, html, other]
-
Title: DAppSCAN: Building Large-Scale Datasets for Smart Contract Weaknesses in DApp ProjectsComments: Published in IEEE Transactions on Software Engineering ( Volume: 50, Issue: 6, June 2024). Dataset available at this https URLSubjects: Software Engineering (cs.SE)
The Smart Contract Weakness Classification Registry (SWC Registry) is a widely recognized list of smart contract weaknesses specific to the Ethereum platform. Despite the SWC Registry not being updated with new entries since 2020, the sustained development of smart contract analysis tools for detecting SWC-listed weaknesses highlights their ongoing significance in the field. However, evaluating these tools has proven challenging due to the absence of a large, unbiased, real-world dataset. To address this problem, we aim to build a large-scale SWC weakness dataset from real-world DApp projects. We recruited 22 participants and spent 44 person-months analyzing 1,199 open-source audit reports from 29 security teams. In total, we identified 9,154 weaknesses and developed two distinct datasets, i.e., DAPPSCAN-SOURCE and DAPPSCAN-BYTECODE. The DAPPSCAN-SOURCE dataset comprises 39,904 Solidity files, featuring 1,618 SWC weaknesses sourced from 682 real-world DApp projects. However, the Solidity files in this dataset may not be directly compilable for further analysis. To facilitate automated analysis, we developed a tool capable of automatically identifying dependency relationships within DApp projects and completing missing public libraries. Using this tool, we created DAPPSCAN-BYTECODE dataset, which consists of 6,665 compiled smart contract with 888 SWC weaknesses. Based on DAPPSCAN-BYTECODE, we conducted an empirical study to evaluate the performance of state-of-the-art smart contract weakness detection tools. The evaluation results revealed sub-par performance for these tools in terms of both effectiveness and success detection rate, indicating that future development should prioritize real-world datasets over simplistic toy contracts.
- [406] arXiv:2305.14297 (replaced) [pdf, html, other]
-
Title: Order conditions for Runge--Kutta-like methods with solution-dependent coefficientsComments: 29 pages, 1 figureSubjects: Numerical Analysis (math.NA)
In recent years, many positivity-preserving schemes for initial value problems have been constructed by modifying a Runge--Kutta (RK) method by weighting the right-hand side of the system of differential equations with solution-dependent factors. These include the classes of modified Patankar--Runge--Kutta (MPRK) and Geometric Conservative (GeCo) methods. Compared to traditional RK methods, the analysis of accuracy and stability of these methods is more complicated. In this work, we provide a comprehensive and unifying theory of order conditions for such RK-like methods, which differ from original RK schemes in that their coefficients are solution-dependent. The resulting order conditions are themselves solution-dependent and obtained using the theory of NB-series, and thus, can easily be read off from labeled N-trees. We present for the first time order conditions for MPRK and GeCo schemes of arbitrary order; For MPRK schemes, the order conditions are given implicitly in terms of the stages. From these results, we recover as particular cases all known order conditions from the literature for first- and second-order GeCo as well as first-, second- and third-order MPRK methods. Additionally, we derive sufficient and necessary conditions in an explicit form for 3rd and 4th order GeCo schemes as well as 4th order MPRK methods. We also present a new 4th order MPRK method within this framework and numerically confirm its convergence rate.
- [407] arXiv:2305.18385 (replaced) [pdf, html, other]
-
Title: Self-attention Dual Embedding for Graphs with HeterophilyComments: 9 pages, 15 figuresSubjects: Machine Learning (cs.LG); Social and Information Networks (cs.SI)
Graph Neural Networks (GNNs) have been highly successful for the node classification task. GNNs typically assume graphs are homophilic, i.e. neighboring nodes are likely to belong to the same class. However, a number of real-world graphs are heterophilic, and this leads to much lower classification accuracy using standard GNNs. In this work, we design a novel GNN which is effective for both heterophilic and homophilic graphs. Our work is based on three main observations. First, we show that node features and graph topology provide different amounts of informativeness in different graphs, and therefore they should be encoded independently and prioritized in an adaptive manner. Second, we show that allowing negative attention weights when propagating graph topology information improves accuracy. Finally, we show that asymmetric attention weights between nodes are helpful. We design a GNN which makes use of these observations through a novel self-attention mechanism. We evaluate our algorithm on real-world graphs containing thousands to millions of nodes and show that we achieve state-of-the-art results compared to existing GNNs. We also analyze the effectiveness of the main components of our design on different graphs.
- [408] arXiv:2306.03660 (replaced) [pdf, html, other]
-
Title: Empir3D : A Framework for Multi-Dimensional Point Cloud AssessmentSubjects: Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)
Advancements in sensors, algorithms, and compute hardware have made 3D perception feasible in real time. Current methods to compare and evaluate the quality of a 3D model, such as Chamfer, Hausdorff, and Earth-Mover's distance, are uni-dimensional and have limitations, including an inability to capture coverage, local variations in density and error, and sensitivity to outliers. In this paper, we propose an evaluation framework for point clouds (Empir3D) that consists of four metrics: resolution to quantify the ability to distinguish between individual parts in the point cloud, accuracy to measure registration error, coverage to evaluate the portion of missing data, and artifact score to characterize the presence of artifacts. Through detailed analysis, we demonstrate the complementary nature of each of these dimensions and the improvements they provide compared to the aforementioned uni-dimensional measures. Furthermore, we illustrate the utility of Empir3D by comparing our metrics with uni-dimensional metrics for two 3D perception applications (SLAM and point cloud completion). We believe that Empir3D advances our ability to reason about point clouds and helps better debug 3D perception applications by providing a richer evaluation of their performance. Our implementation of Empir3D, custom real-world datasets, evaluations on learning methods, and detailed documentation on how to integrate the pipeline will be made available upon publication.
- [409] arXiv:2307.00824 (replaced) [pdf, html, other]
-
Title: Sufficient Conditions on Bipartite Consensus of Weakly Connected Matrix-weighted NetworksComments: 28 pagesSubjects: Systems and Control (eess.SY); Multiagent Systems (cs.MA)
Recent advancements in bipartite consensus, a scenario where agents are divided into two disjoint sets with agents in the same set agreeing on a certain value and those in different sets agreeing on opposite or specifically related values, have highlighted its potential applications across various fields. Traditional research typically relies on the presence of a positive-negative spanning tree, which limits the practical applicability of bipartite consensus. This study relaxes that assumption by allowing for weak connectivity within the network, where paths can be weighted by semidefinite matrices. By exploring the algebraic constraints imposed by positive-negative trees and semidefinite paths, we derive sufficient conditions for achieving bipartite consensus. Our theoretical findings are validated through numerical results.
- [410] arXiv:2307.02930 (replaced) [pdf, html, other]
-
Title: Global convergence of Newton's method for the regularized $p$-Stokes equationsComments: 25 pages, 6 figuresJournal-ref: Numer Algor (2024)Subjects: Numerical Analysis (math.NA)
The motion of glaciers can be simulated with the $p$-Stokes equations. Up to now, Newton's method to solve these equations has been analyzed in finite-dimensional settings only. We analyze the problem in infinite dimensions to gain a new viewpoint. We do that by proving global convergence of the infinite-dimensional Newton's method with Armijo step sizes to the solution of these equations. We only have to add an arbitrarily small diffusion term for this convergence result. We prove that the additional diffusion term only causes minor differences in the solution compared to the original $p$-Stokes equations under the assumption of some regularity. Finally, we test our algorithms on two experiments: A reformulation of the experiment ISMIP-HOM $B$ without sliding and a block with sliding. For the former, the approximation of exact step sizes for the Picard iteration and exact step sizes and Armijo step sizes for Newton's method are superior in the experiment compared to the Picard iteration. For the latter experiment, Newton's method with Armijo step sizes needs many iterations until it converges fast to the solution. Thus, Newton's method with approximately exact step sizes is better than Armijo step sizes in this experiment.
- [411] arXiv:2307.03515 (replaced) [pdf, html, other]
-
Title: Incentive Allocation in Vertical Federated Learning Based on Bankruptcy ProblemSubjects: Machine Learning (cs.LG); Distributed, Parallel, and Cluster Computing (cs.DC); Computer Science and Game Theory (cs.GT)
Vertical federated learning (VFL) is a promising approach for collaboratively training machine learning models using private data partitioned vertically across different parties. Ideally in a VFL setting, the active party (party possessing features of samples with labels) benefits by improving its machine learning model through collaboration with some passive parties (parties possessing additional features of the same samples without labels) in a privacy preserving manner. However, motivating passive parties to participate in VFL can be challenging. In this paper, we focus on the problem of allocating incentives to the passive parties by the active party based on their contributions to the VFL process. We address this by formulating the incentive allocation problem as a bankruptcy game, a concept from cooperative game theory. Using the Talmudic division rule, which leads to the Nucleolus as its solution, we ensure a fair distribution of incentives. We evaluate our proposed method on synthetic and real-world datasets and show that it ensures fairness and stability in incentive allocation among passive parties who contribute their data to the federated model. Additionally, we compare our method to the existing solution of calculating Shapley values and show that our approach provides a more efficient solution with fewer computations.
- [412] arXiv:2307.09483 (replaced) [pdf, html, other]
-
Title: Forecasting steam mass flow in power plants using the parallel hybrid networkComments: 11 pages, 5 figuresSubjects: Machine Learning (cs.LG); Software Engineering (cs.SE); Data Analysis, Statistics and Probability (physics.data-an); Quantum Physics (quant-ph)
Efficient and sustainable power generation is a crucial concern in the energy sector. In particular, thermal power plants grapple with accurately predicting steam mass flow, which is crucial for operational efficiency and cost reduction. In this study, we use a parallel hybrid neural network architecture that combines a parametrized quantum circuit and a conventional feed-forward neural network specifically designed for time-series prediction in industrial settings to enhance predictions of steam mass flow 15 minutes into the future. Our results show that the parallel hybrid model outperforms standalone classical and quantum models, achieving more than 5.7 and 4.9 times lower mean squared error loss on the test set after training compared to pure classical and pure quantum networks, respectively. Furthermore, the hybrid model demonstrates smaller relative errors between the ground truth and the model predictions on the test set, up to 2 times better than the pure classical model. These findings contribute to the broader scientific understanding of how integrating quantum and classical machine learning techniques can be applied to real-world challenges faced by the energy sector, ultimately leading to optimized power plant operations.
- [413] arXiv:2307.09494 (replaced) [pdf, html, other]
-
Title: Explanation-Guided Fair Federated Learning for Transparent 6G RAN SlicingComments: Submitted for possible publication in IEEESubjects: Networking and Internet Architecture (cs.NI); Machine Learning (cs.LG)
Future zero-touch artificial intelligence (AI)-driven 6G network automation requires building trust in the AI black boxes via explainable artificial intelligence (XAI), where it is expected that AI faithfulness would be a quantifiable service-level agreement (SLA) metric along with telecommunications key performance indicators (KPIs). This entails exploiting the XAI outputs to generate transparent and unbiased deep neural networks (DNNs). Motivated by closed-loop (CL) automation and explanation-guided learning (EGL), we design an explanation-guided federated learning (EGFL) scheme to ensure trustworthy predictions by exploiting the model explanation emanating from XAI strategies during the training run time via Jensen-Shannon (JS) divergence. Specifically, we predict per-slice RAN dropped traffic probability to exemplify the proposed concept while respecting fairness goals formulated in terms of the recall metric which is included as a constraint in the optimization task. Finally, the comprehensiveness score is adopted to measure and validate the faithfulness of the explanations quantitatively. Simulation results show that the proposed EGFL-JS scheme has achieved more than $50\%$ increase in terms of comprehensiveness compared to different baselines from the literature, especially the variant EGFL-KL that is based on the Kullback-Leibler Divergence. It has also improved the recall score with more than $25\%$ relatively to unconstrained-EGFL.
- [414] arXiv:2307.12903 (replaced) [pdf, html, other]
-
Title: Towards Bridging the FL Performance-Explainability Trade-Off: A Trustworthy 6G RAN Slicing Use-CaseComments: Submitted for possible publication in IEEE. arXiv admin note: text overlap with arXiv:2210.10147Subjects: Networking and Internet Architecture (cs.NI); Artificial Intelligence (cs.AI)
In the context of sixth-generation (6G) networks, where diverse network slices coexist, the adoption of AI-driven zero-touch management and orchestration (MANO) becomes crucial. However, ensuring the trustworthiness of AI black-boxes in real deployments is challenging. Explainable AI (XAI) tools can play a vital role in establishing transparency among the stakeholders in the slicing ecosystem. But there is a trade-off between AI performance and explainability, posing a dilemma for trustworthy 6G network slicing because the stakeholders require both highly performing AI models for efficient resource allocation and explainable decision-making to ensure fairness, accountability, and compliance. To balance this trade off and inspired by the closed loop automation and XAI methodologies, this paper presents a novel explanation-guided in-hoc federated learning (FL) approach where a constrained resource allocation model and an explainer exchange -- in a closed loop (CL) fashion -- soft attributions of the features as well as inference predictions to achieve a transparent 6G network slicing resource management in a RAN-Edge setup under non-independent identically distributed (non-IID) datasets. In particular, we quantitatively validate the faithfulness of the explanations via the so-called attribution-based confidence metric that is included as a constraint to guide the overall training process in the run-time FL optimization task. In this respect, Integrated-Gradient (IG) as well as Input $\times$ Gradient and SHAP are used to generate the attributions for our proposed in-hoc scheme, wherefore simulation results under different methods confirm its success in tackling the performance-explainability trade-off and its superiority over the unconstrained Integrated-Gradient post-hoc FL baseline.
- [415] arXiv:2308.07060 (replaced) [pdf, other]
-
Title: A convergent stochastic scalar auxiliary variable methodSubjects: Numerical Analysis (math.NA); Analysis of PDEs (math.AP)
We discuss an extension of the scalar auxiliary variable approach, which was originally introduced by Shen et al. ([Shen, Xu, Yang, J. Comput. Phys., 2018]) for the discretization of deterministic gradient flows. By introducing an additional scalar auxiliary variable, this approach allows to derive a linear scheme, while still maintaining unconditional stability. Our extension augments the approximation of the evolution of this scalar auxiliary variable with higher order terms, which enables its application to stochastic partial differential equations. Using the stochastic Allen--Cahn equation as a prototype for nonlinear stochastic partial differential equations with multiplicative noise, we propose an unconditionally energy stable, linear, fully discrete finite element scheme based on our augmented scalar auxiliary variable method. Recovering a discrete version of the energy estimate and establishing Nikolskii estimates with respect to time, we are able to prove convergence of discrete solutions towards pathwise unique martingale solutions by applying Jakubowski's generalization of Skorokhod's theorem. A generalization of the Gyöngy--Krylov characterization of convergence in probability to quasi-Polish spaces finally provides convergence of fully discrete solutions towards strong solutions of the stochastic Allen--Cahn equation. Finally, we present numerical simulations underlining the practicality of the scheme and the importance of the introduced augmentation terms.
- [416] arXiv:2308.11406 (replaced) [pdf, html, other]
-
Title: Designing an attack-defense game: how to increase robustness of financial transaction models via a competitionAlexey Zaytsev, Maria Kovaleva, Alex Natekin, Evgeni Vorsin, Valerii Smirnov, Georgii Smirnov, Oleg Sidorshin, Alexander Senin, Alexander Dudin, Dmitry BerestnevSubjects: Machine Learning (cs.LG); Cryptography and Security (cs.CR); Statistical Finance (q-fin.ST)
Banks routinely use neural networks to make decisions. While these models offer higher accuracy, they are susceptible to adversarial attacks, a risk often overlooked in the context of event sequences, particularly sequences of financial transactions, as most works consider computer vision and NLP modalities.
We propose a thorough approach to studying these risks: a novel type of competition that allows a realistic and detailed investigation of problems in financial transaction data. The participants directly oppose each other, proposing attacks and defenses -- so they are examined in close-to-real-life conditions.
The paper outlines our unique competition structure with direct opposition of participants, presents results for several different top submissions, and analyzes the competition results. We also introduce a new open dataset featuring financial transactions with credit default labels, enhancing the scope for practical research and development. - [417] arXiv:2309.02253 (replaced) [pdf, html, other]
-
Title: MA-VAE: Multi-head Attention-based Variational Autoencoder Approach for Anomaly Detection in Multivariate Time-series Applied to Automotive Endurance Powertrain TestingSubjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Systems and Control (eess.SY)
A clear need for automatic anomaly detection applied to automotive testing has emerged as more and more attention is paid to the data recorded and manual evaluation by humans reaches its capacity. Such real-world data is massive, diverse, multivariate and temporal in nature, therefore requiring modelling of the testee behaviour. We propose a variational autoencoder with multi-head attention (MA-VAE), which, when trained on unlabelled data, not only provides very few false positives but also manages to detect the majority of the anomalies presented. In addition to that, the approach offers a novel way to avoid the bypass phenomenon, an undesirable behaviour investigated in literature. Lastly, the approach also introduces a new method to remap individual windows to a continuous time series. The results are presented in the context of a real-world industrial data set and several experiments are undertaken to further investigate certain aspects of the proposed model. When configured properly, it is 9% of the time wrong when an anomaly is flagged and discovers 67% of the anomalies present. Also, MA-VAE has the potential to perform well with only a fraction of the training and validation subset, however, to extract it, a more sophisticated threshold estimation method is required.
- [418] arXiv:2309.04156 (replaced) [pdf, html, other]
-
Title: Cross-Utterance Conditioned VAE for Speech GenerationYang Li, Cheng Yu, Guangzhi Sun, Weiqin Zu, Zheng Tian, Ying Wen, Wei Pan, Chao Zhang, Jun Wang, Yang Yang, Fanglei SunComments: 13 pages;Subjects: Sound (cs.SD); Computation and Language (cs.CL); Audio and Speech Processing (eess.AS)
Speech synthesis systems powered by neural networks hold promise for multimedia production, but frequently face issues with producing expressive speech and seamless editing. In response, we present the Cross-Utterance Conditioned Variational Autoencoder speech synthesis (CUC-VAE S2) framework to enhance prosody and ensure natural speech generation. This framework leverages the powerful representational capabilities of pre-trained language models and the re-expression abilities of variational autoencoders (VAEs). The core component of the CUC-VAE S2 framework is the cross-utterance CVAE, which extracts acoustic, speaker, and textual features from surrounding sentences to generate context-sensitive prosodic features, more accurately emulating human prosody generation. We further propose two practical algorithms tailored for distinct speech synthesis applications: CUC-VAE TTS for text-to-speech and CUC-VAE SE for speech editing. The CUC-VAE TTS is a direct application of the framework, designed to generate audio with contextual prosody derived from surrounding texts. On the other hand, the CUC-VAE SE algorithm leverages real mel spectrogram sampling conditioned on contextual information, producing audio that closely mirrors real sound and thereby facilitating flexible speech editing based on text such as deletion, insertion, and replacement. Experimental results on the LibriTTS datasets demonstrate that our proposed models significantly enhance speech synthesis and editing, producing more natural and expressive speech.
- [419] arXiv:2309.06020 (replaced) [pdf, html, other]
-
Title: PRESTI: Predicting Repayment Effort of Self-Admitted Technical Debt Using Textual InformationSubjects: Software Engineering (cs.SE); Artificial Intelligence (cs.AI)
Technical debt refers to the consequences of sub-optimal decisions made during software development that prioritize short-term benefits over long-term maintainability. Self-Admitted Technical Debt (SATD) is a specific form of technical debt, explicitly documented by developers within software artifacts such as source code comments and commit messages. As SATD can hinder software development and maintenance, it is crucial to estimate the effort required to repay it so that we can effectively prioritize it. However, we currently lack an understanding of SATD repayment, and more importantly, we lack approaches that can automatically estimate the repayment effort of SATD based on its textual description. To bridge this gap, we have curated a comprehensive dataset of 341,740 SATD items from 2,568,728 commits across 1,060 Apache repositories and analyzed the repayment effort comparing SATD vs. non-SATD items, as well as different types of SATD items. Furthermore, we proposed an innovative approach for Predicting Repayment Effort of SATD using Textual Information, named PRESTI. Our findings show that different types of SATD require varying levels of repayment effort, with code/design, requirement, and test debt demanding greater effort compared to non-SATD items, while documentation debt requires less. We have evaluated our approaches, particularly BERT- and TextCNN-based models, which outperform traditional machine learning methods and the baseline in estimating repayment effort. Additionally, we summarize keywords associated with varying levels of repayment effort that occur during SATD repayment. Our work aims to enhance SATD repayment prioritization and resource allocation, thereby improving software development and maintainability.
- [420] arXiv:2310.02704 (replaced) [pdf, html, other]
-
Title: Extending Isabelle/HOL's Code Generator with support for the Go programming languageComments: 23 pages, of which 15 pages are main contentJournal-ref: FM 2024: Formal Methods Volume 14934 of the series Lecture Notes in Computer Science pp 3-19, SpringerSubjects: Programming Languages (cs.PL); Logic in Computer Science (cs.LO)
The Isabelle proof assistant includes a small functional language, which allows users to write and reason about programs. So far, these programs could be extracted into a number of functional languages: Standard ML, OCaml, Scala, and Haskell. This work adds support for Go as a fifth target language for the Code Generator. Unlike the previous targets, Go is not a functional language and encourages code in an imperative style, thus many of the features of Isabelle's language (particularly data types, pattern matching, and type classes) have to be emulated using imperative language constructs in Go. The developed Code Generation is provided as an add-on library that can be simply imported into existing theories.
- [421] arXiv:2310.05566 (replaced) [pdf, other]
-
Title: Aggregated f-average Neural Network applied to Few-Shot Class Incremental LearningComments: 27 pages, 3 figures, submitted to Signal ProcessingSubjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Ensemble learning leverages multiple models (i.e., weak learners) on a common machine learning task to enhance prediction performance. Basic ensembling approaches average the weak learners outputs, while more sophisticated ones stack a machine learning model in between the weak learners outputs and the final prediction. This work fuses both aforementioned frameworks. We introduce an aggregated f-average (AFA) shallow neural network which models and combines different types of averages to perform an optimal aggregation of the weak learners predictions. We emphasise its interpretable architecture and simple training strategy, and illustrate its good performance on the problem of few-shot class incremental learning.
- [422] arXiv:2310.11133 (replaced) [pdf, other]
-
Title: Synthesizing Invariants for Polynomial Programs by Semidefinite ProgrammingComments: 35 pages, in submissionSubjects: Programming Languages (cs.PL)
Constraint-solving-based program invariant synthesis takes a parametric invariant template and encodes the (inductive) invariant conditions into constraints. The problem of characterizing the set of all valid parameter assignments is referred to as the strong invariant synthesis problem, while the problem of finding a concrete valid parameter assignment is called the weak invariant synthesis problem. For both problems, the challenge lies in solving or reducing the encoded constraints, which are generally non-convex and lack efficient solvers. Consequently, existing works either rely on heuristic optimization techniques (such as bilinear matrix inequalities) or resort to general-purpose solvers (such as quantifier elimination), leading to a trade-off between completeness and efficiency.
In this paper, we propose two novel algorithms for synthesizing invariants of polynomial programs using semidefinite programming (SDP): (1) The Cluster algorithm targets the strong invariant synthesis problem for polynomial invariant templates. Leveraging robust optimization techniques, it solves a series of SDP relaxations and yields a sequence of increasingly precise under-approximations of the set of valid parameter assignments. We prove the algorithm's soundness, convergence, and weak completeness under a specific robustness assumption on templates. Moreover, the outputs can simplify the weak invariant synthesis problem. (2) The Mask algorithm addresses the weak invariant synthesis problem in scenarios where the aforementioned robustness assumption does not hold, rendering the Cluster algorithm ineffective. It identifies a specific subclass of invariant templates, termed masked templates, involving parameterized polynomial equalities and known inequalities. By applying variable substitution, the algorithm transforms constraints into an equivalent form amenable to SDP relaxations. - [423] arXiv:2311.00207 (replaced) [pdf, html, other]
-
Title: Magmaw: Modality-Agnostic Adversarial Attacks on Machine Learning-Based Wireless Communication SystemsComments: Accepted at NDSS 2025Subjects: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
Machine Learning (ML) has been instrumental in enabling joint transceiver optimization by merging all physical layer blocks of the end-to-end wireless communication systems. Although there have been a number of adversarial attacks on ML-based wireless systems, the existing methods do not provide a comprehensive view including multi-modality of the source data, common physical layer protocols, and wireless domain constraints. This paper proposes Magmaw, a novel wireless attack methodology capable of generating universal adversarial perturbations for any multimodal signal transmitted over a wireless channel. We further introduce new objectives for adversarial attacks on downstream applications. We adopt the widely-used defenses to verify the resilience of Magmaw. For proof-of-concept evaluation, we build a real-time wireless attack platform using a software-defined radio system. Experimental results demonstrate that Magmaw causes significant performance degradation even in the presence of strong defense mechanisms. Furthermore, we validate the performance of Magmaw in two case studies: encrypted communication channel and channel modality-based ML model.
- [424] arXiv:2311.02287 (replaced) [pdf, html, other]
-
Title: Estimating Ground Reaction Forces from Inertial SensorsBowen Song, Marco Paolieri, Harper E. Stewart, Leana Golubchik, Jill L. McNitt-Gray, Vishal Misra, Devavrat ShahComments: Accepted for publication at IEEE Transactions on Biomedical EngineeringSubjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Objective: Our aim is to determine if data collected with inertial measurement units (IMUs) during steady-state running could be used to estimate ground reaction forces (GRFs) and to derive biomechanical variables (e.g., contact time, impulse, change in velocity) using lightweight machine-learning approaches. In contrast, state-of-the-art estimation using LSTMs suffers from prohibitive inference times on edge devices, requires expensive training and hyperparameter optimization, and results in black box models. Methods: We proposed a novel lightweight solution, SVD Embedding Regression (SER), using linear regression between SVD embeddings of IMU data and GRF data. We also compared lightweight solutions including SER and k-Nearest-Neighbors (KNN) regression with state-of-the-art LSTMs. Results: We performed extensive experiments to evaluate these techniques under multiple scenarios and combinations of IMU signals and quantified estimation errors for predicting GRFs and biomechanical variables. We did this using training data from different athletes, from the same athlete, or both, and we explored the use of acceleration and angular velocity data from sensors at different locations (sacrum and shanks). Conclusion: Our results illustrated that lightweight solutions such as SER and KNN can be similarly accurate or more accurate than LSTMs. The use of personal data reduced estimation errors of all methods, particularly for most biomechanical variables (as compared to GRFs); moreover, this gain was more pronounced in the lightweight methods. Significance: The study of GRFs is used to characterize the mechanical loading experienced by individuals in movements such as running, which is clinically applicable to identify athletes at risk for stress-related injuries.
- [425] arXiv:2311.07466 (replaced) [pdf, html, other]
-
Title: On Measuring Faithfulness or Self-consistency of Natural Language ExplanationsComments: Paper accepted for publication at ACL 2024 Main (Bangkok, Thailand); 10 main paper pages, 30 appendix pagesSubjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Large language models (LLMs) can explain their predictions through post-hoc or Chain-of-Thought (CoT) explanations. But an LLM could make up reasonably sounding explanations that are unfaithful to its underlying reasoning. Recent work has designed tests that aim to judge the faithfulness of post-hoc or CoT explanations. In this work we argue that these faithfulness tests do not measure faithfulness to the models' inner workings -- but rather their self-consistency at output level. Our contributions are three-fold: i) We clarify the status of faithfulness tests in view of model explainability, characterising them as self-consistency tests instead. This assessment we underline by ii) constructing a Comparative Consistency Bank for self-consistency tests that for the first time compares existing tests on a common suite of 11 open LLMs and 5 tasks -- including iii) our new self-consistency measure CC-SHAP. CC-SHAP is a fine-grained measure (not a test) of LLM self-consistency. It compares how a model's input contributes to the predicted answer and to generating the explanation. Our fine-grained CC-SHAP metric allows us iii) to compare LLM behaviour when making predictions and to analyse the effect of other consistency tests at a deeper level, which takes us one step further towards measuring faithfulness by bringing us closer to the internals of the model than strictly surface output-oriented tests. Our code is available at \url{this https URL}
- [426] arXiv:2311.16114 (replaced) [pdf, html, other]
-
Title: Learning Noise-Robust Joint Representation for Multimodal Emotion Recognition under Incomplete Data ScenariosQi Fan (1), Haolin Zuo (1), Rui Liu (1), Zheng Lian (2), Guanglai Gao (1) ((1) Inner Mongolia University, Hohhot, China, (2) Institute of Automation, Chinese Academy of Sciences, Beijing, China)Comments: Accepted by MRAC@ACM Multimedia 2024Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Multimodal emotion recognition (MER) in practical scenarios is significantly challenged by the presence of missing or incomplete data across different modalities. To overcome these challenges, researchers have aimed to simulate incomplete conditions during the training phase to enhance the system's overall robustness. Traditional methods have often involved discarding data or substituting data segments with zero vectors to approximate these incompletenesses. However, such approaches neither accurately represent real-world conditions nor adequately address the issue of noisy data availability. For instance, a blurry image cannot be simply replaced with zero vectors, while still retaining information. To tackle this issue and develop a more precise MER system, we introduce a novel noise-robust MER model that effectively learns robust multimodal joint representations from noisy data. This approach includes two pivotal components: firstly, a noise scheduler that adjusts the type and level of noise in the data to emulate various realistic incomplete situations. Secondly, a Variational AutoEncoder (VAE)-based module is employed to reconstruct these robust multimodal joint representations from the noisy inputs. Notably, the introduction of the noise scheduler enables the exploration of an entirely new type of incomplete data condition, which is impossible with existing methods. Extensive experimental evaluations on the benchmark datasets IEMOCAP and CMU-MOSEI demonstrate the effectiveness of the noise scheduler and the excellent performance of our proposed model. Our project is publicly available on this https URL.
- [427] arXiv:2311.18615 (replaced) [pdf, html, other]
-
Title: Two-scale exponential integrators with uniform accuracy for three-dimensional charged-particle dynamics under strong magnetic fieldSubjects: Numerical Analysis (math.NA)
The numerical simulation of three-dimensional charged-particle dynamics (CPD) under strong magnetic field is a basic and challenging algorithmic task in plasma physics. In this paper, we introduce a new methodology to design two-scale exponential integrators for three-dimensional CPD whose magnetic field's strength is inversely proportional to a dimensionless and small parameter $0<\varepsilon \ll 1$. By dealing with the transformed form of three-dimensional CPD, we linearize the magnetic field and put the residual component in a new nonlinear function which is shown to be uniformly bounded. Based on this foundation and the proposed two-scale exponential integrators, a class of novel integrators is formulated and studied. The corresponding uniform accuracy of the proposed $r$-th order integrator is shown to be $\mathcal{O}(h^r)$, where $r=1,2,3,4$ and the constant symbolized by $\mathcal{O}$, the time stepsize $h$ and the computation cost are all independent of $\varepsilon$. Moreover, in the case of maximal ordering strong magnetic field, improved error bound $\mathcal{O}(\varepsilon^r h^r)$ is obtained for the proposed $r$-th order integrator. A rigorous proof of these uniform and improved error bounds is presented, and a numerical test is performed to illustrate the error and efficiency behaviour of the proposed integrators.
- [428] arXiv:2312.04356 (replaced) [pdf, other]
-
Title: NeuJeans: Private Neural Network Inference with Joint Optimization of Convolution and BootstrappingComments: 15 pages, 6 figuresSubjects: Cryptography and Security (cs.CR); Machine Learning (cs.LG)
Fully homomorphic encryption (FHE) is a promising cryptographic primitive for realizing private neural network inference (PI) services by allowing a client to fully offload the inference task to a cloud server while keeping the client data oblivious to the server. This work proposes NeuJeans, an FHE-based solution for the PI of deep convolutional neural networks (CNNs). NeuJeans tackles the critical problem of the enormous computational cost for the FHE evaluation of CNNs. We introduce a novel encoding method called Coefficients-in-Slot (CinS) encoding, which enables multiple convolutions in one HE multiplication without costly slot permutations. We further observe that CinS encoding is obtained by conducting the first several steps of the Discrete Fourier Transform (DFT) on a ciphertext in conventional Slot encoding. This property enables us to save the conversion between CinS and Slot encodings as bootstrapping a ciphertext starts with DFT. Exploiting this, we devise optimized execution flows for various two-dimensional convolution (conv2d) operations and apply them to end-to-end CNN implementations. NeuJeans accelerates the performance of conv2d-activation sequences by up to 5.68 times compared to state-of-the-art FHE-based PI work and performs the PI of a CNN at the scale of ImageNet within a mere few seconds.
- [429] arXiv:2312.11242 (replaced) [pdf, other]
-
Title: MAC-SQL: A Multi-Agent Collaborative Framework for Text-to-SQLBing Wang, Changyu Ren, Jian Yang, Xinnian Liang, Jiaqi Bai, Linzheng Chai, Zhao Yan, Qian-Wen Zhang, Di Yin, Xing Sun, Zhoujun LiSubjects: Computation and Language (cs.CL)
Recent LLM-based Text-to-SQL methods usually suffer from significant performance degradation on "huge" databases and complex user questions that require multi-step reasoning. Moreover, most existing methods neglect the crucial significance of LLMs utilizing external tools and model collaboration. To address these challenges, we introduce MAC-SQL, a novel LLM-based multi-agent collaborative framework. Our framework comprises a core decomposer agent for Text-to-SQL generation with few-shot chain-of-thought reasoning, accompanied by two auxiliary agents that utilize external tools or models to acquire smaller sub-databases and refine erroneous SQL queries. The decomposer agent collaborates with auxiliary agents, which are activated as needed and can be expanded to accommodate new features or tools for effective Text-to-SQL parsing. In our framework, We initially leverage GPT-4 as the strong backbone LLM for all agent tasks to determine the upper bound of our framework. We then fine-tune an open-sourced instruction-followed model, SQL-Llama, by leveraging Code Llama 7B, to accomplish all tasks as GPT-4 does. Experiments show that SQL-Llama achieves a comparable execution accuracy of 43.94, compared to the baseline accuracy of 46.35 for vanilla GPT-4. At the time of writing, MAC-SQL+GPT-4 achieves an execution accuracy of 59.59 when evaluated on the BIRD benchmark, establishing a new state-of-the-art (SOTA) on its holdout test set (this https URL).
- [430] arXiv:2401.00713 (replaced) [pdf, html, other]
-
Title: Graph Neural Networks in Intelligent Transportation Systems: Advances, Applications and TrendsHourun Li, Yusheng Zhao, Zhengyang Mao, Yifang Qin, Zhiping Xiao, Jiaqi Feng, Yiyang Gu, Wei Ju, Xiao Luo, Ming ZhangSubjects: Machine Learning (cs.LG)
Intelligent Transportation System (ITS) is crucial for improving traffic congestion, reducing accidents, optimizing urban planning, and more. However, the complexity of traffic networks has rendered traditional machine learning and statistical methods less effective. With the advent of artificial intelligence, deep learning frameworks have achieved remarkable progress across various fields and are now considered highly effective in many areas. Since 2019, Graph Neural Networks (GNNs) have emerged as a particularly promising deep learning approach within the ITS domain, owing to their robust ability to model graph-structured data and address complex problems. Consequently, there has been increasing scholarly attention to the applications of GNNs in transportation, which have demonstrated excellent performance. Nevertheless, current research predominantly focuses on traffic forecasting, with other ITS domains, such as autonomous vehicles and demand prediction, receiving less attention. This paper aims to review the applications of GNNs across six representative and emerging ITS research areas: traffic forecasting, vehicle control system, traffic signal control, transportation safety, demand prediction, and parking management. We have examined a wide range of graph-related studies from 2018 to 2023, summarizing their methodologies, features, and contributions in detailed tables and lists. Additionally, we identify the challenges of applying GNNs in ITS and propose potential future research directions.
- [431] arXiv:2401.00832 (replaced) [pdf, html, other]
-
Title: Taking the Next Step with Generative Artificial Intelligence: The Transformative Role of Multimodal Large Language Models in Science EducationArne Bewersdorff, Christian Hartmann, Marie Hornberger, Kathrin Seßler, Maria Bannert, Enkelejda Kasneci, Gjergji Kasneci, Xiaoming Zhai, Claudia NerdelComments: revised version 2. September 2024Subjects: Artificial Intelligence (cs.AI); Computers and Society (cs.CY)
The integration of Artificial Intelligence (AI), particularly Large Language Model (LLM)-based systems, in education has shown promise in enhancing teaching and learning experiences. However, the advent of Multimodal Large Language Models (MLLMs) like GPT-4 with vision (GPT-4V), capable of processing multimodal data including text, sound, and visual inputs, opens a new era of enriched, personalized, and interactive learning landscapes in education. Grounded in theory of multimedia learning, this paper explores the transformative role of MLLMs in central aspects of science education by presenting exemplary innovative learning scenarios. Possible applications for MLLMs could range from content creation to tailored support for learning, fostering competencies in scientific practices, and providing assessment and feedback. These scenarios are not limited to text-based and uni-modal formats but can be multimodal, increasing thus personalization, accessibility, and potential learning effectiveness. Besides many opportunities, challenges such as data protection and ethical considerations become more salient, calling for robust frameworks to ensure responsible integration. This paper underscores the necessity for a balanced approach in implementing MLLMs, where the technology complements rather than supplants the educator's role, ensuring thus an effective and ethical use of AI in science education. It calls for further research to explore the nuanced implications of MLLMs on the evolving role of educators and to extend the discourse beyond science education to other disciplines. Through the exploration of potentials, challenges, and future implications, we aim to contribute to a preliminary understanding of the transformative trajectory of MLLMs in science education and beyond.
- [432] arXiv:2401.01568 (replaced) [pdf, html, other]
-
Title: A Survey of Protocol FuzzingXiaohan Zhang, Cen Zhang, Xinghua Li, Zhengjie Du, Bing Mao, Yuekang Li, Yaowen Zheng, Yeting Li, Li Pan, Yang Liu, Robert H. DengSubjects: Cryptography and Security (cs.CR); Networking and Internet Architecture (cs.NI)
Communication protocols form the bedrock of our interconnected world, yet vulnerabilities within their implementations pose significant security threats. Recent developments have seen a surge in fuzzing-based research dedicated to uncovering these vulnerabilities within protocol implementations. However, there still lacks a systematic overview of protocol fuzzing for answering the essential questions such as what the unique challenges are, how existing works solve them, etc. To bridge this gap, we conducted a comprehensive investigation of related works from both academia and industry. Our study includes a detailed summary of the specific challenges in protocol fuzzing, and provides a systematic categorization and overview of existing research efforts. Furthermore, we explore and discuss potential future research directions in protocol fuzzing. This survey serves as a foundational guideline for researchers and practitioners in the field.
- [433] arXiv:2401.07261 (replaced) [pdf, html, other]
-
Title: LookAhead: Preventing DeFi Attacks via Unveiling Adversarial ContractsComments: 21 pages, 7 figuresSubjects: Cryptography and Security (cs.CR)
Decentralized Finance (DeFi) incidents stemming from the exploitation of smart contract vulnerabilities have culminated in financial damages exceeding 3 billion US dollars. Existing defense mechanisms typically focus on detecting and reacting to malicious transactions executed by attackers that target victim contracts. However, with the emergence of private transaction pools where transactions are sent directly to miners without first appearing in public mempools, current detection tools face significant challenges in identifying attack activities effectively.
Based on the fact that most attack logic rely on deploying one or more intermediate smart contracts as supporting components to the exploitation of victim contracts, in this paper, we propose a new direction for detecting DeFi attacks that focuses on identifying adversarial contracts instead of adversarial transactions. Our approach allows us to leverage common attack patterns, code semantics and intrinsic characteristics found in malicious smart contracts to build the LookAhead system based on Machine Learning (ML) classifiers and a transformer model that is able to effectively distinguish adversarial contracts from benign ones, and make just-in-time predictions of potential zero-day attacks. Our contributions are three-fold: First, we construct a comprehensive dataset consisting of features extracted and constructed from recent contracts deployed on the Ethereum and BSC blockchains. Secondly, we design a condensed representation of smart contract programs called Pruned Semantic-Control Flow Tokenization (PSCFT) and use it to train a combination of ML models that understand the behaviour of malicious codes based on function calls, control flows and other pattern-conforming features. Lastly, we provide the complete implementation of LookAhead and the evaluation of its performance metrics for detecting adversarial contracts. - [434] arXiv:2401.09384 (replaced) [pdf, html, other]
-
Title: Diverse Part Synthesis for 3D Shape CreationSubjects: Graphics (cs.GR); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
Methods that use neural networks for synthesizing 3D shapes in the form of a part-based representation have been introduced over the last few years. These methods represent shapes as a graph or hierarchy of parts and enable a variety of applications such as shape sampling and reconstruction. However, current methods do not allow easily regenerating individual shape parts according to user preferences. In this paper, we investigate techniques that allow the user to generate multiple, diverse suggestions for individual parts. Specifically, we experiment with multimodal deep generative models that allow sampling diverse suggestions for shape parts and focus on models which have not been considered in previous work on shape synthesis. To provide a comparative study of these techniques, we introduce a method for synthesizing 3D shapes in a part-based representation and evaluate all the part suggestion techniques within this synthesis method. In our method, which is inspired by previous work, shapes are represented as a set of parts in the form of implicit functions which are then positioned in space to form the final shape. Synthesis in this representation is enabled by a neural network architecture based on an implicit decoder and a spatial transformer. We compare the various multimodal generative models by evaluating their performance in generating part suggestions. Our contribution is to show with qualitative and quantitative evaluations which of the new techniques for multimodal part generation perform the best and that a synthesis method based on the top-performing techniques allows the user to more finely control the parts that are generated in the 3D shapes while maintaining high shape fidelity when reconstructing shapes.
- [435] arXiv:2401.10268 (replaced) [pdf, other]
-
Title: The complementary contributions of academia and industry to AI researchLizhen Liang (Syracuse University), Han Zhuang (Northeastern University), James Zou (Stanford University), Daniel E. Acuna (University of Colorado at Boulder)Comments: 35 pages, 11 figuresSubjects: Computers and Society (cs.CY); Artificial Intelligence (cs.AI); Social and Information Networks (cs.SI)
Artificial intelligence (AI) has seen fast paced development in industry and academia. However, striking recent advances by industry have stunned the field, inviting a fresh perspective on the role of academic research on this progress. Here, we characterize the impact and type of AI produced by both environments over the last 25 years and establish several patterns. We find that articles published by teams consisting exclusively of industry researchers tend to get greater attention, with a higher chance of being highly cited and citation-disruptive, and several times more likely to produce state-of-the-art models. In contrast, we find that exclusively academic teams publish the bulk of AI research and tend to produce higher novelty work, with single papers having several times higher likelihood of being unconventional and atypical. The respective impact-novelty advantages of industry and academia are robust to controls for subfield, team size, seniority, and prestige. We find that academic-industry collaborations produce the most impactful work overall but do not have the novelty level of academic teams. Together, our findings identify the unique and nearly irreplaceable contributions that both academia and industry make toward the progress of AI.
- [436] arXiv:2401.10745 (replaced) [pdf, html, other]
-
Title: Ethical Artificial Intelligence Principles and Guidelines for the Governance and Utilization of Highly Advanced Large Language ModelsComments: 5 pages, accepted to workshop on Responsible Language Models (ReLM) at Association of the Advancement of Artificial Intelligence Conference (AAAI 2024)Subjects: Computers and Society (cs.CY); Artificial Intelligence (cs.AI); Computational Engineering, Finance, and Science (cs.CE); Machine Learning (cs.LG)
Given the success of ChatGPT, LaMDA and other large language models (LLMs), there has been an increase in development and usage of LLMs within the technology sector and other sectors. While the level in which LLMs has not reached a level where it has surpassed human intelligence, there will be a time when it will. Such LLMs can be referred to as advanced LLMs. Currently, there are limited usage of ethical artificial intelligence (AI) principles and guidelines addressing advanced LLMs due to the fact that we have not reached that point yet. However, this is a problem as once we do reach that point, we will not be adequately prepared to deal with the aftermath of it in an ethical and optimal way, which will lead to undesired and unexpected consequences. This paper addresses this issue by discussing what ethical AI principles and guidelines can be used to address highly advanced LLMs.
- [437] arXiv:2401.13256 (replaced) [pdf, html, other]
-
Title: UniMS-RAG: A Unified Multi-source Retrieval-Augmented Generation for Personalized Dialogue SystemsHongru Wang, Wenyu Huang, Yang Deng, Rui Wang, Zezhong Wang, Yufei Wang, Fei Mi, Jeff Z. Pan, Kam-Fai WongSubjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
Large Language Models (LLMs) has shown exceptional capabilities in many natual language understanding and generation tasks. However, the personalization issue still remains a much-coveted property, especially when it comes to the multiple sources involved in the dialogue system. To better plan and incorporate the use of multiple sources in generating personalized response, we firstly decompose it into three sub-tasks: Knowledge Source Selection, Knowledge Retrieval, and Response Generation. We then propose a novel Unified Multi-Source Retrieval-Augmented Generation system (UniMS-RAG) Specifically, we unify these three sub-tasks with different formulations into the same sequence-to-sequence paradigm during the training, to adaptively retrieve evidences and evaluate the relevance on-demand using special tokens, called acting tokens and evaluation tokens. Enabling language models to generate acting tokens facilitates interaction with various knowledge sources, allowing them to adapt their behavior to diverse task requirements. Meanwhile, evaluation tokens gauge the relevance score between the dialogue context and the retrieved evidence. In addition, we carefully design a self-refinement mechanism to iteratively refine the generated response considering 1) the consistency scores between the generated response and retrieved evidence; and 2) the relevance scores. Experiments on two personalized datasets (DuLeMon and KBP) show that UniMS-RAG achieves state-of-the-art performance on the knowledge source selection and response generation task with itself as a retriever in a unified manner. Extensive analyses and discussions are provided for shedding some new perspectives for personalized dialogue systems.
- [438] arXiv:2401.14818 (replaced) [pdf, html, other]
-
Title: ChemDFM: A Large Language Foundation Model for ChemistryZihan Zhao, Da Ma, Lu Chen, Liangtai Sun, Zihao Li, Yi Xia, Bo Chen, Hongshen Xu, Zichen Zhu, Su Zhu, Shuai Fan, Guodong Shen, Kai Yu, Xin ChenComments: 10 pages, 12 figures, 12 tables. Under ReviewSubjects: Computation and Language (cs.CL); Digital Libraries (cs.DL)
Artificial intelligence (AI) has played an increasingly important role in chemical research. However, most models currently used in chemistry are specialist models that require training and tuning for specific tasks. A more generic and efficient solution would be an AI model that could address many tasks and support free-form dialogue in the broad field of chemistry. In its utmost form, such a generalist AI chemist could be referred to as Chemical General Intelligence. Large language models (LLMs) have recently logged tremendous success in the general domain of natural language processing, showing emerging task generalization and free-form dialogue capabilities. However, domain knowledge of chemistry is largely missing when training general-domain LLMs. The lack of such knowledge greatly hinders the performance of generalist LLMs in the field of chemistry. To this end, we develop ChemDFM, a pioneering LLM for chemistry trained on 34B tokens from chemical literature and textbooks, and fine-tuned using 2.7M instructions. As a result, it can understand and reason with chemical knowledge in free-form dialogue. Quantitative evaluations show that ChemDFM significantly surpasses most representative open-source LLMs. It outperforms GPT-4 on a great portion of chemical tasks, despite the substantial size difference. We will open-source ChemDFM for the broader community of AI and chemistry.
- [439] arXiv:2402.02108 (replaced) [pdf, html, other]
-
Title: From Synthetic to Real: Unveiling the Power of Synthetic Data for Video Person Re-IDSubjects: Computer Vision and Pattern Recognition (cs.CV)
In this study, we investigate the novel challenge of cross-domain video-based person re-identification (Re-ID). Here, we utilize synthetic video datasets as the source domain for training and real-world videos for testing, notably reducing the reliance on expensive real data acquisition and annotation. To harness the potential of synthetic data, we first propose a self-supervised domain-invariant feature learning strategy for both static and dynamic (temporal) features. Additionally, to enhance person identification accuracy in the target domain, we propose a mean-teacher scheme incorporating a self-supervised ID consistency loss. Experimental results across five real datasets validate the rationale behind cross-synthetic-real domain adaptation and demonstrate the efficacy of our method. Notably, the discovery that synthetic data outperforms real data in the cross-domain scenario is a surprising outcome. The code and data will be publicly available at this https URL.
- [440] arXiv:2402.03144 (replaced) [pdf, other]
-
Title: Computing Generic Fibers of Polynomial Ideals with FGLM and Hensel LiftingJournal-ref: 49th International Symposium on Symbolic and Algebraic Computation, Jul 2024, Raleigh, NC, United States. pp.307-315Subjects: Symbolic Computation (cs.SC); Commutative Algebra (math.AC)
We describe a version of the FGLM algorithm that can be used to compute generic fibers of positive-dimensional polynomial ideals. It combines the FGLM algorithm with a Hensel lifting strategy. In analogy with Hensel lifting, we show that this algorithm has a complexity quasi-linear in the number of terms of certain $\mathfrak{m}$-adic expansions we compute. Some provided experimental data also demonstrates the practical efficacy of our algorithm.
- [441] arXiv:2402.03944 (replaced) [pdf, html, other]
-
Title: Capturing the Unseen: Vision-Free Facial Motion Capture Using Inertial Measurement UnitsYoujia Wang, Yiwen Wu, Hengan Zhou, Hongyang Lin, Xingyue Peng, Jingyan Zhang, Yingsheng Zhu, Yingwenqi Jiang, Yatu Zhang, Lan Xu, Jingya Wang, Jingyi YuSubjects: Computer Vision and Pattern Recognition (cs.CV)
We present Capturing the Unseen (CAPUS), a novel facial motion capture (MoCap) technique that operates without visual signals. CAPUS leverages miniaturized Inertial Measurement Units (IMUs) as a new sensing modality for facial motion capture. While IMUs have become essential in full-body MoCap for their portability and independence from environmental conditions, their application in facial MoCap remains underexplored. We address this by customizing micro-IMUs, small enough to be placed on the face, and strategically positioning them in alignment with key facial muscles to capture expression dynamics. CAPUS introduces the first facial IMU dataset, encompassing both IMU and visual signals from participants engaged in diverse activities such as multilingual speech, facial expressions, and emotionally intoned auditions. We train a Transformer Diffusion-based neural network to infer Blendshape parameters directly from IMU data. Our experimental results demonstrate that CAPUS reliably captures facial motion in conditions where visual-based methods struggle, including facial occlusions, rapid movements, and low-light environments. Additionally, by eliminating the need for visual inputs, CAPUS offers enhanced privacy protection, making it a robust solution for vision-free facial MoCap.
- [442] arXiv:2402.04454 (replaced) [pdf, html, other]
-
Title: Evolving Mobile Cloud Gaming with 5G Standalone Network TelemetrySubjects: Networking and Internet Architecture (cs.NI)
Mobile cloud gaming places the simultaneous demands of high capacity and low latency on the wireless network, demands that Private and Metropolitan-Area Standalone 5G networks are poised to meet. However, lacking introspection into the 5G Radio Access Network (RAN), cloud gaming servers are ill-poised to cope with the vagaries of the wireless last hop to a mobile client, while 5G network operators run mostly closed networks, limiting their potential for co-design with the wider internet and user applications. This paper presents Telesa, a passive, incrementally-deployable, and independently-deployable Standalone 5G network telemetry system that streams fine-grained RAN capacity, latency, and retransmission information to application servers to enable better millisecond scale, application-level decisions on offered load and bit rate adaptation than end-to-end latency measurements or end-to-end packet losses currently permit. We design, implement, and evaluate a Telesa telemetry-enhanced game streaming platform, demonstrating exact congestion-control that can better adapt game video bitrate while simultaneously controlling end-to-end latency, thus maximizing game quality of experience. Our experimental evaluation on a production 5G Standalone network demonstrates a 178-249% Quality of Experience improvement versus two state-of-the-art cloud gaming applications.
- [443] arXiv:2402.07127 (replaced) [pdf, html, other]
-
Title: Learning by Watching: A Review of Video-based Learning Approaches for Robot ManipulationComments: Submitted at IEEE AccessSubjects: Robotics (cs.RO); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
Robot learning of manipulation skills is hindered by the scarcity of diverse, unbiased datasets. While curated datasets can help, challenges remain in generalizability and real-world transfer. Meanwhile, large-scale "in-the-wild" video datasets have driven progress in computer vision through self-supervised techniques. Translating this to robotics, recent works have explored learning manipulation skills by passively watching abundant videos sourced online. Showing promising results, such video-based learning paradigms provide scalable supervision while reducing dataset bias. This survey reviews foundations such as video feature representation learning techniques, object affordance understanding, 3D hand/body modeling, and large-scale robot resources, as well as emerging techniques for acquiring robot manipulation skills from uncontrolled video demonstrations. We discuss how learning only from observing large-scale human videos can enhance generalization and sample efficiency for robotic manipulation. The survey summarizes video-based learning approaches, analyses their benefits over standard datasets, survey metrics, and benchmarks, and discusses open challenges and future directions in this nascent domain at the intersection of computer vision, natural language processing, and robot learning.
- [444] arXiv:2402.08566 (replaced) [pdf, html, other]
-
Title: Gaussian-Sum Filter for Range-based 3D Relative Pose Estimation in the Presence of AmbiguitiesComments: 7 pages, 9 figures, submitted to IEEE Conference on Control Technology and Applications 2024Subjects: Robotics (cs.RO)
Multi-robot systems must have the ability to accurately estimate relative states between robots in order to perform collaborative tasks, possibly with no external aiding. Three-dimensional relative pose estimation using range measurements oftentimes suffers from a finite number of non-unique solutions, or ambiguities. This paper: 1) identifies and accurately estimates all possible ambiguities in 2D; 2) treats them as components of a Gaussian mixture model; and 3) presents a computationally-efficient estimator, in the form of a Gaussian-sum filter (GSF), to realize range-based relative pose estimation in an infrastructure-free, 3D, setup. This estimator is evaluated in simulation and experiment and is shown to avoid divergence to local minima induced by the ambiguous poses. Furthermore, the proposed GSF outperforms an extended Kalman filter, demonstrates similar performance to the computationally-demanding particle filter, and is shown to be consistent.
- [445] arXiv:2402.09117 (replaced) [pdf, html, other]
-
Title: Deterministic identification over channels with finite output: a dimensional perspective on superlinear ratesComments: 39 pages, 5 figuresSubjects: Information Theory (cs.IT); Quantum Physics (quant-ph)
Following initial work by JaJa, Ahlswede and Cai, and inspired by a recent renewed surge in interest in deterministic identification (DI) via noisy channels, we consider the problem in its generality for memoryless channels with finite output, but arbitrary input alphabets. Such a channel is essentially given by its output distributions as a subset in the probability simplex. Our main findings are that the maximum length of messages thus identifiable scales superlinearly as $R\,n\log n$ with the block length $n$, and that the optimal rate $R$ is bounded in terms of the covering (aka Minkowski, or Kolmogorov, or entropy) dimension $d$ of a certain algebraic transformation of the output set: $\frac14 d \leq R \leq \frac12 d$. Remarkably, both the lower and upper Minkowski dimensions play a role in this result. Along the way, we present a "Hypothesis Testing Lemma" showing that it is sufficient to ensure pairwise reliable distinguishability of the output distributions to construct a DI code. Although we do not know the exact capacity formula, we can conclude that the DI capacity exhibits superactivation: there exist channels whose capacities individually are zero, but whose product has positive capacity. We also generalise these results to classical-quantum channels with finite-dimensional output quantum system, in particular to quantum channels on finite-dimensional quantum systems under the constraint that the identification code can only use tensor product inputs.
- [446] arXiv:2402.11420 (replaced) [pdf, html, other]
-
Title: Rethinking the Roles of Large Language Models in Chinese Grammatical Error CorrectionYinghui Li, Shang Qin, Haojing Huang, Yangning Li, Libo Qin, Xuming Hu, Wenhao Jiang, Hai-Tao Zheng, Philip S. YuSubjects: Computation and Language (cs.CL)
Recently, Large Language Models (LLMs) have been widely studied by researchers for their roles in various downstream NLP tasks. As a fundamental task in the NLP field, Chinese Grammatical Error Correction (CGEC) aims to correct all potential grammatical errors in the input sentences. Previous studies have shown that LLMs' performance as correctors on CGEC remains unsatisfactory due to its challenging task focus. To promote the CGEC field to better adapt to the era of LLMs, we rethink the roles of LLMs in the CGEC task so that they can be better utilized and explored in CGEC. Considering the rich grammatical knowledge stored in LLMs and their powerful semantic understanding capabilities, we utilize LLMs as explainers to provide explanation information for the CGEC small models during error correction to enhance performance. We also use LLMs as evaluators to bring more reasonable CGEC evaluations, thus alleviating the troubles caused by the subjectivity of the CGEC task. In particular, our work is also an active exploration of how LLMs and small models better collaborate in downstream tasks. Extensive experiments and detailed analyses on widely used datasets verify the effectiveness of our thinking intuition and the proposed methods.
- [447] arXiv:2402.11520 (replaced) [pdf, html, other]
-
Title: Cross-Attention Fusion of Visual and Geometric Features for Large Vocabulary Arabic LipreadingComments: submitted for reviewSubjects: Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM)
Lipreading involves using visual data to recognize spoken words by analyzing the movements of the lips and surrounding area. It is a hot research topic with many potential applications, such as human-machine interaction and enhancing audio speech recognition. Recent deep-learning based works aim to integrate visual features extracted from the mouth region with landmark points on the lip contours. However, employing a simple combination method such as concatenation may not be the most effective approach to get the optimal feature vector. To address this challenge, firstly, we propose a cross-attention fusion-based approach for large lexicon Arabic vocabulary to predict spoken words in videos. Our method leverages the power of cross-attention networks to efficiently integrate visual and geometric features computed on the mouth region. Secondly, we introduce the first large-scale Lipreading in the Wild for Arabic (LRW-AR) dataset containing 20,000 videos for 100-word classes, uttered by 36 speakers. The experimental results obtained on LRW-AR and ArabicVisual databases showed the effectiveness and robustness of the proposed approach in recognizing Arabic words. Our work provides insights into the feasibility and effectiveness of applying lipreading techniques to the Arabic language, opening doors for further research in this field. Link to the project page: this https URL
- [448] arXiv:2402.11813 (replaced) [pdf, html, other]
-
Title: A novel framework for adaptive stress testing of autonomous vehicles in multi-lane roadsSubjects: Robotics (cs.RO); Artificial Intelligence (cs.AI)
Stress testing is an approach for evaluating the reliability of systems under extreme conditions which help reveal vulnerable scenarios that standard testing may overlook. Identifying such scenarios is of great importance in autonomous vehicles (AV) and other safety-critical systems. Since failure events are rare, naive random search approaches require a large number of vehicle operation hours to identify potential system failures. Adaptive Stress Testing (AST) is a method addressing this constraint by effectively exploring the failure trajectories of AV using a Markov decision process and employs reinforcement learning techniques to identify driving scenarios with high probability of failures. However, existing AST frameworks are able to handle only simple scenarios, such as one vehicle moving longitudinally on a single lane road which is not realistic and has a limited applicability. In this paper, we propose a novel AST framework to systematically explore corner cases of intelligent driving models that can result in safety concerns involving both longitudinal and lateral vehicle's movements. Specially, we develop a new reward function for Deep Reinforcement Learning to guide the AST in identifying crash scenarios based on the collision probability estimate between the AV under test (i.e., the ego vehicle) and the trajectory of other vehicles on the multi-lane roads. To demonstrate the effectiveness of our framework, we tested it with a complex driving model vehicle that can be controlled in both longitudinal and lateral directions. Quantitative and qualitative analyses of our experimental results demonstrate that our framework outperforms the state-of-the-art AST scheme in identifying corner cases with complex driving maneuvers.
- [449] arXiv:2402.12110 (replaced) [pdf, html, other]
-
Title: The Complexity of Geodesic Spanners using Steiner PointsComments: 25 pages, 10 figuresSubjects: Computational Geometry (cs.CG); Data Structures and Algorithms (cs.DS)
A geometric $t$-spanner $\mathcal{G}$ on a set $S$ of $n$ point sites in a metric space $P$ is a subgraph of the complete graph on $S$ such that for every pair of sites $p,q$ the distance in $\mathcal{G}$ is a most $t$ times the distance $d(p,q)$ in $P$. We call a connection between two sites a \emph{link}. In some settings, such as when $P$ is a simple polygon with $m$ vertices and a link is a shortest path in $P$, links can consist of $\Theta (m)$ segments and thus have non-constant complexity. The spanner complexity is a measure of how compact a spanner is, which is equal to the sum of the complexities of all links in the spanner. In this paper, we study what happens if we are allowed to introduce $k$ Steiner points to reduce the spanner complexity. We study such Steiner spanners in simple polygons, polygonal domains, and edge-weighted trees. We show that Steiner points have only limited utility. For a spanner that uses $k$ Steiner points, we provide an $\Omega(mn^{1/(t+1)}/k^{1/(t+1)})$ lower bound on the worst-case complexity of any $(t-\varepsilon)$-spanner, for any constant $\varepsilon \in (0,1)$ and integer constant $t \geq 2$. Additionally, we show NP-hardness for the problem of deciding whether a set of sites in a polygonal domain admits a $3$-spanner with a given maximum complexity using $k$ Steiner points. On the positive side, for trees we show how to build a $2t$-spanner that uses $k$ Steiner points of complexity $O(mn^{1/t}/k^{1/t} + n \log (n/k))$, for any integer $t \geq 1$. We generalize this to forests, and use it to obtain a $2\sqrt{2}t$-spanner in a simple polygon with complexity $O(mn^{1/t}(\log k)^{1+1/t}/k^{1/t} + n\log^2 n)$. When a link can be any path between two sites, we show how to improve the spanning ratio to $(2k+\varepsilon)$, for any constant $\varepsilon \in (0,2k)$, and how to build a $6t$-spanner in a polygonal domain with the same complexity.
- [450] arXiv:2402.12660 (replaced) [pdf, html, other]
-
Title: SingVisio: Visual Analytics of Diffusion Model for Singing Voice ConversionSubjects: Sound (cs.SD); Human-Computer Interaction (cs.HC); Audio and Speech Processing (eess.AS)
In this study, we present SingVisio, an interactive visual analysis system that aims to explain the diffusion model used in singing voice conversion. SingVisio provides a visual display of the generation process in diffusion models, showcasing the step-by-step denoising of the noisy spectrum and its transformation into a clean spectrum that captures the desired singer's timbre. The system also facilitates side-by-side comparisons of different conditions, such as source content, melody, and target timbre, highlighting the impact of these conditions on the diffusion generation process and resulting conversions. Through comparative and comprehensive evaluations, SingVisio demonstrates its effectiveness in terms of system design, functionality, explainability, and user-friendliness. It offers users of various backgrounds valuable learning experiences and insights into the diffusion model for singing voice conversion.
- [451] arXiv:2402.16933 (replaced) [pdf, html, other]
-
Title: Incremental Concept Formation over Visual Images Without Catastrophic ForgettingComments: Accepted by The Eleventh Annual Conference on Advances in Cognitive SystemsSubjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Information Retrieval (cs.IR)
Deep neural networks have excelled in machine learning, particularly in vision tasks, however, they often suffer from catastrophic forgetting when learning new tasks sequentially. In this work, we introduce Cobweb4V, an alternative to traditional neural network approaches. Cobweb4V is a novel visual classification method that builds on Cobweb, a human like learning system that is inspired by the way humans incrementally learn new concepts over time. In this research, we conduct a comprehensive evaluation, showcasing Cobweb4Vs proficiency in learning visual concepts, requiring less data to achieve effective learning outcomes compared to traditional methods, maintaining stable performance over time, and achieving commendable asymptotic behavior, without catastrophic forgetting effects. These characteristics align with learning strategies in human cognition, positioning Cobweb4V as a promising alternative to neural network approaches.
- [452] arXiv:2402.18603 (replaced) [pdf, html, other]
-
Title: MMSR: Symbolic Regression is a Multi-Modal Information Fusion TaskComments: The Information Fusion has accepted this paperSubjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
Mathematical formulas are the crystallization of human wisdom in exploring the laws of nature for thousands of years. Describing the complex laws of nature with a concise mathematical formula is a constant pursuit of scientists and a great challenge for artificial intelligence. This field is called symbolic regression (SR). Symbolic regression was originally formulated as a combinatorial optimization problem, and Genetic Programming (GP) and Reinforcement Learning algorithms were used to solve it. However, GP is sensitive to hyperparameters, and these two types of algorithms are inefficient. To solve this problem, researchers treat the mapping from data to expressions as a translation problem. And the corresponding large-scale pre-trained model is introduced. However, the data and expression skeletons do not have very clear word correspondences as the two languages do. Instead, they are more like two modalities (e.g., image and text). Therefore, in this paper, we proposed MMSR. The SR problem is solved as a pure multi-modal problem, and contrastive learning is also introduced in the training process for modal alignment to facilitate later modal feature fusion. It is worth noting that to better promote the modal feature fusion, we adopt the strategy of training contrastive learning loss and other losses at the same time, which only needs one-step training, instead of training contrastive learning loss first and then training other losses. Because our experiments prove training together can make the feature extraction module and feature fusion module wearing-in better. Experimental results show that compared with multiple large-scale pre-training baselines, MMSR achieves the most advanced results on multiple mainstream datasets including SRBench. Our code is open source at this https URL
- [453] arXiv:2402.18607 (replaced) [pdf, html, other]
-
Title: Exploring Privacy and Fairness Risks in Sharing Diffusion Models: An Adversarial PerspectiveSubjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR)
Diffusion models have recently gained significant attention in both academia and industry due to their impressive generative performance in terms of both sampling quality and distribution coverage. Accordingly, proposals are made for sharing pre-trained diffusion models across different organizations, as a way of improving data utilization while enhancing privacy protection by avoiding sharing private data directly. However, the potential risks associated with such an approach have not been comprehensively examined.
In this paper, we take an adversarial perspective to investigate the potential privacy and fairness risks associated with the sharing of diffusion models. Specifically, we investigate the circumstances in which one party (the sharer) trains a diffusion model using private data and provides another party (the receiver) black-box access to the pre-trained model for downstream tasks. We demonstrate that the sharer can execute fairness poisoning attacks to undermine the receiver's downstream models by manipulating the training data distribution of the diffusion model. Meanwhile, the receiver can perform property inference attacks to reveal the distribution of sensitive features in the sharer's dataset. Our experiments conducted on real-world datasets demonstrate remarkable attack performance on different types of diffusion models, which highlights the critical importance of robust data auditing and privacy protection protocols in pertinent applications. - [454] arXiv:2402.19150 (replaced) [pdf, html, other]
-
Title: Unveiling Typographic Deceptions: Insights of the Typographic Vulnerability in Large Vision-Language ModelHao Cheng, Erjia Xiao, Jindong Gu, Le Yang, Jinhao Duan, Jize Zhang, Jiahang Cao, Kaidi Xu, Renjing XuComments: This paper is accepted by ECCV 2024Subjects: Computer Vision and Pattern Recognition (cs.CV)
Large Vision-Language Models (LVLMs) rely on vision encoders and Large Language Models (LLMs) to exhibit remarkable capabilities on various multi-modal tasks in the joint space of vision and language. However, typographic attacks, which disrupt Vision-Language Models (VLMs) such as Contrastive Language-Image Pretraining (CLIP), have also been expected to be a security threat to LVLMs. Firstly, we verify typographic attacks on current well-known commercial and open-source LVLMs and uncover the widespread existence of this threat. Secondly, to better assess this vulnerability, we propose the most comprehensive and largest-scale Typographic Dataset to date. The Typographic Dataset not only considers the evaluation of typographic attacks under various multi-modal tasks but also evaluates the effects of typographic attacks, influenced by texts generated with diverse factors. Based on the evaluation results, we investigate the causes why typographic attacks impacting VLMs and LVLMs, leading to three highly insightful discoveries. During the process of further validating the rationality of our discoveries, we can reduce the performance degradation caused by typographic attacks from 42.07\% to 13.90\%. Code and Dataset are available in \href{this https URL}
- [455] arXiv:2403.00292 (replaced) [pdf, html, other]
-
Title: Enhancing Jailbreak Attacks with Diversity GuidanceSubjects: Computation and Language (cs.CL)
As large language models(LLMs) become commonplace in practical applications, the security issues of LLMs have attracted societal concerns. Although extensive efforts have been made to safety alignment, LLMs remain vulnerable to jailbreak attacks. We find that redundant computations limit the performance of existing jailbreak attack methods. Therefore, we propose DPP-based Stochastic Trigger Searching (DSTS), a new optimization algorithm for jailbreak attacks. DSTS incorporates diversity guidance through techniques including stochastic gradient search and DPP selection during optimization. Detailed experiments and ablation studies demonstrate the effectiveness of the algorithm. Moreover, we use the proposed algorithm to compute the risk boundaries for different LLMs, providing a new perspective on LLM safety evaluation.
- [456] arXiv:2403.01215 (replaced) [pdf, html, other]
-
Title: Efficient Algorithm Level Error Detection for Number-Theoretic Transform used for Kyber Assessed on FPGAs and ARMSubjects: Cryptography and Security (cs.CR)
Polynomial multiplication stands out as a highly demanding arithmetic process in the development of post-quantum cryptosystems. The importance of the number-theoretic transform (NTT) extends beyond post-quantum cryptosystems, proving valuable in enhancing existing security protocols such as digital signature schemes and hash functions. CRYSTALS-KYBER stands out as the sole public key encryption (PKE) algorithm chosen by the National Institute of Standards and Technology (NIST) in its third round selection, making it highly regarded as a leading post-quantum cryptography (PQC) solution. Due to the potential for errors to significantly disrupt the operation of secure, cryptographically-protected systems, compromising data integrity, and safeguarding against side-channel attacks initiated through faults it is essential to incorporate mitigating error detection schemes. This paper introduces algorithm level fault detection schemes in the NTT multiplication using Negative Wrapped Convolution and the NTT tailored for Kyber Round 3, representing a significant enhancement compared to previous research. We evaluate this through the simulation of a fault model, ensuring that the conducted assessments accurately mirror the obtained results. Consequently, we attain a notably comprehensive coverage of errors. Furthermore, we assess the performance of our efficient error detection scheme for Negative Wrapped Convolution on FPGAs to showcase its implementation and resource requirements. Through implementation of our error detection approach on Xilinx/AMD Zynq Ultrascale+ and Artix-7, we achieve a comparable throughput with just a 9% increase in area and 13% increase in latency compared to the original hardware implementations. Finally, we attained an error detection ratio of nearly 100% for the NTT operation in Kyber Round 3, with a clock cycle overhead of 16% on the Cortex-A72 processor.
- [457] arXiv:2403.04205 (replaced) [pdf, html, other]
-
Title: OGMP: Oracle Guided Multi-mode Policies for Agile and Versatile Robot ControlComments: 7 pages, 6 figuresSubjects: Robotics (cs.RO)
The efficacy of reinforcement learning for robot control relies on the tailored integration of task-specific priors and heuristics for effective exploration, which challenges their straightforward application to complex tasks and necessitates a unified approach. In this work, we define a general class for priors called oracles that generate state references when queried in a closed-loop manner during training. By bounding the permissible state around the oracle's ansatz, we propose a task-agnostic oracle-guided policy optimization. To enhance modularity, we introduce task-vital modes, showing that a policy mastering a compact set of modes and transitions can handle infinite-horizon tasks. For instance, to perform parkour on an infinitely long track, the policy must learn to jump, leap, pace, and transition between these modes effectively. We validate this approach in challenging bipedal control tasks: parkour and diving using a 16 DoF dynamic bipedal robot, HECTOR. Our method results in a single policy per task, solving parkour across diverse tracks and omnidirectional diving from varied heights up to 2m in simulation, showcasing versatile agility. We demonstrate successful sim-to-real transfer of parkour, including leaping over gaps up to 105 % of the leg length, jumping over blocks up to 20 % of the robot's nominal height, and pacing at speeds of up to 0.6 m/s, along with effective transitions between these modes in the real robot.
- [458] arXiv:2403.07623 (replaced) [pdf, html, other]
-
Title: Empowering Sequential Recommendation from Collaborative Signals and Semantic RelatednessMingyue Cheng, Hao Zhang, Qi Liu, Fajie Yuan, Zhi Li, Zhenya Huang, Enhong Chen, Jun Zhou, Longfei LiComments: Accepted By DASFAA 2024Subjects: Information Retrieval (cs.IR)
Sequential recommender systems (SRS) could capture dynamic user preferences by modeling historical behaviors ordered in time. Despite effectiveness, focusing only on the \textit{collaborative signals} from behaviors does not fully grasp user interests. It is also significant to model the \textit{semantic relatedness} reflected in content features, e.g., images and text. Towards that end, in this paper, we aim to enhance the SRS tasks by effectively unifying collaborative signals and semantic relatedness together. Notably, we empirically point out that it is nontrivial to achieve such a goal due to semantic gap issues. Thus, we propose an end-to-end two-stream architecture for sequential recommendation, named TSSR, to learn user preferences from ID-based and content-based sequence. Specifically, we first present novel hierarchical contrasting module, including coarse user-grained and fine item-grained terms, to align the representations of inter-modality. Furthermore, we also design a two-stream architecture to learn the dependence of intra-modality sequence and the complex interactions of inter-modality sequence, which can yield more expressive capacity in understanding user interests. We conduct extensive experiments on five public datasets. The experimental results show that the TSSR could yield superior performance than competitive baselines. We also make our experimental codes publicly available at this https URL.
- [459] arXiv:2403.11482 (replaced) [pdf, html, other]
-
Title: SeisFusion: Constrained Diffusion Model with Input Guidance for 3D Seismic Data Interpolation and ReconstructionComments: in IEEE Transactions on Geoscience and Remote Sensing (2024)Subjects: Machine Learning (cs.LG); Geophysics (physics.geo-ph)
Geographical, physical, or economic constraints often result in missing traces within seismic data, making the reconstruction of complete seismic data a crucial step in seismic data processing. Traditional methods for seismic data reconstruction require the selection of multiple empirical parameters and struggle to handle large-scale continuous missing data. With the development of deep learning, various neural networks have demonstrated powerful reconstruction capabilities. However, these convolutional neural networks represent a point-to-point reconstruction approach that may not cover the entire distribution of the dataset. Consequently, when dealing with seismic data featuring complex missing patterns, such networks may experience varying degrees of performance degradation. In response to this challenge, we propose a novel diffusion model reconstruction framework tailored for 3D seismic data. To constrain the results generated by the diffusion model, we introduce conditional supervision constraints into the diffusion model, constraining the generated data of the diffusion model based on the input data to be reconstructed. We introduce a 3D neural network architecture into the diffusion model, successfully extending the 2D diffusion model to 3D space. Additionally, we refine the model's generation process by incorporating missing data into the generation process, resulting in reconstructions with higher consistency. Through ablation studies determining optimal parameter values, our method exhibits superior reconstruction accuracy when applied to both field datasets and synthetic datasets, effectively addressing a wide range of complex missing patterns. Our implementation is available at this https URL.
- [460] arXiv:2403.11484 (replaced) [pdf, html, other]
-
Title: Robot Navigation in Unknown and Cluttered Workspace with Dynamical System Modulation in Starshaped RoadmapSubjects: Robotics (cs.RO)
Compared to conventional decomposition methods that use ellipses or polygons to represent free space, starshaped representation can better capture the natural distribution of sensor data, thereby exploiting a larger portion of traversable space. This paper introduces a novel motion planning and control framework for navigating robots in unknown and cluttered environments using a dynamically constructed starshaped roadmap. Our approach generates a starshaped representation of the surrounding free space from real-time sensor data using piece-wise polynomials. Additionally, an incremental roadmap maintaining the connectivity information is constructed, and a searching algorithm efficiently selects short-term goals on this roadmap. Importantly, this framework addresses deadend situations with a graph updating mechanism. To ensure safe and efficient movement within the starshaped roadmap, we propose a reactive controller based on Dynamic System Modulation (DSM). This controller facilitates smooth motion within starshaped regions and their intersections, avoiding conservative and short-sighted behaviors and allowing the system to handle intricate obstacle configurations in unknown and cluttered environments. Comprehensive evaluations in both simulations and real-world experiments show that the proposed method achieves higher success rates and reduced travel times compared to other methods. It effectively manages intricate obstacle configurations, avoiding conservative and myopic behaviors.
- [461] arXiv:2403.13367 (replaced) [pdf, other]
-
Title: Quantifying the Aggregate Flexibility of EV Charging Stations for Dependable Congestion Management Products: A Dutch Case StudyComments: 17 Pages,12 figures, 2 tablesSubjects: Systems and Control (eess.SY)
Electric vehicles (EVs) play a crucial role in the transition towards sustainable modes of transportation and thus are critical to the energy transition. As their number grows, managing the aggregate power of EV charging is crucial to maintain grid stability and mitigate congestion. This study analyses more than 500 thousand real charging transactions in the Netherlands to explore the challenge and opportunity for the energy system presented by EV growth and smart charging flexibility. Specifically, it analyses the collective ability to provide congestion management services according to the specifications of those services in the Netherlands. In this study, a data-driven model of charging behaviour is created to explore the implications of delivering dependable congestion management services at various aggregation levels and types of service. The probability of offering specific grid services by different categories of charging stations (CS) is analysed. These probabilities can help EV aggregators, such as charging point operators, make informed decisions about offering congestion mitigation products per relevant regulations and distribution system operators to assess their potential. The ability to offer different flexibility products, namely re-dispatch and capacity limitation, for congestion management, is assessed using various dispatch strategies. Next, machine learning models are used to predict the probability of CSs being able to deliver these products, accounting for uncertainties. Results indicate that residential charging locations have significant potential to provide both products during evening peak hours. While shared EVs offer better certainty regarding arrival and departure times, their small fleet size currently restricts their ability to meet the minimum order size of flexible products.
- [462] arXiv:2403.13900 (replaced) [pdf, html, other]
-
Title: CoMo: Controllable Motion Generation through Language Guided Pose Code EditingSubjects: Computer Vision and Pattern Recognition (cs.CV)
Text-to-motion models excel at efficient human motion generation, but existing approaches lack fine-grained controllability over the generation process. Consequently, modifying subtle postures within a motion or inserting new actions at specific moments remains a challenge, limiting the applicability of these methods in diverse scenarios. In light of these challenges, we introduce CoMo, a Controllable Motion generation model, adept at accurately generating and editing motions by leveraging the knowledge priors of large language models (LLMs). Specifically, CoMo decomposes motions into discrete and semantically meaningful pose codes, with each code encapsulating the semantics of a body part, representing elementary information such as "left knee slightly bent". Given textual inputs, CoMo autoregressively generates sequences of pose codes, which are then decoded into 3D motions. Leveraging pose codes as interpretable representations, an LLM can directly intervene in motion editing by adjusting the pose codes according to editing instructions. Experiments demonstrate that CoMo achieves competitive performance in motion generation compared to state-of-the-art models while, in human studies, CoMo substantially surpasses previous work in motion editing abilities.
- [463] arXiv:2403.15069 (replaced) [pdf, html, other]
-
Title: Allspark: Workload Orchestration for Visual Transformers on Processing In-Memory SystemsComments: accepted by IEEE Transactions on ComputersSubjects: Hardware Architecture (cs.AR)
The advent of Transformers has revolutionized computer vision, offering a powerful alternative to convolutional neural networks (CNNs), especially with the local attention mechanism that excels at capturing local structures within the input and achieve state-of-the-art performance. Processing in-memory (PIM) architecture offers extensive parallelism, low data movement costs, and scalable memory bandwidth, making it a promising solution to accelerate Transformer with memory-intensive operations. However, the crucial issue lies in efficiently deploying an entire model onto resource-limited PIM system while parallelizing each transformer block with potentially many computational branches based on local-attention mechanisms.
We present Allspark, which focuses on workload orchestration for visual Transformers on PIM systems, aiming at minimizing inference latency. Firstly, to fully utilize the massive parallelism of PIM, Allspark employs a fine-grained partitioning scheme for computational branches, and formats a systematic layout and interleaved dataflow with maximized data locality and reduced data movement. Secondly, Allspark formulates the scheduling of the complete model on a resource-limited distributed PIM system as an integer linear programming (ILP) problem. Thirdly, as local-global data interactions exhibit complex yet regular dependencies, Allspark provides a two-stage placement method, which simplifies the challenging placement of computational branches on the PIM system into the structured layout and greedy-based binding, to minimize NoC communication costs. Extensive experiments on 3D-stacked DRAM-based PIM systems show that Allspark brings 1.2x-24.0x inference speedup for various visual Transformers over baselines. Compared to Nvidia V100 GPU, Allspark-enriched PIM system yields average speedups of 2.3x and energy savings of 20x-55x. - [464] arXiv:2403.16920 (replaced) [pdf, html, other]
-
Title: Optimal convergence rates of MCMC integration for functions with unbounded second momentComments: The title has changed, the proofs have been simplified, and some typos were correctedSubjects: Numerical Analysis (math.NA)
We study the Markov chain Monte Carlo (MCMC) estimator for numerical integration for functions that do not need to be square integrable w.r.t. the invariant distribution. For chains with a spectral gap we show that the absolute mean error for $L^p$ functions, with $p \in (1,2)$, decreases like $n^{1/p -1}$, which is known to be the optimal rate. This improves currently known results where an additional parameter $\delta>0$ appears and the convergence is of order $n^{(1+\delta)/p-1}$.
- [465] arXiv:2403.17774 (replaced) [pdf, html, other]
-
Title: Towards Over-Canopy Autonomous Navigation: Crop-Agnostic LiDAR-Based Crop-Row Detection in Arable FieldsComments: 6 pages, 9 figures, submitted to ICRA 2025 (Under Review)Subjects: Robotics (cs.RO)
Autonomous navigation is crucial for various robotics applications in agriculture. However, many existing methods depend on RTK-GPS devices, which can be susceptible to loss of radio signal or intermittent reception of corrections from the internet. Consequently, research has increasingly focused on using RGB cameras for crop-row detection, though challenges persist when dealing with grown plants. This paper introduces a LiDAR-based navigation system that can achieve crop-agnostic over-canopy autonomous navigation in row-crop fields, even when the canopy fully blocks the inter-row spacing. Our algorithm can detect crop rows across diverse scenarios, encompassing various crop types, growth stages, the presence of weeds, curved rows, and discontinuities. Without utilizing a global localization method (i.e., based on GPS), our navigation system can perform autonomous navigation in these challenging scenarios, detect the end of the crop rows, and navigate to the next crop row autonomously, providing a crop-agnostic approach to navigate an entire field. The proposed navigation system has undergone tests in various simulated and real agricultural fields, achieving an average cross-track error of 3.55cm without human intervention. The system has been deployed on a customized UGV robot, which can be reconfigured depending on the field conditions.
- [466] arXiv:2403.19001 (replaced) [pdf, html, other]
-
Title: Cross-domain Fiber Cluster Shape Analysis for Language Performance Cognitive Score PredictionYui Lo, Yuqian Chen, Dongnan Liu, Wan Liu, Leo Zekelman, Fan Zhang, Yogesh Rathi, Nikos Makris, Alexandra J. Golby, Weidong Cai, Lauren J. O'DonnellComments: This paper has been accepted for presentation at The 27th Intl. Conf. on Medical Image Computing and Computer Assisted Intervention (MICCAI 2024) Workshop on Computational Diffusion MRI (CDMRI). 11 pages, 2 figuresSubjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Image and Video Processing (eess.IV); Neurons and Cognition (q-bio.NC)
Shape plays an important role in computer graphics, offering informative features to convey an object's morphology and functionality. Shape analysis in brain imaging can help interpret structural and functionality correlations of the human brain. In this work, we investigate the shape of the brain's 3D white matter connections and its potential predictive relationship to human cognitive function. We reconstruct brain connections as sequences of 3D points using diffusion magnetic resonance imaging (dMRI) tractography. To describe each connection, we extract 12 shape descriptors in addition to traditional dMRI connectivity and tissue microstructure features. We introduce a novel framework, Shape--fused Fiber Cluster Transformer (SFFormer), that leverages a multi-head cross-attention feature fusion module to predict subject-specific language performance based on dMRI tractography. We assess the performance of the method on a large dataset including 1065 healthy young adults. The results demonstrate that both the transformer-based SFFormer model and its inter/intra feature fusion with shape, microstructure, and connectivity are informative, and together, they improve the prediction of subject-specific language performance scores. Overall, our results indicate that the shape of the brain's connections is predictive of human language function.
- [467] arXiv:2403.19062 (replaced) [pdf, html, other]
-
Title: GENESIS-RL: GEnerating Natural Edge-cases with Systematic Integration of Safety considerations and Reinforcement LearningHsin-Jung Yang, Joe Beck, Md Zahid Hasan, Ekin Beyazit, Subhadeep Chakraborty, Tichakorn Wongpiromsarn, Soumik SarkarSubjects: Systems and Control (eess.SY); Robotics (cs.RO)
In the rapidly evolving field of autonomous systems, the safety and reliability of the system components are fundamental requirements. These components are often vulnerable to complex and unforeseen environments, making natural edge-case generation essential for enhancing system resilience. This paper presents GENESIS-RL, a novel framework that leverages system-level safety considerations and reinforcement learning techniques to systematically generate naturalistic edge cases. By simulating challenging conditions that mimic the real-world situations, our framework aims to rigorously test entire system's safety and reliability. Although demonstrated within the autonomous driving application, our methodology is adaptable across diverse autonomous systems. Our experimental validation, conducted on high-fidelity simulator underscores the overall effectiveness of this framework.
- [468] arXiv:2404.04351 (replaced) [pdf, html, other]
-
Title: Assisting humans in complex comparisons: automated information comparison at scaleComments: 11 pages, 7 figures, 5 tablesSubjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Generative Large Language Models enable efficient analytics across knowledge domains, rivalling human experts in information comparisons. However, the applications of LLMs for information comparisons face scalability challenges due to the difficulties in maintaining information across large contexts and overcoming model token limitations. To address these challenges, we developed the novel Abstractive Summarization & Criteria-driven Comparison Endpoint (ASC$^2$End) system to automate information comparison at scale. Our system employs Semantic Text Similarity comparisons for generating evidence-supported analyses. We utilize proven data-handling strategies such as abstractive summarization and retrieval augmented generation to overcome token limitations and retain relevant information during model inference. Prompts were designed using zero-shot strategies to contextualize information for improved model reasoning. We evaluated abstractive summarization using ROUGE scoring and assessed the generated comparison quality using survey responses. Models evaluated on the ASC$^2$End system show desirable results providing insights on the expected performance of the system. ASC$^2$End is a novel system and tool that enables accurate, automated information comparison at scale across knowledge domains, overcoming limitations in context length and retrieval.
- [469] arXiv:2404.05859 (replaced) [pdf, html, other]
-
Title: Box FiltrationComments: 17 figuresSubjects: Computational Geometry (cs.CG); Algebraic Topology (math.AT)
We define a new framework that unifies the filtration and mapper approaches from TDA, and present efficient algorithms to compute it. Termed the box filtration of a PCD, we grow boxes (hyperrectangles) that are not necessarily centered at each point (in place of balls centered at points). We grow the boxes non-uniformly and asymmetrically in different dimensions based on the distribution of points. We present two approaches to handle the boxes: a point cover where each point is assigned its own box at start, and a pixel cover that works with a pixelization of the space of the PCD. Any box cover in either setting automatically gives a mapper of the PCD. We show that the persistence diagrams generated by the box filtration using both point and pixel covers satisfy the classical stability based on the Gromov-Hausdorff distance. Using boxes also implies that the box filtration is identical for pairwise or higher order intersections whereas the VR and Cech filtration are not the same.
Growth in each dimension is computed by solving a linear program (LP) that optimizes a cost functional balancing the cost of expansion and benefit of including more points in the box. The box filtration algorithm runs in $O(m|U(0)|\log(mn\pi)L(q))$ time, where $m$ is number of steps of increments considered for box growth, $|U(0)|$ is the number of boxes in the initial cover ($\leq$ number of points), $\pi$ is the step length for increasing each box dimension, each LP is solved in $O(L(q))$ time, $n$ is the PCD dimension, and $q = n \times |X|$. We demonstrate through multiple examples that the box filtration can produce more accurate results to summarize the topology of the PCD than VR and distance-to-measure (DTM) filtrations. Software for our implementation is available at this https URL. - [470] arXiv:2404.07527 (replaced) [pdf, html, other]
-
Title: Security Modelling for Cyber-Physical Systems: A Systematic Literature ReviewComments: Preprint under submissionSubjects: Cryptography and Security (cs.CR)
Cyber-physical systems (CPS) are at the intersection of digital technology and engineering domains, rendering them high-value targets of sophisticated and well-funded cybersecurity threat actors. Prominent cybersecurity attacks on CPS have brought attention to the vulnerability of these systems, and the inherent weaknesses of critical infrastructure reliant on CPS. Security modelling for CPS is an important mechanism to systematically identify and assess vulnerabilities, threats, and risks throughout system lifecycles, and to ultimately ensure system resilience, safety, and reliability. This survey delves into state-of-the-art research in CPS security modelling, encompassing both threat and attack modelling. While these terms are sometimes used interchangeably, they are different concepts. This article elaborates on the differences between threat and attack modelling, examining their implications for CPS security. A systematic search yielded 428 articles, from which 15 were selected and categorised into three clusters: those focused on threat modelling methods, attack modelling methods, and literature reviews. Specifically, we sought to examine what security modelling methods exist today, and how they address real-world cybersecurity threats and CPS-specific attacker capabilities throughout the lifecycle of CPS, which typically span longer durations compared to traditional IT systems. This article also highlights several limitations in existing research, wherein security models adopt simplistic approaches that do not adequately consider the dynamic, multi-layer, multi-path, and multi-agent characteristics of real-world cyber-physical attacks.
- [471] arXiv:2404.11064 (replaced) [pdf, html, other]
-
Title: Rethinking 3D Dense Caption and Visual Grounding in A Unified Framework through Prompt-based LocalizationYongdong Luo, Haojia Lin, Xiawu Zheng, Yigeng Jiang, Fei Chao, Jie Hu, Guannan Jiang, Songan Zhang, Rongrong JiSubjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
3D Visual Grounding (3DVG) and 3D Dense Captioning (3DDC) are two crucial tasks in various 3D applications, which require both shared and complementary information in localization and visual-language relationships. Therefore, existing approaches adopt the two-stage "detect-then-describe/discriminate" pipeline, which relies heavily on the performance of the detector, resulting in suboptimal performance. Inspired by DETR, we propose a unified framework, 3DGCTR, to jointly solve these two distinct but closely related tasks in an end-to-end fashion. The key idea is to reconsider the prompt-based localization ability of the 3DVG model. In this way, the 3DVG model with a well-designed prompt as input can assist the 3DDC task by extracting localization information from the prompt. In terms of implementation, we integrate a Lightweight Caption Head into the existing 3DVG network with a Caption Text Prompt as a connection, effectively harnessing the existing 3DVG model's inherent localization capacity, thereby boosting 3DDC capability. This integration facilitates simultaneous multi-task training on both tasks, mutually enhancing their performance. Extensive experimental results demonstrate the effectiveness of this approach. Specifically, on the ScanRefer dataset, 3DGCTR surpasses the state-of-the-art 3DDC method by 4.3% in CIDEr@0.5IoU in MLE training and improves upon the SOTA 3DVG method by 3.16% in Acc@0.25IoU. The codes are at this https URL.
- [472] arXiv:2404.11803 (replaced) [pdf, html, other]
-
Title: TempBEV: Improving Learned BEV Encoders with Combined Image and BEV Space Temporal AggregationComments: Accepted for 2024 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS 2024)Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Robotics (cs.RO)
Autonomous driving requires an accurate representation of the environment. A strategy toward high accuracy is to fuse data from several sensors. Learned Bird's-Eye View (BEV) encoders can achieve this by mapping data from individual sensors into one joint latent space. For cost-efficient camera-only systems, this provides an effective mechanism to fuse data from multiple cameras with different views. Accuracy can further be improved by aggregating sensor information over time. This is especially important in monocular camera systems to account for the lack of explicit depth and velocity measurements. Thereby, the effectiveness of developed BEV encoders crucially depends on the operators used to aggregate temporal information and on the used latent representation spaces. We analyze BEV encoders proposed in the literature and compare their effectiveness, quantifying the effects of aggregation operators and latent representations. While most existing approaches aggregate temporal information either in image or in BEV latent space, our analyses and performance comparisons suggest that these latent representations exhibit complementary strengths. Therefore, we develop a novel temporal BEV encoder, TempBEV, which integrates aggregated temporal information from both latent spaces. We consider subsequent image frames as stereo through time and leverage methods from optical flow estimation for temporal stereo encoding. Empirical evaluation on the NuScenes dataset shows a significant improvement by TempBEV over the baseline for 3D object detection and BEV segmentation. The ablation uncovers a strong synergy of joint temporal aggregation in the image and BEV latent space. These results indicate the overall effectiveness of our approach and make a strong case for aggregating temporal information in both image and BEV latent spaces.
- [473] arXiv:2404.13270 (replaced) [pdf, html, other]
-
Title: StrideNET: Swin Transformer for Terrain Recognition with Dynamic Roughness ExtractionComments: 6 pages, 5 figures, 3rd IEEE International Conference on Computer Vision and Machine Intelligence (IEEE CVMI)Subjects: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
The field of remote-sensing image classification has seen immense progress with the rise of convolutional neural networks, and more recently, through vision transformers. These models, with their self-attention mechanism, can effectively capture global relationships and long-range dependencies between the image patches, in contrast with traditional convolutional models. This paper introduces StrideNET, a dual-branch transformer-based model developed for terrain recognition and surface roughness extraction. The terrain recognition branch employs the Swin Transformer to classify varied terrains by leveraging its capability to capture both local and global features. Complementing this, the roughness extraction branch utilizes a statistical texture-feature analysis technique to dynamically extract important land surface properties such as roughness and slipperiness. The model was trained on a custom dataset consisting of four terrain classes - grassy, marshy, sandy, and rocky, and it outperforms benchmark CNN and transformer based models, by achieving an average test accuracy of over 99 % across all classes. The applications of this work extend to different domains such as environmental monitoring, land use and cover classification, disaster response and precision agriculture.
- [474] arXiv:2404.14303 (replaced) [pdf, html, other]
-
Title: Orthogonal Laurent polynomials of two real variablesSubjects: Numerical Analysis (math.NA)
In this paper we consider an appropriate ordering of the Laurent monomials $x^{i}y^{j}$, $i,j \in \mathbb{Z}$ that allows us to study sequences of orthogonal Laurent polynomials of the real variables $x$ and $y$ with respect to a positive Borel measure $\mu$ defined on $\mathbb{R}^2$ such that $\{ x=0 \}\cup \{ y=0 \} \not\in \textrm{supp}(\mu)$. This ordering is suitable for considering the {\em multiplication plus inverse multiplication operator} on each varibale $\left( x+\frac{1}{x}\right.$ and $\left. y+\frac{1}{y}\right)$, and as a result we obtain five-term recurrence relations, Christoffel-Darboux and confluent formulas for the reproducing kernel and a related Favard's theorem. A connection with the one variable case is also presented, along with some applications for future research.
- [475] arXiv:2404.18713 (replaced) [pdf, html, other]
-
Title: Task and Domain Adaptive Reinforcement Learning for Robot ControlSubjects: Robotics (cs.RO); Artificial Intelligence (cs.AI); Systems and Control (eess.SY)
Deep reinforcement learning (DRL) has shown remarkable success in simulation domains, yet its application in designing robot controllers remains limited, due to its single-task orientation and insufficient adaptability to environmental changes. To overcome these limitations, we present a novel adaptive agent that leverages transfer learning techniques to dynamically adapt policy in response to different tasks and environmental conditions. The approach is validated through the blimp control challenge, where multitasking capabilities and environmental adaptability are essential. The agent is trained using a custom, highly parallelized simulator built on IsaacGym. We perform zero-shot transfer to fly the blimp in the real world to solve various tasks. We share our code at this https URL.
- [476] arXiv:2404.18982 (replaced) [pdf, other]
-
Title: Can ChatGPT Make Explanatory Inferences? Benchmarks for Abductive ReasoningSubjects: Artificial Intelligence (cs.AI)
Explanatory inference is the creation and evaluation of hypotheses that provide explanations, and is sometimes known as abduction or abductive inference. Generative AI is a new set of artificial intelligence models based on novel algorithms for generating text, images, and sounds. This paper proposes a set of benchmarks for assessing the ability of AI programs to perform explanatory inference, and uses them to determine the extent to which ChatGPT, a leading generative AI model, is capable of making explanatory inferences. Tests on the benchmarks reveal that ChatGPT performs creative and evaluative inferences in many domains, although it is limited to verbal and visual modalities. Claims that ChatGPT and similar models are incapable of explanation, understanding, causal reasoning, meaning, and creativity are rebutted.
- [477] arXiv:2405.05175 (replaced) [pdf, html, other]
-
Title: AirGapAgent: Protecting Privacy-Conscious Conversational AgentsEugene Bagdasarian, Ren Yi, Sahra Ghalebikesabi, Peter Kairouz, Marco Gruteser, Sewoong Oh, Borja Balle, Daniel RamageComments: at CCS'24Subjects: Cryptography and Security (cs.CR); Computation and Language (cs.CL); Machine Learning (cs.LG)
The growing use of large language model (LLM)-based conversational agents to manage sensitive user data raises significant privacy concerns. While these agents excel at understanding and acting on context, this capability can be exploited by malicious actors. We introduce a novel threat model where adversarial third-party apps manipulate the context of interaction to trick LLM-based agents into revealing private information not relevant to the task at hand.
Grounded in the framework of contextual integrity, we introduce AirGapAgent, a privacy-conscious agent designed to prevent unintended data leakage by restricting the agent's access to only the data necessary for a specific task. Extensive experiments using Gemini, GPT, and Mistral models as agents validate our approach's effectiveness in mitigating this form of context hijacking while maintaining core agent functionality. For example, we show that a single-query context hijacking attack on a Gemini Ultra agent reduces its ability to protect user data from 94% to 45%, while an AirGapAgent achieves 97% protection, rendering the same attack ineffective. - [478] arXiv:2405.06107 (replaced) [pdf, html, other]
-
Title: Transforming the Bootstrap: Using Transformers to Compute Scattering Amplitudes in Planar N = 4 Super Yang-Mills TheoryTianji Cai, Garrett W. Merz, François Charton, Niklas Nolte, Matthias Wilhelm, Kyle Cranmer, Lance J. DixonComments: 26+10 pages, 9 figures, 7 tables, application of machine learning aimed at physics and machine learning audience; v2: clarifications added, matches published versionJournal-ref: Mach.Learn.Sci.Tech. 5 (2024) 3, 035073Subjects: Machine Learning (cs.LG); Symbolic Computation (cs.SC); High Energy Physics - Phenomenology (hep-ph); High Energy Physics - Theory (hep-th); Machine Learning (stat.ML)
We pursue the use of deep learning methods to improve state-of-the-art computations in theoretical high-energy physics. Planar N = 4 Super Yang-Mills theory is a close cousin to the theory that describes Higgs boson production at the Large Hadron Collider; its scattering amplitudes are large mathematical expressions containing integer coefficients. In this paper, we apply Transformers to predict these coefficients. The problem can be formulated in a language-like representation amenable to standard cross-entropy training objectives. We design two related experiments and show that the model achieves high accuracy (> 98%) on both tasks. Our work shows that Transformers can be applied successfully to problems in theoretical physics that require exact solutions.
- [479] arXiv:2405.06830 (replaced) [pdf, html, other]
-
Title: Towards Browser Controls to Protect Cookies from Malicious ExtensionsSubjects: Cryptography and Security (cs.CR)
Cookies maintain state across related web traffic. As such, cookies are commonly used for authentication by storing a user's session ID and replacing the need to re-enter credentials in subsequent traffic. These so-called ``session cookies'' are prime targets for attacks that aim to steal them to gain unauthorized access to user accounts. To mitigate these attacks, the Secure and HttpOnly cookie attributes limit a cookie's accessibility from malicious networks and websites. However, these controls overlook browser extensions: third-party HTML/JavaScript add-ons with access to privileged browser APIs and the ability to operate across multiple websites. Thus malicious or compromised extensions can provide unrestricted access to a user's session cookies.
In this work, we first analyze the prevalence of extensions with access to ``risky'' APIs (those that enable cookie modification and theft) and find that they have hundreds of millions of users. Motivated by this, we propose a mechanism to protect cookies from malicious extensions by introducing two new cookie attributes: BrowserOnly and Monitored. The BrowserOnly attribute prevents extension access to cookies altogether. While effective, not all cookies can be made inaccessible. Thus cookies with the Monitored attribute remain accessible but are tied to a single browser and any changes made to the cookie are logged. As a result, stolen Monitored cookies are unusable outside their original browser and servers can validate the modifications performed. To demonstrate the proposed functionalities, we design and implement CREAM (Cookie Restrictions for Extension Abuse Mitigation) a modified version of the open-source Chromium browser realizing these controls. Our evaluation indicates that CREAM effectively protects cookies from malicious extensions while incurring little run-time overheads. - [480] arXiv:2405.10504 (replaced) [pdf, other]
-
Title: Multi-scale Semantic Prior Features Guided Deep Neural Network for Urban Street-view ImageSubjects: Computer Vision and Pattern Recognition (cs.CV)
Street-view image has been widely applied as a crucial mobile mapping data source. The inpainting of street-view images is a critical step for street-view image processing, not only for the privacy protection, but also for the urban environment mapping applications. This paper presents a novel Deep Neural Network (DNN), multi-scale semantic prior Feature guided image inpainting Network (MFN) for inpainting street-view images, which generate static street-view images without moving objects (e.g., pedestrians, vehicles). To enhance global context understanding, a semantic prior prompter is introduced to learn rich semantic priors from large pre-trained model. We design the prompter by stacking multiple Semantic Pyramid Aggregation (SPA) modules, capturing a broad range of visual feature patterns. A semantic-enhanced image generator with a decoder is proposed that incorporates a novel cascaded Learnable Prior Transferring (LPT) module at each scale level. For each decoder block, an attention transfer mechanism is applied to capture long-term dependencies, and the semantic prior features are fused with the image features to restore plausible structure in an adaptive manner. Additionally, a background-aware data processing scheme is adopted to prevent the generation of hallucinated objects within holes. Experiments on Apolloscapes and Cityscapes datasets demonstrate better performance than state-of-the-art methods, with MAE, and LPIPS showing improvements of about 9.5% and 41.07% respectively. Visual comparison survey among multi-group person is also conducted to provide performance evaluation, and the results suggest that the proposed MFN offers a promising solution for privacy protection and generate more reliable scene for urban applications with street-view images.
- [481] arXiv:2405.12926 (replaced) [pdf, html, other]
-
Title: Trusting Fair Data: Leveraging Quality in Fairness-Driven Data Removal TechniquesComments: The Version of Record of this contribution is published in Springer LNCS 14912 and is available online at this https URLJournal-ref: Lecture Notes in Computer Science, Vol. 14912 (2024), pp. 375-380. SpringerSubjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
In this paper, we deal with bias mitigation techniques that remove specific data points from the training set to aim for a fair representation of the population in that set. Machine learning models are trained on these pre-processed datasets, and their predictions are expected to be fair. However, such approaches may exclude relevant data, making the attained subsets less trustworthy for further usage. To enhance the trustworthiness of prior methods, we propose additional requirements and objectives that the subsets must fulfill in addition to fairness: (1) group coverage, and (2) minimal data loss. While removing entire groups may improve the measured fairness, this practice is very problematic as failing to represent every group cannot be considered fair. In our second concern, we advocate for the retention of data while minimizing discrimination. By introducing a multi-objective optimization problem that considers fairness and data loss, we propose a methodology to find Pareto-optimal solutions that balance these objectives. By identifying such solutions, users can make informed decisions about the trade-off between fairness and data quality and select the most suitable subset for their application. Our method is distributed as a Python package via PyPI under the name FairDo (this https URL).
- [482] arXiv:2405.13014 (replaced) [pdf, html, other]
-
Title: QCRD: Quality-guided Contrastive Rationale Distillation for Large Language ModelsSubjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
The deployment of large language models (LLMs) faces considerable challenges concerning resource constraints and inference efficiency. Recent research has increasingly focused on smaller, task-specific models enhanced by distilling knowledge from LLMs. However, prior studies have often overlooked the diversity and quality of knowledge, especially the untapped potential of negative knowledge. Constructing effective negative knowledge remains severely understudied. In this paper, we introduce a novel framework called quality-guided contrastive rationale distillation aimed at enhancing reasoning capabilities through contrastive knowledge learning. For positive knowledge, we enrich its diversity through temperature sampling and employ self-consistency for further denoising and refinement. For negative knowledge, we propose an innovative self-adversarial approach that generates low-quality rationales by sampling previous iterations of smaller language models, embracing the idea that one can learn from one's own weaknesses. A contrastive loss is developed to distill both positive and negative knowledge into smaller language models, where an online-updating discriminator is integrated to assess qualities of rationales and assign them appropriate weights, optimizing the training process. Through extensive experiments across multiple reasoning tasks, we demonstrate that our method consistently outperforms existing distillation techniques, yielding higher-quality rationales.
- [483] arXiv:2405.13785 (replaced) [pdf, html, other]
-
Title: Efficient Two-Stage Gaussian Process Regression Via Automatic Kernel Search and SubsamplingSubjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Probability (math.PR); Machine Learning (stat.ML)
Gaussian Process Regression (GPR) is widely used in statistics and machine learning for prediction tasks requiring uncertainty measures. Its efficacy depends on the appropriate specification of the mean function, covariance kernel function, and associated hyperparameters. Severe misspecifications can lead to inaccurate results and problematic consequences, especially in safety-critical applications. However, a systematic approach to handle these misspecifications is lacking in the literature. In this work, we propose a general framework to address these issues. Firstly, we introduce a flexible two-stage GPR framework that separates mean prediction and uncertainty quantification (UQ) to prevent mean misspecification, which can introduce bias into the model. Secondly, kernel function misspecification is addressed through a novel automatic kernel search algorithm, supported by theoretical analysis, that selects the optimal kernel from a candidate set. Additionally, we propose a subsampling-based warm-start strategy for hyperparameter initialization to improve efficiency and avoid hyperparameter misspecification. With much lower computational cost, our subsampling-based strategy can yield competitive or better performance than training exclusively on the full dataset. Combining all these components, we recommend two GPR methods-exact and scalable-designed to match available computational resources and specific UQ requirements. Extensive evaluation on real-world datasets, including UCI benchmarks and a safety-critical medical case study, demonstrates the robustness and precision of our methods.
- [484] arXiv:2405.15313 (replaced) [pdf, html, other]
-
Title: Enhancing Text-to-Image Editing via Hybrid Mask-Informed FusionSubjects: Computer Vision and Pattern Recognition (cs.CV)
Recently, text-to-image (T2I) editing has been greatly pushed forward by applying diffusion models. Despite the visual promise of the generated images, inconsistencies with the expected textual prompt remain prevalent. This paper aims to systematically improve the text-guided image editing techniques based on diffusion models, by addressing their limitations. Notably, the common idea in diffusion-based editing firstly reconstructs the source image via inversion techniques e.g., DDIM Inversion. Then following a fusion process that carefully integrates the source intermediate (hidden) states (obtained by inversion) with the ones of the target image. Unfortunately, such a standard pipeline fails in many cases due to the interference of texture retention and the new characters creation in some regions. To mitigate this, we incorporate human annotation as an external knowledge to confine editing within a ``Mask-informed'' region. Then we carefully Fuse the edited image with the source image and a constructed intermediate image within the model's Self-Attention module. Extensive empirical results demonstrate the proposed ``MaSaFusion'' significantly improves the existing T2I editing techniques.
- [485] arXiv:2405.16038 (replaced) [pdf, html, other]
-
Title: Rethinking Early-Fusion Strategies for Improved Multispectral Object DetectionComments: This paper has been accepted by IEEE T-IV journal. Please jump to External DOI to view the official versionSubjects: Computer Vision and Pattern Recognition (cs.CV)
Most recent multispectral object detectors employ a two-branch structure to extract features from RGB and thermal images. While the two-branch structure achieves better performance than a single-branch structure, it overlooks inference efficiency. This conflict is increasingly aggressive, as recent works solely pursue higher performance rather than both performance and efficiency. In this paper, we address this issue by improving the performance of efficient single-branch structures. We revisit the reasons causing the performance gap between these structures. For the first time, we reveal the information interference problem in the naive early-fusion strategy adopted by previous single-branch structures. Besides, we find that the domain gap between multispectral images, and weak feature representation of the single-branch structure are also key obstacles for performance. Focusing on these three problems, we propose corresponding solutions, including a novel shape-priority early-fusion strategy, a weakly supervised learning method, and a core knowledge distillation technique. Experiments demonstrate that single-branch networks equipped with these three contributions achieve significant performance enhancements while retaining high efficiency. Our code will be available at \url{this https URL}.
- [486] arXiv:2405.18734 (replaced) [pdf, html, other]
-
Title: PillarHist: A Quantization-aware Pillar Feature Encoder based on Height-aware HistogramComments: 12 pagesSubjects: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
Real-time and high-performance 3D object detection plays a critical role in autonomous driving and robotics. Recent pillar-based 3D object detectors have gained significant attention due to their compact representation and low computational overhead, making them suitable for onboard deployment and quantization. However, existing pillar-based detectors still suffer from information loss along height dimension and large numerical distribution difference during pillar feature encoding (PFE), which severely limits their performance and quantization potential. To address above issue, we first unveil the importance of different input information during PFE and identify the height dimension as a key factor in enhancing 3D detection performance. Motivated by this observation, we propose a height-aware pillar feature encoder, called PillarHist. Specifically, PillarHist statistics the discrete distribution of points at different heights within one pillar with the information entropy guidance. This simple yet effective design greatly preserves the information along the height dimension while significantly reducing the computation overhead of the PFE. Meanwhile, PillarHist also constrains the arithmetic distribution of PFE input to a stable range, making it quantization-friendly. Notably, PillarHist operates exclusively within the PFE stage to enhance performance, enabling seamless integration into existing pillar-based methods without introducing complex operations. Extensive experiments show the effectiveness of PillarHist in terms of both efficiency and performance.
- [487] arXiv:2405.19300 (replaced) [pdf, html, other]
-
Title: Measuring and Mitigating Bias for Tabular Datasets with Multiple Protected AttributesComments: 12 pages, 13 figuresSubjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Motivated by the recital (67) of the current corrigendum of the AI Act in the European Union, we propose and present measures and mitigation strategies for discrimination in tabular datasets. We specifically focus on datasets that contain multiple protected attributes, such as nationality, age, and sex. This makes measuring and mitigating bias more challenging, as many existing methods are designed for a single protected attribute. This paper comes with a twofold contribution: Firstly, new discrimination measures are introduced. These measures are categorized in our framework along with existing ones, guiding researchers and practitioners in choosing the right measure to assess the fairness of the underlying dataset. Secondly, a novel application of an existing bias mitigation method, FairDo, is presented. We show that this strategy can mitigate any type of discrimination, including intersectional discrimination, by transforming the dataset. By conducting experiments on real-world datasets (Adult, Bank, COMPAS), we demonstrate that de-biasing datasets with multiple protected attributes is possible. All transformed datasets show a reduction in discrimination, on average by 28%. Further, these datasets do not compromise any of the tested machine learning models' performances significantly compared to the original datasets. Conclusively, this study demonstrates the effectiveness of the mitigation strategy used and contributes to the ongoing discussion on the implementation of the European Union's AI Act.
- [488] arXiv:2405.20090 (replaced) [pdf, html, other]
-
Title: Typography Leads Semantic Diversifying: Amplifying Adversarial Transferability across Multimodal Large Language ModelsHao Cheng, Erjia Xiao, Jiayan Yang, Jiahang Cao, Le Yang, Jize Zhang, Kaidi Xu, Jindong Gu, Renjing XuSubjects: Computer Vision and Pattern Recognition (cs.CV)
Following the advent of the Artificial Intelligence (AI) era of large models, Multimodal Large Language Models (MLLMs) with the ability to understand cross-modal interactions between vision and text have attracted wide attention. Adversarial examples with human-imperceptible perturbation are shown to possess a characteristic known as transferability, which means that a perturbation generated by one model could also mislead another different model. Augmenting the diversity in input data is one of the most significant methods for enhancing adversarial transferability. This method has been certified as a way to significantly enlarge the threat impact under black-box conditions. Research works also demonstrate that MLLMs can be exploited to generate adversarial examples in the white-box scenario. However, the adversarial transferability of such perturbations is quite limited, failing to achieve effective black-box attacks across different models. In this paper, we propose the Typographic-based Semantic Transfer Attack (TSTA), which is inspired by: (1) MLLMs tend to process semantic-level information; (2) Typographic Attack could effectively distract the visual information captured by MLLMs. In the scenarios of Harmful Word Insertion and Important Information Protection, our TSTA demonstrates superior performance.
- [489] arXiv:2405.20104 (replaced) [pdf, html, other]
-
Title: Object-centric Reconstruction and Tracking of Dynamic Unknown Objects using 3D Gaussian SplattingComments: Accepted at IEEE International Conference on Space Robotics 2024Subjects: Robotics (cs.RO)
Generalizable perception is one of the pillars of high-level autonomy in space robotics. Estimating the structure and motion of unknown objects in dynamic environments is fundamental for such autonomous systems. Traditionally, the solutions have relied on prior knowledge of target objects, multiple disparate representations, or low-fidelity outputs unsuitable for robotic operations. This work proposes a novel approach to incrementally reconstruct and track a dynamic unknown object using a unified representation -- a set of 3D Gaussian blobs that describe its geometry and appearance. The differentiable 3D Gaussian Splatting framework is adapted to a dynamic object-centric setting. The input to the pipeline is a sequential set of RGB-D images. 3D reconstruction and 6-DoF pose tracking tasks are tackled using first-order gradient-based optimization. The formulation is simple, requires no pre-training, assumes no prior knowledge of the object or its motion, and is suitable for online applications. The proposed approach is validated on a dataset of 10 unknown spacecraft of diverse geometry and texture under arbitrary relative motion. The experiments demonstrate successful 3D reconstruction and accurate 6-DoF tracking of the target object in proximity operations over a short to medium duration. The causes of tracking drift are discussed and potential solutions are outlined.
- [490] arXiv:2406.00116 (replaced) [pdf, html, other]
-
Title: A Sim2Real Approach for Identifying Task-Relevant Properties in Interpretable Machine LearningSubjects: Human-Computer Interaction (cs.HC); Machine Learning (cs.LG)
Explanations of an AI's function can assist human decision-makers, but the most useful explanation depends on the decision's context, referred to as the downstream task. User studies are necessary to determine the best explanations for each task. Unfortunately, testing every explanation and task combination is impractical, especially considering the many factors influencing human+AI collaboration beyond the explanation's content. This work leverages two insights to streamline finding the most effective explanation. First, explanations can be characterized by properties, such as faithfulness or complexity, which indicate if they contain the right information for the task. Second, we introduce XAIsim2real, a pipeline for running synthetic user studies. In our validation study, XAIsim2real accurately predicts user preferences across three tasks, making it a valuable tool for refining explanation choices before full studies. Additionally, it uncovers nuanced relationships, like how cognitive budget limits a user's engagement with complex explanations -- a trend confirmed with real users.
- [491] arXiv:2406.04829 (replaced) [pdf, html, other]
-
Title: IOR: Inversed Objects Replay for Incremental Object DetectionSubjects: Computer Vision and Pattern Recognition (cs.CV)
Existing Incremental Object Detection (IOD) methods partially alleviate catastrophic forgetting when incrementally detecting new objects in real-world scenarios. However, many of these methods rely on the assumption that unlabeled old-class objects may co-occur with labeled new-class objects in the incremental data. When unlabeled old-class objects are absent, the performance of existing methods tends to degrade. The absence can be mitigated by generating old-class samples, but it incurs high costs. This paper argues that previous generation-based IOD suffer from redundancy, both in the use of generative models, which require additional training and storage, and in the overproduction of generated samples, many of which do not contribute significantly to performance improvements. To eliminate the redundancy, we propose Inversed Object Replay (IOR). Specifically, we generate old-class samples by inversing the original detectors, thus eliminating the necessity of training and storing additional generative models. We propose augmented replay to reuse the objects in generated samples, reducing redundant generations. Moreover, we propose high-value knowledge distillation focusing on the positions of old-class objects overwhelmed by the background, which transfers the knowledge to the incremental detector. Extensive experiments conducted on MS COCO 2017 demonstrate that our method can efficiently improve detection performance in IOD scenarios with the absence of old-class objects.
- [492] arXiv:2406.05596 (replaced) [pdf, html, other]
-
Title: Aligning Human Knowledge with Visual Concepts Towards Explainable Medical Image ClassificationComments: MICCAI 2024 Early AcceptSubjects: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
Although explainability is essential in the clinical diagnosis, most deep learning models still function as black boxes without elucidating their decision-making process. In this study, we investigate the explainable model development that can mimic the decision-making process of human experts by fusing the domain knowledge of explicit diagnostic criteria. We introduce a simple yet effective framework, Explicd, towards Explainable language-informed criteria-based diagnosis. Explicd initiates its process by querying domain knowledge from either large language models (LLMs) or human experts to establish diagnostic criteria across various concept axes (e.g., color, shape, texture, or specific patterns of diseases). By leveraging a pretrained vision-language model, Explicd injects these criteria into the embedding space as knowledge anchors, thereby facilitating the learning of corresponding visual concepts within medical images. The final diagnostic outcome is determined based on the similarity scores between the encoded visual concepts and the textual criteria embeddings. Through extensive evaluation of five medical image classification benchmarks, Explicd has demonstrated its inherent explainability and extends to improve classification performance compared to traditional black-box models. Code is available at \url{this https URL}.
- [493] arXiv:2406.12536 (replaced) [pdf, html, other]
-
Title: ViDSOD-100: A New Dataset and a Baseline Model for RGB-D Video Salient Object DetectionJournal-ref: International Journal of Computer Vision (2024)Subjects: Computer Vision and Pattern Recognition (cs.CV)
With the rapid development of depth sensor, more and more RGB-D videos could be obtained. Identifying the foreground in RGB-D videos is a fundamental and important task. However, the existing salient object detection (SOD) works only focus on either static RGB-D images or RGB videos, ignoring the collaborating of RGB-D and video information. In this paper, we first collect a new annotated RGB-D video SOD (ViDSOD-100) dataset, which contains 100 videos within a total of 9,362 frames, acquired from diverse natural scenes. All the frames in each video are manually annotated to a high-quality saliency annotation. Moreover, we propose a new baseline model, named attentive triple-fusion network (ATF-Net), for RGB-D video salient object detection. Our method aggregates the appearance information from an input RGB image, spatio-temporal information from an estimated motion map, and the geometry information from the depth map by devising three modality-specific branches and a multi-modality integration branch. The modality-specific branches extract the representation of different inputs, while the multi-modality integration branch combines the multi-level modality-specific features by introducing the encoder feature aggregation (MEA) modules and decoder feature aggregation (MDA) modules. The experimental findings conducted on both our newly introduced ViDSOD-100 dataset and the well-established DAVSOD dataset highlight the superior performance of the proposed ATF-Net. This performance enhancement is demonstrated both quantitatively and qualitatively, surpassing the capabilities of current state-of-the-art techniques across various domains, including RGB-D saliency detection, video saliency detection, and video object segmentation. Our data and our code are available at this http URL.
- [494] arXiv:2406.13605 (replaced) [pdf, html, other]
-
Title: Nicer Than Humans: How do Large Language Models Behave in the Prisoner's Dilemma?Comments: v1: 9 pages, 8 figures, 1 table v2: 11 pages, 14 figures, 1 table. Increased number of models studied, expanded results and conclusion, added references, corrected typosSubjects: Computers and Society (cs.CY); Artificial Intelligence (cs.AI); Computer Science and Game Theory (cs.GT); Physics and Society (physics.soc-ph)
The behavior of Large Language Models (LLMs) as artificial social agents is largely unexplored, and we still lack extensive evidence of how these agents react to simple social stimuli. Testing the behavior of AI agents in classic Game Theory experiments provides a promising theoretical framework for evaluating the norms and values of these agents in archetypal social situations. In this work, we investigate the cooperative behavior of three LLMs (Llama2, Llama3, and GPT3.5) when playing the Iterated Prisoner's Dilemma against random adversaries displaying various levels of hostility. We introduce a systematic methodology to evaluate an LLM's comprehension of the game rules and its capability to parse historical gameplay logs for decision-making. We conducted simulations of games lasting for 100 rounds and analyzed the LLMs' decisions in terms of dimensions defined in the behavioral economics literature. We find that all models tend not to initiate defection but act cautiously, favoring cooperation over defection only when the opponent's defection rate is low. Overall, LLMs behave at least as cooperatively as the typical human player, although our results indicate some substantial differences among models. In particular, Llama2 and GPT3.5 are more cooperative than humans, and especially forgiving and non-retaliatory for opponent defection rates below 30%. More similar to humans, Llama3 exhibits consistently uncooperative and exploitative behavior unless the opponent always cooperates. Our systematic approach to the study of LLMs in game theoretical scenarios is a step towards using these simulations to inform practices of LLM auditing and alignment.
- [495] arXiv:2406.15585 (replaced) [pdf, html, other]
-
Title: Ten Years of ZMapComments: 10 pages, 7 figures, in submission at Internet Measurement Conference 2024Subjects: Cryptography and Security (cs.CR); Networking and Internet Architecture (cs.NI)
Since ZMap's debut in 2013, networking and security researchers have used the open-source scanner to write hundreds of research papers that study Internet behavior. In addition, ZMap has been adopted by the security industry to build new classes of enterprise security and compliance products. Over the past decade, much of ZMap's behavior -- ranging from its pseudorandom IP generation to its packet construction -- has evolved as we have learned more about how to scan the Internet. In this work, we quantify ZMap's adoption over the ten years since its release, describe its modern behavior (and the measurements that motivated changes), and offer lessons from releasing and maintaining ZMap for future tools.
- [496] arXiv:2406.15662 (replaced) [pdf, html, other]
-
Title: Matching Problems to Solutions: An Explainable Way of Solving Machine Learning ProblemsSubjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Domain experts from all fields are called upon, working with data scientists, to explore the use of ML techniques to solve their problems. Starting from a domain problem/question, ML-based problem-solving typically involves three steps: (1) formulating the business problem (problem domain) as a data analysis problem (solution domain), (2) sketching a high-level ML-based solution pattern, given the domain requirements and the properties of the available data, and (3) designing and refining the different components of the solution pattern. There has to be a substantial body of ML problem solving knowledge that ML researchers agree on, and that ML practitioners routinely apply to solve the most common problems. Our work deals with capturing this body of knowledge, and embodying it in a ML problem solving workbench to helps domain specialists who are not ML experts to explore the ML solution space. This paper focuses on: 1) the representation of domain problems, ML problems, and the main ML solution artefacts, and 2) a heuristic matching function that helps identify the ML algorithm family that is most appropriate for the domain problem at hand, given the domain (expert) requirements, and the characteristics of the training data. We review related work and outline our strategy for validating the workbench
- [497] arXiv:2406.16487 (replaced) [pdf, html, other]
-
Title: Decomposing God Header File via Multi-View Graph ClusteringComments: Accepted by ICSME 2024Subjects: Software Engineering (cs.SE)
God Header Files, just like God Classes, pose significant challenges for code comprehension and maintenance. Additionally, they increase the time required for code recompilation. However, existing refactoring methods for God Classes are inappropriate to deal with God Header Files because the code elements in header files are mostly short declaration types, and build dependencies of the entire system should be considered with the aim of improving compilation efficiency. Meanwhile, ensuring acyclic dependencies among the decomposed sub-header files is also crucial in the God Header File decomposition. This paper proposes a multi-view graph clustering based approach for decomposing God Header Files. It first constructs and coarsens the code element graph, then a novel multi-view graph clustering algorithm is applied to identify the clusters and a heuristic algorithm is introduced to address the cyclic dependencies in the clustering results. To evaluate our approach, we built both a synthetic dataset and a real-world God Header Files dataset. The results show that 1) Our approach could achieve 11.5% higher accuracy than existing God Class refactoring methods; 2) Our decomposition results attain better architecture on real-world God Header Files, evidenced by higher modularity and acyclic dependencies; 3) We can reduce 15% to 60% recompilation time for historical commits that require recompiling.
- [498] arXiv:2406.17650 (replaced) [pdf, html, other]
-
Title: ELIZA Reinterpreted: The world's first chatbot was not intended as a chatbot at allComments: In process of journal reviewSubjects: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computers and Society (cs.CY); Human-Computer Interaction (cs.HC)
ELIZA, often considered the world's first chatbot, was written by Joseph Weizenbaum in the early 1960s. Weizenbaum did not intend to invent the chatbot, but rather to build a platform for research into human-machine conversation and the important cognitive processes of interpretation and misinterpretation. His purpose was obscured by ELIZA's fame, resulting in large part from the fortuitous timing of it's creation, and it's escape into the wild. In this paper I provide a rich historical context for ELIZA's creation, demonstrating that ELIZA arose from the intersection of some of the central threads in the technical history of AI. I also briefly discuss how ELIZA escaped into the world, and how its accidental escape, along with several coincidental turns of the programming language screws, led both to the misapprehension that ELIZA was intended as a chatbot, and to the loss of the original ELIZA to history for over 50 years.
- [499] arXiv:2406.17768 (replaced) [pdf, html, other]
-
Title: EXTRACT: Efficient Policy Learning by Extracting Transferable Robot Skills from Offline DataComments: 25 pages, 16 figuresJournal-ref: CoRL 2024Subjects: Robotics (cs.RO); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Most reinforcement learning (RL) methods focus on learning optimal policies over low-level action spaces. While these methods can perform well in their training environments, they lack the flexibility to transfer to new tasks. Instead, RL agents that can act over useful, temporally extended skills rather than low-level actions can learn new tasks more easily. Prior work in skill-based RL either requires expert supervision to define useful skills, which is hard to scale, or learns a skill-space from offline data with heuristics that limit the adaptability of the skills, making them difficult to transfer during downstream RL. Our approach, EXTRACT, instead utilizes pre-trained vision language models to extract a discrete set of semantically meaningful skills from offline data, each of which is parameterized by continuous arguments, without human supervision. This skill parameterization allows robots to learn new tasks by only needing to learn when to select a specific skill and how to modify its arguments for the specific task. We demonstrate through experiments in sparse-reward, image-based, robot manipulation environments that EXTRACT can more quickly learn new tasks than prior works, with major gains in sample efficiency and performance over prior skill-based RL. Website at this https URL.
- [500] arXiv:2406.17935 (replaced) [pdf, html, other]
-
Title: Sequential Editing for Lifelong Training of Speech Recognition ModelsComments: INTERSPEECH 2024Subjects: Computation and Language (cs.CL); Sound (cs.SD); Audio and Speech Processing (eess.AS)
Automatic Speech Recognition (ASR) traditionally assumes known domains, but adding data from a new domain raises concerns about computational inefficiencies linked to retraining models on both existing and new domains. Fine-tuning solely on new domain risks Catastrophic Forgetting (CF). To address this, Lifelong Learning (LLL) algorithms have been proposed for ASR. Prior research has explored techniques such as Elastic Weight Consolidation, Knowledge Distillation, and Replay, all of which necessitate either additional parameters or access to prior domain data. We propose Sequential Model Editing as a novel method to continually learn new domains in ASR systems. Different than previous methods, our approach does not necessitate access to prior datasets or the introduction of extra parameters. Our study demonstrates up to 15% Word Error Rate Reduction (WERR) over fine-tuning baseline, and superior efficiency over other LLL techniques on CommonVoice English multi-accent dataset.
- [501] arXiv:2406.18000 (replaced) [pdf, html, other]
-
Title: Tiered Service Architecture for Remote Patient MonitoringComments: Accepted for presentation at IEEE Healthcom 2024. 7 pagesSubjects: Systems and Control (eess.SY)
We develop a remote patient monitoring (RPM) service architecture, which has two tiers of monitoring: ordinary and intensive. The patient's health state improves or worsens in each time period according to certain probabilities, which depend on the monitoring tier. The patient incurs a "loss of quality of life" cost or an "invasiveness" cost, which is higher under intensive monitoring than under ordinary. On the other hand, their health improves faster under intensive monitoring than under ordinary. In each period, the service decides which monitoring tier to use, based on the health of the patient. We investigate the optimal policy for making that choice by formulating the problem using dynamic programming. We first provide analytic conditions for selecting ordinary vs intensive monitoring in the asymptotic regime where the number of health states is large. In the general case, we investigate the optimal policy numerically. We observe a threshold behavior, that is, when the patient's health drops below a certain threshold the service switches them to intensive monitoring, while ordinary monitoring is used during adequately good health states of the patient. The modeling and analysis provides a general framework for managing RPM services for various health conditions with medically/clinically defined system parameters.
- [502] arXiv:2406.19831 (replaced) [pdf, html, other]
-
Title: Meshfree Variational Physics Informed Neural Networks (MF-VPINN): an adaptive training strategyComments: 24 pages, 65 figures, 1 tableJournal-ref: Algorithms 17(9), 415 (2024)Subjects: Numerical Analysis (math.NA); Optimization and Control (math.OC)
In this paper, we introduce a Meshfree Variational-Physics-Informed Neural Network. It is a Variational-Physics-Informed Neural Network that does not require the generation of the triangulation of the entire domain and that can be trained with an adaptive set of test functions. In order to generate the test space, we exploit an a posteriori error indicator and add test functions only where the error is higher. Four training strategies are proposed and compared. Numerical results show that the accuracy is higher than the one of a Variational-Physics-Informed Neural Network trained with the same number of test functions but defined on a quasi-uniform mesh.
- [503] arXiv:2407.00225 (replaced) [pdf, html, other]
-
Title: Large-scale, Independent and Comprehensive study of the power of LLMs for test case generationWendkûuni C. Ouédraogo, Kader Kaboré, Haoye Tian, Yewei Song, Anil Koyuncu, Jacques Klein, David Lo, Tegawendé F. BissyandéSubjects: Software Engineering (cs.SE)
Unit testing, crucial for ensuring the reliability of code modules, such as classes and methods, is often overlooked by developers due to time constraints. Automated test generation techniques have emerged to address this, but they frequently lack readability and require significant developer intervention. Large Language Models (LLMs), such as GPT and Mistral, have shown promise in software engineering tasks, including test generation, but their overall effectiveness remains unclear. This study presents an extensive investigation of LLMs, evaluating the effectiveness of four models and five prompt engineering techniques for unit test generation. We analyze 216 300 tests generated by the selected advanced instruct-tuned LLMs for 690 Java classes collected from diverse datasets. Our evaluation considers correctness, understandability, coverage, and test smell detection in the generated tests, comparing them to a widely used automated testing tool, EvoSuite. While LLMs demonstrate potential, improvements in test quality particularly in reducing common test smells are necessary. This study highlights the strengths and limitations of LLM-generated tests compared to traditional methods, paving the way for further research on LLMs in test automation.
- [504] arXiv:2407.00449 (replaced) [pdf, html, other]
-
Title: Fully tensorial approach to hypercomplex neural networksComments: 13 pagesSubjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Neural and Evolutionary Computing (cs.NE)
Fully tensorial theory of hypercomplex neural networks is given. It allows neural networks to use arithmetic based on arbitrary algebras. The key point is to observe that algebra multiplication can be represented as a rank three tensor and use this tensor in every algebraic operation. This approach is attractive for neural network libraries that support effective tensorial operations. It agrees with previous implementations for four-dimensional algebras.
- [505] arXiv:2407.00611 (replaced) [pdf, html, other]
-
Title: WallFacer: Harnessing Multi-dimensional Ring Parallelism for Efficient Long Sequence Model TrainingZiming Liu, Shaoyu Wang, Shenggan Cheng, Zhongkai Zhao, Kai Wang, Xuanlei Zhao, James Demmel, Yang YouSubjects: Distributed, Parallel, and Cluster Computing (cs.DC)
Training Transformer models on long sequences in a distributed setting poses significant challenges in terms of efficiency and scalability. Current methods are either constrained by the number of attention heads or excessive communication overheads. To address this problem, we propose WallFacer, a multi-dimensional distributed training system for long sequences, fostering an efficient communication paradigm and providing additional tuning flexibility for communication arrangements. Specifically, WallFacer introduces an extra parallel dimension to substantially reduce communication volume and avoid bandwidth bottlenecks. Through comprehensive experiments across diverse hardware environments and on both Natural Language Processing (NLP) and Computer Vision (CV) tasks, we demonstrate that our approach significantly surpasses state-of-the-art methods that support near-infinite sequence lengths, achieving performance improvements of up to 77.12% on GPT-style models and up to 114.33% on DiT (Diffusion Transformer) models.
- [506] arXiv:2407.07805 (replaced) [pdf, html, other]
-
Title: SUMix: Mixup with Semantic and Uncertain InformationComments: Accepted by ECCV2024 [Camera Ready] (19 pages, 7 figures) with the source code at this https URLSubjects: Computer Vision and Pattern Recognition (cs.CV)
Mixup data augmentation approaches have been applied for various tasks of deep learning to improve the generalization ability of deep neural networks. Some existing approaches CutMix, SaliencyMix, etc. randomly replace a patch in one image with patches from another to generate the mixed image. Similarly, the corresponding labels are linearly combined by a fixed ratio $\lambda$ by l. The objects in two images may be overlapped during the mixing process, so some semantic information is corrupted in the mixed samples. In this case, the mixed image does not match the mixed label information. Besides, such a label may mislead the deep learning model training, which results in poor performance. To solve this problem, we proposed a novel approach named SUMix to learn the mixing ratio as well as the uncertainty for the mixed samples during the training process. First, we design a learnable similarity function to compute an accurate mix ratio. Second, an approach is investigated as a regularized term to model the uncertainty of the mixed samples. We conduct experiments on five image benchmarks, and extensive experimental results imply that our method is capable of improving the performance of classifiers with different cutting-based mixup approaches. The source code is available at this https URL.
- [507] arXiv:2407.10186 (replaced) [pdf, html, other]
-
Title: Toward Explainable Reasoning in 6G: A Proof of Concept Study on Radio Resource AllocationFarhad Rezazadeh, Sergio Barrachina-Muñoz, Hatim Chergui, Josep Mangues, Mehdi Bennis, Dusit Niyato, Houbing Song, Lingjia LiuComments: 21 pages, 11 Figures, 5 TablesSubjects: Networking and Internet Architecture (cs.NI)
The move toward artificial intelligence (AI)-native sixth-generation (6G) networks has put more emphasis on the importance of explainability and trustworthiness in network management operations, especially for mission-critical use-cases. Such desired trust transcends traditional post-hoc explainable AI (XAI) methods to using contextual explanations for guiding the learning process in an in-hoc way. This paper proposes a novel graph reinforcement learning (GRL) framework named TANGO which relies on a symbolic subsystem. It consists of a Bayesian-graph neural network (GNN) Explainer, whose outputs, in terms of edge/node importance and uncertainty, are periodically translated to a logical GRL reward function. This adjustment is accomplished through defined symbolic reasoning rules within a Reasoner. Considering a real-world testbed proof-of-concept (PoC), a gNodeB (gNB) radio resource allocation problem is formulated, which aims to minimize under- and over-provisioning of physical resource blocks (PRBs) while penalizing decisions emanating from the uncertain and less important edge-nodes relations. Our findings reveal that the proposed in-hoc explainability solution significantly expedites convergence compared to standard GRL baseline and other benchmarks in the deep reinforcement learning (DRL) domain. The experiment evaluates performance in AI, complexity, energy consumption, robustness, network, scalability, and explainability metrics. Specifically, the results show that TANGO achieves a noteworthy accuracy of 96.39% in terms of optimal PRB allocation in inference phase, outperforming the baseline by 1.22x.
- [508] arXiv:2407.10838 (replaced) [pdf, other]
-
Title: Compositional Symbolic Execution for Correctness and Incorrectness Reasoning (Extended Version)Andreas Lööw, Daniele Nantes-Sobrinho, Sacha-Élie Ayoun, Caroline Cronjäger, Petar Maksimović, Philippa GardnerJournal-ref: ECOOP 2024Subjects: Programming Languages (cs.PL); Logic in Computer Science (cs.LO)
The introduction of separation logic has led to the development of symbolic execution techniques and tools that are (functionally) compositional with function specifications that can be used in broader calling contexts. Many of the compositional symbolic execution tools developed in academia and industry have been grounded on a formal foundation, but either the function specifications are not validated with respect to the underlying separation logic of the theory, or there is a large gulf between the theory and the implementation of the tool.
We introduce a formal compositional symbolic execution engine which creates and uses function specifications from an underlying separation logic and provides a sound theoretical foundation for, and indeed was partially inspired by, the Gillian symbolic execution platform. This is achieved by providing an axiomatic interface which describes the properties of the consume and produce operations used in the engine to update compositionally the symbolic state, for example when calling function specifications. This consume-produce technique is used by VeriFast, Viper, and Gillian, but has not been previously characterised independently of the tool. As part of our result, we give consume and produce operations inspired by the Gillian implementation that satisfy the properties described by our axiomatic interface. A surprising property is that our engine semantics provides a common foundation for both correctness and incorrectness reasoning, with the difference in the underlying engine only amounting to the choice to use satisfiability or validity. We use this property to extend the Gillian platform, which previously only supported correctness reasoning, with incorrectness reasoning and automatic true bug-finding using incorrectness bi-abduction. - [509] arXiv:2407.19460 (replaced) [pdf, html, other]
-
Title: White Matter Geometry-Guided Score-Based Diffusion Model for Tissue Microstructure Imputation in Tractography ImagingYui Lo, Yuqian Chen, Fan Zhang, Dongnan Liu, Leo Zekelman, Suheyla Cetin-Karayumak, Yogesh Rathi, Weidong Cai, Lauren J. O'DonnellComments: This paper has been accepted for presentation at The 31st International Conference on Neural Information Processing (ICONIP 2024). 12 pages, 3 figures, 2 tablesSubjects: Computer Vision and Pattern Recognition (cs.CV)
Parcellation of white matter tractography provides anatomical features for disease prediction, anatomical tract segmentation, surgical brain mapping, and non-imaging phenotype classifications. However, parcellation does not always reach 100\% accuracy due to various factors, including inter-individual anatomical variability and the quality of neuroimaging scan data. The failure to identify parcels causes a problem of missing microstructure data values, which is especially challenging for downstream tasks that analyze large brain datasets. In this work, we propose a novel deep-learning model to impute tissue microstructure: the White Matter Geometry-guided Diffusion (WMG-Diff) model. Specifically, we first propose a deep score-based guided diffusion model to impute tissue microstructure for diffusion magnetic resonance imaging (dMRI) tractography fiber clusters. Second, we propose a white matter atlas geometric relationship-guided denoising function to guide the reverse denoising process at the subject-specific level. Third, we train and evaluate our model on a large dataset with 9342 subjects. Comprehensive experiments for tissue microstructure imputation and a downstream non-imaging phenotype prediction task demonstrate that our proposed WMG-Diff outperforms the compared state-of-the-art methods in both error and accuracy metrics. Our code will be available at: this https URL.
- [510] arXiv:2407.20424 (replaced) [pdf, other]
-
Title: A convergent augmented SAV scheme for stochastic Cahn--Hilliard equations with dynamic boundary conditions describing contact line tensionSubjects: Numerical Analysis (math.NA)
We augment a thermodynamically consistent diffuse interface model for the description of line tension phenomena by multiplicative stochastic noise to capture the effects of thermal fluctuations and establish the existence of pathwise unique (stochastically) strong solutions. By starting from a fully discrete linear finite element scheme, we do not only prove the well-posedness of the model, but also provide a practicable and convergent scheme for its numerical treatment. Conceptually, our discrete scheme relies on a recently developed augmentation of the scalar auxiliary variable approach, which reduces the requirements on the time regularity of the solution. By showing that fully discrete solutions to this scheme satisfy an energy estimate, we obtain first uniform regularity results. Establishing Nikolskii estimates with respect to time, we are able to show convergence towards pathwise unique martingale solutions by applying Jakubowski's generalization of Skorokhod's theorem. Finally, a generalization of the Gyöngy--Krylov characterization of convergence in probability provides convergence towards strong solutions and thereby completes the proof.
- [511] arXiv:2407.21275 (replaced) [pdf, html, other]
-
Title: FreqTSF: Time Series Forecasting Via Capturing Intra- and Inter-Variable Variations in the frequency domainSubjects: Artificial Intelligence (cs.AI)
Time series forecasting (TSF) is immensely important in extensive applications, such as electricity transformation, medical monitoring, and smart agriculture. Although deep learning methods have been proposed to handle time series data and achieved superior performances, their ability to predict long-term time series is limited due to overlooking intra- and inter-variable variations in the frequency domain. To address this problem, we propose the FreqBlock, where we obtain frequency representations through the Frequency Transform Module. Subsequently, inspired by the inherent Kramer-Kronig relations (KKRs) in the frequency domain, the Frequency Cross Attention between the real and imaginary parts is designed to obtian enhanced frequency representations and capture intra-variable variations. And then we use inception blocks to mix information to capture correlations between variables. Our backbone network, FreqTSF, adopts a residual structure by concatenating multiple FreqBlocks to avoid degradation problems. On a theoretical level, we demonstrate that the proposed two modules can significantly reduce the time and memory complexity from $\mathcal{O}(L^2)$ to $\mathcal{O}(L)$ for each FreqBlock computation. Empirical studies on three benchmark datasets show that FreqTSF achieves an overall relative MSE reduction of 15\% and an overall relative MAE reduction of 11\% compared to the state-of-the-art methods. The code is available at \url{this https URL}.
- [512] arXiv:2408.00712 (replaced) [pdf, html, other]
-
Title: MotionFix: Text-Driven 3D Human Motion EditingComments: SIGGRAPH Asia 2024 Camera Ready, Project page: this https URLSubjects: Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR)
The focus of this paper is on 3D motion editing. Given a 3D human motion and a textual description of the desired modification, our goal is to generate an edited motion as described by the text. The key challenges include the scarcity of training data and the need to design a model that accurately edits the source motion. In this paper, we address both challenges. We propose a methodology to semi-automatically collect a dataset of triplets comprising (i) a source motion, (ii) a target motion, and (iii) an edit text, introducing the new MotionFix dataset. Access to this data allows us to train a conditional diffusion model, TMED, that takes both the source motion and the edit text as input. We develop several baselines to evaluate our model, comparing it against models trained solely on text-motion pair datasets, and demonstrate the superior performance of our model trained on triplets. We also introduce new retrieval-based metrics for motion editing, establishing a benchmark on the evaluation set of MotionFix. Our results are promising, paving the way for further research in fine-grained motion generation. Code, models, and data are available at this https URL .
- [513] arXiv:2408.01215 (replaced) [pdf, html, other]
-
Title: ZNorm: Z-Score Gradient Normalization for Deep Neural NetworksSubjects: Machine Learning (cs.LG)
The rapid advancements in deep learning necessitate better training methods for deep neural networks (DNNs). As models grow in complexity, vanishing and exploding gradients impede performance. We propose Z-Score Normalization for Gradient Descent (ZNorm), an innovative technique that adjusts only the gradients to accelerate training and improve model performance. ZNorm normalizes the overall gradients, providing consistent gradient scaling across layers, thereby reducing the risks of vanishing and exploding gradients, having better performances. Our extensive experiments on CIFAR-10 and medical datasets demonstrate that ZNorm enhances performance metrics. ZNorm consistently outperforms existing methods, achieving superior results using the same experimental settings. In medical imaging applications, ZNorm improves tumor prediction and segmentation performances, underscoring its practical utility. These findings highlight ZNorm's potential as a robust and versatile tool for enhancing the training speed and effectiveness of deep neural networks across a wide range of architectures and applications.
- [514] arXiv:2408.01508 (replaced) [pdf, html, other]
-
Title: Blockchain Amplification AttackSubjects: Cryptography and Security (cs.CR)
Strategies related to the blockchain concept of Extractable Value (MEV/BEV), such as arbitrage, front- or backrunning create an economic incentive for network nodes to reduce latency. A modified node, that minimizes transaction validation time and neglects to filter invalid transactions in the Ethereum P2P network, introduces a novel attack vector -- Blockchain Amplification Attack. An attacker exploits those modified nodes to amplify an invalid transaction thousands of times, posing a threat to the entire network. To illustrate attack feasibility and practicality in the current mainnet, we 1) identify thousands of similar attacks in the wild, 2) mathematically model propagation mechanism, 3) empirically measure model parameters from our two monitoring nodes, and 4) compare performance with existing Denial-of-Service attacks through local simulation. We show that an attacker can amplify network traffic at modified nodes by a factor of 3,600, and cause economic damages 13,800 times greater than the amount needed to carry out the attack. Despite these risks, aggressive latency reduction may still be profitable enough to justify the existence of modified nodes. To assess this tradeoff, we 1) simulate the transaction validation process in the local network and 2) empirically measure the latency reduction by deploying our modified node in the Ethereum testnet. We conclude with a cost-benefit analysis of skipping validation and provide mitigation strategies against this attack.
- [515] arXiv:2408.02615 (replaced) [pdf, html, other]
-
Title: LaMamba-Diff: Linear-Time High-Fidelity Diffusion Models Based on Local Attention and MambaSubjects: Computer Vision and Pattern Recognition (cs.CV)
Recent Transformer-based diffusion models have shown remarkable performance, largely attributed to the ability of the self-attention mechanism to accurately capture both global and local contexts by computing all-pair interactions among input tokens. However, their quadratic complexity poses significant computational challenges for long-sequence inputs. Conversely, a recent state space model called Mamba offers linear complexity by compressing a filtered global context into a hidden state. Despite its efficiency, compression inevitably leads to information loss of fine-grained local dependencies among tokens, which are crucial for effective visual generative modeling. Motivated by these observations, we introduce Local Attentional Mamba (LaMamba) blocks that combine the strengths of self-attention and Mamba, capturing both global contexts and local details with linear complexity. Leveraging the efficient U-Net architecture, our model exhibits exceptional scalability and surpasses the performance of DiT across various model scales on ImageNet at 256x256 resolution, all while utilizing substantially fewer GFLOPs and a comparable number of parameters. Compared to state-of-the-art diffusion models on ImageNet 256x256 and 512x512, our largest model presents notable advantages, such as a reduction of up to 62% GFLOPs compared to DiT-XL/2, while achieving superior performance with comparable or fewer parameters. Our code is available at this https URL.
- [516] arXiv:2408.03747 (replaced) [pdf, html, other]
-
Title: Online Model-based Anomaly Detection in Multivariate Time Series: Taxonomy, Survey, Research Challenges and Future DirectionsSubjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Systems and Control (eess.SY)
Time-series anomaly detection plays an important role in engineering processes, like development, manufacturing and other operations involving dynamic systems. These processes can greatly benefit from advances in the field, as state-of-the-art approaches may aid in cases involving, for example, highly dimensional data. To provide the reader with understanding of the terminology, this survey introduces a novel taxonomy where a distinction between online and offline, and training and inference is made. Additionally, it presents the most popular data sets and evaluation metrics used in the literature, as well as a detailed analysis. Furthermore, this survey provides an extensive overview of the state-of-the-art model-based online semi- and unsupervised anomaly detection approaches for multivariate time-series data, categorising them into different model families and other properties. The biggest research challenge revolves around benchmarking, as currently there is no reliable way to compare different approaches against one another. This problem is two-fold: on the one hand, public data sets suffers from at least one fundamental flaw, while on the other hand, there is a lack of intuitive and representative evaluation metrics in the field. Moreover, the way most publications choose a detection threshold disregards real-world conditions, which hinders the application in the real world. To allow for tangible advances in the field, these issues must be addressed in future work.
- [517] arXiv:2408.04734 (replaced) [pdf, html, other]
-
Title: A Multi-Scale Cognitive Interaction Model of Instrument Operations at the Linac Coherent Light SourceComments: Under review. Supplemental videos: this https URLSubjects: Human-Computer Interaction (cs.HC); Multiagent Systems (cs.MA); High Energy Physics - Experiment (hep-ex)
We describe a novel multi-agent, multi-scale computational cognitive interaction model of instrument operations at the Linac Coherent Light Source (LCLS). A leading scientific user facility, LCLS is the world's first hard x-ray free electron laser, operated by the SLAC National Accelerator Laboratory for the U.S. Department of Energy. As the world's first x-ray free electron laser, LCLS is in high demand and heavily oversubscribed. Our overall project employs cognitive engineering methodologies to improve experimental efficiency and scientific productivity by refining experimental interfaces and workflows, simplifying tasks, reducing errors, and improving operator safety and stress levels. Our model simulates aspects of human cognition at multiple cognitive and temporal scales, ranging from seconds to hours, and among agents playing multiple roles, including instrument operator, real time data analyst, and experiment manager. The model can predict impacts stemming from proposed changes to operational interfaces and workflows. Because the model code is open source, and supplemental videos go into detail on all aspects of the model and results, this approach could be applied to other experimental apparatus and processes. Example results demonstrate the model's potential in guiding modifications to improve operational efficiency and scientific output. We discuss the implications of our findings for cognitive engineering in complex experimental settings and outline future directions for research.
- [518] arXiv:2408.05365 (replaced) [pdf, html, other]
-
Title: FiSTECH: Financial Style Transfer to Enhance Creativity without Hallucinations in LLMsComments: 9 pages, 13 figures, 5 tables, conferenceSubjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computational Engineering, Finance, and Science (cs.CE)
Financial report generation using general purpose large language models (LLMs) pose two major challenges namely, the lack of compound sentences and hallucinations. Advanced prompt engineering and retrieval augmented generation (RAG) techniques are limited in scope for curing these writing style discrepancies. In this work we propose a novel two-stage fine-tuning (FT) process wherein public domain financial reports are processed into prompt-completions and augmented using simple LLM prompts to then enable sectional financial report generation using minimal instructions and tabular data inputs. The proposed fine-tuning process exploits the self-learning capability of LLMs by allowing hallucinations in the first stage and showing the corrections in the second stage. Our proposed fine-tuning framework results doubles the number of correct questions answers and reduces hallucinations by over 50%. Additionally, the two-stage FT model has lower perplexity, improved ROUGE, TER and BLEU scores, higher creativity and knowledge density with lower uncertainty and cross entropy. Thus, the proposed framework can be generalized to domain specific fine-tuning tasks at minimized tuning costs.
- [519] arXiv:2408.05817 (replaced) [pdf, other]
-
Title: High Probability Latency Sequential Change Detection over an Unknown Finite HorizonComments: 7 pages, 2 figures, International Symposium of Information TheorySubjects: Data Structures and Algorithms (cs.DS); Systems and Control (eess.SY); Statistics Theory (math.ST)
A finite horizon variant of the quickest change detection problem is studied, in which the goal is to minimize a delay threshold (latency), under constraints on the probability of false alarm and the probability that the latency is exceeded. In addition, the horizon is not known to the change detector. A variant of the cumulative sum (CuSum) test with a threshold that increasing logarithmically with time is proposed as a candidate solution to the problem. An information-theoretic lower bound on the minimum value of the latency under the constraints is then developed. This lower bound is used to establish certain asymptotic optimality properties of the proposed test in terms of the horizon and the false alarm probability. Some experimental results are given to illustrate the performance of the test.
- [520] arXiv:2408.08761 (replaced) [pdf, html, other]
-
Title: SYMPOL: Symbolic Tree-Based On-Policy Reinforcement LearningSubjects: Machine Learning (cs.LG)
Reinforcement learning (RL) has seen significant success across various domains, but its adoption is often limited by the black-box nature of neural network policies, making them difficult to interpret. In contrast, symbolic policies allow representing decision-making strategies in a compact and interpretable way. However, learning symbolic policies directly within on-policy methods remains challenging. In this paper, we introduce SYMPOL, a novel method for SYMbolic tree-based on-POLicy RL. SYMPOL employs a tree-based model integrated with a policy gradient method, enabling the agent to learn and adapt its actions while maintaining a high level of interpretability. We evaluate SYMPOL on a set of benchmark RL tasks, demonstrating its superiority over alternative tree-based RL approaches in terms of performance and interpretability. To the best of our knowledge, this is the first method, that allows a gradient-based end-to-end learning of interpretable, axis-aligned decision trees within existing on-policy RL algorithms. Therefore, SYMPOL can become the foundation for a new class of interpretable RL based on decision trees. Our implementation is available under: this https URL
- [521] arXiv:2408.08803 (replaced) [pdf, html, other]
-
Title: FourierKAN outperforms MLP on Text Classification Head Fine-tuningSubjects: Computation and Language (cs.CL)
In resource constraint settings, adaptation to downstream classification tasks involves fine-tuning the final layer of a classifier (i.e. classification head) while keeping rest of the model weights frozen. Multi-Layer Perceptron (MLP) heads fine-tuned with pre-trained transformer backbones have long been the de facto standard for text classification head fine-tuning. However, the fixed non-linearity of MLPs often struggles to fully capture the nuances of contextual embeddings produced by pre-trained models, while also being computationally expensive. In our work, we investigate the efficacy of KAN and its variant, Fourier KAN (FR-KAN), as alternative text classification heads. Our experiments reveal that FR-KAN significantly outperforms MLPs with an average improvement of 10% in accuracy and 11% in F1-score across seven pre-trained transformer models and four text classification tasks. Beyond performance gains, FR-KAN is more computationally efficient and trains faster with fewer parameters. These results underscore the potential of FR-KAN to serve as a lightweight classification head, with broader implications for advancing other Natural Language Processing (NLP) tasks.
- [522] arXiv:2408.10695 (replaced) [pdf, html, other]
-
Title: On NVD Users' Attitudes, Experiences, Hopes and HurdlesComments: To appear in ACM DTRAP Special Issue on IMF 2024Subjects: Cryptography and Security (cs.CR)
The National Vulnerability Database (NVD) is a major vulnerability database that is free to use for everyone. It provides information about vulnerabilities and further useful resources such as linked advisories and patches. The NVD is often considered as the central source for vulnerability information and as a help to improve the resource-intensive process of vulnerability management. Although the NVD receives much public attention, little is known about its usage in vulnerability management, users' attitudes towards it and whether they encounter any problems during usage. We explored these questions using a preliminary interview study with seven people, and a follow-up survey with 71 participants. The results show that the NVD is consulted regularly and often aids decision making. Generally, users are positive about the NVD and perceive it as a helpful, clearly structured tool. But users also faced issues: missing or incorrect entries, incomplete descriptions or incomprehensible CVSS ratings. In order to identify the problems origins, we discussed the results with two senior NVD members. Many of the problems can be attributed to higher-level problems such as the CVE List or limited resources. Nevertheless, the NVD is working on improving existing problems.
- [523] arXiv:2408.10795 (replaced) [pdf, html, other]
-
Title: Adversarial Attack for Explanation Robustness of Rationalization ModelsSubjects: Computation and Language (cs.CL)
Rationalization models, which select a subset of input text as rationale-crucial for humans to understand and trust predictions-have recently emerged as a prominent research area in eXplainable Artificial Intelligence. However, most of previous studies mainly focus on improving the quality of the rationale, ignoring its robustness to malicious attack. Specifically, whether the rationalization models can still generate high-quality rationale under the adversarial attack remains unknown. To explore this, this paper proposes UAT2E, which aims to undermine the explainability of rationalization models without altering their predictions, thereby eliciting distrust in these models from human users. UAT2E employs the gradient-based search on triggers and then inserts them into the original input to conduct both the non-target and target attack. Experimental results on five datasets reveal the vulnerability of rationalization models in terms of explanation, where they tend to select more meaningless tokens under attacks. Based on this, we make a series of recommendations for improving rationalization models in terms of explanation.
- [524] arXiv:2408.10812 (replaced) [pdf, other]
-
Title: Use Cases for Prospective Sensemaking of Human-AI-CollaborationComments: Author pre-print - accepted for publication at the Hawaii International Conference on System Sciences 2025Subjects: Human-Computer Interaction (cs.HC)
This study explores the potential of Human-AI Collaboration (HAIC) use cases as a tool for prospective sensemaking. Based on 14 interviews with executives of an automotive company, we identify and categorize HAIC use cases that can help organizations anticipate and strategically respond to the impact of HAIC. Feedback from the case company shows that our systematic mapping of HAIC use cases along the value chain and group tasks enables a structured understanding of the potential role of AI and underscores the importance of strategic foresight when integrating AI into organizational processes.
- [525] arXiv:2408.11919 (replaced) [pdf, html, other]
-
Title: PAL: A Variability-Aware Policy for Scheduling ML Workloads in GPU ClustersSubjects: Distributed, Parallel, and Cluster Computing (cs.DC)
Large-scale computing systems are increasingly using accelerators such as GPUs to enable peta- and exa-scale levels of compute to meet the needs of Machine Learning (ML) and scientific computing applications. Given the widespread and growing use of ML, including in some scientific applications, optimizing these clusters for ML workloads is particularly important. However, recent work has demonstrated that accelerators in these clusters can suffer from performance variability and this variability can lead to resource under-utilization and load imbalance. In this work we focus on how clusters schedulers, which are used to share accelerator-rich clusters across many concurrent ML jobs, can embrace performance variability to mitigate its effects. Our key insight to address this challenge is to characterize which applications are more likely to suffer from performance variability and take that into account while placing jobs on the cluster. We design a novel cluster scheduler, PAL, which uses performance variability measurements and application-specific profiles to improve job performance and resource utilization. PAL also balances performance variability with locality to ensure jobs are spread across as few nodes as possible. Overall, PAL significantly improves GPU-rich cluster scheduling: across traces for six ML workload applications spanning image, language, and vision models with a variety of variability profiles, PAL improves geomean job completion time by 42%, cluster utilization by 28%, and makespan by 47% over existing state-of-the-art schedulers.
- [526] arXiv:2408.12240 (replaced) [pdf, other]
-
Title: The Bright Side of Timed OpacityComments: This is the author (and extended) version of the manuscript of the same name published in the proceedings of the 25th International Conference on Formal Engineering Methods (ICFEM 2024)Subjects: Logic in Computer Science (cs.LO); Cryptography and Security (cs.CR); Formal Languages and Automata Theory (cs.FL)
In 2009, Franck Cassez showed that the timed opacity problem, where an attacker can observe some actions with their timestamps and attempts to deduce information, is undecidable for timed automata (TAs). Moreover, he showed that the undecidability holds even for subclasses such as event-recording automata. In this article, we consider the same definition of opacity for several other subclasses of TAs: with restrictions on the number of clocks, of actions, on the nature of time, or on a new subclass called observable event-recording automata. We show that opacity can mostly be retrieved, except for one-action TAs and for one-clock TAs with $\epsilon$-transitions, for which undecidability remains. We then exhibit a new decidable subclass in which the number of observations made by the attacker is limited.
- [527] arXiv:2408.13085 (replaced) [pdf, html, other]
-
Title: Map-Free Visual Relocalization Enhanced by Instance Knowledge and Depth KnowledgeComments: 17 pages,6 figuresSubjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
Map-free relocalization technology is crucial for applications in autonomous navigation and augmented reality, but relying on pre-built maps is often impractical. It faces significant challenges due to limitations in matching methods and the inherent lack of scale in monocular images. These issues lead to substantial rotational and metric errors and even localization failures in real-world scenarios. Large matching errors significantly impact the overall relocalization process, affecting both rotational and translational accuracy. Due to the inherent limitations of the camera itself, recovering the metric scale from a single image is crucial, as this significantly impacts the translation error. To address these challenges, we propose a map-free relocalization method enhanced by instance knowledge and depth knowledge. By leveraging instance-based matching information to improve global matching results, our method significantly reduces the possibility of mismatching across different objects. The robustness of instance knowledge across the scene helps the feature point matching model focus on relevant regions and enhance matching accuracy. Additionally, we use estimated metric depth from a single image to reduce metric errors and improve scale recovery accuracy. By integrating methods dedicated to mitigating large translational and rotational errors, our approach demonstrates superior performance in map-free relocalization techniques.
- [528] arXiv:2408.13976 (replaced) [pdf, html, other]
-
Title: Sifting through the Chaff: On Utilizing Execution Feedback for Ranking the Generated Code CandidatesComments: Accepted by the 39th IEEE/ACM International Conference on Automated Software Engineering (ASE 2024)Subjects: Software Engineering (cs.SE)
Large Language Models (LLMs), such as GPT-4, StarCoder, and CodeLlama, are transforming the way developers approach programming by automatically generating code based on given natural language descriptions. Despite advancements, generating syntactically and semantically correct code remains challenging, especially for complex programming tasks. Existing approaches typically generate multiple candidate solutions using LLMs to increase the likelihood of producing correct code. However, selecting the correct code from these candidates-a process known as code ranking-remains a major challenge. Current research on code ranking can be categorized into execution-based and non-execution-based methods. Execution-based methods, although effective, encounter notable limitations, such as scarcity of quality unit tests and security risks. Non-execution-based methods like CodeRanker, which rely solely on classification labels to train a code ranker, struggle to capture subtle errors and provide detailed error insights. Recognizing the strengths and limitations of both approaches, we propose a new method. The key insight of our work is that an effective code ranker is expected to truly comprehend the underlying causes of erroneous code, as relying solely on classification labels is insufficient. Inspired by this, this paper puts forward RankEF, an innovative approach for code ranking that leverages execution feedback. RankEF employs multi-task learning to integrate code classification with execution feedback generation. This approach enables the model to understand the reasons behind incorrect code, distinguishing between correct and incorrect solutions without the need to execute the code during the ranking phase. Experiments on three code generation benchmarks demonstrate that RankEF significantly outperforms the state-of-the-art CodeRanker.
- [529] arXiv:2408.14899 (replaced) [pdf, html, other]
-
Title: MeshUp: Multi-Target Mesh Deformation via Blended Score DistillationComments: Project page: this https URLSubjects: Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR)
We propose MeshUp, a technique that deforms a 3D mesh towards multiple target concepts, and intuitively controls the region where each concept is expressed. Conveniently, the concepts can be defined as either text queries, e.g., "a dog" and "a turtle," or inspirational images, and the local regions can be selected as any number of vertices on the mesh. We can effectively control the influence of the concepts and mix them together using a novel score distillation approach, referred to as the Blended Score Distillation (BSD). BSD operates on each attention layer of the denoising U-Net of a diffusion model as it extracts and injects the per-objective activations into a unified denoising pipeline from which the deformation gradients are calculated. To localize the expression of these activations, we create a probabilistic Region of Interest (ROI) map on the surface of the mesh, and turn it into 3D-consistent masks that we use to control the expression of these activations. We demonstrate the effectiveness of BSD empirically and show that it can deform various meshes towards multiple objectives. Our project page is at this https URL.
- [530] arXiv:2408.15098 (replaced) [pdf, html, other]
-
Title: CLIP-AGIQA: Boosting the Performance of AI-Generated Image Quality Assessment with CLIPComments: accepted by ICPR2024Subjects: Computer Vision and Pattern Recognition (cs.CV)
With the rapid development of generative technologies, AI-Generated Images (AIGIs) have been widely applied in various aspects of daily life. However, due to the immaturity of the technology, the quality of the generated images varies, so it is important to develop quality assessment techniques for the generated images. Although some models have been proposed to assess the quality of generated images, they are inadequate when faced with the ever-increasing and diverse categories of generated images. Consequently, the development of more advanced and effective models for evaluating the quality of generated images is urgently needed. Recent research has explored the significant potential of the visual language model CLIP in image quality assessment, finding that it performs well in evaluating the quality of natural images. However, its application to generated images has not been thoroughly investigated. In this paper, we build on this idea and further explore the potential of CLIP in evaluating the quality of generated images. We design CLIP-AGIQA, a CLIP-based regression model for quality assessment of generated images, leveraging rich visual and textual knowledge encapsulated in CLIP. Particularly, we implement multi-category learnable prompts to fully utilize the textual knowledge in CLIP for quality assessment. Extensive experiments on several generated image quality assessment benchmarks, including AGIQA-3K and AIGCIQA2023, demonstrate that CLIP-AGIQA outperforms existing IQA models, achieving excellent results in evaluating the quality of generated images.
- [531] arXiv:2408.15521 (replaced) [pdf, html, other]
-
Title: A Simple Baseline with Single-encoder for Referring Image SegmentationComments: arXiv pre-printSubjects: Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM)
Referring image segmentation (RIS) requires dense vision-language interactions between visual pixels and textual words to segment objects based on a given description. However, commonly adapted dual-encoders in RIS, e.g., Swin transformer and BERT (uni-modal encoders) or CLIP (a multi-modal dual-encoder), lack dense multi-modal interactions during pre-training, leading to a gap with a pixel-level RIS task. To bridge this gap, existing RIS methods often rely on multi-modal fusion modules that interact two encoders, but this approach leads to high computational costs. In this paper, we present a novel RIS method with a single-encoder, i.e., BEiT-3, maximizing the potential of shared self-attention across all framework components. This enables seamless interactions of two modalities from input to final prediction, producing granularly aligned multi-modal features. Furthermore, we propose lightweight yet effective decoder modules, a Shared FPN and a Shared Mask Decoder, which contribute to the high efficiency of our model. Our simple baseline with a single encoder achieves outstanding performances on the RIS benchmark datasets while maintaining computational efficiency, compared to the most recent SoTA methods based on dual-encoders.
- [532] arXiv:2408.15766 (replaced) [pdf, html, other]
-
Title: Learning Harmonized Representations for Speculative SamplingSubjects: Machine Learning (cs.LG); Computation and Language (cs.CL)
Speculative sampling is a promising approach to accelerate the decoding stage for Large Language Models (LLMs). Recent advancements that leverage target LLM's contextual information, such as hidden states and KV cache, have shown significant practical improvements. However, these approaches suffer from inconsistent context between training and decoding. We also observe another discrepancy between the training and decoding objectives in existing speculative sampling methods. In this work, we propose a solution named HArmonized Speculative Sampling (HASS) that learns harmonized representations to address these issues. HASS accelerates the decoding stage without adding inference overhead through harmonized objective distillation and harmonized context alignment. Experiments on four LLaMA models demonstrate that HASS achieves 2.81x-4.05x wall-clock time speedup ratio averaging across three datasets, surpassing EAGLE-2 by 8%-20%.
- [533] arXiv:2409.00097 (replaced) [pdf, html, other]
-
Title: Large Language Models for Disease Diagnosis: A Scoping ReviewShuang Zhou, Zidu Xu, Mian Zhang, Chunpu Xu, Yawen Guo, Zaifu Zhan, Sirui Ding, Jiashuo Wang, Kaishuai Xu, Yi Fang, Liqiao Xia, Jeremy Yeung, Daochen Zha, Genevieve B. Melton, Mingquan Lin, Rui ZhangComments: 69 pagesSubjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
Automatic disease diagnosis has become increasingly valuable in clinical practice. The advent of large language models (LLMs) has catalyzed a paradigm shift in artificial intelligence, with growing evidence supporting the efficacy of LLMs in diagnostic tasks. Despite the increasing attention in this field, a holistic view is still lacking. Many critical aspects remain unclear, such as the diseases and clinical data to which LLMs have been applied, the LLM techniques employed, and the evaluation methods used. In this article, we perform a comprehensive review of LLM-based methods for disease diagnosis. Our review examines the existing literature across various dimensions, including disease types and associated clinical specialties, clinical data, LLM techniques, and evaluation methods. Additionally, we offer recommendations for applying and evaluating LLMs for diagnostic tasks. Furthermore, we assess the limitations of current research and discuss future directions. To our knowledge, this is the first comprehensive review for LLM-based disease diagnosis.
- [534] arXiv:2409.00222 (replaced) [pdf, html, other]
-
Title: Can Large Language Models Address Open-Target Stance Detection?Comments: 12 pages; currently under submissionSubjects: Computation and Language (cs.CL)
Stance detection (SD) identifies a text's position towards a target, typically labeled as favor, against, or none. We introduce Open-Target Stance Detection (OTSD), the most realistic task where targets are neither seen during training nor provided as input. We evaluate Large Language Models (LLMs) GPT-4o, GPT-3.5, Llama-3, and Mistral, comparing their performance to the only existing work, Target-Stance Extraction (TSE), which benefits from predefined targets. Unlike TSE, OTSD removes the dependency of a predefined list, making target generation and evaluation more challenging. We also provide a metric for evaluating target quality that correlates well with human judgment. Our experiments reveal that LLMs outperform TSE in target generation when the real target is explicitly and not explicitly mentioned in the text. Likewise, for stance detection, LLMs excel in explicit cases with comparable performance in non-explicit in general.
- [535] arXiv:2409.01586 (replaced) [pdf, html, other]
-
Title: Booster: Tackling Harmful Fine-tuning for Large Language Models via Attenuating Harmful PerturbationSubjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
Harmful fine-tuning issue \citep{qi2023fine} poses serious safety concerns for Large language models' fine-tuning-as-a-service. While existing defenses \citep{huang2024vaccine,rosati2024representation} have been proposed to mitigate the issue, their performances are still far away from satisfactory, and the root cause of the problem has not been fully recovered. For the first time in the literature, we in this paper show that \textit{harmful perturbation} over the model weights should be the root cause of alignment-broken of harmful fine-tuning. In order to attenuate the negative impact of harmful perturbation, we propose an alignment-stage solution, dubbed Booster. Technically, along with the original alignment loss, we append a loss regularizer in the alignment stage's optimization. The regularizer ensures that the model's harmful loss reduction before/after simulated harmful perturbation is attenuated, thereby mitigating the subsequent fine-tuning risk. Empirical results show that Booster can effectively reduce the harmful score of the fine-tuned models while maintaining the performance of downstream tasks. Our code is available at \url{this https URL}.
- [536] arXiv:2409.01974 (replaced) [pdf, html, other]
-
Title: Planning to avoid ambiguous states through Gaussian approximations to non-linear sensors in active inference agentsComments: 13 pages, 3 figures. Accepted to the International Workshop on Active Inference 2024Subjects: Systems and Control (eess.SY); Artificial Intelligence (cs.AI); Robotics (cs.RO); Machine Learning (stat.ML)
In nature, active inference agents must learn how observations of the world represent the state of the agent. In engineering, the physics behind sensors is often known reasonably accurately and measurement functions can be incorporated into generative models. When a measurement function is non-linear, the transformed variable is typically approximated with a Gaussian distribution to ensure tractable inference. We show that Gaussian approximations that are sensitive to the curvature of the measurement function, such as a second-order Taylor approximation, produce a state-dependent ambiguity term. This induces a preference over states, based on how accurately the state can be inferred from the observation. We demonstrate this preference with a robot navigation experiment where agents plan trajectories.
- [537] arXiv:2409.02802 (replaced) [pdf, html, other]
-
Title: Boosting Certified Robustness for Time Series Classification with Efficient Self-EnsembleComments: 6 figures, 4 tables, 10 pagesSubjects: Machine Learning (cs.LG); Cryptography and Security (cs.CR); Machine Learning (stat.ML)
Recently, the issue of adversarial robustness in the time series domain has garnered significant attention. However, the available defense mechanisms remain limited, with adversarial training being the predominant approach, though it does not provide theoretical guarantees. Randomized Smoothing has emerged as a standout method due to its ability to certify a provable lower bound on robustness radius under $\ell_p$-ball attacks. Recognizing its success, research in the time series domain has started focusing on these aspects. However, existing research predominantly focuses on time series forecasting, or under the non-$\ell_p$ robustness in statistic feature augmentation for time series classification~(TSC). Our review found that Randomized Smoothing performs modestly in TSC, struggling to provide effective assurances on datasets with poor robustness. Therefore, we propose a self-ensemble method to enhance the lower bound of the probability confidence of predicted labels by reducing the variance of classification margins, thereby certifying a larger radius. This approach also addresses the computational overhead issue of Deep Ensemble~(DE) while remaining competitive and, in some cases, outperforming it in terms of robustness. Both theoretical analysis and experimental results validate the effectiveness of our method, demonstrating superior performance in robustness testing compared to baseline approaches.
- [538] arXiv:2409.04171 (replaced) [pdf, html, other]
-
Title: RCM++:Reverse Cuthill-McKee ordering with Bi-Criteria Node FinderSubjects: Data Structures and Algorithms (cs.DS)
The Reverse Cuthill-McKee (RCM) algorithm is a graph-based method for reordering sparse matrices, renowned for its effectiveness in minimizing matrix bandwidth and profile. This reordering enhances the efficiency of matrix operations, making RCM pivotal among reordering algorithms. In the context of executing the RCM algorithm, it is often necessary to select a starting node from the graph representation of the matrix. This selection allows the execution of BFS (Breadth-First Search) to construct the level structure. The choice of this starting node significantly impacts the algorithm's performance, necessitating a heuristic approach to identify an optimal starting node, commonly referred to as the RCM starting node problem. Techniques such as the minimum degree method and George-Liu (GL) algorithm are popular solutions.
This paper introduces a novel algorithm addressing the RCM starting node problem by considering both the eccentricity and the width of the node during the run. Integrating this algorithm with the RCM algorithm, we introduce RCM++. Experimental results demonstrate that RCM++ outperforms existing RCM methods in major software libraries, achieving higher quality results with comparable computation time. This advancement fosters the further application and development of the RCM algorithm.The code related to this research has been made available at this https URL\_PP.git. - [539] arXiv:2409.04265 (replaced) [pdf, html, other]
-
Title: Fast Algorithms for Fourier extension based on boundary interval dataSubjects: Numerical Analysis (math.NA)
This paper present a new algorithm for the computation of Fourier extension based on boundary data, which can obtain a super-algebraic convergent Fourier approximation for non-periodic functions. The algorithm calculates the extension part through boundary data and connects it with the original function to form a periodic smooth function. By testing the key parameters involved, their impact on the algorithm is clarified and the optimization setting scheme of the parameters is proposed. Compared with FFT, the algorithm only needs to increase the computational complexity by a fixed small amount.
- [540] arXiv:2409.04356 (replaced) [pdf, html, other]
-
Title: Serp-Mamba: Advancing High-Resolution Retinal Vessel Segmentation with Selective State-Space ModelHongqiu Wang, Yixian Chen, Wu Chen, Huihui Xu, Haoyu Zhao, Bin Sheng, Huazhu Fu, Guang Yang, Lei ZhuSubjects: Computer Vision and Pattern Recognition (cs.CV)
Ultra-Wide-Field Scanning Laser Ophthalmoscopy (UWF-SLO) images capture high-resolution views of the retina with typically 200 spanning degrees. Accurate segmentation of vessels in UWF-SLO images is essential for detecting and diagnosing fundus disease. Recent studies have revealed that the selective State Space Model (SSM) in Mamba performs well in modeling long-range dependencies, which is crucial for capturing the continuity of elongated vessel structures. Inspired by this, we propose the first Serpentine Mamba (Serp-Mamba) network to address this challenging task. Specifically, we recognize the intricate, varied, and delicate nature of the tubular structure of vessels. Furthermore, the high-resolution of UWF-SLO images exacerbates the imbalance between the vessel and background categories. Based on the above observations, we first devise a Serpentine Interwoven Adaptive (SIA) scan mechanism, which scans UWF-SLO images along curved vessel structures in a snake-like crawling manner. This approach, consistent with vascular texture transformations, ensures the effective and continuous capture of curved vascular structure features. Second, we propose an Ambiguity-Driven Dual Recalibration (ADDR) module to address the category imbalance problem intensified by high-resolution images. Our ADDR module delineates pixels by two learnable thresholds and refines ambiguous pixels through a dual-driven strategy, thereby accurately distinguishing vessels and background regions. Experiment results on three datasets demonstrate the superior performance of our Serp-Mamba on high-resolution vessel segmentation. We also conduct a series of ablation studies to verify the impact of our designs. Our code shall be released upon publication of this work.
- [541] arXiv:2409.04859 (replaced) [pdf, html, other]
-
Title: Flow-TSVAD: Target-Speaker Voice Activity Detection via Latent Flow MatchingComments: submitted to ICASSP 2025Subjects: Sound (cs.SD); Audio and Speech Processing (eess.AS)
Speaker diarization is typically considered a discriminative task, using discriminative approaches to produce fixed diarization results. In this paper, we explore the use of neural network-based generative methods for speaker diarization for the first time. We implement a Flow-Matching (FM) based generative algorithm within the sequence-to-sequence target speaker voice activity detection (Seq2Seq-TSVAD) diarization system. Our experiments reveal that applying the generative method directly to the original binary label sequence space of the TS-VAD output is ineffective. To address this issue, we propose mapping the binary label sequence into a dense latent space before applying the generative algorithm and our proposed Flow-TSVAD method outperforms the Seq2Seq-TSVAD system. Additionally, we observe that the FM algorithm converges rapidly during the inference stage, requiring only two inference steps to achieve promising results. As a generative model, Flow-TSVAD allows for sampling different diarization results by running the model multiple times. Moreover, ensembling results from various sampling instances further enhances diarization performance.
- [542] arXiv:2409.05099 (replaced) [pdf, html, other]
-
Title: DreamMapping: High-Fidelity Text-to-3D Generation via Variational Distribution MappingComments: 15 pages, 14 figuresSubjects: Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR)
Score Distillation Sampling (SDS) has emerged as a prevalent technique for text-to-3D generation, enabling 3D content creation by distilling view-dependent information from text-to-2D guidance. However, they frequently exhibit shortcomings such as over-saturated color and excess smoothness. In this paper, we conduct a thorough analysis of SDS and refine its formulation, finding that the core design is to model the distribution of rendered images. Following this insight, we introduce a novel strategy called Variational Distribution Mapping (VDM), which expedites the distribution modeling process by regarding the rendered images as instances of degradation from diffusion-based generation. This special design enables the efficient training of variational distribution by skipping the calculations of the Jacobians in the diffusion U-Net. We also introduce timestep-dependent Distribution Coefficient Annealing (DCA) to further improve distilling precision. Leveraging VDM and DCA, we use Gaussian Splatting as the 3D representation and build a text-to-3D generation framework. Extensive experiments and evaluations demonstrate the capability of VDM and DCA to generate high-fidelity and realistic assets with optimization efficiency.
- [543] arXiv:2409.05112 (replaced) [pdf, html, other]
-
Title: WaterSeeker: Efficient Detection of Watermarked Segments in Large DocumentsComments: 19 pages, 5 figures, 5 tablesSubjects: Computation and Language (cs.CL)
Watermarking algorithms for large language models (LLMs) have attained high accuracy in detecting LLM-generated text. However, existing methods primarily focus on distinguishing fully watermarked text from non-watermarked text, overlooking real-world scenarios where LLMs generate only small sections within large documents. In this scenario, balancing time complexity and detection performance poses significant challenges. This paper presents WaterSeeker, a novel approach to efficiently detect and locate watermarked segments amid extensive natural text. It first applies an efficient anomaly extraction method to preliminarily locate suspicious watermarked regions. Following this, it conducts a local traversal and performs full-text detection for more precise verification. Theoretical analysis and experimental results demonstrate that WaterSeeker achieves a superior balance between detection accuracy and computational efficiency. Moreover, WaterSeeker's localization ability supports the development of interpretable AI detection systems. This work pioneers a new direction in watermarked segment detection, facilitating more reliable AI-generated content identification.Our code is available at this https URL.
- [544] arXiv:2409.05292 (replaced) [pdf, other]
-
Title: Mpox Narrative on Instagram: A Labeled Multilingual Dataset of Instagram Posts on Mpox for Sentiment, Hate Speech, and Anxiety AnalysisSubjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computers and Society (cs.CY); Social and Information Networks (cs.SI)
The world is currently experiencing an outbreak of mpox, which has been declared a Public Health Emergency of International Concern by WHO. No prior work related to social media mining has focused on the development of a dataset of Instagram posts about the mpox outbreak. The work presented in this paper aims to address this research gap and makes two scientific contributions to this field. First, it presents a multilingual dataset of 60,127 Instagram posts about mpox, published between July 23, 2022, and September 5, 2024. The dataset, available at this https URL, contains Instagram posts about mpox in 52 languages. For each of these posts, the Post ID, Post Description, Date of publication, language, and translated version of the post (translation to English was performed using the Google Translate API) are presented as separate attributes in the dataset. After developing this dataset, sentiment analysis, hate speech detection, and anxiety or stress detection were performed. This process included classifying each post into (i) one of the sentiment classes, i.e., fear, surprise, joy, sadness, anger, disgust, or neutral, (ii) hate or not hate, and (iii) anxiety/stress detected or no anxiety/stress detected. These results are presented as separate attributes in the dataset. Second, this paper presents the results of performing sentiment analysis, hate speech analysis, and anxiety or stress analysis. The variation of the sentiment classes - fear, surprise, joy, sadness, anger, disgust, and neutral were observed to be 27.95%, 2.57%, 8.69%, 5.94%, 2.69%, 1.53%, and 50.64%, respectively. In terms of hate speech detection, 95.75% of the posts did not contain hate and the remaining 4.25% of the posts contained hate. Finally, 72.05% of the posts did not indicate any anxiety/stress, and the remaining 27.95% of the posts represented some form of anxiety/stress.
- [545] arXiv:2409.05498 (replaced) [pdf, html, other]
-
Title: Deciding the synthesis problem for hybrid games through bisimulationSubjects: Logic in Computer Science (cs.LO); Formal Languages and Automata Theory (cs.FL)
Hybrid games are games played on a finite graph endowed with real variables which may model behaviors of discrete controllers of continuous systems. The synthesis problem for hybrid games is decidable for classical objectives (like LTL formulas) when the games are initialized singular, meaning that the slopes of the continuous variables are piecewise constant and variables are reset whenever their slope changes. The known proof adapts the region construction from timed games. In this paper we show that initialized singular games can be reduced, via a sequence of alternating bisimulations, to timed games, generalizing the known reductions by bisimulation from initialized singular automata to timed automata. Alternating bisimulation is the generalization of bisimulation to games, accomodating a strategy translation lemma by which, when two games are bisimilar and carry the same observations, each strategy in one of the games can be translated to a strategy in the second game such that all the outcomes of the second strategy satisfies the same property that are satisfied by the first strategy. The advantage of the proposed approach is that one may then use realizability tools for timed games to synthesize a winning strategy for a given objective, and then use the strategy translation lemma to obtain a winning strategy in the hybrid game for the same objective.
- [546] arXiv:2409.05500 (replaced) [pdf, html, other]
-
Title: Optimizing VarLiNGAM for Scalable and Efficient Time Series Causal DiscoverySubjects: Machine Learning (cs.LG); Distributed, Parallel, and Cluster Computing (cs.DC); Performance (cs.PF); Computation (stat.CO)
Causal discovery identifies causal relationships in data, but the task is more complex for multivariate time series due to the computational demands of methods like VarLiNGAM, which combines a Vector Autoregressive Model with a Linear Non-Gaussian Acyclic Model. This study optimizes causal discovery specifically for time series data, which are common in practical applications. Time series causal discovery is particularly challenging because of temporal dependencies and potential time lag effects. By developing a specialized dataset generator and reducing the computational complexity of the VarLiNGAM model from \( O(m^3 \cdot n) \) to \( O(m^3 + m^2 \cdot n) \), this study enhances the feasibility of processing large datasets. The proposed methods were validated on advanced computational platforms and tested on simulated, real-world, and large-scale datasets, demonstrating improved efficiency and performance. The optimized algorithm achieved 7 to 13 times speedup compared to the original and about 4.5 times speedup compared to the GPU-accelerated version on large-scale datasets with feature sizes from 200 to 400. Our methods extend current causal discovery capabilities, making them more robust, scalable, and applicable to real-world scenarios, facilitating advancements in fields like healthcare and finance.
- [547] arXiv:2409.05543 (replaced) [pdf, html, other]
-
Title: On-line Anomaly Detection and Qualification of Random Bit StreamsComments: 9 pages, 12 figures, 3 tables, submitted at IEEE CSR 2024Subjects: Cryptography and Security (cs.CR)
Generating random bit streams is required in various applications, most notably cyber-security. Ensuring high-quality and robust randomness is crucial to mitigate risks associated with predictability and system compromise. True random numbers provide the highest unpredictability levels. However, potential biases in the processes exploited for the random number generation must be carefully monitored. This paper reports the implementation and characterization of an on-line procedure for the detection of anomalies in a true random bit stream. It is based on the NIST Adaptive Proportion and Repetition Count tests, complemented by statistical analysis relying on the Monobit and RUNS. The procedure is firmware implemented and performed simultaneously with the bit stream generation, and providing as well an estimate of the entropy of the source. The experimental validation of the approach is performed upon the bit streams generated by a quantum, silicon-based entropy source.
- [548] arXiv:2409.05840 (replaced) [pdf, html, other]
-
Title: MMEvol: Empowering Multimodal Large Language Models with Evol-InstructRun Luo, Haonan Zhang, Longze Chen, Ting-En Lin, Xiong Liu, Yuchuan Wu, Min Yang, Minzheng Wang, Pengpeng Zeng, Lianli Gao, Heng Tao Shen, Yunshui Li, Xiaobo Xia, Fei Huang, Jingkuan Song, Yongbin LiSubjects: Computation and Language (cs.CL)
The development of Multimodal Large Language Models (MLLMs) has seen significant advancements with increasing demands in various fields (e.g., multimodal agents, embodied intelligence). While model-driven approaches attempt to enhance MLLMs capabilities through diverse architectures, the gains have become increasingly marginal. Conversely, data-driven methods, which scale up image-text instruction data, are more effective but face limited data diversity and complexity challenges. The absence of high-quality data constitutes a significant development barrier for MLLMs. To address the data quality bottleneck, we propose MMEvol, a novel multimodal instruction data evolution framework. This framework iteratively improve data quality through a refined combination of fine-grained perception, cognitive reasoning, and interaction evolution, generating a more complex and diverse image-text instruction dataset that empowers MLLMs with enhanced capabilities. Beginning with an initial set of instructions, SEED-163K, we utilize MMEvol to systematically broaden the diversity of instruction types, extend visual reasoning steps to improve cognitive reasoning abilities, and thoroughly explore fine-grained information within images to enhance visual understanding and robustness. To comprehensively evaluate the effectiveness of our approach, we conduct extensive qualitative analysis and quantitative experiments across 13 vision-language tasks. Compared to baseline models trained with the initial seed data, the results demonstrate that our method achieves an average accuracy improvement of 3.1 percentage points. Furthermore, our approach reaches state-of-the-art (SOTA) performance in nine tasks using significantly less data compared to state-of-the-art models.
- [549] arXiv:2409.06219 (replaced) [pdf, html, other]
-
Title: Denoising: A Powerful Building-Block for Imaging, Inverse Problems, and Machine LearningSubjects: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)
Denoising, the process of reducing random fluctuations in a signal to emphasize essential patterns, has been a fundamental problem of interest since the dawn of modern scientific inquiry. Recent denoising techniques, particularly in imaging, have achieved remarkable success, nearing theoretical limits by some measures. Yet, despite tens of thousands of research papers, the wide-ranging applications of denoising beyond noise removal have not been fully recognized. This is partly due to the vast and diverse literature, making a clear overview challenging.
This paper aims to address this gap. We present a clarifying perspective on denoisers, their structure, and desired properties. We emphasize the increasing importance of denoising and showcase its evolution into an essential building block for complex tasks in imaging, inverse problems, and machine learning. Despite its long history, the community continues to uncover unexpected and groundbreaking uses for denoising, further solidifying its place as a cornerstone of scientific and engineering practice. - [550] arXiv:2409.06490 (replaced) [pdf, html, other]
-
Title: UAVDB: Trajectory-Guided Adaptable Bounding Boxes for UAV DetectionComments: 8 pages, 6 figures, 3 tablesSubjects: Computer Vision and Pattern Recognition (cs.CV); Applications (stat.AP)
The rapid advancement of drone technology has made accurate Unmanned Aerial Vehicle (UAV) detection essential for surveillance, security, and airspace management. This paper presents a novel trajectory-guided approach, the Patch Intensity Convergence (PIC) technique, which generates high-fidelity bounding boxes for UAV detection without manual labeling. This technique forms the foundation of UAVDB, a dedicated database designed specifically for UAV detection. Unlike datasets that often focus on large UAVs or simple backgrounds, UAVDB utilizes high-resolution RGB video to capture UAVs at various scales, from hundreds of pixels to near-single-digit sizes. This extensive scale variation enables robust evaluation of detection algorithms under diverse conditions. Using the PIC technique, bounding boxes can be efficiently generated from trajectory or position data. We benchmark UAVDB using state-of-the-art (SOTA) YOLO series detectors, providing a comprehensive performance analysis. Our results demonstrate UAVDB's potential as a critical resource for advancing UAV detection, particularly in high-resolution and long-distance tracking scenarios.
- [551] arXiv:2409.06746 (replaced) [pdf, html, other]
-
Title: Decentralized Neural Networks for Robust and Scalable Eigenvalue ComputationSubjects: Machine Learning (cs.LG); Numerical Analysis (math.NA)
This paper introduces a novel method for eigenvalue computation using a distributed cooperative neural network framework. Unlike traditional techniques that face scalability challenges in large systems, our decentralized algorithm enables multiple autonomous agents to collaboratively estimate the smallest eigenvalue of large matrices. Each agent employs a localized neural network, refining its estimates through communication with neighboring agents. Our empirical results confirm the algorithm's convergence towards the true eigenvalue, with estimates clustered closely around the true value. Even in the presence of communication delays or network disruptions, the method demonstrates strong robustness and scalability. Theoretical analysis further validates the accuracy and stability of the proposed approach, while empirical tests highlight its efficiency and precision, surpassing traditional centralized algorithms in large-scale eigenvalue computations.
- [552] arXiv:2409.07162 (replaced) [pdf, html, other]
-
Title: A Fine-grained Sentiment Analysis of App Reviews using Large Language Models: An Evaluation StudyComments: The summary of the project is available at this https URLSubjects: Computation and Language (cs.CL); Software Engineering (cs.SE)
Analyzing user reviews for sentiment towards app features can provide valuable insights into users' perceptions of app functionality and their evolving needs. Given the volume of user reviews received daily, an automated mechanism to generate feature-level sentiment summaries of user reviews is needed. Recent advances in Large Language Models (LLMs) such as ChatGPT have shown impressive performance on several new tasks without updating the model's parameters i.e. using zero or a few labeled examples. Despite these advancements, LLMs' capabilities to perform feature-specific sentiment analysis of user reviews remain unexplored. This study compares the performance of state-of-the-art LLMs, including GPT-4, ChatGPT, and LLama-2-chat variants, for extracting app features and associated sentiments under 0-shot, 1-shot, and 5-shot scenarios. Results indicate the best-performing GPT-4 model outperforms rule-based approaches by 23.6% in f1-score with zero-shot feature extraction; 5-shot further improving it by 6%. GPT-4 achieves a 74% f1-score for predicting positive sentiment towards correctly predicted app features, with 5-shot enhancing it by 7%. Our study suggests that LLM models are promising for generating feature-specific sentiment summaries of user reviews.
- [553] arXiv:2409.08240 (replaced) [pdf, html, other]
-
Title: IFAdapter: Instance Feature Control for Grounded Text-to-Image GenerationSubjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
While Text-to-Image (T2I) diffusion models excel at generating visually appealing images of individual instances, they struggle to accurately position and control the features generation of multiple instances. The Layout-to-Image (L2I) task was introduced to address the positioning challenges by incorporating bounding boxes as spatial control signals, but it still falls short in generating precise instance features. In response, we propose the Instance Feature Generation (IFG) task, which aims to ensure both positional accuracy and feature fidelity in generated instances. To address the IFG task, we introduce the Instance Feature Adapter (IFAdapter). The IFAdapter enhances feature depiction by incorporating additional appearance tokens and utilizing an Instance Semantic Map to align instance-level features with spatial locations. The IFAdapter guides the diffusion process as a plug-and-play module, making it adaptable to various community models. For evaluation, we contribute an IFG benchmark and develop a verification pipeline to objectively compare models' abilities to generate instances with accurate positioning and features. Experimental results demonstrate that IFAdapter outperforms other models in both quantitative and qualitative evaluations.
- [554] arXiv:2409.09169 (replaced) [pdf, html, other]
-
Title: Curricula for Learning Robust Policies with Factored State Representations in Changing EnvironmentsComments: 17th European Workshop on Reinforcement Learning (EWRL 2024)Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Robust policies enable reinforcement learning agents to effectively adapt to and operate in unpredictable, dynamic, and ever-changing real-world environments. Factored representations, which break down complex state and action spaces into distinct components, can improve generalization and sample efficiency in policy learning. In this paper, we explore how the curriculum of an agent using a factored state representation affects the robustness of the learned policy. We experimentally demonstrate three simple curricula, such as varying only the variable of highest regret between episodes, that can significantly enhance policy robustness, offering practical insights for reinforcement learning in complex environments.
- [555] arXiv:2409.09191 (replaced) [pdf, html, other]
-
Title: ProcessTBench: An LLM Plan Generation Dataset for Process MiningComments: 6 pages, 4 figures, dataset available at this https URLSubjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Emerging Technologies (cs.ET)
Large Language Models (LLMs) have shown significant promise in plan generation. Yet, existing datasets often lack the complexity needed for advanced tool use scenarios - such as handling paraphrased query statements, supporting multiple languages, and managing actions that can be done in parallel. These scenarios are crucial for evaluating the evolving capabilities of LLMs in real-world applications. Moreover, current datasets don't enable the study of LLMs from a process perspective, particularly in scenarios where understanding typical behaviors and challenges in executing the same process under different conditions or formulations is crucial. To address these gaps, we present the ProcessTBench synthetic dataset, an extension of the TaskBench dataset specifically designed to evaluate LLMs within a process mining framework.
- [556] arXiv:2409.09214 (replaced) [pdf, html, other]
-
Title: Seed-Music: A Unified Framework for High Quality and Controlled Music GenerationYe Bai, Haonan Chen, Jitong Chen, Zhuo Chen, Yi Deng, Xiaohong Dong, Lamtharn Hantrakul, Weituo Hao, Qingqing Huang, Zhongyi Huang, Dongya Jia, Feihu La, Duc Le, Bochen Li, Chumin Li, Hui Li, Xingxing Li, Shouda Liu, Wei-Tsung Lu, Yiqing Lu, Andrew Shaw, Janne Spijkervet, Yakun Sun, Bo Wang, Ju-Chiang Wang, Yuping Wang, Yuxuan Wang, Ling Xu, Yifeng Yang, Chao Yao, Shuo Zhang, Yang Zhang, Yilin Zhang, Hang Zhao, Ziyi Zhao, Dejian Zhong, Shicen Zhou, Pei ZouComments: Seed-Music technical report, 20 pages, 5 figuresSubjects: Sound (cs.SD); Audio and Speech Processing (eess.AS)
We introduce Seed-Music, a suite of music generation systems capable of producing high-quality music with fine-grained style control. Our unified framework leverages both auto-regressive language modeling and diffusion approaches to support two key music creation workflows: controlled music generation and post-production editing. For controlled music generation, our system enables vocal music generation with performance controls from multi-modal inputs, including style descriptions, audio references, musical scores, and voice prompts. For post-production editing, it offers interactive tools for editing lyrics and vocal melodies directly in the generated audio.
We encourage readers to listen to demo audio examples at this https URL "this https URL. - [557] arXiv:2409.09218 (replaced) [pdf, html, other]
-
Title: AI as Extraherics: Fostering Higher-order Thinking Skills in Human-AI InteractionSubjects: Human-Computer Interaction (cs.HC)
As artificial intelligence (AI) technologies, including generative AI, continue to evolve, concerns have arisen about over-reliance on AI, which may lead to human deskilling and diminished cognitive engagement. Over-reliance on AI can also lead users to accept information given by AI without performing critical examinations, causing negative consequences, such as misleading users with hallucinated contents. This paper introduces extraheric AI, a human-AI interaction conceptual framework that fosters users' higher-order thinking skills, such as creativity, critical thinking, and problem-solving, during task completion. Unlike existing human-AI interaction designs, which replace or augment human cognition, extraheric AI fosters cognitive engagement by posing questions or providing alternative perspectives to users, rather than direct answers. We discuss interaction strategies, evaluation methods aligned with cognitive load theory and Bloom's taxonomy, and future research directions to ensure that human cognitive skills remain a crucial element in AI-integrated environments, promoting a balanced partnership between humans and AI.
- [558] arXiv:2409.09288 (replaced) [pdf, html, other]
-
Title: Generating API Parameter Security Rules with LLM for API Misuse DetectionComments: Accepted by NDSS Symposium 2025. Please cite this paper as "Jinghua Liu, Yi Yang, Kai Chen, and Miaoqian Lin. Generating API Parameter Security Rules with LLM for API Misuse Detection. In the 32nd Annual Network and Distributed System Security Symposium (NDSS 2025)Subjects: Cryptography and Security (cs.CR); Software Engineering (cs.SE)
In this paper, we present a new framework, named GPTAid, for automatic APSRs generation by analyzing API source code with LLM and detecting API misuse caused by incorrect parameter use. To validate the correctness of the LLM-generated APSRs, we propose an execution feedback-checking approach based on the observation that security-critical API misuse is often caused by APSRs violations, and most of them result in runtime errors. Specifically, GPTAid first uses LLM to generate raw APSRs and the Right calling code, and then generates Violation code for each raw APSR by modifying the Right calling code using LLM. Subsequently, GPTAid performs dynamic execution on each piece of Violation code and further filters out the incorrect APSRs based on runtime errors. To further generate concrete APSRs, GPTAid employs a code differential analysis to refine the filtered ones. Particularly, as the programming language is more precise than natural language, GPTAid identifies the key operations within Violation code by differential analysis, and then generates the corresponding concrete APSR based on the aforementioned operations. These concrete APSRs could be precisely interpreted into applicable detection code, which proven to be effective in API misuse detection. Implementing on the dataset containing 200 randomly selected APIs from eight popular libraries, GPTAid achieves a precision of 92.3%. Moreover, it generates 6 times more APSRs than state-of-the-art detectors on a comparison dataset of previously reported bugs and APSRs. We further evaluated GPTAid on 47 applications, 210 unknown security bugs were found potentially resulting in severe security issues (e.g., system crashes), 150 of which have been confirmed by developers after our reports.
- [559] arXiv:2409.09363 (replaced) [pdf, html, other]
-
Title: Security and Privacy Perspectives of People Living in Shared Home EnvironmentsComments: Published in Proceedings of the ACM on Human-Computer Interaction (PACMHCI), Volume 8, Issue CSCW2, Article Number 368, 39 pages, ACM, 2024, accepted to CSCW 2024 (27th ACM Conference on Computer-Supported Cooperative Work and Social Computing) for presentation, this https URLSubjects: Human-Computer Interaction (cs.HC)
Security and privacy perspectives of people in a multi-user home are a growing area of research, with many researchers reflecting on the complicated power imbalance and challenging access control issues of the devices involved. However, these studies primarily focused on the multi-user scenarios in traditional family home settings, leaving other types of multi-user home environments, such as homes shared by co-habitants without a familial relationship, under-studied. This paper closes this research gap via quantitative and qualitative analysis of results from an online survey and content analysis of sampled online posts on Reddit. It explores the complex roles of shared home users, which depend on various factors unique to the shared home environment, e.g., who owns what home devices, how home devices are used by multiple users, and more complicated relationships between the landlord and people in the shared home and among co-habitants. Half (50.7%) of our survey participants thought that devices in a shared home are less secure than in a traditional family home. This perception was found statistically significantly associated with factors such as the fear of devices being tampered with in their absence and (lack of) trust in other co-habitants and their visitors. Our study revealed new user types and relationships in a multi-user environment such as ExternalPrimary-InternalPrimary while analysing the landlord and shared home resident relationship with regard to shared home device use. We propose a threat actor model for shared home environments, which has a focus on possible malicious behaviours of current and past co-habitants of a shared home, as a special type of insider threat in a home environment. We also recommend further research to understand the complex roles co-habitants can play in navigating and adapting to a shared home environment's security and privacy landscape.
- [560] arXiv:2409.09380 (replaced) [pdf, html, other]
-
Title: The Midas Touch: Triggering the Capability of LLMs for RM-API Misuse DetectionComments: Accepted by NDSS Symposium 2025. Please cite this paper as "Yi Yang, Jinghua Liu, Kai Chen, Miaoqian Lin. The Midas Touch: Triggering the Capability of LLMs for RM-API Misuse Detection. In the 32nd Annual Network and Distributed System Security Symposium (NDSS 2025)."Subjects: Cryptography and Security (cs.CR)
In this paper, we propose an LLM-empowered RM-API misuse detection solution, ChatDetector, which fully automates LLMs for documentation understanding which helps RM-API constraints retrieval and RM-API misuse detection. To correctly retrieve the RM-API constraints, ChatDetector is inspired by the ReAct framework which is optimized based on Chain-of-Thought (CoT) to decompose the complex task into allocation APIs identification, RM-object (allocated/released by RM APIs) extraction and RM-APIs pairing (RM APIs usually exist in pairs). It first verifies the semantics of allocation APIs based on the retrieved RM sentences from API documentation through LLMs. Inspired by the LLMs' performance on various prompting methods,ChatDetector adopts a two-dimensional prompting approach for cross-validation. At the same time, an inconsistency-checking approach between the LLMs' output and the reasoning process is adopted for the allocation APIs confirmation with an off-the-shelf Natural Language Processing (NLP) tool. To accurately pair the RM-APIs, ChatDetector decomposes the task again and identifies the RM-object type first, with which it can then accurately pair the releasing APIs and further construct the RM-API constraints for misuse detection. With the diminished hallucinations, ChatDetector identifies 165 pairs of RM-APIs with a precision of 98.21% compared with the state-of-the-art API detectors. By employing a static detector CodeQL, we ethically report 115 security bugs on the applications integrating on six popular libraries to the developers, which may result in severe issues, such as Denial-of-Services (DoS) and memory corruption. Compared with the end-to-end benchmark method, the result shows that ChatDetector can retrieve at least 47% more RM sentences and 80.85% more RM-API constraints.
- [561] arXiv:2409.09464 (replaced) [pdf, html, other]
-
Title: Rethinking the Influence of Source Code on Test Case GenerationComments: 23 pagesSubjects: Software Engineering (cs.SE); Computation and Language (cs.CL)
Large language models (LLMs) have been widely applied to assist test generation with the source code under test provided as the context. This paper aims to answer the question: If the source code under test is incorrect, will LLMs be misguided when generating tests? The effectiveness of test cases is measured by their accuracy, coverage, and bug detection effectiveness. Our evaluation results with five open- and six closed-source LLMs on four datasets demonstrate that incorrect code can significantly mislead LLMs in generating correct, high-coverage, and bug-revealing tests. For instance, in the HumanEval dataset, LLMs achieve 80.45% test accuracy when provided with task descriptions and correct code, but only 57.12% when given task descriptions and incorrect code. For the APPS dataset, prompts with correct code yield tests that detect 39.85% of the bugs, while prompts with incorrect code detect only 19.61%. These findings have important implications for the deployment of LLM-based testing: using it on mature code may help protect against future regression, but on early-stage immature code, it may simply bake in errors. Our findings also underscore the need for further research to improve LLMs resilience against incorrect code in generating reliable and bug-revealing tests.
- [562] arXiv:2409.09627 (replaced) [pdf, html, other]
-
Title: Spatial-Temporal Mamba Network for EEG-based Motor Imagery ClassificationComments: 15 pages,3 figures, accepted conference:ADMA2024Subjects: Human-Computer Interaction (cs.HC)
Motor imagery (MI) classification is key for brain-computer interfaces (BCIs). Until recent years, numerous models had been proposed, ranging from classical algorithms like Common Spatial Pattern (CSP) to deep learning models such as convolutional neural networks (CNNs) and transformers. However, these models have shown limitations in areas such as generalizability, contextuality and scalability when it comes to effectively extracting the complex spatial-temporal information inherent in electroencephalography (EEG) signals. To address these limitations, we introduce Spatial-Temporal Mamba Network (STMambaNet), an innovative model leveraging the Mamba state space architecture, which excels in processing extended sequences with linear scalability. By incorporating spatial and temporal Mamba encoders, STMambaNet effectively captures the intricate dynamics in both space and time, significantly enhancing the decoding performance of EEG signals for MI classification. Experimental results on BCI Competition IV 2a and 2b datasets demonstrate STMambaNet's superiority over existing models, establishing it as a powerful tool for advancing MI-based BCIs and improving real-world BCI systems.
- [563] arXiv:2409.09670 (replaced) [pdf, html, other]
-
Title: Unsupervised Hyperspectral and Multispectral Image Blind Fusion Based on Deep Tucker Decomposition Network with Spatial-Spectral Manifold LearningComments: Accepted by TNNLS 2024 Some errors has been correctedSubjects: Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)
Hyperspectral and multispectral image fusion aims to generate high spectral and spatial resolution hyperspectral images (HR-HSI) by fusing high-resolution multispectral images (HR-MSI) and low-resolution hyperspectral images (LR-HSI). However, existing fusion methods encounter challenges such as unknown degradation parameters, incomplete exploitation of the correlation between high-dimensional structures and deep image features. To overcome these issues, in this article, an unsupervised blind fusion method for hyperspectral and multispectral images based on Tucker decomposition and spatial spectral manifold learning (DTDNML) is proposed. We design a novel deep Tucker decomposition network that maps LR-HSI and HR-MSI into a consistent feature space, achieving reconstruction through decoders with shared parameter. To better exploit and fuse spatial-spectral features in the data, we design a core tensor fusion network that incorporates a spatial spectral attention mechanism for aligning and fusing features at different scales. Furthermore, to enhance the capacity in capturing global information, a Laplacian-based spatial-spectral manifold constraints is introduced in shared-decoders. Sufficient experiments have validated that this method enhances the accuracy and efficiency of hyperspectral and multispectral fusion on different remote sensing datasets. The source code is available at this https URL.
- [564] arXiv:2409.09959 (replaced) [pdf, other]
-
Title: Mission Planning on Autonomous Avoidance for Spacecraft Confronting Orbital DebrisComments: One of the co-authors has expressed disagreement with the submission of the paper. After further discussions, we believe it is best to withdraw the manuscript from consideration given this disagreementSubjects: Robotics (cs.RO); Systems and Control (eess.SY)
This paper investigates the mission planning problem for spacecraft confronting orbital debris to achieve autonomous avoidance. Firstly, combined with the avoidance requirements, a closed-loop framework of autonomous avoidance for orbital debris is proposed. Under the established model of mission planning, a two-stage planning is proposed to coordinate the conflict between routine tasks and debris avoidance. During the planning for expansion, the temporal constraints for duration actions are handled by the ordering choices. Meanwhile, dynamic resource variables satisfying instantaneous numerical change and continuous linear change are reasoned in the execution of actions. Linear Programming (LP) can solve the bounds of variables in each state, which is used to check the consistency of the interactive constraints on duration and resource. Then, the temporal relaxed planning graph (TRPG) heuristics is rationally developed to guide the plan towards the goal. Finally, the simulation demonstrates that the proposed mission planning strategy can effectively achieve the autonomous debris avoidance of the spacecraft.
- [565] arXiv:2409.10011 (replaced) [pdf, html, other]
-
Title: HALO: Hallucination Analysis and Learning Optimization to Empower LLMs with Retrieval-Augmented Context for Guided Clinical Decision MakingComments: 10 pages, 4 figuresSubjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
Large language models (LLMs) have significantly advanced natural language processing tasks, yet they are susceptible to generating inaccurate or unreliable responses, a phenomenon known as hallucination. In critical domains such as health and medicine, these hallucinations can pose serious risks. This paper introduces HALO, a novel framework designed to enhance the accuracy and reliability of medical question-answering (QA) systems by focusing on the detection and mitigation of hallucinations. Our approach generates multiple variations of a given query using LLMs and retrieves relevant information from external open knowledge bases to enrich the context. We utilize maximum marginal relevance scoring to prioritize the retrieved context, which is then provided to LLMs for answer generation, thereby reducing the risk of hallucinations. The integration of LangChain further streamlines this process, resulting in a notable and robust increase in the accuracy of both open-source and commercial LLMs, such as Llama-3.1 (from 44% to 65%) and ChatGPT (from 56% to 70%). This framework underscores the critical importance of addressing hallucinations in medical QA systems, ultimately improving clinical decision-making and patient care. The open-source HALO is available at: this https URL.
- [566] arXiv:2409.10071 (replaced) [pdf, html, other]
-
Title: Towards Physically-Realizable Adversarial Attacks in Embodied Vision NavigationComments: 8 pages, 6 figures, submitted to the 2025 IEEE International Conference on Robotics & Automation (ICRA)Subjects: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
The deployment of embodied navigation agents in safety-critical environments raises concerns about their vulnerability to adversarial attacks on deep neural networks. However, current attack methods often lack practicality due to challenges in transitioning from the digital to the physical world, while existing physical attacks for object detection fail to achieve both multi-view effectiveness and naturalness. To address this, we propose a practical attack method for embodied navigation by attaching adversarial patches with learnable textures and opacity to objects. Specifically, to ensure effectiveness across varying viewpoints, we employ a multi-view optimization strategy based on object-aware sampling, which uses feedback from the navigation model to optimize the patch's texture. To make the patch inconspicuous to human observers, we introduce a two-stage opacity optimization mechanism, where opacity is refined after texture optimization. Experimental results show our adversarial patches reduce navigation success rates by about 40%, outperforming previous methods in practicality, effectiveness, and naturalness. Code is available at: [this https URL].
- [567] arXiv:2409.10078 (replaced) [pdf, html, other]
-
Title: IRIS: Interactive Responsive Intelligent Segmentation for 3D Affordance AnalysisSubjects: Robotics (cs.RO)
Recent advancements in large language and vision-language models have significantly enhanced multimodal understanding, yet translating high-level linguistic instructions into precise robotic actions in 3D space remains challenging. This paper introduces IRIS (Interactive Responsive Intelligent Segmentation), a novel training-free multimodal system for 3D affordance segmentation, alongside a benchmark for evaluating interactive language-guided affordance in everyday environments. IRIS integrates a large multimodal model with a specialized 3D vision network, enabling seamless fusion of 2D and 3D visual understanding with language comprehension. To facilitate evaluation, we present a dataset of 10 typical indoor environments, each with 50 images annotated for object actions and 3D affordance segmentation. Extensive experiments demonstrate IRIS's capability in handling interactive 3D affordance segmentation tasks across diverse settings, showcasing competitive performance across various metrics. Our results highlight IRIS's potential for enhancing human-robot interaction based on affordance understanding in complex indoor environments, advancing the development of more intuitive and efficient robotic systems for real-world applications.
- [568] arXiv:2409.10173 (replaced) [pdf, html, other]
-
Title: jina-embeddings-v3: Multilingual Embeddings With Task LoRASaba Sturua, Isabelle Mohr, Mohammad Kalim Akram, Michael Günther, Bo Wang, Markus Krimmel, Feng Wang, Georgios Mastrapas, Andreas Koukounas, Nan Wang, Han XiaoComments: 20 pages, pp11-13 references, pp14-20 appendix and experiment tablesSubjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)
We introduce jina-embeddings-v3, a novel text embedding model with 570 million parameters, achieves state-of-the-art performance on multilingual data and long-context retrieval tasks, supporting context lengths of up to 8192 tokens. The model includes a set of task-specific Low-Rank Adaptation (LoRA) adapters to generate high-quality embeddings for query-document retrieval, clustering, classification, and text matching. Evaluation on the MTEB benchmark shows that jina-embeddings-v3 outperforms the latest proprietary embeddings from OpenAI and Cohere on English tasks, while achieving superior performance compared to multilingual-e5-large-instruct across all multilingual tasks. With a default output dimension of 1024, users can flexibly reduce the embedding dimensions to as low as 32 without compromising performance, enabled by Matryoshka Representation Learning.
- [569] arXiv:2409.10372 (replaced) [pdf, html, other]
-
Title: Instigating Cooperation among LLM Agents Using Adaptive Information ModulationSubjects: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computers and Society (cs.CY); Computer Science and Game Theory (cs.GT)
This paper introduces a novel framework combining LLM agents as proxies for human strategic behavior with reinforcement learning (RL) to engage these agents in evolving strategic interactions within team environments. Our approach extends traditional agent-based simulations by using strategic LLM agents (SLA) and introducing dynamic and adaptive governance through a pro-social promoting RL agent (PPA) that modulates information access across agents in a network, optimizing social welfare and promoting pro-social behavior. Through validation in iterative games, including the prisoner dilemma, we demonstrate that SLA agents exhibit nuanced strategic adaptations. The PPA agent effectively learns to adjust information transparency, resulting in enhanced cooperation rates. This framework offers significant insights into AI-mediated social dynamics, contributing to the deployment of AI in real-world team settings.
- [570] arXiv:2409.10836 (replaced) [pdf, html, other]
-
Title: Single-Layer Learnable Activation for Implicit Neural Representation (SL$^{2}$A-INR)Subjects: Computer Vision and Pattern Recognition (cs.CV)
Implicit Neural Representation (INR), leveraging a neural network to transform coordinate input into corresponding attributes, has recently driven significant advances in several vision-related domains. However, the performance of INR is heavily influenced by the choice of the nonlinear activation function used in its multilayer perceptron (MLP) architecture. Multiple nonlinearities have been investigated; yet, current INRs face limitations in capturing high-frequency components, diverse signal types, and handling inverse problems. We have identified that these problems can be greatly alleviated by introducing a paradigm shift in INRs. We find that an architecture with learnable activations in initial layers can represent fine details in the underlying signals. Specifically, we propose SL$^{2}$A-INR, a hybrid network for INR with a single-layer learnable activation function, prompting the effectiveness of traditional ReLU-based MLPs. Our method performs superior across diverse tasks, including image representation, 3D shape reconstructions, inpainting, single image super-resolution, CT reconstruction, and novel view synthesis. Through comprehensive experiments, SL$^{2}$A-INR sets new benchmarks in accuracy, quality, and convergence rates for INR.
- [571] arXiv:2409.10840 (replaced) [pdf, html, other]
-
Title: Implicit Reasoning in Deep Time Series ForecastingSubjects: Machine Learning (cs.LG)
Recently, time series foundation models have shown promising zero-shot forecasting performance on time series from a wide range of domains. However, it remains unclear whether their success stems from a true understanding of temporal dynamics or simply from memorizing the training data. While implicit reasoning in language models has been studied, similar evaluations for time series models have been largely unexplored. This work takes an initial step toward assessing the reasoning abilities of deep time series forecasting models. We find that certain linear, MLP-based, and patch-based Transformer models generalize effectively in systematically orchestrated out-of-distribution scenarios, suggesting underexplored reasoning capabilities beyond simple pattern memorization.
- [572] arXiv:2409.11261 (replaced) [pdf, html, other]
-
Title: The Art of Storytelling: Multi-Agent Generative AI for Dynamic Multimodal NarrativesSubjects: Computation and Language (cs.CL)
This paper introduces the concept of an education tool that utilizes Generative Artificial Intelligence (GenAI) to enhance storytelling for children. The system combines GenAI-driven narrative co-creation, text-to-speech conversion, and text-to-video generation to produce an engaging experience for learners. We describe the co-creation process, the adaptation of narratives into spoken words using text-to-speech models, and the transformation of these narratives into contextually relevant visuals through text-to-video technology. Our evaluation covers the linguistics of the generated stories, the text-to-speech conversion quality, and the accuracy of the generated visuals.
- [573] arXiv:2409.11272 (replaced) [pdf, html, other]
-
Title: LOLA -- An Open-Source Massively Multilingual Large Language ModelNikit Srivastava, Denis Kuchelev, Tatiana Moteu Ngoli, Kshitij Shetty, Michael Röder, Diego Moussallem, Hamada Zahera, Axel-Cyrille Ngonga NgomoSubjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
This paper presents LOLA, a massively multilingual large language model trained on more than 160 languages using a sparse Mixture-of-Experts Transformer architecture. Our architectural and implementation choices address the challenge of harnessing linguistic diversity while maintaining efficiency and avoiding the common pitfalls of multilinguality. Our analysis of the evaluation results shows competitive performance in natural language generation and understanding tasks. Additionally, we demonstrate how the learned expert-routing mechanism exploits implicit phylogenetic linguistic patterns to potentially alleviate the curse of multilinguality. We provide an in-depth look at the training process, an analysis of the datasets, and a balanced exploration of the model's strengths and limitations. As an open-source model, LOLA promotes reproducibility and serves as a robust foundation for future research. Our findings enable the development of compute-efficient multilingual models with strong, scalable performance across languages.
- [574] arXiv:2409.11502 (replaced) [pdf, html, other]
-
Title: Super Resolution On Global Weather ForecastsSubjects: Machine Learning (cs.LG); Atmospheric and Oceanic Physics (physics.ao-ph)
Weather forecasting is a vitally important tool for tasks ranging from planning day to day activities to disaster response planning. However, modeling weather has proven to be challenging task due to its chaotic and unpredictable nature. Each variable, from temperature to precipitation to wind, all influence the path the environment will take. As a result, all models tend to rapidly lose accuracy as the temporal range of their forecasts increase. Classical forecasting methods use a myriad of physics-based, numerical, and stochastic techniques to predict the change in weather variables over time. However, such forecasts often require a very large amount of data and are extremely computationally expensive. Furthermore, as climate and global weather patterns change, classical models are substantially more difficult and time-consuming to update for changing environments. Fortunately, with recent advances in deep learning and publicly available high quality weather datasets, deploying learning methods for estimating these complex systems has become feasible. The current state-of-the-art deep learning models have comparable accuracy to the industry standard numerical models and are becoming more ubiquitous in practice due to their adaptability. Our group seeks to improve upon existing deep learning based forecasting methods by increasing spatial resolutions of global weather predictions. Specifically, we are interested in performing super resolution (SR) on GraphCast temperature predictions by increasing the global precision from 1 degree of accuracy to 0.5 degrees, which is approximately 111km and 55km respectively.
- [575] arXiv:2409.11639 (replaced) [pdf, html, other]
-
Title: Spline-based solution transfer for space-time methods in 2D+tComments: 36 pages, 17 figures, 3 tablesSubjects: Numerical Analysis (math.NA)
This work introduces a new solution-transfer process for slab-based space-time finite element methods. The new transfer process is based on Hsieh-Clough-Tocher (HCT) splines and satisfies the following requirements: (i) it maintains high-order accuracy up to 4th order, (ii) it preserves a discrete maximum principle, (iii) it asymptotically enforces mass conservation, and (iv) it constructs a smooth, continuous surrogate solution in between space-time slabs. While many existing transfer methods meet the first three requirements, the fourth requirement is crucial for enabling visualization and boundary condition enforcement for space-time applications. In this paper, we derive an error bound for our HCT spline-based transfer process. Additionally, we conduct numerical experiments quantifying the conservative nature and order of accuracy of the transfer process. Lastly, we present a qualitative evaluation of the visualization properties of the smooth surrogate solution.
- [576] arXiv:2409.11690 (replaced) [pdf, html, other]
-
Title: LLM-Powered Text Simulation Attack Against ID-Free Recommender SystemsComments: 12 pagesSubjects: Information Retrieval (cs.IR)
The ID-free recommendation paradigm has been proposed to address the limitation that traditional recommender systems struggle to model cold-start users or items with new IDs. Despite its effectiveness, this study uncovers that ID-free recommender systems are vulnerable to the proposed Text Simulation attack (TextSimu) which aims to promote specific target items. As a novel type of text poisoning attack, TextSimu exploits large language models (LLM) to alter the textual information of target items by simulating the characteristics of popular items. It operates effectively in both black-box and white-box settings, utilizing two key components: a unified popularity extraction module, which captures the essential characteristics of popular items, and an N-persona consistency simulation strategy, which creates multiple personas to collaboratively synthesize refined promotional textual descriptions for target items by simulating the popular items. To withstand TextSimu-like attacks, we further explore the detection approach for identifying LLM-generated promotional text. Extensive experiments conducted on three datasets demonstrate that TextSimu poses a more significant threat than existing poisoning attacks, while our defense method can detect malicious text of target items generated by TextSimu. By identifying the vulnerability, we aim to advance the development of more robust ID-free recommender systems.
- [577] arXiv:2409.11693 (replaced) [pdf, html, other]
-
Title: On the second-order zero differential properties of several classes of power functions over finite fieldsSubjects: Cryptography and Security (cs.CR); Information Theory (cs.IT)
Feistel Boomerang Connectivity Table (FBCT) is an important cryptanalytic technique on analysing the resistance of the Feistel network-based ciphers to power attacks such as differential and boomerang attacks. Moreover, the coefficients of FBCT are closely related to the second-order zero differential spectra of the function $F(x)$ over the finite fields with even characteristic and the Feistel boomerang uniformity is the second-order zero differential uniformity of $F(x)$. In this paper, by computing the number of solutions of specific equations over finite fields, we determine explicitly the second-order zero differential spectra of power functions $x^{2^m+3}$ and $x^{2^m+5}$ with $m>2$ being a positive integer over finite field with even characteristic, and $x^{p^k+1}$ with integer $k\geq1$ over finite field with odd characteristic $p$. It is worth noting that $x^{2^m+3}$ is a permutation over $\mathbb{F}_{2^n}$ and only when $m$ is odd, $x^{2^m+5}$ is a permutation over $\mathbb{F}_{2^n}$, where integer $n=2m$. As a byproduct, we find $F(x)=x^4$ is a PN and second-order zero differentially $0$-uniform function over $\mathbb{F}_{3^n}$ with odd $n$. The computation of these entries and the cardinalities in each table aimed to facilitate the analysis of differential and boomerang cryptanalysis of S-boxes when studying distinguishers and trails.
- [578] arXiv:2409.11733 (replaced) [pdf, html, other]
-
Title: Human-like Affective Cognition in Foundation ModelsKanishk Gandhi, Zoe Lynch, Jan-Philipp Fränken, Kayla Patterson, Sharon Wambu, Tobias Gerstenberg, Desmond C. Ong, Noah D. GoodmanSubjects: Computation and Language (cs.CL)
Understanding emotions is fundamental to human interaction and experience. Humans easily infer emotions from situations or facial expressions, situations from emotions, and do a variety of other affective cognition. How adept is modern AI at these inferences? We introduce an evaluation framework for testing affective cognition in foundation models. Starting from psychological theory, we generate 1,280 diverse scenarios exploring relationships between appraisals, emotions, expressions, and outcomes. We evaluate the abilities of foundation models (GPT-4, Claude-3, Gemini-1.5-Pro) and humans (N = 567) across carefully selected conditions. Our results show foundation models tend to agree with human intuitions, matching or exceeding interparticipant agreement. In some conditions, models are ``superhuman'' -- they better predict modal human judgements than the average human. All models benefit from chain-of-thought reasoning. This suggests foundation models have acquired a human-like understanding of emotions and their influence on beliefs and behavior.
- [579] arXiv:2409.11742 (replaced) [pdf, html, other]
-
Title: Simulating Native Speaker Shadowing for Nonnative Speech Assessment with Latent Speech RepresentationsComments: Submitted to ICASSP2025 Demo available: this https URLSubjects: Sound (cs.SD); Audio and Speech Processing (eess.AS)
Evaluating speech intelligibility is a critical task in computer-aided language learning systems. Traditional methods often rely on word error rates (WER) provided by automatic speech recognition (ASR) as intelligibility scores. However, this approach has significant limitations due to notable differences between human speech recognition (HSR) and ASR. A promising alternative is to involve a native (L1) speaker in shadowing what nonnative (L2) speakers say. Breakdowns or mispronunciations in the L1 speaker's shadowing utterance can serve as indicators for assessing L2 speech intelligibility. In this study, we propose a speech generation system that simulates the L1 shadowing process using voice conversion (VC) techniques and latent speech representations. Our experimental results demonstrate that this method effectively replicates the L1 shadowing process, offering an innovative tool to evaluate L2 speech intelligibility. Notably, systems that utilize self-supervised speech representations (S3R) show a higher degree of similarity to real L1 shadowing utterances in both linguistic accuracy and naturalness.
- [580] arXiv:2409.11868 (replaced) [pdf, other]
-
Title: Practical Investigation on the Distinguishability of Longa's Atomic PatternsSubjects: Cryptography and Security (cs.CR)
This paper investigates the distinguishability of the atomic patterns for elliptic curve point doubling and addition operations proposed by Longa. We implemented a binary elliptic curve scalar multiplication kP algorithm with Longa's atomic patterns for the NIST elliptic curve P-256 using the open-source cryptographic library FLECC in C. We measured and analysed an electromagnetic trace of a single kP execution on a microcontroller (TI Launchpad F28379 board). Due to various technical limitations, significant differences in the execution time and the shapes of the atomic blocks could not be determined. Further investigations of the side channel analysis-resistance can be performed based on this work. Last but not least, we examined and corrected Longa's atomic patterns corresponding to formulae proposed by Longa.
- [581] arXiv:2409.12005 (replaced) [pdf, html, other]
-
Title: Representing Positional Information in Generative World Models for Object ManipulationSubjects: Robotics (cs.RO); Artificial Intelligence (cs.AI)
Object manipulation capabilities are essential skills that set apart embodied agents engaging with the world, especially in the realm of robotics. The ability to predict outcomes of interactions with objects is paramount in this setting. While model-based control methods have started to be employed for tackling manipulation tasks, they have faced challenges in accurately manipulating objects. As we analyze the causes of this limitation, we identify the cause of underperformance in the way current world models represent crucial positional information, especially about the target's goal specification for object positioning tasks. We introduce a general approach that empowers world model-based agents to effectively solve object-positioning tasks. We propose two declinations of this approach for generative world models: position-conditioned (PCP) and latent-conditioned (LCP) policy learning. In particular, LCP employs object-centric latent representations that explicitly capture object positional information for goal specification. This naturally leads to the emergence of multimodal capabilities, enabling the specification of goals through spatial coordinates or a visual goal. Our methods are rigorously evaluated across several manipulation environments, showing favorable performance compared to current model-based control approaches.
- [582] arXiv:2409.12046 (replaced) [pdf, html, other]
-
Title: Using Large Language Models to Generate Clinical Trial Tables and FiguresSubjects: Computation and Language (cs.CL)
Tables, figures, and listings (TFLs) are essential tools for summarizing clinical trial data. Creation of TFLs for reporting activities is often a time-consuming task encountered routinely during the execution of clinical trials. This study explored the use of large language models (LLMs) to automate the generation of TFLs through prompt engineering and few-shot transfer learning. Using public clinical trial data in ADaM format, our results demonstrated that LLMs can efficiently generate TFLs with prompt instructions, showcasing their potential in this domain. Furthermore, we developed a conservational agent named Clinical Trial TFL Generation Agent: An app that matches user queries to predefined prompts that produce customized programs to generate specific predefined TFLs.
- [583] arXiv:2409.12089 (replaced) [pdf, html, other]
-
Title: The Impact of Element Ordering on LM Agent PerformanceSubjects: Machine Learning (cs.LG)
There has been a surge of interest in language model agents that can navigate virtual environments such as the web or desktop. To navigate such environments, agents benefit from information on the various elements (e.g., buttons, text, or images) present. It remains unclear which element attributes have the greatest impact on agent performance, especially in environments that only provide a graphical representation (i.e., pixels). Here we find that the ordering in which elements are presented to the language model is surprisingly impactful--randomizing element ordering in a webpage degrades agent performance comparably to removing all visible text from an agent's state representation. While a webpage provides a hierarchical ordering of elements, there is no such ordering when parsing elements directly from pixels. Moreover, as tasks become more challenging and models more sophisticated, our experiments suggest that the impact of ordering increases. Finding an effective ordering is non-trivial. We investigate the impact of various element ordering methods in web and desktop environments. We find that dimensionality reduction provides a viable ordering for pixel-only environments. We train a UI element detection model to derive elements from pixels and apply our findings to an agent benchmark--OmniACT--where we only have access to pixels. Our method completes more than two times as many tasks on average relative to the previous state-of-the-art.
- [584] arXiv:2409.12097 (replaced) [pdf, html, other]
-
Title: Skill matching at scale: freelancer-project alignment for efficient multilingual candidate retrievalSubjects: Computation and Language (cs.CL); Information Retrieval (cs.IR); Machine Learning (cs.LG); Social and Information Networks (cs.SI)
Finding the perfect match between a job proposal and a set of freelancers is not an easy task to perform at scale, especially in multiple languages. In this paper, we propose a novel neural retriever architecture that tackles this problem in a multilingual setting. Our method encodes project descriptions and freelancer profiles by leveraging pre-trained multilingual language models. The latter are used as backbone for a custom transformer architecture that aims to keep the structure of the profiles and project. This model is trained with a contrastive loss on historical data. Thanks to several experiments, we show that this approach effectively captures skill matching similarity and facilitates efficient matching, outperforming traditional methods.
- [585] arXiv:2409.12159 (replaced) [pdf, html, other]
-
Title: WeHelp: A Shared Autonomy System for Wheelchair UsersSubjects: Robotics (cs.RO)
There is a large population of wheelchair users. Most of the wheelchair users need help with daily tasks. However, according to recent reports, their needs are not properly satisfied due to the lack of caregivers. Therefore, in this project, we develop WeHelp, a shared autonomy system aimed for wheelchair users. A robot with a WeHelp system has three modes, following mode, remote control mode and tele-operation mode. In the following mode, the robot follows the wheelchair user automatically via visual tracking. The wheelchair user can ask the robot to follow them from behind, by the left or by the right. When the wheelchair user asks for help, the robot will recognize the command via speech recognition, and then switch to the teleoperation mode or remote control mode. In the teleoperation mode, the wheelchair user takes over the robot with a joy stick and controls the robot to complete some complex tasks for their needs, such as opening doors, moving obstacles on the way, reaching objects on a high shelf or on the low ground, etc. In the remote control mode, a remote assistant takes over the robot and helps the wheelchair user complete some complex tasks for their needs. Our evaluation shows that the pipeline is useful and practical for wheelchair users. Source code and demo of the paper are available at \url{this https URL}.
- [586] arXiv:2409.12194 (replaced) [pdf, html, other]
-
Title: Gender Representation and Bias in Indian Civil Service Mock InterviewsSubjects: Computation and Language (cs.CL); Computers and Society (cs.CY)
This paper makes three key contributions. First, via a substantial corpus of 51,278 interview questions sourced from 888 YouTube videos of mock interviews of Indian civil service candidates, we demonstrate stark gender bias in the broad nature of questions asked to male and female candidates. Second, our experiments with large language models show a strong presence of gender bias in explanations provided by the LLMs on the gender inference task. Finally, we present a novel dataset of 51,278 interview questions that can inform future social science studies.
- [587] arXiv:2103.09320 (replaced) [pdf, html, other]
-
Title: Quantum Pseudorandomness and Classical ComplexityComments: 31 pages. V2: added a new result about Haar random oracles (Corollary 5); various writing improvements. V3: added a new section about t-designs and corrected some proofs involving the use of t-designs. V4: corrected Lemma 25. V5: major changes to Section 5; the proof uses a new oracle construction after a bug was discovered in the previous versionJournal-ref: 16th Conference on the Theory of Quantum Computation, Communication and Cryptography (TQC 2021), Leibniz International Proceedings in Informatics (LIPIcs) 197, pp. 2:1-2:20 (2021)Subjects: Quantum Physics (quant-ph); Computational Complexity (cs.CC); Cryptography and Security (cs.CR)
We construct a quantum oracle relative to which $\mathsf{BQP} = \mathsf{QMA}$ but cryptographic pseudorandom quantum states and pseudorandom unitary transformations exist, a counterintuitive result in light of the fact that pseudorandom states can be "broken" by quantum Merlin-Arthur adversaries. We explain how this nuance arises as the result of a distinction between algorithms that operate on quantum and classical inputs. On the other hand, we show that some computational complexity assumption is needed to construct pseudorandom states, by proving that pseudorandom states do not exist if $\mathsf{BQP} = \mathsf{PP}$. We discuss implications of these results for cryptography, complexity theory, and shadow tomography.
- [588] arXiv:2306.06291 (replaced) [pdf, html, other]
-
Title: Optimal Multitask Linear Regression and Contextual Bandits under Sparse HeterogeneitySubjects: Machine Learning (stat.ML); Machine Learning (cs.LG); Methodology (stat.ME)
Large and complex datasets are often collected from several, possibly heterogeneous sources. Multitask learning methods improve efficiency by leveraging commonalities across datasets while accounting for possible differences among them. Here, we study multitask linear regression and contextual bandits under sparse heterogeneity, where the source/task-associated parameters are equal to a global parameter plus a sparse task-specific term. We propose a novel two-stage estimator called MOLAR that leverages this structure by first constructing a covariate-wise weighted median of the task-wise linear regression estimates and then shrinking the task-wise estimates towards the weighted median. Compared to task-wise least squares estimates, MOLAR improves the dependence of the estimation error on the data dimension. Extensions of MOLAR to generalized linear models and constructing confidence intervals are discussed in the paper. We then apply MOLAR to develop methods for sparsely heterogeneous multitask contextual bandits, obtaining improved regret guarantees over single-task bandit methods. We further show that our methods are minimax optimal by providing a number of lower bounds. Finally, we support the efficiency of our methods by performing experiments on both synthetic data and the PISA dataset on student educational outcomes from heterogeneous countries.
- [589] arXiv:2309.06869 (replaced) [pdf, other]
-
Title: Dynamic control of self-assembly of quasicrystalline structures through reinforcement learningComments: 11 pages, 12 figures, includes Supplementary InformationSubjects: Soft Condensed Matter (cond-mat.soft); Machine Learning (cs.LG)
We propose reinforcement learning to control the dynamical self-assembly of the dodecagonal quasicrystal (DDQC) from patchy particles. The patchy particles have anisotropic interactions with other particles and form DDQC. However, their structures at steady states are significantly influenced by the kinetic pathways of their structural formation. We estimate the best policy of temperature control trained by the Q-learning method and demonstrate that we can generate DDQC with few defects using the estimated policy. It is found that reinforcement learning autonomously discovers a characteristic temperature at which structural fluctuations enhance the chance of forming a globally stable state. The estimated policy guides the system toward the characteristic temperature to assist the formation of DDQC. We also illustrate the performance of RL when the target is metastable or unstable. To clarify the success of the learning, we analyse a simple model describing the kinetics of structural changes through the motion in a triple-well potential.
- [590] arXiv:2309.15035 (replaced) [pdf, html, other]
-
Title: On the Reduced Gr\"obner Bases of Blockwise Determinantal IdealsComments: 38 pages, 8 figuresSubjects: Commutative Algebra (math.AC); Symbolic Computation (cs.SC); Combinatorics (math.CO)
Blockwise determinantal ideals are those generated by the union of all the minors of specified sizes in certain blocks of a generic matrix, and they are the natural generalization of many existing determinantal ideals like the Schubert and ladder ones. In this paper we establish several criteria to verify whether the Gröbner bases of blockwise determinantal ideals with respect to (anti-)diagonal term orders are minimal or reduced. In particular, for Schubert determinantal ideals, while all the elusive minors form the reduced Gröbner bases when the defining permutations are vexillary, in the non-vexillary case we derive an explicit formula for computing the reduced Gröbner basis from elusive minors which avoids all algebraic operations. The fundamental properties of being normal and strong for W-characteristic sets and characteristic pairs, which are heavily connected to the reduced Gröbner bases, of Schubert determinantal ideals are also proven.
- [591] arXiv:2309.15793 (replaced) [pdf, html, other]
-
Title: Targeting relative risk heterogeneity with causal forestsComments: 17 pages, 4 figuresSubjects: Methodology (stat.ME); Machine Learning (cs.LG); Machine Learning (stat.ML)
The estimation of heterogeneous treatment effects (HTE) across different subgroups in a population is of significant interest in clinical trial analysis. State-of-the-art HTE estimation methods, including causal forests (Wager and Athey, 2018), generally rely on recursive partitioning for non-parametric identification of relevant covariates and interactions. However, like many other methods in this area, causal forests partition subgroups based on differences in absolute risk. This can dilute statistical power by masking variability in the relative risk, which is often a more appropriate quantity of clinical interest. In this work, we propose and implement a methodology for modifying causal forests to target relative risk, using a novel node-splitting procedure based on exhaustive generalized linear model comparison. We present results that suggest relative risk causal forests can capture otherwise undetected sources of heterogeneity.
- [592] arXiv:2309.17262 (replaced) [pdf, other]
-
Title: Estimation and Inference in Distributional Reinforcement LearningSubjects: Machine Learning (stat.ML); Machine Learning (cs.LG)
In this paper, we study distributional reinforcement learning from the perspective of statistical efficiency. We investigate distributional policy evaluation, aiming to estimate the complete return distribution (denoted $\eta^\pi$) attained by a given policy $\pi$. We use the certainty-equivalence method to construct our estimator $\hat\eta^\pi$, given a generative model is available. In this circumstance we need a dataset of size $\widetilde O\left(\frac{|\mathcal{S}||\mathcal{A}|}{\varepsilon^{2p}(1-\gamma)^{2p+2}}\right)$ to guarantee the $p$-Wasserstein metric between $\hat\eta^\pi$ and $\eta^\pi$ less than $\varepsilon$ with high probability. This implies the distributional policy evaluation problem can be solved with sample efficiency. Also, we show that under different mild assumptions a dataset of size $\widetilde O\left(\frac{|\mathcal{S}||\mathcal{A}|}{\varepsilon^{2}(1-\gamma)^{4}}\right)$ suffices to ensure the Kolmogorov metric and total variation metric between $\hat\eta^\pi$ and $\eta^\pi$ is below $\varepsilon$ with high probability. Furthermore, we investigate the asymptotic behavior of $\hat\eta^\pi$. We demonstrate that the ``empirical process'' $\sqrt{n}(\hat\eta^\pi-\eta^\pi)$ converges weakly to a Gaussian process in the space of bounded functionals on Lipschitz function class $\ell^\infty(\mathcal{F}_{\text{W}})$, also in the space of bounded functionals on indicator function class $\ell^\infty(\mathcal{F}_{\text{KS}})$ and bounded measurable function class $\ell^\infty(\mathcal{F}_{\text{TV}})$ when some mild conditions hold. Our findings give rise to a unified approach to statistical inference of a wide class of statistical functionals of $\eta^\pi$.
- [593] arXiv:2312.01043 (replaced) [pdf, html, other]
-
Title: Quantifying Hippocampal Shape Asymmetry in Alzheimer's Disease Using Optimal Shape CorrespondencesComments: 4 pages, 3 figures Published in 2024 IEEE International Symposium on Biomedical Imaging (ISBI)Subjects: Image and Video Processing (eess.IV); Machine Learning (cs.LG)
Hippocampal atrophy in Alzheimer's disease (AD) is asymmetric and spatially inhomogeneous. While extensive work has been done on volume and shape analysis of atrophy of the hippocampus in AD, less attention has been given to hippocampal asymmetry specifically. Previous studies of hippocampal asymmetry are limited to global volume or shape measures, which don't localize shape asymmetry at the point level. In this paper, we propose to quantify localized shape asymmetry by optimizing point correspondences between left and right hippocampi within a subject, while simultaneously favoring a compact statistical shape model of the entire sample. To account for related variables that have impact on AD and healthy subject differences, we build linear models with other confounding factors. Our results on the OASIS3 dataset demonstrate that compared to using volumetric information, shape asymmetry reveals fine-grained, localized differences that indicate the hippocampal regions of most significant shape asymmetry in AD patients.
- [594] arXiv:2401.01258 (replaced) [pdf, html, other]
-
Title: Model-Free Learning for the Linear Quadratic Regulator over Rate-Limited ChannelsComments: 37 pages, 3 figures. Compared to the previous version, we extend the general AQGD algorithm to handle noisy gradients and prove its convergence. In addition, we apply the algorithm to solve model-free LQR with communication constraints and provide finite sample analysis regarding the convergence of the algorithmSubjects: Optimization and Control (math.OC); Machine Learning (cs.LG); Systems and Control (eess.SY)
Consider a linear quadratic regulator (LQR) problem being solved in a model-free manner using the policy gradient approach. If the gradient of the quadratic cost is being transmitted across a rate-limited channel, both the convergence and the rate of convergence of the resulting controller may be affected by the bit-rate permitted by the channel. We first pose this problem in a communication-constrained optimization framework and propose a new adaptive quantization algorithm titled Adaptively Quantized Gradient Descent (AQGD). This algorithm guarantees exponentially fast convergence to the globally optimal policy, with no deterioration of the exponent relative to the unquantized setting, above a certain finite threshold bit-rate allowed by the communication channel. We then propose a variant of AQGD that provides similar performance guarantees when applied to solve the model-free LQR problem. Our approach reveals the benefits of adaptive quantization in preserving fast linear convergence rates, and, as such, may be of independent interest to the literature on compressed optimization. Our work also marks a first step towards a more general bridge between the fields of model-free control design and networked control systems.
- [595] arXiv:2402.00014 (replaced) [pdf, html, other]
-
Title: Hybrid quantum cycle generative adversarial network for small molecule generationMatvei Anoshin, Asel Sagingalieva, Christopher Mansell, Dmitry Zhiganov, Vishal Shete, Markus Pflitsch, Alexey MelnikovComments: 16 pages, 8 figures, 4 tablesJournal-ref: IEEE Trans. Quantum Eng. 5, 2500514 (2024)Subjects: Biomolecules (q-bio.BM); Emerging Technologies (cs.ET); Machine Learning (cs.LG); Biological Physics (physics.bio-ph); Quantum Physics (quant-ph)
The drug design process currently requires considerable time and resources to develop each new compound that enters the market. This work develops an application of hybrid quantum generative models based on the integration of parametrized quantum circuits into known molecular generative adversarial networks, and proposes quantum cycle architectures that improve model performance and stability during training. Through extensive experimentation on benchmark drug design datasets, QM9 and PC9, the introduced models are shown to outperform the previously achieved scores. Most prominently, the new scores indicate an increase of up to 30% in the quantitative estimation of druglikeness. The new hybrid quantum machine learning algorithms, as well as the achieved scores of pharmacokinetic properties, contribute to the development of fast and accurate drug discovery processes.
- [596] arXiv:2402.03990 (replaced) [pdf, html, other]
-
Title: Subsampling is not Magic: Why Large Batch Sizes Work for Differentially Private Stochastic OptimisationComments: After the publication of this work (ICML 2024), the Conjecture 6.3 has been proven by Kalinin (2024). Kalinin, N. P. Notes on Sampled Gaussian Mechanism. arXiv:2409.04636, 2024Subjects: Machine Learning (stat.ML); Cryptography and Security (cs.CR); Machine Learning (cs.LG)
We study how the batch size affects the total gradient variance in differentially private stochastic gradient descent (DP-SGD), seeking a theoretical explanation for the usefulness of large batch sizes. As DP-SGD is the basis of modern DP deep learning, its properties have been widely studied, and recent works have empirically found large batch sizes to be beneficial. However, theoretical explanations of this benefit are currently heuristic at best. We first observe that the total gradient variance in DP-SGD can be decomposed into subsampling-induced and noise-induced variances. We then prove that in the limit of an infinite number of iterations, the effective noise-induced variance is invariant to the batch size. The remaining subsampling-induced variance decreases with larger batch sizes, so large batches reduce the effective total gradient variance. We confirm numerically that the asymptotic regime is relevant in practical settings when the batch size is not small, and find that outside the asymptotic regime, the total gradient variance decreases even more with large batch sizes. We also find a sufficient condition that implies that large batch sizes similarly reduce effective DP noise variance for one iteration of DP-SGD.
- [597] arXiv:2402.06455 (replaced) [pdf, html, other]
-
Title: Quantum Computing and Tensor Networks for Laminate Design: A Novel Approach to Stacking Sequence RetrievalComments: 44 pages, 6 figures. Accompanying code repository: this https URL . Accompanying data repository: this https URL . Changes: Minor revisionJournal-ref: Comput. Methods Appl. Mech. Eng. 432 (2024) 117380Subjects: Quantum Physics (quant-ph); Computational Engineering, Finance, and Science (cs.CE); Emerging Technologies (cs.ET)
As with many tasks in engineering, structural design frequently involves navigating complex and computationally expensive problems. A prime example is the weight optimization of laminated composite materials, which to this day remains a formidable task, due to an exponentially large configuration space and non-linear constraints. The rapidly developing field of quantum computation may offer novel approaches for addressing these intricate problems. However, before applying any quantum algorithm to a given problem, it must be translated into a form that is compatible with the underlying operations on a quantum computer. Our work specifically targets stacking sequence retrieval with lamination parameters. To adapt this problem for quantum computational methods, we map the possible stacking sequences onto a quantum state space. We further derive a linear operator, the Hamiltonian, within this state space that encapsulates the loss function inherent to the stacking sequence retrieval problem. Additionally, we demonstrate the incorporation of manufacturing constraints on stacking sequences as penalty terms in the Hamiltonian. This quantum representation is suitable for a variety of classical and quantum algorithms for finding the ground state of a quantum Hamiltonian. For a practical demonstration, we performed state-vector simulations of two variational quantum algorithms and additionally chose a classical tensor network algorithm, the DMRG algorithm, to numerically validate our approach. Although this work primarily concentrates on quantum computation, the application of tensor network algorithms presents a novel quantum-inspired approach for stacking sequence retrieval.
- [598] arXiv:2402.14053 (replaced) [pdf, html, other]
-
Title: Self-adhesivity in lattices of abstract conditional independence modelsComments: 39 pages, 4 figures; v2: major revisionSubjects: Combinatorics (math.CO); Information Theory (cs.IT)
We introduce an algebraic concept of the frame for abstract conditional independence (CI) models, together with basic operations with respect to which such a frame should be closed: copying and marginalization. Three standard examples of such frames are (discrete) probabilistic CI structures, semi-graphoids and structural semi-graphoids. We concentrate on those frames which are closed under the operation of set-theoretical intersection because, for these, the respective families of CI models are lattices. This allows one to apply the results from lattice theory and formal concept analysis to describe such families in terms of implications among CI statements.
The central concept of this paper is that of self-adhesivity defined in algebraic terms, which is a combinatorial reflection of the self-adhesivity concept studied earlier in context of polymatroids and information theory. The generalization also leads to a self-adhesivity operator defined on the hyper-level of CI frames. We answer some of the questions related to this approach and raise other open questions.
The core of the paper is in computations. The combinatorial approach to computation might overcome some memory and space limitation of software packages based on polyhedral geometry, in particular, if SAT solvers are utilized. We characterize some basic CI families over 4 variables in terms of canonical implications among CI statements. We apply our method in information-theoretical context to the task of entropic region demarcation over 5 variables. - [599] arXiv:2403.03013 (replaced) [pdf, html, other]
-
Title: The clique chromatic number of sparse random graphsComments: 24 pages; minor editsSubjects: Combinatorics (math.CO); Discrete Mathematics (cs.DM); Probability (math.PR)
The clique chromatic number of a graph is the smallest number of colors in a vertex coloring so that no maximal clique is monochromatic. In this paper, we determine the order of magnitude of the clique chromatic number of the random graph G_{n,p} for most edge-probabilities p in the range n^{-2/5} \ll p \ll 1. This resolves open problems and questions of Lichev, Mitsche and Warnke as well as Alon and Krievelevich.
One major proof difficulty stems from high-degree vertices, which prevent maximal cliques in their neighborhoods: we deal with these vertices by an intricate union bound argument, that combines the probabilistic method with new degree counting arguments in order to enable Janson's inequality. This way we determine the asymptotics of the clique chromatic number of G_{n,p} in some ranges, and discover a surprising new phenomenon that contradicts earlier predictions for edge-probabilities p close to n^{-2/5}. - [600] arXiv:2403.10861 (replaced) [pdf, html, other]
-
Title: FedQNN: Federated Learning using Quantum Neural NetworksComments: Accepted for presentation at IJCNN 2024Journal-ref: 2024 International Joint Conference on Neural Networks (IJCNN), Yokohama, Japan, 2024, pp. 1-9Subjects: Quantum Physics (quant-ph); Emerging Technologies (cs.ET); Machine Learning (cs.LG)
In this study, we explore the innovative domain of Quantum Federated Learning (QFL) as a framework for training Quantum Machine Learning (QML) models via distributed networks. Conventional machine learning models frequently grapple with issues about data privacy and the exposure of sensitive information. Our proposed Federated Quantum Neural Network (FedQNN) framework emerges as a cutting-edge solution, integrating the singular characteristics of QML with the principles of classical federated learning. This work thoroughly investigates QFL, underscoring its capability to secure data handling in a distributed environment and facilitate cooperative learning without direct data sharing. Our research corroborates the concept through experiments across varied datasets, including genomics and healthcare, thereby validating the versatility and efficacy of our FedQNN framework. The results consistently exceed 86% accuracy across three distinct datasets, proving its suitability for conducting various QML tasks. Our research not only identifies the limitations of classical paradigms but also presents a novel framework to propel the field of QML into a new era of secure and collaborative innovation.
- [601] arXiv:2404.08064 (replaced) [pdf, other]
-
Title: The Impact of Speech Anonymization on Pathology and Its LimitsSoroosh Tayebi Arasteh, Tomas Arias-Vergara, Paula Andrea Perez-Toro, Tobias Weise, Kai Packhaeuser, Maria Schuster, Elmar Noeth, Andreas Maier, Seung Hee YangComments: Published in Communications MedicineJournal-ref: Commun Med 4, (2024)Subjects: Audio and Speech Processing (eess.AS); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR); Machine Learning (cs.LG)
Integration of speech into healthcare has intensified privacy concerns due to its potential as a non-invasive biomarker containing individual biometric information. In response, speaker anonymization aims to conceal personally identifiable information while retaining crucial linguistic content. However, the application of anonymization techniques to pathological speech, a critical area where privacy is especially vital, has not been extensively examined. This study investigates anonymization's impact on pathological speech across over 2,700 speakers from multiple German institutions, focusing on privacy, pathological utility, and demographic fairness. We explore both deep-learning-based and signal processing-based anonymization methods. We document substantial privacy improvements across disorders-evidenced by equal error rate increases up to 1933%, with minimal overall impact on utility. Specific disorders such as Dysarthria, Dysphonia, and Cleft Lip and Palate experience minimal utility changes, while Dysglossia shows slight improvements. Our findings underscore that the impact of anonymization varies substantially across different disorders. This necessitates disorder-specific anonymization strategies to optimally balance privacy with diagnostic utility. Additionally, our fairness analysis reveals consistent anonymization effects across most of the demographics. This study demonstrates the effectiveness of anonymization in pathological speech for enhancing privacy, while also highlighting the importance of customized and disorder-specific approaches to account for inversion attacks.
- [602] arXiv:2405.05779 (replaced) [pdf, html, other]
-
Title: A note on the theory of well ordersComments: 4 pagesSubjects: Logic (math.LO); Logic in Computer Science (cs.LO)
We give a simple proof that the first-order theory of well orders is axiomatized by transfinite induction, and that it is decidable.
- [603] arXiv:2405.08417 (replaced) [pdf, html, other]
-
Title: Neural Speech Coding for Real-time Communications using Constant Bitrate Scalar QuantizationSubjects: Audio and Speech Processing (eess.AS); Sound (cs.SD)
Neural audio coding has emerged as a vivid research direction by promising good audio quality at very low bitrates unachievable by classical coding techniques. Here, end-to-end trainable autoencoder-like models represent the state of the art, where a discrete representation in the bottleneck of the autoencoder is learned. This allows for efficient transmission of the input audio signal. The learned discrete representation of neural codecs is typically generated by applying a quantizer to the output of the neural encoder. In almost all state-of-the-art neural audio coding approaches, this quantizer is realized as a Vector Quantizer (VQ) and a lot of effort has been spent to alleviate drawbacks of this quantization technique when used together with a neural audio coder. In this paper, we propose and analyze simple alternatives to VQ, which are based on projected Scalar Quantization (SQ). These quantization techniques do not need any additional losses, scheduling parameters or codebook storage thereby simplifying the training of neural audio codecs. For real-time speech communication applications, these neural codecs are required to operate at low complexity, low latency and at low bitrates. We address those challenges by proposing a new causal network architecture that is based on SQ and a Short-Time Fourier Transform (STFT) representation. The proposed method performs particularly well in the very low complexity and low bitrate regime.
- [604] arXiv:2407.08555 (replaced) [pdf, html, other]
-
Title: SLoRD: Structural Low-Rank Descriptors for Shape Consistency in Vertebrae SegmentationComments: Under reviewSubjects: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
Automatic and precise multi-class vertebrae segmentation from CT images is crucial for various clinical applications. However, due to a lack of explicit consistency constraints, existing methods especially for single-stage methods, still suffer from the challenge of intra-vertebrae segmentation inconsistency, which refers to multiple label predictions inside a singular vertebra. For multi-stage methods, vertebrae detection serving as the first step, tends to be affected by the pathology and metal implants. Thus, imprecise detections cause biased patches before segmentation, which then leads to inaccurate contour delineation and inconsistent segmentation. In our work, we intend to label individual and complete binary masks to address that challenge. Specifically, a contour generation network is proposed based on Structural Low-Rank Descriptors for shape consistency, termed SLoRD. For a structural representation of vertebral contours, we adopt the spherical coordinate system and devise the spherical centroid to calculate contour descriptors. Due to vertebrae's similar appearances, basic contour descriptors can be acquired to restore original contours. Therefore, SLoRD leverages these contour priors and explicit shape constraints to facilitate regressed contour points close to vertebral surfaces. Quantitative and qualitative evaluations on VerSe 2019 and 2020 demonstrate the superior performance of our framework over other single-stage and multi-stage state-of-the-art (SOTA) methods. Further, SLoRD is a plug-and-play framework to refine the segmentation inconsistency existing in coarse predictions from other approaches.
- [605] arXiv:2407.09845 (replaced) [pdf, html, other]
-
Title: Towards Understanding Epoch-wise Double descent in Two-layer Linear Neural NetworksSubjects: Machine Learning (stat.ML); Machine Learning (cs.LG)
Epoch-wise double descent is the phenomenon where generalisation performance improves beyond the point of overfitting, resulting in a generalisation curve exhibiting two descents under the course of learning. Understanding the mechanisms driving this behaviour is crucial not only for understanding the generalisation behaviour of machine learning models in general, but also for employing conventional selection methods, such as the use of early stopping to mitigate overfitting. While we ultimately want to draw conclusions of more complex models, such as deep neural networks, a majority of theoretical results regarding the underlying cause of epoch-wise double descent are based on simple models, such as standard linear regression. In this paper, to take a step towards more complex models in theoretical analysis, we study epoch-wise double descent in two-layer linear neural networks. First, we derive a gradient flow for the linear two-layer model, that bridges the learning dynamics of the standard linear regression model, and the linear two-layer diagonal network with quadratic weights. Second, we identify additional factors of epoch-wise double descent emerging with the extra model layer, by deriving necessary conditions for the generalisation error to follow a double descent pattern. While epoch-wise double descent in linear regression has been attributed to differences in input variance, in the two-layer model, also the singular values of the input-output covariance matrix play an important role. This opens up for further questions regarding unidentified factors of epoch-wise double descent for truly deep models.
- [606] arXiv:2407.11242 (replaced) [pdf, html, other]
-
Title: OmniGenome: Aligning RNA Sequences with Secondary Structures in Genomic Foundation ModelsComments: submitted to NeurIPS 2024, 19 pagesSubjects: Genomics (q-bio.GN); Computation and Language (cs.CL)
The alignment between RNA sequences and structures in foundation models (FMs) has yet to be thoroughly investigated. Existing FMs have struggled to establish sequence-structure alignment, hindering the free flow of genomic information between RNA sequences and structures. In this study, we introduce OmniGenome, an RNA FM trained to align RNA sequences with respect to secondary structures based on structure-contextualised modelling. The alignment enables free and bidirectional mappings between sequences and structures by utilising the flexible RNA modelling paradigm that supports versatile input and output modalities, i.e., sequence and/or structure as input/output. We implement RNA design and zero-shot secondary structure prediction as case studies to evaluate the Seq2Str and Str2Seq mapping capacity of OmniGenome. Results on the EternaV2 benchmark show that OmniGenome solved 74% of puzzles, whereas existing FMs only solved up to 3% of the puzzles due to the oversight of sequence-structure alignment. We leverage four comprehensive in-silico genome modelling benchmarks to evaluate performance across a diverse set of genome downstream tasks, where the results show that OmniGenome achieves state-of-the-art performance on RNA and DNA benchmarks, even without any training on DNA genomes.
- [607] arXiv:2407.18070 (replaced) [pdf, html, other]
-
Title: CSWin-UNet: Transformer UNet with Cross-Shaped Windows for Medical Image SegmentationSubjects: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
Deep learning, especially convolutional neural networks (CNNs) and Transformer architectures, have become the focus of extensive research in medical image segmentation, achieving impressive results. However, CNNs come with inductive biases that limit their effectiveness in more complex, varied segmentation scenarios. Conversely, while Transformer-based methods excel at capturing global and long-range semantic details, they suffer from high computational demands. In this study, we propose CSWin-UNet, a novel U-shaped segmentation method that incorporates the CSWin self-attention mechanism into the UNet to facilitate horizontal and vertical stripes self-attention. This method significantly enhances both computational efficiency and receptive field interactions. Additionally, our innovative decoder utilizes a content-aware reassembly operator that strategically reassembles features, guided by predicted kernels, for precise image resolution restoration. Our extensive empirical evaluations on diverse datasets, including synapse multi-organ CT, cardiac MRI, and skin lesions, demonstrate that CSWin-UNet maintains low model complexity while delivering high segmentation accuracy. Codes are available at this https URL.
- [608] arXiv:2408.07067 (replaced) [pdf, html, other]
-
Title: Asymptotic quantification of entanglement with a single copyComments: 18+18 pagesSubjects: Quantum Physics (quant-ph); Information Theory (cs.IT); Mathematical Physics (math-ph)
Despite the central importance of quantum entanglement in fueling many quantum technologies, the understanding of the optimal ways to exploit it is still beyond our reach, and even measuring entanglement in an operationally meaningful way is prohibitively difficult. This is due to the need to precisely characterise many-copy, asymptotic protocols for entanglement processing. Here we overcome these issues by introducing a new way of benchmarking the fundamental protocol of entanglement distillation (purification), where instead of measuring its asymptotic yield, we focus on the best achievable error. We connect this formulation of the task with an information-theoretic problem in composite quantum hypothesis testing known as generalised Sanov's theorem. By solving the latter problem -- which had no previously known solution even in classical information theory -- we thus compute the optimal asymptotic error exponent of entanglement distillation. We show this asymptotic solution to be given by the reverse relative entropy of entanglement, a single-letter quantity that can be evaluated using only a single copy of a quantum state, which is a unique feature among operational measures of entanglement. Altogether, we thus demonstrate a measure of entanglement that admits a direct operational interpretation as the optimal asymptotic rate of an important entanglement manipulation protocol while enjoying an exact, single-letter formula.
- [609] arXiv:2408.10114 (replaced) [pdf, html, other]
-
Title: Topics in Non-local Games: Synchronous Algebras, Algebraic Graph Identities, and Quantum NP-hardness ReductionsComments: 21 pages. Research conducted under the supervision of Dr. Connor Paddock and Prof. Anne BroadbentSubjects: Quantum Physics (quant-ph); Computational Complexity (cs.CC); Operator Algebras (math.OA)
We review the correspondence between synchronous games and their associated $*$-algebra. Building upon the work of (Helton et al., New York J. Math. 2017), we propose results on algebraic and locally commuting graph identities. Based on the noncommutative Nullstellensätze (Watts, Helton and Klep, Annales Henri Poincaré 2023), we build computational tools that check the non-existence of perfect $C^*$ and algebraic strategies of synchronous games using Gröbner basis methods and semidefinite programming. We prove the equivalence between the hereditary and $C^*$ models questioned in (Helton et al., New York J. Math. 2017). We also extend the quantum-version NP-hardness reduction $\texttt{3-Coloring}^* \leq_p \texttt{3-SAT}^*$ due to (Ji, arXiv 2013) by exhibiting another instance of such reduction $\texttt{Clique}^* \leq_p \texttt{3-SAT}^*$.
- [610] arXiv:2408.16462 (replaced) [pdf, html, other]
-
Title: Consensus Planning with Primal, Dual, and Proximal AgentsSubjects: Optimization and Control (math.OC); Multiagent Systems (cs.MA)
Consensus planning is a method for coordinating decision making across complex systems and organizations, including complex supply chain optimization pipelines. It arises when large interdependent distributed agents (systems) share common resources and must act in order to achieve a joint goal. In prior consensus planning work, all agents have been assumed to have the same interaction pattern (e.g., all dual agents or all primal agents or all proximal agents), most commonly using the Alternating Direction Method of Multipliers (ADMM) as proximal agents. However, this is often not a valid assumption in practice, where agents consist of large complex systems, and where we might not have the luxury of modifying these large complex systems at will. In this paper, we introduce a consensus algorithm that overcomes this hurdle by allowing for the coordination of agents with different types of interfaces (named primal, dual, and proximal). Our consensus planning algorithm allows for any mix of agents by combining ADMM-like updates for the proximal agents, dual ascent updates for the dual agents, and linearized ADMM updates for the primal agents. We prove convergence results for the algorithm, namely a sublinear O(1/k) convergence rate under mild assumptions, and two-step linear convergence under stronger assumptions. We also discuss enhancements to the basic method and provide illustrative empirical results.
- [611] arXiv:2408.17175 (replaced) [pdf, html, other]
-
Title: Codec Does Matter: Exploring the Semantic Shortcoming of Codec for Audio Language ModelZhen Ye, Peiwen Sun, Jiahe Lei, Hongzhan Lin, Xu Tan, Zheqi Dai, Qiuqiang Kong, Jianyi Chen, Jiahao Pan, Qifeng Liu, Yike Guo, Wei XueSubjects: Audio and Speech Processing (eess.AS); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Sound (cs.SD)
Recent advancements in audio generation have been significantly propelled by the capabilities of Large Language Models (LLMs). The existing research on audio LLM has primarily focused on enhancing the architecture and scale of audio language models, as well as leveraging larger datasets, and generally, acoustic codecs, such as EnCodec, are used for audio tokenization. However, these codecs were originally designed for audio compression, which may lead to suboptimal performance in the context of audio LLM. Our research aims to address the shortcomings of current audio LLM codecs, particularly their challenges in maintaining semantic integrity in generated audio. For instance, existing methods like VALL-E, which condition acoustic token generation on text transcriptions, often suffer from content inaccuracies and elevated word error rates (WER) due to semantic misinterpretations of acoustic tokens, resulting in word skipping and errors. To overcome these issues, we propose a straightforward yet effective approach called X-Codec. X-Codec incorporates semantic features from a pre-trained semantic encoder before the Residual Vector Quantization (RVQ) stage and introduces a semantic reconstruction loss after RVQ. By enhancing the semantic ability of the codec, X-Codec significantly reduces WER in speech synthesis tasks and extends these benefits to non-speech applications, including music and sound generation. Our experiments in text-to-speech, music continuation, and text-to-sound tasks demonstrate that integrating semantic information substantially improves the overall performance of language models in audio generation. Our code and demo are available (Demo: this https URL Code: this https URL)
- [612] arXiv:2409.03510 (replaced) [pdf, html, other]
-
Title: A Deceptively Simple Quadratic RecurrenceComments: 9 pagesSubjects: Number Theory (math.NT); Discrete Mathematics (cs.DM); Combinatorics (math.CO)
Standard techniques for treating linear recurrences no longer apply for quadratic recurrences. It is not hard to determine asymptotics for a specific parametrized model over a wide domain of values (all $p \neq 1/2$ here). The gap between theory and experimentation seems insurmountable, however, at a single outlier ($p = 1/2$).
- [613] arXiv:2409.03607 (replaced) [pdf, html, other]
-
Title: Variance reduction in Texas hold'em and in video pokerComments: 15 pages, 0 figuresSubjects: Probability (math.PR); Computer Science and Game Theory (cs.GT)
In Texas hold'em, after an all-in bet is made and called before the flop, the turn, or the river, the two players sometimes agree to run it $n$ times, meaning that the remaining five, two, or one cards are dealt out not just once but $n$ times successively without replacement, with $1/n$ of the pot attached to each run. In $n$-play video poker, five cards are dealt exactly as in the conventional single-play game. After the player chooses which cards to hold, new cards are drawn to replace the discards, not just once but $n$ times independently, with $1/n$ of the bet attached to each draw. In both scenarios the players are attempting to reduce the variance of the return without changing the mean. We quantify the extent to which the variance is reduced.
- [614] arXiv:2409.07077 (replaced) [pdf, html, other]
-
Title: Submonoid Membership in n-dimensional lamplighter groups and S-unit equationsComments: Added funding information, 21 pagesSubjects: Group Theory (math.GR); Formal Languages and Automata Theory (cs.FL); Number Theory (math.NT)
We show that Submonoid Membership is decidable in n-dimensional lamplighter groups $(\mathbb{Z}/p\mathbb{Z}) \wr \mathbb{Z}^n$ for any prime $p$ and integer $n$. More generally, we show decidability of Submonoid Membership in semidirect products of the form $\mathcal{Y} \rtimes \mathbb{Z}^n$, where $\mathcal{Y}$ is any finitely presented module over the Laurent polynomial ring $\mathbb{F}_p[X_1^{\pm}, \ldots, X_n^{\pm}]$. Combined with a result of Shafrir (2024), this gives the first example of a group $G$ and a finite index subgroup $\widetilde{G} \leq G$, such that Submonoid Membership is decidable in $\widetilde{G}$ but undecidable in $G$.
To obtain our decidability result, we reduce Submonoid Membership in $\mathcal{Y} \rtimes \mathbb{Z}^n$ to solving S-unit equations over $\mathbb{F}_p[X_1^{\pm}, \ldots, X_n^{\pm}]$-modules. We show that the solution set of such equations is effectively $p$-automatic, extending a result of Adamczewski and Bell (2012). As an intermediate result, we also obtain that the solution set of the Knapsack Problem in $\mathcal{Y} \rtimes \mathbb{Z}^n$ is effectively $p$-automatic. - [615] arXiv:2409.08584 (replaced) [pdf, html, other]
-
Title: CompressedMediQ: Hybrid Quantum Machine Learning Pipeline for High-Dimensional Neuroimaging DataSubjects: Quantum Physics (quant-ph); Distributed, Parallel, and Cluster Computing (cs.DC); Machine Learning (cs.LG)
This paper introduces CompressedMediQ, a novel hybrid quantum-classical machine learning pipeline specifically developed to address the computational challenges associated with high-dimensional multi-class neuroimaging data analysis. Standard neuroimaging datasets, such as 4D MRI data from the Alzheimer's Disease Neuroimaging Initiative (ADNI) and Neuroimaging in Frontotemporal Dementia (NIFD), present significant hurdles due to their vast size and complexity. CompressedMediQ integrates classical high-performance computing (HPC) nodes for advanced MRI pre-processing and Convolutional Neural Network (CNN)-PCA-based feature extraction and reduction, addressing the limited-qubit availability for quantum data encoding in the NISQ (Noisy Intermediate-Scale Quantum) era. This is followed by Quantum Support Vector Machine (QSVM) classification. By utilizing quantum kernel methods, the pipeline optimizes feature mapping and classification, enhancing data separability and outperforming traditional neuroimaging analysis techniques. Experimental results highlight the pipeline's superior accuracy in dementia staging, validating the practical use of quantum machine learning in clinical diagnostics. Despite the limitations of NISQ devices, this proof-of-concept demonstrates the transformative potential of quantum-enhanced learning, paving the way for scalable and precise diagnostic tools in healthcare and signal processing.
- [616] arXiv:2409.10588 (replaced) [pdf, html, other]
-
Title: Opponent Shaping for Antibody DevelopmentSebastian Towers, Aleksandra Kalisz, Philippe A. Robert, Alicia Higueruelo, Francesca Vianello, Ming-Han Chloe Tsai, Harrison Steel, Jakob N. FoersterComments: PreprintSubjects: Populations and Evolution (q-bio.PE); Artificial Intelligence (cs.AI); Computer Science and Game Theory (cs.GT); Multiagent Systems (cs.MA)
Anti-viral therapies are typically designed or evolved towards the current strains of a virus. In learning terms, this corresponds to a myopic best response, i.e., not considering the possible adaptive moves of the opponent. However, therapy-induced selective pressures act on viral antigens to drive the emergence of mutated strains, against which initial therapies have reduced efficacy. To motivate our work, we consider antibody designs that target not only the current viral strains but also the wide range of possible future variants that the virus might evolve into under the evolutionary pressure exerted by said antibodies. Building on a computational model of binding between antibodies and viral antigens (the Absolut! framework), we design and implement a genetic simulation of the viral evolutionary escape. Crucially, this allows our antibody optimisation algorithm to consider and influence the entire escape curve of the virus, i.e. to guide (or ''shape'') the viral evolution. This is inspired by opponent shaping which, in general-sum learning, accounts for the adaptation of the co-player rather than playing a myopic best response. Hence we call the optimised antibodies shapers. Within our simulations, we demonstrate that our shapers target both current and simulated future viral variants, outperforming the antibodies chosen in a myopic way. Furthermore, we show that shapers exert specific evolutionary pressure on the virus compared to myopic antibodies. Altogether, shapers modify the evolutionary trajectories of viral strains and minimise the viral escape compared to their myopic counterparts. While this is a simple model, we hope that our proposed paradigm will enable the discovery of better long-lived vaccines and antibody therapies in the future, enabled by rapid advancements in the capabilities of simulation tools.
- [617] arXiv:2409.10739 (replaced) [pdf, html, other]
-
Title: Evolving a Multi-Population Evolutionary-QAOA on Distributed QPUsComments: 8 pages, 5 figuresSubjects: Quantum Physics (quant-ph); Neural and Evolutionary Computing (cs.NE)
Our research combines an Evolutionary Algorithm (EA) with a Quantum Approximate Optimization Algorithm (QAOA) to update the ansatz parameters, in place of traditional gradient-based methods, and benchmark on the Max-Cut problem. We demonstrate that our Evolutionary-QAOA (E-QAOA) pairing performs on par or better than a COBYLA-based QAOA in terms of solution accuracy and variance, for $d$-3 regular graphs between 4 and 26 nodes, using both $max\_count$ and Conditional Value at Risk (CVaR) for fitness function evaluations. Furthermore, we take our algorithm one step further and present a novel approach by presenting a multi-population EA distributed on two QPUs, which evolves independent and isolated populations in parallel, classically communicating elite individuals. Experiments were conducted on both simulators and IBM quantum hardware, and we investigated the relative performance accuracy and variance.
- [618] arXiv:2409.11299 (replaced) [pdf, html, other]
-
Title: TTT-Unet: Enhancing U-Net with Test-Time Training Layers for Biomedical Image SegmentationRong Zhou, Zhengqing Yuan, Zhiling Yan, Weixiang Sun, Kai Zhang, Yiwei Li, Yanfang Ye, Xiang Li, Lifang He, Lichao SunSubjects: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
Biomedical image segmentation is crucial for accurately diagnosing and analyzing various diseases. However, Convolutional Neural Networks (CNNs) and Transformers, the most commonly used architectures for this task, struggle to effectively capture long-range dependencies due to the inherent locality of CNNs and the computational complexity of Transformers. To address this limitation, we introduce TTT-Unet, a novel framework that integrates Test-Time Training (TTT) layers into the traditional U-Net architecture for biomedical image segmentation. TTT-Unet dynamically adjusts model parameters during the testing time, enhancing the model's ability to capture both local and long-range features. We evaluate TTT-Unet on multiple medical imaging datasets, including 3D abdominal organ segmentation in CT and MR images, instrument segmentation in endoscopy images, and cell segmentation in microscopy images. The results demonstrate that TTT-Unet consistently outperforms state-of-the-art CNN-based and Transformer-based segmentation models across all tasks. The code is available at this https URL.
- [619] arXiv:2409.11430 (replaced) [pdf, other]
-
Title: Federated Learning with Quantum Computing and Fully Homomorphic Encryption: A Novel Computing Paradigm Shift in Privacy-Preserving MLSiddhant Dutta, Pavana P Karanth, Pedro Maciel Xavier, Iago Leal de Freitas, Nouhaila Innan, Sadok Ben Yahia, Muhammad Shafique, David E. Bernal NeiraComments: 10 pages, 2 figuresSubjects: Quantum Physics (quant-ph); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR); Machine Learning (cs.LG); Neural and Evolutionary Computing (cs.NE)
The widespread deployment of products powered by machine learning models is raising concerns around data privacy and information security worldwide. To address this issue, Federated Learning was first proposed as a privacy-preserving alternative to conventional methods that allow multiple learning clients to share model knowledge without disclosing private data. A complementary approach known as Fully Homomorphic Encryption (FHE) is a quantum-safe cryptographic system that enables operations to be performed on encrypted weights. However, implementing mechanisms such as these in practice often comes with significant computational overhead and can expose potential security threats. Novel computing paradigms, such as analog, quantum, and specialized digital hardware, present opportunities for implementing privacy-preserving machine learning systems while enhancing security and mitigating performance loss. This work instantiates these ideas by applying the FHE scheme to a Federated Learning Neural Network architecture that integrates both classical and quantum layers.
- [620] arXiv:2409.11738 (replaced) [pdf, html, other]
-
Title: Adaptive Selection of Sampling-Reconstruction in Fourier Compressed SensingComments: 30 pages, 9.8 MB, Accepted to ECCV 2024Subjects: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
Compressed sensing (CS) has emerged to overcome the inefficiency of Nyquist sampling. However, traditional optimization-based reconstruction is slow and can not yield an exact image in practice. Deep learning-based reconstruction has been a promising alternative to optimization-based reconstruction, outperforming it in accuracy and computation speed. Finding an efficient sampling method with deep learning-based reconstruction, especially for Fourier CS remains a challenge. Existing joint optimization of sampling-reconstruction works ($\mathcal{H}_1$) optimize the sampling mask but have low potential as it is not adaptive to each data point. Adaptive sampling ($\mathcal{H}_2$) has also disadvantages of difficult optimization and Pareto sub-optimality. Here, we propose a novel adaptive selection of sampling-reconstruction ($\mathcal{H}_{1.5}$) framework that selects the best sampling mask and reconstruction network for each input data. We provide theorems that our method has a higher potential than $\mathcal{H}_1$ and effectively solves the Pareto sub-optimality problem in sampling-reconstruction by using separate reconstruction networks for different sampling masks. To select the best sampling mask, we propose to quantify the high-frequency Bayesian uncertainty of the input, using a super-resolution space generation model. Our method outperforms joint optimization of sampling-reconstruction ($\mathcal{H}_1$) and adaptive sampling ($\mathcal{H}_2$) by achieving significant improvements on several Fourier CS problems.
- [621] arXiv:2409.11752 (replaced) [pdf, html, other]
-
Title: Cross-Organ and Cross-Scanner Adenocarcinoma Segmentation using Rein to Fine-tune Vision Foundation ModelsSubjects: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
In recent years, significant progress has been made in tumor segmentation within the field of digital pathology. However, variations in organs, tissue preparation methods, and image acquisition processes can lead to domain discrepancies among digital pathology images. To address this problem, in this paper, we use Rein, a fine-tuning method, to parametrically and efficiently fine-tune various vision foundation models (VFMs) for MICCAI 2024 Cross-Organ and Cross-Scanner Adenocarcinoma Segmentation (COSAS2024). The core of Rein consists of a set of learnable tokens, which are directly linked to instances, improving functionality at the instance level in each layer. In the data environment of the COSAS2024 Challenge, extensive experiments demonstrate that Rein fine-tuned the VFMs to achieve satisfactory results. Specifically, we used Rein to fine-tune ConvNeXt and DINOv2. Our team used the former to achieve scores of 0.7719 and 0.7557 on the preliminary test phase and final test phase in task1, respectively, while the latter achieved scores of 0.8848 and 0.8192 on the preliminary test phase and final test phase in task2. Code is available at GitHub.