Loading...
Loading...
Browse, search and filter the latest cybersecurity research papers from arXiv
A reliable method of quantifying the perceptual realness of AI-generated images and identifying visually inconsistent regions is crucial for practical use of AI-generated images and for improving photorealism of generative AI via realness feedback during training. This paper introduces a framework that accomplishes both overall objective realness assessment and local inconsistency identification of AI-generated images using textual descriptions of visual inconsistencies generated by vision-language models trained on large datasets that serve as reliable substitutes for human annotations. Our results demonstrate that the proposed multimodal approach improves objective realness prediction performance and produces dense realness maps that effectively distinguish between realistic and unrealistic spatial regions.
Several video understanding tasks, such as natural language temporal video grounding, temporal activity localization, and audio description generation, require "temporally dense" reasoning over frames sampled at high temporal resolution. However, computing frame-level features for these tasks is computationally expensive given the temporal resolution requirements. In this paper, we make three contributions to reduce the cost of computing features for temporally dense tasks. First, we introduce a vision transformer (ViT) architecture, dubbed ResidualViT, that leverages the large temporal redundancy in videos to efficiently compute temporally dense frame-level features. Our architecture incorporates (i) learnable residual connections that ensure temporal consistency across consecutive frames and (ii) a token reduction module that enhances processing speed by selectively discarding temporally redundant information while reusing weights of a pretrained foundation model. Second, we propose a lightweight distillation strategy to approximate the frame-level features of the original foundation model. Finally, we evaluate our approach across four tasks and five datasets, in both zero-shot and fully supervised settings, demonstrating significant reductions in computational cost (up to 60%) and improvements in inference speed (up to 2.5x faster), all while closely approximating the accuracy of the original foundation model.
Reliable uncertainty quantification (UQ) is essential in medical AI. Evidential Deep Learning (EDL) offers a computationally efficient way to quantify model uncertainty alongside predictions, unlike traditional methods such as Monte Carlo (MC) Dropout and Deep Ensembles (DE). However, all these methods often rely on a single expert's annotations as ground truth for model training, overlooking the inter-rater variability in healthcare. To address this issue, we propose MEGAN, a Multi-Expert Gating Network that aggregates uncertainty estimates and predictions from multiple AI experts via EDL models trained with diverse ground truths and modeling strategies. MEGAN's gating network optimally combines predictions and uncertainties from each EDL model, enhancing overall prediction confidence and calibration. We extensively benchmark MEGAN on endoscopy videos for Ulcerative colitis (UC) disease severity estimation, assessed by visual labeling of Mayo Endoscopic Subscore (MES), where inter-rater variability is prevalent. In large-scale prospective UC clinical trial, MEGAN achieved a 3.5% improvement in F1-score and a 30.5% reduction in Expected Calibration Error (ECE) compared to existing methods. Furthermore, MEGAN facilitated uncertainty-guided sample stratification, reducing the annotation burden and potentially increasing efficiency and consistency in UC trials.
The aorta is the body's largest arterial vessel, serving as the primary pathway for oxygenated blood within the systemic circulation. Aortic aneurysms consistently rank among the top twenty causes of mortality in the United States. Thoracic aortic aneurysm (TAA) arises from abnormal dilation of the thoracic aorta and remains a clinically significant disease, ranking as one of the leading causes of death in adults. A thoracic aortic aneurysm ruptures when the integrity of all aortic wall layers is compromised due to elevated blood pressure. Currently, three-dimensional computed tomography (3D CT) is considered the gold standard for diagnosing TAA. The geometric characteristics of the aorta, which can be quantified from medical imaging, and stresses on the aortic wall, which can be obtained by finite element analysis (FEA), are critical in evaluating the risk of rupture and dissection. Deep learning based image segmentation has emerged as a reliable method for extracting anatomical regions of interest from medical images. Voxel based segmentation masks of anatomical structures are typically converted into structured mesh representation to enable accurate simulation. Hexahedral meshes are commonly used in finite element simulations of the aorta due to their computational efficiency and superior simulation accuracy. Due to anatomical variability, patient specific modeling enables detailed assessment of individual anatomical and biomechanics behaviors, supporting precise simulations, accurate diagnoses, and personalized treatment strategies. Finite element (FE) simulations provide valuable insights into the biomechanical behaviors of tissues and organs in clinical studies. Developing accurate FE models represents a crucial initial step in establishing a patient-specific, biomechanically based framework for predicting the risk of TAA.
The increasing prevalence of retinal diseases poses a significant challenge to the healthcare system, as the demand for ophthalmologists surpasses the available workforce. This imbalance creates a bottleneck in diagnosis and treatment, potentially delaying critical care. Traditional methods of generating medical reports from retinal images rely on manual interpretation, which is time-consuming and prone to errors, further straining ophthalmologists' limited resources. This thesis investigates the potential of Artificial Intelligence (AI) to automate medical report generation for retinal images. AI can quickly analyze large volumes of image data, identifying subtle patterns essential for accurate diagnosis. By automating this process, AI systems can greatly enhance the efficiency of retinal disease diagnosis, reducing doctors' workloads and enabling them to focus on more complex cases. The proposed AI-based methods address key challenges in automated report generation: (1) A multi-modal deep learning approach captures interactions between textual keywords and retinal images, resulting in more comprehensive medical reports; (2) Improved methods for medical keyword representation enhance the system's ability to capture nuances in medical terminology; (3) Strategies to overcome RNN-based models' limitations, particularly in capturing long-range dependencies within medical descriptions; (4) Techniques to enhance the interpretability of the AI-based report generation system, fostering trust and acceptance in clinical practice. These methods are rigorously evaluated using various metrics and achieve state-of-the-art performance. This thesis demonstrates AI's potential to revolutionize retinal disease diagnosis by automating medical report generation, ultimately improving clinical efficiency, diagnostic accuracy, and patient care.
Anomaly detection and classification in medical imaging are critical for early diagnosis but remain challenging due to limited annotated data, class imbalance, and the high cost of expert labeling. Emerging vision foundation models such as DINOv2, pretrained on extensive, unlabeled datasets, offer generalized representations that can potentially alleviate these limitations. In this study, we propose an attention-based global aggregation framework tailored specifically for 3D medical image anomaly classification. Leveraging the self-supervised DINOv2 model as a pretrained feature extractor, our method processes individual 2D axial slices of brain MRIs, assigning adaptive slice-level importance weights through a soft attention mechanism. To further address data scarcity, we employ a composite loss function combining supervised contrastive learning with class-variance regularization, enhancing inter-class separability and intra-class consistency. We validate our framework on the ADNI dataset and an institutional multi-class headache cohort, demonstrating strong anomaly classification performance despite limited data availability and significant class imbalance. Our results highlight the efficacy of utilizing pretrained 2D foundation models combined with attention-based slice aggregation for robust volumetric anomaly detection in medical imaging. Our implementation is publicly available at https://github.com/Rafsani/DinoAtten3D.git.
A healthy smile plays a significant role in functional as well as esthetic considerations, improving confidence. It is difficult for dental professionals to strike a balance between esthetic requirements and functional requirements. Traditional smile design has had heavy reliance on dentist expertise and used plaster models and hand drawings, raising questions about the outcome for patients. Digital technology, led by Dr. Christian Coachman in 2007, allows photographic and videographic assessments, enabling improved intercommunication among specialists and patients. Advances in artificial intelligence (AI) and big data have supported analysis of facial features and development of personalized smile designs in the last few years. Outputs are, however, susceptible to practitioner bias or limitations of training data, and may be suboptimal for individual users. The study presented here suggests a comprehensive system integrating AI, big data, and recognition technologies to automate the smile design process so that both experienced and inexperienced dentists can generate pleasing aesthetics with ease. The system has a Facial Feature Extraction Module and an Image Generation Module, serving diverse practitioner and patient needs. User data can be incorporated in future research for design optimization and testing of virtual and augmented reality for real-time previewing. Data gathered can also be employed in aesthetic preference analyses, which can enhance our knowledge of smile design in dental practice.
The recent success of immersive applications is pushing the research community to define new approaches to process 360{\deg} images and videos and optimize their transmission. Among these, saliency estimation provides a powerful tool that can be used to identify visually relevant areas and, consequently, adapt processing algorithms. Although saliency estimation has been widely investigated for 2D content, very few algorithms have been proposed for 360{\deg} saliency estimation. Towards this goal, we introduce Sphere-GAN, a saliency detection model for 360{\deg} videos that leverages a Generative Adversarial Network with spherical convolutions. Extensive experiments were conducted using a public 360{\deg} video saliency dataset, and the results demonstrate that Sphere-GAN outperforms state-of-the-art models in accurately predicting saliency maps.
To select suitable filters for a task or to improve existing filters, a deep understanding of their inner workings is vital. Diffusion echoes, which are space-adaptive impulse responses, are useful to visualise the effect of nonlinear diffusion filters. However, they have received little attention in the literature. There may be two reasons for this: Firstly, the concept was introduced specifically for diffusion filters, which might appear too limited. Secondly, diffusion echoes have large storage requirements, which restricts their practicality. This work addresses both problems. We introduce the filter echo as a generalisation of the diffusion echo and use it for applications beyond adaptive smoothing, such as image inpainting, osmosis, and variational optic flow computation. We provide a framework to visualise and inspect echoes from various filters with different applications. Furthermore, we propose a compression approach for filter echoes, which reduces storage requirements by a factor of 20 to 100.
Virtual Reality (VR) cloud gaming systems render the 3D graphics on cloud servers for playing graphically demanding games on VR headsets. Delivering high-resolution game scenes is challenging due to variation in network performance. By leveraging the non-uniform human vision perception, foveated rendering and encoding have proven effective for optimized streaming in constrained networks. SoTA foveation methods either do not incorporate real-time gaze data or are unable to handle variations in network conditions, resulting in a suboptimal user experience. We introduce EyeNexus, a pioneering system that combines real-time gaze-driven spatial compression (FSC) with gaze-driven video encoding (FVE), transforming the gaze point for precise alignment and foveation. We propose a novel foveation model that dynamically adjusts the foveation region based on real-time bandwidth and gaze data. The model simplifies network-aware quality assignment in FVE, ensuring smooth and imperceptible quality gradients. We evaluate EyeNexus using objective and subjective measures with different network conditions and games. EyeNexus reduces latency by up to 70.9% and improves perceptual visual quality by up to 24.6%. Our IRB-approved user study shows that EyeNexus achieves the highest playability and visual quality, with improvements of up to 48%, while eliminating motion sickness.
Recent research has explored complex loss functions for deblurring. In this work, we explore the impact of a previously introduced loss function - Q which explicitly addresses sharpness and employ it to fine-tune State-of-the-Art (SOTA) deblurring models. Standard image quality metrics such as PSNR or SSIM do not distinguish sharpness from ringing. Therefore, we propose a novel full-reference image quality metric Omega that combines PSNR with Q. This metric is sensitive to ringing artefacts, but not to a slight increase in sharpness, thus making it a fair metric for comparing restorations from deblurring mechanisms. Our approach shows an increase of 15 percent in sharpness (Q) and up to 10 percent in Omega over the use of standard losses.
Objective: Lung cancer is a leading cause of cancer-related mortality worldwide, primarily due to delayed diagnosis and poor early detection. This study aims to develop a computer-aided diagnosis (CAD) system that leverages large vision-language models (VLMs) for the accurate detection and classification of pulmonary nodules in computed tomography (CT) scans. Methods: We propose an end-to-end CAD pipeline consisting of two modules: (i) a detection module (CADe) based on the Segment Anything Model 2 (SAM2), in which the standard visual prompt is replaced with a text prompt encoded by CLIP (Contrastive Language-Image Pretraining), and (ii) a diagnosis module (CADx) that calculates similarity scores between segmented nodules and radiomic features. To add clinical context, synthetic electronic medical records (EMRs) were generated using radiomic assessments by expert radiologists and combined with similarity scores for final classification. The method was tested on the publicly available LIDC-IDRI dataset (1,018 CT scans). Results: The proposed approach demonstrated strong performance in zero-shot lung nodule analysis. The CADe module achieved a Dice score of 0.92 and an IoU of 0.85 for nodule segmentation. The CADx module attained a specificity of 0.97 for malignancy classification, surpassing existing fully supervised methods. Conclusions: The integration of VLMs with radiomics and synthetic EMRs allows for accurate and clinically relevant CAD of pulmonary nodules in CT scans. The proposed system shows strong potential to enhance early lung cancer detection, increase diagnostic confidence, and improve patient management in routine clinical workflows.
We propose MindVL, a multimodal large langauge model trained on Ascend NPUs. Similar to Qwen2.5-VL, MindVL adopts native-resolution Vision Transformers, which enables it to process images at their original variable resolutions. This design avoids the degradation caused by fixed-resolution tiling while preserving fine-grained details and global layouts, which is crucial for visually dense content such as complex charts and diagrams. To ensure the smooth training of MindVL on Ascend NPUs, we develop Mindspeed-MLLM, a distributed multimodal training framework tailored for Ascend NPUs. To maintain training accuracy, we implement equivalent replacements for certain operators. MindVL undergoes a three-phase training process, namely the warm-up phase, multitask training phase, and supervised instruction tuning phase, to gradually enhance its capabilities. This process starts with basic visual and multimodal pre-training, followed by large-scale multiask trainging and instruction tuning. We also adopt multimodal data packaging and hybrid parallelism techniques, which significantly improve end-to-end training speed. To further boost model performance, we specifically introduce test-time resolution search and model weight averaging. Notably, despite using about 1/10 of the training data required by Qwen2.5-VL, MindVL achieves performance on par with Qwen2.5-VL in evaluations of general multimodal understanding and document/table comprehension. Beyond overall scores, MindVL also delivers leading performance in OCR assessments.
Existing deep learning models for chest radiology often neglect patient metadata, limiting diagnostic accuracy and fairness. To bridge this gap, we introduce MetaCheX, a novel multimodal framework that integrates chest X-ray images with structured patient metadata to replicate clinical decision-making. Our approach combines a convolutional neural network (CNN) backbone with metadata processed by a multilayer perceptron through a shared classifier. Evaluated on the CheXpert Plus dataset, MetaCheX consistently outperformed radiograph-only baseline models across multiple CNN architectures. By integrating metadata, the overall diagnostic accuracy was significantly improved, measured by an increase in AUROC. The results of this study demonstrate that metadata reduces algorithmic bias and enhances model generalizability across diverse patient populations. MetaCheX advances clinical artificial intelligence toward robust, context-aware radiographic disease detection.
This article presents a novel microscopy image analysis framework designed for low-budget labs equipped with a standard CPU desktop. The Python-based program enables cytometric analysis of live, unstained cells in culture through an advanced computer vision and machine learning pipeline. Crucially, the framework operates on label-free data, requiring no manually annotated training data or training phase. It is accessible via a user-friendly, cross-platform GUI that requires no programming skills, while also providing a scripting interface for programmatic control and integration by developers. The end-to-end workflow performs semantic and instance segmentation, feature extraction, analysis, evaluation, and automated report generation. Its modular architecture supports easy maintenance and flexible integration while supporting both single-image and batch processing. Validated on several unstained cell types from the public dataset of livecells, the framework demonstrates superior accuracy and reproducibility compared to contemporary tools like Cellpose and StarDist. Its competitive segmentation speed on a CPU-based platform highlights its significant potential for basic research and clinical applications -- particularly in cell transplantation for personalized medicine and muscle regeneration therapies.
Ultrasound imaging is widely used in clinical practice due to its cost-effectiveness, mobility, and safety. However, current AI research often treats disease prediction and tissue segmentation as two separate tasks and their model requires substantial computational overhead. In such a situation, we introduce UltraUPConvNet, a computationally efficient universal framework designed for both ultrasound image classification and segmentation. Trained on a large-scale dataset containing more than 9,700 annotations across seven different anatomical regions, our model achieves state-of-the-art performance on certain datasets with lower computational overhead. Our model weights and codes are available at https://github.com/yyxl123/UltraUPConvNet
Microwave vision from spaceborne synthetic aperture radar (SAR) provides an all-weather, day-and-night capability to observe Earth, yet much of the information encoded in its signals remains undeciphered. Recent high-resolution imagery has revealed a striking phenomenon: man-made structures systematically appear in a spectrum of colours, the physical origin of which has been an open question. Here we show that this effect, which we term the microwave rainbow, is a form of geometric dispersion arising from structures acting as intrinsic diffraction gratings. We introduce a geometric-physical model that provides a direct analytical link between a target's geometry and its observed colour signature. This model quantitatively explains the full range of signatures, from continuous colour gradients on curved surfaces (zero-order diffraction) to repeating spectral patterns from periodic structures (high-order diffraction). This work transforms colour from a visual artefact into a precise measure of physical form, enabling the geometry of both critical infrastructure and natural phenomena to be mapped directly from space. Our findings establish the physical basis for a new remote sensing modality: microwave colour vision, and open a new frontier in how we perceive our world.
Branched broomrape (Phelipanche ramosa (L.) Pomel) is a chlorophyll-deficient parasitic plant that threatens tomato production by extracting nutrients from the host, with reported yield losses up to 80 percent. Its mostly subterranean life cycle and prolific seed production (more than 200,000 seeds per plant, viable for up to 20 years) make early detection essential. We present an end-to-end pipeline that uses Sentinel-2 imagery and time-series analysis to identify broomrape-infested tomato fields in California. Regions of interest were defined from farmer-reported infestations, and images with less than 10 percent cloud cover were retained. We processed 12 spectral bands and sun-sensor geometry, computed 20 vegetation indices (e.g., NDVI, NDMI), and derived five plant traits (Leaf Area Index, Leaf Chlorophyll Content, Canopy Chlorophyll Content, Fraction of Absorbed Photosynthetically Active Radiation, and Fractional Vegetation Cover) using a neural network calibrated with ground-truth and synthetic data. Trends in Canopy Chlorophyll Content delineated transplanting-to-harvest periods, and phenology was aligned using growing degree days. Vegetation pixels were segmented and used to train a Long Short-Term Memory (LSTM) network on 18,874 pixels across 48 growing-degree-day time points. The model achieved 88 percent training accuracy and 87 percent test accuracy, with precision 0.86, recall 0.92, and F1 0.89. Permutation feature importance ranked NDMI, Canopy Chlorophyll Content, FAPAR, and a chlorophyll red-edge index as most informative, consistent with the physiological effects of infestation. Results show the promise of satellite-driven time-series modeling for scalable detection of parasitic stress in tomato farms.