Loading...
Loading...
Browse, search and filter the latest cybersecurity research papers from arXiv
Despite their cultural and historical significance, Black digital archives continue to be a structurally underrepresented area in AI research and infrastructure. This is especially evident in efforts to digitize historical Black newspapers, where inconsistent typography, visual degradation, and limited annotated layout data hinder accurate transcription, despite the availability of various systems that claim to handle optical character recognition (OCR) well. In this short paper, we present a layout-aware OCR pipeline tailored for Black newspaper archives and introduce an unsupervised evaluation framework suited to low-resource archival contexts. Our approach integrates synthetic layout generation, model pretraining on augmented data, and a fusion of state-of-the-art You Only Look Once (YOLO) detectors. We used three annotation-free evaluation metrics, the Semantic Coherence Score (SCS), Region Entropy (RE), and Textual Redundancy Score (TRS), which quantify linguistic fluency, informational diversity, and redundancy across OCR regions. Our evaluation on a 400-page dataset from ten Black newspaper titles demonstrates that layout-aware OCR improves structural diversity and reduces redundancy compared to full-page baselines, with modest trade-offs in coherence. Our results highlight the importance of respecting cultural layout logic in AI-driven document understanding and lay the foundation for future community-driven and ethically grounded archival AI systems.
The automated generation of research workflows is essential for improving the reproducibility of research and accelerating the paradigm of "AI for Science". However, existing methods typically extract merely fragmented procedural components and thus fail to capture complete research workflows. To address this gap, we propose an end-to-end framework that generates comprehensive, structured research workflows by mining full-text academic papers. As a case study in the Natural Language Processing (NLP) domain, our paragraph-centric approach first employs Positive-Unlabeled (PU) Learning with SciBERT to identify workflow-descriptive paragraphs, achieving an F1-score of 0.9772. Subsequently, we utilize Flan-T5 with prompt learning to generate workflow phrases from these paragraphs, yielding ROUGE-1, ROUGE-2, and ROUGE-L scores of 0.4543, 0.2877, and 0.4427, respectively. These phrases are then systematically categorized into data preparation, data processing, and data analysis stages using ChatGPT with few-shot learning, achieving a classification precision of 0.958. By mapping categorized phrases to their document locations in the documents, we finally generate readable visual flowcharts of the entire research workflows. This approach facilitates the analysis of workflows derived from an NLP corpus and reveals key methodological shifts over the past two decades, including the increasing emphasis on data analysis and the transition from feature engineering to ablation studies. Our work offers a validated technical framework for automated workflow generation, along with a novel, process-oriented perspective for the empirical investigation of evolving scientific paradigms. Source code and data are available at: https://github.com/ZH-heng/research_workflow.
The complex systems keyword diagram generated by the author in 2010 has been used widely in a variety of educational and outreach purposes, but it definitely needs a major update and reorganization. This short paper reports our recent attempt to update the keyword diagram using information collected from the following multiple sources: (a) collective feedback posted on social media, (b) recent reference books on complex systems and network science, (c) online resources on complex systems, and (d) keyword search hits obtained using OpenAlex, an open-access bibliographic catalogue of scientific publications. The data (a), (b) and (c) were used to incorporate the research community's internal perceptions of the relevant topics, whereas the data (d) was used to obtain more objective measurements of the keywords' relevance and associations from publications made in complex systems science. Results revealed differences and overlaps between public perception and actual usage of keywords in publications on complex systems. Four topical communities were obtained from the keyword association network, although they were highly intertwined with each other. We hope that the resulting network visualization of complex systems keywords provides a more up-to-date, accurate topic map of the field of complex systems as of today.
Teams now drive most scientific advances, yet the impact of absolute beginners -- authors with no prior publications -- remains understudied. Analyzing over 28 million articles published between 1971 and 2020 across disciplines and team sizes, we uncover a universal and previously undocumented pattern: teams with a higher fraction of beginners are systematically more disruptive and innovative. Their contributions are linked to distinct knowledge-integration behaviors, including drawing on broader and less canonical prior work and producing more atypical recombinations. Collaboration structure further shapes outcomes: disruption is high when beginners work with early-career colleagues or with co-authors who have disruptive track records. Although disruption and citations are negatively correlated overall, highly disruptive papers from beginner-heavy teams are highly cited. These findings reveal a "beginner's charm" in science, highlighting the underrecognized yet powerful value of beginner fractions in teams and suggesting actionable strategies for fostering a thriving ecosystem of innovation in science and technology.
Information Technology (IT) is recognized as an independent and unique research field. However, there has been ambiguity and difficulty in identifying and differentiating IT research from other close variations. Given this context, this paper aimed to explore the roots of the Information Technology (IT) research domain by conducting a large-scale text mining analysis of 50,780 abstracts from awarded NSF CISE grants from 1985 to 2024. We categorized the awards based on their program content, labeling human-centric programs as IT research programs and infrastructure-centric programs as other research programs based on the IT definitions in the literature. This novel approach helped us identify the core concepts of IT research and compare the similarities and differences between IT research and other research areas. The results showed that IT differentiates itself from other close variations by focusing more on the needs of users, organizations, and societies.
This study investigates how Large Language Models (LLMs) are influencing the language of academic papers by tracking 12 LLM-associated terms across six major scholarly databases (Scopus, Web of Science, PubMed, PubMed Central (PMC), Dimensions, and OpenAlex) from 2015 to 2024. Using over 2.4 million PMC open-access publications (2021-July 2025), we also analysed full texts to assess changes in the frequency and co-occurrence of these terms before and after ChatGPT's initial public release. Across databases, delve (+1,500%), underscore (+1,000%), and intricate (+700%) had the largest increases between 2022 and 2024. Growth in LLM-term usage was much higher in STEM fields than in social sciences and arts and humanities. In PMC full texts, the proportion of papers using underscore six or more times increased by over 10,000% from 2022 to 2025, followed by intricate (+5,400%) and meticulous (+2,800%). Nearly half of all 2024 PMC papers using any LLM term also included underscore, compared with only 3%-14% of papers before ChatGPT in 2022. Papers using one LLM term are now much more likely to include other terms. For example, in 2024, underscore strongly correlated with pivotal (0.449) and delve (0.311), compared with very weak associations in 2022 (0.032 and 0.018, respectively). These findings provide the first large-scale evidence based on full-text publications and multiple databases that some LLM-related terms are now being used much more frequently and together. The rapid uptake of LLMs to support scholarly publishing is a welcome development reducing the language barrier to academic publishing for non-English speakers.
AI scientist systems, capable of autonomously executing the full research workflow from hypothesis generation and experimentation to paper writing, hold significant potential for accelerating scientific discovery. However, the internal workflow of these systems have not been closely examined. This lack of scrutiny poses a risk of introducing flaws that could undermine the integrity, reliability, and trustworthiness of their research outputs. In this paper, we identify four potential failure modes in contemporary AI scientist systems: inappropriate benchmark selection, data leakage, metric misuse, and post-hoc selection bias. To examine these risks, we design controlled experiments that isolate each failure mode while addressing challenges unique to evaluating AI scientist systems. Our assessment of two prominent open-source AI scientist systems reveals the presence of several failures, across a spectrum of severity, which can be easily overlooked in practice. Finally, we demonstrate that access to trace logs and code from the full automated workflow enables far more effective detection of such failures than examining the final paper alone. We thus recommend journals and conferences evaluating AI-generated research to mandate submission of these artifacts alongside the paper to ensure transparency, accountability, and reproducibility.
This article proposes a novel methodological approach for developing use cases for CH e-infrastuctures documented using Jupyter Notebooks (JNs), enabling transparency and reproducibility. We also address the present problem of use cases that are not consistently documented to cover all key aspects that are derived from the use case literature review outside of CH field to define a useful use case. Purpose. Our primary objective is to explore the practices around creating and analysing use cases related to digital cultural heritage. Our review of the literature showed a substantial deviation in the depth and coverage of use cases and revealed the need for a more robust and consistent approach to creating use cases in a digital heritage context. We developed a framework to develop use cases to support the ongoing efforts to expand the use of eInfrastructures in the digital heritage domain as a first step. Design/methodology/approach. Our research design combines desk research of existing literature and analysing examples of use cases documented in projects. We examine the challenges and inconsistencies in the current practice of use case production in digital heritage. Finally, we synthesize a systematic process to generate use cases which is illustrated by five example use cases within the context. Our work impacts directly such infrastructures and communities as the International GLAM Labs Community, AI for Libraries, Archives, and Museums (AI4LAM) and Time Machine Organisation. This work advances the use of data research infrastructures within communities of researchers, scholars, students, GLAM (Galleries, Libraries, Archives, and Museums) institutions, and Cultural Heritage and Cultural and Creative Industries (CCIs).
Scientific progress fundamentally depends on researchers' ability to access and build upon the work of others. Yet, a majority of published work remains behind expensive paywalls, limiting access to universities that can afford subscriptions. Furthermore, even when articles are accessible, the underlying datasets could be restricted, available only through a "reasonable request" to the authors. One way researchers could overcome these barriers is by relying on informal channels, such as emailing authors directly, to obtain paywalled articles or restricted datasets. However, whether these informal channels are hindered by racial and/or institutional biases remains unknown. Here, we combine qualitative semi-structured interviews, large-scale observational analysis, and two randomized audit experiments to examine racial and institutional disparities in access to scientific knowledge. Our analysis of 250 million articles reveals that researchers in the Global South cite paywalled papers and upon-request datasets at significantly lower rates than their Global North counterparts, and that these access gaps are associated with reduced knowledge breadth and scholarly impact. To interrogate the mechanisms underlying this phenomenon, we conduct two randomized email audit studies in which fictional PhD students differing in racial background and institutional affiliation request access to paywalled articles (N = 18,000) and datasets (N = 11,840). We find that racial identity more strongly predicts response rate to paywalled article requests compared to institutional affiliation, whereas institutional affiliation played a larger role in shaping access to datasets. These findings reveal how informal gatekeeping can perpetuate structural inequities in science, highlighting the need for stronger data-sharing mandates and more equitable open access policies.
Structured information extraction from scientific literature is crucial for capturing core concepts and emerging trends in specialized fields. While existing datasets aid model development, most focus on specific publication sections due to domain complexity and the high cost of annotating scientific texts. To address this limitation, we introduce SciNLP - a specialized benchmark for full-text entity and relation extraction in the Natural Language Processing (NLP) domain. The dataset comprises 60 manually annotated full-text NLP publications, covering 7,072 entities and 1,826 relations. Compared to existing research, SciNLP is the first dataset providing full-text annotations of entities and their relationships in the NLP domain. To validate the effectiveness of SciNLP, we conducted comparative experiments with similar datasets and evaluated the performance of state-of-the-art supervised models on this dataset. Results reveal varying extraction capabilities of existing models across academic texts of different lengths. Cross-comparisons with existing datasets show that SciNLP achieves significant performance improvements on certain baseline models. Using models trained on SciNLP, we implemented automatic construction of a fine-grained knowledge graph for the NLP domain. Our KG has an average node degree of 3.2 per entity, indicating rich semantic topological information that enhances downstream applications. The dataset is publicly available at https://github.com/AKADDC/SciNLP.
This study presents a framework for automated evaluation of dynamically evolving topic models using Large Language Models (LLMs). Topic modeling is essential for organizing and retrieving scholarly content in digital library systems, helping users navigate complex and evolving knowledge domains. However, widely used automated metrics, such as coherence and diversity, often capture only narrow statistical patterns and fail to explain semantic failures in practice. We introduce a purpose-oriented evaluation framework that employs nine LLM-based metrics spanning four key dimensions of topic quality: lexical validity, intra-topic semantic soundness, inter-topic structural soundness, and document-topic alignment soundness. The framework is validated through adversarial and sampling-based protocols, and is applied across datasets spanning news articles, scholarly publications, and social media posts, as well as multiple topic modeling methods and open-source LLMs. Our analysis shows that LLM-based metrics provide interpretable, robust, and task-relevant assessments, uncovering critical weaknesses in topic models such as redundancy and semantic drift, which are often missed by traditional metrics. These results support the development of scalable, fine-grained evaluation tools for maintaining topic relevance in dynamic datasets. All code and data supporting this work are accessible at https://github.com/zhiyintan/topic-model-LLMjudgment.
Navigating the vast and rapidly increasing sea of academic publications to identify institutional synergies, benchmark research contributions and pinpoint key research contributions has become an increasingly daunting task, especially with the current exponential increase in new publications. Existing tools provide useful overviews or single-document insights, but none supports structured, qualitative comparisons across institutions or publications. To address this, we demonstrate Compare, a novel framework that tackles this challenge by enabling sophisticated long-context comparisons of scientific contributions. Compare empowers users to explore and analyze research overlaps and differences at both the institutional and publication granularity, all driven by user-defined questions and automatic retrieval over online resources. For this we leverage on Retrieval-Augmented Generation over evolving data sources to foster long context knowledge synthesis. Unlike traditional scientometric tools, Compare goes beyond quantitative indicators by providing qualitative, citation-supported comparisons.
This study examines the evolution of references to grain storage structures in medieval European charters, based on a quantitative and semantic analysis of the digitized CEMA (Cartae Europae Medii Aevi) corpus comprising more than 225,000 documents. The author applies text mining and distributional analysis methods to a lexicon of some forty terms designating storage locations (grangia, horreum, granarium, granica, etc.), cross-referencing these data with references to grain and analyzing their semantic contexts over the long term. The analysis reveals a paradigm shift between the early Middle Ages (decentralized, loosely regulated storage) and the 12th-13th centuries (centralization of storage by the ruling classes). Granaries became instruments of spatial polarization and social control, contributing to the accentuation of social domination in medieval Europe. This evolution was accompanied by a new conceptualization of storage, both material and spiritual.
The mechanisms driving different types of scientific innovation through collaboration remain poorly understood. Here we develop a comprehensive framework analyzing over 14 million papers across 19 disciplines from 1960 to 2020 to unpack how collaborative synergy shapes research disruption. We introduce the synergy factor to quantify collaboration cost-benefit dynamics, revealing discipline-specific architectures where Physics peaks at medium team sizes while humanities achieve maximal synergy through individual scholarship. Our mediation analysis demonstrates that collaborative synergy, not team size alone, mediates 75% of the relationship between team composition and disruption. Key authors play a catalytic role, with papers featuring exceptional researchers showing 561% higher disruption indices. Surprisingly, high-citation authors reduce disruptive potential while those with breakthrough track records enhance it, challenging traditional evaluation metrics. We identify four distinct knowledge production modes: elite-driven, baseline, heterogeneity-driven, and low-cost. These findings reveal substantial heterogeneity in optimal collaboration strategies across disciplines and provide evidence-based guidance for research organization, with implications for science policy and the design of research institutions in an increasingly collaborative scientific landscape.
Gender inequality in scientific careers has been extensively documented through aggregate measures such as total publications and cumulative citations, yet the temporal dynamics underlying these disparities remain largely unexplored. Here we developed a multi-dimensional framework to examine gender differences in scientific knowledge creation through three complementary temporal dimensions: stability (consistency of performance over time), volatility (degree of year-to-year fluctuation), and persistence (ability to maintain high performance for extended periods). Using comprehensive bibliometric data from SciSciNet covering 62.5 million authors whose careers began between 1960-2010, we constructed knowledge creation capability measures that captured how scientists absorb knowledge from diverse sources and contribute to field advancement. We found that female scientists demonstrated significantly higher knowledge production stability (0.170 vs. 0.119 for males) while simultaneously exhibiting greater year-to-year volatility (6.606 vs. 6.228), revealing a striking paradox in career dynamics. Female scientists showed persistence advantages under moderate performance requirements but faced disadvantages under extreme criteria demanding sustained peak performance. However, these patterns varied substantially across disciplines, with female advantages strongest in humanities and social sciences while STEM fields show mixed results.
In this work, we study how URL extraction results depend on input format. We compiled a pilot dataset by extracting URLs from 10 arXiv papers and used the same heuristic method to extract URLs from four formats derived from the PDF files or the source LaTeX files. We found that accurate and complete URL extraction from any single format or a combination of multiple formats is challenging, with the best F1-score of 0.71. Using the pilot dataset, we evaluate extraction performance across formats and show that structured formats like HTML and XML produce more accurate results than PDFs or Text. Combining multiple formats improves coverage, especially when targeting research-critical resources. We further apply URL extraction on two tasks, namely classifying URLs into open-access datasets and software and the others, and analyzing the trend of URLs usage in arXiv papers from 1992 to 2024. These results suggest that using a combination of multiple formats achieves better performance on URL extraction than a single format, and the number of URLs in arXiv papers has been steadily increasing since 1992 to 2014 and has been drastically increasing from 2014 to 2024. The dataset and the Jupyter notebooks used for the preliminary analysis are publicly available at https://github.com/lamps-lab/arxiv-urls
This paper examines how the role of cited papers evolves over time by analyzing nearly 900 highly cited papers (HCPs) published between 2000 and 2016 and the full text of over 220,000 papers citing them. We investigate multiple citation characteristics, including citation location within the full text, reference and in-text citation types, citation sentiment, and textual and bibliographic relatedness between citing and cited papers. Our findings reveal that as HCPs age, they tend to be cited earlier in papers citing them, mentioned fewer times in the full text, and more often cited alongside other references. Citation sentiment remains predominantly neutral, while both textual and bibliographic similarity between HCPs and their citing papers decline over time. These patterns indicate a shift from direct topical and methodological engagement toward more general, background, and symbolic referencing. The findings highlight the importance to consider citation context rather than relying solely on simple citation counts. Large-scale full-text analyses such as ours can help refine measures of scientific impact and advance scholarly search and science mapping by uncovering more nuanced connections between papers.
Bibliometric measures, such as total citations and h-index, have become a cornerstone for evaluating academic performance; however, these traditional metrics, being non-weighted, inadequately capture the nuances of individual contributions. To address this constraint, we developed GScholarLens, an open-access browser extension that integrates seamlessly with Google Scholar to enable detailed bibliometric analysis. GScholarLens categorizes publications by authorship roles, adjusts citation weightings accordingly, and introduces Scholar h-index, Sh-index, an authorship-contribution normalized h-index. This tool proportionally weights citations based on authorship position using heuristic percentages, i.e., corresponding 100 percent, first 90 percent, second 50 percent, co-authors in publications with less than six authors 25 percent, and co-authors with more than six authors 10 percent. Currently, there is no empirical data available for author-contribution weights, however, this proof-of-concept framework can easily adapt more precise author-contribution weightage data decided by authors at the time of manuscript submission along with CRediT, which journals and publishers can mandate. Furthermore, this tool incorporates retraction detection by mapping data from retraction databases into the Google Scholar interface. By aligning bibliometric evaluation more closely with actual scholarly contribution, GScholarLens presents a better open-access framework for academic recognition, particularly within interdisciplinary and highly collaborative research environments. This tool is freely accessible at https://project.iith.ac.in/sharmaglab/gscholarlens/.