Loading...
Loading...
Browse, search and filter the latest cybersecurity research papers from arXiv
This paper explores methods for building a comprehensive citation graph using big data techniques to evaluate scientific impact more accurately. Traditional citation metrics have limitations, and this work investigates merging large citation datasets to create a more accurate picture. Challenges of big data, like inconsistent data formats and lack of unique identifiers, are addressed through deduplication efforts, resulting in a streamlined and reliable merged dataset with over 119 million records and 1.4 billion citations. We demonstrate that merging large citation datasets builds a more accurate citation graph facilitating a more robust evaluation of scientific impact.
Historical study of the Holocaust is commonly hampered by the dispersed and fragmented nature of important archival sources relating to this event. The EHRI project set out to mitigate this problem by building a trans-national network of archives, researchers, and digital practitioners, and one of its main outcomes was the creation of the EHRI Portal, a "virtual observatory" that gathers in one centralised platform descriptions of Holocaust-related archival sources from around the world. In order to build the Portal a strong data identification and integration effort was required, culminating in the project's third phase with the creation of the EHRI-3 data integration lab. The focus of the lab was to lower the bar to participation in the EHRI Portal by providing support to institutions in conforming their archival metadata with that required for integration, ultimately opening the process up to smaller institutions (and even so-called "micro-archives") without the necessary resources to undertake this process themselves. In this paper we present our experiences from running the data integration lab and discuss some of the challenges (both of a technical and social nature), how we tried to overcome them, and the overall lessons learnt. We envisage this work as an archetype upon which other practitioners seeking to pursue similar data integration activities can build their own efforts.
The adoption of open science has quickly changed how artificial intelligence (AI) policy research is distributed globally. This study examines the regional trends in the citation of preprints, specifically focusing on the impact of two major disruptive events: the COVID-19 pandemic and the release of ChatGPT, on research dissemination patterns in the United States, Europe, and South Korea from 2015 to 2024. Using bibliometrics data from the Web of Science, this study tracks how global disruptive events influenced the adoption of preprints in AI policy research and how such shifts vary by region. By marking the timing of these disruptive events, the analysis reveals that while all regions experienced growth in preprint citations, the magnitude and trajectory of change varied significantly. The United States exhibited sharp, event-driven increases; Europe demonstrated institutional growth; and South Korea maintained consistent, linear growth in preprint adoption. These findings suggest that global disruptions may have accelerated preprint adoption, but the extent and trajectory are shaped by local research cultures, policy environments, and levels of open science maturity. This paper emphasizes the need for future AI governance strategies to consider regional variability in research dissemination and highlights opportunities for further longitudinal and comparative research to deepen our understanding of open-access adoption in AI policy development.
Historical visualizations are a rich resource for visualization research. While taxonomy is commonly used to structure and understand the design space of visualizations, existing taxonomies primarily focus on contemporary visualizations and largely overlook historical visualizations. To address this gap, we describe an empirical method for taxonomy development. We introduce a coding protocol and the VisTaxa system for taxonomy labeling and comparison. We demonstrate using our method to develop a historical visualization taxonomy by coding 400 images of historical visualizations. We analyze the coding result and reflect on the coding process. Our work is an initial step toward a systematic investigation of the design space of historical visualizations.
Navigating and visualizing multilayered knowledge graphs remains a challenging, unresolved problem in information systems design. Building on our earlier study, which engaged end users in both the design and population of a domain-specific knowledge graph, we now focus on translating their insights into actionable interface guidelines. In this paper, we synthesize recommendations drawn from a participatory workshop with doctoral students. We then demonstrate how these recommendations inform the design of a prototype interface. Finally, we found that a participatory iterative design approach can help designers in decision making, leading to interfaces that are both innovative and user-centric. By combining user-driven requirements with proven visualization techniques, this paper presents a coherent framework for guiding future development of knowledge-graph navigation tools.
This paper presents our system developed for the SemEval-2025 Task 5: LLMs4Subjects: LLM-based Automated Subject Tagging for a National Technical Library's Open-Access Catalog. Our system relies on prompting a selection of LLMs with varying examples of intellectually annotated records and asking the LLMs to similarly suggest keywords for new records. This few-shot prompting technique is combined with a series of post-processing steps that map the generated keywords to the target vocabulary, aggregate the resulting subject terms to an ensemble vote and, finally, rank them as to their relevance to the record. Our system is fourth in the quantitative ranking in the all-subjects track, but achieves the best result in the qualitative ranking conducted by subject indexing experts.
Language is a major source of systemic inequities in science, particularly among scholars whose first language is not English. Studies have examined scientists' linguistic practices in specific contexts; few, however, have provided a global analysis of multilingualism in science. Using two major bibliometric databases (OpenAlex and Dimensions), we provide a large-scale analysis of linguistic diversity in science, considering both the language of publications (N=87,577,942) and of cited references (N=1,480,570,087). For the 1990-2023 period, we find that only Indonesian, Portuguese and Spanish have expanded at a faster pace than English. Country-level analyses show that this trend is due to the growing strength of the Latin American and Indonesian academic circuits. Our results also confirm the own-language preference phenomenon (particularly for languages other than English), the strong connection between multilingualism and bibliodiversity, and that social sciences and humanities are the least English-dominated fields. Our findings suggest that policies recognizing the value of both national-language and English-language publications have had a concrete impact on the distribution of languages in the global field of scholarly communication.
Moral stories are a time-tested vehicle for transmitting values, yet modern NLP lacks a large, structured corpus that couples coherent narratives with explicit ethical lessons. We close this gap with TF1-EN-3M, the first open dataset of three million English-language fables generated exclusively by instruction-tuned models no larger than 8B parameters. Each story follows a six-slot scaffold (character -> trait -> setting -> conflict -> resolution -> moral), produced through a combinatorial prompt engine that guarantees genre fidelity while covering a broad thematic space. A hybrid evaluation pipeline blends (i) a GPT-based critic that scores grammar, creativity, moral clarity, and template adherence with (ii) reference-free diversity and readability metrics. Among ten open-weight candidates, an 8B-parameter Llama-3 variant delivers the best quality-speed trade-off, producing high-scoring fables on a single consumer GPU (<24 GB VRAM) at approximately 13.5 cents per 1,000 fables. We release the dataset, generation code, evaluation scripts, and full metadata under a permissive license, enabling exact reproducibility and cost benchmarking. TF1-EN-3M opens avenues for research in instruction following, narrative intelligence, value alignment, and child-friendly educational AI, demonstrating that large-scale moral storytelling no longer requires proprietary giant models.
We propose a citation index $\nu$ (``nu'') and show that it lies between the classical $h$-index and $g$-index. This idea is then generalized to a monotone parametric family $(\nu_\alpha)$ ($\alpha\ge 0$), whereby $h=\nu_0$ and $\nu=\nu_1$, while the limiting value $\nu_\infty$ is expressed in terms of the maximum citation.
This report addresses the challenge of limited labeled datasets for developing legal recommender systems, particularly in specialized domains like labor disputes. We propose a new approach leveraging the co-citation of legal articles within cases to establish similarity and enable algorithmic annotation. This method draws a parallel to the concept of case co-citation, utilizing cited precedents as indicators of shared legal issues. To evaluate the labeled results, we employ a system that recommends similar cases based on plaintiffs' accusations, defendants' rebuttals, and points of disputes. The evaluation demonstrates that the recommender, with finetuned text embedding models and a reasonable BiLSTM module can recommend labor cases whose similarity was measured by the co-citation of the legal articles. This research contributes to the development of automated annotation techniques for legal documents, particularly in areas with limited access to comprehensive legal databases.
A key factor for lunar mission planning is the ability to assess the local availability of raw materials. However, many potentially relevant measurements are scattered across a variety of scientific publications. In this paper we consider the viability of obtaining lunar composition data by leveraging LLMs to rapidly process a corpus of scientific publications. While leveraging LLMs to obtain knowledge from scientific documents is not new, this particular application presents interesting challenges due to the heterogeneity of lunar samples and the nuances involved in their characterization. Accuracy and uncertainty quantification are particularly crucial since many materials properties can be sensitive to small variations in composition. Our findings indicate that off-the-shelf LLMs are generally effective at extracting data from tables commonly found in these documents. However, there remains opportunity to further refine the data we extract in this initial approach; in particular, to capture fine-grained mineralogy information and to improve performance on more subtle/complex pieces of information.
This paper presents the Annif system in SemEval-2025 Task 5 (LLMs4Subjects), which focussed on subject indexing using large language models (LLMs). The task required creating subject predictions for bibliographic records from the bilingual TIBKAT database using the GND subject vocabulary. Our approach combines traditional natural language processing and machine learning techniques implemented in the Annif toolkit with innovative LLM-based methods for translation and synthetic data generation, and merging predictions from monolingual models. The system ranked first in the all-subjects category and second in the tib-core-subjects category in the quantitative evaluation, and fourth in qualitative evaluations. These findings demonstrate the potential of combining traditional XMTC algorithms with modern LLM techniques to improve the accuracy and efficiency of subject indexing in multilingual contexts.
Open science is increasingly recognised worldwide, with preprint posting emerging as a key strategy. This study explores the factors influencing researchers' adoption of preprint publication, particularly the perceived effectiveness of this practice and research intensity indicators such as publication and review frequency. Using open data from a comprehensive survey with 5,873 valid responses, we conducted regression analyses to control for demographic variables. Researchers' productivity, particularly the number of journal articles and books published, greatly influences the frequency of preprint deposits. The perception of the effectiveness of preprints follows this. Preprints are viewed positively in terms of early access to new research, but negatively in terms of early feedback. Demographic variables, such as gender and the type of organisation conducting the research, do not have a significant impact on the production of preprints when other factors are controlled for. However, the researcher's discipline, years of experience and geographical region generally have a moderate effect on the production of preprints. These findings highlight the motivations and barriers associated with preprint publication and provide insights into how researchers perceive the benefits and challenges of this practice within the broader context of open science.
This study examines the use of evidence in policymaking by analysing a range of journal and article attributes, as well as online engagement metrics. It employs a large-scale citation analysis of nearly 150,000 articles covering diverse policy topics. The findings highlight that scholarly citations exert the strongest positive influence on policy citations. Articles from journals with a higher citation impact and larger Mendeley readership are cited more frequently in policy documents. Other online engagements, such as news and blog mentions, also boost policy citations, while mentions on social media X have a negative effect. The finding that highly cited and widely read papers are also frequently referenced in policy documents likely reflects the perception among policymakers that such research is more trustworthy. In contrast, papers that derive their influence primarily from social media tend to be cited less often in policy contexts.
Citation metrics serve as the cornerstone of scholarly impact evaluation despite their well-documented vulnerability to inflation through self-citation practices. This paper introduces the Self-Citation Adjusted Index (SCAI), a sophisticated metric designed to recalibrate citation counts by accounting for discipline-specific self-citation patterns. Through comprehensive analysis of 5,000 researcher profiles across diverse disciplines, we demonstrate that excessive self-citation inflates traditional metrics by 10-20%, potentially misdirecting billions in research funding. Recent studies confirm that self-citation patterns exhibit significant gender disparities, with men self-citing up to 70% more frequently than women, exacerbating existing inequalities in academic recognition. Our open-source implementation provides comprehensive tools for calculating SCAI and related metrics, offering a more equitable assessment of research impact that reduces the gender citation gap by approximately 8.5%. This work contributes to the paradigm shift toward transparent, nuanced, and equitable research evaluation methodologies in academia, with direct implications for funding allocation decisions that collectively amount to over $100 billion annually in the United States alone.
Recent advances in machine learning and artificial intelligence have provided more alternatives for the implementation of repetitive or monotonous tasks. However, the development of AI tools has not been straightforward, and use case exploration and workflow integration are still ongoing challenges. In this work, we present a detailed qualitative analysis of the performance and user experience of popular commercial AI chatbots when used for document classification with limited data. We report the results for a real-world example of metadata augmentation in academic libraries environment. We compare the results of AI chatbots with other machine learning and natural language processing methods such as XGBoost and BERT-based fine tuning, and share insights from our experience. We found that AI chatbots perform similarly among them while outperforming the machine learning methods we tested, showing their advantage when the method relies on local data for training. We also found that while working with AI chatbots is easier than with code, getting useful results from them still represents a challenge for the user. Furthermore, we encountered alarming conceptual errors in the output of some chatbots, such as not being able to count the number of lines of our inputs and explaining the mistake as ``human error''. Although this is not complete evidence that AI chatbots can be effectively used for metadata classification, we believe that the information provided in this work can be useful to librarians and data curators in developing pathways for the integration and use of AI tools for data curation or metadata augmentation tasks.
We applied computational methods to analyze references across 2,245 philosophical texts, spanning from approximately 550 BCE to 1940 AD, in order to measure patterns in how philosophical ideas have spread over time. Using natural language processing and network analysis, we mapped over 294,970 references between authors, classifying each reference into subdisciplines of philosophy based on its surrounding context. We then constructed a graph, with authors as nodes and textual references as edges, to empirically validate, visualize, and quantify intellectual lineages as they are understood within philosophical scholarship. For instance, we find that Plato and Aristotle alone account for nearly 10% of all references from authors in our dataset, suggesting that their influence may still be underestimated. As another example, we support the view that St. Thomas Aquinas served as a synthesizer between Aristotelian and Christian philosophy by analyzing the network structures of Aquinas, Aristotle, and Christian theologians. Our results are presented through an interactive visualization tool, allowing users to dynamically explore these networks, alongside a mathematical analysis of the network's structure. Our methodology demonstrates the value of applying network analysis with textual references to study a large collection of historical works.
Failures of retraction are common in science. Why do these failures occur? And, relatedly, what makes findings harder or easier to retract? We use data from Microsoft Academic Graph, Retraction Watch, and Altmetric -- including retracted papers, citation records, and Altmetric scores and mentions -- to test recently proposed answers to these questions. A recent previous study by LaCroix et al. employ simple network models to argue that the social spread of scientific information helps explain failures of retraction. One prediction of their models is that widely known or well established results, surprisingly, should be easier to retract, since their retraction is more relevant to more scientists. Our results support this conclusion. We find that highly cited papers show more significant reductions in citation after retraction and garner more attention to their retractions as they occur.