
Ramon Perez
Abstract
Asset management organizations rely on accurate and consistent failure codes to understand equipment reliability, optimize maintenance strategies, and prioritize high-value interventions. Structured failure data enables engineering and maintenance teams to identify recurring issues, quantify risk, allocate capital effectively, and focus resources on the most impactful reliability improvements. However, in many legacy Maintenance Management Systems, historical corrective maintenance work orders were recorded without structured failure classifications. In this case, more than 20 years of work order history lacked standardized failure codes, creating a significant data gap. Manually reviewing and back-classifying tens of thousands of records would require thousands of labor hours, introduce subjectivity and inconsistency, and still result in incomplete coverage. This presentation describes a Phase 3 initiative that leverages Large Language Models (LLMs) to automatically classify failure types in power generation maintenance records. The objective was to determine whether modern generative AI systems could accurately interpret free-text corrective work orders and assign structured failure codes aligned to an established failure hierarchy. Data sources included corrective work orders extracted from a Maintenance Management System (Maximo), a multi-level failure classification hierarchy (e.g., failure class, component, cause, mitigation), and domain-specific technical documentation. To improve contextual accuracy and reduce hallucinations, a Retrieval-Augmented Generation (RAG) framework was implemented so that relevant hierarchical definitions and technical references were dynamically supplied to the model at inference time. A few-shot learning strategy was also applied to guide the model with representative labeled examples. The selected model, Anthropic Claude 3.5 Sonnet, was evaluated using majority-vote methodologies to increase robustness and confidence in final predictions. Multiple inference passes were performed per work order, and consensus-based selection was used to stabilize outputs. Model performance was assessed against a curated evaluation dataset using standard classification metrics. Results demonstrate that LLMs can reliably assign structured failure codes, achieving a 92% F1-score across evaluation datasets. The approach significantly reduces manual effort while maintaining high classification quality, enabling the rapid creation of a historically complete failure dataset that would otherwise take years to produce manually. Beyond retrospective data enrichment, the methodology also shows promise as a quality assurance tool for future human-coded work orders and as a potential replacement or augmentation to traditional failure reporting processes within the CMMS. The presentation discusses methodology design decisions, evaluation results, observed limitations, and areas for improvement—including expanding datasets with more diverse equipment failures, refining prompting and retrieval strategies, and exploring integration with diagnostic tools and industry reporting frameworks such as GADS. Overall, the findings suggest that LLM-driven classification can accelerate asset intelligence initiatives, enhance data-driven maintenance decision-making, and provide scalable, repeatable quality assurance for both historical and future maintenance records in power generation environments.
Keywords: AI, LLM
Biography of the presenter
Ramon Perez is the Director of AI Solutions at Elder Research, an AI/ML consultancy, where he builds AI enabled software products to solve challenging industrial problems. Ramon holds an engineering bachelors degree from Georgia Tech and masters degrees from Georgetown and Harvard universities.

