Standardize: Aligning Language Models with Expert-Defined Standards for Content Generation

1University of Bath, UK, 2National University Philippines

The Standardize framework is a retrieval-style ICL-based framework which aims to guide large language models to generate text content (ex. short stories) that are aligned with expert-defined standards (ex. CEFR or CCS)

Figure 1.


What are standards?
Standards are documented guidelines often containing rich detail in describing requirements, specifications, and criteria of a process or a content. These guidelines are defined and continuously improved by experts or interest groups in various domains, such as education, healthcare, and accounting.

Why should LLMs be aligned with expert-defined standards?
Augmenting these expert-defined standards to an LLM to ensure that it can generate text content that aligns to the standards opens a number of advantages. Primarily, using standards will ensure that a model’s internal processes, decision-making, and outputs are consistent and reproducible, which is a prerequisite to building trustworthy AI systems. Human users will also be able to trust LLM-based systems more if they are assured that the technology also adheres to the same standards and rules that human experts follow.

This study focuses on content-based standards used in education and language assessment (see Table below) such as the Common European Framework of Reference for Languages (CEFR) and the Common Core Standards (CCS) which will be augmented into an LLM's text generation process. The alignment with these existing standards for any generated text material is crucial to ensure quality and consistency before being used in classroom settings.


The Standardize Framework

Standardize is a new retrieval-style in-context learning-based framework that exploits the rich information found in standards and transforms this into knowledge artifacts to improve the quality of content produced by generative models.

As seen in the Figure above, the Standardize framework involves a three-part process:

  • Extraction of Target Specifications is performed first to obtain informative tags in the prompt and to correctly match this information within the standards. For academic standards in language assessment, these specifications should provide information about who will be content delivered to (target audience) and using what specific standard out of many (CEFR or CCS).
  • Lookup and Retrieval is then performed next upon extracting the target specifications. A lookup process is done to find a match with the selected standard, usually in the form of a database or an external machine-readable file. The length and complexity of a standard's level of information regarding its specifications may vary.
  • Knowledge Augmentation is done last. We propose a further technical augmentation of information found in standards to obtain knowledge artifacts in the prompts. These knowledge artifacts can range from simple additional information already present in the standard to complex representations, such as incorporating actual linguistic features to control the granularity of the generation process.

Knowledge Artifacts for Standard Alignment

The knowledge augmentation process of Standardize involves the extraction of three forms of knowledge artifacts from the standard itself:

  1. Aspect Information (Standardize-A) - this is generally attributed to linguistic criteria of content with respect to its target audience. The addition of aspect criteria information ensures that the generative model will have access to explicit characteristics of the desired generated content in different dimensions.
  2. Exemplars (Standardize-E) - pertain to recommended examples by experts or developers of standards for reference of users. The addition of exemplars or any artifact found in the standard that showcases gold-standard output allows the generative model to have a sense of implicit knowledge during the content generation process.
  3. Linguistic Flags (Standardize-L) - represent the controllable variables of a standard that a generative model can use to explicitly steer the direction of content generation. In the Standardize framework, this process serves as a rewrite function where a generative model is asked to produce an initial content first using another method prompting (e.g., aspect information), and rewrites this by comparing linguistic flag values of the initially generated content against the mean value of a gold standard dataset of the target level.

    A verbalizer is used to transform the computed linguistic flags into natural language prompts. The keywords increase and decrease are used in constructing the prompts to provide a sense of direction for the generative model.
knowledge artifacts


Standard Alignment via Overall Performance

For CEFR, we report a 100% increase in performance with GPT-4 in precise accuracy (from 0.227 → 0.480) and a 43% increase for adjacent accuracy (from 0.630 → 0.906) using the Standardize framework compared to the teacher style method. The open models also gained substantial boosts in performance, such as Longform up by 23%, OpenChat up by 14%, and Llama2 by 58%. In terms of adjacent accuracies, GPT-4 remained the best model for preserving the ordinality of the labels with 0.906.

For CCS, we see a similar pattern where all open and closed models obtained the best performance with Standardize, with boosts ranging from 3% to 45% increase using linguistic signals to refine the generated content toward the target level.

main results

Standard Alignment via Distributional Closeness

We observe that the general trend of using the best models with linguistic signals produces a more stable distribution across the variables it is explicitly controlling for (e.g., average sentence length or type token diversity), particularly with the CCS standards. We also notice that the distributions using Standardize also produce distributions closer to the mean from their corresponding gold-standard data.

In terms of distributional closeness, using linguistic signals makes the quality of model generations more similar to the linguistic characteristics of the gold standard datasets in CEFR and CCS. Overall, these findings further strengthen the evidence of standard alignment by incorporating specific linguistic variables in the content generation process through the Standardize framework.

distribution results

Fluency and Diversity

In the case of fluency for models generating CEFR and CCS content, we don't see an obvious tradeoff and report relatively consistent performances with the Standardize setup. The best-performing model is still GPT-4 for both standards and Longform in terms of the open models. On the other hand, with the diversity metric, the most diverse batch of generated content comes from the teacher style method for CCS. But this may be on a case-to-case basis and task-dependent since we do not see the same tradeoff in performance with the CEFR standards in context-assisted story generation.

Ultimately, our experiment procedure is focused on generating text content that aligns with the specified target level with respect to a standard. The standards that we applied in this study, CEFR and CCS, did not explicitly provide information on content creativity and how to measure this. Thus, we posit that creativity may be an interesting angle to explore in future works.


Validity on Global Educational Context

Our experiments with the CEFR and CCS standards showcase an opportunity for the generated texts of language model interfaces such as GPT-4, which are commonly used by educators and teachers, to be aligned with international language proficiency levels. Moreover, showing the effectiveness of the Standardize framework on the aforementioned internationally recognized academic standards used in European and Northern American schools signifies the framework's strong potential for cross-curricula application. Thus, we invite future researchers to explore, validate, and propose derivations of our base framework for their own languages and language-specific standards for content generation.

Towards More Personalized Content Generation

This work contributes toward the goal of helping educators craft more personalized content for learners using the capabilities of large language models based on an assigned language proficiency level described by a standard. While we present a new task specifically targeted for the NLP community to encourage research in this direction, our results may already be useful for educators by providing context on better methods for generating level or target audience-specific texts by prompting language models using information found in educational standards in the way we proposed.

Prior Work

Works in Complexity Controlled NLG. Research in complexity-controlled generation has been explored in the past, covering diverse facets in terms of text format, level granularity, and task variation. The work of Agrawal and Carpuat (2019) introduced controlling for specific complexity in the machine translation task. The following works of Agrawal and Carpuat (2023) and Ribeiro et al. (2023) explored grade-specific text simplification and summarization using control tokens and reinforcement learning, respectively. Currently, only two works have investigated incorporating CEFR for language learning content generation. Stowe et al. (2022) and Imperial and Madabushi (2023) both made use of CEFR-aligned text for NLG but limited their studies to two levels, A1 and C2.

However, none of these works made use of the actual guideline information found in CEFR during the generation process.

Novelty. The Standardize framework is parallel to the work of Zhou et al. (2023), where a verbalizer is used to transform quantitative constraints into natural language for prompting, as well as the work of Ram et al. (2023) in the lookup and retrieval phase where aspect information is added in the prompt to influence model controllability. In comparison to all the works mentioned, our study’s main novelty is capturing the wholeness of expert-defined standards, prioritizing fine granularity and not just one or two levels, as well as including information that can be represented as artifacts in the content generation process.


      author    = {Imperial, Joseph Marvin and Forey, Gail and Tayyar Madabushi, Harish},
      title     = {{Standardize: Aligning Language Models with Expert-Defined Standards for Content Generation}},
      year      = {2024},
     journal    = {arXiv preprint arXiv:2402.12593},
     url        = {}