CEO Blog - How to Teach Large Language Models

The hot topic in business right now is how do we use AI and LLMs in our business. Many want to train their own model, while it is solution to lot of challenges it is not certainly a solution for everything. Sometimes the more conventional boring tools are better, sometimes training a large language model is indeed the best way forward. So how to do that? Let’s have a look the different ways you can teach a model, what you want to achieve should define the approach that you take. Another angle is to look at what data you have available and then explore the options that you have based on your data. And I often start looking at the data format before anything else, a method that might be a unorthodox but not unheard of.

Before Training: Methods That Don’t Require Data

Before jumping into training, most teams should try these approaches first:

Prompting & System Prompts: Write better instructions and context in your prompts. Often solves 80% of problems without any training.

RAG (Retrieval-Augmented Generation): Connect the model to external databases or documents. The model retrieves relevant information before answering.

Tool/Function Calling: Give the model access to APIs, calculators, or databases. It learns when to call these tools and how to use their outputs. This is like giving instructions to a coworker: “When someone asks about the weather, check the weather API first, then respond with what you found.”

These methods teach the model new capabilities without changing its internal parameters. Try these first – they’re faster and cheaper than training.

Supervised Fine-Tuning (SFT) / Instruction Tuning

How is this technique used: You show the model thousands of examples of correct input-output pairs until it learns to mimic your style and approach.

What data it eats: Question and answer pairs. Input and expected output examples. Text files with two columns work fine. In practice, these are usually JSONL files with instruction, input, and output fields, plus metadata like domain tags for organization.

Examples:

Travel request: “Business trip to Chicago, 3 days, client meeting with Acme Corp, estimated cost $1,200.” Approval decision: “Approved. This aligns with the Q3 client expansion strategy, Acme Corp is a high-priority prospect, and the budget is reasonable for a 3-day trip.”
Restaurant recommendation: “Anniversary dinner, likes Italian food, budget around $150 for two people, downtown area.” Response: “I recommend Bella Vista on Main Street. They specialize in Northern Italian cuisine, have a romantic atmosphere perfect for anniversaries, and most entrees run $25-35 per person. Make a reservation – they book up quickly on weekends.”

Learning outcome: The model learns to handle your specific business situations the way you do. It picks up on details that matter – mentioning competitor concerns, referencing previous conversations, connecting decisions to budget priorities. After training on sales follow-ups, it knows to address objections proactively. After training on expense approvals, it learns what criteria you actually care about.

How it’s different: This teaches by direct example. Unlike other methods, you show exactly what you want. The model learns “when I see this input, produce this type of output.”

Chain of Thought Training

How is this technique used: You demonstrate the complete reasoning process step-by-step, not just the final answer, so the model learns to show its work.

What data it eats: Question, thinking steps, then final answer. You show the work, not just the result. Technically stored as JSONL with instruction, input, reasoning steps, and final output fields.

Examples:

Question: “Should we extend credit terms to this customer?”
Thinking: “First, check payment history. They’ve paid 18 of 20 invoices on time, with only two delays during their busy season. Second, review credit limit usage. Currently using 60% of available credit, which is reasonable. Third, assess business stability. Revenue has grown 15% year-over-year for three years. Fourth, consider strategic value. They’re expanding into new markets where we want presence.”
Answer: “Yes, extend credit terms. Strong payment history and growth trajectory outweigh the two seasonal delays.”

Learning outcome: The model learns to show its reasoning process. Instead of jumping to conclusions, it works through problems step by step. This makes its answers more trustworthy because you can see how it arrived at the conclusion.

How it’s different: Unlike supervised fine-tuning that only shows final answers, this teaches the thinking process. However, note that CoT is often disabled in production for privacy or safety reasons. The model learns process-shaped outputs but this doesn’t guarantee true understanding – it’s guided mimicry of reasoning patterns.

Conversational Training

How is this technique used: You train the model on multi-turn dialogue datasets, but mask out (multiply by zero) the human parts so the model only learns to predict AI responses, not human responses.

What data it eats: Chat logs with back-and-forth conversations. The human parts get masked out during training so the AI only learns from its own responses. These are formatted as JSONL files with conversation arrays containing role-labeled messages.

Examples:

Customer support chat: Human: “My order hasn’t arrived” → AI: “Let me look up your order number. Can you provide the order ID?” → Human: “It’s #12345” → AI: “I see your order shipped yesterday and should arrive tomorrow by 2 PM” (Only the AI responses are learned)
Technical support: Human: “The server keeps crashing” → AI: “What error messages do you see?” → Human: “Out of memory errors” → AI: “Let’s check your memory usage and increase the allocated RAM” (Human parts masked out)

Learning outcome: The model learns to maintain context across multiple conversation turns and respond appropriately to human inputs without trying to mimic human speech patterns.

How it’s different: Unlike single-turn training, this teaches conversational flow and context retention. The masking ensures the model learns to be the AI assistant, not to predict what humans might say next.

RLHF (Preference Learning with PPO)

How is this technique used: You train a reward model to score responses, then use PPO (Proximal Policy Optimization) algorithm to update the main model to generate higher-scoring responses.

What data it eats: Pairs of responses where humans said which one was better. Plus prompts for the model to practice on during training. Stored as JSONL with prompt, chosen response, and rejected response, plus additional prompts for policy training.

Examples:

Project status question → Answer A: “Everything is on track” vs Answer B: “Milestones 1-3 completed on schedule, milestone 4 delayed by 2 weeks due to vendor issues, overall delivery still projected for original deadline” → B preferred
Investment recommendation → Answer A: “Buy Tesla stock” vs Answer B: “Tesla shows strong EV market position but high volatility. Consider 2-3% portfolio allocation if risk tolerance allows, with exit strategy if it drops below $180” → B preferred

Learning outcome: The model learns human preferences for helpful, detailed, actionable responses over generic ones.

How it’s different: This uses reinforcement learning with a separate reward model. PPO ensures the model doesn’t change too dramatically in each training step, maintaining stability while optimizing for human preferences.

Direct Preference Optimization (DPO/IPO/KTO)

How is this technique used: You directly train the model to assign higher probability to preferred responses and lower probability to rejected ones, skipping the reward model step entirely.

What data it eats: Three things for each example: the question, the better answer, and the worse answer. Technically stored as JSONL with prompt, chosen response, and rejected response triplets. Some variants like KTO can work with just positive examples.

Examples:

Customer complaint handling → Preferred: “I understand your frustration with the delayed shipment. Let me track your order and provide a specific update within 2 hours” vs Rejected: “Sorry for the inconvenience, we’ll look into it”
Technical explanation → Preferred: “API rate limiting works by counting requests per time window. When you exceed 100 requests per minute, the server returns a 429 status code and you must wait” vs Rejected: “Rate limiting prevents too many requests”

Learning outcome: The model learns to favor detailed, helpful responses directly without needing a separate reward model.

How it’s different: DPO skips the reward model and reinforcement learning complexity. It directly adjusts probabilities based on preference pairs, making it simpler and more stable than PPO-based methods.

Listwise Preference Optimization (Group Ranking with GRPO)

How is this technique used: You rank multiple responses from best to worst across different quality dimensions, and use GRPO (Group Relative Policy Optimization) algorithm to learn from these relative comparisons.

What data it eats: Multiple responses to the same question, ranked from best to worst. You need to show which dimensions matter for ranking. Stored as JSONL with prompts and arrays of responses with ranking scores across multiple evaluation criteria.

Examples:

“Explain our Q4 budget variance to the board” → Response 1: Perfectly accurate with detailed spreadsheet data but puts executives to sleep, Response 2: Engaging storytelling but skips important financial details, Response 3: Clear narrative with key numbers and actionable next steps, Response 4: Simple but draws wrong conclusions from the data → Ranked 3,2,1,4

Learning outcome: The model learns to balance competing objectives like accuracy vs accessibility, comprehensiveness vs clarity.

How it’s different: Instead of just “better vs worse,” this teaches the model to optimize for multiple criteria simultaneously. GRPO computes advantages relative to the group average, allowing the model to learn from ranking relationships rather than absolute preferences.

Reinforcement Learning from Verifiable Rewards (RLVR)

How is this technique used: You set up automatic tests that check whether the model’s answers are objectively correct, giving points for right answers and zero for wrong ones.

What data it eats: Problems with clear right and wrong answers that can be checked automatically. The system needs to include the problem, the model’s attempt, and an automatic checker that returns pass/fail. Stored as JSONL with problems and verification functions that handle edge cases.

Examples:

Invoice data extraction → Either gets vendor name, amount, due date correct or doesn’t
Email classification → Either correctly categorizes as support/sales/billing or doesn’t
Report formatting → Either produces the required table structure or doesn’t
Data validation → Either correctly flags orders over $10,000 without manager approval or doesn’t

Learning outcome: The model gets extremely good at tasks with objective success criteria.

How it’s different: This uses automatic checking instead of human judgment. Success is measured by pass/fail tests, not human preference ratings. Also called Programmatic Rewards or RL from Unit Tests.

Constitutional AI Training

How is this technique used: You train the model to critique its own responses based on written principles, then improve them before giving the final answer.

What data it eats: Examples of bad responses, critiques explaining what’s wrong, and improved versions. You need the original response, the critique based on your principles, and the better version. Stored as JSONL with original responses, critique explanations, and revised responses.

Examples:

Original response: “That’s a terrible idea. Any competent manager would know better.”
Critique: “This violates the principle of providing constructive feedback without personal attacks.”
Improved response: “I see some challenges with this approach. The main concern is resource allocation – this would require pulling staff from other priorities. Have you considered starting with a smaller pilot program?”

Learning outcome: The model learns to self-correct based on quality principles you define.

How it’s different: This teaches self-monitoring. Instead of learning from external feedback, the model learns to evaluate and improve its own outputs.

Parameter-Efficient Training (LoRA/Adapters)

How is this technique used: You add small adapter layers to an existing model and train only those layers, specializing the model for your domain without retraining everything.

What data it eats: Same data types as other methods – question-answer pairs, conversations, preferences. The difference is in how the training works, not the data format. Uses the same JSONL format as the base method you’re adapting.

Examples:

Insurance domain: Take a general model and feed it thousands of insurance documents – claim forms, policy language, adjuster reports, settlement decisions. The model learns insurance terminology and decision patterns while keeping its ability to write emails, summarize text, and handle general business tasks.
Legal domain: Feed contract templates, case law summaries, and legal memos to a general model. It learns to write like a lawyer and understand legal concepts, but still retains knowledge about other topics.

Learning outcome: The model gains specialized domain knowledge while keeping its general capabilities intact.

How it’s different: This is about training efficiency, not learning type. You can adapt a general model to your specific domain without the cost of full retraining.

Data Quality & Licensing Considerations

The data is the most important part of any training project, yet most people rush through data preparation to get to the exciting model training phase. This is a mistake that kills projects before they start. It’s better to prepare a small sample manually, even when it feels frustrating and slow. This manual work gives you time to think about what you’re actually trying to teach the model and helps you spot problems early when they’re cheap to fix.

Once you have a small, high-quality sample, scaling up becomes much easier. AI models excel at understanding patterns from examples, so you can use the models themselves to help convert larger databases to match your successful format. But you need that carefully crafted foundation first. Remove duplicates, fix formatting inconsistencies, and redact sensitive information like personal data, credentials, or proprietary details. Ensure you have proper licensing rights to use the training data, especially for commercial applications – many datasets restrict usage to research only. The extra time spent on data quality pays back exponentially in model performance.

How to Know It Worked: Evaluation Methods

You can’t trust students telling you they learned something in your course – you need to conduct an exam. The same principle applies to AI training. Many projects fail because teams leave evaluation until the end, but this is backwards thinking. You should design your evaluation first, before any training begins. Once you know what success looks like, you can build a mechanism to achieve that outcome. This might feel counterintuitive, but it’s how successful projects work.

When you conduct an exam, you need grading rules – a clear rubric that defines what counts as a good answer versus a poor one. Once you have those grading rules established, testing becomes straightforward. The purpose is to score how well different models perform so you can compare their abilities objectively. Start with offline task metrics by measuring accuracy on test sets you’ve held back from training. For text generation, track scores like BLEU or ROUGE. For classification tasks, measure accuracy and precision.

Human evaluations provide another crucial layer. Have domain experts rate model outputs on helpfulness, accuracy, and style using your standardized rubrics. Don’t skip red-team testing – try adversarial prompts to find failure modes, test edge cases, prompt injections, and requests for harmful content. Set up hallucination checks to verify factual claims, especially for applications where accuracy matters. In production, run A/B tests comparing your new model against the baseline, measuring task completion rates and user satisfaction. Include safety evaluations that test for bias, toxicity, and inappropriate content using both automatic classifiers and human review.

When to Use Which Method

What Each Method Actually Teaches

Supervised Fine-Tuning teaches format mimicry. The model learns “when I see this pattern, produce this pattern.”

Chain of Thought teaches reasoning display. The model learns to show work, but may not actually understand the logic.

Conversational Training teaches dialogue flow. The model learns to maintain context across turns and respond as an AI assistant.

RLHF with PPO teaches human approval optimization through reinforcement learning. The model learns what responses get higher ratings.

Direct Preference Optimization (DPO) teaches preference satisfaction directly. The model learns to favor good responses without needing a separate reward model.

Listwise Preference Optimization (GRPO) teaches multi-objective optimization. The model learns to balance different quality criteria using relative comparisons.

Verifiable Rewards teaches objective task completion. The model learns to pass specific tests.

Constitutional AI teaches self-correction habits. The model learns to critique and revise its own outputs.

Parameter-Efficient teaches domain adaptation. The model learns specialized knowledge without losing general abilities.

The fundamental limitation across all methods: you can teach a model to produce outputs you approve of, but you cannot verify it learned the reasoning process you intended to teach.

Cyber Exposure Platform

Products

MSSPs

Resources

Company

Cyber Exposure Platform

Products

MSSPs

Resources

Company

CEO Blog – How to Teach Large Language Models

Before Training: Methods That Don’t Require Data

Supervised Fine-Tuning (SFT) / Instruction Tuning

Chain of Thought Training

Conversational Training

RLHF (Preference Learning with PPO)

Direct Preference Optimization (DPO/IPO/KTO)

Listwise Preference Optimization (Group Ranking with GRPO)

Reinforcement Learning from Verifiable Rewards (RLVR)

Constitutional AI Training

Parameter-Efficient Training (LoRA/Adapters)

Data Quality & Licensing Considerations

How to Know It Worked: Evaluation Methods

When to Use Which Method

What Each Method Actually Teaches

You might also be interested in

CEO Blog – How to Teach Large Language Models

Leak of the Week – 13th October

Deception is a Built-In Feature in Intelligent Creatures

Ready to get started?