The Rise of “Vibe Coding” and Intuitive Software Development

Posted on 7th Jul 2025 by Rodrigo Silva

The world of software development is being reshaped by a new, more intuitive approach: “vibe coding.” This method, fueled by advancements in artificial intelligence, is moving the focus from writing syntactically perfect code to expressing the desired outcome in natural language. This deep-dive article explores the essence of vibe coding, spotlights the pioneering tools enabling this shift, and provides a framework for its integration across the entire Software Development Life Cycle (SDLC).

Deconstructing the “Vibe”: What is Vibe Coding?

At its core, vibe coding is a development practice where a human developer collaborates with an AI-powered coding assistant to generate, refine, and debug code. The developer provides high-level prompts, ideas, and feedback—the “vibe”—and the AI translates this into functional software. This approach represents a significant paradigm shift, moving the developer’s role from a meticulous crafter of syntax to a creative director of automated systems. This section unpacks the nuances of this emerging methodology, exploring its origins, its foundational principles, the various forms it takes, and the critical debates surrounding its adoption.

The Genesis of a Term

The phrase “vibe coding” entered the developer lexicon in early 2025, coined by esteemed AI researcher Andrej Karpathy. In a post that quickly resonated throughout the tech community, he described a novel method of software creation: one where you “fully give in to the vibes, embrace exponentials, and forget that the code even exists.” Karpathy wasn’t just describing a more advanced form of AI-assisted autocompletion; he was articulating a more profound surrender of low-level implementation details to the machine. His vision was of a developer operating almost purely on the level of intent, guiding the AI with natural language and immediate feedback in a fluid, conversational loop. This concept rapidly spread from niche forums to major tech publications, capturing the imagination of developers who saw it as a glimpse into the future of their craft, where the barrier between a creative idea and a functional application becomes almost transparent.

The Core Philosophy: Intent Over Implementation

The foundational principle of vibe coding is the prioritization of intent over implementation. It fundamentally shifts the developer’s focus from the “how” to the “what.” Traditionally, building a feature requires a developer to mentally map a desired outcome onto specific programming languages, frameworks, and architectural patterns. Vibe coding abstracts away much of this cognitive load. The developer’s primary task is no longer to write syntactically perfect code, but to clearly and effectively articulate their goal to an AI partner.

Consider building a feature for a car rental application that allows users to see available vehicles. A traditional approach would involve writing explicit code to handle database connections, execute SQL queries, manage state, and render the results.

# Traditional Approach: The "How"
import psycopg2
from datetime import datetime

def get_available_cars(db_params, start_date, end_date):
    """
    Connects to the database and fetches cars not booked within the given date range.
    """
    conn = None
    available_cars = []
    try:
        # Manually handle connection and cursor
        conn = psycopg2.connect(**db_params)
        cur = conn.cursor()

        # Write a specific SQL query
        sql = """
            SELECT c.id, c.make, c.model, c.year, c.daily_rate
            FROM cars c
            WHERE c.id NOT IN (
                SELECT b.car_id
                FROM bookings b
                WHERE (b.start_date, b.end_date) OVERLAPS (%s, %s)
            )
        """
        
        # Execute and fetch results
        cur.execute(sql, (start_date, end_date))
        rows = cur.fetchall()
        
        # Format the results
        for row in rows:
            available_cars.append({
                "id": row[0], "make": row[1], "model": row[2], 
                "year": row[3], "daily_rate": row[4]
            })
            
        cur.close()
    except (Exception, psycopg2.DatabaseError) as error:
        print(error)
    finally:
        if conn is not None:
            conn.close()
    
    return available_cars

In contrast, the vibe coding approach focuses purely on the desired outcome. The developer expresses their intent in natural language, and the AI handles the complex implementation.

Developer Prompt: “Using my existing FastAPI setup and a PostgreSQL database with tables cars and bookings, create an API endpoint /available_cars that accepts a start_date and end_date. It should return a JSON list of all cars that are not booked during that period.”

The AI then generates the necessary code, translating the high-level “vibe” into a concrete, functional implementation. The developer is liberated from recalling specific library functions, SQL syntax, and error-handling boilerplate, allowing them to remain focused on the larger architectural and user experience goals.

The Spectrum of Vibe Coding

Vibe coding is not a single, monolithic practice; it exists on a spectrum of human-AI interaction, ranging from subtle assistance to full-blown conversational development. The level of engagement often depends on the developer’s needs, the complexity of the task, and the capabilities of the chosen tool.

At the most basic level, vibe coding manifests as intelligent code completion. Here, the AI acts as a silent partner, anticipating the developer’s next move. While writing a function to finalize a booking in our car rental app, the developer might only need to type the function signature, and the AI will suggest the entire implementation body.

# Low-end Spectrum: AI-powered autocompletion

# Developer writes this line:
async def finalize_booking(booking_id: int, db: Session):
    # AI suggests the following block of code:
    booking = db.query(Booking).filter(Booking.id == booking_id).first()
    if not booking:
        raise HTTPException(status_code=404, detail="Booking not found")
    
    booking.status = "confirmed"
    db.commit()
    
    # Send a confirmation email (placeholder)
    send_confirmation_email(booking.customer_email, booking_id)
    
    return {"message": "Booking confirmed successfully"}

Further along the spectrum is component generation from comments or prompts. In this mode, the developer provides a concise, natural language description of a desired piece of functionality, and the AI generates the complete code block. This is especially powerful for creating UI components.

Developer Prompt in a React file: // Create a React component to display a car's details. It should take a 'car' object as a prop, which includes make, model, year, daily_rate, and an image_url. Display the information in a card format with a "Book Now" button.

The AI would then generate the corresponding JSX and CSS, instantly creating a reusable UI element without the developer needing to write a single line of component code manually.

At the most advanced end of the spectrum lies conversational development. This is an iterative, dialogue-driven process where the developer and AI collaborate to build and refine a feature.

Developer: “Create a Python function to calculate the total price for a car rental, given a car ID and a start and end date. The price should be the daily rate multiplied by the number of days. Also, add a 10% discount if the rental period is 7 days or longer.”

AI: (Generates the initial function)

Developer: “This looks good, but it doesn’t account for weekends. Can you modify it to increase the daily rate by 20% for any days that fall on a Saturday or Sunday?”

In this back-and-forth, the AI is not just a code generator but a creative partner. The developer guides the process at a high level, progressively adding complexity and refining the logic through conversation, embodying the purest form of vibe coding.

The Great Debate: Pros and Cons

The rapid ascent of vibe coding has sparked a vibrant and necessary debate within the engineering community. Its advantages in speed and accessibility are profound, but they are counterbalanced by significant concerns regarding code quality, security, and the potential erosion of core development skills.

The most celebrated advantage is the dramatic increase in development speed. A task that might have taken a developer hours of manual coding, such as creating a search and filtering interface for the car rental app, can be prototyped in minutes. A simple prompt like, “Build a UI with filters for car type, price range, and availability dates that updates the list of cars in real-time,” can produce a working prototype almost instantly. This velocity empowers developers to experiment and iterate far more freely. Furthermore, it enhances accessibility, allowing individuals with strong domain knowledge but limited programming expertise, such as a product manager or a UI/UX designer, to build functional mock-ups and contribute more directly to the development process.

However, these benefits come with serious disadvantages. A primary concern is code quality and maintainability. AI-generated code can often be functional but suboptimal, inefficient, or difficult for a human to read and maintain. For example, when asked to retrieve a user’s booking history, an AI might generate a simple but inefficient database query.

-- AI-Generated Query (Potentially Inefficient)
-- This query might be slow on a large 'bookings' table if 'customer_id' is not indexed.
-- A human developer would ideally ensure such indexes exist.
SELECT * FROM bookings WHERE customer_id = 123;

An even more critical pitfall lies in security vulnerabilities. AI models are trained on vast amounts of public code, which includes both secure and insecure patterns. Without careful oversight, an AI can easily generate code with classic vulnerabilities. A prompt to create a function for retrieving car details might produce code susceptible to SQL injection if it doesn’t use parameterised queries.

# AI-Generated Code with a Security Flaw
def get_car_by_id(car_id: str):
    # WARNING: This code is vulnerable to SQL Injection.
    # It directly formats the input into the SQL string.
    query = f"SELECT * FROM cars WHERE id = {car_id}" 
    # ... database execution logic ...

This leads to the ultimate concern: the risk of over-reliance. If a developer uses vibe coding to generate complex, mission-critical systems—such as the payment processing logic for the car rental app—without fully understanding the underlying implementation, they become incapable of properly debugging, securing, or extending that system. The convenience of generating code with a simple “vibe” can obscure a dangerous lack of true comprehension, creating a fragile system that is a mystery to the very person responsible for it.

The Vibe Coder’s Toolkit

A new ecosystem of tools has emerged to facilitate vibe coding, each offering a unique approach to translating human intent into functional software. This section provides a comprehensive overview of the most popular platforms, detailing their distinct features, target audiences, and ideal use cases within the context of building a modern car rental application.

The All-in-One Platforms

All-in-one platforms are designed to take a developer from a simple idea to a fully deployed application within a single, cohesive environment. They handle the frontend, backend, and database setup, allowing the user to focus almost entirely on the application’s features and logic.

Lovable is renowned for its intuitive, guided approach to building full-stack web applications. It’s particularly well-suited for developers and entrepreneurs who want to quickly scaffold a project without getting bogged down in configuration. Lovable acts as an AI co-engineer, asking clarifying questions to ensure the generated application meets the user’s vision. For our car rental application, a developer could start with a high-level prompt that describes a complete user journey.

Lovable Prompt: “Create a car rental app using Next.js and Supabase. I need user authentication with email/password. After signing up, users should have a profile page where they can upload a picture of their driver’s license. The main page should show a list of available cars from the database.”

Lovable would then generate the foundational code, set up the database schema for users and cars, and create the necessary pages and components, effectively building the application’s skeleton in minutes.

Bolt excels at rapid prototyping and seamless integration with popular third-party services. It’s a versatile tool for developers who need to build and validate a minimum viable product (MVP) at lightning speed. Bolt’s strength lies in its ability to quickly wire up external APIs for essential services like payments or backend infrastructure. In the context of our car rental app, a developer could use Bolt to quickly establish the core business logic.

Bolt Prompt: “Generate a full-stack application with a React frontend and a Node.js backend. Create a ‘cars’ table in a Supabase database with columns for make, model, year, and daily_rate. Integrate Stripe for payments, creating an API endpoint that generates a checkout session based on a car’s daily rate and the number of rental days.”

Bolt would not only generate the code but also configure the webhooks and API clients needed to communicate with both Supabase and Stripe, making the application functional far more quickly than a manual setup would allow.

Replit offers a powerful, browser-based Integrated Development Environment (IDE) that makes it incredibly easy to start coding, collaborate with others, and deploy applications without any local setup. Its AI assistant, Ghostwriter, is deeply integrated, offering features from code completion to full-fledged generation. Replit is ideal for both beginners and experienced developers looking for a flexible and collaborative cloud environment. For our car rental app, a team could use Replit to work on a specific backend feature simultaneously.

# In Replit, a developer might start with a comment for the AI
#
# Create a FastAPI endpoint at /search/cars
# It should accept query parameters: 'make', 'model', and 'max_price'.
# Connect to the Postgres database and return cars that match the criteria.
# Only show cars where the 'is_available' flag is true.

# Replit's AI would then generate the following code directly in the editor:

from fastapi import FastAPI
from pydantic import BaseModel
import asyncpg

app = FastAPI()

# --- AI-Generated Code Starts ---

class Car(BaseModel):
    id: int
    make: str
    model: str
    year: int
    daily_rate: float
    is_available: bool

@app.get("/search/cars", response_model=list[Car])
async def search_cars(make: str = None, model: str = None, max_price: float = None):
    conn = await asyncpg.connect(user='user', password='password', database='rentals', host='127.0.0.1')
    
    query = "SELECT * FROM cars WHERE is_available = TRUE"
    params = []
    
    if make:
        params.append(f"%{make}%")
        query += f" AND make ILIKE ${len(params)}"
    if model:
        params.append(f"%{model}%")
        query += f" AND model ILIKE ${len(params)}"
    if max_price:
        params.append(max_price)
        query += f" AND daily_rate <= ${len(params)}"
        
    results = await conn.fetch(query, *params)
    await conn.close()
    
    return [Car(**dict(result)) for result in results]

# --- AI-Generated Code Ends ---

The AI-Powered IDEs and Editors

These tools integrate AI directly into the developer’s primary workspace—the code editor. They are less about generating entire applications from scratch and more about augmenting the moment-to-moment coding experience, acting as an intelligent pair programmer.

Cursor is an “AI-native” code editor, forked from VS Code, that is built from the ground up for vibe coding. It allows a developer to highlight a block of code and provide natural language instructions to refactor or debug it. Its deep integration with the project’s entire codebase allows it to provide highly contextual suggestions. This is perfect for working with existing or complex code. Imagine our car rental app has a convoluted pricing function; a developer could use Cursor to simplify it.

Developer highlights the messy function and prompts Cursor: “Refactor this code to be more readable. Separate the base price calculation from the discount and tax logic. Add comments explaining each step.”

Cursor would then rewrite the code in place, applying best practices for clarity and structure without the developer having to manually untangle the logic.

GitHub Copilot is the most widely adopted AI pair programmer, living as an extension inside popular editors like VS Code. It excels at providing real-time code suggestions and autocompletions based on the current file’s context and the developer’s comments. It shines at reducing boilerplate and speeding up the implementation of well-defined functions. For our car rental app, a developer could use Copilot to swiftly create a utility function.

// In VS Code, a developer writes a comment and the function signature.
// A utility function to format a date range for display.
// Example: "July 7, 2025 - July 14, 2025"
function formatDateRange(startDate, endDate) {
  // GitHub Copilot will automatically suggest the following implementation:

  const options = { year: 'numeric', month: 'long', day: 'numeric' };
  const start = new Date(startDate).toLocaleDateString('en-US', options);
  const end = new Date(endDate).toLocaleDateString('en-US', options);
  return `${start} - ${end}`;
}

The Specialised Tools

Specialised tools focus on excelling at one specific part of the development workflow, often the bridge between design and front-end development. They are designed to be integrated into a larger toolchain.

v0.dev, by Vercel, is a generative UI platform focused exclusively on creating web components. Using natural language prompts, developers can describe an interface, and v0 generates the corresponding React code using Tailwind CSS and shadcn/ui. It’s ideal for rapidly building the visual elements of an application. For our car rental project, we could use it to generate a visually appealing card to display a single vehicle.

v0.dev Prompt: “Create a responsive card for a rental car. It should have an image at the top. Below the image, display the car’s make and model in a large font. Underneath that, show the model year. At the bottom, display the daily rental price on the left, and a primary-colored ‘Book Now’ button on the right.”

v0.dev would provide several visual options along with the production-ready JSX code, allowing the developer to simply copy and paste a professionally designed component directly into the application.

Anima serves as a powerful bridge between design and development, helping teams convert high-fidelity designs from tools like Figma directly into clean, functional code. It’s perfect for teams where design fidelity is paramount, ensuring that the final product is a pixel-perfect match to the original design. A designer for the car rental app could complete the entire search results page layout in Figma, including responsive breakpoints. Using the Anima plugin, they could then export that design directly into React or HTML/CSS code that developers can immediately integrate and wire up to the backend data, drastically reducing the time spent translating visual mockups into code.

The Conversational AI Assistants

General-purpose large language models (LLMs) have become indispensable tools for developers. While not specialised for coding, their broad knowledge base makes them excellent partners for brainstorming, learning new concepts, and debugging tricky problems.

ChatGPT and Claude can be used as versatile, conversational partners throughout the development process. A developer can use them to think through high-level architectural decisions, generate code snippets for specific algorithms, or get help understanding a cryptic error message. For our car rental application, a developer could use an AI assistant to plan the database structure before writing any code.

Developer’s Conversational Prompt: “I’m building a car rental application. I need a database schema to store cars, customers, and bookings. A customer can have multiple bookings, and each booking is for one car. Bookings need a start date, end date, total price, and a status (e.g., ‘confirmed’, ‘completed’, ‘cancelled’). Can you give me the SQL CREATE TABLE statements for this using PostgreSQL?”

The AI would provide the complete SQL schema, acting as an expert consultant and saving the developer the time of designing it from scratch. This brainstorming and problem-solving capability makes these assistants a crucial part of the modern vibe coder’s toolkit.

Vibe Coding Across the Software Development Life Cycle (SDLC): A Practical Framework

Vibe coding is not merely a tool for isolated, rapid prototyping; its principles and the platforms that power it can be strategically integrated into every phase of the traditional Software Development Life Cycle (SDLC). By weaving AI-assisted techniques throughout the entire process, from initial concept to final deployment and maintenance, teams can unlock significant gains in efficiency, creativity, and collaboration. This section outlines a practical framework for applying vibe coding across the SDLC, transforming it from a linear, often cumbersome process into a more fluid and intelligent workflow.

Phase 1: Planning and Requirements Gathering

The initial phase of any project, where ideas are nebulous and requirements are still taking shape, is an area where vibe coding can provide immense value. It bridges the gap between abstract concepts and tangible artifacts, facilitating clearer communication and a more robust planning process.

One of the most powerful applications in this phase is the ability to translate a concept to code, creating interactive prototypes directly from user stories or high-level ideas. Instead of relying on static wireframes or lengthy specification documents, a product manager or business analyst can use an all-in-one platform to generate a functional, clickable prototype. For our car rental application, a simple user story can be transformed into a live demo.

Prompt for an all-in-one platform: “Generate a three-page web application. The first page is a landing page with a search bar for location and dates. The second page shows a grid of available cars based on the search. The third page is a detailed view of a single car with a booking form. Don’t worry about the database connection yet; use mock data for the cars.”

This instantly creates a tangible artifact that stakeholders can interact with, providing concrete feedback far earlier in the process than traditional methods allow.

This phase also benefits greatly from AI-assisted brainstorming. Using a conversational AI like ChatGPT or Claude, the project team can explore different features, user flows, and technical approaches without committing to a specific path. This allows for a more expansive and creative exploration of possibilities.

Brainstorming Prompt: “We’re designing the user registration flow for a car rental app. The goal is to minimize friction. Can you outline three different user flow options? One standard email/password flow, one using social logins like Google, and a third ‘magic link’ flow that doesn’t require a password. For each, describe the steps the user would take and the potential pros and cons regarding security and user experience.”

This approach allows the team to evaluate complex trade-offs and make more informed decisions before any significant design or development work has begun, setting a solid foundation for the rest of the project.

Phase 2: Design and Prototyping

During the design phase, vibe coding accelerates the transition from visual concepts to interactive components, blurring the lines between design and front-end development. Specialized tools empower designers and developers to create and iterate on the user interface with unprecedented speed.

This is where rapid UI/UX mockups become a reality. A designer can use a tool like v0.dev to generate production-ready front-end code from a simple natural language description, bypassing the need to manually code a component from a static design file. This dramatically accelerates the design-to-development handoff. For the car rental application’s search results page, a developer could generate a filter component with a single prompt.

v0.dev Prompt: “Create a responsive sidebar filter component for a car rental website. It should include a price range slider, a multi-select checklist for ‘Car Type’ (e.g., Sedan, SUV, Truck), and a set of radio buttons for ‘Transmission’ (Automatic, Manual). Add a clear ‘Apply Filters’ button at the bottom.”

The tool would generate the React component, complete with appropriate state management hooks and styled with Tailwind CSS, ready to be integrated into the application.

// AI-Generated React Component for a Filter Sidebar
import { Slider } from "@/components/ui/slider";
import { Checkbox } from "@/components/ui/checkbox";
import { RadioGroup, RadioGroupItem } from "@/components/ui/radio-group";
import { Label } from "@/components/ui/label";
import { Button } from "@/components/ui/button";

export default function CarFilterSidebar() {
  return (
    <aside className="w-full md:w-64 p-4 border-r bg-gray-50">
      <h3 className="text-lg font-semibold mb-4">Filters</h3>
      <div className="space-y-6">
        <div>
          <Label htmlFor="price-range">Price Range</Label>
          <Slider id="price-range" defaultValue={[50]} max={500} step={10} className="mt-2" />
          <div className="flex justify-between text-sm text-gray-500 mt-1">
            <span>$0</span>
            <span>$500</span>
          </div>
        </div>
        <div>
          <h4 className="font-medium mb-2">Car Type</h4>
          <div className="space-y-2">
            <div className="flex items-center space-x-2">
              <Checkbox id="sedan" />
              <Label htmlFor="sedan">Sedan</Label>
            </div>
            <div className="flex items-center space-x-2">
              <Checkbox id="suv" />
              <Label htmlFor="suv">SUV</Label>
            </div>
            <div className="flex items-center space-x-2">
              <Checkbox id="truck" />
              <Label htmlFor="truck">Truck</Label>
            </div>
          </div>
        </div>
        <div>
          <h4 className="font-medium mb-2">Transmission</h4>
          <RadioGroup defaultValue="automatic">
            <div className="flex items-center space-x-2">
              <RadioGroupItem value="automatic" id="automatic" />
              <Label htmlFor="automatic">Automatic</Label>
            </div>
            <div className="flex items-center space-x-2">
              <RadioGroupItem value="manual" id="manual" />
              <Label htmlFor="manual">Manual</Label>
            </div>
          </RadioGroup>
        </div>
        <Button className="w-full">Apply Filters</Button>
      </div>
    </aside>
  );
}

This capability enables iterative design with instant feedback. Because functional components can be created so quickly, designers and developers can test different layouts, interactions, and styles in a live environment, making adjustments on the fly based on real-time feedback rather than waiting for the entire development cycle to complete.

Phase 3: Development and Implementation

In the core development phase, AI-powered tools act as a force multiplier for engineers, automating repetitive tasks, tackling complex logic, and ensuring code quality.

Here, the AI truly becomes a pair programmer. Using a tool like GitHub Copilot, a developer can generate boilerplate code, such as API endpoints or data models, with minimal effort. More impressively, it can help implement complex business logic. For our car rental app, calculating the final price of a rental involves several factors. A developer can outline the logic in a comment, and the AI will generate the implementation.

# Developer writes a detailed comment outlining the business logic
#
# Function to calculate the total rental price.
# 1. Calculate the number of rental days between start_date and end_date.
# 2. Get the car's base daily_rate from the database.
# 3. Apply a 15% surcharge for drivers under 25.
# 4. Add a flat insurance fee of $15 per day.
# 5. Apply a 10% discount if the total rental duration is 7 days or more.
# 6. Add a 8% sales tax to the final amount.

# GitHub Copilot suggests the full function implementation
def calculate_total_price(car, user, start_date, end_date):
    from datetime import date
    
    # 1. Calculate rental days
    rental_days = (end_date - start_date).days
    if rental_days <= 0:
        return 0.0

    # 2. Get base price
    base_price = rental_days * car.daily_rate

    # 3. Apply underage surcharge
    surcharge = 0.0
    user_age = (date.today() - user.date_of_birth).days / 365.25
    if user_age < 25:
        surcharge = base_price * 0.15

    # 4. Add insurance fee
    insurance_cost = rental_days * 15.0

    subtotal = base_price + surcharge + insurance_cost

    # 5. Apply long-term discount
    discount = 0.0
    if rental_days >= 7:
        discount = subtotal * 0.10

    final_subtotal = subtotal - discount

    # 6. Add sales tax
    tax = final_subtotal * 0.08
    total_price = final_subtotal + tax

    return round(total_price, 2)

Beyond initial creation, vibe coding is transformative for code maintenance through “vibe-driven” refactoring. Using an AI-native editor like Cursor, a developer can highlight a section of legacy or poorly written code and ask the AI to improve it based on specific criteria.

Refactoring Prompt: “This database query uses multiple joins and is becoming slow. Refactor it to use a Common Table Expression (CTE) for clarity and potentially better performance. Also, ensure all selected column names are explicit to avoid ambiguity.”

The AI will analyze the selected code and rewrite it according to the developer’s instructions, improving its structure, readability, and performance without requiring a manual, line-by-line rewrite.

Phase 4: Testing and Quality Assurance

Ensuring the reliability of an application is a critical, though often tedious, part of the SDLC. Vibe coding can significantly reduce the manual effort involved in testing and debugging by automating test creation and providing intelligent diagnostic assistance.

The practice of automated test generation is a prime example. A developer can prompt an AI assistant to write comprehensive tests for a specific function, ensuring robust code coverage. For the pricing function in our car rental app, we can ask for a suite of unit tests using a framework like pytest.

Testing Prompt: “Write a set of pytest unit tests for the calculate_total_price function. Include tests for a standard rental, a rental with an underage driver surcharge, a rental long enough to receive a discount, and an edge case with a single-day rental.”

The AI would then generate the corresponding test file, complete with mock objects for the car and user data.

# AI-Generated Pytest file for the pricing function
import pytest
from datetime import date, timedelta
from your_app.pricing import calculate_total_price

# Mock objects for testing
class MockCar:
    def __init__(self, daily_rate):
        self.daily_rate = daily_rate

class MockUser:
    def __init__(self, dob):
        self.date_of_birth = dob

def test_standard_rental():
    car = MockCar(daily_rate=50)
    user = MockUser(dob=date(1990, 1, 1)) # Over 25
    start = date(2025, 8, 1)
    end = date(2025, 8, 6) # 5 days
    # Expected: (5 * 50) + (5 * 15) = 325. 325 * 1.08 tax = 351.00
    assert calculate_total_price(car, user, start, end) == 351.00

def test_underage_driver_surcharge():
    car = MockCar(daily_rate=50)
    user = MockUser(dob=date(2002, 1, 1)) # Under 25
    start = date(2025, 8, 1)
    end = date(2025, 8, 6) # 5 days
    # Expected: (5 * 50) = 250. Surcharge = 250 * 0.15 = 37.5. Insurance = 75.
    # Subtotal = 250 + 37.5 + 75 = 362.5. Tax = 362.5 * 0.08 = 29. Total = 391.50
    assert calculate_total_price(car, user, start, end) == 391.50

def test_long_term_discount():
    car = MockCar(daily_rate=50)
    user = MockUser(dob=date(1990, 1, 1)) # Over 25
    start = date(2025, 8, 1)
    end = date(2025, 8, 9) # 8 days
    # Expected: (8 * 50) + (8 * 15) = 520. Discount = 520 * 0.1 = 52.
    # Final Sub = 520 - 52 = 468. Tax = 468 * 0.08 = 37.44. Total = 505.44
    assert calculate_total_price(car, user, start, end) == 505.44

Furthermore, AI-powered debugging transforms a frustrating process into a collaborative one. When faced with a bug, instead of spending hours manually tracing the code, a developer can describe the problem in natural language to the AI.

Debugging Prompt: “I’m getting a TypeError: unsupported operand type(s) for -: 'datetime.date' and 'NoneType' in my calculate_total_price function. The error happens on the line rental_days = (end_date - start_date).days. What could be causing this and how can I fix it?”

The AI would analyze the context and explain that either start_date or end_date is likely None when the function is called. It would then suggest adding validation checks at the beginning of the function to handle these null values gracefully, providing the exact code to fix the issue.

Phase 5: Deployment and Maintenance

The final phase of the SDLC, which involves deploying the application and ensuring its ongoing health, can also be streamlined with vibe coding techniques. AI can assist in generating the complex configuration files needed for modern deployment pipelines and can help make sense of production data.

Automated deployment scripts are a key area of improvement. Creating configuration files for tools like Docker or platforms like Vercel and AWS can be complex and error-prone. An AI assistant can generate these files based on a high-level description of the application’s stack.

Deployment Prompt: “Create a multi-stage Dockerfile for my FastAPI car rental application. The first stage should install Python dependencies from requirements.txt. The final stage should be a lightweight image that copies the application code and runs it using Uvicorn on port 8000.”

The AI would generate a complete, optimized Dockerfile, saving the developer from having to memorize the specific syntax and best practices for containerization.

# AI-Generated Dockerfile

# Stage 1: Build stage with dependencies
FROM python:3.9-slim as builder

WORKDIR /app

COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt

# Stage 2: Final lightweight production stage
FROM python:3.9-slim

WORKDIR /app

# Copy installed packages from the builder stage
COPY --from=builder /usr/local/lib/python3.9/site-packages /usr/local/lib/python3.9/site-packages

# Copy application code
COPY . .

# Expose the port the app runs on
EXPOSE 8000

# Run the application
CMD ["uvicorn", "main:app", "--host", "0.0.0.0", "--port", "8000"]

After deployment, intelligent monitoring and alerting becomes possible. While dedicated observability platforms are essential, a conversational AI can be an invaluable tool for interpreting the vast amounts of data they produce. A developer on call who receives an alert can paste a series of cryptic log messages into the AI.

Maintenance Prompt: “I’m seeing a spike in 502 Bad Gateway errors in our production logs for the car rental app. The logs show multiple entries of (111: Connection refused) while connecting to upstream. This seems to be happening when the /api/payment/confirm endpoint is called. What is the likely cause of this issue?”

The AI could analyze the logs and explain that the web server is unable to connect to the backend payment processing service. It might suggest that the payment service has crashed or is overwhelmed, guiding the developer to check the status of that specific microservice, thereby dramatically reducing the time to diagnose and resolve the production issue.

The Future of Vibe: The Evolving Landscape of Software Creation

As artificial intelligence models become more sophisticated and their integration into development tools deepens, the potential of vibe coding will continue to expand. We are standing at the threshold of a new era in software creation, one that will redefine the roles of developers, broaden access to technology, and demand a renewed focus on responsibility and ethics. This concluding section looks ahead at the future of this paradigm shift and the evolving landscape it promises to shape.

The Rise of the “AI-Augmented” Developer

The ascent of vibe coding does not signal the end of the software developer; rather, it heralds the rise of the AI-augmented developer. The focus of the role is shifting away from the meticulous, line-by-line transcription of logic and toward a higher-level function of architecture, creative direction, and system orchestration. In this new reality, a developer’s value is less about their speed at typing syntactically correct code and more about their ability to translate a complex business problem into a series of well-crafted prompts and to critically evaluate the AI’s output.

Think of building our car rental application. The AI-augmented developer isn’t just concerned with generating the code for a single booking form. Instead, they are architecting the entire customer journey. They ask the high-level questions: How do we design a scalable database schema that can handle peak demand? What is the most secure and frictionless way to handle user authentication and payments? How do we build a front-end that is not only functional but also intuitive and delightful to use? They use their deep domain knowledge to guide the AI, prompting it to build the individual components while they focus on ensuring the pieces fit together into a coherent, secure, and robust system. The developer becomes less of a bricklayer and more of an architect, armed with an infinitely fast and knowledgeable construction crew.

The Democratization of Development

Perhaps one of the most profound impacts of vibe coding is its potential for the democratization of development. For decades, software creation has been the exclusive domain of those with specialized and often expensive training in computer science. Vibe coding is rapidly dismantling this barrier, enabling a new wave of “citizen developers”—entrepreneurs, designers, scientists, and small business owners—to build the tools they need without writing a single line of traditional code.

Imagine the owner of a small, independent car rental business. Previously, creating a custom booking and inventory management system would require a significant capital investment to hire a team of software engineers. Today, that same owner can use an all-in-one vibe coding platform to build a functional application tailored to their specific needs. By describing their business logic in plain language—”I need a system to track my five cars, show their availability on a calendar, and let customers book and pay online with a credit card”—they can generate a working product. This empowerment allows for an explosion of niche innovation, enabling subject-matter experts to directly solve their own problems and bring their ideas to life at a speed that was previously unimaginable.

The Ethical Considerations and the Road Ahead

This powerful new landscape is not without its challenges and requires a deep commitment to ethical considerations and responsible development. As we hand over more of the implementation details to AI, we must remain vigilant and intentional in our oversight. The road ahead demands a framework built on three core pillars.

First is confronting the risk of inherent bias. AI models learn from vast datasets of existing code and text from the internet, which inevitably contain the biases of their human creators. An AI, if not carefully guided, could inadvertently generate code for our car rental app that creates discriminatory pricing models or has accessibility flaws that exclude users with disabilities. The AI-augmented developer must serve as the ethical gatekeeper, actively auditing AI outputs for fairness and inclusivity.

Second is the critical need to maintain code quality and security standards. The convenience of vibe coding can lead to a dangerous complacency, where developers blindly trust AI-generated code. As we’ve seen, AI can produce code that is inefficient, difficult to maintain, or riddled with security vulnerabilities like SQL injection. The future of software engineering will require an even stronger emphasis on code review, security audits, and architectural validation. The “vibe” can guide creation, but it cannot replace the rigorous engineering discipline required to build safe and reliable systems.

Finally, this all points to the evolving and essential role of human oversight. The future is not a fully autonomous system where humans are obsolete; it’s a collaborative one where human judgment is more critical than ever. The most effective development teams will be those who master this human-AI partnership. The road ahead involves creating new best practices for managing, testing, and documenting AI-generated codebases. It requires training developers not just in programming languages, but in the art of prompt engineering, architectural thinking, and critical analysis. Vibe coding is not an autopilot for software development; it is a powerful new instrument, and its ultimate potential will only be realised in the hands of a skilled human operator who knows how to play it with intention, wisdom, and responsibility.

Too many llamas? Running AI locally

Posted on 4th Mar 2025 by Rodrigo Silva

In the rapidly evolving landscape of artificial intelligence, understanding the distinctions between various tools and models is crucial for developers and researchers. This blog post aims to elucidate the differences between the LLaMA model, llama.cpp, and Ollama. While the LLaMA model serves as the foundational large language model developed by Meta, llama.cpp is an open-source C++ implementation designed to run LLaMA efficiently on local hardware. Building upon llama.cpp, Ollama offers a user-friendly interface with additional optimizations and features. By exploring these distinctions, readers will gain insights into selecting the appropriate tool for their AI applications.

What is the LLaMA Model?

LLaMA (Large Language Model Meta AI) is a series of open-weight large language models (LLMs) developed by Meta (formerly Facebook AI). Unlike proprietary models like GPT-4, LLaMA models are released under a research-friendly license, allowing developers and researchers to experiment with state-of-the-art AI while maintaining control over data and privacy.

LLaMA models are designed to be smaller and more efficient than competing models while maintaining strong performance in natural language understanding, text generation, and reasoning.

LLaMA is a Transformer-based AI model that processes and generates human-like text. It is similar to OpenAI’s GPT models but optimized for efficiency. Meta’s goal with LLaMA is to provide smaller yet powerful language models that can run on consumer hardware.

Unlike GPT-4, which is closed-source, LLaMA models are available to researchers and developers, enabling:

Customisation & fine-tuning for specific applications
Running models locally instead of relying on cloud APIs
Improved privacy since queries don’t need to be sent to external servers

LLaMA models are powerful, but they are not the only open-source LLMs available. Let’s compare them with other major models:

Feature	LLaMA 2	GPT-4 (OpenAI)	Mistral 7B	Mixtral (MoE)
Size	7B, 13B, 70B	Proprietary	7B	12.9B (MoE)
Open-Source?	✅ Yes	❌ No	✅ Yes	✅ Yes
Performance	GPT-3.5 Level	🔥 Best	Better than LLaMA 2-7B	Outperforms LLaMA 2-13B
Fine-Tunable?	✅ Yes	❌ No	✅ Yes	✅ Yes
Runs on CPU?	✅ Yes (with llama.cpp)	❌ No	✅ Yes	❌ Requires GPU
Best For	Chatbots, research, AI apps	General AI, commercial APIs	Fast reasoning, efficiency	Scalable AI applications

LLaMA models are versatile and can be used for various applications:

AI Chatbots
Code Generation
Scientific Research
Private AI Applications

LLaMA is one of the most influential open-weight LLMs, offering a balance between power, efficiency, and accessibility. Unlike closed-source models like GPT-4, LLaMA allows developers to run AI locally, fine-tune models, and ensure data privacy.

AI Model Quantisation: Making AI Models Smaller and Faster

AI models, especially deep learning models like large language models (LLMs) and speech recognition systems, are huge. They require massive amounts of computational power and memory to run efficiently. This is where model quantisation comes in—a technique that reduces the size of AI models and speeds up inference while keeping accuracy as high as possible.

Quantisation is the process of converting a model’s parameters (weights and activations) from high-precision floating-point numbers (e.g., 32-bit float, FP32) into lower-precision numbers (e.g., 8-bit integer, INT8). This reduces the memory footprint and improves computational efficiency, allowing AI models to run on less powerful hardware like CPUs, edge devices, and mobile phones.

When an AI model is trained, it typically uses 32-bit floating-point (FP32) numbers to represent its weights and activations. These provide high precision but require a lot of memory and processing power. Quantisation converts these high-precision numbers into lower-bit representations, such as:

FP32 → FP16 (Half-precision floating-point)
FP32 → INT8 (8-bit integer)
FP32 → INT4 / INT2 (Ultra-low precision)

The lower the bit-width, the smaller and faster the model becomes, but at the cost of some accuracy. Assume we have a weight value stored as a 32-bit float:

Weight (FP32) = 0.87654321

If we convert this to 8-bit integer (INT8):

Weight (INT8) ≈ 87 (scaled down)

Even though we lose some precision, the model remains usable while consuming much less memory and processing power.

There are several types of quantisation:

Post-Training Quantisation – PTQ (Applied after training, converts model weights and activations to lower precision, faster but may cause some accuracy loss)
Quantisation-Aware Training – QAT (The model is trained while simulating lower precision, maintains higher accuracy compared to PTQ, more computationally expensive during training, used when accuracy is critical e.g., in medical AI models)
Dynamic Quantisation (Only weights are quantised; activations remain in higher precision, applied at runtime, making it more flexible, used in NLP models like llama.cpp for efficient inference)
Weight-Only Quantisation (Only model weights are quantised, not activations, used in GGUF/GGML models to run LLMs efficiently on CPUs)

Some of the benefits of quantisation are:

Reduces Model Size – Helps fit large AI models on small devices.
Speeds Up Inference – Allows faster processing on CPUs and edge devices.
Lower Power Consumption – Essential for mobile and embedded applications.
Enables AI on Consumer Hardware – Allows running LLMs (like llama.cpp) on laptops and smartphones.

Real world examples of quantisation include:

Whisper.cpp – Uses INT8 quantisation for speech-to-text transcription on CPUs.
Llama.cpp – Uses GGUF/GGML quantisation to run LLaMA models efficiently on local machines.
TensorFlow Lite & ONNX – Deploy AI models on mobile and IoT devices using quantized versions.

Quantisation is one of the most effective techniques for optimising AI models, making them smaller, faster, and more efficient. It allows complex deep learning models to run on consumer-grade hardware without sacrificing too much accuracy. Whether you’re working with text generation, speech recognition, or computer vision, quantisation is a game-changer in bringing AI to the real world.

Model fine-tuning with LoRA

Low-Rank Adaptation (LoRA) is a technique introduced to efficiently fine-tune large-scale pre-trained models, such as Large Language Models (LLMs), for specific tasks without updating all of their parameters. As models grow in size, full fine-tuning becomes computationally expensive and resource-intensive. LoRA addresses this challenge by freezing the original model’s weights and injecting trainable low-rank matrices into each layer of the Transformer architecture. This approach significantly reduces the number of trainable parameters and the required GPU memory, making the fine-tuning process more efficient.

In traditional fine-tuning, all parameters of a pre-trained model are updated, which is not feasible for models with billions of parameters. LoRA proposes that the changes in weights during adaptation can be approximated by low-rank matrices. By decomposing these weight updates into the product of two smaller matrices, LoRA introduces additional trainable parameters that are much fewer in number. These low-rank matrices are integrated into the model’s layers, allowing for task-specific adaptation while keeping the original weights intact.

LoRA presents several advantages:

Parameter Efficiency: LoRA reduces the number of trainable parameters by orders of magnitude. For instance, fine-tuning GPT-3 with LoRA can decrease the trainable parameters by approximately 10,000 times compared to full fine-tuning.
Reduced Memory Footprint: By updating only the low-rank matrices, LoRA lowers the GPU memory requirements during training, making it feasible to fine-tune large models on hardware with limited resources.
Maintained Performance: Despite the reduction in trainable parameters, models fine-tuned with LoRA perform on par with, or even better than, those fine-tuned traditionally across various tasks.

LoRA has been applied successfully in various domains, including:

Natural Language Processing (NLP): Fine-tuning models for specific tasks like sentiment analysis, translation, or question-answering.
Computer Vision: Adapting vision transformers to specialised image recognition tasks.
Generative Models: Customising models like Stable Diffusion for domain-specific image generation.

By enabling efficient and effective fine-tuning, LoRA facilitates the deployment of large models in specialised applications without the associated computational burdens of full model adaptation.

Using llama.cpp to Run Large Language Models Locally

With the rise of large language models (LLMs) like OpenAI’s GPT-4 and Meta’s LLaMA series, the demand for running these models efficiently on local machines has grown. However, most large-scale AI models require powerful GPUs and cloud-based services, which can be costly and raise privacy concerns.

Enter llama.cpp, a highly optimised C++ implementation of Meta’s LLaMA models that allows users to run language models directly on CPUs. This makes it possible to deploy chatbots, assistants, and other AI applications on personal computers, edge devices, and even mobile phones—without relying on cloud services.

What is llama.cpp?

llama.cpp is an efficient CPU-based inference engine for running Meta’s LLaMA models (LLaMA 1, LLaMA 2, and variants like Mistral, Phi, and Qwen) on Windows, macOS, Linux, and even ARM-based devices. It uses quantisation techniques to reduce the model size and memory requirements, making it possible to run LLMs on consumer-grade hardware.

The key features of llama.cpp are:

CPU-based execution – No need for GPUs.
Quantisation support – Reduces model size with minimal accuracy loss.
Multi-platform – Runs on Windows, Linux, macOS, Raspberry Pi, and Android.
Memory efficiency – Optimised for low RAM usage.
GGUF format – Uses an efficient binary format for LLaMA models.

Installing llama.cpp

The minimum system requirements for llama.cpp are:

OS: Windows, macOS, or Linux.
CPU: Intel, AMD, Apple Silicon (M1/M2), or ARM-based processors.
RAM: 4GB minimum, 8GB+ recommended for better performance.
Dependencies: gcc, make, cmake, python3, pip

To install on Linux/macOS, first clone the repository:

git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp

Then, build the project:

make

This compiles the main executable for CPU inference.

On Windows, install MinGW-w64 or use WSL (Windows Subsystem for Linux). Then, open a terminal (PowerShell or WSL) and run:

git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp
make

Alternatively, you can use Python Bindings. llama.cpp provides Python bindings for easy usage:

pip install llama-cpp-python

Downloading and Preparing Models

Meta’s LLaMA models require approval for access. However, open-weight alternatives like Mistral, Phi, and Qwen can be used freely. To download a model, visit Hugging Face and search for LLaMA 2 GGUF models. Download a quantised model, e.g., llama-2-7b.Q4_K_M.gguf.

If you have raw LLaMA models, you must convert them to the GGUF format. First, install transformers:

pip install transformers

Then, convert:

python convert.py --model /path/to/llama/model

Once you have a GGUF model, you can start chatting!

./main -m models/llama-2-7b.Q4_K_M.gguf -p "Tell me a joke"

This runs inference using the model and generates a response. To run a chatbot session:

./main -m models/llama-2-7b.Q4_K_M.gguf --interactive

It will allow continuous interaction, just like ChatGPT.

If needed, you can quantise a model using one of the available levels:

Q8_0 – High accuracy, large size.
Q6_K – Balanced performance and accuracy.
Q4_K_M – Optimised for speed and memory.
Q2_K – Ultra-low memory, reduced accuracy.

You can quantise a model using:

python quantize.py --model llama-2-7b.gguf --type Q4_K_M

This produces a GGUF file that is much smaller and runs faster.

To improve performance, use more CPU threads:

./main -m models/llama-2-7b.Q4_K_M.gguf -t 8

This will use 8 threads for inference.

If you have a GPU, you can enable acceleration:

make LLAMA_CUBLAS=1

This allows CUDA-based inference on NVIDIA GPUs.

Fine-tuning

With the power of llama.cpp and LoRA, you can build advanced chatbots, specialised assistants and domain-specific NLP solutions, all running locally, with full control over data and privacy.

Fine-tuning with llama.cpp requires a dataset in JSONL format (JSON Lines), which is a widely-used structure for text data in machine learning. Each line in the JSONL file represents an input-output pair. This format allows the model to learn a mapping from inputs (prompts) to outputs (desired completions):

{"input": "What is the capital of France?", "output": "Paris"}
{"input": "Translate to French: apple", "output": "pomme"}
{"input": "Explain quantum mechanics.", "output": "Quantum mechanics is a fundamental theory in physics..."}

To create a dataset, collect data relevant to your task. For example:

Question-Answer Pairs – For a Q&A bot.
Translation Examples – For a language translation model.
Dialogue Snippets – For chatbot fine-tuning.

Once you have the JSONL dataset ready, you can fine-tune your llama.cpp model using finetune.py. This script utilizes LoRA (Low-Rank Adaptation) to efficiently train the model.

First, you need to install the required libraries:

pip install torch transformers datasets peft bitsandbytes

You can now run finetune.py using the following command:

python finetune.py --model models/llama-2-7b.Q4_K_M.gguf --data dataset.jsonl --output-dir lora-output

After fine-tuning, the LoRA adapters must be merged with the base model to produce a single, fine-tuned model file.

python merge_lora.py --base models/llama-2-7b.Q4_K_M.gguf --lora lora-output --output models/llama-2-7b-finetuned.gguf

You can test the fine-tuned model using llama.cpp to see how it performs:

./main -m models/llama-2-7b-finetuned.gguf -p "What is the capital of France?"

Interesting Models to Run on llama.cpp

There are several models that you can run on llama.cpp:

1. LLaMA 2

Creator: Meta
Variants: 7B, 13B, 70B
Use Cases: General-purpose chatbot, knowledge retrieval, creative writing
Best Quantized Version: Q4_K_M (balanced accuracy and speed)
Why It’s Interesting: LLaMA 2 is one of the most powerful open-weight language models, comparable to GPT-3.5 in many tasks. It serves as the baseline for experimentation.

Example Usage in llama.cpp:

./main -m models/llama-2-13b.Q4_K_M.gguf -p "Explain the theory of relativity in simple terms."

2. Mistral 7B

Creator: Mistral AI
Variants: 7B (densely trained)
Use Cases: Chatbot, reasoning, math, structured answers
Best Quantized Version: Q6_K
Why It’s Interesting: Mistral 7B is optimized for factual accuracy and reasoning. It outperforms LLaMA 2 in some tasks despite being smaller.

Example Usage:

./main -m models/mistral-7b.Q6_K.gguf -p "Summarize the latest advancements in quantum computing."

3. Mixtral (Mixture of Experts)

Creator: Mistral AI
Variants: 12.9B (only 2 experts active at a time)
Use Cases: High-performance chatbot, research assistant
Best Quantized Version: Q5_K_M
Why It’s Interesting: Unlike standard models, Mixtral is a Mixture of Experts (MoE) model, meaning it activates only two out of eight experts per token. This makes it more efficient than similarly sized dense models.

Example Usage:

./main -m models/mixtral-8x7b.Q5_K_M.gguf --interactive

4. Code LLaMA

Creator: Meta
Variants: 7B, 13B, 34B
Use Cases: Code generation, debugging, explaining code
Best Quantized Version: Q4_K
Why It’s Interesting: This model is fine-tuned for programming tasks. It can generate Python, JavaScript, C++, Rust, and more.

Example Usage:

./main -m models/code-llama-13b.Q4_K.gguf -p "Write a Python function to reverse a linked list."

5. Phi-2

Creator: Microsoft
Variants: 2.7B
Use Cases: Math, logic, reasoning, lightweight chatbot
Best Quantized Version: Q5_K_M
Why It’s Interesting: Despite being only 2.7B parameters, Phi-2 is surprisingly strong in logical reasoning and problem-solving, outperforming models twice its size.

Example Usage:

./main -m models/phi-2.Q5_K_M.gguf -p "Solve the equation: 5x + 7 = 2x + 20."

6. Qwen-7B

Creator: Alibaba
Variants: 7B, 14B
Use Cases: Conversational AI, structured text generation
Best Quantized Version: Q4_K_M
Why It’s Interesting: Qwen models are multilingual and trained with high-quality data, making them excellent for chatbots.

Example Usage:

./main -m models/qwen-7b.Q4_K_M.gguf --interactive

Ollama: A Local AI Tool for Running Large Language Models

Ollama is another open-source tool that enables users to run large language models (LLMs) locally on their machines. Unlike cloud-based AI services like OpenAI’s GPT models, Ollama provides a privacy-focused, efficient, and customisable approach to working with AI models. It allows users to download, manage, and execute AI-powered applications on macOS, Linux, and Windows (preview), reducing reliance on external servers.

Ollama supports multiple models, including LLaMA 3.3, Mistral, Phi-4, DeepSeek-R1, and Gemma 2, catering to a range of applications such as text generation, code assistance, and scientific research.

Ollama is easy to install with just a single command (macOS & Linux):

curl -fsSL https://ollama.com/install.sh | sh

Windows support is currently in preview. You can install it by downloading the latest version from the Ollama website.

Once installed, you can run an AI model with one simple command:

ollama run mistral

This command downloads the model automatically (if not already installed) and starts generating text based on the input. You can provide a custom prompt to the model:

ollama run mistral "What are black holes?"

Available AI Models in Ollama

Ollama supports multiple open-weight models. Here are some of the key ones:

1. LLaMA 3.3

General-purpose NLP tasks such as text generation, summarisation, and translation.

Example Command:

ollama run llama3 "Explain the theory of relativity in simple terms."

2. Mistral

Code generation, large-scale data analysis, and fast text-based tasks.

Example Command:

ollama run mistral "Write a Python script that calculates Fibonacci numbers."

3. Phi-4

Scientific research, literature review, and data summarisation.

Example Command:

ollama run phi "Summarise the key findings of quantum mechanics."

4. DeepSeek-R1

AI-assisted research, programming help, and chatbot applications.

Example Command:

ollama run deepseek "What are the ethical considerations of AI in medicine?"

5. Gemma 2

A multi-purpose AI model optimised for efficiency.

Example Command:

ollama run gemma "Generate a short sci-fi story about Mars."

Using Ollama in a Python Script

Developers can integrate Ollama into their Python applications using its OpenAI-compatible API.

import requests

response = requests.post(
    "http://localhost:11434/api/generate",
    json={"model": "mistral", "prompt": "Explain black holes."}
)

print(response.json()["response"])

This allows developers to build AI-powered applications without relying on cloud services.

Advanced Usage

To see which models you have installed:

ollama list

If you want to download a model without running it:

ollama pull llama3

You can start Ollama in server mode for use in applications:

ollama serve

Ollama is a powerful tool for anyone looking to run AI models locally—whether for text generation, coding, research, or creative writing. Its simplicity, efficiency, and privacy-first approach make it an excellent alternative to cloud-based AI services.

Key Differences Between Ollama and llama.cpp

Both Ollama and llama.cpp are powerful tools for running large language models (LLMs) locally, but they serve different purposes. While llama.cpp is a low-level inference engine focused on efficiency and CPU-based execution, Ollama is a high-level tool designed to simplify running LLMs with an easy-to-use API and built-in model management.

If you’re wondering which one to use, next we break down the major differences between Ollama vs. llama.cpp, covering their features, performance, ease of use, and best use cases.

Feature	llama.cpp	Ollama
Primary Purpose	Low-level LLM inference engine	High-level LLM runtime with API
Ease of Use	Requires manual setup & CLI knowledge	Simple CLI with built-in model handling
Model Management	Manual	Automatic download & caching
Supported Models	LLaMA, Mistral, Mixtral, Qwen, etc.	Same as llama.cpp, plus model catalog
Quantization Support	Yes (GGUF)	Yes (automatically handled)
Runs on CPU	✅ Yes	✅ Yes
Runs on GPU	❌ (Only with extra setup)	✅ Yes (CUDA-enabled by default)
API Support	❌ No built-in API	✅ Has an OpenAI-compatible API
Web Server Support	❌ No	✅ Yes (serves models via HTTP API)
Installation Simplicity	Requires compiling manually	One-command install
Performance Optimization	Fine-tuned for CPU efficiency	Optimised but with slight overhead due to API layer

llama.cpp is slightly faster on CPU since it is a barebones inference engine with no extra API layers. Ollama has a small overhead because it manages API interactions and model caching.

llama.cpp does not natively support GPU but can be compiled with CUDA or Metal manually. Ollama supports GPU out of the box on NVIDIA (CUDA) and Apple Silicon (Metal).

So, when should you use one or the other?

If you need…	Use llama.cpp	Use Ollama
Maximum CPU efficiency	✅ Yes	❌ No
Easy setup & installation	❌ No	✅ Yes
Built-in API for applications	❌ No	✅ Yes
Manual model control (fine-tuning, conversion)	✅ Yes	❌ No
GPU acceleration out of the box	❌ No (requires manual setup)	✅ Yes
Streaming responses (for chatbot UIs)	❌ No	✅ Yes
Web-based AI serving (like OpenAI API)	❌ No	✅ Yes

If you’re a developer or researcher who wants fine-grained control over model execution, llama.cpp is the better choice. If you just want an easy way to run LLMs (especially with an API and GPU support), Ollama is the way to go.

Understanding Vector Databases in the Modern Data Landscape

Posted on 4th Feb 2025 by Rodrigo Silva

In the ever-expanding cosmos of data management, relational databases once held the status of celestial bodies—structured, predictable, and elegant in their ordered revolutions around SQL queries. Then came the meteoric rise of NoSQL databases, breaking free from rigid schemas like rebellious planets charting eccentric orbits. And now, we find ourselves grappling with a new cosmic phenomenon: vector databases—databases designed to handle data not in neatly ordered rows and columns, nor in flexible JSON-like blobs, but as multidimensional points floating in abstract mathematical spaces.

At first glance, the term vector database may sound like something conjured up by a caffeinated data scientist at 2 AM, but it’s anything but a fleeting buzzword. Vector databases are redefining how we store, search, and interact with complex, unstructured data—especially in the era of artificial intelligence, machine learning, and large-scale recommendation systems. But to truly appreciate their significance, we need to peel back the layers of abstraction and venture into the mechanics that make vector databases both fascinating and indispensable.

The Vector: A Brief Mathematical Detour

Imagine, if you will, the humble vector—not the villain from Despicable Me, but the mathematical object. In its simplest form, a vector is an ordered list of numbers, each representing a dimension. A 2-dimensional vector could be something like [3, 4], which you might recognize from your high school geometry class as a point on a Cartesian plane. Add a third number, and you’ve got a 3D point. But why stop at three? In the world of vector databases, we often deal with hundreds or even thousands of dimensions.

Why so many dimensions? Because when we represent complex data—like images, videos, audio clips, or even blocks of text—we extract features that capture essential characteristics. Each feature corresponds to a dimension. For example, an image might be transformed into a vector of 512 or 1024 floating-point numbers, each representing something abstract like color gradients, edge patterns, or latent semantic concepts. This transformation is often the result of deep learning models, which specialize in distilling raw data into dense, numerical representations known as embeddings.

The Problem: Why Traditional Databases Fall Short

Now, consider the task of finding similar items in a dataset. In SQL, if you want to find records with the same customer_id or order_date, it’s a simple matter of writing a WHERE clause. Indexes on columns make these lookups blazingly fast. But what if you wanted to find images that look similar to each other? Or documents with similar meanings? How would you even define “similarity” in a structured table?

This is where relational databases throw up their hands in despair. Their indexing strategies—B-trees, hash maps, etc.—are optimized for exact matches or range queries, not for the fuzzy, high-dimensional notion of similarity. You could, in theory, store vectors as JSON blobs in a NoSQL database, but querying them would be excruciatingly slow and inefficient because you’d lack the underlying data structures optimized for similarity searches.

Enter Vector Databases: The Knights of Approximate Similarity

Vector databases are purpose-built to address this exact problem. Instead of optimizing for exact matches, they specialize in approximate nearest neighbor (ANN) search—a fancy term for finding the vectors that are most similar to a given query vector. The key here is approximate, because finding the exact nearest neighbors in high-dimensional spaces is computationally expensive to the point of impracticality. But thanks to clever algorithms, vector databases can find results that are close enough, in a fraction of the time.

These algorithms are designed to handle millions, even billions, of high-dimensional vectors with impressive speed and accuracy.

A Practical Example: Searching Similar Texts

Let’s say you’re building a recommendation system that suggests similar news articles. First, you’d convert each article into a vector using a model like Sentence Transformers or OpenAI’s text embeddings. Here’s a simplified Python example using faiss, an open-source vector search library developed by Facebook:

import faiss
import numpy as np

# Imagine we have 1000 articles, each represented by a 512-dimensional vector
np.random.seed(42)
article_vectors = np.random.random((1000, 512)).astype('float32')

# Create an index for fast similarity search
index = faiss.IndexFlatL2(512)  # L2 is the Euclidean distance
index.add(article_vectors)

# Now, suppose we have a new article we want to find similar articles for
new_article_vector = np.random.random((1, 512)).astype('float32')

# Perform the search
k = 5  # Number of similar articles to retrieve
distances, indices = index.search(new_article_vector, k)

# Output the indices of the most similar articles
print(f"Top {k} similar articles are at indices: {indices}")

Note: In mathematics, Euclidean distance is the measure of the shortest straight-line distance between two points in Euclidean space. Named after the ancient Greek mathematician Euclid, who laid the groundwork for geometry, this distance metric is fundamental in fields ranging from computer graphics to machine learning.

Behind the scenes, faiss is not just brute-forcing through all 1000 vectors; it’s using optimised data structures to prune the search space and return results in milliseconds.

Peering Under the Hood

As with any technological marvel, the real intrigue lies beneath the surface. What happens when we peel back the abstraction layers and dive into the guts of these systems? How do they manage to handle millions—or billions—of high-dimensional vectors with such grace and efficiency? And what does the landscape of vector database offerings look like in the wild, both as standalone titans and as cloud-native services?

The Core Anatomy

At the heart of every vector database lies a deceptively simple question: “Given this vector, what are the most similar vectors in my collection?” This might sound like the database equivalent of asking a room full of people, “Who here looks the most like me?”—except instead of comparing faces, we’re comparing mathematical representations across hundreds or thousands of dimensions.

Now, brute-forcing this problem would mean calculating the distance between the query vector and every single vector in the database—a computational nightmare, especially when you’re dealing with millions of entries. This is where vector databases show their true genius: they don’t look at everything; they look at just enough to get the job done efficiently.

Indexing

In relational databases, indexes are like those sticky tabs you put on important pages in a textbook. In vector databases, the indexing mechanism is more like an intricate map that helps you find the closest coffee shop—not by checking every building in the city but by guiding you down the most promising streets.

The most common indexing techniques include:

HNSW (Hierarchical Navigable Small World Graphs): Imagine trying to find the shortest path through a vast network of cities. Instead of walking from door to door, HNSW creates a multi-layered graph where higher layers cover more ground (like express highways), and lower layers provide finer detail (like local streets). When searching for similar vectors, the algorithm starts at the top layer and gradually descends, zooming in on the best candidates with impressive speed.
IVF (Inverted File Index): Think of this like sorting a library into genres. Instead of scanning every book for a keyword, you first narrow your search to the right genre (or cluster), drastically reducing the number of comparisons. IVF clusters vectors into groups based on similarity, then searches only within the most relevant clusters.
PQ (Product Quantization): This technique compresses vectors into smaller chunks, reducing both storage requirements and computation time. It’s like summarizing long essays into key bullet points—not perfect, but good enough to quickly find what you’re looking for.

Most vector databases don’t rely on just one of these techniques; they often combine them, tuning performance based on the specific use case.

The Search

When you submit a query to a vector database, here’s a simplified version of what happens under the hood:

1. Preprocessing: The query vector is normalised or transformed to match the format of the stored vectors.

2. Index Traversal: The database navigates its index (whether it’s an HNSW graph, IVF clusters, or some hybrid) to identify promising candidates.

3. Distance Calculation: For these candidates, the database computes similarity scores using distance metrics like Euclidean distance, cosine similarity, or dot product.

4. Ranking: The results are ranked based on similarity, and the top-k closest vectors are returned.

And all of this happens in milliseconds, even for datasets with billions of vectors.

Note: Cosine similarity measures—not the distance between two points, but the angle between two vectors. It’s a metric that answers the question: “How similar are these two vectors in terms of their orientation?”. At its core, cosine similarity calculates the cosine of the angle between two non-zero vectors in an inner product space. The cosine of 0° is 1, meaning the vectors are perfectly aligned (maximum similarity), while the cosine of 90° is 0, indicating that the vectors are orthogonal (no similarity). If the angle is 180°, the cosine is -1, meaning the vectors are diametrically opposed. The dot product (also known as the scalar product) is an operation that takes two equal-length vectors and returns a single number—a scalar. In plain English: multiply corresponding elements of the two vectors, then sum the results.

Real-World Use Cases

While the technical details are fascinating, the real magic of vector databases becomes evident when you see them in action. They are the quiet engines behind some of the most advanced applications today.

Recommendation Systems

When Netflix suggests shows you might like, it’s not just comparing genres or actors—it’s comparing complex behavioural vectors derived from your viewing habits, preferences, and even micro-interactions. Vector databases enable these systems to perform real-time similarity searches, ensuring recommendations are both personalised and timely.

Semantic Search

Forget keyword-based search. Modern search engines aim to understand meaning. When you type “How to bake a chocolate cake?” the system doesn’t just look for pages with those exact words. It converts your query into a vector that captures semantic meaning and finds documents with similar vectors, even if the wording is entirely different.

Computer Vision

In facial recognition, each face is represented as a vector based on key features—eye spacing, cheekbone structure, etc. Vector databases can compare a new face against millions of stored vectors to find matches with remarkable accuracy.

Fraud Detection

Financial institutions use vector databases to identify unusual patterns that might indicate fraud. Transaction histories are converted into vectors, and anomalies are flagged based on their “distance” from typical behavior patterns.

The Vector Database Landscape

Now that we’ve dissected the internals and marveled at the use cases, it’s time to tour the bustling marketplace of vector databases. The landscape can be broadly categorized into standalone and cloud-native offerings.

Standalone Solutions

These are databases you can deploy on your own infrastructure, giving you full control over data privacy, performance tuning, and resource allocation.

Faiss: Developed by Facebook AI Research, Faiss is a library rather than a full-fledged database. It’s blazing fast for similarity search but requires some DIY effort to manage persistence, scaling, and API layers.
Annoy: Created by Spotify, Annoy (Approximate Nearest Neighbors Oh Yeah) is optimized for read-heavy workloads. It’s great for static datasets where the index doesn’t change often.
Milvus: A powerhouse in the open-source vector database arena, Milvus is designed for scalability. It supports multiple indexing algorithms, integrates well with big data ecosystems, and handles real-time updates gracefully.

Cloud-Native Solutions

For those who prefer to offload infrastructure headaches to someone else, cloud-native vector databases offer managed services with easy scaling, high availability, and integrations with other cloud products.

Pinecone: Pinecone abstracts away all the complexity of vector indexing, offering a simple API for similarity search. It’s optimised for performance and scalability, making it popular in production-grade AI applications.
Weaviate: More than just a vector database, Weaviate includes built-in machine learning capabilities, allowing you to perform semantic search without external models. It’s cloud-native but also offers self-hosting options.
Amazon Kendra / OpenSearch: AWS has dipped its toes into vector search through Kendra and OpenSearch, integrating vector capabilities with their broader cloud ecosystem.
Qdrant: A rising star in the vector database space, Qdrant offers high performance, flexibility, and strong API support. It’s designed with modern AI applications in mind, supporting real-time data ingestion and querying.

Exploring Azure and AWS Implementations

While open-source solutions like Faiss, Milvus, and Weaviate offer flexibility and control, managing them at scale comes with operational overhead. This is where Azure and AWS step in, offering managed services that handle the heavy lifting—provisioning infrastructure, scaling, ensuring high availability, and integrating seamlessly with their vast ecosystems of data and AI tools. Today, we’ll delve into how each of these cloud giants approaches vector databases, comparing their offerings, strengths, and implementation nuances.

AWS and the Vector Landscape

AWS, being the sprawling behemoth it is, doesn’t offer a single monolithic “vector database” product. Instead, it provides a constellation of services that, when combined, form a powerful ecosystem for vector search and management.

Amazon OpenSearch Service with k-NN Plugin

AWS’s primary foray into vector search comes via Amazon OpenSearch Service, formerly known as Elasticsearch Service. While OpenSearch is traditionally associated with full-text search and log analytics, AWS supercharged it with the k-NN (k-Nearest Neighbours) plugin, enabling efficient vector-based similarity search.

The k-NN plugin integrates libraries like Faiss and nmslib under the hood. Vectors are stored as part of OpenSearch documents, and the plugin allows you to perform approximate nearest neighbour (ANN) searches alongside traditional keyword queries.

PUT /my-index
{
  "mappings": {
    "properties": {
      "title": { "type": "text" },
      "vector": { "type": "knn_vector", "dimension": 128 }
    }
  }
}

POST /my-index/_doc
{
  "title": "Introduction to Vector Databases",
  "vector": [0.1, 0.2, 0.3, ..., 0.128]
}

POST /my-index/_search
{
  "size": 3,
  "query": {
    "knn": {
      "vector": {
        "vector": [0.12, 0.18, 0.31, ..., 0.134],
        "k": 3
      }
    }
  }
}

This blend of full-text and vector search capabilities makes OpenSearch a versatile choice for applications like e-commerce search engines, where you might want to combine semantic relevance with keyword matching.

Amazon Aurora with pgvector

For those entrenched in the relational world, AWS offers another compelling option: Amazon Aurora (PostgreSQL-compatible) with the pgvector extension. This approach allows developers to store and search vectors directly within a relational database, bridging the gap between structured data and vector embeddings. This has additional benefits: no need to manage separate vector databases and run SQL queries that mix structured data with vector similarity searches.

CREATE EXTENSION IF NOT EXISTS vector;

CREATE TABLE articles (
    id SERIAL PRIMARY KEY,
    title TEXT,
    embedding VECTOR(300)
);

INSERT INTO articles (title, embedding)
VALUES ('Deep Learning Basics', '[0.23, 0.11, ..., 0.89]');

SELECT id, title
FROM articles
ORDER BY embedding <-> '[0.25, 0.13, ..., 0.85]' -- Cosine similarity
LIMIT 5;

While this solution doesn’t match the raw performance of dedicated vector databases like Pinecone, it’s incredibly convenient for applications where relational integrity and SQL querying are paramount.

Amazon Kendra: AI-Powered Semantic Search

If OpenSearch and Aurora are the “build-it-yourself” kits, Amazon Kendra is the sleek, pre-assembled appliance. Kendra is a fully managed, AI-powered enterprise search service designed to deliver highly relevant search results using natural language queries. It abstracts away all the complexities of vector embeddings and ANN algorithms.

You feed Kendra your documents, and it automatically generates embeddings, indexes them, and provides semantic search capabilities via API. Kendra is ideal if you need out-of-the-box semantic search without delving into the mechanics of vector databases.

Azure and the Vector Frontier

While AWS takes a modular approach, Microsoft Azure has focused on tightly integrated services that embed vector capabilities within its broader AI and data ecosystem. Azure’s strategy revolves around Cognitive Search and Azure Database for PostgreSQL.

Azure Cognitive Search with Vector Search

Azure Cognitive Search is the crown jewel of Microsoft’s search services. Initially designed for full-text search, it now supports vector search capabilities, allowing developers to combine keyword-based and semantic search in a single API. The key features are the native support for HNSW indexing for fast ANN search and the Integration with Azure’s AI services, making it easy to generate embeddings using models from Azure OpenAI Service.

POST /indexes/my-index/docs/search?api-version=2021-04-30-Preview
{
  "search": "machine learning",
  "vector": {
    "value": [0.15, 0.22, 0.37, ..., 0.91],
    "fields": "contentVector",
    "k": 5
  },
  "select": "title, summary"
}

This hybrid search approach allows you to retrieve documents based on both traditional keyword relevance and semantic similarity, making it perfect for applications like enterprise knowledge bases and intelligent document retrieval systems.

Azure Database for PostgreSQL with pgvector

Much like AWS’s Aurora, Azure Database for PostgreSQL supports the pgvector extension. This allows you to run vector similarity queries directly within your relational database, providing an elegant solution for applications that need to mix structured SQL data with unstructured semantic data.

The implementation is almost identical to what we’ve seen with AWS, thanks to PostgreSQL’s consistency across platforms. However, Azure’s deep integration with Power BI, Data Factory, and other analytics tools adds an extra layer of convenience for enterprise applications.

Azure Synapse Analytics and AI Integration

For organizations dealing with petabytes of data, Azure Synapse Analytics offers a powerful environment for big data processing and analytics. While Synapse doesn’t natively support vector search out of the box, it integrates seamlessly with Cognitive Search, allowing for large-scale vector analysis combined with data warehousing capabilities.

Imagine running complex data transformations in Synapse, generating embeddings using Azure Machine Learning, and then indexing those embeddings in Cognitive Search—all within the Azure ecosystem.

Comparing AWS and Azure: A Tale of Two Cloud Giants

While both AWS and Azure offer robust vector database capabilities, their approaches reflect their broader cloud philosophies:

AWS Emphasises modularity and flexibility. You can mix and match services like OpenSearch, Aurora, and Kendra to create custom solutions tailored to specific use cases. AWS is ideal for teams that prefer granular control over their architecture.

Azure Focuses on integrated, enterprise-grade solutions. Cognitive Search, in particular, shines for its seamless blend of traditional search, vector search, and AI-driven features. Azure is a natural fit for businesses deeply invested in Microsoft’s ecosystem.

Ultimately, the “best” vector database solution depends on your specific requirements. If you need real-time recommendations with low latency, AWS OpenSearch with k-NN or Azure Cognitive Search with HNSW might be your best bet. For applications where structured SQL data meets unstructured embeddings, PostgreSQL with pgvector on either AWS or Azure provides a flexible, developer-friendly solution. If you prefer managed AI-powered search with minimal configuration, Amazon Kendra or Azure Cognitive Search’s AI integrations will get you up and running quickly.

In the ever-evolving world of vector databases, both AWS and Azure are not just keeping pace—they’re setting the pace. Whether you’re a data engineer optimising for performance, a developer building AI-powered applications, or an enterprise architect designing at scale, these platforms offer the tools to turn vectors into value. And in the grand narrative of data, that’s what it’s all about.

The Importance of Vector Databases in the Modern Landscape

So why is this important? Because the world is drowning in unstructured data—images, videos, text, audio—and vector databases are the life rafts. They power recommendation systems at Netflix and Spotify, semantic search at Google, facial recognition systems in security applications, and product recommendations in e-commerce platforms. Without vector databases, these systems would be slower, less accurate, and more resource-intensive.

Moreover, vector databases are increasingly being integrated with traditional databases to create hybrid systems. For example, you might have user profiles stored in PostgreSQL, but their activity history represented as vectors in a vector database like Pinecone or Weaviate. The ability to combine structured metadata with unstructured vector search opens up new possibilities for personalisation, search relevance, and AI-driven insights.

In a way, vector databases represent the next evolutionary step in data management. Just as relational databases structured the chaos of early data processing, and NoSQL systems liberated us from rigid schemas, vector databases are unlocking the potential of data that doesn’t fit neatly into rows and columns—or even into traditional key-value pairs.

For developers coming from relational and NoSQL backgrounds, understanding vector databases requires a shift in thinking—from deterministic queries to probabilistic approximations, from indexing discrete values to navigating high-dimensional spaces. But the underlying principles of data modeling, querying, and optimization still apply. It’s just that the data now lives in a more abstract, mathematical universe.

Professional Developer

by Rodrigo Silva

Tag Archives: llm