HàPhan 河

🤖 Automating Code Reviews with LangChain, GitLab MCP, and RAG

Imagine this: a developer pushes code to a merge request on GitLab. Immediately, our AI agent springs into action. It reviews the changes, cross-referencing them with the existing codebase, identifying potential issues, suggesting improvements, and even explaining why certain changes might be problematic. All of this, as a comment directly on the merge request.

Creating an AI-powered code review agent that integrates with GitLab can significantly streamline the development process. By leveraging LangChain, the Model Context Protocol (MCP), and Retrieval-Augmented Generation (RAG), we can build an agent capable of retrieving, analyzing, and providing feedback on code changes.This guide will walk you through setting up such an agent, complete with sample code snippets.

🧠 Overview

Our AI code review agent will:

Connect to GitLab: Utilize an MCP server to interact with GitLab repositories.
Index Source Code: Employ RAG to embed and index the codebase for efficient retrieval.
Analyze Merge Requests: Automatically fetch and review merge requests, providing actionable feedback.

🛠️ Prerequisites

Before we begin, ensure you have the following:

  • Python 3.8+- Node.js (v16+)

  • GitLab Personal Access Token

  • GitLab MCP Server: A server that exposes GitLab functionalities via MCP. Refer to GitLab MCP Server for setup instructions

  • Required Python Packages

pip install langchain faiss-cpu openai mcp-use

💡Step-by-Step: Building the Agent

Let's break down the development process.

🧱 Indexing the Codebase with RAG

Retrieval-Augmented Generation (RAG) enhances the agent's ability to provide contextually relevant feedback by retrieving pertinent information from the codebase.

Load and Split the Codebase:

from langchain_community.document_loaders.generic import GenericLoader
from langchain_community.document_loaders.parsers import LanguageParser
from langchain.text_splitter import RecursiveCharacterTextSplitter

# Load all code files from the local folder
loader = GenericLoader.from_filesystem(
    "./source",
    glob="**/*",
    suffixes=[".swift"],
    parser=LanguageParser()
)
documents = loader.load()

# Split documents into manageable chunks
text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=1000,      # Max characters per chunk
    chunk_overlap=200,    # Overlap between chunks to maintain context
    length_function=len,
    add_start_index=True, # Add metadata about the start index of the chunk
)
chunks = text_splitter.split_documents(documents)

Generate Embeddings and Store in Vector Database:

from langchain_huggingface import HuggingFaceEmbeddings
from langchain_chroma import Chroma

# Initialize the embedding model
embeddings = HuggingFaceEmbeddings(model_name="all-MiniLM-L6-v2")

# Check if the vector store already exists to create or load
persist_directory = "./chroma"
if os.path.exists(persist_directory) and os.listdir(persist_directory):
    print(f"Loading existing Chroma DB from {persist_directory}")
    vectorstore = Chroma(persist_directory=persist_directory, embedding_function=embeddings)
else:
    # Ensure the persist directory exists
    os.makedirs(persist_directory, exist_ok=True)
    vectorstore = Chroma.from_documents(
        documents=chunks,
        embedding=embeddings,
        persist_directory=persist_directory
    )
    vectorstore.persist()
    print("Chroma DB created and persisted.")

This setup allows the agent to retrieve relevant code snippets when analyzing merge requests.

Define the RAG Retrieval Tool

Create a custom tool that utilizes the vector store to retrieve relevant code snippets:

from langchain.agents import Tool

def retrieve_code_snippets(query):
    # Retrieve relevant documents based on the query
    docs = vectorstore.similarity_search(query)
    # Combine the content of the documents into a single string
    return "\n".join([doc.page_content for doc in docs])

# Define the tool
rag_tool = Tool(
    name="RetrieveCodeSnippets",
    func=retrieve_code_snippets,
    description="Retrieves relevant code snippets based on a query."
)

Hint: We can use the RetrievalQA to let llm communicate directly without RAG tool.

def setup_rag_chain(vectorstore):
    """
    Sets up the Retrieval Augmented Generation (RAG) chain.
    This chain takes a query, retrieves relevant documents from the vector store,
    and then passes them to the LLM to generate an answer.
    """
    print("Setting up RAG chain...")
    llm = OllamaLLM(model="deepseek-r1:7b") # local llm

    # Define a custom prompt for the LLM
    # This prompt guides the LLM on how to use the retrieved context
    custom_prompt_template = """Use the following pieces of context to answer the user's question.
If you don't know the answer, just say that you don't know, don't try to make up an answer.
Keep the answer concise and relevant to the codebase.

Context:
{context}

Question:
{question}

Helpful Answer:"""
    CUSTOM_PROMPT = PromptTemplate(
        template=custom_prompt_template,
        input_variables=["context", "question"]
    )

    # Create the RetrievalQA chain
    # 'stuff' chain type puts all retrieved documents into the prompt
    qa_chain = RetrievalQA.from_chain_type(
        llm=llm,
        chain_type="stuff",
        retriever=vectorstore.as_retriever(search_kwargs={"k": 5}), # Retrieve top 5 relevant documents
        return_source_documents=True, # Return the documents that were used to generate the answer
        chain_type_kwargs={"prompt": CUSTOM_PROMPT}
    )
    print("RAG chain setup complete.")
    return qa_chain
    
# Usage
result = qa_chain({"query": query})

🛠️ Setting Up the Agent with mcp-use

To streamline the integration of GitLab tools into your LangChain agent, you can utilize the mcp-use library. This library simplifies the process of connecting LangChain-compatible LLMs with MCP servers, allowing for seamless tool integration without manual definitions.

Clone and build Gitlab MCP server:

Your system needs NodeJS to build and start the server.

git clone https://github.com/rifqi96/mcp-gitlab.git
cd mcp-gitlab
npm install
npm run build

After building the source code you can find the index.js in the build folder and use to start the local mcp server and connect with client.

Initialize the MCP Client:

Set up the MCP client to connect to your GitLab MCP server:

from mcp_use import MCPClient

# Init the client with local server built source configuration
config = {
  "mcpServers": {
    "gitlab": {
      "command": "node",
      "args": ["/path/to/mcp-gitlab/build/index.js"],
      "env": {
        "GITLAB_API_TOKEN": "YOUR_GITLAB_API_TOKEN",
        "GITLAB_API_URL": "https://gitlab.com/api/v4"
      }
    }
  }
}

client = MCPClient.from_dict(config)

Load Tools from the MCP Server:

Retrieve the available tools from the connected MCP server:

mcp_tools = client.load_tools()

Integrate Tools into Your Agent:

Pass the loaded tools from RAG and MCP server to your LangChain agent:

from langchain.agents import initialize_agent
from langchain.llms import OpenAI

# Initialize your LLM
llm = OpenAI(temperature=0)

all_tools = mcp_tools + [rag_tool]

# Create the agent with the loaded tools
agent = initialize_agent(all_tools, llm, agent="cool-ai-reviewer", verbose=True)

Use the Agent:

Now, you can prompt your agent to perform tasks:

# Example prompt to review a merge request
prompt = "Review merge request #42 and provide feedback on code quality and potential issues."

# Run the agent with the prompt
agent.run(prompt)

By leveraging the mcp-use library and RAG, you can efficiently integrate GitLab functionalities into your LangChain agent. The agent will utilize the tools to fetch merge request details and retrieve relevant code snippets, enabling it to provide comprehensive feedback. This approach enhances scalability and maintainability, especially when dealing with multiple tools or MCP servers.


🧪 Testing the Agent

To ensure the agent functions correctly:

Submit a Merge Request: Create a new merge request in your GitLab repository.

Run the Agent: Use the agent to review the merge request by providing the appropriate prompt.

Evaluate Feedback: Assess the feedback provided by the agent for accuracy and usefulness.


🚀 Conclusion

By integrating LangChain with GitLab via an MCP server and employing RAG for code indexing, we've built a powerful AI agent capable of automating code reviews. This setup not only saves time but also enhances code quality by providing consistent and thorough feedback.

Feel free to expand upon this foundation by adding more tools, improving the embedding model, or integrating with other services to further enhance the agent's capabilities.

Comments