December 19, 2025 / AISelf-developed Tools / Read Time: 13 Min

I Trained a Better Legal Embedding Model Than Google and Alibaba?

Sharing the complete process of fine-tuning a legal-specific embedding model: based on Google's EmbeddingGemma-300M, trained on a legal provision-to-colloquial question dataset, outperforming Google and Alibaba's models in legal provision retrieval, now open-sourced.

A while ago, I built a legal provision lookup tool.

I initially made it for personal use, but since I’d already built it, I decided to share it:

【Self-made】Local AI Legal Provision Database — Instant Search Results, AI-Powered Conclusions, Quick Cross-Reference Navigation, Fast Citation, and More!

While using it, I felt the legal provision recall wasn’t good enough. So I looked at my GPU and thought:

“Why not fine-tune my own model?”

Note: This is an Embedding model, not an LLM — it cannot be used for conversation.

1. Results

Models Compared

Original embedding model: Google’s EmbeddingGemma-300M

Excellent domestic embedding model: Alibaba’s Qwen3-embedding-0.6b

Both are excellent lightweight embedding models, ranking highly on the global open-source embedding model leaderboard, and suitable for most hardware.

That’s why I initially chose EmbeddingGemma-300M as the base model for the legal provision database.

Comparison Results

The comparison process: randomly select over 100 legal provisions from the database, have Deepseek generate colloquial questions based on each provision (similar to ordinary user queries), then compare each model’s ranking of the target provision among all retrieved results.

After fine-tuning, the results are shown below (long image warning). Over 1/3 of the retrieval results outperformed Alibaba’s Qwen3-embedding-0.6b model; compared to the original Google EmbeddingGemma-300M, over 1/4 of the retrieval results improved in ranking:

2. Open Experience

Fully Open

The trained model has been uploaded to Ollama, ModelScope, and HuggingFace.

Ollama users can pull it with:

ollama pull demonbyron/embeddinggemma-300m-lawvault

Note: This is the bf16 quantized version.

ModelScope page:

https://modelscope.cn/models/ByronLeeee/EmbeddingGemma-300M-LawVault

HuggingFace page:

https://huggingface.co/ByronLeeee/EmbeddingGemma-300M-LawVault

How to Use

To use it with LawVault, you must update the legal database file, otherwise it won’t return correct provisions.

Download the new vector database file package from:

https://pan.xunlei.com/s/VOgpB1Qqjfe8uxRokBXzyy-BA1#

Then delete the original vector database folder (law_db.lancedb) and extract the new database folder in its place (no need to delete the content.db file).

In the settings, change the model name to:

demonbyron/embeddinggemma-300m-lawvault:latest

Then just ask questions normally:

New Features

Compared to the previous version, several new features have been added — welcome to try them:

【Legal Provision Search Agent】

After enabling 【Deep Thinking Mode】 and 【AI Q&A】, AI will first break down the search question, list multiple search directions, and automatically perform searches.

The agent automatically determines whether the search results are sufficient to answer the question. If not, it continues searching for subsequent questions or adds new keywords.

When AI determines the search is complete (or has reached the maximum search rounds), it will consolidate all retrieved provisions and generate a search report based on the question:

【Writing Assistant】

You can now add retrieved provisions or full-text selections directly to the material library, and use the writing assistant to have AI draft the required text content.

You can also use the “Smart Material Search” feature to let AI search for the required provisions and draft content:

Supports various export formats — for example, copy directly formatted for use in Word:

3. Principles and Fine-Tuning Process

This is the dry technical section. If you’re not interested, feel free to share and like this article before closing. Thank you!

Embedding Principle

The legal provision RAG workflow is shown above.

Why do many legal databases struggle to find the right provisions?

Because vector search performs text similarity search.

This means the search content must have high similarity to the “legal language” of the provision text to successfully return the correct provision.

So there’s a paradox:

Only if “I” already know the content of the provision can “I” search for it;

But if “I” already know the provision, why would “I” need to search?

Fine-Tuning Process

So I decided to try fine-tuning the embedding model.

First, I took over 20,000 already-split legal provisions (pure laws, excluding judicial interpretations and local regulations) and used the Deepseek V3.2 model to generate colloquial questions based on each provision.

Each provision was queried in 3 different tones, ultimately generating 65,783 question-answer pairs.

Since EmbeddingGemma (and embedding models in general) are best fine-tuned using triplet format (query, positive example, negative example), and generating negative examples via LLM might be unstable, I ultimately used:

The closest provision in the original model’s vector database (excluding the target provision) returned for the training question

as the negative example, completing the full training dataset.

Then came routine training. Parameters:

Batch size: 24 (effective batch size = 144, gradient accumulation)

EpochStepTraining Loss
0.002213.5148
14570.2123
29140.0749
313710.0369

Training on an RTX 5070 Ti 16G, 3 epochs took about 2.5 hours — acceptable speed overall.

4. Finally

The title is a bit “clickbaity.”

After all, the fine-tuned model only has an advantage in the specific scenario of correlating colloquial legal questions with specific provisions.

But what I want to show is that model fine-tuning is actually very accessible.

The models provided by major companies are usually “common denominators.” Fine-tuning a model for your own use can further improve model utilization.

I believe the future will inevitably involve a combination of local personalized small models and online large models. Having your own model is definitely satisfying.

Boyang Li
Author

Boyang Li

Chinese Attorney — Beijing Longan (Guangzhou) Law Firm

A lawyer focused on game law, AI regulation, data compliance, and digital content rights. I write about practical legal insights for innovative tech teams.

Contact me about this topic →

Research on Criminal Liability and Governance Paths of AI Large Model API Reverse Proxies

Analyzes three types of AI large model API reverse proxies (rule abuse, payment fraud, and protection breakthrough), explores criminal regulation paths such as the crime of destroying computer information systems, and advocates for upholding the principle of criminal restraint while adopting a cross-cutting criminal-civil rights protection strategy.