[LEGAL AI] Part 2-2: Structuring Legal Data for AI

When it comes to structuring datasets focused on South Korean current laws, the project doesn’t end with a single version. Instead, it evolves into a continuous effort, exploring various ways to process and refine the dataset. This includes extending the dataset to cover case law, precedents, and other related legal frameworks. Naturally, these datasets will be used for LLM fine-tuning and RAG (Retrieval-Augmented Generation) applications, aiming for robust and practical implementations.

Additionally, I have taken the initial step of converting the original JSON data into a simplified dataset (not final) and uploaded it to Hugging Face for public use: Korean Law Dataset.

1. The Pillars of Legal Data

To effectively use legal texts in AI, understanding the essential components of such datasets is crucial. Here are the key elements I focused on:

1.1 Fundamental Information

Law Name (Korean): Central for dataset navigation and search.
Dates (Promulgation & Enforcement): Vital for understanding when the law came into effect.
Administrative Body: Information about the department or office responsible for the law’s oversight.

1.2 Provisions (Articles and Clauses)

Article Numbers and Titles: Define the law’s structure, crucial for indexing and contextualizing data.
Article Content: The core text used for LLM training.
Hierarchy of Clauses: Detailed breakdowns that help the model grasp nuanced legal contexts.
Amendments: Data on changes, including dates, facilitates version control and temporal analysis.

1.3 Supplementary Provisions (부칙)
Supplementary rules often outline transitional measures, dates, or modifications to related laws. For example, connections between laws are often codified in provisions like “Amendment of Other Laws” (e.g., Article 7).

1.4 Legislative History

The rationale behind amendments adds interpretative layers, offering context for changes.

2. Designing the Dataset: Lessons in Precision

Creating a dataset isn’t just about gathering information—it’s about crafting it for usability and relevance. Here’s how I structured my dataset:

2.1 Segmentation by Purpose

Individual Articles: Each article was treated as an independent sample, making it easy to analyze in isolation.
Separate Supplementary Data: Provisions and amendments were stored distinctly, aiding inter-law relationship analysis.
Metadata Enrichment: Fields like law name, enforcement dates, and overseeing body were added for filtering and searching.

2.2 Formatting Decisions

JSON vs. CSV: JSON worked best for hierarchical data (like clauses), while CSV sufficed for tabular relationships.
Example JSON structure:

{
  "법령명": "재외국민등록법",
  "조문번호": "1",
  "조문제목": "목적",
  "조문내용": "이 법은 외국에 거주...",
  "시행일자": "20230605",
  "개정이유": "국가를 위해 헌신..."
}

Here is one example of data fetched directly from an API provided by the Korean government, focusing on just the current law part:

I detailed the process of how this data was retrieved through the government-provided API in my blog post: Gathering Essential Legal Data.

2.3 Preprocessing Pitfalls and Fixes

Cleaning Text: Korean-specific punctuation and whitespace issues required careful handling.
Terminology Consistency: Avoiding ambiguities like “외교부장관” vs. “외교부 장관” by standardizing terms.
Maintaining Hierarchy: Ensuring sub-clauses remained linked to their parent articles.

3. Applications and Takeaways

3.1 Fine-Tuning Models

Using this structured dataset, I plan to focus on:

Training LLMs to Understand Legal Provisions: By inputting article content and metadata (e.g., article number, title), the model will be trained to predict specific outputs, such as related amendments or legislative history.
Creating Q&A Pairs for Targeted Learning: This includes generating datasets that help the model answer legal questions. For instance:
- Question: “What duties does the consular officer have under Article 6 of the Overseas Koreans Act?”
- Answer: “Maintain the Overseas Korean Register and submit it to the Minister of Foreign Affairs and the Commissioner of the Overseas Koreans Agency.”
Enhancing Contextual Understanding: Incorporating supplementary provisions and legislative history to ensure the model captures the broader implications of laws and amendments.

3.2 Expanding Use Cases

In future iterations, I aim to refine and expand the dataset to include:

Precedents and Case Law: Developing connections between statutory law and judicial interpretations for deeper legal reasoning.
Cross-Law Relationships: Using provisions like amendments and supplementary rules to create a network of interconnected laws, enabling complex query handling.
Multilingual Capabilities: Translating legal texts and metadata into multiple languages for cross-border applicability, particularly focusing on South Korean laws in a global context.

3.3 Future Enhancements
With this foundation, my next steps will include integrating the dataset into RAG frameworks for efficient document retrieval. The goal is to create a robust system where users can query specific laws, amendments, or precedents and receive accurate, contextually aware answers.

4. Data Conversion Summary

The original JSON data was processed into a simplified structure using the following steps:

Extracting Core Fields: Focused on critical elements such as law name, dates, articles, clauses, and their hierarchical relationships.
Handling Nested Structures: Flattened the nested JSON into rows for CSV and JSON Lines (JSONL) formats to make the data more accessible.
Preprocessing: Cleaned text, standardized terms, and ensured consistency across all entries.

Below is a summarized version of the code used for the conversion process:

import json
import pandas as pd
import glob
import os

def extract_law_data(data):
    rows = []
    if '법령' not in data:
        return rows
    law = data['법령']
    basic_info = law.get('기본정보', {})
    law_name = basic_info.get('법령명_한글', '')
    proclamation_date = basic_info.get('공포일자', '')
    enforcement_date = basic_info.get('시행일자', '')
    jurisdiction = basic_info.get('소관부처', {}).get('content', '')

    for clause in law.get('조문', {}).get('조문단위', []):
        clause_number = clause.get('조문번호', '')
        clause_title = clause.get('조문제목', '')
        clause_content = clause.get('조문내용', '')

        rows.append([
            law_name, proclamation_date, enforcement_date, jurisdiction,
            clause_number, clause_title, clause_content
        ])
    return rows

json_files = glob.glob('data/law_details/*.json')
all_rows = []
for file in json_files:
    with open(file, 'r', encoding='utf-8') as f:
        data = json.load(f)
        all_rows.extend(extract_law_data(data))

df = pd.DataFrame(all_rows, columns=[
    '법령명', '공포일자', '시행일자', '소관부처',
    '조문번호', '조문제목', '조문내용'
])
df.to_csv('refined_law_dataset.csv', index=False, encoding='utf-8-sig')

This dataset, available on Hugging Face, represents the first step in building a comprehensive legal AI framework.

Let me know if you’d like to explore any section further!

[LEGAL AI] Part 2-2: Structuring Legal Data for AI

1. The Pillars of Legal Data

2. Designing the Dataset: Lessons in Precision

3.1 Fine-Tuning Models

3.2 Expanding Use Cases

4. Data Conversion Summary

답글 남기기 응답 취소

카테고리

태그

최신 글

different style of AI music generated (k-pop style)

[LegAL AI] Part 2-1-2: A Quick Guide to Using Open Law Korea

Why Feeling ‘Special’ Matters: hash tag “music”

블로그롤