When it comes to structuring datasets focused on South Korean current laws, the project doesn’t end with a single version. Instead, it evolves into a continuous effort, exploring various ways to process and refine the dataset. This includes extending the dataset to cover case law, precedents, and other related legal frameworks. Naturally, these datasets will be used for LLM fine-tuning and RAG (Retrieval-Augmented Generation) applications, aiming for robust and practical implementations.
Additionally, I have taken the initial step of converting the original JSON data into a simplified dataset (not final) and uploaded it to Hugging Face for public use: Korean Law Dataset.
1. The Pillars of Legal Data
To effectively use legal texts in AI, understanding the essential components of such datasets is crucial. Here are the key elements I focused on:
1.1 Fundamental Information
- Law Name (Korean): Central for dataset navigation and search.
- Dates (Promulgation & Enforcement): Vital for understanding when the law came into effect.
- Administrative Body: Information about the department or office responsible for the law’s oversight.
1.2 Provisions (Articles and Clauses)
- Article Numbers and Titles: Define the law’s structure, crucial for indexing and contextualizing data.
- Article Content: The core text used for LLM training.
- Hierarchy of Clauses: Detailed breakdowns that help the model grasp nuanced legal contexts.
- Amendments: Data on changes, including dates, facilitates version control and temporal analysis.
1.3 Supplementary Provisions (부칙)
Supplementary rules often outline transitional measures, dates, or modifications to related laws. For example, connections between laws are often codified in provisions like “Amendment of Other Laws” (e.g., Article 7).
1.4 Legislative History
- The rationale behind amendments adds interpretative layers, offering context for changes.
2. Designing the Dataset: Lessons in Precision
Creating a dataset isn’t just about gathering information—it’s about crafting it for usability and relevance. Here’s how I structured my dataset:
2.1 Segmentation by Purpose
- Individual Articles: Each article was treated as an independent sample, making it easy to analyze in isolation.
- Separate Supplementary Data: Provisions and amendments were stored distinctly, aiding inter-law relationship analysis.
- Metadata Enrichment: Fields like law name, enforcement dates, and overseeing body were added for filtering and searching.
2.2 Formatting Decisions
- JSON vs. CSV: JSON worked best for hierarchical data (like clauses), while CSV sufficed for tabular relationships.
Example JSON structure:
{
"법령명": "재외국민등록법",
"조문번호": "1",
"조문제목": "목적",
"조문내용": "이 법은 외국에 거주...",
"시행일자": "20230605",
"개정이유": "국가를 위해 헌신..."
}
Here is one example of data fetched directly from an API provided by the Korean government, focusing on just the current law part:
I detailed the process of how this data was retrieved through the government-provided API in my blog post: Gathering Essential Legal Data.
2.3 Preprocessing Pitfalls and Fixes
- Cleaning Text: Korean-specific punctuation and whitespace issues required careful handling.
- Terminology Consistency: Avoiding ambiguities like “외교부장관” vs. “외교부 장관” by standardizing terms.
- Maintaining Hierarchy: Ensuring sub-clauses remained linked to their parent articles.
3. Applications and Takeaways
3.1 Fine-Tuning Models
Using this structured dataset, I plan to focus on:
- Training LLMs to Understand Legal Provisions: By inputting article content and metadata (e.g., article number, title), the model will be trained to predict specific outputs, such as related amendments or legislative history.
- Creating Q&A Pairs for Targeted Learning: This includes generating datasets that help the model answer legal questions. For instance:
- Question: “What duties does the consular officer have under Article 6 of the Overseas Koreans Act?”
- Answer: “Maintain the Overseas Korean Register and submit it to the Minister of Foreign Affairs and the Commissioner of the Overseas Koreans Agency.”
- Enhancing Contextual Understanding: Incorporating supplementary provisions and legislative history to ensure the model captures the broader implications of laws and amendments.
3.2 Expanding Use Cases
In future iterations, I aim to refine and expand the dataset to include:
- Precedents and Case Law: Developing connections between statutory law and judicial interpretations for deeper legal reasoning.
- Cross-Law Relationships: Using provisions like amendments and supplementary rules to create a network of interconnected laws, enabling complex query handling.
- Multilingual Capabilities: Translating legal texts and metadata into multiple languages for cross-border applicability, particularly focusing on South Korean laws in a global context.
3.3 Future Enhancements
With this foundation, my next steps will include integrating the dataset into RAG frameworks for efficient document retrieval. The goal is to create a robust system where users can query specific laws, amendments, or precedents and receive accurate, contextually aware answers.
4. Data Conversion Summary
The original JSON data was processed into a simplified structure using the following steps:
- Extracting Core Fields: Focused on critical elements such as law name, dates, articles, clauses, and their hierarchical relationships.
- Handling Nested Structures: Flattened the nested JSON into rows for CSV and JSON Lines (JSONL) formats to make the data more accessible.
- Preprocessing: Cleaned text, standardized terms, and ensured consistency across all entries.
Below is a summarized version of the code used for the conversion process:
import json
import pandas as pd
import glob
import os
def extract_law_data(data):
rows = []
if '법령' not in data:
return rows
law = data['법령']
basic_info = law.get('기본정보', {})
law_name = basic_info.get('법령명_한글', '')
proclamation_date = basic_info.get('공포일자', '')
enforcement_date = basic_info.get('시행일자', '')
jurisdiction = basic_info.get('소관부처', {}).get('content', '')
for clause in law.get('조문', {}).get('조문단위', []):
clause_number = clause.get('조문번호', '')
clause_title = clause.get('조문제목', '')
clause_content = clause.get('조문내용', '')
rows.append([
law_name, proclamation_date, enforcement_date, jurisdiction,
clause_number, clause_title, clause_content
])
return rows
json_files = glob.glob('data/law_details/*.json')
all_rows = []
for file in json_files:
with open(file, 'r', encoding='utf-8') as f:
data = json.load(f)
all_rows.extend(extract_law_data(data))
df = pd.DataFrame(all_rows, columns=[
'법령명', '공포일자', '시행일자', '소관부처',
'조문번호', '조문제목', '조문내용'
])
df.to_csv('refined_law_dataset.csv', index=False, encoding='utf-8-sig')

This dataset, available on Hugging Face, represents the first step in building a comprehensive legal AI framework.

Let me know if you’d like to explore any section further!
답글 남기기