XBRL GL Palette Taxonomy Parser

Views: 52

This article introduces a Python-based parser to extract a logical hierarchical model (LHM) structure from the XBRL GL taxonomy. The parser also retrieves multilingual labels and documentation from the label linkbase. The output is a structured CSV file useful for semantic analysis, implementation, and documentation.

1. Motivation

The XBRL Global Ledger (XBRL GL) Palette taxonomy defines an XML-based standard for representing accounting and audit data. However, its hierarchical structure—especially when modularised—can be difficult to navigate, particularly when multilingual labels are defined using labelArc.

This script provides a bridge between raw schema definitions and a friendly CSV format enriched with English and localised labels (e.g., Japanese).

2. What This Script Does

  • Loads all gl-*.xsd and gl-*-content.xsd schemas

  • Detects complexType definitions with anyType base as tuples

  • Extracts all element names, types, and cardinality

  • Extracts labels from label.xml and label-ja.xml via labelArc

  • Supports fallback resolution of label identifiers

  • Outputs a fully annotated CSV representing the logical structure defined by complexType and complexContent/xs:sequence declarations in the schema

3. Requirements

  • Python 3.7 or later

  • lxml library:

    pip install lxml

4. Usage Instructions

4.1. Command-Line Execution

python xbrl_gl_label_parser.py --base-dir XBRL-GL-PWD-2016-12-01

4.2. Optional Parameters

Argument Description

--base-dir

(Required) Path to the root directory of your XBRL GL taxonomy

--palette

Subdirectory name of the palette folder (default: case-c-b-m-u-e-t-s)

--lang

Language code for labels (default: ja). Accepts values like en, ja, etc.

--debug

Enable detailed debug logging

--trace

Enable top-level trace output

--output

Output CSV filename (default: XBRL_GL_Parsed_LHM_Structure.csv)

4.3. Example (in launch.json for VSCode)

"args": [
  "--base-dir", "XBRL-GL-PWD-2016-12-01",
  "--palette", "case-c-b",
  "--lang", "ja",
  "--debug",
  "--trace",
  "--output", "XBRL_GL_case-c-b_Structure.csv"
]

5. Input Directory Structure

Your XBRL GL taxonomy should be structured like this:

XBRL-GL-PWD-2016-12-01/
├── gl/
│   ├── cor/
│   │   ├── gl-cor-2016-12-01.xsd
│   │   └── lang/
│   │       ├── gl-cor-2016-12-01-label.xml
│   │       └── gl-cor-2016-12-01-label-ja.xml
│   ├── bus/
│   ├── muc/
│   └── ...
├── gl/plt/case-c-b/
│   ├── gl-cor-content-2016-12-01.xsd
│   └── ...

6. Output

The script generates a CSV file:

Level,Element,Type,Path,isTuple,minOccurs,maxOccurs,BaseType,Label,Documentation,LocalLabel,LocalDocumentation
1,accountingEntries,gl-cor:accountingEntriesComplexType,/gl-cor:accountingEntries,True,1,unbounded,,Accounting Entries,Root for XBRL GL. No entry made here.,【会計仕訳】,XBRL GLのルート要素。 この要素にはデータは登録されない。
2,gl-cor:documentInfo,gl-cor:documentInfoComplexType,/gl-cor:accountingEntries/gl-cor:documentInfo,True,1,1,,Document Information,Parent for descriptive information about the accountingEntries section in which it is contained.,【文書情報】,この会計仕訳に関する情報の親タグ。
3,gl-cor:entriesType,gl-gen:entriesTypeItemType,/gl-cor:accountingEntries/gl-cor:documentInfo/gl-cor:entriesType,False,1,1,xbrli:tokenItemType,Document Type,"account: information to fill in a chart of accounts file.  
balance: the results of accumulation of a complete and validated list of entries for an account (or a list of account) in a specific period - sometimes called general ledger  
entries: a list of individual accounting entries, which might be posted/validated or nonposted/validated   
journal: a self-balancing (Dr = Cr) list of entries for a specific period including beginning balance for that period.  
ledger: a complete list of entries for a specific account (or list of accounts) for a specific period; note - debits do not have to equal credits.   
assets: a listing of open receivables, payables, inventory, fixed assets or other information that can be extracted from but are not necessarily included as part of a journal entry.  
trialBalance: the self-balancing (Dr = Cr) result of accumulation of a complete and validated list of entries for the entity in a complete list of accounts in a specific period. 

6.1. CSV Columns

Column Meaning

Level

Depth level in the hierarchy

Element

QName (e.g. gl-cor:uniqueID)

Type

Schema type (e.g. gl-cor:uniqueIDItemType)

Path

Hierarchy path

isTuple

True if the type is a tuple

minOccurs

Minimum cardinality

maxOccurs

Maximum cardinality

BaseType

Underlying XBRL base type (e.g. xbrli:stringItemType)

Label

English label from label.xml

Documentation

English description

LocalLabel

Localised label (e.g. Japanese)

LocalDocumentation

Localised description

6.2. Notes

  • Tuples are determined by checking if complexType is based on anyType.

  • Localised labels (e.g. ja) can be extracted by using --lang ja.

  • The script is modular and extensible to support other taxonomies.

8. Questions or Feedback?

If you have suggestions, encounter issues, or need support adapting the script to other taxonomies, feel free to comment on this page. Contributions and improvements are always welcome.

You can also fork the script or submit enhancements by referencing the source file:
SOURCE
Google Drive xbrl_gl_palette_parser.py

#!/usr/bin/env python3
# coding: utf-8
"""
xbrl_gl_palette_parser.py
Parses XBRL Global Ledger (XBRL GL) taxonomy and extracts labeled hierarchical element structures into CSV format.

Designed by SAMBUICHI, Nobuyuki (Sambuichi Professional Engineers Office)
Written by SAMBUICHI, Nobuyuki (Sambuichi Professional Engineers Office)

Creation Date: 2025-04-02

MIT License

(c) 2025 SAMBUICHI, Nobuyuki (Sambuichi Professional Engineers Office)

Permission is hereby granted, free of charge, to any person obtaining a copy
of this software and associated documentation files (the "Software"), to deal
in the Software without restriction, including without limitation the rights
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
copies of the Software, and to permit persons to whom the Software is
furnished to do so, subject to the following conditions:

The above copyright notice and this permission notice shall be included in all
copies or substantial portions of the Software.

THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
SOFTWARE.

Usage:
    python xbrl_gl_label_parser.py --base-dir <taxonomy-root-directory> [--palette <palette-subdir>] [--lang <language-code>] [--debug] [--trace] [--output <filename>]

Arguments:
    --base-dir     Required. Path to the root of the XBRL GL taxonomy (e.g., XBRL-GL-PWD-2016-12-01).
    --palette      Optional. Subdirectory name of the palette folder (default: case-c-b-m-u-e-t-s).
    --lang         Optional. Language code for multilingual labels. Default is 'ja'.
    --debug        Optional. Enables detailed debug output.
    --trace        Optional. Enables trace messages.
    --output       Optional. Filename for the output CSV (default: XBRL_GL_Parsed_LHM_Structure.csv).

Example:
    python xbrl_gl_label_parser.py --base-dir XBRL-GL-PWD-2016-12-01 --palette case-c-b --lang ja --debug --output my_labels.csv
"""

import lxml.etree as ET
import os
import re
import csv
import argparse
from collections import defaultdict

TRACE = True
DEBUG = True

def trace_print(text):
    if TRACE or DEBUG:
        print(text)

def debug_print(text):
    if DEBUG:
        print(text)

# Helper to clean label IDs
def clean_label_id(label_id):
    label_id = re.sub(r"^label_", "", label_id)
    label_id = re.sub(r"(_lbl|_\d+(_\d+)?)$", "", label_id)
    return label_id

# Argument parser for base directory
parser = argparse.ArgumentParser(description="Parse XBRL-GL schemas and extract labeled hierarchy.")
parser.add_argument("--palette", type=str, default="case-c-b-m-u-e-t-s", help="Palette subdirectory under gl/plt/ (e.g. case-c-b or case-c-b-m-u-e-t-s)")
parser.add_argument("--base-dir", type=str, required=True, help="Base directory path to XBRL GLtaxonomy, e.g. XBRL-GL-PWD-2016-12-01")
parser.add_argument("--debug", action="store_true", help="Enable debug output")
parser.add_argument("--trace", action="store_true", help="Enable trace output")
parser.add_argument("--lang", type=str, default="ja", help="Language code for local labels (e.g. 'ja', 'en')")
parser.add_argument("--output", type=str, default="XBRL_GL_Parsed_LHM_Structure.csv", help="Output CSV filename")

args = parser.parse_args()
base_dir = args.base_dir
palette = args.palette
DEBUG = args.debug
TRACE = args.trace
LANG = args.lang
output_filename = args.output

xsd_path = os.path.join(base_dir, f"gl/plt/{palette}/gl-cor-content-2016-12-01.xsd")
namespaces = {
    'xs': "http://www.w3.org/2001/XMLSchema",
    'xbrli': "http://www.xbrl.org/2003/instance"
}
modules = ['gen', 'cor', 'bus', 'muc', 'usk', 'ehm', 'taf', 'srcd']

# Load base schemas and build type maps
element_type_map = {}
type_base_map = {}
type_base_lookup = {}
complex_type_lookup = {}
for mod in modules:
    path = os.path.join(base_dir, f"gl/{mod}/gl-{mod}-2016-12-01.xsd")
    if os.path.exists(path):
        tree = ET.parse(path)
        root = tree.getroot()
        for el in root.xpath("//xs:element", namespaces=namespaces):
            name, type_ = el.get("name"), el.get("type")
            if name and type_:
                # debug_print(f"gl-{mod}:{name}")
                element_type_map[f"gl-{mod}:{name}"] = type_
        for tdef in root.xpath("//xs:simpleType | //xs:complexType", namespaces=namespaces):
            name = tdef.get("name")
            if name:
                # debug_print(name)
                complex_type_lookup[name] = tdef
                restriction = tdef.find(".//xs:restriction", namespaces)
                if restriction is not None:
                    base = restriction.get("base")
                    if base:
                        type_base_map[name] = base
                        type_base_lookup[name] = base
                extension = tdef.find(".//xs:extension", namespaces)
                if extension is not None:
                    base = extension.get("base")
                    if base:
                        type_base_map[name] = base
                        type_base_lookup[name] = base

# Load content schemas
content_roots = {}
for mod in modules:
    path = os.path.join(base_dir, f"gl/plt/{palette}/gl-{mod}-content-2016-12-01.xsd")
    if os.path.exists(path):
        content_roots[mod] = ET.parse(path).getroot()
        tree = ET.parse(path)
        root = tree.getroot()
        for el in root.xpath("//xs:element", namespaces=namespaces):
            name, type_ = el.get("name"), el.get("type")
            if name and type_:
                # debug_print(f"gl-{mod}:{name}")
                element_type_map[f"gl-{mod}:{name}"] = type_
        for tdef in root.xpath("//xs:simpleType | //xs:complexType", namespaces=namespaces):
            name = tdef.get("name")
            if name:
                # debug_print(name)
                complex_type_lookup[name] = tdef
                restriction = tdef.find(".//xs:restriction", namespaces)
                if restriction is not None:
                    base = restriction.get("base")
                    if base:
                        type_base_map[name] = base
                        type_base_lookup[name] = base
                extension = tdef.find(".//xs:extension", namespaces)
                if extension is not None:
                    base = extension.get("base")
                    if base:
                        type_base_map[name] = base
                        type_base_lookup[name] = base

# Load content schemas
content_roots = {}
for mod in modules:
    path = os.path.join(base_dir, f"gl/plt/{palette}/gl-{mod}-content-2016-12-01.xsd")
    if os.path.exists(path):
        content_roots[mod] = ET.parse(path).getroot()

# Load label linkbases (EN and JA)
def load_labels(mod, lang):
    label_map = defaultdict(dict)
    suffix = "label.xml" if lang == "en" else f"label-{lang}.xml"
    path = os.path.join(base_dir, f"gl/{mod}/lang/gl-{mod}-2016-12-01-{suffix}")
    if not os.path.exists(path):
        return label_map
    tree = ET.parse(path)
    root = tree.getroot()
    ns = {'link': 'http://www.xbrl.org/2003/linkbase', 'xlink': 'http://www.w3.org/1999/xlink'}

    locator_map = {}
    label_resources = {}

    # Map locator label -> href target
    for loc in root.xpath(".//link:loc", namespaces=ns):
        label_id = loc.get("{http://www.w3.org/1999/xlink}label")
        href = loc.get("{http://www.w3.org/1999/xlink}href")
        _, anchor = href.split("#")
        if label_id and href and '#' in href:
            locator_map[label_id] = anchor

    # Collect label resources
    for label in root.xpath(".//link:label", namespaces=ns):
        label_id = label.get("{http://www.w3.org/1999/xlink}label")
        role = label.get("{http://www.w3.org/1999/xlink}role")
        label_text = label.text.strip() if label.text else ""
        if label_id not in label_resources:
            label_resources[label_id] = {}
        if role.endswith("label"):
            label_resources[label_id]["label"] = label_text
        elif role.endswith("documentation"):
            label_resources[label_id]["documentation"] = label_text


    # Resolve labelArcs and map labels to href anchors
    for arc in root.xpath(".//link:labelArc", namespaces=ns):
        from_label = arc.get("{http://www.w3.org/1999/xlink}from")
        to_label = arc.get("{http://www.w3.org/1999/xlink}to")
        href = locator_map.get(from_label)
        label = label_resources.get(to_label)
        if href and label is not None:
            role = label.get("{http://www.w3.org/1999/xlink}role")
            if lang == "en":
                if "label" in label:
                    label_map[href]["label"] = label["label"]
                if "documentation" in label:
                    label_map[href]["documentation"] = label["documentation"]
            elif lang != "en":
                if "label" in label:
                    label_map[href][f"label_{lang}"] = label["label"]
                if "documentation" in label:
                    label_map[href][f"documentation_{lang}"] = label["documentation"]

    return label_map

label_texts = defaultdict(dict)
for mod in modules:
    labels = [load_labels(mod, "en")]
    if LANG != "en":
        labels.append(load_labels(mod, LANG))
    for label_map in labels:
        for k, v in label_map.items():
            label_texts[k].update(v)

# Helpers
def is_tuple_type(complex_type_element):
    if complex_type_element is None:
        return False
    if complex_type_element.find("xs:simpleContent", namespaces) is not None:
        return False
    complex_content = complex_type_element.find("xs:complexContent", namespaces)
    if complex_content is not None:
        for tag in ["xs:restriction", "xs:extension"]:
            inner = complex_content.find(tag, namespaces)
            if inner is not None:
                base = inner.get("base")
                return base == "anyType"
    return False

def resolve_base_type(type_str):
    type_name = type_str.split(":")[-1]
    return type_base_lookup.get(type_name, "")

# Traversal
records = []
def process_sequence(seq, _type, module, path, base, namespaces):
    debug_print(f" - Processing xs:sequence in path: /{path}")
    for el in seq.findall("xs:element", namespaces=namespaces):
        ref = el.get("ref")
        name = el.get("name")
        el_name = ref or name
        el_type = element_type_map.get(el_name, "")
        type_name = el_type.split(":")[-1]
        complex_type = complex_type_lookup.get(type_name)
        is_tuple = False
        if complex_type is not None:
            is_tuple = is_tuple_type(complex_type)

        path_str = f"gl-{module}:{path}" if "gl-" not in path else path
        new_path = f"{path_str}/{el_name}"
        min_occurs = el.get("minOccurs", "1")
        max_occurs = el.get("maxOccurs", "1")
        base_type = resolve_base_type(el_type) if not is_tuple and el_type else ""
        level = 1 + new_path.count("/")

        raw_key = el_name.replace(":", "_")
        label_info = label_texts.get(raw_key, {})

        record = {
            "Level": level,
            "Element": el_name,
            "Type": el_type,
            "Path": f"/{new_path}",
            "isTuple": is_tuple,
            "minOccurs": min_occurs,
            "maxOccurs": max_occurs,
            "BaseType": base_type,
            "Label": label_info.get("label", ""),
            "Documentation": label_info.get("documentation", ""),
            "LocalLabel": label_info.get("label_ja", ""),
            "LocalDocumentation": label_info.get("documentation_ja", "")
        }
        records.append(record)
        if not el_type:
            continue
        type_name = el_type.split(":")[-1]
        if is_tuple:
            mod = el_type.split(":")[0][3:]
            for _path in [
                os.path.join(base_dir, f"gl/{mod}/gl-{mod}-2016-12-01.xsd"),
                os.path.join(base_dir, f"gl/plt/{palette}/gl-{mod}-content-2016-12-01.xsd")
            ]:
                if os.path.exists(_path):
                    tree = ET.parse(_path)
                    nested = tree.xpath(f".//xs:complexType[@name='{type_name}']", namespaces=namespaces)
                    if nested:
                        walk_complex_type(type_name, nested[0], "tuple", mod, new_path, namespaces)
                        break

def walk_complex_type(name, element, _type, module, path, namespaces):
    if ":" not in path:
        trace_print(f"Walking {_type} type '{name}' at path: /gl-{module}:{path}")
    else:
        trace_print(f"Walking {_type}: '{name}' at path: /{path}")
    sequence = element.find("xs:sequence", namespaces)
    if sequence is not None:
        process_sequence(sequence, _type, module, path, name, namespaces)
        return
    complex_content = element.find("xs:complexContent", namespaces)
    if complex_content is not None:
        for tag in ["xs:restriction", "xs:extension"]:
            inner = complex_content.find(tag, namespaces)
            if inner is not None:
                base = inner.get("base")
                seq = inner.find("xs:sequence", namespaces)
                if seq is not None:
                    process_sequence(seq, _type, module, path, base, namespaces)
                return

# Start with root complexType
root = content_roots["cor"]
complex_type_list = root.xpath(".//xs:complexType[@name='accountingEntriesComplexType']", namespaces=namespaces)
if complex_type_list:
    href = "gl-cor_accountingEntries"
    record = {
        "Level": 1,
        "Element": "accountingEntries",
        "Type": "gl-cor:accountingEntriesComplexType",
        "Path": "/gl-cor:accountingEntries",
        "isTuple": True,
        "minOccurs": "1",
        "maxOccurs": "unbounded",
        "BaseType": "",
        "Label": label_texts[href].get("label", ""),
        "Documentation": label_texts[href].get("documentation", ""),
        "LocalLabel": label_texts[href].get("label_ja", ""),
        "LocalDocumentation": label_texts[href].get("documentation_ja", "")
    }
    records.append(record)
    
    walk_complex_type("accountingEntriesComplexType", complex_type_list[0], "tuple", "cor", "accountingEntries", namespaces)
else:
    print("❌ Not found: accountingEntriesComplexType")

# Output to CSV
output_dir = "XBRL-GL-2025"
os.makedirs(output_dir, exist_ok=True)
output_file = os.path.join(output_dir, output_filename)

with open(output_file, mode='w', newline='', encoding='utf-8-sig') as f:
    if records:
        writer = csv.DictWriter(f, fieldnames=records[0].keys())
        writer.writeheader()
        writer.writerows(records)
    else:
        print("⚠️ No records to write.")

print(f"\n✅ Saved parsed structure to: {output_file}")


Posted

in

,

by

Tags:

Comments

Leave a Reply

Your email address will not be published. Required fields are marked *