Hierarchical Tidy Data and Data Transformation

Nobuyuki SAMBUICHI
ISO/TC295 Audit data services/SG1 Semantic model Convener

Hierarchical Tidy Data represents an important concept in data analysis and information management, offering unique characteristics that set it apart from conventional relational databases and simple Tidy Data. This article elaborates on these attributes and explains standard CSV format based on Hierarchical Tidy Data and its related technologies.

This article serves as an introduction. For more detailed information, please refer to the following
“The New Era of Data Conversion: Data Binding through Hierarchical Tidy Data”.

1. Characteristics of Hierarchical Tidy Data

Hierarchical Tidy Data extends the Tidy Data concept by Hadley Wickham to better represent intricate data structures across multiple levels. It’s particularly useful for managing complex data.

2.3. Tidy data
Tidy data is a standard way of mapping the meaning of a dataset to its structure. A dataset is messy or tidy depending on how rows, columns and tables are matched up with observations, variables and types. In tidy data:

Each variable forms a column.

Each observation forms a row.

Each type of observational unit forms a table.

— Wickham
H. . (2014). Tidy Data. Journal of Statistical Software

Traditional Tidy Data groups the same kinds of observations into one table. In contrast, Hierarchical Tidy Data accommodates various kinds of observations within a single table.

Expanding on Tidy Data, Hierarchical Tidy Data provides a framework for observational units spanning various levels. It underscores the ability of an observational unit to cover diverse observations. For instance, in a digital invoice, it might represent both the document header and line items.

The format simplifies data representation, condensing it into a single table and eliminating the need for JOIN operations. This efficiency streamlines data handling and promotes consistent organization and analysis of unstructured data. The outcome? Easier data analysis, visualization, and a unified structure for diverse datasets.

In conventional relational databases or simple Tidy Data, data is managed in a flat two-dimensional structure. Although appropriate for some data analysis scenarios, it can pose constraints when dealing with data that has a complex hierarchical structure. In contrast, Hierarchical Tidy Data is a unique data model that retains data hierarchy while enabling analysis simply by specifying conditions, without the need for relational database processing.

Hierarchical Tidy Data offers a way to represent data more clearly and efficiently. Compared to traditional relational databases or simple Tidy Data, Hierarchical Tidy Data can reflect the hierarchical relationships within the data, which proves highly effective when dealing with structurally complex data.

Consider the example of invoice data. In the traditional Tidy Data format, each item on the invoice (such as invoice number, issue date, seller, buyer, etc.) is treated as an individual row. However, invoices inherently possess hierarchical relationships between items. For instance, each line item of an invoice corresponds to the invoice as a whole, making up parts of that invoice. Hierarchical Tidy Data is well-suited for expressing such relationships.

Taking invoice data as an example, below is a comparison between the traditional flat data structure and how it would be represented using Hierarchical Tidy Data:

In the traditional flat data structure, data is aligned one-dimensionally. For example:

Table 1. **Table of Invoice:**
Invoice ID	Issue Date	Seller	Buyer	Document Total	Item ID	Item Name	Unit Price	Quantity	Line Amount
001	2023-08-05	Corporation A	Corporation B	5000	Item01	Product A	1000	2	2000
001	2023-08-05	Corporation A	Corporation B	5000	Item02	Product B	3000	1	3000

Although Table 1 representation is simple and seemingly straightforward, it’s redundant due to the repeated mention of the same invoice ID and header information (issue date, seller, buyer, document total). Also, it doesn’t express the hierarchical relationship between the invoice as a whole and the items.

It is also possible to represent invoice data in the following way using conventional Tidy Data:

Table 2. **Table of Invoice Header Information:**
Invoice ID	Issue Date	Seller	Buyer	Document Total
001	2023-08-05	Corporation A	Corporation B	5000

Table 3. **Table of Invoice Item Information:**
Invoice ID	Item ID	Item Name	Unit Price	Quantity	Line Amount
001	Item01	Product A	1000	2	2000
001	Item02	Product B	3000	1	3000

In Table 2 and Table 3 representation, each hierarchy is managed in a separate table, which allows for maintaining hierarchical relationships while avoiding data redundancy. However, it requires relational database operations such as JOIN, and isn’t readily usable as is.

Hierarchical Tidy Data is designed to accurately represent information across multiple hierarchies. It maintains the hierarchical structure while eliminating redundancy. For instance, Hierarchical Tidy Data expresses the hierarchical relationship between the invoice as a whole and its line items using column structures. Table 4 shows an example:

Table 4. **Table of Invoice in Hierarchical Tidy Data:**
Invoice ID	Issue Date	Seller	Buyer	Document Total	Item ID	Item Name	Price	Quantity	Line Amount
001	2023-08-05	Corporation A	Corporation B	5000
001					Item01	Product A	1000	2	2000
001					Item02	Product B	3000	1	3000

In the row with the invoice header information, the line item data is blank (or null). In the line item rows, the Invoice ID and Item ID are defined, representing a hierarchical structure indicating that the line items belong to the specified header. In these line item rows, the header information is blank (or null). This method eliminates data redundancy while retaining hierarchical relationships between data.

The above is an example of the differences between traditional data structures and Hierarchical Tidy Data.

Standard CSV format based on Hierarchical Tidy Data can make data more understandable and manageable by reflecting such hierarchical relationships in column structures. Moreover, by using semantic binding and syntax binding, it’s possible to convert between standard CSV format and specific CSV formats, facilitating smooth data exchange.

By aligning these bindings with taxonomies (classification systems), it is possible to ensure data compatibility across different software and platforms.

2. Standard format CSV and Data Binding

Hierarchical Tidy Data used in standard format CSV provides a foundation for managing data in a consistent format. This standardized structure allows for the application of semantic binding and syntactic binding, which link data semantics (meaning) and syntax (structure).

Specifically, semantic binding allows for mutual conversion between proprietary CSV files provided by accounting software and standard format CSV, while maintaining the semantic content of the data (Python programs csv2tidy and tidy2csv). This means that the standard format CSV plays a role as a kind of “interpreter,” facilitating data exchange between different formats.

Similarly, syntactic binding allows for mutual conversion between standard format CSV and XML files that express the same semantic content with different syntax rules (Java programs Invoice2csv and Csv2invoice).

The diagram below shows the relationship between this data and processing.

Figure 1. Semantic binding, Syntax binding, and Standard CSV format based on Hierarchical Tidy Data

Figure 1 shows how semantic binding and syntactic binding play a central role in data conversion. Proprietary CSV from Accounting Software is converted to standard format CSV using semantic binding, and vice versa. Similarly, XML files with different syntax rules are converted to standard format CSV using syntactic binding, and vice versa.

3. Cooperation with Taxonomy

By linking this hierarchical Tidy Data-based standard format CSV and its surrounding semantic and syntactic bindings with a taxonomy, it is possible to enhance the reliability and consistency of data exchange. A taxonomy is like a dictionary that defines the relationship between the meaning and structure of data. Using this dictionary automates the interpretation and exchange of data, making it more efficient and reliable.

4. Conclusion

Hierarchical Tidy Data and its related technologies offer a new paradigm for data exchange. They enable consistent data exchange through standard format CSV, even between systems with different data formats and syntax rules, providing a powerful means to enhance the value of data utilization. In the future, these technologies will be at the heart of a data-driven economic society, creating a source of new value creation.

Through the design and implementation of bindings centered on Hierarchical Tidy Data, data standardization and exchange are realized, which promotes the effective use and sharing of data.