Hierarchical Tidy Data – 三分一技術士事務所

1. Wickham’s Definition

Hadley Wickham defines tidy data as follows:
1. Each variable forms a column: Each variable in a dataset is placed in its own column.
2. Each observation forms a row: Each observation is placed in its own row.
3. Each type of observational unit forms a table: Each cell contains a single value.

This definition of tidy data facilitates data analysis and visualization and allows consistent data handling. It is important to note that tidy data imposes restrictions on the types of observations by specifying “each type of observational unit”. If we remove this restriction and just specify “observational units”,this means that various types of data can be handled in a consistent format, significantly enhancing the flexibility of data analysis.

2. Extension to Hierarchical Tidy Data

Hierarchical Tidy Data extends Wickham’s principles of tidy data to efficiently represent datasets with multiple hierarchies. This approach integrates data into a single tabular format while maintaining relationships between different levels of hierarchy. Traditionally, normalized multiple tables and JOIN operations were used to manipulate necessary information. In contrast, Hierarchical Tidy Data incorporates all necessary information into a single table, using rows to correspond to different observations, thereby eliminating data redundancy.

3. Example

The following table represents information about classes, students, subjects, and scores as hierarchical tidy data:

ClassID	Student ID	Subject ID	Class Name	Student Name	Subject	Score	Observation
NULL	1	NULL	NULL	Alice	NULL	NULL	Student Alice
NULL	2	NULL	NULL	Bob	NULL	NULL	Student Bob
NULL	NULL	1	NULL	NULL	Math	NULL	Subject Math
NULL	NULL	2	NULL	NULL	Science	NULL	Subject Science
1	NULL	NULL	Class 1	NULL	NULL	NULL	Class 1
1	1	NULL	NULL	Alice	NULL	NULL	Alice is in Class 1
1	1	1	NULL	NULL	NULL	90	Test result of Alice’s score of Math is 90 in Class 1
1	1	2	NULL	NULL	NULL	85	Test result of Alice’s score of Science is 85 in Class 1
2	NULL	NULL	Class 2	NULL	NULL	NULL	Class 2
2	2	NULL	NULL	Bob	NULL	NULL	Bob is in Class 2
2	2	1	NULL	NULL	NULL	80	Test result of Bob’s score of Math is 80 in Class 2
2	2	2	NULL	NULL	NULL	88	Test result of Bob’s score of Science is 88 in Class 2

ClassID

Student ID

Subject ID

Class Name

Student Name

Subject

Score

Observation

NULL

Alice

NULL

Student Alice

NULL

Bob

NULL

Student Bob

NULL

Math

NULL

Subject Math

NULL

Science

NULL

Subject Science

NULL

Class 1

NULL

Class 1

NULL

Alice

NULL

Alice is in Class 1

NULL

Test result of Alice’s score of Math is 90 in Class 1

NULL

Test result of Alice’s score of Science is 85 in Class 1

NULL

Class 2

NULL

Class 2

NULL

Bob

NULL

Bob is in Class 2

NULL

Test result of Bob’s score of Math is 80 in Class 2

NULL

Test result of Bob’s score of Science is 88 in Class 2

In this table, information at each level (class, student, subject, score) is integrated, with each observation represented as a row. This eliminates data redundancy and allows efficient management of hierarchical relationships.

4. Explanation of Hierarchical Levels

Class Level: ClassID and Class Name represent the details of each class.
Student Level: Student ID represents the details of each student.
Subject Level: Subject represents the details of each subject.
Score Level: Score represents the test scores for a subject for each student within their respective class.
Observation: Provides a human-readable description of the observation at each level, encapsulating the hierarchical relationship.

5. Unique Key Constraints Allowing Nulls

To maintain the integrity of hierarchical tidy data, unique key constraints must be defined at each level of the hierarchy, allowing nulls to accommodate the hierarchical nature. These constraints ensure that the combination of ClassID, Student ID, and Subject ID uniquely identifies each observation.

Class Level:
- Unique Key: ClassID
- Allows nulls in Student ID and Subject ID to uniquely identify each class.
Student Level:
- Unique Key: Student ID
- Allows nulls in ClassID and Subject ID to uniquely identify each student.
Subject Level:
- Unique Key: Subject ID
- Allows nulls in ClassID and Student ID to uniquely identify each subject.
Student in Class Level:
- Unique Key: Composition of ClassID and Student ID
- Allows nulls in Subject ID to uniquely identifies each student within their respective class.
Score Level:
- Unique Key: Composition of ClassID, Student ID, and Subject ID
- Uniquely identifies each test score for a subject for each student within their respective class.

6. Example Explanation

Class Level: Class 1 with ClassID = 1 and Class 2 with ClassID = 2 are uniquely identified by their ClassID.
Student in Class Level: Alice with Student ID = 1 in Class 1　with ClassID = 1 and Bob with Student ID = 2 in Class 2 with ClassID = 2 are uniquely identified by the combination of ClassID and Student ID.
Score Level: Alice’s score of 90 in Math for Class 1 is uniquely identified by the combination of ClassID, Student ID, Subject ID.

By allowing nulls in the unique key components, we ensure that each level’s identifiers uniquely identify the corresponding entity while maintaining the hierarchical relationships and avoiding redundancy.

Dimensions defined in the XBRL definition linkbase can represent this structure and generate hierarchical tidy data in a single CSV file using xBRL-CSV. This approach eliminates the need for traditional normalized multiple tables and joins in a relational database. All necessary data is integrated into one file, making it easy to handle without a database.

Additionally, historical data, such as records from 10 years ago, can be accessed without needing a database. This frees users from regular version updates and maintenance. Since all data, including master data, is stored as is, past data can be used without needing to restore and join past master data. This ensures that all information, including historical data, is consistently maintained, facilitating data analysis and reporting.

Contents

1. Wickham’s Definition
2. Extension to Hierarchical Tidy Data
3. Example
4. Explanation of Hierarchical Levels
5. Unique Key Constraints Allowing Nulls
6. Example Explanation