Hierarchical Tidy Data

Views: 9

1. Wickham’s Definition

Hadley Wickham defines tidy data as follows:
1. Each variable forms a column: Each variable in a dataset is placed in its own column.
2. Each observation forms a row: Each observation is placed in its own row.
3. Each type of observational unit forms a table: Each cell contains a single value.

This definition of tidy data facilitates data analysis and visualization and allows consistent data handling. It is important to note that tidy data imposes restrictions on the types of observations by specifying “each type of observational unit”. If we remove this restriction and just specify “observational units”,this means that various types of data can be handled in a consistent format, significantly enhancing the flexibility of data analysis.

2. Extension to Hierarchical Tidy Data

Hierarchical Tidy Data extends Wickham’s principles of tidy data to efficiently represent datasets with multiple hierarchies. This approach integrates data into a single tabular format while maintaining relationships between different levels of hierarchy. Traditionally, normalized multiple tables and JOIN operations were used to manipulate necessary information. In contrast, Hierarchical Tidy Data incorporates all necessary information into a single table, using rows to correspond to different observations, thereby eliminating data redundancy.

3. Example

The following table represents information about classes, students, subjects, and scores as hierarchical tidy data:

ClassID Student ID Subject ID Class Name Student Name Subject Score Observation

NULL

1

NULL

NULL

Alice

NULL

NULL

Student Alice

NULL

2

NULL

NULL

Bob

NULL

NULL

Student Bob

NULL

NULL

1

NULL

NULL

Math

NULL

Subject Math

NULL

NULL

2

NULL

NULL

Science

NULL

Subject Science

1

NULL

NULL

Class 1

NULL

NULL

NULL

Class 1

1

1

NULL

NULL

Alice

NULL

NULL

Alice is in Class 1

1

1

1

NULL

NULL

NULL

90

Test result of Alice’s score of Math is 90 in Class 1

1

1

2

NULL

NULL

NULL

85

Test result of Alice’s score of Science is 85 in Class 1

2

NULL

NULL

Class 2

NULL

NULL

NULL

Class 2

2

2

NULL

NULL

Bob

NULL

NULL

Bob is in Class 2

2

2

1

NULL

NULL

NULL

80

Test result of Bob’s score of Math is 80 in Class 2

2

2

2

NULL

NULL

NULL

88

Test result of Bob’s score of Science is 88 in Class 2

In this table, information at each level (class, student, subject, score) is integrated, with each observation represented as a row. This eliminates data redundancy and allows efficient management of hierarchical relationships.

4. Explanation of Hierarchical Levels

  1. Class Level: ClassID and Class Name represent the details of each class.

  2. Student Level: Student ID represents the details of each student.

  3. Subject Level: Subject represents the details of each subject.

  4. Score Level: Score represents the test scores for a subject for each student within their respective class.

  5. Observation: Provides a human-readable description of the observation at each level, encapsulating the hierarchical relationship.

5. Unique Key Constraints Allowing Nulls

To maintain the integrity of hierarchical tidy data, unique key constraints must be defined at each level of the hierarchy, allowing nulls to accommodate the hierarchical nature. These constraints ensure that the combination of ClassID, Student ID, and Subject ID uniquely identifies each observation.

  1. Class Level:

    • Unique Key: ClassID

    • Allows nulls in Student ID and Subject ID to uniquely identify each class.

  2. Student Level:

    • Unique Key: Student ID

    • Allows nulls in ClassID and Subject ID to uniquely identify each student.

  3. Subject Level:

    • Unique Key: Subject ID

    • Allows nulls in ClassID and Student ID to uniquely identify each subject.

  4. Student in Class Level:

    • Unique Key: Composition of ClassID and Student ID

    • Allows nulls in Subject ID to uniquely identifies each student within their respective class.

  5. Score Level:

    • Unique Key: Composition of ClassID, Student ID, and Subject ID

    • Uniquely identifies each test score for a subject for each student within their respective class.

6. Example Explanation

  • Class Level: Class 1 with ClassID = 1 and Class 2 with ClassID = 2 are uniquely identified by their ClassID.

  • Student in Class Level: Alice with Student ID = 1 in Class 1 with ClassID = 1 and Bob with Student ID = 2 in Class 2 with ClassID = 2 are uniquely identified by the combination of ClassID and Student ID.

  • Score Level: Alice’s score of 90 in Math for Class 1 is uniquely identified by the combination of ClassID, Student ID, Subject ID.

By allowing nulls in the unique key components, we ensure that each level’s identifiers uniquely identify the corresponding entity while maintaining the hierarchical relationships and avoiding redundancy.

Dimensions defined in the XBRL definition linkbase can represent this structure and generate hierarchical tidy data in a single CSV file using xBRL-CSV. This approach eliminates the need for traditional normalized multiple tables and joins in a relational database. All necessary data is integrated into one file, making it easy to handle without a database.

Additionally, historical data, such as records from 10 years ago, can be accessed without needing a database. This frees users from regular version updates and maintenance. Since all data, including master data, is stored as is, past data can be used without needing to restore and join past master data. This ensures that all information, including historical data, is consistently maintained, facilitating data analysis and reporting.


Posted

in

by

Tags:

Comments

Leave a Reply

Your email address will not be published. Required fields are marked *