Views: 9
Hierarchical Tidy Data
1. Wickham’s Definition
Hadley Wickham defines tidy data as follows:
1. Each variable forms a column: Each variable in a dataset is placed in its own column.
2. Each observation forms a row: Each observation is placed in its own row.
3. Each type of observational unit forms a table: Each cell contains a single value.
This definition of tidy data facilitates data analysis and visualization and allows consistent data handling. It is important to note that tidy data imposes restrictions on the types of observations by specifying “each type of observational unit”. If we remove this restriction and just specify “observational units”,this means that various types of data can be handled in a consistent format, significantly enhancing the flexibility of data analysis.
2. Extension to Hierarchical Tidy Data
Hierarchical Tidy Data extends Wickham’s principles of tidy data to efficiently represent datasets with multiple hierarchies. This approach integrates data into a single tabular format while maintaining relationships between different levels of hierarchy. Traditionally, normalized multiple tables and JOIN operations were used to manipulate necessary information. In contrast, Hierarchical Tidy Data incorporates all necessary information into a single table, using rows to correspond to different observations, thereby eliminating data redundancy.
3. Example
The following table represents information about classes, students, subjects, and scores as hierarchical tidy data:
ClassID | Student ID | Subject ID | Class Name | Student Name | Subject | Score | Observation |
---|---|---|---|---|---|---|---|
NULL |
1 |
NULL |
NULL |
Alice |
NULL |
NULL |
Student Alice |
NULL |
2 |
NULL |
NULL |
Bob |
NULL |
NULL |
Student Bob |
NULL |
NULL |
1 |
NULL |
NULL |
Math |
NULL |
Subject Math |
NULL |
NULL |
2 |
NULL |
NULL |
Science |
NULL |
Subject Science |
1 |
NULL |
NULL |
Class 1 |
NULL |
NULL |
NULL |
Class 1 |
1 |
1 |
NULL |
NULL |
Alice |
NULL |
NULL |
Alice is in Class 1 |
1 |
1 |
1 |
NULL |
NULL |
NULL |
90 |
Test result of Alice’s score of Math is 90 in Class 1 |
1 |
1 |
2 |
NULL |
NULL |
NULL |
85 |
Test result of Alice’s score of Science is 85 in Class 1 |
2 |
NULL |
NULL |
Class 2 |
NULL |
NULL |
NULL |
Class 2 |
2 |
2 |
NULL |
NULL |
Bob |
NULL |
NULL |
Bob is in Class 2 |
2 |
2 |
1 |
NULL |
NULL |
NULL |
80 |
Test result of Bob’s score of Math is 80 in Class 2 |
2 |
2 |
2 |
NULL |
NULL |
NULL |
88 |
Test result of Bob’s score of Science is 88 in Class 2 |
In this table, information at each level (class, student, subject, score) is integrated, with each observation represented as a row. This eliminates data redundancy and allows efficient management of hierarchical relationships.
4. Explanation of Hierarchical Levels
-
Class Level:
ClassID
andClass Name
represent the details of each class. -
Student Level:
Student ID
represents the details of each student. -
Subject Level:
Subject
represents the details of each subject. -
Score Level:
Score
represents the test scores for a subject for each student within their respective class. -
Observation: Provides a human-readable description of the observation at each level, encapsulating the hierarchical relationship.
5. Unique Key Constraints Allowing Nulls
To maintain the integrity of hierarchical tidy data, unique key constraints must be defined at each level of the hierarchy, allowing nulls to accommodate the hierarchical nature. These constraints ensure that the combination of ClassID
, Student ID
, and Subject ID
uniquely identifies each observation.
-
Class Level:
-
Unique Key:
ClassID
-
Allows nulls in
Student ID
andSubject ID
to uniquely identify each class.
-
-
Student Level:
-
Unique Key:
Student ID
-
Allows nulls in
ClassID
andSubject ID
to uniquely identify each student.
-
-
Subject Level:
-
Unique Key:
Subject ID
-
Allows nulls in
ClassID
andStudent ID
to uniquely identify each subject.
-
-
Student in Class Level:
-
Unique Key: Composition of
ClassID
andStudent ID
-
Allows nulls in
Subject ID
to uniquely identifies each student within their respective class.
-
-
Score Level:
-
Unique Key: Composition of
ClassID
,Student ID
, andSubject ID
-
Uniquely identifies each test score for a subject for each student within their respective class.
-
6. Example Explanation
-
Class Level:
Class 1
withClassID = 1
andClass 2
withClassID = 2
are uniquely identified by theirClassID
. -
Student in Class Level:
Alice
withStudent ID = 1
inClass 1
withClassID = 1
andBob
withStudent ID = 2
inClass 2
withClassID = 2
are uniquely identified by the combination ofClassID
andStudent ID
. -
Score Level:
Alice’s
score of90
inMath
forClass 1
is uniquely identified by the combination ofClassID
,Student ID
,Subject ID
.
By allowing nulls in the unique key components, we ensure that each level’s identifiers uniquely identify the corresponding entity while maintaining the hierarchical relationships and avoiding redundancy.
Dimensions defined in the XBRL definition linkbase can represent this structure and generate hierarchical tidy data in a single CSV file using xBRL-CSV. This approach eliminates the need for traditional normalized multiple tables and joins in a relational database. All necessary data is integrated into one file, making it easy to handle without a database.
Additionally, historical data, such as records from 10 years ago, can be accessed without needing a database. This frees users from regular version updates and maintenance. Since all data, including master data, is stored as is, past data can be used without needing to restore and join past master data. This ensures that all information, including historical data, is consistently maintained, facilitating data analysis and reporting.
Leave a Reply