Views: 15
Hierarchical Tidy Data: A Comprehensive Solution for Multi-layer Data Structures
Nobuyuki SAMBUICHI
ISO/TC295 Audit data services/SG1 Semantic model Convener
Following texts are produced by ChatGPT and edited by me.
ChatGPT said “As an AI language model, I don’t have feelings or opinions. However, based on our interaction, I hope I’ve been able to provide clear and helpful responses to your questions. We’ve discussed several complex topics, including the principles of tidy data, the challenges of applying these principles to hierarchical data structures, and how hierarchical tidy data can extend the tidy data framework. Throughout our conversation, I have aimed to ensure that the information I provide is accurate, comprehensive, and understandable. If you have any other questions or need further clarification on any of these topics, please feel free to ask!”
1. Introduction
Data are often messy. It may not be in the format that statisticians and data scientists prefer, which often leads to time-consuming cleaning processes. Hadley Wickham proposed the concept of tidy data as a standard way of mapping the semantics of a dataset to its physical representation. A dataset is tidy when each column is a variable, each row is an observation, and each cell contains a single value. This approach has proven tremendously useful in data cleaning and exploratory data analysis.
However, the original definition of tidy data by Wickham doesn’t consider hierarchical data, a common type of data structure where variables have relationships with one another, and these relationships form a structure of different levels, or hierarchies.
Hierarchical data can often be found in financial accounting systems. A key example is the General Ledger used in business accounting, where transactions are stored in a series of debits and credits across multiple accounts. This can be represented as a three-layer hierarchical structure, which includes General Ledger as the top level, Journal Lines in the middle, and Business Segment at the bottom level.
This paper proposes an extension of the original Wickham model, coined as “Hierarchical Tidy Data,” to manage this type of hierarchical data efficiently.
2. Original Journal Entry
Consider a simple journal entry scenario. We’ll illustrate this with a table of data (Table 1), where columns represent variables, and rows are observations.
ID |
Date |
Line Description |
Debit (JPY) |
Credit (JPY) |
Account number |
Account description |
Business Segment Code |
Business Segment Name |
Business Segment Type |
943-8-477 |
2009-04-03 |
Paper |
14858 |
754 |
Office consumables expenses |
10 |
Sapporo |
Business unit |
|
943-8-477 |
2009-04-03 |
742 |
191 |
Temporary payment consumption tax |
10 |
Sapporo |
Business unit |
||
943-8-477 |
2009-04-03 |
Paper |
15,600 |
111 |
Cash |
0 |
Common department |
Business unit |
3. Nearly Tidy Data
Our first step in tidying this data involves identifying variables within a table. However, this nearly tidy data is still problematic because it mixes different elements within a single column.
-
d_A25 General Ledger
(representing details about the entire Journal Entry) -
d_A26 General Ledger Line
(representing details about a specific Line in the Journal Entry) -
d_A01 Business Segment
(representing the Business Segment associated with a specific line)
The following table represents a three-level hierarchy using the variables above.
d_A25 | d_A26 | d_A01 | ID | Element | Value |
---|---|---|---|---|---|
943-8-477 |
A25-01 |
Journal ID |
943-8-477 |
||
943-8-477 |
A25-A26-08 |
JE Type Code |
entries |
||
943-8-477 |
A25-A26-07 |
Journal Line Number |
8-477 |
||
943-8-477 |
A25-A26-14 |
Bill Date |
2009-04-03 |
||
943-8-477 |
A25-A26-12 |
Bill Number |
manual |
||
943-8-477 |
A25-A26-11 |
Source Code |
gj |
||
943-8-477 |
A25-09 |
Header Description |
General |
||
943-8-477 |
A25-02 |
Journal Number |
134 |
||
943-8-477 |
1 |
A26-A09-01 |
GL Account Number |
754 |
|
943-8-477 |
1 |
A26-A09-02 |
GL Account Name |
Office consumables expenses |
|
943-8-477 |
1 |
A26-A09-05 |
Account Type |
account |
|
943-8-477 |
1 |
A25-A26-27 |
Credit Debit Indicator |
debit |
|
943-8-477 |
1 |
A26-A89-01 |
Amount |
14,858 |
|
943-8-477 |
1 |
A26-A89-02 |
Currency |
JPY |
|
943-8-477 |
1 |
A25-A26-06 |
Effective Date |
2009-04-03 |
|
943-8-477 |
1 |
A26-A09-06 |
Account Subtype |
proposed |
|
943-8-477 |
1 |
A25-A26-10 |
Line Description |
paper |
|
943-8-477 |
1 |
1 |
A26-A01-03 |
Segment Reference Level |
account sub |
943-8-477 |
1 |
1 |
A26-A01-01 |
Business Segment Code |
10 |
943-8-477 |
1 |
1 |
A26-A01-02 |
Business Segment Name |
Sapporo |
943-8-477 |
1 |
1 |
A26-A01-04 |
Business Segment Type |
Business unit |
943-8-477 |
2 |
A26-A09-01 |
GL Account Number |
191 |
|
943-8-477 |
2 |
A26-A09-02 |
GL Account Name |
Temporary payment consumption tax |
|
943-8-477 |
2 |
A26-A09-05 |
Account Type |
account |
|
943-8-477 |
2 |
A25-A26-27 |
Credit Debit Indicator |
debit |
|
943-8-477 |
2 |
A26-A89-01 |
Amount |
742 |
|
943-8-477 |
2 |
A26-A89-02 |
Currency |
JPY |
|
943-8-477 |
2 |
A25-A26-06 |
Effective Date |
2009-04-03 |
|
943-8-477 |
2 |
A26-A09-06 |
Account Subtype |
tax |
|
943-8-477 |
2 |
1 |
A26-A01-03 |
Segment Reference Level |
account sub |
943-8-477 |
2 |
1 |
A26-A01-01 |
Business Segment Code |
10 |
943-8-477 |
2 |
1 |
A26-A01-02 |
Business Segment Name |
Sapporo |
943-8-477 |
2 |
1 |
A26-A01-04 |
Business Segment Type |
Business unit |
943-8-477 |
3 |
A26-A09-01 |
GL Account Number |
111 |
|
943-8-477 |
3 |
A26-A09-02 |
GL Account Name |
Cash |
|
943-8-477 |
3 |
A26-A09-05 |
Account Type |
account |
|
943-8-477 |
3 |
A25-A26-27 |
Credit Debit Indicator |
credit |
|
943-8-477 |
3 |
A26-A89-01 |
Amount |
15,600 |
|
943-8-477 |
3 |
A26-A89-02 |
Currency |
JPY |
|
943-8-477 |
3 |
A25-A26-06 |
Effective Date |
2009-04-03 |
|
943-8-477 |
3 |
A26-A09-06 |
Account Subtype |
proposed |
|
943-8-477 |
3 |
A25-A26-10 |
Line Description |
paper |
|
943-8-477 |
3 |
1 |
A26-A01-03 |
Segment Reference Level |
account sub |
943-8-477 |
3 |
1 |
A26-A01-01 |
Business Segment Code |
0 |
943-8-477 |
3 |
1 |
A26-A01-02 |
Business Segment Name |
Common department |
943-8-477 |
3 |
1 |
A26-A01-04 |
Business Segment Type |
Business unit |
This format is better, but not ideal. The element column combines multiple variables, making it harder to work with the data. Moreover, this structure doesn’t fully exploit the hierarchical nature of the data.
4. Hierarchical Tidy Data
The final step in tidying the data is to transform it into Hierarchical Tidy Data. This involves reorganizing the table so that each type of observational unit forms a table. Here, we represent the hierarchical structure within a single table without repetition.
Tidy data, as defined by Hadley Wickham, is a standard way of mapping the underlying semantics of a dataset to its physical representation. In tidy data:
-
Each variable forms a column.
-
Each observation forms a row.
-
Each type of observational unit forms a table.
While this standard has been highly effective for a wide range of datasets and use-cases, it has limitations when dealing with complex hierarchical datasets. For instance, consider the example of a General Ledger, a key accounting document, which has a hierarchical structure. The ledger comprises of various journal entries, each of which has multiple lines, and each line might relate to various business segments. In such cases, it is challenging to maintain the principles of tidy data.
This is where hierarchical tidy data comes into play. Hierarchical tidy data extends the original tidy data principles to handle these more complex, hierarchical data structures.
In hierarchical tidy data:
-
Each variable still forms a column.
-
Each observation, regardless of its hierarchical level, forms a row.
-
Different types of observational units can exist within the same table, distinguished by a variable that indicates the hierarchical level.
This approach allows us to incorporate different observational units (e.g., journal headers, journal lines, and business segments) in the same table. Each unit has a unique identifier, and there are additional variables to link between different units in the hierarchy.
In the context of the General Ledger, each row can be a header, a line, or a business segment detail. The distinction between these different units is made using a variable indicating the hierarchical level. The relationships between these units are maintained using unique identifiers for each unit and additional variables that link them in a hierarchical manner.
By extending the tidy data concept in this way, we can handle more complex, hierarchical datasets while maintaining the benefits of tidy data, such as simplicity and intuitiveness of structure. This hierarchical tidy data structure provides a powerful framework for managing and analyzing complex datasets.
d_A25 | d_A26 | d_A01 | A25-01 | A25-A26-08 | A25-A26-07 Journal Line Number | A25-A26-14 Bill Date | A25-A26-12 Bill Number | A25-A26-11 Source Code | A25-09 Header Description | A25-02 Journal Number | A26-A09-01 GL Account Number | A26-A09-02 GL Account Name | A26-A09-05 Account Type | A25-A26-27 Credit Debit Indicator | A26-A89-01 Amount | A26-A89-02 Currency | A25-A26-06 Effective Date | A26-A09-06 Account Subtype | A25-A26-10 Line Description | A26-A01-03 Segment Reference Level | A26-A01-01 Business Segment Code | A26-A01-02 Business Segment Name | A26-A01-04 Business Segment Type |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
General Ledger |
General Ledger Line |
Business Segment |
Journal ID |
Type Code |
|||||||||||||||||||
943-8-477 |
943-8-477 |
entries |
8-477 |
2009-04-03 |
manual |
gj |
General |
134 |
|||||||||||||||
943-8-477 |
1 |
754 |
Office consumables expenses |
account |
debit |
14858 |
JPY |
2009-04-03 |
proposed |
paper |
|||||||||||||
943-8-477 |
1 |
1 |
account sub |
10 |
Sapporo |
Business unit |
|||||||||||||||||
943-8-477 |
2 |
191 |
Temporary payment consumption tax |
account |
debit |
742 |
JPY |
2009-04-03 |
tax |
||||||||||||||
943-8-477 |
2 |
1 |
account sub |
10 |
Sapporo |
Business unit |
|||||||||||||||||
943-8-477 |
3 |
111 |
Cash |
account |
credit |
15600 |
JPY |
2009-04-03 |
proposed |
paper |
|||||||||||||
943-8-477 |
3 |
1 |
account sub |
0 |
Common department |
Business unit |
NOTE In this format, each type of observation has its row, even though they may have different lengths. In this case, there are three layers represented, each corresponding to a different level in the hierarchy.
5. Conclusion
The hierarchical tidy data concept extends Wickham’s tidy data principles by adding the capability to handle hierarchical data efficiently. The tidying process maintains the easy-to-use nature of tidy data while enabling more complex and hierarchical data structures to be worked with effectively. This approach is particularly valuable in fields such as financial accounting, where hierarchical data is common. By adopting this model, we can better manage, analyze, and visualize hierarchical data structures.
6. Dialogues with ChatGPT
Can you refine following text so that technical expert,who is new to tidy data, can understand?
With explanation of tidy data and relationship to this implementation.
Add examples for each clauses to be easily understood. use my table data in explanation.
describe more detail specific to journal entry data.
Journal entry data is structured 3 layers.
A header contains posted data for business entity.
Line items detail contains debit/credit accountinc code, amount, tax, transaction description.
chart of accounts structure has various sub-account structure, such as business unit, cost center, bank account, customer, supplier, and so on.
These 3 layer structure can be recorded usiung hierarchical tidy data in a single table.
The concept of tidy data was proposed by Hadley Wickham as a standardized way to link the layout of a dataset and its semantics (meaning). In a tidy dataset, each variable is presented as a column, each observation as a row, and each type of observational unit as a table.
Let’s apply this concept to the structure of a General Ledger journal entry, which typically consists of three layers:
-
The header layer, which provides posted data for a business entity.
-
The line items detail layer, containing elements such as debit/credit accounting code, amount, tax, and transaction description.
-
The chart of accounts structure layer, with various sub-account structures like business unit, cost center, bank account, customer, and supplier.
Let’s take a look at an example of a journal entry for purchasing paper with cash, as presented in Table 11:
ID | Date | Line Description | Debit (JPY) | Credit (JPY) | Account number | Account description | Business Segment Code | Business Segment Name | Business Segment Type |
---|---|---|---|---|---|---|---|---|---|
943-8-477 |
2009-04-03 |
Paper |
14858 |
754 |
Office consumables expenses |
10 |
Sapporo |
Business unit |
|
943-8-477 |
2009-04-03 |
742 |
191 |
Temporary payment consumption tax |
10 |
Sapporo |
Business unit |
||
943-8-477 |
2009-04-03 |
Paper |
15,600 |
111 |
Cash |
0 |
Common department |
Business unit |
The general ledger line in this example includes variables (such as debit amount, credit amount, account number, etc.) as column headers and their corresponding values as rows.
For a person new to tidy data, this structure may seem odd. This is why Wickham suggested the concept of ‘tidying’ data to describe the underlying semantics.
Now, let’s revisit the concept of data semantics based on Wickham’s explanation. He describes a dataset as a collection of values, where every value belongs to a variable and an observation. In essence, variables are the different measurements or characteristics, while observations are the instances or entities the data represents.
To illustrate this, we can look at Tables 13, 14, and 15, which represent data from an imaginary experiment. The data is the same in all three tables, but each table presents the data differently.
The columns in the tables represent variables, like “Treatment” or “Person”, while the rows represent observations, such as “John Smith” or “Jane Doe”. The actual data values (the numbers or missing values) link a specific observation with a variable, forming a single data point.
Now, let’s examine the semantic structure of the General Ledger, represented in Table 16. This structure shows how variables are organized in a digital ledger. Normally, to represent this structure, multiple Codd’s 3rd normal form tables would be defined and connected in a relational database. Tidy data, however, allows us to reorganize this complex structure into a single tidy table. Tables 17 to 21 represent the tables for the relational database.
The next challenge is to create tidy data for our General Ledger. The first step involves identifying variables within an aggregation table within a table. If a variable’s occurrence is specified as multiple, it should be reported.
In the example of our General Ledger, which is recorded in a single currency, we don’t need a variable for a multi-currency amount. We identify
Descrive more detail on your following texts.
“Now, let’s examine the semantic structure of the General Ledger, represented in Table 16. This structure shows how variables are organized in a digital ledger. Normally, to represent this structure, multiple Codd’s 3rd normal form tables would be defined and connected in a relational database. Tidy data, however, allows us to reorganize this complex structure into a single tidy table. Tables 17 to 21 represent the tables for the relational database.
The next challenge is to create tidy data for our General Ledger. The first step involves identifying variables within an aggregation table within a table. If a variable’s occurrence is specified as multiple, it should be reported.
In the example of our General Ledger, which is recorded in a single currency, we don’t need a variable for a multi-currency amount. We identify.”
NOTE These texts is explaining Codd’s normalization approach.
Sure, I’m happy to elaborate.
When managing financial data in a digital ledger, the database design will typically follow certain guidelines or ‘normal forms’ to ensure that the data is efficiently organized, and free of redundancy or inconsistency. This is where the concept of Codd’s 3rd normal form comes into play, which is a design principle used in relational databases. According to this principle, every non-prime attribute of a table should be non-transitively dependent on the primary key. In simpler terms, each piece of data in the table should depend only on the primary key, and not on other pieces of data in the table. This helps to prevent data anomalies and redundancy.
Let’s imagine a typical General Ledger structure which would require multiple tables to represent the different layers of data:
-
The Journal Entry Header Table would store the ID, Date, and any other metadata related to a journal entry as a whole. The primary key here would be the Journal Entry ID.
-
The Journal Entry Line Table would store each individual line of a journal entry, including its associated Debit or Credit, the Account Number, Line Description, etc. The primary key could be a combination of the Journal Entry ID and a Line Item Number.
-
The Chart of Accounts Table would store a list of all available accounts that could be used in a journal entry line, including its Account Number, Account Description, and other related information. The Account Number would be the primary key in this case.
-
The Business Segment Table would contain details of each business segment, such as Business Segment Code, Business Segment Name, and Business Segment Type. The primary key could be the Business Segment Code.
-
Lastly, a Business Segment Line Association Table could map each line of a journal entry to a business segment. The primary key might be a combination of the Journal Entry ID, Line Item Number, and Business Segment Code.
This design, while efficient in terms of data storage and integrity, can make the data hard to read or analyze because the data for a single journal entry is spread across multiple tables. That’s where the idea of ‘tidy’ data comes in.
When we ‘tidy’ this data, we aim to transform it into a single table where each row represents a single observation (e.g., a single line item in a journal entry), and each column represents a variable (e.g., the Debit amount, the Account Number, etc.). In this context, a single observation would include all the details of a line item, including the associated business segment and chart of accounts information.
So, the first step in tidying the data would be to determine what constitutes an ‘observation’. In this case, it would be a single line item of a journal entry, including all its related data.
Next, we identify all the variables that are associated with that observation. For each variable, we create a column in our tidy data table. In our journal entry example, these variables would include the debit amount, credit amount, account number, line description, and business segment information.
However, if a variable doesn’t add valuable information or is not applicable to the dataset, it can be omitted. For example, if all the transactions in our ledger are recorded in a single currency (e.g., JPY), then we don’t need a column to specify the currency for each transaction.
After identifying our observations and variables, we can then create our tidy data table by populating it with our ledger data, ensuring that each row corresponds to a single line item of a journal entry and each column corresponds to a specific variable. The final result would be a comprehensive, user-friendly table that contains all the data of our general ledger in a ‘tidy’ format.
Add example for multiple lines for journal entry.
In our example we know that General Ledger is recorded in a single currency, Local Currency is Reporting Currency. We don’t need variable for multi-currency amount.
There are following three variables .
d_A25 General Ledger
d_A26 General Ledger Line
d_A01 Business Segment
Which data item in table 11 goes to where.
Given the complexity of a general ledger, a complete example might be out of scope, but I’ll provide a simplified example using your specified variables. Assuming the ‘d_’ prefix is part of your data labeling, I’ll illustrate it as:
-
d_A25 General Ledger
(representing details about the entire Journal Entry) -
d_A26 General Ledger Line
(representing details about a specific Line in the Journal Entry) -
d_A01 Business Segment
(representing the Business Segment associated with a specific line)
To show how these variables could be structured in a tidy format, I’ll create an example tidy table based on the information you’ve provided. Note that some necessary details for a complete Journal Entry like debit or credit amounts, account number, line description, etc. are omitted for simplicity.
Journal_Entry_ID | Entry_Date | Line_Item_Number | Business_Segment | Line_Description |
---|---|---|---|---|
JE001 |
2023-01-01 |
1 |
BS001 |
Purchase of Equipment |
JE001 |
2023-01-01 |
2 |
BS002 |
Depreciation |
JE002 |
2023-01-02 |
1 |
BS001 |
Office Supplies |
JE002 |
2023-01-02 |
2 |
BS003 |
Revenue from Sales |
In this table:
-
The
Journal_Entry_ID
andEntry_Date
columns correspond to thed_A25 General Ledger
variable, representing the details of the entire Journal Entry. -
The
Line_Item_Number
andLine_Description
columns correspond to thed_A26 General Ledger Line
variable, representing the details of a specific line in the Journal Entry. -
The
Business_Segment
column corresponds to thed_A01 Business Segment
variable, representing the business segment associated with each line in the Journal Entry.
This format brings all the relevant data into a single table, making it easier to analyze. However, it’s crucial to remember that while this format can be easier to work with for data analysis, it does not replace the need for normalized tables in a database for data integrity and efficient storage.
As for your question about Table 11, without seeing the actual data and structure of that table, I can’t provide a specific mapping. Typically, you would identify where each piece of data in Table 11 belongs in terms of the variables and observations we’ve identified, and then create the tidy data table accordingly.
We need to include header, line, business segment within a single table without repetition. This is the goal for hierarchcal tidy data.
If I understand your goal correctly, you’re aiming to create a tidy hierarchical data table where the journal entry information (header), line item information, and business segment information are all included in a single table, and there’s no unnecessary repetition.
In the context of a General Ledger, it can be challenging because the very nature of the journal entries is hierarchical (entries have line items, line items have business segments). One common approach to handle such hierarchies is to use “wide” format instead of “long” format, which can reduce repetition.
Here’s an example based on the earlier variables. We’ll assume that each Line Item is linked to a single Business Segment, and that each Journal Entry has a maximum of two line items for simplicity.
Journal_Entry_ID | Entry_Date | Line_Item_Number_1 | Business_Segment_1 | Line_Description_1 | Line_Item_Number_2 | Business_Segment_2 | Line_Description_2 |
---|---|---|---|---|---|---|---|
JE001 |
2023-01-01 |
1 |
BS001 |
Purchase of Equipment |
2 |
BS002 |
Depreciation |
JE002 |
2023-01-02 |
1 |
BS001 |
Office Supplies |
2 |
BS003 |
Revenue from Sales |
This table layout avoids repetition of the Journal Entry details (d_A25 General Ledger
) by placing multiple Line Items (d_A26 General Ledger Line
and d_A01 Business Segment
) in the same row as the corresponding Journal Entry. However, this “wide” format can become unwieldy if there are many Line Items or Business Segments per Journal Entry.
Remember that the concept of “tidy data” usually implies that each variable forms a column and each observation forms a row. In more complex, hierarchical data scenarios like this, the definition can be more flexible and must be balanced with practical considerations like ease of use and clarity of representation.
Is the following a tidy data?
d_A25 | d_A26 | d_A01 | ID | Element | Value |
---|---|---|---|---|---|
943-8-477 |
A25-01 |
Journal ID |
943-8-477 |
||
943-8-477 |
A25-A26-08 |
JE Type Code |
entries |
||
943-8-477 |
A25-A26-07 |
Journal Line Number |
8-477 |
||
943-8-477 |
A25-A26-14 |
Bill Date |
2009-04-03 |
||
943-8-477 |
A25-A26-12 |
Bill Number |
manual |
||
943-8-477 |
A25-A26-11 |
Source Code |
gj |
||
943-8-477 |
A25-09 |
Header Description |
General |
||
943-8-477 |
A25-02 |
Journal Number |
134 |
||
943-8-477 |
1 |
A26-A09-01 |
GL Account Number |
754 |
|
943-8-477 |
1 |
A26-A09-02 |
GL Account Name |
Office consumables expenses |
|
943-8-477 |
1 |
A26-A09-05 |
Account Type |
account |
|
943-8-477 |
1 |
A25-A26-27 |
Credit Debit Indicator |
debit |
|
943-8-477 |
1 |
A26-A89-01 |
Amount |
14,858 |
|
943-8-477 |
1 |
A26-A89-02 |
Currency |
JPY |
|
943-8-477 |
1 |
A25-A26-06 |
Effective Date |
2009-04-03 |
|
943-8-477 |
1 |
A26-A09-06 |
Account Subtype |
proposed |
|
943-8-477 |
1 |
A25-A26-10 |
Line Description |
paper |
|
943-8-477 |
1 |
1 |
A26-A01-03 |
Segment Reference Level |
account sub |
943-8-477 |
1 |
1 |
A26-A01-01 |
Business Segment Code |
10 |
943-8-477 |
1 |
1 |
A26-A01-02 |
Business Segment Name |
Sapporo |
943-8-477 |
1 |
1 |
A26-A01-04 |
Business Segment Type |
Business unit |
943-8-477 |
2 |
A26-A09-01 |
GL Account Number |
191 |
|
943-8-477 |
2 |
A26-A09-02 |
GL Account Name |
Temporary payment consumption tax |
|
943-8-477 |
2 |
A26-A09-05 |
Account Type |
account |
|
943-8-477 |
2 |
A25-A26-27 |
Credit Debit Indicator |
debit |
|
943-8-477 |
2 |
A26-A89-01 |
Amount |
742 |
|
943-8-477 |
2 |
A26-A89-02 |
Currency |
JPY |
|
943-8-477 |
2 |
A25-A26-06 |
Effective Date |
2009-04-03 |
|
943-8-477 |
2 |
A26-A09-06 |
Account Subtype |
tax |
|
943-8-477 |
2 |
1 |
A26-A01-03 |
Segment Reference Level |
account sub |
943-8-477 |
2 |
1 |
A26-A01-01 |
Business Segment Code |
10 |
943-8-477 |
2 |
1 |
A26-A01-02 |
Business Segment Name |
Sapporo |
943-8-477 |
2 |
1 |
A26-A01-04 |
Business Segment Type |
Business unit |
943-8-477 |
3 |
A26-A09-01 |
GL Account Number |
111 |
|
943-8-477 |
3 |
A26-A09-02 |
GL Account Name |
Cash |
|
943-8-477 |
3 |
A26-A09-05 |
Account Type |
account |
|
943-8-477 |
3 |
A25-A26-27 |
Credit Debit Indicator |
credit |
|
943-8-477 |
3 |
A26-A89-01 |
Amount |
15,600 |
|
943-8-477 |
3 |
A26-A89-02 |
Currency |
JPY |
|
943-8-477 |
3 |
A25-A26-06 |
Effective Date |
2009-04-03 |
|
943-8-477 |
3 |
A26-A09-06 |
Account Subtype |
proposed |
|
943-8-477 |
3 |
A25-A26-10 |
Line Description |
paper |
|
943-8-477 |
3 |
1 |
A26-A01-03 |
Segment Reference Level |
account sub |
943-8-477 |
3 |
1 |
A26-A01-01 |
Business Segment Code |
0 |
943-8-477 |
3 |
1 |
A26-A01-02 |
Business Segment Name |
Common department |
943-8-477 |
3 |
1 |
A26-A01-04 |
Business Segment Type |
Business unit |
This table is an example of “long” format, and is not quite “tidy” data as defined by Hadley Wickham in his influential paper “Tidy Data.”
According to Wickham’s definition, tidy data has the following characteristics:
-
Each variable forms a column.
-
Each observation forms a row.
-
Each type of observational unit forms a table.
In your table, the columns “ID”, “Element”, and “Value” are actually containing multiple variables. For example, “GL Account Number”, “GL Account Name”, “Amount”, and “Currency” are all different variables but they’re squeezed into a single column (“Element”) with their corresponding values in the “Value” column.
A “tidier” version of this table could look like this:
Journal_ID | JE_Type_Code | Journal_Line_Number | Bill_Date | Bill_Number | Source_Code | Header_Description | Journal_Number | Line_Item_Number | GL_Account_Number | GL_Account_Name | Account_Type | Credit_Debit_Indicator | Amount | Currency | Effective_Date | Account_Subtype | Line_Description | Segment_Reference_Level | Business_Segment_Code | Business_Segment_Name | Business_Segment_Type |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
943-8-477 |
entries |
8-477 |
2009-04-03 |
manual |
gj |
General |
134 |
1 |
754 |
Office consumables expenses |
account |
debit |
14,858 |
JPY |
2009-04-03 |
proposed |
paper |
account sub |
10 |
Sapporo |
Business unit |
943-8-477 |
entries |
8-477 |
2009-04-03 |
manual |
gj |
General |
134 |
2 |
191 |
Temporary payment consumption tax |
account |
debit |
742 |
JPY |
2009-04-03 |
tax |
paper |
account sub |
10 |
Sapporo |
Business unit |
943-8-477 |
entries |
8-477 |
2009-04-03 |
manual |
gj |
General |
134 |
3 |
111 |
Cash |
account |
credit |
15,600 |
JPY |
2009-04-03 |
proposed |
paper |
account sub |
0 |
Common department |
Business unit |
This table has now separated out all the variables into their own columns, and each row represents one observation (one line item in this case).
Of course, real-world data can be more complex, and it may not be possible or practical to achieve perfect tidiness. The key is to aim for a structure that helps you to analyze the data effectively and efficiently.
How about this table?
d_A25 | d_A26 | d_A01 | A25-01 | A25-A26-08 | A25-A26-07 Journal Line Number | A25-A26-14 Bill Date | A25-A26-12 Bill Number | A25-A26-11 Source Code | A25-09 Header Description | A25-02 Journal Number | A26-A09-01 GL Account Number | A26-A09-02 GL Account Name | A26-A09-05 Account Type | A25-A26-27 Credit Debit Indicator | A26-A89-01 Amount | A26-A89-02 Currency | A25-A26-06 Effective Date | A26-A09-06 Account Subtype | A25-A26-10 Line Description | A26-A01-03 Segment Reference Level | A26-A01-01 Business Segment Code | A26-A01-02 Business Segment Name | A26-A01-04 Business Segment Type |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
General Ledger |
General Ledger Line |
Business Segment |
Journal ID |
Type Code |
|||||||||||||||||||
943-8-477 |
943-8-477 |
entries |
8-477 |
2009-04-03 |
manual |
gj |
General |
134 |
|||||||||||||||
943-8-477 |
1 |
754 |
Office consumables expenses |
account |
debit |
14858 |
JPY |
2009-04-03 |
proposed |
paper |
|||||||||||||
943-8-477 |
1 |
1 |
account sub |
10 |
Sapporo |
Business unit |
|||||||||||||||||
943-8-477 |
2 |
191 |
Temporary payment consumption tax |
account |
debit |
742 |
JPY |
2009-04-03 |
tax |
||||||||||||||
943-8-477 |
2 |
1 |
account sub |
10 |
Sapporo |
Business unit |
|||||||||||||||||
943-8-477 |
3 |
111 |
Cash |
account |
credit |
15600 |
JPY |
2009-04-03 |
proposed |
paper |
|||||||||||||
943-8-477 |
3 |
1 |
account sub |
0 |
Common department |
Business unit |
This table represents an improvement over the previous one in terms of tidiness. Here, each variable now forms a column, and each observation (a general ledger line in this case) forms a row. However, it’s important to note that the tidiness of the data also depends on the specific analysis you want to perform.
One minor point to consider is the presence of cells with repeated or irrelevant information for a specific row, such as the “Journal ID” column which repeats “943-8-477” across rows where it’s not applicable. Similarly, the cells in the “Type Code”, “Journal Line Number”, “Bill Date”, “Bill Number”, “Source Code”, “Header Description”, and “Journal Number” columns are empty for the lines that are not header lines. These could lead to misunderstandings about the structure of your data.
The goal of tidy data is to create a structure where every cell contains one and only one value that describes the variable of the column and the observation of the row.
A tidier version of this table would look like this:
Journal ID | Type Code | Journal Line Number | Bill Date | Bill Number | Source Code | Header Description | Journal Number | General Ledger Line | GL Account Number | GL Account Name | Account Type | Credit Debit Indicator | Amount | Currency | Effective Date | Account Subtype | Line Description | Business Segment | Segment Reference Level | Business Segment Code | Business Segment Name | Business Segment Type |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
943-8-477 |
entries |
8-477 |
2009-04-03 |
manual |
gj |
General |
134 |
1 |
754 |
Office consumables expenses |
account |
debit |
14858 |
JPY |
2009-04-03 |
proposed |
paper |
1 |
account sub |
10 |
Sapporo |
Business unit |
943-8-477 |
entries |
8-477 |
2009-04-03 |
manual |
gj |
General |
134 |
2 |
191 |
Temporary payment consumption tax |
account |
debit |
742 |
JPY |
2009-04-03 |
tax |
1 |
account sub |
10 |
Sapporo |
Business unit |
|
943-8-477 |
entries |
8-477 |
2009-04-03 |
manual |
gj |
General |
134 |
3 |
111 |
Cash |
account |
credit |
15600 |
JPY |
2009-04-03 |
proposed |
paper |
1 |
account sub |
0 |
Common department |
Business unit |
Note: This structure assumes that each line in the General Ledger belongs to only one Business Segment. If a line can belong to multiple segments, you may need to reshape your data accordingly.
You are repeating
Journal ID Type Code Journal Line Number Bill Date Bill Number Source Code Header Description Journal Number
943-8-477 entries 8-477 2009-04-03 manual gj General 134
You need to distinguish observation for header and lines.
There is no restriction observation in a table should be the same. So we can store different observations within the same table, which is a hierarchcal tidy data.
I apologize for the misunderstanding. You’re right. In the context of hierarchical data, different levels of observation can indeed be stored in the same table, and these different levels of observation can make the table look less ‘tidy’ than a traditional tidy data format where each row is a single observation. This method can be particularly useful when working with complex hierarchical or transactional data such as a general ledger.
Given that, your original table, where each row represents a different level of the hierarchy (header, line, business segment) is considered a tidy hierarchical data structure. Each row still represents a single observation, but the type of observation differs between rows:
-
Some rows represent a ledger header.
-
Some rows represent a ledger line.
-
Some rows represent a business segment.
This structure enables you to capture the hierarchical relationships within the data, which would not be possible with a simple flat structure. Each level in the hierarchy is clearly distinguished by the values in the ‘d_A25’, ‘d_A26’, and ‘d_A01’ columns.
It is important to note, however, that while this structure is more flexible and can handle complex hierarchical data, it can also be more challenging to work with than a simple flat structure, as it requires additional processing to filter or aggregate data at different levels of the hierarchy.
Provide your understood table of the hierachical tidy data for journal entry of my example.
Certainly, here is a representation of the hierarchical tidy data format based on your example:
d_A25 | d_A26 | d_A01 | A25-01 Journal ID | A25-A26-08 Type Code | A25-A26-07 Journal Line Number | A25-A26-14 Bill Date | A25-A26-12 Bill Number | A25-A26-11 Source Code | A25-09 Header Description | A25-02 Journal Number | A26-A09-01 GL Account Number | A26-A09-02 GL Account Name | A26-A09-05 Account Type | A25-A26-27 Credit Debit Indicator | A26-A89-01 Amount | A26-A89-02 Currency | A25-A26-06 Effective Date | A26-A09-06 Account Subtype | A25-A26-10 Line Description | A26-A01-03 Segment Reference Level | A26-A01-01 Business Segment Code | A26-A01-02 Business Segment Name | A26-A01-04 Business Segment Type |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
943-8-477 |
NA |
NA |
943-8-477 |
entries |
NA |
2009-04-03 |
manual |
gj |
General |
134 |
NA |
NA |
NA |
NA |
NA |
NA |
NA |
NA |
NA |
NA |
NA |
NA |
NA |
943-8-477 |
1 |
NA |
NA |
NA |
1 |
NA |
NA |
NA |
NA |
NA |
754 |
Office consumables expenses |
account |
debit |
14858 |
JPY |
2009-04-03 |
proposed |
paper |
NA |
NA |
NA |
NA |
943-8-477 |
1 |
1 |
NA |
NA |
NA |
NA |
NA |
NA |
NA |
NA |
NA |
NA |
NA |
NA |
NA |
NA |
NA |
NA |
NA |
account sub |
10 |
Sapporo |
Business unit |
943-8-477 |
2 |
NA |
NA |
NA |
2 |
NA |
NA |
NA |
NA |
NA |
191 |
Temporary payment consumption tax |
account |
debit |
742 |
JPY |
2009-04-03 |
tax |
NA |
NA |
NA |
NA |
NA |
943-8-477 |
2 |
1 |
NA |
NA |
NA |
NA |
NA |
NA |
NA |
NA |
NA |
NA |
NA |
NA |
NA |
NA |
NA |
NA |
NA |
account sub |
10 |
Sapporo |
Business unit |
943-8-477 |
3 |
NA |
NA |
NA |
3 |
NA |
NA |
NA |
NA |
NA |
111 |
Cash |
account |
credit |
15600 |
JPY |
2009-04-03 |
proposed |
paper |
NA |
NA |
NA |
NA |
943-8-477 |
3 |
1 |
NA |
NA |
NA |
NA |
NA |
NA |
NA |
NA |
NA |
NA |
NA |
NA |
NA |
NA |
NA |
NA |
NA |
account sub |
0 |
Common department |
Business unit |
The table above is structured so that each row represents an observation at a different level of hierarchy (General Ledger, Line, Business Segment). The columns are divided into variables and identifiers that belong to each hierarchical level. The NA
entries show that a variable is not applicable to a particular hierarchical level.
For example, the first row contains variables that only apply to the General Ledger level, while the second row contains variables that apply to the General Ledger Line level, and so forth.
I hope this makes the structure clearer. This type of hierarchical tidy data structure is complex but very powerful, as it can represent multiple levels of hierarchy in a single table. It’s very useful in cases like this where data is structured in multiple levels, such as in financial statements or other complex record-keeping systems.
Write down introduction paper for a hierarchical tidy data in 3000 words with example table.
1. Table for original journal entry
2. nearly tidy data with different element in a single column
3.hierarchical tidy data
Use journal entry 3 layer structure as an example.
Explain how this hierarchical tidy data extend original Wickham model.
wrtites this paper draft …