Hierarchical Tidy Data: A Comprehensive Solution for Multi-layer Data Structures

Nobuyuki SAMBUICHI

ISO/TC295 Audit data services/SG1 Semantic model Convener

Following texts are produced by ChatGPT and edited by me.

ChatGPT said “As an AI language model, I don’t have feelings or opinions. However, based on our interaction, I hope I’ve been able to provide clear and helpful responses to your questions. We’ve discussed several complex topics, including the principles of tidy data, the challenges of applying these principles to hierarchical data structures, and how hierarchical tidy data can extend the tidy data framework. Throughout our conversation, I have aimed to ensure that the information I provide is accurate, comprehensive, and understandable. If you have any other questions or need further clarification on any of these topics, please feel free to ask!”

1. Introduction

Data are often messy. It may not be in the format that statisticians and data scientists prefer, which often leads to time-consuming cleaning processes. Hadley Wickham proposed the concept of tidy data as a standard way of mapping the semantics of a dataset to its physical representation. A dataset is tidy when each column is a variable, each row is an observation, and each cell contains a single value. This approach has proven tremendously useful in data cleaning and exploratory data analysis.

However, the original definition of tidy data by Wickham doesn’t consider hierarchical data, a common type of data structure where variables have relationships with one another, and these relationships form a structure of different levels, or hierarchies.

Hierarchical data can often be found in financial accounting systems. A key example is the General Ledger used in business accounting, where transactions are stored in a series of debits and credits across multiple accounts. This can be represented as a three-layer hierarchical structure, which includes General Ledger as the top level, Journal Lines in the middle, and Business Segment at the bottom level.

This paper proposes an extension of the original Wickham model, coined as “Hierarchical Tidy Data,” to manage this type of hierarchical data efficiently.

2. Original Journal Entry

Consider a simple journal entry scenario. We’ll illustrate this with a table of data (Table 1), where columns represent variables, and rows are observations.

Table 1. Original Journal Entry Data
ID	Date	Line Description	Debit (JPY)	Credit (JPY)	Account number	Account description	Business Segment Code	Business Segment Name	Business Segment Type
943-8-477	2009-04-03	Paper	14858		754	Office consumables expenses	10	Sapporo	Business unit
943-8-477	2009-04-03		742		191	Temporary payment consumption tax	10	Sapporo	Business unit
943-8-477	2009-04-03	Paper		15,600	111	Cash	0	Common department	Business unit

3. Nearly Tidy Data

Our first step in tidying this data involves identifying variables within a table. However, this nearly tidy data is still problematic because it mixes different elements within a single column.

d_A25 General Ledger (representing details about the entire Journal Entry)
d_A26 General Ledger Line (representing details about a specific Line in the Journal Entry)
d_A01 Business Segment (representing the Business Segment associated with a specific line)

The following table represents a three-level hierarchy using the variables above.

Table 2. Nearly Tidy Data
d_A25	d_A26	d_A01	ID	Element	Value
943-8-477			A25-01	Journal ID	943-8-477
943-8-477			A25-A26-08	JE Type Code	entries
943-8-477			A25-A26-07	Journal Line Number	8-477
943-8-477			A25-A26-14	Bill Date	2009-04-03
943-8-477			A25-A26-12	Bill Number	manual
943-8-477			A25-A26-11	Source Code	gj
943-8-477			A25-09	Header Description	General
943-8-477			A25-02	Journal Number	134
943-8-477	1		A26-A09-01	GL Account Number	754
943-8-477	1		A26-A09-02	GL Account Name	Office consumables expenses
943-8-477	1		A26-A09-05	Account Type	account
943-8-477	1		A25-A26-27	Credit Debit Indicator	debit
943-8-477	1		A26-A89-01	Amount	14,858
943-8-477	1		A26-A89-02	Currency	JPY
943-8-477	1		A25-A26-06	Effective Date	2009-04-03
943-8-477	1		A26-A09-06	Account Subtype	proposed
943-8-477	1		A25-A26-10	Line Description	paper
943-8-477	1	1	A26-A01-03	Segment Reference Level	account sub
943-8-477	1	1	A26-A01-01	Business Segment Code	10
943-8-477	1	1	A26-A01-02	Business Segment Name	Sapporo
943-8-477	1	1	A26-A01-04	Business Segment Type	Business unit
943-8-477	2		A26-A09-01	GL Account Number	191
943-8-477	2		A26-A09-02	GL Account Name	Temporary payment consumption tax
943-8-477	2		A26-A09-05	Account Type	account
943-8-477	2		A25-A26-27	Credit Debit Indicator	debit
943-8-477	2		A26-A89-01	Amount	742
943-8-477	2		A26-A89-02	Currency	JPY
943-8-477	2		A25-A26-06	Effective Date	2009-04-03
943-8-477	2		A26-A09-06	Account Subtype	tax
943-8-477	2	1	A26-A01-03	Segment Reference Level	account sub
943-8-477	2	1	A26-A01-01	Business Segment Code	10
943-8-477	2	1	A26-A01-02	Business Segment Name	Sapporo
943-8-477	2	1	A26-A01-04	Business Segment Type	Business unit
943-8-477	3		A26-A09-01	GL Account Number	111
943-8-477	3		A26-A09-02	GL Account Name	Cash
943-8-477	3		A26-A09-05	Account Type	account
943-8-477	3		A25-A26-27	Credit Debit Indicator	credit
943-8-477	3		A26-A89-01	Amount	15,600
943-8-477	3		A26-A89-02	Currency	JPY
943-8-477	3		A25-A26-06	Effective Date	2009-04-03
943-8-477	3		A26-A09-06	Account Subtype	proposed
943-8-477	3		A25-A26-10	Line Description	paper
943-8-477	3	1	A26-A01-03	Segment Reference Level	account sub
943-8-477	3	1	A26-A01-01	Business Segment Code	0
943-8-477	3	1	A26-A01-02	Business Segment Name	Common department
943-8-477	3	1	A26-A01-04	Business Segment Type	Business unit

This format is better, but not ideal. The element column combines multiple variables, making it harder to work with the data. Moreover, this structure doesn’t fully exploit the hierarchical nature of the data.

4. Hierarchical Tidy Data

The final step in tidying the data is to transform it into Hierarchical Tidy Data. This involves reorganizing the table so that each type of observational unit forms a table. Here, we represent the hierarchical structure within a single table without repetition.

Tidy data, as defined by Hadley Wickham, is a standard way of mapping the underlying semantics of a dataset to its physical representation. In tidy data:

Each variable forms a column.
Each observation forms a row.
Each type of observational unit forms a table.

While this standard has been highly effective for a wide range of datasets and use-cases, it has limitations when dealing with complex hierarchical datasets. For instance, consider the example of a General Ledger, a key accounting document, which has a hierarchical structure. The ledger comprises of various journal entries, each of which has multiple lines, and each line might relate to various business segments. In such cases, it is challenging to maintain the principles of tidy data.

This is where hierarchical tidy data comes into play. Hierarchical tidy data extends the original tidy data principles to handle these more complex, hierarchical data structures.

In hierarchical tidy data:

Each variable still forms a column.
Each observation, regardless of its hierarchical level, forms a row.
Different types of observational units can exist within the same table, distinguished by a variable that indicates the hierarchical level.

This approach allows us to incorporate different observational units (e.g., journal headers, journal lines, and business segments) in the same table. Each unit has a unique identifier, and there are additional variables to link between different units in the hierarchy.

In the context of the General Ledger, each row can be a header, a line, or a business segment detail. The distinction between these different units is made using a variable indicating the hierarchical level. The relationships between these units are maintained using unique identifiers for each unit and additional variables that link them in a hierarchical manner.

By extending the tidy data concept in this way, we can handle more complex, hierarchical datasets while maintaining the benefits of tidy data, such as simplicity and intuitiveness of structure. This hierarchical tidy data structure provides a powerful framework for managing and analyzing complex datasets.

Table 3. Hierarchical Tidy Data
d_A25	d_A26	d_A01	A25-01	A25-A26-08	A25-A26-07 Journal Line Number	A25-A26-14 Bill Date	A25-A26-12 Bill Number	A25-A26-11 Source Code	A25-09 Header Description	A25-02 Journal Number	A26-A09-01 GL Account Number	A26-A09-02 GL Account Name	A26-A09-05 Account Type	A25-A26-27 Credit Debit Indicator	A26-A89-01 Amount	A26-A89-02 Currency	A25-A26-06 Effective Date	A26-A09-06 Account Subtype	A25-A26-10 Line Description	A26-A01-03 Segment Reference Level	A26-A01-01 Business Segment Code	A26-A01-02 Business Segment Name	A26-A01-04 Business Segment Type
General Ledger	General Ledger Line	Business Segment	Journal ID	Type Code
943-8-477			943-8-477	entries	8-477	2009-04-03	manual	gj	General	134
943-8-477	1										754	Office consumables expenses	account	debit	14858	JPY	2009-04-03	proposed	paper
943-8-477	1	1																		account sub	10	Sapporo	Business unit
943-8-477	2										191	Temporary payment consumption tax	account	debit	742	JPY	2009-04-03	tax
943-8-477	2	1																		account sub	10	Sapporo	Business unit
943-8-477	3										111	Cash	account	credit	15600	JPY	2009-04-03	proposed	paper
943-8-477	3	1																		account sub	0	Common department	Business unit

NOTE In this format, each type of observation has its row, even though they may have different lengths. In this case, there are three layers represented, each corresponding to a different level in the hierarchy.

5. Conclusion

The hierarchical tidy data concept extends Wickham’s tidy data principles by adding the capability to handle hierarchical data efficiently. The tidying process maintains the easy-to-use nature of tidy data while enabling more complex and hierarchical data structures to be worked with effectively. This approach is particularly valuable in fields such as financial accounting, where hierarchical data is common. By adopting this model, we can better manage, analyze, and visualize hierarchical data structures.

6. Dialogues with ChatGPT

User

Can you refine following text so that technical expert,who is new to tidy data, can understand?
With explanation of tidy data and relationship to this implementation.
Add examples for each clauses to be easily understood. use my table data in explanation.

describe more detail specific to journal entry data.
Journal entry data is structured 3 layers.
A header contains posted data for business entity.
Line items detail contains debit/credit accountinc code, amount, tax, transaction description.
chart of accounts structure has various sub-account structure, such as business unit, cost center, bank account, customer, supplier, and so on.
These 3 layer structure can be recorded usiung hierarchical tidy data in a single table.

ChatGPT

The concept of tidy data was proposed by Hadley Wickham as a standardized way to link the layout of a dataset and its semantics (meaning). In a tidy dataset, each variable is presented as a column, each observation as a row, and each type of observational unit as a table.

Let’s apply this concept to the structure of a General Ledger journal entry, which typically consists of three layers:

The header layer, which provides posted data for a business entity.
The line items detail layer, containing elements such as debit/credit accounting code, amount, tax, and transaction description.
The chart of accounts structure layer, with various sub-account structures like business unit, cost center, bank account, customer, and supplier.

Let’s take a look at an example of a journal entry for purchasing paper with cash, as presented in Table 11:

ID	Date	Line Description	Debit (JPY)	Credit (JPY)	Account number	Account description	Business Segment Code	Business Segment Name	Business Segment Type
943-8-477	2009-04-03	Paper	14858		754	Office consumables expenses	10	Sapporo	Business unit
943-8-477	2009-04-03		742		191	Temporary payment consumption tax	10	Sapporo	Business unit
943-8-477	2009-04-03	Paper		15,600	111	Cash	0	Common department	Business unit

The general ledger line in this example includes variables (such as debit amount, credit amount, account number, etc.) as column headers and their corresponding values as rows.

For a person new to tidy data, this structure may seem odd. This is why Wickham suggested the concept of ‘tidying’ data to describe the underlying semantics.

Now, let’s revisit the concept of data semantics based on Wickham’s explanation. He describes a dataset as a collection of values, where every value belongs to a variable and an observation. In essence, variables are the different measurements or characteristics, while observations are the instances or entities the data represents.

To illustrate this, we can look at Tables 13, 14, and 15, which represent data from an imaginary experiment. The data is the same in all three tables, but each table presents the data differently.

The columns in the tables represent variables, like “Treatment” or “Person”, while the rows represent observations, such as “John Smith” or “Jane Doe”. The actual data values (the numbers or missing values) link a specific observation with a variable, forming a single data point.

Now, let’s examine the semantic structure of the General Ledger, represented in Table 16. This structure shows how variables are organized in a digital ledger. Normally, to represent this structure, multiple Codd’s 3rd normal form tables would be defined and connected in a relational database. Tidy data, however, allows us to reorganize this complex structure into a single tidy table. Tables 17 to 21 represent the tables for the relational database.

The next challenge is to create tidy data for our General Ledger. The first step involves identifying variables within an aggregation table within a table. If a variable’s occurrence is specified as multiple, it should be reported.

In the example of our General Ledger, which is recorded in a single currency, we don’t need a variable for a multi-currency amount. We identify

User

Descrive more detail on your following texts.

“Now, let’s examine the semantic structure of the General Ledger, represented in Table 16. This structure shows how variables are organized in a digital ledger. Normally, to represent this structure, multiple Codd’s 3rd normal form tables would be defined and connected in a relational database. Tidy data, however, allows us to reorganize this complex structure into a single tidy table. Tables 17 to 21 represent the tables for the relational database.

In the example of our General Ledger, which is recorded in a single currency, we don’t need a variable for a multi-currency amount. We identify.”

NOTE These texts is explaining Codd’s normalization approach.

ChatGPT

Sure, I’m happy to elaborate.

When managing financial data in a digital ledger, the database design will typically follow certain guidelines or ‘normal forms’ to ensure that the data is efficiently organized, and free of redundancy or inconsistency. This is where the concept of Codd’s 3rd normal form comes into play, which is a design principle used in relational databases. According to this principle, every non-prime attribute of a table should be non-transitively dependent on the primary key. In simpler terms, each piece of data in the table should depend only on the primary key, and not on other pieces of data in the table. This helps to prevent data anomalies and redundancy.

Let’s imagine a typical General Ledger structure which would require multiple tables to represent the different layers of data:

The Journal Entry Header Table would store the ID, Date, and any other metadata related to a journal entry as a whole. The primary key here would be the Journal Entry ID.
The Journal Entry Line Table would store each individual line of a journal entry, including its associated Debit or Credit, the Account Number, Line Description, etc. The primary key could be a combination of the Journal Entry ID and a Line Item Number.
The Chart of Accounts Table would store a list of all available accounts that could be used in a journal entry line, including its Account Number, Account Description, and other related information. The Account Number would be the primary key in this case.
The Business Segment Table would contain details of each business segment, such as Business Segment Code, Business Segment Name, and Business Segment Type. The primary key could be the Business Segment Code.
Lastly, a Business Segment Line Association Table could map each line of a journal entry to a business segment. The primary key might be a combination of the Journal Entry ID, Line Item Number, and Business Segment Code.

This design, while efficient in terms of data storage and integrity, can make the data hard to read or analyze because the data for a single journal entry is spread across multiple tables. That’s where the idea of ‘tidy’ data comes in.

When we ‘tidy’ this data, we aim to transform it into a single table where each row represents a single observation (e.g., a single line item in a journal entry), and each column represents a variable (e.g., the Debit amount, the Account Number, etc.). In this context, a single observation would include all the details of a line item, including the associated business segment and chart of accounts information.

So, the first step in tidying the data would be to determine what constitutes an ‘observation’. In this case, it would be a single line item of a journal entry, including all its related data.

Next, we identify all the variables that are associated with that observation. For each variable, we create a column in our tidy data table. In our journal entry example, these variables would include the debit amount, credit amount, account number, line description, and business segment information.

However, if a variable doesn’t add valuable information or is not applicable to the dataset, it can be omitted. For example, if all the transactions in our ledger are recorded in a single currency (e.g., JPY), then we don’t need a column to specify the currency for each transaction.

After identifying our observations and variables, we can then create our tidy data table by populating it with our ledger data, ensuring that each row corresponds to a single line item of a journal entry and each column corresponds to a specific variable. The final result would be a comprehensive, user-friendly table that contains all the data of our general ledger in a ‘tidy’ format.

User

Add example for multiple lines for journal entry.

In our example we know that General Ledger is recorded in a single currency, Local Currency is Reporting Currency. We don’t need variable for multi-currency amount.
There are following three variables .

d_A25 General Ledger
d_A26 General Ledger Line
d_A01 Business Segment

Which data item in table 11 goes to where.

ChatGPT

Given the complexity of a general ledger, a complete example might be out of scope, but I’ll provide a simplified example using your specified variables. Assuming the ‘d_’ prefix is part of your data labeling, I’ll illustrate it as:

d_A25 General Ledger (representing details about the entire Journal Entry)
d_A26 General Ledger Line (representing details about a specific Line in the Journal Entry)
d_A01 Business Segment (representing the Business Segment associated with a specific line)

To show how these variables could be structured in a tidy format, I’ll create an example tidy table based on the information you’ve provided. Note that some necessary details for a complete Journal Entry like debit or credit amounts, account number, line description, etc. are omitted for simplicity.

Journal_Entry_ID	Entry_Date	Line_Item_Number	Business_Segment	Line_Description
JE001	2023-01-01	1	BS001	Purchase of Equipment
JE001	2023-01-01	2	BS002	Depreciation
JE002	2023-01-02	1	BS001	Office Supplies
JE002	2023-01-02	2	BS003	Revenue from Sales

In this table:

The Journal_Entry_ID and Entry_Date columns correspond to the d_A25 General Ledger variable, representing the details of the entire Journal Entry.
The Line_Item_Number and Line_Description columns correspond to the d_A26 General Ledger Line variable, representing the details of a specific line in the Journal Entry.
The Business_Segment column corresponds to the d_A01 Business Segment variable, representing the business segment associated with each line in the Journal Entry.

This format brings all the relevant data into a single table, making it easier to analyze. However, it’s crucial to remember that while this format can be easier to work with for data analysis, it does not replace the need for normalized tables in a database for data integrity and efficient storage.

As for your question about Table 11, without seeing the actual data and structure of that table, I can’t provide a specific mapping. Typically, you would identify where each piece of data in Table 11 belongs in terms of the variables and observations we’ve identified, and then create the tidy data table accordingly.

User

We need to include header, line, business segment within a single table without repetition. This is the goal for hierarchcal tidy data.

ChatGPT

If I understand your goal correctly, you’re aiming to create a tidy hierarchical data table where the journal entry information (header), line item information, and business segment information are all included in a single table, and there’s no unnecessary repetition.

In the context of a General Ledger, it can be challenging because the very nature of the journal entries is hierarchical (entries have line items, line items have business segments). One common approach to handle such hierarchies is to use “wide” format instead of “long” format, which can reduce repetition.

Here’s an example based on the earlier variables. We’ll assume that each Line Item is linked to a single Business Segment, and that each Journal Entry has a maximum of two line items for simplicity.

Journal_Entry_ID	Entry_Date	Line_Item_Number_1	Business_Segment_1	Line_Description_1	Line_Item_Number_2	Business_Segment_2	Line_Description_2
JE001	2023-01-01	1	BS001	Purchase of Equipment	2	BS002	Depreciation
JE002	2023-01-02	1	BS001	Office Supplies	2	BS003	Revenue from Sales

This table layout avoids repetition of the Journal Entry details (d_A25 General Ledger) by placing multiple Line Items (d_A26 General Ledger Line and d_A01 Business Segment) in the same row as the corresponding Journal Entry. However, this “wide” format can become unwieldy if there are many Line Items or Business Segments per Journal Entry.

Remember that the concept of “tidy data” usually implies that each variable forms a column and each observation forms a row. In more complex, hierarchical data scenarios like this, the definition can be more flexible and must be balanced with practical considerations like ease of use and clarity of representation.

User

Is the following a tidy data?

d_A25	d_A26	d_A01	ID	Element	Value
943-8-477			A25-01	Journal ID	943-8-477
943-8-477			A25-A26-08	JE Type Code	entries
943-8-477			A25-A26-07	Journal Line Number	8-477
943-8-477			A25-A26-14	Bill Date	2009-04-03
943-8-477			A25-A26-12	Bill Number	manual
943-8-477			A25-A26-11	Source Code	gj
943-8-477			A25-09	Header Description	General
943-8-477			A25-02	Journal Number	134
943-8-477	1		A26-A09-01	GL Account Number	754
943-8-477	1		A26-A09-02	GL Account Name	Office consumables expenses
943-8-477	1		A26-A09-05	Account Type	account
943-8-477	1		A25-A26-27	Credit Debit Indicator	debit
943-8-477	1		A26-A89-01	Amount	14,858
943-8-477	1		A26-A89-02	Currency	JPY
943-8-477	1		A25-A26-06	Effective Date	2009-04-03
943-8-477	1		A26-A09-06	Account Subtype	proposed
943-8-477	1		A25-A26-10	Line Description	paper
943-8-477	1	1	A26-A01-03	Segment Reference Level	account sub
943-8-477	1	1	A26-A01-01	Business Segment Code	10
943-8-477	1	1	A26-A01-02	Business Segment Name	Sapporo
943-8-477	1	1	A26-A01-04	Business Segment Type	Business unit
943-8-477	2		A26-A09-01	GL Account Number	191
943-8-477	2		A26-A09-02	GL Account Name	Temporary payment consumption tax
943-8-477	2		A26-A09-05	Account Type	account
943-8-477	2		A25-A26-27	Credit Debit Indicator	debit
943-8-477	2		A26-A89-01	Amount	742
943-8-477	2		A26-A89-02	Currency	JPY
943-8-477	2		A25-A26-06	Effective Date	2009-04-03
943-8-477	2		A26-A09-06	Account Subtype	tax
943-8-477	2	1	A26-A01-03	Segment Reference Level	account sub
943-8-477	2	1	A26-A01-01	Business Segment Code	10
943-8-477	2	1	A26-A01-02	Business Segment Name	Sapporo
943-8-477	2	1	A26-A01-04	Business Segment Type	Business unit
943-8-477	3		A26-A09-01	GL Account Number	111
943-8-477	3		A26-A09-02	GL Account Name	Cash
943-8-477	3		A26-A09-05	Account Type	account
943-8-477	3		A25-A26-27	Credit Debit Indicator	credit
943-8-477	3		A26-A89-01	Amount	15,600
943-8-477	3		A26-A89-02	Currency	JPY
943-8-477	3		A25-A26-06	Effective Date	2009-04-03
943-8-477	3		A26-A09-06	Account Subtype	proposed
943-8-477	3		A25-A26-10	Line Description	paper
943-8-477	3	1	A26-A01-03	Segment Reference Level	account sub
943-8-477	3	1	A26-A01-01	Business Segment Code	0
943-8-477	3	1	A26-A01-02	Business Segment Name	Common department
943-8-477	3	1	A26-A01-04	Business Segment Type	Business unit

ChatGPT

This table is an example of “long” format, and is not quite “tidy” data as defined by Hadley Wickham in his influential paper “Tidy Data.”

According to Wickham’s definition, tidy data has the following characteristics:

Each variable forms a column.
Each observation forms a row.
Each type of observational unit forms a table.

In your table, the columns “ID”, “Element”, and “Value” are actually containing multiple variables. For example, “GL Account Number”, “GL Account Name”, “Amount”, and “Currency” are all different variables but they’re squeezed into a single column (“Element”) with their corresponding values in the “Value” column.

A “tidier” version of this table could look like this:

Journal_ID	JE_Type_Code	Journal_Line_Number	Bill_Date	Bill_Number	Source_Code	Header_Description	Journal_Number	Line_Item_Number	GL_Account_Number	GL_Account_Name	Account_Type	Credit_Debit_Indicator	Amount	Currency	Effective_Date	Account_Subtype	Line_Description	Segment_Reference_Level	Business_Segment_Code	Business_Segment_Name	Business_Segment_Type
943-8-477	entries	8-477	2009-04-03	manual	gj	General	134	1	754	Office consumables expenses	account	debit	14,858	JPY	2009-04-03	proposed	paper	account sub	10	Sapporo	Business unit
943-8-477	entries	8-477	2009-04-03	manual	gj	General	134	2	191	Temporary payment consumption tax	account	debit	742	JPY	2009-04-03	tax	paper	account sub	10	Sapporo	Business unit
943-8-477	entries	8-477	2009-04-03	manual	gj	General	134	3	111	Cash	account	credit	15,600	JPY	2009-04-03	proposed	paper	account sub	0	Common department	Business unit

This table has now separated out all the variables into their own columns, and each row represents one observation (one line item in this case).

Of course, real-world data can be more complex, and it may not be possible or practical to achieve perfect tidiness. The key is to aim for a structure that helps you to analyze the data effectively and efficiently.

User

How about this table?

d_A25	d_A26	d_A01	A25-01	A25-A26-08	A25-A26-07 Journal Line Number	A25-A26-14 Bill Date	A25-A26-12 Bill Number	A25-A26-11 Source Code	A25-09 Header Description	A25-02 Journal Number	A26-A09-01 GL Account Number	A26-A09-02 GL Account Name	A26-A09-05 Account Type	A25-A26-27 Credit Debit Indicator	A26-A89-01 Amount	A26-A89-02 Currency	A25-A26-06 Effective Date	A26-A09-06 Account Subtype	A25-A26-10 Line Description	A26-A01-03 Segment Reference Level	A26-A01-01 Business Segment Code	A26-A01-02 Business Segment Name	A26-A01-04 Business Segment Type
General Ledger	General Ledger Line	Business Segment	Journal ID	Type Code
943-8-477			943-8-477	entries	8-477	2009-04-03	manual	gj	General	134
943-8-477	1										754	Office consumables expenses	account	debit	14858	JPY	2009-04-03	proposed	paper
943-8-477	1	1																		account sub	10	Sapporo	Business unit
943-8-477	2										191	Temporary payment consumption tax	account	debit	742	JPY	2009-04-03	tax
943-8-477	2	1																		account sub	10	Sapporo	Business unit
943-8-477	3										111	Cash	account	credit	15600	JPY	2009-04-03	proposed	paper
943-8-477	3	1																		account sub	0	Common department	Business unit

ChatGPT

This table represents an improvement over the previous one in terms of tidiness. Here, each variable now forms a column, and each observation (a general ledger line in this case) forms a row. However, it’s important to note that the tidiness of the data also depends on the specific analysis you want to perform.

One minor point to consider is the presence of cells with repeated or irrelevant information for a specific row, such as the “Journal ID” column which repeats “943-8-477” across rows where it’s not applicable. Similarly, the cells in the “Type Code”, “Journal Line Number”, “Bill Date”, “Bill Number”, “Source Code”, “Header Description”, and “Journal Number” columns are empty for the lines that are not header lines. These could lead to misunderstandings about the structure of your data.

The goal of tidy data is to create a structure where every cell contains one and only one value that describes the variable of the column and the observation of the row.

A tidier version of this table would look like this:

Journal ID	Type Code	Journal Line Number	Bill Date	Bill Number	Source Code	Header Description	Journal Number	General Ledger Line	GL Account Number	GL Account Name	Account Type	Credit Debit Indicator	Amount	Currency	Effective Date	Account Subtype	Line Description	Business Segment	Segment Reference Level	Business Segment Code	Business Segment Name	Business Segment Type
943-8-477	entries	8-477	2009-04-03	manual	gj	General	134	1	754	Office consumables expenses	account	debit	14858	JPY	2009-04-03	proposed	paper	1	account sub	10	Sapporo	Business unit
943-8-477	entries	8-477	2009-04-03	manual	gj	General	134	2	191	Temporary payment consumption tax	account	debit	742	JPY	2009-04-03	tax		1	account sub	10	Sapporo	Business unit
943-8-477	entries	8-477	2009-04-03	manual	gj	General	134	3	111	Cash	account	credit	15600	JPY	2009-04-03	proposed	paper	1	account sub	0	Common department	Business unit

Note: This structure assumes that each line in the General Ledger belongs to only one Business Segment. If a line can belong to multiple segments, you may need to reshape your data accordingly.

User

You are repeating
Journal ID Type Code Journal Line Number Bill Date Bill Number Source Code Header Description Journal Number
943-8-477 entries 8-477 2009-04-03 manual gj General 134

You need to distinguish observation for header and lines.
There is no restriction observation in a table should be the same. So we can store different observations within the same table, which is a hierarchcal tidy data.

ChatGPT

I apologize for the misunderstanding. You’re right. In the context of hierarchical data, different levels of observation can indeed be stored in the same table, and these different levels of observation can make the table look less ‘tidy’ than a traditional tidy data format where each row is a single observation. This method can be particularly useful when working with complex hierarchical or transactional data such as a general ledger.

Given that, your original table, where each row represents a different level of the hierarchy (header, line, business segment) is considered a tidy hierarchical data structure. Each row still represents a single observation, but the type of observation differs between rows:

Some rows represent a ledger header.
Some rows represent a ledger line.
Some rows represent a business segment.

This structure enables you to capture the hierarchical relationships within the data, which would not be possible with a simple flat structure. Each level in the hierarchy is clearly distinguished by the values in the ‘d_A25’, ‘d_A26’, and ‘d_A01’ columns.

It is important to note, however, that while this structure is more flexible and can handle complex hierarchical data, it can also be more challenging to work with than a simple flat structure, as it requires additional processing to filter or aggregate data at different levels of the hierarchy.

User

Provide your understood table of the hierachical tidy data for journal entry of my example.

ChatGPT

Certainly, here is a representation of the hierarchical tidy data format based on your example:

d_A25	d_A26	d_A01	A25-01 Journal ID	A25-A26-08 Type Code	A25-A26-07 Journal Line Number	A25-A26-14 Bill Date	A25-A26-12 Bill Number	A25-A26-11 Source Code	A25-09 Header Description	A25-02 Journal Number	A26-A09-01 GL Account Number	A26-A09-02 GL Account Name	A26-A09-05 Account Type	A25-A26-27 Credit Debit Indicator	A26-A89-01 Amount	A26-A89-02 Currency	A25-A26-06 Effective Date	A26-A09-06 Account Subtype	A25-A26-10 Line Description	A26-A01-03 Segment Reference Level	A26-A01-01 Business Segment Code	A26-A01-02 Business Segment Name	A26-A01-04 Business Segment Type
943-8-477	NA	NA	943-8-477	entries	NA	2009-04-03	manual	gj	General	134	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA
943-8-477	1	NA	NA	NA	1	NA	NA	NA	NA	NA	754	Office consumables expenses	account	debit	14858	JPY	2009-04-03	proposed	paper	NA	NA	NA	NA
943-8-477	1	1	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	account sub	10	Sapporo	Business unit
943-8-477	2	NA	NA	NA	2	NA	NA	NA	NA	NA	191	Temporary payment consumption tax	account	debit	742	JPY	2009-04-03	tax	NA	NA	NA	NA	NA
943-8-477	2	1	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	account sub	10	Sapporo	Business unit
943-8-477	3	NA	NA	NA	3	NA	NA	NA	NA	NA	111	Cash	account	credit	15600	JPY	2009-04-03	proposed	paper	NA	NA	NA	NA
943-8-477	3	1	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	account sub	0	Common department	Business unit

The table above is structured so that each row represents an observation at a different level of hierarchy (General Ledger, Line, Business Segment). The columns are divided into variables and identifiers that belong to each hierarchical level. The NA entries show that a variable is not applicable to a particular hierarchical level.

For example, the first row contains variables that only apply to the General Ledger level, while the second row contains variables that apply to the General Ledger Line level, and so forth.

I hope this makes the structure clearer. This type of hierarchical tidy data structure is complex but very powerful, as it can represent multiple levels of hierarchy in a single table. It’s very useful in cases like this where data is structured in multiple levels, such as in financial statements or other complex record-keeping systems.

User

Write down introduction paper for a hierarchical tidy data in 3000 words with example table.
1. Table for original journal entry
2. nearly tidy data with different element in a single column
3.hierarchical tidy data
Use journal entry 3 layer structure as an example.
Explain how this hierarchical tidy data extend original Wickham model.

ChatGPT

wrtites this paper draft …