Search Posts

Visits: 44

Nobuyuki SAMBUICHI

ISO/TC295 Audit data services/SG1 Semantic model Convener

Following texts are produced by ChatGPT and edited by me.

ChatGPT said “As an AI language model, I don’t have feelings or opinions. However, based on our interaction, I hope I’ve been able to provide clear and helpful responses to your questions. We’ve discussed several complex topics, including the principles of tidy data, the challenges of applying these principles to hierarchical data structures, and how hierarchical tidy data can extend the tidy data framework. Throughout our conversation, I have aimed to ensure that the information I provide is accurate, comprehensive, and understandable. If you have any other questions or need further clarification on any of these topics, please feel free to ask!”

1. Introduction

Data are often messy. It may not be in the format that statisticians and data scientists prefer, which often leads to time-consuming cleaning processes. Hadley Wickham proposed the concept of tidy data as a standard way of mapping the semantics of a dataset to its physical representation. A dataset is tidy when each column is a variable, each row is an observation, and each cell contains a single value. This approach has proven tremendously useful in data cleaning and exploratory data analysis.

However, the original definition of tidy data by Wickham doesn’t consider hierarchical data, a common type of data structure where variables have relationships with one another, and these relationships form a structure of different levels, or hierarchies.

Hierarchical data can often be found in financial accounting systems. A key example is the General Ledger used in business accounting, where transactions are stored in a series of debits and credits across multiple accounts. This can be represented as a three-layer hierarchical structure, which includes General Ledger as the top level, Journal Lines in the middle, and Business Segment at the bottom level.

This paper proposes an extension of the original Wickham model, coined as “Hierarchical Tidy Data,” to manage this type of hierarchical data efficiently.

2. Original Journal Entry

Consider a simple journal entry scenario. We’ll illustrate this with a table of data (Table 1), where columns represent variables, and rows are observations.

Table 1. Original Journal Entry Data

ID

Date

Line Description

Debit (JPY)

Credit (JPY)

Account number

Account description

Business Segment Code

Business Segment Name

Business Segment Type

943-8-477

2009-04-03

Paper

14858

754

Office consumables expenses

10

Sapporo

Business unit

943-8-477

2009-04-03

742

191

Temporary payment consumption tax

10

Sapporo

Business unit

943-8-477

2009-04-03

Paper

15,600

111

Cash

0

Common department

Business unit

3. Nearly Tidy Data

Our first step in tidying this data involves identifying variables within a table. However, this nearly tidy data is still problematic because it mixes different elements within a single column.

  • d_A25 General Ledger (representing details about the entire Journal Entry)

  • d_A26 General Ledger Line (representing details about a specific Line in the Journal Entry)

  • d_A01 Business Segment (representing the Business Segment associated with a specific line)

The following table represents a three-level hierarchy using the variables above.

Table 2. Nearly Tidy Data
d_A25 d_A26 d_A01 ID Element Value

943-8-477

A25-01

Journal ID

943-8-477

943-8-477

A25-A26-08

JE Type Code

entries

943-8-477

A25-A26-07

Journal Line Number

8-477

943-8-477

A25-A26-14

Bill Date

2009-04-03

943-8-477

A25-A26-12

Bill Number

manual

943-8-477

A25-A26-11

Source Code

gj

943-8-477

A25-09

Header Description

General

943-8-477

A25-02

Journal Number

134

943-8-477

1

A26-A09-01

GL Account Number

754

943-8-477

1

A26-A09-02

GL Account Name

Office consumables expenses

943-8-477

1

A26-A09-05

Account Type

account

943-8-477

1

A25-A26-27

Credit Debit Indicator

debit

943-8-477

1

A26-A89-01

Amount

14,858

943-8-477

1

A26-A89-02

Currency

JPY

943-8-477

1

A25-A26-06

Effective Date

2009-04-03

943-8-477

1

A26-A09-06

Account Subtype

proposed

943-8-477

1

A25-A26-10

Line Description

paper

943-8-477

1

1

A26-A01-03

Segment Reference Level

account sub

943-8-477

1

1

A26-A01-01

Business Segment Code

10

943-8-477

1

1

A26-A01-02

Business Segment Name

Sapporo

943-8-477

1

1

A26-A01-04

Business Segment Type

Business unit

943-8-477

2

A26-A09-01

GL Account Number

191

943-8-477

2

A26-A09-02

GL Account Name

Temporary payment consumption tax

943-8-477

2

A26-A09-05

Account Type

account

943-8-477

2

A25-A26-27

Credit Debit Indicator

debit

943-8-477

2

A26-A89-01

Amount

742

943-8-477

2

A26-A89-02

Currency

JPY

943-8-477

2

A25-A26-06

Effective Date

2009-04-03

943-8-477

2

A26-A09-06

Account Subtype

tax

943-8-477

2

1

A26-A01-03

Segment Reference Level

account sub

943-8-477

2

1

A26-A01-01

Business Segment Code

10

943-8-477

2

1

A26-A01-02

Business Segment Name

Sapporo

943-8-477

2

1

A26-A01-04

Business Segment Type

Business unit

943-8-477

3

A26-A09-01

GL Account Number

111

943-8-477

3

A26-A09-02

GL Account Name

Cash

943-8-477

3

A26-A09-05

Account Type

account

943-8-477

3

A25-A26-27

Credit Debit Indicator

credit

943-8-477

3

A26-A89-01

Amount

15,600

943-8-477

3

A26-A89-02

Currency

JPY

943-8-477

3

A25-A26-06

Effective Date

2009-04-03

943-8-477

3

A26-A09-06

Account Subtype

proposed

943-8-477

3

A25-A26-10

Line Description

paper

943-8-477

3

1

A26-A01-03

Segment Reference Level

account sub

943-8-477

3

1

A26-A01-01

Business Segment Code

0

943-8-477

3

1

A26-A01-02

Business Segment Name

Common department

943-8-477

3

1

A26-A01-04

Business Segment Type

Business unit

This format is better, but not ideal. The element column combines multiple variables, making it harder to work with the data. Moreover, this structure doesn’t fully exploit the hierarchical nature of the data.

4. Hierarchical Tidy Data

The final step in tidying the data is to transform it into Hierarchical Tidy Data. This involves reorganizing the table so that each type of observational unit forms a table. Here, we represent the hierarchical structure within a single table without repetition.

Tidy data, as defined by Hadley Wickham, is a standard way of mapping the underlying semantics of a dataset to its physical representation. In tidy data:

  1. Each variable forms a column.

  2. Each observation forms a row.

  3. Each type of observational unit forms a table.

While this standard has been highly effective for a wide range of datasets and use-cases, it has limitations when dealing with complex hierarchical datasets. For instance, consider the example of a General Ledger, a key accounting document, which has a hierarchical structure. The ledger comprises of various journal entries, each of which has multiple lines, and each line might relate to various business segments. In such cases, it is challenging to maintain the principles of tidy data.

This is where hierarchical tidy data comes into play. Hierarchical tidy data extends the original tidy data principles to handle these more complex, hierarchical data structures.

In hierarchical tidy data:

  1. Each variable still forms a column.

  2. Each observation, regardless of its hierarchical level, forms a row.

  3. Different types of observational units can exist within the same table, distinguished by a variable that indicates the hierarchical level.

This approach allows us to incorporate different observational units (e.g., journal headers, journal lines, and business segments) in the same table. Each unit has a unique identifier, and there are additional variables to link between different units in the hierarchy.

In the context of the General Ledger, each row can be a header, a line, or a business segment detail. The distinction between these different units is made using a variable indicating the hierarchical level. The relationships between these units are maintained using unique identifiers for each unit and additional variables that link them in a hierarchical manner.

By extending the tidy data concept in this way, we can handle more complex, hierarchical datasets while maintaining the benefits of tidy data, such as simplicity and intuitiveness of structure. This hierarchical tidy data structure provides a powerful framework for managing and analyzing complex datasets.

Table 3. Hierarchical Tidy Data
d_A25 d_A26 d_A01 A25-01 A25-A26-08 A25-A26-07 Journal Line Number A25-A26-14 Bill Date A25-A26-12 Bill Number A25-A26-11 Source Code A25-09 Header Description A25-02 Journal Number A26-A09-01 GL Account Number A26-A09-02 GL Account Name A26-A09-05 Account Type A25-A26-27 Credit Debit Indicator A26-A89-01 Amount A26-A89-02 Currency A25-A26-06 Effective Date A26-A09-06 Account Subtype A25-A26-10 Line Description A26-A01-03 Segment Reference Level A26-A01-01 Business Segment Code A26-A01-02 Business Segment Name A26-A01-04 Business Segment Type

General Ledger

General Ledger Line

Business Segment

Journal ID

Type Code

943-8-477

943-8-477

entries

8-477

2009-04-03

manual

gj

General

134

943-8-477

1

754

Office consumables expenses

account

debit

14858

JPY

2009-04-03

proposed

paper

943-8-477

1

1

account sub

10

Sapporo

Business unit

943-8-477

2

191

Temporary payment consumption tax

account

debit

742

JPY

2009-04-03

tax

943-8-477

2

1

account sub

10

Sapporo

Business unit

943-8-477

3

111

Cash

account

credit

15600

JPY

2009-04-03

proposed

paper

943-8-477

3

1

account sub

0

Common department

Business unit

NOTE In this format, each type of observation has its row, even though they may have different lengths. In this case, there are three layers represented, each corresponding to a different level in the hierarchy.

5. Conclusion

The hierarchical tidy data concept extends Wickham’s tidy data principles by adding the capability to handle hierarchical data efficiently. The tidying process maintains the easy-to-use nature of tidy data while enabling more complex and hierarchical data structures to be worked with effectively. This approach is particularly valuable in fields such as financial accounting, where hierarchical data is common. By adopting this model, we can better manage, analyze, and visualize hierarchical data structures.

6. Dialogues with ChatGPT

User

Can you refine following text so that technical expert,who is new to tidy data, can understand?
With explanation of tidy data and relationship to this implementation.
Add examples for each clauses to be easily understood. use my table data in explanation.

describe more detail specific to journal entry data.
Journal entry data is structured 3 layers.
A header contains posted data for business entity.
Line items detail contains debit/credit accountinc code, amount, tax, transaction description.
chart of accounts structure has various sub-account structure, such as business unit, cost center, bank account, customer, supplier, and so on.
These 3 layer structure can be recorded usiung hierarchical tidy data in a single table.

ChatGPT

The concept of tidy data was proposed by Hadley Wickham as a standardized way to link the layout of a dataset and its semantics (meaning). In a tidy dataset, each variable is presented as a column, each observation as a row, and each type of observational unit as a table.

Let’s apply this concept to the structure of a General Ledger journal entry, which typically consists of three layers:

  1. The header layer, which provides posted data for a business entity.

  2. The line items detail layer, containing elements such as debit/credit accounting code, amount, tax, and transaction description.

  3. The chart of accounts structure layer, with various sub-account structures like business unit, cost center, bank account, customer, and supplier.

Let’s take a look at an example of a journal entry for purchasing paper with cash, as presented in Table 11:

ID Date Line Description Debit (JPY) Credit (JPY) Account number Account description Business Segment Code Business Segment Name Business Segment Type

943-8-477

2009-04-03

Paper

14858

754

Office consumables expenses

10

Sapporo

Business unit

943-8-477

2009-04-03

742

191

Temporary payment consumption tax

10

Sapporo

Business unit

943-8-477

2009-04-03

Paper

15,600

111

Cash

0

Common department

Business unit

The general ledger line in this example includes variables (such as debit amount, credit amount, account number, etc.) as column headers and their corresponding values as rows.

For a person new to tidy data, this structure may seem odd. This is why Wickham suggested the concept of ‘tidying’ data to describe the underlying semantics.

Now, let’s revisit the concept of data semantics based on Wickham’s explanation. He describes a dataset as a collection of values, where every value belongs to a variable and an observation. In essence, variables are the different measurements or characteristics, while observations are the instances or entities the data represents.

To illustrate this, we can look at Tables 13, 14, and 15, which represent data from an imaginary experiment. The data is the same in all three tables, but each table presents the data differently.

The columns in the tables represent variables, like “Treatment” or “Person”, while the rows represent observations, such as “John Smith” or “Jane Doe”. The actual data values (the numbers or missing values) link a specific observation with a variable, forming a single data point.

Now, let’s examine the semantic structure of the General Ledger, represented in Table 16. This structure shows how variables are organized in a digital ledger. Normally, to represent this structure, multiple Codd’s 3rd normal form tables would be defined and connected in a relational database. Tidy data, however, allows us to reorganize this complex structure into a single tidy table. Tables 17 to 21 represent the tables for the relational database.

The next challenge is to create tidy data for our General Ledger. The first step involves identifying variables within an aggregation table within a table. If a variable’s occurrence is specified as multiple, it should be reported.

In the example of our General Ledger, which is recorded in a single currency, we don’t need a variable for a multi-currency amount. We identify

User

Descrive more detail on your following texts.

“Now, let’s examine the semantic structure of the General Ledger, represented in Table 16. This structure shows how variables are organized in a digital ledger. Normally, to represent this structure, multiple Codd’s 3rd normal form tables would be defined and connected in a relational database. Tidy data, however, allows us to reorganize this complex structure into a single tidy table. Tables 17 to 21 represent the tables for the relational database.

The next challenge is to create tidy data for our General Ledger. The first step involves identifying variables within an aggregation table within a table. If a variable’s occurrence is specified as multiple, it should be reported.

In the example of our General Ledger, which is recorded in a single currency, we don’t need a variable for a multi-currency amount. We identify.”

NOTE These texts is explaining Codd’s normalization approach.

ChatGPT

Sure, I’m happy to elaborate.

When managing financial data in a digital ledger, the database design will typically follow certain guidelines or ‘normal forms’ to ensure that the data is efficiently organized, and free of redundancy or inconsistency. This is where the concept of Codd’s 3rd normal form comes into play, which is a design principle used in relational databases. According to this principle, every non-prime attribute of a table should be non-transitively dependent on the primary key. In simpler terms, each piece of data in the table should depend only on the primary key, and not on other pieces of data in the table. This helps to prevent data anomalies and redundancy.

Let’s imagine a typical General Ledger structure which would require multiple tables to represent the different layers of data:

  1. The Journal Entry Header Table would store the ID, Date, and any other metadata related to a journal entry as a whole. The primary key here would be the Journal Entry ID.

  2. The Journal Entry Line Table would store each individual line of a journal entry, including its associated Debit or Credit, the Account Number, Line Description, etc. The primary key could be a combination of the Journal Entry ID and a Line Item Number.

  3. The Chart of Accounts Table would store a list of all available accounts that could be used in a journal entry line, including its Account Number, Account Description, and other related information. The Account Number would be the primary key in this case.

  4. The Business Segment Table would contain details of each business segment, such as Business Segment Code, Business Segment Name, and Business Segment Type. The primary key could be the Business Segment Code.

  5. Lastly, a Business Segment Line Association Table could map each line of a journal entry to a business segment. The primary key might be a combination of the Journal Entry ID, Line Item Number, and Business Segment Code.

This design, while efficient in terms of data storage and integrity, can make the data hard to read or analyze because the data for a single journal entry is spread across multiple tables. That’s where the idea of ‘tidy’ data comes in.

When we ‘tidy’ this data, we aim to transform it into a single table where each row represents a single observation (e.g., a single line item in a journal entry), and each column represents a variable (e.g., the Debit amount, the Account Number, etc.). In this context, a single observation would include all the details of a line item, including the associated business segment and chart of accounts information.

So, the first step in tidying the data would be to determine what constitutes an ‘observation’. In this case, it would be a single line item of a journal entry, including all its related data.

Next, we identify all the variables that are associated with that observation. For each variable, we create a column in our tidy data table. In our journal entry example, these variables would include the debit amount, credit amount, account number, line description, and business segment information.

However, if a variable doesn’t add valuable information or is not applicable to the dataset, it can be omitted. For example, if all the transactions in our ledger are recorded in a single currency (e.g., JPY), then we don’t need a column to specify the currency for each transaction.

After identifying our observations and variables, we can then create our tidy data table by populating it with our ledger data, ensuring that each row corresponds to a single line item of a journal entry and each column corresponds to a specific variable. The final result would be a comprehensive, user-friendly table that contains all the data of our general ledger in a ‘tidy’ format.

User

Add example for multiple lines for journal entry.

In our example we know that General Ledger is recorded in a single currency, Local Currency is Reporting Currency. We don’t need variable for multi-currency amount.
There are following three variables .

d_A25 General Ledger
d_A26 General Ledger Line
d_A01 Business Segment

Which data item in table 11 goes to where.

ChatGPT

Given the complexity of a general ledger, a complete example might be out of scope, but I’ll provide a simplified example using your specified variables. Assuming the ‘d_’ prefix is part of your data labeling, I’ll illustrate it as:

  • d_A25 General Ledger (representing details about the entire Journal Entry)

  • d_A26 General Ledger Line (representing details about a specific Line in the Journal Entry)

  • d_A01 Business Segment (representing the Business Segment associated with a specific line)

To show how these variables could be structured in a tidy format, I’ll create an example tidy table based on the information you’ve provided. Note that some necessary details for a complete Journal Entry like debit or credit amounts, account number, line description, etc. are omitted for simplicity.

Journal_Entry_ID Entry_Date Line_Item_Number Business_Segment Line_Description

JE001

2023-01-01

1

BS001

Purchase of Equipment

JE001

2023-01-01

2

BS002

Depreciation

JE002

2023-01-02

1

BS001

Office Supplies

JE002

2023-01-02

2

BS003

Revenue from Sales

In this table:

  • The Journal_Entry_ID and Entry_Date columns correspond to the d_A25 General Ledger variable, representing the details of the entire Journal Entry.

  • The Line_Item_Number and Line_Description columns correspond to the d_A26 General Ledger Line variable, representing the details of a specific line in the Journal Entry.

  • The Business_Segment column corresponds to the d_A01 Business Segment variable, representing the business segment associated with each line in the Journal Entry.

This format brings all the relevant data into a single table, making it easier to analyze. However, it’s crucial to remember that while this format can be easier to work with for data analysis, it does not replace the need for normalized tables in a database for data integrity and efficient storage.

As for your question about Table 11, without seeing the actual data and structure of that table, I can’t provide a specific mapping. Typically, you would identify where each piece of data in Table 11 belongs in terms of the variables and observations we’ve identified, and then create the tidy data table accordingly.

User

We need to include header, line, business segment within a single table without repetition. This is the goal for hierarchcal tidy data.

ChatGPT

If I understand your goal correctly, you’re aiming to create a tidy hierarchical data table where the journal entry information (header), line item information, and business segment information are all included in a single table, and there’s no unnecessary repetition.

In the context of a General Ledger, it can be challenging because the very nature of the journal entries is hierarchical (entries have line items, line items have business segments). One common approach to handle such hierarchies is to use “wide” format instead of “long” format, which can reduce repetition.

Here’s an example based on the earlier variables. We’ll assume that each Line Item is linked to a single Business Segment, and that each Journal Entry has a maximum of two line items for simplicity.

Journal_Entry_ID Entry_Date Line_Item_Number_1 Business_Segment_1 Line_Description_1 Line_Item_Number_2 Business_Segment_2 Line_Description_2

JE001

2023-01-01

1

BS001

Purchase of Equipment

2

BS002

Depreciation

JE002

2023-01-02

1

BS001

Office Supplies

2

BS003

Revenue from Sales

This table layout avoids repetition of the Journal Entry details (d_A25 General Ledger) by placing multiple Line Items (d_A26 General Ledger Line and d_A01 Business Segment) in the same row as the corresponding Journal Entry. However, this “wide” format can become unwieldy if there are many Line Items or Business Segments per Journal Entry.

Remember that the concept of “tidy data” usually implies that each variable forms a column and each observation forms a row. In more complex, hierarchical data scenarios like this, the definition can be more flexible and must be balanced with practical considerations like ease of use and clarity of representation.

User

Is the following a tidy data?

d_A25 d_A26 d_A01 ID Element Value

943-8-477

A25-01

Journal ID

943-8-477

943-8-477

A25-A26-08

JE Type Code

entries

943-8-477

A25-A26-07

Journal Line Number

8-477

943-8-477

A25-A26-14

Bill Date

2009-04-03

943-8-477

A25-A26-12

Bill Number

manual

943-8-477

A25-A26-11

Source Code

gj

943-8-477

A25-09

Header Description

General

943-8-477

A25-02

Journal Number

134

943-8-477

1

A26-A09-01

GL Account Number

754

943-8-477

1

A26-A09-02

GL Account Name

Office consumables expenses

943-8-477

1

A26-A09-05

Account Type

account

943-8-477

1

A25-A26-27

Credit Debit Indicator

debit

943-8-477

1

A26-A89-01

Amount

14,858

943-8-477

1

A26-A89-02

Currency

JPY

943-8-477

1

A25-A26-06

Effective Date

2009-04-03

943-8-477

1

A26-A09-06

Account Subtype

proposed

943-8-477

1

A25-A26-10

Line Description

paper

943-8-477

1

1

A26-A01-03

Segment Reference Level

account sub

943-8-477

1

1

A26-A01-01

Business Segment Code

10

943-8-477

1

1

A26-A01-02

Business Segment Name

Sapporo

943-8-477

1

1

A26-A01-04

Business Segment Type

Business unit

943-8-477

2

A26-A09-01

GL Account Number

191

943-8-477

2

A26-A09-02

GL Account Name

Temporary payment consumption tax

943-8-477

2

A26-A09-05

Account Type

account

943-8-477

2

A25-A26-27

Credit Debit Indicator

debit

943-8-477

2

A26-A89-01

Amount

742

943-8-477

2

A26-A89-02

Currency

JPY

943-8-477

2

A25-A26-06

Effective Date

2009-04-03

943-8-477

2

A26-A09-06

Account Subtype

tax

943-8-477

2

1

A26-A01-03

Segment Reference Level

account sub

943-8-477

2

1

A26-A01-01

Business Segment Code

10

943-8-477

2

1

A26-A01-02

Business Segment Name

Sapporo

943-8-477

2

1

A26-A01-04

Business Segment Type

Business unit

943-8-477

3

A26-A09-01

GL Account Number

111

943-8-477

3

A26-A09-02

GL Account Name

Cash

943-8-477

3

A26-A09-05

Account Type

account

943-8-477

3

A25-A26-27

Credit Debit Indicator

credit

943-8-477

3

A26-A89-01

Amount

15,600

943-8-477

3

A26-A89-02

Currency

JPY

943-8-477

3

A25-A26-06

Effective Date

2009-04-03

943-8-477

3

A26-A09-06

Account Subtype

proposed

943-8-477

3

A25-A26-10

Line Description

paper

943-8-477

3

1

A26-A01-03

Segment Reference Level

account sub

943-8-477

3

1

A26-A01-01

Business Segment Code

0

943-8-477

3

1

A26-A01-02

Business Segment Name

Common department

943-8-477

3

1

A26-A01-04

Business Segment Type

Business unit

ChatGPT

This table is an example of “long” format, and is not quite “tidy” data as defined by Hadley Wickham in his influential paper “Tidy Data.”

According to Wickham’s definition, tidy data has the following characteristics:

  1. Each variable forms a column.

  2. Each observation forms a row.

  3. Each type of observational unit forms a table.

In your table, the columns “ID”, “Element”, and “Value” are actually containing multiple variables. For example, “GL Account Number”, “GL Account Name”, “Amount”, and “Currency” are all different variables but they’re squeezed into a single column (“Element”) with their corresponding values in the “Value” column.

A “tidier” version of this table could look like this:

Journal_ID JE_Type_Code Journal_Line_Number Bill_Date Bill_Number Source_Code Header_Description Journal_Number Line_Item_Number GL_Account_Number GL_Account_Name Account_Type Credit_Debit_Indicator Amount Currency Effective_Date Account_Subtype Line_Description Segment_Reference_Level Business_Segment_Code Business_Segment_Name Business_Segment_Type

943-8-477

entries

8-477

2009-04-03

manual

gj

General

134

1

754

Office consumables expenses

account

debit

14,858

JPY

2009-04-03

proposed

paper

account sub

10

Sapporo

Business unit

943-8-477

entries

8-477

2009-04-03

manual

gj

General

134

2

191

Temporary payment consumption tax

account

debit

742

JPY

2009-04-03

tax

paper

account sub

10

Sapporo

Business unit

943-8-477

entries

8-477

2009-04-03

manual

gj

General

134

3

111

Cash

account

credit

15,600

JPY

2009-04-03

proposed

paper

account sub

0

Common department

Business unit

This table has now separated out all the variables into their own columns, and each row represents one observation (one line item in this case).

Of course, real-world data can be more complex, and it may not be possible or practical to achieve perfect tidiness. The key is to aim for a structure that helps you to analyze the data effectively and efficiently.

User

How about this table?

d_A25 d_A26 d_A01 A25-01 A25-A26-08 A25-A26-07 Journal Line Number A25-A26-14 Bill Date A25-A26-12 Bill Number A25-A26-11 Source Code A25-09 Header Description A25-02 Journal Number A26-A09-01 GL Account Number A26-A09-02 GL Account Name A26-A09-05 Account Type A25-A26-27 Credit Debit Indicator A26-A89-01 Amount A26-A89-02 Currency A25-A26-06 Effective Date A26-A09-06 Account Subtype A25-A26-10 Line Description A26-A01-03 Segment Reference Level A26-A01-01 Business Segment Code A26-A01-02 Business Segment Name A26-A01-04 Business Segment Type

General Ledger

General Ledger Line

Business Segment

Journal ID

Type Code

943-8-477

943-8-477

entries

8-477

2009-04-03

manual

gj

General

134

943-8-477

1

754

Office consumables expenses

account

debit

14858

JPY

2009-04-03

proposed

paper

943-8-477

1

1

account sub

10

Sapporo

Business unit

943-8-477

2

191

Temporary payment consumption tax

account

debit

742

JPY

2009-04-03

tax

943-8-477

2

1

account sub

10

Sapporo

Business unit

943-8-477

3

111

Cash

account

credit

15600

JPY

2009-04-03

proposed

paper

943-8-477

3

1

account sub

0

Common department

Business unit

ChatGPT

This table represents an improvement over the previous one in terms of tidiness. Here, each variable now forms a column, and each observation (a general ledger line in this case) forms a row. However, it’s important to note that the tidiness of the data also depends on the specific analysis you want to perform.

One minor point to consider is the presence of cells with repeated or irrelevant information for a specific row, such as the “Journal ID” column which repeats “943-8-477” across rows where it’s not applicable. Similarly, the cells in the “Type Code”, “Journal Line Number”, “Bill Date”, “Bill Number”, “Source Code”, “Header Description”, and “Journal Number” columns are empty for the lines that are not header lines. These could lead to misunderstandings about the structure of your data.

The goal of tidy data is to create a structure where every cell contains one and only one value that describes the variable of the column and the observation of the row.

A tidier version of this table would look like this:

Journal ID Type Code Journal Line Number Bill Date Bill Number Source Code Header Description Journal Number General Ledger Line GL Account Number GL Account Name Account Type Credit Debit Indicator Amount Currency Effective Date Account Subtype Line Description Business Segment Segment Reference Level Business Segment Code Business Segment Name Business Segment Type

943-8-477

entries

8-477

2009-04-03

manual

gj

General

134

1

754

Office consumables expenses

account

debit

14858

JPY

2009-04-03

proposed

paper

1

account sub

10

Sapporo

Business unit

943-8-477

entries

8-477

2009-04-03

manual

gj

General

134

2

191

Temporary payment consumption tax

account

debit

742

JPY

2009-04-03

tax

1

account sub

10

Sapporo

Business unit

943-8-477

entries

8-477

2009-04-03

manual

gj

General

134

3

111

Cash

account

credit

15600

JPY

2009-04-03

proposed

paper

1

account sub

0

Common department

Business unit

Note: This structure assumes that each line in the General Ledger belongs to only one Business Segment. If a line can belong to multiple segments, you may need to reshape your data accordingly.

User

You are repeating
Journal ID Type Code Journal Line Number Bill Date Bill Number Source Code Header Description Journal Number
943-8-477 entries 8-477 2009-04-03 manual gj General 134

You need to distinguish observation for header and lines.
There is no restriction observation in a table should be the same. So we can store different observations within the same table, which is a hierarchcal tidy data.

ChatGPT

I apologize for the misunderstanding. You’re right. In the context of hierarchical data, different levels of observation can indeed be stored in the same table, and these different levels of observation can make the table look less ‘tidy’ than a traditional tidy data format where each row is a single observation. This method can be particularly useful when working with complex hierarchical or transactional data such as a general ledger.

Given that, your original table, where each row represents a different level of the hierarchy (header, line, business segment) is considered a tidy hierarchical data structure. Each row still represents a single observation, but the type of observation differs between rows:

  • Some rows represent a ledger header.

  • Some rows represent a ledger line.

  • Some rows represent a business segment.

This structure enables you to capture the hierarchical relationships within the data, which would not be possible with a simple flat structure. Each level in the hierarchy is clearly distinguished by the values in the ‘d_A25’, ‘d_A26’, and ‘d_A01’ columns.

It is important to note, however, that while this structure is more flexible and can handle complex hierarchical data, it can also be more challenging to work with than a simple flat structure, as it requires additional processing to filter or aggregate data at different levels of the hierarchy.

User

Provide your understood table of the hierachical tidy data for journal entry of my example.

ChatGPT

Certainly, here is a representation of the hierarchical tidy data format based on your example:

d_A25 d_A26 d_A01 A25-01 Journal ID A25-A26-08 Type Code A25-A26-07 Journal Line Number A25-A26-14 Bill Date A25-A26-12 Bill Number A25-A26-11 Source Code A25-09 Header Description A25-02 Journal Number A26-A09-01 GL Account Number A26-A09-02 GL Account Name A26-A09-05 Account Type A25-A26-27 Credit Debit Indicator A26-A89-01 Amount A26-A89-02 Currency A25-A26-06 Effective Date A26-A09-06 Account Subtype A25-A26-10 Line Description A26-A01-03 Segment Reference Level A26-A01-01 Business Segment Code A26-A01-02 Business Segment Name A26-A01-04 Business Segment Type

943-8-477

NA

NA

943-8-477

entries

NA

2009-04-03

manual

gj

General

134

NA

NA

NA

NA

NA

NA

NA

NA

NA

NA

NA

NA

NA

943-8-477

1

NA

NA

NA

1

NA

NA

NA

NA

NA

754

Office consumables expenses

account

debit

14858

JPY

2009-04-03

proposed

paper

NA

NA

NA

NA

943-8-477

1

1

NA

NA

NA

NA

NA

NA

NA

NA

NA

NA

NA

NA

NA

NA

NA

NA

NA

account sub

10

Sapporo

Business unit

943-8-477

2

NA

NA

NA

2

NA

NA

NA

NA

NA

191

Temporary payment consumption tax

account

debit

742

JPY

2009-04-03

tax

NA

NA

NA

NA

NA

943-8-477

2

1

NA

NA

NA

NA

NA

NA

NA

NA

NA

NA

NA

NA

NA

NA

NA

NA

NA

account sub

10

Sapporo

Business unit

943-8-477

3

NA

NA

NA

3

NA

NA

NA

NA

NA

111

Cash

account

credit

15600

JPY

2009-04-03

proposed

paper

NA

NA

NA

NA

943-8-477

3

1

NA

NA

NA

NA

NA

NA

NA

NA

NA

NA

NA

NA

NA

NA

NA

NA

NA

account sub

0

Common department

Business unit

The table above is structured so that each row represents an observation at a different level of hierarchy (General Ledger, Line, Business Segment). The columns are divided into variables and identifiers that belong to each hierarchical level. The NA entries show that a variable is not applicable to a particular hierarchical level.

For example, the first row contains variables that only apply to the General Ledger level, while the second row contains variables that apply to the General Ledger Line level, and so forth.

I hope this makes the structure clearer. This type of hierarchical tidy data structure is complex but very powerful, as it can represent multiple levels of hierarchy in a single table. It’s very useful in cases like this where data is structured in multiple levels, such as in financial statements or other complex record-keeping systems.

User

Write down introduction paper for a hierarchical tidy data in 3000 words with example table.
1. Table for original journal entry
2. nearly tidy data with different element in a single column
3.hierarchical tidy data
Use journal entry 3 layer structure as an example.
Explain how this hierarchical tidy data extend original Wickham model.

ChatGPT

wrtites this paper draft …​

コメントを残す

メールアドレスが公開されることはありません。 が付いている欄は必須項目です