Transforming a pre-formatted excel report into a tabular format without macros — Reporting system part 2

This article follows on from my previous article https://alexanderamlani24.medium.com/extracting-and-writing-email-attachments-to-xlsx-using-the-ms-graph-api-reporting-system-part-1-82819256b630

In the past when dealing with pre-formatted reports I’ve either had to ask for direct access to the reporting database and when thats not been possible i’ve had to rely on vba/macros or manual intervention to cleanse the report into a format that i could then work with. So I thought this would be a good challenge for my new python skills.

This is the structure of the data that I extracted in the last blog post. I’ve only put the structure of the data and no column names to avoid identification of the supplier.

The first thing I noticed is that there was some hidden files in the folder, now this might have been because I had a file open and a temporary file was created, however I thought I’d write a function to make sure we don’t get anything we dont want in.

# Import libraries
import pandas as pd
import datetime
from os import listdir
from os.path import isfile, join
import re
import time

The next step is to create a function that will clean each individual file and then concatenate them into one dataframe

def cleandata(folder):
df_list = []

The above code cleans the data up so that each table of data for each category in the single report appends into one table and empty cells are filled along with data types are assigned.

We now however need to somehow associate each row in the data frame with the associated category which was above each table in the original report

There is one column which is a sub category in the table which we will use to create a lookup table and then join it on to the main table in the next step

def category_df(folder):
category_df_list = []

The above code assigns a column that dertmines whether the column string contains a numeric value or not. The categories above the tables all contained numbers and I used this to help me create the category column. As the category column itself fell directly on top of the subcategory column.

def category_df(folder):
category_df_list = []

The end result is two files one with all reports cleaned into one large data frame and a look up table to match sub-category to major category

We had to derive the sub-category and category this way, as we were not able to manually create a lookup table due to the fact that we do not control any new sub-categories/categories being added

The below code then takes these staging tables and creates a star schema with some additional meta data for the image

# Import libraries
import pandas as pd
import datetime
from os import listdir
from os.path import isfile, join
from airflow.models import Variable

The following article will go over the generation of the reports