Extracting and writing email attachments to xlsx using the MS graph API— Reporting System part 1

I was receiving data from a data supplier on a weekly basis, there was no history and it wasn’t in a format that you could easily bring into a BI tool.

Because I wanted to experiment with open source technologies I thought I’d build a mini data mart from this data to capture the history and automate the data cleansing and create some reports from it.

I’m new to open source after a few years as a Power BI Consultant, so if anyone has any improvements for my methodology/code let me know, equally if my blog has helped you in any way do let me know!

I decided to use the following stack for this project

  • apache airlflow — for orchestration of the system
  • python — for ETL , (airlfow is also coded with python)
  • GCP storage buckets and Big Query (I did also have airflow running on a VM for practice however its not worth the cost for this project)
  • WeasyPrint — Really great PDF Generator for python
  • Flask (Jinja2) , HTML, CSS — to create the reports (I really enjoyed created pixel perfect reports this way! )

Further to do on this project

  • Unit Testing
  • CI / CD integration
  • Live dashboard using something like cube.js/ custom react dashboard (https://cube.dev/)

I did my formal python training on a data science bootcamp with General Assembly and this Udemy course really helped me understand airflow and was the foundation for this exercise. https://www.udemy.com/share/103qo0AEAfcFhUQXQD/

I did this project some time ago so I can’t remember some of the supporting articles that helped me do this however will post the ones I do.

I’ve renamed all column names to alphabet characters to avoid any identification of the data sources.

The airflow pipeline pretty much describes the entire process and is visible below.

So the first requirement is to automatically extract the data from the email attachments. (refers to bash_extract_data on the airflow pipeline)

The first step is to get your oauth access token that we will use to access the microsoft graph end point — I’ve used airflow variables in a config file to store my client and secret keys along with other variables.


The next step is to call the graph api and get a response, note I set a rule on outlook to move these emails to a specific mail folder and the below uses that mail folder id which can be found by calling this api https://docs.microsoft.com/en-us/graph/api/user-list-mailfolders?view=graph-rest-1.0&tabs=http

The next part of the code deals with pagination, I found a python function that handles this perfectly, so heres the link https://journeyofthegeek.com/2019/01/08/using-python-to-pull-data-from-ms-graph-api-part-2/

After defining the function I then list all the emails received from the paginated response by calling the function and looping through the email id’s.

Using the email id’s I then create a list of request urls that we will use to call the graph for each email and receive the attachment

The next code block deals with extracting the json response and storing them in a list

Then we write the byte information to csv’s using base64 to decode the bytes into a string and write to xlsx. We also remove any existing xlsx files since we downloading all the attachments everytime. (In future probably would best to only download the latest attachment)

That was it for the first part , we’ve extracted the data from outlook attachments and stored them in xlsx files in an automated fashion.

Up next will be dealing with transforming the data and scraping the web for images.

Full code below