What are we doing?
If you’ve ever fancied being one of the cool kids you know it starts with being fluent in Python. No one likes faffing about pulling CSVs from different silos into Google Sheets month by month to compile their reports, absolutely no one. We’re big proponents here of “If it can be automated it should!”. As this is the first article in the series we’re going to start with baby steps. How can we automate the extraction of Google Analytics data using Python? Subsequent articles will deal with how we warehouse this data and combine it with other data sources.
This is a basic step by step guide to start you on your Google Analytics API journey using Python. The end result of this will be a CSV containing the page paths in your reports that have cardinality causing parameters you might not want, sorted by sessions. Python experience isn’t completely necessary if you are just looking for a quick guide to getting started, but you’ll need some to understand what this process is doing in its entirety.
Step 1. Starting With Python
Getting a useful and easy to use workflow in Python is pretty simple, there are two parts and I’m going to use a fun theatre analogy: a virtual environment, which you can think of as ‘setting the stage’; and pip modules, which you can think of as the actors on this stage. The python script can’t be performed without them. Get it?
This part can be a little confusing so I have recorded a video below to guide you, though if you prefer the text version, read on:
First we enter the Python theatre. If you haven’t already, you can download Python (and pip along with it) here. You can check it’s installed globally with a quick ‘python -V’ in the CMD console. If you’re having trouble with Python, Pip, I can’t help you in this article as setups vary a lot depending on the machine, but every error has an answer on Google.
I like to work within a ‘Dev’ folder for this, and within this, I have the python projects I’m working on which will contain our .py script and a ‘venv’ which contains all my virtual environments in one place. Like this..
Now we get into ‘setting our stage’, open up CMD, navigate to our ‘venv’ folder using ‘cd’, and install the virtual environment module, this is our ‘stage builder’.
venv> python -m pip install --user virtualenv
You can check you’ve done this correctly with a ‘virtualenv –version’ once complete. Now in the same place, we are going to create our google analytics ‘stage’.
venv> virtualenv ga
It should have created a folder called ga-api, this is what we use to activate our environment.. Like so.
venv> .\ga-api\Scripts\activate
If all’s gone well, you should see a ‘(ga-api)’ next to your directory. We can now navigate out of the venv folder, and into our project folder using ‘cd’ commands. We need to add pip modules to our environment, or to continue this analogy, hire actors for our stage. These actors are called ‘google-api-python-client’ and ‘oauth2client’ which we can hire at the same time like so..
analytics-reporting-project(ga-api) > python -m pip install --upgrade google-api-python-client pip install --upgrade oauth2client
You may be asked to confirm the install with the Y key, once complete, we’re now ready for the next step, but leave this cmd window open for later!
Step 2. API Access Through The Google Cloud Console
For Python to be able to call our data from Google Analytics, we first must set up the means of communication with a new Google Cloud project. If you’ve never used Google Cloud Console before, it’s as simple as going to console.cloud.google.com and creating a new project.
Use the navigation menu to get yourself in the APIs & Services Library. Here you can enable the Google Analytics Reporting API.
On the next screen, create credentials for a Service Account with Owner level project access. You should see a button to create a new JSON key, which we will use to give our script permissions.
When prompted to save this JSON file, keep this in your recently created project file for safekeeping and rename it ‘client_secrets.json’.
In the JSON file, you should see a field called ‘client_email’, which contains our new service account email address which our script will use. Give this email address ‘Read & Analyze’ access to the Google Analytics Views you would like to run this script for. This is done in the same way you would any other account using User Management in the Admin section.
You will also want to take note of the View ID when giving permissions to the service account, this will be used in our Python script in the next step.
Step 3. The Authenticating Function
It’s time to start writing some code (or ctrl+c / ctrl+v some code in your case).
I like Visual Studio Code as my IDE, very easy to use, get it here. Once installed go back to that CMD window we were using earlier, it should still be in your project directory, with the environment active. We can quickly create a new ‘main.py’ script file by typing..
analytics-reporting-project(ga-api) > code main.py
It should open up in VS code, now we can paste in the following..
from apiclient.discovery import build from oauth2client.service_account import ServiceAccountCredentials import pandas as pd SCOPES = ['https://www.googleapis.com/auth/analytics.readonly'] KEY_FILE_LOCATION = 'client_secrets.json' VIEW_ID = 'XXXXXX' def initialize_analyticsreporting(): credentials = ServiceAccountCredentials.from_json_keyfile_name( KEY_FILE_LOCATION, SCOPES) # Build the service object. analytics = build('analyticsreporting', 'v4', credentials=credentials) return analytics
Add a view ID to the ‘VIEWID’ variable. This ‘initializeanalyticsreporting()‘ function will allow the next section to access that reporting view.
Step 4. Pulling The Data
Add the following to the file:
def get_report(analytics): rgx='~\?.*' return analytics.reports().batchGet( body={ 'reportRequests': [ { 'viewId': VIEW_ID, 'dateRanges': [{'startDate': '30daysAgo', 'endDate': 'yesterday'}], 'metrics': [{'expression': 'ga:sessions'}], 'dimensions': [{'name': 'ga:pagePath'}], 'filtersExpression':f'ga:pagePath={rgx}', 'orderBys': [{"fieldName": "ga:sessions", "sortOrder": "DESCENDING"}], }] } ).execute()
We now have a function to retrieve our data, this is a basic request for the number of session split by page. There is also a filter to only retrieve the pages that have a parameter using our old pal regex in the ‘rgx’ variable.
I’d recommend using Google’s Dimensions & Metrics Explorer if you are interested in modifying the dimensions or metrics being requested.
Step 5. Catching The Response
To do something useful with the response we get, we want to parse through whatever get’s thrown back at us when running this. This function will break this down into columns and rows that is easier for us humans to comprehend.
def save_response(response): _dimension = [] _value = [] for report in response.get('reports', []): columnHeader = report.get('columnHeader', {}) dimensionHeaders = columnHeader.get('dimensions', []) metricHeaders = columnHeader.get('metricHeader', {}).get('metricHeaderEntries', []) for row in report.get('data', {}).get('rows', []): dimensions = row.get('dimensions', []) dateRangeValues = row.get('metrics', []) for header, dimension in zip(dimensionHeaders, dimensions): _dimension.append(dimension) for i, values in enumerate(dateRangeValues): for metricHeader, value in zip(metricHeaders, values.get('values')): _value.append(value) _data = pd.DataFrame() _data["Sessions"]=_value _data["pagePath"]=_dimension _data=_data[["pagePath","Sessions"]] _data.to_csv("parameter_pages.csv")
When this is running, it will save a CSV file in your project folder named ‘parameter_pages.csv’
Step 6. Running The Script
The last part of our script is putting these functions in sequence, you might notice I like to print the response when it’s returned, this is helpful if you are having problems.
def main(): analytics = initialize_analyticsreporting() response = get_report(analytics) print(response) save_response(response) if __name__ == '__main__': main()
All Together Now
Your final .py file should contain all of the following now:
from apiclient.discovery import build from oauth2client.service_account import ServiceAccountCredentials import pandas as pd SCOPES = ['https://www.googleapis.com/auth/analytics.readonly'] KEY_FILE_LOCATION = 'client_secrets.json' VIEW_ID = '47186994' def initialize_analyticsreporting(): credentials = ServiceAccountCredentials.from_json_keyfile_name( KEY_FILE_LOCATION, SCOPES) # Build the service object. analytics = build('analyticsreporting', 'v4', credentials=credentials) return analytics def get_report(analytics): rgx='~\?.*' return analytics.reports().batchGet( body={ 'reportRequests': [ { 'viewId': VIEW_ID, 'dateRanges': [{'startDate': '30daysAgo', 'endDate': 'yesterday'}], 'metrics': [{'expression': 'ga:sessions'}], 'dimensions': [{'name': 'ga:pagePath'}], 'filtersExpression':f'ga:pagePath={rgx}', 'orderBys': [{"fieldName": "ga:sessions", "sortOrder": "DESCENDING"}], }] } ).execute() def save_response(response): _dimension = [] _value = [] for report in response.get('reports', []): columnHeader = report.get('columnHeader', {}) dimensionHeaders = columnHeader.get('dimensions', []) metricHeaders = columnHeader.get('metricHeader', {}).get('metricHeaderEntries', []) for row in report.get('data', {}).get('rows', []): dimensions = row.get('dimensions', []) dateRangeValues = row.get('metrics', []) for header, dimension in zip(dimensionHeaders, dimensions): _dimension.append(dimension) for i, values in enumerate(dateRangeValues): for metricHeader, value in zip(metricHeaders, values.get('values')): _value.append(value) _data = pd.DataFrame() _data["Sessions"]=_value _data["pagePath"]=_dimension _data=_data[["pagePath","Sessions"]] _data.to_csv("parameter_pages.csv") def main(): analytics = initialize_analyticsreporting() response = get_report(analytics) print(response) save_response(response) if __name__ == '__main__': main()
All that’s left to do now is to run the script! Go back to our old friend that CMD window we were using earlier, it should still be in your project directory, with the environment active. Run the following..
analytics-reporting-project(ga-api) > python main.py
Take a sip of water, and enjoy the sweet CSV of success. This can be found in your root project folder, and should be full of page paths with parameters attached.
If you don’t see anything printed in the console when running this or have a blank CSV, you may want to step back through this guide and make sure your connection to the API, the service emails access to your reports, and the client_secrets file are all hooked up correctly.
Hopefully, this guide gave you a bit of a flavour for what is possible in Python with the Google Analytics API. Shoot any feedback or questions you might have on this article to me on LinkedIn or email at richard@bedrock42.com