tobiaswillmann.de

How to automatically filter for suspicious Canonicals with Screaming Frog SEO Spider exports +…

2018-03-31

Python Pandas is super useful if you want to check the same specific things in your crawls over and over again. Just create a Python script and run it with a new Screaming Frog SEO Spider crawl.

Pandas is much better than Excel with really big crawls and with the new Screaming Frog SEO Spider database storage feature you can process them without having a lot of memory.

This shows step by step what’s possible with Canonicals, but can be used for other columns in the crawl, too.

Example:

Load Screaming Frog SEO Spider csv to pandas:

import pandas as pd
import numpy as np
df =  pd.DataFrame(pd.read_csv('/Users/LOAD-THE-SCREAMING-FROG-SEO-SPIDER-EXPORT-HERE.csv', low_memory=False, header=1))

header=1 is important to use row 2 for column names. Row 2 for column names is default in Screaming Frog SEO Spider.

If you want to just check some specific columns you can select them like this.

df = df[['Address', 'Status Code', 'Meta Robots 1', 'Canonical Link Element 1']].copy()

It’s useful to filter just status = 200 URLs

df = df[df['Status Code'] == 200]

and just indexable URLs

df = df[~df['Meta Robots 1'].str.contains("noindex", na=False)]

Now we will check some Canonical related things in the crawl. We will create a table with a lot of information about Canonicals and related stuff out of the crawl’s raw data:

Canonical empty

This will set an additional column with ‘yes’ if the Canonical element is empty and ‘no’ if not.

df['Canonical empty'] = np.where((df['Canonical Link Element 1'].isnull()), 'yes', 'no')

Canonical equals adress

This checks if the Canonical equals the adress. It will set a new column with ‘yes’ or ‘no’, too.

df[‘Canonical equals adress’] = np.where((df[‘Address’] == df[‘Canonical Link Element 1’]), ‘yes’, ‘no’)

Adress without parameters

In many cases Canonicals are used to clean up e.g. tracking parameters. This will extract the Adress URL without parameters.

df[‘Adress without parameters’] = df[‘Address’].str.extract(‘^([^?]+)’, expand=True)

Parameters

This will add just the parameter to another column. Can be useful to find often used Parameters.

df[‘Parameters’] = df[‘Address’].str.extract(‘(?.*)’, expand=True)

Canonical equals adress without parameters

df[‘Canonical equals adress without parameters’] = np.where((df[‘Adress without parameters’] == df[‘Canonical Link Element 1’]), ‘yes’, ‘no’)

With this you can identify all the Canonicals, which were probably set to clean up tracking or filter parameters.

Canonical differs from adress without parameters + canonical is not empty

This is especially interesting because Canonicals, which are really set and not just there to clean up tracking or filter parameters are a hint for real duplicate content. Just in theory these Canonicals have higher probability that something is messed up. A column with ‘yes, please check!’ is added with this:

df['Canonical differs adress without parameters and canonical is not empty'] = np.where((df['Canonical equals adress without parameters'] == "no") & (df['Canonical empty'] == "no"), 'yes, please check!', 'no')

I exported and filtered… and found cross domain Canonicals with http instead of https instantly 🙂

Check the filter in column K, the Adress in B and the Canonical in E

Parameter and no canonical

This one is also suspicious. In most cases you should have a Canonical to clean up parameters. This is looking for the parameters and an empty canonical column.

df[‘Parameter and no canonical’] = np.where((df[‘Parameters’].notnull()) & (df[‘Canonical empty’] == “yes”), ‘yes, please check!’, ‘no’)

Results

We generated a script which shows two special Canonical use-cases with just one export

  • Canonical differs from adress without parameters + canonical is not empty
  • Parameter and no canonical

It’s easy to extend for other use-cases and will produce these filtered tables with just one click.

I like to export to Excel in the end:

df.to_csv(‘dfTestCanonicals.csv’, sep=’t’)

If the files are too big to filter in Excel you can of course pre-filter in Pandas and create e.g. exports for every usecase

df = df[df[‘Parameter and no canonical’] == "yes, please check!"]
df.to_csv(‘dfParameters-and-no-Canonical.csv’, sep=’t’)

Which patterns related to Canonicals do you check? Where do you spend time with data cleanup an filtering the same stuff over an over again?

© 2020 Tobias Willmann