tobiaswillmann.de

A jupyter notebook / Python script to explore subdirectories of a website based on XML Sitemap

2020-04-19

Sometimes SEOs are locking for a subdirectory / section structure overview of a website. This jupyter notebook can be your starting point to get a list of all subdirectories and the number of articles in the directory.

Why to check a website’s subdirectory structure

A hierarchically ordered website with reasonable structure is kind of SEO best practice, which helps users and bots to group content and find their way within your site. From time to time (or if you take over a new project) you want to validate if this structure (still) makes sense. If this check wasn’t done for a while expect forgotten / legacy subdirectories, which disturb your nice website structure.

This script can help you to get and idea about the structure of your site.

About the script’s output

The script can generate something like this based on XML Sitemaps.

Output CSV

You will get a CSV listing every subdirectories full URL, an example URL within the directory, some count about how deep your are in the hierarchy and each subdirectory separated per level.

In addition you can expect some information about how many article are in the subdirectory.

Another Output example: https://docs.google.com/spreadsheets/d/1SBW3rMp3OxUOWNyAn8iS2ORx2I4xFaTTFeP0M6RDi_g/edit?usp=sharing

How to use the script

It’s a jupyter notebook. I added a lot of comments … please contact me if something isn’t working.

If the script works highly depends on your websites structure and your ability to write a proper regex for your site. I made it for multiple publishers to work fine, but websites and URLs can be very different so you probably have to adapt the code.

👉👉👉 The Script

You can it download here or run a copy of it directly in Google Colab (recommended)

https://colab.research.google.com/drive/1hYR_8uVBtNfyRLXAoNWmlXOhlsnVdggD

How to config

You have to define 5 things:

  1. The sitemaps you want to get data from
sitemaps = ['https://www.blick.ch/video.xml','https://www.blick.ch/image.xml','https://www.blick.ch/article.xml']

2. The regex pattern to detect a subdirectory

pattern = '(.*?)/' #Get the subdir

For a URL like https://www.blick.ch/life/essen/rezept/schnelle-rezepte-so-einfach-bereiten-sie-couscous-richtig-zu-id15838923.html this can get “life” as level 1 subdirectory if used like this below in the code (check here)

dfSitemap['L1'] = dfSitemap['loc'].str.extract('http[s]://.*?/' + pattern, expand=False)

3. The regex pattern to detect the full URL of the subdirectory

patternPath = '^(.*[\/])'

For a URL like https://www.blick.ch/life/essen/rezept/schnelle-rezepte-so-einfach-bereiten-sie-couscous-richtig-zu-id15838923.html this gets https://www.blick.ch/life/essen/rezept/ Used like this:

dfSitemap['SubdirURL'] = dfSitemap['loc'].str.extract(patternPath, expand=False)

4. The brandname for file names … (optional)

brandname = 'Blick'

5. There is a testing variable. If you add e.g. 5 here it will just test 5 sub-sitemaps of an index sitemap. 0 will test all sitemaps.

maxSubsitemapsToCrawl = 0

Example configs:

No gzip test

sitemaps = ['https://www.spiegel.de/sitemap.xml']
pattern = '(.*?)/'
patternPath = '^(.*[\/])'
brandname = 'Spiegel'

Example URL: https://www.spiegel.de/kultur/filme-fuer-die-corona-krise-lagerkollerfilme-a-34a730fe-6cd9-4064-b06e-d98cd17a47f2

gzip test

sitemaps = ['https://www.blick.ch/video.xml','https://www.blick.ch/image.xml','https://www.blick.ch/article.xml']
pattern = '(.*?)/'
patternPath = '^(.*[\/])'
brandname = 'Blick'

Example URL: https://www.blick.ch/life/essen/rezept/schnelle-rezepte-so-einfach-bereiten-sie-couscous-richtig-zu-id15838923.html

mixed

sitemaps = ['https://www.welt.de/sitemaps/sitemap/sitemap.xml']
pattern = '(.*?)/.*[0-9]{6,}.*?/.*html'
patternPath = '^(.*[\/]).*[0-9]{6,}.*'
brandname = 'Welt'

Example URL: https://www.welt.de/food/essen/article206809645/Kekse-backen-ohne-Mehl-Schoko-Cookies-mit-Meersalz.html

Example config problems:

With URLs like this https://www.welt.de/food/essen/article206809645/Kekse-backen-ohne-Mehl-Schoko-Cookies-mit-Meersalz.html the part article206809645 isn’t really a subdirectory. So you must use a regex to detect and exclude these.

^(.*[\/])article[0-9]{6,}.*

can to it, to get the real URL of the subdirectory. Check: https://regex101.com/r/ZZcusp/1 Looking forward for comments doing it nicer.

I haven’t tried to find a solution for https://www.theguardian.com/world/live/2020/apr/18/coronavirus-live-news-global-deaths-pass-150000-trump-china-china-denies-any-concealment-pence-origins-europe-germany but “2020/apr/18″ needs to be handled too…

Scrape XML Sitemaps in Python

Within the script is some code to scrape XML Sitemaps in Python

It’s able to handle multiple XML Sitemaps and Index Sitemaps. Gzip and not-gzip version. It uses a “stop variable” to not run too long while testing. Super basic exception handling.

Codesnippet here

Export Pandas dataframe to Google Sheets (in Google Colab)

I’m using https://github.com/robin900/gspread-dataframe with set_with_dataframe.

Codesnippet here

Why to use a spreadsheet to visualize a websites structure

In theory your could go with tree graphs or mindmaps to visualize how your website is structured, but if the site isn’t super small for me it always feels cluttered at some point.

In addition if you visualized the website structure you will have follow up questions and data:

  • How many organic entries does this section generate compared to the other one?
  • How many article are in there?

If you want to add additional data spreadsheets work fine.

© 2020 Tobias Willmann