abstract kitchen

The Definitive Guide To Sitemaps With Python

by Dmitry

Sitemaps are important. Especially for big websites. It is always a good idea to develop your website with SEO in mind. Unfortunately, most developers ignore this part. This article describes general idea and how to implement your sitemaps with python. I made this article for myself in the first place, because I tend to forget things.

What Is Sitemap

<?xml version="1.0" encoding="UTF-8"?>
<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">
  <url>
    <loc>http://www.example.com/foo.html</loc>
    <lastmod>2022-06-04</lastmod>
  </url>
</urlset>

Sitemaps help search engines discover your website pages. You combine your most important URLs in a bunch of XML files. Different sitemaps can contain different types of media. It can be plain URLs, Images, Videos, and News entries. Images, videos, and news entries are just URLs with additional metadata.

Sitemaps are especially important if you have a website with a lot of pages. Now, I will not go into details, because obviously you're a smart person and will find everything at Google Search Central or sitemaps.org.

Just a few simple rules for you:

  • You can combine sitemaps in index sitemaps.
  • Sitemap size must not exceed 50mbs and/or 50k URLs.
  • Sitemap can be compressed via GZIP.
  • Don't forget to link your sitemaps in robots.txt
  • All sitemaps must be in the same domain.
  • "priority" and "changefreq" are deprecated by Google, so don't bother wasting space.

Don't forget to signup at Google Search Console and upload your sitemaps.

Can I Link To Multiple Sitemaps In robots.txt?

Yes, you can. Sitemap directive can be used multiple times. Here is real-world example:

Sitemap: https://zip.international/sitemaps/sitemaps.en.us.xml
Sitemap: https://zip.international/sitemaps/sitemaps.en.gb.xml

Create Your First Sitemap With Python

Here is the idea. You'll need 3 modules: xml, os and, optionally gzip. This snippet shows how sitemap can be created.

import os
import gzip

from xml.etree import cElementTree


def add_url(root_node, url, lastmod):
    doc = cElementTree.SubElement(root_node, "url")
    cElementTree.SubElement(doc, "loc").text = url
    cElementTree.SubElement(doc, "lastmod").text = lastmod

    return doc


def save_sitemap(root_node, save_as, **kwargs):
    compress = kwargs.get("compress", False)

    sitemap_name = save_as.split("/")[-1]
    dest_path = "/".join(save_as.split("/")[:-1])

    sitemap_name = f"{sitemap_name}.xml"
    if compress:
        sitemap_name = f"{sitemap_name}.gz"

    save_as = f"{dest_path}/{sitemap_name}"

    # create sitemap path if not existed
    if not os.path.exists(f"{dest_path}/"):
        os.makedirs(f"{dest_path}/")

    if not compress:
        tree = cElementTree.ElementTree(root_node)
        tree.write(save_as, encoding='utf-8', xml_declaration=True)
    else:

        # gzip sitemap
        gzipped_sitemap_file = gzip.open(save_as, 'wb')
        cElementTree.ElementTree(root_node).write(gzipped_sitemap_file)
        gzipped_sitemap_file.close()

    return sitemap_name


# create root XML node
sitemap_root = cElementTree.Element('urlset')
sitemap_root.attrib['xmlns'] = "http://www.sitemaps.org/schemas/sitemap/0.9"

# add urls
add_url(sitemap_root, "https://example.com/url-1", "2022-04-07")
add_url(sitemap_root, "https://example.com/url-2", "2022-04-07")
add_url(sitemap_root, "https://example.com/url-3", "2022-04-07")

# save sitemap. xml extension will be added automatically
save_sitemap(sitemap_root, "sitemaps/sitemap")

# if you want to gzip sitemap
save_sitemap(sitemap_root, "sitemaps/sitemap", compress=True)

  

If you want to add images, videos or news sections you'll need to add xml attributes for your root node.

# create root XML node
sitemap_root = cElementTree.Element('urlset')
sitemap_root.attrib['xmlns'] = "http://www.sitemaps.org/schemas/sitemap/0.9"

# for images add
sitemap_root.attrib["xmlns:image"] = "http://www.google.com/schemas/sitemap-image/1.1"

# for videos add
sitemap_root.attrib["xmlns:video"] = "http://www.google.com/schemas/sitemap-video/1.1"

# for news add
sitemap_root.attrib["xmlns:news"] = "http://www.google.com/schemas/sitemap-news/0.9"

# add this snippet to attach image to url
def add_url_image(url_node, image_url):
    image_node = cElementTree.SubElement(url_node, "image:image")
    cElementTree.SubElement(image_node, "image:loc").text = image_url

    return image_node

# now when you want to add image to url
url_1 = add_url(sitemap_root, "https://example.com/url-1", "2022-04-07"),
add_url_image(url_1, "https://example.com/image-1.jpg")
  

I will not describe here how to add videos or news to your url, because with this code you can easily do it yourself.

How To Create Index Sitemap

If you have a lot of pages on your website or you simply want to place your sitemaps in different sections you'll need index sitemaps. Index sitemap is just an XML-file with root tag sitemapindex with sitemap tags containing URLs to your sitemaps.

<?xml version="1.0" encoding="UTF-8"?>
<sitemapindex xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">
  <sitemap>
    <loc>http://www.example.com/sitemap1.xml</loc>
  </sitemap>
  <sitemap>
    <loc>http://www.example.com/sitemap2.xml</loc>
  </sitemap>
</sitemapindex>

Let's improve our code to create index sitemap. Add function add_sitemap_url at the beginning of your file.

def add_sitemap_url(root_node, sitemap_url):
    sitemap_url_node = cElementTree.SubElement(root_node, "sitemap")
    cElementTree.SubElement(sitemap_url_node, "loc").text = sitemap_url

    return sitemap_url_node

Then use it whenever you need it.

# create sitemapindex tag
sitemap_index_node = cElementTree.Element('sitemapindex')
sitemap_index_node.attrib['xmlns'] = "http://www.sitemaps.org/schemas/sitemap/0.9"

# append links to other sitemaps
add_sitemap_url(sitemap_index_node, "https://example.com/sitemap1.xml")
add_sitemap_url(sitemap_index_node, "https://example.com/sitemap2.xml")

save_sitemap(sitemap_index_root, "sitemaps/sitemap")

You can find code here. Feel free to comment or ask questions.

Get my latest articles in your inbox.

Sitemapa Library

Sitemapa Library. Create sitemaps with Python

Here is the code: GitHub
And package: PyPi

Now, for small sitemaps, it's all pretty easy. If you need to generate lots of sitemaps with images, videos, or news metadata, your code will become messy at some point. I created sitemapa as a little abstraction for XML burden.

Sitemapa is a small package to reduce your work while generating sitemaps. You describe your sitemaps with JSON structure. Sitemapa is framework-agnostic and not indexing your website — it's just generating sitemaps from your description. Noting more. I use it to generate sitemaps for millions of URLs on my websites.

Keep in mind that it's your job to validate your urls and lastmod dates.

Features

  • Use JSON to describe your sitemaps. Don't waste your time with XML.
  • No extra dependencies.
  • Create regular sitemaps. URLs, Images, Videos and News are supported.
  • Create index sitemaps to combine your regular sitemaps.
  • Create extra attributes for your tags like <video:price currency="EUR">1.99</video:price>.
  • Compress sitemaps with gzip.
  • Auto Image, Video or news xmlns attributes.

Installation

pip install sitemapa

# import in your script
from sitemapa import Sitemap, IndexSitemap

Create Standard Sitemap. Sitemap Class API.

You need to import Sitemap class to create a standard sitemap: from sitemapa import Sitemap. Sitemap class has two methods: append_url and save.

append_url(url, url_data=None)
Parameters: url(str) — Website URL
            url_data(Optional[dict]) — URL Description
            url_data can contain next keys:
              - lastmod
              - changefreq. Deprecated at Google
              - priority. Deprecated at Google
              - images. To describe URL images
              - videos. To describe URL videos
              - news. To describe URL news


Return type: dict. Dictionary with all urls and url_data

# ------

save(save_as, **kwargs)
Parameters: save_as(str) — Sitemap name and where to save. For example: sitemap1.xml or sitemap1.xml.gz

Return type: str. For example sitemap1.xml or sitemap1.xml.gz

Let's create a sitemap like this and save it as sitemap1.xml.

<?xml version="1.0" encoding="UTF-8"?>
<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">
  <url>
    <loc>http://www.example.com/url1.html</loc>
  </url>
  <url>
    <loc>http://www.example.com/foo.html</loc>
    <lastmod>2022-06-04</lastmod>
  </url>
</urlset>

And this is the implementation with sitemapa:

from sitemapa import Sitemap

standard_sitemap = Sitemap()

standard_sitemap.append_url("http://www.example.com/url1.html")
standard_sitemap.append_url("http://www.example.com/foo.html", {
    "lastmod": "2022-06-04"
})

# method 'save' will reset inner dictionary with URLs
sitemap1_name = standard_sitemap.save("sitemap1.xml")

# now, if you want to create new sitemap, just do this:
standard_sitemap.append_url("http://www.example.com/url-2.html")
standard_sitemap.append_url("http://www.example.com/url-3.html")
sitemap2_name = standard_sitemap.save("sitemap2.xml")

Add Images To Your Standard Sitemap

Let's take this example from Google.

<?xml version="1.0" encoding="UTF-8"?>
<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9"
        xmlns:image="http://www.google.com/schemas/sitemap-image/1.1">
  <url>
    <loc>http://example.com/sample1.html</loc>
    <image:image>
      <image:loc>http://example.com/image.jpg</image:loc>
    </image:image>
    <image:image>
      <image:loc>http://example.com/photo.jpg</image:loc>
    </image:image>
  </url>
  <url>
    <loc>http://example.com/sample2.html</loc>
    <image:image>
      <image:loc>http://example.com/picture.jpg</image:loc>
    </image:image>
  </url>
</urlset>

To do so, we'll use url_data description.

from sitemapa import Sitemap

sitemap_with_images = Sitemap()

sitemap_with_images.append_url("http://example.com/sample1.html", {
    "images": [
        "http://example.com/image.jpg",
        "http://example.com/photo.jpg"
    ]
})

# you can also describe like this
sitemap_with_images.append_url("http://example.com/sample2.html", {
    "images": [
        {
            "loc": "http://example.com/picture.jpg",
            "lastmod": "2022-05-05"
        }
    ]
})

sitemap_with_images.save("sitemap.xml")

As you can see you can use a list of images or a list of dictionaries. I prefer the first option, since Google deprecated all keys except loc.

Add Videos To Your Standard Sitemap

This is where it gets a little tricky. Videos have a more complex structure. Let's dive into details, using an example from Google.

<?xml version="1.0" encoding="UTF-8"?>
<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9"
        xmlns:video="http://www.google.com/schemas/sitemap-video/1.1">
   <url>
     <loc>http://www.example.com/videos/some_video_landing_page.html</loc>
     <video:video>
       <video:thumbnail_loc>http://www.example.com/thumbs/123.jpg</video:thumbnail_loc>
       <video:title>Grilling steaks for summer</video:title>
       <video:description>Alkis shows you how to get perfectly done steaks every
         time</video:description>
       <video:content_loc>
          http://streamserver.example.com/video123.mp4</video:content_loc>
       <video:player_loc>
         http://www.example.com/videoplayer.php?video=123</video:player_loc>
       <video:duration>600</video:duration>
       <video:expiration_date>2021-11-05T19:20:30+08:00</video:expiration_date>
       <video:rating>4.2</video:rating>
       <video:view_count>12345</video:view_count>
       <video:publication_date>2007-11-05T19:20:30+08:00</video:publication_date>
       <video:family_friendly>yes</video:family_friendly>
       <video:restriction relationship="allow">IE GB US CA</video:restriction>
       <video:price currency="EUR">1.99</video:price>
       <video:requires_subscription>yes</video:requires_subscription>
       <video:uploader
         info="http://www.example.com/users/grillymcgrillerson">GrillyMcGrillerson
       </video:uploader>
       <video:live>no</video:live>
     </video:video>
   </url>
</urlset>
from sitemapa import Sitemap

sitemap = Sitemap()

sitemap.append_url("http://www.example.com/videos/some_video_landing_page.html", {
    "videos": [
        {
            "thumbnail_loc": "http://www.example.com/thumbs/123.jpg",
            "title": "Grilling steaks for summer",
            "description": "Alkis shows you how to get perfectly done steaks every time",
            "content_loc": "http://streamserver.example.com/video123.mp4",
            "player_loc": "http://www.example.com/videoplayer.php?video=123",
            "duration": "600",
            "expiration_date": "2021-11-05T19:20:30+08:00",
            "rating": "4.2",
            "view_count": "12345",
            "publication_date": "2007-11-05T19:20:30+08:00",
            "family_friendly": "yes",
            "restriction": {
                "$value": "IE GB US CA",
                "relationship": "allow"
            },
            "price": {
                "$value": "1.99",
                "currency": "EUR"
            },
            "requires_subscription": "yes",
            "uploader": {
                "info": "http://www.example.com/users/grillymcgrillerson",
                "$value": "GrillyMcGrillerson"
            },
            "live": "no"
        }
    ]
})

sitemap.save("sitemap.xml")

You can see that each item in the videos list is a description for <video:video>. Take a look at the "restriction" attribute. Each property(except $value) will add extra attributes to <video:restriction>. $value is a special property and it is the content of a tag. So basically it works like this: <video:restriction relationship="allow">restriction[$value]</video:restriction>.

Add Google News To Your Standard Sitemap

Keep in mind that Google require you to publish in sitemap only new articles. Read more about this here.

<?xml version="1.0" encoding="UTF-8"?>
<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9" xmlns:news="http://www.google.com/schemas/sitemap-news/0.9">
    <url>
        <loc>http://www.example.org/business/article55.html</loc>
        <news:news>
            <news:publication>
                <news:name>The Example Times</news:name>
                <news:language>en</news:language>
            </news:publication>
            <news:publication_date>2008-12-23</news:publication_date>
            <news:title>Companies A, B in Merger Talks</news:title>
        </news:news>
    </url>
</urlset>

And this is implementation with sitemapa

from sitemapa import Sitemap

sitemap = Sitemap()

sitemap.append_url("http://www.example.org/business/article55.html", {
    "news": [
        {
            "publication": {
                "$tags": {
                    "name": "The Example Times",
                    "language": "en"
                }
            },
            "publication_date": "2008-12-23",
            "title": "Companies A, B in Merger Talks"
        }
    ]
})

sitemap.save("sitemap.xml")

As you can see we just added new tags(<news:name> and <news:language>) inside of <news:publication> using $tags key.

Can I Describe Images, Videos and Google News all at once?

sitemap.append_url("http://www.example.org/business/article55.html", {
    "lastmod": "",
    "images": [],
    "videos": [],
    "news": []
})

Create Index Sitemap with Sitemapa

We'll use an example sitemap from the beginning of this article. Import IndexSitemap from sitemapa. IndexSitemap class has two methods: append_sitemap and save.

from sitemapa import IndexSitemap


index_sitemap = IndexSitemap()

index_sitemap.append_sitemap("http://www.example.com/sitemap1.xml")
index_sitemap.append_sitemap("http://www.example.com/sitemap2.xml")

index_sitemap.save("index-sitemap.xml")

Postface

This article is my summary for sitemaps. I hope it helps you on your journey. Don't forget to verify everything with official resources. If you have any questions or you see mistakes in this text, don't be shy and drop me a line.

Get my latest articles in your inbox.