Skip to content

An update about our green web datasets

Green Web Foundation logo and text "Towards a fossil-free internet by 2030"

Earlier this year we wrote about a few updates to the daily open data snapshots of our green domains dataset. This post we detail a few changes relevant to people using the dataset, including a 12 month cutoff for our daily snapshots, and how to get commercial support using this data.

Introducing a 12 month cutoff for daily snapshots of our green domains dataset

On August 6th, we introduced a 12 month cut off to our Green Domains dataset accessible on datasets.greenweb.org. This means only green domains that have been checked and shown as green in the last 12 months will be included in our daily snapshot that we make available for download and on the datasets website.

Why make this change?

Our Green Domains dataset exhibits a strong power law – there is a fairly short list of domains that receive lots and lots of checks each day, and there is a long tail of domains associated with our hosting providers that are very rarely checked.

Over time, this long tail has grown, making up a larger and larger share of the dataset, and correspondingly a larger share of the dataset’s size.

While these domains were associated with providers marked as “green” at the time, because domains themselves have an expiry date, and can be directed to other servers, some subsequent lookups against these domains could resolve to new providers not in our database. This meant that we couldn’t make the same guarantees as we do at present via the API.

So, rather than running an expensive set of queries every night against millions of domains when we aren’t seeing anyone actively wanting to check them, we’ve opted to remove them from our daily snapshot if it’s been more then 12 months since they were last checked.

This means that after the August 6th date, the number of domains in the daily snapshots will have been reduced, from around 8 million domains, to a figure between 2 and 3 million instead.

Getting domains back into the daily snapshot of green domains

If a domain has been checked in the last 12 months via our Green Web Check services, and it came up as “green”, we will include it in our daily green domains database snapshot.

This means that as soon as someone runs a check against your website, it shows as “green”, it will appear in the daily snapshot for the next 12 months, and any time a new check is run, this “Time To Live” counter is reset, so it will stay in the dataset for 12 months from that latest check. As long as your domain is actively being used and checked, it should keep appearing in the green domains dataset.

Of course, we still offer the ability to check any domain against the Green Check Database via our API, so if a domain is showing as “grey” rather than “green” in the dataset you have downloaded, you can always confirm the result by running a check against the API for the most recent results. This lets you use the green domains snapshot as a cache if you are making multiple checks in a short period of time.

If you have further technical questions about the content of the dataset, and what the columns mean, consult the dedicated page on our datasets site.

Getting commercial support with analysis of green domains data

We think this should strike a good balance between keeping the dataset useful for people integrating it into their software, and keeping snapshots at a sensible size.

For people who need to run queries further back than 12 months, we still do keep this data to support various kinds of analysis – particularly for tracking the transition to green hosting on the internet, or how the concentration of domains amongst hosting providers is changing.

If you have a particular custom query you want to run, or want help understanding how to use this data, we offer commercial support with analysis of this data. If you need some bespoke help, let us know via our support form.