Getting the word out, one page at a time…

When we started Sondz, one of our assumptions was that we would be able to rely on search engines to quickly acquire traffic. This assumption was key as we were looking to build a website that would feature 30 Million+ URLs from day one… and one can hope that Google might want to take a look at that and start displaying at least some of it on its mighty search service!

Long story short, that was not quite the case and the following post recaps the unexpected struggle and the learning SEO curve we have had since we hit play...

Building a flawless content factory

A little context first: we in the Sondz team had prior experience dealing with SEO management, from local services to e-commerce websites, from personal websites to global brands and user generated platforms. So we had already been faced with multiple challenges when it comes to SEO, but never to the extent of publishing dozens of millions of pages on a single domain name without prior use. This was uncharted territory for us. And, as SEO is a secretive game, there is not that much content about that scenario out there…

We set our standard pretty high and looked to use all the tools to provide the best possible chance for a strong start. The ABC of SEO is the quality of your code; in the case of Sondz, every single code error had to be multiplied by a factor of 30M. No room for error…

The quality of our HTML markup (i.e. code structure) has always been a concern of ours. From the beginning, we used an HTML prototype to build our content architecture, design and code patterns. At each step of the project, we improved the logic behind product structure, building reusable components following an atomic design-like methodology. We looked at every div or span to define how we could remove them while keeping the design as functional as needed. We reviewed every section, heading, text, list structure, links, image alternative content… to facilitate the readability of our code by algorithms. To that end, we also decided to follow schema.org’s micro-data standard, which allows for websites to present an even clearer picture to Google’s engines. Who wouldn’t want that?

In 2020, along with a major iteration on design aesthetics finalized before our November launch, we also reviewed frontend code with a more mobile first approach. We reset our CSS to grid and flex as much as we could, while removing the Bootstrap grids that were initially put in place across the prototype. This allowed us to set the key metrics Google looks at on the mobile UX quality side, and on top a 40% decrease of our CSS weight, achieving a much better design starting point.

Once we knew the project was going to be technically feasible, i.e. with a release date months away rather than years, we started to use the Sondz domain with a blog, adding content once a week, then three times a week with news posts about music, album reviews and trivia. This content, highlighted by social media publications and the use of sitemaps, allowed us to get the domain within hearing distance of Google’s automated ears and set the website’s main semiological theme: Music information.

Up until that point, everything was going according to plan: we had a lot of content, good markup, a comprehensive content architecture, efficient light weight design and a domain that had a little existing SEO history. Then, on November 17th 2020, the site went live.

Optimizing speed by all means

Although we hit play, we still kept the site out of Google’s sight for the first few weeks, using the magical “no index – no follow” formula as well as a proper robots.txt file. This gave us a little more time to adjust various settings, which turned out to be a sound decision as we still needed to improve the website’s performance and speed quite a bit…

A project based on 30M+ URLs and 150M+ data points requires a strong technical infrastructure. We rely on the Neo4J graph database system, which we feed through an API. And we need to do so in the shortest time possible: time is a very important metric when it comes to SEO. If a page loads slowly, Google will rate it negatively as it does not provide a qualitative experience for the end user. That and the fact that it takes more time for its search engine to scan the website’s pages… For all these reasons, speed has been one of our key concerns since day one, and the project gradually evolved towards the right performance formula.

When we launched, the rendering of a “large” page took between 3 and 6 seconds. For major artists and labels, that could even take way more time than that. We therefore had 2 problems: 1) the fact that it took too long to load certain pages, 2) the fact that loading time was uneven from page to page. We needed to improve that — significantly.

We reviewed complex data queries to find faster ways to generate and display responses. For example, we had one query retrieve all albums and songs an artist was credited on but only displaying the top 10 on the page. The upside was — all elements had been retrieved, therefore searching or sorting them was almost instantaneous. The downside was — for Bob Dylan or The Rolling Stones, such queries could take up to 15 seconds… By completely retooling the API and moving search filters, number of elements displayed and sort criteria at the database level, query time went down to around 1 second.

To boost speed, we also looked at images, the largest type of content we host on Sondz (audio and video files are embedded rather than hosted). On any given page on the platform, there are quite a few images. If not optimized, they can reduce the loading speed of a page as well as said page’s Google rating — both things we do not want. In theory, Google favors images that fit exactly the size displayed on the page. As that is an impossible task with responsive images, we went for a multiple size approach, defining 4 sizes depending on image location: 192, 512, 1280 and 1980 px. We also took the decision to rely on a third party service for image optimization and cache management, Cloudinary. These two decisions combined allowed us to reduce image weight by up to 80%!

But the most important performance improvement has to do with cache. Allowing to send the user page content without generating complex queries can save up a lot of time. We set a multi-level cache architecture, starting with the server, where all data is stored in memory. We then added another layer on frontend servers via graphql caching with the following instructions on our Apollo server using the apollo-response-cache-plugin:

cacheControl: {: {defaultMaxAge: 3600,},

As performance issues where getting fixed, we where now displaying pages full of content within two seconds. These days, our median response time is around 55ms and the bottom 95th percentile is around 1,1 second. More improvements could still be done, but good is better than perfect!

On December 11th 2020, we removed the settings hiding the app from Google and let it crawl without any guidance, knowing what our next SEO step would be: sitemaps.

Setting a path with sitemaps

If you use a CMS, like we did with our blog, building sitemaps sounds like a no brainer. There is a plugin or a library that takes care of its architecture, sets a sitemap index and automatically adds multiple sitemaps (generally one per content type). But Sondz is a 100% custom app with 30M+ URLs to display, and as soon as we started asking ourselves which architecture to setup, a few questions came up:

How many URLs can we add in a single sitemap file?
Is there a weight limit?
How many sitemaps can one have in a single index?
Can a sitemap index include another sitemap index for multi-level architecture?
Where can we store our sitemaps?

Let it be said, Google provides a lot of information for high volume sitemaps. For our 4 content type-structure, we ended up with the following setup:

A single sitemap index (turns out you can only have 1);
An unlimited number of sitemaps per content type, with 49999 URLs per sitemap (there is a limit) in a fixed order based on the initial entry date for each URL (seeing as we add new URLs all the time: music is an endless flow of new releases and future grammy winners…);
Sitemaps are compressed using the gzip standard library;
They are stored at the root of the “public” folder (which does make for a messy repo).

There was still a catch: our 30+ Million pages where going to amount to no less than 600 files. We then decided to adapt and take in consideration our first user experience learning: users on the platform mostly search for artists. Consequently, we set to push our sitemaps partially: artists and labels first, a first set of 2M+ URLs to start, thinking that would be an acceptable volume of content for Google.

Sitemaps were ready and, on January 22nd 2021, we deployed them in production, letting Google crawl through them as well.

Before sitemaps, Google’s slow crawling pace (around 200 per day) seemed logical: the only way to discover our content was through our Twitter posts… With the launch of sitemaps, though, we hoped for a jump. We kept on looking at Google Search Console metrics to check on Google’s progress, but let’s just stay that’s not the best Google product when it comes to providing real time feedback and transparency about what is happening… In hindsight, here is how it played out:

At first, it seems, Google takes into account your sitemap index and lists all sitemaps you sent him. The ones that have not been parsed yet are shown with an ominous “error” next to them. Then, you wait… What we found out was that, if you give Google 2M+ sitemapped URLs, the system will realize that you are a high stake player and therefore take its time to analyze each sitemap at a rate of one or two per day, i.e. 50 000 to 100 000 URLs per day.

With sitemaps finally scanned, we were hoping that Google would then start indexing the site. It first did so at a pretty good rate — we thought — of 200 URLs per day. The thing is, sitemap scans was just part #1 of a longer process, one that requires a trust bound. We did see from time to time peaks with thousands of daily URLs indexed, but any peak was rapidly followed by a return to our initial rate, as Google encountered crawling errors.

To help Google crawl our server faster, we tried to fix any issue that might slow its pace down. We rapidly came to realize that Google quickly tries to understand how much crawling your service can effectively handle. If it encounters errors, it will slow down and wait for you to fix the issue. GWT errors are a key point in all this, and any error that Google finds has to become your priority. In our case, most errors were due to the use of a single frontend node that could not cope with the crawling demands required by Google’s bots. After we increased their number and used load balancing, we could at least ensure that we would not reach a critical number of failed crawl requests.

Our next focus was on URL-based errors. What happens is that, in different languages, properly slugifying artist or label names (i.e. integrating them in a URL) can be somewhat tricky and needs finessing to eliminate unsupported special characters. Every fix sent to Google then sets it back on another crawl round, until the next error comes in.

In the end

It took Google a full month to analyze our sitemaps files. Then, the magic finally happened: Google started crawling our service at a much higher pace: peaks of 4 to 5 thousand URLs per hour, which led to our server… being overwhelmed! That one was easy: we bought more cloud hosting power… We started monitoring Google’s crawling in near real time, and adding more processing power whenever it was required. In the next few days, we hit a new record: Google crawled 45 000 URLs per day for 3 consecutive days!

It has now been over a month since this initial boost, and we would love to say that our Google traffic is skyrocketing… but Google algorithms have their moods. In other words, things are looking up, just not high enough for us to be fully satisfied yet. Then again, this story is still very much unfolding, and we hope to share more (good news) in the future!