A tale of the never-ending pursuit for the non-broken search algorithms using Neo4j’s Full-text search...

 

Introduction

Either something works or, it is broken. In a funny and inspiring lecture that can be found on TED’s website, Seth Godin describes how, from a user’s perspective, simple design mistakes may have dire consequences on the usability of a feature. For instance, to offer discounts, a company sent him 3 debit cards, each for a $30 value and with an expiration date 60 days in the future. However, he notes, where can you use those cards? If you want to buy a product that is worth slightly more than $30, you have to ask for the possibility to use two different cards, and if you buy something that is less than said $30, you will never use the remaining money, which will therefore be lost. Conclusion: the three cards have expired, and he never used them. It didn’t work, he is angry: the discount system is broken.

Never has this been truer than in the world of search: either you find what you are looking for, or you don’t, and you will not try a lot of different options. You may check your spelling to ensure no mistakes were made (even though a certain margin of error should be allowed: who doesn’t use Google as a spell-checker after all?), but not much further. This is all the more true as the likes of Google, Bing… offer an incredibly high level of quality in the field these days. User expectations are therefore extremely high, and the barrier for “brokenness” is very low…

At Sondz, our core feature is to help users find correct, up-to-date and comprehensive information in the realm of music. In order to achieve that, we designed a search bar that is front and center on our homepage and present on every page via a single click. Our question was: how to make the search feature work, or how to make it as “non-broken” as possible?

 

The issue(s)

The issue is the following: how to make it as seamless and effective as possible for users to find the information they want on their music. After all, this is what they should expect on a website that makes the bold claim to help you “Know your music, find any information about any music”. Part of the answer, for us, was to provide a single search field to find artists, albums, songs and labels alike. No dropdown to select a search category should be present. The result should be displayed to the user in the top 5 results for any given category. We all know the simplest way to hide something is to have it show on Google’s second result page… Sounds simple enough, however this is where things actually get complicated.

 

Different search categories have different issues:

  • Artists and labels tend to have different and distinctive names: this is how they brand themselves. Lest we forget, David Bowie had to change his stage name twice from David Robert Jones. So, for a given set of search terms, only a few results will be found in our yet 1.7 Million+ artist database. The issue is usually more on the spelling side. For instance, I always thought the band Weezer was called “The Weezers” (shame on me, I guess) and I am never quite certain on Megan thee Stallion’s correct name (yes, it is “thee” and not “the”). But such margins of error should be forgiven by the algorithm. Another issue is that a lot of bands are known by their nicknames more than by their actual names. For instance, Rage Against The Machine is usually known as RATM — and they were still all the rage…
  • Regarding albums and songs, things are a bit different. First, there are about 20 million songs in our database and song names vary way less from one another. Try searching for “I love you” and you’ll see how all the top results from various artists have the exact same name. Second, album and song names are usually longer and can sometimes even be small sentences. As such, accommodating for spelling provides a huge number of results. And, even without it, a sorting algorithm or scoring had to be implemented.
  • Globally, the issue remains how to order results and show the most likely searched result at the top.

 

Current implementation

Our current implementation (emphasis on current) has been the following to address those issues. First, we use neo4J community edition as our graph database (version 4.0 currently) that comes bundled with, as most current modern databases, a full text search system which allows for a bit of tweaking to serve our different use cases :

 

1. Creating indexes on more than a single field for artists ensures that you can find them not only using their legal name, but also their aliases or other known search terms. Looking for RATM:

CALL db.index.fulltext.createNodeIndex(“artistNameIndex”, [“Artist”],[“name”,”artist_alias”,”sort_name”]);

 

As a result, searching for RATM will suggest Rage Against the Machine:

 

2. Searching for artists while allowing for approximation, or a margin of error in how it is written. This is done by using the Call.apoc.cypher queries and adding ~0.85 at the end of the elements searched. This can be fine-tuned to be more or less forgiving. (Going all the way up to 0.95 or more would be less forgiving):

 

CALL apoc.cypher.run(“CALL db.index.fulltext.queryNodes(‘artistNameIndex’, $searchString+’~0.85′) YIELD node, score return node order by score desc

 

3. Search for songs, releases & labels, but only through exact matches by adding quotes around the search string. That way, only exact matches are provided, thus heavily narrowing down results. This makes finding results such as “The long and winding road” by the Beatles way easier:

 

4. Implementing a scoring mechanism to sort between results. For artists, we want to take into account the approximation score provided by the database, but also how “important” an artist is. The latter aspect we deduce by counting how many albums a given artist has released, which is of course biased as a long lasting artist likely has a larger catalogue to show for:

 

CALL apoc.cypher.run(“CALL db.index.fulltext.queryNodes(‘artistNameIndex’, $searchString+’~0.85′) YIELD node, score
WITH node, score limit 20
OPTIONAL MATCH (node)-[rel3:`is credited to`]-(s2:Release_group)  return node, count(rel3)*score*score as BalancedScore limit 20″,

{searchString: $searchString}) yield value
return value, value.BalancedScore as BalancedScore   order by BalancedScore desc limit 12

 

For tracks, the scoring is done differently — and is arguably more complex. First, we count the number of mediums on which the track was released, then we use scoring from the database engine and multiply it by what we call “typeScore”. This score equals 2 for albums , EPs and singles and 1 for lives releases, compilations… so as to favor “major” releases. Finally, we use the year when an album, single or EP has been released, thus slightly favoring older releases.

 

CALL apoc.cypher.run(“CALL db.index.fulltext.queryNodes(‘trackNameIndex’, $searchString) YIELD node, score
WITH node, score limit 20
OPTIONAL MATCH (r:Release)<-[rel:`Found on release`]-(m:Medium)<-[:`Found on medium`]-(node)
WITH node, (CASE when s.secondary_type is null then 2 ELSE 1 END) as typeScore return coalesce(min(r.release_date_year),2050) as release_date_year, count(m)*score*score*typeScore)-coalesce(min(r.release_date_year),0) as BalancedScore   limit 20″, {searchString: $searchString}) yield value

return value, value.score, value.NumberRelations as NumberRelations, value.BalancedScore as BalancedScore  order by BalancedScore desc limit 12

 

(First) conclusion:

So, is our search system “Broken” as Seth Godin would say? I would quite humbly say so, as in some instances it is still hard to find what you are looking for. Our scoring mechanisms are still biased towards popular and thus older artists and are sometimes too restrictive (exact match only). Going forward, we are looking at implementing the following changes:

  1. search for both approximated and exact matches and provide exact matches first (or score them favorably) but still allow for a margin of error on track names;
  2. take into account artist, albums and track ratings as provided by users, so as not to have to rely on the sheer number of releases: a young upcoming artist could thus be sorted way more favourably — as he should;
  3. provide some level of autocompletion, so that users are geared towards “winning searches”;
  4. use the number of times an artist, album… page has been visited as part of the scoring…

In other words, this is only the beginning of our quest towards a fully non-broken search. We firmly believe that we still have a lot to learn, but also that it will be an exciting journey!