Technical: Getting Hashes For Searches

This document contains technical information about hashing and the strategies you can use to attempt to match a download on TorBox.

This document will first explain the reasoning for all of the methods, what methods are recommended, and why we have multiple hashes for a single file.

The Why

When you add a file to TorBox, whether torrent, usenet or web download, TorBox must accurately determine whether this file is already in our system. If it is in the system, there is a symbolic link on your account created the the file, otherwise we create a new file on our system and begin downloading it. The issue comes with the limited information that TorBox is given when it attempts to decide whether the file already exists.

When you add a link to TorBox, lets use this link as an example:

"https://website.site/download/123456?filename=hello.png"

In this case, when you add this link to your account, we take the entire link at face value, and calculate a hash using the MD5 algorithm. When you send TorBox this link, what we really see is "484b93eac20cddf3cdd379754a4cc9df". MD5 hashing is deterministic so when someone adds the exact same link, it outputs the exact same. But what happens if something about the link changes? For example, now another user, who intends to download the same file but got the link on a new website, submits the following:

"https://website.site/download/123456"

The content on the website is the exact same, but now the "filename" parameter is missing. Well, TorBox treats it as an entirely new download with the MD5 hash of "d69e0b720f24907f348fd9ef1b8f624b". This is a problem, we have duplicates of the same file, but we have different hashes. Double the storage space for no reason, not to mention, we forced the user to wait for the download to be complete before they are able to use it.

Alternative hashes solves this problem by calculating all the possible hashes (or links) that somebody could submit and saves them in the database. So now, when user A adds "https://website.site/download/123456?filename=hello.png", we can save the hash for "https://website.site/download/123456" alongside it, along with dozens of other possibilities with the amount of strategies we have. That way the next time somebody adds "https://website.site/download/123456", TorBox already knows it is on its system and will give access to the user.

There are lots of alternatives to this strategy of saving Alternative Hashes, but we decided this would be the best way for a few reasons:

It happens before the user adds the download.
1. Unlike post download de-duplication like many services employ this happens much quicker and saves the user the time waiting for the download.
2. Alternative hashes is logically simpler.
3. While it would be great to somehow be able to match the checksum of the actual physical file, this is impossible to do beforehand.
Strategies can be added, modified, or removed.
1. If a strategy, such as parsing the filename, and cleaning it, is found to lead to too many issues, it can be removed.
No major service breaking changes. This system can be added on top of TorBox's already stable service, meaning that there is no chance this will break existing functionality.

Strategies

In this section which is mostly for developers, we will be going over what strategies you can use for your applications that rely on TorBox to provide proper cache checking.

Magnet Links (Torrents)

This is probably the easiest one to use since it is already employed by applications and is the simplest.

When a user submits a magnet link, in the format "magnet:?xt=urn:bith:xxxxxxxxxxxxx", all you must do is submit the "xxxxxxxxxxxxx" part of the magnet. This is the hash of the torrent, and TorBox will match to the correct file. There is nothing more needed here. Below is a code snippet on how you might do this manually, and also with a library:

python

import urllib.parse

def getMagnetHash(magnet: str) -> str:
    try:
        parsed_url = urllib.parse.urlparse(magent)
        query_params = urllib.parse.parse_qs(parsed_url.query)
        xt_param = query_params.get("xt", [None])[0]
        if xt_param:
             hash_value = xt_param.split(":")[-1]
             return hash_value.lower()
        return ""
    except Exception as e:
        return ""

python

import libtorrent as lt

def getMagnetHash(magnet: str) -> str:
    try:
        info = lt.parse_magnet_uri(magnet)
        hash = str(info.info_hash)
        return hash.lower()
    except Exception as e:
        return ""

Links (Usenet, Web Downloads)

Next is basic links. All you have to do is MD5 Hash the link using your language's standard library. You can use this as a hash.

python

import hashlib

def hashLink(link: str) -> str: 
# you can use this as a standard function for all the following strategies too, because they use MD5 hashing

    md5_hash = hashlib.md5()
    md5_hash.update(link.encode())
    return md5_hash.hexdigest().lower() # lower is necessary as it normalizes it to all lowercase

Link Normalization (Usenet, Web Downloads)

Same as above for the most part, except we are first normalizing the urls. By this, we mean, removing query parameters. In most cases, query params are not necessary to the actual content of the url and are more for authorization, stating filenames for downloaders or other extra information that is useful, but not for our case here.

There are many ways you can attempt to remove query params, such as just discarding everything after the "?", but this won't work in some cases where the user doesn't submit the URL correctly, or the URL includes other information such as "#". Simply deconstructing and reconstructing the URL is the best way to handle all the edge cases (although if you do submit this, TorBox will still be able to match it).

python

import urllib.parse
import hashlib

def badlyNormalizeLink(link: str) -> str:
    if "?" in link:
        link = link.split("?")[0]
    else:
        link = link.split("&")[0] # handle malformed urls where the indexer was lazy

    return link.lower()

def properlyNormalizeLink(link: str) -> str:
    parsed_url = urllib.parse.urlparse(url)
    normalized_url = urllib.parse.urlunparse((
        parsed_url.scheme,
        parsed_url.netloc,
        parsed_url.path,
        "", # removes params
        "", # removes query
        "", # removes fragment
     ))
     return normalized_url.lower()

def hashLink(link: str) -> str:
    md5_hash = hashlib.md5()
    md5_hash.update(link.encode())
    return md5_hash.hexdigest().lower()

def hashAndNormalizeLink(link: str) -> str:
    normalized_link = properlyNormalizeLink(link=link)
    md5_link = hashLink(link=normalized_link)
    return md5_link

NZB File (Usenet)

If you allow NZB files to be used in your application, you can simply take the entire NZB file and MD5 hash it, and use that as a possible hash. You must first clean the file using the following:

python

import hashlib
import re
import xml.dom.minidom

def cleanUsenetFile(file: bytes) -> bytes:
    try:
        no_comments = re.sub(r'<!--[\s\S]*?-->', '', file.decode('utf-8'))
        clean_xml = no_comments.encode('utf-8').strip()
        
        dom = xml.dom.minidom.parseString(clean_xml)
        for elem in dom.getElementsByTagName('*'):
            if 'poster' in elem.attributes.keys():
                elem.removeAttribute('poster')
        cleaned_xml = dom.toxml().encode('utf-8')
        return cleaned_xml
    except Exception:
        return file # bad file or not normal nzb

def hashFile(file: bytes) -> str:
    cleaned_file = cleanUsenetFile(file=file)
    md5_hash = hashlib.md5()
    md5_hash.update(cleaned_file)
    return md5_hash.hexdigest().lower()

Message IDs (Usenet)

This is probably the best strategy as far as Usenet goes. Message IDs are what is actually used to download the Usenet files from the Usenet. There are usually hundreds or thousands of them in a file, so instead, simply get the first Message ID from every segment from every file inside the NZB. In this example, we use a dependency called "xmltodict" which makes it easy to use XML files (which is what NZB files really are), but you can use any XML parser you choose.

Once you get the list of Message IDs, you can MD5 hash each one as you see them, and use those hashes. If any of them match, it will direct you to the correct TorBox file.

python

import xmltodict

def getUsenetMessageIDs(file: bytes) -> list[str]:
    try:
        message_ids = []
        root = xmltodict.parse(file)
        nzb = root["nzb"]
        files = nzb["file"]
        if isinstance(files, dict):
            files = [files]
        for file in files:
            if type(file) is not dict:
                continue
            if not file.get("segments", None):
                continue
            segments = file.get("segments").get("segment", [])
            for segment in segments:
                try:
                    message_ids.append(segment.get("#text", None))
                except Exception:
                    continue
                break # only get the first message id per file
        return message_ids
    except Exception:
        return []

NZB File Normalization (Usenet)

This is very similar to the raw NZB file, except we are cleaning out anything that may distinguish one NZB file from another without removing the core of the file. Stuff you can remove includes comments and posters. While this doesn't change much, some indexers add them while others don't. We don't recommend this strategy to be used.

Raw Titles (Torrents, Usenet, Web Downloads)

If you know the title of the torrent such as through a search engine, simply MD5 hash the title as you see it and use the hash to submit it. We don't recommend this strategy to be used, due to possible inaccuracies.

Download Controls

Re-Download Action in TorBox