One of the many challenges of categorizing content on the internet is the sheer volume and the spectrum of what content types exist. While threat categories such as Terrorism and Hate, P2P & Illegal, Weapons, and Adult Content do much of the heavy lifting in blocking sensitive content, there has always been additional content that exists on the peripheral of these categories. Is a lingerie shop an adult website? Well perhaps technically, but it is also a business. Are there instances of graphic content that are also intended to instruct? The short answer is yes.
These wide and varied instances can be a sincere challenge to properly pin down, and can always lead to a rabbit hole of second guessing that can only be helped with research. Content categorization is not infallible, even with humans performing manual review, and the best that can be done is to perform due diligence.
Our primary goal in the Domain Intelligence department at DNSFilter is to try to be as empirical as possible, while also leaving room for the adage “Absence of evidence is not evidence of absence.” There can be content that necessitates contextual assessment and interpretation.
Idle threats on forums, propaganda, questionable medical advice, conspiracies concerning world governments—enough of this content comes across the Domain Intelligence Desk that it became necessary to create a new place for this kind of material. This need drove the creation of the “Contentious and Misinformation” category.
What precisely constitutes the definition for this category? The guiding definition is:
Sites that are contentious or controversial, often causing argument or controversy, characterized by strong opposing arguments; and sites that spread or aid in the spreading of misinformation.
This seems broad, however a prime example could be drawn from anti-vaccination sites, or those that propose unfactual claims about the coronavirus. These would fit neatly inside this category. Previous filtering of unreliable pandemic sites marked these sites as deceptive. That not only caused problems, it also wasn’t an accurate categorization for some of the activity.
Sites that are able to be fact checked and considered against credible sources are also candidates for this category. Included would also be election misinformation, conspiracy presented as fact, or sites with potentially dangerous content. The various -chan boards are an example as well. Previously they had mainly been categorized as Message Boards and Forums. Due to their transient nature it could be difficult to pin down hour-to-hour where posts could be spread from or which statements were a meme or a genuine threat. There have been many instances of live-posted violence, that then result in a trickle down effect of posts and thoughts that then pervade elsewhere. These are now included under the Contentious and Misinformation category, to provide our customers with additional context regarding the sites their users are navigating.
Domain Generation Algorithms (DGA) have been prevalent for ages on the internet and have only gotten more complex and harder to decipher (and thus identify as DGA) as time goes on. It can be a notorious challenge to uncover the patterns in how these DGAs are formed and block a specific permutation of algorithm per use—DGAs are used in copious amounts of ads, but also on sites directly. The pattern behind such domains can be anything from a range of characters and numbers positionally shifted in a cipher to a long and complex equation that will create an encoded domain string that can be decoded later. The methods and means to create them are many, and there have been whole companies devoted to studying solely algorithmically generated domain names.
This begs the question: If there isn’t an easy way to detect a specific algorithm, is there a trend to analyze for the TLD (.com, .org, .xyz, etc.) and improve spam ad blocking? If our systems were able to recognize that alpha-numeric “soup” followed by a range of less common TLDs as DGA, we’d have the ability to confidently block this spam.
Using data that has already been leveraged to confidently mark spam on a daily basis, it is a hope that some traction can be gained in building these associations and perhaps work to create some reference sets for simpler ciphers (i.e. two-word combo domains like word1.word2.TLD) and see if there is a conceivable pattern to increase customer security and blocking of spam ads/trackers more reliably.
In conjunction with our release of the Contentious and Misinformation category, we’ve also been driving toward improving our measures against other harmful materials on the internet. We’re now expanding towards doing more Dark Web research on the Data Intelligence team.
We’ve gained resources to continue work on this more safely and efficiently. We are already partnered with several anti-CSAM organizations that inform feed data to block and filter, and are committed to more actively contribute in the fight against this content. DNSFilter also plans to partner with anti-Human Trafficking organizations and leverage the power of our machine learning algorithms against internal data to help identify high risk content and find trends. We will also leverage our Dark Web capabilities to improve research into crypto scams, malware and phishing. This all serves to improve our knowledge of threats both physical and digital, and make a difference not just in our customers lives, but the world as a whole.