Yesterday we announced some details about an upcoming Domain Name Filter software. We used an early version of the software to split the domain names in Guru, XYZ and Club zone files into words.
We used the default English language dictionary within the software. This dictionary has over 75000 English words. It also includes the names of countries. We did not use dictionaries for common names, places and other proper nouns.
The software splits domain names into component words. Domain names with only numbers or a combination of numbers and valid words were accepted. So 101domain.club is split as “101 domain” and accepted. But a domain name like xfrtyclub.guru is ignored because ‘xfrty’ is not in the dictionary.
Domains with hyphens were always accepted. For example, 1-koelner-pfeifen.club was split into “1 koelner pfeifen” and accepted even though both words koelner and pfeifen were not in the default English dictionary.
A Side Note: The software can split domain names accurately for most cases. However, in rare instances it does create unintended word combinations. For example, GreatIdeasInAging.xyz was split into “Great Idea Sin Aging” instead of “Great Ideas In Aging”.
Here are the results :-
Total Domains Excluding IDN – 739,448
Valid Keyword Phrases after Splitting – 242,909
33% of the domains are valid English word combinations
Total Domains Excluding IDN – 152,817
Valid Keyword Phrases after Splitting – 79,580
52% of the domains are valid English word combinations
Total Domains Excluding IDN – 78,271
Valid Keyword Phrases after Splitting – 49,983
64% of the domains are valid English word combinations