Cleaning up the ODP
Moz with Mop

News Flash!

ODP Janitors' Union Formed

Strikes for a Cleaner Directory
MPEGs at 11

I recently got a note from an editor, saying "Thank you for correcting that spelling error in Regional/Pangaea/Erewhon/Weather, I've looked at it a hundred times and not seen that it was wrong. How did you find it? Can you teach me your tricks?" Other editors, too, have looked in their category edit logs, seen my footprints, and wondered what brought me wandering by. 

The answer is simple, of course: I take a copy of the RDF dump (the way the ODP publishes its database), print it out, feed it into a shredder, and let several thousand specially-trained penguins root through the debris and bring me fragments with spelling errors on them. 

That's what works for me, anyway. :-) 

If you don't happen to have access to trained penguins, you'll have to look for spelling errors yourself. The first habit to get into is to actually look at categories that you edit. It can be hard to find problems when you're accustomed to seeing them. For some people it may help to print out the public view of a category to look at, or to swap proofreading services with someone who edits in a different part of the directory. Another way is to cut and paste (or otherwise load) the public or edit view of a category into a word-processor, and use the built-in spell checker to find misspelled words. Scroll around to match the highlighted errors with the entry at fault, and edit it. Remember, though, that most spell-checking programs won't catch usage problems like typing "that" instead of "than", or using "it's" (it is) rather than "its" (possession or property of it.) 

The second big trick is to realize that errors aren't just random. There are patterns to them, whether it's common typos like "teh" instead of "the", or common misspellings like "freind" instead of "friend". Chances are, if you make a typo or misspell a word, you've made the same mistake before, and so have other people. (This is a special case of what used to be known on Usenet as "Ugol's Law" - "If you ever have to ask 'am I the only one who does <anything>?' the answer is invariably 'no.'") 

So, if you want to clean up a lot of errors in a hurry, instead of hunting through a category looking for errors, you might want to look for one specific error in lots of categories. Obviously this works better the higher up in the tree you have access, but even lower-level editors can do the searching, and then smile cheerfully as they pass the results on to other people to fix. (More on that later.) 

It used to be that there was only one way to automate hunting batches of spelling errors: Use the search box (at the top of every ODP page) to look for specific misspelled words, limited to the tree in question, open the category in a new window, click the "edit" link to get to the edit view, scroll around to find the site at fault, and edit it. Repeat the process for every site within the category, and every category that has search results. Needless to say, this is a slow and painful process. 

Finding and fixing errors is a lot easier with ODP Search for Editors, a tool written by editor newwave. Like all of the other productivity-boosters in the ODP Tools area, this automates simple repetitive tasks so that human energy can be focused on the things that only humans can do. It works the same way as the regular ODP search - enter one or more keywords, and optionally a category to restrict to. Results are displayed in batches of 20, but the links go to the edit view of the category or listing in question. Unfortunately, this also has the same limitations as the regular search, notably that the search index is only updated every day or so. You might open up all the edit windows and find that someone else fixed the errors an hour ago. 

Why would two editors be looking for the same error on the same day? Well, we all know that the ODP is a team effort, that we get better results by working together rather than all going off in different directions. Traditionally this cooperation has been among editors in the same category, or parallel parts of the directory structure, but in recent months there's been a concerted group effort among editors from a wide range of topics to improve overall directory quality, mostly by fixing spelling errors. This group has become known as the "janitor gang", and we're always looking for fresh meat, um, er, we always welcome new participants. :-) 

The primary turf of the janitor gang is the spelling threads in the General Forum. Look for a thread with "Spelling Errors" in the title; the current incarnation isSpelling Errors - But Wait, There's More!. Inside you will find lots of cryptic entries in this format:

http://newwave.primeline.com/dmoz/search2.kizml?search=porgram* (3 of 4)
Translated into English, what this means is there are currently four search matches for "porgram" or a longer word that starts with "porgram" (perhaps "porgrams" or "porgramming"), and that the poster felt that three of the four needed to be corrected. The fourth is to be excluded for some reason that is expected to be obvious - perhaps the "typo" is part of a functional URL, a proper name, or an acronym. The poster couldn't or didn't fix the errors directly, but published them for others to take care of. 

If you get in the habit of doing a global search every time you fix a typo or spelling error, you will find that many of them are, unfortunately, duplicated elsewhere. Even if you can't fix all of the occurrences, you can post a note in the "janitors' spelling thread" with your findings. 

If you are a lower-level editor, your responsibility ends there; you can walk away smugly, knowing that someone else will almost certainly jump in and make the necessary fixes, and the end result is a cleaner directory for everyone. Of course, since diligence and competence are their own punishments at the ODP, you may one day find yourself magically turned into one of those higher-level editors who are compelled to fix as many errors as they can... 

If you are a higher-level editor (and this can mean anything from an editall or top-level category editor, to the new editor who signs on one day and discovers that there's an even newer editor residing in a subcategory of the one they edit), there are several ways you can handle errors you find deeper in the tree: 

  • Send a note to local editors, pointing out the problem and requesting a fix. Remember to be polite and courteous, and if possible look at the big picture: someone who inherited a messy category and is gradually cleaning it up will benefit from a different tone than someone who is committing new errors themselves.
  • Fix it yourself. This is often the fastest and easiest solution, but misses out on the benefit of getting other editors involved in the process.
  • Unreview it with a comment requesting cleanup. This can be problematic; if there is a large backlog of unreviewed in the category, or if the editor in question only logs on occasionally, it may be a while before the site is re-added. Personally I only use this tactic when there are multiple obvious guideline violations in the site's listing, it appears to be grossly miscategorized, or there are editor's notes that suggest it may have been added by mistake.
  • Post a note in a janitorial thread about the error's global incidence. While it can make you look lazy to ask others to fix a problem you have access to, it's better to have the problem fixed than not fixed, and not all of us have the time to do all the editing we want.
When spell-checking, there are a few pitfalls to watch out for, though. These become more common if you start editing a broader range of topics, but can trip up any editor. Watch out for: 
Site titles
Sometimes web authors will deliberately misspell a word for some reason. The usual policy is to follow the site's actual title (if it actually is a title instead of a keyword list or description) as long as it doesn't seem to have been designed just to get a higher ranking in alphabetical search results. Leave good editors' notes explaining that the "error" is deliberate, though, so someone else doesn't have to repeat your research. 
Proper names
Names of people and places often have odd spellings. If in any doubt, check against the site itself, and make sure you aren't just copying from a single typo on the site. (But see above.)
Jargon
Sometimes a word that appears to be misspelled is simply an unusual word that appears similar to a common one. For example, "calender" is usually a misspelling of "calendar", but is correct usage in reference to a step in papermaking, or a Sufi devotional practice. 
World/ 
Sometimes a global search for an English misspelling will turn up many listings from the World/ (foreign-language) hierarchy. Obviously, spellings differ from language to language. As long as listings are in the right language (English everywhere outside World/, non-English within it) you should only pay attention to results in languages where you are fluent. (If you are completely convinced that something is wrong, contact an editor from that part of the directory.)
UK vs. US spelling
Each side thinks the other spells funny, and they're both right. There are both obvious differences (i.e. "color" versus "colour") and subtle ones (i.e. "fulfill" versus "fulfil"). There is no general consensus on how to handle this in titles and descriptions. Some editors follow the usage on the site, others standardize to US spelling. Category names, though, should be in US English except for specific parts of the Regional hierarchy, where UK English is the standard (for categories, titles, and descriptions).
Hype
While listings full of self-advertising language ("marketing hype") like "The first and best site on the Internet for widget lovers!" should be high-priority cleanup targets, there are many subtle problems, like ending a list of site contents with "and much more." While there is common agreement that this language should not be used for new listings, opinions vary on whether it is worthwhile to spend time attacking them globally.
Common Bad Usage
Some errors are so common we almost don't notice them, like using "CD's" instead of "CDs". Most users of the directory won't notice them either. So, even though they are errors, there's no great priority placed on a global fix.
While low-level hype and common bad usage may not be the most productive targets for global cleanup efforts, they should still be fixed up in areas where you edit directly, because we all should want to make our own special interests into the best category in the directory. :-) 

Of course, some editors are always going to say "spell-checking bores me", or "as soon as this reorg is done", or "right after I clear all the unreviewed." That's OK; we all have different editing styles, and different skills and talents we bring to the project. As long as your edits are good, they are helping to make this directory the best in the world, no matter how few or how many you do. Just remember that some people will discount all the hard work you do if they see spelling errors or sloppy editing, and a bad experience in one category will affect their willingness to use the ODP for another subject. Neatness counts. 

Stay tuned for Part II next month: Your Friendly Neighborhood Serial Killer.

-- Pinkie
   (ODP editor hotpink)
   2000-07-18

Mop Mozzie courtesy of editor etoile.

home