editall projects

One of the things that is important to me is the quality of our data. And judging from the recent survey, it's important to many other editors as well. One of the benefits of having editall permissions is that I can concentrate on quality control issues that I think are important.

In this article, I'll detail three of the projects I've been working on in the last year. Here they are, in no particular order. Puter Mozilla

Project 1: Vanity URLs

Even before becoming an editall, I had seen the problems with allowing vanity URLs (also called cloaked URLs and framed redirects) to be listed in the directory. Site-owners claim that the use of URL Cloaking is beneficial because the resulting URL is usually shorter and easier to remember. The second reason is that the vanity URL will always stays the same, no matter how often they change their real URLs. So we don't have to change their URL in the directory every time.

However, the negatives associated with vanity URLs outweigh the positives. If the target of a vanity URL returns an error, Robozilla will not see it. This is because "he" does not look at the underlying page. It is only the frame that is broken, not the URL itself. A vanity URL also allows for site owners to spam their pages across the ODP without editors being able to see the editor notes associated with the site. In this way, it is possible for a single site to be listed multiple times, even in the same category. Therefore, we deal with these the same way as mirrors. The original site is found and listed, with all other sites being red-tagged as not listable. This provides a way to ensure that the site is only listed once.

In order to combat the many vanity sites listed in the directory, I compiled a list of common vanity URLs. Every once in a while, I run a program which uses itubert's search to see how many of each type of vanity URL are listed in the directory. Comparing the number of vanity URLs from one date to another, it's possible to tell where we need to put more resources in educating editors regarding vanities.

Project 2: Cat Cleanup

Vanity URLs are all over the directory. And many are not found using the above method. Someone has to go to the category and look at the sites. Which leads into the next project which I have been working on.

I like browsing the directory. It's fun learning a little bit about all sorts of topics I would normally not know about. The downside (or upside, depending on how you look at it) of this extensive browsing is that I come across category that are less than perfect. When I see a category that has a few bad links, I like to run Tulip Chain, a cross platform link checker programmed in Java. It's an amazingly simple, yet powerful way to check categories for dead sites, redirects and sites with other problems.

I run Tulip Chain on categories ranging in size from 100 sites to 1000 sites. Anything above that and it seems counterproductive. Tulip Chain can take a long time to check a large category, and since I have the attention span of a monkey, I don't like things that take a long time, hence the small categories. Once Tulip Chain has done its thing, I generate a report, and it's editing time!

I like to only run through the list once, so I start at the top, and work my way down. The easiest URLs to fix are redirecting ones. This is because Tulip Chain provides an autocorrect button. But use caution, the autocorrect should only be used after you have verified that the redirect is correct. Sometimes, even the best programs can make mistakes. Usually, it's not the programs fault though. Many times, a website will recognized that a web client will not have some specific piece of software (such as flash), and redirect the user to a different page so that the content will still be visible. This is a legitimate use of a redirect and should not be changed. There are many examples of legitimate uses of redirects, it's the robots job to find them, and it's the humans job to determine if the use of a redirect is good or not.

Another thing that I look for in the Tulip Chain reports is dead sites. Sites returning codes in the 400s (401, 403, 404) I move to the unreviewed pile after verifying that they are indeed dead. This allows another editor to come behind me and check for a second time to see if they are in fact dead. Two strikes and you're out in this league. It also allows another editor a chance to perform a detailed search for the dead site. Since I only do a very basic search of the sites title, it allows someone with more experience in that area to try to root out a good site.

Project 3: Spelling Errors

Often while doing a cat cleanup, I'll notice some spelling errors in some of the listings. Which will lead me to do a spelling check on the category to see if there are any other misspelt words.

Another way I find misspelt words is to use two tools. The first is itubert's random typo generator. After the spelling errors are found, I use rpfuller's search4 to correct them. As with Tulip Chain, search4 is very powerful, and it should only be used if you know what you're doing. If you already know what typos you want to be correcting, there is no need for the first tool. Another good place to find spelling mistakes is in Quality_Control/Misspellings,_Typos,_Errors. Everyone can help control the amount of spelling mistakes we have. Make sure you use the spell check button on every edit you make. There is no excuse not to do it; it takes very little time, and it will help you improve your spelling, and possibly your typing too.

Conclusion:

I hope this has given some insight into the editing style of at least one editall. Of course, everyone is different, so every editall will edit in a different fashion. I expect most focus on categories they were interested in before becoming an editall. After all, editors are most efficient if they edit where they are interested, and editalls are no different.
up

- ishtar


Please send all comments, questions or suggestions to the newsletter editor.
Copyright 2000-2005 Open Directory Project