One of the things that is important to me is the quality of our data. And judging from the recent survey, it's important to many other editors
as well. One of the benefits of having editall permissions is that I can concentrate on quality control issues that I think are important.
In this article, I'll detail three of the projects I've been working on in the last year. Here they are, in no particular order.
Project 1: Vanity URLs
Even before becoming an editall, I had seen the problems with allowing
vanity URLs (also called cloaked URLs and framed redirects) to be
listed in the directory. Site-owners claim that the use of URL Cloaking is beneficial because the resulting URL is usually shorter and
easier to remember. The second reason is that the vanity URL will always stays the same, no matter how often they change their real URLs.
So we don't have to change their URL in the directory every time.
However, the negatives associated with vanity URLs outweigh the positives. If the target of a vanity URL returns an error, Robozilla will
not see it. This is because "he" does not look at the underlying page. It is only the frame that is broken, not the URL itself.
A vanity URL also allows for site owners to spam their pages across the ODP without editors being able to see the editor notes associated
with the site. In this way, it is possible for a single site to be listed multiple times, even in the same category. Therefore, we deal
with these the same way as mirrors. The original site is found and listed, with all other sites being red-tagged as not listable. This
provides a way to ensure that the site is only listed once.
In order to combat the many vanity sites listed in the directory, I compiled a list of
common vanity URLs. Every once in a while, I run a program which
uses itubert's search to see how many of each type of vanity URL are listed in the directory. Comparing the number of vanity URLs from
one date to another, it's possible to tell where we need to put more resources in educating editors regarding vanities.
Project 2: Cat Cleanup
Vanity URLs are all over the directory. And many are not found using the above method. Someone has to go to the category and look at the
sites. Which leads into the next project which I have been working on.
I like browsing the directory. It's fun learning a little bit about all sorts of topics I would normally not know about. The downside
(or upside, depending on how you look at it) of this extensive browsing is that I come across category that are less than perfect. When
I see a category that has a few bad links, I like to run Tulip Chain, a cross platform
link checker programmed in Java. It's an amazingly simple, yet powerful way to check categories for dead sites, redirects and sites with
other problems.
|
I run Tulip Chain on categories ranging in size from 100 sites to 1000 sites. Anything above that and it seems counterproductive. Tulip Chain
can take a long time to check a large category, and since I have the attention span of a monkey, I don't like things that take a long time,
hence the small categories. Once Tulip Chain has done its thing, I generate a report, and it's editing time!
I like to only run through the list once, so I start at the top, and work my way down. The easiest URLs to fix are redirecting ones. This is
because Tulip Chain provides an autocorrect button. But use caution, the autocorrect should only be used after you have verified that the
redirect is correct. Sometimes, even the best programs can make mistakes. Usually, it's not the programs fault though. Many times, a website
will recognized that a web client will not have some specific piece of software (such as flash), and redirect the user to a different page
so that the content will still be visible. This is a legitimate use of a redirect and should not be changed. There are many examples of
legitimate uses of redirects, it's the robots job to find them, and it's the humans job to determine if the use of a redirect is good or not.
Another thing that I look for in the Tulip Chain reports is dead sites. Sites returning codes in the 400s (401, 403, 404) I move to the
unreviewed pile after verifying that they are indeed dead. This allows another editor to come behind me and check for a second time to see
if they are in fact dead. Two strikes and you're out in this league. It also allows another editor a chance to perform a detailed search for
the dead site. Since I only do a very basic search of the sites title, it allows someone with more experience in that area to try to root
out a good site.
Project 3: Spelling Errors
Often while doing a cat cleanup, I'll notice some spelling errors in some of the listings. Which will lead me to do a spelling check on
the category to see if there are any other misspelt words.
Another way I find misspelt words is to use two tools. The first is itubert's random typo generator.
After the spelling errors are found, I use rpfuller's search4 to correct them. As
with Tulip Chain, search4 is very powerful, and it should only be used if you know what you're doing. If you already know what typos you
want to be correcting, there is no need for the first tool. Another good place to find spelling mistakes is in
Quality_Control/Misspellings,_Typos,_Errors.
Everyone can help control the amount of spelling mistakes we have. Make sure you use the spell check button on every edit you make. There
is no excuse not to do it; it takes very little time, and it will help you improve your spelling, and possibly your typing too.
Conclusion:
I hope this has given some insight into the editing style of at least one editall. Of course, everyone is different, so every editall
will edit in a different fashion. I expect most focus on categories they were interested in before becoming an editall. After all,
editors are most efficient if they edit where they are interested, and editalls are no different.
- ishtar
|