Previous Blog Entry Next Blog Entry

My employer, the San Jose Public Library, has recently been putting a lot of time and effort in to researching internet filters in response to a city council member’s proposal that the city implement mandatory filters on all of the library’s public access computers.  You can read about our process, documentation, and so forth on our public webpage about the issue (linked from our homepage). 

We presented the issue to the Library Commission at their last meeting, which recommended that the library not install filters, but instead continue its practice of providing privacy screens.  The issue now goes to the City Council in March.

The venerable Mary Minow attended to speak out against filters from a legal standpoint.  She blogged about her experience at the meeting.  She also linked to the report on internet filtering that I wrote, examining the functionality of three internet filtering products: WebSense Enterprise, CyberPatrol, and FilterGate.   

The average results from all three filters were telling.  The accuracy rates for the filtering of keyword searching and direct URL access to sites were on par with previous studies – approximately 85% of "trigger sites" are blocked correctly, and approximately 85% of "non-trigger sites" are allowed through correctly — leaving a 15% error rate in either direction.  The accuracy for images (38%) and image email attachments (25%) were much lower.  RSS feeds barely passed the halfway mark, at 53% accurate.  Catalog searches fared better at 67% and database searches better yet, at 83%.  Here’s a table to make the data easier to read.  Conclusions?  Traditional text content is handled at around 85% effectiveness, but non-traditional text content, and non-text content like images, are not handled at nearly the same levels.

Average Filter Accuracy (margin of error +/- 5%)

Type of content tested Accuracy Percentage
Content of an Adult Sexual Nature – direct URL access 89%
Content of an Adult Sexual Nature – keyword searches 83%
Content not of an Adult Sexual Nature – direct URL access 88%
Content not of an Adult Sexual Nature – keyword searches 62%
Image Searches 38%
Email Attachments 25%
RSS Feeds 53%
Library Catalog Searches 67%
Library Database Searches 83%

We do hope to write something formal up for the library community about this experience.  Stay tuned! 

“SJPL Internet Filtering Study – Testing Results”

  1. Anne-Lise Says:

    Oh, it’s nice to have confirmation of what I’ve sorta assumed. This is definitely going into the bookmarks for the day when the politicians here get the idea that filtering is a great and wonderful idea.

    Actually, while the fact that not all of the content that should be blocked is blocked is probably what’ll sell the idea that we shouldn’t waste the money to the politicians – the fact that content that shouldn’t be blocked is, is what freaks me out, as a librarian.

  2. Lori Ayre Says:

    Sarah,
    This report packs just the punch libraries need to help explain to the public why filters aren’t the solution so many people think they are. Thanks for building nicely on my efforts which sometimes feel a bit in vain as more and more libraries slap filters on their computers.

    I’m relieved that your study confirms my findings (which were based on much more anecdotal evidence) that filters are only 85% effective — at their best. Some are much worse, of course.

    I’m wondering how you settled on those three to test?

    Lori

  3. Sarah Houghton-Jan (Librarian in Black) Says:

    As stated in our report, this was not a scientific study. However, I stand by our results and am confident that the results are accurate within the stated margin of error. By the by, some of the other testimony listed in the ACLU vs. Gonzales case rated filters as being 88%, 90%, or lower. And those are numbers of how successful they were at blocking “trigger” content. The accuracy ratings for over-blocking Constitutionally protected content are much lower – meaning that content that should get through often does not, up to 1 out of 3 times. All of that is well within the error margins of what we found as well.

    And that’s just for normal text content – keyword searching in search engines and direct URL access. The accuracy for “newer” stuff – image searches, image email attachments, RSS feeds, database and catalog content, etc. was much lower in each case. Some of the tests we ran in these three filters did block breast cancer and health information. Numerous studies have not been conducted to test filters’ ability to deal with this content – especially non-text content like still images and video. This is, supposedly, what many people are trying to block when they implement filters in libraries. If that is the case, why do the accuracy ratings generally only test text? Let’s test images and see what happens in a controlled, scientific study, measuring real user searches, real user needs, and seeing what gets over and under-blocked.

    You will also please notice that in the testimony in the ACLU vs. Gonzales case that it is highlighted that the more accurate the filters are going to be, the more restrictive they are too (in other words, the more they over-block things that should be allowed through). The AOL filter mentioned in the case testimony as being highly effective at blocking “trigger” content also over-blocks content a disturbing amount of the time: “Mewett found that the AOL filter overblocked 22.3 to 23.6 percent of non-sexually explicit Web pages.” This has been held up in study after study. So, yes, some filter may have a 95% accuracy rating in blocking “trigger” content. But that filter will also have a higher incidence of blocking Constitutionally protected content.

    I stand by our findings. I stand by what we found, and I encourage libraries who have the time to create similar test situations, to see what they find. I would love to create a grant request to do a larger study – to scientifically test a dozen or so filters on all types of content, especially multimedia content which has been largely ignored in past studies, and using those results to better guide libraries’ decisions on this, the most important information access issue of our time.

  4. Sarah Houghton-Jan (Librarian in Black) Says:

    Lori – choosing the filters to test was difficult. We were given a specific charge by the city council member’s report – to block “websites that contain child pornography or material that is obscene.” I spoke with nearly three dozen internet filtering companies about their products, and in doing so formed a clearer picture of the types of products out there. The statement made by the council member was that filters had improved significantly in the last few years, and therefore were accurate enough for implementation in our libraries.

    To address that issue, and to see what filters in general are capable of today, we chose three filters of differing size and scope. We wanted to find filters that represented the market – after talking with vendors there seemed to be three real categories for what is available. Simple filters that are very cheap and often client-based, Moderate filters that are more expensive but offer many more options and granular levels of filtering, and Complex filters that offer a slew of options, probably have the biggest and most complex trigger word and URL databases, and run very, very expensive. Choosing one from each category, we felt, would give us a good overview of the market, let us know the capability out there right now for addressing images, and, realistically, what our options would be if we were charged to begin filtering.

    FilterGate is relatively simple – with an “AdultFilter” category that you block wholesale.

    CyberPatrol is a big more granular, with several different options for blocking and we chose “Adult/Sexually Explicit,” “Glamour & Intimate Apparel content,” as well as “Remote Proxies” (a known adult content enabler).

    For WebSense, which was the one network-based appliance we tested, and was by far the most option-rich of the products, we chose to filter only a couple of sub-categories within Adult Material (including “Adult Content,” “Lingerie & Swimsuits,” “Nudity,” and “Sex”), and then a couple of categories that are known adult content enablers: “Illegal or Questionable sites” and two “Information Technology” subcategories that are also adult content enablers: “Proxy Avoidance” and “URL Translation Sites.”

    Overall, the goal was to block what we felt we would block were we trying to meet the goals set forward in the proposal given to us. And testing it in three different filters of varying abilities and strengths gave us, I believe, an accurate view of what the market is like right now.

    They all over-block and under-block – that is to be expected. But the accuracy numbers don’t necessarily coincide with how complex or expensive the filter is. For example, the simple option Filtergate was 100% accurate on RSS feeds while the other two came in at 33% and 25%. Filtergate also out-performed CyberPatrol on databases, and outperformed both filters on accurately blocking direct URL access to content of an adult sexual nature.

  5. Daisy Porter Says:

    What were the results like for podcasts? I remember testing a few during the whole fun process…

  6. Sarah Houghton-Jan (LiB) Says:

    One of the RSS feeds we had had a podcast attached to it, and in almost all cases people were able to bypass the filters and listen to audio files with content of an adult sexual nature. So, for audio access, not so good. I would like to do a more formal study of multiple types of audio files, multiple ways of accessing them, to see what gets filtered (if anything). But once again, non-text content is hard for text-based filters to handle.

  7. Sarah Houghton-Jan (Librarian in Black) Says:

    Exactly what aspect of this individual’s report, for his library and his library’s filtering software specifically, disagrees with what I have asserted? He seems to have similar concerns for image over-blocking, though the types of images and access to those images was tested differently than we tested ours. Additionally, he does note that “controversial” health care sites with topics like safe sex and homosexuality still tend to be blocked. Again, other than a slight fluctuation in the numbers, I don’t see any disagreement here. His results were as follows: 93.1% accuracy blocking trigger websites and 48% accuracy blocking trigger images. By “trigger” I mean something the filter is supposed to catch. Perhaps I am not reading something the same way you are.

  8. Sarah Houghton-Jan (LiB) Says:

    Thank you for pointing out the new study. It is, in fact, primarily a study of how effective the filter is at blocking what it is “supposed to block.” There is a mention that 1% returned a false positive (meaning over-blocking), but if you look at the actual results from the test searches they ran, there are many less searches that are “non-porn” in nature than the “porn” searches (about 1/5 the number). In addition, the criteria that their “human” used to classify it as porn or not-porn was very unclear and not at all documented. The sites selected are all “sexual” sites, and not other types of content, which would be required for a representative test. As a result, the test’s evaluation of the efficacy of the filter in letting constitutionally protected material through is poor and in my mind inaccurate because of their methodology. In other words, they didn’t test in a comprehensive way how often it falsely blocks material that should be allowed.

    Past studies have shown time and time again that the higher the accuracy of the filter in blocking what it’s supposed to block, the lower the accuracy of the filter in allowing what it is supposed to let through. It is shameful, to me, that filtering companies miss this part of the testing process. It is easy to say “we block ninety-some percent!” without giving the caveat that the filter also over-blocks quite a bit of content.

    Until I see some substantial results and methodology documentation from Untangle (the authors of this recent study) about letting constitutionally protected material through, I will continue to view this as an incomplete test.

  9. Snuppy.dk » 404 Day: A Day of Action Against Censorship in Libraries Says:

    […] blocked by Internet filters abound. Websites have been blocked that host medical information about breast cancer, chicken breasts, and even a New York Times article about Internet gambling all due to keyword […]

  10. Open Source Intelligence [OSINT] Agency of Massachusetts | EFF: 404 Day: A Day of Action Against Censorship in Libraries Says:

    […] blocked by Internet filters abound. Websites have been blocked that host medical information about breast cancer, chicken breasts, and even a New York Times article about Internet gambling all due to keyword […]

  11. 404 Day: A Day of Action Against Censorship in Libraries | Michigan Standard Says:

    […] blocked by Internet filters abound. Websites have been blocked that host medical information about breast cancer, chicken breasts, and even a New York Times article about Internet gambling all due to keyword […]

  12. 404 Day: A Day of Action Against Censorship in Libraries | americanpeacenik technology journal Says:

    […] blocked by Internet filters abound. Websites have been blocked that host medical information about breast cancer, chicken breasts, and even a New York Times article about Internet gambling all due to keyword […]

  13. 404 Day: A Day of Action Against Censorship in Libraries | Electronic Frontier Foundation Says:

    […] blocked by Internet filters abound. Websites have been blocked that host medical information about breast cancer, chicken breasts, and even a New York Times article about Internet gambling all due to keyword […]

  14. Help fight Internet censorship on April 4: 404 Day | Susan Ricker Says:

    […] However, the broad reach of the 2000 law has caused many libraries to opt for over-censoring web content available rather than risk breaking the law. As a result, students and library patrons are being denied access to constitutionally protected websites and resources (think art museums or sites with health and sexual well-being information). In addition, the First Amendment takes a beating as Internet filters can’t make the legal distinction between what’s “harmful” and what’s okay. […]

  15. 404 Day: День борьбы с интернет-цензурой в библиотеках | Теплица Социальных Технологий (ТеСТ) | Краудсорсинг, технологии для третьего сектора, И Says:

    […] как сайт, содержащий медицинскую информацию о раке груди, или статья New York Times об азартных играх в […]

Leave a Reply

LiB's simple ground rules for comments:

  1. No spam, personal attacks, or rude or intolerant comments.
  2. Comments need to actually relate to the blog post topic.