[CivicAccess-discuss] What page scraping means

Russell McOrmond russell at flora.ca
Tue Sep 23 21:58:13 EDT 2008


Robin Millette wrote:
> Le Tue, 23 Sep 2008 19:56:55 -0400, Russell McOrmond
> <russell at flora.ca> a écrit :
> 
>> Tracey P. Lauriault wrote:
>>> Some was asking me what page scraping means.  Could you explain -
>>> in sorta lay person terms?
>> A computer automated cut-and-paste where what page you go to is 
>> automated, and what piece of information you try to learn from the
>>  resulting page is automated.
> 
> When it's _really_ automated, it's called a feed or a microformat.
> It's called scaping because it usually also involves manual labor to
> get the job done right, as HTML pages are often modified with no
> regards to its semantic value.


   Aren't definitions fun.  The difference in my mind between a 
feed/microformat and 'scraping' is whether the relevant output format 
was designed to be human readable (IE: html) or machine readable (XML, 
csv, etc).  Whether there is manual labour is unrelated in my mind.

   It is often called "screen scraping" from back in the days that a 
screen of information was drawn, and then we tried to pull information 
from that screen based on the location of information.   Re-intepreting 
HTML like we are doing here is a bit different, but we are still talking 
about taking a page intended to be read by a human (rendered by a 
browers) and instead interpret it as data as input to a 
program/database/etc.

-- 
  Russell McOrmond, Internet Consultant: <http://www.flora.ca/>
  Please help us tell the Canadian Parliament to protect our property
  rights as owners of Information Technology. Sign the petition!
  http://www.digital-copyright.ca/petition/ict/

  "The government, lobbied by legacy copyright holders and hardware
   manufacturers, can pry my camcorder, computer, home theatre, or
   portable media player from my cold dead hands!"



More information about the CivicAccess-discuss mailing list