PHP Screenscraping Using Curl

// October 22nd, 2005 // Technology Bits

Php-1Tim was looking to try something new, so I decided to introduce him to Client URL (CURL) functions. As the example at hand, we looked at hitting the USPS site to lookup city and states based on ZIP code.

For the uninitiated, CURL basically lets you programmatically simulate a user browsing a web site. You can POST, GET, PUT, maintain cookies and session information. In the following example we are using a technique called "screen scraping" which is rarely recommended, but a good skill to have because sometimes its the only solution.

The reason its bad is because it is extremely fragile. If a webmaster of the site you are accessing makes even a slight change, it could break your page parsing. The other reason to shy away from this is some web sites really don't like when you do this. As a rule, if the webmaster of the site you are scraping contacts you and wants you to stop, you should, immediately. Though you should also recommend they provide the info you are scraping as a service through something like REST or SOAP. It would be very Web 2.0 of them to comply, it's worth a shot.

Anyway, check out this example code, it's kinda fun.

PHP:
  1. <?php
  2.  
  3. $ch = curl_init();
  4. curl_setopt($ch, CURLOPT_URL, "http://zip4.usps.com/zip4/zcl_3_results.jsp");
  5. curl_setopt ($ch, CURLOPT_POST, 1);
  6. curl_setopt ($ch, CURLOPT_POSTFIELDS, "zip5=".$_GET['zip']);
  7.  
  8. $data = curl_exec($ch);
  9. $string = ob_get_contents();
  10.  
  11.  
  12. list(,$second) = explode('Actual City name', $string);
  13. list($first) = explode('images/spacer.gif', $second);
  14. $junk = explode(\n,$first);
  15.  
  16. list($city,$state) = explode(', ',trim(strip_tags($junk[6])));
  17.  
  18. $city = ucwords(strtolower($city));
  19.  
  20. print $city.','.$state;
  21. ?>

web20, web2.0, code, programming, screen scraping, php, soap, rest, curl, usps, address information, zip code, zip, zip codes

19 Responses to “PHP Screenscraping Using Curl”

  1. brian says:

    good job with the most incomplete piece of code ive ever seen.

  2. zbtirrell says:

    This is purely an example of how one might use cURL and output bufferring. There is certainly no error checking or particularly useful output… but this is not a chunk of code anyone would drop in and use somewhere. It’s an example… From this example you should be able to extrapolate your own useful application, api, or what have you.

    If there is some other incompleteness beyond it being short with no error checking, please enlighten people with your insight. I’d love constructive discourse on this. I’d love someone to show me other ways to accomplish this, or better uses of output buffering and/or cURL.

  3. georges says:

    incomplete or not, it was handy to see a decent exmaple. thanks.

  4. zbtirrell says:

    Thanks georges, that was my intent…

  5. hi,

    How to maintain the session in curl in php.

  6. Kumi Rauf says:

    Nice example… pretty black and white. Time to scrape some pages!

  7. QiLxvY0EWI says:

    Hi! Very nice site! Thanks you very much! zpqSG3PlScj

  8. madsh says:

    And if you want to do the same in 5 min without writing a single line of code… have a look at http://www.openkapow.com

  9. Stephie says:

    Nice post on screen scraping, simple and too the point :), For screen scraping i use python for simple things, but for larger projects i used http://www.extractingdata.com/screen%20scraping.htm which worked great, they build quick custom screen scraper and web scraper programs

  10. Nice post on screen scraping and good job with the most incomplete code.

    http://www.elantechnologies.com

  11. Fannie says:

    Thanks fot your sharing..Would be pleasure to read this,Valen

Leave a Reply

PHVsPjxsaT48c3Ryb25nPndvb19hYm91dDwvc3Ryb25nPiAtIENvbWljIGJvb2sgZ3V5LCB0ZWNoIGdlZWssIGFuZCBmYXRoZXIgb2YgdHdvLi4uPC9saT48bGk+PHN0cm9uZz53b29fYWJvdXRsaW5rPC9zdHJvbmc+IC0gIzwvbGk+PGxpPjxzdHJvbmc+d29vX2Fkc19yb3RhdGU8L3N0cm9uZz4gLSB0cnVlPC9saT48bGk+PHN0cm9uZz53b29fYWRfaW1hZ2VfMTwvc3Ryb25nPiAtIGh0dHA6Ly93d3cud29vdGhlbWVzLmNvbS9hZHMvd29vdGhlbWVzLTEyNXgxMjUtMS5naWY8L2xpPjxsaT48c3Ryb25nPndvb19hZF9pbWFnZV8yPC9zdHJvbmc+IC0gaHR0cDovL3d3dy53b290aGVtZXMuY29tL2Fkcy93b290aGVtZXMtMTI1eDEyNS0yLmdpZjwvbGk+PGxpPjxzdHJvbmc+d29vX2FkX2ltYWdlXzM8L3N0cm9uZz4gLSBodHRwOi8vd3d3Lndvb3RoZW1lcy5jb20vYWRzL3dvb3RoZW1lcy0xMjV4MTI1LTMuZ2lmPC9saT48bGk+PHN0cm9uZz53b29fYWRfaW1hZ2VfNDwvc3Ryb25nPiAtIGh0dHA6Ly93d3cud29vdGhlbWVzLmNvbS9hZHMvd29vdGhlbWVzLTEyNXgxMjUtNC5naWY8L2xpPjxsaT48c3Ryb25nPndvb19hZF91cmxfMTwvc3Ryb25nPiAtIGh0dHA6Ly93d3cud29vdGhlbWVzLmNvbTwvbGk+PGxpPjxzdHJvbmc+d29vX2FkX3VybF8yPC9zdHJvbmc+IC0gaHR0cDovL3d3dy53b290aGVtZXMuY29tPC9saT48bGk+PHN0cm9uZz53b29fYWRfdXJsXzM8L3N0cm9uZz4gLSBodHRwOi8vd3d3Lndvb3RoZW1lcy5jb208L2xpPjxsaT48c3Ryb25nPndvb19hZF91cmxfNDwvc3Ryb25nPiAtIGh0dHA6Ly93d3cud29vdGhlbWVzLmNvbTwvbGk+PGxpPjxzdHJvbmc+d29vX2FsdF9zdHlsZXNoZWV0PC9zdHJvbmc+IC0gZ3JheS5jc3M8L2xpPjxsaT48c3Ryb25nPndvb19jdXN0b21fY3NzPC9zdHJvbmc+IC0gPC9saT48bGk+PHN0cm9uZz53b29fY3VzdG9tX2Zhdmljb248L3N0cm9uZz4gLSA8L2xpPjxsaT48c3Ryb25nPndvb19mZWVkYnVybmVyX3VybDwvc3Ryb25nPiAtIGh0dHA6Ly9mZWVkczIuZmVlZGJ1cm5lci5jb20vbm9zaGVlcDwvbGk+PGxpPjxzdHJvbmc+d29vX2dvb2dsZV9hbmFseXRpY3M8L3N0cm9uZz4gLSA8c2NyaXB0IHR5cGU9InRleHQvamF2YXNjcmlwdCI+DQp2YXIgZ2FKc0hvc3QgPSAoKCJodHRwczoiID09IGRvY3VtZW50LmxvY2F0aW9uLnByb3RvY29sKSA/ICJodHRwczovL3NzbC4iIDogImh0dHA6Ly93d3cuIik7DQpkb2N1bWVudC53cml0ZSh1bmVzY2FwZSgiJTNDc2NyaXB0IHNyYz0nIiArIGdhSnNIb3N0ICsgImdvb2dsZS1hbmFseXRpY3MuY29tL2dhLmpzJyB0eXBlPSd0ZXh0L2phdmFzY3JpcHQnJTNFJTNDL3NjcmlwdCUzRSIpKTsNCjwvc2NyaXB0Pg0KPHNjcmlwdCB0eXBlPSJ0ZXh0L2phdmFzY3JpcHQiPg0KdmFyIHBhZ2VUcmFja2VyID0gX2dhdC5fZ2V0VHJhY2tlcigiVUEtODI3MjAtMSIpOw0KcGFnZVRyYWNrZXIuX3RyYWNrUGFnZXZpZXcoKTsNCjwvc2NyaXB0PjwvbGk+PGxpPjxzdHJvbmc+d29vX2hvbWU8L3N0cm9uZz4gLSB0cnVlPC9saT48bGk+PHN0cm9uZz53b29faG9tZV9hcmNoaXZlczwvc3Ryb25nPiAtIGh0dHA6Ly9ub3NoZWVwLm5ldC9hcmNoaXZlcy88L2xpPjxsaT48c3Ryb25nPndvb19ob21lX2ZsaWNrcl9jb3VudDwvc3Ryb25nPiAtIDEwPC9saT48bGk+PHN0cm9uZz53b29faG9tZV9mbGlja3JfdXJsPC9zdHJvbmc+IC0gaHR0cDovL3d3dy5mbGlja3IuY29tL3Bob3Rvcy90aXJyZWxsLzwvbGk+PGxpPjxzdHJvbmc+d29vX2hvbWVfZmxpY2tyX3VzZXI8L3N0cm9uZz4gLSA2MDg2MzE1NUBOMDA8L2xpPjxsaT48c3Ryb25nPndvb19ob21lX2xpZmVzdHJlYW08L3N0cm9uZz4gLSAxMDwvbGk+PGxpPjxzdHJvbmc+d29vX2hvbWVfcG9zdHM8L3N0cm9uZz4gLSA1PC9saT48bGk+PHN0cm9uZz53b29fbG9nbzwvc3Ryb25nPiAtIGh0dHA6Ly9ub3NoZWVwLm5ldC93cC1jb250ZW50L3dvb191cGxvYWRzLzMtbG9nby5wbmc8L2xpPjxsaT48c3Ryb25nPndvb19tYWlucmlnaHQ8L3N0cm9uZz4gLSBmYWxzZTwvbGk+PGxpPjxzdHJvbmc+d29vX21hbnVhbDwvc3Ryb25nPiAtIGh0dHA6Ly93d3cud29vdGhlbWVzLmNvbS9zdXBwb3J0L3RoZW1lLWRvY3VtZW50YXRpb24vaXJyZXNpc3RpYmxlLzwvbGk+PGxpPjxzdHJvbmc+d29vX25hdjwvc3Ryb25nPiAtIGZhbHNlPC9saT48bGk+PHN0cm9uZz53b29fc2hvcnRuYW1lPC9zdHJvbmc+IC0gd29vPC9saT48bGk+PHN0cm9uZz53b29fdGFiczwvc3Ryb25nPiAtIGZhbHNlPC9saT48bGk+PHN0cm9uZz53b29fdGhlbWVuYW1lPC9zdHJvbmc+IC0gSXJyZXNpc3RpYmxlPC9saT48bGk+PHN0cm9uZz53b29fdXBsb2Fkczwvc3Ryb25nPiAtIGh0dHA6Ly9ub3NoZWVwLm5ldC93cC1jb250ZW50L3dvb191cGxvYWRzLzMtbG9nby5wbmc8L2xpPjxsaT48c3Ryb25nPndvb192aWRlbzwvc3Ryb25nPiAtIGZhbHNlPC9saT48L3VsPg==