Murmurhash3 and the Bloom Filter

There is often a need to speed things up by generating a unique hash via a bloom filter. Bloom filters are arrays of bits to which objects are mapped using a set of hashing functions.

Using some knowledge from creating a hashing trick algorithm, I came across the idea of deploying murmurhash3 in scala to improve performance and generate a fairly unique key. My immediate case, is to avoid collecting millions of duplicate pages from a large online scrape of a single source.

More information is available from this great blog article as I am a bit busy to get into details.

Advertisements

Headless Testing and Scraping with Java FX

There is a lot of JavaScript in the world today and there is a need to get things moving quickly. Whether testing multiple websites or acquiring data for ETL and/or analysis, a tool needs to exist that does not leak memory as much as Selenium. Until recently, Selenium was really the only option for webkit, JCEF and writing native bindings for Chromium have been options for a while. Java 7 and Java 8 have stepped into the void with the JavaFX tools. These tools can be used to automate scraping and testing where network calls for HTML, Json, CSVs, pdfs, or what not are more tedious and difficult.

The FX Package

FX is much better than the television channel with some exceptions. Java created a sleeker version of Chromium based on webkit. While webkit suffers from some serious setbacks, Java FX also incorporates nearly any part of the java.net framework. Setting SSL Handlers, proxies, and the like works the same as with java.net. Therefore, FX can be used to intercept traffic (e.g. directly stream images that are incoming to a file named by URL without making more network calls), present a nifty front end controlled by JavaScript and querying for components,

Ui4J

Ui4j is as equally nifty as the FX package. While FX is not capable of going headless without a lot of work, Ui4j takes the work out of such a project using Monocle or Xvfb. Unfortunately, there are some issues getting Monocle to run by setting -Dui4j.headless=true on command line or using system properties after jdk1.8.0_20. Oracle removed Monocle from the jdk after this release and forced the programs using the server to OpenMonocle. However, xvfb-run -a works equally well. The -a option automatically chooses a server number. The github site does claim compatibility with Monocle though.

On top of headless mode, the authors have made working with FX simple. Run JavaScript as needed, incorporate interceptors with ease, run javascript, and avoid nasty waitFor calls and Selanese (this is an entire language within your existing language).

TestFX

There is an alternative to Ui4j in TestFX. It is geared towards testing. Rather than using an Assert after calling or with ((String) page.executeScript(“document.documentElement.innerHTML”)), methods such as verifyThat exist. Combine with Scala and have a wonderfully compact day. The authors have also managed to get a workaround for the Monocle problem.

Multiple Proxies

The only negative side effect of FX is that multiple instances must be run to use multiple proxies. Java and Scala for that matter set one proxy per JVM. Luckily, both Java and Scala have subprocess modules. The lovely data friendly language that is Scala makes this task as simple as Process(“java -jar myjar.jar -p my:proxy”).!. Simply run the command which returns the exit status and blocks until complete (see Futures to make this a better version of non-blocking) and use tools like Scopt to get the proxy and set it in a new Browser session. Better yet, take a look at my Scala macros article for some tips on loading code from a file (please don’t pass it as command line). RMI would probably be a bit better for large code but it may be possible to better secure a file than compiled code using checksums.

Conclusion

Throw out Selenium, get rid of the extra Selanese parsing and get Ui4J or TestFX for webkit testing. Sadly, it does not work with Gecko so Chromium is needed to replace these tests and obtain such terrific options as –ignore-certificate-errors. There are cases where fonts in the SSL will wreak havoc before you can even handle the incoming text no matter how low level you write your connections. For simple page pulls, stick to Apache HTTP Components which contains a fairly fast, somewhat mid-tier RAM usage asynchronous thread pool useable in Java or Scala. Sorry for the brevity folks but I tried to answer a question or two that was not in tutorials or documentation. Busy!

Lack of Nested and Implicit Data Support in Drill, PostgreSQL and Pentaho when Working with Json Data

JSon is great. It can contain a variety of data types in an expected format. It is becoming easier and easier to work with Json in existing formats as well making it a future workhorse for NoSQL based ETL. However, and not in the least because NoSQL ingestion needs to result in relational tables using SQL standards, there is still one bug to work out. Ingestion with Json will not break out nested tables and requires direct knowledge of data to complete tasks.

This may seem petty but when millions of recods are being read, it clearly is not.

In drill, this could potentially be overcome by creating a table for every single submap we wish to analyze but CREATE TABLE from the tool itself will not work. Instead, it is necessary to limit use cases to the data we want to use.

In PostgreSQL, it is possible to concatenate JSon data using a query whose individual results can then be saved. It is also possible to ‘pop’ keys that are unneeded. However, this approach requires many different tables at one per somewhat normalized form. It also requires recombining data.


SELECT row_to_json(r.*) FROM (SELECT nonJson AS nonJson, ((((data->'level1')::jsonb - 'bkey1')::jsonb - 'bkey2')::jsonb -'bkey3')::jsonb AS jdata FROM table WHERE data::jsonb ?| array['mustHaveKey'] AND data::jsonb ?| array['notHaveKey'] IS FALSE) r

Drill is still much more ahead of the game than Pentaho and PostgreSQL in terms of conversion though. Postgresql can guess types but has no function to attempt to automatically generate tables. Pentaho requires explicit conversion as well.

Of course, if one already knows every key that will be present, this is not a problem. That, however, means more maintenance as it is then impossible to write programs to automatically handle changes to data. Perhaps implicit conversion will happen soon but any argument as to data type issues should really look at the depth of the standards and conform to them instead of complaining.

Mixing Java XML and Scala XML for Maximum Speed

Lets face it. ETL and data tasks are expected to be finished quickly. However, the amount of crud pushed through XML can be overwhelming. XML manipulation itself in java can take a while despite being relatively easy due to the need to loop through every attribute that one wished to create or read from. On the other hand, Scala, despite being a terrifically complete language based on the JVM suffers from some of the issues of a functional language. Only Scales XML offers a way to actually perform in place changes on XML. It is not, however, necessary to download an entirely new dependency on top of the existing Java and Scala libraries already available when using Scala.

Scala XML

There are definitive advantages to Scala XML. Imported under scala.xml._ and an other relevant subpackages, creation of XML is much faster and easier than with Java. All nodes and attributes are created as strings, making picking up Scala XML a breeze.

   
   val txtVal="myVal"
   val otherTxtVal="Hello World"
   val node = <tag attr={txtVal} attr2="other"> {otherTxtVal} </tag>

In order to generate child nodes, simply create an iterable with the nodes.

   val attrs:List[String] = List("1","2","3") //because we rarely know what the actual values are, these will be mapped
   val node = <property name="list"><list>{attrs.map(<value> li </value>)}</list></property> // a Spring list

Another extremely decent addition that brings speed to Scala XML parsing is the Pretty Printer under scala.xml.PrettyPrinter(width:int,spacer:int)

     val printer = new PrettyPrinter(250,5) //will create lines of width 250 and a tab space of 5 chars at each depth level
     val output = printer.format(xml)

There is one serious issue with Scala XML. Scala XML uses immutable lists. This means that nodes are difficult to append to. Rewrite
Rules and transformations
are available and particularly useful for filtering and deleting
but not necessarily good at appending.

   class Filter extends RewriteRule{
         override def transform(n:Node):Seq[Node] = n match{
            case (n \\ "bean[@id='myId']).length > 0 => NodeSeq.Empty //please bear with me as I am still learning Scala
            case n.Elem(prefix, label, attribs, scope, child @ _*) => NodeSeq.Empty
            case e:Elem => n
            case _ =>{
               println("Unrecognized")
               throw new RuntimeExceptione
            }
         }
   }

   val newXML = new RuleTransformer(new Filter()).transform(xml) //returns filtered xml

Such code can become cumbersome when writing a ton of appendage rules each with a special case. Looking at the example from the github page makes this abundantly clear. Another drawback is the speed impact a large number of cases can generate since XML is re-written with every iteration. >a href=”https://github.com/chris-twiner/scalesXml”>Scales XML overcomes this burden but has a much less Scala-esque API and requires an additional download. Memory will be impacted as well as copies of the original information are stored.

Java Document

Unlike Scala XML, Java already has a built in in-place XML editor available after the release of Java 4. Importing this library is not outside the standard uses of Scala. Perhaps the Scala developers intended for their library to mainly be used in parsing. Hence, there is a mkString within the native library but a developer or programmer must download sbt with sbt.IO for a scala File System API that itself uses Java’s File class. There is also the lack of networking tools leading to third party tools and a Json library not quite excelling at re-writing Json in older Scala versions (non-existant in 2.11.0+). While it would be nice to see this change, there is no guarantee that it will. Instead, java fills in the gaps.

Here the javax.xml libraries come particularly in handy. The parsers and transformers can read and generate strings, making interoperability a breeze. Xpath expressions may not be as simple as in Scala where xml // “xpath” or xml / “xpath” will return the nodes but it is painless.

   import javax.xml.parsers._ //quick and dirty
   import org.w4c.dom._
   import javax.xml._    
   val xmlstr = <tag>My Text Value</tag>
   val doc:Document = DocumentBuilderFactory.newInstance.newDocumentBuilder.parse(new ByteArrayInputStream(xmlstr.getBytes))
   val nodes:NodeSet = xpath.compile("//tag").evaluate(doc,NodeConstants.NodeSet)

With a nodelist, one can getAttributes(), setAttribute(key,value), appendChild(node), insertBefore(nodea,nodeb) and call other useful methods, such as getNextSibling() or getPreviousSibling(), from the API. Just be careful to use the document element when making calls on the nodes where necessary.

Inter-operability

Inter-operability between scala and java is quite simple. Use a builder with an input stream obtained from the string to read an XML document to Java and a transformer to read that XML back to something Scala enjoys.

//an example getting a string with Pretty Printer from an existing document
   val xmlstr = <tag>My Text Value</tag>
   val doc: Document = DocumentBuilderFactory.newInstance.newDocumentBuilder.parse(new ByteArrayInputStream(xmlstr.getBytes))
   val printer = new scala.xml.PrettyPrinter(250,5)
   val sw = new java.io.StringWriter()
   val xmlString = javax.xml.transform.TransformerFactory.newInstance().newTransformer().transform(new javax.xml.transform.dom.DOMSource(doc),new javax.xml.transform.stream.StreamResult(sw))
   val xml = printer.format()

A New Project: Is a Distance Based Regular Expression Method Feasible?

So, I would like to find specific data from scraped web pages, pdfs, and just about anything under the sun without taking a lot of time. After looking over the various fuzzy logic algorithms such as Jaro-Winkler, Metaphone, and Levenstein and finding that one did not have an incredibly wide application, I decided that developing a regular expression based distance algorithm may be more feasible.

The idea is simple, start with a regular expression, build a probability distribution across a good and known data set or multiple data sets, and test for the appropriate expression across every web page. The best score across multiple columns would be the clear winner in this case.

Building out the expression would be include taking known good data and finding a combination between the base pattern and the data that works or building an entirely new one. Patterns that appear across a large proportion of the set should be combined. If [A-Z]+[\s]+[0-9]+[A-Z], and [A-Z]+[\s]+[0-9]+, appears often in the same or equivalent place or even [a-z]+[\s]+[0-9]+, then it should likely be [A-Z\s0-9a-z]+, if the set is similarly structured. Since the goal is to save time in programming regular expressions to further parse Xpath or other regular expression results, this is useful.

The tricky part of the project will be designing a similarity score that adequately equates the expressions without too many outliers. Whether this is done with a simple difference test resulting in a statistical distribution or a straightforward score needs to be tested,

In all likelihood, re-occurring words should be used to break ties or bolster weak scores.

The new project will hopefully be available on Source Forge for data integration and pre-curation processes.

Avalanche Data Part I: Collecting and Analyzing Public Information for Patterns

It has been a goal of mine for a while to collect and analyze publicly available avalanche data to discover patterns and raise awareness. My home state of Colorado is a disastrous combination of climate, tourists, newcomers, and testosterone junkies with a varying degree of IQ who perform little to know thought before jumping on our Continental slopes. The result can be 20 fatalities each winter season. While the availability of public information is appauling, I did manage to wrangle together a large database with hundreds of incidents, millions of weather records, and a variety of locations across many different states.

As of today, this data is available via post or by visiting my website, be kind as I will ban people bogging down my small amount of resources and may even retaliate. Use wireshark or Firebug to decipher the request format.The port will hopefully go away once I set up Apache, port forwarding is not allowed by my hosting service and I needed a bizzarre install of Tomcat that is missing the configuration file with authbind.

My goals for this project were simple, establish an automated network for data collection, ETL, and the combination of the data which is placed in a relational database. That database is then analyzed using a set of open source tools and custom code for statistical analysis from Apache Commons Math for clustering and some analysis.

Attributes I Needed and What I Found

I wished for prediction so I needed everything from crystal type to weather patterns. Avalanche, crown, base layer type, depth, path size, destructive force, terrain traps, and a variety of other factors are important. Regression testing on what I did receive showed previous storms,terrain traps, and the base layer to be the most important factors for size and destructive force.

However, this data was dirty, not cleanable with great expense, and somewhat unreliable. Only two sites reliably reported crown depth, width, and even base layer. Agencies are likely not forthcoming with this information since it relates directly to sales revenue.

Only the weather data, which can be acquired from many government sites was forthcoming.

Web Framework

I decided to use Tomcat as the web framework to deploy my WAR. This is only my second attempt at Spring. Tomcat is an incredibly fast framework as evidenced by my site. Spring is an incredibly decent framework for handling requests, requiring much less code when set up than most other frameworks. In particular, the Request handling is of significant benefit. Post requests are handled with GSon.

Below is a basic request mapping:

        @RequestMapping(value = "/", method = RequestMethod.GET)
	public String home(Locale locale, Model model) {
		//The Request Mapping
		ServletRequestAttributes requestAttributes = ((ServletRequestAttributes) RequestContextHolder.currentRequestAttributes());
		String ip=requestAttributes.getRequest().getRemoteAddr();
		
		//my ip should be all zeros for local host at the moment and so I need to change it 
		if(ip.contains("0:0:0:0:0:0:0:1")||ip.contains("127.0.0.1")){
			//ip detection
		}
		
		
	
		ClassPathXmlApplicationContext ctx=new ClassPathXmlApplicationContext("avy.xml");
		
		GeoDBPath gpath=(GeoDBPath) ctx.getBean("GPS");
		
		GeoServicesDB geodb=GeoServicesDB.getInstance(gpath.getFpath());
		ArrayList coords=geodb.getGPS(ip);
		
		double latitude=coords.get(0);
		double longitude=coords.get(1);
		
		AvyDAO avyobjs=(AvyDAO) ctx.getBean("AvyDAO");
				
		
		List avs=avyobjs.findAll(latitude,longitude);
		String outhtml=""; 
                //table head 
                  int i=0; 
                  for(Avalanche av: avs){ 
                    if((i%2)==0)
                   { 
                     //my table 
                    } else{ 
                     //my table 
                    } i++; 
               } 
               //end table 
               model.addAttribute("avyTable",outhtml.replaceAll("null","")); return "home"; 
          }

The Tools That Worked

Standard deviations, elementary statistics, and other basic statistics are handle=able using custom code. Fast clustering algorithms and more complex math that can be made more efficient is completed well with Apache’s common math.

Clustering is of particular interest. Commons math does not have affinity propagation but does have a quick k-means clusterer, a downer for wanting to discover patterns without known relations. However, the relations can be estimated using sqrt(n/2) centroids. This is the number affinity propagation often chooses. With this method, it is possible to obtain decent relations in the time taken to process a simple post request.

The Collection Process

Data collection resulted in an automated set of interrelated scrapers,ETL processes, and triggers. Scrapers were set up for nearly every reporting site available. This meant that the American North West, Alaska, California, and British Columbia were the only sites available for collection. The Colorado Avalanche Information Center and Utah’s avalanche center were the best in terms of data with Utah providing a wide range of attributes. This data was fed to weather collecting scrapers and finally to an ETL process. I wrapped the entire process in a Spring program and configuration file.

The Final Product and v2.0

My final product is a site that delivers reports on incidents, weather, and other key factors as well as the opportunity to cluster what little complete data there is in your region. A heat map and google map show the incident locations. I will hopefully include relevant date filters and eventually graphs and predictions as the data grows stronger and more numerous. Weather is currently averaged from two weeks before an avalanche event. However, this could grow to accommodate historical patterns. Weather data may be the only solidly available data at the present time and so will likely happen sooner than predictive features.

A Plea for Better Data From Avalanche Centers and Why No Predictions are Available

In the end, I was appauled by the lack of data. People die because they know nothing of the conditions generating avalanches. I myself have felt the force of a concussion wave rippling my clothes from half a mile away. This must end. Selling data should not take precedence over safety. My site suffers at the moment from poor reporting, a lack of publicly available clean data, and the result of simple mis-reportings not caught in ETL. My data set actually shrank in cleaning from 1000 to 300 avalanches across the entire NorthWest.

Still, weather data was incredibly public. The National Resource Conservation Service, which sells data, is a powerful tool when placed in combination with the National Atmospheric and Oceanic Society and Air Force weather station data sets.

Overall, I can only provide for public clustering because of this poor data. Over time, this may change as clusters become more distinct and patterns and predictions more powerful. However, I would feel personally responsible for someone’s untimely end at this point. I have tried running multiple regression on this topic before but the results were dismal. While better than 2011, data sets still need improvement.

The Future

I have no intention of stopping collection and will document my development work on the project here. I also plan to document any attempts to develop a device that uses the data it collects and my weather and/or other data to make predictions on the snowpack.

Could Watson Be Used to Dynamically Create Databases

Data collection,cleaning, and presentation are a pain, especially when dealing with a multitude of sources. When APIs aren’t available and every step is taken to keep people from getting data, it can be incredibly tedious just to get the data. Parsing in this instance, of course, can be made easier by relating terms in a dictionary and using the documents structure to your advantage. At worst it is just a few lines of regex or several expath expressions and more cleaning with Pentaho. I’ve gone a bit further by enforcing key constraints and naming conventions with the help of Java Spring.

It seems that IBM is making this process a little less time consuming with Watson. Watson appears to have the capacity to find patterns and relations with minimal effort from a CSV or other structured file.

This could really benefit database design by applying a computer to the finding and illumination of the patterns driving key creation and normalization. After all, I would love to be able to focus less on key creation in a maximum volume industry and more on pumping scripts into my automated controllers. The less work and more productivity pre person, the more profit.

Mornging Joe: Can Computer Vision Technology Help De-Militarize the Police and Provide Assistance?

There ha been an explosion of computer vision technology in the past few years or even the last decade or so considering OpenCV has been around that long. The recent events in Ferguson have created a need for keeping the police in line as well as the need to present credible evidence regarding certain situations.

Many police departments are starting to test programs that place snake cams like those used in the military on officers. While this could be viewed as more militarization, it also can present departments with a black eye if power is abused.

What if the lawyers, police, and ethics commissions could have a way of recognizing potentially dangerous situations before they happen? What if there was a light weight solution that allowed data programs to monitor situations in real or near real time, spot troublesome incidents, and provide alerts when situations were likely to get out of hand? What if potentially unethical situations could be flagged?

The answer is that this is possible without too much development already.

Statistical patterns can be used to predict behaviour long before anything happens. Microsoft and Facebook can accurately predict what you will be doing a year from now. The sad state of the current near police state is that the government has as much or more data on officers and citizens than Microsoft and Facebook.

These patterns can be used to narrow the video from those snake cams to potentially harmful situations for real time monitoring.

From there, a plethora of strong open source tools can be used to spot everything from weapons and the potential use of force, using the training capabilities of OpenCV and some basic kinematics (video is just a bunch of really quickly taken photos played in order), speech using Sphinx4 (a work in progress for captchas but probably not for clear speech), and even optical character recognition with pytesser. A bit of image pre-processing and OCR in Tesseract can already break nearly every captcha on the market in under one second with a single core and less than 2 gb of RAM. The same goes for using corner detection and OCR on a pdf table. Why can’t it be used in this situation?

The result in this case should be a more ethical police force and better safety to qualm the fears of officers and civilians alike.

Call me crazy but we can go deeper than just using snake cams on officers to police officers and provide assistance.  Quantum computing and/or better processors and graphics cards will only make this more of a reality.

Morning Joe/Python PDF Part 3: Straight Optical Character Recognition

*Due to time constraints, I will be publishing large articles on the weekends with a daily small article for the time being.

Now, we start to delve into the PDF Images since the pdf text processing articles are quite popular. Not everything PDF is capable of being stripped using straight text conversion and the biggest headache is the PDF image. Luckily, our do “no evil” (heavy emphasis here) friends came up with tesseract, which, with training, is also quite good at breaking their own paid Captcha products to my great amusement and company’s profit.

A plethora of image pre-processing libraries and a bit of post-processing are still necessary when completing this task. Images must be of high enough contrast and large enough to make sense of. Basically, the algorithm consists of pre-processing an image, saving an image, using optical character recognition, and then performing clean-up tasks.

Saving Images Using Software and by Finding Stream Objects

For linux users, saving images from a pdf is best done with Poplar Utils which comes with Fedora,CentOS, and Ubuntu distributions and saves images to a specified directory. The command format is pdfimages [options] [pdf file path] [image root] . Options are included for specifying a starting page [-f int], an ending page [-l int], and more. Just type pdfimages into a linux terminal to see all of the options.


pdfimages -j /path/to/file.pdf /image/root/

To see if there are images just type pdfimages -list.

Windows users can use a similar command with the open source XPdf.

It is also possible to use the magic numbers I wrote about in a different article to find the images while iterating across the pdf stream objects and finding the starting and ending bytes of an image before writing them to a file using the commands from open().write(). A stream object is the way Adobe embeds objects in a pdf and is represented below. The find command can be used to ensure they exist and the regular expression command re.finditer(“(?mis)(?<=stream).*?(?=endstrem)",pdf) will find all of the streams.


stream

....our gibberish looking bytes....

endstream

Pre-Processing

Python offers a variety of extremely good tools via pillow that eliminate the need for hard-coded pre-processing as can be found with my image tools for Java.

Some of the features that pillow includes are:

  1. Blurring
  2. Contrast
  3. Edge Enhancement
  4. Smoothing

These classes should work for most pdfs. For more, I will be posting a decluttering algorithm in a Captcha Breaking post soon.

For resizing,OpenCV includes a module that avoids pixelation with a bit of math magic.


#! /usr/bin/python

import cv2

im=cv2.imread("impath")

im=cv2.resize(im,(im.shape[0]*2,im.shape[1]*2))

OCR with Tesseract

With a subprocess call or the use of pytesser (which includes faster support for Tesseract by implementing a subprocess call and ensuring reliability), it is possible to OCR the document.


#! /usr/bin/python

from PIL import Image

import pytesser

im=Image.open("fpath")

string=pytesser.image_to_string(im)

If the string comes out as complete garbage, try using the pre-processing modules to fix it or look at my image tools for ways to write custom algorithms.

Post-Processing

Unfortunately, Tesseract is not a commercial grade product for performing PDF OCR. It will require post processing. Fortunately, Python provides a terrific set of modules and capabilities for dealing with data quickly and effectively.

The regular expression module re, list comprehension, and substrings are useful in this case.

An example of post-processing would be (in continuance of the previous section):


import re

lines=string.split("\n")

lines=[x for x in lines if "bad stuff" not in x]

results=[]

for line in lines:

if re.search("pattern ",line):

results.append(re.sub("bad pattern","replacement",line))

Conclusion

It is definitely possible to obtain text using tesseract from a PDF. Post-processing is a requirement though. For documents that are proving difficult, commercial software is the best solution with Simple OCR and Abby Fine Reader offering quality solutions. Abby offers the best quality in my opinion but comes at a high price with a quote for an API for just reading PDF documents coming in at $5000. I have managed to use the Tesseract approach successfully at work but the process is time consuming and not guaranteed.

Morning Joe: Legality of Acquiring Scraped Data

One of my tasks at the entry level besides basic normalization, network programming, ETL, and IT work is to acquire data using just about anything. Being in the US this sort of data acquisition can be problematic.

I did some research since recent court rulings seem a bit mixed. Legally, in the US, there are a few factors that seem to be important.

Illegal acts obviously include targeting others in an attack. Are you doing anything that is akin to hacking or gaining unauthorized access via the Computer Fraud and Abuse Act. Exploiting vulnerabilities and passing SQL in the URL to open a database no matter how bad the idiot programming like that was is illegal at the felony level with a 15 year sentence (see the cases where an individual exploited security vulnerabilities in Verizon). Also, add a time out even if you round robin or use proxies. DDoS attacks are attacks. 1000 requests per second can shut down a lot of servers providing public information. The result here is up to 15 years in jail.

Copyright law is starting to becom important as well though. Pure replication of data that is protected is illegal. Even 4% replication has been deemed a breach. With the recent gutting of the DMCA, a person is even more vulnerable to civil and criminal penalties if somewhat knowingly or negligently serving this data to others. It is nearly impossible to tell if mixed data is obtained illegally though.

The following from the verified Wikipedia scraping entry where all of the cases are real says it all.

U.S. courts have acknowledged that users of “scrapers” or “robots” may be held liable for committing trespass to chattels,[5][6] which involves a computer system itself being considered personal property upon which the user of a scraper is trespassing. The best known of these cases, eBay v. Bidder’s Edge, resulted in an injunction ordering Bidder’s Edge to stop accessing, collecting, and indexing auctions from the eBay web site.

Paywalls and Product offer another significant though easy to skirt boundary. When going behind paywalls, contracts are breachable by clicking an agreement not to do something and then doing it. This is particularly damaging since You add fuel to the protection of negligence v. willingness [an issue for damages and penalties not guilt] in civil and any criminal trials. Ignorance is no defense.

Outside of the US, things are quite different. EU law and other law is way more lax. Corporations with big budgets dominate our legal landscape. They control the system in a very real way with their money in a way that they do not elsewhere, at least not as much despite being just as powerful in almost every respect.

The gist of the cases and laws seems to point towards getting public information and information that is available without going behind a pay wall. Think like a user of the internet and combine a bunch of sources into a unique product. Don’t just ‘steal’ an entire site protected site.

Here are a few examples. Trulia owns its information but you could use it to go to an agents website or collect certain information. However, accessing protected data is not legal and just re-purposing a site seems to be as well. The legal amount of pulled information is determinable. Also, a public MLS listing lookup site with no agreement or terms and offering data to the public is fair game. The MLS numbers lists, however, are normally not fair game since access is heavily guarded behind a wall of registration requiring some fakery to get to.

If a researcher can get to data, so can you. If a researcher needs permission, so do you. A computer is like having a million corporate researchers at your disposal.

As for company policy, it is usually used internally to shield from liability and serves as a warning but is not entirely enforceable. The legal parts letting you know about copyrights and such are and usually are supposed to be known by everyone. Complete ignorance is not a legal protection. It does provide a ground set of rules. Be nice, or get banned is that message as far as I know.

My personal strategy is to start with public data and embellish it within legal means.