Headless Testing and Scraping with Java FX

There is a lot of JavaScript in the world today and there is a need to get things moving quickly. Whether testing multiple websites or acquiring data for ETL and/or analysis, a tool needs to exist that does not leak memory as much as Selenium. Until recently, Selenium was really the only option for webkit, JCEF and writing native bindings for Chromium have been options for a while. Java 7 and Java 8 have stepped into the void with the JavaFX tools. These tools can be used to automate scraping and testing where network calls for HTML, Json, CSVs, pdfs, or what not are more tedious and difficult.

The FX Package

FX is much better than the television channel with some exceptions. Java created a sleeker version of Chromium based on webkit. While webkit suffers from some serious setbacks, Java FX also incorporates nearly any part of the java.net framework. Setting SSL Handlers, proxies, and the like works the same as with java.net. Therefore, FX can be used to intercept traffic (e.g. directly stream images that are incoming to a file named by URL without making more network calls), present a nifty front end controlled by JavaScript and querying for components,

Ui4J

Ui4j is as equally nifty as the FX package. While FX is not capable of going headless without a lot of work, Ui4j takes the work out of such a project using Monocle or Xvfb. Unfortunately, there are some issues getting Monocle to run by setting -Dui4j.headless=true on command line or using system properties after jdk1.8.0_20. Oracle removed Monocle from the jdk after this release and forced the programs using the server to OpenMonocle. However, xvfb-run -a works equally well. The -a option automatically chooses a server number. The github site does claim compatibility with Monocle though.

On top of headless mode, the authors have made working with FX simple. Run JavaScript as needed, incorporate interceptors with ease, run javascript, and avoid nasty waitFor calls and Selanese (this is an entire language within your existing language).

TestFX

There is an alternative to Ui4j in TestFX. It is geared towards testing. Rather than using an Assert after calling or with ((String) page.executeScript(“document.documentElement.innerHTML”)), methods such as verifyThat exist. Combine with Scala and have a wonderfully compact day. The authors have also managed to get a workaround for the Monocle problem.

Multiple Proxies

The only negative side effect of FX is that multiple instances must be run to use multiple proxies. Java and Scala for that matter set one proxy per JVM. Luckily, both Java and Scala have subprocess modules. The lovely data friendly language that is Scala makes this task as simple as Process(“java -jar myjar.jar -p my:proxy”).!. Simply run the command which returns the exit status and blocks until complete (see Futures to make this a better version of non-blocking) and use tools like Scopt to get the proxy and set it in a new Browser session. Better yet, take a look at my Scala macros article for some tips on loading code from a file (please don’t pass it as command line). RMI would probably be a bit better for large code but it may be possible to better secure a file than compiled code using checksums.

Conclusion

Throw out Selenium, get rid of the extra Selanese parsing and get Ui4J or TestFX for webkit testing. Sadly, it does not work with Gecko so Chromium is needed to replace these tests and obtain such terrific options as –ignore-certificate-errors. There are cases where fonts in the SSL will wreak havoc before you can even handle the incoming text no matter how low level you write your connections. For simple page pulls, stick to Apache HTTP Components which contains a fairly fast, somewhat mid-tier RAM usage asynchronous thread pool useable in Java or Scala. Sorry for the brevity folks but I tried to answer a question or two that was not in tutorials or documentation. Busy!

Morning Joe: Legality of Acquiring Scraped Data

One of my tasks at the entry level besides basic normalization, network programming, ETL, and IT work is to acquire data using just about anything. Being in the US this sort of data acquisition can be problematic.

I did some research since recent court rulings seem a bit mixed. Legally, in the US, there are a few factors that seem to be important.

Illegal acts obviously include targeting others in an attack. Are you doing anything that is akin to hacking or gaining unauthorized access via the Computer Fraud and Abuse Act. Exploiting vulnerabilities and passing SQL in the URL to open a database no matter how bad the idiot programming like that was is illegal at the felony level with a 15 year sentence (see the cases where an individual exploited security vulnerabilities in Verizon). Also, add a time out even if you round robin or use proxies. DDoS attacks are attacks. 1000 requests per second can shut down a lot of servers providing public information. The result here is up to 15 years in jail.

Copyright law is starting to becom important as well though. Pure replication of data that is protected is illegal. Even 4% replication has been deemed a breach. With the recent gutting of the DMCA, a person is even more vulnerable to civil and criminal penalties if somewhat knowingly or negligently serving this data to others. It is nearly impossible to tell if mixed data is obtained illegally though.

The following from the verified Wikipedia scraping entry where all of the cases are real says it all.

U.S. courts have acknowledged that users of “scrapers” or “robots” may be held liable for committing trespass to chattels,[5][6] which involves a computer system itself being considered personal property upon which the user of a scraper is trespassing. The best known of these cases, eBay v. Bidder’s Edge, resulted in an injunction ordering Bidder’s Edge to stop accessing, collecting, and indexing auctions from the eBay web site.

Paywalls and Product offer another significant though easy to skirt boundary. When going behind paywalls, contracts are breachable by clicking an agreement not to do something and then doing it. This is particularly damaging since You add fuel to the protection of negligence v. willingness [an issue for damages and penalties not guilt] in civil and any criminal trials. Ignorance is no defense.

Outside of the US, things are quite different. EU law and other law is way more lax. Corporations with big budgets dominate our legal landscape. They control the system in a very real way with their money in a way that they do not elsewhere, at least not as much despite being just as powerful in almost every respect.

The gist of the cases and laws seems to point towards getting public information and information that is available without going behind a pay wall. Think like a user of the internet and combine a bunch of sources into a unique product. Don’t just ‘steal’ an entire site protected site.

Here are a few examples. Trulia owns its information but you could use it to go to an agents website or collect certain information. However, accessing protected data is not legal and just re-purposing a site seems to be as well. The legal amount of pulled information is determinable. Also, a public MLS listing lookup site with no agreement or terms and offering data to the public is fair game. The MLS numbers lists, however, are normally not fair game since access is heavily guarded behind a wall of registration requiring some fakery to get to.

If a researcher can get to data, so can you. If a researcher needs permission, so do you. A computer is like having a million corporate researchers at your disposal.

As for company policy, it is usually used internally to shield from liability and serves as a warning but is not entirely enforceable. The legal parts letting you know about copyrights and such are and usually are supposed to be known by everyone. Complete ignorance is not a legal protection. It does provide a ground set of rules. Be nice, or get banned is that message as far as I know.

My personal strategy is to start with public data and embellish it within legal means.