Headless Testing and Scraping with Java FX

There is a lot of JavaScript in the world today and there is a need to get things moving quickly. Whether testing multiple websites or acquiring data for ETL and/or analysis, a tool needs to exist that does not leak memory as much as Selenium. Until recently, Selenium was really the only option for webkit, JCEF and writing native bindings for Chromium have been options for a while. Java 7 and Java 8 have stepped into the void with the JavaFX tools. These tools can be used to automate scraping and testing where network calls for HTML, Json, CSVs, pdfs, or what not are more tedious and difficult.

The FX Package

FX is much better than the television channel with some exceptions. Java created a sleeker version of Chromium based on webkit. While webkit suffers from some serious setbacks, Java FX also incorporates nearly any part of the java.net framework. Setting SSL Handlers, proxies, and the like works the same as with java.net. Therefore, FX can be used to intercept traffic (e.g. directly stream images that are incoming to a file named by URL without making more network calls), present a nifty front end controlled by JavaScript and querying for components,

Ui4J

Ui4j is as equally nifty as the FX package. While FX is not capable of going headless without a lot of work, Ui4j takes the work out of such a project using Monocle or Xvfb. Unfortunately, there are some issues getting Monocle to run by setting -Dui4j.headless=true on command line or using system properties after jdk1.8.0_20. Oracle removed Monocle from the jdk after this release and forced the programs using the server to OpenMonocle. However, xvfb-run -a works equally well. The -a option automatically chooses a server number. The github site does claim compatibility with Monocle though.

On top of headless mode, the authors have made working with FX simple. Run JavaScript as needed, incorporate interceptors with ease, run javascript, and avoid nasty waitFor calls and Selanese (this is an entire language within your existing language).

TestFX

There is an alternative to Ui4j in TestFX. It is geared towards testing. Rather than using an Assert after calling or with ((String) page.executeScript(“document.documentElement.innerHTML”)), methods such as verifyThat exist. Combine with Scala and have a wonderfully compact day. The authors have also managed to get a workaround for the Monocle problem.

Multiple Proxies

The only negative side effect of FX is that multiple instances must be run to use multiple proxies. Java and Scala for that matter set one proxy per JVM. Luckily, both Java and Scala have subprocess modules. The lovely data friendly language that is Scala makes this task as simple as Process(“java -jar myjar.jar -p my:proxy”).!. Simply run the command which returns the exit status and blocks until complete (see Futures to make this a better version of non-blocking) and use tools like Scopt to get the proxy and set it in a new Browser session. Better yet, take a look at my Scala macros article for some tips on loading code from a file (please don’t pass it as command line). RMI would probably be a bit better for large code but it may be possible to better secure a file than compiled code using checksums.

Conclusion

Throw out Selenium, get rid of the extra Selanese parsing and get Ui4J or TestFX for webkit testing. Sadly, it does not work with Gecko so Chromium is needed to replace these tests and obtain such terrific options as –ignore-certificate-errors. There are cases where fonts in the SSL will wreak havoc before you can even handle the incoming text no matter how low level you write your connections. For simple page pulls, stick to Apache HTTP Components which contains a fairly fast, somewhat mid-tier RAM usage asynchronous thread pool useable in Java or Scala. Sorry for the brevity folks but I tried to answer a question or two that was not in tutorials or documentation. Busy!

Advertisements

Is there an Ethical Imperative to Own Certain Domains?

Surfing the internet, I came across this site goole.com. Click this site at your own risk. It brings up an ethical question. Should websites that deal results to a wide range of people and can expect an error or two from time to time from everyone own certain domains? How far should they go?

Please consider the golden rule here. The majority of spelling errors occur within an edit distance of two. Peter Norvig found a value as high as 98.9 percent of errors with roughly eighty percent at one distance unit. A quick review of Levenshtein from code can be found here but it is the total number of transpositions, insertions, and deletions. Apache also produces a version.

Pro Ownership

Let’s consider Google, the obvious target of the above link. If someone was looking for a handout and became angry, they could easily try to turn the site into a virus ridden hell-fest for any unsuspecting victim. People make mistakes. Therefore, it is a decent proposal to at least try and protect the user by owning some of these sites.

Ownership has quite a few pros some of which are more commercial than ethical.

  • Attain credibility by attempting to protect users
  • Acknowledges humanity
  • Alert ill-doers that you take some stance against ill-will
  • Protection from likeness and image issues

Against Ownership

The issue with ownership is that going so far may create an expectation of going even further. If a company such as Google purchased Goole, do they then need to purchase Gogle and Googls at a Levenshtein of 1. What about goilele? Perhaps the user then fails to take matters into their own hand and correct their mistakes. Even worse, what if expectations of a payout follow and failure to do so create more virus ridden Velociraptor. If this were the case, the acknowledgement may even negate some of the pros.

Cons include:

  • Generating expectations of protection creating complacency
  • Creating a drive to use the site to do ill will
  • Generating opportunity to achieve a payout without effort
  • Cost (especially with the availability of domain names)
  • Errors may not matter for a small,specialized, or obscure website

Conclusion

The pros and cons are actually more numerous than mentioned but still interesting in this case. There should be some attempt to protect a site. Not everyone will hit the target page every time. Errors are human. Since most errors are within two Levenshtein distance units, obtaining a large number of sites within this number that seems appropriate to avoid errors by most users is helpful. Goole may be so close to Google that it would make a solid purchase. This weighs the need to protect ones self against errors but why not show an alert or post a blank page instead of a redirect. This approach avoids extreme costs by considering appropriateness, acknowledges human error, and meets most criteria laid out here.

A Neat Little REST Trick

So there is a shared memory problem. We are using the REST template and a few solutions come to mind. Perhaps it is time to not rule out statics.

The Controller component is established as a Singleton whose methods are called by every connection. It is possible to include a static synchronized static object to avoid. Java’s concurrent classes have a few of these. Synchronized classes and maps of locks are way more complicated and likely to slow things down. Synchronized methods in fact lock all Class level methods and attributes (e.g. synchronize(this)).

As a warning though, the CopyOnWriteArrayList is thread safe but slow for writing to.

The example below is of a ConcurrentHashMap<String,Integer> but the types are sometimes not showing, sorry.

@RestController
public class RESTClass{
    private static ConcurrentHashMap<String,Integer> mp = new ConcurrentHashMap<String,Ingeter>(); 
}