Headless Testing and Scraping with Java FX

There is a lot of JavaScript in the world today and there is a need to get things moving quickly. Whether testing multiple websites or acquiring data for ETL and/or analysis, a tool needs to exist that does not leak memory as much as Selenium. Until recently, Selenium was really the only option for webkit, JCEF and writing native bindings for Chromium have been options for a while. Java 7 and Java 8 have stepped into the void with the JavaFX tools. These tools can be used to automate scraping and testing where network calls for HTML, Json, CSVs, pdfs, or what not are more tedious and difficult.

The FX Package

FX is much better than the television channel with some exceptions. Java created a sleeker version of Chromium based on webkit. While webkit suffers from some serious setbacks, Java FX also incorporates nearly any part of the java.net framework. Setting SSL Handlers, proxies, and the like works the same as with java.net. Therefore, FX can be used to intercept traffic (e.g. directly stream images that are incoming to a file named by URL without making more network calls), present a nifty front end controlled by JavaScript and querying for components,

Ui4J

Ui4j is as equally nifty as the FX package. While FX is not capable of going headless without a lot of work, Ui4j takes the work out of such a project using Monocle or Xvfb. Unfortunately, there are some issues getting Monocle to run by setting -Dui4j.headless=true on command line or using system properties after jdk1.8.0_20. Oracle removed Monocle from the jdk after this release and forced the programs using the server to OpenMonocle. However, xvfb-run -a works equally well. The -a option automatically chooses a server number. The github site does claim compatibility with Monocle though.

On top of headless mode, the authors have made working with FX simple. Run JavaScript as needed, incorporate interceptors with ease, run javascript, and avoid nasty waitFor calls and Selanese (this is an entire language within your existing language).

TestFX

There is an alternative to Ui4j in TestFX. It is geared towards testing. Rather than using an Assert after calling or with ((String) page.executeScript(“document.documentElement.innerHTML”)), methods such as verifyThat exist. Combine with Scala and have a wonderfully compact day. The authors have also managed to get a workaround for the Monocle problem.

Multiple Proxies

The only negative side effect of FX is that multiple instances must be run to use multiple proxies. Java and Scala for that matter set one proxy per JVM. Luckily, both Java and Scala have subprocess modules. The lovely data friendly language that is Scala makes this task as simple as Process(“java -jar myjar.jar -p my:proxy”).!. Simply run the command which returns the exit status and blocks until complete (see Futures to make this a better version of non-blocking) and use tools like Scopt to get the proxy and set it in a new Browser session. Better yet, take a look at my Scala macros article for some tips on loading code from a file (please don’t pass it as command line). RMI would probably be a bit better for large code but it may be possible to better secure a file than compiled code using checksums.

Conclusion

Throw out Selenium, get rid of the extra Selanese parsing and get Ui4J or TestFX for webkit testing. Sadly, it does not work with Gecko so Chromium is needed to replace these tests and obtain such terrific options as –ignore-certificate-errors. There are cases where fonts in the SSL will wreak havoc before you can even handle the incoming text no matter how low level you write your connections. For simple page pulls, stick to Apache HTTP Components which contains a fairly fast, somewhat mid-tier RAM usage asynchronous thread pool useable in Java or Scala. Sorry for the brevity folks but I tried to answer a question or two that was not in tutorials or documentation. Busy!

Advertisements

Using JCEF in Eclipse

After some serious issues with Selenium and with a need to stay up on the latest and greatest to ensure that I do not fall behind in running test-like scripts, I came across Chromium Embedded Extensions. Having most of my tools in java, I decided to install and use JCEF. This proved to be a bit of a nightmare. Hopefully, the following post will alleviate problems for users.

Before continuing, I did receive extensive support from MaGreenblatt (a cheif contributor and maybe the lead for JCEF).

Morning Joe: Why IT Needs Software Requirements and the Disaster of Diving In Headfirst

First off, I plan on writing more technical articles at night but just moved in to a new apartment. I have 4 current articles in my queue dealing with optical character recognition of pdfs, tables, and captchas as well as creating a wireless Arduino device.

That said, I am now on my fifth call to IT regarding what should be the simple task of setting up my email in Office 365. I use Linux, as a large number of developers and techies do and sadly, each step to get my email online has been a painstaking process with IT only fixing the problem in front of its face. I have had to have the service first reinitialized, then a key attached to my account, and, finally, I am wrangling access to each product I should already have.

Something tells me that software analysis and the process of discovery could have saved my pain and reduced the number of insults hurled at a few college kids looking to make some side money with 0 industry skills. Of course, that is what we do with college students and immigrants, pay them peanuts, give them a little, and let them handle the work no one wants. On the other hand, the massive IT budgets that make my own look like a needle in a haystack are being misappropriated to pay for God knows what.

In today’s environment, where a day can cost a lifetime, this needs to change. Deploying resources to study upgrades and major changes is a must and that means analyzing the means of failure and working around them.

Failures come from a variety of sources. Sara Base offers a good review of them in A Gift of Fire. They include:

  1. Lack of understanding of the material that can be overcome with research
  2. Arrogance
  3. Sloppy user interfaces
  4. Too Little Testing
  5. Failure to understand a products uses and potential conflicts
  6. Funding
  7. The pressure for profit
  8. Product loyalty and fads (one that I have personally noticed)
  9. An unqualified workforce (another thing I have noticed)
  10. Not understanding the depth and needs of the user base to an adequate degree

In this instance, issues 1, possibly 2, 4,5,8, and 9,10 have created a perfect storm that is now harming every aspect of the institution’s communications.

Most of the issues can be overcome with research and testing of a product. I implore IT to take these issues to heart. It costs time and a great deal of money not to do this.