ETL 1 Billion Rows in 2.5 Hours Without Paying on 4 cores and 7gb of RAM

There are a ton of ETL tools in the world. Alteryx, Tableau, Pentaho. This list goes on. Out of each, only Pentaho offers a quality free version. Alteryx prices can reach as high as $100,000 per year for a six person company and it is awful and awfully slow. Pentaho is not the greatest solution for streaming ETL either as it is not reactive but is a solid choice over the competitors.

How then, is it possible to ETL large datasets, stream on the same system from a TCP socket, or run flexible computations at speed. Surprisingly, this article will describe how to do just that using Celery and a tool which I am currently working on, CeleryETL.

Celery

Python is clearly an easy language to learn over others such as Scala, Java, and, of course, C++. These languages handle the vast majority of tasks for data science, AI, and mathematics outside of specialized languages such as R. They are likely the front runners in building production grade systems.

In place of the actor model popular with other languages, Python, being more arcane and outdated than any of the popular languages, requires task queues. My own foray into actor systems in Python led to a design which was, in fact, Celery backed by Python’s Thespian.

Celery handles tasks through RabbitMQ or other brokers claiming that the former can achieve up to 50 million messages per second. That is beyond the scope of this article but would theoretically cause my test case to outstrip the capacity of my database to write records. I only hazard to guess at what that would do to my file system.

Task queues are clunky, just like Python. Still, especially with modern hardware, they get the job done fast, blazingly fast. A task is queued with a module name specified as modules are loaded into a registry at run time. The queues, processed by a distributed set of workers running much like an actor in Akka, can be managed externally.

Celery allows for task streaming through chains and chords. The technical documentation is quite extensive and requires a decent chunk of time to get through.

Processing at Speed

Processing in Python at speed requires little more than properly chunking operations, batching record processing appropriately to remove latency, and performing other simple tasks as described in the Akka streams documentation. In fact, I wrote my layer on Celery using the Akka streams play book.

The only truly important operation, chunk your records. When streaming over TCP, this may not be necessary unless TCP connections happen extremely rapidly. Thresholding in this case may be an appropriate solution. If there are more connection attempts than can be completed at once, buffer requests and empty the buffer appropriately upon completion of each chain. I personally found that a maximum bucket size of 1000 for typical records was appropriate and 100 for large records including those containing text blobs was appropriate.

Take a look at my tool for implementation. However, I was able to remap,  split fields to rows, perform string operations, and write to my Neo4J graph database at anywhere from 80,000 to 120,000 records per second.

Conclusion

While this article is shorter than my others, it is something I felt necessary to write in the short time I have to write it. This discovery allows me to write a single language system through Celery, Neo4J, Django, PyQt, and PyTorch for an entire company. That, is phenomenal and only rivaled by Scala which is, sadly, dying despite being a far superior, faster, and less arcane language. By all measures, Scala should have won over the data science community but people detest the JVM. Until this changes, there is Celery.

 

Advertisements

How Can the Tools of Big Data Detect Malicious Activity?

With Apple in the news and security becoming a large concern and even as companies try new ways to protect their online presence, finding malicious activity has become an exploding topic. Another area offers some deeper insights into just how to discover users with bad intentions before data is lost. This article deals with protecting an online presence.

Detection can go well beyond knowing when a bad credit card hits the system or a certain blocked IP Address attempts to access a website.

Similarity: The Neural Net or Cluster

The neural net has become an enormous topic. Today it is used to discern categories in fields ranging from biology to dating or even terrorist activity. Similarity based algorithms have come into their own since their inception largely in the cold war intelligence game. Yet, how different is finding political discussions from conversational data captured at the Soviet embassy or discovering a sleeper cell in Berlin from finding a hacker. Not terribly different at the procedural level actually. Find the appropriate vectors, train the neural net or clustering algorithm, and try to find clusters representing those with an aim to steal your data. These are your state secrets. With Fuzzy C Means, K Means, and RBF neural nets, the line between good and bad doesn’t even need to look like a middle school dance.

Here are just a sampling of the traits that could be used in similarity algorithms which require shaping a vector to train on. Using them in conjunction with data taken from previous hacking attempts, it shouldn’t be extremely difficult to flag the riff raff.

Traits that Can be Useful

Useful traits come in a variety of forms. They can be encoded as a 1 or 0 for a Boolean value such as known malicious IP (always block these). They could be a Levenshtein distance on that IP. Perhaps a frequency for number of requests per second is important. They may even be a probability or weight describing likelihood of belonging to one category or another based on content. Whichever they are, they should be informative to your case with an eye towards general trends.

  • Types of Items purchased : Are they trivial like a stick of gum?
  • Number of Pages Accessed while skipping a level of depth on a website : Do they attempt to skip pages despite a viewstate or a typical usage pattern?
  • Number of Malformed Requests : Are they sending bad headers?
  • Number of Each type of Error Sent from the Server : Are there a lot of malformed errors?
  • Frequency of Requests to your website : Does it look like a DNS attack?
  • Time spent on each Page : Is it too brief to be human?
  • Number of Recent Purchases : Perhaps they appear to be window shopping
  • Spam or another derived level usually sent from an IP address: Perhaps a common proxy is being used?
  • Validity or threat of a given email address : Is it a known spam address or even real?
  • Validity of user information : Do they seem real or do they live at 123 Main Street and are named Rhinoceros?
  • Frequencies of words used that Represent Code: Is the user always using the word var or curly braces and semi-colons?
  • Bayesian belonging to one category or another based on word frequencies: Are words appearing like var?

Traits that May not Be Useful

People looking for your data will be looking to appear normal, periodically looking to access your site or attempting an attack in one fell swoop. Some traits may be less informative. All traits depend on your particular activity. These traits may, in fact be representative but are likely not.

  • Commonality of User Name : Not extremely informative but good to study
  • Validity of user information: Perhaps your users are actually value their secrecy and your plans to get to know them are ill-advised

Do not Immediately Discount Traits and Always Test

Not all traits that seem discountable are. Perhaps users value their privacy and provide fake credentials. However, what credentials are provided can be key. More often, such information could provide a slight degree of similarity with a bad cluster or just enough of an edge toward an activation equation to tip the scales from good to bad or vice versa. A confusion matrix and test data should always be used in discerning whether the traits you picked are actually informative.

Bayes, Cosines, and Text Content

Not all attacks can be detected by behaviour. Perhaps a vulnerability is already known. In this case, it is useful to look at Bayesian probabilities and perhaps cosine similarities. Even obfuscated code contains specific key words. For example, variables in javascript are always declared with var, most code languages use semi-colons, and obfuscated code is often a one line mess. Bayesian probability would state that the presence of one item followed by another when compared to frequencies from various categories yields a certain probability of belonging to a category.

If Bayes is failing, then perhaps similarity is useful. Words like e and var and characters such as ; or = may be more important in code.

Avalanche Data Part I: Collecting and Analyzing Public Information for Patterns

It has been a goal of mine for a while to collect and analyze publicly available avalanche data to discover patterns and raise awareness. My home state of Colorado is a disastrous combination of climate, tourists, newcomers, and testosterone junkies with a varying degree of IQ who perform little to know thought before jumping on our Continental slopes. The result can be 20 fatalities each winter season. While the availability of public information is appauling, I did manage to wrangle together a large database with hundreds of incidents, millions of weather records, and a variety of locations across many different states.

As of today, this data is available via post or by visiting my website, be kind as I will ban people bogging down my small amount of resources and may even retaliate. Use wireshark or Firebug to decipher the request format.The port will hopefully go away once I set up Apache, port forwarding is not allowed by my hosting service and I needed a bizzarre install of Tomcat that is missing the configuration file with authbind.

My goals for this project were simple, establish an automated network for data collection, ETL, and the combination of the data which is placed in a relational database. That database is then analyzed using a set of open source tools and custom code for statistical analysis from Apache Commons Math for clustering and some analysis.

Attributes I Needed and What I Found

I wished for prediction so I needed everything from crystal type to weather patterns. Avalanche, crown, base layer type, depth, path size, destructive force, terrain traps, and a variety of other factors are important. Regression testing on what I did receive showed previous storms,terrain traps, and the base layer to be the most important factors for size and destructive force.

However, this data was dirty, not cleanable with great expense, and somewhat unreliable. Only two sites reliably reported crown depth, width, and even base layer. Agencies are likely not forthcoming with this information since it relates directly to sales revenue.

Only the weather data, which can be acquired from many government sites was forthcoming.

Web Framework

I decided to use Tomcat as the web framework to deploy my WAR. This is only my second attempt at Spring. Tomcat is an incredibly fast framework as evidenced by my site. Spring is an incredibly decent framework for handling requests, requiring much less code when set up than most other frameworks. In particular, the Request handling is of significant benefit. Post requests are handled with GSon.

Below is a basic request mapping:

        @RequestMapping(value = "/", method = RequestMethod.GET)
	public String home(Locale locale, Model model) {
		//The Request Mapping
		ServletRequestAttributes requestAttributes = ((ServletRequestAttributes) RequestContextHolder.currentRequestAttributes());
		String ip=requestAttributes.getRequest().getRemoteAddr();
		
		//my ip should be all zeros for local host at the moment and so I need to change it 
		if(ip.contains("0:0:0:0:0:0:0:1")||ip.contains("127.0.0.1")){
			//ip detection
		}
		
		
	
		ClassPathXmlApplicationContext ctx=new ClassPathXmlApplicationContext("avy.xml");
		
		GeoDBPath gpath=(GeoDBPath) ctx.getBean("GPS");
		
		GeoServicesDB geodb=GeoServicesDB.getInstance(gpath.getFpath());
		ArrayList coords=geodb.getGPS(ip);
		
		double latitude=coords.get(0);
		double longitude=coords.get(1);
		
		AvyDAO avyobjs=(AvyDAO) ctx.getBean("AvyDAO");
				
		
		List avs=avyobjs.findAll(latitude,longitude);
		String outhtml=""; 
                //table head 
                  int i=0; 
                  for(Avalanche av: avs){ 
                    if((i%2)==0)
                   { 
                     //my table 
                    } else{ 
                     //my table 
                    } i++; 
               } 
               //end table 
               model.addAttribute("avyTable",outhtml.replaceAll("null","")); return "home"; 
          }

The Tools That Worked

Standard deviations, elementary statistics, and other basic statistics are handle=able using custom code. Fast clustering algorithms and more complex math that can be made more efficient is completed well with Apache’s common math.

Clustering is of particular interest. Commons math does not have affinity propagation but does have a quick k-means clusterer, a downer for wanting to discover patterns without known relations. However, the relations can be estimated using sqrt(n/2) centroids. This is the number affinity propagation often chooses. With this method, it is possible to obtain decent relations in the time taken to process a simple post request.

The Collection Process

Data collection resulted in an automated set of interrelated scrapers,ETL processes, and triggers. Scrapers were set up for nearly every reporting site available. This meant that the American North West, Alaska, California, and British Columbia were the only sites available for collection. The Colorado Avalanche Information Center and Utah’s avalanche center were the best in terms of data with Utah providing a wide range of attributes. This data was fed to weather collecting scrapers and finally to an ETL process. I wrapped the entire process in a Spring program and configuration file.

The Final Product and v2.0

My final product is a site that delivers reports on incidents, weather, and other key factors as well as the opportunity to cluster what little complete data there is in your region. A heat map and google map show the incident locations. I will hopefully include relevant date filters and eventually graphs and predictions as the data grows stronger and more numerous. Weather is currently averaged from two weeks before an avalanche event. However, this could grow to accommodate historical patterns. Weather data may be the only solidly available data at the present time and so will likely happen sooner than predictive features.

A Plea for Better Data From Avalanche Centers and Why No Predictions are Available

In the end, I was appauled by the lack of data. People die because they know nothing of the conditions generating avalanches. I myself have felt the force of a concussion wave rippling my clothes from half a mile away. This must end. Selling data should not take precedence over safety. My site suffers at the moment from poor reporting, a lack of publicly available clean data, and the result of simple mis-reportings not caught in ETL. My data set actually shrank in cleaning from 1000 to 300 avalanches across the entire NorthWest.

Still, weather data was incredibly public. The National Resource Conservation Service, which sells data, is a powerful tool when placed in combination with the National Atmospheric and Oceanic Society and Air Force weather station data sets.

Overall, I can only provide for public clustering because of this poor data. Over time, this may change as clusters become more distinct and patterns and predictions more powerful. However, I would feel personally responsible for someone’s untimely end at this point. I have tried running multiple regression on this topic before but the results were dismal. While better than 2011, data sets still need improvement.

The Future

I have no intention of stopping collection and will document my development work on the project here. I also plan to document any attempts to develop a device that uses the data it collects and my weather and/or other data to make predictions on the snowpack.

How Bad Data Collection is Messing Up Data Analysis

Big data is driving the world but are company’s driving big data programs correctly? Here I make an argument for more genericism (now that I know more on this subject after working on it for the past year) and better test data. Basically, my rant from initial research is now an awesome plug for SimplrTek and SimplrTerms and whichever ABC style company comes from SimplrTek research.

Data Collection

I need to make a clarification and a confession, I make up data for my own purposes for my own LLC but only for testing (a previous statement was a little ambiguous here as I work for a company and am trying to create a company as futile as it may seem in this growingly competitive market). It is this sort of task that can hurt a company’s bottom line if done wrong and it should never be sold.

But why can using this sort of data mess with building large scale, timely algorithms?

Company’s are basing their own decisions on the results of distributions based on samples that may not really be representative or even correct. Algorithms have followed this and are driven and effected in large part by the shape of the data they are built with. They are predictive but work more like exponential smoothing than rectangular or triangular smoothing (they base decisions on what they were trained on in the past). Basically, current approaches often are not adaptive to change or corrective for awful data and, while likely using machine learning, use it in a way that is rigid and inflexible.

The results of making up data and using poor distributions or records can thus have a deeply wrong impact on a company’s bottom line. If the distribution shows that the best way to expand the number of records (testing often occurs on a portion of records) without throwing it off is to create or use a 30 year old, camino driving, pot-smoker who also happens to be a judge, something is seriously wrong. If your models and algorithms are based on this, your company is screwed. Your algorithm may take pot-smoking to be the key to what that judge rules. In production, with thousands and even hundreds of thousands of records being requested in a timely manner, there is not time to make sure that the different groups in the data used to build a model are good representations of the groups marked for analysis.

This effects everything from clustering to hypothesis testing (whether or not it is the result of clustering). How well received would marketing in the same way to the MMJ crowd be to our supposedly camino driving judge unless, of course, he really isn’t sober as a judge? So, by all means find a representative sample when building projects and spend the money to purchase good test data.

Bad data is a huge problem.

A good part of the solution is to collect data from the environment related to the specific task. I would say design better surveys with open ended questions, keep better track of customers with better database design, centralize data collection, and modernize the process with a decent system and little downtime. It is also possible to just flat out purchase everything However, this should be incredibly obvious.

Fix a Problem with Responsive Algorithms and Clustering Techniques or Neural Nets

Now for a plug. I am working on algorithms that can help tackle this very program. Generic algorithms that remove intuition, pre-trained modelling, and thus the aforementioned problems from data. Check us out at http://www.simplrterms.com. Our demos are starting to materialize. If you would like to help or meet with us, definitely contact us as well.

Still, one thing I am finding as the bright deer in the headlights look comes from related questions, is that people fail to adequately generate test data. Cluster the data on known factions that will use this data. For instance, My pot smoking judge could be ferreted out by clustering against representative samples of judges and criminals and setting a cosine cutoff distance to test records that fit the judge category well. For more variation, maybe use a neural net trained on good records, blatantly bad records, and records from somewhere in between and use the same cutoff approach to generate test data.

You may ask why not just make the records by hand. It is time consuming. Big data algorithms actually need gigabytes or terabytes of data, and with real data you can do things like map or predict fake income ranges, map people to actual locations, and build demos and the like.

Whatever is chosen, a little thought goes a long way.

Morning Joe: Big Data Tools for Review

Big data is a big topic and many tools are starting to Spring up. However, implementations of the old SQL standard are falling far behind these tools. My recent post on using a Fork Join Pool in insertion and pulling data can help if using PostgreSQL but also destroys bandwidth. It cause headaches for everyone whenever I need to do some serious checking of my ETL,parsing, normalization, data work, or other processes outside of our co-location. Although this is nothing new, I’ve compiled a list of tools to go forth and make little rocks from big rocks all day.

I’ve found some companies and a tool for a big review later on, this is my prework post while my IDE starts up. So far, I have found technologies for SQL databases that use Hadoop,Fractal Trees, and Cassandra to speed up the process. They are not focused specifically on speed but can help create faster database access and lower coding time.

What I’ve found so far:

  • Cassandra (open source): promises scalability and availability alongside a plethora of features (maybe for implementing other tools)
  • Oracle Data Integration Adapter for Hadoop: promises the speed of hadoop connected to an Oracle database
  • BigSQL (open source):promises to combine Cassandra,PostgreSQL, and Hadoop into a blazing fast package for analsis.
  • MapR technologies (somewhat open source): offers a wide variety of products to improve speed in querying and analysis from Hive and map reduce, to actual hadoop
  • Fractal Tree Indexing (open source): Tokutech’s fractal tree indexing speeds up insertions using buffers on each tree node
  • Alteryx: a tool for quicker data processing though not quite as fast as the others (good if your budget does not allow clusters but allows something better than Pentaho
  • MongoDB (open source): Combines map reduce and other technologies with large databases. Tokutech tested its fractal tree indexes on MongoDB
  • Pentaho (open source): The open source version of Alteryx

Many of these tools are already implemented in others such as Pentaho. Personally, I would like to see a SQL-like language that uses these tools alongside a query processor. It would make the tasks even faster to write, think Java v. Python. You could have 10 lines of map reduce code, 5 minutes of click and drag, or a 1 line easy-going query that writes as you think.

To be clear, I am not ranking these, only marking them for future review since this is what has piqued my interest today. Cheers!