Tracking and Graphs as a Security Mechanism

Being busy means I cannot keep regular posts or post on the code topics cluttering my drafts box, many of which just need a piece of code here or there. However, theory is easy and I cannot really do much with my code
reaching out in an interactive way to the internet while everything is updated, again.

Why is this happening yet again, why is ShellShock allowed to exist in 2014, almost two decades since the world discovered it could do things like cookie SQL Injection or just flat out cookie manipulation? Why is this a problem when so many algorithms exist to put an end to this deviant nonsense? Why is every attempt at some sort of web security half-crocked?

I have a simple proposal. Track IP addresses and other headers, monitor behavior, use statistical patterns, and implement a graph attached to those IP addresses representing the valid path of pages from the current page to stop everything from scraping to manipulation. Furthermore, implement a system of permissions that make this more feasible and ensure that trusted IP addresses and identifiers really are who they say they are. Deviants are deviants through and through and we put them in a programmer controlled environment. While we cannot protect against every attack, there is a better attempt than we make today to stop these people from accessing our systems. They probably have somewhat unusual patterns of behavior. While we cannot stop everyone, we can make a better attempt and be safer.

This may require separating the web and test environment entirely but that is not a bad thing, right? It is easy to implement this. Any CS I student worth half a penny can implement a de-duplicated graph in under a second with hash maps and some code magic (it has been a while since I had to search 1,000,000 nodes in less than 60 seconds). Combine this with the fact that today’s machines can have over 100 gb of RAM and terabytes or more of disk memory, not to mention the processors, and current security mechanisms start to look like David v. Goliath. Does the saying good enough for government work apply to hacking or should the NSA teach Congress how to work?

Avalanche Data Part I: Collecting and Analyzing Public Information for Patterns

It has been a goal of mine for a while to collect and analyze publicly available avalanche data to discover patterns and raise awareness. My home state of Colorado is a disastrous combination of climate, tourists, newcomers, and testosterone junkies with a varying degree of IQ who perform little to know thought before jumping on our Continental slopes. The result can be 20 fatalities each winter season. While the availability of public information is appauling, I did manage to wrangle together a large database with hundreds of incidents, millions of weather records, and a variety of locations across many different states.

As of today, this data is available via post or by visiting my website, be kind as I will ban people bogging down my small amount of resources and may even retaliate. Use wireshark or Firebug to decipher the request format.The port will hopefully go away once I set up Apache, port forwarding is not allowed by my hosting service and I needed a bizzarre install of Tomcat that is missing the configuration file with authbind.

My goals for this project were simple, establish an automated network for data collection, ETL, and the combination of the data which is placed in a relational database. That database is then analyzed using a set of open source tools and custom code for statistical analysis from Apache Commons Math for clustering and some analysis.

Attributes I Needed and What I Found

I wished for prediction so I needed everything from crystal type to weather patterns. Avalanche, crown, base layer type, depth, path size, destructive force, terrain traps, and a variety of other factors are important. Regression testing on what I did receive showed previous storms,terrain traps, and the base layer to be the most important factors for size and destructive force.

However, this data was dirty, not cleanable with great expense, and somewhat unreliable. Only two sites reliably reported crown depth, width, and even base layer. Agencies are likely not forthcoming with this information since it relates directly to sales revenue.

Only the weather data, which can be acquired from many government sites was forthcoming.

Web Framework

I decided to use Tomcat as the web framework to deploy my WAR. This is only my second attempt at Spring. Tomcat is an incredibly fast framework as evidenced by my site. Spring is an incredibly decent framework for handling requests, requiring much less code when set up than most other frameworks. In particular, the Request handling is of significant benefit. Post requests are handled with GSon.

Below is a basic request mapping:

        @RequestMapping(value = "/", method = RequestMethod.GET)
	public String home(Locale locale, Model model) {
		//The Request Mapping
		ServletRequestAttributes requestAttributes = ((ServletRequestAttributes) RequestContextHolder.currentRequestAttributes());
		String ip=requestAttributes.getRequest().getRemoteAddr();
		
		//my ip should be all zeros for local host at the moment and so I need to change it 
		if(ip.contains("0:0:0:0:0:0:0:1")||ip.contains("127.0.0.1")){
			//ip detection
		}
		
		
	
		ClassPathXmlApplicationContext ctx=new ClassPathXmlApplicationContext("avy.xml");
		
		GeoDBPath gpath=(GeoDBPath) ctx.getBean("GPS");
		
		GeoServicesDB geodb=GeoServicesDB.getInstance(gpath.getFpath());
		ArrayList coords=geodb.getGPS(ip);
		
		double latitude=coords.get(0);
		double longitude=coords.get(1);
		
		AvyDAO avyobjs=(AvyDAO) ctx.getBean("AvyDAO");
				
		
		List avs=avyobjs.findAll(latitude,longitude);
		String outhtml=""; 
                //table head 
                  int i=0; 
                  for(Avalanche av: avs){ 
                    if((i%2)==0)
                   { 
                     //my table 
                    } else{ 
                     //my table 
                    } i++; 
               } 
               //end table 
               model.addAttribute("avyTable",outhtml.replaceAll("null","")); return "home"; 
          }

The Tools That Worked

Standard deviations, elementary statistics, and other basic statistics are handle=able using custom code. Fast clustering algorithms and more complex math that can be made more efficient is completed well with Apache’s common math.

Clustering is of particular interest. Commons math does not have affinity propagation but does have a quick k-means clusterer, a downer for wanting to discover patterns without known relations. However, the relations can be estimated using sqrt(n/2) centroids. This is the number affinity propagation often chooses. With this method, it is possible to obtain decent relations in the time taken to process a simple post request.

The Collection Process

Data collection resulted in an automated set of interrelated scrapers,ETL processes, and triggers. Scrapers were set up for nearly every reporting site available. This meant that the American North West, Alaska, California, and British Columbia were the only sites available for collection. The Colorado Avalanche Information Center and Utah’s avalanche center were the best in terms of data with Utah providing a wide range of attributes. This data was fed to weather collecting scrapers and finally to an ETL process. I wrapped the entire process in a Spring program and configuration file.

The Final Product and v2.0

My final product is a site that delivers reports on incidents, weather, and other key factors as well as the opportunity to cluster what little complete data there is in your region. A heat map and google map show the incident locations. I will hopefully include relevant date filters and eventually graphs and predictions as the data grows stronger and more numerous. Weather is currently averaged from two weeks before an avalanche event. However, this could grow to accommodate historical patterns. Weather data may be the only solidly available data at the present time and so will likely happen sooner than predictive features.

A Plea for Better Data From Avalanche Centers and Why No Predictions are Available

In the end, I was appauled by the lack of data. People die because they know nothing of the conditions generating avalanches. I myself have felt the force of a concussion wave rippling my clothes from half a mile away. This must end. Selling data should not take precedence over safety. My site suffers at the moment from poor reporting, a lack of publicly available clean data, and the result of simple mis-reportings not caught in ETL. My data set actually shrank in cleaning from 1000 to 300 avalanches across the entire NorthWest.

Still, weather data was incredibly public. The National Resource Conservation Service, which sells data, is a powerful tool when placed in combination with the National Atmospheric and Oceanic Society and Air Force weather station data sets.

Overall, I can only provide for public clustering because of this poor data. Over time, this may change as clusters become more distinct and patterns and predictions more powerful. However, I would feel personally responsible for someone’s untimely end at this point. I have tried running multiple regression on this topic before but the results were dismal. While better than 2011, data sets still need improvement.

The Future

I have no intention of stopping collection and will document my development work on the project here. I also plan to document any attempts to develop a device that uses the data it collects and my weather and/or other data to make predictions on the snowpack.

Could Watson Be Used to Dynamically Create Databases

Data collection,cleaning, and presentation are a pain, especially when dealing with a multitude of sources. When APIs aren’t available and every step is taken to keep people from getting data, it can be incredibly tedious just to get the data. Parsing in this instance, of course, can be made easier by relating terms in a dictionary and using the documents structure to your advantage. At worst it is just a few lines of regex or several expath expressions and more cleaning with Pentaho. I’ve gone a bit further by enforcing key constraints and naming conventions with the help of Java Spring.

It seems that IBM is making this process a little less time consuming with Watson. Watson appears to have the capacity to find patterns and relations with minimal effort from a CSV or other structured file.

This could really benefit database design by applying a computer to the finding and illumination of the patterns driving key creation and normalization. After all, I would love to be able to focus less on key creation in a maximum volume industry and more on pumping scripts into my automated controllers. The less work and more productivity pre person, the more profit.

How Bad Data Collection is Messing Up Data Analysis

Big data is driving the world but are company’s driving big data programs correctly? Here I make an argument for more genericism (now that I know more on this subject after working on it for the past year) and better test data. Basically, my rant from initial research is now an awesome plug for SimplrTek and SimplrTerms and whichever ABC style company comes from SimplrTek research.

Data Collection

I need to make a clarification and a confession, I make up data for my own purposes for my own LLC but only for testing (a previous statement was a little ambiguous here as I work for a company and am trying to create a company as futile as it may seem in this growingly competitive market). It is this sort of task that can hurt a company’s bottom line if done wrong and it should never be sold.

But why can using this sort of data mess with building large scale, timely algorithms?

Company’s are basing their own decisions on the results of distributions based on samples that may not really be representative or even correct. Algorithms have followed this and are driven and effected in large part by the shape of the data they are built with. They are predictive but work more like exponential smoothing than rectangular or triangular smoothing (they base decisions on what they were trained on in the past). Basically, current approaches often are not adaptive to change or corrective for awful data and, while likely using machine learning, use it in a way that is rigid and inflexible.

The results of making up data and using poor distributions or records can thus have a deeply wrong impact on a company’s bottom line. If the distribution shows that the best way to expand the number of records (testing often occurs on a portion of records) without throwing it off is to create or use a 30 year old, camino driving, pot-smoker who also happens to be a judge, something is seriously wrong. If your models and algorithms are based on this, your company is screwed. Your algorithm may take pot-smoking to be the key to what that judge rules. In production, with thousands and even hundreds of thousands of records being requested in a timely manner, there is not time to make sure that the different groups in the data used to build a model are good representations of the groups marked for analysis.

This effects everything from clustering to hypothesis testing (whether or not it is the result of clustering). How well received would marketing in the same way to the MMJ crowd be to our supposedly camino driving judge unless, of course, he really isn’t sober as a judge? So, by all means find a representative sample when building projects and spend the money to purchase good test data.

Bad data is a huge problem.

A good part of the solution is to collect data from the environment related to the specific task. I would say design better surveys with open ended questions, keep better track of customers with better database design, centralize data collection, and modernize the process with a decent system and little downtime. It is also possible to just flat out purchase everything However, this should be incredibly obvious.

Fix a Problem with Responsive Algorithms and Clustering Techniques or Neural Nets

Now for a plug. I am working on algorithms that can help tackle this very program. Generic algorithms that remove intuition, pre-trained modelling, and thus the aforementioned problems from data. Check us out at http://www.simplrterms.com. Our demos are starting to materialize. If you would like to help or meet with us, definitely contact us as well.

Still, one thing I am finding as the bright deer in the headlights look comes from related questions, is that people fail to adequately generate test data. Cluster the data on known factions that will use this data. For instance, My pot smoking judge could be ferreted out by clustering against representative samples of judges and criminals and setting a cosine cutoff distance to test records that fit the judge category well. For more variation, maybe use a neural net trained on good records, blatantly bad records, and records from somewhere in between and use the same cutoff approach to generate test data.

You may ask why not just make the records by hand. It is time consuming. Big data algorithms actually need gigabytes or terabytes of data, and with real data you can do things like map or predict fake income ranges, map people to actual locations, and build demos and the like.

Whatever is chosen, a little thought goes a long way.

Mornging Joe: Can Computer Vision Technology Help De-Militarize the Police and Provide Assistance?

There ha been an explosion of computer vision technology in the past few years or even the last decade or so considering OpenCV has been around that long. The recent events in Ferguson have created a need for keeping the police in line as well as the need to present credible evidence regarding certain situations.

Many police departments are starting to test programs that place snake cams like those used in the military on officers. While this could be viewed as more militarization, it also can present departments with a black eye if power is abused.

What if the lawyers, police, and ethics commissions could have a way of recognizing potentially dangerous situations before they happen? What if there was a light weight solution that allowed data programs to monitor situations in real or near real time, spot troublesome incidents, and provide alerts when situations were likely to get out of hand? What if potentially unethical situations could be flagged?

The answer is that this is possible without too much development already.

Statistical patterns can be used to predict behaviour long before anything happens. Microsoft and Facebook can accurately predict what you will be doing a year from now. The sad state of the current near police state is that the government has as much or more data on officers and citizens than Microsoft and Facebook.

These patterns can be used to narrow the video from those snake cams to potentially harmful situations for real time monitoring.

From there, a plethora of strong open source tools can be used to spot everything from weapons and the potential use of force, using the training capabilities of OpenCV and some basic kinematics (video is just a bunch of really quickly taken photos played in order), speech using Sphinx4 (a work in progress for captchas but probably not for clear speech), and even optical character recognition with pytesser. A bit of image pre-processing and OCR in Tesseract can already break nearly every captcha on the market in under one second with a single core and less than 2 gb of RAM. The same goes for using corner detection and OCR on a pdf table. Why can’t it be used in this situation?

The result in this case should be a more ethical police force and better safety to qualm the fears of officers and civilians alike.

Call me crazy but we can go deeper than just using snake cams on officers to police officers and provide assistance.  Quantum computing and/or better processors and graphics cards will only make this more of a reality.