With Apple in the news and security becoming a large concern and even as companies try new ways to protect their online presence, finding malicious activity has become an exploding topic. Another area offers some deeper insights into just how to discover users with bad intentions before data is lost. This article deals with protecting an online presence.
Detection can go well beyond knowing when a bad credit card hits the system or a certain blocked IP Address attempts to access a website.
Similarity: The Neural Net or Cluster
The neural net has become an enormous topic. Today it is used to discern categories in fields ranging from biology to dating or even terrorist activity. Similarity based algorithms have come into their own since their inception largely in the cold war intelligence game. Yet, how different is finding political discussions from conversational data captured at the Soviet embassy or discovering a sleeper cell in Berlin from finding a hacker. Not terribly different at the procedural level actually. Find the appropriate vectors, train the neural net or clustering algorithm, and try to find clusters representing those with an aim to steal your data. These are your state secrets. With Fuzzy C Means, K Means, and RBF neural nets, the line between good and bad doesn’t even need to look like a middle school dance.
Here are just a sampling of the traits that could be used in similarity algorithms which require shaping a vector to train on. Using them in conjunction with data taken from previous hacking attempts, it shouldn’t be extremely difficult to flag the riff raff.
Traits that Can be Useful
Useful traits come in a variety of forms. They can be encoded as a 1 or 0 for a Boolean value such as known malicious IP (always block these). They could be a Levenshtein distance on that IP. Perhaps a frequency for number of requests per second is important. They may even be a probability or weight describing likelihood of belonging to one category or another based on content. Whichever they are, they should be informative to your case with an eye towards general trends.
- Types of Items purchased : Are they trivial like a stick of gum?
- Number of Pages Accessed while skipping a level of depth on a website : Do they attempt to skip pages despite a viewstate or a typical usage pattern?
- Number of Malformed Requests : Are they sending bad headers?
- Number of Each type of Error Sent from the Server : Are there a lot of malformed errors?
- Frequency of Requests to your website : Does it look like a DNS attack?
- Time spent on each Page : Is it too brief to be human?
- Number of Recent Purchases : Perhaps they appear to be window shopping
- Spam or another derived level usually sent from an IP address: Perhaps a common proxy is being used?
- Validity or threat of a given email address : Is it a known spam address or even real?
- Validity of user information : Do they seem real or do they live at 123 Main Street and are named Rhinoceros?
- Frequencies of words used that Represent Code: Is the user always using the word var or curly braces and semi-colons?
- Bayesian belonging to one category or another based on word frequencies: Are words appearing like var?
Traits that May not Be Useful
People looking for your data will be looking to appear normal, periodically looking to access your site or attempting an attack in one fell swoop. Some traits may be less informative. All traits depend on your particular activity. These traits may, in fact be representative but are likely not.
- Commonality of User Name : Not extremely informative but good to study
- Validity of user information: Perhaps your users are actually value their secrecy and your plans to get to know them are ill-advised
Do not Immediately Discount Traits and Always Test
Not all traits that seem discountable are. Perhaps users value their privacy and provide fake credentials. However, what credentials are provided can be key. More often, such information could provide a slight degree of similarity with a bad cluster or just enough of an edge toward an activation equation to tip the scales from good to bad or vice versa. A confusion matrix and test data should always be used in discerning whether the traits you picked are actually informative.
Bayes, Cosines, and Text Content
If Bayes is failing, then perhaps similarity is useful. Words like e and var and characters such as ; or = may be more important in code.