Opinion: Why Self-Driving Cars Will Fail in Customer Service

All to often, we are faced with a technology we feel is innovative and will break down an entire economy. That seems nearly certain with delivery, right? Wrong. If anything, the next 20 years will see a critical failure in an industry better suited to long haul trucking, train driving, and other even more mundane or menial tasks.

So many decent initiatives in technology fail because they do not consider the most important end games, people, security, and cost. In self-driving delivery game, that comes in the base pay vs. maintenance and LIDAR debate, security, and the customer services side.

There is an important reason to consider delivery as the improper industry for self driving cars, customer service.

No matter how many times a person is called, how hard, the door is knocked, or how friendly the front desk is. 25% or more of deliveries require problem solving well beyond route finding to provide both quality customer service or just complete a sale. This is well beyond the basic people enjoy interactions with others more than vehicles, another crucially important element.

Even a UPS driver will find that person just too completely unaware of their surroundings or un-trusting of their phone to respond to the door.

There are also the slew of factors not capable of being solved by a car that cannot fit through the front entrance, the factors that can drive the average tip over $5.

  • some high end establishments require talking to the front desk and arranging pickup due to ‘security’ reasons
  • some people require two or three approaches to reach
  • there are an increasing number of gated neighborhoods hiding miles of housing
  • People need an outlet to complain and the busy storefront will just hang up on them. They accomplish satisfaction by lodging a complaint with your driver.
  • stores tend to shift blame for failed quality control to drivers as near independent contractors to avoid the blame being placed on the store
  • many people, whether they consciously realize it or not, actually order delivery because someone comes to their door with good etiquette, a smile, and an assurance that their order is 100% correct
  • people tend to feel more secure when a trusted agent is in control of their goods and makes this fact known.
  • wrong orders made by everyone and everything are caught with good quality control and pizza chains follow CASQ better than most IT companies to achieve near 95% success

Imagine if every store lost even 15% of their business. Chains and restaurants paying drivers are already stretched to their capacity in a delivery radius that cannot be changed. That 15% will hurt and possibly close a store.

Consider, next, the lesser factor, maintenance and vehicle costs.

The cost of maintaining an electric car is, in fact more expensive than the cost of maintaining a fleet of drivers. If the average drive is paid $9.50 on average ($7.25  if tips are not equal to this price or you work for a better store attracting better quality employees + $10.25 per hour in store + $1 per mile), and the cost of LIDAR maintenance is roughly equal in addition to the cost of additional IT support staff and technicians, the store loses money.

That is not to mention the expenditure per car. The base delivery vehicle in the tech company’s target industry is $0, the driver provides the vehicle. The cost per vehicle for  a fleet of 13 cars running constantly is much higher even if by the time of writing this that cost goes down significantly. There is the $17000 base per vehicle, the breakdown of even electric vehicles and their maintenance at $120 per month, any electricity at $30 per vehicle per month, the cost of wifi at $100 per month per vehicle, $500 for an entire phone line setup for your fleet per month, and the cost of  wear and tear at $.50 per mile due to the technical nature of the vehicle. Gasoline costs will range between $20 and $40 per night if the vehicle is fueled by gasoline. A new car will be required every 4 years. Repairs can add thousands of dollars per year not considering LIDAR as the vehicle ages.

Finally, let’s consider security. In the past few years, the military lost at least 2 drones to Iran who simply launched a DDOS attack on them. How hard is it to lob requests at a cars network? Not difficult. Ask Charlie Miller.

If every drone carried $1000 in cash on a good day and there were dozens of drones to hack, that is more money than even a help desk employee makes in an entire year. Most attacks can be bought from the dark web. There isn’t as much skill involved in hacking as there used to be.

In sum total, delivery is the wrong industry to target self-driving vehicles towards.

There is a reason delivery drivers in my city, Denver, can earn $21- $25+ per hour, this is my part time go to when starting a company so I know this is true. That reason is not the store which usually pays $9 per hour on average per driver. The reason is the bizarre nuances of the delivery game.

It does not matter whether you are driving for brown (UPS), FedEx, or Papa Johns. That extra mile can make your business.

The right industry will always be long haul trucking, train engineering, and, to a lesser extent, air travel. Anywhere the task is more monotonous and there is no customer service, people are more replaceable. For the food industry, that means the back of house.

Advertisements

A New Kind of CRM

CRM software is not IT, creative, technology, or manufacturing oriented. Aside from a plugin for JIRA, there is not a lot dedicated towards project and lead generation in the CRM space. This article introduces an effort my company Simplr Insites, LLC is committed to which will create a CRM capable of not only separating clients from infrastructure project workflow tools but that can adequately group tickets for generic project creation and provide useful statistics.

This project is being deployed by Simplr Insites, LLC for use with our clients. This tool will likely have other features built in to supplement JIRA such as work-flow test integration with our backend tool.

The Problem

IT solutions in the CRM, manufacturing, construction, and technology space are oddly lacking. There are project management tools and a few high priced options but most stick to the old chat server systems and barely scrape the surface of lead generation.

So what is needed in this space:

  • the ability to converse which is already beat to death
  • the ability to manage a project which is fairly well done
  • the ability to spot similar tickets, problems, and issues to limit redundancy
  • the ability to generate group leads based on these tickets
  • the ability to share samples
  • role separation directly related to the space

The Solution

Simplr Insites, LLC is creating an open source attempt to get around this issue. After all, we are a low profit company controlling non-profits and public goods companies. We aim to tackle all of these issues using Python, Flask, JavaScript, JQuery, Bootstrap and connections to Celery ETL.

This project:

  • creates the conversational and ticket management tools we all love
  • adds the analytics we need for lead generation and genericism through NLP and AI
  • separates clients, project managers, developers, and observers with varying levels of access
  • allows sample sharing
  • the ability to kick of data validation tasks on Celery ETL because we need it and not because it is related
  • other tasks

It is 2018, let’s create something useful.

Conclusion

Want to help, join us on Github. While your at it, check out CeleryETL which is an entire distributed system which Simplr Insites, LLC currently uses to handle ETL, analytics, backend, and frontend tasks.

NotePad: A NickName Recognition Idea

So this is just an idea. Nicknames can apparently be attached to multiple names. That means that we could have a graph of nicknames and also that a direct list may be tricky. A quick thought on how to approach a solution to this.

 

A. Graph the Names and their known connections to formal names.

B. Attach probability data to each Node specifying which characteristics trigger that solution (Markovian sort of)

C. Start a Server accepting requests

D. When a request comes in first look for all potential connections from a nickname to a name, if none save this as a new node. Search can be performed as depth first or breadth first and even split into random points on the graph in a more stochastic mannersuch as a random walk (will need to study these more). This could also be done by just choosing multiple random search points.

E. If there is one node, return the formal name

F. If there are multiple names take one of several approaches:

1. Approach is to calculate Bayesian probabilities given specified characteristics and return the strongest match (this is the weakest solution).

2. Approach is to train a neural net with the appropriate kernel (RBF, linear; etc.) and return the result from this net. (This is slow as having a million neural nets in storage seems like a bad idea)

When generating stand alone nodes, it may be possible to use a Levenshtein distance and other characteristics to attach nickname nodes to formal nodes based on a threshold. A clustering algorithm could use the formal name averages as cluster centers and a hard cutoff could specify (e.g. Levenshtein of 1 or 2) could solidify the connection and throw out Type I and Type II error.

Stay tuned while I flesh this post out in this post with actual code. It may just be a proposal for a while.

Shooting the Gap Statistic with Map Reduce

Finding the right K in Kmeans, n in LSA, n in LSI, or whichever grouping algorithm you use is fuzzy, especially with text. However, Tibshirani,Walther and Hastie developed a superb algorithm to deal with this, much better than finding an inflection point in your data as it handles abnormalities and ensures a proper distribution.

An easier understanding of the algorithm can be found through a blog post from the Data Science Lab.

This algorithm is intensive but,luckily, map-reduce exists. Splitting the procedure among a large number of machines is most helpful. Python and Scala offer map reduce and concurrent frameworks that can handle fairly significant amounts of data either directly or via a framework such as Celery.

Scala’s callback mechanism is a personal favorite of mine although the tools for experimentation are not as readily available as they are with Python. Tools such as gensim and sci-kit learn are readily available and easy to test in Python.

Using map reduce for this algorithm is almost a necessity for anything more than a small data set, especially for text data. However, unlike the Pham et. al algorithm, Tibshirani does not rely on the previous output at all. In this way it can be spread over a large number of machines to obtain a comparable result (use Pham for smaller datasets such as less than 30,000 text documents that fit on a single machine).

The following is an example of the fastest way I ran this algorithm using Python. Whereas a single threaded algorithm was running 20 tests in 10 minutes, the following test were running 10.4 tests per minute on a 3.5 ghz core i-7 with 15 gb of RAM.

The proper tradeoffs must be made between speed and memory.

Firstly, Python offers a multiprocessing version of map. The map function is applied using a multiprocessing pool which is easy to establish.


from multiprocessing import Pool
import psutil

pool=Pool(psutil.cpu_count(logical=True)

You will need to fiddle with an equation that works with Pool’s processors function. It may not be the best for your data set.

This pool can be used (please read the Data Science Lab article first) to find Wk, run multiple Kmeans and split up among other processes.

Since I am text mining, my Wk and distance functions now become the following. I will not be releasing my text based KMeans here, sorry but note that it involves using a special function for finding clusters based on the dissimilarity rating.

  
def Wk(X):
    if X.shape[0]>1:
        try:
            mul=float(1/float(2*X.shape[0]))*float(2*X.shape[0])
            mvect=scipy.mean(X,axis=0)[0].todense()
            mp=map(lambda x: float(mul*angulareDistance(x[0].todense(),mvect)),X)
            res=sum(mp)
        except:
            res=0
    elif X.shape[0] is 1:
        res=1
    else:
        res=0
    return res
def calculateQueuedCosines(x):
    d=[]
    for j in xrange(len(x[1])):
        d.append(cosine(x[0].todense(),x[1][j]))
    return d.index(max(d))

def performWK(X):
    '''
    This function performs WK for a map and requires an input matrix,predictions from KMeans, and a k value.
    0=>ref
    1=>preds
    2=>K
    '''
    preds=X[1]
    results=[]
    for currk in range(X[2]):
        set=None
        for i in range(X[0].shape[0]):
            if preds[i] == currk:
                if set is None:     
                    set=scipy.sparse.csr_matrix(X[0][i])
                else:
                    set=scipy.sparse.vstack((set,X[0][i]))
        res=0
        try:
            if set is not None:
                if set.shape[0]>1:
                    try:
                        mvect=set.sum(axis=0)/set.shape[0]
                        mp=map(lambda x: float(angulareDistance(x[0].todense(),mvect)),set)
                        res=sum(mp)
                    except:
                        res=0
                elif set.shape[0] is 1:
                    res=1
        except Exception,e:
            print str(e)
        results.append(res)
        gc.collect()
        del gc.garbage[:]
    return results

def performTest(X):
    '''
    For performing tests in separate processes.
    
    Requires a tuple with
    0=>ks
    1=>mtx
    '''
    kmn=KMeans()
    kmn.n_clusters=X[0]
    kmn.fit(X[1])
    preds=kmn.predict(X[1])
    return preds
	
def mapWKResults(X):
    return numpy.log(sum(numpy.nan_to_num(X)))

I then use these functions to split up intense processes among multiple cores. Keep in mind that this is looking to work with text data. The distance formula becomes the angular distance formual with cosines (1-2*acos(cos(veca,vecb))/pi).

  
def discoverCategories(self,file,refs=20,maxTime=600,ks=range(1,500),mx=20):
        '''
        A 1 to n estimator of the best categories using cosines to be used as  a backup.
        This is a sci-kit learn /python/scipy learning experience as well.
        
        Reference: https://datasciencelab.wordpress.com/2013/12/27/finding-the-k-in-k-means-clustering/
        '''
        pool=Pool(psutil.cpu_count(logical=False))
        vectorizer=CountVectorizer(analyzer='word', binary=False, decode_error='strict',dtype=numpy.int64, encoding='utf-8', input='content',lowercase=True, max_df=1.0, max_features=None, min_df=1,ngram_range=(1, 1), preprocessor=None, stop_words=None,strip_accents=None, token_pattern='(?u)\\b\\w\\w+\\b',tokenizer=None, vocabulary=None)
        logging.info("Starting KMN "+datetime.datetime.fromtimestamp(time.time()).strftime("%m/%d/%Y"))
        kmn=KMeans()
        lines=[]
        with open(file,'rb') as fp:
            lines=fp.read().split("\n")
            
        if len(lines)>0:
            mtx=vectorizer.fit_transform(lines)
            tfidf=TfidfTransformer(norm='l2',smooth_idf=True,sublinear_tf=False,use_idf=True)
            mtx=tfidf.fit_transform(mtx)
            del tfidf
            del vectorizer
            lines=[]
            del lines
            gc.collect()
            del gc.garbage[:]
            
            (ymin, xmin), (ymax, xmax)=self.bounding_box(mtx)
            Wks=[]
            Wkbs=[]
            sk=[]
            
            for k in ks:
                #get the overall WK
                print "Testing at Cluster "+str(k)
                tempk=[]
                kmn.n_clusters=k
                kmn.fit(mtx)
                preds=kmn.predict(mtx)
                sets=[]
                for currk in range(k):
                    set=None
                    for i in range(mtx.shape[0]):
                        if preds[i]==currk:
                            if set is None:     
                                set=scipy.sparse.csr_matrix(mtx[i])
                            else:
                                set=scipy.sparse.vstack((set,mtx[i]))
                    if set.shape[0]>0:
                        sets.append(set)
                res=pool.map(Wk,sets)
                Wks.append(numpy.log(sum(res)))

                del set
                gc.collect()
                del gc.garbage[:]
                
                BWkbs=[]
                #generate individual sets and calculate Gap(k)
                mres=[]
                for i in range(refs):
                    print "Setting Up Test "+str(i)
                    ref=scipy.sparse.rand(xmax,ymax,format='csr')             
                    mres.append(ref)
                    if len(mres) is mx or (i+1 is refs and len(mres)>0):
                        print "Performing Async Tests" #if we were to create our own distributed framework with pyHFS,asyncore, and a db
                        preds=pool.map(performTest,[[k,x] for x in mres])
                        res=pool.map(performWK,[[mres[j],preds[j],k] for j in range(len(mres))])
                        BWkbs.extend(pool.map(mapWKResults,res))
                        mres=[]
                        gc.collect()
                        del gc.garbage[:]
                    del ref
                    gc.collect()
                    del gc.garbage[:]
                s=sum(BWkbs)/refs
                Wkbs.append(s)
                sk.append(numpy.sqrt(sum((numpy.asarray(BWkbs)-s)**2)/refs))
        
        sk=numpy.asarray(sk)*numpy.sqrt(1+1/refs)
        return(ks,Wks,Wkbs,sk)

The results take a while but there is a basic rule based on non-zero values in a matrix for text data. This can be used to determine the maximum amount of clusters to be run, ensuring some buffer room.

  
  import numpy
  (mtx.shape[0]*mtx.shape[1])/numpy.count_nonzero(mtx)

The code splits up the most intense function into map and reduce tasks. Map reduce could be use in each case (please see the sum functions attached to the final result).

  
preds=pool.map(performTest,[[k,x] for x in mres])
res=pool.map(performWK,[[mres[j],preds[j],k] for j in range(len(mres))])
BWkbs.extend(pool.map(mapWKResults,res))

These are the Kmeans functions, reduction across all vectors, and a slight performance boost by performing the reduction on the result to obtain our Wk value for the overall equation.

Ideally, this process can be used to find the best number of categories or groups in a variety of algorithms such as fuzzy clustering with membership matrices or LSA.

Morning Joe: Categorizing Text, Which Algorithms are Best in What Situations

There is a ton of buzz around data mining and, as always, many new names being injected into the same topics showing a lack of study. While the buzz exists, knowing when to deploy an algorithm can be tricky. Based on a deep dive into the subject recently and a dire need to program these algorithms in Java, I present a brief overview of tools with benchmarks and examples to hopefully follow later. Basic concepts are presented first followed by some algorithms that I have not really honed yet (benchmarks are not feasible for now but will be incredibly soon).

To read this thoroughly, I highly recommend following the links. This is a starting point and is also a summary of what I have found to now. Basically, the goal is to help avoid the hours it takes to find information on each and do some stumbling by reading this document.

A word of caution, effective matrix based solutions use a large amount of data. Other fuzzy algorithms exist for discovering relate-ability between small sets of items. For strings, there is distance matching such as Jaro-Winkler or Levenshtein with rule based comparisons and lookup tables to minimize error (say between Adama and Obama). Statistics can enhance this process if there is a need to take the best rating. Train a model for discovering whether the hypothesis that two entities distances makes them the same as opposed to the null hypothesis that it does not after filtering out some common issues.

The Matrix

Linear algebra is a critical foundation of text mining. Different elements are thought of as equations When we have different documents or images, each document or image is often considered to form an equation. This equation can then be presented in a matrix, a simple and easy way to get rid of thinking in really complex terms.

You may have heard of the vaulted differential equation. If not, some reading is in order from a site I used in college when there was not time for the book. The basis of a large portion of differential equations can be written in a matrix. This is important due to the eigen vector and eigen value. To be sure these concepts are crucial for solving matrices to find equations that explain a set of models. Drexel Universities eigenfaces tutorial provides a fairly solid understanding of the way a matrix is used in most text mining. However, similarity ratings are used to compare documents rather than a co-variance matrix for most tasks.

The end result of studying these methods is the ability to look under the hood at today’s hottest text mining technologies.

The Vector Space and TF-IDF

Understanding vectors and vector operations is another crucial step to understanding the mining process. A basic vector is a set of points representing a position in these planes . Vectors can be added, subtracted,multiplied, and, most importantly, stuffed in a matrix where their points can be used with basic linear algebra to find relationships.

Vectors have magnitude and distance. The angles and distances can be compared. Note that, while data loss may be present in finding the right magnitude and distance, the units used should be the same (it would be a terrible idea to think in terms of say millisecond-meter-documents-ice cream cones), it provides a sane basis for considering data. It is up to the miner to choose the most representative points for use in the vector.

In text mining, term frequency-inverse document frequency rating is used in many commercial algorithms including search engines. If the name is not enough, it is basically a frequency ratio based on the ratio of individual document . It works best on more than one document and an offset of 0.5 for term frequency helps offset the effect of large documents slightly by bumping up the rating. Inverse document frequency utilizes a logarithm function to ensure that the rating remains between 0 and 1.

Multiply the following equations together to find the result as described by Wikepedia:

Similarity Ratings

No matter what you do, similarity ratings are the key to making the process work. There are a several that can be used. If the data can be represented fairly well, co-variance is an option . However, text data is not that well suited to using co-variance. This is due to varying styles that represent the same object and, most importantly, issues with quantization. Natural language is naturally fuzzy. Therefore, cosines usually offers a much better solution.

This cosines equation takes the product of two vectors or the sum of two vectors and divides by the product of there normalized vectors or the sum of their normalized vectors. It follows from vector algebra. The result is an angle representing the ‘degree’ of similarity. This can be used for comparison.

Word Net, Disambiguation, and Stemming

The process of disambiguation and stemming are crucial to text mining. There are many sentence processing methods as NLTK shows. At their core is WordNet and other dictionaries. WordNet is a freely available graph of an english dictionary. Most tools work with WordNet for finding root words, disambiguation, and cleaning.

Part of Speech or POS tagging is involved in both disambiguation or stemming. Maximum entropy models are used to discover a part of speech based on common usage.

Disambiguation attempts to resolve words with multiple meanings to their most probable meaning. The worst algorithm is original Lesk but involves only the use of WordNet. Accuracy hovers around 50 percent. Simplified Lesk achieves better results. Lesk finds overlapping words and frequencies to determine the best synonym to replace an ambiguous word. The better algorithms try to use clustering bayes for word sense discovery. Cosines may be used to improve Lesk as well.

Stemming reduces words to their roots. Most WordNet tools use existing classifications with POS tagging to achieve this result.

A Note on Regression Models

Lets be certain, prediction is not well suited to categorization. Changes in word choice across a large number of documents and decisions on importance do not always mean the same thing. Therefore, regression models tend to work poorly. The data is not likely continuous as well. Think of writing like a magnetic field with eddy currents. Predicting the effect due to an encounter with these currents is really, really difficult. Basically, run into an eddy current and you are going to have a really, really bad day. That is not to say that an equation can be created that fits most of the data with respect to location of a point, basically a differential equation. It will likely not be generic and be incredibly difficult to find.

Regression works well on continuous and more natural events.

Classification Tree and a Random Forest

Another often poor performer in categorization of text data is the classification tree. They are as good as the number of rules you are willing to create. However, they may be combined with multinomial Bayes for writing that is uniform and professional (say a legal document) to achieve some success. They are particularly useful after filtering data using LSA/HDP or Multinomial Bayes with decisions that work like a bayesian model when thinking about the bigger picture.

Basically, a classification tree uses probabilities within groupings to ascertain an outcome and moving down to the appropriate left or right child node based on a yes or no response to the question do you belong?

This process works well with defined data when there is a good degree of knowledge about a subject (say gene mapping) but text mining often uses fuzzy data with multiple possible meanings and disambiguation is not entirely accurate, Lesks original algorithm only acheived 50 percent accuracy and an LSA model hovers between 80-90%. Improving the quality can be done with multiple trees or possibly by training off of an extremely large set using cosines instead of raw frequency.

There are multiple methods for tree creation, two are random forests and bagging. Bagging takes multiple trees and averages the probabilities for decisions, using this average for their respective nodes. Random forests find random subsets of features, find probabilities based on them, and select the stronger predictor for a node. The latter approach is best with a much larger set of known features. The number of forests is usually the square root of the total number of features.

Again, the features must be known and fairly homogeneous. Text data is often not.

Multinomial Bayesian Classifier

Multinomial Bayesian Classification is a method that classifies data based on the frequency of words in different categories and their probabilities of occurrence. It is fairly straightforward, find the frequencies or train a set of frequencies on a word by word or gram by gram (a gram being an n-pairing and thus n-gram of words), find probabilities by sentence, take the best one.

MNB works well when writing differs starkly, say with subject matters that differ greatly. It is good for tasks such as separating spam from policies and code in html data when large amounts of training data are present.

Clustering with LSA or HDP

Clustering works well when something is known about the data but categorization is not well done manually. Most algorithms avoid affinity propagation which usually uses the square root of total inputs as the number of clusters any way. Matrices are used heavily here as eigen values and eigen vectors derive an equation that can be used to find relate-ability between documents.

LSA uses raw frequencies or more effectively cosines in the same manner as eigen faces to compare vectors. The end result, however, is an equation representing a category. By matrix inversion and multiplication, all elements are compared. In this case each ij entry in the matrix is a cosine or frequency of a word in a document. HDP (hierarchical direchlet process) is similar but attempts to learn more about the results and improve on the process. It takes much longer than LSA and is experiemental.

If trying to discover new information about text or trying to find the best fit of a number of categories, these methods are useful.

Maximum Entropy Models

Maximum entropy models work well on heterogenous data in a manner similar to Bayes. Gensims sentence tagger classifies sentences from non-sentences in this way. The models find entropy using the maxent principle which uses frequencies and likelihood of occurrence to find outcomes. It works quite well with the correct training sets.

If conditional independence is not assumable and nothing is known about a set, this model is useful. Categories should be known beforehand.

Tools

Java

Python

Common Resources

A New Project: Is a Distance Based Regular Expression Method Feasible?

So, I would like to find specific data from scraped web pages, pdfs, and just about anything under the sun without taking a lot of time. After looking over the various fuzzy logic algorithms such as Jaro-Winkler, Metaphone, and Levenstein and finding that one did not have an incredibly wide application, I decided that developing a regular expression based distance algorithm may be more feasible.

The idea is simple, start with a regular expression, build a probability distribution across a good and known data set or multiple data sets, and test for the appropriate expression across every web page. The best score across multiple columns would be the clear winner in this case.

Building out the expression would be include taking known good data and finding a combination between the base pattern and the data that works or building an entirely new one. Patterns that appear across a large proportion of the set should be combined. If [A-Z]+[\s]+[0-9]+[A-Z], and [A-Z]+[\s]+[0-9]+, appears often in the same or equivalent place or even [a-z]+[\s]+[0-9]+, then it should likely be [A-Z\s0-9a-z]+, if the set is similarly structured. Since the goal is to save time in programming regular expressions to further parse Xpath or other regular expression results, this is useful.

The tricky part of the project will be designing a similarity score that adequately equates the expressions without too many outliers. Whether this is done with a simple difference test resulting in a statistical distribution or a straightforward score needs to be tested,

In all likelihood, re-occurring words should be used to break ties or bolster weak scores.

The new project will hopefully be available on Source Forge for data integration and pre-curation processes.