Which to Use: Microservices or Actor Systems?

Which is better, an actor system handling requests and registering to a network or a microservice? That question is probably not asked often enough. This article examines this question from a theoretical standpoint prior to deployment in a production system.

Benefits of Microservices

We often decide to incorporate microservices simply because they have been the go to for quite some time, forgetting about the actor model all together. In understanding our debate, we need to layout the benefits of the microservice.

Microservices:

  • Allow for flexibility by allowing code to change in an application
  • Allow different applications to scale autonomically based on need
  • Allow for different teams to focus on different tasks without needing to know the implementation of another team’s tasks

However, they also:

  • Can grow complicated and not take into account interaction between services directly, slowing the system down
  • Require more knowledge of other team’s APIs (a poor con)
  • Require a layer of abstration that leaves them somewhat vulnerable and are often implemented insecurely

Benefits of Actor Models

Actor models are fast, efficient, and have benefits as well. In particular, they:

  • Allow flexibility by allowing a node to change behavior or state
  • Allow different parts of a system to scale automatically based on need
  • Take into account different interactions between services
  • Tend to be more secure. Think of block chain as a database version of an actor system. It maintains state and is contained.
  • Simplifies different components (both a pro and a con)

However, actor models:

  • May or may not require knowledge of the implementation produced by other teams
  • Require a great deal more work to become as flexible and abstract as microservices
  • Are more difficult to change
  • Tend to perform blocking requests poorly
  • Require extra work to stabilize

Conclusion

Microservices are clearly great for company wide tasks but what about internally to each team where microservices can complicate matters significantly due to their inherent complexity. This is where actor models shine. For instance, if we want to manage different machines in a non-blocking manner to kick off different tasks, actor models perform well. Perhaps we are creating a set of nodes in the system to manage celery workers and producers. This is the debate going on in my open source CeleryETL system at the moment.

Feel free to comment!

 

Advertisements

Ways Uber Can Improve Their App: A Pro-Uber Position

I recently landed a lucrative opportunity. With all such opportunities, the question comes up as to what to do next. The answer, sadly, was look for part time work and build an extensive system.  As anyone seeking to stay out of a homeless shelter after leaving a position might do, I decided to try Uber as a bike courier.

If you are already wondering why I chose to be a bike courier,  I need the excercise after sitting 10-12 hours per day working on my backend to establish best practices, working on agreements, and having meetings.

Being a bike courier gives me a unique perspective on the actual Uber driver app that might pique the interest of the folks who ordered sushi from Hai Sushi last night. My roomate and I give the restaurant a 5/5 and I really do not intend to have a delivery cancelled again, especially mid-delivery when I need to continue eating.

The Issues

Uber promises are overblown, you really need to come down to earth and realize that actual bike couriers earn about 10.50-11 dollars per hour and car couriers do not far much better. I already knew this after doing my research.

There are other problems. Instead of merely listing them, why not give examples:

  • The application only considers cars: The major street on my alluded to pickup (which only took 5 minutes) allows cars to travel 50 mph. Uber failed to take into account, it does not show a delivery partner distance, that the capped speed for the app to work on a bicycle is 15 mph. So, a 5 minute pickup led to a 4 minute wait and a nasty 35 minute bicicyle ride at 15 mph praying the customer would enjoy cold sushi.  You can see that a 15 minute delivery is now taking 45 minutes, both false advertising and bad data engineering. I know the latter because that is my trade and this lack of consideration is a fireable offense.
  • The application fails to work with the Android device’s gyroscope correctly: This one really has me beat. Google Maps, Uber’s old map vendor, works perfectly. Considering Google’s ability to charge for their services, the decision to build their own app was not surprising. However, when I see a long journey know a road and quickly pack my phone away, I really don’t want to find out 2-3 minutes later that the app was pointing in the wrong direction. Worst of all, moving five to ten feet does not resolve this problem.
  • The application fails to live up to a Google replacement in route finding:  Beyond traffic information, the application fails to consider the type of vehicle it should route find for. As I learn the bike routes  I get faster but even then, the application eats juice and probably keeps re-estimating my time to arrival as I take them.

All of these problems lead to some glaringly bad quality issues. If Uber were Postmates, it would be dead.

Why are these Problems

The perception of quality at UberEATS and Uber depends on three to five things. These are timely service, the degredation of food/delivery items if applicable, the connections with partners allowing faster food/item prepartion if applicable, the knowledge of the driver, and the personalities and customer service presented by the driver.

These problems have serious side effects as:

  • They result in less timely service
  • They result in partners (bsusinesses) questioning your service
  • They result in angrier drivers and thus worse experiences
  • All of this leads to unloyal customers and higher turnover, hurting the bottom line entirely

It is not uncommon for the CAQ quality score to come off as a 2/5 when many of your taxi drivers starts to badmouth the company and complain about the application.

The Solutions

For those wondering how Uber can improve it’s application, my take follows:

  1. Peg speed limits to distance. This will greatly improve time estimates.
  2. Limit the radius of operation by vehicle type and warn customers outside of them about time problems. It is easy to accomplish this on a graph structure. Graph databases like Neo4J are quite fast and can easily accomodate distance calculations and route finding within these distances. There is actually a geo-spatial tool with a haversine function for Neo4J. This could be accomplished quicky on Postgres as well. FYI, Postgres is used for all of the real time satellite data at the multi-billion dollar in REVENUE (not the bullshitting most tech companies base their value on) really awesome Digital Globe. If you want to work for a super-awseome version of Google that feels like NASA, Digital Globe is it. Heuristics, thresholds, and genetic algorithms can also help. A simple distance related cut off in Neo4J and a fast program, even in Python, should be sufficient even for millions to possibly billions of requests. Think Celery and Flask. It no longer takes C++ to acomplish such tasks.
  3. Figure out why your application is not working as well as Google Maps. I know it eats my battery do to the GPS requirement which is actually an inevitabilty and why I carry a backup batter but please generate some tickets for the gyroscope and look into battery use, likely with some less intense code language and some more server-side processing.
  4. Add route information per vehicle. I can travel faster when I am not pushed to a major state highway in a new area forcing me up on the sidewalk. Most cities offer bike route information and, in a graph database, is_bikeroute is really just an edge attribute.
  5. Add traffic information. Please add traffic information.
  6. Push the feedback form and incorporate the feedback. Qualtiy is an iterative process. It is good to have response customer service for the driver application but these people are the bread and butter. Make sure they know it exists and treat the feedback representative as a typical organization treats a shift manager in terms of said feedback. This should bubble up to the appropriate position with the filters turned on and make it to the data, application, and front end teams at some point.
  7. As always, keep working to improve battery life.

Conclusion

Despite the size and revenue of this very real company, they still feel a bit off in the development category. I am not at odds with working for the company to gain side money but really just hope that the iterative software development ideologies made their way to Uber, Postmates, and the other companies. As always, consider promoting the engineers to positions of architecture and power. They are more knowledgeable on achieving customer goals with smooth scalable systems.

ETL 1 Billion Rows in 2.5 Hours Without Paying on 4 cores and 7gb of RAM

There are a ton of ETL tools in the world. Alteryx, Tableau, Pentaho. This list goes on. Out of each, only Pentaho offers a quality free version. Alteryx prices can reach as high as $100,000 per year for a six person company and it is awful and awfully slow. Pentaho is not the greatest solution for streaming ETL either as it is not reactive but is a solid choice over the competitors.

How then, is it possible to ETL large datasets, stream on the same system from a TCP socket, or run flexible computations at speed. Surprisingly, this article will describe how to do just that using Celery and a tool which I am currently working on, CeleryETL.

Celery

Python is clearly an easy language to learn over others such as Scala, Java, and, of course, C++. These languages handle the vast majority of tasks for data science, AI, and mathematics outside of specialized languages such as R. They are likely the front runners in building production grade systems.

In place of the actor model popular with other languages, Python, being more arcane and outdated than any of the popular languages, requires task queues. My own foray into actor systems in Python led to a design which was, in fact, Celery backed by Python’s Thespian.

Celery handles tasks through RabbitMQ or other brokers claiming that the former can achieve up to 50 million messages per second. That is beyond the scope of this article but would theoretically cause my test case to outstrip the capacity of my database to write records. I only hazard to guess at what that would do to my file system.

Task queues are clunky, just like Python. Still, especially with modern hardware, they get the job done fast, blazingly fast. A task is queued with a module name specified as modules are loaded into a registry at run time. The queues, processed by a distributed set of workers running much like an actor in Akka, can be managed externally.

Celery allows for task streaming through chains and chords. The technical documentation is quite extensive and requires a decent chunk of time to get through.

Processing at Speed

Processing in Python at speed requires little more than properly chunking operations, batching record processing appropriately to remove latency, and performing other simple tasks as described in the Akka streams documentation. In fact, I wrote my layer on Celery using the Akka streams play book.

The only truly important operation, chunk your records. When streaming over TCP, this may not be necessary unless TCP connections happen extremely rapidly. Thresholding in this case may be an appropriate solution. If there are more connection attempts than can be completed at once, buffer requests and empty the buffer appropriately upon completion of each chain. I personally found that a maximum bucket size of 1000 for typical records was appropriate and 100 for large records including those containing text blobs was appropriate.

Take a look at my tool for implementation. However, I was able to remap,  split fields to rows, perform string operations, and write to my Neo4J graph database at anywhere from 80,000 to 120,000 records per second.

Conclusion

While this article is shorter than my others, it is something I felt necessary to write in the short time I have to write it. This discovery allows me to write a single language system through Celery, Neo4J, Django, PyQt, and PyTorch for an entire company. That, is phenomenal and only rivaled by Scala which is, sadly, dying despite being a far superior, faster, and less arcane language. By all measures, Scala should have won over the data science community but people detest the JVM. Until this changes, there is Celery.

 

Reactive Streaming with Thespian

In this article I will cover my new attempts at building reactive software for the world in Python, helping to streamline the tasks that hinder scale in this operationally efficient language. We will review Thespian and my reactive software based on this actively developed project.

Problem

Well folks, my attempts at building reactive software in asyncio have stalled. After beating my head against a wall when my loops became stuck, even when using a future instead of solely a loop, I have given up for now.

After a search I should have done weeks ago, I discovered I am not the only one to give up on an asyncio actor model (see Xudd and Cleveland). Thankfully, I stumbled on Thespian.

Thespian

Thespian is a basic actor system written in Python. While it is possible to wrap CAF, Thespian allows for wildly popular Python tools to be spread across an actor system while only minimally dealing with serialization or data packaging.

The system is stable, having been originally utilized in a production environment at GoDaddy.  The system is well documented.

Reactive Software

Despite being stable, Thespian is fairly light in terms of features. Compared with Akka, the system makes little to no attempt at implementing Reactive Streams, lacks decent cluster support, and is even missing basic routers.

To resolve this, I am building a series of reactive software programs that implement streams, routers, and the other basic features of Akka. I am additionally focusing on building ETL, ingestion, and networking tools based around the reactive streams model and Thespian.

Help Build the Future

We need your help. Our platforms should help streamline the monstrosities that Python based backends and systems can become. Help build the kind of scale that will hopefully power future Python applications, hone your skills, learn a new framework.

If interested, contact me at aevans48@simplrinsites.com. I can invite you to our Trello boards.

CompAktor: Python Actor System

While interest in Scala wanes, many of the tools that remain popular and a reason the language persists will likely be sidelined or moved to Java. While Java is popular language, it may not be appropriate for some use cases

Imagine building a series of interconnected, reactive robots based on Raspberry Pi with their own ‘nervous system’. Asyncio in Python 3.5+ gives the programmer this unique capability by allowing them to create several actor systems.

This article explores some of the basics behind actors in Python while serving as conceptual documentation for CompAktor. All code comes from CompAktor.

Message Passing

Actor Systems utilize message passing to perform units of work.  Queues store messages that the actor uses to perform certain tasks (image courtesy of PetaBridge):

how-actors-process-messages

In CompAktor, this is represented by the following code:

    @abstractmethod    
    async def _task(self):
        """
        The running task.  It is not recommended to override this function.
        """
        message = await self.__inbox.get()
        try:
            handler = self._handlers[type(message)]
            is_query = isinstance(message, QueryMessage)
            try:
                if handler:
                    response = await handler(message)
                else:
                    logging.warning("Handler is NoneType")
                    self.handle_fail()
            except Exception as ex:
                if is_query:
                    message.result.set_exception(ex)
                else:
                    logging.warning('Unhandled exception from handler of '
                                    '{0}'.format(type(message)))
                    self.handle_fail()
            else:
                if is_query:
                    message.result.set_result(response)
        except KeyError as ex:
            self.handle_fail()
            raise HandlerNotFoundError(type(message)) from ex
.......

Actor in Python

An actor is, of course, the core consumer and producer of messages in the actor system. The actor maintains a queue and in non-asyncio environments typically runs on its own thread.

Actors maintain and can change state, another critical component of actor systems. They use provided messages to perform units of work.

State is handled in CompAktor with the following code:

self._handlers = {}
self.register_handler(PoisonPill, self._stop_message_handler

The PoisonPill kills an actor and is a common construct.

Asyncio Loop

Asyncio runs on event loops. Multiple loops can be run in a program.

The loop works around Python’s GIL by using generator like behavior to mimic completely asynchronous behavior. Loops do not block block on I/O, allowing multiple tasks to run at once.

Straight from python.org:

tulip_coro

Notes on Asynchrosity and Event Loops

The actor model is safer internally than most other systems. Actors themselves perform a single task at a time.

Despite being safer, they still should not block the loop for a long time. If a task will take a while, it is recommended to use a separate thread or process to complete a task as opposed to blocking the loop and wreaking potential havoc on the system.

Conclusion

This article explored the creation of actors in Python using asyncio. The actor is the basic object in CompAktor.

Checking and Increasing the Connection Limit in PostgreSQL

It finally happened. I blew the max connection limit in my new  PostgreSQL install by starting too many grid nodes  with associated connection pools for a system I am writing.

The default limit of 100 connections is far too few, especially with two people doing database intensive tasks. This article explains how to check the number of connections used as a standard user and administrator and how the administrator can change the connection limit.

Default Connection Limit and Buffer Size

By default, PostgreSQL has a relatively low number of maximum allowed connections. The default limit is 100.

The limit is related to the size of the shared buffers. Connections utilize the memory in the shared buffers. By default, the shared buffer size is set to 8 gigabytes.

PostgreSQL is a versatile database. Amazon built Redshift on the system. However, the maximum sizes are low and should remain so to encourage developers and students not to abuse their system’s resources.

Common Error When Exceeding Size Limit

With an ever increasing amount of data available and still growing need for technology, connection limits will be exceeded on many systems. PostgreSQL throws the following error when this occurs:

psql: FATAL: remaining connection slots are reserved for non-replication superuser connections

In my case, the connection pool failed to reserve the required ten to twenty connections per highly database intensive job and released what connections it did acquire back to the database instantly. This left me with 96 used connections and four that were unattainable. Some of the 96 connections were by an ETL tool or PG Admin.

Querying the Connection Limit

It is possible for non-administrators to check the number of connections in use with the following query:

SELECT count(distinct(numbackends)) FROM pg_stat_database

This query puts together a view of database statistics and counts the number of currently active connections.

Administrators have the option of querying connections from the psql command line using:

SHOW max_connections

This query asks the database to return the value of the max_connections configuration variable described below.

Most configuration variables can be queried in psql. Some can even be set from the command line console. The max_connections variable cannot be set from the command line due to an associated increase or decrease in memory usage.

Setting Connection Related Variables

While the initial settings are low by todays settings, they can be set in your postgresql.conf file. The file is typically found at a location such as:

/var/lib/pgsql/X.X/data/postgresql.conf  or /usr/local/postgres/postgresql.conf

The default location is the data directory under your PostgreSQL installation.

Look for the following variables in the file:

shared_buffers = 8000MB
max_connections = 100

Choosing Size Limits

It is important to set size limits relevant to the size of the shared buffers. Dividing the RAM usage by the current number of connections in use or the maximum number allotted will give a nice over-estimate of total memory use per connection. Ensure that this multiple does not exceed the shared buffer size and the capacity of your RAM. The shared buffer size should generally be less than the amount of RAM in your machine.

The equality to  check with is:

shared_buffers < (RAM - padding) && (current_ram_use / max_connections) * desired_connections < (RAM - padding) && (current_ram_use / max_connections) * desired_connections < shared_buffers

Basically, ensure that you have enough RAM and that the potential memory use does not exceed RAM and the shared buffer size. This inequality uses the max_connections settings instead of the actual number of used connections for some extra assurance.

Reloading and Restarting

The configuration file can be reloaded in an active session without a restart using the sql command line console or through PG Admin with the query:

SELECT pg_reload_config()

This query does not require administrator privileges but will not work when setting the maximum number of allotted connections or buffer size. Any variable in the configuration file proceeded by the following comment requires restarting the database server:

# (change requires restart)

Both max_connections and shared_buffers cause a change in memory usage which cannot be reworked on the fly. From the Linux command line, restart the server with the command:

./postgresql restart

On Ubuntu, the restart script normally resides at /etc/init.d.

If PostgreSQL is setup as a service, use:

service postgresql restart

Conclusion

In today’s world where data is bountiful, it is easy to exceed the connection limit in PostgreSQL. This article reviewed how to reset the connection limit and considerations for resetting the limit and associated shared buffer size.

A Guide to Defining Business Objectives

It can be said that in the world of the actual developer when the client or boss stops complaining and critiquing, the developer was automated to the point of being redundant. Basically, if no one is complaining, the developer is out of work.

Everyone wants a flying car, even when the job involves ingestion. However, there is a fine line in a business between a business requirement and idealism or ignorance. This is especially true when reporting to a boss whose technical competence and knowledge is less than stellar. So what rules have I discovered that will keep you employed? How do you avoid going to far off track when creating SCRUM tasks? Read on.

This article is an opinion piece on defining objectives and, to some extent, building the infrastructure required for a project.

Defining Core Objectives

All actual rules should be tightly cropped to the actual end client’s needs. This means being social with the end user. The first thing to do should be to find the most pessimistic person in the office and the person who will be using your product most. They may be the same people. Strike up a nice conversation with them. Get to know them  and then start asking questions about what they absolutely need. If there are multiple people in the office interacting directly with your end product, work from the most pessimistic and necessary to the least, establishing an absolute core of necessity.

By default, your boss will usually be the most optimistic person in the office. They are often less knowledgeable of the technical requirements and often of technology in general. Your goal is to temper their expectations to something reasonable. I have found this to be true of business partners as well.  You should understand what they want as they will provide a small set of absolute requirements but also keep in mind that if they can squeeze gold from water, they will.

If you find yourself limiting your objectives, use memos and presentations to make sure that you are providing a solid line of reasoning for why something cannot be done. Learn the IEEE standards for documentation and get familiar with Microsoft Office or Libre Office. Always offer a solution and state what would be needed to actually accomplish an objective and why it may be infeasible. In doing so, you may find a compromise or a solution. Offer them as alternatives. Do not be overly technical with the less technical.

My line of work requires providing relatively sensitive information in bulk at speed with a fair degree of normalization and quality. Basically, I am developing and building a large distributed ingestion and ETL engine with unique requirements that do not fit existing tools. This process has been a wreck as I was new to development coming in, given a largely inappropriate set of technology, ignored, and asked to build a Netflix style stack from the hulk of Geo Cities.

Defining business requirements was the first task I struggled with. Competing and even conflicting interests came from everywhere. My boss wanted and to a large degree wants an auto-scaling system with a great degree of statistical prediction on top of a massive number of sources from just one person. My co-workers, clients in the most strict sense, want normalization, de-duplication, anomaly detection,  and a large number of enhancements on top of a high degree of speed. No one seemed to grasp the technical requirements but everyone had an idea of what they needed.

In solving this mess, I found that my most valuable resource was the end user who was both most pessimistic of what we could deliver and had less technical skill than hoped for.  She is extremely bright and quite amazing, so bringing her up to speed was a simple task. However, she was very vague about what she wanted. In this case, I was able to discern requirements from my bosses optimism and a set of questions posed to her. As she also creates the tickets stemming from issues in the system, she indirectly defines our objectives as well.

Available Technology

The availability of technology will determine how much you can do. Your company will often try to provide less than the required amount of technology. Use your standards based documentation, cost models, and business writing to jockey for more. If you are under-respected, find someone who has clout and push them to act on your behalf.

As a junior employee several years ago, I found myself needing to push for basic technologies. My boss wanted me to use Pentaho for highly concurrent yet state based networking tasks on large documents ranging from HTML to PDF. In addition to this, he wanted automation and a tool that could scale easily. Pentaho was a poor tool choice. Worse, I had no server access. It took one year before I was able to start leaning on a more senior employee to lobby for more leniency and after another year and a half, servers. It took another year before I had appropriate access. If I was not developing a company, one that now has clients, I would have quit. The important take away, get to know your senior employees and use them on your behalf when you need to. Bribes can be paid in donuts where I work.

Promise Appropriately, Deliver With Quality

Some organizations require under-promising and over-delivering. These tend to be large organizations with performance review systems in desperate need of an overhaul. If you find yourself in such a situation, lobby to change the system. A solid set of reasoning goes a long way.

Most of us are in a position to promise an appropriate number of features with improvements over time. If you use SCRUM, this model fits perfectly. Devise your tasks around this model. Know who is on your team and promise an appropriate unit of work. Sales targets are built around what you can deliver. They are kept on your quality and  the ease of handling your product. Do not deliver to little, you will be fired but don’t define so much as to raise exuberance to an unsatisfiable level.

In my ingestion job, promises are made based on quality and quantity. I use the SCRUM model to refine our promises. Knowing my new co-worker’s capacity, fairly dismal, and my own, swamped with creating the tool, I can temper our tasks to meet business goals. Over time, we are able to include more business requirements on top of the number of sources being output and improving existing tools.

Hire Talent

If you are in the position of being able to hire people to expand on what you can achieve,  I do not recommend telling your boss an entry level position will suffice as they will then find someone with no skill. Also, push to be in the loop on the hiring process. The right person can make or break a project. My current co-worker is stuck re-running old tasks as he had no knowledge of our required tools and concepts despite my memo to my boss. Over time, he will get better but, with little skill, that may be too long. Sometimes the most difficult higher ups are those who are nice at heart but ignorant in practice.

Tickets

Your core requirements are not the ten commandments. You are not defining a society and universal morals but a more organic project. Requirements and objectives will change over time. The best thing you can do is to establish a ticket system, choose a solid system as changing to a different tool later is difficult. Patterns from these tickets will create new tools, define more requirements, and help you to better define your process.

In finding an appropriate system, ask:

  • Do I need an API that can interact with my or my clients tools?
  • Do I need SCRUM and Kanban capabilities on top of ticketing?
  • How hard is it to communicate with the client or end user?

At my work, I implemented a manual SCRUM board for certain tasks which had a positive impact on my overwhelmed co-worker who found JIRA cumbersome and full of lag. It is. We use JIRA for bug reporting and associated Kanban capabilities.

Cost

Cost is that lurking issue many will ignore. You need to document cost and use it to explain the feasibility of an objective. When possible create statistical tools that you can use to predict the burden on profitability and justify decisions. Money is the most powerful reasoning tool you have.

Conclusion

This opinion piece reviewed my lessons for entry level software developers looking to learn how to define business objectives. Overall, my advice is to:

  • Define core objectives starting with the most important and pessimistic users
  • Dive into your bosses core requirements and use their optimism to define the icing on the cake
  • Build on your objectives and requirements over time
  • Be involved in your bosses decisions
  • Define an appropriate number of objectives that allow you to deliver quality work (you will build on your past work over time)
  • Communicate and use an appropriate project management framework
  • Track costs and build statistical tools
  • Learn IEEE standards based documentation such as Software Design Documents and Database Design Documents, get familiar with business writing
  • Make sure you hire the right people