sauce sauce - 15 days ago 6
Java Question

App engine consistent latency spikes under low load

I've noticed periodic, but consistent latency spices from my app running on app engine. At first I thought the network may be slow but app stats confirmed that not to be the case.

I've been able to reproduce the latency spikes using older and newer versions of the SDKs, currently I'm using the following:


  • App engine SDK: 1.9.42

  • Google cloud endpoints: 1.9.42

  • Objectify: 5.1.13

  • Appstats (to debug network latency)



So usage on the app is pretty low, over the last 30 days I'm generally under 0.04 requests a second:

requests per second

Most work is done with one instance as well:
enter image description here

Most operations' latency is well under a second, but an alarming number of requests take 10-30x longer.

Latency density distribution

5% of requests take 23 seconds or longer...

So I figured it must just be network latency, but every appstat of a slower operation disproved this. Datastore and the network have always been incredibly reliable. Here is the anatomy of a slow request taking over 30 seconds:

app stats of operation taking 31 seconds

Here is the anatomy of a normal request:
enter image description here

At a high level my code is pretty uninteresting: it's a simple api that makes a few network calls and saves / reads data from the cloud datastore. The entire source can be found on github here. The app runs on a single auto scaling app engine instance and is warmed up.

CPU usage over the last month doesnt seem to show anything exciting either:
enter image description here

It's really strange to see that even for quick operations, a huge percentage of time is spent on the CPU, even though the code simply creates a few objects, persists them, and returns JSON. I'm wondering if the CPU is getting pegged on my app engine instance by another app which could be causing performance to periodically degrade.

My appengine.xml config looks like this:

<?xml version="1.0" encoding="utf-8"?>
<appengine-web-app xmlns="http://appengine.google.com/ns/1.0">
<application>sauce-sync</application>
<version>1</version>
<threadsafe>true</threadsafe>
<automatic-scaling>
<!-- always keep an instance up in order to keep startup time low-->
<min-idle-instances>1</min-idle-instances>
</automatic-scaling>
</appengine-web-app>


And my web.xml looks like this:

<web-app xmlns="http://java.sun.com/xml/ns/javaee" version="2.5">
<servlet>
<servlet-name>SystemServiceServlet</servlet-name>
<servlet-class>com.google.api.server.spi.SystemServiceServlet</servlet-class>
<init-param>
<param-name>services</param-name>
<param-value>com.sauce.sync.SauceSyncEndpoint</param-value>
</init-param>
</servlet>
<servlet-mapping>
<servlet-name>SystemServiceServlet</servlet-name>
<url-pattern>/_ah/spi/*</url-pattern>
</servlet-mapping>

<!--reaper-->
<servlet>
<servlet-name>reapercron</servlet-name>
<servlet-class>com.sauce.sync.reaper.ReaperCronServlet</servlet-class>
</servlet>
<servlet-mapping>
<servlet-name>reapercron</servlet-name>
<url-pattern>/reapercron</url-pattern>
</servlet-mapping>

<servlet>
<servlet-name>reaper</servlet-name>
<servlet-class>com.sauce.sync.reaper.ReaperServlet</servlet-class>
</servlet>
<servlet-mapping>
<servlet-name>reaper</servlet-name>
<url-pattern>/reaper</url-pattern>
</servlet-mapping>


<welcome-file-list>
<welcome-file>index.html</welcome-file>
</welcome-file-list>

<filter>
<filter-name>ObjectifyFilter</filter-name>
<filter-class>com.googlecode.objectify.ObjectifyFilter</filter-class>
</filter>
<filter-mapping>
<filter-name>ObjectifyFilter</filter-name>
<url-pattern>/*</url-pattern>
</filter-mapping>
</web-app>


TLDR I'm completely stuck and I'm not sure how to debug or fix this issue and I'm starting to think this is business as usual for smaller apps on app engine.

I'm thinking about turning off the resident instance for a while hoping that my app has just been running some bunk hardware or along side an app that is consuming alot of resources. Has anyone run into similar performance issues or knows of additional ways to profile your app?

EDIT:

I've tried running on 1 resident instance, I've also tried setting concurrent requests at 2-4 per this question with no results. Logs and appstats both confirm that an inordinate amount of time is spent waiting for my code to initially run. Here is a request that takes 25 seconds before my first line of code is run, not sure what endpoints / app engine is doing in this time.

25 seconds before my code is run

Again load is still low and this is request is on a warmed up instance.

EDIT 2:

Seems like for whatever reason app engine + endpoints doesnt play well with
min-idle-instances
set. Ripping out appstats and reverting to the default app engine config fixed my problem.

enter image description here

Answer

I don't have an answer, but I can offer you some debugging tips.

Appstats may or may not be reporting correctly. However, log messages get timestamped. Log before & after each of the RPC operations. That should give you some insight.

30s delays sound a lot like warmup requests, which should be marked clearly in the logs. One thing that I've found in the past is that setting any resident instances for low-traffic apps (unintuitively) tended to route a lot of requests to cold instances. Use the default setting and set up a cron task to ping and endpoint once per minute.