Archis's Blog

August 17, 2010

What does 100% CPU usage mean?

Filed under: Technology — Tags: , , , — archisgore @ 1:32 am

Traditional scientists (mathematicians, physicists, and engineers in traditional disciplines of mechanics, dynamics, etc.) have always known that certain entities are temporal in nature. Temporal means time-dependent (using the term losely here), meaning quantities or values that only make sense “across time”. The larger time period you observe them for, the more sense they make, and the less you know about local details (you lose details of when  happened but you get a greater understanding of what exactly happened.) This is Heisenberg’s principle applied to time-domain metrics. I don’t want to go into all the stuff Fourier gave us, but heck that dude really changed the way we do stuff today.

A quick refresher example (or introductory example for those who didn’t study a lot of signal analysis). Imagine you were sitting in a large theatre at 8:00pm in the evening with the theatre partially filled up (say 30% of the seats were filled when you entered at 8:00pm sharp). At 8:01, you observe one person coming into the room. Given this data, would you be able to describe either, by what time the theatre would be filled up, or how many people would be in the theatre at 8:30pm the same evening? Now you doze off for 3 minutes, and again observe at 8:05, you notice another person entering the room (due to the darkness, you don’t know how many people are in the theatre currently). A better estimate of the answers to the two questions? Suppose the first person who entered at 8:01 clarified that between him entering and the next person entering at 8:05, he can assure you nobody else entered while you were dozing off, would that make your answer more accurate?  If by 8:20, you got details of how many people entered when, would your estimate be even more accurate?

As you see, when making such estimates, specific point-data has very little value. Given, at 8:15, there were 40 people seated, is quite pointless to figure out when your show should start. Without knowing how many people are currently seated, but the rate-of-entry per minute is much more valuable. Knowing both, is even more so. Knowing the rate-of-entry as it differs each minute is even more valuable (whether it is decreasing, increasing or constant.)

This is all fairly introductory textbook stuff for most other disciplines. In computing though, a lot many programmers while aware of temporal quantities, either misinterpret them or overlook their usefullness. This can mean a lot of implications in terms of quality, performance and correctness. I comment based on some observations and interpretations I have seen in this industry over the last few years and how we misinterpret benchmarks and metrics.

The most commonly misinterpreted statistic thrown around out there is “CPU usage”. We see people panic at “100%” CPU usage, while there are also apps out there who could have only 10% CPU usage but bring systems to a crawl. Ask a common coder, “My app uses 100% CPU sometimes, is that good?” and the immediate response is, “What kind of coder would write such a program?” Let’s look at how a CPU works for a minute.

The CPU runs on a wave (a square wave to be precise). It has a certain number of vibrations per second, and it does work every time it vibrates (just like the piston of an engine). What you see as CPU usage for a certain program, is the number of vibrations of the CPU the program used to do its own work (basically if a car engine could divide it’s piston strokes between the axle and the air conditioner’s compressor, the amount of strokes it allocated to the axle). A lot of times, we have a tendency to interpret “100%” CPU usage as bad. While this is generally the right metric to use in very generic terms, when you are developing a controlled system, it could lead to some quirky scenarios.

A CPU, just like the car engine, is running whether or not it is used to get work done. Of course, 100% usage naturally means nobody else gets a bit of that power when they need it, but there’s no reason why it shouldn’t be used whenever possible (when nobody else needs it). At times, you turn off your airconditioner, so that you get a higher boost in acceleration. In the same way, for certain applications, if a program is deliberately NOT using 100% CPU, it is a very very bad thing.

I’ve been developing server applications for a while now, and this gets brought up a lot. When my webserver is hit with one request, the server goes to 100% CPU usage for about 2-3 milliseconds. This isn’t only not bad, but actually helpful because what else did I expect the server to be doing anyway? What would it mean if I got 20 requests in one second? The server would still use 100% CPU and answer the requests in order and they get answered faily fast too. I’ve seen people going nuts on forums when they see their OS running at 90% CPU – you can imagine how the question is framed, “If with only a one request it used 100% CPU, how will it handle 2 requests at all?”

It’s not all that hard to reproduce on your home desktop too. Ever notice how media players always seem to be using 80% CPU and the system is still responsive (I compile large code-trees while playing a movie on my 2nd screen)? Well, why shouldn’t they, if nobody is using those ‘piston strokes’ for driving the axle? Contrary to that, sometimes a program goes unresponsive, and you open up task manager but you see barely 1-2% of total CPU usage, and wonder why the program is stuck? Happens to me too – even on programs I’ve written, until I realise that the whole “noble coding” era is passed. Back then we used to use more “efficient” workarounds to common functions to do things faster. Modern OSes expect you to be more semantic than syntactic – tell the OS what you need in no indirect terms, and the OS knows best how to provide it to you. Try doing custom memory tricks, and you end up with inefficient code. This doesn’t mean that the “hacker culture” doesn’t exist where super-smart minds exploit new ways to improve speed and efficiency, however it’s just that no longer can you read books on using “a+a” instead of “a*2″ and hope to gain a lot of applause.

About these ads

7 Comments »

  1. As much as I would love to have my web server run at 100% usage, it usually more often than not means that there is a thread somewhere running on a busy wait. Most applications depend on some kind of I/O (who run static pages anymore?), which means that the CPU usage should drop when moving into an I/O wait state. My experience has been that the PHP module to both apache and lighttpd does a poor job at this.

    Did I mention that the solution to this problem is to context switch more rapidly between the busy wait processes? :-)

    Comment by Prashanth — August 17, 2010 @ 6:51 am

  2. Prashanth, your point is substantially valid, and that’s exactly where the metrics are misinterpreted.

    “100% CPU usage over _what_ time?” There are places where you’re doing computations (coercing lists, rendering charts, computing results, etc.) where the data is available and in RAM (think building the facebook home feed from front-end caches every few seconds). All I’m saying is, artificially making attempts in application code to reduce usage really go against the efforts the OS is making at optimizing. If the system really does need to do something, the OS wouldn’t allocate a large slice to your CPU-intensive process anyway, and when it does allocate it, it means there was no better operation really waiting for it.

    I/O wait state problems also really arise from just bad coding. Once more, I’ve seen a lot of super-coders trying to do all sorts of Async calls without understanding the underlying mechanism and polling for responses. Counter-intuitively, if they depended on a blocking synchronous call, the OS could have taken care of the waiting efficiently, and timed out appropriately to provide responsiveness.

    Actually the fact that nobody runs static pages is precisely why a 100% CPU state for short bursts is a good thing – your backend/IO responses are being cached to RAM until you have enough actionable data before waking up a thread for just one or two operations (unless they are conditional) before sleeping again, and your CPU is being utilized properly by other threads in the meantime.response and every thread proceeds a little bit (yes, a lot of webservers are configured with an unnecessary number of worker threads).

    Comment by archisgore — August 17, 2010 @ 11:16 am

  3. Ok this is from a n00b @ PHP. I thought that any process that requested I/O would get thrown into a wait queue pertaining to that device and it gets brought into a “ready” queue when the request is serviced. Why would PHP try its own busy-wait implementation when round-robin is available?

    Comment by Shriphani Palakodety — August 17, 2010 @ 11:21 am

  4. I agree. Threading was never my forte. Python on the other hand has an excellent alternative to threading called the multiprocessing library (the reason partly is because python cannot execute 2 threads simultaneously on 2 cores owning to a “global interpreter lock”). The library has thread like semantics of shared data structures which are all explicitly marked as shared while the programmer is still aware that they are run as different processes. I really like this abstraction. Of course this is not optimal if you are doing a lot of sharing between the processes.

    Regarding your comment on caching – This is especially relevant in today’s large web services where almost the whole system runs off of RAM. More than just latency – it is the variability in latency that kills disk. There is an excellent paper called RAMClouds by some stanford people if you are interested.

    Comment by Prashanth — August 17, 2010 @ 11:26 am

  5. I’ve been thinking about this a lot over the past few days. Mainly because I did work on something that was a fan-in model of coding. Basically, any webpage today fans in data from multiple backend sources and generates something consumable by outsiders.

    A lot of wakeups happen when you interleave CPU/local operations with the read/write calls. So you have your thread waking up intermittently for one or two additions before sleeping again.

    To me an ideal CPU/IO graph would be a lot of 10%’s while the necessary data to render the page is being brought in, then a sudden burst of 100% (or whatever the OS can possibly allocate), and then back down to 10-20% which would be kee-alive mode.

    The only reason I posted this was because I did write code once which exactly met these conditions, and put three testers and two developers on a wild goose chase finding out why we were getting such bad numbers and such good response times.

    Comment by archisgore — August 19, 2010 @ 10:54 am

  6. Something is broken if you are idling at 10%.

    Comment by Prashanth — August 19, 2010 @ 11:56 am

  7. LOL. That’s certainly true. :-) I started with 10% and hence somehow felt symmetrical to go back down to 10%. One of those times when reality and a poetic style of writing conflict.

    Comment by archisgore — August 19, 2010 @ 12:22 pm


RSS feed for comments on this post. TrackBack URI

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Connecting to %s

Theme: Silver is the New Black. Blog at WordPress.com.

Follow

Get every new post delivered to your Inbox.

Join 130 other followers

%d bloggers like this: