Home - Contact Info - Articles
- Files - Links - Adventures - Other Stuff
Cross-section of a Quantum Viking 1 hard drive cut in half to reveal the inner details
A simple overview of hard drive internals.
AD = Albert Dayes
JT = John Treder
AD: How big is the design team for a hard drive? What type of members comprise
a design team? (e.g. Electrical Engineers, Mechanical Engineers, etc.)
JT: The size varies, depending on the product's specifications, whether it's a new design
or an update to an existing design, and the urgency. The team for Quantum's Viking 2 could be thought
of as more or less typical. There were half a dozen mechanical engineers, two technicians, seven electrical
engineers, one and a half circuit board layout designers, a heads and media engineer, eight people who
wrote control and interface code, three product support engineers, two product test engineers, and half
a dozen managers. Quantum is unusual in assigning product support engineers near the beginning of a program.
It helps a lot, because they contribute to making the design robust from the beginning, and they understand
the ins and outs very thoroughly, so their support after the drive is in production is based more on knowledge
In addition to the team members, there are purchasing, marketing, documentation, servo-writer development
and laboratory specialists who get involved, though they're in "service groups". There can be
half a dozen of those people who are "project assigned".
AD: Is price point the most important factor in a new hard drive?
JT: Yes. I suspect it even overrides profitability, though hard drive companies can't afford
many "loss leaders".
AD: How long is the development cycle from original design specification to production?
JT: Again, it varies. A "bump" product, changing the amount of data per platter,
can be done in as little as 6 months. A design from scratch, starting a new product family, will take
from 2 1/2 to 5 years, but if it takes more than 3 years, something went wrong. Viking 1 was a crash program
for Quantum, a new design and new product series, and it took 16 months. Viking 2 was a significant enhancement
of Viking 1, and it took just over 2 years.
AD: In the late 1980's or early 1990's one hard drive maker (small company from
what I recall) sued all of the other hard drive manufacturers over 3.5 inch disk patents which the company
owned. There does not seem to be much litigation in the hard drive business today. Is this due only to
the extensive patent cross licensing between the hard drive manufacturers?
JT: That was Rodime. Conner settled for an amount rumored within the company (I worked there
at the time) to be > $10 million. Quantum fought, and eventually won, but probably spent more money
than Conner. I think the major hard drive manufacturers' upper management has recognized that they'll
spend less money on the lawyers if they cross-license than if they sue.
Of course, there aren't nearly as many hard drive companies as there were 10 years ago, so there aren't
so many opportunities for litigation. The number of manufacturers has decreased enormously--there are
only 7 or 8 that I can point to today, where in the mid to late 80's there were at least 50.
AD: Can you describe the HD design process from your (Mechanical Engineer's) point
JT: Before a program formally begins, there's a period of product definition. People study
head and disk technology, look in their crystal balls (cracked and cloudy, of course) to guess what the
hot product is going to be in two years, and generally try to figure out how many disks, what kind of
heads, what RPM and what capacity the drive should have. Engineering gets a hand in the process. Usually
marketing sets up an approximate specification and the guffaws from engineering cause changes.
There are usually two or three Engineering builds, then two or three Pre-production builds before mass
production begins. Before the first E-build, we often take an existing drive and try one or two ideas
on it. We'll only build two or three samples to test the ideas. The first E-build will be 10 or 20 units.
The base is either machined from a solid block of aluminum, or modified from an existing base. These units
will have the right number of heads and disks, and any key mechanical developments. They'll usually run
on a previous product's PCBA, or a new PCBA that's out of form factor and has a socketed processor so
the EE's can plug in an emulator (ICE).
Massive changes can happen between the first and second E-builds. Number of disks, RPM, head technology
(MR to GMR, for instance), and spindle motor internal design have all changed in my experience. For the
second E-build we try to have all the production technologies, though very often the base casting and
the actuator won't be made by production methods. At this time, the circuit board will be specific to
the product, though it's often still out of form factor and always socketed. The drive will have close
to the right TPI, but it usually won't be formatted to full capacity. Data and servo format details change
almost weekly at this stage.
There will be maybe 100 drives built for E-2. Mechanically, we'll be measuring runouts, shock and vibration
performance, EMI (electromagnetic interference), contamination and sealing issues, acoustics, seek performance,
and whatever else we can think of. It's our last chance to find and fix major problems without affecting
the program schedule.
The first P-build is critical to a project's success. It's when it all comes together. Mechanically, the
castings will be castings (not machined from solid), stampings will be stamped, and the drive will generally
look like a production unit. Electrically, they'll have real silicon for the main ASICs. For the first
time, we'll try to have at least some of the drives written to full capacity. At Quantum, they build 1,000
or 1,500 drives at this time. Formal product testing begins. We show samples to OEM customers, but don't
give them any.
The second P-build should be a cleanup of problems found so far, and usually doesn't involve major
mechanical changes. If big changes are needed in the mechanics, a third P-build will usually have to happen.
P-2 drives are given to OEM customers to begin evaluation. They should show 90% of the production units'
performance. Code changes come two or three times a week now.
P-2 is usually the darkest time, emotionally. You have a year or more invested, and all you can see are
problems. Marketing has been asking why it wasn't ready 6 months ago.
Then you have the mass production launch. There isn't generally much to do, mechanically, at this stage.
If there is, you've got BIG trouble! Servo engineers and interface code guys are working like mad to squeeze
the performance and get the bugs out.
AD: What is the hardest part of the design and/or the design process?
JT: Inventing a good solution to a key problem. That's also what's the most fun. When you
find an answer, it's just wonderful, and usually involves some amount of serendipity. As an example, we
were having intermittent spindle motor performance problems with Viking 1, during the E-2 time. Motors
that performed poorly also often, but not always, made a buzzing noise. My boss happened to pick up a
motor that had been torn apart and twisted the stator, and it came off in his hands. Now, the stator is
supposed to be firmly fastened down!
So, over the next month, the motor company engineers and I worked out several ways to improve the stator's
fastening to the base. We ended up adding about 30 cents to the motor's cost, but Viking 1, Viking 2 and
Atlas 4 have among the quietest motors in the industry.
AD: Are the interface/firmware engineers (I would assume EEs mostly) involved from
the beginning of the design or only after the mechanical engineers are done?
JT: The interface guys start a few months after the MEs start. They usually begin to work
on some of the E-1 drives and are fully involved by E-2.
AD: What is the most interesting discovery that you made (which is not a trade
secret) during the course of your design or re-design?
JT: The one that's most interesting to me is maybe a little bit abstruse. The disks and spindle
motor bearings and the base casting all form a complicated set of springs that vibrate with various frequencies
as the drive spins. The source of some of the vibration is obvious-imbalance in the disks, irregularities
in the ball bearings, torque pulses from the motor, for example. But especially at 7200 RPM and above,
there's more vibration amplitude, especially at higher frequencies, than these sources ought to produce.
It's air turbulence. At 5400 RPM, most of the air flow is laminar. At 7200 RPM, the airflow at the
outside of the disks is in the transition range between laminar and turbulent. And that causes odd vibrations
to come and go, as the airflow changes back and forth.
AD: Do you use computer simulators for all of your designs? (e.g. hardware-software
co-design simulator software similar to the software that CARDtools Systems provides)
JT: The EE and code guys do a lot of simulation. I'm not sure what tools they use. Sometimes
it looks a lot like Doom. <vbg> The servo guys use a combination of Matlab and custom-developed
Mechanically, we do most of our design these days on a solid modeler. Different companies use different
solid modeling software. We also use FEM extensively. One person in each mechanical team is usually proficient
with FEM. We also do Monte Carlo analysis of assembly tolerances.
AD: Can you discuss/define the terms FEM and Monte Carlo analysis?
JT: FEM = Finite Element Modeling. You use programs such as Nastran or Ansys or Algor or
Fluent to model various mechanical problems--stress and deflection, magnetic fields, fluid flow, vibration
modes, and so forth.
To become really proficient at using an FEM program requires working at it full time for a year or two.
Once you've learned one of them, you can pick up another in six months or so. But I've never met anyone
who could use more than one FEM program at the same time--they're all very finicky and different from
You can get results quite easily. It's very difficult to get meaningful results that correlate well with
FEM used to be mainframe or mini stuff. It started to run on Unix workstations about 10 years ago, and
in the last two or three years it's become practical to run FEM on a high-end Windows NT box. Models run
in from a few minutes to overnight. If you have a really big, slow model, it could have taken months to
build, and a weekend run doesn't seem all that slow. Big models, in Unix, can usually be parceled out
across various machines on your local network, over night or over the weekend. FEM is basically inverting
a few xillion enormous matrices. The problems involve ill-conditioning and slow inner loops. That's why
it takes an expert to get good results--ill-conditioning, especially, can give very bad answers without
Monte Carlo analysis is used for studying the effect of assembly variables. "Monte Carlo" refers
to "rolling the dice". Any time you have many statistically independent variables, for each
one of which you can propose a statistical model of its values, and for which you can make a mathematical
model of how they combine, you can use a Monte Carlo analysis to come up with a statistical model of how
the variables might work together. I don't know of any system that can make a general mathematical model
of how the variables might combine, so a Monte Carlo analysis requires writing the core engine of a program
for each problem.
Here's an absurdly simple example, the sort of thing that's commonly checked out with a spreadsheet.
Say you have a stack of 6 bricks in your assembly. All the bricks are nominally the same thickness, but
there are three brick factories where you buy them, and of course, bricks aren't all =exactly= the same
thickness, and the bricks from each factory tend to be a little different. You're going to make a million
of the assemblies, so you want to know what you can expect the height of the tallest, shortest, and average
stack will be. You'd also like to know what the odds are that the stack will be higher than some "magic
height" where it won't fit.
You measure a bunch of bricks from each factory, and calculate the mean and standard deviation of each
factory's output. If you're clever, you also make a histogram of thicknesses and see if the distribution
matches (within reason) the "normal" distribution (Bell curve, Gaussian distribution).
Then you write a program that takes into account the number of bricks coming from each factory,
and each factory's distribution, and roll the dice to make, on paper (or computer, whatever), a large
number of brick stacks. Say 20,000 just for laughs. You put the results into a histogram and report the
mean, standard deviation, min, max, number over "magic", and so forth.
The advantage of simulation is that you can tinker with the variables.
For Atlas 4, I wrote a Monte Carlo simulation of where the tracks would be on the disk. I used 34 independent
variables, and the "assembly" used a lot of trigonometry to account for the angles as the actuator
rotates in going across the disk, and for various "tilts" that happen. I used Borland Pascal
7 to do the job, with objects. The simulation ran at about 100 assemblies per second on a Intel Pentium-90,
so you could simulate 50,000 assemblies in less than 10 minutes. It took a couple of hours to make sense
of the results, of course.
We ran 14 different sets of input variables before we were happy with the answers. It took me a month
to write the program (I was actually rewriting a similar one that I did for Viking 1), and about 3 weeks
to go through the analysis loops.
The engineer who did a similar analysis for Viking 2 used an Microsoft Excel add-in, and he used to let
the program run overnight on a Intel Pentium-200.
AD: Can you discuss what is involved in the testing process for a hard drive? What
basic tests are absolutely required to be passed before shipping the drive?
JT: People make careers out of testing hard drives. There are the various engineering and
qualification tests that each product has to pass before it's "shippable", then the detailed
production tests that each drive has to pass before it's shipped.
Engineering and qualification tests are by no means identical, but I'll lump them together for an SST-altitude
view. I'm a mechanical guy, so I may miss some electrical or software testing in this list. It's not that
I don't care, just that I'm ignorant of many details outside my specialty.
Operating and non-operating shock and vibration performance. Non-operating tests look for physical damage.
Operating tests look for error rates and performance degradation in addition to physical damage.
Four-corner tests. Drive performance is measured at various combinations and rates of change of temperature
and humidity, ranging generally about 5C beyond the specified temperatures, and usually some amount beyond
the specified humidity (it's a lot harder to control humidity precisely).
Altitude tests. Drive performance is measured from 200 feet below sea level to at least 10,000 feet above
sea level. Flying height and flying height variation is particularly scrutinized.
Voltage limits. Drives are typically specified to run at plus or minus 5% of specified voltage. Testing
is commonly done to plus or minus at least 10%. There's normally a test to find out how far off you can
go before the drive fails. All combinations of high, low and nominal 5V and 12V are tested.
RFI/EMI tests. The drive's electronic emissions are measured, and its susceptibility to external electromagnetic
fields is measured.
Start/stop reliability. Samples are started and stopped massive numbers of times. Starting current, error
rates and acoustics are measured at intervals. For a 40,000 start/stop spec, about 1000 drives are spun
up and down maybe 80,000 to 100,000 times each. The test takes months.
Acoustics, both idle and seeking, both "new condition" and after various torture sessions. If
a drive fails, it can be very hard to find out why and what to do about it. I've probably spent a total
of 5 years working on acoustical issues, interspersed with other tasks.
Contamination measurements. This is usually done with drives that have been going through 4-corner or
some kind of reliability testing. Test results can be incredibly baffling and hard to interpret and hard
to figure out what to do.
All the various interface tests (data rates, error rates and so forth). There are many such tests, and
I'm afraid I just don't know much about them in detail.
Latch reliability. All modern drives include some kind of a lock to keep the actuator parked in the landing
zone while it's stopping and starting. There are several kinds, and many variations of each design. The
lock has to keep the actuator parked while spinning down and not allow any combination of shocks and accelerations
to let the heads move out of the landing zone while power is off. Both linear and rotational shocks and
accelerations are tested. This is one of the most difficult tests to pass and one of the most hated assignments
for a mechanical engineer.
TMR measurements and other servo performance measurements. This is a critical item. Servo performance
is subject to strange failures on totally unpredictable combinations of seeking and external influences.
Servo engineers have as hard a life as mechanical engineers! I've given you a very cursory discussion
of TMR, and it shouldn't be hard to dream up dozens of tests from that, if you have a sufficiently evil
Thousands of drives are run for a few thousand hours each and power consumption, data-handling parameters,
and reliability are measured. The final reliability test is so stringent that one hard data error in a
couple of thousand drives, over a thousand or more hours each, can halt the program. Such a thing may
happen once in two out of three development programs. There's hell to pay when it does!
A small percentage of new drives are destructively tested for non-operating shock, internal cleanliness,
and such things.
Samples are measured for acoustic performance.
The rest of these tests are 100%. It's typical for such tests to take about 8 hours. The 36-GB Quantum
Atlas 4 drive needs about 20 hours to do its testing. That's partly because the error testing takes time
directly proportional to the number of disks and partly because that drive gets unusually stringent testing
because of its intended market.
Every head on every drive is measured for its reading and writing properties (amplitude, resolution, overwrite
capability, PW50 [a measure of how cleanly a transition can be read], and nowadays some MR characteristics
that I don't recall. The drive maintains tables of these parameters by head and zone on the disk. There
are typically 16 data rate zones.
[ Note by JT: Look at the article about Disk
Layout for more information about zones. ]
[ Note by AD: The data are kept in a "secret room" which John Treder will explain
a bit more about:
A drive might have 4 disks, 8 heads, and 16 "data rate zones", each of which may have a different
number of data sectors. Each head and each zone will have some of its critical read/write parameters measured
during the factory test and stored away. That means a dozen or so tables of 128 values each (probably
integers) to be kept somewhere. You don't want to put it in ROM, because it would be too expensive to
have a unique ROM for each unit you build. And a hard drive is designed to hold variable information.
So you simply keep all kinds of running and testing data on the drive. You also keep a good deal of the
drive's operating code on disk, and page it to the drive's RAM as needed.
All that stuff is stored in "extra" tracks outside the user's data space. That's our "secret
room". The extra tracks are formatted exactly the same as regular data space, it's merely on tracks
-1 to -28 (or whatever), and they're part of the outermost data rate zone. There are usually 25 or 30
tracks "reserved" for that purpose, so a hard drive has room to store perhaps 20 or 30 MB of
private programs and data. ]
Every surface is scanned for media defects and hard error locations are reallocated. There are algorithms
used for "scratch-fill" to eliminate sectors between detected errors; those sectors are likely
to have errors that just didn't quite get detected. Several hundred hard errors per surface is normal.
The test includes deliberately moving the heads off the track center to find scratches or pits "between"
Actuator latch opening and closing speed is tested.
Data rates and soft data error rates are measured by head and zone. Even if a drive may pass overall,
it can fail on a detail.
Servo parameters such as raw seek times, settling times, stability parameters, and quality of the recording
of the servo data are measured. Information about how much current it takes to stay on track, and what
the relation is between seek current and acceleration (the torque constant) for several places across
the disk is stored in the "secret room".
At Quantum, the production testing is done in an environment that's roughly equivalent to what a typical
operating environment might be--ambient temperature around 40C, humidity whatever it is (factories are
in Singapore, southern Japan and Ireland, so humidity is generally high), and about 100 drives running
in a test cell, 10 drives per shelf, sort of bouncing around.
Criteria to pass the tests are always more stringent than the specs. That's been true everywhere I've
worked. I thought about the possibility of trying to give specific numbers for these tests, but as I thought
about it I realized that the test criteria change so fast that what I know is certainly already obsolete.
AD: Has there been any consideration to make flash ROM (similar to what most modems
have) a standard for hard drives? Or is that considered too dangerous?
JT: During development and often leaking into the early production drives, there's a flashable
ROM on the drives. It's replaced as early as possible with hard-coded ROM to save costs. And afterwards,
if the ROM needs changing, it's an earthquake-level task.
So it isn't danger, it's a combination of $$$ and tradition.
AD: If cost was not an issue what kind of hard drive would you design for your
JT: 10K RPM, 2 1/2" disks, about 20 GB, with two complete drives, striped, in one housing.
It should be able to put about 60 MB/s across the interface, continuously
It won't happen. It's WAY too expensive, but technically pretty easy to do.
AD: Any thoughts on other storage technologies such as CD-R/CD-RW, Magneto-Optical
or DVD? Do you think these technologies will replace hard drives as the primary storage medium?
JT: I don't think CD and its derivatives will replace hard drives because the physics in
the way they write, especially, is slower than hard drive magnetics. MO has no advantage over hard drives
in speed or data density.
However, sometime before 2010, hard drives will hit the wall in terms of data density, and at this time
I don't know of a way around it. That's the first time I've had to say that in my hard drive career. In
the past, I've been able to perceive one or more ways around some supposed barrier to speed or capacity.
When the data density barrier (technically it's called the paramagnetic limit) is reached, the only candidate
I see for replacement today is some development of flash ROM. They need to cut cost by an order of magnitude
and improve writing speed by a couple of orders of magnitude. Those are formidable challenges!
AD: Any common misconceptions about hard drives that end users have that you would
like to clear up?
JT: The biggest one is that you'll wear out the bearings by letting the drive run. If you
leave a drive running continuously, there's roughly a 1% chance that a bearing will fail in the first
The only other one is the argument about whether to leave it running or shut it off. I just said it won't
hurt to leave it running. Well, the standard test for starting and stopping ends up with a 0.3% chance
of a drive failing to start in the first 20,000 starts.
So my advice is, leave it running if you like, shut it off if you like. It doesn't matter. If your
drive fails, it isn't because of your choice in that matter.
AD: Did you work a 40 hour work week during the design process?
JT: As I said before, Ha,ha,ha,ha!! During the heart of a project, 60 or 70 hours was pretty
much normal. When there's a crisis, or during a build, 80 to 100 hours of actual working (not just being
there) is what you do.
It's funny, of course. One guy will be busting butt, and the guy in the next cube will have nothing out
of the ordinary on the fire. Yet engineering is essentially an intellectual sport, so you can't just hand
off half your work when you have a crisis.
AD: How much documentation was produced for the Atlas 4? An estimate would be fine.
JT: Depends on what "documentation" means--but let me see--print docs, maybe a
pile 20 feet high, if you don't count all the drafts.
The hard-copy documents I kept in my file filled an entire file drawer for each of three programs I worked
on at Quantum. I threw away a lot more paper than I kept.
There were 200+ mechanical drawings, with 2 to 15 revisions each (average maybe around 5 revisions). I
didn't have a complete solid model file of the drive, my solid model directories for Katana (Quantum's
internal project name for Quantum Atlas 4 (SCSI) and Fireball +KA (IDE); they're the same except for the
interface) never ran more than about 200 MB. Two other engineers kept solid model directories, too.
The firmware manager had a graph on his office wall about size of the code files--I think it peaked around
4 or 5 gigs.
E-mail & phone-mail messages, I have no idea. My E-mail was constantly overflowing my 4 MB allocation--about
500 messages or so. I had to purge it once every couple of months. Managers had more space <g>.
In other words, lots and lots of docs.
AD: What is the estimated cost to bring a new hard disk to market from start to
JT: These days, in the range of $50 million.
AD: Can you discuss a bit about the ATM (Automatic Teller Machine) deposit mechanism
that you designed (20+ years ago)?
JT: 20 years is a long time! The problem was to accept deposit envelopes of various sizes
and shapes and thicknesses, maybe containing coins, print an identification number on them, pass the UL
test for theft resistance, work when it's pouring with heavy rain, fit in the available space, be easy
to maintain, and cheap. In general, the usual engineering challenges. <g>
The hardest task was to pass the UL break-in test. The tester was a massive, muscular fellow armed with
punches, crowbars, sledgehammers, long grabbers, and other tools. He could study the deposit system for
as long as he wanted to, inside and out, before he began his attack. He had half an hour of actual "breaking
in" time to try to fish out an envelope. That half hour didn't have to be contiguous. He could bang
on the depository, then go around and see if his attack was working. Another UL person timed him with
a stopwatch. We passed, barely.
It was also difficult to come up with a reliable envelope printer. I eventually designed a sort of rotary
rubber stamp that printed the number every couple of inches along the envelope. If there were coins, it
seemed from our testing that there was always a way to make out the number, maybe combining a couple of
This picture of a Corvette was taken in June of '67 at the Corkscrew at Laguna Seca.
AD: Since you have done race car driving via Sports Car Club of America (SCCA)
road racing do you ever play with car racing simulations/video games?
JT: I've tried a couple, also tried a couple of coin-op games. They're boring.
One of the arcade games had pretty good visuals, comparable to the in-car cameras you see on TV occasionally.
The problem I have with all those things is that you don't get the physical feedback. 1.5G+ cornering
forces, 1G+ braking, etc. If you get a flat-spotted tire it can literally cause you to see double. The
simulations don't do that stuff. Also, they're generally over way too quick. If you imagine more intensity
than an arcade racing game, then have it last for 45 minutes or so, that's what you do. I'm a skinny sort
of guy, 5' 8" and 140 lb, and I used to sweat off 5 lb or so in a 40 minute session. It was fun (the
most fun you can have while dressed), it was intense, and it required total commitment, not only on the
race track, but in preparation too. When I was no longer prepared to give the commitment, I retired.
This picture of a Ralt RT-4, was taken in March of '86 at Firebird Raceway in Phoenix.
AD: Thank You.
Mr. John Treder has also written more details on hard drive history, design, and performance issues which
are included in the following sections.
Copyright © 1999 by Albert Dayes