Another Way To Find Similar Players – Introducing The Rate Similarity Score
Bill James introduced the concept of similarity scores by comparing career totals in games played, at bats, runs, hits and other offensive stats. Most of the stats used are counting stats, and therefore players can be similar only if they are similar hitters and they have a similar number of plate appearances.
This is fine as far as it goes, but it doesn’t identify similar batters with very different career lengths. So what if you made a similarity score based on purely rate stats? It should identify similar offensive players, no matter how many games they played.
Enter a new stat, the Rate Similarity Score (RSS), which compares the rates of singles, doubles, triples, home runs, walks, and hit by pitches per plate appearance, and the number of stolen bases divided by the number of singles plus walks plus hit by pitches. The RSS is determined by first finding the differences between the rates of two players divided by the biggest rate differences for those stats within the full sample of players being used. Then a root mean square difference (RMSD) is found over the seven rate stats; the RSS is the difference between the RMSD and 1. Two identical players would have an RSS of 1, while two perfectly dissimilar players would have an RSS of 0.
The most similar players among those with 5000 PA or more:
The best match is between Harold Baines and Richie Zisk. They had very different career lengths (Baines had almost two times as many PA), but very similar slash stats (.289/.356/.465 for Baines versus .287/.353/.466 for Zisk). You could just use slash stats to determine a rate similarity, but RSS has more detail since it considers doubles, triples and home runs separately (rather than simply total bases), differentiates between walks and hit by pitches, and also includes stolen bases.
Duke Snider and Dick Allen are a close match; Snider had 923 more PA and is in the Hall of Fame. Note also that RSS doesn’t account for different eras (although in principle it could if you used neutralized stats). The Alomar/Larkin pairing is also interesting, since they are both middle infielders and Alomar just made the Hall of Fame. Their main difference was that Larkin had 1343 fewer PA, and of course one played second base and the other shortstop.
A very unique player according to his rate stats (among the 910 players with at least 5000 PA) is Billy Hamilton; he had the lowest RSS for a most similar player (Eddie Collins, .85816). Others who had a most similar with a low RSS were Hughie Jennings (Tommy Tucker, .86747), Otis Nixon (Tommy McCarthy, .86967), Barry Bonds (Jim Thome, .87135), Harry Stovey (Oyster Burns, .87927), Rickey Henderson (Joe Morgan, .88106). Babe Ruth and Ted Williams were most similar using RSS, with a score of .88176. They also show up in each others’ top-ten most similar according to the standard similarity score.
Other interesting closest pairs:
In many of the cases shown here a most similar player is in the Hall of Fame, or often discussed as a possible candidate, while the other is not. Usually it is a difference in career length (number of PA) or fielding position that accounts for our different assessments of two players who were actually very similar offensively, as measured by their RSS.
Pete Rose is an interesting case. Using the standard similarity ratings — taken from Baseball-reference.com, which has a slightly modified version of Bill James’ original similarity formula — his nine most similars are in the Hall of Fame, and the tenth, Craig Biggio, might very well get there. The most similar player is Paul Molitor with a score of 678, which is not very close to the maximum possible value of 1000 (most standard similarity scores tend to be in the 800s and 900s for most similar players). But using RSS, his most similars are Billy Goodman, Joe Sewell (.94917), Kevin Seitzer (.94885), Woody English (.94821) and Dom DiMaggio (.94669), a pretty non-descript bunch (if you’re wondering who’s Billy Goodman, you’re not alone). These RSS scores are fairly high, so Rose, Goodman, et al., were in fact very similar hitters. Clearly Rose’s Hall of Fame merit comes mainly from his longevity. This is no surprise, but the RSS values make that point very convincingly.
The above table has a number of other interesting comparisons. Most similars by RSS Winfield and Smith were actually most similar by the standard similarity test at ages 27, 29 and 30, but Winfield went on to a much longer career. Likewise Vizquel and Concepcion were most similar to each other at various ages from 23 to 39, but Vizquel has amassed 2048 more PA. Oliver and Puckett were most similar at ages 34 and 35 using the standard similarity score, but Oliver went on to have 1947 more PA. Olerud and Martinez are actually the third and second most similar to each other, respectively, by the standard similarity score since their PA totals are only 391 apart.
The two most dissimilar players (lowest RSS) with 5000 or more PA are Mark McGwire and Hughie Jennings, with an RSS of .35970. The next eight smallest similarities also involve McGwire with Willie Keeler (RSS .37876), Buck Ewing (.37997), Vince Coleman (.38141), Joe Jackson (.38624), Ty Cobb (.39705), Nap Lajoie (.40121), Willie Wilson (.40137) and Sam Crawford (.40523) – clearly not a lot of home runs and a lot of triples and stolen bases among this group, which are their major distinguishing characteristics compared to McGwire.
The players who had the most players to which they were most dissimilar (least similar) are:
|Player||# most dissimilar to|
Clearly Mark McGwire had a very unusual batting profile.
The player who had the largest least-similar number was Jim Landis, whose smallest RSS (.62394) was with Hughie Jennings. He might be considered the most average player, not too far from anyone else. A close second was Jackie Robinson, with a smallest RSS of .62344 compared to Mark McGwire.
There is a lot more that hasn’t been shown here. If anyone wants to look at an Excel spreadsheet with the five most similar and five most dissimilar players for every hitter with at least 5000 PA, the file may be found here: http://www.public.iastate.edu/~whisnant/similar.xls. The spreadsheet also has the five top similars and dissimilars for all players with 3000 or more PA.