Sep 8, 2016

Benford's Law in R (cont.): Actual Data

This is the second post based on Sara Silverstein's blog on Benford’s Law. Previously we duplicated the comparison of the proportion of first digits from a series of randomly generated numbers, and successive arithmetic operations on those numbers, and saw that the the more complicated the operation, the closer the conformance.

In this post we investigate the conformance with actual data, similar to Ms. Silverstein's investigation of "all the values from Apple's financials for every quarter over the past ten years."

Four different types of financial documents from property/casualty insurance were investigated:

1. An exhibit of estimated ultimate loss using various actuarial methods, and related calculated values
This exhibit includes financial values as well as some non-financial numbers, such as rows labeled with years, which could skew the results.

2. A Massachusetts insurance company rate filing 

In addition to many financial values, rate filings include much text and many numbers that are non-financial in nature.

3. An insurance company annual statement from 2009

Annual statements (aka, the Yellow Book) include many, many, many, many, many, many financial values.

4.  Schedule P data compiled by the Casualty Actuarial Society

Schedule P for six different lines of business for all U.S. property casualty insurers can be found at this link. The six files were combined into a single document. To isolate the investigation to purely financial numbers sans labels, company codes, and the like, the columns investigated are "IncurLoss_", "CumPaidLoss_", and "BulkLoss_".

Here are the results. The number of non-zero numbers in each document is indicated on the plot.

The Schedule P data is the most purely-financial in nature, and its plot in black matches Benford's Law almost exactly. Perhaps surprising, the Exhibits document is also quite close even though it holds the least number of observations. Perhaps a better job of pulling purely financial numbers out of the Rate Filing and the Annual Statement would improve their conformance.


For reading PDF documents into R as text strings, I used the readPDF function in the tm package. Look at this link to learn how to download the binary files that make readPDF work easily, and the suggestion of where to store them for expediency.

To divide strings of characters into individual "words", I used 'scan' in base R. See this link.

For parsing numbers, in all their various forms with commas, decimal points, etc., I used the parse_number function in the readr package.


R version 3.3.1 (2016-06-21)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows >= 8 x64 (build 9200)

[1] LC_COLLATE=English_United States.1252 LC_CTYPE=English_United States.1252
[3] LC_MONETARY=English_United States.1252 LC_NUMERIC=C
[5] LC_TIME=English_United States.1252

attached base packages:
[1] stats graphics grDevices utils datasets methods base

other attached packages:
[1] readr_1.0.0 tm_0.6-2 NLP_0.1-9

loaded via a namespace (and not attached):
[1] assertthat_0.1 rsconnect_0.4.3 parallel_3.3.1 tools_3.3.1 tibble_1.2
[6] Rcpp_0.12.5 slam_0.1-38

Aug 30, 2016

Benford's Law Graphed in R

Using R to replicate Sara Silverstein's post at

A first-year student near and dear to my heart at the Kellogg School of Management thought I would be interested in this Business Insider story by Sara Silverstein on Benford’s Law. After sitting through the requisite ad, I became engrossed in Ms. Silverstein’s talk about what that law theoretically is and how it can be applied in financial forensics.

I thought I would try duplicating the demonstration in R.1 This gave me a chance to compare and contrast the generation of combined bar- and line-plots using base R and ggplot2. It also gave me an opportunity to learn how to post RMarkdown output to blogger.

Using base R

Define the Benford Law function using log base 10 and plot the predicted values.
benlaw <- function(d) log10(1 + 1 / d)
digits <- 1:9
baseBarplot <- barplot(benlaw(digits), names.arg = digits, xlab = "First Digit", 
                       ylim = c(0, .35))
  • That was easy!

Aug 11, 2016

Forking, Cloning, and Pull Requests with Github Desktop

This is the best explanation I've found of how to collaborate on someone else's repository. Bonus! it's a video:

Jul 31, 2016

A Diversified R in Insurance Conference

I visited London this month for the first time in many years, having been honored to participate in the fourth annual R in Insurance conference held at the Cass Business School. Mired in the deep rooted polarity of the current American presidential election, this traveler was refreshed and uplifted by London's surprising and multi-faceted diversity. The conference program organized by Markus Gesmann and Andreas Tsanakas was similarly multi-faceted and equally enjoyable. See highlights in Markus' Notes from the Conference and this amateur's images below.

In addition to the conference, I had the pleasure of meeting up with old friends and making new ones.

Apr 1, 2016

R Tools for Visual Studio (RTVS) now available: good news for MS-only shops

Microsoft informs in Newsletter #2 that they are looking for people who are willing to evaluate an "early access trial" version of their Visual Studio IDE for R, called RTVS (for R Tools for Visual Studio).

Based on the video, RTVS has the same four-window design as RStudio, so there's not an immediate struggle with an unfamiliar layout. David Smith's blog lists some of RTVS's current shortcomings, such as automated package support, that may or may not be a problem for you. I looked for signs that VS might facilitate the integration of R with other languages – such as C# for a front-end and R for the back-end – but not a whiff.

The greatest advantage of RTVS I can see is for IT shops that are comfortable

Mar 24, 2016

Control totals of a data.frame

When you are conducting a business analysis project with a data extract from the company's internal system, professional risk management suggests you make sure you are not missing any records or double counting any records. But you certainly don't want to look at every record. Yikes!

Auditors solve this predicament with control totals. When the sums of key fields and the numbers of records match known values, usually from some well-established "production report," it can be assumed your data "reconciles." *

What does it mean to calculate "control totals" of a general data.frame?

Mar 17, 2016

Google's New Search Algorithm Introduces Bias

Larry Magid has a technology "article" on the local radio station. I always turn up the volume when Magid comes on. Today's spot tells how Google Search going forward may be biased for you personally based on your Google-stored relationships. This might be handy sometimes. For example, when looking for a restaurant you may want results skewed toward your friends' favorites. Google calls these "private results." For other searches, "private results" could hide or demote the actual results you'd hoped to find. On his website Magid shows how to turn off the privatizing feature after each search, as well as how to remove it for all searches via your Google settings.

Magid mentions a third option: "Incognito" mode. In Incognito mode, it's as if you're not logged in to Google, in which case your bias-influencing relationships are (presumably!) not available. You can open a new Incognito window in Chrome via Ctrl-Shift-N. Here is the link to Google's instructions on how to browse Incognito-ly on various devices.