Abstract
The purpose of this paper is to introduce the R package InfoTrad for estimating the probability of informed trading (PIN) initially proposed by Easley et al. (1996). PIN is a popular information asymmetry measure that proxies the proportion of informed traders in the market. This study provides a short survey on alternative estimation techniques for the PIN. There are many problems documented in the existing literature in estimating PIN. InfoTrad package aims to address two problems. First, the sequential trading structure proposed by Easley et al. (1996) and later extended by Easley, Hvidkjaer, and O’Hara (2002) is prone to sample selection bias for stocks with large trading volumes, due to floating point exception. This problem is solved by different factorizations provided by Easley, Hvidkjaer, and O’Hara (2010) (EHO factorization) and Lin and Ke (2011) (LK factorization). Second, the estimates are prone to bias due to boundary solutions. A grid-search algorithm (YZ algorithm) is proposed by Yan and Zhang (2012) to overcome the bias introduced due to boundary estimates. In recent years, clustering algorithms have become popular due to their flexibility in quickly handling large data sets. Gan, Wei, and Johnstone (2015) propose an algorithm (GAN algorithm) to estimate PIN using hierarchical agglomerative clustering which is later extended by Ersan and Alici (2016) (EA algorithm). The package InfoTrad offers LK and EHO factorizations given an input matrix and initial parameter vector. In addition, these factorizations can be used to estimate PIN through YZ algorithm, GAN algorithm and EA algorithm.The main aim of this paper is to present the InfoTrad package that estimates the probability of informed trading (PIN) initially proposed by Easley et al. (1996). PIN is one of the primary measures of proxy information asymmetry in the market. The structural model is driven from maximum likelihood estimation (MLE). Wide range of studies use PIN to answer questions in different fields of finance1.
Although it is a heavily used measure in the finance literature, the
development of applications that calculate PIN are quite slow. An
initial attempt for R community is made by (Zagaglia 2012). FinAsym
package of Zagaglia (2012) and the PIN
package of Zagaglia (2013) provide the
trade classification algorithm of Lee and Ready
(1991) which is an important tool for studies that use the TAQ
database. Both packages also provide PIN estimates through
pin_likelihood() functions. However, those estimates are
prone to bias due to misspecification and other limitations.
InfoTrad package aims to overcome such
limitations and provide users with a wide range of options when
estimating PIN.
Due to the popularity of the measure, problems in estimating PIN recently gained attention in the finance literature. Easley, Hvidkjaer, and O’Hara (2010) indicate that for stocks with a large trading volume, it is not possible to estimate PIN due to floating-point-exception (FPE). Two different numerical factorizations are provided by Easley, Hvidkjaer, and O’Hara (2010) and Lin and Ke (2011) to overcome the bias created due to FPE.
In addition, boundary solutions in estimating PIN are also shown to create bias in empirical studies. Yan and Zhang (2012) show that, independent of the type of factorization, the likelihood function can stuck at local optimum and provide biased PIN estimates. They propose an algorithm (YZ algorithm) that spans the parameter space by using 125 different initial values for the MLE problem and obtain the PIN estimate that gives the highest likelihood value with non-boundary solutions. Although YZ algorithm provides estimates with higher likelihood and guarantees obtain non-boundary solutions, the iterative structure makes this algorithm time-consuming especially for studies that use large datasets.
Considering the fact that recent studies that estimate PIN use large datasets, the effectiveness of the YZ algorithm is questioned. In recent years, clustering algorithms have become popular due to their efficiency in processing large sets of data. Gan, Wei, and Johnstone (2015) propose an algorithm that use hierarchical agglomerative clustering to estimate PIN. Ersan and Alici (2016) later extends this framework.
FPE and boundary solutions are not the only problems of PIN model. Duarte and Young (2009) indicate that the structural model of Easley et al. (1996) enforces a negative contemporaneous covariance between intraday buy and sell orders, which is contrary to the empirical evidence for symmetric order shocks. In addition,they show that the PIN model fails to capture the volatility of buy and sell orders,through simulations. Moreover, Duarte and Young (2009) adjust PIN to take into account the liquidity impact and show that liquidity is more prominent on stock returns compared to information asymmetry. Finally, it is important to note that PIN does not consider any strategic behaviour of investors such as order splitting. Order splitting can be more evident when a stock is jointly trading on multiple venues (Menkveld 2008). Even for a stock that is traded on a single market, an informed investor may want to split her order in order avoid revealing her private information too quickly (Foucault, Pagano, and Röell 2013). PIN model, by construction, fails to attach multiple small orders to a single informed investor.
This paper introduces and discusses the R (R
Core Team 2016) InfoTrad package for
estimating PIN. InfoTrad provides users with
the necessary methods to solely adress the problems of FPE and
boundary solutions. The package contains the likelihood factorizations
of EHO and LK as separate functions (EHO() and
LK(), respectively) which provide likelihood specifications
to avoid FPE. In addition, through YZ(), GAN()
and EA() functions, PIN estimates can be obtained using the
grid-search algorithm of Yan and Zhang
(2012) and clustering algorithms of Gan,
Wei, and Johnstone (2015) and Ersan and
Alici (2016). For all of the algorithms, likelihood specification
can be set to EHO or LK.
The paper is organized as follows; Section 2 provides a brief description of PIN. Specifically, section 2.1 discusses the problem of FPE and the alternative factorizations EHO and LK. Section 2.2 reviews the problem of boundary solutions and the YZ algorithm. Section 2.3 describes the clustering algorithms of Gan, Wei, and Johnstone (2015) and Ersan and Alici (2016). Section 3 introduces the package InfoTrad along with examples. Section 4 evaluates the performance of each method through simulations. Section 5 provides concluding remarks.
The structural model of Easley et al. (1996) and Easley, Hvidkjaer, and O’Hara (2002) consists of three types of agents; informed traders, uninformed traders and market makers. On a trading day \(t\), one risky asset is continuously traded. Market maker sets the price for a given stock by observing the buy orders \((B_t)\) and sell orders \((S_t)\). For that stock, an information event is assumed to follow a Bernoulli distribution with success probability \(\alpha\). This event reveals either a high or a low signal for the stock value. The event is assumed to provide a low signal with probability \(\delta\). When informed traders observe a high (low) signal, they are assumed to place buy (sell) orders at a rate of \(\mu\). Uninformed traders are assumed to place orders, independent of the information event and the signal. They arrive to market to place a buy (sell) order at a rate of \(\epsilon_b\) (\(\epsilon_s\)). Orders of both informed and uninformed investors are assumed to follow independent Poisson processes.
The joint probability distribution with respect to the parameter vector \(\Theta \equiv \{ \alpha, \delta, \mu, \epsilon_b, \epsilon_s \}\) and the number of buys and sells \((B_t,S_t)\), is specified by \[\label{eq:1} \begin{split} f(B_t,S_t|\Theta) \equiv \alpha\delta exp(-\epsilon_b) \frac{\epsilon_b^{B_t}}{{B_t}!} exp[-(\epsilon_s+\mu)] \frac{(\epsilon_s+\mu)^{S_t}}{{S_t}!} \hspace{0.9cm} \\ + \alpha (1-\delta) exp[-(\epsilon_b+\mu)] \frac{(\epsilon_b +\mu)^{B_t}}{{B_t}!} exp(-\epsilon_s) \frac{\epsilon_s^{S_t}}{{S_t}!} \hspace{0.1cm}\\ + (1-\alpha) exp(-\epsilon_b) \frac{\epsilon_b^{B_t}}{{B_t}!} exp(-\epsilon_s) \frac{\epsilon_s^{S_t}}{{S_t}!} \hspace{1.6cm} \end{split} (\#eq:1)\]
The estimates of arrival rates (\(\hat{\mu}, \hat{\epsilon_s}\) and \(\hat{\epsilon_b}\)), along with estimates of the probabilities (\(\hat{\alpha}\) and \(\hat{\delta}\)) can be obtained by maximizing the joint log-likelihood function given the order input matrix \((B_t,S_t)\) over \(T\) trading days. The non-linear objective function of this problem can be written as; \[\label{eq:2} L(\Theta|T)\equiv \sum_{t=1}^{T}L(\Theta|(B_t,S_t))= \sum_{t=1}^Tlog[f(B_t,S_t|\Theta)] (\#eq:2)\] The maximization problem is subject to the boundary constraints \(\alpha,\delta \in [0,1]\) and \(\mu,\epsilon_b,\epsilon_s \in [0,\infty)\)2. The PIN estimate is then given by; \[\label{eq:3} \widehat{PIN}=\frac{\hat{\alpha}\hat{\mu}}{\hat{\alpha}\hat{\mu}+\hat{\epsilon_b}+\hat{\epsilon_s}} (\#eq:3)\]
PIN estimates are prone to selection bias, especially for stocks for which the number of buy and sell orders are large3. Lin and Ke (2011) show that the increase in the number of buy and sell orders for a given stock, significantly shrinks the feasible solution set for the maximization of the log likelihood function in equation @ref(eq:2). To maximize the non-linear function @ref(eq:1), the optimization software introduces initial values for the parameters in \(\Theta\). The numerical optimization method is applied after those initial parameters are introduced. Therefore, for large enough \(B_t\) and \(S_t\) whose factorials cannot be calculated by mainstream computers (i.e. FPE), the optimal value for equation @ref(eq:2) becomes undefined. The FPE problem is therefore, more pronounced in active stocks.
To avoid the bias created due to FPE, one factorization of the equation @ref(eq:2) is provided by Easley, Hvidkjaer, and O’Hara (2010) as \(L_{EHO}(\Theta|T)\equiv \sum_{t=1}^T L_{EHO}(\Theta|B_t,S_t)\) where \[\label{eq:4} \begin{split} L_{EHO}(\Theta|B_t,S_t) = log[\alpha \delta exp(-\mu)x_b^{B_t-M_t}x_s^{-M_t}+\alpha(1-\delta)exp(-\mu)x_b^{-M_t}x_s^{S_t-M_t}+(1-\alpha)x_b^{B_t-M_t}x_s^{S_t-M_t}] \\ + B_t log(\epsilon_b + \mu)+S_t log (\epsilon_s + \mu)-(\epsilon_b+\epsilon_s)+ M_t[log(x_b)+log(x_s)]-log(S_t!B_t!),\hspace{1.1cm} \end{split} (\#eq:4)\] where \(M_t=min(B_t,S_t)+max(B_t,S_t)/2\), \(x_b=\epsilon_b/(\mu+\epsilon_b)\) and \(x_s=\epsilon_s/(\mu+\epsilon_s)\).
Lin and Ke (2011) introduce another
algebraically equivalent factorization of the equation @ref(eq:2),
\(L_{LK}(\Theta|T)\equiv \sum_{t=1}^T
L_{LK}(\Theta|B_t,S_t)\) where \[\label{eq:5}
\begin{split}
L_{LK}(\Theta|B_t,S_t) = log[\alpha \delta exp(e_{1 t}-e_{max t})
+\alpha (1-\delta)exp(e_{2t}-e_{max t})+(1-\alpha)exp(e_{3 t}-e_{max
t})] \\
+ B_tlog(\epsilon_b+\mu) +
S_tlog(\epsilon_s+\mu)-(\epsilon_b+\epsilon_s)+e_{max t} -log(S_t!B_t!),
\hspace{2.3cm}
\end{split} (\#eq:5)\] where \(e_{1
t}=-\mu-B_tlog(1+\mu/\epsilon_b)\), \(e_{2 t}=-\mu-S_tlog(1+\mu/\epsilon_s)\),
\(e_{3
t}=-B_tlog(1+\mu/\epsilon_b)-S_tlog(1+\mu/\epsilon_s)\) and \(e_{max t} = max(e_{1t},e_{2t},e_{3t})\).
The last term \(log(S_t!B_t!)\) is
constant with respect to the parameter vector \(\Theta\), and is, therefore, dropped in the
MLE for both factorizations.
Another source of bias in estimating PIN arises from boundary solutions. Yan and Zhang (2012) indicate that in calculating PIN, parameter estimates \(\hat{\alpha}\) and \(\hat{\delta}\) usually fall onto the boundaries of the parameter space, that is, they are equal to zero or one. PIN estimate presented in equation (@ref(eq:3)) is directly related to the estimate of \(\hat{\alpha}\). Letting \(\hat{\alpha}\) equal to zero will make sure that PIN is zero as well. This can create a sample selection bias in portfolio formation, especially for quarterly estimations4. Yan and Zhang (2012) show that;
\[E(B)=\alpha(1-\delta)\mu+\epsilon_b\]
\[E(S)=\alpha\delta\mu+\epsilon_s\]
Then, they propose the following algorithm to overcome the bias created due to boundary solutions. Let \((\alpha^0,\delta^0,\epsilon_b^0,\epsilon_s^0,\mu^0)\) be the initial parameter function to be placed in the non-linear program presented in equation (@ref(eq:4)). In addition, let \(\bar{B}\) and \(\bar{S}\) be the average number of buy and sell orders.
\[\alpha^0=\alpha_i, \hspace{0.5cm} \delta^0=\delta_j, \hspace{0.5cm} \epsilon_b^0=\gamma_k\bar{B}, \hspace{0.5cm} \mu^0=\frac{\bar{B}-\epsilon_b^0}{\alpha^0(1-\delta^0)} \hspace{0.2cm} \text{and} \hspace{0.2cm} \epsilon_s^0=\bar{S}-\alpha^0\delta^0\mu^0\] where \(\alpha_i,\delta_j, \gamma_k \in \{0.1,0.3,0.5,0.7,0.9\}\). This will yield 125 different PIN estimates along with their likelihood values. In line with Yan and Zhang (2012), we drop any initial parameter vector having negative values for \(\epsilon_s^0\). In addition, following Ersan and Alici (2016), we also drop any initial parameter vector with \(\mu^0>max(B_t,S_t)\). Yan and Zhang (2012) then select the estimate with non-boundary parameters yielding highest likelihood value. This method, by construction, spans the parameter space and tries to avoid local optima and provides non-boundary estimates for \(\alpha\).
In recent years, clustering algorithms are increasingly becoming
popular in estimating the probability of informed trading due to
efficiency concerns. Gan, Wei, and Johnstone
(2015) and Ersan and Alici (2016)
use clustering algorithms to estimate PIN. Gan,
Wei, and Johnstone (2015) introduce a method that clusters the
data into three groups (good news, bad news, no news) based on the mean
absolute difference in order imbalance. Let \(X_t=B_t-S_t\) be the order imbalance on day
\(t\) computed as the difference
between buy orders and sell orders. The clustering is then based on the
distance function defined as \(D(I,J)=|X_i-X_j|, \quad 1\leq i,j \leq T\)
where \(i \neq j\). They use
hierarchical agglomerative clustering (HAC) to group the data elements
based on the distance matrix. Specifically, they use
hclust() function of Müllner
(2013) in R5. The algorithm sequentially clusters, in a
bottom-up fashion, each observation into groups based on \(X_t\) and stops when it reaches three
clusters. The theoretical framework of Easley et
al. (1996) indicates that a stock has high (low) \(X_t\) on good (bad) days. Therefore, the
cluster which has the highest (lowest) mean \(X_t\) is labelled as good (bad) news. The
remaining cluster is then labelled as no news. Once each observation is
grouped into their respective clusters (good news, bad news, no news),
\(c \in \{G,B,N\}\), the parameter
estimates for \(\Theta \equiv
\{\alpha,\delta,\mu,\epsilon_b,\epsilon_s\}\) are calculated
simply by counting. Let \(\omega_c\) be
the proportion of cluster \(c\)
occupying the total number of days \(T\), such that \(\sum_{c=1}^3\omega_c=1\). Similarly, let
\(\bar{B_c}\) and \(\bar{S_c}\) be the average number of buys
and sells on cluster \(c\),
respectively.
Then, the probability of an information event is given by \(\hat{\alpha}=\omega_B+\omega_G\). Moreover, the estimate for the probability of information event releasing bad news is given by \(\hat{\delta}=\omega_B/\hat{\alpha}\). The estimate for the arrival rate of buy orders of uninformed traders represented by \(\hat{\epsilon_b}=\frac{\omega_B}{\omega_B+\omega_N}\bar{B_B}+\frac{\omega_N}{\omega_B+\omega_N}\bar{B_N}\). Similarly, the estimate for the arrival rate of sell orders of uninformed traders represented by \(\hat{\epsilon_s}=\frac{\omega_G}{\omega_G+\omega_N}\bar{S_G}+\frac{\omega_N}{\omega_G+\omega_N}\bar{B_N}\). Finally, the arrival rate for the informed investors is calculated as \(\hat{\mu}=\frac{\omega_G}{\omega_B+\omega_G}(\bar{B_G}-\hat{\epsilon_b})+\frac{\omega_B}{\omega_B+\omega_G}(\bar{S_B}-\hat{\epsilon_s})\) where \((\bar{B_G}-\hat{\epsilon_b})\) corresponds to the buy rate of informed investors \(\hat{\mu_b}\) and \((\bar{S_B}-\hat{\epsilon_s})\) corresponds to the sell rate of informed investors \(\hat{\mu_s}\)6.
Through simulations, Gan, Wei, and Johnstone (2015) show that estimates calculated as above are proper candidates for the initial parameter values to be used in MLE process. Ersan and Alici (2016) argue that the estimates for the informed arrival rate, \(\mu\), contains a downward bias with GAN algorithm7. This is what we observe in this study as well. In addition, they state that GAN algorithm provides inaccurate estimates for \(\delta\). In order to overcome these issues, instead of using \(X_t\), Ersan and Alici (2016) use absolute daily order imbalance, \(|X_t|\), to cluster the data. They initially cluster, \(|X_t|\) into two, again by using hclust(). The cluster with the lower mean daily absolute order imbalance is labelled as "no event" cluster and the remaining as "event" cluster. Then, the formation of "good" and "bad" event day clusters are obtained through separating the days in the "event" cluster into two with respect to the sign of the daily order imbalances. The parameter estimates are then computed with the same procedure presented above8.
The R package InfoTrad provides five
different functions
EHO(),LK(),YZ(),GAN()
and EA(). The first two functions provide likelihood
specifications whereas the last three functions can be used to obtain
parameter estimates for \(\Theta\) to
calculate PIN in equation @ref(eq:3). All five functions require a data
frame that contains \(B_t\) in the
first column, and \(S_t\) in the second
column. We create \(B_t\) and \(S_t\) for ten hypothetical trading days9.
EHO() and LK() read \((B_t,S_t)\) and return the related
functional form of the negative log likelihood. These objects can be
used in any optimization procedure such as optim() to
obtain the parameter estimates \(\hat{\Theta}\equiv\{\hat{\alpha},\hat{\delta},\hat{\mu},\hat{\epsilon_b},\hat{\epsilon_s}\}\),
the likelihood value and other specifications, in one iteration with a
pre-specified initial value vector, \(\Theta_0\), for parameters. We define
EHO() and LK() as simple likelihood
specifications rather than functions that execute the MLE procedure.
This is due to the fact that MLE estimators vary depending on the
optimization procedure. Users who wish to develop alternative estimation
techniques, based on the proposed likelihood factorization, can use
EHO() and LK(). This is the underlying reason
why those functions do not have built-in optimization procedures. By
specifying EHO() and LK() as simple likelihood
functions, we give developers the flexibility to select the most
suitable optimization procedure for their application.
For researchers who want to calculate an estimate of PIN,
YZ(), GAN() and EA() functions
have built-in optimization procedures. Those functions read a likelihood
specification value along with data. Likelihood specification can be set
either to “LK" or to”EHO" with “LK" being the default. All estimation
functions use neldermead() function of nloptr
package to conduct MLE with the specified factorization. GAN and EA
functions also use hclust() function of Müllner (2013) to conduct clustering. The output
of these three functions is an object that provides \(\{\hat{\alpha},\hat{\delta},
\hat{\mu},\hat{\epsilon_b},\hat{\epsilon_s},f(\hat{\Theta}),\widehat{PIN}\}\),
where \(f(\hat{\Theta})\) represents
the optimal likelihood value given the parameter estimates \(\hat{\Theta}\).
An example is provided below for EHO() with a sample
data and initial parameter values. Notice that the first column of
sample data is for \(B_t\) and second
column is for \(S_t\). Similarly, the
initial parameter values are constructed as; \(\Theta_0\) = \(\{\alpha,\delta,\mu,\epsilon_b,\epsilon_s\}\).
We use optim() with ‘Nelder-Mead’ method to execute MLE,
however developer is flexible to use other methods as well.
library(InfoTrad)
# Sample Data
# Buy Sell
#1 350 382
#2 250 500
#3 500 463
#4 552 550
#5 163 200
#6 345 323
#7 847 456
#8 923 342
#9 123 578
#10 349 455
Buy<-c(350,250,500,552,163,345,847,923,123,349)
Sell<-c(382,500,463,550,200,323,456,342,578,455)
data=cbind(Buy,Sell)
# Initial parameter values
# par0 = (alpha, delta, mu, epsilon_b, epsilon_s)
par0 = c(0.5,0.5,300,400,500)
# Call EHO function
EHO_out = EHO(data)
model = optim(par0, EHO_out, gr = NULL, method = c("Nelder-Mead"), hessian = FALSE)
## Parameter Estimates
model$par[1] # Estimate for alpha
# [1] 0.9111102
model$par[2] # Estimate for delta
#[1] 0.0001231429
model$par[3] # Estimate for mu
# [1] 417.1497
model$par[4] # Estimate for eb
# [1] 336.075
model$par[5] # Estimate for es
# [1] 466.2539
## Estimate for PIN
(model$par[1]*model$par[3])/((model$par[1]*model$par[3])+model$par[4]+model$par[5])
# [1] 0.3214394
####
In this example, \(B_t\) and \(S_t\) vectors are selected so that the likelihood function cannot be represented as in equation @ref(eq:1). We set the initial parameters to be \(\Theta_0\)=(0.5,0.5,300,400,500). For the given \(B_t\), \(S_t\) and \(\Theta_0\) vectors, PIN measure is calculated as 0.32 with EHO factorization.
An example is provided below for LK() function with a
sample data and initial parameter values. Notice that the first column
of sample data is for \(B_t\) and
second column is for \(S_t\).
Similarly, the initial parameter values are constructed as; \(\Theta_0\) = \(\{\alpha,\delta,\mu,\epsilon_b,\epsilon_s\}\).
We use optim() with ‘Nelder-Mead’ method to execute MLE,
however developer is flexible to use other methods as well.
library(InfoTrad)
# Sample Data
# Buy Sell
#1 350 382
#2 250 500
#3 500 463
#4 552 550
#5 163 200
#6 345 323
#7 847 456
#8 923 342
#9 123 578
#10 349 455
Buy<-c(350,250,500,552,163,345,847,923,123,349)
Sell<-c(382,500,463,550,200,323,456,342,578,455)
data=cbind(Buy,Sell)
# Initial parameter values
# par0 = (alpha, delta, mu, epsilon_b, epsilon_s)
par0 = c(0.5,0.5,300,400,500)
# Call LK function
LK_out = LK(data)
model = optim(par0, LK_out, gr = NULL, method = c("Nelder-Mead"), hessian = FALSE)
## The structure of the model output ##
model
#$par
#[1] 0.480277 0.830850 315.259805 296.862318 400.490830
#$value
#[1] -44343.21
#$counts
#function gradient
# 502 NA
#$convergence
#[1] 1
#$message
#NULL
## Parameter Estimates
model$par[1] # Estimate for alpha
# [1] 0.480277
model$par[2] # Estimate for delta
# [1] 0.830850
model$par[3] # Estimate for mu
# [1] 315.259805
model$par[4] # Estimate for eb
# [1] 296.862318
model$par[5] # Estimate for es
# [1] 400.4908
## Estimate for PIN
(model$par[1]*model$par[3])/((model$par[1]*model$par[3])+model$par[4]+model$par[5])
# [1] 0.178391
####
For the given \(B_t\), \(S_t\) and \(\Theta_0\) vectors, PIN measure is calculated as 0.18 with LK factorization.
An example is provided below for YZ() function with a
sample data. Notice that the first column of sample data is for \(B_t\) and second column is for \(S_t\). In addition, the first example is
with default likelihood specification LK and the second one is with EHO.
Notice that YZ() function do not require any initial
parameter vector \(\Theta_0\).
library(InfoTrad)
# Sample Data
# Buy Sell
#1 350 382
#2 250 500
#3 500 463
#4 552 550
#5 163 200
#6 345 323
#7 847 456
#8 923 342
#9 123 578
#10 349 455
Buy<-c(350,250,500,552,163,345,847,923,123,349)
Sell<-c(382,500,463,550,200,323,456,342,578,455)
data<-cbind(Buy,Sell)
# Parameter estimates using the LK factorization of Lin and Ke (2011)
# with the algorithm of Yan and Zhang (2012).
# Default factorization is set to be "LK"
result=YZ(data)
print(result)
# Alpha: 0.3999999
# Delta: 0
# Mu: 442.1667
# Epsilon_b: 263.3333
# Epsilon_s: 424.9
# Likelihood Value: 44371.84
# PIN: 0.2004457
# Parameter estimates using the EHO factorization of Easley et. al. (2010)
# with the algorithm of Yan and Zhang (2012).
result=YZ(data,likelihood="EHO")
print(result)
# Alpha: 0.9000001
# Delta: 0.9000001
# Mu: 489.1111
# Epsilon_b: 396.1803
# Epsilon_s: 28.72002
# Likelihood Value: Inf
# PIN: 0.3321033
For the given \(B_t\) and \(S_t\) vectors, PIN measure is calculated as 0.20 with YZ algorithm along with LK factorization. Moreover, PIN measure is calculated as 0.33 with YZ algorithm along with EHO factorization.
An example is provided below for GAN() function with a
sample data. Notice that the first column of sample data is for \(B_t\) and second column is for \(S_t\). In addition, the first example is
with default likelihood specification LK and the second one is with EHO.
Notice that GAN() function do not require any initial
parameter vector \(\Theta_0\).
library(InfoTrad)
# Sample Data
# Buy Sell
#1 350 382
#2 250 500
#3 500 463
#4 552 550
#5 163 200
#6 345 323
#7 847 456
#8 923 342
#9 123 578
#10 349 455
Buy<-c(350,250,500,552,163,345,847,923,123,349)
Sell<-c(382,500,463,550,200,323,456,342,578,455)
data<-cbind(Buy,Sell)
# Parameter estimates using the LK factorization of Lin and Ke (2011)
# with the algorithm of Gan et. al. (2015).
# Default factorization is set to be "LK"
result=GAN(data)
print(result)
# Alpha: 0.3999998
# Delta: 0
# Mu: 442.1667
# Epsilon_b: 263.3333
# Epsilon_s: 424.9
# Likelihood Value: 44371.84
# PIN: 0.2044464
# Parameter estimates using the EHO factorization of Easley et. al. (2010)
# with the algorithm of Gan et. al. (2015)
result=GAN(data, likelihood="EHO")
print(result)
# Alpha: 0.3230001
# Delta: 0.4780001
# Mu: 481.3526
# Epsilon_b: 356.6359
# Epsilon_s: 313.136
# Likelihood Value: Inf
# PIN: 0.1884001
For the given \(B_t\) and \(S_t\) vectors, PIN measure is calculated as 0.20 with GAN algorithm along with LK factorization. Moreover, PIN measure is calculated as 0.19 with GAN algorithm along with EHO factorization.
An example is provided below for EA() function with a
sample data. Notice that the first column of sample data is for \(B_t\) and second column is for \(S_t\). In addition, the first example is
with default likelihood specification LK and the second one is with EHO.
Notice that EA() function do not require any initial
parameter vector \(\Theta_0\).
library(InfoTrad)
# Sample Data
# Buy Sell
#1 350 382
#2 250 500
#3 500 463
#4 552 550
#5 163 200
#6 345 323
#7 847 456
#8 923 342
#9 123 578
#10 349 455
Buy=c(350,250,500,552,163,345,847,923,123,349)
Sell=c(382,500,463,550,200,323,456,342,578,455)
data=cbind(Buy,Sell)
# Parameter estimates using the LK factorization of Lin and Ke (2011)
# with the modified clustering algorithm of Ersan and Alici (2016).
# Default factorization is set to be "LK"
result=EA(data)
print(result)
# Alpha: 0.9511418
# Delta: 0.2694005
# Mu: 76.7224
# Epsilon_b: 493.7045
# Epsilon_s: 377.4877
# Likelihood Value: 43973.71
# PIN: 0.07728924
# Parameter estimates using the EHO factorization of Easley et. al. (2010)
# with the modified clustering algorithm of Ersan and Alici (2016).
result=EA(data,likelihood="EHO")
print(result)
# Alpha: 0.9511418
# Delta: 0.2694005
# Mu: 76.7224
# Epsilon_b: 493.7045
# Epsilon_s: 377.4877
# Likelihood Value: 43973.71
# PIN: 0.07728924
For the given \(B_t\) and \(S_t\) vectors, PIN measure is calculated as 0.08 with EA algorithm along with LK factorization. Moreover, PIN measure is calculated, again, as 0.08 with EA algorithm along with EHO factorization.
In this section, we investigate the performance of the estimates obtained for \(\Theta\) and PIN using the existing methods. We evaluate the methods based on their accuracy proxied by mean absolute errors (MAE)10. We first examine how the estimates vary in different trade intensity levels. To this end, we follow the methodology in Gan, Wei, and Johnstone (2015). Let \(I\) be the the set of trade intensity levels ranging from 50 to 5000 at step size of 50, that is, I=\(\{50,100,150,\dots,5000\}\). We first set our parameters as \(\Theta= \{\alpha=0.5,\delta=0.5,\mu=0.2i,\epsilon_b=0.4i,\epsilon_s=0.4i \}\), where \(i \in I\). For each trade intensity level, we generate \(N\)=50 random samples of \(\tilde{\alpha}\) and \(\tilde{\delta}\) that are binomially distributed with parameters \(\alpha\) and \(\delta\) respectively. \(\tilde{\alpha}\) and \(\tilde{\delta}\) proxy the content of the information event. For each pair of \(\tilde{\alpha}\), \(\tilde{\delta}\) values, we generate buy and sell values \((B_t,S_t)\) for hypothetical \(T\)=60 days in the following manner;
if \(\tilde{\alpha}\) = 0, then there is no information event, therefore, generate \(B_t \sim Pois(\epsilon_b)\) and \(S_t \sim Pois(\epsilon_s)\).
if \(\tilde{\alpha}\) = 1, and \(\tilde{\delta}\) =1, then there is bad news, therefore generate \(B_t \sim Pois(\epsilon_b)\) and \(S_t \sim Pois(\epsilon_s+\mu)\)
if \(\tilde{\alpha}\) = 1, and \(\tilde{\delta}\) =0, then there is good news, therefore generate \(B_t \sim Pois(\epsilon_b+\mu)\) and \(S_t \sim Pois(\epsilon_s)\)
We then form the joint likelihood function represented by equation
@ref(eq:4) in EHO form or by equation @ref(eq:5) in LK form and obtain
the estimates using YZ(), GAN() or
EA() methods.
The results are presented in Table 1 which
indicates that YZ() method with LK()
factorization provides the PIN estimates with lowest MAE. Although the
clustering algorithms, especially GAN() method, provide
powerful estimates of \(\hat{\alpha},\hat{\delta},\hat{\epsilon_b},\hat{\epsilon_s}\),
they fail to estimate the arrival rate of informed investors \(\hat{\mu}\),accurately. This is in line
with Ersan and Alici (2016). On the
contrary, YZ() method with EHO() factorization
provides the best estimates for \(\hat{\mu}\), but fails to provide good
estimates for other parameters.
| Method | Factorization | \(\widehat{PIN}\) | \(\hat{\alpha}\) | \(\hat{\delta}\) | \(\hat{\mu}\) | \(\hat{\epsilon_b}\) | \(\hat{\epsilon_s}\) | |
|---|---|---|---|---|---|---|---|---|
| YZ | LK | 0.075 | 0.199 | 0.059 | 415.2 | 104.3 | 109.0 | |
| YZ | EHO | 0.134 | 0.428 | 0.310 | 154.6 | 288.3 | 247.4 | |
| GAN | EHO | 0.101 | 0.087 | 0.083 | 479.4 | 124.1 | 117.3 | |
| GAN | LK | 0.101 | 0.087 | 0.083 | 479.5 | 123.8 | 118.1 | |
| EA | LK | 0.102 | 0.268 | 0.274 | 484.6 | 128.7 | 119.3 | |
| EA | EHO | 0.102 | 0.270 | 0.275 | 483.1 | 128.5 | 107.8 |
A more general way of examining the accuracy of PIN estimates is proposed in several studies (e.g, Lin and Ke (2011), Gan, Wei, and Johnstone (2015), Ersan and Alici (2016)). In this setting, we fix the trade intensity, I=2500. The total trade intensity represents the overall presence of informed and uninformed traders, that is, I=(\(\mu\), \(\epsilon_b\), \(\epsilon_s\)). We then generate three probability terms \(p_1,p_2,p_3\) with \(N\)=5000 random observations that are distributed uniformly between 0 and 1. \(p_1\) represents the fraction of informed investors in total trade intensity, that is, \(\mu\)=\(p_1*I\). The rest of the trade intensity is distributed equally to buy and sell orders of uninformed investors, that is, \(e_b=e_s=(1-p_1)*I/2\). \(p_2\) represents the true parameter for the probability of news arrival, \(\alpha\), and \(p_3\) is the true parameter for the content of the news, \(\delta\). We generate observations for \(\tilde{\alpha}\) and \(\tilde{\delta}\), as described earlier. For each pair of \(\tilde{\alpha}\) and \(\tilde{\delta}\), we generate buy and sell values \((B_t,S_t)\) for hypothetical \(T\)=60 days, again, in the manner presented above; form the likelihood and obtain the parameter estimates.
The results are presented in Table 2. Similar
to first simulation, GAN() captures the true nature of
\(\hat{\alpha}\) and \(\hat{\delta}\) better than any other method
with both factorizations. YZ() method with
EHO() factorization performs best when estimating the
arrival of informed traders, \(\hat{\mu}\). The importance of estimating
\(\hat{\mu}\) becomes quite evident in
Table 2. Although other methods outperform
YZ() method with EHO() factorization in
estimating \(\alpha,\epsilon_b\) and
\(\epsilon_s\), it provides the best
estimate for PIN due to it’s performance on estimating \(\hat{\mu}\).
| Method | Factorization | \(\widehat{PIN}\) | \(\hat{\alpha}\) | \(\hat{\delta}\) | \(\hat{\mu}\) | \(\hat{\epsilon_b}\) | \(\hat{\epsilon_s}\) | |
|---|---|---|---|---|---|---|---|---|
| YZ | LK | 0.323 | 0.428 | 0.432 | 1,212.0 | 303.4 | 325.0 | |
| YZ | EHO | 0.237 | 0.437 | 0.357 | 942.9 | 386.0 | 470.2 | |
| GAN | LK | 0.348 | 0.380 | 0.410 | 1,218.7 | 314.5 | 323.3 | |
| GAN | EHO | 0.347 | 0.357 | 0.397 | 1,216.2 | 328.5 | 339.5 | |
| EA | LK | 0.348 | 0.437 | 0.421 | 1,224.0 | 325.1 | 336.3 | |
| EA | EHO | 0.347 | 0.428 | 0.413 | 1,222.0 | 331.3 | 345.9 |
This paper provides a short survey on five most widely used
estimation techniques for the probability of informed trading (PIN)
measure. In this paper, we introduce the R package
InfoTrad, covering estimation procedures for
PIN using EHO, LK factorizations along with YZ, GAN and EA algorithms
(EHO(),LK(), YZ(),
GAN() EA()). The functions EHO()
and LK() read a (Tx2) matrix where the rows of the first
column contains total number of buy orders on a given trading day t,
\(B_t\), and the rows of the second
column contains the total number of sell orders on a given trading day
t, \(S_t\), where t \(\in\) \(\{1,2,\dots,T\}\). In addition, they also
require an initial parameter vector in the form of, \(\Theta_0\) = \(\{\alpha,\delta,\mu,\epsilon_b,\epsilon_s\}\).
Both functions produce the respective log-likelihood functions.
The functions YZ(), GAN() and
EA() read \((B_t,S_t)\) as
an input along with a likelihood specification that is set to
LK by default. These functions do not require initial
parameter matrix to obtain the parameter estimates when calculating PIN.
All three functions use neldermead() method of
nloptr as built-in optimization procedure for
MLE. YZ() GAN() and EA() produce
an object that gives the parameter estimates \(\hat{\Theta}\) along with likelihood value
and \(\widehat{PIN}\).
This research is supported by the Scientific and Technological Research Council of Turkey (TUBITAK), Grant Number: 116K335.
For instance, analyst coverage (Easley, O’Hara, and Paperman 1998), stock splits (Easley, O’Hara, and Saar 2001), initial public offerings (Ellul and Pagano 2006), credit ratings (Odders-White and Ready 2006), M\(\&\)A announcements (Aktas et al. 2007) and asset returns \[[@2002Easleyetal],[@2010Easleyetal]\] among others.↩︎
Both PIN package of Zagaglia (2013) and FinAsym package of Zagaglia (2012) fail to acknowledge the boundary constraints on arrival rates \(\mu,\epsilon_b,\epsilon_s\). Similar to event probabilities, they restrict these parameters to \([0,1]\) which forces the estimates for the arrival of informed and uninformed traders on a given day to take values at most one. This creates significant bias in PIN estimates.↩︎
For example, Zagaglia
(2012) provides a sample data to calculate PIN. In sample data
the maximum trade number is 19. If you multiply each observation in the
sample data by 10, the pin_likelihood() function of
FinAsym package fails to provide results with
the sample initial parameter vector.↩︎
For quarterly estimations of PIN, one can be sure that there is at least one information event, earnings announcement. Therefore \(\hat{\alpha}\) cannot be equal to zero.↩︎
hclust() function is used at its default
setting in line with Gan, Wei, and Johnstone
(2015).↩︎
Both Gan, Wei, and Johnstone (2015) and Ersan and Alici (2016) do not mention the case where \(\hat{\mu_b}<0\) or \(\hat{\mu_s}<0\). It is fair to assume that in such cases, informed investors are not present on the buy (sell) side. Therefore, we set \(\mu_b\) and \(\mu_s\) equal to zero when we obtain a negative estimate.↩︎
We also show that estimates for \(\mu\) contains a significant downward bias due to poor choice of initial parameter value \(\mu_0\) when GAN algorithm is used.↩︎
Ersan and Alici (2016) also provide an iterative process in which they systematically update the clusters. We plan to introduce this methodology in the future versions of our package.↩︎
The numbers are randomly selected. We set numbers to be high enough so that the original likelihood framework presented in equation @ref(eq:1) cannot be used due to FPE. Easley et al. (1996) indicate that at least 60 days worth of data is required in order to obtain proper convergence for \(\widehat{PIN}\). We use ten days for demonstration purposes.↩︎
All estimations are conducted on a 2.6 Intel i7-6700HQ CPU.We do not consider speed as a performance measure since the average processing time for each method is less than 10 seconds.↩︎