So a few weeks ago I had the good fortune to get my hands on intraday trading data, due to the kindness of the folks at Nanex Research. I asked for a plain old text file rather than learn their custom software and API, and what I received was a 23 GB text file. Now, it’s a bit difficult just to open a 23 GB text file and see what’s inside since most software programs aren’t built to handle that kind of memory management.
But I was excited to see what was inside nonetheless. This was also the dataset that I’d been waiting all semester to receive, and the one that I was looking forward to exploring more in depth in Mark Hansen’s Data class. Unfortunately, it came to me when there were only about 2 weeks left in the semester, but Mark was kind enough to sit down with me and go over some basic techniques of how to open, parse and model the data inside this massive file.
So, first of all there are some helpful commands in Terminal which allowed me to see what was inside the file, and then make smaller files that R could actually open. For instance, just to see what the file contained, I typed:
Then, in order to see more of the file, I typed:
head -100000 'filename' >> tenthousandlines.txt
This gave me a new text file, named “tenthousandlines.txt” which contained the first ten thousand lines of the original file.
Another useful command is grep. Grep lets you search for a term, and returns all lines that contain that term. After looking through the head file with Mark, we noticed that one of the first actual trades was of the stock symbol eNOK, which stands for Nokia. So we made another file that searched for “eNOK” and then saved all the lines containing that term into a new file.
Also, by typing in ‘tail’ we were able to check the timestamp of the large file, and it turns out that a 23 GB file of electronic exchange activity (on what day I’m not sure), translates to about 4 hours worth of data.
We also made a file that counted how many lines there were with the same timestamp (in seconds). The resolution of the exchange data is every 25 milliseconds, but we wanted to get a feel for how much activity there was at different times of the day.
Next we brought the files into R.
We took a few approaches to understanding what was in the data. Keeping with the quote activity over the course of 4 hours, we first graphed that activity by typing:
cntPerSec = read.table("Desktop/count.txt") plot(cntPerSec$V1,type="l")
This yielded a plot that looks like this:
We wanted to get a better look at what was happening in the later part of the data, so we typed:
Which gave us a plot that looked like this:
Then we did the next logical thing, which was to plot it as a histogram:
This looked like the familiar “hockey stick” so we did the obvious next step, graphing the log histogram:
This gave us a very normal looking distribution:
So normal, that Mark showed me how to plot the data as a QQ-Norm, which from a little googling is used to compare the distribution of the data to another distribution, in our case to see whether the data follows a normal distribution (right Mark?)
So we typed:
Which gave us the QQ-norm plot and the “best-fit” (linear regression) line.
So the timing of things was one way to approach the dataset, but we also looked at a more traditional way of approaching financial data, stock by stock. Using the eNOK file we made from the grep commands in terminal, we took that into R, and looked at what was there.
The main thing we did was plot the price of eNOK stock over the course of the 4 hours. We typed:
> enok <- read.table("Desktop/enok.txt",sep="|",as.is=T) > > head(enok) V1 V2 V3 V4 V5 V6 V7 V8 V9 V10 1 MQ NYSE eNOK 04:00:00.175 04:00:00 PACF 4.03 190 4.05 10 2 MQ NYSE eNOK 04:00:01.125 04:00:01 PACF 4.03 290 4.05 10 3 MQ NYSE eNOK 04:00:03.875 04:00:03 PACF 4.06 500 4.09 500 4 MQ NYSE eNOK 04:00:03.875 04:00:03 PACF 4.07 500 4.09 500 5 MQ NYSE eNOK 04:00:13.100 04:00:13 PACF 4.08 150 4.09 500 6 MQ NYSE eNOK 04:00:13.125 04:00:13 PACF 4.07 500 4.09 500 > plot(enok$V7,type="l")
And plotted this chart:
While it’s not necessarily the most interesting thing to learn about the data, to my delight, the chart looks a lot like the Nanex charts I’ve been pouring over the whole semester. So I realized that using R is the way it’s done.
A huge thanks to Mark Hansen for encouraging me to keep asking for data even after getting refused numerous times. I’m looking forward to using the techniques he showed me to keep pouring through this dataset. My first plans will most likely be a sonofication of actual trades through the constant noise of orders posted and cancelled. But I’m very excited to take a look at the activity happening at a sub-second timescale. It’s pretty mind-boggling, but I guess that’s how 4 hours fills up a 23GB text file.