Bit more

2016-12-27 00:01:35 +00:00 · 2016-12-27 00:01:35 +00:00 · 8ddd6a2d49
commit 8ddd6a2d49
parent 57800c7114
4 changed files with 697 additions and 0 deletions
--- a/scraping-alpha/Scraping_Alpha/README
+++ b/scraping-alpha/Scraping_Alpha/README
@ -0,0 +1,81 @@
+# Scraping Alpha
+
+### Author
+
+Ben Goldsworthy
+<[email](mailto:b.goldsworthy@lancaster.ac.uk)> 
+<[website](http://www.bengoldsworthy.uk/)>
+
+### Version
+
+1.0
+
+### Abstract
+
+Scraping Alpha is a series of Python scripts used to scrape 
+[Seeking Alpha](http://seekingalpha.com/) earnings call transcripts and produce 
+SQL from them.
+
+It was created for Dr Lars Hass of the Lancaster University Management School.
+
+### Usage
+
+The instructions for each step of the process can be found at the beginning of 
+each of the files involved: `transcript_spider.py`, `JSONtoSQL.py` and 
+`execsAndAnalysts.py`. The are repeated here for brevity.
+
+#### `transcript_spider.py`
+
+This file is the webspider that Scrapy uses to retrieve the information from the
+ website. Left unattended, it will scrape all 4,000+ pages of results.
+To interrupt this behaviour and still be able to proceed with the other steps, 
+cancel the script with `CTRL+Z`. This will likely leave an unfinished JSON item 
+at the end of the output file. To clear this up, open the file in `vim` and type
+ the following keys: 
+```vim
+G
+V
+d
+$
+i
+BACKSPACE
+ENTER
+]
+ESC
+:wp
+ENTER
+```
+
+This will truncate the file at the last complete record and seal it off.
+
+For installation instructions for Scrapy, see 
+[here](https://doc.scrapy.org/en/latest/intro/install.html). This file should be
+ in the `spiders` directory of the project, and is run via `scrapy crawl 
+ transcripts -o transcripts.json` at the command line (the output file will be 
+ placed in the directory the Terminal is currently pointing to).
+
+#### `JSONtoSQL.py`
+
+This file takes the `transcripts.json` file output of `transcript_spider.py` and
+ converts it into SQL.
+
+This file should be located in the same directory as `transcripts.json`, and is 
+run via `python JSONtoSQL.py > [FILE].sql`, where `[FILE]` is the desired name 
+of the output file. 
+
+#### `execsAndAnalysts.py`
+
+First, import the output file of `JSONtoSQL.py` to your chosen DBMS (I've tested
+ it with phpMyAdmin). Then, run the following query:
+```SQL
+SELECT `id`, `execs`, `analysts` FROM `transcripts`
+```
+
+Export the resulting table ([instructions](http://serverfault.com/a/435443)) to 
+`transcripts.sql`, and place the file in the same directory as 
+`execsAndAnalysts.py`. Run it with 'python execsAndAnalysts'.
+
+It creates from this two files (`execs.sql` and `analysts.sql`). Import them 
+into your DBMS to create two linking tables. The final instruction of 
+`analysts.sql` then deletes the superfluous `execs` and `analysts` columns from 
+the `transcripts` table.