Custom APIs and Web Scraping for Science

So my team’s most recent application, Helix, involved genome visualization. We integrated it with the 23andme API, but still needed a way to find out interesting information about specific RSIDs (used by researchers and databases to refer to specific base pairs of DNA). By far the most useful and open source repository of genetic information is SNPedia, but I needed access to lots of information and to integrate calls to specific SNPs. Basically I needed an API. So being ever resourceful, I decided to make my own.

Tools for the task were an easy choice. I needed a small fast server that I could implement a web scrapper on. I have always wanted a reason to use BeautifulSoup, but it’s a Python library so I knew it would be easier to build a Python server to run the API endpoints. I chose Flask because of its lightweight nature and how much it reminds me of a Node/Express server at times.

Thankfully there are some really good tutorials for both Flask and BeautifulSoup, my favorites (and the ones I referenced when I hit weirdness) were Designing a RESTful API and Website Scraping with BeautifulSoup. Both of these tutorials said a lot of things better than I could have myself.

For access to my SNPedia API and information on how to use it, check out my project on GitHub.

Week 9: Highs and Lows and… WTF I only have three weeks left??

My week started out fairly average. We were all rolling along on our projects and then I noticed an event on the Hack Reactor Senior calendar. Tuesday, three weeks from this past Tuesday, is Hiring Day. Three weeks?? Not even now, more like two?? Oh, god. And yet, as much of a whirlwind as this has been and as often as I have impostor syndrome, I’m a little excited. I want to see what’s out there for me and find a job and learn and grow and do my instructors proud.

One slight stumbling block for me this week: Hacker in Residence positions. I applied and think I would have been accepted, but I had to bow out. After I sat down and thought about it, I just couldn’t justify being out of work that much longer (even on a stipend). It would have been fun to learn how to teach and spend some more time hacking on personal ideas, but that’s what weekends are for, right?

We also got to demo Helix for the first time. Helix is a gene visualization app that shows you your SNPs (base pairs) from 23andme that have traits attached to them (according to SNPedia.com). You can search traits or just browse your chromosomes for interesting info. It was built using a private beta framework (called Famo.us) that my team was lucky enough to get to be involved with. We have *fingers crossed* two more opportunities to demo Helix, one more run through at Hack Reactor and if all goes well, a private party/meetup for Famous.

Another fun thing that came out of Helix was that I got to dust off my Python knowledge. I had wanted to try BeautifulSoup (a Python web scraper) for a while now and I needed an easy way to pull rsid information from SNPedia so I created my own API wrapper! The code is available (including instructions on how to run it on your own) on my github account. It’s a tiny Python/Flask server that only has a couple of endpoints (the ones I really needed) but I’m thinking about expanding eventually.

And then I got sick. I came down with a cold on Friday and haven’t been to HackReactor since. I’ve been working from home, but mostly just trying to sleep, having weird dreams, and sounding pitiful. I’m getting better though and I will definitely be on-point on Monday to work out the last-minute details of Helix before all the demos come crashing around us.

Three more weeks until I graduate! My gift to myself – I’m attending the LAUNCH hackathon with two other women from HackReactor the weekend after it’s all over. I just don’t want to get lazy!

Week 8: New Surroundings, Same Routine

Sorry for the delay in this post. My roommate, Ava, was worried about my long hours all week so she wouldn’t let me touch my laptop on Sunday. Saturday night was card games with Hack Reactor peeps so I was out late. It was a jumble of a week and I’m writing this so late that this week is already upon me. So I think this one will be very short.

We started working at Famo.us this past week. They have a beautiful office that was converted from an apartment. It’s weird to go back to Hack Reactor now with their darker rooms and only two bathrooms, but I still miss it something fierce when I’m away. Hack Reactor feels like home, but Famo.us is a nice vacation. Our project is slowly progressing. We have some really neat ideas about gene visualization and if all goes well we’ll get to demo the awesomeness in front of a bunch of people.

Other fun things from this week included a talk on Thursday from the author of Cracking the Coding Interview and Saturday social night where I got my ass handed to me in Marvel vs. Capcom and then made people squirm in Cards Against Humanity.

One final thing: Ava talked me into buying a FitBit! I’ve been meeting all my goals every day and it’s pink so life is pretty amazing. You can find me on Fitbit here.

 

Week 7: Is This What Confidence Feels Like?

Coming back from break was wonderful. I really missed this place and these people and I’m at a point now where I’m excited to walk in the front door of this space. I am really, truly a Software Engineer. I have been for a long time, but it took this place and these people to pull that knowledge out of myself. I started the week with giant hugfests of awesome. It was great to see everyone after two weeks. There was some unexpected lack of (and new growth of) facial hair and general fun stories about hijinks had during our time away. We all quickly felt the glory of being seniors and then were promptly blown away by how awesome the new batch of juniors are.

There wasn’t much time to chat though, juniors were starting their hell week and we were about to embark on a different sort of hell – Hiring Day Assessments. I was terrified. I’ve decided my brain just needs to have something to focus on being terrified about to function at all – I’m starting to wonder if losing my fear would also diminish my awesomeness. We had all day to finish our assessments and as I dove in my confidence built. I knew this stuff. I knew it from the times it had been drilled into my head and the moments when I was working on something alone and would need to Google a concept and those times at the lunch table with my peers discussing wild and crazy new concepts. It rocked to realize how awesome we all are now. Everyone can tell us we are awesome until their blue in the face, but it’s moments like that when it clicks for me.

The other moments it clicks for me is the new, terribly unfunny programming jokes we’ve all started making. It’s getting ridiculous.

After Monday’s stress, we quickly got our hands dirty in our code. Our first round of group projects wrapped this week. I worked with Sara and João to make a custom html5 video player plugin to vote on moments in videos and visualize the user data. Our project is called HeatVote. We’re still hacking on it in our “free time”, but its production cycle is officially over. There are a few previous posts on things I worked on for this project and I feel like I have a book more to write about the experience, but time is, as my faithful readers know, very short lately so I’m going to close this book for now.

Our next project period starts on Tuesday. I was fortunate enough to get a client project working with an awesome team to create mobile web apps at famo.us! I am very excited to dive into unfamiliar territory, learn, and help out a team of super talented people.

D3.js Rollups

Do you have all the data and none of the visuals? Do you just want a pretty, fast way to compare lots of data that centers around maybe just a handful of moments?

D3.js can help you tame all of your data and d3.rollup is especially useful if you have lots of data that you need to combine into just a couple of data points. All it takes is a couple of (pretty long) lines of code and you will have an awesome visual that’s very customizable.

Lets start with a really straightforward example of a rollup. In all of these examples, I’m using code straight from my HeatVote project, which requires me to pull voting data from our server API that I receive as a JSON blob. Here’s an example entry:

{ video_id: 'T-D1KVIuvjA',
  timestamp: 2,
  vote: 1,
  id: 1,
  createdAt: Sat Dec 21 2013 14:55:42 GMT-0800 (PST),
  updatedAt: Sat Dec 21 2013 14:55:42 GMT-0800 (PST) }

Now obviously there are a bunch of these, and technically there are easier ways to do this, but to show off the structure of a rollup, lets count how many entries we had in our database using a d3 rollup!

var total = d3.nest()
  .rollup(function(d){
    return d.length;
  })
  .entries(data);

Remember, data here is my array of JSON entries, so in our rollup function the d is just shorthand for all of the data. This isn’t a very interesting example though, lets take a look at something that really shows off the beauty of a d3 rollup.

var averages = d3.nest()
  .key(function(d) {
    return d.timestamp; 
  })
  .sortKeys(d3.ascending)
  .rollup(function(d){
    return d3.mean(d, function(g) { 
      return +g.vote;
    });
   })
   .entries(data);

Now there is a lot going on in this very compact few lines, so well go through them one by one, but the result is that averages is equal to an array of objects with the properties key (that is equal to each unique timestamp) and value (that is equal to the mean of all votes at that timestamp).

So lets break it down:

  • .key(...) is just used to tell the function what our keys are, only grabbing unique values of that property.
  • .sortKeys is just a prettiness thing, it sorts my keys into an order (when they’re pulled off the server the only order is by the time they were created on the database).
  • and finally our lovely .rollup(...). Now instead of d being an array of the whole data, it’s now an array of only the data for each individual key (so all of the data with the same timestamp). The inner function d3.mean takes a specific property from all of the data for each key and averages them up.

And that, is d3 rollup in a nutshell, it’s really lovely at coercing relationships out of your raw data and you can obviously do a lot more with it that just averaging things. The d3 nest docs are probably the next best place to look to get your hands dirty (.rollup is a property of nest).

Cloud PostgreSQL Servers with Heroku

While it might be possible that I’m the only person out there who didn’t want to hassle with creating a Postgres database on my computer that needed to be passed around to my project-mates, I have a feeling there is one lost soul in the universe besides me who might find this useful. When I first tried to use Heroku’s PostgreSQL database add-on the only documentation I really found was for attaching it to existing Heroku apps. I just wanted a dev environment and currently don’t have plans to deploy the app with Heroku so I just wanted to “borrow” their free level of database storage.

It is shockingly easy. All I needed to do was log in to my Heroku dashboard and click the databases tab. From there I created a new database (be sure to choose the almost hidden Dev Plan (free) option on the lower left). Once it was done spinning up if you click on it, you are taken to a page with all the login info you could ever need:

Heroku Postgres Login Stuffs

I just plugged all of that info into Sequelize (see my previous post), because I’m using Heroku Postgres, I had to make sure I used the pg.native options in both Node and Sequelize.

A few caveats! For a Mac, if you just install the Postgres standalone app from PostgreSQL, life will not be pleasant (I learned this the hard way). The easiest way to make life happy is to use Homebrew to install Postgres. Please have Homebrew, it makes life super easy. All you need to do then is brew install postgresql.

The second awesome thing about Heroku Postgres is that although I’m only on the free version, I can still easily log in to my database from the command line and change things. On the page with the connection info, click the double arrows on the right and choose URL. Now, in Terminal type psql DATABASE_URL_HERE and voila! direct access to your database.

Setting Up a Database Node Module on a Node/Express Server with Sequelize

If you just need to get a node server up and running with very few lines of code then you hopefully already know about Node/Express (if you don’t, the Express website has a pretty good intro tutorial and has good documentation to get started serving up static assets). If you need all of that and to be able to route queries to a database quickly and easily then you might not know about the awesome power of Sequelize.

If you haven’t messed around with a ORM (Object Relational Mapper, a program that maps your code to a database) before, Sequelize is a really straight forward one to start with.

var voteTable = sequelize.define('vote', {
  video_id: Sequelize.STRING,
  timestamp: Sequelize.INTEGER,
  vote: Sequelize.INTEGER
});

voteTable.sync();

Once I had my server set up (and installed the sequelize dependencies). This was the line of code I needed to use to create a new table for my current group project. Sequelize automatically includes extra columns like unique id and created at/updated at timestamps (there are ways to tell it not to too).

The rest of the initial set up is easy too, but it’s well documented over in the Sequelize docs. The fun part comes when you can start to make your database more modular. The first thing I did was to take out the actual username/password information from my database connection. I stored them as an object in a separate JavaScript file (so I could add it to .gitignore and not share my passwords with the world).

//module.exports allows us to use this code in other places as a node module, 
//we'll see it again when I make the database calls modular
module.exports = { 
  database: 'database_name', 
  username: 'username', 
  password: 'secret_password',
  host: 'host_url',
  port: 5432,
  dialect: 'postgres', //obviously you don't have to use PostgreSQL
  native: true //required for Heroku Postgres (I'll cover that in another post)
};

I saved that snipped to a file called db_config.js at the root directory and then created my main database module in the subfolder /controller/ called database.js. So to have access to the private config object, all I need to do is set up my dependencies, import the db_config file, and I can start using my config variables to connect to my database:

var Sequelize = require('sequelize');
var pg = require('pg').native; //again this line is specific to using a Postgres database
var config = require('../db_config');

var sequelize = new Sequelize(config.database, config.username, config.password, {
  host: config.host,
  port: config.port,
  dialect: config.dialect,
  native: config.native //Heroku Postgress again
});

Now between that code and my creating a table I have full access to a database, but now I need to get this all into my main app.js simple server I created with Node/Express. This is a super easy leap. First I create my functions to send and retrieve data in /controller/database.js:


module.exports.createVote = function(req, res){
  //code to bundle up the created object and save it do the database
};

module.exports.getVotes = function(req, res){
  //code to find votes based on specific requests from the user
};

Now to have access to these functions (which are a part of the database.js node module, thanks to the module.exports object which is a feature of Node) I only need to require database.js in my main server app and call the functions where I need them:

var database = require('./controllers/database');

//many lines later
app.post('/votes', database.createVote);
app.get('/votes/:vidID', database.getVotes);

The /votes/:vidID is a handy trick of Express to pass information to the server. The value gets attached to the req.param.vidID property so I can use it when I request specific information from the server. For example if I wanted to query for results from a video ID of 123 I would send a post request to /votes/123 and then my req.param.vidID === 123.

One last trick, in Sequelize when you query the database you get back quite a few more rows than you might expect. When I query my voteTable (the one I only explicitly created three columns for?) I get back something that looks like this monster:

 { dataValues: 
     { video_id: '7QBgK0_RbkE', timestamp: 2, vote: 1, id: 192,
       createdAt: Sun Dec 22 2013 22:19:33 GMT-0800 (PST),
       updatedAt: Sun Dec 22 2013 22:19:33 GMT-0800 (PST) },
    __options: 
     { timestamps: true, createdAt: 'createdAt', updatedAt: 'updatedAt',
       deletedAt: 'deletedAt', touchedAt: 'touchedAt', instanceMethods: {}, classMethods: {}, 
       validate: {}, freezeTableName: false, underscored: false, syncOnAssociation: true,
       paranoid: false, whereCollection: [Object], schema: null, schemaDelimiter: '',
       language: 'en', defaultScope: null, scopes: null, hooks: [Object], omitNull: false, 
       hasPrimaryKeys: false },
    hasPrimaryKeys: false,
    selectedValues: 
     { video_id: '7QBgK0_RbkE', timestamp: 2, vote: 1, id: 192,
       createdAt: Sun Dec 22 2013 22:19:33 GMT-0800 (PST),
       updatedAt: Sun Dec 22 2013 22:19:33 GMT-0800 (PST) },
    __eagerlyLoadedAssociations: [],
    isDirty: false,
    isNewRecord: false,
    daoFactoryName: 'votes',
    daoFactory: 
     { options: [Object], name: 'votes', tableName: 'votes', rawAttributes: [Object],
       daoFactoryManager: [Object], associations: {}, scopeObj: {}, primaryKeys: {},
       primaryKeyCount: 0, hasPrimaryKeys: false, autoIncrementField: 'id', DAO: [Object] } 
 } 

And that’s just ONE entry in the database! To fix that add an option to your sequelize query: {raw: true} so the query would look like:

voteTable.findAll(query, {raw: true})

And one entry of output would be:

{ video_id: 'T-D1KVIuvjA',
    timestamp: 2,
    vote: 1,
    id: 1,
    createdAt: Sat Dec 21 2013 14:55:42 GMT-0800 (PST),
    updatedAt: Sat Dec 21 2013 14:55:42 GMT-0800 (PST) }

That is enough database tricks for today. If you want to just stare at my code for a while to create & retrieve votes from our database you can find my gist here (with sanitized login info for the db_config). The real (still being modded by the team) code is forked on my GitHub. I have a few more of these in the works from my adventures slinging code and I hope to post a few more before Hack Reactor starts back up and I lose all free time again.