Go to file
sneakers-the-rat 0c1d340541 working version! 2022-10-29 20:08:02 -07:00
scrape_ob working version! 2022-10-29 20:08:02 -07:00
.gitignore working version! 2022-10-29 20:08:02 -07:00
LICENSE Initial commit 2022-10-27 05:24:27 +00:00
README.md working version! 2022-10-29 20:08:02 -07:00
poetry.lock working version! 2022-10-29 20:08:02 -07:00
pyproject.toml working version! 2022-10-29 20:08:02 -07:00



Scrape open behavior and convert to structured data


  • scrape.py - utility functions for listing and downloading files
  • parse.py - Parse downloaded HTML files
  • main.py - Wrapper/entrypoint scripts


>>> scrape_ob --help
usage: scrape_ob [-h] [-u URL] [-o OUTPUT] [-d]

Scrape Open Behavior and return structured data

  -h, --help            show this help message and exit
  -u URL, --url URL     Root URL for open behavior's open source project directory. Default is https://edspace.american.edu/openbehavior/open-source-tools/
  -o OUTPUT, --output OUTPUT
                        Output directory to store downloaded html files in. Default is ./html
  -d, --download        Just download html files without parsing them

Be kind and always give credit where labor has been done


We try and always work on local copies of the files so we are not requesting too much from their server, so first we start by downloading html copies of their project descriptions. Say to the ./html folder


from scrape_ob.scrape import download_all

From the cli

scrape_ob --download


From there we can parse an individual project's HTML representation into a structured one using two primary dataclasses,

  • Project: The primary representation of a project
  • Blurb: The image and link boxes at the bottom of a page that link to further details about a project. Since these don't have structure information, and will use different terminology, embed links within text, etc. we are leaving them as relatively unprocessed for now, pending further refinement of the parsing classes

The Project class can parse projects given a url or a file, eg.

from scrape_ob.parse import Project

ap = Project.from_file('html/autopilot.html')

Which gives us a structured representation of the project, which we can access directly as attributes or pull out as a dictionary like:

(omitting the body attribute for clarity)

>>> print(ap.dict())

    'name': 'AutoPilot: python framework for behavior experiments with raspberry pi',
    'url': 'https://edspace.american.edu/openbehavior/project/autopilot/',
    'date': datetime.datetime(2019, 12, 12, 0, 0),
    'tags': [
    'categories': [
    'rrids': ['SCR_021448', 'SCR_021518'],
    'blurbs': [
            'links': [
            'name': 'Paper',
            'type': None},
            'links': [ 
            'name': 'Github',
            'type': None
            'links': [ 
            'name': 'Website',
            'type': None
    'docs': None,
    'repo': None,
    'paper': None

Note how we are able to pull out the "category" and "tags" information which is usually hidden as part of the page metadata.

The extra docs, repo, and paper fields are currently left unfilled, as we will eventually use the Blurb class to parse out that information

The body of the project description is extracted into body and converted to markdown :)

Jonny Saunders from Michael Wehrs lab at the University of Oregon
recently posted a preprint documenting their project Autopilot, which is
a python framework for running behavioral experiments:


[Autopilot](https://auto-pi-lot.com/)\xa0is a python framework for
behavioral experiments through utilizing\xa0[Raspberry Pi
microcontrollers](https://www.raspberrypi.org/). Autopilot incorporates
all aspects of an experiment, including the hardware, stimuli,
behavioral task paradigm, data management, data visualization, and a
user interface. The authors propose that Autopilot is the fastest, least
expensive, most flexibile behavioral system that is currently available.

The benefit of using Autopilot is that it allows more experimental
flexibility, which lets researchers to optimize it for their specific
experimental needs. Additionally, this project exemplifies how useful a
raspberry pi can be for performing experiments and recording data. The
preprint discusses many benefits of raspberry pis, including their
speed, precision and proper data logging, and they only cost $35 (!!).
Ultimately, the authors developed Autopilot in an effort to encourage
users to write reusable, portable experiments that is put into a public
central library to push replication and reproducibility.

*This research tool was created by your colleagues. Please acknowledge
the Principal Investigator, cite the article in which the tool was
described, and include an\xa0RRID\xa0in the Materials and Methods of your
future publications.\xa0\xa0Project portal\xa0RRID:SCR_021448;

The utility function parse_folder can be used to parse the entire downloaded folder of html documents!


  • Complete parsing of blurbs
  • Export Project to structured markdown with YAML headers or to whatever format open neuro ppl want
  • Make mediawiki template and client code to push into the open neuro wiki