scrape_ob | ||
.gitignore | ||
LICENSE | ||
poetry.lock | ||
pyproject.toml | ||
README.md |
scrape_ob
Scrape open behavior and convert to structured data
Components
scrape.py
- utility functions for listing and downloading filesparse.py
- Parse downloaded HTML filesmain.py
- Wrapper/entrypoint scripts
Usage
>>> scrape_ob --help
usage: scrape_ob [-h] [-u URL] [-o OUTPUT] [-d]
Scrape Open Behavior and return structured data
options:
-h, --help show this help message and exit
-u URL, --url URL Root URL for open behavior's open source project directory. Default is https://edspace.american.edu/openbehavior/open-source-tools/
-o OUTPUT, --output OUTPUT
Output directory to store downloaded html files in. Default is ./html
-d, --download Just download html files without parsing them
Be kind and always give credit where labor has been done
Downloading
We try and always work on local copies of the files so we are not requesting too much from their server, so first
we start by downloading html copies of their project descriptions. Say to the ./html
folder
Interactively:
from scrape_ob.scrape import download_all
download_all(output_folder='./html')
From the cli
scrape_ob --download
Parsing
From there we can parse an individual project's HTML representation into a structured one using two primary dataclasses,
Project
: The primary representation of a projectBlurb
: The image and link boxes at the bottom of a page that link to further details about a project. Since these don't have structure information, and will use different terminology, embed links within text, etc. we are leaving them as relatively unprocessed for now, pending further refinement of the parsing classes
The Project class can parse projects given a url or a file, eg.
from scrape_ob.parse import Project
ap = Project.from_file('html/autopilot.html')
Which gives us a structured representation of the project, which we can access directly as attributes or pull out as a dictionary like:
(omitting the body
attribute for clarity)
>>> print(ap.dict())
{
'name': 'AutoPilot: python framework for behavior experiments with raspberry pi',
'url': 'https://edspace.american.edu/openbehavior/project/autopilot/',
'date': datetime.datetime(2019, 12, 12, 0, 0),
'tags': [
'automated',
'cognition',
'decision-making',
'gui',
'hardware',
'perceptual',
'raspberrypi',
'sensory',
'software'
],
'categories': [
'behavior-apparatus',
'behavior-measurement',
'data-analysis-software',
'behavior-rigs',
'behavior-analysis',
'behavioral-tasks',
'freely-moving',
'integrated-systems',
'stimuli'
],
'rrids': ['SCR_021448', 'SCR_021518'],
'blurbs': [
{
'links': [
'https://www.biorxiv.org/content/10.1101/807693v1'
],
'name': 'Paper',
'type': None},
{
'links': [
'https://github.com/wehr-lab/autopilot',
'http://docs.auto-pi-lot.com/'
],
'name': 'Github',
'type': None
},
{
'links': [
'https://auto-pi-lot.com/',
'https://auto-pi-lot.com/presentation/#/'
],
'name': 'Website',
'type': None
}
],
'docs': None,
'repo': None,
'paper': None
}
Note how we are able to pull out the "category" and "tags" information which is usually hidden as part of the page metadata.
The extra docs
, repo
, and paper
fields are currently left unfilled, as we
will eventually use the Blurb
class to parse out that information
The body of the project description is extracted into body
and converted to
markdown :)
Jonny Saunders from Michael Wehr’s lab at the University of Oregon
recently posted a preprint documenting their project Autopilot, which is
a python framework for running behavioral experiments:
------------------------------------------------------------------------
[Autopilot](https://auto-pi-lot.com/)\xa0is a python framework for
behavioral experiments through utilizing\xa0[Raspberry Pi
microcontrollers](https://www.raspberrypi.org/). Autopilot incorporates
all aspects of an experiment, including the hardware, stimuli,
behavioral task paradigm, data management, data visualization, and a
user interface. The authors propose that Autopilot is the fastest, least
expensive, most flexibile behavioral system that is currently available.
The benefit of using Autopilot is that it allows more experimental
flexibility, which lets researchers to optimize it for their specific
experimental needs. Additionally, this project exemplifies how useful a
raspberry pi can be for performing experiments and recording data. The
preprint discusses many benefits of raspberry pis, including their
speed, precision and proper data logging, and they only cost $35 (!!).
Ultimately, the authors developed Autopilot in an effort to encourage
users to write reusable, portable experiments that is put into a public
central library to push replication and reproducibility.
*This research tool was created by your colleagues. Please acknowledge
the Principal Investigator, cite the article in which the tool was
described, and include an\xa0RRID\xa0in the Materials and Methods of your
future publications.\xa0\xa0Project portal\xa0RRID:SCR_021448;
Software\xa0RRID:SCR_021518*
The utility function parse_folder
can be used to parse the entire downloaded
folder of html documents!
TODO
- Complete parsing of blurbs
- Export
Project
to structured markdown with YAML headers or to whatever format open neuro ppl want - Make mediawiki template and client code to push into the open neuro wiki