# scrape_ob Scrape open behavior and convert to structured data ## Components - `scrape.py` - utility functions for listing and downloading files - `parse.py` - Parse downloaded HTML files - `main.py` - Wrapper/entrypoint scripts ## Usage ``` >>> scrape_ob --help usage: scrape_ob [-h] [-u URL] [-o OUTPUT] [-d] Scrape Open Behavior and return structured data options: -h, --help show this help message and exit -u URL, --url URL Root URL for open behavior's open source project directory. Default is https://edspace.american.edu/openbehavior/open-source-tools/ -o OUTPUT, --output OUTPUT Output directory to store downloaded html files in. Default is ./html -d, --download Just download html files without parsing them Be kind and always give credit where labor has been done ``` ### Downloading We try and always work on local copies of the files so we are not requesting too much from their server, so first we start by downloading html copies of their project descriptions. Say to the `./html` folder Interactively: ```python from scrape_ob.scrape import download_all download_all(output_folder='./html') ``` From the cli ```json scrape_ob --download ``` ### Parsing From there we can parse an individual project's HTML representation into a structured one using two primary dataclasses, * `Project`: The primary representation of a project * `Blurb`: The image and link boxes at the bottom of a page that link to further details about a project. Since these don't have structure information, and will use different terminology, embed links within text, etc. we are leaving them as relatively unprocessed for now, pending further refinement of the parsing classes The Project class can parse projects given a url or a file, eg. ```python from scrape_ob.parse import Project ap = Project.from_file('html/autopilot.html') ``` Which gives us a structured representation of the project, which we can access directly as attributes or pull out as a dictionary like: (omitting the `body` attribute for clarity) ```python >>> print(ap.dict()) { 'name': 'AutoPilot: python framework for behavior experiments with raspberry pi', 'url': 'https://edspace.american.edu/openbehavior/project/autopilot/', 'date': datetime.datetime(2019, 12, 12, 0, 0), 'tags': [ 'automated', 'cognition', 'decision-making', 'gui', 'hardware', 'perceptual', 'raspberrypi', 'sensory', 'software' ], 'categories': [ 'behavior-apparatus', 'behavior-measurement', 'data-analysis-software', 'behavior-rigs', 'behavior-analysis', 'behavioral-tasks', 'freely-moving', 'integrated-systems', 'stimuli' ], 'rrids': ['SCR_021448', 'SCR_021518'], 'blurbs': [ { 'links': [ 'https://www.biorxiv.org/content/10.1101/807693v1' ], 'name': 'Paper', 'type': None}, { 'links': [ 'https://github.com/wehr-lab/autopilot', 'http://docs.auto-pi-lot.com/' ], 'name': 'Github', 'type': None }, { 'links': [ 'https://auto-pi-lot.com/', 'https://auto-pi-lot.com/presentation/#/' ], 'name': 'Website', 'type': None } ], 'docs': None, 'repo': None, 'paper': None } ``` Note how we are able to pull out the "category" and "tags" information which is usually hidden as part of the page metadata. The extra `docs`, `repo`, and `paper` fields are currently left unfilled, as we will eventually use the `Blurb` class to parse out that information The body of the project description is extracted into `body` and converted to markdown :) ```markdown Jonny Saunders from Michael Wehr’s lab at the University of Oregon recently posted a preprint documenting their project Autopilot, which is a python framework for running behavioral experiments: ------------------------------------------------------------------------ [Autopilot](https://auto-pi-lot.com/)\xa0is a python framework for behavioral experiments through utilizing\xa0[Raspberry Pi microcontrollers](https://www.raspberrypi.org/). Autopilot incorporates all aspects of an experiment, including the hardware, stimuli, behavioral task paradigm, data management, data visualization, and a user interface. The authors propose that Autopilot is the fastest, least expensive, most flexibile behavioral system that is currently available. The benefit of using Autopilot is that it allows more experimental flexibility, which lets researchers to optimize it for their specific experimental needs. Additionally, this project exemplifies how useful a raspberry pi can be for performing experiments and recording data. The preprint discusses many benefits of raspberry pis, including their speed, precision and proper data logging, and they only cost $35 (!!). Ultimately, the authors developed Autopilot in an effort to encourage users to write reusable, portable experiments that is put into a public central library to push replication and reproducibility. *This research tool was created by your colleagues. Please acknowledge the Principal Investigator, cite the article in which the tool was described, and include an\xa0RRID\xa0in the Materials and Methods of your future publications.\xa0\xa0Project portal\xa0RRID:SCR_021448; Software\xa0RRID:SCR_021518* ``` The utility function `parse_folder` can be used to parse the entire downloaded folder of html documents! ## TODO - Complete parsing of blurbs - Export `Project` to structured markdown with YAML headers or to whatever format open neuro ppl want - Make mediawiki template and client code to push into the open neuro wiki